Skip to content Skip to sidebar Skip to footer

Find The Length Of A Sentence With English Words And Chinese Characters

The sentence may include non-english characters, e.g. Chinese: 你好,hello world the expected value for the length is 5 (2 Chinese characters, 2 English words, and 1 comma)

Solution 1:

You can use that most Chinese characters are located in the unicode range 0x4e00 - 0x9fcc.

# -*- coding: utf-8 -*-import re

s = '你好 hello, world'
s = s.decode('utf-8')

# First find all 'normal' words and interpunction# '[\x21-\x2f]' includes most interpunction, change it to ',' if you only need to match a comma
count = len(re.findall(r'\w+|[\x21-\x2]', s))

for word in s:
    for ch in word:
        # see https://stackoverflow.com/a/11415841/1248554 for additional ranges if neededif0x4e00 < ord(ch) < 0x9fcc:
            count += 1print count

Solution 2:

If you're happy to consider each Chinese character as a separate word even though that isn't always the case, you could possibly accomplish something like this by examining the Unicode character property of each character, using the unicodedata module.

For example, if you run this code on your example text:

# -*- coding: utf-8 -*-import unicodedata

s = u"你好,hello world"for c in s:
  print unicodedata.category(c)

You'll see the chinese characters are reported as Lo (letter other) which is different from Latin characters which would typically be reported as Ll or Lu.

Knowing that, you could consider anything that is Lo to to be an individual word, even if it isn't separated by whitespace/punctuation.

Now this almost definitely won't work in all cases for all languages, but it may be good enough for your needs.

Update

Here is a more complete example of how you could do it:

# -*- coding: utf-8 -*-import unicodedata

s = u"你好,hello world"     

wordcount = 0
start = Truefor c in s:      
  cat = unicodedata.category(c)
  if cat == 'Lo':        # Letter, other
    wordcount += 1# each letter counted as a word
    start = Trueelif cat[0] == 'P':    # Some kind of punctuation
    wordcount += 1# each punctation counted as a word
    start = Trueelif cat[0] == 'Z':    # Some kind of separator
    start = Trueelse:                  # Everything elseif start:
      wordcount += 1# Only count at the start
    start = Falseprint wordcount    

Solution 3:

There is a problem with the logic here:

你好
,

These are all characters, not words. For the Chinese characters you will need to do something possibly with regex

The problem here is that chinese Characters might be word parts or words.

大好

In Regex, is that one or two words? Each character alone is a word, but together they are also one word.

hello world

If you count this on spaces, then you get 2 words, but also your Chinese regex might not work.

I think the only way you can make this work for "words" is to work out the Chinese and English separately.

Post a Comment for "Find The Length Of A Sentence With English Words And Chinese Characters"