How distingush Unicode code is character or ideogram?

39 Views Asked by At

I try create top frequency words tables for many languages. I read Wikipedia text and isolate words. To detect if is alphanumeric I use u_isalnum from ICU (C++). This function takes as parameter 32 bit codepoint. It work correctly for latin chars (English), extended latin (Polish) and I think will also for Greek, Russian, Hebrew, Arabic etc.
But how with Chinese and Japanese? I must collect single chars, not series chars to space and punctuation. How detect, Unicode codepoint is ideogram?
Mu first simple solution: manually check if code is in range of Chinese and Japanese, but can be more ideograms codes.

1

There are 1 best solutions below

1
On

East-Asian characters have mostly a Unicode category of Lo (other Letter). This is sufficient for u_isalnum to return true, according to the documentation. This means, it should be perfectly fine for you to keep using u_isalnum for a first iteration to match strings of words.

To then split them up in single words you might need a word list for comparison. Search for “chinese word segmentation”. I would be surprised if there isn’t at least part of the problem already solved. But beware that it might lead you switftly into natural language processing territory.