I try create top frequency words tables for many languages. I read Wikipedia text and isolate words. To detect if is alphanumeric I use u_isalnum from ICU (C++). This function takes as parameter 32 bit codepoint. It work correctly for latin chars (English), extended latin (Polish) and I think will also for Greek, Russian, Hebrew, Arabic etc.
But how with Chinese and Japanese? I must collect single chars, not series chars to space and punctuation. How detect, Unicode codepoint is ideogram?
Mu first simple solution: manually check if code is in range of Chinese and Japanese, but can be more ideograms codes.
How distingush Unicode code is character or ideogram?
39 Views Asked by Saku At
1
East-Asian characters have mostly a Unicode category of
Lo
(other Letter). This is sufficient foru_isalnum
to return true, according to the documentation. This means, it should be perfectly fine for you to keep usingu_isalnum
for a first iteration to match strings of words.To then split them up in single words you might need a word list for comparison. Search for “chinese word segmentation”. I would be surprised if there isn’t at least part of the problem already solved. But beware that it might lead you switftly into natural language processing territory.