How distingush Unicode code is character or ideogram?

49 Views Asked by Saku At 28 June 2025 at 06:12

I try create top frequency words tables for many languages. I read Wikipedia text and isolate words. To detect if is alphanumeric I use u_isalnum from ICU (C++). This function takes as parameter 32 bit codepoint. It work correctly for latin chars (English), extended latin (Polish) and I think will also for Greek, Russian, Hebrew, Arabic etc.
But how with Chinese and Japanese? I must collect single chars, not series chars to space and punctuation. How detect, Unicode codepoint is ideogram?
Mu first simple solution: manually check if code is in range of Chinese and Japanese, but can be more ideograms codes.

Original Q&A

There are 1 best solutions below

Boldewyn On 24 July 2023 at 13:17

East-Asian characters have mostly a Unicode category of Lo (other Letter). This is sufficient for u_isalnum to return true, according to the documentation. This means, it should be perfectly fine for you to keep using u_isalnum for a first iteration to match strings of words.

To then split them up in single words you might need a word list for comparison. Search for “chinese word segmentation”. I would be surprised if there isn’t at least part of the problem already solved. But beware that it might lead you switftly into natural language processing territory.

How distingush Unicode code is character or ideogram?

There are 1 best solutions below

Related Questions in UNICODE

Related Questions in SOUTHEAST-ASIAN-LANGUAGES

Trending Questions

Popular # Hahtags

Popular Questions