ICU4C does not tokenize Japanese correctly

52 Views Asked by Govind Salvi At 28 July 2025 at 05:23

I am using ICU4C library to tokenize japanese text into individual words. However tokenization giving wrong results

Example: アーティスティック word break into 5 words -> ア , ー, テ, ィ, スティック

However this is a single word アーティスティック only.

UnicodeString s = UnicodeString::fromUTF8(StringPiece(searchQuery));


std::cout << "In listWordBoundaries" << std::endl;

UErrorCode status = U_ZERO_ERROR;
BreakIterator* bi = BreakIterator::createWordInstance("ja_JP", status);
std::cout << "BreakIterator = " << bi << std::endl;

bi->setText(s);
for (int32_t p = bi->first(), prevBoundary = 0; p != BreakIterator::DONE; prevBoundary = p, p = bi->next())
{
    const UnicodeString word = s.tempSubStringBetween(prevBoundary, p);
    std::string converted;
    word.toUTF8String(converted);
    
    words.emplace_back(converted);
}

Original Q&A

ICU4C does not tokenize Japanese correctly

There are 0 best solutions below

Related Questions in TOKENIZE

Related Questions in ICU4J

Related Questions in ICU4C

Trending Questions

Popular # Hahtags

Popular Questions