ICU4C does not tokenize Japanese correctly

58 Views Asked by At

I am using ICU4C library to tokenize japanese text into individual words. However tokenization giving wrong results

Example: アーティスティック word break into 5 words -> ア , ー, テ, ィ, スティック

However this is a single word アーティスティック only.

UnicodeString s = UnicodeString::fromUTF8(StringPiece(searchQuery));


std::cout << "In listWordBoundaries" << std::endl;

UErrorCode status = U_ZERO_ERROR;
BreakIterator* bi = BreakIterator::createWordInstance("ja_JP", status);
std::cout << "BreakIterator = " << bi << std::endl;

bi->setText(s);
for (int32_t p = bi->first(), prevBoundary = 0; p != BreakIterator::DONE; prevBoundary = p, p = bi->next())
{
    const UnicodeString word = s.tempSubStringBetween(prevBoundary, p);
    std::string converted;
    word.toUTF8String(converted);
    
    words.emplace_back(converted);
}
0

There are 0 best solutions below