BreakIterator not working correctly with Chinese text

1.1k Views Asked by At

I used BreakIterator.getWordInstance to split a Chinese text into words. Here is my example

import java.text.BreakIterator;
import java.util.Locale;

public class Sample {
    public static void main(String[] args) {
        String stringToExamine = "I like to eat apples. 我喜欢吃苹果。";

        //print each word in order
        BreakIterator boundary = BreakIterator.getWordInstance(new Locale("zh", "CN"));
        boundary.setText(stringToExamine);

        printEachForward(boundary, stringToExamine);
    }

    public static void printEachForward(BreakIterator boundary, String source) {
        int start = boundary.first();
        for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) {
            System.out.println(start + ": " + source.substring(start, end));
        }
    }
}

My example text is taken from https://stackoverflow.com/a/42219474/954439

The output that I get is

0: I
1:  
2: like
6:  
7: to
9:  
10: eat
13:  
14: apples
20: .
21:  
22: 我喜欢吃苹果
28: 。

Whereas, the expected output is

0 I
1  
2 like
6  
7 to
9  
10 eat
13  
14 apples
20 .
21  
22 我
23 喜欢
25 吃
26 苹果
28 。

I even tried pure Chinese text, but the words are broken on whitespace and punctuation characters.

I am programming for a server, so the jar file size is not a big concern. I am trying to find the number of words that is different in a given content compared to a sample content using Least Common Subsequence (but on words).

What am I doing wrong?

1

There are 1 best solutions below

5
On BEST ANSWER

The standard BreakIterator does not support detection of "word" boundaries within unbroken strings of CJK ideographs. There is a bug report on this subject, but it was closed in 2006 as "Won't Fix".

Instead, you'll need to use the ICU implementation. If you're developing on Android, you already have this as android.icu.text.BreakIterator. Otherwise, you'll need to download the ICU4J library from http://site.icu-project.org/download, which has it as com.ibm.icu.text.BreakIterator.