How to compare Chinese characters in Java using 'equals()'

2.5k Views Asked by At

I want to compare a string portion (i.e. character) against a Chinese character. I assume due to the Unicode encoding it counts as two characters, so I'm looping through the string with increments of two. Now I ran into a roadblock where I'm trying to detect the '兒' character, but equals() doesn't match it, so what am I missing ? This is the code snippet:

for (int CharIndex = 0; CharIndex < tmpChar.length(); CharIndex=CharIndex+2) {

   // Account for 'r' like in dianr/huir
   if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {

Also, feel free to suggest a more elegant way to parse this ...

[UPDATE] Some pics from the debugger, showing that it doesn't match, even though it should. I pasted the Chinese character from the spreadsheet I use as input, so I don't think it's a copy and paste issue (unless the unicode gets lost along the way)

enter image description here

enter image description here

oh, dang, apparently it does not work simply copy and pasting:

enter image description here

3

There are 3 best solutions below

1
On

Use CharSequence.codePoints(), which returns a stream of the codepoints, rather than having to deal with chars:

tmpChar.codePoints().forEach(c -> {
  if (c == '兒') {
    // ...
  }
});

(Of course, you could have used tmpChar.codePoints().filter(c -> c == '兒').forEach(c -> { /* ... */ })).

0
On

Either characters, accepting as substring.

String s = ...;
if (s.contains("兒")) { ... }
int position = s.indexOf("兒");
if (position != -1) {
    int position2 = position + "兒".length();
    s = s.substring(0, position) + "*" + s.substring(position2);
}
if (s.startsWith("兒", i)) {
    // At position i there is a 兒.
}

Or code points where it would be one code point. As that is not really easier, variable substring seem fine.

0
On
if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {

Is your problem. 兒 is only one UTF-16 character. Many Chinese characters can be represented in UTF-16 in one code unit; Java uses UTF-16. However, other characters are two code units.

There are a variety of APIs on the String class for coping.

As offered in another answer, obtaining the IntStream from codepoints allows you to get a 32-bit code point for each character. You can compare that to the code point value for the character you are looking for.

Or, you can use the ICU4J library with a richer set of facilities for all of this.