Different codepoints for same character in MacOS and Windows

194 Views Asked by At

I have a small piece of code in which I am checking the codepoint for the the character Ü.

Locale lc = Locale.getDefault();
System.out.println(lc.toString());
System.out.println(Charset.defaultCharset());
System.out.println(System.getProperty("file.encoding"));
String inUnicode = "\u00dc";
String glyph = "Ü";
System.out.println("inUnicode " + inUnicode + " code point " + inUnicode.codePointAt(0));
System.out.println("glyph " + glyph + " code point " + glyph.codePointAt(0));

I am getting different value for codepoint when I run this code on MacOS x and Windows 10, see the output below.

Output on MacOS

en_US
UTF-8
UTF-8
inUnicode Ü code point 220
glyph Ü code point 220

Output on Windows

en_US
windows-1252
Cp1252
in unicode Ü code point 220
glyph ?? code point 195

I checked the codepage for windows-1252 at https://en.wikipedia.org/wiki/Windows-1252#Character_set, here the codepoint for Ü is 220. For String glyph = "Ü"; why do I get codepoint as 195 on Windows? As per my understanding glyph should have been rendered properly and the codepoint should have been 220 since it is defined in Windows-1252.

If I replace String glyph = "Ü"; with String glyph = new String("Ü".getBytes(), Charset.forName("UTF-8")); then glyph is rendered correctly and codepoint value is 220. Is this the correct and efficient way to standardize behavior of String on any OS irrespective of locale and charset?

1

There are 1 best solutions below

4
On BEST ANSWER

195 is 0xC3 in hex.

In UTF-8, Ü is encoded as bytes 0xC3 0x9C.

System.getProperty("file.encoding") says the default file encoding on Windows is not UTF-8, but clearly your Java file is actually encoded in UTF-8. The fact that println() is outputting glyph ?? (note 2 ?, meaning 2 chars are present), and that you are able to decode the raw string bytes using the UTF-8 Charset, proves this.

glyph should have a single char whose value is 0x00DC, not 2 chars whose values are 0x00C3 0x009C. getCodepointAt(0) is returning 0x00C3 (195) on Windows because your Java file is encoded in UTF-8 but is being loaded as if it were encoded in Windows-1252 instead, so the 2 bytes 0xC3 0x9C get decoded as characters 0x00C3 0x009C instead of as character 0x00DC.

You need to specify the actual file encoding when running Java, eg:

java -Dfile.encoding=UTF-8 ...