I'm finally having to deal with surrogate UTF characters in Java. I'm encountering a problem trying to print them on a uxterm launched by Cygwin.
Here's a sample program. It intends to print mathematical italic small i (codepoint 119894):
public class PrintChar {
public static void main(String[] args) throws Exception {
int codepoint = 119894; // Mathematical italic small i
String s = new StringBuilder().appendCodePoint(codepoint).toString();
System.out.println("Codepoint " + codepoint + "=" + s);
}
}
When I run this, I get the output:
$ java -Dfile.encoding=UTF-8 -cp bin PrintChar
Codepoint 119894=?
But when I pipe the output through cat
, I get the expected result:
$ java -Dfile.encoding=UTF-8 -cp bin PrintChar | cat
Codepoint 119894=
Can someone explain why? It also happens with a regular cygwin terminal. It does not happen in a terminal run through VMware's vSphere. On that terminal I don't need the pipe to see the italic i.
When working with encoding, you should always remember there is two operation: encoding and reading.
I'll guess what happenned. The ? you see on your terminal is how cygwin decode the utf-8 char encoded on a three byte array 11, 98, 94. ? is probably a "joker" char when cygwin can not understand the char you provided. By using cat, cat probably understand the utf-8 char, and detecting that your cygwin terminal is not configured to render utf-8, car is converting the char to another encoding charset(CP1252 - ISO8859?).
You can put the result of cat in a file, hexedit this file and check the binary result. It sure has changed and giving this value you can try to find which encoding it is.
To solve your problem (but there is not, the char is well printed in UTF-8), I think you just have to change the default charset of cygwin.
EDIT: imagine you have to code a text editor. First you'll need a graphical representation of each char (a font). Then you try to decode a binary file. You can either ask the user to provide with which charset to decode or try to figure it yourself (really hard). When you decode the file, if you encounter a sequence of byte you do not understand, you can either crash the program, or print a joker char (like ? for example)