The G-Clef (U+1D11E) is not part of the Basic Multilingual Plane (BMP), which means that it requires more than 16 bit. Almost all of Java's read functions return only a char
or a int
containing also only 16 bit. Which function reads complete Unicode symbols including SMP, SIP, TIP, SSP and PUA?
Update
I have asked how to read a single Unicode symbol (or code point) from a input stream. I neither have any integer array nor do I want to read a line.
It is possible to build a code point with Character.toCodePoint()
but this function requires a char
. On the other side reading a char
is not possible because read()
returns an int
. My best work around so far is this but it still contains unsafe casts:
public int read_code_point (Reader input) throws java.io.IOException
{
int ch16 = input.read();
if (Character.isHighSurrogate((char)ch16))
return Character.toCodePoint((char)ch16, (char)input.read());
else
return (int)ch16;
}
How to do it better?
Update 2
Another version returning a String but still using casts:
public String readchar (Reader input) throws java.io.IOException
{
int i16 = input.read(); // UTF-16 as int
if (i16 == -1) return null;
char c16 = (char)i16; // UTF-16
if (Character.isHighSurrogate(c16)) {
int low_i16 = input.read(); // low surrogate UTF-16 as int
if (low_i16 == -1)
throw new java.io.IOException ("Can not read low surrogate");
char low_c16 = (char)low_i16;
int codepoint = Character.toCodePoint(c16, low_c16);
return new String (Character.toChars(codepoint));
}
else
return Character.toString(c16);
}
The remaining question: are the casts safe or how to avoid them?
The only unsafe thing about the code you've presented is that
ch16
might be -1 ifinput
has reached EOF. If you check for this condition first then you can guarantee that the other(char)
casts are safe asReader.read()
is specified to return either -1 or a value that is within the range ofchar
(0 - 0xFFFF).This still isn't ideal, really you need to handle the edge case where the first
char
read is a high surrogate but the second one isn't a matching low surrogate, in which case you probably want to return the firstchar
as-is and backup the reader so that the next read gives you the next character. But that only works ifinput.markSupported() == true
. If you can guarantee that then how aboutOr you could wrap the original reader in a
PushbackReader
and useunread(secondChar)