How to read non-BMP (astral) Unicode supplementary characters (code points)

925 Views Asked by At

The G-Clef (U+1D11E) is not part of the Basic Multilingual Plane (BMP), which means that it requires more than 16 bit. Almost all of Java's read functions return only a char or a int containing also only 16 bit. Which function reads complete Unicode symbols including SMP, SIP, TIP, SSP and PUA?

Update

I have asked how to read a single Unicode symbol (or code point) from a input stream. I neither have any integer array nor do I want to read a line.

It is possible to build a code point with Character.toCodePoint() but this function requires a char. On the other side reading a char is not possible because read() returns an int. My best work around so far is this but it still contains unsafe casts:

public int read_code_point (Reader input) throws java.io.IOException
{
  int ch16 = input.read();
  if (Character.isHighSurrogate((char)ch16))
    return Character.toCodePoint((char)ch16, (char)input.read());
  else 
    return (int)ch16;
}

How to do it better?

Update 2

Another version returning a String but still using casts:

public String readchar (Reader input) throws java.io.IOException
{
  int i16 = input.read(); // UTF-16 as int
  if (i16 == -1) return null;
  char c16 = (char)i16; // UTF-16
  if (Character.isHighSurrogate(c16)) {
    int low_i16 = input.read(); // low surrogate UTF-16 as int
    if (low_i16 == -1)
      throw new java.io.IOException ("Can not read low surrogate");
    char low_c16 = (char)low_i16;
    int codepoint = Character.toCodePoint(c16, low_c16);
    return new String (Character.toChars(codepoint));
  }
  else 
    return Character.toString(c16);
}

The remaining question: are the casts safe or how to avoid them?

2

There are 2 best solutions below

2
On BEST ANSWER

My best work around so far is this but it still contains unsafe casts

The only unsafe thing about the code you've presented is that ch16 might be -1 if input has reached EOF. If you check for this condition first then you can guarantee that the other (char) casts are safe as Reader.read() is specified to return either -1 or a value that is within the range of char (0 - 0xFFFF).

public int read_code_point (Reader input) throws java.io.IOException
{
  int ch16 = input.read();
  if (ch16 < 0 || !Character.isHighSurrogate((char)ch16))
    return ch16;
  else {
    int loSurr = input.read();
    if(loSurr < 0 || !Character.isLowSurrogate((char)loSurr)) 
      return ch16; // or possibly throw an exception
    else 
      return Character.toCodePoint((char)ch16, (char)loSurr);
  }
}

This still isn't ideal, really you need to handle the edge case where the first char read is a high surrogate but the second one isn't a matching low surrogate, in which case you probably want to return the first char as-is and backup the reader so that the next read gives you the next character. But that only works if input.markSupported() == true. If you can guarantee that then how about

public int read_code_point (Reader input) throws java.io.IOException
{
  int firstChar = input.read();
  if (firstChar < 0 || !Character.isHighSurrogate((char)firstChar)) {
    return firstChar;
  } else {
    input.mark(1);
    int secondChar = input.read();
    if(secondChar < 0) {
      // reached EOF
      return firstChar;
    } else if(!Character.isLowSurrogate((char)secondChar)) {
      // unpaired surrogates, un-read the second char
      input.reset();
      return firstChar;
    }
    else {
      return Character.toCodePoint((char)firstChar, (char)secondChar);
    }
  }
}

Or you could wrap the original reader in a PushbackReader and use unread(secondChar)

3
On

Full Unicode can be represented in both UTF-8 and UTF-16, by sequences of bytes resp. byte pairs ("java chars"). From String a full Unicode code point can be extracted with:

int[] codePoints = { 0x1d11e };
String s = new String(codePoints, 0, codePoints.length);

for (int i = 0; i < s.length(); ) {
    int cp = s.codePointAt(i);
    i += Character.charCount(cp);
}

For a file with basically latin characters, UTF-8 would seem fine.

Tho following reads a full standard Unicode file (in UTF-8):

try (BufferedReader in = new BufferedReader(
        new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
    for (;;) {
        String line = in.readLine();
        if (line == null) {
            break;
        }
        ... do some thing with a Unicode line ...
    }
} catch (FileNotFoundException e) {
    System.err.println("No file: " + file.getPath());
} catch (IOException e) {
    ...
}

A function that delivers a Java String of one (or more Unicode codes):

String s = unicodeToString(0x1d11e);
String s = unicodeToString(0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x1d11e);

public static String unicodeToString(int... codepoints) {
    return new String(codePoints, 0, codePoints.length);
}