How to read non-BMP (astral) Unicode supplementary characters (code points)

930 Views Asked by At

The G-Clef (U+1D11E) is not part of the Basic Multilingual Plane (BMP), which means that it requires more than 16 bit. Almost all of Java's read functions return only a char or a int containing also only 16 bit. Which function reads complete Unicode symbols including SMP, SIP, TIP, SSP and PUA?

Update

I have asked how to read a single Unicode symbol (or code point) from a input stream. I neither have any integer array nor do I want to read a line.

It is possible to build a code point with Character.toCodePoint() but this function requires a char. On the other side reading a char is not possible because read() returns an int. My best work around so far is this but it still contains unsafe casts:

public int read_code_point (Reader input) throws java.io.IOException
{
  int ch16 = input.read();
  if (Character.isHighSurrogate((char)ch16))
    return Character.toCodePoint((char)ch16, (char)input.read());
  else 
    return (int)ch16;
}

How to do it better?

Update 2

Another version returning a String but still using casts:

public String readchar (Reader input) throws java.io.IOException
{
  int i16 = input.read(); // UTF-16 as int
  if (i16 == -1) return null;
  char c16 = (char)i16; // UTF-16
  if (Character.isHighSurrogate(c16)) {
    int low_i16 = input.read(); // low surrogate UTF-16 as int
    if (low_i16 == -1)
      throw new java.io.IOException ("Can not read low surrogate");
    char low_c16 = (char)low_i16;
    int codepoint = Character.toCodePoint(c16, low_c16);
    return new String (Character.toChars(codepoint));
  }
  else 
    return Character.toString(c16);
}

The remaining question: are the casts safe or how to avoid them?

2

There are 2 best solutions below

2
Ian Roberts On BEST ANSWER

My best work around so far is this but it still contains unsafe casts

The only unsafe thing about the code you've presented is that ch16 might be -1 if input has reached EOF. If you check for this condition first then you can guarantee that the other (char) casts are safe as Reader.read() is specified to return either -1 or a value that is within the range of char (0 - 0xFFFF).

public int read_code_point (Reader input) throws java.io.IOException
{
  int ch16 = input.read();
  if (ch16 < 0 || !Character.isHighSurrogate((char)ch16))
    return ch16;
  else {
    int loSurr = input.read();
    if(loSurr < 0 || !Character.isLowSurrogate((char)loSurr)) 
      return ch16; // or possibly throw an exception
    else 
      return Character.toCodePoint((char)ch16, (char)loSurr);
  }
}

This still isn't ideal, really you need to handle the edge case where the first char read is a high surrogate but the second one isn't a matching low surrogate, in which case you probably want to return the first char as-is and backup the reader so that the next read gives you the next character. But that only works if input.markSupported() == true. If you can guarantee that then how about

public int read_code_point (Reader input) throws java.io.IOException
{
  int firstChar = input.read();
  if (firstChar < 0 || !Character.isHighSurrogate((char)firstChar)) {
    return firstChar;
  } else {
    input.mark(1);
    int secondChar = input.read();
    if(secondChar < 0) {
      // reached EOF
      return firstChar;
    } else if(!Character.isLowSurrogate((char)secondChar)) {
      // unpaired surrogates, un-read the second char
      input.reset();
      return firstChar;
    }
    else {
      return Character.toCodePoint((char)firstChar, (char)secondChar);
    }
  }
}

Or you could wrap the original reader in a PushbackReader and use unread(secondChar)

3
Joop Eggen On

Full Unicode can be represented in both UTF-8 and UTF-16, by sequences of bytes resp. byte pairs ("java chars"). From String a full Unicode code point can be extracted with:

int[] codePoints = { 0x1d11e };
String s = new String(codePoints, 0, codePoints.length);

for (int i = 0; i < s.length(); ) {
    int cp = s.codePointAt(i);
    i += Character.charCount(cp);
}

For a file with basically latin characters, UTF-8 would seem fine.

Tho following reads a full standard Unicode file (in UTF-8):

try (BufferedReader in = new BufferedReader(
        new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
    for (;;) {
        String line = in.readLine();
        if (line == null) {
            break;
        }
        ... do some thing with a Unicode line ...
    }
} catch (FileNotFoundException e) {
    System.err.println("No file: " + file.getPath());
} catch (IOException e) {
    ...
}

A function that delivers a Java String of one (or more Unicode codes):

String s = unicodeToString(0x1d11e);
String s = unicodeToString(0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x1d11e);

public static String unicodeToString(int... codepoints) {
    return new String(codePoints, 0, codePoints.length);
}