convert ucs-4 to ucs-2

469 Views Asked by At

The unicode value of ucs-4 character '' is 0001f923, it gets auto changed to the corresponding value of \uD83E\uDD23 when being copied into java code in intelliJ IDEA.

Java only supports ucs-2, so there occurs a transformation from ucs-4 to ucs-2.

I want to know the logic of the transformation, but didn't find any material about it.


There are 2 best solutions below


U+010000 to U+10FFFF

  • 0x10000 is subtracted from the code point (U), leaving a 20-bit number (U') in the range 0x00000–0xFFFFF. U is defined to be no greater than 0x10FFFF.
  • The high ten bits (in the range 0x000–0x3FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate (W1), which will be in the range 0xD800–0xDBFF.
  • The low ten bits (also in the range 0x000–0x3FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate (W2), which will be in the range 0xDC00–0xDFFF.

Now with input code point \U1F923:

  • \U1F923 - \U10000 = \UF923
  • \UF923 = 1111100100100011 = 00001111100100100011 = [0000111110][0100100011] = [\U3E][\U123]
  • \UD800 + \U3E = \UD83E
  • \UDC00 + \U123 = \UDD23
  • The result: \UD83E\UDD23


public static void main(String[] args) {
    int input = 0x1f923;
    int x = input - 0x10000;

    int highTenBits = x >> 10;
    int lowTenBits = x & ((1 << 10) - 1);

    int high = highTenBits + 0xd800;
    int low = lowTenBits + 0xdc00;

    System.out.println(String.format("[%x][%x]", high, low));

Though String contains Unicode as a char array where char is a two byte UTF-16BE encoding, there also is support for UCS4.

UCS4: UTF-32, "code points":

Unicode code points, UCS4, are represented in java as int.

int[] ucs4 = new int[] {0x0001_f923};
String s = new String(ucs4, 0, ucs4.length);
ucs4 = s.codePoints().toArray();

There are encodings, transformations, of code points to UTF-16 and UTF-8 which require longer sequences of respectively 2-byte or 1-byte values. The encoding is chosen such that the 2/1-byte values will be different from any other value. That means that such a value will not erroneously match "/" or any other string search. That is realized by high bits starting with 1... and then bits of the code point in big-endian format (most significant first).

Rather than searching for UCS4 and UCS2 a search for UTF-16 will yield info on the algorithms used.