Why does utf-16 only support 2^20 code points?

514 Views Asked by At

Well, I'm starting to study unicode now, and I had several doubts, at this moment I'm learning what a plane is, I saw that a plane is a set of 2^16 code points, and that utf-16 encoding supports 17 plans enumerated from 0 to 16, well my question is the following, if utf-16 supports up to 32 bits, because in practice it only encodes up to 2^20 code points? where does 20 come from? I know that if a code point requires more than 2 bytes, utf-16 uses two 16-bit units, but how does that fit into all of this, the final question is where does this 2^20 come from and not 2^32 ? Thanks, :)

3

There are 3 best solutions below

0
On BEST ANSWER

Have a look at how surrogate pairs encode a character U >= 0x10000:

U' = yyyyyyyyyyxxxxxxxxxx  // U - 0x10000
W1 = 110110yyyyyyyyyy      // 0xD800 + yyyyyyyyyy
W2 = 110111xxxxxxxxxx      // 0xDC00 + xxxxxxxxxx

(source)

As you can see, from the 32 bits of the 2x16 surrogate pair, 2x6 = 12 bits are used "only" to convey the information that this is indeed a surrogate pair (and not simply two characters with a value < 0x10000). This leaves you with 32 - 12 = 20 bits to store U'.

(Technically, you additionally have some values for U < 0x10000, of which again some are reserved for low and high surrogates, which means you end up slightly above 2^20 codepoints which can be encoded by UTF-16 (but still well below 2^21), considering that the highest possible codepoint that is supported by UTF-16 is U+10FFFF and not 2^20 = 0x100000.)

2
On

The original form of Unicode only supported 64k code points (16 bits). The intention was to support all commonly used, modern characters, and 64k really is enough for that (yes, even including Chinese). As the introduction notes (emphasis mine):

Completeness. The coded character set would be large enough to encompass all characters that were likely to be used in general text interchange.

But Unicode grew to encompass almost all human writing, including historic and lesser-used writing systems, and 64k characters was too small to handle that. (Unicode 14 has ~145k characters.) As the Unicode 2.0 introduction says (again, emphasis mine):

The Unicode Standard, Version 2.0 contains 38,885 characters from the world's scripts. These characters are more than sufficient not only for modern communication, but also for the classical forms of many languages.

In Unicode 1.x, the typical encoding was UCS-2, which is just a simple 16-bit number defining the code-point. When they decided that they were going to need more (during the Unicode 1.1 timeframe), there were only ~34k code points assigned.

Originally the thought was to create a 32-bit encoding (UCS-4) that could encode 231 values with one bit left-over, but this would have doubled the size of encoding, wasting a lot of space, and wouldn't have been backward compatible with UCS-2.

So they decided for Unicode 2.0 to invent a system backward-compatible with all defined UCS-2 code points, but that allowed them to scale larger. That's why they invented the surrogate pair system (which LMD's answer explains well). This created the UTF-16 encoding which completely replaces UCS-2.

The full thinking on how much space was needed for various areas is explained in the Unicode 2.0 Introduction:

There are over 18,000 unassigned code positions that are available for for future allocation. This number far exceeds anticipated character coding requirements for modern and most archaic characters.

One million additional characters are accessible through the surrogate extension mechanism.... This number far exceeds anticipated encoding requirements for all world characters and symbols.

The goal was to keep "common" characters in the Basic Multilingual Plane (BMP), and to place lesser-used characters into the surrogate extension area.

The surrogate system "wastes" a lot of code points that could be used for real characters. You could imagine replacing it with a more naïve system with a single "the next character is in the surrogate space" code point. But that would create ambiguity between byte sequences. You couldn't just search for 0x0041 to find the letter A. You'd have to scan backwards to make sure it wasn't a surrogate character, making certain kinds of problems much harder.

That design choice has been pretty solid. In 20 years, with steady additions of more and more obscure scripts and characters, we've used less than 15% of the available space. We definitely didn't need another 10 bits.

0
On

thinking in terms of multiples and powers of 4 help a lot with understanding UTF-8 and UTF-16 :

BMP/ASCII    start  :                       =          0
Supp plane   start  :         4 ^ ( 4 + 4 ) =     65,536

Size of BMP         :         4 ^ ( 4 + 4 ) =     65,536 ( 4 ^  8 )
Size of Supp plane  : 4 * 4 * 4 ^ ( 4 + 4 ) =  1,048,576 ( 4 ^ 10 )
————————————————————————————————————————————————————————
Unicode (unadj)   ( 4*4 + 4^4 ) * ( 4 + 4 )^4
                                            = 4^8 + 4^10
                                            =  1,114,112
UTF-8

2-byte UTF-8 start  :   4 * 4   * ( 4 + 4 ) =        128
3-byte UTF-8 start  : ( 4 ^ 4 ) * ( 4 + 4 ) =      2,048
4-byte UTF-8 start  :       4   ^ ( 4 + 4 ) =     65,536

UTF-8 Multi-byte scale factors 

trailing x 1 : 4 ^ 3  =  4 * (   4   ) * 4  =         64
trailing x 2 : 4 ^ 6  =      ( 4 + 4 ) ^ 4  =      4,096
trailing x 3 : 4 ^ 9  =  4 ^ ( 4 + 4 ) * 4  =    262,144

UTF-16 

Hi surrogate start  : ( 4 ^ 5 ) *     54    =     55,296 ( 0xD800 )
per surrogate width : ( 4 ^ 5 )             =      1,024 ( 0x 400 )
Lo surrogate start  : ( 4 ^ 5 ) *     55    =     56,320 ( 0xDC00 )
Total surr. combos  : ( 4 ^ 5 ) * ( 4 ^ 5 ) =  1,048,576 ( 4 ^ 10 )