What if we used the extra bit as a flag? If the flag is set (1), it indicates that the character continues into the next byte. If not (0), it’s the end of the character.
Where UTF-8 uses
| Byte 1 | Byte 2 | Byte 3 | Byte 4 |
|---|---|---|---|
| 0xxxxxxx | |||
| 110xxxxx | 10xxxxxx | ||
| 1110xxxx | 10xxxxxx | 10xxxxxx | |
| 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
This uses
| Byte 1 | Byte 2 | Byte 3 | Byte 4 |
|---|---|---|---|
| 0xxxxxxx | |||
| 1xxxxxxx | 0xxxxxxx | ||
| 1xxxxxxx | 1xxxxxxx | 0xxxxxxx | |
| 1xxxxxxx | 1xxxxxxx | 1xxxxxxx | 0xxxxxxx |
Is it speed because you know the size from the first byte in UTF-8? Is this a trade for size vs speed?
I tried looking for answers by looking into the early history of Unicode. I looked into the original Unicode 88 paper and I couldn't find a definitive answer