What is an Unusual Octet Order BOM

513 Views Asked by At

On the XML documentation and on the different implementations of the Mozilla Universal Character Set Detector (UCSD), there appears a BOM sequence where either the byte order or the word order is reversed, but not both, and they call it 'unusual octet order':

XML docs:

F.1 Detection Without External Encoding Information
...
00 00 FF FE     UCS-4, unusual octet order (2143)
FE FF 00 00     UCS-4, unusual octet order (3412)

Universal Character Set Detector (UCSD) source (just an example):

  if (('\xFF' == aBuf[1]) && ('\x00' == aBuf[2]) && ('\x00' == aBuf[3]))
    // FE FF 00 00 UCS-4, unusual octet order BOM (3412)
    mDetectedCharset = "X-ISO-10646-UCS-4-3412";

  else if (('\x00' == aBuf[1]) && ('\xFF' == aBuf[2]) && ('\xFE' == aBuf[3]))
    // 00 00 FF FE UCS-4, unusual octet order BOM (2143)
    mDetectedCharset = "X-ISO-10646-UCS-4-2143";

Universal Character Set Detector (UCSD) docs:

Known character sets
...
X-ISO-10646-UCS-4-2143
X-ISO-10646-UCS-4-3412

Is there any hardware in existence that uses this endianness, is there such an encoding or an ISO standard for it, is there any popular libs that support encoding/decoding this? Why aren't these sequences just ignored like any other invalid sequence?

1

There are 1 best solutions below

0
On BEST ANSWER

ISO 10646 and Unicode only include big-endian and little-endian UCS-4/UTF-32, not middle-endian. To my knowledge, no software in existence uses these encodings, they are practically irrelevant. Why then does the XML standard mention it? I don't know, but I guess mentioning it was driven by a desire for theoretical completeness rather than any practical value; the same likely applies to character detection/conversion software which includes support for it.

Historically, there have been some systems which have used middle-endian byte order; PDP-11s use the 3412 format to store 32-bit numbers. So if you were to try to process UCS-4/UTF-32 on a PDP-11, the UCS-4-3412 format might be useful. But in practice, no one tries to do that, since PDP-11s were past their heyday by the time Unicode arrived; and since PDP-11s are only 16-bit machines, UCS-4 is not the best Unicode format to use with them.