Why is the hexdump of my Unicode text file different from the byte sequence I manually entered?

1.7k Views Asked by At

Why does the following lead to such a different byte sequence in the hexdump?

$ echo -e "\u0f67\u0fb9\u0fa8\u0fb3\u0fba\u0fbc\u0fbb\u0f83\u0f0b" > uni
$ hexdump uni
0000000 bde0 e0a7 b9be bee0 e0a8 b3be bee0 e0ba
0000010 bcbe bee0 e0bb 83be bce0 0a8b
000001c

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Locale is correctly set to: en_US.UTF-8 and the actual unicode output is correct: ཧྐྵྨླྺྼྻྃ་

1

There are 1 best solutions below

0
On

My misconception stems from believing that the characters I echoed were utf8, when they are in fact utf16. When looking up the first character, the utf8 is displayed as

 e0 bd a7

Which should be in big endian. So to change the endianess, hexdump can be run with the -C parameter.