Gnu sort UTF-8 incorrect collation order

499 Views Asked by At

I am using GNU sort on Linux for a UTF-8 file and some strings are not being sorted correctly. I have the LC_COLLATE variable set to en_US.UTF-8 in BASH. Here is a hex dump showing the problem.

5f ef ac 82 0a
5f ef ac 81 0a
5f ef ac 82 0a
5f ef ac 82 0a

These are four consecutive sorted lines. The 0a is the end of line. The order on the forth byte is incorrect. The byte value 81 should not be between the 82 value bytes. When this is displayed in the terminal window the second line is a different character from the other three.

I doubt that this is a problem with the sort command because it is a GNU core utility, and it should be rock solid. Any ideas why this could be occurring? And why do I have to use hexdump to track down this problem; it's the 21st century already!

2

There are 2 best solutions below

0
On

Use LC_COLLATE=C appears to be the only solution.

You can set this up for everything by editing /etc/default/locale

Unfortunately this loses a lot of useful aspects of UTF-8 sorting, such as putting accented characters next to their base characters. But it is far less objectionable than the complete hideous mess the libc developers and Unicode consortium did. They fail to understand the purpose of sorting, the need to preserve sort order when strings are concatenated, the need to always produce the same order, and how virtually every program in the world relies on this. Instead they seem to feel it is important to "sort" typos such as spaces inserted into the middle of names by ignoring them (!).

0
On

It seems that it has probably been some kind of bug in the version you used. When I execute sort(version from GNU coreutils 8.30) it works as follows:

$ printf '\x5f\xef\xac\x82\x0a\x5f\xef\xac\x81\x0a\x5f\xef\xac\x82\x0a\x5f\xef\xac\x82\x0a' | LC_COLLATE=en_US.UTF-8 sort
_fi
_fl
_fl
_fl

which appears to work as expected. I didn't bother to try if it can successfully handle NFC vs NFD normalization forms because I only use NFC myself.