I am using GNU sort on Linux for a UTF-8 file and some strings are not being sorted correctly. I have the LC_COLLATE variable set to en_US.UTF-8 in BASH. Here is a hex dump showing the problem.
5f ef ac 82 0a
5f ef ac 81 0a
5f ef ac 82 0a
5f ef ac 82 0a
These are four consecutive sorted lines. The 0a is the end of line. The order on the forth byte is incorrect. The byte value 81 should not be between the 82 value bytes. When this is displayed in the terminal window the second line is a different character from the other three.
I doubt that this is a problem with the sort command because it is a GNU core utility, and it should be rock solid. Any ideas why this could be occurring? And why do I have to use hexdump to track down this problem; it's the 21st century already!
Use LC_COLLATE=C appears to be the only solution.
You can set this up for everything by editing /etc/default/locale
Unfortunately this loses a lot of useful aspects of UTF-8 sorting, such as putting accented characters next to their base characters. But it is far less objectionable than the complete hideous mess the libc developers and Unicode consortium did. They fail to understand the purpose of sorting, the need to preserve sort order when strings are concatenated, the need to always produce the same order, and how virtually every program in the world relies on this. Instead they seem to feel it is important to "sort" typos such as spaces inserted into the middle of names by ignoring them (!).