Bash string lexicographical comparisons inconsistency

Question

Bash string lexicographical comparisons inconsistency

283 Views Asked by Michael Chen At 27 June 2021 at 06:00

Bash manual section 6.4 describes [[ string1 < string2 ]] as

True if string1 sorts after string2 lexicographically in the current locale.

I am using a stock English language Linux and was expecting my current locale is ASCII where period [.] is lexicographically less than [0-9A-Za-z]. However, take a look at these:

$ echo $BASH_VERSION
4.3.11(1)-release
$ [[ "." < "1" ]] && echo "yes"
yes
$ [[ "A" < "B" ]] && echo "yes"
yes
$ [[ ".A" < "1B" ]] && echo "yes"
$

The 1st and 2nd comparison agree with the ASCII table, but why the 3rd one false? What exactly is this lexicographical sort order?

Here is the output of locale:

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Original Q&A

There are 2 best solutions below

**oguz ismail** · Answer 1 · 2021-06-27T13:23:27.460000

This doesn't have much to do with your shell. To perform a locale-dependent lexicographic comparison of .A and 1B, bash simply calls strcoll(".A", "1B"), and interprets the return value, that's all.

    {
#if defined (HAVE_STRCOLL)
      if (shell_compatibility_level > 40 && flags & TEST_LOCALE)
    return ((op[0] == '>') ? (strcoll (arg1, arg2) > 0) : (strcoll (arg1, arg2) < 0));
      else
#endif
    return ((op[0] == '>') ? (strcmp (arg1, arg2) > 0) : (strcmp (arg1, arg2) < 0));
    }

^{(copied from test.c)}

Above excerpt also reveals that in order to force a byte-by-byte comparison without altering locale settings, one needs to change the shell compatibility level to 40 (which stands for 4.0, the last version of bash which behaves the way you expected by default).

$ shopt -s compat40
$ [[ .A < 1B ]] && echo yes
yes
$

Now, as to your question (The 1st and 2nd comparison agree with the ASCII table, but why the 3rd one false? What exactly is this lexicographical sort order?), well, it's your locale's collation order apparently. Under What Collation is NOT, UCA specification says:

Collation order is not preserved under concatenation or substring operations, in general.

For example, the fact that x is less than y does not mean that x + z is less than y + z, because characters may form contractions across the substring or concatenation boundaries. In summary:

x < y does not imply that xz < yz
x < y does not imply that zx < zy
xz < yz does not imply that x < y
zx < zy does not imply that x < y

Which, I think, corroborates that this is not a bug but a feature.

**Gordon Davisson** · Answer 2 · 2021-06-27T21:14:43.120000

UTF-8 collation order doesn't go character-by-character, like traditional ASCIIbetical collation does. It uses a multi-level comparison, in which some types of differences are prioritized over others even if they occur later in the string. In this case, what you're seeing the result of "Base character" order ("A" < "1B") being prioritized over a punctuation difference. Here's a quote from the standard:

To address the complexities of language-sensitive sorting, a multilevel comparison algorithm is employed. In comparing two words, the most important feature is the identity of the base letters—for example, the difference between an A and a B. Accent differences are typically ignored, if the base letters differ. Case differences (uppercase versus lowercase), are typically ignored, if the base letters or their accents differ. Treatment of punctuation varies. In some situations a punctuation character is treated like a base letter. In other situations, it should be ignored if there are any base, accent, or case differences. [...]

Here's an example showing the prioritization of punctuation vs "base characters":

$ printf '%s\n' {,.,-}{,1,A,AB,B,BA} | LANG=en_US.UTF-8 sort
-
.
-1
.1
1
-A
.A
A
-AB
.AB
AB
-B
.B
B
-BA
.BA
BA

Note that the punctuation only matters to break ties between lines containing the same base characters. You can also see similar effects involving capitalization and accents:

printf '%s\n' {a,A,B}{A,Å,B} | LANG=en_US.UTF-8 sort
aA
AA
aÅ
AÅ
aB
AB
BA
BÅ
BB

Note that the accent on the second character has higher priority than the capitalization of the first character (and punctuation anywhere in the string would have lower priority than either).

(And, of course, there are lots of other complications beyond this.)

Bash string lexicographical comparisons inconsistency

There are 2 best solutions below

Related Questions in BASH

Related Questions in STRING-COMPARISON

Related Questions in LEXICOGRAPHIC-ORDERING

Trending Questions

Popular # Hahtags

Popular Questions