On 26.06.24 17:59, Laurenz Albe wrote:
> Interestingly, glibc and ICU disagree about that:
>
> SELECT '|1' < '01' COLLATE "de_DE.utf8" AS glibc,
> '|1' < '01' COLLATE "de-AT-x-icu" AS icu;
>
> glibc │ icu
> ═══════╪═════
> f │ t
> (1 row)
Let's work this out.
The collation weights of the characters involved are:
0030 ; [.209F.0020.0002] # DIGIT ZERO
0031 ; [.20A0.0020.0002] # DIGIT ONE
007C ; [*06AF.0020.0002] # VERTICAL LINE
So under the ICU default "noignore" behavior, the sort keys are:
'|1' => 06AF 20A0 0000 0020 0020 0000 0002 0002
'01' => 209F 20A0 0000 0020 0020 0000 0002 0002
Therefore, '|1' < '01'.
Under "shifted", the effective collation weights are:
0030 ; [.209F.0020.0002.FFFF] # DIGIT ZERO
0031 ; [.20A0.0020.0002.FFFF] # DIGIT ONE
007C ; [.0000.0000.0000.06AF] # VERTICAL LINE
and the sort keys are:
'|1' => 20A0 0000 0020 0000 0002 0000 06AF FFFF
'01' => 209F 20A0 0000 0020 0020 0000 0002 0002 0000 FFFF FFFF
Therefore, '|1' > '01'. That is what you get in glibc. (Older glibc
use "shift-trimmed" behavior, which trims the trailing FFFF's, which
doesn't affect the result in this case.)
Under ICU (or rather CLDR), the '|' character is not actually a variable
collation element by default, so you actually get the same result if you
ask for "noignore" or "shifted" behavior. But you can make symbols
variable with 'und-u-ka-shifted-kv-symbol', and then you will also get
the same result as glibc.
Further reading:
-
https://peter.eisentraut.org/blog/2023/04/12/how-collation-of-punctuation-and-whitespace-works
-
https://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings