Re: BUG #18523: String compare not consistent - Mailing list pgsql-bugs

From Peter Eisentraut
Subject Re: BUG #18523: String compare not consistent
Date
Msg-id 30c9e733-9573-4500-b191-5a8a67b3795f@eisentraut.org
Whole thread Raw
In response to Re: BUG #18523: String compare not consistent  (Laurenz Albe <laurenz.albe@cybertec.at>)
List pgsql-bugs
On 26.06.24 17:59, Laurenz Albe wrote:
> Interestingly, glibc and ICU disagree about that:
> 
> SELECT '|1' < '01' COLLATE "de_DE.utf8"  AS glibc,
>         '|1' < '01' COLLATE "de-AT-x-icu" AS icu;
> 
>   glibc │ icu
> ═══════╪═════
>   f     │ t
> (1 row)

Let's work this out.

The collation weights of the characters involved are:

0030  ; [.209F.0020.0002] # DIGIT ZERO
0031  ; [.20A0.0020.0002] # DIGIT ONE
007C  ; [*06AF.0020.0002] # VERTICAL LINE

So under the ICU default "noignore" behavior, the sort keys are:

'|1' => 06AF 20A0 0000 0020 0020 0000 0002 0002
'01' => 209F 20A0 0000 0020 0020 0000 0002 0002

Therefore, '|1' < '01'.

Under "shifted", the effective collation weights are:

0030  ; [.209F.0020.0002.FFFF] # DIGIT ZERO
0031  ; [.20A0.0020.0002.FFFF] # DIGIT ONE
007C  ; [.0000.0000.0000.06AF] # VERTICAL LINE

and the sort keys are:

'|1' => 20A0 0000 0020 0000 0002 0000 06AF FFFF
'01' => 209F 20A0 0000 0020 0020 0000 0002 0002 0000 FFFF FFFF

Therefore, '|1' > '01'.  That is what you get in glibc.  (Older glibc
use "shift-trimmed" behavior, which trims the trailing FFFF's, which 
doesn't affect the result in this case.)

Under ICU (or rather CLDR), the '|' character is not actually a variable 
collation element by default, so you actually get the same result if you 
ask for "noignore" or "shifted" behavior.  But you can make symbols 
variable with 'und-u-ka-shifted-kv-symbol', and then you will also get 
the same result as glibc.

Further reading:
- 
https://peter.eisentraut.org/blog/2023/04/12/how-collation-of-punctuation-and-whitespace-works
- 
https://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings




pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #18523: String compare not consistent
Next
From: Heikki Linnakangas
Date:
Subject: Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows