Thread: BUG #18523: String compare not consistent

BUG #18523: String compare not consistent

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18523
Logged by:          Patrick van Dijk
Email address:      patrick.van.dijk@gmail.com
PostgreSQL version: 16.3
Operating system:   Windows vs Linux/Azure
Description:

When I try the following on a Azure/Linux version of PostgreSQL i see some
strange results that are not correct.
On Windows the behavior is correct.

select '|' < '0';        -- true, true, this is correct
select '|1' < '01';    -- false, true, Azure/Linux is wrong
select '||' < '0|';    -- true, true, this is correct
select '||1' < '0|1';    -- false, true, Azure/Linux is wrong

When the first character compares <, then no matter what follows, it should
be <...


Re: BUG #18523: String compare not consistent

From
Patrick van Dijk
Date:
On Windows: PostgreSQL 16.2, compiled by Visual C++ build 1937, 64-bit
On Azure: PostgreSQL 16.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.2.0, 64-bit




On Wed, Jun 26, 2024 at 11:44 AM PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:

Bug reference:      18523
Logged by:          Patrick van Dijk
Email address:      patrick.van.dijk@gmail.com
PostgreSQL version: 16.3
Operating system:   Windows vs Linux/Azure
Description:       

When I try the following on a Azure/Linux version of PostgreSQL i see some
strange results that are not correct.
On Windows the behavior is correct.

select '|' < '0';               -- true, true, this is correct
select '|1' < '01';     -- false, true, Azure/Linux is wrong
select '||' < '0|';     -- true, true, this is correct
select '||1' < '0|1';   -- false, true, Azure/Linux is wrong

When the first character compares <, then no matter what follows, it should
be <...

Re: BUG #18523: String compare not consistent

From
Patrick van Dijk
Date:

I am getting more curious, WIndows is consistent, but Azure/Linux is not.

select '.' < '0'; -- true, true
select '.x' < '0x'; -- false, true
select '.1' < '01'; -- false, true
select '..' < '0.'; -- true, true
select '..1' < '0.1'; -- false, true


On Wed, Jun 26, 2024 at 11:44 AM Patrick van Dijk <patrick.van.dijk@gmail.com> wrote:
On Windows: PostgreSQL 16.2, compiled by Visual C++ build 1937, 64-bit
On Azure: PostgreSQL 16.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.2.0, 64-bit




On Wed, Jun 26, 2024 at 11:44 AM PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:

Bug reference:      18523
Logged by:          Patrick van Dijk
Email address:      patrick.van.dijk@gmail.com
PostgreSQL version: 16.3
Operating system:   Windows vs Linux/Azure
Description:       

When I try the following on a Azure/Linux version of PostgreSQL i see some
strange results that are not correct.
On Windows the behavior is correct.

select '|' < '0';               -- true, true, this is correct
select '|1' < '01';     -- false, true, Azure/Linux is wrong
select '||' < '0|';     -- true, true, this is correct
select '||1' < '0|1';   -- false, true, Azure/Linux is wrong

When the first character compares <, then no matter what follows, it should
be <...

Re: BUG #18523: String compare not consistent

From
Laurenz Albe
Date:
On Wed, 2024-06-26 at 09:43 +0000, PG Bug reporting form wrote:
> PostgreSQL version: 16.3
> Operating system:   Windows vs Linux/Azure
>
> When I try the following on a Azure/Linux version of PostgreSQL i see some
> strange results that are not correct.
> On Windows the behavior is correct.
>
> select '|' < '0';        -- true, true, this is correct
> select '|1' < '01';    -- false, true, Azure/Linux is wrong
> select '||' < '0|';    -- true, true, this is correct
> select '||1' < '0|1';    -- false, true, Azure/Linux is wrong
>
> When the first character compares <, then no matter what follows, it should
> be <...

That is certainly not a PostgreSQL bug, because PostgreSQL uses collations
defined by the C library (or the ICU library, if you use that).

So you'd have to complain to the authors of the respective library.

Interestingly, glibc and ICU disagree about that:

SELECT '|1' < '01' COLLATE "de_DE.utf8"  AS glibc,
       '|1' < '01' COLLATE "de-AT-x-icu" AS icu;

 glibc │ icu
═══════╪═════
 f     │ t
(1 row)

Yours,
Laurenz Albe



Re: BUG #18523: String compare not consistent

From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes:
> When I try the following on a Azure/Linux version of PostgreSQL i see some
> strange results that are not correct.
> On Windows the behavior is correct.

"Correct" is in the eye of the beholder, and non-C collations
tend to act in ways far more complicated than you seem to expect.
Check the locale settings on the two machines; unless they are
both "C", there's not really any guarantee of consistency
between Windows and anybody else.

As a demonstration that this isn't some Postgres-specific
weirdness, you can try asking sort(1) what it thinks.
On a RHEL8 machine I get

$ echo -e '|\n0' | LANG=C sort
0
|
$ echo -e '|1\n01' | LANG=C sort
01
|1
$ echo -e '|\n0' | LANG=en_US.utf8 sort
|
0
$ echo -e '|1\n01' | LANG=en_US.utf8 sort
01
|1

and I'd expect the same from any Linux machine with a similar
vintage of glibc.

            regards, tom lane



Re: BUG #18523: String compare not consistent

From
Peter Eisentraut
Date:
On 26.06.24 17:59, Laurenz Albe wrote:
> Interestingly, glibc and ICU disagree about that:
> 
> SELECT '|1' < '01' COLLATE "de_DE.utf8"  AS glibc,
>         '|1' < '01' COLLATE "de-AT-x-icu" AS icu;
> 
>   glibc │ icu
> ═══════╪═════
>   f     │ t
> (1 row)

Let's work this out.

The collation weights of the characters involved are:

0030  ; [.209F.0020.0002] # DIGIT ZERO
0031  ; [.20A0.0020.0002] # DIGIT ONE
007C  ; [*06AF.0020.0002] # VERTICAL LINE

So under the ICU default "noignore" behavior, the sort keys are:

'|1' => 06AF 20A0 0000 0020 0020 0000 0002 0002
'01' => 209F 20A0 0000 0020 0020 0000 0002 0002

Therefore, '|1' < '01'.

Under "shifted", the effective collation weights are:

0030  ; [.209F.0020.0002.FFFF] # DIGIT ZERO
0031  ; [.20A0.0020.0002.FFFF] # DIGIT ONE
007C  ; [.0000.0000.0000.06AF] # VERTICAL LINE

and the sort keys are:

'|1' => 20A0 0000 0020 0000 0002 0000 06AF FFFF
'01' => 209F 20A0 0000 0020 0020 0000 0002 0002 0000 FFFF FFFF

Therefore, '|1' > '01'.  That is what you get in glibc.  (Older glibc
use "shift-trimmed" behavior, which trims the trailing FFFF's, which 
doesn't affect the result in this case.)

Under ICU (or rather CLDR), the '|' character is not actually a variable 
collation element by default, so you actually get the same result if you 
ask for "noignore" or "shifted" behavior.  But you can make symbols 
variable with 'und-u-ka-shifted-kv-symbol', and then you will also get 
the same result as glibc.

Further reading:
- 
https://peter.eisentraut.org/blog/2023/04/12/how-collation-of-punctuation-and-whitespace-works
- 
https://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings