Thread: sort order

sort order

From
Marc Millas
Date:
Hi,

context:
one postgres 12 on centos 7
one postgres 12 on windows 10
both on machines with french as default
the centos 7 lc_collate and lc_ctype: fr_FR.UTF_8
the w10 lc_collate and lc_ctype: French_France.1252


create table test (ble text, id serial primary key);
insert into test(ble) values(' ');
insert into test(ble) values('Marc');
insert into test(ble) values(' Marc');
insert into test(ble) values('marc');
insert into test(ble) values(' marc');
insert into test(ble) values('bobo');
insert into test(ble) values(' bobo'):
insert into test(ble) values('élise');

differences include a french é character, and some white spaces at the beginning.

then select * from test order by ble;

centos result:
  ble  | id
-------+----
       |  3
 bobo  |  2
  bobo |  1
 élise |  6
 marc  |  5
  marc |  4
 Marc  |  8
  Marc |  7
(8 lignes)

w10 result:
  ble  | id
-------+----
       |  3
  bobo |  1
  marc |  4
  Marc |  7
 élise |  8
 bobo  |  2
 marc  |  5
 Marc  |  6
(8 lignes)

so, obviously, both lc_collate knows about the é
but obviously, too, they do behave differently on the impact of the beginning white space.

I didn't see anything about this behaviour on the doc, unless the reference at the libc should be understood as please read and test libc doc on each platform.
So my first question is: why ?
My second question is: how to make the centos postgres behave like the w10 one ??
ie. knowing about french characters AND taking beginning white spaces into account ?

thanks,

regards,
 

Marc MILLAS
Senior Architect
+33607850334

Re: sort order

From
Marc Millas
Date:
Re-reading my post, I see that even the élise is not sorted correctly on w10...

Marc MILLAS
Senior Architect
+33607850334



On Tue, Jul 27, 2021 at 7:07 PM Marc Millas <marc.millas@mokadb.com> wrote:
Hi,

context:
one postgres 12 on centos 7
one postgres 12 on windows 10
both on machines with french as default
the centos 7 lc_collate and lc_ctype: fr_FR.UTF_8
the w10 lc_collate and lc_ctype: French_France.1252


create table test (ble text, id serial primary key);
insert into test(ble) values(' ');
insert into test(ble) values('Marc');
insert into test(ble) values(' Marc');
insert into test(ble) values('marc');
insert into test(ble) values(' marc');
insert into test(ble) values('bobo');
insert into test(ble) values(' bobo'):
insert into test(ble) values('élise');

differences include a french é character, and some white spaces at the beginning.

then select * from test order by ble;

centos result:
  ble  | id
-------+----
       |  3
 bobo  |  2
  bobo |  1
 élise |  6
 marc  |  5
  marc |  4
 Marc  |  8
  Marc |  7
(8 lignes)

w10 result:
  ble  | id
-------+----
       |  3
  bobo |  1
  marc |  4
  Marc |  7
 élise |  8
 bobo  |  2
 marc  |  5
 Marc  |  6
(8 lignes)

so, obviously, both lc_collate knows about the é
but obviously, too, they do behave differently on the impact of the beginning white space.

I didn't see anything about this behaviour on the doc, unless the reference at the libc should be understood as please read and test libc doc on each platform.
So my first question is: why ?
My second question is: how to make the centos postgres behave like the w10 one ??
ie. knowing about french characters AND taking beginning white spaces into account ?

thanks,

regards,
 

Marc MILLAS
Senior Architect
+33607850334

Re: sort order

From
Tom Lane
Date:
Marc Millas <marc.millas@mokadb.com> writes:
> Re-reading my post, I see that even the élise is not sorted correctly on
> w10...

Sort orders aren't even particularly guaranteed on different releases
of the same platform, let alone totally different platforms.  glibc
made major changes to their collation rules not long ago:

https://wiki.postgresql.org/wiki/Locale_data_changes

At one time we thought that migrating to ICU sorting might bring
a little more stability to this, but I think what we've found so
far is that it's just different :-(

            regards, tom lane



Re: sort order

From
Peter Eisentraut
Date:
On 27.07.21 19:07, Marc Millas wrote:
> so, obviously, both lc_collate knows about the é
> but obviously, too, they do behave differently on the impact of the 
> beginning white space.
> 
> I didn't see anything about this behaviour on the doc, unless the 
> reference at the libc should be understood as please read and test libc 
> doc on each platform.
> So my first question is: why ?
> My second question is: how to make the centos postgres behave like the 
> w10 one ??
> ie. knowing about french characters AND taking beginning white spaces 
> into account ?

There are multiple standard ways to deal with space and punctuation 
characters when sorting.  See 
<https://unicode-org.github.io/icu/userguide/collation/customization/ignorepunct.html> 
for a description.  Not all collation providers implement all of them, 
but the behavior you happen to get is usually one of them.  The centos 7 
behavior corresponds to "shift-trimmed", the Windows one appears to 
match "non-ignorable".  If you want to get that latter one on Linux as 
well, you can use the ICU locales, which also default to non-ignorable. 
For example

     select * from test order by ble collate "fr-x-icu";

matches your Windows output for me.