Thread: sort order for UTF-8 char column with Japanese UTF-8

sort order for UTF-8 char column with Japanese UTF-8

From
Matthias Apitz
Date:
Hello,

With ESQL/C on a PostgreSQL 13.1 server I see the result of this query:

select katkey,normform from swd_anzeige where normform >= 'A' ORDER BY ASC;

coming out in this order:

query: fetch swd_anzeige_seq
RESULT: A
query: fetch swd_anzeige_seq
RESULT: ゲアハルト・A・リッター
query: fetch swd_anzeige_seq
RESULT: ゲルハルト・A・リッター
query: fetch swd_anzeige_seq
RESULT: チャールズ・A・ビアード
query: fetch swd_anzeige_seq
RESULT: A010STRUKTUR
query: fetch swd_anzeige_seq
RESULT: A010STRUKTUR
query: fetch swd_anzeige_seq
RESULT: A010STRUKTUR
query: fetch swd_anzeige_seq
RESULT: A0150SUPRALEITER

I loaded the same table in my server, but can't get the same order with
psql:


 katkey |                        normform
--------+---------------------------------------------------------
 233871 | A
 ...
  92938 | ゲアハルト・A・リッター
  92938 | ゲルハルト・A・リッター
  92938 | リッター0ゲルハルト・A

i.e. the Japanese rows are sorted at the end after all others. Why?

Thanks

    matthias

-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub
August 13, 1961: Better a wall than a war. And, while the GDR was still existing,
no German troups and bombs have been killed in Yugoslavia, Afghanistan, Afrika...



Re: sort order for UTF-8 char column with Japanese UTF-8

From
Matthias Apitz
Date:
El día jueves, febrero 03, 2022 a las 11:14:55 +0100, Matthias Apitz escribió:

> 
> Hello,
> 
> With ESQL/C on a PostgreSQL 13.1 server I see the result of this query:
> 
> select katkey,normform from swd_anzeige where normform >= 'A' ORDER BY ASC;
> 
> coming out in this order:
> 
> query: fetch swd_anzeige_seq
> RESULT: A
> query: fetch swd_anzeige_seq
> RESULT: ゲアハルト・A・リッター
> query: fetch swd_anzeige_seq
> RESULT: ゲルハルト・A・リッター
> query: fetch swd_anzeige_seq
> RESULT: チャールズ・A・ビアード
> query: fetch swd_anzeige_seq
> RESULT: A010STRUKTUR
> query: fetch swd_anzeige_seq
> RESULT: A010STRUKTUR
> query: fetch swd_anzeige_seq
> RESULT: A010STRUKTUR
> query: fetch swd_anzeige_seq
> RESULT: A0150SUPRALEITER
> 
> I loaded the same table in my server, but can't get the same order with
> psql:
> 
> 
>  katkey |                        normform
> --------+---------------------------------------------------------
>  233871 | A
>  ...
>   92938 | ゲアハルト・A・リッター
>   92938 | ゲルハルト・A・リッター
>   92938 | リッター0ゲルハルト・A
> 
> i.e. the Japanese rows are sorted at the end after all others. Why?
> 

I forgot to say, both database have:

psql -l
...
 ueisis    | sisis      | UTF8 | de_DE.UTF-8  | de_DE.UTF-8      |

    matthias
-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub



Re: sort order for UTF-8 char column with Japanese UTF-8

From
Tom Lane
Date:
Matthias Apitz <guru@unixarea.de> writes:
> El día jueves, febrero 03, 2022 a las 11:14:55 +0100, Matthias Apitz escribió:
>> With ESQL/C on a PostgreSQL 13.1 server I see the result of this query:
>> select katkey,normform from swd_anzeige where normform >= 'A' ORDER BY ASC;
>> coming out in this order:
>> ...
>> I loaded the same table in my server, but can't get the same order with
>> psql:

Do the two machines produce the same results if you sort the data in
question with sort(1)?  (Being careful to set LANG=de_DE.UTF-8 of
course.)  I rather doubt this has anything to do with Postgres as such;
there are lots of inter-system and inter-release discrepancies in
collation behavior.

            regards, tom lane



Re: sort order for UTF-8 char column with Japanese UTF-8

From
Matthias Apitz
Date:
El día jueves, febrero 03, 2022 a las 10:00:37 -0500, Tom Lane escribió:

> Matthias Apitz <guru@unixarea.de> writes:
> > El día jueves, febrero 03, 2022 a las 11:14:55 +0100, Matthias Apitz escribió:
> >> With ESQL/C on a PostgreSQL 13.1 server I see the result of this query:
> >> select katkey,normform from swd_anzeige where normform >= 'A' ORDER BY ASC;
> >> coming out in this order:
> >> ...
> >> I loaded the same table in my server, but can't get the same order with
> >> psql:
>
> Do the two machines produce the same results if you sort the data in
> question with sort(1)?  (Being careful to set LANG=de_DE.UTF-8 of
> course.)  I rather doubt this has anything to do with Postgres as such;
> there are lots of inter-system and inter-release discrepancies in
> collation behavior.

I've created a small text file 'swd' with some of the UTF-8 lines in
question. On my Linux mobile phone (a Debian Linux) it gives:

purism@pureos:~$ LANG=de_DE.UTF-8 sort swd
A
A010STRUKTUR
A010STRUKTUR
A010STRUKTUR
A0150SUPRALEITER
ゲアハルト・A・リッター
ゲルハルト・A・リッター
チャールズ・A・ビアード

On my FreeBSD laptop the same file sorts as

guru@c720-r368166:~ $ LANG=de_DE.UTF-8 sort swd
A
ゲアハルト・A・リッター
ゲルハルト・A・リッター
チャールズ・A・ビアード
A010STRUKTUR
A010STRUKTUR
A010STRUKTUR
A0150SUPRALEITER

I'll check tomorrow with the admins of the remote server (a SuSE Linux).

Thanks for the hint.

    matthias

--
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

Attachment

Re: sort order for UTF-8 char column with Japanese UTF-8

From
Thomas Munro
Date:
On Fri, Feb 4, 2022 at 8:11 AM Matthias Apitz <guru@unixarea.de> wrote:
> On my FreeBSD laptop the same file sorts as
>
> guru@c720-r368166:~ $ LANG=de_DE.UTF-8 sort swd
> A
> ゲアハルト・A・リッター
> ゲルハルト・A・リッター
> チャールズ・A・ビアード
> A010STRUKTUR
> A010STRUKTUR
> A010STRUKTUR
> A0150SUPRALEITER

Wow, so it's one thing to have a different default "script order" than
glibc and ICU (which is something you can customise IIRC), but isn't
something broken here if the Japanese text comes between "A" and
"A0..."??  Hmm, it's almost as if it completely ignored the Japanese
text.  From my FreeBSD box:

tmunro=> select * from t order by x collate "de_DE.UTF-8";
            x
--------------------------
 ゲアハルト
 A
 ゲアハルト・A・リッター
 A0
 A010STRUKTUR
 AA
 ゲアハルト・AA・リッター
 ゲアハルト・B・リッター
(8 rows)

tmunro=> select * from t order by x collate "ja_JP.UTF-8";
            x
--------------------------
 A
 A0
 A010STRUKTUR
 AA
 ゲアハルト
 ゲアハルト・AA・リッター
 ゲアハルト・A・リッター
 ゲアハルト・B・リッター
(8 rows)

Seems like something to investigate in FreeBSD land.

Re: sort order for UTF-8 char column with Japanese UTF-8

From
Matthias Apitz
Date:
El día Donnerstag, Februar 03, 2022 a las 10:00:37 -0500, Tom Lane escribió:

> Matthias Apitz <guru@unixarea.de> writes:
> > El día jueves, febrero 03, 2022 a las 11:14:55 +0100, Matthias Apitz escribió:
> >> With ESQL/C on a PostgreSQL 13.1 server I see the result of this query:
> >> select katkey,normform from swd_anzeige where normform >= 'A' ORDER BY ASC;
> >> coming out in this order:
> >> ...
> >> I loaded the same table in my server, but can't get the same order with
> >> psql:
> 
> Do the two machines produce the same results if you sort the data in
> question with sort(1)?  (Being careful to set LANG=de_DE.UTF-8 of
> course.)  I rather doubt this has anything to do with Postgres as such;
> there are lots of inter-system and inter-release discrepancies in
> collation behavior.

No, they do not. I gathered from the admin of the remote (customer)
server the output of sort(1) in dependency of LANG and LC_ALL (see
below). Only when I let her unset the UTF-8 env vars, the result is with
the Japanese lines at the end, with UTF-8 env they're sorted at the
beginning.

On my own server (the only difference is that my server is
SUSE Linux Enterprise Server 15 SP3, while the customer still runs SP2)
I never get the Japanese lines on top with the same commands as used by
the remote admin. I requested now in addition the output of

ls -l /lib64/libc.* /usr/lib/locale/de_DE.utf8

to see if the libc version is different, mine is libc-2.31.so

Said that, does the SORT done by the server depends on the environment
(LANG, LC_*) in which the PostgreSQL server is started or only of the
sp_catalog information of the database in question?

Thanks

    matthias



LC_ALL=de_DE.UTF-8 sort swd
A
ゲアハルト・A・リッター
ゲルハルト・A・リッター
チャールズ・A・ビアード
A010STRUKTUR
A010STRUKTUR
A010STRUKTUR
A0150SUPRALEITER


LANG=de_DE.UTF-8 sort swd
A
ゲアハルト・A・リッター
ゲルハルト・A・リッター
チャールズ・A・ビアード
A010STRUKTUR
A010STRUKTUR
A010STRUKTUR
A0150SUPRALEITER


sort swd
A
ゲアハルト・A・リッター
ゲルハルト・A・リッター
チャールズ・A・ビアード
A010STRUKTUR
A010STRUKTUR
A010STRUKTUR
A0150SUPRALEITER


env | grep LC
LC_ALL=de_DE.UTF-8


env | grep LANG
LANG=de_DE.UTF-8


unset LC_ALL LC_COLLATE LANG
sort swd
A
A010STRUKTUR
A010STRUKTUR
A010STRUKTUR
A0150SUPRALEITER
ゲアハルト・A・リッター
ゲルハルト・A・リッター
チャールズ・A・ビアード

-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub
May, 9: Спаси́бо освободители! Thank you very much, Russian liberators!



Re: sort order for UTF-8 char column with Japanese UTF-8

From
Tom Lane
Date:
Matthias Apitz <guru@unixarea.de> writes:
> Said that, does the SORT done by the server depends on the environment
> (LANG, LC_*) in which the PostgreSQL server is started or only of the
> sp_catalog information of the database in question?

We set LC_COLLATE and LC_CTYPE from the database's relevant
pg_database fields.  The remaining LC environment variables
shouldn't affect this (and I think we force them to "C"
anyway).

            regards, tom lane