Thread: sort order for UTF-8 char column with Japanese UTF-8
Hello, With ESQL/C on a PostgreSQL 13.1 server I see the result of this query: select katkey,normform from swd_anzeige where normform >= 'A' ORDER BY ASC; coming out in this order: query: fetch swd_anzeige_seq RESULT: A query: fetch swd_anzeige_seq RESULT: ゲアハルト・A・リッター query: fetch swd_anzeige_seq RESULT: ゲルハルト・A・リッター query: fetch swd_anzeige_seq RESULT: チャールズ・A・ビアード query: fetch swd_anzeige_seq RESULT: A010STRUKTUR query: fetch swd_anzeige_seq RESULT: A010STRUKTUR query: fetch swd_anzeige_seq RESULT: A010STRUKTUR query: fetch swd_anzeige_seq RESULT: A0150SUPRALEITER I loaded the same table in my server, but can't get the same order with psql: katkey | normform --------+--------------------------------------------------------- 233871 | A ... 92938 | ゲアハルト・A・リッター 92938 | ゲルハルト・A・リッター 92938 | リッター0ゲルハルト・A i.e. the Japanese rows are sorted at the end after all others. Why? Thanks matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub August 13, 1961: Better a wall than a war. And, while the GDR was still existing, no German troups and bombs have been killed in Yugoslavia, Afghanistan, Afrika...
El día jueves, febrero 03, 2022 a las 11:14:55 +0100, Matthias Apitz escribió: > > Hello, > > With ESQL/C on a PostgreSQL 13.1 server I see the result of this query: > > select katkey,normform from swd_anzeige where normform >= 'A' ORDER BY ASC; > > coming out in this order: > > query: fetch swd_anzeige_seq > RESULT: A > query: fetch swd_anzeige_seq > RESULT: ゲアハルト・A・リッター > query: fetch swd_anzeige_seq > RESULT: ゲルハルト・A・リッター > query: fetch swd_anzeige_seq > RESULT: チャールズ・A・ビアード > query: fetch swd_anzeige_seq > RESULT: A010STRUKTUR > query: fetch swd_anzeige_seq > RESULT: A010STRUKTUR > query: fetch swd_anzeige_seq > RESULT: A010STRUKTUR > query: fetch swd_anzeige_seq > RESULT: A0150SUPRALEITER > > I loaded the same table in my server, but can't get the same order with > psql: > > > katkey | normform > --------+--------------------------------------------------------- > 233871 | A > ... > 92938 | ゲアハルト・A・リッター > 92938 | ゲルハルト・A・リッター > 92938 | リッター0ゲルハルト・A > > i.e. the Japanese rows are sorted at the end after all others. Why? > I forgot to say, both database have: psql -l ... ueisis | sisis | UTF8 | de_DE.UTF-8 | de_DE.UTF-8 | matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Matthias Apitz <guru@unixarea.de> writes: > El día jueves, febrero 03, 2022 a las 11:14:55 +0100, Matthias Apitz escribió: >> With ESQL/C on a PostgreSQL 13.1 server I see the result of this query: >> select katkey,normform from swd_anzeige where normform >= 'A' ORDER BY ASC; >> coming out in this order: >> ... >> I loaded the same table in my server, but can't get the same order with >> psql: Do the two machines produce the same results if you sort the data in question with sort(1)? (Being careful to set LANG=de_DE.UTF-8 of course.) I rather doubt this has anything to do with Postgres as such; there are lots of inter-system and inter-release discrepancies in collation behavior. regards, tom lane
El día jueves, febrero 03, 2022 a las 10:00:37 -0500, Tom Lane escribió: > Matthias Apitz <guru@unixarea.de> writes: > > El día jueves, febrero 03, 2022 a las 11:14:55 +0100, Matthias Apitz escribió: > >> With ESQL/C on a PostgreSQL 13.1 server I see the result of this query: > >> select katkey,normform from swd_anzeige where normform >= 'A' ORDER BY ASC; > >> coming out in this order: > >> ... > >> I loaded the same table in my server, but can't get the same order with > >> psql: > > Do the two machines produce the same results if you sort the data in > question with sort(1)? (Being careful to set LANG=de_DE.UTF-8 of > course.) I rather doubt this has anything to do with Postgres as such; > there are lots of inter-system and inter-release discrepancies in > collation behavior. I've created a small text file 'swd' with some of the UTF-8 lines in question. On my Linux mobile phone (a Debian Linux) it gives: purism@pureos:~$ LANG=de_DE.UTF-8 sort swd A A010STRUKTUR A010STRUKTUR A010STRUKTUR A0150SUPRALEITER ゲアハルト・A・リッター ゲルハルト・A・リッター チャールズ・A・ビアード On my FreeBSD laptop the same file sorts as guru@c720-r368166:~ $ LANG=de_DE.UTF-8 sort swd A ゲアハルト・A・リッター ゲルハルト・A・リッター チャールズ・A・ビアード A010STRUKTUR A010STRUKTUR A010STRUKTUR A0150SUPRALEITER I'll check tomorrow with the admins of the remote server (a SuSE Linux). Thanks for the hint. matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Attachment
On Fri, Feb 4, 2022 at 8:11 AM Matthias Apitz <guru@unixarea.de> wrote: > On my FreeBSD laptop the same file sorts as > > guru@c720-r368166:~ $ LANG=de_DE.UTF-8 sort swd > A > ゲアハルト・A・リッター > ゲルハルト・A・リッター > チャールズ・A・ビアード > A010STRUKTUR > A010STRUKTUR > A010STRUKTUR > A0150SUPRALEITER Wow, so it's one thing to have a different default "script order" than glibc and ICU (which is something you can customise IIRC), but isn't something broken here if the Japanese text comes between "A" and "A0..."?? Hmm, it's almost as if it completely ignored the Japanese text. From my FreeBSD box: tmunro=> select * from t order by x collate "de_DE.UTF-8"; x -------------------------- ゲアハルト A ゲアハルト・A・リッター A0 A010STRUKTUR AA ゲアハルト・AA・リッター ゲアハルト・B・リッター (8 rows) tmunro=> select * from t order by x collate "ja_JP.UTF-8"; x -------------------------- A A0 A010STRUKTUR AA ゲアハルト ゲアハルト・AA・リッター ゲアハルト・A・リッター ゲアハルト・B・リッター (8 rows) Seems like something to investigate in FreeBSD land.
El día Donnerstag, Februar 03, 2022 a las 10:00:37 -0500, Tom Lane escribió: > Matthias Apitz <guru@unixarea.de> writes: > > El día jueves, febrero 03, 2022 a las 11:14:55 +0100, Matthias Apitz escribió: > >> With ESQL/C on a PostgreSQL 13.1 server I see the result of this query: > >> select katkey,normform from swd_anzeige where normform >= 'A' ORDER BY ASC; > >> coming out in this order: > >> ... > >> I loaded the same table in my server, but can't get the same order with > >> psql: > > Do the two machines produce the same results if you sort the data in > question with sort(1)? (Being careful to set LANG=de_DE.UTF-8 of > course.) I rather doubt this has anything to do with Postgres as such; > there are lots of inter-system and inter-release discrepancies in > collation behavior. No, they do not. I gathered from the admin of the remote (customer) server the output of sort(1) in dependency of LANG and LC_ALL (see below). Only when I let her unset the UTF-8 env vars, the result is with the Japanese lines at the end, with UTF-8 env they're sorted at the beginning. On my own server (the only difference is that my server is SUSE Linux Enterprise Server 15 SP3, while the customer still runs SP2) I never get the Japanese lines on top with the same commands as used by the remote admin. I requested now in addition the output of ls -l /lib64/libc.* /usr/lib/locale/de_DE.utf8 to see if the libc version is different, mine is libc-2.31.so Said that, does the SORT done by the server depends on the environment (LANG, LC_*) in which the PostgreSQL server is started or only of the sp_catalog information of the database in question? Thanks matthias LC_ALL=de_DE.UTF-8 sort swd A ゲアハルト・A・リッター ゲルハルト・A・リッター チャールズ・A・ビアード A010STRUKTUR A010STRUKTUR A010STRUKTUR A0150SUPRALEITER LANG=de_DE.UTF-8 sort swd A ゲアハルト・A・リッター ゲルハルト・A・リッター チャールズ・A・ビアード A010STRUKTUR A010STRUKTUR A010STRUKTUR A0150SUPRALEITER sort swd A ゲアハルト・A・リッター ゲルハルト・A・リッター チャールズ・A・ビアード A010STRUKTUR A010STRUKTUR A010STRUKTUR A0150SUPRALEITER env | grep LC LC_ALL=de_DE.UTF-8 env | grep LANG LANG=de_DE.UTF-8 unset LC_ALL LC_COLLATE LANG sort swd A A010STRUKTUR A010STRUKTUR A010STRUKTUR A0150SUPRALEITER ゲアハルト・A・リッター ゲルハルト・A・リッター チャールズ・A・ビアード -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub May, 9: Спаси́бо освободители! Thank you very much, Russian liberators!
Matthias Apitz <guru@unixarea.de> writes: > Said that, does the SORT done by the server depends on the environment > (LANG, LC_*) in which the PostgreSQL server is started or only of the > sp_catalog information of the database in question? We set LC_COLLATE and LC_CTYPE from the database's relevant pg_database fields. The remaining LC environment variables shouldn't affect this (and I think we force them to "C" anyway). regards, tom lane