Thread: Wrong sorting on docker image
Greetings
Starting from version 12.0 official docker image switched from Debian-stretch to Debian-bullseye and from that point we have a huge pain with sorting issues on Russian collation.
Dockerfile:
FROM postgres:14
RUN apt-get clean && apt-get update && apt-get install -y locales
RUN localedef -i ru_RU -c -f UTF-8 -A /usr/share/locale/locale.alias ru_RU.UTF-8
ENV LANG ru_RU.utf8
postgres=# select version();
version
-----------------------------------------------------------------------------------------------------------------------------
PostgreSQL 14.0 (Debian 14.0-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
Issue:
postgres=# SELECT * FROM unnest(ARRAY ['ея', 'ёа']) name ORDER BY name;
name
------
ёа
ея
(2 строки)
еqя should go before ёqа
postgres=# SELECT 'ея' COLLATE "ru_RU" < 'ёа' COLLATE "ru_RU";
?column?
----------
f
(1 строка)
And should be TRUE here
Any idea how to fix that?
We are not able manage this for 3 years already (((
On 2021-10-16 13:50:31 +0300, Oleksandr Voytsekhovskyy wrote: > Starting from version 12.0 official docker image switched from Debian-stretch > to Debian-bullseye and from that point we have a huge pain with sorting issues > on Russian collation. [...] > Issue: > > postgres=# SELECT * FROM unnest(ARRAY ['ея', 'ёа']) name ORDER BY name; > name > ------ > ёа > ея > (2 строки) > > > > еqя should go before ёqа Same with the sort command in the shell. > postgres=# SELECT 'ея' COLLATE "ru_RU" < 'ёа' COLLATE "ru_RU"; > ?column? > ---------- > f > (1 строка) > > And should be TRUE here > > Any idea how to fix that? Since the collation is defined by the OS (or rather its C library) I think this should be reported to Debian or possibly the glibc maintainers. PostgreSQL can also use the ICU locales instead of those provided by the OS. Have you tried that? hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
Oleksandr Voytsekhovskyy <young.inbox@gmail.com> writes: > Starting from version 12.0 official docker image switched from Debian-stretch to Debian-bullseye and from that point wehave a huge pain with sorting issues on Russian collation. Yeah, Debian versions after stretch adopted the significant glibc locale data changes (sorting rule changes) that are discussed at [1]. > ея should go before ёа I'm not qualified to have an opinion on that point, but one would hope that the glibc people who changed the sorting rules are qualified. If you disagree, you need to go discuss it with glibc. Postgres doesn't define any text sorting rules, we just use what libc or ICU tells us. Speaking of ICU, if you are using an ICU-enabled Postgres build, maybe you could find an ICU collation that acts the way you want. This wouldn't be a perfect solution, because we don't yet have the ability to set an ICU collation as a database's default. But you can attach ICU collations to individual text columns, and maybe that would be a good enough workaround. regards, tom lane [1] https://wiki.postgresql.org/wiki/Locale_data_changes
On Sun, Oct 17, 2021 at 4:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Speaking of ICU, if you are using an ICU-enabled Postgres build, > maybe you could find an ICU collation that acts the way you want. > This wouldn't be a perfect solution, because we don't yet have > the ability to set an ICU collation as a database's default. > But you can attach ICU collations to individual text columns, > and maybe that would be a good enough workaround. For what it's worth, ICU's "ru-RU-x-icu" and FreeBSD's libc agree with glibc on these sort orders, so I suspect this might be coming from CLDR/UCA/DUCET/ISO 14651 common/synchronised data. It does look quite suspicious to me, but I don't know Russian and I'm only speculating wildly here: it does look as if ё is perhaps getting a lower weight than it should. That said, it seems strange that something so basic should be wrong. Nosing around in the unicode.org issue tracker, it seems as though some people might think there is something funny about Ё (and I wonder if there are/were similar issues with й/Й): https://unicode-org.atlassian.net/browse/CLDR-2745?jql=text%20~%20%22%D0%81%22 https://unicode-org.atlassian.net/browse/CLDR-1974?jql=text%20~%20%22%D0%81%22 (and more) It's probably not a great idea, but for the record, you can build your own collation for glibc and other POSIX-oid systems. For example, see glibc commit 159738548130d5ac4fe6178977e940ed5f8cfdc4, where they previously had customisations on top of the iso14651_t1 file to reorder a special Ukrainian character in ru_RU, so in theory you could reorder ё/Ё with a similar local hack and call it ru_RU_X... I also wonder if there is some magic switch you can put after an @ symbol on ICU collations that would change this, perhaps some way to disable the "contractions" that are potentially implicated here. Not sure.