Thread: BUG #16570: Collation not working
The following bug has been logged on the website: Bug reference: 16570 Logged by: Arnaud Perrier Email address: arnaud.perrier@gmail.com PostgreSQL version: 12.3 Operating system: Windows / Linux Description: I've installed Postgresql 12 on both Wnindows and Linux CentOS 8. For my project, I needed to create my own ICU Collation and apply it to all character columns (either in column creation or order by requests = preferred). Before doing so, I tried to make a simple test case to check if it's works as expected ... and it's not. A simple table with some numeric and alphabetic data DROP TABLE IF EXISTS TBL; CREATE TABLE TBL ( TEXT1 CHARACTER(5), TEXT2 CHARACTER(5) ); INSERT INTO TBL VALUES ('aaaaa', 'aaaaa') ,('aaaaa', '00000') ,('aaaaa', 'bbbbb') ,('aaaaa', '11111') ,('bbbbb', '22222') ,('00000', '22222') ,('ccccc', '22222') ,('11111', '22222'); I've installed Postgresql 12 on both Wnindows and Linux CentOS 8. For my project, I needed to create my own ICU Collation and apply it to all character columns (either in column creation or order by requests = preferred). Before doing so, I tried to make a simple test case to check if it's works as expected ... and it's not. A simple table with some numeric and alphabetic data DROP TABLE IF EXISTS TBL; CREATE TABLE TBL ( TEXT1 CHARACTER(5), TEXT2 CHARACTER(5) ); INSERT INTO TBL VALUES ('aaaaa', 'aaaaa') ,('aaaaa', '00000') ,('aaaaa', 'bbbbb') ,('aaaaa', '11111') ,('bbbbb', '22222') ,('00000', '22222') ,('ccccc', '22222') ,('11111', '22222'); The collation to order digits after latin characters from the official documentation https://www.postgresql.org/docs/12/collation.html CREATE COLLATION digitslast (provider = icu, locale = 'en-u-kr-latn-digit'); CREATE COLLATION digitslast (provider = icu, locale = 'en@colReorder=latn-digit'); Sort digits after Latin letters. (The default is digits before letters.) I've installed Postgresql 12 on both Wnindows and Linux CentOS 8. For my project, I needed to create my own ICU Collation and apply it to all character columns (either in column creation or order by requests = preferred). Before doing so, I tried to make a simple test case to check if it's works as expected ... and it's not. A simple table with some numeric and alphabetic data DROP TABLE IF EXISTS TBL; CREATE TABLE TBL ( TEXT1 CHARACTER(5), TEXT2 CHARACTER(5) ); INSERT INTO TBL VALUES ('aaaaa', 'aaaaa') ,('aaaaa', '00000') ,('aaaaa', 'bbbbb') ,('aaaaa', '11111') ,('bbbbb', '22222') ,('00000', '22222') ,('ccccc', '22222') ,('11111', '22222'); The collation to order digits after latin characters from the official documentation https://www.postgresql.org/docs/12/collation.html CREATE COLLATION digitslast (provider = icu, locale = 'en-u-kr-latn-digit'); CREATE COLLATION digitslast (provider = icu, locale = 'en@colReorder=latn-digit'); Sort digits after Latin letters. (The default is digits before letters.) The testing requests 1/ SELECT * FROM TBL; 2/ SELECT * FROM TBL ORDER BY TEXT1, TEXT2; 3/ SELECT * FROM TBL ORDER BY TEXT1 COLLATE digitslast; 4/ SELECT * FROM TBL ORDER BY TEXT2 COLLATE digitslast; 5/ SELECT * FROM TBL ORDER BY TEXT1 COLLATE digitslast, TEXT2 COLLATE digitslast; 6/ SELECT * FROM TBL ORDER BY TEXT2 COLLATE digitslast, TEXT1 COLLATE digitslast; The results on Windows = collation not works at all 1/ OK = TEXT1 + TEXT2 = digits before letters 2/ OK = TEXT1 + TEXT2 = digits before letters 3/ KO = TEXT1 + TEXT2 = digits before letters 4/ KO = TEXT1 + TEXT2 = digits before letters 5/ KO = TEXT1 + TEXT2 = digits before letters 6/ KO = TEXT1 + TEXT2 = digits before letters The results on Linux Centos 8 = collation works only for 1st column from ORDER BY 1/ OK = TEXT1 + TEXT2 = digits before letters 2/ OK = TEXT1 + TEXT2 = digits before letters 3/ OK = TEXT1 = digits after letters + TEXT2 = digits before letters 4/ OK = TEXT1 = digits before letters + TEXT2 = digits after letters 5/ KO = TEXT1 = digits after letters + TEXT2 = digits before letters 6/ KO = TEXT1 = digits before letters + TEXT2 = digits after letters If I applied the COLLATE clause on the table columns, it's not working also. ICU 53 is packaged to all Postgresql 12 installation and --with-icu argument is present on Postgresql configuration. Does anybody has any clue to make this simple test case works ? Post also on stack overflow : https://stackoverflow.com/questions/63054543/postgresql-12-collation-not-working-windows-linux
PG Bug reporting form wrote: > The collation to order digits after latin characters from the official > documentation https://www.postgresql.org/docs/12/collation.html > > CREATE COLLATION digitslast (provider = icu, locale = > 'en-u-kr-latn-digit'); The doc has this caveat: "The examples using the k* subtags require at least ICU version 54." Since you're using ICU 53, the syntax mentioned above for the locale argument doesn't work. It doesn't error out, it's just being ignored. You want to use exclusively the "old" syntax with option=value instead of tags, for instance: locale = 'en@colReorder=latn-digit', or build with a newer ICU version. Unfortunately the EDB builds for Windows (maybe for linux too I didn't check) ship with ICU 53, although this version is 6 years old. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
Hi Daniel,
Thanks for your reply.
It does not work with locale = 'en@colReorder=latn-digit' on Windows / Linux also.
I've got the same results as mentioned previously.
I cannot make my own build because my project will probably rely on a Postgresql service cloud provider (default installation, no extension like icu_ext, ...).
1/ Is-it possible to request a newer ICU version for Postgresql 13 builds (for all platforms) and how to do it ?
2/ Is-it possible to request for a "debug" mode for ICU feature to see when it is applied/ignored ?
Best regards,
Arnaud
Le lun. 3 août 2020 à 19:35, Daniel Verite <daniel@manitou-mail.org> a écrit :
PG Bug reporting form wrote:
> The collation to order digits after latin characters from the official
> documentation https://www.postgresql.org/docs/12/collation.html
>
> CREATE COLLATION digitslast (provider = icu, locale =
> 'en-u-kr-latn-digit');
The doc has this caveat:
"The examples using the k* subtags require at least ICU version 54."
Since you're using ICU 53, the syntax mentioned above for the locale
argument doesn't work. It doesn't error out, it's just being ignored.
You want to use exclusively the "old" syntax with option=value
instead of tags, for instance:
locale = 'en@colReorder=latn-digit',
or build with a newer ICU version.
Unfortunately the EDB builds for Windows (maybe for linux
too I didn't check) ship with ICU 53, although this version
is 6 years old.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite
Arnaud Perrier wrote: > It does not work with locale = 'en@colReorder=latn-digit' on Windows / > Linux also. > I've got the same results as mentioned previously. For ICU pre-54 versions, the parsing of collation attributes are done by Postgres. Checking with the current source code, it appears not to support colReorder. See src/backend/utils/adt/pg_locale.c: /* * Parse collation attributes and apply them to the open collator. This takes * a string like "und@colStrength=primary;colCaseLevel=yes" and parses and * applies the key-value arguments. * * Starting with ICU version 54, the attributes are processed automatically by * ucol_open(), so this is only necessary for emulating this behavior on older * versions. */ pg_attribute_unused() static void icu_set_collation_attributes(UCollator *collator, const char *loc) { [...] if (strcmp(name, "colstrength") == 0) uattr = UCOL_STRENGTH; else if (strcmp(name, "colbackwards") == 0) uattr = UCOL_FRENCH_COLLATION; else if (strcmp(name, "colcaselevel") == 0) uattr = UCOL_CASE_LEVEL; else if (strcmp(name, "colcasefirst") == 0) uattr = UCOL_CASE_FIRST; else if (strcmp(name, "colalternate") == 0) uattr = UCOL_ALTERNATE_HANDLING; else if (strcmp(name, "colnormalization") == 0) uattr = UCOL_NORMALIZATION_MODE; else if (strcmp(name, "colnumeric") == 0) uattr = UCOL_NUMERIC_COLLATION; else /* ignore if unknown */ continue; [...] So with ICU-53, it seems to me that there's no way to use colReorder. > 1/ Is-it possible to request a newer ICU version for Postgresql 13 builds > (for all platforms) and how to do it ? I've asked in [1] for Windows. Other platforms tend to provide ICU as a separate package, and Postgres gets built with whatever is current for that platform/OS version/distribution at the time of its release. BTW Centos 8 is reasonably recent, it seems to go with ICU-60, so if it there fails too, that should be investigated separately. > 2/ Is-it possible to request for a "debug" mode for ICU feature to see when > it is applied/ignored ? Some kind of "EXPLAIN COLLATION" would be useful indeed. [1] https://www.postgresql.org/message-id/43f9b6b6-6fae-47bb-8f4d-8e84af354523@manitou-mail.org Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes: > PG Bug reporting form wrote: >> The collation to order digits after latin characters from the official >> documentation https://www.postgresql.org/docs/12/collation.html >> >> CREATE COLLATION digitslast (provider = icu, locale = >> 'en-u-kr-latn-digit'); > The doc has this caveat: > "The examples using the k* subtags require at least ICU version 54." > Since you're using ICU 53, the syntax mentioned above for the locale > argument doesn't work. It doesn't error out, it's just being ignored. It seems from this bug report that the OP copied-and-pasted *both* variants of the example CREATE COLLATION digitslast (provider = icu, locale = 'en-u-kr-latn-digit'); CREATE COLLATION digitslast (provider = icu, locale = 'en@colReorder=latn-digit'); without noticing the caveat (which is a good deal further down anyway). I wonder if we could improve this by dropping the separate caveat and writing each example like: CREATE COLLATION digitslast (provider = icu, locale = 'en-u-kr-latn-digit'); or for ICU versions before 54, CREATE COLLATION digitslast (provider = icu, locale = 'en@colReorder=latn-digit'); That'd be a little repetitive, but there are only five such examples, and it'd be way harder to mistake the meaning then. regards, tom lane
Daniel,
Thanks for your precise explanation!
I double checked and Windows installer comes with ICU 53 while Linux Centos 8 comes with ICU 60.
On Linux Centos 8 with ICU 60, the collation works but only for the first argument of the ORDER BY clause (see below test requests 5 and 6).
Do I need to create a separate issue for that ?
1/ SELECT * FROM TBL;
1/ OK = TEXT1 + TEXT2 = digits before letters
1/ OK = TEXT1 + TEXT2 = digits before letters
2/ SELECT * FROM TBL ORDER BY TEXT1, TEXT2;
2/ OK = TEXT1 + TEXT2 = digits before letters
2/ OK = TEXT1 + TEXT2 = digits before letters
3/ SELECT * FROM TBL ORDER BY TEXT1 COLLATE digitslast;
3/ OK = TEXT1 = digits after letters + TEXT2 = digits before letters
3/ OK = TEXT1 = digits after letters + TEXT2 = digits before letters
4/ SELECT * FROM TBL ORDER BY TEXT2 COLLATE digitslast;
4/ OK = TEXT1 = digits before letters + TEXT2 = digits after letters
4/ OK = TEXT1 = digits before letters + TEXT2 = digits after letters
5/ SELECT * FROM TBL ORDER BY TEXT1 COLLATE digitslast, TEXT2 COLLATE
digitslast;
5/ KO = TEXT1 = digits after letters + TEXT2 = digits before letters
digitslast;
5/ KO = TEXT1 = digits after letters + TEXT2 = digits before letters
6/ SELECT * FROM TBL ORDER BY TEXT2 COLLATE digitslast, TEXT1 COLLATE
digitslast;
digitslast;
6/ KO = TEXT1 = digits before letters + TEXT2 = digits after letters
Best regards,
Arnaud
Le mar. 4 août 2020 à 16:32, Daniel Verite <daniel@manitou-mail.org> a écrit :
Arnaud Perrier wrote:
> It does not work with locale = 'en@colReorder=latn-digit' on Windows /
> Linux also.
> I've got the same results as mentioned previously.
For ICU pre-54 versions, the parsing of collation attributes are
done by Postgres. Checking with the current source code,
it appears not to support colReorder.
See src/backend/utils/adt/pg_locale.c:
/*
* Parse collation attributes and apply them to the open collator. This
takes
* a string like "und@colStrength=primary;colCaseLevel=yes" and parses and
* applies the key-value arguments.
*
* Starting with ICU version 54, the attributes are processed automatically
by
* ucol_open(), so this is only necessary for emulating this behavior on
older
* versions.
*/
pg_attribute_unused()
static void
icu_set_collation_attributes(UCollator *collator, const char *loc)
{
[...]
if (strcmp(name, "colstrength") == 0)
uattr = UCOL_STRENGTH;
else if (strcmp(name, "colbackwards") == 0)
uattr = UCOL_FRENCH_COLLATION;
else if (strcmp(name, "colcaselevel") == 0)
uattr = UCOL_CASE_LEVEL;
else if (strcmp(name, "colcasefirst") == 0)
uattr = UCOL_CASE_FIRST;
else if (strcmp(name, "colalternate") == 0)
uattr = UCOL_ALTERNATE_HANDLING;
else if (strcmp(name, "colnormalization") == 0)
uattr = UCOL_NORMALIZATION_MODE;
else if (strcmp(name, "colnumeric") == 0)
uattr = UCOL_NUMERIC_COLLATION;
else
/* ignore if unknown */
continue;
[...]
So with ICU-53, it seems to me that there's no way to use colReorder.
> 1/ Is-it possible to request a newer ICU version for Postgresql 13 builds
> (for all platforms) and how to do it ?
I've asked in [1] for Windows. Other platforms tend to provide ICU as
a separate package, and Postgres gets built with whatever is current
for that platform/OS version/distribution at the time of its release.
BTW Centos 8 is reasonably recent, it seems to go with ICU-60,
so if it there fails too, that should be investigated separately.
> 2/ Is-it possible to request for a "debug" mode for ICU feature to see when
> it is applied/ignored ?
Some kind of "EXPLAIN COLLATION" would be useful indeed.
[1]
https://www.postgresql.org/message-id/43f9b6b6-6fae-47bb-8f4d-8e84af354523@manitou-mail.org
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite
Arnaud Perrier wrote: > 5/ SELECT * FROM TBL ORDER BY TEXT1 COLLATE digitslast, TEXT2 COLLATE > digitslast; > 5/ KO = TEXT1 = digits after letters + TEXT2 = digits before letters I'm seeing wrong results too, and I don't really understand what's going on with this collation. Here's a simpler test with the current dev branch (14devel) and ICU 60.2-3ubuntu. =# CREATE COLLATION digitslast (provider = icu, locale = 'en@colReorder=latn-digit'); =# select * from (values('1'),('a')) as list(x) order by x collate digitslast; x --- a 1 The digit is sorted after the letter, OK. But then why would an inequality test report the opposite? =# select 'a' < '1' collate digitslast; ?column? ---------- f I'm wondering whether there is a discrepancy between the binary sort keys and results of direct comparisons by ucol_strcoll() for this collation. Maybe that could explain why the second level of ORDER BY produces results that are inconsistent with the first level. With icu_ext we can visualize the sort keys and see that key('a') if lower than key('1'). =# select icu_sort_key('a' collate digitslast) as sort_key_a, icu_sort_key('1' collate digitslast) as sort_key_1 sort_key_a | sort_key_1 --------------+-------------- \x0f01050105 | \x4901050105 But then a direct comparison says the opposite, just like the '<' operator as above. =# select icu_compare('a', '1', 'en@colReorder=latn-digit'); icu_compare ------------- 1 icu_compare() is a simple interface on top of ucol_strcollUTF8() or ucol_strcoll(), and 1 as a result maps to UCOL_GREATER, so unless I'm missing something, this is looking more like an ICU bug than a Postgres bug. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
Hi, On Wed, 05 Aug 2020 16:49:49 +0200 "Daniel Verite" <daniel@manitou-mail.org> wrote: [...] > I'm wondering whether there is a discrepancy between the binary > sort keys and results of direct comparisons by ucol_strcoll() > for this collation. [...] This bug has been discussed back in 2018 and a customer hit it few weeks ago as well. Here is my last analysis, various link to histories, a workaround and a even an humble patch proposal to discuss: https://www.postgresql.org/message-id/flat/20200610002933.6a6d482b%40firost#acba071ed158e0c03ea9a106d9b0fd6d I even sent a patch to the ICU community to add a regression test exposing the bug, but I had no answer. Hopefully the workaround is good enough for you. Regards,
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes: > This bug has been discussed back in 2018 and a customer hit it few weeks ago > as well. Here is my last analysis, various link to histories, a workaround and a > even an humble patch proposal to discuss: > https://www.postgresql.org/message-id/flat/20200610002933.6a6d482b%40firost#acba071ed158e0c03ea9a106d9b0fd6d I don't see that listed in the commitfest queue, would you add it to make sure we don't forget it? regards, tom lane
On Wed, 05 Aug 2020 17:30:59 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes: > > This bug has been discussed back in 2018 and a customer hit it few weeks ago > > as well. Here is my last analysis, various link to histories, a workaround > > and a even an humble patch proposal to discuss: > > https://www.postgresql.org/message-id/flat/20200610002933.6a6d482b%40firost#acba071ed158e0c03ea9a106d9b0fd6d > > I don't see that listed in the commitfest queue, would you add it to > make sure we don't forget it? I just created it using the mail I bounced to pgsql-hackers three weeks ago: https://commitfest.postgresql.org/29/2672/ It seems we can not link a thread from pgsql-bugs, where the main history and various link exists for this bug :/ Thanks,