Thread: BUG #16570: Collation not working

BUG #16570: Collation not working

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      16570
Logged by:          Arnaud Perrier
Email address:      arnaud.perrier@gmail.com
PostgreSQL version: 12.3
Operating system:   Windows / Linux
Description:

I've installed Postgresql 12 on both Wnindows and Linux CentOS 8.

For my project, I needed to create my own ICU Collation and apply it to all
character columns (either in column creation or order by requests =
preferred).

Before doing so, I tried to make a simple test case to check if it's works
as expected ... and it's not.

A simple table with some numeric and alphabetic data

DROP TABLE IF EXISTS TBL;
CREATE TABLE TBL ( TEXT1 CHARACTER(5), TEXT2 CHARACTER(5) );
INSERT INTO TBL VALUES
 ('aaaaa', 'aaaaa')
,('aaaaa', '00000')
,('aaaaa', 'bbbbb')
,('aaaaa', '11111')

,('bbbbb', '22222')
,('00000', '22222')
,('ccccc', '22222')
,('11111', '22222');

I've installed Postgresql 12 on both Wnindows and Linux CentOS 8.

For my project, I needed to create my own ICU Collation and apply it to all
character columns (either in column creation or order by requests =
preferred).

Before doing so, I tried to make a simple test case to check if it's works
as expected ... and it's not.

A simple table with some numeric and alphabetic data

DROP TABLE IF EXISTS TBL;
CREATE TABLE TBL ( TEXT1 CHARACTER(5), TEXT2 CHARACTER(5) );
INSERT INTO TBL VALUES
 ('aaaaa', 'aaaaa')
,('aaaaa', '00000')
,('aaaaa', 'bbbbb')
,('aaaaa', '11111')

,('bbbbb', '22222')
,('00000', '22222')
,('ccccc', '22222')
,('11111', '22222');

The collation to order digits after latin characters from the official
documentation https://www.postgresql.org/docs/12/collation.html

CREATE COLLATION digitslast (provider = icu, locale =
'en-u-kr-latn-digit');
CREATE COLLATION digitslast (provider = icu, locale =
'en@colReorder=latn-digit');
    Sort digits after Latin letters. (The default is digits before
letters.)

I've installed Postgresql 12 on both Wnindows and Linux CentOS 8.

For my project, I needed to create my own ICU Collation and apply it to all
character columns (either in column creation or order by requests =
preferred).

Before doing so, I tried to make a simple test case to check if it's works
as expected ... and it's not.

A simple table with some numeric and alphabetic data

DROP TABLE IF EXISTS TBL;
CREATE TABLE TBL ( TEXT1 CHARACTER(5), TEXT2 CHARACTER(5) );
INSERT INTO TBL VALUES
 ('aaaaa', 'aaaaa')
,('aaaaa', '00000')
,('aaaaa', 'bbbbb')
,('aaaaa', '11111')

,('bbbbb', '22222')
,('00000', '22222')
,('ccccc', '22222')
,('11111', '22222');
The collation to order digits after latin characters from the official
documentation https://www.postgresql.org/docs/12/collation.html

CREATE COLLATION digitslast (provider = icu, locale =
'en-u-kr-latn-digit');
CREATE COLLATION digitslast (provider = icu, locale =
'en@colReorder=latn-digit');
    Sort digits after Latin letters. (The default is digits before
letters.)

The testing requests

1/ SELECT * FROM TBL;
2/ SELECT * FROM TBL ORDER BY TEXT1, TEXT2;
3/ SELECT * FROM TBL ORDER BY TEXT1 COLLATE digitslast;
4/ SELECT * FROM TBL ORDER BY TEXT2 COLLATE digitslast;
5/ SELECT * FROM TBL ORDER BY TEXT1 COLLATE digitslast, TEXT2 COLLATE
digitslast;
6/ SELECT * FROM TBL ORDER BY TEXT2 COLLATE digitslast, TEXT1 COLLATE
digitslast;
The results on Windows = collation not works at all

1/ OK = TEXT1 + TEXT2 = digits before letters
2/ OK = TEXT1 + TEXT2 = digits before letters
3/ KO = TEXT1 + TEXT2 = digits before letters
4/ KO = TEXT1 + TEXT2 = digits before letters
5/ KO = TEXT1 + TEXT2 = digits before letters
6/ KO = TEXT1 + TEXT2 = digits before letters
The results on Linux Centos 8 = collation works only for 1st column from
ORDER BY

1/ OK = TEXT1 + TEXT2 = digits before letters
2/ OK = TEXT1 + TEXT2 = digits before letters
3/ OK = TEXT1 = digits after letters + TEXT2 = digits before letters
4/ OK = TEXT1 = digits before letters + TEXT2 = digits after letters
5/ KO = TEXT1 = digits after letters + TEXT2 = digits before letters
6/ KO = TEXT1 = digits before letters + TEXT2 = digits after letters
If I applied the COLLATE clause on the table columns, it's not working
also.

ICU 53 is packaged to all Postgresql 12 installation and --with-icu argument
is present on Postgresql configuration.

Does anybody has any clue to make this simple test case works ?

Post also on stack overflow :
https://stackoverflow.com/questions/63054543/postgresql-12-collation-not-working-windows-linux


Re: BUG #16570: Collation not working

From
"Daniel Verite"
Date:
    PG Bug reporting form wrote:

> The collation to order digits after latin characters from the official
> documentation https://www.postgresql.org/docs/12/collation.html
>
> CREATE COLLATION digitslast (provider = icu, locale =
> 'en-u-kr-latn-digit');

The doc has this caveat:
"The examples using the k* subtags require at least ICU version 54."

Since you're using ICU 53,  the syntax mentioned above for the locale
argument doesn't work. It doesn't error out, it's just being ignored.

You want to use exclusively the "old" syntax with option=value
instead of tags, for instance:
   locale = 'en@colReorder=latn-digit',
or build with a newer ICU version.

Unfortunately the EDB builds for Windows (maybe for linux
too I didn't check) ship with ICU 53, although this version
is 6 years old.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite



Re: BUG #16570: Collation not working

From
Arnaud Perrier
Date:
Hi Daniel,

Thanks for your reply.

It does not work with locale = 'en@colReorder=latn-digit' on Windows / Linux also.
I've got the same results as mentioned previously.

I cannot make my own build because my project will probably rely on a Postgresql service cloud provider (default installation, no extension like icu_ext, ...).

1/ Is-it possible to request a newer ICU version for Postgresql 13 builds (for all platforms) and how to do it ?
2/ Is-it possible to request for a "debug" mode for ICU feature to see when it is applied/ignored ?

Best regards,
Arnaud


Le lun. 3 août 2020 à 19:35, Daniel Verite <daniel@manitou-mail.org> a écrit :
        PG Bug reporting form wrote:

> The collation to order digits after latin characters from the official
> documentation https://www.postgresql.org/docs/12/collation.html
>
> CREATE COLLATION digitslast (provider = icu, locale =
> 'en-u-kr-latn-digit');

The doc has this caveat:
"The examples using the k* subtags require at least ICU version 54."

Since you're using ICU 53,  the syntax mentioned above for the locale
argument doesn't work. It doesn't error out, it's just being ignored.

You want to use exclusively the "old" syntax with option=value
instead of tags, for instance:
   locale = 'en@colReorder=latn-digit',
or build with a newer ICU version.

Unfortunately the EDB builds for Windows (maybe for linux
too I didn't check) ship with ICU 53, although this version
is 6 years old.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

Re: BUG #16570: Collation not working

From
"Daniel Verite"
Date:
    Arnaud Perrier wrote:

> It does not work with locale = 'en@colReorder=latn-digit' on Windows /
> Linux also.
> I've got the same results as mentioned previously.

For ICU pre-54 versions, the parsing of collation attributes are
done by Postgres. Checking with the current source code,
it appears not to support colReorder.

See src/backend/utils/adt/pg_locale.c:

/*
 * Parse collation attributes and apply them to the open collator.  This
takes
 * a string like "und@colStrength=primary;colCaseLevel=yes" and parses and
 * applies the key-value arguments.
 *
 * Starting with ICU version 54, the attributes are processed automatically
by
 * ucol_open(), so this is only necessary for emulating this behavior on
older
 * versions.
 */
pg_attribute_unused()
static void
icu_set_collation_attributes(UCollator *collator, const char *loc)
{
[...]

      if (strcmp(name, "colstrength") == 0)
          uattr = UCOL_STRENGTH;
      else if (strcmp(name, "colbackwards") == 0)
          uattr = UCOL_FRENCH_COLLATION;
      else if (strcmp(name, "colcaselevel") == 0)
          uattr = UCOL_CASE_LEVEL;
      else if (strcmp(name, "colcasefirst") == 0)
          uattr = UCOL_CASE_FIRST;
      else if (strcmp(name, "colalternate") == 0)
          uattr = UCOL_ALTERNATE_HANDLING;
      else if (strcmp(name, "colnormalization") == 0)
          uattr = UCOL_NORMALIZATION_MODE;
      else if (strcmp(name, "colnumeric") == 0)
          uattr = UCOL_NUMERIC_COLLATION;
      else
          /* ignore if unknown */
          continue;
[...]

So with ICU-53, it seems to me that there's no way to use colReorder.


> 1/ Is-it possible to request a newer ICU version for Postgresql 13 builds
> (for all platforms) and how to do it ?

I've asked in [1] for Windows. Other platforms tend to provide ICU as
a separate package, and Postgres gets built with whatever is current
for that platform/OS version/distribution at the time of its release.
BTW Centos 8 is reasonably recent, it seems to go with ICU-60,
so if it there fails too, that should be investigated separately.

> 2/ Is-it possible to request for a "debug" mode for ICU feature to see when
> it is applied/ignored ?

Some kind of "EXPLAIN COLLATION" would be useful indeed.


[1]
https://www.postgresql.org/message-id/43f9b6b6-6fae-47bb-8f4d-8e84af354523@manitou-mail.org


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite



Re: BUG #16570: Collation not working

From
Tom Lane
Date:
"Daniel Verite" <daniel@manitou-mail.org> writes:
>     PG Bug reporting form wrote:
>> The collation to order digits after latin characters from the official
>> documentation https://www.postgresql.org/docs/12/collation.html
>>
>> CREATE COLLATION digitslast (provider = icu, locale =
>> 'en-u-kr-latn-digit');

> The doc has this caveat:
> "The examples using the k* subtags require at least ICU version 54."
> Since you're using ICU 53,  the syntax mentioned above for the locale
> argument doesn't work. It doesn't error out, it's just being ignored.

It seems from this bug report that the OP copied-and-pasted *both* variants
of the example

    CREATE COLLATION digitslast (provider = icu, locale = 'en-u-kr-latn-digit');
    CREATE COLLATION digitslast (provider = icu, locale = 'en@colReorder=latn-digit');

without noticing the caveat (which is a good deal further down anyway).

I wonder if we could improve this by dropping the separate caveat and
writing each example like:

    CREATE COLLATION digitslast (provider = icu, locale = 'en-u-kr-latn-digit');
   or for ICU versions before 54,
    CREATE COLLATION digitslast (provider = icu, locale = 'en@colReorder=latn-digit');

That'd be a little repetitive, but there are only five such examples,
and it'd be way harder to mistake the meaning then.

            regards, tom lane



Re: BUG #16570: Collation not working

From
Arnaud Perrier
Date:
Daniel,

Thanks for your precise explanation!

I double checked and Windows installer comes with ICU 53 while Linux Centos 8 comes with ICU 60.

On Linux Centos 8 with ICU 60, the collation works but only for the first argument of the ORDER BY clause (see below test requests 5 and 6).

Do I need to create a separate issue for that ?

1/ SELECT * FROM TBL;
1/ OK = TEXT1 + TEXT2 = digits before letters

2/ SELECT * FROM TBL ORDER BY TEXT1, TEXT2;
2/ OK = TEXT1 + TEXT2 = digits before letters

3/ SELECT * FROM TBL ORDER BY TEXT1 COLLATE digitslast;
3/ OK = TEXT1 = digits after letters + TEXT2 = digits before letters

4/ SELECT * FROM TBL ORDER BY TEXT2 COLLATE digitslast;
4/ OK = TEXT1 = digits before letters + TEXT2 = digits after letters

5/ SELECT * FROM TBL ORDER BY TEXT1 COLLATE digitslast, TEXT2 COLLATE
digitslast;
5/ KO = TEXT1 = digits after letters + TEXT2 = digits before letters

6/ SELECT * FROM TBL ORDER BY TEXT2 COLLATE digitslast, TEXT1 COLLATE
digitslast; 
6/ KO = TEXT1 = digits before letters + TEXT2 = digits after letters  

Best regards,
Arnaud


Le mar. 4 août 2020 à 16:32, Daniel Verite <daniel@manitou-mail.org> a écrit :
        Arnaud Perrier wrote:

> It does not work with locale = 'en@colReorder=latn-digit' on Windows /
> Linux also.
> I've got the same results as mentioned previously.

For ICU pre-54 versions, the parsing of collation attributes are
done by Postgres. Checking with the current source code,
it appears not to support colReorder.

See src/backend/utils/adt/pg_locale.c:

/*
 * Parse collation attributes and apply them to the open collator.  This
takes
 * a string like "und@colStrength=primary;colCaseLevel=yes" and parses and
 * applies the key-value arguments.
 *
 * Starting with ICU version 54, the attributes are processed automatically
by
 * ucol_open(), so this is only necessary for emulating this behavior on
older
 * versions.
 */
pg_attribute_unused()
static void
icu_set_collation_attributes(UCollator *collator, const char *loc)
{
[...]

      if (strcmp(name, "colstrength") == 0)
              uattr = UCOL_STRENGTH;
      else if (strcmp(name, "colbackwards") == 0)
              uattr = UCOL_FRENCH_COLLATION;
      else if (strcmp(name, "colcaselevel") == 0)
              uattr = UCOL_CASE_LEVEL;
      else if (strcmp(name, "colcasefirst") == 0)
              uattr = UCOL_CASE_FIRST;
      else if (strcmp(name, "colalternate") == 0)
              uattr = UCOL_ALTERNATE_HANDLING;
      else if (strcmp(name, "colnormalization") == 0)
              uattr = UCOL_NORMALIZATION_MODE;
      else if (strcmp(name, "colnumeric") == 0)
              uattr = UCOL_NUMERIC_COLLATION;
      else
              /* ignore if unknown */
              continue;
[...]

So with ICU-53, it seems to me that there's no way to use colReorder.


> 1/ Is-it possible to request a newer ICU version for Postgresql 13 builds
> (for all platforms) and how to do it ?

I've asked in [1] for Windows. Other platforms tend to provide ICU as
a separate package, and Postgres gets built with whatever is current
for that platform/OS version/distribution at the time of its release.
BTW Centos 8 is reasonably recent, it seems to go with ICU-60,
so if it there fails too, that should be investigated separately.

> 2/ Is-it possible to request for a "debug" mode for ICU feature to see when
> it is applied/ignored ?

Some kind of "EXPLAIN COLLATION" would be useful indeed.


[1]
https://www.postgresql.org/message-id/43f9b6b6-6fae-47bb-8f4d-8e84af354523@manitou-mail.org


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

Re: BUG #16570: Collation not working

From
"Daniel Verite"
Date:
    Arnaud Perrier wrote:

> 5/ SELECT * FROM TBL ORDER BY TEXT1 COLLATE digitslast, TEXT2 COLLATE
> digitslast;
> 5/ KO = TEXT1 = digits after letters + TEXT2 = digits before letters

I'm seeing wrong results too, and I don't really understand what's going
on with this collation.

Here's a simpler test with the current dev branch (14devel)
and ICU 60.2-3ubuntu.

=# CREATE COLLATION digitslast (provider = icu, locale =
'en@colReorder=latn-digit');

=# select * from (values('1'),('a')) as list(x) order by x collate
digitslast;
 x
---
 a
 1

The digit is sorted after the letter, OK.

But then why would an inequality test report the opposite?

=# select 'a' < '1' collate digitslast;
 ?column?
----------
 f

I'm wondering whether there is a discrepancy between the binary
sort keys and results of direct comparisons by ucol_strcoll()
for this collation.
Maybe that could explain why the second level of ORDER BY
produces results that are inconsistent with the first level.

With icu_ext we can visualize the sort keys and see that key('a')
if lower than key('1').

=# select icu_sort_key('a' collate digitslast) as sort_key_a,
  icu_sort_key('1' collate digitslast) as sort_key_1

   sort_key_a  |  sort_key_1
--------------+--------------
 \x0f01050105 | \x4901050105

But then a direct comparison says the opposite, just like the
'<' operator as above.

=# select icu_compare('a', '1', 'en@colReorder=latn-digit');
 icu_compare
-------------
       1

icu_compare() is a simple interface on top of ucol_strcollUTF8()
or ucol_strcoll(), and 1 as a result maps to UCOL_GREATER,  so unless
I'm missing something, this is looking more like an ICU bug than a
Postgres bug.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite



Re: BUG #16570: Collation not working

From
Jehan-Guillaume de Rorthais
Date:
Hi, 

On Wed, 05 Aug 2020 16:49:49 +0200
"Daniel Verite" <daniel@manitou-mail.org> wrote:
[...]
> I'm wondering whether there is a discrepancy between the binary
> sort keys and results of direct comparisons by ucol_strcoll()
> for this collation.
[...]

This bug has been discussed back in 2018 and a customer hit it few weeks ago
as well. Here is my last analysis, various link to histories, a workaround and a
even an humble patch proposal to discuss:

https://www.postgresql.org/message-id/flat/20200610002933.6a6d482b%40firost#acba071ed158e0c03ea9a106d9b0fd6d

I even sent a patch to the ICU community to add a regression test exposing the
bug, but I had no answer.

Hopefully the workaround is good enough for you.

Regards,



Re: BUG #16570: Collation not working

From
Tom Lane
Date:
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes:
> This bug has been discussed back in 2018 and a customer hit it few weeks ago
> as well. Here is my last analysis, various link to histories, a workaround and a
> even an humble patch proposal to discuss:
> https://www.postgresql.org/message-id/flat/20200610002933.6a6d482b%40firost#acba071ed158e0c03ea9a106d9b0fd6d

I don't see that listed in the commitfest queue, would you add it to
make sure we don't forget it?

            regards, tom lane



Re: BUG #16570: Collation not working

From
Jehan-Guillaume de Rorthais
Date:
On Wed, 05 Aug 2020 17:30:59 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes:
> > This bug has been discussed back in 2018 and a customer hit it few weeks ago
> > as well. Here is my last analysis, various link to histories, a workaround
> > and a even an humble patch proposal to discuss:
> > https://www.postgresql.org/message-id/flat/20200610002933.6a6d482b%40firost#acba071ed158e0c03ea9a106d9b0fd6d  
> 
> I don't see that listed in the commitfest queue, would you add it to
> make sure we don't forget it?

I just created it using the mail I bounced to pgsql-hackers three weeks ago:
https://commitfest.postgresql.org/29/2672/

It seems we can not link a thread from pgsql-bugs, where the main history and
various link exists for this bug :/

Thanks,