Thread: BUG #17571: PostgreSQL 15b2 ICU collation defined at db level sorts differently in collate clause vs implicit

The following bug has been logged on the website:

Bug reference:      17571
Logged by:          Daniel Halsey
Email address:      daniel.halsey@vectorsolutions.com
PostgreSQL version: Unsupported/Unknown
Operating system:   Debian (Docker PG15b2 Bullseye)
Description:

I'm using an "official" 15beta2 bullseye Docker image for this: When using a
nondeterministic ICU collation as the default collation for a database,
sorting (ORDER BY) without a COLLATE subclause acts differently than the
same query with one. Instead of treating differently cased characters as
truly equivalent, it appears to treat casing as a tiebreaker within a
column. Multi-column ORDER BY clauses reveal this difference, as shown in
results from select statements in the demo code below. This doesn't appear
to duplicate an existing TODO, as far as I can determine.

-- Test code:
-----
-- as sa (postgres)

create collation if not exists "und-sorttest-x-icu" 
    (provider=icu, 
     locale='und-u-ks-level2-kc-false', 
     deterministic=false);

create database sorttest with
    owner = postgres
    encoding = 'UTF8'
    locale_provider = 'icu'
    icu_locale = 'und-sorttest-x-icu'
    connection limit = -1
    template = template0;


-- Connect to new "sorttest" db

create table sort_test (
    id bigserial primary key,
    sortableOne varchar(50),
    sortableTwo varchar(50),
    sortableThree int
);

insert into sort_test
(sortableOne, sortableTwo, sortableThree)
values 
('Abc', 'B', 3),
('AbC', 'A', 3),
('AbC', 'B', 1),
('ABc', 'C', 3),
('AbC', 'C', 2),
('ABC', 'C', 1),
('ABc', 'A', 2)
;

-- Natural/index ordering
select * from sort_test;

-- Lower before upper in sortableOne, without regards to sortableTwo
secondary ordering
select * from sort_test
order by sortableOne, sortableTwo;

-- Truly non-deterministic sort on sortableOne, following secondary ordering
on sortableTwo
select * from sort_test
order by sortableOne collate "und-sorttest-x-icu", sortableTwo;

-- Additional test using tertiary sort
select * from sort_test
order by sortableOne, sortableTwo, sortableThree;

-- Additional test using tertiary sort
select * from sort_test
order by sortableOne collate "und-sorttest-x-icu", sortableTwo,
sortableThree;

-- LIKE clause treating default collation as deterministic
-- This is fine/preferred, since we can use it without specifying a separate
COLLATE subclause for the LIKE
select * from sort_test
where sortableOne like 'Ab%'
order by sortableOne collate "und-sorttest-x-icu", sortableTwo,
sortableThree;

-- ILIKE working as desired
select * from sort_test
where sortableOne ilike 'ab%'
order by sortableOne collate "und-sorttest-x-icu", sortableTwo,
sortableThree;

----- END Test code


Hi,

On Thu, Aug 04, 2022 at 02:56:32PM +0000, PG Bug reporting form wrote:
>
> Bug reference:      17571
> Logged by:          Daniel Halsey
> Email address:      daniel.halsey@vectorsolutions.com
> PostgreSQL version: Unsupported/Unknown
> Operating system:   Debian (Docker PG15b2 Bullseye)
> Description:
>
> I'm using an "official" 15beta2 bullseye Docker image for this: When using a
> nondeterministic ICU collation as the default collation for a database,
> sorting (ORDER BY) without a COLLATE subclause acts differently than the
> same query with one.

This is working as expected.  As mentioned in CREATE DATABASE documentation
(https://www.postgresql.org/docs/15/sql-createdatabase.html):

There is currently no option to use a database locale with nondeterministic
comparisons (see CREATE COLLATION for an explanation). If this is needed, then
per-column collations would need to be used.

And indeed:

> create collation if not exists "und-sorttest-x-icu"
>     (provider=icu,
>      locale='und-u-ks-level2-kc-false',
>      deterministic=false);

This is an collation created in a specific database, and doesn't exist outside.

> create database sorttest with
>     owner = postgres
>     encoding = 'UTF8'
>     locale_provider = 'icu'
>     icu_locale = 'und-sorttest-x-icu'
>     connection limit = -1
>     template = template0;

Here "und-sorttest-x-icu "is just a string passed to the ICU library, which
probably understands it as the root collation, definitely not what you defined
in the other database.



Thank you for the quick reply, Julien,

I'd missed seeing that note in the CREATE docs. I also found an issue with my test block, in that when running through
this,I must have inadvertently re-run the create collation statement in the newly created db, which is why it was
available.I re-ran it as written just now, and got an error that I should have in the first place that the named
collationdoesn't exist for that db. 

This will be frustrating to work around, since it'll require injecting COLLATE sub-clauses for all order by clauses (or
likeclauses, if we re-define our columns to use a non-deterministic collation).  
Is support for non-deterministic collation at the db level on the roadmap?
Is there a query to determine what the actual/effective collation settings are (per the underlying provider) for a
givendb (since pg_database.daticulocale x pg_catalog.pg_collation.collname doesn't necessarily return what's going to
beused)? 

Thank you again,
Daniel

-----Original Message-----
From: Julien Rouhaud <rjuju123@gmail.com>
Sent: Thursday, August 4, 2022 11:18 AM
To: Daniel Halsey <daniel.halsey@vectorsolutions.com>; pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #17571: PostgreSQL 15b2 ICU collation defined at db level sorts differently in collate clause vs
implicit

Hi,

On Thu, Aug 04, 2022 at 02:56:32PM +0000, PG Bug reporting form wrote:
>
> Bug reference:      17571
> Logged by:          Daniel Halsey
> Email address:      daniel.halsey@vectorsolutions.com
> PostgreSQL version: Unsupported/Unknown
> Operating system:   Debian (Docker PG15b2 Bullseye)
> Description:
>
> I'm using an "official" 15beta2 bullseye Docker image for this: When
> using a nondeterministic ICU collation as the default collation for a
> database, sorting (ORDER BY) without a COLLATE subclause acts
> differently than the same query with one.

This is working as expected.  As mentioned in CREATE DATABASE documentation

(https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fdocs%2F15%2Fsql-createdatabase.html&data=05%7C01%7Cdaniel.halsey%40vectorsolutions.com%7Ceed514c046194c339cdc08da762c7eed%7Cfd01ebd7e586432592d27ad43688e011%7C0%7C0%7C637952230721474583%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=S32wfQ3r05xhhIirx2c7eEmvnn46G5KDBKR9%2FIfrTaU%3D&reserved=0):

There is currently no option to use a database locale with nondeterministic comparisons (see CREATE COLLATION for an
explanation).If this is needed, then per-column collations would need to be used. 

And indeed:

> create collation if not exists "und-sorttest-x-icu"
>       (provider=icu,
>        locale='und-u-ks-level2-kc-false',
>        deterministic=false);

This is an collation created in a specific database, and doesn't exist outside.

> create database sorttest with
>     owner = postgres
>     encoding = 'UTF8'
>     locale_provider = 'icu'
>     icu_locale = 'und-sorttest-x-icu'
>     connection limit = -1
>     template = template0;

Here "und-sorttest-x-icu "is just a string passed to the ICU library, which probably understands it as the root
collation,definitely not what you defined in the other database. 



On Thu, Aug 04, 2022 at 03:59:12PM +0000, Daniel Halsey wrote:
>
> This will be frustrating to work around, since it'll require injecting
> COLLATE sub-clauses for all order by clauses (or like clauses, if we
> re-define our columns to use a non-deterministic collation).
> Is support for non-deterministic collation at the db level on the roadmap?

No one is working on that at the moment as far as I know.

The problem is that we would need to support LIKE and regex for non
deterministic collation first, as those are use in system views.

> Is there a query to determine what the actual/effective collation settings
> are (per the underlying provider) for a given db (since
> pg_database.daticulocale x pg_catalog.pg_collation.collname doesn't
> necessarily return what's going to be used)?

I think there's no misunderstanding here, pg_database.daticulocale isn't
supposed to refer to a collation name in the database, it's the actual locale
string passed to ICU, ie. the same as pg_collation.colliculocale.

So you have to check what ICU will exactly be doing for a given locale string,
which is a bit troublesome as it tends to accept anything and fallback to its
"root" locale.

You can also use Daniel Vérité's excellent icu_ext extension at
https://github.com/dverite/icu_ext.  This provides a lot of useful function,
including icu_collation_attributes(), which for the "und-sorttest-x-icu" locale
string returns:

=# SELECT * FROM icu_collation_attributes('und-sorttest-x-icu');
  attribute  |                    value
-------------+----------------------------------------------
 displayname | Unknown language (SORTTEST, Private-Use=icu)
 kn          | false
 kb          | false
 kk          | false
 ka          | noignore
 ks          | level3
 kf          | false
 kc          | false
 kv          | punct
 version     | 153.112
(10 rows)