Thread: BUG #17571: PostgreSQL 15b2 ICU collation defined at db level sorts differently in collate clause vs implicit
BUG #17571: PostgreSQL 15b2 ICU collation defined at db level sorts differently in collate clause vs implicit
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17571 Logged by: Daniel Halsey Email address: daniel.halsey@vectorsolutions.com PostgreSQL version: Unsupported/Unknown Operating system: Debian (Docker PG15b2 Bullseye) Description: I'm using an "official" 15beta2 bullseye Docker image for this: When using a nondeterministic ICU collation as the default collation for a database, sorting (ORDER BY) without a COLLATE subclause acts differently than the same query with one. Instead of treating differently cased characters as truly equivalent, it appears to treat casing as a tiebreaker within a column. Multi-column ORDER BY clauses reveal this difference, as shown in results from select statements in the demo code below. This doesn't appear to duplicate an existing TODO, as far as I can determine. -- Test code: ----- -- as sa (postgres) create collation if not exists "und-sorttest-x-icu" (provider=icu, locale='und-u-ks-level2-kc-false', deterministic=false); create database sorttest with owner = postgres encoding = 'UTF8' locale_provider = 'icu' icu_locale = 'und-sorttest-x-icu' connection limit = -1 template = template0; -- Connect to new "sorttest" db create table sort_test ( id bigserial primary key, sortableOne varchar(50), sortableTwo varchar(50), sortableThree int ); insert into sort_test (sortableOne, sortableTwo, sortableThree) values ('Abc', 'B', 3), ('AbC', 'A', 3), ('AbC', 'B', 1), ('ABc', 'C', 3), ('AbC', 'C', 2), ('ABC', 'C', 1), ('ABc', 'A', 2) ; -- Natural/index ordering select * from sort_test; -- Lower before upper in sortableOne, without regards to sortableTwo secondary ordering select * from sort_test order by sortableOne, sortableTwo; -- Truly non-deterministic sort on sortableOne, following secondary ordering on sortableTwo select * from sort_test order by sortableOne collate "und-sorttest-x-icu", sortableTwo; -- Additional test using tertiary sort select * from sort_test order by sortableOne, sortableTwo, sortableThree; -- Additional test using tertiary sort select * from sort_test order by sortableOne collate "und-sorttest-x-icu", sortableTwo, sortableThree; -- LIKE clause treating default collation as deterministic -- This is fine/preferred, since we can use it without specifying a separate COLLATE subclause for the LIKE select * from sort_test where sortableOne like 'Ab%' order by sortableOne collate "und-sorttest-x-icu", sortableTwo, sortableThree; -- ILIKE working as desired select * from sort_test where sortableOne ilike 'ab%' order by sortableOne collate "und-sorttest-x-icu", sortableTwo, sortableThree; ----- END Test code
Re: BUG #17571: PostgreSQL 15b2 ICU collation defined at db level sorts differently in collate clause vs implicit
From
Julien Rouhaud
Date:
Hi, On Thu, Aug 04, 2022 at 02:56:32PM +0000, PG Bug reporting form wrote: > > Bug reference: 17571 > Logged by: Daniel Halsey > Email address: daniel.halsey@vectorsolutions.com > PostgreSQL version: Unsupported/Unknown > Operating system: Debian (Docker PG15b2 Bullseye) > Description: > > I'm using an "official" 15beta2 bullseye Docker image for this: When using a > nondeterministic ICU collation as the default collation for a database, > sorting (ORDER BY) without a COLLATE subclause acts differently than the > same query with one. This is working as expected. As mentioned in CREATE DATABASE documentation (https://www.postgresql.org/docs/15/sql-createdatabase.html): There is currently no option to use a database locale with nondeterministic comparisons (see CREATE COLLATION for an explanation). If this is needed, then per-column collations would need to be used. And indeed: > create collation if not exists "und-sorttest-x-icu" > (provider=icu, > locale='und-u-ks-level2-kc-false', > deterministic=false); This is an collation created in a specific database, and doesn't exist outside. > create database sorttest with > owner = postgres > encoding = 'UTF8' > locale_provider = 'icu' > icu_locale = 'und-sorttest-x-icu' > connection limit = -1 > template = template0; Here "und-sorttest-x-icu "is just a string passed to the ICU library, which probably understands it as the root collation, definitely not what you defined in the other database.
RE: BUG #17571: PostgreSQL 15b2 ICU collation defined at db level sorts differently in collate clause vs implicit
From
Daniel Halsey
Date:
Thank you for the quick reply, Julien, I'd missed seeing that note in the CREATE docs. I also found an issue with my test block, in that when running through this,I must have inadvertently re-run the create collation statement in the newly created db, which is why it was available.I re-ran it as written just now, and got an error that I should have in the first place that the named collationdoesn't exist for that db. This will be frustrating to work around, since it'll require injecting COLLATE sub-clauses for all order by clauses (or likeclauses, if we re-define our columns to use a non-deterministic collation). Is support for non-deterministic collation at the db level on the roadmap? Is there a query to determine what the actual/effective collation settings are (per the underlying provider) for a givendb (since pg_database.daticulocale x pg_catalog.pg_collation.collname doesn't necessarily return what's going to beused)? Thank you again, Daniel -----Original Message----- From: Julien Rouhaud <rjuju123@gmail.com> Sent: Thursday, August 4, 2022 11:18 AM To: Daniel Halsey <daniel.halsey@vectorsolutions.com>; pgsql-bugs@lists.postgresql.org Subject: Re: BUG #17571: PostgreSQL 15b2 ICU collation defined at db level sorts differently in collate clause vs implicit Hi, On Thu, Aug 04, 2022 at 02:56:32PM +0000, PG Bug reporting form wrote: > > Bug reference: 17571 > Logged by: Daniel Halsey > Email address: daniel.halsey@vectorsolutions.com > PostgreSQL version: Unsupported/Unknown > Operating system: Debian (Docker PG15b2 Bullseye) > Description: > > I'm using an "official" 15beta2 bullseye Docker image for this: When > using a nondeterministic ICU collation as the default collation for a > database, sorting (ORDER BY) without a COLLATE subclause acts > differently than the same query with one. This is working as expected. As mentioned in CREATE DATABASE documentation (https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fdocs%2F15%2Fsql-createdatabase.html&data=05%7C01%7Cdaniel.halsey%40vectorsolutions.com%7Ceed514c046194c339cdc08da762c7eed%7Cfd01ebd7e586432592d27ad43688e011%7C0%7C0%7C637952230721474583%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=S32wfQ3r05xhhIirx2c7eEmvnn46G5KDBKR9%2FIfrTaU%3D&reserved=0): There is currently no option to use a database locale with nondeterministic comparisons (see CREATE COLLATION for an explanation).If this is needed, then per-column collations would need to be used. And indeed: > create collation if not exists "und-sorttest-x-icu" > (provider=icu, > locale='und-u-ks-level2-kc-false', > deterministic=false); This is an collation created in a specific database, and doesn't exist outside. > create database sorttest with > owner = postgres > encoding = 'UTF8' > locale_provider = 'icu' > icu_locale = 'und-sorttest-x-icu' > connection limit = -1 > template = template0; Here "und-sorttest-x-icu "is just a string passed to the ICU library, which probably understands it as the root collation,definitely not what you defined in the other database.
Re: BUG #17571: PostgreSQL 15b2 ICU collation defined at db level sorts differently in collate clause vs implicit
From
Julien Rouhaud
Date:
On Thu, Aug 04, 2022 at 03:59:12PM +0000, Daniel Halsey wrote: > > This will be frustrating to work around, since it'll require injecting > COLLATE sub-clauses for all order by clauses (or like clauses, if we > re-define our columns to use a non-deterministic collation). > Is support for non-deterministic collation at the db level on the roadmap? No one is working on that at the moment as far as I know. The problem is that we would need to support LIKE and regex for non deterministic collation first, as those are use in system views. > Is there a query to determine what the actual/effective collation settings > are (per the underlying provider) for a given db (since > pg_database.daticulocale x pg_catalog.pg_collation.collname doesn't > necessarily return what's going to be used)? I think there's no misunderstanding here, pg_database.daticulocale isn't supposed to refer to a collation name in the database, it's the actual locale string passed to ICU, ie. the same as pg_collation.colliculocale. So you have to check what ICU will exactly be doing for a given locale string, which is a bit troublesome as it tends to accept anything and fallback to its "root" locale. You can also use Daniel Vérité's excellent icu_ext extension at https://github.com/dverite/icu_ext. This provides a lot of useful function, including icu_collation_attributes(), which for the "und-sorttest-x-icu" locale string returns: =# SELECT * FROM icu_collation_attributes('und-sorttest-x-icu'); attribute | value -------------+---------------------------------------------- displayname | Unknown language (SORTTEST, Private-Use=icu) kn | false kb | false kk | false ka | noignore ks | level3 kf | false kc | false kv | punct version | 153.112 (10 rows)