RE: Supporting non-deterministic collations with tailoring rules. - Mailing list pgsql-hackers

From Todd Lang
Subject RE: Supporting non-deterministic collations with tailoring rules.
Date
Msg-id YT2PPF95923661846DF142CDE265CA2F781BE1FA@YT2PPF959236618.CANPRD01.PROD.OUTLOOK.COM
Whole thread Raw
In response to Re: Supporting non-deterministic collations with tailoring rules.  ("Daniel Verite" <daniel@manitou-mail.org>)
Responses RE: Supporting non-deterministic collations with tailoring rules.
List pgsql-hackers
Ah, somehow I missed your email on this.  This is, in fact, exactly what should happen.  The ICU folks are updating
theirdocumentation to reflect this with https://github.com/unicode-org/icu/pull/3684/files . 

Is this small change a reasonable thing to include given the update in guidance from the ICU team?

-----Original Message-----
From: Daniel Verite <daniel@manitou-mail.org>
Sent: Wednesday, September 24, 2025 6:17 AM
To: Todd Lang <Todd.Lang@D2L.com>
Cc: pgsql-hackers@lists.postgresql.org
Subject: Re: Supporting non-deterministic collations with tailoring rules.

CAUTION: This email originated from outside of D2L. Do not respond to, click links or open attachments unless you
recognizethe sender and know the content is safe. 


        Todd Lang wrote:

> When creating a collation, in
> https://gith/
> ub.com%2Fpostgres%2Fpostgres%2Fblob%2Fmaster%2Fsrc%2Fbackend%2Futils%2
> Fadt%2Fpg_locale_icu.c%23L461&data=05%7C02%7CTodd.Lang%40D2L.com%7Cb34
> 6f047ed7944ebe01408ddfb5391b2%7C74bbca6d410b45b39b512a6aa6477079%7C0%7
> C0%7C638943058554088325%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydW
> UsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D
> %7C0%7C%7C%7C&sdata=4%2F6%2BalfTMFrnAzjQBt4i9Qa1BUUWBpaUJnGz%2B8dvy1s%
> 3D&reserved=0 it is opening the collator with the tailoring rules
> supplied. However, it has hardcoded the strength level
> UCOL_DEFAULT_STRENGTH. This has the effect of ignoring the
> "deterministic=false" you may have specified in your CREATE COLLATION
> call.


This is related to BUG #18771 previously reported at [1], where the reporter notes that passing UCOL_DEFAULT works for
himwhereas UCOL_DEFAULT_STRENGTH does not. 
It looks like a documentation bug in ICU [2] It says:

  strength: The default collation strength; one of UCOL_PRIMARY,
  UCOL_SECONDARY, UCOL_TERTIARY, UCOL_IDENTICAL,UCOL_DEFAULT_STRENGTH
  - can be also set in the rules.

But UCOL_DEFAULT_STRENGTH is an alias for UCOL_TERTIARY.
U_COL_DEFAULT is what should normally be passed to not override the collation strength.

Now, by "it works", it means that the strength expressed in the rule (with rules = '[strength 1]' in the case of the
OP)takes effect. 
This syntax is described at [3] (see "Rule Syntax" column)


There is a second problem: when the strength is specified in the locale and not specified in the rules (as you did), it
wouldalso be expected to take effect. It does not appear to be the case, as if the rules were resetting the collation
settings.
As mentioned in the thread at [1], Peter Eisentraut has submitted this as a bug [4], but there hasn't been any
follow-upto it in 2.5 years. 

> If, instead of UCOL_DEFAULT_STRENGTH, the code understood the
> deterministic parameter and passed either UCOL_PRIMARY for
> "deterministic=true", and UCOL_SECONDARY for "deterministic=false",
> this would preserve the attempt to obtain case-insensitivity in the
> locale while simultaneously allowing tailoring as expected.

We can't hardcode that deterministic=false implies that the strength is 2. deterministic=false only says that the
collationcan have equal strings that are not binary-equal. 

To me, the most plausible fix on the Postgres side would be to pass UCOL_DEFAULT instead of UCOL_DEFAULT_STRENGTH as in
theattached, which lets the user specify the strength in the rule, as the OP did in [1]. 


[1]:
https://www.postgresql.org/message-id/flat/18771-98bb23e455b0f367%40postgresql.org
[2]:
https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucol_8h.html#a0cb1ddd81f322ed24e389f208eb35c8a
[3]: https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
[4]: https://unicode-org.atlassian.net/browse/ICU-22456


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/



pgsql-hackers by date:

Previous
From: Maxim Orlov
Date:
Subject: Use "?=" operator for a contrib makefile in documentation
Next
From: Tom Lane
Date:
Subject: Re: plan shape work