Todd Lang wrote:
> When creating a collation, in
> https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/pg_locale_icu.c#L461
> it is opening the collator with the tailoring rules supplied. However, it
> has hardcoded the strength level UCOL_DEFAULT_STRENGTH. This has the effect
> of ignoring the "deterministic=false" you may have specified in your CREATE
> COLLATION call.
This is related to BUG #18771 previously reported at [1],
where the reporter notes that passing UCOL_DEFAULT works
for him whereas UCOL_DEFAULT_STRENGTH does not.
It looks like a documentation bug in ICU [2]
It says:
strength: The default collation strength; one of UCOL_PRIMARY,
UCOL_SECONDARY, UCOL_TERTIARY, UCOL_IDENTICAL,UCOL_DEFAULT_STRENGTH
- can be also set in the rules.
But UCOL_DEFAULT_STRENGTH is an alias for UCOL_TERTIARY.
U_COL_DEFAULT is what should normally be passed to not override the
collation strength.
Now, by "it works", it means that the strength expressed in the rule
(with rules = '[strength 1]' in the case of the OP) takes effect.
This syntax is described at [3] (see "Rule Syntax" column)
There is a second problem: when the strength is specified in
the locale and not specified in the rules (as you did), it would also
be expected to take effect. It does not appear to be the case,
as if the rules were resetting the collation settings.
As mentioned in the thread at [1], Peter Eisentraut has submitted
this as a bug [4], but there hasn't been any follow-up to it
in 2.5 years.
> If, instead of UCOL_DEFAULT_STRENGTH, the code understood the
> deterministic parameter and passed either UCOL_PRIMARY for
> "deterministic=true", and UCOL_SECONDARY for "deterministic=false",
> this would preserve the attempt to obtain case-insensitivity in the
> locale while simultaneously allowing tailoring as expected.
We can't hardcode that deterministic=false implies that the strength
is 2. deterministic=false only says that the collation can have equal
strings that are not binary-equal.
To me, the most plausible fix on the Postgres side would be to pass
UCOL_DEFAULT instead of UCOL_DEFAULT_STRENGTH as in the attached,
which lets the user specify the strength in the rule, as the OP did in [1].
[1]:
https://www.postgresql.org/message-id/flat/18771-98bb23e455b0f367%40postgresql.org
[2]:
https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucol_8h.html#a0cb1ddd81f322ed24e389f208eb35c8a
[3]: https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
[4]: https://unicode-org.atlassian.net/browse/ICU-22456
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/