On Fri, 2025-09-19 at 17:21 -0700, Jeff Davis wrote:
> --------
> Proposal
> --------
>
> New builtin case-insensitive collation PG_UNICODE_CI, where the
> ordering semantics are just:
>
> strcmp(CASEFOLD(arg1), CASEFOLD(arg2))
>
> and the character semantics are the same as PG_UNICODE_FAST.
I think that this is interesting.
> ----------
> Motivation
> ----------
>
> Non-deterministic collations cannot be used by SIMILAR TO, and may
> cause problems for ILIKE and regexes. The reason is that pattern
> matching often depends on the character-by-character semantics, but ICU
> collations aren't constrained enough for these semantics to work. See:
>
> However, PG_UNICODE_CI collation does have character-by-character
> semantics which are well-defined for pattern matching.
>
> That takes us a step closer to allowing the database default collation
> to be case-insensitive.
What is still missing for that? Pattern matching?
> ----------
> Versioning
> ----------
>
> Unlike other built-in collations, the order does depend on the version
> of Unicode, so the collation is given a version equal to the version of
> Unicode. (Other builtin collations have a version of "1".)
>
> That means that indexes, including primary keys, can become
> inconsistent after a major version upgrade if the version of Unicode
> has changed. The conditions where this can happen are much narrower
> than with libc or ICU collations:
>
> (a) The database in the prior version must contain code points
> unassigned as of that version; and
> (b) Some of those previously-unassigned code points must be assigned
> to a Cased character in the newer version.
That's an improvement for people who are ready to perform a test upgrade
and check if any indexes are corrupted - they will likely see that none
are, so no index needs to be rebuilt.
I tried your patch.
It works as advertised, and I didn't manage to break it.
Yours,
Laurenz Albe