Home > mailing lists

Re: Built-in case-insensitive collation pg_unicode_ci - Mailing list pgsql-hackers

From	Laurenz Albe
Subject	Re: Built-in case-insensitive collation pg_unicode_ci
Date	September 24, 2025 18:10:45
Msg-id	f3b42d3ccef71f431f3c8ea436422f3b87867527.camel@cybertec.at Whole thread Raw
In response to	Built-in case-insensitive collation pg_unicode_ci (Jeff Davis <pgsql@j-davis.com>)
List	pgsql-hackers

Tree view

On Fri, 2025-09-19 at 17:21 -0700, Jeff Davis wrote:
> --------
> Proposal
> --------
>
> New builtin case-insensitive collation PG_UNICODE_CI, where the
> ordering semantics are just:
>
>    strcmp(CASEFOLD(arg1), CASEFOLD(arg2))
>
> and the character semantics are the same as PG_UNICODE_FAST.

I think that this is interesting.

> ----------
> Motivation
> ----------
>
> Non-deterministic collations cannot be used by SIMILAR TO, and may
> cause problems for ILIKE and regexes. The reason is that pattern
> matching often depends on the character-by-character semantics, but ICU
> collations aren't constrained enough for these semantics to work. See:
>
> However, PG_UNICODE_CI collation does have character-by-character
> semantics which are well-defined for pattern matching.
>
> That takes us a step closer to allowing the database default collation
> to be case-insensitive.

What is still missing for that?  Pattern matching?


> ----------
> Versioning
> ----------
>
> Unlike other built-in collations, the order does depend on the version
> of Unicode, so the collation is given a version equal to the version of
> Unicode. (Other builtin collations have a version of "1".)
>
> That means that indexes, including primary keys, can become
> inconsistent after a major version upgrade if the version of Unicode
> has changed. The conditions where this can happen are much narrower
> than with libc or ICU collations:
>
>   (a) The database in the prior version must contain code points
> unassigned as of that version; and
>   (b) Some of those previously-unassigned code points must be assigned
> to a Cased character in the newer version.

That's an improvement for people who are ready to perform a test upgrade
and check if any indexes are corrupted - they will likely see that none
are, so no index needs to be rebuilt.

I tried your patch.
It works as advertised, and I didn't manage to break it.

Yours,
Laurenz Albe

pgsql-hackers by date:

From: Peter Eisentraut
Date: 24 September 2025, 18:05:02
Subject: Re: Remove PointerIsValid()

From: Christoph Berg
Date: 24 September 2025, 18:13:44
Subject: Re: "openssl" should not be optional

Re: Built-in case-insensitive collation pg_unicode_ci - Mailing list pgsql-hackers

Previous

Next