On Wed, 2023-06-14 at 15:55 -0700, Jeff Davis wrote:
> The locale "C" (and equivalently, "POSIX") is not really a libc
> locale;
> it's implemented internally with memcmp for collation and
> pg_ascii_tolower, etc., for ctype.
>
> The attached patch implements a new collation provider, "builtin",
> which only supports "C" and "POSIX".
Rebased patch attached.
I got some generally positive comments, but it needs some more feedback
on the specifics to be committable.
This might be a good time to summarize my thoughts on collation after
my work in v16:
* Picking a database default collation other than UCS_BASIC (a.k.a.
"C", a.k.a. memcmp(), a.k.a. provider=builtin) is something that should
be done intentionally. It's an impactful choice that affects semantics,
performance, and upgrades/deployment. Beyond that, our implementation
still lacks a good way to manage versions of collation provider
libraries and track object dependencies in a safe way to prevent index
corruption, so the safest choice is really just to use stable memcmp()
semantics.
* The defaults for initdb seem bad in a number of ways, but it's too
hard to change that default now (I tried in v16 and reverted it). So
the job of reasonable choices is left for higher-level tools and
documentation.
* We can handle the collation and character classification
independently. The main use case is to set the collation to memcmp()
semantics (for stability and performance) and set the character
classification to something interesting (on the grounds that it's more
likely to be stable and less likely to be used in an index than a
collation). Right now the only way to do that is to use the libc
provider and set the collation to C and the ctype to a libc locale. But
there is also a use case for having ICU as the provider for character
classification. One option is to have separate datcolprovider=b
(builtin provider) and datctypeprovider=i, so that the collation would
be handled with memcmp and the character classification daticulocale.
It feels like we're growing the fields in pg_database a little too
much, but the use case seems valid, and perhaps we can reorganize the
catalog representation a bit.
--
Jeff Davis
PostgreSQL Contributor Team - AWS