Re: Built-in CTYPE provider - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Built-in CTYPE provider
Date
Msg-id 6b1370d5eaba5e8c42f54c05f7bc2b8e27b8db12.camel@j-davis.com
Whole thread Raw
In response to Re: Built-in CTYPE provider  ("Daniel Verite" <daniel@manitou-mail.org>)
Responses Re: Built-in CTYPE provider
List pgsql-hackers
On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote:
>
> But C.UTF-8 is not available everywhere, and there's still the
> problem that Unicode updates through libc are not aligned
> with Postgres releases.

Attached is an implementation of a built-in provider for the "C.UTF-8"
locale. That way applications (and tests!) can count on C.UTF-8 always
being available on any platform; and it also aligns with the Postgres
Unicode updates. Documentation is sparse and the patch is a bit rough,
but feedback is welcome -- it does have some basic tests which can be
used as a guide.

The C.UTF-8 locale, briefly, is a UTF-8 locale that provides simple
collation semantics (code point order) but rich ctype semantics
(lower/upper/initcap and regexes). This locale is for users who want
proper Unicode semantics for character operations (upper/lower,
regexes), but don't need a specific natural-language string sort order
to apply to all queries and indexes in their system. One might use it
as the database default collation, and use COLLATE clauses (i.e.
COLLATE UNICODE) where more specific behavior is needed.

The builtin C.UTF-8 locale has the following advantages over using the
libc C.UTF-8 locale:

  * Collation performance: the builtin provider uses memcmp and
abbreviated keys. In libc, these advantages are only available for the
C locale.

  * Unicode version is aligned with other parts of Postgres, like
normalization.

  * Available on all platforms with exactly the same semantics.

  * Testable and documentable.

  * Avoids index corruption risks. In theory libc C.UTF-8 should also
have stable collation, but that is not 100% true. In the builtin
provider it is 100% stable.

Regards,
    Jeff Davis


Attachment

pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: cannot abort transaction 2737414167, it was already committed
Next
From: Corey Huinker
Date:
Subject: Re: Statistics Import and Export