Re: Built-in CTYPE provider - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Built-in CTYPE provider
Date
Msg-id 4a69d067374d2f6bfb66f5bfb2ab9a020493d49f.camel@j-davis.com
Whole thread Raw
In response to Re: Built-in CTYPE provider  (Peter Eisentraut <peter@eisentraut.org>)
Responses Re: Built-in CTYPE provider
List pgsql-hackers
On Tue, 2024-02-13 at 07:24 +0100, Peter Eisentraut wrote:
> It is my understanding that "correct" Unicode case conversion needs
> to
> use at least some parts of SpecialCasing.txt.
...
> I think we need to use the "Unconditional" mappings and the
> "Conditional
> Language-Insensitive" mappings (which is just Greek sigma). 
> Obviously,
> skip the "Language-Sensitive" mappings.

Attached a new series.

Overall I'm quite happy with this feature as well as the recent
updates. It expands a lot on what behavior we can actually document;
the character semantics are nearly as good as ICU; it's fast; and it
eliminates what is arguably the last reason to use libc ("C collation
combined with some other CTYPE").

Changes:

 * Added a doc update for the "standard collations" (tiny patch, mostly
separate) which clarifies the collations that are always available, and
describes them a bit better

 * Added built-in locale "UCS_BASIC" (is that name confusing?) which
uses full case mapping and the standard properties:
   - "ß" uppercases to "SS"
   - "Σ" usually lowercases to "σ", except when the Final_Sigma
condition is met, in which case it lowercases to "ς"
   - initcap() uses titlecase variants ("dž" changes to "Dž")
   - in patterns/regexes, symbols (like "=") are not treated as
punctuation

 * Changed the UCS_BASIC collation to use the builtin "UCS_BASIC"
locale with Unicode semantis. At first I was skeptical because it's a
behavior change, and I am still not sure we want to do that. But doing
so would take us closer to both the SQL spec as well as Unicode; and
also this kind of character behavior change is less likely to cause a
problem than a collation behavior change.

 * The built-in locale "C.UTF-8" still exists, which uses Unicode
simple case mapping and the POSIX compatible properties (no change
here).

Implementation-wise:

 * I introduced the CaseKind enum, which seemed to clean up a few
things and reduce code duplication between upper/lower/titlecase. It
also leaves room for introducing case folding later.

 * Introduced a "case-ignorable" table to properly implement the
Final_Sigma rule.

Loose ends:

 * Right now you can't mix all of the full case mapping behavior with
INITCAP(), it just does simple titlecase mapping. I'm not sure we want
to get too fancy here; after all, INITCAP() is not a SQL standard
function and it's documented in a narrow fashion that doesn't seem to
leave a lot of room to be very smart. ICU does a few extra things
beyond what I did:
  - it accepts a word break iterator to the case conversion function
  - it provides some built-in word break iterators
  - it also has some configurable "break adjustment" behavior[1][2]
which re-aligns the start of the word, and I'm not entirely sure why
that isn't done in the word break iterator or the titlecasing rules

Regards,
    Jeff Davis


[1]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#a4975f537b9960f0330b233061ef0608d
[2]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#afc65fa226cac9b8eeef0e877b8a7744e


Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Fix incorrect PG_GETARG in pgcrypto
Next
From: Thomas Munro
Date:
Subject: Re: Streaming I/O, vectored I/O (WIP)