Re: Update Unicode data to Unicode 16.0.0 - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Update Unicode data to Unicode 16.0.0
Date
Msg-id 042a9d1890c2704b415f9051ee6ca5437b22bc19.camel@j-davis.com
Whole thread Raw
In response to Re: Update Unicode data to Unicode 16.0.0  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Update Unicode data to Unicode 16.0.0
List pgsql-hackers
On Tue, 2025-03-18 at 14:45 -0400, Robert Haas wrote:
> I think Joe has the right idea. The way to actually provide the
> stability that people want here is to continue supporting old
> versions
> while adding support for new versions. Anything else we do works
> subject to assumptions: you can either assume that people don't use
> code points before they're assigned (as Tom proposes) or you can
> assume that not updating to newer Unicode specs will not cause any
> problems for anyone. Joe's proposal is unique in requiring no
> assumptions about what users will do or what they care about. It just
> works, period. The only disadvantage is that it's more work for us,
> but this problem has caused enough suffering over the years that I
> bet
> we can find hackers to implement it and maintain it on an ongoing
> basis without great difficulty.

I've already put a fair amount of work into this approach, but it is a
lot of work, and I could use some help. Here's a quick dump of some of
my notes on what we can do going forward:

* builtin provider with stable primary keys: done

* collation behavior as method tables: done

* have support for STRICT_UNICODE, or something like it, to allow users
to mitigate their upgrade risks by rejecting unassigned code points: I
submitted a proposal for a database-level option, which got no
discussion

* ctype behavior as method tables: patch written, discussion trailed
off. There was a really minor performance regression, so I held off
committing it, but I don't think it's an actual problem so if people
are in general agreement that we want it then I have no problem
committing it.

* separate "collation provider" from "ctype provider" internally. Have
pg_open_collation() and pg_open_ctype(), and deprecate
pg_newlocale_from_collation(). This is a fair amount of work, but it's
important for dependency and version tracking, as well as an
organizational improvement.

* turn providers into method tables: not too hard. We'd still need to
have the notion of a "provider kind" (builtin, ICU, libc) so that we
know how to interpret the syntax and store things in the other catalogs
(for instance, only ICU accepts ICU_RULES, only libc allows LC_COLLATE
and LC_CTYPE to be different, etc.).

* put providers into new shared catalogs pg_collation_provider and
pg_ctype_provider, which would each have handlers that know how to
instantiate a specific collation or ctype

* add new function markers COLLATE and CTYPE (or some other names),
meaning that the function is sensitive to the collation or ctype of its
arguments.
  - for example: LOWER() would be marked CTYPE, ">" would be marked
COLLATE, and "||" wouldn't need any mark.
  - When creating some object that has an expression in it, let's say
an index, we already walk the expression and add dependencies on the
functions in the expression. If one of those functions has such a
marker, we would look at the inferred collation of the function, find
its provider, and add a dependency on the provider's shared catalog
entry.
  - must work even on "pinned" functions
  - queries with ORDER BY, say as part of an MV definition, would be
implicitly treated like functions marked with COLLATE

* (optional) have some kind of runtime check so that UDFs that are
missing the appropraite COLLATE or CTYPE markers figure out that a
collation or ctype is being opened, and throw a WARNING or ERROR

* throw away the idea of collation-speciifc versions, or make it more
of an additional check. Versions would be attached to the provider
entries in the shared catalogs. The only provider that differentiates
collation versions by locale is ICU, and people were highly skeptical
of that before we found bugs in it, and more skeptical afterward. They
will just be a source of confusion in the long term.

* Have some new functions and DDL commands that can find and fix
objects by following the dependency links.

* Allow extensions to be loaded at initdb time, and initialize their
own providers and their own lists of collations.

* Provide a contrib that implements the builtin provider with unicode
15.1.

* If we want multiple versions of a provider in the same running
server, that would take more work. I have my doubts about how many
people would really, actually use that, but it's possible.


I plan to submit some proposals in a few weeks as this CF settles down,
and then have an unconference session on this topic at pgconf.dev. If
anyone is motivated for these problems to be fixed, please jump into
those discussions on list or at the conference, and take on a task or
two.

I am not trying to be dismissive of the concerns raised in this thread,
but I'd like others to understand that what they are asking for is a
lot of work, and that the builtin collation provider solves 99% of it
already. All this effort is to solve that last 1%.

Regards,
    Jeff Davis




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Increase default maintenance_io_concurrency to 16
Next
From: Bruce Momjian
Date:
Subject: Re: Increase default maintenance_io_concurrency to 16