Re: Update Unicode data to Unicode 16.0.0 - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: Update Unicode data to Unicode 16.0.0 |
Date | |
Msg-id | 042a9d1890c2704b415f9051ee6ca5437b22bc19.camel@j-davis.com Whole thread Raw |
In response to | Re: Update Unicode data to Unicode 16.0.0 (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Update Unicode data to Unicode 16.0.0
|
List | pgsql-hackers |
On Tue, 2025-03-18 at 14:45 -0400, Robert Haas wrote: > I think Joe has the right idea. The way to actually provide the > stability that people want here is to continue supporting old > versions > while adding support for new versions. Anything else we do works > subject to assumptions: you can either assume that people don't use > code points before they're assigned (as Tom proposes) or you can > assume that not updating to newer Unicode specs will not cause any > problems for anyone. Joe's proposal is unique in requiring no > assumptions about what users will do or what they care about. It just > works, period. The only disadvantage is that it's more work for us, > but this problem has caused enough suffering over the years that I > bet > we can find hackers to implement it and maintain it on an ongoing > basis without great difficulty. I've already put a fair amount of work into this approach, but it is a lot of work, and I could use some help. Here's a quick dump of some of my notes on what we can do going forward: * builtin provider with stable primary keys: done * collation behavior as method tables: done * have support for STRICT_UNICODE, or something like it, to allow users to mitigate their upgrade risks by rejecting unassigned code points: I submitted a proposal for a database-level option, which got no discussion * ctype behavior as method tables: patch written, discussion trailed off. There was a really minor performance regression, so I held off committing it, but I don't think it's an actual problem so if people are in general agreement that we want it then I have no problem committing it. * separate "collation provider" from "ctype provider" internally. Have pg_open_collation() and pg_open_ctype(), and deprecate pg_newlocale_from_collation(). This is a fair amount of work, but it's important for dependency and version tracking, as well as an organizational improvement. * turn providers into method tables: not too hard. We'd still need to have the notion of a "provider kind" (builtin, ICU, libc) so that we know how to interpret the syntax and store things in the other catalogs (for instance, only ICU accepts ICU_RULES, only libc allows LC_COLLATE and LC_CTYPE to be different, etc.). * put providers into new shared catalogs pg_collation_provider and pg_ctype_provider, which would each have handlers that know how to instantiate a specific collation or ctype * add new function markers COLLATE and CTYPE (or some other names), meaning that the function is sensitive to the collation or ctype of its arguments. - for example: LOWER() would be marked CTYPE, ">" would be marked COLLATE, and "||" wouldn't need any mark. - When creating some object that has an expression in it, let's say an index, we already walk the expression and add dependencies on the functions in the expression. If one of those functions has such a marker, we would look at the inferred collation of the function, find its provider, and add a dependency on the provider's shared catalog entry. - must work even on "pinned" functions - queries with ORDER BY, say as part of an MV definition, would be implicitly treated like functions marked with COLLATE * (optional) have some kind of runtime check so that UDFs that are missing the appropraite COLLATE or CTYPE markers figure out that a collation or ctype is being opened, and throw a WARNING or ERROR * throw away the idea of collation-speciifc versions, or make it more of an additional check. Versions would be attached to the provider entries in the shared catalogs. The only provider that differentiates collation versions by locale is ICU, and people were highly skeptical of that before we found bugs in it, and more skeptical afterward. They will just be a source of confusion in the long term. * Have some new functions and DDL commands that can find and fix objects by following the dependency links. * Allow extensions to be loaded at initdb time, and initialize their own providers and their own lists of collations. * Provide a contrib that implements the builtin provider with unicode 15.1. * If we want multiple versions of a provider in the same running server, that would take more work. I have my doubts about how many people would really, actually use that, but it's possible. I plan to submit some proposals in a few weeks as this CF settles down, and then have an unconference session on this topic at pgconf.dev. If anyone is motivated for these problems to be fixed, please jump into those discussions on list or at the conference, and take on a task or two. I am not trying to be dismissive of the concerns raised in this thread, but I'd like others to understand that what they are asking for is a lot of work, and that the builtin collation provider solves 99% of it already. All this effort is to solve that last 1%. Regards, Jeff Davis
pgsql-hackers by date: