Re: Update Unicode data to Unicode 16.0.0 - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Update Unicode data to Unicode 16.0.0 |
Date | |
Msg-id | CA+TgmoahWh6zxBFUygOUwrdkGogp47ZVoa8GVkxEORRc7+EwWA@mail.gmail.com Whole thread Raw |
In response to | Re: Update Unicode data to Unicode 16.0.0 (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Update Unicode data to Unicode 16.0.0
|
List | pgsql-hackers |
On Fri, Mar 21, 2025 at 2:45 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Thu, 2025-03-20 at 08:45 -0400, Robert Haas wrote: > > * When the collation/ctype/whatever definitions upon which you are > > relying change, you can either decide to switch to the new ones > > without rebuilding your indexes and risk wrong results until you > > reindex, or you can decide to create new indexes using the new > > definitions and drop the old ones. > > Would newly-created objects pick up the new Unicode version, or stick > with the old one? Hmm, I hadn't thought about that. I'm assuming that the Unicode version would need, in this scheme, to be coupled to the object that depends on it. For example, an index that uses a Unicode collation would need to store a Unicode version. But for a new index, how would that be set? Maybe the Unicode version would be treated as part of the collation. I'm guessing that an index defaults to the column collation, and I think the column collation defaults to the database default collation. We might need a way for ALTER DATABASE to allow the database default to be adjusted. I'm not quite sure here, but my general feeling is that Unicode version feels like part of the collation and that we should avoid introducing a separate mechanism if possible. What are your thoughts? > Supprting built-in natural language sort orders would be a much larger > scope. And I don't think we need that, but that's a larger discussion. I'm curious why you think this. My own feeling (as I think you probably know, but just to be clear) is that relatively few people need extremely precise control over their collation behavior, but there are some who do. However, I think there are many people for whom a code-point sort won't be good enough. If you want to leave this discussion for another time, that's fine. > What if we were able to tell, for instance, that your database has none > of the codepoints affected by the most recent update. Then updating > would be less risky than not updating: if you don't update Unicode, > then the code points could end up in the database treated as > unassigned, and then cause a problem for future updates. The problem with this is that it requires scanning the whole database. That's not to say it's useless. Some people can afford to scan the whole database, and some people might even WANT to scan the whole database just to give themselves peace of mind. But there are also plenty of people for whom this is a major downside, even unusable. I'd like to have a solution that is based on metadata. Maybe we should actually move in the direction of having encodings that are essentially specific versions of Unicode. Instead of just having UTF-8 that accepts whatever, you could have UTF-8.v16.0.0 or whatever, which would only accept code points known to that version of Unicode. Or maybe this shouldn't be entirely new encodings but something vaguely akin to a typmod, so that you could have columns of type text[limited_to_unicode_v16_0_0] or whatever. If we actually exclude unassigned code points, then we know they aren't there, and we can make deductions about what is safe to do based on that information. I'm not quite sure how useful that is, but I tend to think that enforcing rules when the data goes in has a decent shot at being better than letting anything going in and then having to scan it later to see how it all turned out. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: