Re: Update Unicode data to Unicode 16.0.0 - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: Update Unicode data to Unicode 16.0.0 |
Date | |
Msg-id | d04688cb65619f3c006352763bc8285e1ce3537a.camel@j-davis.com Whole thread Raw |
In response to | Re: Update Unicode data to Unicode 16.0.0 (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Update Unicode data to Unicode 16.0.0
|
List | pgsql-hackers |
On Fri, 2025-03-21 at 10:45 -0400, Robert Haas wrote: > We might need a way for ALTER DATABASE to allow the > database default to be adjusted. I'm not quite sure here, but my > general feeling is that Unicode version feels like part of the > collation and that we should avoid introducing a separate mechanism > if > possible. What are your thoughts? My (early stage) plans are to have two new shared catalogs, pg_ctype_provider and pg_collation_provider. Objects would depend on records in those shared catalogs, which would each have a version. We'd eventually allow multiple records with providerkind=icu, for instance, and have some way to choose which one to use (perhaps new objects get the default version, old objects keep the old version, or something). The reason to have two shared catalogs is because some objects depend on collation behavior and some on ctype behavior. If there's an index on "t COLLATE PG_C_UTF8" then there would be no direct dependency from the index to the builtin provider in either catalog, because collation behavior in the builtin provider is unversioned memcmp. But if there's an index on "LOWER(t COLLATE PG_C_UTF8)", then it would have a dependency entry to the builtin provider's entry in pg_ctype_provider. > > I'm curious why you think this. My own feeling (as I think you > probably know, but just to be clear) is that relatively few people > need extremely precise control over their collation behavior, but > there are some who do. However, I think there are many people for > whom > a code-point sort won't be good enough. You can use ICU for sorting without using it for the index comparators. Using ICU in the index comparators is an implementation detail that's only required for unique indexes over non-deterministic collations. And if it's not used for the index comparators, then most of the problems go away, and versioning is not nearly so important. Sure, there are some cases where using ICU in the index comparator is important, and I'm not suggesting that we remove functionality. But I believe that using libc or ICU for index comparators is the wrong default behavior -- high downsides and low upsides for most text indexes that have ever been created. Even if there is an ORDER BY, using an index is often the wrong thing unless it's an index only scan. Text indexes are rarely correlated with the heap, so it would lead to a lot of random heap fetches, and it's often better to just execute the query and do a final sort. The situations where ICU in the comparator is a good idea are special cases of special cases. I've posted about this in the past, and got universal disagreement. But I believe others will eventually come to the same conclusion that I did. > > Maybe we should actually move in the direction of having encodings > that are essentially specific versions of Unicode. Instead of just > having UTF-8 that accepts whatever, you could have UTF-8.v16.0.0 or > whatever, which would only accept code points known to that version > of > Unicode. Or maybe this shouldn't be entirely new encodings but > something vaguely akin to a typmod, so that you could have columns of > type text[limited_to_unicode_v16_0_0] or whatever. If we actually > exclude unassigned code points, then we know they aren't there, and > we > can make deductions about what is safe to do based on that > information. I like this line of thinking, vaguely similar to my STRICT_UNICODE database option proposal. Maybe these aren't exactly the right things to do, but I think there are some possibilities here, and we shouldn't give up and assume there's a problem when usually there is not. It reminds me of fast-path locking: sure, there *might* be DDL happening while I'm trying to do a simple SELECT query. But probably not, so let's make it the responsibility of DDL to warn others that it's doing something, rather than the responsibility of the SELECT query. Regards, Jeff Davis
pgsql-hackers by date: