Re: Update Unicode data to Unicode 16.0.0 - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: Update Unicode data to Unicode 16.0.0 |
Date | |
Msg-id | b7c9dafa10ba3ad7fa201d9c3a2d8ac5b7aa923d.camel@j-davis.com Whole thread Raw |
In response to | Re: Update Unicode data to Unicode 16.0.0 (Jeremy Schneider <schneider@ardentperf.com>) |
Responses |
Re: Update Unicode data to Unicode 16.0.0
|
List | pgsql-hackers |
On Tue, 2025-03-18 at 09:28 -0700, Jeremy Schneider wrote: > We think case-insensitive indexes are probably uncommon, so as > long as its "rare" we can let them break. Let's define "break" in this context to mean that the constraints are not enforced, or that the query doesn't return the results that the user is expecting. Let's say a user has an index on LOWER(t) in PG17 (Unicode 15.1). Then Unicode 16.0 comes out, introducing the newly-assigned U+A7DC, which lowercases to U+019B. The rest of the world moves on and starts using U+A7DC. There are only two ways that Postgres can prevent breakage: 1. Update the database to Unicode 16.0 before U+A7DC is encountered, so that it's properly lowercased to U+019B, and a query on LOWER(t) = U&'\019B' will correctly return the record containing it. 2. Prevent U+A7DC from going into the database at all. Continuing on with Unicode 15.1 and accepting the unassigned code point *cannot* prevent breakage. A truly paranoid user would want a combination of both solutions: regular Unicode updates; and something like STRICT_UNICODE ( https://commitfest.postgresql.org/patch/4876/ ) to protect the user between the time Unicode assigns the code point and the time they can deploy a version of Postgres that understands it. You are rightfully concerned that updating Unicode can create its own inconsistencies, and if nothing is done that can lead to breakage as well. The upgrade-time check in this thread is one solution to that problem, but we could do a lot more. You are also right that we should be more skeptical of an internal inconsistency (e.g. different results for seqscan vs indexscan) than a wider definition of inconsistency. But the user created a unicode-based case-folded index there for a reason, and we shouldn't lose sight of that. > I'm not asking for an extreme definition of "IMMUTABLE" but I'd be > very happy with a GUC "data_safety=radical_like_jeremy" where > Postgres > simply won't start if the control file says it was from a different > operating system or architecture or ICU/glibc collation version. I > can > disable the GUC (like a maintenance mode) to rebuild my indexes and > update my collation versions, and ideally this GUC would also mean > that > indexes simply aren't allowed to be created on functions that might > change within the guarantees that are made. (And range-based > partitions > can't use them, and FDWs can't rely on them for query planning, etc.) Does the upgrade check patch in this thread accomplish that for you? If not, what else does it need? It's an upgrade-time check rather than a GUC, but it basically seems to match what you want. See: https://www.postgresql.org/message-id/16c4e37d4c89e63623b009de9ad6fb90e7456ed8.camel@j-davis.com Regards, Jeff Davis
pgsql-hackers by date: