Re: Update Unicode data to Unicode 16.0.0 - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: Update Unicode data to Unicode 16.0.0 |
Date | |
Msg-id | CALDaNm3V7-pW_=Fb-DChdKS_mDiYCmxmvepfNQHP+SiVr3KaYg@mail.gmail.com Whole thread Raw |
In response to | Re: Update Unicode data to Unicode 16.0.0 (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Update Unicode data to Unicode 16.0.0
|
List | pgsql-hackers |
On Sat, 8 Mar 2025 at 02:41, Jeff Davis <pgsql@j-davis.com> wrote: > > On Wed, 2025-03-05 at 20:43 -0600, Nathan Bossart wrote: > > I see. Do we provide any suggested next steps for users to assess > > the > > potentially-affected relations? > > I don't know exactly where we should document it, but I've attached a > SQL file that demonstrates what can happen for a PG17->PG18 upgrade, > assuming that we've updated Unicode to 16.0.0 in PG18. > > The change in Unicode that I'm focusing on is the addition of U+A7DC, > which is unassigned in Unicode 15.1 and assigned in Unicode 16, which > lowercases to U+019B. The examples assume that the user is using > unassigned code points in PG17/Unicode15.1 and the PG_C_UTF8 collation. > > The exmaple table 'test' has a single row with U+A7DC and an index > test_idx on LOWER(t). Because the codepoint is unassigned in PG17, > LOWER(U&'\A7DC') returns itself, while in PG18 it returns U&'\019B'. > > The first thing the user should do if they see the warnings during the > --check phase is to rerun with "--check --retain" so they can keep the > file. Maybe that's not great, and we should keep the file regardless > for these particular kinds of warnings. > > Then go ahead and upgrade. After upgrade, the user should: > > REINDEX INDEX CONCURRENTLY test_idx; > > each potentially-affected index. If they have amcheck installed, they > can do: > > select bt_index_check(index=>'test_idx', heapallindexed=>true); > > first, and only REINDEX if there's an error to fix. Do the same for > other indexes listed in the warning file. > > The next example is tuniq, which is similar except that the index is a > unique index, and the table has two values: U&'\019B' and U&'\A7DC'. In > PG17, LOWER() returns U&'\019B' and U&'\A7DC', respectively (the former > because it's already lowercase, and the latter because it's unassigned > and returns itself). In PG18, LOWER() returns U&'\019B' and U&'\019B'. > So, a REINDEX will encounter a uniqueness violation, and the user will > need to decide how to handle the duplicate values. > > The next example is tcheck, which does CHECK(LOWER(t) = t), which is > the most plausible check constraint I can think of that would be > affected. In theory, the user might have to reason through what should > be done about a tuple that violates it, but in practice it might be as > simple as: > > UPDATE tcheck SET t = LOWER(t) WHERE t <> LOWER(t); > > The last example is partitioning where the table is range partitioned > on LOWER(t), and the tuple ends up in the wrong partition on PG18. > After fixing the indexes with: > > -- tpart0_lower_idx is fine, no REINDEX needed > select bt_index_check(index=>'tpart0_lower_idx', > heapallindexed=>true); > select bt_index_check(index=>'tpart1_lower_idx', > heapallindexed=>true); > REINDEX INDEX CONCURRENTLY tpart1_lower_idx; > > See the problem with: > > SET enable_partition_pruning=off; > -- correctly finds one tuple > SELECT * FROM tpart WHERE LOWER(t) = U&'\019B'; > > SET enable_partition_pruning TO DEFAULT; > -- incorrectly finds zero tuples > SELECT * FROM tpart WHERE LOWER(t) = U&'\019B'; > > Now move the tuples to the right partitions: > > BEGIN; > WITH d AS ( > DELETE FROM tpart0 > WHERE NOT (LOWER(t) >= U&'\0001' AND LOWER(t) < U&'\8000') > RETURNING * > ) INSERT INTO tpart SELECT * FROM d; > WITH d AS ( > DELETE FROM tpart1 > WHERE NOT (LOWER(t) >= U&'\8000' AND LOWER(t) < U&'\FFFF') > RETURNING * > ) INSERT INTO tpart SELECT * FROM d; > COMMIT; > > The order of operations should be to fix indexes, unique constraints, > and check constraints first; and then to fix partitioned tables. That > way the above partitioned table queries get correct results using the > indexes. > > Granted, this is not a great user experience, and we could do better in > the future (e.g. add some new utility commands). But this is an > improvement in the sense that it's a small number of affected objects, > unlike a collation version change, which affects anything related to > text. We currently have two Commitfest entries for the same thread at [1] and [2]. Are both still necessary, or can we consolidate tracking into a single entry? [1] - https://commitfest.postgresql.org/patch/5472/ [2] - https://commitfest.postgresql.org/patch/5563/ Regards, Vignesh
pgsql-hackers by date: