Re: Update Unicode data to Unicode 16.0.0 - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: Update Unicode data to Unicode 16.0.0 |
Date | |
Msg-id | 16c4e37d4c89e63623b009de9ad6fb90e7456ed8.camel@j-davis.com Whole thread Raw |
In response to | Re: Update Unicode data to Unicode 16.0.0 (Nathan Bossart <nathandbossart@gmail.com>) |
List | pgsql-hackers |
On Wed, 2025-03-05 at 20:43 -0600, Nathan Bossart wrote: > I see. Do we provide any suggested next steps for users to assess > the > potentially-affected relations? I don't know exactly where we should document it, but I've attached a SQL file that demonstrates what can happen for a PG17->PG18 upgrade, assuming that we've updated Unicode to 16.0.0 in PG18. The change in Unicode that I'm focusing on is the addition of U+A7DC, which is unassigned in Unicode 15.1 and assigned in Unicode 16, which lowercases to U+019B. The examples assume that the user is using unassigned code points in PG17/Unicode15.1 and the PG_C_UTF8 collation. The exmaple table 'test' has a single row with U+A7DC and an index test_idx on LOWER(t). Because the codepoint is unassigned in PG17, LOWER(U&'\A7DC') returns itself, while in PG18 it returns U&'\019B'. The first thing the user should do if they see the warnings during the --check phase is to rerun with "--check --retain" so they can keep the file. Maybe that's not great, and we should keep the file regardless for these particular kinds of warnings. Then go ahead and upgrade. After upgrade, the user should: REINDEX INDEX CONCURRENTLY test_idx; each potentially-affected index. If they have amcheck installed, they can do: select bt_index_check(index=>'test_idx', heapallindexed=>true); first, and only REINDEX if there's an error to fix. Do the same for other indexes listed in the warning file. The next example is tuniq, which is similar except that the index is a unique index, and the table has two values: U&'\019B' and U&'\A7DC'. In PG17, LOWER() returns U&'\019B' and U&'\A7DC', respectively (the former because it's already lowercase, and the latter because it's unassigned and returns itself). In PG18, LOWER() returns U&'\019B' and U&'\019B'. So, a REINDEX will encounter a uniqueness violation, and the user will need to decide how to handle the duplicate values. The next example is tcheck, which does CHECK(LOWER(t) = t), which is the most plausible check constraint I can think of that would be affected. In theory, the user might have to reason through what should be done about a tuple that violates it, but in practice it might be as simple as: UPDATE tcheck SET t = LOWER(t) WHERE t <> LOWER(t); The last example is partitioning where the table is range partitioned on LOWER(t), and the tuple ends up in the wrong partition on PG18. After fixing the indexes with: -- tpart0_lower_idx is fine, no REINDEX needed select bt_index_check(index=>'tpart0_lower_idx', heapallindexed=>true); select bt_index_check(index=>'tpart1_lower_idx', heapallindexed=>true); REINDEX INDEX CONCURRENTLY tpart1_lower_idx; See the problem with: SET enable_partition_pruning=off; -- correctly finds one tuple SELECT * FROM tpart WHERE LOWER(t) = U&'\019B'; SET enable_partition_pruning TO DEFAULT; -- incorrectly finds zero tuples SELECT * FROM tpart WHERE LOWER(t) = U&'\019B'; Now move the tuples to the right partitions: BEGIN; WITH d AS ( DELETE FROM tpart0 WHERE NOT (LOWER(t) >= U&'\0001' AND LOWER(t) < U&'\8000') RETURNING * ) INSERT INTO tpart SELECT * FROM d; WITH d AS ( DELETE FROM tpart1 WHERE NOT (LOWER(t) >= U&'\8000' AND LOWER(t) < U&'\FFFF') RETURNING * ) INSERT INTO tpart SELECT * FROM d; COMMIT; The order of operations should be to fix indexes, unique constraints, and check constraints first; and then to fix partitioned tables. That way the above partitioned table queries get correct results using the indexes. Granted, this is not a great user experience, and we could do better in the future (e.g. add some new utility commands). But this is an improvement in the sense that it's a small number of affected objects, unlike a collation version change, which affects anything related to text. Regards, Jeff Davis
Attachment
pgsql-hackers by date: