Re: Update Unicode data to Unicode 16.0.0 - Mailing list pgsql-hackers

From vignesh C
Subject Re: Update Unicode data to Unicode 16.0.0
Date
Msg-id CALDaNm3V7-pW_=Fb-DChdKS_mDiYCmxmvepfNQHP+SiVr3KaYg@mail.gmail.com
Whole thread Raw
In response to Re: Update Unicode data to Unicode 16.0.0  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Update Unicode data to Unicode 16.0.0
List pgsql-hackers
On Sat, 8 Mar 2025 at 02:41, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Wed, 2025-03-05 at 20:43 -0600, Nathan Bossart wrote:
> > I see.  Do we provide any suggested next steps for users to assess
> > the
> > potentially-affected relations?
>
> I don't know exactly where we should document it, but I've attached a
> SQL file that demonstrates what can happen for a PG17->PG18 upgrade,
> assuming that we've updated Unicode to 16.0.0 in PG18.
>
> The change in Unicode that I'm focusing on is the addition of U+A7DC,
> which is unassigned in Unicode 15.1 and assigned in Unicode 16, which
> lowercases to U+019B. The examples assume that the user is using
> unassigned code points in PG17/Unicode15.1 and the PG_C_UTF8 collation.
>
> The exmaple table 'test' has a single row with U+A7DC and an index
> test_idx on LOWER(t). Because the codepoint is unassigned in PG17,
> LOWER(U&'\A7DC') returns itself, while in PG18 it returns U&'\019B'.
>
> The first thing the user should do if they see the warnings during the
> --check phase is to rerun with "--check --retain" so they can keep the
> file. Maybe that's not great, and we should keep the file regardless
> for these particular kinds of warnings.
>
> Then go ahead and upgrade. After upgrade, the user should:
>
>   REINDEX INDEX CONCURRENTLY test_idx;
>
> each potentially-affected index. If they have amcheck installed, they
> can do:
>
>   select bt_index_check(index=>'test_idx', heapallindexed=>true);
>
> first, and only REINDEX if there's an error to fix. Do the same for
> other indexes listed in the warning file.
>
> The next example is tuniq, which is similar except that the index is a
> unique index, and the table has two values: U&'\019B' and U&'\A7DC'. In
> PG17, LOWER() returns U&'\019B' and U&'\A7DC', respectively (the former
> because it's already lowercase, and the latter because it's unassigned
> and returns itself). In PG18, LOWER() returns U&'\019B' and U&'\019B'.
> So, a REINDEX will encounter a uniqueness violation, and the user will
> need to decide how to handle the duplicate values.
>
> The next example is tcheck, which does CHECK(LOWER(t) = t), which is
> the most plausible check constraint I can think of that would be
> affected. In theory, the user might have to reason through what should
> be done about a tuple that violates it, but in practice it might be as
> simple as:
>
>    UPDATE tcheck SET t = LOWER(t) WHERE t <> LOWER(t);
>
> The last example is partitioning where the table is range partitioned
> on LOWER(t), and the tuple ends up in the wrong partition on PG18.
> After fixing the indexes with:
>
>    -- tpart0_lower_idx is fine, no REINDEX needed
>    select bt_index_check(index=>'tpart0_lower_idx',
>      heapallindexed=>true);
>    select bt_index_check(index=>'tpart1_lower_idx',
>      heapallindexed=>true);
>    REINDEX INDEX CONCURRENTLY tpart1_lower_idx;
>
> See the problem with:
>
>    SET enable_partition_pruning=off;
>    -- correctly finds one tuple
>    SELECT * FROM tpart WHERE LOWER(t) = U&'\019B';
>
>    SET enable_partition_pruning TO DEFAULT;
>     -- incorrectly finds zero tuples
>    SELECT * FROM tpart WHERE LOWER(t) = U&'\019B';
>
> Now move the tuples to the right partitions:
>
>    BEGIN;
>    WITH d AS (
>      DELETE FROM tpart0
>      WHERE NOT (LOWER(t) >= U&'\0001' AND LOWER(t) < U&'\8000')
>      RETURNING *
>    ) INSERT INTO tpart SELECT * FROM d;
>    WITH d AS (
>      DELETE FROM tpart1
>      WHERE NOT (LOWER(t) >= U&'\8000' AND LOWER(t) < U&'\FFFF')
>      RETURNING *
>    ) INSERT INTO tpart SELECT * FROM d;
>    COMMIT;
>
> The order of operations should be to fix indexes, unique constraints,
> and check constraints first; and then to fix partitioned tables. That
> way the above partitioned table queries get correct results using the
> indexes.
>
> Granted, this is not a great user experience, and we could do better in
> the future (e.g. add some new utility commands). But this is an
> improvement in the sense that it's a small number of affected objects,
> unlike a collation version change, which affects anything related to
> text.

We currently have two Commitfest entries for the same thread at [1]
and [2]. Are both still necessary, or can we consolidate tracking into
a single entry?
[1] - https://commitfest.postgresql.org/patch/5472/
[2] - https://commitfest.postgresql.org/patch/5563/

Regards,
Vignesh



pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Implement waiting for wal lsn replay: reloaded
Next
From: Daniel Gustafsson
Date:
Subject: Re: Adding support for SSLKEYLOGFILE in the frontend