Thread: Re: Update Unicode data to Unicode 16.0.0

Re: Update Unicode data to Unicode 16.0.0

From
Peter Eisentraut
Date:
On 05.02.25 22:47, Jeff Davis wrote:
>    (b) we should make reasonable attempts to mitigate potential
> problems.
> 
> One idea for (b) resurfaced, which was to make a best-effort check at
> pg_upgrade time for affected indexes. The check would not be
> bulletproof, because we can't catch dependencies that are hidden inside
> SPI (e.g. a plpgsql function that calls LOWER()), but it would catch
> most potential problems.
> 
> Patch attached. A few notes:
> 
>    * The dependency entries don't exist because LOWER(), etc., are
> system objects (pinned); so it queries the indexprs, indpreds,
> partexprs, and conbin.
>    * The query is large and perhaps too clever, but it seems to work. I
> tried to add inline comments to the SQL, and pgindent had its own ideas
> about how to format them -- suggestions welcome.
>    * We haven't actually done the Unicode update yet, so it will notice
> that the PG17 and PG18 Unicode versions are the same, and return early.
> Either apply on top of the Unicode update patch, or comment out the
> early return for testing.
>    * It emits a warning rather than an error, so you need to specify
> pg_upgrade with "-r" to see the output file.
>    * I didn't adapt the query to run on pre-17 versions, even though it
> could find some potential problem cases (like an index on NORMALIZE()).
> I can add that if someone thinks it's worthwhile.

This all looks quite reasonable to me.  The code could obviously use a 
bit closer inspection, but the way you've structured it looked quite 
okay to me.





Re: Update Unicode data to Unicode 16.0.0

From
Nathan Bossart
Date:
On Mon, Feb 17, 2025 at 11:46:43AM -0800, Jeff Davis wrote:
> Attached a version that rebases both patches. In my patch, I added a
> report_status().

I briefly looked at v2-0002, and the UpgradeTask usage looks correct to me.
Did you find it easy enough to use?

+    /*
+     * The builtin provider did not exist prior to version 17. While there are
+     * still problems that could potentially be caught from earlier versions,
+     * such as an index on NORMALIZE(), we don't check for that here.
+     */
+    if (GET_MAJOR_VERSION(cluster->major_version) < 1700)
+        return;

nitpick: In most cases, I think this check is done in
check_and_dump_old_cluster() before actually calling the checking function.
I don't think there's any big problem here, except you might strand the
memory allocated for the task.

+    if (!unicode_version_changed(cluster))
+    {
+        check_ok();
+        return;
+    }

Same nitpick here about stranding the task memory.

+        report_status(PG_WARNING, "warning");
+        pg_log(PG_WARNING, "Your installation contains relations that may be affected by a new version of Unicode.\n"
+               "A list of potentially-affected relations is in the file:\n"
+               "    %s", report.path);

This may have been discussed upthread, but is a warning enough?  That seems
like something that could very easily be missed.

-- 
nathan



Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Wed, 2025-03-05 at 14:33 -0600, Nathan Bossart wrote:
> +               report_status(PG_WARNING, "warning");
> +               pg_log(PG_WARNING, "Your installation contains
> relations that may be affected by a new version of Unicode.\n"
> +                          "A list of potentially-affected relations
> is in the file:\n"
> +                          "    %s", report.path);
>
> This may have been discussed upthread, but is a warning enough?  That
> seems
> like something that could very easily be missed.

There can be false positives, because even if such an expression index
exists, it's often not an actual problem. Do we want to stop an upgrade
from happening in that case? I doubt it, but if so, we'd need some kind
of option to bypass it.

Regards,
    Jeff Davis


Re: Update Unicode data to Unicode 16.0.0

From
Nathan Bossart
Date:
On Wed, Mar 05, 2025 at 03:34:06PM -0800, Jeff Davis wrote:
> On Wed, 2025-03-05 at 14:33 -0600, Nathan Bossart wrote:
>> +               report_status(PG_WARNING, "warning");
>> +               pg_log(PG_WARNING, "Your installation contains
>> relations that may be affected by a new version of Unicode.\n"
>> +                          "A list of potentially-affected relations
>> is in the file:\n"
>> +                          "    %s", report.path);
>> 
>> This may have been discussed upthread, but is a warning enough?  That
>> seems
>> like something that could very easily be missed.
> 
> There can be false positives, because even if such an expression index
> exists, it's often not an actual problem. Do we want to stop an upgrade
> from happening in that case? I doubt it, but if so, we'd need some kind
> of option to bypass it.

I see.  Do we provide any suggested next steps for users to assess the
potentially-affected relations?

-- 
nathan



Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Wed, 2025-03-05 at 20:43 -0600, Nathan Bossart wrote:
> I see.  Do we provide any suggested next steps for users to assess
> the
> potentially-affected relations?

I don't know exactly where we should document it, but I've attached a
SQL file that demonstrates what can happen for a PG17->PG18 upgrade,
assuming that we've updated Unicode to 16.0.0 in PG18.

The change in Unicode that I'm focusing on is the addition of U+A7DC,
which is unassigned in Unicode 15.1 and assigned in Unicode 16, which
lowercases to U+019B. The examples assume that the user is using
unassigned code points in PG17/Unicode15.1 and the PG_C_UTF8 collation.

The exmaple table 'test' has a single row with U+A7DC and an index
test_idx on LOWER(t). Because the codepoint is unassigned in PG17,
LOWER(U&'\A7DC') returns itself, while in PG18 it returns U&'\019B'.

The first thing the user should do if they see the warnings during the
--check phase is to rerun with "--check --retain" so they can keep the
file. Maybe that's not great, and we should keep the file regardless
for these particular kinds of warnings.

Then go ahead and upgrade. After upgrade, the user should:

  REINDEX INDEX CONCURRENTLY test_idx;

each potentially-affected index. If they have amcheck installed, they
can do:

  select bt_index_check(index=>'test_idx', heapallindexed=>true);

first, and only REINDEX if there's an error to fix. Do the same for
other indexes listed in the warning file.

The next example is tuniq, which is similar except that the index is a
unique index, and the table has two values: U&'\019B' and U&'\A7DC'. In
PG17, LOWER() returns U&'\019B' and U&'\A7DC', respectively (the former
because it's already lowercase, and the latter because it's unassigned
and returns itself). In PG18, LOWER() returns U&'\019B' and U&'\019B'.
So, a REINDEX will encounter a uniqueness violation, and the user will
need to decide how to handle the duplicate values.

The next example is tcheck, which does CHECK(LOWER(t) = t), which is
the most plausible check constraint I can think of that would be
affected. In theory, the user might have to reason through what should
be done about a tuple that violates it, but in practice it might be as
simple as:

   UPDATE tcheck SET t = LOWER(t) WHERE t <> LOWER(t);

The last example is partitioning where the table is range partitioned
on LOWER(t), and the tuple ends up in the wrong partition on PG18.
After fixing the indexes with:

   -- tpart0_lower_idx is fine, no REINDEX needed
   select bt_index_check(index=>'tpart0_lower_idx',
     heapallindexed=>true);
   select bt_index_check(index=>'tpart1_lower_idx',
     heapallindexed=>true);
   REINDEX INDEX CONCURRENTLY tpart1_lower_idx;

See the problem with:

   SET enable_partition_pruning=off;
   -- correctly finds one tuple
   SELECT * FROM tpart WHERE LOWER(t) = U&'\019B';

   SET enable_partition_pruning TO DEFAULT;
    -- incorrectly finds zero tuples
   SELECT * FROM tpart WHERE LOWER(t) = U&'\019B';

Now move the tuples to the right partitions:

   BEGIN;
   WITH d AS (
     DELETE FROM tpart0
     WHERE NOT (LOWER(t) >= U&'\0001' AND LOWER(t) < U&'\8000')
     RETURNING *
   ) INSERT INTO tpart SELECT * FROM d;
   WITH d AS (
     DELETE FROM tpart1
     WHERE NOT (LOWER(t) >= U&'\8000' AND LOWER(t) < U&'\FFFF')
     RETURNING *
   ) INSERT INTO tpart SELECT * FROM d;
   COMMIT;

The order of operations should be to fix indexes, unique constraints,
and check constraints first; and then to fix partitioned tables. That
way the above partitioned table queries get correct results using the
indexes.

Granted, this is not a great user experience, and we could do better in
the future (e.g. add some new utility commands). But this is an
improvement in the sense that it's a small number of affected objects,
unlike a collation version change, which affects anything related to
text.

Regards,
    Jeff Davis


Attachment

Re: Update Unicode data to Unicode 16.0.0

From
Peter Eisentraut
Date:
On 17.02.25 20:46, Jeff Davis wrote:
> Note that the Unicode update has a few test diffs for NORMALIZE(),
> please check to see if the tests themselves need an update.

I think these test result changes are incorrect.  AFAICT, nothing has 
changed in the upstream data that would explain such a change.

I didn't get such test differences in my original patch.  Did you rebase 
the patch or regenerate it?  Maybe something went wrong there.




Re: Update Unicode data to Unicode 16.0.0

From
Jeremy Schneider
Date:
On Fri, 07 Mar 2025 13:11:18 -0800
Jeff Davis <pgsql@j-davis.com> wrote:

> On Wed, 2025-03-05 at 20:43 -0600, Nathan Bossart wrote:
> > I see.  Do we provide any suggested next steps for users to assess
> > the
> > potentially-affected relations?
>
> I don't know exactly where we should document it, but I've attached a
> SQL file that demonstrates what can happen for a PG17->PG18 upgrade,
> assuming that we've updated Unicode to 16.0.0 in PG18.
>
> The change in Unicode that I'm focusing on is the addition of U+A7DC,
> which is unassigned in Unicode 15.1 and assigned in Unicode 16, which
> lowercases to U+019B. The examples assume that the user is using
> unassigned code points in PG17/Unicode15.1 and the PG_C_UTF8
> collation.

It seems the consensus is to update unicode in core... FWIW, I'm still
in favor of leaving it alone because ICU is there for when I need
up-to-date unicode versions.

From my perspective, the whole point of the builtin collation was to
one option that avoids these problems that come with updating both ICU
and glibc.

So I guess the main point of the builtin provider just that it's faster
than ICU?

-Jeremy




Re: Update Unicode data to Unicode 16.0.0

From
Laurenz Albe
Date:
On Fri, 2025-03-14 at 23:54 -0700, Jeremy Schneider wrote:
> On Fri, 07 Mar 2025 13:11:18 -0800
> It seems the consensus is to update unicode in core... FWIW, I'm still
> in favor of leaving it alone because ICU is there for when I need
> up-to-date unicode versions.

Me too.

> From my perspective, the whole point of the builtin collation was to
> one option that avoids these problems that come with updating both ICU
> and glibc.

+1

Yours,
Laurenz Albe



Re: Update Unicode data to Unicode 16.0.0

From
Joe Conway
Date:
On 3/15/25 03:26, Laurenz Albe wrote:
> On Fri, 2025-03-14 at 23:54 -0700, Jeremy Schneider wrote:
>> On Fri, 07 Mar 2025 13:11:18 -0800
>> It seems the consensus is to update unicode in core... FWIW, I'm still
>> in favor of leaving it alone because ICU is there for when I need
>> up-to-date unicode versions.
> 
> Me too.

+1

>> From my perspective, the whole point of the builtin collation was to
>> one option that avoids these problems that come with updating both ICU
>> and glibc.
> 
> +1

+1

In the long term I think we should figure out how to support newer 
versions of unicode for the builtin, but in my mind that might involve 
the necessity of supporting multiple versions of unicode such that the 
choice remains to remain on the older one.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Update Unicode data to Unicode 16.0.0

From
Tom Lane
Date:
Jeremy Schneider <schneider@ardentperf.com> writes:
> On Fri, 07 Mar 2025 13:11:18 -0800
> Jeff Davis <pgsql@j-davis.com> wrote:
>> The change in Unicode that I'm focusing on is the addition of U+A7DC,
>> which is unassigned in Unicode 15.1 and assigned in Unicode 16, which
>> lowercases to U+019B. The examples assume that the user is using
>> unassigned code points in PG17/Unicode15.1 and the PG_C_UTF8
>> collation.

> It seems the consensus is to update unicode in core... FWIW, I'm still
> in favor of leaving it alone because ICU is there for when I need
> up-to-date unicode versions.

> From my perspective, the whole point of the builtin collation was to
> one option that avoids these problems that come with updating both ICU
> and glibc.

I don't really buy this argument.  If we sit on Unicode 15 until that
becomes untenable, which it will, then people will still be faced
with a behavioral change whenever we bow to reality and invent a
"builtin-2.0" or whatever collation.  Moreover, by then they might
well have instances of the newly-assigned code points in their
database, making the changeover real and perhaps painful for them.

On the other hand, if we keep up with the Joneses by updating the
Unicode data, we can hopefully put those behavioral changes into
effect *before* they'd affect any real data.  So it seems to me
that freezing our Unicode data is avoiding hypothetical pain now
at the price of certain pain later.

I compare this to our routine timezone data updates, which certainly
have not been without occasional pain ... but does anyone seriously
want to argue that we should still be running tzdata from 20 years
back?  Or even 5 years back?

In fact, on the analogy of timezones, I think we should not only
adopt newly-published Unicode versions pretty quickly but push
them into released branches as well.  Otherwise the benefit of
staying ahead of real use of the new code points isn't there
for end users.

            regards, tom lane



Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Sat, 2025-03-15 at 12:15 -0400, Tom Lane wrote:
> On the other hand, if we keep up with the Joneses by updating the
> Unicode data, we can hopefully put those behavioral changes into
> effect *before* they'd affect any real data.

That's a good point.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Fri, 2025-03-14 at 23:54 -0700, Jeremy Schneider wrote:
> From my perspective, the whole point of the builtin collation was to
> one option that avoids these problems that come with updating both
> ICU
> and glibc.
>
> So I guess the main point of the builtin provider just that it's
> faster
> than ICU?

It doesn't break primary keys.

Also, it's stable within a major version, we can document and test its
behavior, it solves 99% of the upgrade problem, and what problems
remains are much more manageable.

And yes, collation is way, way faster than ICU.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

From
Jeremy Schneider
Date:
> On Mar 15, 2025, at 10:22 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Sat, 2025-03-15 at 12:15 -0400, Tom Lane wrote:
>> On the other hand, if we keep up with the Joneses by updating the
>> Unicode data, we can hopefully put those behavioral changes into
>> effect *before* they'd affect any real data.
>
> That's a good point.

Jeff - thanks for the reminder that this is just about character semantics and not ordering. Obviously C collation by
definition(code point ordering) doesn’t change sort order… two weeks ago I was working on updating the torture test
GitHubpage with glibc collation changes up through Ubuntu 24.10 so my mind was definitely over there. No detected
changesin en-US so that’s great news. 🙂 

Is the simple answer that functions & clauses related to both time zones and character semantics should just all be
consideredSTABLE instead of IMMUTABLE? 

I think if that were the case then changes across a minor version would simply be allowed by definition right? No need
forwarnings. 

This would impact the ability to create case-insensitive indexes.

-Jeremy

Sent from my TI-83




Re: Update Unicode data to Unicode 16.0.0

From
vignesh C
Date:
On Sat, 8 Mar 2025 at 02:41, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Wed, 2025-03-05 at 20:43 -0600, Nathan Bossart wrote:
> > I see.  Do we provide any suggested next steps for users to assess
> > the
> > potentially-affected relations?
>
> I don't know exactly where we should document it, but I've attached a
> SQL file that demonstrates what can happen for a PG17->PG18 upgrade,
> assuming that we've updated Unicode to 16.0.0 in PG18.
>
> The change in Unicode that I'm focusing on is the addition of U+A7DC,
> which is unassigned in Unicode 15.1 and assigned in Unicode 16, which
> lowercases to U+019B. The examples assume that the user is using
> unassigned code points in PG17/Unicode15.1 and the PG_C_UTF8 collation.
>
> The exmaple table 'test' has a single row with U+A7DC and an index
> test_idx on LOWER(t). Because the codepoint is unassigned in PG17,
> LOWER(U&'\A7DC') returns itself, while in PG18 it returns U&'\019B'.
>
> The first thing the user should do if they see the warnings during the
> --check phase is to rerun with "--check --retain" so they can keep the
> file. Maybe that's not great, and we should keep the file regardless
> for these particular kinds of warnings.
>
> Then go ahead and upgrade. After upgrade, the user should:
>
>   REINDEX INDEX CONCURRENTLY test_idx;
>
> each potentially-affected index. If they have amcheck installed, they
> can do:
>
>   select bt_index_check(index=>'test_idx', heapallindexed=>true);
>
> first, and only REINDEX if there's an error to fix. Do the same for
> other indexes listed in the warning file.
>
> The next example is tuniq, which is similar except that the index is a
> unique index, and the table has two values: U&'\019B' and U&'\A7DC'. In
> PG17, LOWER() returns U&'\019B' and U&'\A7DC', respectively (the former
> because it's already lowercase, and the latter because it's unassigned
> and returns itself). In PG18, LOWER() returns U&'\019B' and U&'\019B'.
> So, a REINDEX will encounter a uniqueness violation, and the user will
> need to decide how to handle the duplicate values.
>
> The next example is tcheck, which does CHECK(LOWER(t) = t), which is
> the most plausible check constraint I can think of that would be
> affected. In theory, the user might have to reason through what should
> be done about a tuple that violates it, but in practice it might be as
> simple as:
>
>    UPDATE tcheck SET t = LOWER(t) WHERE t <> LOWER(t);
>
> The last example is partitioning where the table is range partitioned
> on LOWER(t), and the tuple ends up in the wrong partition on PG18.
> After fixing the indexes with:
>
>    -- tpart0_lower_idx is fine, no REINDEX needed
>    select bt_index_check(index=>'tpart0_lower_idx',
>      heapallindexed=>true);
>    select bt_index_check(index=>'tpart1_lower_idx',
>      heapallindexed=>true);
>    REINDEX INDEX CONCURRENTLY tpart1_lower_idx;
>
> See the problem with:
>
>    SET enable_partition_pruning=off;
>    -- correctly finds one tuple
>    SELECT * FROM tpart WHERE LOWER(t) = U&'\019B';
>
>    SET enable_partition_pruning TO DEFAULT;
>     -- incorrectly finds zero tuples
>    SELECT * FROM tpart WHERE LOWER(t) = U&'\019B';
>
> Now move the tuples to the right partitions:
>
>    BEGIN;
>    WITH d AS (
>      DELETE FROM tpart0
>      WHERE NOT (LOWER(t) >= U&'\0001' AND LOWER(t) < U&'\8000')
>      RETURNING *
>    ) INSERT INTO tpart SELECT * FROM d;
>    WITH d AS (
>      DELETE FROM tpart1
>      WHERE NOT (LOWER(t) >= U&'\8000' AND LOWER(t) < U&'\FFFF')
>      RETURNING *
>    ) INSERT INTO tpart SELECT * FROM d;
>    COMMIT;
>
> The order of operations should be to fix indexes, unique constraints,
> and check constraints first; and then to fix partitioned tables. That
> way the above partitioned table queries get correct results using the
> indexes.
>
> Granted, this is not a great user experience, and we could do better in
> the future (e.g. add some new utility commands). But this is an
> improvement in the sense that it's a small number of affected objects,
> unlike a collation version change, which affects anything related to
> text.

We currently have two Commitfest entries for the same thread at [1]
and [2]. Are both still necessary, or can we consolidate tracking into
a single entry?
[1] - https://commitfest.postgresql.org/patch/5472/
[2] - https://commitfest.postgresql.org/patch/5563/

Regards,
Vignesh



Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Sat, 2025-03-15 at 18:23 -0700, Jeremy Schneider wrote:
> Is the simple answer that functions & clauses related to both time
> zones and character semantics should just all be considered STABLE
> instead of IMMUTABLE?

That was discussed a few times, but:

(a) That doesn't exactly solve the problem, because people still need
indexes on LOWER() or CASEFOLD(); and

(b) If we change IMMUTABLE to mean "returns the same results on every
platform for all time", that would be too strict for many purposes,
like the planner doing constant folding.

I have been thinking about ways we can express the right dependencies,
and I may be making some proposals along those lines.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Sun, 2025-03-16 at 19:10 +0530, vignesh C wrote:
> We currently have two Commitfest entries for the same thread at [1]
> and [2]. Are both still necessary, or can we consolidate tracking
> into
> a single entry?

I'm fine removing my CF entry, but unfortunately there's no "withdrawn
-- duplicate", so it might send the wrong message.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> That was discussed a few times, but:

> (a) That doesn't exactly solve the problem, because people still need
> indexes on LOWER() or CASEFOLD(); and

> (b) If we change IMMUTABLE to mean "returns the same results on every
> platform for all time", that would be too strict for many purposes,
> like the planner doing constant folding.

Yeah.  Not only would the set of functions meeting such a standard be
vanishingly small, but so would the set of use-cases.  What we need is
some sort of understanding that "this is okay to use in indexes",
"this is okay to constant-fold when planning", etc.  Maybe it's less
about "is it okay to just assume this" and more about "can we devise
a method for figuring out when we have to reindex, replan, etc".
We've got bits of that in our collation versioning infrastructure,
but that doesn't cover every source of infrequently-mutating behavior.

> I have been thinking about ways we can express the right dependencies,
> and I may be making some proposals along those lines.

I await a proposal with interest.

            regards, tom lane



Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Thu, 2025-03-13 at 14:49 +0100, Peter Eisentraut wrote:
> I think these test result changes are incorrect.  AFAICT, nothing has
> changed in the upstream data that would explain such a change.
>
> I didn't get such test differences in my original patch.  Did you
> rebase
> the patch or regenerate it?  Maybe something went wrong there.

Right. The previous version was improperly rebased.

I went back to your original version and rebased over master (including
the latest optimizations for case mapping), and then rebased my changes
on top of that. Attached.

One annoyance with the recent case mapping optimizations is that the
Unicode update diff is larger, similar to the diff for
unicode_norm_hashfunc.h. Previously, it was binary search, so the only
differences were logical differences.

Regards,
    Jeff Davis


Attachment

Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Sat, 2025-03-15 at 12:15 -0400, Tom Lane wrote:
> In fact, on the analogy of timezones, I think we should not only
> adopt newly-published Unicode versions pretty quickly but push
> them into released branches as well.

That approach suggests that we consider something like my previous
STRICT_UNICODE proposal[1]. If Postgres updates Unicode quickly enough,
there's not much reason that users would need to use unassigned code
points, so it would be practical to just reject them (as an option).
That would dramatically reduce the practical problems people would
encounter when we do update Unicode.

Note that assigned code points can still change behavior in later
versions, but not in ways that would typically cause a problem for
things like indexes. For instance, U+0363 changed from non-Alphabetic
to Alphabetic in Unicode 16, which changes the results of the
expression:

  U&'\0363' ~ '[[:alpha:]]' COLLATE PG_C_UTF8

from false to true, even though U+0363 is assigned in both Unicode
15.1.0 and 16.0.0. That might plausibly matter, but such cases would be
more obscure than case folding.

Regards,
    Jeff Davis

[1] https://commitfest.postgresql.org/patch/4876/




Re: Update Unicode data to Unicode 16.0.0

From
vignesh C
Date:
On Mon, 17 Mar 2025 at 23:03, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Sun, 2025-03-16 at 19:10 +0530, vignesh C wrote:
> > We currently have two Commitfest entries for the same thread at [1]
> > and [2]. Are both still necessary, or can we consolidate tracking
> > into
> > a single entry?
>
> I'm fine removing my CF entry, but unfortunately there's no "withdrawn
> -- duplicate", so it might send the wrong message.

Yes, we don't have "withdrawn duplicate", I have closed it as
withdrawn, anyway we can get it reviewed and committed using the other
entry.

Regards,
Vignesh



Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Sat, 2025-03-15 at 10:14 -0400, Joe Conway wrote:
> In the long term I think we should figure out how to support newer
> versions of unicode for the builtin, but in my mind that might
> involve
> the necessity of supporting multiple versions of unicode such that
> the
> choice remains to remain on the older one.

What do you think of Tom's argument that waiting to update Unicode is
what creates the problem in the first place?

"by then they might well have instances of the newly-assigned code
points in their database"[1]

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/3481161.1742055336@sss.pgh.pa.us



Re: Update Unicode data to Unicode 16.0.0

From
Jeremy Schneider
Date:
On Tue, 18 Mar 2025 08:53:56 -0700
Jeff Davis <pgsql@j-davis.com> wrote:

> What do you think of Tom's argument that waiting to update Unicode is
> what creates the problem in the first place?
> 
> "by then they might well have instances of the newly-assigned code
> points in their database"[1]
> 
> [1]
> https://www.postgresql.org/message-id/3481161.1742055336@sss.pgh.pa.us

Waiting to update Unicode is not what creates the problem, as long as
we support in-place major upgrades without rebuilding indexes.


It seems that we think that it's really uncommon to make indexes on
timezone data. We think that breaking primary keys must definitely be
avoided. We think case-insensitive indexes are probably uncommon, so as
long as its "rare" we can let them break.

I'm not asking for an extreme definition of "IMMUTABLE" but I'd be
very happy with a GUC "data_safety=radical_like_jeremy" where Postgres
simply won't start if the control file says it was from a different
operating system or architecture or ICU/glibc collation version. I can
disable the GUC (like a maintenance mode) to rebuild my indexes and
update my collation versions, and ideally this GUC would also mean that
indexes simply aren't allowed to be created on functions that might
change within the guarantees that are made. (And range-based partitions
can't use them, and FDWs can't rely on them for query planning, etc.)

I get that we don't want to break stuff people are currently doing -
but I think there are a bunch of people who would be happy with an
option to reduce functionality going forward in exchange for clearly
defined behavior. I just don't want to have to think about this on
behalf of all my Postgres users.

-Jeremy



Re: Update Unicode data to Unicode 16.0.0

From
Robert Haas
Date:
On Tue, Mar 18, 2025 at 11:54 AM Jeff Davis <pgsql@j-davis.com> wrote:
> What do you think of Tom's argument that waiting to update Unicode is
> what creates the problem in the first place?
>
> "by then they might well have instances of the newly-assigned code
> points in their database"[1]

I know you weren't asking me, but FWIW, I think Tom has a point, but I
also don't completely buy it. There's no technological safeguard
against people using Unicode points before they're assigned, so we
can't assume that no one ever does. And they probably do.

I think Joe has the right idea. The way to actually provide the
stability that people want here is to continue supporting old versions
while adding support for new versions. Anything else we do works
subject to assumptions: you can either assume that people don't use
code points before they're assigned (as Tom proposes) or you can
assume that not updating to newer Unicode specs will not cause any
problems for anyone. Joe's proposal is unique in requiring no
assumptions about what users will do or what they care about. It just
works, period. The only disadvantage is that it's more work for us,
but this problem has caused enough suffering over the years that I bet
we can find hackers to implement it and maintain it on an ongoing
basis without great difficulty.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Tue, 2025-03-18 at 09:28 -0700, Jeremy Schneider wrote:
> We think case-insensitive indexes are probably uncommon, so as
> long as its "rare" we can let them break.

Let's define "break" in this context to mean that the constraints are
not enforced, or that the query doesn't return the results that the
user is expecting.

Let's say a user has an index on LOWER(t) in PG17 (Unicode 15.1). Then
Unicode 16.0 comes out, introducing the newly-assigned U+A7DC, which
lowercases to U+019B. The rest of the world moves on and starts using
U+A7DC.

There are only two ways that Postgres can prevent breakage:

1. Update the database to Unicode 16.0 before U+A7DC is encountered, so
that it's properly lowercased to U+019B, and a query on LOWER(t) =
U&'\019B' will correctly return the record containing it.

2. Prevent U+A7DC from going into the database at all.

Continuing on with Unicode 15.1 and accepting the unassigned code point
*cannot* prevent breakage.

A truly paranoid user would want a combination of both solutions:
regular Unicode updates; and something like STRICT_UNICODE
( https://commitfest.postgresql.org/patch/4876/ ) to protect the user
between the time Unicode assigns the code point and the time they can
deploy a version of Postgres that understands it.

You are rightfully concerned that updating Unicode can create its own
inconsistencies, and if nothing is done that can lead to breakage as
well. The upgrade-time check in this thread is one solution to that
problem, but we could do a lot more.

You are also right that we should be more skeptical of an internal
inconsistency (e.g. different results for seqscan vs indexscan) than a
wider definition of inconsistency. But the user created a unicode-based
case-folded index there for a reason, and we shouldn't lose sight of
that.

> I'm not asking for an extreme definition of "IMMUTABLE" but I'd be
> very happy with a GUC "data_safety=radical_like_jeremy" where
> Postgres
> simply won't start if the control file says it was from a different
> operating system or architecture or ICU/glibc collation version. I
> can
> disable the GUC (like a maintenance mode) to rebuild my indexes and
> update my collation versions, and ideally this GUC would also mean
> that
> indexes simply aren't allowed to be created on functions that might
> change within the guarantees that are made. (And range-based
> partitions
> can't use them, and FDWs can't rely on them for query planning, etc.)

Does the upgrade check patch in this thread accomplish that for you? If
not, what else does it need?

It's an upgrade-time check rather than a GUC, but it basically seems to
match what you want. See:

https://www.postgresql.org/message-id/16c4e37d4c89e63623b009de9ad6fb90e7456ed8.camel@j-davis.com

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

From
Robert Haas
Date:
On Tue, Mar 18, 2025 at 2:55 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Continuing on with Unicode 15.1 and accepting the unassigned code point
> *cannot* prevent breakage.

Under your definition, this is true, but I think Jeremy would define
breakage differently. His primary concern, I expect, is *stability*.
Breakage means that the same supposedly-stable results return
different answers on the same data. Under that definition, continuing
under Unicode 15.1 does prevent breakage.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Update Unicode data to Unicode 16.0.0

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Mar 18, 2025 at 2:55 PM Jeff Davis <pgsql@j-davis.com> wrote:
>> Continuing on with Unicode 15.1 and accepting the unassigned code point
>> *cannot* prevent breakage.

> Under your definition, this is true, but I think Jeremy would define
> breakage differently. His primary concern, I expect, is *stability*.
> Breakage means that the same supposedly-stable results return
> different answers on the same data. Under that definition, continuing
> under Unicode 15.1 does prevent breakage.

That approach works only if you sit on Unicode 15.1 *forever*.
The impracticality of that seems obvious to me.  Sooner or later
you will need to update, and then you are going to suffer pain.
(In the running example of this thread, a unique index on LOWER(t)
might not only be corrupt, but might fail reindex due to the
constraint being violated under the newer rules.)  The longer you
wait, the more probable it is that you are going to have problems,
and the more painful it'll be to clean things up.

Now, if you both sit on Unicode 15.1 forever and disallow the
introduction of unassigned-per-15.1 code points, you can escape
that fate, but that approach brings its own kind of pain.

The short answer is that "immutable" = "doesn't change till the heat
death of the universe" is a definition that is not useful when
dealing with this type of data.  Other people determine the reality
that you have to deal with.

            regards, tom lane



Re: Update Unicode data to Unicode 16.0.0

From
Robert Haas
Date:
On Tue, Mar 18, 2025 at 3:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> That approach works only if you sit on Unicode 15.1 *forever*.
> The impracticality of that seems obvious to me.  Sooner or later
> you will need to update, and then you are going to suffer pain.

I completely agree.

> The short answer is that "immutable" = "doesn't change till the heat
> death of the universe" is a definition that is not useful when
> dealing with this type of data.  Other people determine the reality
> that you have to deal with.

I think that's mostly true because of lack of versioning capabilities,
or crappy versioning practices. glibc, AIUI, just disclaims collation
stability: if you're fool enough to sort anything with one of their
collations, that's on you. To me, that seems like an obviously
user-hostile position, as if it were reasonable to suppose that an
algorithm whose whole purpose is to implement a sort order would not
be used for, uh, sorting. Or at least not any sort of sorting where
you don't immediately throw away the results (and then why did you
bother?). ICU doesn't seem to be entirely stable, either. But none of
that means stability isn't a valuable property. It just means people
have done a bad job implementing it. If we give people the ability to
execute operation X using ICU 15.1 or ICU 16.0, they're still
*eventually* going to have to migrate forward to ICU 16.0 or some
later version, because we're probably not going to keep ICU 15.1 until
the heat death of the universe. But we allow people to not have that
update forced upon them at the same time they're trying to change
other things, and that's pretty darn useful. That's why extensions
have separate versioning from the server, for instance.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Update Unicode data to Unicode 16.0.0

From
Joe Conway
Date:
On 3/18/25 16:30, Robert Haas wrote:
> On Tue, Mar 18, 2025 at 3:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That approach works only if you sit on Unicode 15.1 *forever*.
>> The impracticality of that seems obvious to me.  Sooner or later
>> you will need to update, and then you are going to suffer pain.
> 
> I completely agree.
> 
>> The short answer is that "immutable" = "doesn't change till the heat
>> death of the universe" is a definition that is not useful when
>> dealing with this type of data.  Other people determine the reality
>> that you have to deal with.
> 
> I think that's mostly true because of lack of versioning capabilities,
> or crappy versioning practices. glibc, AIUI, just disclaims collation
> stability: if you're fool enough to sort anything with one of their
> collations, that's on you. To me, that seems like an obviously
> user-hostile position, as if it were reasonable to suppose that an
> algorithm whose whole purpose is to implement a sort order would not
> be used for, uh, sorting. Or at least not any sort of sorting where
> you don't immediately throw away the results (and then why did you
> bother?). ICU doesn't seem to be entirely stable, either.

Yep

> But none of that means stability isn't a valuable property. It just
> means people have done a bad job implementing it. If we give people
> the ability to execute operation X using ICU 15.1 or ICU 16.0,
> they're still *eventually* going to have to migrate forward to ICU
> 16.0 or some later version, because we're probably not going to keep
> ICU 15.1 until the heat death of the universe. But we allow people
> to not have that update forced upon them at the same time they're
> trying to change other things, and that's pretty darn useful. That's
> why extensions have separate versioning from the server, for
> instance.

+1 Robert articulates my thinking exactly, and much better than I did :-)

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Tue, 2025-03-18 at 14:45 -0400, Robert Haas wrote:
> I think Joe has the right idea. The way to actually provide the
> stability that people want here is to continue supporting old
> versions
> while adding support for new versions. Anything else we do works
> subject to assumptions: you can either assume that people don't use
> code points before they're assigned (as Tom proposes) or you can
> assume that not updating to newer Unicode specs will not cause any
> problems for anyone. Joe's proposal is unique in requiring no
> assumptions about what users will do or what they care about. It just
> works, period. The only disadvantage is that it's more work for us,
> but this problem has caused enough suffering over the years that I
> bet
> we can find hackers to implement it and maintain it on an ongoing
> basis without great difficulty.

I've already put a fair amount of work into this approach, but it is a
lot of work, and I could use some help. Here's a quick dump of some of
my notes on what we can do going forward:

* builtin provider with stable primary keys: done

* collation behavior as method tables: done

* have support for STRICT_UNICODE, or something like it, to allow users
to mitigate their upgrade risks by rejecting unassigned code points: I
submitted a proposal for a database-level option, which got no
discussion

* ctype behavior as method tables: patch written, discussion trailed
off. There was a really minor performance regression, so I held off
committing it, but I don't think it's an actual problem so if people
are in general agreement that we want it then I have no problem
committing it.

* separate "collation provider" from "ctype provider" internally. Have
pg_open_collation() and pg_open_ctype(), and deprecate
pg_newlocale_from_collation(). This is a fair amount of work, but it's
important for dependency and version tracking, as well as an
organizational improvement.

* turn providers into method tables: not too hard. We'd still need to
have the notion of a "provider kind" (builtin, ICU, libc) so that we
know how to interpret the syntax and store things in the other catalogs
(for instance, only ICU accepts ICU_RULES, only libc allows LC_COLLATE
and LC_CTYPE to be different, etc.).

* put providers into new shared catalogs pg_collation_provider and
pg_ctype_provider, which would each have handlers that know how to
instantiate a specific collation or ctype

* add new function markers COLLATE and CTYPE (or some other names),
meaning that the function is sensitive to the collation or ctype of its
arguments.
  - for example: LOWER() would be marked CTYPE, ">" would be marked
COLLATE, and "||" wouldn't need any mark.
  - When creating some object that has an expression in it, let's say
an index, we already walk the expression and add dependencies on the
functions in the expression. If one of those functions has such a
marker, we would look at the inferred collation of the function, find
its provider, and add a dependency on the provider's shared catalog
entry.
  - must work even on "pinned" functions
  - queries with ORDER BY, say as part of an MV definition, would be
implicitly treated like functions marked with COLLATE

* (optional) have some kind of runtime check so that UDFs that are
missing the appropraite COLLATE or CTYPE markers figure out that a
collation or ctype is being opened, and throw a WARNING or ERROR

* throw away the idea of collation-speciifc versions, or make it more
of an additional check. Versions would be attached to the provider
entries in the shared catalogs. The only provider that differentiates
collation versions by locale is ICU, and people were highly skeptical
of that before we found bugs in it, and more skeptical afterward. They
will just be a source of confusion in the long term.

* Have some new functions and DDL commands that can find and fix
objects by following the dependency links.

* Allow extensions to be loaded at initdb time, and initialize their
own providers and their own lists of collations.

* Provide a contrib that implements the builtin provider with unicode
15.1.

* If we want multiple versions of a provider in the same running
server, that would take more work. I have my doubts about how many
people would really, actually use that, but it's possible.


I plan to submit some proposals in a few weeks as this CF settles down,
and then have an unconference session on this topic at pgconf.dev. If
anyone is motivated for these problems to be fixed, please jump into
those discussions on list or at the conference, and take on a task or
two.

I am not trying to be dismissive of the concerns raised in this thread,
but I'd like others to understand that what they are asking for is a
lot of work, and that the builtin collation provider solves 99% of it
already. All this effort is to solve that last 1%.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

From
Robert Haas
Date:
On Tue, Mar 18, 2025 at 5:09 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I am not trying to be dismissive of the concerns raised in this thread,
> but I'd like others to understand that what they are asking for is a
> lot of work, and that the builtin collation provider solves 99% of it
> already. All this effort is to solve that last 1%.

I'm not asking you to do the work. It would be great if you did, of
course, but you don't work for me.

But I could not disagree more strongly with the idea that this problem
is 99% solved. That doesn't seem remotely true to me. I'm not sure the
problem is 1% solved.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Tue, 2025-03-18 at 21:34 -0400, Robert Haas wrote:
> But I could not disagree more strongly with the idea that this
> problem
> is 99% solved. That doesn't seem remotely true to me. I'm not sure
> the
> problem is 1% solved.

If we compare the following two problems:

  A. With glibc or ICU, every text index, including primary keys, are
highly vulnerable to inconsistencies after an OS upgrade, even if
there's no Postgres upgrade; vs.

  B. With the builtin provider, only expression indexes and a few other
things are vulnerable, only during a major version upgrade, and mostly
(but not entirely) when using recently-assigned Cased letters.

To me, problem A seems about 100 times worse than B almost any way I
can imagine measuring it: number of objects vulnerable, severity of the
problem when it does happen, likelihood of a vulnerable object having
an actual problem, etc. If you disagree, I'd like to hear more.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

From
Jeremy Schneider
Date:
On Tue, 18 Mar 2025 19:33:00 -0700
Jeff Davis <pgsql@j-davis.com> wrote:

> If we compare the following two problems:
> 
>   A. With glibc or ICU, every text index, including primary keys, are
> highly vulnerable to inconsistencies after an OS upgrade, even if
> there's no Postgres upgrade; vs.
> 
>   B. With the builtin provider, only expression indexes and a few
> other things are vulnerable, only during a major version upgrade, and
> mostly (but not entirely) when using recently-assigned Cased letters.
> 
> To me, problem A seems about 100 times worse than B almost any way I
> can imagine measuring it: number of objects vulnerable, severity of
> the problem when it does happen, likelihood of a vulnerable object
> having an actual problem, etc. If you disagree, I'd like to hear more.

Jeff - you and several others have literally put years into making this
better, and it's deeply appreciated. I agree that with the builtin
provider we're in a much better place.

I don't quite understand Tom's argument about why Unicode 15 must
eventually become untenable. Why are we assuming it will? In Oracle's
entire history, I think they have only ever supported four versions of
Unicode. [1] MySQL seems to have added their second only recently. [2]
And again - we have ICU if I need the latest emoji characters. Frankly,
Unicode 15 is pretty good. Most updates to unicode these days are fairly
minor.

Maybe Postgres can be the first database to always ship support for the
latest Unicode with each major version - but I think we should design
that right if we're going to do it. If we just stay on Unicode 15 for
now then there are no problems with case insensitive indexes or range
partitioned tables returning wrong query results after a major version
upgrades.

There's been a lot of discussion about indexes, but this SQL also seems
to work:

postgres=# create table test_events(customer_name text, ts timestamp,
message text) partition by range((lower(customer_name)));

I'm sure that people shouldn't do this ... but if anyone /did/ then it
wouldn't be as simple as an index rebuild after their major version
upgrade.

I had never really considered it before, but this SQL also seems to work

postgres=# create table test_events(id uuid, ts timestamp, message
text) partition by range((ts at time zone 'America/Sao_Paulo')); 

I'm sure that people shouldn't do that either ... but if anyone did then
would their rows would be in the wrong partition after they upgraded
from 11.4 to 11.5?

The difficulty here is that I work at a company with thousands of
developers and lots of Postgres and I see people do things all the time
that we might think they "shouldnt" do.

Before we bump the unicode version, perseonally I'd just like to have
some tools to make it so people actually can't do the things they
shouldn't do.

-Jeremy


1:

https://docs.oracle.com/en/database/oracle/oracle-database/23/nlspg/appendix-A-locale-data.html#GUID-CC85A33C-81FC-4E93-BAAB-1B3DB9036060__CIABEDHB

2:
https://dev.mysql.com/blog-archive/mysql-character-sets-unicode-and-uca-compliant-collations/





Re: Update Unicode data to Unicode 16.0.0

From
Robert Haas
Date:
On Tue, Mar 18, 2025 at 10:33 PM Jeff Davis <pgsql@j-davis.com> wrote:
> If we compare the following two problems:
>
>   A. With glibc or ICU, every text index, including primary keys, are
> highly vulnerable to inconsistencies after an OS upgrade, even if
> there's no Postgres upgrade; vs.
>
>   B. With the builtin provider, only expression indexes and a few other
> things are vulnerable, only during a major version upgrade, and mostly
> (but not entirely) when using recently-assigned Cased letters.
>
> To me, problem A seems about 100 times worse than B almost any way I
> can imagine measuring it: number of objects vulnerable, severity of the
> problem when it does happen, likelihood of a vulnerable object having
> an actual problem, etc. If you disagree, I'd like to hear more.

I see your point, but most people don't use the builtin collation
provider. Granted, we could change the default and then more people
would use it, but I'm not sure people would be happy with the
resulting behavior: a lot of people probably want "a" to sort near "á"
even if they don't have strong preferences about the exact details in
every corner case.

Also, and I think rather importantly, many people are less sensitive
to whether anything is actually broken than to whether anything
hypothetically could be broken. When an EDB customer asks "if I do X,
will anything break," it's often the case that answering "maybe" is
the same as answering "yes". The DBA doesn't necessarily know or care
what the application does or know or care what data is in the
database. They want a hard guarantee that the behavior will not
change. From that point of view, your statement that nothing will
change in minor releases when the builtin provider is used is quite
powerful (and a good argument against back-patching Unicode updates as
Tom proposes).

But people will still need to use other collation providers and they
will still need to do major release upgrades and they also want those
things to be guaranteed not to break. Again, I'm not trying to oblige
you to deliver that behavior and I confess to ignorance on how we
could realistically get there. But I do think it's what people want:
to be forced to endure collation updates infrequently, and to be able
to choose the timing of the update when they absolutely must happen,
and to be able to easily know exactly what they need to reindex.

And from that point of view -- and again, I'm not volunteering to
implement it and I'm not telling you to do it either -- Joe's proposal
of supporting multiple versions sounds fantastic. Because then, I can
do a major version upgrade using pg_upgrade and keep everything pinned
to the old Unicode version or, perhaps even the old ICU version if we
had multi-version libicu support. I may be able to go through several
major version upgrades without ever needing to survive a collation
change. Eventually my hand will be forced, because PostgreSQL will
remove support for the Unicode version I care about or that old
version of libicu won't compile any more or will have security
vulnerabilities or something, but I will have the option to deal with
that collation change before or after any PostgreSQL version changes
that I'm doing. I'll be able to change the collation version at a time
when I'm not changing anything else and deal with JUST that fallout on
its own.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Update Unicode data to Unicode 16.0.0

From
Robert Haas
Date:
On Wed, Mar 19, 2025 at 1:25 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:
> Maybe Postgres can be the first database to always ship support for the
> latest Unicode with each major version

Shipping the latest Unicode with our latest major version is existing
policy, as I understand it. I don't think we're breaking any new
ground here.

> Before we bump the unicode version, perseonally I'd just like to have
> some tools to make it so people actually can't do the things they
> shouldn't do.

I would, too, but I think that's a hole with no bottom. We're not
realistically going to sort this problem out in the next twelve
months, or even in the next five years. For that reason, I don't
understand how you can propose with a straight face that we pause the
Unicode updates that have been routine up until now until we have the
problem sorted out. That amounts to deciding that we're never going to
add support for any new Unicode versions. And I don't see that as
realistic. We have those Unicode tables because their contents matter.
If we don't update them then, for example, new cased letters won't
case-fold properly. Yeah, maybe if we didn't update for one year,
nobody would care. But five or ten or twenty years from now, somebody
will care, and I don't think we'll have all the stuff you want to
prevent people from hosing themselves even then.

Not only that, but freezing the Unicode tables doesn't even really
help that much, because the much bigger problem comes from the
constant whacking around of the glibc and to a lesser extent ICU
behavior.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Wed, 2025-03-19 at 08:46 -0400, Robert Haas wrote:
> I see your point, but most people don't use the builtin collation
> provider.

The other providers aren't affected by us updating Unicode, so I think
we got off track somehow. I suppose what I meant was:

"If you are concerned about inconsistencies, and you move to the
builtin provider, then 99% of the inconsistency problem is gone. We can
remove the last 1% of the problem if we do all the work listed above."

> When an EDB customer asks "if I do X,
> will anything break," it's often the case that answering "maybe" is
> the same as answering "yes".

That's a good point. However, note that "doesn't break primary keys" is
a nice guarantee, even if there's still some remaining doubts about
expression indexes, etc.

> They want a hard guarantee that the behavior will not
> change.

My understanding of this thread so far was that we were mostly
concerned about internal inconsistencies of stored structures; e.g.
indexes that could return different results than a seqscan.

Not changing query results at all between major versions is a valid
concern, but a fairly strict one that doesn't seem limited to immutable
functions or collation issues. Surely, at least the results of "SELECT
version()" should change from release to release ;-)

> Again, I'm not trying to oblige
> you to deliver that behavior and I confess to ignorance on how we
> could realistically get there.

FWIW I'm not complaining about doing the work. But I think the results
will be better if we can get a few people aligned on a general plan and
collaborating. I will try to kick that off.


> and to be able to easily know exactly what they need to reindex.

That's the main one, I think. The upgrade check offers that for the
builtin provider, though admittedly it's not a very user-friendly
solution, and we can do better.

> And from that point of view -- and again, I'm not volunteering to
> implement it and I'm not telling you to do it either -- Joe's
> proposal
> of supporting multiple versions sounds fantastic.

I certainly don't oppose giving users that choice. But I view it as a
burden we are placing on the users -- better than breakage, but not
really great, either. So if we do put in a ton of work, I'd like it if
we could arrive at a bettter destination.


If we actually want the BEST user experience possible, they'd not even
really know that their index was ever inconsistent. Autovacuum would
come along and just find the few entries in the index that need fixing,
and reindex just those few tuples. In theory, it should be possible:
there are a finite number of codepoints that change each Unicode
version, and we can just search for them in the data and fix up derived
structures.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

From
Robert Haas
Date:
On Wed, Mar 19, 2025 at 1:39 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2025-03-19 at 08:46 -0400, Robert Haas wrote:
> > I see your point, but most people don't use the builtin collation
> > provider.
>
> The other providers aren't affected by us updating Unicode, so I think
> we got off track somehow. I suppose what I meant was:
>
> "If you are concerned about inconsistencies, and you move to the
> builtin provider, then 99% of the inconsistency problem is gone. We can
> remove the last 1% of the problem if we do all the work listed above."

All right. I'm not sure I totally buy the 99% number, but I take your point.

> > When an EDB customer asks "if I do X,
> > will anything break," it's often the case that answering "maybe" is
> > the same as answering "yes".
>
> That's a good point. However, note that "doesn't break primary keys" is
> a nice guarantee, even if there's still some remaining doubts about
> expression indexes, etc.

No argument.

> > They want a hard guarantee that the behavior will not
> > change.
>
> My understanding of this thread so far was that we were mostly
> concerned about internal inconsistencies of stored structures; e.g.
> indexes that could return different results than a seqscan.

I think that is true, but inconsistent indexes can be the worst
problem without being the only one.

> Not changing query results at all between major versions is a valid
> concern, but a fairly strict one that doesn't seem limited to immutable
> functions or collation issues. Surely, at least the results of "SELECT
> version()" should change from release to release ;-)

Maybe we should stop doing releases, and then users won't have to
worry about our releases breaking things!

Slightly more seriously, the use of UPPER() and LOWER() in expression
indexes is not that uncommon. Sometimes, the index exists specifically
to enforce a unique constraint. Yes, plain indexes on columns are more
common, and it makes sense to target that case first, but we shouldn't
be too quickly hand-wave away the use of case-folding functions as a
thing that doesn't happen.

> I certainly don't oppose giving users that choice. But I view it as a
> burden we are placing on the users -- better than breakage, but not
> really great, either. So if we do put in a ton of work, I'd like it if
> we could arrive at a bettter destination.
>
> If we actually want the BEST user experience possible, they'd not even
> really know that their index was ever inconsistent. Autovacuum would
> come along and just find the few entries in the index that need fixing,
> and reindex just those few tuples. In theory, it should be possible:
> there are a finite number of codepoints that change each Unicode
> version, and we can just search for them in the data and fix up derived
> structures.

I have to disagree with this. I think this is a case where fixing
something automatically is clearly worse. First, it could never fix it
instantly, so you would be stuck with some window where queries might
return wrong results -- or if you prevent that by not using the
indexes any more until they're fixed, then it would instead cause huge
query performance regressions that could easily take down the whole
system. Second, one of the things people like least about autovacuum
is when it unexpectedly does a lot of work all at once. Today, that's
usually a vacuum for wrap-around, but suddenly trying to fix all my
indexes when I wasn't expecting that to happen could easily be just as
bad. I strongly believe users want to control what happens, not have
the system try to fix it for them automatically without their
knowledge.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Wed, 2025-03-19 at 14:33 -0400, Robert Haas wrote:

> I strongly believe users want to control what happens, not have
> the system try to fix it for them automatically without their
> knowledge.

Do you have a sketch of what the ideal Unicode version management
experience might look like? Very high level, like "this is what happens
by default during an upgrade" and "this is how a user discovers that
that they might want to update Uniocde", etc.

What ways can/should we nudge users to update more quickly, if at all,
so that they are less likely to have problems with newly-assigned code
points?

And, if possible, how we might extend this user experience to libc or
ICU updates?

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

From
Robert Haas
Date:
On Wed, Mar 19, 2025 at 5:47 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Do you have a sketch of what the ideal Unicode version management
> experience might look like? Very high level, like "this is what happens
> by default during an upgrade" and "this is how a user discovers that
> that they might want to update Uniocde", etc.
>
> What ways can/should we nudge users to update more quickly, if at all,
> so that they are less likely to have problems with newly-assigned code
> points?
>
> And, if possible, how we might extend this user experience to libc or
> ICU updates?

As I think you know, I don't consider myself an expert in this area,
just somebody who has seen a decent amount of user pain (although I am
sure that even there some other people have seen more). That said, for
me the ideal would probably include the following things:

* When the collation/ctype/whatever definitions upon which you are
relying change, you can either decide to switch to the new ones
without rebuilding your indexes and risk wrong results until you
reindex, or you can decide to create new indexes using the new
definitions and drop the old ones.

* You're never forced to adopt new definitions during a SPECIFIC major
or minor release upgrade or when making some other big change to the
system. It's fine, IMHO, if we eventually remove support for old
stuff, but there should be a multi-year window of overlap. For
example, if PostgreSQL 42 adds support for Unicode 95.0.0, we'd keep
that support for, I don't know, at least the next four or five major
versions. So upgrading PG can eventually force you to upgrade
collation defs, but you don't get into a situation where PG 41
supports only Unicode < 95 and PG 42 supports only Unicode >= 95.

* In an absolutely perfect world, we'd have strong versioning of every
type of collation from every provider. This is probably very difficult
to achieve in practice, so maybe the somewhat more realistic goal
might be to get to a point where most users, most of the time, are
relying on collations with strong versioning. For glibc, this seems
relatively hopeless unless upstream changes their policy in a big way.
For ICU, loading multiple library versions seems like a possible path
forward. Relying more on built-in collations seems like another
possible approach, but I think that would require us to have more than
just a code-point sort: we'd need to have built-in collations for
users of various languages. That sounds like it would be a lot of work
to develop, but even worse, it sounds like it would be a tremendous
amount of work to maintain. I expect Tom will opine that this is an
absolutely terrible idea that we should never do under any
circumstances, and I understand the sentiment, but I think it might be
worth considering if we're confident we will have people to do the
maintenance over the long term.

* I would imagine pg_upgrade either keeping the behavior unchanged for
any strongly-versioned collation, or failing. I don't see a strong
need to try to notify users about the availability of new versions
otherwise. People who want to stay current will probably figure out
how to do that, and people who don't will ignore any warnings we give
them. I'm not completely opposed to some other form of notification,
but I think it's OK if "we finally removed support for your extremely
old ICU version" is the driving force that makes people upgrade.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Thu, 2025-03-20 at 08:45 -0400, Robert Haas wrote:
> * When the collation/ctype/whatever definitions upon which you are
> relying change, you can either decide to switch to the new ones
> without rebuilding your indexes and risk wrong results until you
> reindex, or you can decide to create new indexes using the new
> definitions and drop the old ones.

Would newly-created objects pick up the new Unicode version, or stick
with the old one?


> Relying more on built-in collations seems like another
> possible approach, but I think that would require us to have more
> than
> just a code-point sort: we'd need to have built-in collations for
> users of various languages. That sounds like it would be a lot of
> work
> to develop, but even worse, it sounds like it would be a tremendous
> amount of work to maintain. I expect Tom will opine that this is an
> absolutely terrible idea that we should never do under any
> circumstances, and I understand the sentiment, but I think it might
> be
> worth considering if we're confident we will have people to do the
> maintenance over the long term.

Supporting a built-in case-insensitive collation would be some work,
but it's not a huge leap now that we have CASEFOLD().

Supprting built-in natural language sort orders would be a much larger
scope. And I don't think we need that, but that's a larger discussion.

> I don't see a strong
> need to try to notify users about the availability of new versions
> otherwise. People who want to stay current will probably figure out
> how to do that

What if we were able to tell, for instance, that your database has none
of the codepoints affected by the most recent update. Then updating
would be less risky than not updating: if you don't update Unicode,
then the code points could end up in the database treated as
unassigned, and then cause a problem for future updates.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

From
Robert Haas
Date:
On Fri, Mar 21, 2025 at 2:45 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2025-03-20 at 08:45 -0400, Robert Haas wrote:
> > * When the collation/ctype/whatever definitions upon which you are
> > relying change, you can either decide to switch to the new ones
> > without rebuilding your indexes and risk wrong results until you
> > reindex, or you can decide to create new indexes using the new
> > definitions and drop the old ones.
>
> Would newly-created objects pick up the new Unicode version, or stick
> with the old one?

Hmm, I hadn't thought about that. I'm assuming that the Unicode
version would need, in this scheme, to be coupled to the object that
depends on it. For example, an index that uses a Unicode collation
would need to store a Unicode version. But for a new index, how would
that be set? Maybe the Unicode version would be treated as part of the
collation. I'm guessing that an index defaults to the column
collation, and I think the column collation defaults to the database
default collation. We might need a way for ALTER DATABASE to allow the
database default to be adjusted. I'm not quite sure here, but my
general feeling is that Unicode version feels like part of the
collation and that we should avoid introducing a separate mechanism if
possible. What are your thoughts?

> Supprting built-in natural language sort orders would be a much larger
> scope. And I don't think we need that, but that's a larger discussion.

I'm curious why you think this. My own feeling (as I think you
probably know, but just to be clear) is that relatively few people
need extremely precise control over their collation behavior, but
there are some who do. However, I think there are many people for whom
a code-point sort won't be good enough. If you want to leave this
discussion for another time, that's fine.

> What if we were able to tell, for instance, that your database has none
> of the codepoints affected by the most recent update. Then updating
> would be less risky than not updating: if you don't update Unicode,
> then the code points could end up in the database treated as
> unassigned, and then cause a problem for future updates.

The problem with this is that it requires scanning the whole database.
That's not to say it's useless. Some people can afford to scan the
whole database, and some people might even WANT to scan the whole
database just to give themselves peace of mind. But there are also
plenty of people for whom this is a major downside, even unusable. I'd
like to have a solution that is based on metadata.

Maybe we should actually move in the direction of having encodings
that are essentially specific versions of Unicode. Instead of just
having UTF-8 that accepts whatever, you could have UTF-8.v16.0.0 or
whatever, which would only accept code points known to that version of
Unicode. Or maybe this shouldn't be entirely new encodings but
something vaguely akin to a typmod, so that you could have columns of
type text[limited_to_unicode_v16_0_0] or whatever. If we actually
exclude unassigned code points, then we know they aren't there, and we
can make deductions about what is safe to do based on that
information. I'm not quite sure how useful that is, but I tend to
think that enforcing rules when the data goes in has a decent shot at
being better than letting anything going in and then having to scan it
later to see how it all turned out.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Update Unicode data to Unicode 16.0.0

From
Peter Eisentraut
Date:
On 15.03.25 07:54, Jeremy Schneider wrote:
> in favor of leaving it alone because ICU is there for when I need
> up-to-date unicode versions.
> 
>  From my perspective, the whole point of the builtin collation was to
> one option that avoids these problems that come with updating both ICU
> and glibc.
> 
> So I guess the main point of the builtin provider just that it's faster
> than ICU?

A mistake that some people apparently continue to make in this 
discussion is that they assume that the only thing the Unicode tables 
drive is the builtin collation provider.  This is not true, the Unicode 
tables were there long before the builtin collation provider, and they 
have other purposes.  And we knew at the time the builtin collation 
provider was added that it would have certain problems with the Unicode 
table updates, and we accepted it with the understanding that this would 
not change our procedures.  Otherwise, we would likely not have accepted 
it in its current form.

Those who are now trying to make the builtin collation provider have 
properties that it does not have and was not intended to have when it 
was added, they would need to arrange the work to make it have those 
properties if they want them, rather than insist that progress in other 
areas must stop because they are content with the current state.




Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Fri, 2025-03-21 at 17:15 +0100, Peter Eisentraut wrote:
> And we knew at the time the builtin collation
> provider was added that it would have certain problems with the
> Unicode
> table updates, and we accepted it with the understanding that this
> would
> not change our procedures.

Correct. That was called out by me in the initial proposal for the
builtin collation provider and documented explicitly.

> Those who are now trying to make the builtin collation provider have
> properties that it does not have and was not intended to have when it
> was added, they would need to arrange the work to make it have those
> properties if they want them, rather than insist that progress in
> other
> areas must stop because they are content with the current state.

It does feel like the goalposts are moving. That's not necessarily bad
by itself -- our expectations should go up. But the way it's happening
in this thread makes it feel like new obligations are being put on the
people already working on collation improvements, in particular Peter
and I.

Robert indicated that there might be some willing hackers, and perhaps
even appetite for larger-scope projects in this area, which is great
news. A lot of what's happening in this area is non-controversial, and
more attention would be an unqualified win. For instance, Peter put
some work into better support for non-deterministic collations, and I
had some ideas there:

https://www.postgresql.org/message-id/024c9b9aa834f668496ef95700b57e50bf3f4808.camel%40j-davis.com

but I didn't have time to work on that this cycle. (Maybe my idea would
be hard to implement or not work at all, or maybe Peter and Tom already
have better ideas, but that's different from being controversial.)

For the many people who think multi-lib is the way to go, the shortest
path involves someone taking a look at this prerequisite:

https://www.postgresql.org/message-id/cb580fec46ea4ca96dd4bbde9d2769360e097d01.camel%40j-davis.com

Some technical review would be nice, but really what I needed was
someone to say "this small regression in a worst case due to an
unavoidable indirect function call is not worth worrying about". It
might be a bit late now, though, as a big refactoring right before FF
seems like a bad idea. So it will probably slip until July, adding risk
that any other multi-lib work (which I am not promising to do) might
slip to PG20, which users will see at the end of 2027. Ugh.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

From
Robert Haas
Date:
On Fri, Mar 21, 2025 at 2:27 PM Jeff Davis <pgsql@j-davis.com> wrote:
> > Those who are now trying to make the builtin collation provider have
> > properties that it does not have and was not intended to have when it
> > was added, they would need to arrange the work to make it have those
> > properties if they want them, rather than insist that progress in
> > other
> > areas must stop because they are content with the current state.
>
> It does feel like the goalposts are moving. That's not necessarily bad
> by itself -- our expectations should go up. But the way it's happening
> in this thread makes it feel like new obligations are being put on the
> people already working on collation improvements, in particular Peter
> and I.

Honestly, I'm confused as to why Peter hasn't committed the Unicode
update a long time ago at this point. Nobody has alleged that the
stability guarantees provided by the builtin collation provider are
(a) worse than any of other other providers or (b) worse than what was
documented. And nobody has refuted the argument that refusing to
update the Unicode tables will cause other problems (such as not
knowing what to do with new code points that are added in the other
places where those tables are used). People who aren't doing the work
to improve the infrastructure don't get to hold the longstanding
process hostage.

None of the above means that I wouldn't like things to be better in
this area; this is a huge source of pain and I would obviously like
all of kinds of amazing things, preferably done by other people with
no real thought or work required on my part. But I'd like that in lots
of areas of PostgreSQL and lots of areas of my life in general, and I
understand that it's an unreasonable expectation.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Update Unicode data to Unicode 16.0.0

From
Jeff Davis
Date:
On Fri, 2025-03-21 at 10:45 -0400, Robert Haas wrote:
> We might need a way for ALTER DATABASE to allow the
> database default to be adjusted. I'm not quite sure here, but my
> general feeling is that Unicode version feels like part of the
> collation and that we should avoid introducing a separate mechanism
> if
> possible. What are your thoughts?

My (early stage) plans are to have two new shared catalogs,
pg_ctype_provider and pg_collation_provider. Objects would depend on
records in those shared catalogs, which would each have a version. We'd
eventually allow multiple records with providerkind=icu, for instance,
and have some way to choose which one to use (perhaps new objects get
the default version, old objects keep the old version, or something).

The reason to have two shared catalogs is because some objects depend
on collation behavior and some on ctype behavior. If there's an index
on "t COLLATE PG_C_UTF8" then there would be no direct dependency from
the index to the builtin provider in either catalog, because collation
behavior in the builtin provider is unversioned memcmp. But if there's
an index on "LOWER(t COLLATE PG_C_UTF8)", then it would have a
dependency entry to the builtin provider's entry in pg_ctype_provider.

>
> I'm curious why you think this. My own feeling (as I think you
> probably know, but just to be clear) is that relatively few people
> need extremely precise control over their collation behavior, but
> there are some who do. However, I think there are many people for
> whom
> a code-point sort won't be good enough.

You can use ICU for sorting without using it for the index comparators.
Using ICU in the index comparators is an implementation detail that's
only required for unique indexes over non-deterministic collations. And
if it's not used for the index comparators, then most of the problems
go away, and versioning is not nearly so important.

Sure, there are some cases where using ICU in the index comparator is
important, and I'm not suggesting that we remove functionality. But I
believe that using libc or ICU for index comparators is the wrong
default behavior -- high downsides and low upsides for most text
indexes that have ever been created.

Even if there is an ORDER BY, using an index is often the wrong thing
unless it's an index only scan. Text indexes are rarely correlated with
the heap, so it would lead to a lot of random heap fetches, and it's
often better to just execute the query and do a final sort. The
situations where ICU in the comparator is a good idea are special cases
of special cases.

I've posted about this in the past, and got universal disagreement. But
I believe others will eventually come to the same conclusion that I
did.

>
> Maybe we should actually move in the direction of having encodings
> that are essentially specific versions of Unicode. Instead of just
> having UTF-8 that accepts whatever, you could have UTF-8.v16.0.0 or
> whatever, which would only accept code points known to that version
> of
> Unicode. Or maybe this shouldn't be entirely new encodings but
> something vaguely akin to a typmod, so that you could have columns of
> type text[limited_to_unicode_v16_0_0] or whatever. If we actually
> exclude unassigned code points, then we know they aren't there, and
> we
> can make deductions about what is safe to do based on that
> information.

I like this line of thinking, vaguely similar to my STRICT_UNICODE
database option proposal. Maybe these aren't exactly the right things
to do, but I think there are some possibilities here, and we shouldn't
give up and assume there's a problem when usually there is not.

It reminds me of fast-path locking: sure, there *might* be DDL
happening while I'm trying to do a simple SELECT query. But probably
not, so let's make it the responsibility of DDL to warn others that
it's doing something, rather than the responsibility of the SELECT
query.

Regards,
    Jeff Davis




Re: Update Unicode data to Unicode 16.0.0

From
Laurenz Albe
Date:
On Fri, 2025-03-21 at 14:54 -0400, Robert Haas wrote:
> And nobody has refuted the argument that refusing to
> update the Unicode tables will cause other problems (such as not
> knowing what to do with new code points that are added in the other
> places where those tables are used).

The argument is irrefutable.
I can only utter my (unproven) conviction that the complaints about
some outlandish character being converted to upper case in the wrong
way won't be as loud as the complaints about a corrupted index.

But perhaps both occurrences would be rare enough that there is
little point in arguing at all.

> People who aren't doing the work
> to improve the infrastructure don't get to hold the longstanding
> process hostage.

Actually, I am arguing for *not* doing the work...

Yours,
Laurenz Albe



Re: Update Unicode data to Unicode 16.0.0

From
Jeremy Schneider
Date:
On Fri, 21 Mar 2025 13:45:24 -0700
Jeff Davis <pgsql@j-davis.com> wrote:

> > Maybe we should actually move in the direction of having encodings
> > that are essentially specific versions of Unicode. Instead of just
> > having UTF-8 that accepts whatever, you could have UTF-8.v16.0.0 or
> > whatever, which would only accept code points known to that version
> > of
> > Unicode. Or maybe this shouldn't be entirely new encodings but
> > something vaguely akin to a typmod, so that you could have columns
> > of type text[limited_to_unicode_v16_0_0] or whatever. If we actually
> > exclude unassigned code points, then we know they aren't there, and
> > we
> > can make deductions about what is safe to do based on that
> > information.  
> 
> I like this line of thinking, vaguely similar to my STRICT_UNICODE
> database option proposal. Maybe these aren't exactly the right things
> to do, but I think there are some possibilities here, and we shouldn't
> give up and assume there's a problem when usually there is not.

There is "the iPhone paradox" here; if we reject unassigned code
points, then websites are going to start throwing database errors for
anyone with the latest iPhone who uses a new emoji.

(Unless the database is updated very quickly, which is atypical.) Apple
tends to get new emojis into consumers hands a year or less after the
new Unicode release.

-Jeremy



Re: Update Unicode data to Unicode 16.0.0

From
Peter Eisentraut
Date:
On 17.03.25 19:54, Jeff Davis wrote:
> On Thu, 2025-03-13 at 14:49 +0100, Peter Eisentraut wrote:
>> I think these test result changes are incorrect.  AFAICT, nothing has
>> changed in the upstream data that would explain such a change.
>>
>> I didn't get such test differences in my original patch.  Did you
>> rebase
>> the patch or regenerate it?  Maybe something went wrong there.
> 
> Right. The previous version was improperly rebased.
> 
> I went back to your original version and rebased over master (including
> the latest optimizations for case mapping), and then rebased my changes
> on top of that. Attached.
> 
> One annoyance with the recent case mapping optimizations is that the
> Unicode update diff is larger, similar to the diff for
> unicode_norm_hashfunc.h. Previously, it was binary search, so the only
> differences were logical differences.

I have committed the update to the Unicode tables.  I suggest you commit 
your pg_upgrade patch on top of that now.