Re: Built-in CTYPE provider - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Built-in CTYPE provider
Date
Msg-id 20240704212641.c4.nmisch@google.com
Whole thread Raw
In response to Re: Built-in CTYPE provider  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Built-in CTYPE provider
List pgsql-hackers
On Wed, Jul 03, 2024 at 02:19:07PM -0700, Jeff Davis wrote:
> * Unless I made a mistake, the last three releases of Unicode (14.0,
> 15.0, and 15.1) all have the exact same behavior for UPPER() and
> LOWER() -- even for unassigned code points. It would be silly to
> promise to stay with 15.1 and then realize that moving to 16.0 doesn't
> create any actual problem.

I think you're saying that if some Unicode update changes the results of a
STABLE function but does not change the result of any IMMUTABLE function, we
may as well import that update.  Is that about right?  If so, I agree.

In addition to the options I listed earlier (error in pg_upgrade or document
that IMMUTABLE stands) I would be okay with a third option.  Decide here that
we'll not adopt a Unicode update in a way that changes a v17 IMMUTABLE
function result of the new provider.  We don't need to write that in the
documentation, since it's implicit in IMMUTABLE.  Delete the "stable within a
<productname>Postgres</productname> major version" documentation text.

> * While someone can pin libc+ICU to particular versions, it's
> impossible when using the official packages, and additionally requires
> using something like [1], which just became available last year. I
> don't think it's reasonable to put it forth as a matter-of-fact
> solution.
> 
> * Let's keep some perspective: we've lived for a long time with ALL
> text indexes at serious risk of breakage. In contrast, the concerns you
> are raising now are about certain kinds of expression indexes over data
> containing certain unassigned code points. I am not dismissing that
> concern, but the builtin provider moves us in the right direction and
> let's not lose sight of that.

I see you're trying to help users get less breakage, and that's a good goal.
I agree $SUBJECT eliminates libc+ICU breakage, and libc+ICU breakage has hurt
plenty.  However, you proposed to update Unicode data and give REINDEX as the
solution to breakage this causes.  Unlike libc+ICU breakage, the packager has
no escape from that.  That's a different kind of breakage proposition, and no
new PostgreSQL feature should do that.  It's on a different axis from helping
users avoid libc+ICU breakage, and a feature doesn't get to credit helping on
one axis against a regression on the other axis.  What am I missing here?

> Given that no code changes for v17 are proposed, I suggest that we
> refrain from making any declarations until the next version of Unicode
> is released. If the pattern holds, that will be around September, which
> still leaves time to make reasonable decisions for v18.

Soon enough, a Unicode release will add one character to regexp [[:alpha:]].
PostgreSQL will then need to decide what IMMUTABLE is going to mean.  How does
that get easier in September?

Thanks,
nm

> [1] https://github.com/awslabs/compat-collation-for-glibc



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Pluggable cumulative statistics
Next
From: Andres Freund
Date:
Subject: Re: Wrong results with grouping sets