Re: Built-in CTYPE provider - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Built-in CTYPE provider
Date
Msg-id db496682c6656ac64433f05f8821e561bbf4d105.camel@j-davis.com
Whole thread Raw
In response to Re: Built-in CTYPE provider  (Noah Misch <noah@leadboat.com>)
Responses Re: Built-in CTYPE provider
List pgsql-hackers
On Tue, 2024-07-02 at 16:03 -0700, Noah Misch wrote:
> Each packager can choose their dependencies so the v16 providers
> don't have
> the problem.  With the $SUBJECT provider, a packager won't have that
> option.

While nothing needs to be changed for 17, I agree that we may need to
be careful in future releases not to break things.

Broadly speaking, you are right that we may need to freeze Unicode
updates or be more precise about versioning. But there's a lot of
nuance to the problem, so I don't think we should pre-emptively promise
either of those things right now.

Consider:

* Unless I made a mistake, the last three releases of Unicode (14.0,
15.0, and 15.1) all have the exact same behavior for UPPER() and
LOWER() -- even for unassigned code points. It would be silly to
promise to stay with 15.1 and then realize that moving to 16.0 doesn't
create any actual problem.

* Unicode also offers "case folding", which has even stronger stability
guarantees, and I plan to propose that soon. When implemented, it would
be preferred over LOWER()/UPPER() in index expressions for most use
cases.

* While someone can pin libc+ICU to particular versions, it's
impossible when using the official packages, and additionally requires
using something like [1], which just became available last year. I
don't think it's reasonable to put it forth as a matter-of-fact
solution.

* Let's keep some perspective: we've lived for a long time with ALL
text indexes at serious risk of breakage. In contrast, the concerns you
are raising now are about certain kinds of expression indexes over data
containing certain unassigned code points. I am not dismissing that
concern, but the builtin provider moves us in the right direction and
let's not lose sight of that.


Given that no code changes for v17 are proposed, I suggest that we
refrain from making any declarations until the next version of Unicode
is released. If the pattern holds, that will be around September, which
still leaves time to make reasonable decisions for v18.

Regards,
    Jeff Davis

[1] https://github.com/awslabs/compat-collation-for-glibc




pgsql-hackers by date:

Previous
From: "Joel Jacobson"
Date:
Subject: Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.
Next
From: David Rowley
Date:
Subject: Re: Incorrect Assert in BufFileSize()?