Re: Built-in CTYPE provider - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Built-in CTYPE provider
Date
Msg-id 1ecfeb4d2b1b7f119fa917d3052a3aecfaf4f425.camel@j-davis.com
Whole thread Raw
In response to Re: Built-in CTYPE provider  (Noah Misch <noah@leadboat.com>)
Responses Re: Built-in CTYPE provider
List pgsql-hackers
On Sat, 2024-06-29 at 15:08 -0700, Noah Misch wrote:
> lower(), initcap(), upper(), and regexp_matches() are
> PROVOLATILE_IMMUTABLE.
> Until now, we've delegated that responsibility to the user.  The user
> is
> supposed to somehow never update libc or ICU in a way that changes
> outcomes
> from these functions.

To me, "delegated" connotes a clear and organized transfer of
responsibility to the right person to solve it. In that sense, I
disagree that we've delegated it.

What's happened here is evolution of various choices that seemed
reasonable at the time. Unfortunately, the consequences that are hard
for us to manage and even harder for users to manage themselves.

>   Now that postgresql.org is taking that responsibility
> for builtin C.UTF-8, how should we govern it?  I think the above text
> and [1]
> convey that we'll update the Unicode data between major versions,
> making
> functions like lower() effectively STABLE.  Is that right?

Marking them STABLE is not a viable option, that would break a lot of
valid use cases, e.g. an index on LOWER().

Unicode already has its own governance, including a stability policy
that includes case mapping:

https://www.unicode.org/policies/stability_policy.html#Case_Pair

Granted, that policy does not guarantee that the results will never
change. In particular, the results can change if using unassinged code
poitns that are later assigned to Cased characters.

That's not terribly common though; for instance, there are zero changes
in uppercase/lowercase behavior between Unicode 14.0 (2021) and 15.1
(current) -- even for code points that were unassigned in 14.0 and
later assigned. I checked this by modifying case_test.c to look at
unassigned code points as well.

There's a greater chance that character properties can change (e.g.
whether a character is "alphabetic" or not) in new releases of Unicode.
Such properties can affect regex character classifications, and in some
cases the results of initcap (because it uses the "alphanumeric"
classification to determine word boundaries).

I don't think we need code changes for 17. Some documentation changes
might be helpful, though. Should we have a note around LOWER()/UPPER()
that users should REINDEX any dependent indexes when the provider is
updated?

> (This thread had some discussion[2] that datcollversion/collversion
> won't
> necessarily change when a major versions changes lower() behavior.)

datcollversion/collversion track the vertsion of the collation
specifically (text ordering only), not the ctype (character semantics).
When using the libc provider, get_collation_actual_version() completely
ignores the ctype.

It would be interesting to consider tracking the versions separately,
though.

Regards,
    Jeff Davis




pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)
Next
From: Nathan Bossart
Date:
Subject: Re: optimizing pg_upgrade's once-in-each-database steps