Re: [18] Policy on IMMUTABLE functions and Unicode updates - Mailing list pgsql-hackers

From Laurenz Albe
Subject Re: [18] Policy on IMMUTABLE functions and Unicode updates
Date
Msg-id 5fcd0b4ae0d90c4df3e05266f0859f9751defe46.camel@cybertec.at
Whole thread Raw
In response to [18] Policy on IMMUTABLE functions and Unicode updates  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: [18] Policy on IMMUTABLE functions and Unicode updates
List pgsql-hackers
On Tue, 2024-07-16 at 10:42 -0700, Jeff Davis wrote:
> The IMMUTABLE marker for functions is quite simple on the surface, but
> could be interpreted a few different ways, and there's some historical
> baggage that makes it complicated.
>
> There are a number of ways in which IMMUTABLE functions can change
> behavior:
>
> 1. Updating or moving to a different OS affects all collations that use
> the libc provider (other than "C" and "POSIX", which don't actually use
> libc). LOWER(), INITCAP(), UPPER() and pattern matching are also
> affected.
>
> 2. Updating ICU affects the collations that use the ICU provider.
> ICU_UNICODE_VERSION(), LOWER(), INITCAP(), UPPER() and pattern matching
> are also affected.
>
> 3. Moving to a different database encoding may affect collations that
> use the "C" or "POSIX" locales in the libc provider (NB: those locales
> don't actually use libc).
>
> 4. A PG Unicode update may change the results of functions that depend
> on Unicode. For instance, NORMALIZE(), UNICODE_ASSIGNED(), and
> UNICODE_VERSION(). Or, if using the new builtin provider's "C.UTF-8"
> locale in version 17, LOWER(), INITCAP(), UPPER(), and pattern matching
> (NB: collation itself is not affected -- always code point order).
>
> 5. If a well-defined IMMUTABLE function produces the wrong results, we
> may fix the bug in the next major release.
>
> 6. The GUC extra_float_digits can change the results of floating point
> text output.
>
> 7. A UDF may be improperly marked IMMUTABLE. A particularly common
> variant is a UDF without search_path specified, which is probably not
> truly IMMUTABLE.
>
> Noah seemed particularly concerned[1] about #4, so I'll start off by
> discussing that.
>
> Unicode updates do not affect collation itself, they
> affect affect NORMALIZE(), UNICODE_VERSION(), and UNICODE_ASSIGNED().
> If using the builtin "C.UTF-8" locale, they also affect LOWER(),
> INITCAP(), UPPER(), and pattern matching. (NB: the builtin collation
> provider hasn't yet gone through any Unicode update.)
>
> There are two alternative philosophies:
>
> A. By choosing to use a Unicode-based function, the user has opted in
> to the Unicode stability guarantees[2], and it's fine to update Unicode
> occasionally in new major versions as long as we are transparent with
> the user.
>
> B. IMMUTABLE implies some very strict definition of stability, and we
> should never again update Unicode because it changes the results of
> IMMUTABLE functions.
>
> We've been following (A), and that's the defacto policy today[3][4].
> Noah and Laurenz argued[5] that the policy starting in version 18
> should be (B). Given that it's a policy decision that affects more than
> just the builtin collation provider, I'd like to discuss it more
> broadly outside of that subthread.
>
> [1] 
> https://www.postgresql.org/message-id/20240629220857.fb.nmisch@google.com
>
> [2]
> https://www.unicode.org/policies/stability_policy.html
>
> [3] 
> https://www.postgresql.org/message-id/1d178eb1bbd61da1bcfe4a11d6545e9cdcede1d1.camel%40j-davis.com
>
> [4]
> https://www.postgresql.org/message-id/564325.1720297161%40sss.pgh.pa.us
>
> [5]
> https://www.postgresql.org/message-id/af82b292f13dd234790bc701933e9992ee07d4fa.camel%40cybertec.at

Concerning #4, the new built-in locale, my hope (and, in my opinion, its only
value) is to get out of the problems #1 and #2 that are not under our control.

If changes in major PostgreSQL versions force users of the built-in
locale provider to rebuild indexes, that would invalidate it.  I think that
users care more about data corruption than about exact Unicode-compliant
behavior.  Anybody who does can use ICU.

People routinely create indexes that involve upper() or lower(), so I'd
say changing their behavior would be a problem.

Perhaps I should moderate my statement: if a change affects only a newly
introduced code point (which is unlikely to be used in a database), and we
think that the change is very important, we could consider applying it.
But that should be carefully considered; I am against blindly following the
changes in Unicode.

Yours,
Laurenz Albe



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Remove dependence on integer wrapping
Next
From: Fujii Masao
Date:
Subject: Re: Add privileges test for pg_stat_statements to improve coverage