Re: [18] Policy on IMMUTABLE functions and Unicode updates - Mailing list pgsql-hackers
From | Laurenz Albe |
---|---|
Subject | Re: [18] Policy on IMMUTABLE functions and Unicode updates |
Date | |
Msg-id | 5fcd0b4ae0d90c4df3e05266f0859f9751defe46.camel@cybertec.at Whole thread Raw |
In response to | [18] Policy on IMMUTABLE functions and Unicode updates (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: [18] Policy on IMMUTABLE functions and Unicode updates
|
List | pgsql-hackers |
On Tue, 2024-07-16 at 10:42 -0700, Jeff Davis wrote: > The IMMUTABLE marker for functions is quite simple on the surface, but > could be interpreted a few different ways, and there's some historical > baggage that makes it complicated. > > There are a number of ways in which IMMUTABLE functions can change > behavior: > > 1. Updating or moving to a different OS affects all collations that use > the libc provider (other than "C" and "POSIX", which don't actually use > libc). LOWER(), INITCAP(), UPPER() and pattern matching are also > affected. > > 2. Updating ICU affects the collations that use the ICU provider. > ICU_UNICODE_VERSION(), LOWER(), INITCAP(), UPPER() and pattern matching > are also affected. > > 3. Moving to a different database encoding may affect collations that > use the "C" or "POSIX" locales in the libc provider (NB: those locales > don't actually use libc). > > 4. A PG Unicode update may change the results of functions that depend > on Unicode. For instance, NORMALIZE(), UNICODE_ASSIGNED(), and > UNICODE_VERSION(). Or, if using the new builtin provider's "C.UTF-8" > locale in version 17, LOWER(), INITCAP(), UPPER(), and pattern matching > (NB: collation itself is not affected -- always code point order). > > 5. If a well-defined IMMUTABLE function produces the wrong results, we > may fix the bug in the next major release. > > 6. The GUC extra_float_digits can change the results of floating point > text output. > > 7. A UDF may be improperly marked IMMUTABLE. A particularly common > variant is a UDF without search_path specified, which is probably not > truly IMMUTABLE. > > Noah seemed particularly concerned[1] about #4, so I'll start off by > discussing that. > > Unicode updates do not affect collation itself, they > affect affect NORMALIZE(), UNICODE_VERSION(), and UNICODE_ASSIGNED(). > If using the builtin "C.UTF-8" locale, they also affect LOWER(), > INITCAP(), UPPER(), and pattern matching. (NB: the builtin collation > provider hasn't yet gone through any Unicode update.) > > There are two alternative philosophies: > > A. By choosing to use a Unicode-based function, the user has opted in > to the Unicode stability guarantees[2], and it's fine to update Unicode > occasionally in new major versions as long as we are transparent with > the user. > > B. IMMUTABLE implies some very strict definition of stability, and we > should never again update Unicode because it changes the results of > IMMUTABLE functions. > > We've been following (A), and that's the defacto policy today[3][4]. > Noah and Laurenz argued[5] that the policy starting in version 18 > should be (B). Given that it's a policy decision that affects more than > just the builtin collation provider, I'd like to discuss it more > broadly outside of that subthread. > > [1] > https://www.postgresql.org/message-id/20240629220857.fb.nmisch@google.com > > [2] > https://www.unicode.org/policies/stability_policy.html > > [3] > https://www.postgresql.org/message-id/1d178eb1bbd61da1bcfe4a11d6545e9cdcede1d1.camel%40j-davis.com > > [4] > https://www.postgresql.org/message-id/564325.1720297161%40sss.pgh.pa.us > > [5] > https://www.postgresql.org/message-id/af82b292f13dd234790bc701933e9992ee07d4fa.camel%40cybertec.at Concerning #4, the new built-in locale, my hope (and, in my opinion, its only value) is to get out of the problems #1 and #2 that are not under our control. If changes in major PostgreSQL versions force users of the built-in locale provider to rebuild indexes, that would invalidate it. I think that users care more about data corruption than about exact Unicode-compliant behavior. Anybody who does can use ICU. People routinely create indexes that involve upper() or lower(), so I'd say changing their behavior would be a problem. Perhaps I should moderate my statement: if a change affects only a newly introduced code point (which is unlikely to be used in a database), and we think that the change is very important, we could consider applying it. But that should be carefully considered; I am against blindly following the changes in Unicode. Yours, Laurenz Albe
pgsql-hackers by date: