Thread: [18] Policy on IMMUTABLE functions and Unicode updates
The IMMUTABLE marker for functions is quite simple on the surface, but could be interpreted a few different ways, and there's some historical baggage that makes it complicated. There are a number of ways in which IMMUTABLE functions can change behavior: 1. Updating or moving to a different OS affects all collations that use the libc provider (other than "C" and "POSIX", which don't actually use libc). LOWER(), INITCAP(), UPPER() and pattern matching are also affected. 2. Updating ICU affects the collations that use the ICU provider. ICU_UNICODE_VERSION(), LOWER(), INITCAP(), UPPER() and pattern matching are also affected. 3. Moving to a different database encoding may affect collations that use the "C" or "POSIX" locales in the libc provider (NB: those locales don't actually use libc). 4. A PG Unicode update may change the results of functions that depend on Unicode. For instance, NORMALIZE(), UNICODE_ASSIGNED(), and UNICODE_VERSION(). Or, if using the new builtin provider's "C.UTF-8" locale in version 17, LOWER(), INITCAP(), UPPER(), and pattern matching (NB: collation itself is not affected -- always code point order). 5. If a well-defined IMMUTABLE function produces the wrong results, we may fix the bug in the next major release. 6. The GUC extra_float_digits can change the results of floating point text output. 7. A UDF may be improperly marked IMMUTABLE. A particularly common variant is a UDF without search_path specified, which is probably not truly IMMUTABLE. (more I'm sure, please add to list...) #1 and #2 have been discussed much more than the rest, but I think it's worthwhile to enumerate the other problems even if the impact is a lot lower. Noah seemed particularly concerned[1] about #4, so I'll start off by discussing that. Here's a brief history (slightly confusing because the PG and Unicode versions are similar numbers): PG13: Unicode 13.0 and NORMALIZE() is first exposed as a SQL function PG15: Unicode updated to 14.0 PG16: Unicode updated to 15.0 PG17: Unicode updated to 15.1, UNICODE_ASSIGNED(), UNICODE_VERSION() and builtin "C.UTF-8" locale are introduced To repeat, these Unicode updates do not affect collation itself, they affect affect NORMALIZE(), UNICODE_VERSION(), and UNICODE_ASSIGNED(). If using the builtin "C.UTF-8" locale, they also affect LOWER(), INITCAP(), UPPER(), and pattern matching. (NB: the builtin collation provider hasn't yet gone through any Unicode update.) There are two alternative philosophies: A. By choosing to use a Unicode-based function, the user has opted in to the Unicode stability guarantees[2], and it's fine to update Unicode occasionally in new major versions as long as we are transparent with the user. B. IMMUTABLE implies some very strict definition of stability, and we should never again update Unicode because it changes the results of IMMUTABLE functions. We've been following (A), and that's the defacto policy today[3][4]. Noah and Laurenz argued[5] that the policy starting in version 18 should be (B). Given that it's a policy decision that affects more than just the builtin collation provider, I'd like to discuss it more broadly outside of that subthread. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/20240629220857.fb.nmisch@google.com [2] https://www.unicode.org/policies/stability_policy.html [3] https://www.postgresql.org/message-id/1d178eb1bbd61da1bcfe4a11d6545e9cdcede1d1.camel%40j-davis.com [4] https://www.postgresql.org/message-id/564325.1720297161%40sss.pgh.pa.us [5] https://www.postgresql.org/message-id/af82b292f13dd234790bc701933e9992ee07d4fa.camel%40cybertec.at
On 7/16/24 13:42, Jeff Davis wrote: > The IMMUTABLE marker for functions is quite simple on the surface, but > could be interpreted a few different ways, and there's some historical > baggage that makes it complicated. > > There are a number of ways in which IMMUTABLE functions can change > behavior: > > 1. Updating or moving to a different OS affects all collations that use > the libc provider (other than "C" and "POSIX", which don't actually use > libc). LOWER(), INITCAP(), UPPER() and pattern matching are also > affected. > > 2. Updating ICU affects the collations that use the ICU provider. > ICU_UNICODE_VERSION(), LOWER(), INITCAP(), UPPER() and pattern matching > are also affected. > > 3. Moving to a different database encoding may affect collations that > use the "C" or "POSIX" locales in the libc provider (NB: those locales > don't actually use libc). > > 4. A PG Unicode update may change the results of functions that depend > on Unicode. For instance, NORMALIZE(), UNICODE_ASSIGNED(), and > UNICODE_VERSION(). Or, if using the new builtin provider's "C.UTF-8" > locale in version 17, LOWER(), INITCAP(), UPPER(), and pattern matching > (NB: collation itself is not affected -- always code point order). > > 5. If a well-defined IMMUTABLE function produces the wrong results, we > may fix the bug in the next major release. > > 6. The GUC extra_float_digits can change the results of floating point > text output. > > 7. A UDF may be improperly marked IMMUTABLE. A particularly common > variant is a UDF without search_path specified, which is probably not > truly IMMUTABLE. > > (more I'm sure, please add to list...) > > > #1 and #2 have been discussed much more than the rest, but I think it's > worthwhile to enumerate the other problems even if the impact is a lot > lower. > > > Noah seemed particularly concerned[1] about #4, so I'll start off by > discussing that. Here's a brief history (slightly confusing because the > PG and Unicode versions are similar numbers): > > PG13: Unicode 13.0 and NORMALIZE() is first exposed as a SQL function > PG15: Unicode updated to 14.0 > PG16: Unicode updated to 15.0 > PG17: Unicode updated to 15.1, UNICODE_ASSIGNED(), UNICODE_VERSION() > and builtin "C.UTF-8" locale are introduced > > To repeat, these Unicode updates do not affect collation itself, they > affect affect NORMALIZE(), UNICODE_VERSION(), and UNICODE_ASSIGNED(). > If using the builtin "C.UTF-8" locale, they also affect LOWER(), > INITCAP(), UPPER(), and pattern matching. (NB: the builtin collation > provider hasn't yet gone through any Unicode update.) > > There are two alternative philosophies: > > A. By choosing to use a Unicode-based function, the user has opted in > to the Unicode stability guarantees[2], and it's fine to update Unicode > occasionally in new major versions as long as we are transparent with > the user. > > B. IMMUTABLE implies some very strict definition of stability, and we > should never again update Unicode because it changes the results of > IMMUTABLE functions. > > We've been following (A), and that's the defacto policy today[3][4]. > Noah and Laurenz argued[5] that the policy starting in version 18 > should be (B). Given that it's a policy decision that affects more than > just the builtin collation provider, I'd like to discuss it more > broadly outside of that subthread. On the general topic, we have these definitions in the fine manual: 8<----------------- A VOLATILE function can do anything, ... A query using a volatile function will re-evaluate the function at every row where its value is needed. A STABLE function cannot modify the database and is guaranteed to return the same results given the same arguments for all rows within a single statement... An IMMUTABLE function cannot modify the database and is guaranteed to return the same results given the same arguments forever. 8<----------------- As Jeff points out, the IMMUTABLE definition has never really been true. Even the STABLE is not quite right, as there are at least some STABLE functions that will return the same value for multiple statements if they are within a transaction block (e.g. "now()" -- TBH I don't remember offhand if that is true for all stable functions). In any case, there is quite a gap between "forever" and "single statement". Perhaps we need to have more volatility categories, with guarantees that lie somewhere between the two, and allow those to be used like we do IMMUTABLE except with appropriate warning labels. E.g. something ("STABLE_VERSION"?) to mean "forever within a major version lifetime" and something ("STABLE_SYSTEM?") to mean "as long as you don't upgrade your OS". -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Jul 16, 2024 at 11:57 AM Joe Conway <mail@joeconway.com> wrote:
> There are two alternative philosophies:
>
> A. By choosing to use a Unicode-based function, the user has opted in
> to the Unicode stability guarantees[2], and it's fine to update Unicode
> occasionally in new major versions as long as we are transparent with
> the user.
>
> B. IMMUTABLE implies some very strict definition of stability, and we
> should never again update Unicode because it changes the results of
> IMMUTABLE functions.
>
> We've been following (A), and that's the defacto policy today[3][4].
> Noah and Laurenz argued[5] that the policy starting in version 18
> should be (B). Given that it's a policy decision that affects more than
> just the builtin collation provider, I'd like to discuss it more
> broadly outside of that subthread.
On the general topic, we have these definitions in the fine manual:
8<-----------------
A VOLATILE function can do anything, ... A query using a volatile
function will re-evaluate the function at every row where its value is
needed.
A STABLE function cannot modify the database and is guaranteed to return
the same results given the same arguments for all rows within a single
statement...
An IMMUTABLE function cannot modify the database and is guaranteed to
return the same results given the same arguments forever.
8<-----------------
As Jeff points out, the IMMUTABLE definition has never really been true.
Even the STABLE is not quite right, as there are at least some STABLE
functions that will return the same value for multiple statements if
they are within a transaction block (e.g. "now()" -- TBH I don't
remember offhand if that is true for all stable functions).
Under-specification here doesn't make the meaning of stable incorrect. We don't have anything that guarantees stability at the transaction scope because I don't think it can be guaranteed there without considering whether said transaction is read-committed, repeatable read, or serializable. The function itself can promise more but the marker seems correctly scoped for how the system uses it in statement optimization.
and allow those to be
used like we do IMMUTABLE except with appropriate warning labels. E.g.
something ("STABLE_VERSION"?) to mean "forever within a major version
lifetime" and something ("STABLE_SYSTEM?") to mean "as long as you don't
upgrade your OS".
I'd be content cutting "forever" down to "within a given server configuration". Then just note that immutable functions can depend implicitly on external server characteristics and so when moving data between servers re-evaluation of immutable functions may be necessary. Not so bad for indexes. A bit more problematic for generated values.
I'm not against adding metadata options here but for internal functions comments and documentation can work. For user-defined functions I have my doubts on how trustworthy they would end up being.
For the original question, I suggest continuing behaving per "A" and work on making it more clear to users what that means in terms of server upgrades.
If we do add metadata to reflect our reality I'd settle on a generic "STATIC" marker that can be used on those functions the rely on real world state, whether we are directly calling into the system (e.g., hashing) or have chosen to provide the state access management ourselves (e.g., unicode).
When we do take control we should have a goal of allowing for a given external dependency version to exist in many PostgreSQL versions and give the DBA the choice of when to move individual databases from one version to the next. Possibly dropping the dependency version support alongside the dropping of support of the major version it first appeared in. Not keeping up with external dependency versions just punishes new users by forbidding them a tool permanently, as well as puts us out-of-step with those dependency development groups, to save existing users some short-term pain. Being able to deal with that pain at a time different than the middle of a major version upgrade, one database at a time, gives those existing users reasonable options.
David J.
On 7/16/24 15:33, David G. Johnston wrote: > On Tue, Jul 16, 2024 at 11:57 AM Joe Conway <mail@joeconway.com > <mailto:mail@joeconway.com>> wrote: > > > > There are two alternative philosophies: > > > > A. By choosing to use a Unicode-based function, the user has opted in > > to the Unicode stability guarantees[2], and it's fine to update > Unicode > > occasionally in new major versions as long as we are transparent with > > the user. > > > > B. IMMUTABLE implies some very strict definition of stability, and we > > should never again update Unicode because it changes the results of > > IMMUTABLE functions. > > > > We've been following (A), and that's the defacto policy today[3][4]. > > Noah and Laurenz argued[5] that the policy starting in version 18 > > should be (B). Given that it's a policy decision that affects > more than > > just the builtin collation provider, I'd like to discuss it more > > broadly outside of that subthread. > > On the general topic, we have these definitions in the fine manual: > > 8<----------------- > A VOLATILE function can do anything, ... A query using a volatile > function will re-evaluate the function at every row where its value is > needed. > > A STABLE function cannot modify the database and is guaranteed to > return > the same results given the same arguments for all rows within a single > statement... > > An IMMUTABLE function cannot modify the database and is guaranteed to > return the same results given the same arguments forever. > 8<----------------- > > As Jeff points out, the IMMUTABLE definition has never really been > true. > > Even the STABLE is not quite right, as there are at least some STABLE > functions that will return the same value for multiple statements if > they are within a transaction block (e.g. "now()" -- TBH I don't > remember offhand if that is true for all stable functions). > > > Under-specification here doesn't make the meaning of stable incorrect. > We don't have anything that guarantees stability at the transaction > scope because I don't think it can be guaranteed there without > considering whether said transaction is read-committed, repeatable read, > or serializable. The function itself can promise more but the marker > seems correctly scoped for how the system uses it in statement optimization. The way it is described is still surprising and can bite you if you are not familiar with the nuances. In particular I have seen now() used in transaction blocks surprise more than one person over the years. > and allow those to be > used like we do IMMUTABLE except with appropriate warning labels. E.g. > something ("STABLE_VERSION"?) to mean "forever within a major version > lifetime" and something ("STABLE_SYSTEM?") to mean "as long as you > don't > upgrade your OS". > > I'd be content cutting "forever" down to "within a given server > configuration". Then just note that immutable functions can depend > implicitly on external server characteristics and so when moving data > between servers re-evaluation of immutable functions may be necessary. > Not so bad for indexes. A bit more problematic for generated values. Yeah I forgot about the configuration controlled ones. > I'm not against adding metadata options here but for internal functions > comments and documentation can work. For user-defined functions I have > my doubts on how trustworthy they would end up being. People lie all the time for user-defined functions, usually specifically when they need IMMUTABLE semantics and are willing to live with the risk and/or apply their own controls to ensure no changes in output. > For the original question, I suggest continuing behaving per "A" and > work on making it more clear to users what that means in terms of server > upgrades. > > If we do add metadata to reflect our reality I'd settle on a generic > "STATIC" marker that can be used on those functions the rely on real > world state, whether we are directly calling into the system (e.g., > hashing) or have chosen to provide the state access management ourselves > (e.g., unicode). So you are proposing we add STATIC to VOLATILE/STABLE/IMMUTABLE (in the third position before IMMUTABLE), give it IMMUTABLE semantics, mark builtin functions that deserve it, and document with suitable caution statements? I guess can live with just one additional level of granularity. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Joe Conway <mail@joeconway.com> writes: > So you are proposing we add STATIC to VOLATILE/STABLE/IMMUTABLE (in the > third position before IMMUTABLE), give it IMMUTABLE semantics, mark > builtin functions that deserve it, and document with suitable caution > statements? What is the point of that, exactly? I'll agree that the user documentation could use some improvement in how it describes the volatility levels, but I do not see how it will reduce anybody's confusion to invent multiple aliases for what's effectively the same volatility level. Nor do I see a use-case for actually having multiple versions of "immutable". Once you've decided you can put something into an index, quibbling over just how immutable it is doesn't really change anything. To put this another way: the existing volatility levels were basically reverse-engineered from the ways that the planner could meaningfully treat a function: it's dangerous, it is safe enough to use in an index condition (which changes the number of times the query will evaluate it), or it's safe to constant-fold in advance of execution. Unless there's a fourth planner behavior that's worth having, we don't need a fourth level. Possibly you could argue that "safe to put in an index" is a different level from "safe to constant-fold", but I don't really agree with that. regards, tom lane
On 7/16/24 16:16, Tom Lane wrote: > Joe Conway <mail@joeconway.com> writes: >> So you are proposing we add STATIC to VOLATILE/STABLE/IMMUTABLE (in the >> third position before IMMUTABLE), give it IMMUTABLE semantics, mark >> builtin functions that deserve it, and document with suitable caution >> statements? > > What is the point of that, exactly? > > I'll agree that the user documentation could use some improvement > in how it describes the volatility levels, but I do not see how > it will reduce anybody's confusion to invent multiple aliases for > what's effectively the same volatility level. Nor do I see a > use-case for actually having multiple versions of "immutable". > Once you've decided you can put something into an index, quibbling > over just how immutable it is doesn't really change anything. > > To put this another way: the existing volatility levels were > basically reverse-engineered from the ways that the planner could > meaningfully treat a function: it's dangerous, it is safe enough > to use in an index condition (which changes the number of times > the query will evaluate it), or it's safe to constant-fold in > advance of execution. Unless there's a fourth planner behavior that's > worth having, we don't need a fourth level. Possibly you could > argue that "safe to put in an index" is a different level from > "safe to constant-fold", but I don't really agree with that. Fair enough, but then I think we should change the documentation to not say "forever". -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Joe Conway <mail@joeconway.com> writes: > Fair enough, but then I think we should change the documentation to not > say "forever". No objection to that, it's clearly a misleading definition. regards, tom lane
On Tue, Jul 16, 2024 at 1:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Joe Conway <mail@joeconway.com> writes:
> So you are proposing we add STATIC to VOLATILE/STABLE/IMMUTABLE (in the
> third position before IMMUTABLE), give it IMMUTABLE semantics, mark
> builtin functions that deserve it, and document with suitable caution
> statements?
What is the point of that, exactly?
I'll agree that the user documentation could use some improvement
in how it describes the volatility levels, but I do not see how
it will reduce anybody's confusion to invent multiple aliases for
what's effectively the same volatility level. Nor do I see a
use-case for actually having multiple versions of "immutable".
Once you've decided you can put something into an index, quibbling
over just how immutable it is doesn't really change anything.
I'd teach pg_upgrade to inspect the post-upgraded catalog of is-use dependencies and report on any of these it finds and remind the DBA that this latent issue may exist in their system.
I agree the core behaviors of the system would remain unchanged and both modes would be handled identically. Though requiring superuser or a predefined role membership to actually use a "static" mode function in an index or generated expression would be an interesting option to consider.
David J.
On Tue, Jul 16, 2024 at 3:28 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
I'd teach pg_upgrade to inspect the post-upgraded catalog of in-use dependencies and report on any of these it finds and remind the DBA that this latent issue may exist in their system.
Would this help? Collation-related dependency changes are a different thing from major version DB upgrades
Tom’s point about how the levels are directly tied to concrete differences in behavior (planner/executor) makes a lot of sense to me
-Jeremy
On Tue, 2024-07-16 at 13:27 -0700, David G. Johnston wrote: > I'd teach pg_upgrade to inspect the post-upgraded catalog of is-use > dependencies and report on any of these it finds and remind the DBA > that this latent issue may exist in their system. That's impossible to do in a complete way, and hard to do with much accuracy. I don't oppose it though -- if someone finds a way to provide enough information to be useful, then that's fine with me. Regards, Jeff Davis
On Tue, 2024-07-16 at 10:42 -0700, Jeff Davis wrote: > The IMMUTABLE marker for functions is quite simple on the surface, but > could be interpreted a few different ways, and there's some historical > baggage that makes it complicated. > > There are a number of ways in which IMMUTABLE functions can change > behavior: > > 1. Updating or moving to a different OS affects all collations that use > the libc provider (other than "C" and "POSIX", which don't actually use > libc). LOWER(), INITCAP(), UPPER() and pattern matching are also > affected. > > 2. Updating ICU affects the collations that use the ICU provider. > ICU_UNICODE_VERSION(), LOWER(), INITCAP(), UPPER() and pattern matching > are also affected. > > 3. Moving to a different database encoding may affect collations that > use the "C" or "POSIX" locales in the libc provider (NB: those locales > don't actually use libc). > > 4. A PG Unicode update may change the results of functions that depend > on Unicode. For instance, NORMALIZE(), UNICODE_ASSIGNED(), and > UNICODE_VERSION(). Or, if using the new builtin provider's "C.UTF-8" > locale in version 17, LOWER(), INITCAP(), UPPER(), and pattern matching > (NB: collation itself is not affected -- always code point order). > > 5. If a well-defined IMMUTABLE function produces the wrong results, we > may fix the bug in the next major release. > > 6. The GUC extra_float_digits can change the results of floating point > text output. > > 7. A UDF may be improperly marked IMMUTABLE. A particularly common > variant is a UDF without search_path specified, which is probably not > truly IMMUTABLE. > > Noah seemed particularly concerned[1] about #4, so I'll start off by > discussing that. > > Unicode updates do not affect collation itself, they > affect affect NORMALIZE(), UNICODE_VERSION(), and UNICODE_ASSIGNED(). > If using the builtin "C.UTF-8" locale, they also affect LOWER(), > INITCAP(), UPPER(), and pattern matching. (NB: the builtin collation > provider hasn't yet gone through any Unicode update.) > > There are two alternative philosophies: > > A. By choosing to use a Unicode-based function, the user has opted in > to the Unicode stability guarantees[2], and it's fine to update Unicode > occasionally in new major versions as long as we are transparent with > the user. > > B. IMMUTABLE implies some very strict definition of stability, and we > should never again update Unicode because it changes the results of > IMMUTABLE functions. > > We've been following (A), and that's the defacto policy today[3][4]. > Noah and Laurenz argued[5] that the policy starting in version 18 > should be (B). Given that it's a policy decision that affects more than > just the builtin collation provider, I'd like to discuss it more > broadly outside of that subthread. > > [1] > https://www.postgresql.org/message-id/20240629220857.fb.nmisch@google.com > > [2] > https://www.unicode.org/policies/stability_policy.html > > [3] > https://www.postgresql.org/message-id/1d178eb1bbd61da1bcfe4a11d6545e9cdcede1d1.camel%40j-davis.com > > [4] > https://www.postgresql.org/message-id/564325.1720297161%40sss.pgh.pa.us > > [5] > https://www.postgresql.org/message-id/af82b292f13dd234790bc701933e9992ee07d4fa.camel%40cybertec.at Concerning #4, the new built-in locale, my hope (and, in my opinion, its only value) is to get out of the problems #1 and #2 that are not under our control. If changes in major PostgreSQL versions force users of the built-in locale provider to rebuild indexes, that would invalidate it. I think that users care more about data corruption than about exact Unicode-compliant behavior. Anybody who does can use ICU. People routinely create indexes that involve upper() or lower(), so I'd say changing their behavior would be a problem. Perhaps I should moderate my statement: if a change affects only a newly introduced code point (which is unlikely to be used in a database), and we think that the change is very important, we could consider applying it. But that should be carefully considered; I am against blindly following the changes in Unicode. Yours, Laurenz Albe
On Fri, 2024-07-19 at 21:06 +0200, Laurenz Albe wrote: > Perhaps I should moderate my statement: if a change affects only a > newly > introduced code point (which is unlikely to be used in a database), > and we > think that the change is very important, we could consider applying > it. > But that should be carefully considered; I am against blindly > following the > changes in Unicode. That sounds reasonable. I propose that, going forward, we take more care with Unicode updates: assess the impact, provide time for comments, and consider possible mitigations. In other words, it would be reviewed like any other change. Ideally, some new developments would make it less worrisome, and Unicode updates could become more routine. I have some ideas, which I can propose in separate threads. But for now, I don't see a reason to rush Unicode updates. Regards, Jeff Davis
On 19.07.24 21:41, Jeff Davis wrote: > On Fri, 2024-07-19 at 21:06 +0200, Laurenz Albe wrote: >> Perhaps I should moderate my statement: if a change affects only a >> newly >> introduced code point (which is unlikely to be used in a database), >> and we >> think that the change is very important, we could consider applying >> it. >> But that should be carefully considered; I am against blindly >> following the >> changes in Unicode. > > That sounds reasonable. > > I propose that, going forward, we take more care with Unicode updates: > assess the impact, provide time for comments, and consider possible > mitigations. In other words, it would be reviewed like any other > change. I disagree with that. We should put ourselves into the position to adopt new Unicode versions without fear. Similar to updates to time zones, snowball, etc. We can't be discussing the merits of the Unicode update every year. That would be madness. How would we weigh each change against the others? Some new character is introduced because it's the new currency of some country; seems important. Some mobile phone platforms jumped the gun and already use the character for the same purpose before it was assigned; now the character is in databases but some function results will change with the upgrade. How do we proceed? Moreover, if we were to decide to not take a particular Unicode update, that would then stop that process forever, because whatever the issue was wouldn't go away with the next Unicode version. Unless I missed something here, all the problem examples involve unassigned code points that were later assigned. (Assigned code points already have compatibility mechanisms, such as collation versions.) So I would focus on that issue. We already have a mechanism to disallow unassigned code points. So there is a tradeoff that users can make: Disallow unassigned code points and avoid upgrade issues resulting from them. Maybe that just needs to be documented more prominently.
On Mon, Jul 22, 2024 at 10:26 AM Peter Eisentraut <peter@eisentraut.org> wrote: > I disagree with that. We should put ourselves into the position to > adopt new Unicode versions without fear. Similar to updates to time > zones, snowball, etc. > > We can't be discussing the merits of the Unicode update every year. > That would be madness. Yeah, I agree with that 100%. I can't imagine that we want to, in effect, develop our own version of Unicode that is not quite the same as upstream. We've got to figure out a way to fix this problem from the other end - coping with updates when they happen. I feel like we've already discussed the obvious approach at some length: have a way to mark indexes invalid when "immutable" things change. That doesn't fix everything because you could, for example, manufacture constraint violations, even if there are no relevant indexes, so maybe index invalidation wouldn't be the only thing we'd ever need to do, but it would help a lot. In view of Jeff's list at the start of the thread, maybe that mechanism needs to be more general than just collation-related stuff: maybe there should be a general way to say "oopsie, this index can't be relied upon until it's rebuit" and a user could manually do that if they change the definition of an immutable function. Or there could even be some flag to CREATE FUNCTION that triggers it for all dependent indexes. I'm not really sure. If I remember correctly, Thomas Munro put a good deal of work into developing specifically for collation definition changes a few releases ago and it was judged not good enough, but that means we just still have nothing, which is unfortunate considering how often things go wrong in this area. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, 2024-07-22 at 16:26 +0200, Peter Eisentraut wrote: > Unless I missed something here, all the problem examples involve > unassigned code points that were later assigned. For normalization and case mapping that's right. For regexes, a character property could change. But that's mostly a theoretical problem because, at least in my experience, I can't recall ever seeing an index that would be affected. Regards, Jeff Davis
On Mon, 2024-07-22 at 11:14 -0400, Robert Haas wrote: > On Mon, Jul 22, 2024 at 10:26 AM Peter Eisentraut > <peter@eisentraut.org> wrote: > > I disagree with that. We should put ourselves into the position to > > adopt new Unicode versions without fear. Similar to updates to > > time > > zones, snowball, etc. > > > > We can't be discussing the merits of the Unicode update every year. > > That would be madness. > > Yeah, I agree with that 100%. It's hard for me to argue; that was my reasoning during development. But Noah seems to have a very strong opinion on this matter: https://www.postgresql.org/message-id/20240629220857.fb.nmisch%40google.com and I thought this thread would be a better opportunity for him to express it. Noah? > In view of Jeff's list at the start of the thread, > maybe that mechanism needs to be more general than just > collation-related stuff: maybe there should be a general way to say > "oopsie, this index can't be relied upon until it's rebuit" ... > If I remember correctly, Thomas Munro put a good deal of work into > developing specifically for collation definition changes a few > releases ago and it was judged not good enough, Yeah, see ec48314708. The revert appears to be for a number of technical reasons, but even if we solve all of those, it's hard to have a perfect solution that accounts for plpgsql functions that create arbitrary query strings and EXECUTE them. Though perhaps not impossible if we use some kind of runtime detection. We could have some kind of global context that tracks, at runtime, when an expression is executing for the purposes of an index. If a function depends on a versioned collation, then mark the index or add a version somewhere. Regards, Jeff Davis
On Mon, 2024-07-22 at 16:26 +0200, Peter Eisentraut wrote: > I propose that, going forward, we take more care with Unicode updates: > > assess the impact, provide time for comments, and consider possible > > mitigations. In other words, it would be reviewed like any other > > change. > > I disagree with that. We should put ourselves into the position to > adopt new Unicode versions without fear. Similar to updates to time > zones, snowball, etc. > > We can't be discussing the merits of the Unicode update every year. > That would be madness. How would we weigh each change against the > others? Some new character is introduced because it's the new currency > of some country; seems important. Some mobile phone platforms jumped > the gun and already use the character for the same purpose before it was > assigned; now the character is in databases but some function results > will change with the upgrade. How do we proceed? > > Moreover, if we were to decide to not take a particular Unicode update, > that would then stop that process forever, because whatever the issue > was wouldn't go away with the next Unicode version. I understand the difficulty (madness) of discussing every Unicode change. If that's unworkable, my preference would be to stick with some Unicode version and never modify it, ever. The choice that users could make in that case is a) use the built-in provider, don't get proper support for new code points, but never again worry about corrupted indexes after an upgrade b) use ICU collations, be up to date with Unicode, but reindex whenever you upgrade to a new ICU version > Unless I missed something here, all the problem examples involve > unassigned code points that were later assigned. (Assigned code points > already have compatibility mechanisms, such as collation versions.) So > I would focus on that issue. We already have a mechanism to disallow > unassigned code points. So there is a tradeoff that users can make: > Disallow unassigned code points and avoid upgrade issues resulting from > them. Maybe that just needs to be documented more prominently. Are you proposing a switch that would make PostgreSQL error out if somebody wants to use an unassigned code point? That would be an option. If what you mean is just add some documentation that tells people not to use unassigned code points if they want to avoid a reindex, I'd say that is not enough. Yours, Laurenz Albe
On Mon, 2024-07-22 at 19:18 +0200, Laurenz Albe wrote: > I understand the difficulty (madness) of discussing every Unicode > change. If that's unworkable, my preference would be to stick with > some > Unicode version and never modify it, ever. Among all the ways that IMMUTABLE and indexes can go wrong, is there a reason why you think we should draw such a bright line in this one case? > > Are you proposing a switch that would make PostgreSQL error out if > somebody wants to use an unassigned code point? That would be an > option. You can use a CHECK(UNICODE_ASSIGNED(t)) in version 17, and in version 18 I have a proposal here to make it a database-level option: https://www.postgresql.org/message-id/a0e85aca6e03042881924c4b31a840a915a9d349.camel@j-davis.com (Note: the proposal might have a few holes in it, I didn't look at it lately and nobody has commented yet.) Regards, Jeff Davis
On Mon, 22 Jul 2024 at 13:51, Jeff Davis <pgsql@j-davis.com> wrote:
> Are you proposing a switch that would make PostgreSQL error out if
> somebody wants to use an unassigned code point? That would be an
> option.
You can use a CHECK(UNICODE_ASSIGNED(t)) in version 17, and in version
18 I have a proposal here to make it a database-level option:
And if you define a domain over text with this check, you would effectively have a type that works exactly like text except you can only store assigned code points in it. Then use that instead of text everywhere (easy to audit with a query over the system tables).
On Mon, Jul 22, 2024 at 1:18 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > I understand the difficulty (madness) of discussing every Unicode > change. If that's unworkable, my preference would be to stick with some > Unicode version and never modify it, ever. I think that's a completely non-viable way forward. Even if everyone here voted in favor of that, five years from now there will be someone who shows up to say "I can't use your crappy software because the Unicode tables haven't been updated in five years, here's a patch!". And, like, what are we going to do? Still keeping shipping the 2024 version of Unicode four hundred years from now, assuming humanity and civilization and PostgreSQL are still around then? Holding something still "forever" is just never going to work. Every other piece of software in the world has to deal with changes as a result of the addition of new code points, and probably less commonly, revisions to existing code points. Presumably, their stuff breaks too, from time to time. I mean, I find it a bit difficult to believe that web browsers or messaging applications on phones only ever display emoji, and never try to do any sort of string sorting. The idea that PostgreSQL is the only thing that ever sorts strings cannot be taken seriously. So other people are presumably hacking around this in some way appropriate to what their software does, and we're going to have to figure out how to do the same thing. We could of course sit here and talk about whether it's really a good of the Unicode folks to add a lime emoji and a bunch of new emojis of people proceeding in a rightward direction to complement the existing emojis of people proceeding in a leftward direction, but they are going to do that whether we like it or not, and people -- including me, I'm afraid -- are going to use those emojis once they show up, so software that wants to remain relevant is going to have to support them. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, 2024-07-22 at 13:55 -0400, Robert Haas wrote: > On Mon, Jul 22, 2024 at 1:18 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > I understand the difficulty (madness) of discussing every Unicode > > change. If that's unworkable, my preference would be to stick with some > > Unicode version and never modify it, ever. > > I think that's a completely non-viable way forward. Even if everyone > here voted in favor of that, five years from now there will be someone > who shows up to say "I can't use your crappy software because the > Unicode tables haven't been updated in five years, here's a patch!". > And, like, what are we going to do? Still keeping shipping the 2024 > version of Unicode four hundred years from now, assuming humanity and > civilization and PostgreSQL are still around then? Holding something > still "forever" is just never going to work. I hear you. It would be interesting to know what other RDBMS do here. Yours, Laurenz Albe
On Tue, Jul 23, 2024 at 3:11 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > I hear you. It would be interesting to know what other RDBMS do here. Yeah, I agree. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jul 23, 2024 at 1:11 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Mon, 2024-07-22 at 13:55 -0400, Robert Haas wrote:
> On Mon, Jul 22, 2024 at 1:18 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > I understand the difficulty (madness) of discussing every Unicode
> > change. If that's unworkable, my preference would be to stick with some
> > Unicode version and never modify it, ever.
>
> I think that's a completely non-viable way forward. Even if everyone
> here voted in favor of that, five years from now there will be someone
> who shows up to say "I can't use your crappy software because the
> Unicode tables haven't been updated in five years, here's a patch!".
> And, like, what are we going to do? Still keeping shipping the 2024
> version of Unicode four hundred years from now, assuming humanity and
> civilization and PostgreSQL are still around then? Holding something
> still "forever" is just never going to work.
I hear you. It would be interesting to know what other RDBMS do here.
Other RDBMS are very careful not to corrupt databases, afaik including function based indexes, by changing Unicode. I’m not aware of any other RDBMS that updates Unicode versions in place; instead they support multiple Unicode versions and do not drop the old ones.
See also:
I know Jeff mentioned that Unicode tables copied into Postgres for normalization have been updated a few times. Did anyone ever actually discuss the fact that things like function based indexes can be corrupted by this, and weigh the reasoning? Are there past mailing list threads touching on the corruption problem and making the argument why updating anyway is the right thing to do? I always assumed that nobody had really dug deeply into this before the last few years.
I do agree it isn’t as broad of a problem as linguistic collation itself, which causes a lot more widespread corruption when it changes (as we’ve seen from glibc 2.28 and also other older hacker mailing list threads about smaller changes in older glibc versions corrupting databases). For now, Postgres only has code-point collation and the other Unicode functions mentioned in this thread.
-Jeremy
On Tue, Jul 23, 2024 at 8:32 AM Jeremy Schneider <schneider@ardentperf.com> wrote: > Other RDBMS are very careful not to corrupt databases, afaik including function based indexes, by changing Unicode. I’mnot aware of any other RDBMS that updates Unicode versions in place; instead they support multiple Unicode versions anddo not drop the old ones. > > See also: > https://www.postgresql.org/message-id/E8754F74-C65F-4A1A-826F-FD9F37599A2E%40ardentperf.com Hmm. I think we might have some unique problems due to the fact that we rely partly on the operating system behavior, partly on libicu, and partly on our own internal tables. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Jul 22, 2024 at 09:34:42AM -0700, Jeff Davis wrote: > On Mon, 2024-07-22 at 11:14 -0400, Robert Haas wrote: > > On Mon, Jul 22, 2024 at 10:26 AM Peter Eisentraut <peter@eisentraut.org> wrote: > > > I disagree with that. We should put ourselves into the position to > > > adopt new Unicode versions without fear. Similar to updates to > > > time > > > zones, snowball, etc. > > > > > > We can't be discussing the merits of the Unicode update every year. > > > That would be madness. > > > > Yeah, I agree with that 100%. > > It's hard for me to argue; that was my reasoning during development. > > But Noah seems to have a very strong opinion on this matter: > > https://www.postgresql.org/message-id/20240629220857.fb.nmisch%40google.com > > and I thought this thread would be a better opportunity for him to > express it. Noah? Long-term, we should handle this like Oracle, SQL Server, and DB2 do: https://postgr.es/m/CA+fnDAbmn2d5tzZsj-4wmD0jApHTsg_zGWUpteb=OMSsX5rdAg@mail.gmail.com Short-term, we should remedy the step backward that pg_c_utf8 has taken: https://postgr.es/m/20240718233908.52.nmisch@google.com https://postgr.es/m/486d71991a3f80ec1c47e1bd7931e2ef3627b6b3.camel@cybertec.at $SUBJECT has proposed remedy "take more care with Unicode updates". If one wanted to pursue that, it should get more specific, by giving one or both of: (a) principles for deciding whether a Unicode update is okay (b) examples of past Unicode release changes and whether PostgreSQL should adopt a future Unicode version making a similar change That said, I'm not aware of an (a) or (b) likely to create an attractive compromise between the "index scan agrees with seqscan after pg_upgrade" goal (https://postgr.es/m/20240706195129.fd@rfd.leadboat.com) and the "don't freeze Unicode data" goal (https://postgr.es/m/CA+TgmoZRpOFVmQWKEXHdcKj9AFLbXT5ouwtXa58J=3ydLP00ZQ@mail.gmail.com). The "long-term" above would satisfy both goals. If it were me, I would abandon the "more care" proposal.
On Tue, 2024-07-23 at 08:49 -0400, Robert Haas wrote: > Hmm. I think we might have some unique problems due to the fact that > we rely partly on the operating system behavior, partly on libicu, > and > partly on our own internal tables. The reliance on the OS is especially problematic for reasons that have already been discussed extensively. One of my strongest motivations for PG_C_UTF8 was that there was still a use case for libc in PG16: the "C.UTF-8" locale, which is not supported at all in ICU. Daniel Vérité made me aware of the importance of this locale, which offers code point order collation combined with Unicode ctype semantics. With PG17, between ICU and the builtin provider, there's little remaining reason to use libc (aside from legacy). Regards, Jeff Davis
On Tue, Jul 23, 2024 at 1:03 PM Jeff Davis <pgsql@j-davis.com> wrote: > One of my strongest motivations for PG_C_UTF8 was that there was still > a use case for libc in PG16: the "C.UTF-8" locale, which is not > supported at all in ICU. Daniel Vérité made me aware of the importance > of this locale, which offers code point order collation combined with > Unicode ctype semantics. > > With PG17, between ICU and the builtin provider, there's little > remaining reason to use libc (aside from legacy). I was really interested to read Jeremy Schneider's slide deck, to which he linked earlier, wherein he explained that other major databases default to something more like C.UTF-8. Maybe we need to relitigate the debate about what our default should be in light of those findings (but, if so, on another thread with a clear subject line). But even if we were to decide to change the default, there are lots and lots of existing databases out there that are using libc collations. I'm not in a good position to guess how many of those people actually truly care about language-specific collations. I'm positive it's not zero, but I can't really guess how much more than zero it is. Even if it were zero, though, the fact that so many upgrades are done using pg_upgrade means that this problem will still be around in a decade even if we changed the default tomorrow. (I do understand that you wrote "aside from legacy" so I'm not accusing you of ignoring the upgrade issues, just taking the opportunity to be more explicit about my own view.) Also, Noah has pointed out that C.UTF-8 introduces some forward-compatibility hazards of its own, at least with respect to ctype semantics. I don't have a clear view of what ought to be done about that, but if we just replace a dependency on an unstable set of libc definitions with a dependency on an equally unstable set of PostgreSQL definitions, we're not really winning. Do we need to version the new ctype provider? -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > Also, Noah has pointed out that C.UTF-8 introduces some > forward-compatibility hazards of its own, at least with respect to > ctype semantics. I don't have a clear view of what ought to be done > about that, but if we just replace a dependency on an unstable set of > libc definitions with a dependency on an equally unstable set of > PostgreSQL definitions, we're not really winning. No, I think we *are* winning, because the updates are not "equally unstable": with pg_c_utf8, we control when changes happen. We can align them with major releases and release-note the differences. With libc-based collations, we have zero control and not much notification. > Do we need to version the new ctype provider? It would be a version for the underlying Unicode definitions, not the provider as such, but perhaps yes. I don't know to what extent doing so would satisfy Noah's concern; but if it would do so I'd be happy with that answer. regards, tom lane
On 7/23/24 15:26, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Also, Noah has pointed out that C.UTF-8 introduces some >> forward-compatibility hazards of its own, at least with respect to >> ctype semantics. I don't have a clear view of what ought to be done >> about that, but if we just replace a dependency on an unstable set of >> libc definitions with a dependency on an equally unstable set of >> PostgreSQL definitions, we're not really winning. > > No, I think we *are* winning, because the updates are not "equally > unstable": with pg_c_utf8, we control when changes happen. We can > align them with major releases and release-note the differences. > With libc-based collations, we have zero control and not much > notification. +1 >> Do we need to version the new ctype provider? > > It would be a version for the underlying Unicode definitions, > not the provider as such, but perhaps yes. I don't know to what > extent doing so would satisfy Noah's concern; but if it would do > so I'd be happy with that answer. I came to the same conclusion. I think someone mentioned somewhere on this thread that other databases support multiple Unicode versions. I think we need to figure out how to do that too. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, 2024-07-23 at 15:26 -0400, Tom Lane wrote: > No, I think we *are* winning, because the updates are not "equally > unstable": with pg_c_utf8, we control when changes happen. We can > align them with major releases and release-note the differences. > With libc-based collations, we have zero control and not much > notification. Also, changes to libc collations are much more impactful, at least two orders of magnitude. All indexes on text are at risk, even primary keys. PG_C_UTF8 has stable code point ordering (memcmp()) that is unaffected by Unicode updates, so primary keys will never be affected. The risks we are talking about are for expression indexes, e.g. on LOWER(). Even if you do have such expression indexes, the types of changes Unicode makes to casing and character properties are typically much more mild. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Tue, 2024-07-23 at 15:26 -0400, Tom Lane wrote: >> No, I think we *are* winning, because the updates are not "equally >> unstable": with pg_c_utf8, we control when changes happen. We can >> align them with major releases and release-note the differences. >> With libc-based collations, we have zero control and not much >> notification. > Also, changes to libc collations are much more impactful, at least two > orders of magnitude. All indexes on text are at risk, even primary > keys. Well, it depends on which libc collation you have in mind. I was thinking of a libc-supplied C.UTF-8 collation, which I would expect to behave the same as pg_c_utf8, modulo which Unicode version it's based on. But even when comparing to that, pg_c_utf8 can win on stability for the reasons I stated. If you don't have a C.UTF-8 collation available, and are forced to use en_US.UTF-8 or $locale-of-choice, then the stability picture is far more dire, as Jeff says. Noah seems to be comparing the stability of pg_c_utf8 to the stability of a pure C/POSIX collation, but I do not think that is the relevant comparison to make. Besides, if someone is using C/POSIX, this feature doesn't stop them from continuing to do so. regards, tom lane
On Tue, 2024-07-23 at 07:39 -0700, Noah Misch wrote: > we should remedy the step backward that pg_c_utf8 has taken: Obviously I disagree that we've taken a step backwards. Can you articulate the principle by which all of the other problems with IMMUTABLE are just fine, but updates to Unicode are intolerable, and only for PG_C_UTF8? Regards, Jeff Davis
On Tue, Jul 23, 2024 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > No, I think we *are* winning, because the updates are not "equally > unstable": with pg_c_utf8, we control when changes happen. We can > align them with major releases and release-note the differences. > With libc-based collations, we have zero control and not much > notification. OK, that's pretty fair. > > Do we need to version the new ctype provider? > > It would be a version for the underlying Unicode definitions, > not the provider as such, but perhaps yes. I don't know to what > extent doing so would satisfy Noah's concern; but if it would do > so I'd be happy with that answer. I don't see how we can get by without some kind of versioning here. It's probably too late to do that for v17, but if we bet either that (1) we'll never need to change anything for pg_c_utf8 or that (2) those changes will be so minor that nobody will have a problem, I think we will lose our bet. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jul 23, 2024 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Do we need to version the new ctype provider? >> It would be a version for the underlying Unicode definitions, >> not the provider as such, but perhaps yes. I don't know to what >> extent doing so would satisfy Noah's concern; but if it would do >> so I'd be happy with that answer. > I don't see how we can get by without some kind of versioning here. > It's probably too late to do that for v17, Why? If we agree that that's the way forward, we could certainly stick some collversion other than "1" into pg_c_utf8's pg_collation entry. There's already been one v17 catversion bump since beta2 (716bd12d2), so another one is basically free. regards, tom lane
Tom Lane wrote: > > I don't see how we can get by without some kind of versioning here. > > It's probably too late to do that for v17, > > Why? If we agree that that's the way forward, we could certainly > stick some collversion other than "1" into pg_c_utf8's pg_collation > entry. There's already been one v17 catversion bump since beta2 > (716bd12d2), so another one is basically free. pg_collation.collversion has been used so far for the sort part of the collations. For the ctype part: postgres=# select unicode_version(); unicode_version ----------------- 15.1 (1 row) postgres=# select icu_unicode_version (); icu_unicode_version --------------------- 14.0 (1 row) Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On 22.07.24 19:55, Robert Haas wrote: > Every other piece of software in the world has to deal with changes as > a result of the addition of new code points, and probably less > commonly, revisions to existing code points. Presumably, their stuff > breaks too, from time to time. I mean, I find it a bit difficult to > believe that web browsers or messaging applications on phones only > ever display emoji, and never try to do any sort of string sorting. The sorting isn't the problem. We have a versioning mechanism for collations. What we do with the version information is clearly not perfect yet, but the mechanism exists and you can hack together queries that answer the question, did anything change here that would affect my indexes. And you could build more tooling around that and so on. The problem being considered here are updates to Unicode itself, as distinct from the collation tables. A Unicode update can impact at least two things: - Code points that were previously unassigned are now assigned. That's obviously a very common thing with every Unicode update. The new character will have new properties attached to it, so the result of various functions that use such properties (upper(), lower(), normalize(), etc.) could change, because previously the code point had no properties, and so those functions would not do anything interesting with the character. - Certain properties of an existing character can change. Like, a character used to be a letter and now it's a digit. (This is an example; I'm not sure if that particular change would be allowed.) In the extreme case, this could have the same impact as the above, but in practice the kinds of changes that are allowed wouldn't affect typical indexes. I don't think this has anything in particular to do with the new builtin collation provider. That is just one new consumer of this.
"Daniel Verite" <daniel@manitou-mail.org> writes: > Tom Lane wrote: >> Why? If we agree that that's the way forward, we could certainly >> stick some collversion other than "1" into pg_c_utf8's pg_collation >> entry. There's already been one v17 catversion bump since beta2 >> (716bd12d2), so another one is basically free. > pg_collation.collversion has been used so far for the sort part > of the collations. Hmm, we haven't particularly drawn a distinction between sort-related and not-sort-related aspects of collation versions AFAIK. Perhaps it'd be appropriate to do so, and I agree that there's not time to design such a thing for v17. But pg_c_utf8 might be the only case where we could do anything other than advance those versions in lockstep. I doubt we have enough insight into the behaviors of other providers to say confidently that an update affects only one side of their behavior. regards, tom lane
On Tue, 2024-07-23 at 16:07 -0400, Tom Lane wrote: > Well, it depends on which libc collation you have in mind. I was > thinking of a libc-supplied C.UTF-8 collation, which I would expect > to behave the same as pg_c_utf8, modulo which Unicode version it's > based on. Daniel Vérité documented[1] cases where the libc C.UTF-8 locale changed the *sort* behavior, thereby affecting primary keys. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/8a3dc06f-9b9d-4ed7-9a12-2070d8b0165f%40manitou-mail.org
Jeff Davis <pgsql@j-davis.com> writes: > On Tue, 2024-07-23 at 16:07 -0400, Tom Lane wrote: >> Well, it depends on which libc collation you have in mind. I was >> thinking of a libc-supplied C.UTF-8 collation, which I would expect >> to behave the same as pg_c_utf8, modulo which Unicode version it's >> based on. > Daniel Vérité documented[1] cases where the libc C.UTF-8 locale changed > the *sort* behavior, thereby affecting primary keys. Ouch. But we didn't establish whether that was an ancient bug, or something likely to happen again. (In any case, that surely reinforces the point that we can expect pg_c_utf8 to be more stable than any previously-available alternative.) regards, tom lane
On Tue, Jul 23, 2024 at 4:36 PM Peter Eisentraut <peter@eisentraut.org> wrote: > The sorting isn't the problem. We have a versioning mechanism for > collations. What we do with the version information is clearly not > perfect yet, but the mechanism exists and you can hack together queries > that answer the question, did anything change here that would affect my > indexes. And you could build more tooling around that and so on. In my experience, sorting is, overwhelmingly, the problem. What people complain about is that they do an upgrade - of PG or some OS package - and then their indexes are broken. Or their partition bounds are broken. That we have versioning information that someone could hypothetically know how to do something useful with is not really useful, because nobody actually knows how to do it, and there's nothing to trigger them to do it in the first place. People don't think "oh, I'm running dnf update, I better run undocumented queries against the PostgreSQL system catalogs to see whether my system is going to melt afterwards." What needs to happen is that when you do something that breaks something, something notices automatically and tells you and gives you a way to get it fixed again. Or better yet, when you do something that would break something as things stand today, some kind of versioning logic kicks in and you keep the old behavior and nothing actually breaks. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, 2024-07-23 at 21:37 -0400, Robert Haas wrote: > In my experience, sorting is, overwhelmingly, the problem. I strongly agree. > That we have versioning information that someone could hypothetically > know how to do something useful with is not really useful, because > nobody actually knows how to do it Including me. I put significant effort into creating some views that could help users identify potentially-affected indexes based on collation changes, and I gave up. In theory it's just about impossible (consider some UDF that constructs queries and EXECUTEs them -- what collations does that depend on?). In practice, it's not much easier, and you might as well just reindex everything having to do with text. In contrast, if the problem is CTYPE-related, users are in a much better position. It won't affect their primary keys or most indexes. It's much more tractable to review your expression indexes and look for problems (not ideal, but better). Also, as Peter points out, CTYPE changes are typically more narrow, so there's a good chance that there's no problem at all. Regards, Jeff Davis
On 24.07.24 03:37, Robert Haas wrote: > On Tue, Jul 23, 2024 at 4:36 PM Peter Eisentraut <peter@eisentraut.org> wrote: >> The sorting isn't the problem. We have a versioning mechanism for >> collations. What we do with the version information is clearly not >> perfect yet, but the mechanism exists and you can hack together queries >> that answer the question, did anything change here that would affect my >> indexes. And you could build more tooling around that and so on. > > In my experience, sorting is, overwhelmingly, the problem. What people > complain about is that they do an upgrade - of PG or some OS package - > and then their indexes are broken. Or their partition bounds are > broken. Fair enough. My argument was, that topic is distinct from the topic of this thread.
On Tue, Jul 23, 2024 at 01:07:49PM -0700, Jeff Davis wrote: > On Tue, 2024-07-23 at 07:39 -0700, Noah Misch wrote: > > Short-term, we should remedy the step backward that pg_c_utf8 has taken: > > https://postgr.es/m/20240718233908.52.nmisch@google.com > > https://postgr.es/m/486d71991a3f80ec1c47e1bd7931e2ef3627b6b3.camel@cybertec.at > > Obviously I disagree that we've taken a step backwards. Yes. > Can you articulate the principle by which all of the other problems > with IMMUTABLE are just fine, but updates to Unicode are intolerable, > and only for PG_C_UTF8? No, because I don't think all the other problems with IMMUTABLE are just fine. The two messages linked cover the comparisons I do consider important, especially the comparison between pg_c_utf8 and packager-frozen ICU.
On Wed, Jul 24, 2024 at 12:42 AM Peter Eisentraut <peter@eisentraut.org> wrote: > Fair enough. My argument was, that topic is distinct from the topic of > this thread. OK, that's fair. But I think the solutions are the same: we complain all the time about glibc and ICU shipping collations and not versioning them. We shouldn't make the same kinds of mistakes. Even if ctype is less likely to break things than collations, it still can, and we should move in the direction of letting people keep the v17 behavior for the foreseeable future while at the same time having a way that they can also get the new behavior if they want it (and the new behavior should be the default). I note in passing that the last time I saw a customer query with UPPER() in the join clause was... yesterday. The problems there had nothing to do with CTYPE, but there's no reason to suppose that it couldn't have had such a problem. I suspect the reason we don't hear about ctype problems now is that the collation problems are worse and happen in similar situations. But if all the collation problems went away, a subset of the same users would then be unhappy about ctype. So I don't want to see us sit on our hands and assert that we don't need to worry about ctype because it's minor in comparison with collation. It *is* minor in comparison with collation. But one problem can be small in comparison with another and still bad. If an aircraft is on fire whilst experiencing a dual engine failure, it's still in a lot of trouble even if the fire can be put out. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jul 24, 2024 at 6:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
I note in passing that the last time I saw a customer query with
UPPER() in the join clause was... yesterday. The problems there had
nothing to do with CTYPE, but there's no reason to suppose that it
couldn't have had such a problem. I suspect the reason we don't hear
about ctype problems now is that the collation problems are worse and
happen in similar situations. But if all the collation problems went
away, a subset of the same users would then be unhappy about ctype.
I have seen and created indexes on upper() functions a number of times too, and I think this is not an uncommon pattern for case insensitive searching
Before glibc 2.28, there was at least one mailing list thread where an unhappy person complained about collation problems; but for a number of years before 2.28 I guess the collation changes were uncommon so it didn’t get enough momentum to be considered a real problem until the problem became widespread a few years ago?
I myself would prefer an approach here that sets a higher bar for pg_upgrade not corrupting indexes, rather than saying it’s ok as long as it’s rare
-Jeremy
On Tue, 2024-07-23 at 06:31 -0600, Jeremy Schneider wrote: > Other RDBMS are very careful not to corrupt databases, afaik > including function based indexes, by changing Unicode. I’m not aware > of any other RDBMS that updates Unicode versions in place; instead > they support multiple Unicode versions and do not drop the old ones. I'm curious about the details of what other RDBMSs do. Let's simplify and say that there's one database-wide collation at version 1, and the application doesn't use any COLLATE clause or other specifications for queries or DDL. Then, version 2 of that collation becomes available. When a query comes into the database, which version of the collation does it use, 1 or 2? If it uses the latest available (version 2), then all the old indexes are effectively useless. So I suppose there's some kind of migration process where you rebuild/fix objects to use the new collation, and when that's done then you change the default so that queries use version 2. How does all that work? Regards, Jeff Davis
On Wed, 2024-07-24 at 08:20 -0400, Robert Haas wrote: > I note in passing that the last time I saw a customer query with > UPPER() in the join clause was... yesterday. Can you expand on that? This thread is mostly about durable state so I don't immediately see the connection. > So I don't want to see us sit on our hands and assert that we don't > need to worry about ctype because it's minor in comparison with > collation. It *is* minor in comparison with collation. ... > But one problem > can be small in comparison with another and still bad. If an aircraft > is on fire whilst experiencing a dual engine failure, it's still in a > lot of trouble even if the fire can be put out. There's a qualitative difference between a collation update which can break your PKs and FKs, and a ctype update which definitely will not. Your analogy doesn't quite capture this distinction. I don't mean to over-emphasize this point, but I do think we need to keep some perspective here. But I agree with your general point that we shouldn't dismiss the problem just because it's minor. We should expect the problem to surface at some point and be reasonably prepared. Regards, Jeff Davis
On 24.07.24 14:20, Robert Haas wrote: > On Wed, Jul 24, 2024 at 12:42 AM Peter Eisentraut <peter@eisentraut.org> wrote: >> Fair enough. My argument was, that topic is distinct from the topic of >> this thread. > > OK, that's fair. But I think the solutions are the same: we complain > all the time about glibc and ICU shipping collations and not > versioning them. We shouldn't make the same kinds of mistakes. Even if > ctype is less likely to break things than collations, it still can, > and we should move in the direction of letting people keep the v17 > behavior for the foreseeable future while at the same time having a > way that they can also get the new behavior if they want it (and the > new behavior should be the default). Versioning is possibly part of the answer, but I think it would be different versioning from the collation version. The collation versions are in principle designed to change rarely. Some languages' rules might change once in twenty years, some never. Maybe you have a database mostly in English and a few tables in, I don't know, Swedish (unverified examples). Most of the time nothing happens during upgrades, but one time in many years you need to reindex the Swedish tables, and the system starts warning you about that as soon as you access the Swedish tables. (Conversely, if you never actually access the Swedish tables, then you don't get warned about.) If we wanted a similar versioning system for the Unicode updates, it would be separate. We'd write the Unicode version that was current when the system catalogs were initialized into, say, a pg_database column. And then at run-time, when someone runs say the normalize() function or some regular expression character classification, then we check what the version of the current compiled-in Unicode tables are, and then we'd issue a warning when they are different. A possible problem is that the Unicode version changes in practice with every major PostgreSQL release, so this approach would end up warning users after every upgrade. To avoid that, we'd probably need to keep support for multiple Unicode versions around, as has been suggested in this thread already.
On Wed, Jul 24, 2024 at 1:45 PM Jeff Davis <pgsql@j-davis.com> wrote: > There's a qualitative difference between a collation update which can > break your PKs and FKs, and a ctype update which definitely will not. I don't think that's true. All you need is a unique index on UPPER(somecol). -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 2024-07-24 at 14:47 -0400, Robert Haas wrote: > On Wed, Jul 24, 2024 at 1:45 PM Jeff Davis <pgsql@j-davis.com> wrote: > > There's a qualitative difference between a collation update which > > can > > break your PKs and FKs, and a ctype update which definitely will > > not. > > I don't think that's true. All you need is a unique index on > UPPER(somecol). Primary keys are on plain column references, not expressions; and don't support WHERE clauses, so I don't see how a ctype update would affect a PK. In any case, you are correct that Unicode updates could put some constraints at risk, including unique indexes, CHECK, and partition constraints. But someone has to actually use one of the affected functions somewhere, and that's the main distinction that I'm trying to draw. The reason why collation is qualitatively a much bigger problem is because there's no obvious indication that you are doing anything related to collation at all. A very plain "CREATE TABLE x(t text PRIMARY KEY)" is at risk. Regards, Jeff Davis
On Wed, Jul 24, 2024 at 3:12 PM Jeff Davis <pgsql@j-davis.com> wrote: > In any case, you are correct that Unicode updates could put some > constraints at risk, including unique indexes, CHECK, and partition > constraints. But someone has to actually use one of the affected > functions somewhere, and that's the main distinction that I'm trying to > draw. > > The reason why collation is qualitatively a much bigger problem is > because there's no obvious indication that you are doing anything > related to collation at all. A very plain "CREATE TABLE x(t text > PRIMARY KEY)" is at risk. Well, I don't know. I agree that collation is a much bigger problem, but not for that reason. I think a user who is familiar with the problems in this area will see the danger either way, and one who isn't, won't. For me, the only real difference is that a unique index on a text column is a lot more common than one that involves UPPER. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jul 24, 2024 at 12:47 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jul 24, 2024 at 1:45 PM Jeff Davis <pgsql@j-davis.com> wrote:
> There's a qualitative difference between a collation update which can
> break your PKs and FKs, and a ctype update which definitely will not.
I don't think that's true. All you need is a unique index on UPPER(somecol).
I doubt it’s common to have unique on upper()
But non-unique indexes for case insensitive searches will be more common. Historically this is the most common way people did case insensitive on oracle.
Changing ctype would mean these queries can return wrong results
The impact would be similar to the critical problem TripAdvisor hit in 2014 with their read replicas, in the Postgres email thread I linked above
-Jeremy
On Wed, Jul 24, 2024 at 3:43 PM Jeremy Schneider <schneider@ardentperf.com> wrote: > But non-unique indexes for case insensitive searches will be more common. Historically this is the most common way peopledid case insensitive on oracle. > > Changing ctype would mean these queries can return wrong results Yeah. I mentioned earlier that I very recently saw a customer query with UPPER() in the join condition. If someone is doing foo JOIN bar ON upper(foo.x) = upper(bar.x), it is not unlikely that one or both of those expressions are indexed. Not guaranteed, of course, but very plausible. -- Robert Haas EDB: http://www.enterprisedb.com