Thread: [18] Policy on IMMUTABLE functions and Unicode updates

[18] Policy on IMMUTABLE functions and Unicode updates

From
Jeff Davis
Date:
The IMMUTABLE marker for functions is quite simple on the surface, but
could be interpreted a few different ways, and there's some historical
baggage that makes it complicated.

There are a number of ways in which IMMUTABLE functions can change
behavior:

1. Updating or moving to a different OS affects all collations that use
the libc provider (other than "C" and "POSIX", which don't actually use
libc). LOWER(), INITCAP(), UPPER() and pattern matching are also
affected.

2. Updating ICU affects the collations that use the ICU provider.
ICU_UNICODE_VERSION(), LOWER(), INITCAP(), UPPER() and pattern matching
are also affected.

3. Moving to a different database encoding may affect collations that
use the "C" or "POSIX" locales in the libc provider (NB: those locales
don't actually use libc).

4. A PG Unicode update may change the results of functions that depend
on Unicode. For instance, NORMALIZE(), UNICODE_ASSIGNED(), and
UNICODE_VERSION(). Or, if using the new builtin provider's "C.UTF-8"
locale in version 17, LOWER(), INITCAP(), UPPER(), and pattern matching
(NB: collation itself is not affected -- always code point order).

5. If a well-defined IMMUTABLE function produces the wrong results, we
may fix the bug in the next major release.

6. The GUC extra_float_digits can change the results of floating point
text output.

7. A UDF may be improperly marked IMMUTABLE. A particularly common
variant is a UDF without search_path specified, which is probably not
truly IMMUTABLE.

(more I'm sure, please add to list...)


#1 and #2 have been discussed much more than the rest, but I think it's
worthwhile to enumerate the other problems even if the impact is a lot
lower.


Noah seemed particularly concerned[1] about #4, so I'll start off by
discussing that. Here's a brief history (slightly confusing because the
PG and Unicode versions are similar numbers):

  PG13: Unicode 13.0 and NORMALIZE() is first exposed as a SQL function
  PG15: Unicode updated to 14.0
  PG16: Unicode updated to 15.0
  PG17: Unicode updated to 15.1, UNICODE_ASSIGNED(), UNICODE_VERSION()
and builtin "C.UTF-8" locale are introduced

To repeat, these Unicode updates do not affect collation itself, they
affect affect NORMALIZE(), UNICODE_VERSION(), and UNICODE_ASSIGNED().
If using the builtin "C.UTF-8" locale, they also affect LOWER(),
INITCAP(), UPPER(), and pattern matching. (NB: the builtin collation
provider hasn't yet gone through any Unicode update.)

There are two alternative philosophies:

A. By choosing to use a Unicode-based function, the user has opted in
to the Unicode stability guarantees[2], and it's fine to update Unicode
occasionally in new major versions as long as we are transparent with
the user.

B. IMMUTABLE implies some very strict definition of stability, and we
should never again update Unicode because it changes the results of
IMMUTABLE functions.

We've been following (A), and that's the defacto policy today[3][4].
Noah and Laurenz argued[5] that the policy starting in version 18
should be (B). Given that it's a policy decision that affects more than
just the builtin collation provider, I'd like to discuss it more
broadly outside of that subthread.

Regards,
    Jeff Davis


[1] 
https://www.postgresql.org/message-id/20240629220857.fb.nmisch@google.com

[2]
https://www.unicode.org/policies/stability_policy.html

[3] 
https://www.postgresql.org/message-id/1d178eb1bbd61da1bcfe4a11d6545e9cdcede1d1.camel%40j-davis.com

[4]
https://www.postgresql.org/message-id/564325.1720297161%40sss.pgh.pa.us

[5]
https://www.postgresql.org/message-id/af82b292f13dd234790bc701933e9992ee07d4fa.camel%40cybertec.at



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Joe Conway
Date:
On 7/16/24 13:42, Jeff Davis wrote:
> The IMMUTABLE marker for functions is quite simple on the surface, but
> could be interpreted a few different ways, and there's some historical
> baggage that makes it complicated.
> 
> There are a number of ways in which IMMUTABLE functions can change
> behavior:
> 
> 1. Updating or moving to a different OS affects all collations that use
> the libc provider (other than "C" and "POSIX", which don't actually use
> libc). LOWER(), INITCAP(), UPPER() and pattern matching are also
> affected.
> 
> 2. Updating ICU affects the collations that use the ICU provider.
> ICU_UNICODE_VERSION(), LOWER(), INITCAP(), UPPER() and pattern matching
> are also affected.
> 
> 3. Moving to a different database encoding may affect collations that
> use the "C" or "POSIX" locales in the libc provider (NB: those locales
> don't actually use libc).
> 
> 4. A PG Unicode update may change the results of functions that depend
> on Unicode. For instance, NORMALIZE(), UNICODE_ASSIGNED(), and
> UNICODE_VERSION(). Or, if using the new builtin provider's "C.UTF-8"
> locale in version 17, LOWER(), INITCAP(), UPPER(), and pattern matching
> (NB: collation itself is not affected -- always code point order).
> 
> 5. If a well-defined IMMUTABLE function produces the wrong results, we
> may fix the bug in the next major release.
> 
> 6. The GUC extra_float_digits can change the results of floating point
> text output.
> 
> 7. A UDF may be improperly marked IMMUTABLE. A particularly common
> variant is a UDF without search_path specified, which is probably not
> truly IMMUTABLE.
> 
> (more I'm sure, please add to list...)
> 
> 
> #1 and #2 have been discussed much more than the rest, but I think it's
> worthwhile to enumerate the other problems even if the impact is a lot
> lower.
> 
> 
> Noah seemed particularly concerned[1] about #4, so I'll start off by
> discussing that. Here's a brief history (slightly confusing because the
> PG and Unicode versions are similar numbers):
> 
>    PG13: Unicode 13.0 and NORMALIZE() is first exposed as a SQL function
>    PG15: Unicode updated to 14.0
>    PG16: Unicode updated to 15.0
>    PG17: Unicode updated to 15.1, UNICODE_ASSIGNED(), UNICODE_VERSION()
> and builtin "C.UTF-8" locale are introduced
> 
> To repeat, these Unicode updates do not affect collation itself, they
> affect affect NORMALIZE(), UNICODE_VERSION(), and UNICODE_ASSIGNED().
> If using the builtin "C.UTF-8" locale, they also affect LOWER(),
> INITCAP(), UPPER(), and pattern matching. (NB: the builtin collation
> provider hasn't yet gone through any Unicode update.)
> 
> There are two alternative philosophies:
> 
> A. By choosing to use a Unicode-based function, the user has opted in
> to the Unicode stability guarantees[2], and it's fine to update Unicode
> occasionally in new major versions as long as we are transparent with
> the user.
> 
> B. IMMUTABLE implies some very strict definition of stability, and we
> should never again update Unicode because it changes the results of
> IMMUTABLE functions.
> 
> We've been following (A), and that's the defacto policy today[3][4].
> Noah and Laurenz argued[5] that the policy starting in version 18
> should be (B). Given that it's a policy decision that affects more than
> just the builtin collation provider, I'd like to discuss it more
> broadly outside of that subthread.

On the general topic, we have these definitions in the fine manual:

8<-----------------
A VOLATILE function can do anything, ... A query using a volatile 
function will re-evaluate the function at every row where its value is 
needed.

A STABLE function cannot modify the database and is guaranteed to return 
the same results given the same arguments for all rows within a single 
statement...

An IMMUTABLE function cannot modify the database and is guaranteed to 
return the same results given the same arguments forever.
8<-----------------

As Jeff points out, the IMMUTABLE definition has never really been true. 
Even the STABLE is not quite right, as there are at least some STABLE 
functions that will return the same value for multiple statements if 
they are within a transaction block (e.g. "now()" -- TBH I don't 
remember offhand if that is true for all stable functions).

In any case, there is quite a gap between "forever" and "single 
statement". Perhaps we need to have more volatility categories, with 
guarantees that lie somewhere between the two, and allow those to be 
used like we do IMMUTABLE except with appropriate warning labels. E.g. 
something ("STABLE_VERSION"?) to mean "forever within a major version 
lifetime" and something ("STABLE_SYSTEM?") to mean "as long as you don't 
upgrade your OS".

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
"David G. Johnston"
Date:
On Tue, Jul 16, 2024 at 11:57 AM Joe Conway <mail@joeconway.com> wrote:

> There are two alternative philosophies:
>
> A. By choosing to use a Unicode-based function, the user has opted in
> to the Unicode stability guarantees[2], and it's fine to update Unicode
> occasionally in new major versions as long as we are transparent with
> the user.
>
> B. IMMUTABLE implies some very strict definition of stability, and we
> should never again update Unicode because it changes the results of
> IMMUTABLE functions.
>
> We've been following (A), and that's the defacto policy today[3][4].
> Noah and Laurenz argued[5] that the policy starting in version 18
> should be (B). Given that it's a policy decision that affects more than
> just the builtin collation provider, I'd like to discuss it more
> broadly outside of that subthread.

On the general topic, we have these definitions in the fine manual:

8<-----------------
A VOLATILE function can do anything, ... A query using a volatile
function will re-evaluate the function at every row where its value is
needed.

A STABLE function cannot modify the database and is guaranteed to return
the same results given the same arguments for all rows within a single
statement...

An IMMUTABLE function cannot modify the database and is guaranteed to
return the same results given the same arguments forever.
8<-----------------

As Jeff points out, the IMMUTABLE definition has never really been true.
 
Even the STABLE is not quite right, as there are at least some STABLE
functions that will return the same value for multiple statements if
they are within a transaction block (e.g. "now()" -- TBH I don't
remember offhand if that is true for all stable functions).

Under-specification here doesn't make the meaning of stable incorrect.  We don't have anything that guarantees stability at the transaction scope because I don't think it can be guaranteed there without considering whether said transaction is read-committed, repeatable read, or serializable.  The function itself can promise more but the marker seems correctly scoped for how the system uses it in statement optimization.

 and allow those to be
used like we do IMMUTABLE except with appropriate warning labels. E.g.
something ("STABLE_VERSION"?) to mean "forever within a major version
lifetime" and something ("STABLE_SYSTEM?") to mean "as long as you don't
upgrade your OS".


I'd be content cutting "forever" down to "within a given server configuration".  Then just note that immutable functions can depend implicitly on external server characteristics and so when moving data between servers re-evaluation of immutable functions may be necessary.  Not so bad for indexes.  A bit more problematic for generated values.

I'm not against adding metadata options here but for internal functions comments and documentation can work.  For user-defined functions I have my doubts on how trustworthy they would end up being.

For the original question, I suggest continuing behaving per "A" and work on making it more clear to users what that means in terms of server upgrades.

If we do add metadata to reflect our reality I'd settle on a generic "STATIC" marker that can be used on those functions the rely on real world state, whether we are directly calling into the system (e.g., hashing) or have chosen to provide the state access management ourselves (e.g., unicode).

When we do take control we should have a goal of allowing for a given external dependency version to exist in many PostgreSQL versions and give the DBA the choice of when to move individual databases from one version to the next.  Possibly dropping the dependency version support alongside the dropping of support of the major version it first appeared in. Not keeping up with external dependency versions just punishes new users by forbidding them a tool permanently, as well as puts us out-of-step with those dependency development groups, to save existing users some short-term pain.  Being able to deal with that pain at a time different than the middle of a major version upgrade, one database at a time, gives those existing users reasonable options.

David J.

Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Joe Conway
Date:
On 7/16/24 15:33, David G. Johnston wrote:
> On Tue, Jul 16, 2024 at 11:57 AM Joe Conway <mail@joeconway.com 
> <mailto:mail@joeconway.com>> wrote:
> 
> 
>      > There are two alternative philosophies:
>      >
>      > A. By choosing to use a Unicode-based function, the user has opted in
>      > to the Unicode stability guarantees[2], and it's fine to update
>     Unicode
>      > occasionally in new major versions as long as we are transparent with
>      > the user.
>      >
>      > B. IMMUTABLE implies some very strict definition of stability, and we
>      > should never again update Unicode because it changes the results of
>      > IMMUTABLE functions.
>      >
>      > We've been following (A), and that's the defacto policy today[3][4].
>      > Noah and Laurenz argued[5] that the policy starting in version 18
>      > should be (B). Given that it's a policy decision that affects
>     more than
>      > just the builtin collation provider, I'd like to discuss it more
>      > broadly outside of that subthread.
> 
>     On the general topic, we have these definitions in the fine manual:
> 
>     8<-----------------
>     A VOLATILE function can do anything, ... A query using a volatile
>     function will re-evaluate the function at every row where its value is
>     needed.
> 
>     A STABLE function cannot modify the database and is guaranteed to
>     return
>     the same results given the same arguments for all rows within a single
>     statement...
> 
>     An IMMUTABLE function cannot modify the database and is guaranteed to
>     return the same results given the same arguments forever.
>     8<-----------------
> 
>     As Jeff points out, the IMMUTABLE definition has never really been
>     true.
> 
>     Even the STABLE is not quite right, as there are at least some STABLE
>     functions that will return the same value for multiple statements if
>     they are within a transaction block (e.g. "now()" -- TBH I don't
>     remember offhand if that is true for all stable functions).
> 
> 
> Under-specification here doesn't make the meaning of stable incorrect.  
> We don't have anything that guarantees stability at the transaction 
> scope because I don't think it can be guaranteed there without 
> considering whether said transaction is read-committed, repeatable read, 
> or serializable.  The function itself can promise more but the marker 
> seems correctly scoped for how the system uses it in statement optimization.


The way it is described is still surprising and can bite you if you are 
not familiar with the nuances. In particular I have seen now() used in 
transaction blocks surprise more than one person over the years.


>       and allow those to be
>     used like we do IMMUTABLE except with appropriate warning labels. E.g.
>     something ("STABLE_VERSION"?) to mean "forever within a major version
>     lifetime" and something ("STABLE_SYSTEM?") to mean "as long as you
>     don't
>     upgrade your OS".
> 
> I'd be content cutting "forever" down to "within a given server 
> configuration".  Then just note that immutable functions can depend 
> implicitly on external server characteristics and so when moving data 
> between servers re-evaluation of immutable functions may be necessary.  
> Not so bad for indexes.  A bit more problematic for generated values.

Yeah I forgot about the configuration controlled ones.

> I'm not against adding metadata options here but for internal functions 
> comments and documentation can work.  For user-defined functions I have 
> my doubts on how trustworthy they would end up being.

People lie all the time for user-defined functions, usually specifically 
when they need IMMUTABLE semantics and are willing to live with the risk 
and/or apply their own controls to ensure no changes in output.

> For the original question, I suggest continuing behaving per "A" and 
> work on making it more clear to users what that means in terms of server 
> upgrades.
> 
> If we do add metadata to reflect our reality I'd settle on a generic 
> "STATIC" marker that can be used on those functions the rely on real 
> world state, whether we are directly calling into the system (e.g., 
> hashing) or have chosen to provide the state access management ourselves 
> (e.g., unicode).

So you are proposing we add STATIC to VOLATILE/STABLE/IMMUTABLE (in the 
third position before IMMUTABLE), give it IMMUTABLE semantics, mark 
builtin functions that deserve it, and document with suitable caution 
statements?

I guess can live with just one additional level of granularity.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Tom Lane
Date:
Joe Conway <mail@joeconway.com> writes:
> So you are proposing we add STATIC to VOLATILE/STABLE/IMMUTABLE (in the 
> third position before IMMUTABLE), give it IMMUTABLE semantics, mark 
> builtin functions that deserve it, and document with suitable caution 
> statements?

What is the point of that, exactly?

I'll agree that the user documentation could use some improvement
in how it describes the volatility levels, but I do not see how
it will reduce anybody's confusion to invent multiple aliases for
what's effectively the same volatility level.  Nor do I see a
use-case for actually having multiple versions of "immutable".
Once you've decided you can put something into an index, quibbling
over just how immutable it is doesn't really change anything.

To put this another way: the existing volatility levels were
basically reverse-engineered from the ways that the planner could
meaningfully treat a function: it's dangerous, it is safe enough
to use in an index condition (which changes the number of times
the query will evaluate it), or it's safe to constant-fold in
advance of execution.  Unless there's a fourth planner behavior that's
worth having, we don't need a fourth level.  Possibly you could
argue that "safe to put in an index" is a different level from
"safe to constant-fold", but I don't really agree with that.

            regards, tom lane



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Joe Conway
Date:
On 7/16/24 16:16, Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
>> So you are proposing we add STATIC to VOLATILE/STABLE/IMMUTABLE (in the 
>> third position before IMMUTABLE), give it IMMUTABLE semantics, mark 
>> builtin functions that deserve it, and document with suitable caution 
>> statements?
> 
> What is the point of that, exactly?
> 
> I'll agree that the user documentation could use some improvement
> in how it describes the volatility levels, but I do not see how
> it will reduce anybody's confusion to invent multiple aliases for
> what's effectively the same volatility level.  Nor do I see a
> use-case for actually having multiple versions of "immutable".
> Once you've decided you can put something into an index, quibbling
> over just how immutable it is doesn't really change anything.
> 
> To put this another way: the existing volatility levels were
> basically reverse-engineered from the ways that the planner could
> meaningfully treat a function: it's dangerous, it is safe enough
> to use in an index condition (which changes the number of times
> the query will evaluate it), or it's safe to constant-fold in
> advance of execution.  Unless there's a fourth planner behavior that's
> worth having, we don't need a fourth level.  Possibly you could
> argue that "safe to put in an index" is a different level from
> "safe to constant-fold", but I don't really agree with that.

Fair enough, but then I think we should change the documentation to not 
say "forever".

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Tom Lane
Date:
Joe Conway <mail@joeconway.com> writes:
> Fair enough, but then I think we should change the documentation to not 
> say "forever".

No objection to that, it's clearly a misleading definition.

            regards, tom lane



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
"David G. Johnston"
Date:
On Tue, Jul 16, 2024 at 1:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Joe Conway <mail@joeconway.com> writes:
> So you are proposing we add STATIC to VOLATILE/STABLE/IMMUTABLE (in the
> third position before IMMUTABLE), give it IMMUTABLE semantics, mark
> builtin functions that deserve it, and document with suitable caution
> statements?

What is the point of that, exactly?

I'll agree that the user documentation could use some improvement
in how it describes the volatility levels, but I do not see how
it will reduce anybody's confusion to invent multiple aliases for
what's effectively the same volatility level.  Nor do I see a
use-case for actually having multiple versions of "immutable".
Once you've decided you can put something into an index, quibbling
over just how immutable it is doesn't really change anything.


I'd teach pg_upgrade to inspect the post-upgraded catalog of is-use dependencies and report on any of these it finds and remind the DBA that this latent issue may exist in their system.

I agree the core behaviors of the system would remain unchanged and both modes would be handled identically.  Though requiring superuser or a predefined role membership to actually use a "static" mode function in an index or generated expression would be an interesting option to consider.

David J.

Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeremy Schneider
Date:

On Tue, Jul 16, 2024 at 3:28 PM David G. Johnston <david.g.johnston@gmail.com> wrote:

I'd teach pg_upgrade to inspect the post-upgraded catalog of in-use dependencies and report on any of these it finds and remind the DBA that this latent issue may exist in their system.

Would this help? Collation-related dependency changes are a different thing from major version DB upgrades

Tom’s point about how the levels are directly tied to concrete differences in behavior (planner/executor) makes a lot of sense to me

-Jeremy

Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeff Davis
Date:
On Tue, 2024-07-16 at 13:27 -0700, David G. Johnston wrote:
> I'd teach pg_upgrade to inspect the post-upgraded catalog of is-use
> dependencies and report on any of these it finds and remind the DBA
> that this latent issue may exist in their system.

That's impossible to do in a complete way, and hard to do with much
accuracy. I don't oppose it though -- if someone finds a way to provide
enough information to be useful, then that's fine with me.

Regards,
    Jeff Davis




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Laurenz Albe
Date:
On Tue, 2024-07-16 at 10:42 -0700, Jeff Davis wrote:
> The IMMUTABLE marker for functions is quite simple on the surface, but
> could be interpreted a few different ways, and there's some historical
> baggage that makes it complicated.
>
> There are a number of ways in which IMMUTABLE functions can change
> behavior:
>
> 1. Updating or moving to a different OS affects all collations that use
> the libc provider (other than "C" and "POSIX", which don't actually use
> libc). LOWER(), INITCAP(), UPPER() and pattern matching are also
> affected.
>
> 2. Updating ICU affects the collations that use the ICU provider.
> ICU_UNICODE_VERSION(), LOWER(), INITCAP(), UPPER() and pattern matching
> are also affected.
>
> 3. Moving to a different database encoding may affect collations that
> use the "C" or "POSIX" locales in the libc provider (NB: those locales
> don't actually use libc).
>
> 4. A PG Unicode update may change the results of functions that depend
> on Unicode. For instance, NORMALIZE(), UNICODE_ASSIGNED(), and
> UNICODE_VERSION(). Or, if using the new builtin provider's "C.UTF-8"
> locale in version 17, LOWER(), INITCAP(), UPPER(), and pattern matching
> (NB: collation itself is not affected -- always code point order).
>
> 5. If a well-defined IMMUTABLE function produces the wrong results, we
> may fix the bug in the next major release.
>
> 6. The GUC extra_float_digits can change the results of floating point
> text output.
>
> 7. A UDF may be improperly marked IMMUTABLE. A particularly common
> variant is a UDF without search_path specified, which is probably not
> truly IMMUTABLE.
>
> Noah seemed particularly concerned[1] about #4, so I'll start off by
> discussing that.
>
> Unicode updates do not affect collation itself, they
> affect affect NORMALIZE(), UNICODE_VERSION(), and UNICODE_ASSIGNED().
> If using the builtin "C.UTF-8" locale, they also affect LOWER(),
> INITCAP(), UPPER(), and pattern matching. (NB: the builtin collation
> provider hasn't yet gone through any Unicode update.)
>
> There are two alternative philosophies:
>
> A. By choosing to use a Unicode-based function, the user has opted in
> to the Unicode stability guarantees[2], and it's fine to update Unicode
> occasionally in new major versions as long as we are transparent with
> the user.
>
> B. IMMUTABLE implies some very strict definition of stability, and we
> should never again update Unicode because it changes the results of
> IMMUTABLE functions.
>
> We've been following (A), and that's the defacto policy today[3][4].
> Noah and Laurenz argued[5] that the policy starting in version 18
> should be (B). Given that it's a policy decision that affects more than
> just the builtin collation provider, I'd like to discuss it more
> broadly outside of that subthread.
>
> [1] 
> https://www.postgresql.org/message-id/20240629220857.fb.nmisch@google.com
>
> [2]
> https://www.unicode.org/policies/stability_policy.html
>
> [3] 
> https://www.postgresql.org/message-id/1d178eb1bbd61da1bcfe4a11d6545e9cdcede1d1.camel%40j-davis.com
>
> [4]
> https://www.postgresql.org/message-id/564325.1720297161%40sss.pgh.pa.us
>
> [5]
> https://www.postgresql.org/message-id/af82b292f13dd234790bc701933e9992ee07d4fa.camel%40cybertec.at

Concerning #4, the new built-in locale, my hope (and, in my opinion, its only
value) is to get out of the problems #1 and #2 that are not under our control.

If changes in major PostgreSQL versions force users of the built-in
locale provider to rebuild indexes, that would invalidate it.  I think that
users care more about data corruption than about exact Unicode-compliant
behavior.  Anybody who does can use ICU.

People routinely create indexes that involve upper() or lower(), so I'd
say changing their behavior would be a problem.

Perhaps I should moderate my statement: if a change affects only a newly
introduced code point (which is unlikely to be used in a database), and we
think that the change is very important, we could consider applying it.
But that should be carefully considered; I am against blindly following the
changes in Unicode.

Yours,
Laurenz Albe



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeff Davis
Date:
On Fri, 2024-07-19 at 21:06 +0200, Laurenz Albe wrote:
> Perhaps I should moderate my statement: if a change affects only a
> newly
> introduced code point (which is unlikely to be used in a database),
> and we
> think that the change is very important, we could consider applying
> it.
> But that should be carefully considered; I am against blindly
> following the
> changes in Unicode.

That sounds reasonable.

I propose that, going forward, we take more care with Unicode updates:
assess the impact, provide time for comments, and consider possible
mitigations. In other words, it would be reviewed like any other
change.

Ideally, some new developments would make it less worrisome, and
Unicode updates could become more routine. I have some ideas, which I
can propose in separate threads. But for now, I don't see a reason to
rush Unicode updates.

Regards,
    Jeff Davis




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Peter Eisentraut
Date:
On 19.07.24 21:41, Jeff Davis wrote:
> On Fri, 2024-07-19 at 21:06 +0200, Laurenz Albe wrote:
>> Perhaps I should moderate my statement: if a change affects only a
>> newly
>> introduced code point (which is unlikely to be used in a database),
>> and we
>> think that the change is very important, we could consider applying
>> it.
>> But that should be carefully considered; I am against blindly
>> following the
>> changes in Unicode.
> 
> That sounds reasonable.
> 
> I propose that, going forward, we take more care with Unicode updates:
> assess the impact, provide time for comments, and consider possible
> mitigations. In other words, it would be reviewed like any other
> change.

I disagree with that.  We should put ourselves into the position to 
adopt new Unicode versions without fear.  Similar to updates to time 
zones, snowball, etc.

We can't be discussing the merits of the Unicode update every year. 
That would be madness.  How would we weigh each change against the 
others?  Some new character is introduced because it's the new currency 
of some country; seems important.  Some mobile phone platforms jumped 
the gun and already use the character for the same purpose before it was 
assigned; now the character is in databases but some function results 
will change with the upgrade.  How do we proceed?

Moreover, if we were to decide to not take a particular Unicode update, 
that would then stop that process forever, because whatever the issue 
was wouldn't go away with the next Unicode version.


Unless I missed something here, all the problem examples involve 
unassigned code points that were later assigned.  (Assigned code points 
already have compatibility mechanisms, such as collation versions.)  So 
I would focus on that issue.  We already have a mechanism to disallow 
unassigned code points.  So there is a tradeoff that users can make: 
Disallow unassigned code points and avoid upgrade issues resulting from 
them.  Maybe that just needs to be documented more prominently.




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Robert Haas
Date:
On Mon, Jul 22, 2024 at 10:26 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> I disagree with that.  We should put ourselves into the position to
> adopt new Unicode versions without fear.  Similar to updates to time
> zones, snowball, etc.
>
> We can't be discussing the merits of the Unicode update every year.
> That would be madness.

Yeah, I agree with that 100%. I can't imagine that we want to, in
effect, develop our own version of Unicode that is not quite the same
as upstream.

We've got to figure out a way to fix this problem from the other end -
coping with updates when they happen. I feel like we've already
discussed the obvious approach at some length: have a way to mark
indexes invalid when "immutable" things change. That doesn't fix
everything because you could, for example, manufacture constraint
violations, even if there are no relevant indexes, so maybe index
invalidation wouldn't be the only thing we'd ever need to do, but it
would help a lot. In view of Jeff's list at the start of the thread,
maybe that mechanism needs to be more general than just
collation-related stuff: maybe there should be a general way to say
"oopsie, this index can't be relied upon until it's rebuit" and a user
could manually do that if they change the definition of an immutable
function. Or there could even be some flag to CREATE FUNCTION that
triggers it for all dependent indexes. I'm not really sure.

If I remember correctly, Thomas Munro put a good deal of work into
developing specifically for collation definition changes a few
releases ago and it was judged not good enough, but that means we just
still have nothing, which is unfortunate considering how often things
go wrong in this area.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeff Davis
Date:
On Mon, 2024-07-22 at 16:26 +0200, Peter Eisentraut wrote:
> Unless I missed something here, all the problem examples involve
> unassigned code points that were later assigned.

For normalization and case mapping that's right.

For regexes, a character property could change. But that's mostly a
theoretical problem because, at least in my experience, I can't recall
ever seeing an index that would be affected.

Regards,
    Jeff Davis




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeff Davis
Date:
On Mon, 2024-07-22 at 11:14 -0400, Robert Haas wrote:
> On Mon, Jul 22, 2024 at 10:26 AM Peter Eisentraut
> <peter@eisentraut.org> wrote:
> > I disagree with that.  We should put ourselves into the position to
> > adopt new Unicode versions without fear.  Similar to updates to
> > time
> > zones, snowball, etc.
> >
> > We can't be discussing the merits of the Unicode update every year.
> > That would be madness.
>
> Yeah, I agree with that 100%.

It's hard for me to argue; that was my reasoning during development.

But Noah seems to have a very strong opinion on this matter:

https://www.postgresql.org/message-id/20240629220857.fb.nmisch%40google.com

and I thought this thread would be a better opportunity for him to
express it. Noah?

> In view of Jeff's list at the start of the thread,
> maybe that mechanism needs to be more general than just
> collation-related stuff: maybe there should be a general way to say
> "oopsie, this index can't be relied upon until it's rebuit"

...

> If I remember correctly, Thomas Munro put a good deal of work into
> developing specifically for collation definition changes a few
> releases ago and it was judged not good enough, 

Yeah, see ec48314708. The revert appears to be for a number of
technical reasons, but even if we solve all of those, it's hard to have
a perfect solution that accounts for plpgsql functions that create
arbitrary query strings and EXECUTE them.

Though perhaps not impossible if we use some kind of runtime detection.
We could have some kind of global context that tracks, at runtime, when
an expression is executing for the purposes of an index. If a function
depends on a versioned collation, then mark the index or add a version
somewhere.

Regards,
    Jeff Davis




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Laurenz Albe
Date:
On Mon, 2024-07-22 at 16:26 +0200, Peter Eisentraut wrote:
> I propose that, going forward, we take more care with Unicode updates:
> > assess the impact, provide time for comments, and consider possible
> > mitigations. In other words, it would be reviewed like any other
> > change.
>
> I disagree with that.  We should put ourselves into the position to
> adopt new Unicode versions without fear.  Similar to updates to time
> zones, snowball, etc.
>
> We can't be discussing the merits of the Unicode update every year.
> That would be madness.  How would we weigh each change against the
> others?  Some new character is introduced because it's the new currency
> of some country; seems important.  Some mobile phone platforms jumped
> the gun and already use the character for the same purpose before it was
> assigned; now the character is in databases but some function results
> will change with the upgrade.  How do we proceed?
>
> Moreover, if we were to decide to not take a particular Unicode update,
> that would then stop that process forever, because whatever the issue
> was wouldn't go away with the next Unicode version.

I understand the difficulty (madness) of discussing every Unicode
change.  If that's unworkable, my preference would be to stick with some
Unicode version and never modify it, ever.

The choice that users could make in that case is

a) use the built-in provider, don't get proper support for new code
   points, but never again worry about corrupted indexes after an
   upgrade

b) use ICU collations, be up to date with Unicode, but reindex whenever
   you upgrade to a new ICU version

> Unless I missed something here, all the problem examples involve
> unassigned code points that were later assigned.  (Assigned code points
> already have compatibility mechanisms, such as collation versions.)  So
> I would focus on that issue.  We already have a mechanism to disallow
> unassigned code points.  So there is a tradeoff that users can make:
> Disallow unassigned code points and avoid upgrade issues resulting from
> them.  Maybe that just needs to be documented more prominently.

Are you proposing a switch that would make PostgreSQL error out if
somebody wants to use an unassigned code point?  That would be an option.
If what you mean is just add some documentation that tells people not
to use unassigned code points if they want to avoid a reindex, I'd say
that is not enough.

Yours,
Laurenz Albe



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeff Davis
Date:
On Mon, 2024-07-22 at 19:18 +0200, Laurenz Albe wrote:
> I understand the difficulty (madness) of discussing every Unicode
> change.  If that's unworkable, my preference would be to stick with
> some
> Unicode version and never modify it, ever.

Among all the ways that IMMUTABLE and indexes can go wrong, is there a
reason why you think we should draw such a bright line in this one
case?


>
> Are you proposing a switch that would make PostgreSQL error out if
> somebody wants to use an unassigned code point?  That would be an
> option.

You can use a CHECK(UNICODE_ASSIGNED(t)) in version 17, and in version
18 I have a proposal here to make it a database-level option:

https://www.postgresql.org/message-id/a0e85aca6e03042881924c4b31a840a915a9d349.camel@j-davis.com

(Note: the proposal might have a few holes in it, I didn't look at it
lately and nobody has commented yet.)

Regards,
    Jeff Davis




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Isaac Morland
Date:
On Mon, 22 Jul 2024 at 13:51, Jeff Davis <pgsql@j-davis.com> wrote:
 
> Are you proposing a switch that would make PostgreSQL error out if
> somebody wants to use an unassigned code point?  That would be an
> option.

You can use a CHECK(UNICODE_ASSIGNED(t)) in version 17, and in version
18 I have a proposal here to make it a database-level option:

And if you define a domain over text with this check, you would effectively have a type that works exactly like text except you can only store assigned code points in it. Then use that instead of text everywhere (easy to audit with a query over the system tables).

Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Robert Haas
Date:
On Mon, Jul 22, 2024 at 1:18 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> I understand the difficulty (madness) of discussing every Unicode
> change.  If that's unworkable, my preference would be to stick with some
> Unicode version and never modify it, ever.

I think that's a completely non-viable way forward. Even if everyone
here voted in favor of that, five years from now there will be someone
who shows up to say "I can't use your crappy software because the
Unicode tables haven't been updated in five years, here's a patch!".
And, like, what are we going to do? Still keeping shipping the 2024
version of Unicode four hundred years from now, assuming humanity and
civilization and PostgreSQL are still around then? Holding something
still "forever" is just never going to work.

Every other piece of software in the world has to deal with changes as
a result of the addition of new code points, and probably less
commonly, revisions to existing code points. Presumably, their stuff
breaks too, from time to time. I mean, I find it a bit difficult to
believe that web browsers or messaging applications on phones only
ever display emoji, and never try to do any sort of string sorting.
The idea that PostgreSQL is the only thing that ever sorts strings
cannot be taken seriously. So other people are presumably hacking
around this in some way appropriate to what their software does, and
we're going to have to figure out how to do the same thing. We could
of course sit here and talk about whether it's really a good of the
Unicode folks to add a lime emoji and a bunch of new emojis of people
proceeding in a rightward direction to complement the existing emojis
of people proceeding in a leftward direction, but they are going to do
that whether we like it or not, and people -- including me, I'm afraid
-- are going to use those emojis once they show up, so software that
wants to remain relevant is going to have to support them.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Laurenz Albe
Date:
On Mon, 2024-07-22 at 13:55 -0400, Robert Haas wrote:
> On Mon, Jul 22, 2024 at 1:18 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > I understand the difficulty (madness) of discussing every Unicode
> > change.  If that's unworkable, my preference would be to stick with some
> > Unicode version and never modify it, ever.
>
> I think that's a completely non-viable way forward. Even if everyone
> here voted in favor of that, five years from now there will be someone
> who shows up to say "I can't use your crappy software because the
> Unicode tables haven't been updated in five years, here's a patch!".
> And, like, what are we going to do? Still keeping shipping the 2024
> version of Unicode four hundred years from now, assuming humanity and
> civilization and PostgreSQL are still around then? Holding something
> still "forever" is just never going to work.

I hear you.  It would be interesting to know what other RDBMS do here.

Yours,
Laurenz Albe



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Robert Haas
Date:
On Tue, Jul 23, 2024 at 3:11 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> I hear you.  It would be interesting to know what other RDBMS do here.

Yeah, I agree.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeremy Schneider
Date:
On Tue, Jul 23, 2024 at 1:11 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Mon, 2024-07-22 at 13:55 -0400, Robert Haas wrote:
> On Mon, Jul 22, 2024 at 1:18 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > I understand the difficulty (madness) of discussing every Unicode
> > change.  If that's unworkable, my preference would be to stick with some
> > Unicode version and never modify it, ever.
>
> I think that's a completely non-viable way forward. Even if everyone
> here voted in favor of that, five years from now there will be someone
> who shows up to say "I can't use your crappy software because the
> Unicode tables haven't been updated in five years, here's a patch!".
> And, like, what are we going to do? Still keeping shipping the 2024
> version of Unicode four hundred years from now, assuming humanity and
> civilization and PostgreSQL are still around then? Holding something
> still "forever" is just never going to work.

I hear you.  It would be interesting to know what other RDBMS do here.

Other RDBMS are very careful not to corrupt databases, afaik including function based indexes, by changing Unicode. I’m not aware of any other RDBMS that updates Unicode versions in place; instead they support multiple Unicode versions and do not drop the old ones.

See also:

I know Jeff mentioned that Unicode tables copied into Postgres for normalization have been updated a few times. Did anyone ever actually discuss the fact that things like function based indexes can be corrupted by this, and weigh the reasoning? Are there past mailing list threads touching on the corruption problem and making the argument why updating anyway is the right thing to do? I always assumed that nobody had really dug deeply into this before the last few years.

I do agree it isn’t as broad of a problem as linguistic collation itself, which causes a lot more widespread corruption when it changes (as we’ve seen from glibc 2.28 and also other older hacker mailing list threads about smaller changes in older glibc versions corrupting databases). For now, Postgres only has code-point collation and the other Unicode functions mentioned in this thread.

-Jeremy

Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Robert Haas
Date:
On Tue, Jul 23, 2024 at 8:32 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:
> Other RDBMS are very careful not to corrupt databases, afaik including function based indexes, by changing Unicode.
I’mnot aware of any other RDBMS that updates Unicode versions in place; instead they support multiple Unicode versions
anddo not drop the old ones. 
>
> See also:
> https://www.postgresql.org/message-id/E8754F74-C65F-4A1A-826F-FD9F37599A2E%40ardentperf.com

Hmm. I think we might have some unique problems due to the fact that
we rely partly on the operating system behavior, partly on libicu, and
partly on our own internal tables.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Noah Misch
Date:
On Mon, Jul 22, 2024 at 09:34:42AM -0700, Jeff Davis wrote:
> On Mon, 2024-07-22 at 11:14 -0400, Robert Haas wrote:
> > On Mon, Jul 22, 2024 at 10:26 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> > > I disagree with that.  We should put ourselves into the position to
> > > adopt new Unicode versions without fear.  Similar to updates to
> > > time
> > > zones, snowball, etc.
> > > 
> > > We can't be discussing the merits of the Unicode update every year.
> > > That would be madness.
> > 
> > Yeah, I agree with that 100%.
> 
> It's hard for me to argue; that was my reasoning during development.
> 
> But Noah seems to have a very strong opinion on this matter:
> 
> https://www.postgresql.org/message-id/20240629220857.fb.nmisch%40google.com
> 
> and I thought this thread would be a better opportunity for him to
> express it. Noah?

Long-term, we should handle this like Oracle, SQL Server, and DB2 do:
https://postgr.es/m/CA+fnDAbmn2d5tzZsj-4wmD0jApHTsg_zGWUpteb=OMSsX5rdAg@mail.gmail.com

Short-term, we should remedy the step backward that pg_c_utf8 has taken:
https://postgr.es/m/20240718233908.52.nmisch@google.com
https://postgr.es/m/486d71991a3f80ec1c47e1bd7931e2ef3627b6b3.camel@cybertec.at


$SUBJECT has proposed remedy "take more care with Unicode updates".  If one
wanted to pursue that, it should get more specific, by giving one or both of:

(a) principles for deciding whether a Unicode update is okay
(b) examples of past Unicode release changes and whether PostgreSQL should
    adopt a future Unicode version making a similar change

That said, I'm not aware of an (a) or (b) likely to create an attractive
compromise between the "index scan agrees with seqscan after pg_upgrade" goal
(https://postgr.es/m/20240706195129.fd@rfd.leadboat.com) and the "don't freeze
Unicode data" goal
(https://postgr.es/m/CA+TgmoZRpOFVmQWKEXHdcKj9AFLbXT5ouwtXa58J=3ydLP00ZQ@mail.gmail.com).
The "long-term" above would satisfy both goals.  If it were me, I would
abandon the "more care" proposal.



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeff Davis
Date:
On Tue, 2024-07-23 at 08:49 -0400, Robert Haas wrote:
> Hmm. I think we might have some unique problems due to the fact that
> we rely partly on the operating system behavior, partly on libicu,
> and
> partly on our own internal tables.

The reliance on the OS is especially problematic for reasons that have
already been discussed extensively.

One of my strongest motivations for PG_C_UTF8 was that there was still
a use case for libc in PG16: the "C.UTF-8" locale, which is not
supported at all in ICU. Daniel Vérité made me aware of the importance
of this locale, which offers code point order collation combined with
Unicode ctype semantics.

With PG17, between ICU and the builtin provider, there's little
remaining reason to use libc (aside from legacy).

Regards,
    Jeff Davis




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Robert Haas
Date:
On Tue, Jul 23, 2024 at 1:03 PM Jeff Davis <pgsql@j-davis.com> wrote:
> One of my strongest motivations for PG_C_UTF8 was that there was still
> a use case for libc in PG16: the "C.UTF-8" locale, which is not
> supported at all in ICU. Daniel Vérité made me aware of the importance
> of this locale, which offers code point order collation combined with
> Unicode ctype semantics.
>
> With PG17, between ICU and the builtin provider, there's little
> remaining reason to use libc (aside from legacy).

I was really interested to read Jeremy Schneider's slide deck, to
which he linked earlier, wherein he explained that other major
databases default to something more like C.UTF-8. Maybe we need to
relitigate the debate about what our default should be in light of
those findings (but, if so, on another thread with a clear subject
line). But even if we were to decide to change the default, there are
lots and lots of existing databases out there that are using libc
collations. I'm not in a good position to guess how many of those
people actually truly care about language-specific collations. I'm
positive it's not zero, but I can't really guess how much more than
zero it is. Even if it were zero, though, the fact that so many
upgrades are done using pg_upgrade means that this problem will still
be around in a decade even if we changed the default tomorrow.

(I do understand that you wrote "aside from legacy" so I'm not
accusing you of ignoring the upgrade issues, just taking the
opportunity to be more explicit about my own view.)

Also, Noah has pointed out that C.UTF-8 introduces some
forward-compatibility hazards of its own, at least with respect to
ctype semantics. I don't have a clear view of what ought to be done
about that, but if we just replace a dependency on an unstable set of
libc definitions with a dependency on an equally unstable set of
PostgreSQL definitions, we're not really winning. Do we need to
version the new ctype provider?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Also, Noah has pointed out that C.UTF-8 introduces some
> forward-compatibility hazards of its own, at least with respect to
> ctype semantics. I don't have a clear view of what ought to be done
> about that, but if we just replace a dependency on an unstable set of
> libc definitions with a dependency on an equally unstable set of
> PostgreSQL definitions, we're not really winning.

No, I think we *are* winning, because the updates are not "equally
unstable": with pg_c_utf8, we control when changes happen.  We can
align them with major releases and release-note the differences.
With libc-based collations, we have zero control and not much
notification.

> Do we need to version the new ctype provider?

It would be a version for the underlying Unicode definitions,
not the provider as such, but perhaps yes.  I don't know to what
extent doing so would satisfy Noah's concern; but if it would do
so I'd be happy with that answer.

            regards, tom lane



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Joe Conway
Date:
On 7/23/24 15:26, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Also, Noah has pointed out that C.UTF-8 introduces some
>> forward-compatibility hazards of its own, at least with respect to
>> ctype semantics. I don't have a clear view of what ought to be done
>> about that, but if we just replace a dependency on an unstable set of
>> libc definitions with a dependency on an equally unstable set of
>> PostgreSQL definitions, we're not really winning.
> 
> No, I think we *are* winning, because the updates are not "equally
> unstable": with pg_c_utf8, we control when changes happen.  We can
> align them with major releases and release-note the differences.
> With libc-based collations, we have zero control and not much
> notification.

+1

>> Do we need to version the new ctype provider?
> 
> It would be a version for the underlying Unicode definitions,
> not the provider as such, but perhaps yes.  I don't know to what
> extent doing so would satisfy Noah's concern; but if it would do
> so I'd be happy with that answer.

I came to the same conclusion. I think someone mentioned somewhere on 
this thread that other databases support multiple Unicode versions. I 
think we need to figure out how to do that too.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeff Davis
Date:
On Tue, 2024-07-23 at 15:26 -0400, Tom Lane wrote:
> No, I think we *are* winning, because the updates are not "equally
> unstable": with pg_c_utf8, we control when changes happen.  We can
> align them with major releases and release-note the differences.
> With libc-based collations, we have zero control and not much
> notification.

Also, changes to libc collations are much more impactful, at least two
orders of magnitude. All indexes on text are at risk, even primary
keys.

PG_C_UTF8 has stable code point ordering (memcmp()) that is unaffected
by Unicode updates, so primary keys will never be affected. The risks
we are talking about are for expression indexes, e.g. on LOWER(). Even
if you do have such expression indexes, the types of changes Unicode
makes to casing and character properties are typically much more mild.

Regards,
    Jeff Davis




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Tue, 2024-07-23 at 15:26 -0400, Tom Lane wrote:
>> No, I think we *are* winning, because the updates are not "equally
>> unstable": with pg_c_utf8, we control when changes happen.  We can
>> align them with major releases and release-note the differences.
>> With libc-based collations, we have zero control and not much
>> notification.

> Also, changes to libc collations are much more impactful, at least two
> orders of magnitude. All indexes on text are at risk, even primary
> keys.

Well, it depends on which libc collation you have in mind.  I was
thinking of a libc-supplied C.UTF-8 collation, which I would expect
to behave the same as pg_c_utf8, modulo which Unicode version it's
based on.  But even when comparing to that, pg_c_utf8 can win on
stability for the reasons I stated.  If you don't have a C.UTF-8
collation available, and are forced to use en_US.UTF-8 or
$locale-of-choice, then the stability picture is far more dire,
as Jeff says.

Noah seems to be comparing the stability of pg_c_utf8 to the stability
of a pure C/POSIX collation, but I do not think that is the relevant
comparison to make.  Besides, if someone is using C/POSIX, this
feature doesn't stop them from continuing to do so.

            regards, tom lane



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeff Davis
Date:
On Tue, 2024-07-23 at 07:39 -0700, Noah Misch wrote:
> we should remedy the step backward that pg_c_utf8 has taken:

Obviously I disagree that we've taken a step backwards.

Can you articulate the principle by which all of the other problems
with IMMUTABLE are just fine, but updates to Unicode are intolerable,
and only for PG_C_UTF8?

Regards,
    Jeff Davis




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Robert Haas
Date:
On Tue, Jul 23, 2024 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> No, I think we *are* winning, because the updates are not "equally
> unstable": with pg_c_utf8, we control when changes happen.  We can
> align them with major releases and release-note the differences.
> With libc-based collations, we have zero control and not much
> notification.

OK, that's pretty fair.

> > Do we need to version the new ctype provider?
>
> It would be a version for the underlying Unicode definitions,
> not the provider as such, but perhaps yes.  I don't know to what
> extent doing so would satisfy Noah's concern; but if it would do
> so I'd be happy with that answer.

I don't see how we can get by without some kind of versioning here.
It's probably too late to do that for v17, but if we bet either that
(1) we'll never need to change anything for pg_c_utf8 or that (2)
those changes will be so minor that nobody will have a problem, I
think we will lose our bet.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jul 23, 2024 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Do we need to version the new ctype provider?

>> It would be a version for the underlying Unicode definitions,
>> not the provider as such, but perhaps yes.  I don't know to what
>> extent doing so would satisfy Noah's concern; but if it would do
>> so I'd be happy with that answer.

> I don't see how we can get by without some kind of versioning here.
> It's probably too late to do that for v17,

Why?  If we agree that that's the way forward, we could certainly
stick some collversion other than "1" into pg_c_utf8's pg_collation
entry.  There's already been one v17 catversion bump since beta2
(716bd12d2), so another one is basically free.

            regards, tom lane



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
"Daniel Verite"
Date:
    Tom Lane wrote:

> > I don't see how we can get by without some kind of versioning here.
> > It's probably too late to do that for v17,
>
> Why?  If we agree that that's the way forward, we could certainly
> stick some collversion other than "1" into pg_c_utf8's pg_collation
> entry.  There's already been one v17 catversion bump since beta2
> (716bd12d2), so another one is basically free.

pg_collation.collversion has been used so far for the sort part
of the collations.

For the ctype part:

postgres=# select unicode_version();
 unicode_version
-----------------
 15.1
(1 row)


postgres=# select icu_unicode_version ();
 icu_unicode_version
---------------------
 14.0
(1 row)



Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Peter Eisentraut
Date:
On 22.07.24 19:55, Robert Haas wrote:
> Every other piece of software in the world has to deal with changes as
> a result of the addition of new code points, and probably less
> commonly, revisions to existing code points. Presumably, their stuff
> breaks too, from time to time. I mean, I find it a bit difficult to
> believe that web browsers or messaging applications on phones only
> ever display emoji, and never try to do any sort of string sorting.

The sorting isn't the problem.  We have a versioning mechanism for 
collations.  What we do with the version information is clearly not 
perfect yet, but the mechanism exists and you can hack together queries 
that answer the question, did anything change here that would affect my 
indexes.  And you could build more tooling around that and so on.

The problem being considered here are updates to Unicode itself, as 
distinct from the collation tables.  A Unicode update can impact at 
least two things:

- Code points that were previously unassigned are now assigned.  That's 
obviously a very common thing with every Unicode update.  The new 
character will have new properties attached to it, so the result of 
various functions that use such properties (upper(), lower(), 
normalize(), etc.) could change, because previously the code point had 
no properties, and so those functions would not do anything interesting 
with the character.

- Certain properties of an existing character can change.  Like, a 
character used to be a letter and now it's a digit.  (This is an 
example; I'm not sure if that particular change would be allowed.)  In 
the extreme case, this could have the same impact as the above, but in 
practice the kinds of changes that are allowed wouldn't affect typical 
indexes.

I don't think this has anything in particular to do with the new builtin 
collation provider.  That is just one new consumer of this.



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Tom Lane
Date:
"Daniel Verite" <daniel@manitou-mail.org> writes:
>     Tom Lane wrote:
>> Why?  If we agree that that's the way forward, we could certainly
>> stick some collversion other than "1" into pg_c_utf8's pg_collation
>> entry.  There's already been one v17 catversion bump since beta2
>> (716bd12d2), so another one is basically free.

> pg_collation.collversion has been used so far for the sort part
> of the collations.

Hmm, we haven't particularly drawn a distinction between sort-related
and not-sort-related aspects of collation versions AFAIK.  Perhaps
it'd be appropriate to do so, and I agree that there's not time to
design such a thing for v17.  But pg_c_utf8 might be the only case
where we could do anything other than advance those versions in
lockstep.  I doubt we have enough insight into the behaviors of
other providers to say confidently that an update affects only
one side of their behavior.

            regards, tom lane



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeff Davis
Date:
On Tue, 2024-07-23 at 16:07 -0400, Tom Lane wrote:
> Well, it depends on which libc collation you have in mind.  I was
> thinking of a libc-supplied C.UTF-8 collation, which I would expect
> to behave the same as pg_c_utf8, modulo which Unicode version it's
> based on.

Daniel Vérité documented[1] cases where the libc C.UTF-8 locale changed
the *sort* behavior, thereby affecting primary keys.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/8a3dc06f-9b9d-4ed7-9a12-2070d8b0165f%40manitou-mail.org




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Tue, 2024-07-23 at 16:07 -0400, Tom Lane wrote:
>> Well, it depends on which libc collation you have in mind.  I was
>> thinking of a libc-supplied C.UTF-8 collation, which I would expect
>> to behave the same as pg_c_utf8, modulo which Unicode version it's
>> based on.

> Daniel Vérité documented[1] cases where the libc C.UTF-8 locale changed
> the *sort* behavior, thereby affecting primary keys.

Ouch.  But we didn't establish whether that was an ancient bug,
or something likely to happen again.  (In any case, that surely
reinforces the point that we can expect pg_c_utf8 to be more
stable than any previously-available alternative.)

            regards, tom lane



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Robert Haas
Date:
On Tue, Jul 23, 2024 at 4:36 PM Peter Eisentraut <peter@eisentraut.org> wrote:
> The sorting isn't the problem.  We have a versioning mechanism for
> collations.  What we do with the version information is clearly not
> perfect yet, but the mechanism exists and you can hack together queries
> that answer the question, did anything change here that would affect my
> indexes.  And you could build more tooling around that and so on.

In my experience, sorting is, overwhelmingly, the problem. What people
complain about is that they do an upgrade - of PG or some OS package -
and then their indexes are broken. Or their partition bounds are
broken.

That we have versioning information that someone could hypothetically
know how to do something useful with is not really useful, because
nobody actually knows how to do it, and there's nothing to trigger
them to do it in the first place. People don't think "oh, I'm running
dnf update, I better run undocumented queries against the PostgreSQL
system catalogs to see whether my system is going to melt afterwards."

What needs to happen is that when you do something that breaks
something, something notices automatically and tells you and gives you
a way to get it fixed again. Or better yet, when you do something that
would break something as things stand today, some kind of versioning
logic kicks in and you keep the old behavior and nothing actually
breaks.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeff Davis
Date:
On Tue, 2024-07-23 at 21:37 -0400, Robert Haas wrote:
> In my experience, sorting is, overwhelmingly, the problem.

I strongly agree.

> That we have versioning information that someone could hypothetically
> know how to do something useful with is not really useful, because
> nobody actually knows how to do it

Including me. I put significant effort into creating some views that
could help users identify potentially-affected indexes based on
collation changes, and I gave up. In theory it's just about impossible
(consider some UDF that constructs queries and EXECUTEs them -- what
collations does that depend on?). In practice, it's not much easier,
and you might as well just reindex everything having to do with text.

In contrast, if the problem is CTYPE-related, users are in a much
better position. It won't affect their primary keys or most indexes.
It's much more tractable to review your expression indexes and look for
problems (not ideal, but better). Also, as Peter points out, CTYPE
changes are typically more narrow, so there's a good chance that
there's no problem at all.

Regards,
    Jeff Davis




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Peter Eisentraut
Date:
On 24.07.24 03:37, Robert Haas wrote:
> On Tue, Jul 23, 2024 at 4:36 PM Peter Eisentraut <peter@eisentraut.org> wrote:
>> The sorting isn't the problem.  We have a versioning mechanism for
>> collations.  What we do with the version information is clearly not
>> perfect yet, but the mechanism exists and you can hack together queries
>> that answer the question, did anything change here that would affect my
>> indexes.  And you could build more tooling around that and so on.
> 
> In my experience, sorting is, overwhelmingly, the problem. What people
> complain about is that they do an upgrade - of PG or some OS package -
> and then their indexes are broken. Or their partition bounds are
> broken.

Fair enough.  My argument was, that topic is distinct from the topic of 
this thread.




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Noah Misch
Date:
On Tue, Jul 23, 2024 at 01:07:49PM -0700, Jeff Davis wrote:
> On Tue, 2024-07-23 at 07:39 -0700, Noah Misch wrote:
> > Short-term, we should remedy the step backward that pg_c_utf8 has taken:
> > https://postgr.es/m/20240718233908.52.nmisch@google.com
> > https://postgr.es/m/486d71991a3f80ec1c47e1bd7931e2ef3627b6b3.camel@cybertec.at
> 
> Obviously I disagree that we've taken a step backwards.

Yes.

> Can you articulate the principle by which all of the other problems
> with IMMUTABLE are just fine, but updates to Unicode are intolerable,
> and only for PG_C_UTF8?

No, because I don't think all the other problems with IMMUTABLE are just fine.
The two messages linked cover the comparisons I do consider important,
especially the comparison between pg_c_utf8 and packager-frozen ICU.



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Robert Haas
Date:
On Wed, Jul 24, 2024 at 12:42 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> Fair enough.  My argument was, that topic is distinct from the topic of
> this thread.

OK, that's fair. But I think the solutions are the same: we complain
all the time about glibc and ICU shipping collations and not
versioning them. We shouldn't make the same kinds of mistakes. Even if
ctype is less likely to break things than collations, it still can,
and we should move in the direction of letting people keep the v17
behavior for the foreseeable future while at the same time having a
way that they can also get the new behavior if they want it (and the
new behavior should be the default).

I note in passing that the last time I saw a customer query with
UPPER() in the join clause was... yesterday. The problems there had
nothing to do with CTYPE, but there's no reason to suppose that it
couldn't have had such a problem. I suspect the reason we don't hear
about ctype problems now is that the collation problems are worse and
happen in similar situations. But if all the collation problems went
away, a subset of the same users would then be unhappy about ctype.

So I don't want to see us sit on our hands and assert that we don't
need to worry about ctype because it's minor in comparison with
collation. It *is* minor in comparison with collation. But one problem
can be small in comparison with another and still bad. If an aircraft
is on fire whilst experiencing a dual engine failure, it's still in a
lot of trouble even if the fire can be put out.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeremy Schneider
Date:
On Wed, Jul 24, 2024 at 6:20 AM Robert Haas <robertmhaas@gmail.com> wrote:

I note in passing that the last time I saw a customer query with
UPPER() in the join clause was... yesterday. The problems there had
nothing to do with CTYPE, but there's no reason to suppose that it
couldn't have had such a problem. I suspect the reason we don't hear
about ctype problems now is that the collation problems are worse and
happen in similar situations. But if all the collation problems went
away, a subset of the same users would then be unhappy about ctype.

I have seen and created indexes on upper() functions a number of times too, and I think this is not an uncommon pattern for case insensitive searching

Before glibc 2.28, there was at least one mailing list thread where an unhappy person complained about collation problems; but for a number of years before 2.28 I guess the collation changes were uncommon so it didn’t get enough momentum to be considered a real problem until the problem became widespread a few years ago?


I myself would prefer an approach here that sets a higher bar for pg_upgrade not corrupting indexes, rather than saying it’s ok as long as it’s rare

-Jeremy

Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeff Davis
Date:
On Tue, 2024-07-23 at 06:31 -0600, Jeremy Schneider wrote:
> Other RDBMS are very careful not to corrupt databases, afaik
> including function based indexes, by changing Unicode. I’m not aware
> of any other RDBMS that updates Unicode versions in place; instead
> they support multiple Unicode versions and do not drop the old ones.

I'm curious about the details of what other RDBMSs do.

Let's simplify and say that there's one database-wide collation at
version 1, and the application doesn't use any COLLATE clause or other
specifications for queries or DDL.

Then, version 2 of that collation becomes available. When a query comes
into the database, which version of the collation does it use, 1 or 2?
If it uses the latest available (version 2), then all the old indexes
are effectively useless.

So I suppose there's some kind of migration process where you
rebuild/fix objects to use the new collation, and when that's done then
you change the default so that queries use version 2. How does all that
work?

Regards,
    Jeff Davis




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeff Davis
Date:
On Wed, 2024-07-24 at 08:20 -0400, Robert Haas wrote:
> I note in passing that the last time I saw a customer query with
> UPPER() in the join clause was... yesterday.

Can you expand on that? This thread is mostly about durable state so I
don't immediately see the connection.

> So I don't want to see us sit on our hands and assert that we don't
> need to worry about ctype because it's minor in comparison with
> collation. It *is* minor in comparison with collation. 

...

> But one problem
> can be small in comparison with another and still bad. If an aircraft
> is on fire whilst experiencing a dual engine failure, it's still in a
> lot of trouble even if the fire can be put out.

There's a qualitative difference between a collation update which can
break your PKs and FKs, and a ctype update which definitely will not.
Your analogy doesn't quite capture this distinction. I don't mean to
over-emphasize this point, but I do think we need to keep some
perspective here.

But I agree with your general point that we shouldn't dismiss the
problem just because it's minor. We should expect the problem to
surface at some point and be reasonably prepared.

Regards,
    Jeff Davis




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Peter Eisentraut
Date:
On 24.07.24 14:20, Robert Haas wrote:
> On Wed, Jul 24, 2024 at 12:42 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>> Fair enough.  My argument was, that topic is distinct from the topic of
>> this thread.
> 
> OK, that's fair. But I think the solutions are the same: we complain
> all the time about glibc and ICU shipping collations and not
> versioning them. We shouldn't make the same kinds of mistakes. Even if
> ctype is less likely to break things than collations, it still can,
> and we should move in the direction of letting people keep the v17
> behavior for the foreseeable future while at the same time having a
> way that they can also get the new behavior if they want it (and the
> new behavior should be the default).

Versioning is possibly part of the answer, but I think it would be 
different versioning from the collation version.

The collation versions are in principle designed to change rarely.  Some 
languages' rules might change once in twenty years, some never.  Maybe 
you have a database mostly in English and a few tables in, I don't know, 
Swedish (unverified examples).  Most of the time nothing happens during 
upgrades, but one time in many years you need to reindex the Swedish 
tables, and the system starts warning you about that as soon as you 
access the Swedish tables.  (Conversely, if you never actually access 
the Swedish tables, then you don't get warned about.)

If we wanted a similar versioning system for the Unicode updates, it 
would be separate.  We'd write the Unicode version that was current when 
the system catalogs were initialized into, say, a pg_database column. 
And then at run-time, when someone runs say the normalize() function or 
some regular expression character classification, then we check what the 
version of the current compiled-in Unicode tables are, and then we'd 
issue a warning when they are different.

A possible problem is that the Unicode version changes in practice with 
every major PostgreSQL release, so this approach would end up warning 
users after every upgrade.  To avoid that, we'd probably need to keep 
support for multiple Unicode versions around, as has been suggested in 
this thread already.




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Robert Haas
Date:
On Wed, Jul 24, 2024 at 1:45 PM Jeff Davis <pgsql@j-davis.com> wrote:
> There's a qualitative difference between a collation update which can
> break your PKs and FKs, and a ctype update which definitely will not.

I don't think that's true. All you need is a unique index on UPPER(somecol).

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeff Davis
Date:
On Wed, 2024-07-24 at 14:47 -0400, Robert Haas wrote:
> On Wed, Jul 24, 2024 at 1:45 PM Jeff Davis <pgsql@j-davis.com> wrote:
> > There's a qualitative difference between a collation update which
> > can
> > break your PKs and FKs, and a ctype update which definitely will
> > not.
>
> I don't think that's true. All you need is a unique index on
> UPPER(somecol).

Primary keys are on plain column references, not expressions; and don't
support WHERE clauses, so I don't see how a ctype update would affect a
PK.

In any case, you are correct that Unicode updates could put some
constraints at risk, including unique indexes, CHECK, and partition
constraints. But someone has to actually use one of the affected
functions somewhere, and that's the main distinction that I'm trying to
draw.

The reason why collation is qualitatively a much bigger problem is
because there's no obvious indication that you are doing anything
related to collation at all. A very plain "CREATE TABLE x(t text
PRIMARY KEY)" is at risk.

Regards,
    Jeff Davis




Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Robert Haas
Date:
On Wed, Jul 24, 2024 at 3:12 PM Jeff Davis <pgsql@j-davis.com> wrote:
> In any case, you are correct that Unicode updates could put some
> constraints at risk, including unique indexes, CHECK, and partition
> constraints. But someone has to actually use one of the affected
> functions somewhere, and that's the main distinction that I'm trying to
> draw.
>
> The reason why collation is qualitatively a much bigger problem is
> because there's no obvious indication that you are doing anything
> related to collation at all. A very plain "CREATE TABLE x(t text
> PRIMARY KEY)" is at risk.

Well, I don't know. I agree that collation is a much bigger problem,
but not for that reason. I think a user who is familiar with the
problems in this area will see the danger either way, and one who
isn't, won't. For me, the only real difference is that a unique index
on a text column is a lot more common than one that involves UPPER.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Jeremy Schneider
Date:
On Wed, Jul 24, 2024 at 12:47 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jul 24, 2024 at 1:45 PM Jeff Davis <pgsql@j-davis.com> wrote:
> There's a qualitative difference between a collation update which can
> break your PKs and FKs, and a ctype update which definitely will not.

I don't think that's true. All you need is a unique index on UPPER(somecol).

I doubt it’s common to have unique on upper()

But non-unique indexes for case insensitive searches will be more common. Historically this is the most common way people did case insensitive on oracle.

Changing ctype would mean these queries can return wrong results

The impact would be similar to the critical problem TripAdvisor hit in 2014 with their read replicas, in the Postgres email thread I linked above

-Jeremy

Re: [18] Policy on IMMUTABLE functions and Unicode updates

From
Robert Haas
Date:
On Wed, Jul 24, 2024 at 3:43 PM Jeremy Schneider
<schneider@ardentperf.com> wrote:
> But non-unique indexes for case insensitive searches will be more common. Historically this is the most common way
peopledid case insensitive on oracle. 
>
> Changing ctype would mean these queries can return wrong results

Yeah. I mentioned earlier that I very recently saw a customer query
with UPPER() in the join condition. If someone is doing foo JOIN bar
ON upper(foo.x) = upper(bar.x), it is not unlikely that one or both of
those expressions are indexed. Not guaranteed, of course, but very
plausible.

--
Robert Haas
EDB: http://www.enterprisedb.com