Thread: Re: Update Unicode data to Unicode 16.0.0
On 05.02.25 22:47, Jeff Davis wrote: > (b) we should make reasonable attempts to mitigate potential > problems. > > One idea for (b) resurfaced, which was to make a best-effort check at > pg_upgrade time for affected indexes. The check would not be > bulletproof, because we can't catch dependencies that are hidden inside > SPI (e.g. a plpgsql function that calls LOWER()), but it would catch > most potential problems. > > Patch attached. A few notes: > > * The dependency entries don't exist because LOWER(), etc., are > system objects (pinned); so it queries the indexprs, indpreds, > partexprs, and conbin. > * The query is large and perhaps too clever, but it seems to work. I > tried to add inline comments to the SQL, and pgindent had its own ideas > about how to format them -- suggestions welcome. > * We haven't actually done the Unicode update yet, so it will notice > that the PG17 and PG18 Unicode versions are the same, and return early. > Either apply on top of the Unicode update patch, or comment out the > early return for testing. > * It emits a warning rather than an error, so you need to specify > pg_upgrade with "-r" to see the output file. > * I didn't adapt the query to run on pre-17 versions, even though it > could find some potential problem cases (like an index on NORMALIZE()). > I can add that if someone thinks it's worthwhile. This all looks quite reasonable to me. The code could obviously use a bit closer inspection, but the way you've structured it looked quite okay to me.
On Mon, Feb 17, 2025 at 11:46:43AM -0800, Jeff Davis wrote: > Attached a version that rebases both patches. In my patch, I added a > report_status(). I briefly looked at v2-0002, and the UpgradeTask usage looks correct to me. Did you find it easy enough to use? + /* + * The builtin provider did not exist prior to version 17. While there are + * still problems that could potentially be caught from earlier versions, + * such as an index on NORMALIZE(), we don't check for that here. + */ + if (GET_MAJOR_VERSION(cluster->major_version) < 1700) + return; nitpick: In most cases, I think this check is done in check_and_dump_old_cluster() before actually calling the checking function. I don't think there's any big problem here, except you might strand the memory allocated for the task. + if (!unicode_version_changed(cluster)) + { + check_ok(); + return; + } Same nitpick here about stranding the task memory. + report_status(PG_WARNING, "warning"); + pg_log(PG_WARNING, "Your installation contains relations that may be affected by a new version of Unicode.\n" + "A list of potentially-affected relations is in the file:\n" + " %s", report.path); This may have been discussed upthread, but is a warning enough? That seems like something that could very easily be missed. -- nathan
On Wed, 2025-03-05 at 14:33 -0600, Nathan Bossart wrote: > + report_status(PG_WARNING, "warning"); > + pg_log(PG_WARNING, "Your installation contains > relations that may be affected by a new version of Unicode.\n" > + "A list of potentially-affected relations > is in the file:\n" > + " %s", report.path); > > This may have been discussed upthread, but is a warning enough? That > seems > like something that could very easily be missed. There can be false positives, because even if such an expression index exists, it's often not an actual problem. Do we want to stop an upgrade from happening in that case? I doubt it, but if so, we'd need some kind of option to bypass it. Regards, Jeff Davis
On Wed, Mar 05, 2025 at 03:34:06PM -0800, Jeff Davis wrote: > On Wed, 2025-03-05 at 14:33 -0600, Nathan Bossart wrote: >> + report_status(PG_WARNING, "warning"); >> + pg_log(PG_WARNING, "Your installation contains >> relations that may be affected by a new version of Unicode.\n" >> + "A list of potentially-affected relations >> is in the file:\n" >> + " %s", report.path); >> >> This may have been discussed upthread, but is a warning enough? That >> seems >> like something that could very easily be missed. > > There can be false positives, because even if such an expression index > exists, it's often not an actual problem. Do we want to stop an upgrade > from happening in that case? I doubt it, but if so, we'd need some kind > of option to bypass it. I see. Do we provide any suggested next steps for users to assess the potentially-affected relations? -- nathan
On Wed, 2025-03-05 at 20:43 -0600, Nathan Bossart wrote: > I see. Do we provide any suggested next steps for users to assess > the > potentially-affected relations? I don't know exactly where we should document it, but I've attached a SQL file that demonstrates what can happen for a PG17->PG18 upgrade, assuming that we've updated Unicode to 16.0.0 in PG18. The change in Unicode that I'm focusing on is the addition of U+A7DC, which is unassigned in Unicode 15.1 and assigned in Unicode 16, which lowercases to U+019B. The examples assume that the user is using unassigned code points in PG17/Unicode15.1 and the PG_C_UTF8 collation. The exmaple table 'test' has a single row with U+A7DC and an index test_idx on LOWER(t). Because the codepoint is unassigned in PG17, LOWER(U&'\A7DC') returns itself, while in PG18 it returns U&'\019B'. The first thing the user should do if they see the warnings during the --check phase is to rerun with "--check --retain" so they can keep the file. Maybe that's not great, and we should keep the file regardless for these particular kinds of warnings. Then go ahead and upgrade. After upgrade, the user should: REINDEX INDEX CONCURRENTLY test_idx; each potentially-affected index. If they have amcheck installed, they can do: select bt_index_check(index=>'test_idx', heapallindexed=>true); first, and only REINDEX if there's an error to fix. Do the same for other indexes listed in the warning file. The next example is tuniq, which is similar except that the index is a unique index, and the table has two values: U&'\019B' and U&'\A7DC'. In PG17, LOWER() returns U&'\019B' and U&'\A7DC', respectively (the former because it's already lowercase, and the latter because it's unassigned and returns itself). In PG18, LOWER() returns U&'\019B' and U&'\019B'. So, a REINDEX will encounter a uniqueness violation, and the user will need to decide how to handle the duplicate values. The next example is tcheck, which does CHECK(LOWER(t) = t), which is the most plausible check constraint I can think of that would be affected. In theory, the user might have to reason through what should be done about a tuple that violates it, but in practice it might be as simple as: UPDATE tcheck SET t = LOWER(t) WHERE t <> LOWER(t); The last example is partitioning where the table is range partitioned on LOWER(t), and the tuple ends up in the wrong partition on PG18. After fixing the indexes with: -- tpart0_lower_idx is fine, no REINDEX needed select bt_index_check(index=>'tpart0_lower_idx', heapallindexed=>true); select bt_index_check(index=>'tpart1_lower_idx', heapallindexed=>true); REINDEX INDEX CONCURRENTLY tpart1_lower_idx; See the problem with: SET enable_partition_pruning=off; -- correctly finds one tuple SELECT * FROM tpart WHERE LOWER(t) = U&'\019B'; SET enable_partition_pruning TO DEFAULT; -- incorrectly finds zero tuples SELECT * FROM tpart WHERE LOWER(t) = U&'\019B'; Now move the tuples to the right partitions: BEGIN; WITH d AS ( DELETE FROM tpart0 WHERE NOT (LOWER(t) >= U&'\0001' AND LOWER(t) < U&'\8000') RETURNING * ) INSERT INTO tpart SELECT * FROM d; WITH d AS ( DELETE FROM tpart1 WHERE NOT (LOWER(t) >= U&'\8000' AND LOWER(t) < U&'\FFFF') RETURNING * ) INSERT INTO tpart SELECT * FROM d; COMMIT; The order of operations should be to fix indexes, unique constraints, and check constraints first; and then to fix partitioned tables. That way the above partitioned table queries get correct results using the indexes. Granted, this is not a great user experience, and we could do better in the future (e.g. add some new utility commands). But this is an improvement in the sense that it's a small number of affected objects, unlike a collation version change, which affects anything related to text. Regards, Jeff Davis
Attachment
On 17.02.25 20:46, Jeff Davis wrote: > Note that the Unicode update has a few test diffs for NORMALIZE(), > please check to see if the tests themselves need an update. I think these test result changes are incorrect. AFAICT, nothing has changed in the upstream data that would explain such a change. I didn't get such test differences in my original patch. Did you rebase the patch or regenerate it? Maybe something went wrong there.
On Fri, 07 Mar 2025 13:11:18 -0800 Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2025-03-05 at 20:43 -0600, Nathan Bossart wrote: > > I see. Do we provide any suggested next steps for users to assess > > the > > potentially-affected relations? > > I don't know exactly where we should document it, but I've attached a > SQL file that demonstrates what can happen for a PG17->PG18 upgrade, > assuming that we've updated Unicode to 16.0.0 in PG18. > > The change in Unicode that I'm focusing on is the addition of U+A7DC, > which is unassigned in Unicode 15.1 and assigned in Unicode 16, which > lowercases to U+019B. The examples assume that the user is using > unassigned code points in PG17/Unicode15.1 and the PG_C_UTF8 > collation. It seems the consensus is to update unicode in core... FWIW, I'm still in favor of leaving it alone because ICU is there for when I need up-to-date unicode versions. From my perspective, the whole point of the builtin collation was to one option that avoids these problems that come with updating both ICU and glibc. So I guess the main point of the builtin provider just that it's faster than ICU? -Jeremy
On Fri, 2025-03-14 at 23:54 -0700, Jeremy Schneider wrote: > On Fri, 07 Mar 2025 13:11:18 -0800 > It seems the consensus is to update unicode in core... FWIW, I'm still > in favor of leaving it alone because ICU is there for when I need > up-to-date unicode versions. Me too. > From my perspective, the whole point of the builtin collation was to > one option that avoids these problems that come with updating both ICU > and glibc. +1 Yours, Laurenz Albe
On 3/15/25 03:26, Laurenz Albe wrote: > On Fri, 2025-03-14 at 23:54 -0700, Jeremy Schneider wrote: >> On Fri, 07 Mar 2025 13:11:18 -0800 >> It seems the consensus is to update unicode in core... FWIW, I'm still >> in favor of leaving it alone because ICU is there for when I need >> up-to-date unicode versions. > > Me too. +1 >> From my perspective, the whole point of the builtin collation was to >> one option that avoids these problems that come with updating both ICU >> and glibc. > > +1 +1 In the long term I think we should figure out how to support newer versions of unicode for the builtin, but in my mind that might involve the necessity of supporting multiple versions of unicode such that the choice remains to remain on the older one. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Jeremy Schneider <schneider@ardentperf.com> writes: > On Fri, 07 Mar 2025 13:11:18 -0800 > Jeff Davis <pgsql@j-davis.com> wrote: >> The change in Unicode that I'm focusing on is the addition of U+A7DC, >> which is unassigned in Unicode 15.1 and assigned in Unicode 16, which >> lowercases to U+019B. The examples assume that the user is using >> unassigned code points in PG17/Unicode15.1 and the PG_C_UTF8 >> collation. > It seems the consensus is to update unicode in core... FWIW, I'm still > in favor of leaving it alone because ICU is there for when I need > up-to-date unicode versions. > From my perspective, the whole point of the builtin collation was to > one option that avoids these problems that come with updating both ICU > and glibc. I don't really buy this argument. If we sit on Unicode 15 until that becomes untenable, which it will, then people will still be faced with a behavioral change whenever we bow to reality and invent a "builtin-2.0" or whatever collation. Moreover, by then they might well have instances of the newly-assigned code points in their database, making the changeover real and perhaps painful for them. On the other hand, if we keep up with the Joneses by updating the Unicode data, we can hopefully put those behavioral changes into effect *before* they'd affect any real data. So it seems to me that freezing our Unicode data is avoiding hypothetical pain now at the price of certain pain later. I compare this to our routine timezone data updates, which certainly have not been without occasional pain ... but does anyone seriously want to argue that we should still be running tzdata from 20 years back? Or even 5 years back? In fact, on the analogy of timezones, I think we should not only adopt newly-published Unicode versions pretty quickly but push them into released branches as well. Otherwise the benefit of staying ahead of real use of the new code points isn't there for end users. regards, tom lane
On Sat, 2025-03-15 at 12:15 -0400, Tom Lane wrote: > On the other hand, if we keep up with the Joneses by updating the > Unicode data, we can hopefully put those behavioral changes into > effect *before* they'd affect any real data. That's a good point. Regards, Jeff Davis
On Fri, 2025-03-14 at 23:54 -0700, Jeremy Schneider wrote: > From my perspective, the whole point of the builtin collation was to > one option that avoids these problems that come with updating both > ICU > and glibc. > > So I guess the main point of the builtin provider just that it's > faster > than ICU? It doesn't break primary keys. Also, it's stable within a major version, we can document and test its behavior, it solves 99% of the upgrade problem, and what problems remains are much more manageable. And yes, collation is way, way faster than ICU. Regards, Jeff Davis
> On Mar 15, 2025, at 10:22 AM, Jeff Davis <pgsql@j-davis.com> wrote: > > On Sat, 2025-03-15 at 12:15 -0400, Tom Lane wrote: >> On the other hand, if we keep up with the Joneses by updating the >> Unicode data, we can hopefully put those behavioral changes into >> effect *before* they'd affect any real data. > > That's a good point. Jeff - thanks for the reminder that this is just about character semantics and not ordering. Obviously C collation by definition(code point ordering) doesn’t change sort order… two weeks ago I was working on updating the torture test GitHubpage with glibc collation changes up through Ubuntu 24.10 so my mind was definitely over there. No detected changesin en-US so that’s great news. 🙂 Is the simple answer that functions & clauses related to both time zones and character semantics should just all be consideredSTABLE instead of IMMUTABLE? I think if that were the case then changes across a minor version would simply be allowed by definition right? No need forwarnings. This would impact the ability to create case-insensitive indexes. -Jeremy Sent from my TI-83
On Sat, 8 Mar 2025 at 02:41, Jeff Davis <pgsql@j-davis.com> wrote: > > On Wed, 2025-03-05 at 20:43 -0600, Nathan Bossart wrote: > > I see. Do we provide any suggested next steps for users to assess > > the > > potentially-affected relations? > > I don't know exactly where we should document it, but I've attached a > SQL file that demonstrates what can happen for a PG17->PG18 upgrade, > assuming that we've updated Unicode to 16.0.0 in PG18. > > The change in Unicode that I'm focusing on is the addition of U+A7DC, > which is unassigned in Unicode 15.1 and assigned in Unicode 16, which > lowercases to U+019B. The examples assume that the user is using > unassigned code points in PG17/Unicode15.1 and the PG_C_UTF8 collation. > > The exmaple table 'test' has a single row with U+A7DC and an index > test_idx on LOWER(t). Because the codepoint is unassigned in PG17, > LOWER(U&'\A7DC') returns itself, while in PG18 it returns U&'\019B'. > > The first thing the user should do if they see the warnings during the > --check phase is to rerun with "--check --retain" so they can keep the > file. Maybe that's not great, and we should keep the file regardless > for these particular kinds of warnings. > > Then go ahead and upgrade. After upgrade, the user should: > > REINDEX INDEX CONCURRENTLY test_idx; > > each potentially-affected index. If they have amcheck installed, they > can do: > > select bt_index_check(index=>'test_idx', heapallindexed=>true); > > first, and only REINDEX if there's an error to fix. Do the same for > other indexes listed in the warning file. > > The next example is tuniq, which is similar except that the index is a > unique index, and the table has two values: U&'\019B' and U&'\A7DC'. In > PG17, LOWER() returns U&'\019B' and U&'\A7DC', respectively (the former > because it's already lowercase, and the latter because it's unassigned > and returns itself). In PG18, LOWER() returns U&'\019B' and U&'\019B'. > So, a REINDEX will encounter a uniqueness violation, and the user will > need to decide how to handle the duplicate values. > > The next example is tcheck, which does CHECK(LOWER(t) = t), which is > the most plausible check constraint I can think of that would be > affected. In theory, the user might have to reason through what should > be done about a tuple that violates it, but in practice it might be as > simple as: > > UPDATE tcheck SET t = LOWER(t) WHERE t <> LOWER(t); > > The last example is partitioning where the table is range partitioned > on LOWER(t), and the tuple ends up in the wrong partition on PG18. > After fixing the indexes with: > > -- tpart0_lower_idx is fine, no REINDEX needed > select bt_index_check(index=>'tpart0_lower_idx', > heapallindexed=>true); > select bt_index_check(index=>'tpart1_lower_idx', > heapallindexed=>true); > REINDEX INDEX CONCURRENTLY tpart1_lower_idx; > > See the problem with: > > SET enable_partition_pruning=off; > -- correctly finds one tuple > SELECT * FROM tpart WHERE LOWER(t) = U&'\019B'; > > SET enable_partition_pruning TO DEFAULT; > -- incorrectly finds zero tuples > SELECT * FROM tpart WHERE LOWER(t) = U&'\019B'; > > Now move the tuples to the right partitions: > > BEGIN; > WITH d AS ( > DELETE FROM tpart0 > WHERE NOT (LOWER(t) >= U&'\0001' AND LOWER(t) < U&'\8000') > RETURNING * > ) INSERT INTO tpart SELECT * FROM d; > WITH d AS ( > DELETE FROM tpart1 > WHERE NOT (LOWER(t) >= U&'\8000' AND LOWER(t) < U&'\FFFF') > RETURNING * > ) INSERT INTO tpart SELECT * FROM d; > COMMIT; > > The order of operations should be to fix indexes, unique constraints, > and check constraints first; and then to fix partitioned tables. That > way the above partitioned table queries get correct results using the > indexes. > > Granted, this is not a great user experience, and we could do better in > the future (e.g. add some new utility commands). But this is an > improvement in the sense that it's a small number of affected objects, > unlike a collation version change, which affects anything related to > text. We currently have two Commitfest entries for the same thread at [1] and [2]. Are both still necessary, or can we consolidate tracking into a single entry? [1] - https://commitfest.postgresql.org/patch/5472/ [2] - https://commitfest.postgresql.org/patch/5563/ Regards, Vignesh
On Sat, 2025-03-15 at 18:23 -0700, Jeremy Schneider wrote: > Is the simple answer that functions & clauses related to both time > zones and character semantics should just all be considered STABLE > instead of IMMUTABLE? That was discussed a few times, but: (a) That doesn't exactly solve the problem, because people still need indexes on LOWER() or CASEFOLD(); and (b) If we change IMMUTABLE to mean "returns the same results on every platform for all time", that would be too strict for many purposes, like the planner doing constant folding. I have been thinking about ways we can express the right dependencies, and I may be making some proposals along those lines. Regards, Jeff Davis
On Sun, 2025-03-16 at 19:10 +0530, vignesh C wrote: > We currently have two Commitfest entries for the same thread at [1] > and [2]. Are both still necessary, or can we consolidate tracking > into > a single entry? I'm fine removing my CF entry, but unfortunately there's no "withdrawn -- duplicate", so it might send the wrong message. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > That was discussed a few times, but: > (a) That doesn't exactly solve the problem, because people still need > indexes on LOWER() or CASEFOLD(); and > (b) If we change IMMUTABLE to mean "returns the same results on every > platform for all time", that would be too strict for many purposes, > like the planner doing constant folding. Yeah. Not only would the set of functions meeting such a standard be vanishingly small, but so would the set of use-cases. What we need is some sort of understanding that "this is okay to use in indexes", "this is okay to constant-fold when planning", etc. Maybe it's less about "is it okay to just assume this" and more about "can we devise a method for figuring out when we have to reindex, replan, etc". We've got bits of that in our collation versioning infrastructure, but that doesn't cover every source of infrequently-mutating behavior. > I have been thinking about ways we can express the right dependencies, > and I may be making some proposals along those lines. I await a proposal with interest. regards, tom lane
On Thu, 2025-03-13 at 14:49 +0100, Peter Eisentraut wrote: > I think these test result changes are incorrect. AFAICT, nothing has > changed in the upstream data that would explain such a change. > > I didn't get such test differences in my original patch. Did you > rebase > the patch or regenerate it? Maybe something went wrong there. Right. The previous version was improperly rebased. I went back to your original version and rebased over master (including the latest optimizations for case mapping), and then rebased my changes on top of that. Attached. One annoyance with the recent case mapping optimizations is that the Unicode update diff is larger, similar to the diff for unicode_norm_hashfunc.h. Previously, it was binary search, so the only differences were logical differences. Regards, Jeff Davis
Attachment
On Sat, 2025-03-15 at 12:15 -0400, Tom Lane wrote: > In fact, on the analogy of timezones, I think we should not only > adopt newly-published Unicode versions pretty quickly but push > them into released branches as well. That approach suggests that we consider something like my previous STRICT_UNICODE proposal[1]. If Postgres updates Unicode quickly enough, there's not much reason that users would need to use unassigned code points, so it would be practical to just reject them (as an option). That would dramatically reduce the practical problems people would encounter when we do update Unicode. Note that assigned code points can still change behavior in later versions, but not in ways that would typically cause a problem for things like indexes. For instance, U+0363 changed from non-Alphabetic to Alphabetic in Unicode 16, which changes the results of the expression: U&'\0363' ~ '[[:alpha:]]' COLLATE PG_C_UTF8 from false to true, even though U+0363 is assigned in both Unicode 15.1.0 and 16.0.0. That might plausibly matter, but such cases would be more obscure than case folding. Regards, Jeff Davis [1] https://commitfest.postgresql.org/patch/4876/
On Mon, 17 Mar 2025 at 23:03, Jeff Davis <pgsql@j-davis.com> wrote: > > On Sun, 2025-03-16 at 19:10 +0530, vignesh C wrote: > > We currently have two Commitfest entries for the same thread at [1] > > and [2]. Are both still necessary, or can we consolidate tracking > > into > > a single entry? > > I'm fine removing my CF entry, but unfortunately there's no "withdrawn > -- duplicate", so it might send the wrong message. Yes, we don't have "withdrawn duplicate", I have closed it as withdrawn, anyway we can get it reviewed and committed using the other entry. Regards, Vignesh
On Sat, 2025-03-15 at 10:14 -0400, Joe Conway wrote: > In the long term I think we should figure out how to support newer > versions of unicode for the builtin, but in my mind that might > involve > the necessity of supporting multiple versions of unicode such that > the > choice remains to remain on the older one. What do you think of Tom's argument that waiting to update Unicode is what creates the problem in the first place? "by then they might well have instances of the newly-assigned code points in their database"[1] Regards, Jeff Davis [1] https://www.postgresql.org/message-id/3481161.1742055336@sss.pgh.pa.us
On Tue, 18 Mar 2025 08:53:56 -0700 Jeff Davis <pgsql@j-davis.com> wrote: > What do you think of Tom's argument that waiting to update Unicode is > what creates the problem in the first place? > > "by then they might well have instances of the newly-assigned code > points in their database"[1] > > [1] > https://www.postgresql.org/message-id/3481161.1742055336@sss.pgh.pa.us Waiting to update Unicode is not what creates the problem, as long as we support in-place major upgrades without rebuilding indexes. It seems that we think that it's really uncommon to make indexes on timezone data. We think that breaking primary keys must definitely be avoided. We think case-insensitive indexes are probably uncommon, so as long as its "rare" we can let them break. I'm not asking for an extreme definition of "IMMUTABLE" but I'd be very happy with a GUC "data_safety=radical_like_jeremy" where Postgres simply won't start if the control file says it was from a different operating system or architecture or ICU/glibc collation version. I can disable the GUC (like a maintenance mode) to rebuild my indexes and update my collation versions, and ideally this GUC would also mean that indexes simply aren't allowed to be created on functions that might change within the guarantees that are made. (And range-based partitions can't use them, and FDWs can't rely on them for query planning, etc.) I get that we don't want to break stuff people are currently doing - but I think there are a bunch of people who would be happy with an option to reduce functionality going forward in exchange for clearly defined behavior. I just don't want to have to think about this on behalf of all my Postgres users. -Jeremy
On Tue, Mar 18, 2025 at 11:54 AM Jeff Davis <pgsql@j-davis.com> wrote: > What do you think of Tom's argument that waiting to update Unicode is > what creates the problem in the first place? > > "by then they might well have instances of the newly-assigned code > points in their database"[1] I know you weren't asking me, but FWIW, I think Tom has a point, but I also don't completely buy it. There's no technological safeguard against people using Unicode points before they're assigned, so we can't assume that no one ever does. And they probably do. I think Joe has the right idea. The way to actually provide the stability that people want here is to continue supporting old versions while adding support for new versions. Anything else we do works subject to assumptions: you can either assume that people don't use code points before they're assigned (as Tom proposes) or you can assume that not updating to newer Unicode specs will not cause any problems for anyone. Joe's proposal is unique in requiring no assumptions about what users will do or what they care about. It just works, period. The only disadvantage is that it's more work for us, but this problem has caused enough suffering over the years that I bet we can find hackers to implement it and maintain it on an ongoing basis without great difficulty. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, 2025-03-18 at 09:28 -0700, Jeremy Schneider wrote: > We think case-insensitive indexes are probably uncommon, so as > long as its "rare" we can let them break. Let's define "break" in this context to mean that the constraints are not enforced, or that the query doesn't return the results that the user is expecting. Let's say a user has an index on LOWER(t) in PG17 (Unicode 15.1). Then Unicode 16.0 comes out, introducing the newly-assigned U+A7DC, which lowercases to U+019B. The rest of the world moves on and starts using U+A7DC. There are only two ways that Postgres can prevent breakage: 1. Update the database to Unicode 16.0 before U+A7DC is encountered, so that it's properly lowercased to U+019B, and a query on LOWER(t) = U&'\019B' will correctly return the record containing it. 2. Prevent U+A7DC from going into the database at all. Continuing on with Unicode 15.1 and accepting the unassigned code point *cannot* prevent breakage. A truly paranoid user would want a combination of both solutions: regular Unicode updates; and something like STRICT_UNICODE ( https://commitfest.postgresql.org/patch/4876/ ) to protect the user between the time Unicode assigns the code point and the time they can deploy a version of Postgres that understands it. You are rightfully concerned that updating Unicode can create its own inconsistencies, and if nothing is done that can lead to breakage as well. The upgrade-time check in this thread is one solution to that problem, but we could do a lot more. You are also right that we should be more skeptical of an internal inconsistency (e.g. different results for seqscan vs indexscan) than a wider definition of inconsistency. But the user created a unicode-based case-folded index there for a reason, and we shouldn't lose sight of that. > I'm not asking for an extreme definition of "IMMUTABLE" but I'd be > very happy with a GUC "data_safety=radical_like_jeremy" where > Postgres > simply won't start if the control file says it was from a different > operating system or architecture or ICU/glibc collation version. I > can > disable the GUC (like a maintenance mode) to rebuild my indexes and > update my collation versions, and ideally this GUC would also mean > that > indexes simply aren't allowed to be created on functions that might > change within the guarantees that are made. (And range-based > partitions > can't use them, and FDWs can't rely on them for query planning, etc.) Does the upgrade check patch in this thread accomplish that for you? If not, what else does it need? It's an upgrade-time check rather than a GUC, but it basically seems to match what you want. See: https://www.postgresql.org/message-id/16c4e37d4c89e63623b009de9ad6fb90e7456ed8.camel@j-davis.com Regards, Jeff Davis
On Tue, Mar 18, 2025 at 2:55 PM Jeff Davis <pgsql@j-davis.com> wrote: > Continuing on with Unicode 15.1 and accepting the unassigned code point > *cannot* prevent breakage. Under your definition, this is true, but I think Jeremy would define breakage differently. His primary concern, I expect, is *stability*. Breakage means that the same supposedly-stable results return different answers on the same data. Under that definition, continuing under Unicode 15.1 does prevent breakage. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Mar 18, 2025 at 2:55 PM Jeff Davis <pgsql@j-davis.com> wrote: >> Continuing on with Unicode 15.1 and accepting the unassigned code point >> *cannot* prevent breakage. > Under your definition, this is true, but I think Jeremy would define > breakage differently. His primary concern, I expect, is *stability*. > Breakage means that the same supposedly-stable results return > different answers on the same data. Under that definition, continuing > under Unicode 15.1 does prevent breakage. That approach works only if you sit on Unicode 15.1 *forever*. The impracticality of that seems obvious to me. Sooner or later you will need to update, and then you are going to suffer pain. (In the running example of this thread, a unique index on LOWER(t) might not only be corrupt, but might fail reindex due to the constraint being violated under the newer rules.) The longer you wait, the more probable it is that you are going to have problems, and the more painful it'll be to clean things up. Now, if you both sit on Unicode 15.1 forever and disallow the introduction of unassigned-per-15.1 code points, you can escape that fate, but that approach brings its own kind of pain. The short answer is that "immutable" = "doesn't change till the heat death of the universe" is a definition that is not useful when dealing with this type of data. Other people determine the reality that you have to deal with. regards, tom lane
On Tue, Mar 18, 2025 at 3:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > That approach works only if you sit on Unicode 15.1 *forever*. > The impracticality of that seems obvious to me. Sooner or later > you will need to update, and then you are going to suffer pain. I completely agree. > The short answer is that "immutable" = "doesn't change till the heat > death of the universe" is a definition that is not useful when > dealing with this type of data. Other people determine the reality > that you have to deal with. I think that's mostly true because of lack of versioning capabilities, or crappy versioning practices. glibc, AIUI, just disclaims collation stability: if you're fool enough to sort anything with one of their collations, that's on you. To me, that seems like an obviously user-hostile position, as if it were reasonable to suppose that an algorithm whose whole purpose is to implement a sort order would not be used for, uh, sorting. Or at least not any sort of sorting where you don't immediately throw away the results (and then why did you bother?). ICU doesn't seem to be entirely stable, either. But none of that means stability isn't a valuable property. It just means people have done a bad job implementing it. If we give people the ability to execute operation X using ICU 15.1 or ICU 16.0, they're still *eventually* going to have to migrate forward to ICU 16.0 or some later version, because we're probably not going to keep ICU 15.1 until the heat death of the universe. But we allow people to not have that update forced upon them at the same time they're trying to change other things, and that's pretty darn useful. That's why extensions have separate versioning from the server, for instance. -- Robert Haas EDB: http://www.enterprisedb.com
On 3/18/25 16:30, Robert Haas wrote: > On Tue, Mar 18, 2025 at 3:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That approach works only if you sit on Unicode 15.1 *forever*. >> The impracticality of that seems obvious to me. Sooner or later >> you will need to update, and then you are going to suffer pain. > > I completely agree. > >> The short answer is that "immutable" = "doesn't change till the heat >> death of the universe" is a definition that is not useful when >> dealing with this type of data. Other people determine the reality >> that you have to deal with. > > I think that's mostly true because of lack of versioning capabilities, > or crappy versioning practices. glibc, AIUI, just disclaims collation > stability: if you're fool enough to sort anything with one of their > collations, that's on you. To me, that seems like an obviously > user-hostile position, as if it were reasonable to suppose that an > algorithm whose whole purpose is to implement a sort order would not > be used for, uh, sorting. Or at least not any sort of sorting where > you don't immediately throw away the results (and then why did you > bother?). ICU doesn't seem to be entirely stable, either. Yep > But none of that means stability isn't a valuable property. It just > means people have done a bad job implementing it. If we give people > the ability to execute operation X using ICU 15.1 or ICU 16.0, > they're still *eventually* going to have to migrate forward to ICU > 16.0 or some later version, because we're probably not going to keep > ICU 15.1 until the heat death of the universe. But we allow people > to not have that update forced upon them at the same time they're > trying to change other things, and that's pretty darn useful. That's > why extensions have separate versioning from the server, for > instance. +1 Robert articulates my thinking exactly, and much better than I did :-) -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, 2025-03-18 at 14:45 -0400, Robert Haas wrote: > I think Joe has the right idea. The way to actually provide the > stability that people want here is to continue supporting old > versions > while adding support for new versions. Anything else we do works > subject to assumptions: you can either assume that people don't use > code points before they're assigned (as Tom proposes) or you can > assume that not updating to newer Unicode specs will not cause any > problems for anyone. Joe's proposal is unique in requiring no > assumptions about what users will do or what they care about. It just > works, period. The only disadvantage is that it's more work for us, > but this problem has caused enough suffering over the years that I > bet > we can find hackers to implement it and maintain it on an ongoing > basis without great difficulty. I've already put a fair amount of work into this approach, but it is a lot of work, and I could use some help. Here's a quick dump of some of my notes on what we can do going forward: * builtin provider with stable primary keys: done * collation behavior as method tables: done * have support for STRICT_UNICODE, or something like it, to allow users to mitigate their upgrade risks by rejecting unassigned code points: I submitted a proposal for a database-level option, which got no discussion * ctype behavior as method tables: patch written, discussion trailed off. There was a really minor performance regression, so I held off committing it, but I don't think it's an actual problem so if people are in general agreement that we want it then I have no problem committing it. * separate "collation provider" from "ctype provider" internally. Have pg_open_collation() and pg_open_ctype(), and deprecate pg_newlocale_from_collation(). This is a fair amount of work, but it's important for dependency and version tracking, as well as an organizational improvement. * turn providers into method tables: not too hard. We'd still need to have the notion of a "provider kind" (builtin, ICU, libc) so that we know how to interpret the syntax and store things in the other catalogs (for instance, only ICU accepts ICU_RULES, only libc allows LC_COLLATE and LC_CTYPE to be different, etc.). * put providers into new shared catalogs pg_collation_provider and pg_ctype_provider, which would each have handlers that know how to instantiate a specific collation or ctype * add new function markers COLLATE and CTYPE (or some other names), meaning that the function is sensitive to the collation or ctype of its arguments. - for example: LOWER() would be marked CTYPE, ">" would be marked COLLATE, and "||" wouldn't need any mark. - When creating some object that has an expression in it, let's say an index, we already walk the expression and add dependencies on the functions in the expression. If one of those functions has such a marker, we would look at the inferred collation of the function, find its provider, and add a dependency on the provider's shared catalog entry. - must work even on "pinned" functions - queries with ORDER BY, say as part of an MV definition, would be implicitly treated like functions marked with COLLATE * (optional) have some kind of runtime check so that UDFs that are missing the appropraite COLLATE or CTYPE markers figure out that a collation or ctype is being opened, and throw a WARNING or ERROR * throw away the idea of collation-speciifc versions, or make it more of an additional check. Versions would be attached to the provider entries in the shared catalogs. The only provider that differentiates collation versions by locale is ICU, and people were highly skeptical of that before we found bugs in it, and more skeptical afterward. They will just be a source of confusion in the long term. * Have some new functions and DDL commands that can find and fix objects by following the dependency links. * Allow extensions to be loaded at initdb time, and initialize their own providers and their own lists of collations. * Provide a contrib that implements the builtin provider with unicode 15.1. * If we want multiple versions of a provider in the same running server, that would take more work. I have my doubts about how many people would really, actually use that, but it's possible. I plan to submit some proposals in a few weeks as this CF settles down, and then have an unconference session on this topic at pgconf.dev. If anyone is motivated for these problems to be fixed, please jump into those discussions on list or at the conference, and take on a task or two. I am not trying to be dismissive of the concerns raised in this thread, but I'd like others to understand that what they are asking for is a lot of work, and that the builtin collation provider solves 99% of it already. All this effort is to solve that last 1%. Regards, Jeff Davis
On Tue, Mar 18, 2025 at 5:09 PM Jeff Davis <pgsql@j-davis.com> wrote: > I am not trying to be dismissive of the concerns raised in this thread, > but I'd like others to understand that what they are asking for is a > lot of work, and that the builtin collation provider solves 99% of it > already. All this effort is to solve that last 1%. I'm not asking you to do the work. It would be great if you did, of course, but you don't work for me. But I could not disagree more strongly with the idea that this problem is 99% solved. That doesn't seem remotely true to me. I'm not sure the problem is 1% solved. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, 2025-03-18 at 21:34 -0400, Robert Haas wrote: > But I could not disagree more strongly with the idea that this > problem > is 99% solved. That doesn't seem remotely true to me. I'm not sure > the > problem is 1% solved. If we compare the following two problems: A. With glibc or ICU, every text index, including primary keys, are highly vulnerable to inconsistencies after an OS upgrade, even if there's no Postgres upgrade; vs. B. With the builtin provider, only expression indexes and a few other things are vulnerable, only during a major version upgrade, and mostly (but not entirely) when using recently-assigned Cased letters. To me, problem A seems about 100 times worse than B almost any way I can imagine measuring it: number of objects vulnerable, severity of the problem when it does happen, likelihood of a vulnerable object having an actual problem, etc. If you disagree, I'd like to hear more. Regards, Jeff Davis
On Tue, 18 Mar 2025 19:33:00 -0700 Jeff Davis <pgsql@j-davis.com> wrote: > If we compare the following two problems: > > A. With glibc or ICU, every text index, including primary keys, are > highly vulnerable to inconsistencies after an OS upgrade, even if > there's no Postgres upgrade; vs. > > B. With the builtin provider, only expression indexes and a few > other things are vulnerable, only during a major version upgrade, and > mostly (but not entirely) when using recently-assigned Cased letters. > > To me, problem A seems about 100 times worse than B almost any way I > can imagine measuring it: number of objects vulnerable, severity of > the problem when it does happen, likelihood of a vulnerable object > having an actual problem, etc. If you disagree, I'd like to hear more. Jeff - you and several others have literally put years into making this better, and it's deeply appreciated. I agree that with the builtin provider we're in a much better place. I don't quite understand Tom's argument about why Unicode 15 must eventually become untenable. Why are we assuming it will? In Oracle's entire history, I think they have only ever supported four versions of Unicode. [1] MySQL seems to have added their second only recently. [2] And again - we have ICU if I need the latest emoji characters. Frankly, Unicode 15 is pretty good. Most updates to unicode these days are fairly minor. Maybe Postgres can be the first database to always ship support for the latest Unicode with each major version - but I think we should design that right if we're going to do it. If we just stay on Unicode 15 for now then there are no problems with case insensitive indexes or range partitioned tables returning wrong query results after a major version upgrades. There's been a lot of discussion about indexes, but this SQL also seems to work: postgres=# create table test_events(customer_name text, ts timestamp, message text) partition by range((lower(customer_name))); I'm sure that people shouldn't do this ... but if anyone /did/ then it wouldn't be as simple as an index rebuild after their major version upgrade. I had never really considered it before, but this SQL also seems to work postgres=# create table test_events(id uuid, ts timestamp, message text) partition by range((ts at time zone 'America/Sao_Paulo')); I'm sure that people shouldn't do that either ... but if anyone did then would their rows would be in the wrong partition after they upgraded from 11.4 to 11.5? The difficulty here is that I work at a company with thousands of developers and lots of Postgres and I see people do things all the time that we might think they "shouldnt" do. Before we bump the unicode version, perseonally I'd just like to have some tools to make it so people actually can't do the things they shouldn't do. -Jeremy 1: https://docs.oracle.com/en/database/oracle/oracle-database/23/nlspg/appendix-A-locale-data.html#GUID-CC85A33C-81FC-4E93-BAAB-1B3DB9036060__CIABEDHB 2: https://dev.mysql.com/blog-archive/mysql-character-sets-unicode-and-uca-compliant-collations/
On Tue, Mar 18, 2025 at 10:33 PM Jeff Davis <pgsql@j-davis.com> wrote: > If we compare the following two problems: > > A. With glibc or ICU, every text index, including primary keys, are > highly vulnerable to inconsistencies after an OS upgrade, even if > there's no Postgres upgrade; vs. > > B. With the builtin provider, only expression indexes and a few other > things are vulnerable, only during a major version upgrade, and mostly > (but not entirely) when using recently-assigned Cased letters. > > To me, problem A seems about 100 times worse than B almost any way I > can imagine measuring it: number of objects vulnerable, severity of the > problem when it does happen, likelihood of a vulnerable object having > an actual problem, etc. If you disagree, I'd like to hear more. I see your point, but most people don't use the builtin collation provider. Granted, we could change the default and then more people would use it, but I'm not sure people would be happy with the resulting behavior: a lot of people probably want "a" to sort near "á" even if they don't have strong preferences about the exact details in every corner case. Also, and I think rather importantly, many people are less sensitive to whether anything is actually broken than to whether anything hypothetically could be broken. When an EDB customer asks "if I do X, will anything break," it's often the case that answering "maybe" is the same as answering "yes". The DBA doesn't necessarily know or care what the application does or know or care what data is in the database. They want a hard guarantee that the behavior will not change. From that point of view, your statement that nothing will change in minor releases when the builtin provider is used is quite powerful (and a good argument against back-patching Unicode updates as Tom proposes). But people will still need to use other collation providers and they will still need to do major release upgrades and they also want those things to be guaranteed not to break. Again, I'm not trying to oblige you to deliver that behavior and I confess to ignorance on how we could realistically get there. But I do think it's what people want: to be forced to endure collation updates infrequently, and to be able to choose the timing of the update when they absolutely must happen, and to be able to easily know exactly what they need to reindex. And from that point of view -- and again, I'm not volunteering to implement it and I'm not telling you to do it either -- Joe's proposal of supporting multiple versions sounds fantastic. Because then, I can do a major version upgrade using pg_upgrade and keep everything pinned to the old Unicode version or, perhaps even the old ICU version if we had multi-version libicu support. I may be able to go through several major version upgrades without ever needing to survive a collation change. Eventually my hand will be forced, because PostgreSQL will remove support for the Unicode version I care about or that old version of libicu won't compile any more or will have security vulnerabilities or something, but I will have the option to deal with that collation change before or after any PostgreSQL version changes that I'm doing. I'll be able to change the collation version at a time when I'm not changing anything else and deal with JUST that fallout on its own. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Mar 19, 2025 at 1:25 AM Jeremy Schneider <schneider@ardentperf.com> wrote: > Maybe Postgres can be the first database to always ship support for the > latest Unicode with each major version Shipping the latest Unicode with our latest major version is existing policy, as I understand it. I don't think we're breaking any new ground here. > Before we bump the unicode version, perseonally I'd just like to have > some tools to make it so people actually can't do the things they > shouldn't do. I would, too, but I think that's a hole with no bottom. We're not realistically going to sort this problem out in the next twelve months, or even in the next five years. For that reason, I don't understand how you can propose with a straight face that we pause the Unicode updates that have been routine up until now until we have the problem sorted out. That amounts to deciding that we're never going to add support for any new Unicode versions. And I don't see that as realistic. We have those Unicode tables because their contents matter. If we don't update them then, for example, new cased letters won't case-fold properly. Yeah, maybe if we didn't update for one year, nobody would care. But five or ten or twenty years from now, somebody will care, and I don't think we'll have all the stuff you want to prevent people from hosing themselves even then. Not only that, but freezing the Unicode tables doesn't even really help that much, because the much bigger problem comes from the constant whacking around of the glibc and to a lesser extent ICU behavior. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 2025-03-19 at 08:46 -0400, Robert Haas wrote: > I see your point, but most people don't use the builtin collation > provider. The other providers aren't affected by us updating Unicode, so I think we got off track somehow. I suppose what I meant was: "If you are concerned about inconsistencies, and you move to the builtin provider, then 99% of the inconsistency problem is gone. We can remove the last 1% of the problem if we do all the work listed above." > When an EDB customer asks "if I do X, > will anything break," it's often the case that answering "maybe" is > the same as answering "yes". That's a good point. However, note that "doesn't break primary keys" is a nice guarantee, even if there's still some remaining doubts about expression indexes, etc. > They want a hard guarantee that the behavior will not > change. My understanding of this thread so far was that we were mostly concerned about internal inconsistencies of stored structures; e.g. indexes that could return different results than a seqscan. Not changing query results at all between major versions is a valid concern, but a fairly strict one that doesn't seem limited to immutable functions or collation issues. Surely, at least the results of "SELECT version()" should change from release to release ;-) > Again, I'm not trying to oblige > you to deliver that behavior and I confess to ignorance on how we > could realistically get there. FWIW I'm not complaining about doing the work. But I think the results will be better if we can get a few people aligned on a general plan and collaborating. I will try to kick that off. > and to be able to easily know exactly what they need to reindex. That's the main one, I think. The upgrade check offers that for the builtin provider, though admittedly it's not a very user-friendly solution, and we can do better. > And from that point of view -- and again, I'm not volunteering to > implement it and I'm not telling you to do it either -- Joe's > proposal > of supporting multiple versions sounds fantastic. I certainly don't oppose giving users that choice. But I view it as a burden we are placing on the users -- better than breakage, but not really great, either. So if we do put in a ton of work, I'd like it if we could arrive at a bettter destination. If we actually want the BEST user experience possible, they'd not even really know that their index was ever inconsistent. Autovacuum would come along and just find the few entries in the index that need fixing, and reindex just those few tuples. In theory, it should be possible: there are a finite number of codepoints that change each Unicode version, and we can just search for them in the data and fix up derived structures. Regards, Jeff Davis
On Wed, Mar 19, 2025 at 1:39 PM Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2025-03-19 at 08:46 -0400, Robert Haas wrote: > > I see your point, but most people don't use the builtin collation > > provider. > > The other providers aren't affected by us updating Unicode, so I think > we got off track somehow. I suppose what I meant was: > > "If you are concerned about inconsistencies, and you move to the > builtin provider, then 99% of the inconsistency problem is gone. We can > remove the last 1% of the problem if we do all the work listed above." All right. I'm not sure I totally buy the 99% number, but I take your point. > > When an EDB customer asks "if I do X, > > will anything break," it's often the case that answering "maybe" is > > the same as answering "yes". > > That's a good point. However, note that "doesn't break primary keys" is > a nice guarantee, even if there's still some remaining doubts about > expression indexes, etc. No argument. > > They want a hard guarantee that the behavior will not > > change. > > My understanding of this thread so far was that we were mostly > concerned about internal inconsistencies of stored structures; e.g. > indexes that could return different results than a seqscan. I think that is true, but inconsistent indexes can be the worst problem without being the only one. > Not changing query results at all between major versions is a valid > concern, but a fairly strict one that doesn't seem limited to immutable > functions or collation issues. Surely, at least the results of "SELECT > version()" should change from release to release ;-) Maybe we should stop doing releases, and then users won't have to worry about our releases breaking things! Slightly more seriously, the use of UPPER() and LOWER() in expression indexes is not that uncommon. Sometimes, the index exists specifically to enforce a unique constraint. Yes, plain indexes on columns are more common, and it makes sense to target that case first, but we shouldn't be too quickly hand-wave away the use of case-folding functions as a thing that doesn't happen. > I certainly don't oppose giving users that choice. But I view it as a > burden we are placing on the users -- better than breakage, but not > really great, either. So if we do put in a ton of work, I'd like it if > we could arrive at a bettter destination. > > If we actually want the BEST user experience possible, they'd not even > really know that their index was ever inconsistent. Autovacuum would > come along and just find the few entries in the index that need fixing, > and reindex just those few tuples. In theory, it should be possible: > there are a finite number of codepoints that change each Unicode > version, and we can just search for them in the data and fix up derived > structures. I have to disagree with this. I think this is a case where fixing something automatically is clearly worse. First, it could never fix it instantly, so you would be stuck with some window where queries might return wrong results -- or if you prevent that by not using the indexes any more until they're fixed, then it would instead cause huge query performance regressions that could easily take down the whole system. Second, one of the things people like least about autovacuum is when it unexpectedly does a lot of work all at once. Today, that's usually a vacuum for wrap-around, but suddenly trying to fix all my indexes when I wasn't expecting that to happen could easily be just as bad. I strongly believe users want to control what happens, not have the system try to fix it for them automatically without their knowledge. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 2025-03-19 at 14:33 -0400, Robert Haas wrote: > I strongly believe users want to control what happens, not have > the system try to fix it for them automatically without their > knowledge. Do you have a sketch of what the ideal Unicode version management experience might look like? Very high level, like "this is what happens by default during an upgrade" and "this is how a user discovers that that they might want to update Uniocde", etc. What ways can/should we nudge users to update more quickly, if at all, so that they are less likely to have problems with newly-assigned code points? And, if possible, how we might extend this user experience to libc or ICU updates? Regards, Jeff Davis
On Wed, Mar 19, 2025 at 5:47 PM Jeff Davis <pgsql@j-davis.com> wrote: > Do you have a sketch of what the ideal Unicode version management > experience might look like? Very high level, like "this is what happens > by default during an upgrade" and "this is how a user discovers that > that they might want to update Uniocde", etc. > > What ways can/should we nudge users to update more quickly, if at all, > so that they are less likely to have problems with newly-assigned code > points? > > And, if possible, how we might extend this user experience to libc or > ICU updates? As I think you know, I don't consider myself an expert in this area, just somebody who has seen a decent amount of user pain (although I am sure that even there some other people have seen more). That said, for me the ideal would probably include the following things: * When the collation/ctype/whatever definitions upon which you are relying change, you can either decide to switch to the new ones without rebuilding your indexes and risk wrong results until you reindex, or you can decide to create new indexes using the new definitions and drop the old ones. * You're never forced to adopt new definitions during a SPECIFIC major or minor release upgrade or when making some other big change to the system. It's fine, IMHO, if we eventually remove support for old stuff, but there should be a multi-year window of overlap. For example, if PostgreSQL 42 adds support for Unicode 95.0.0, we'd keep that support for, I don't know, at least the next four or five major versions. So upgrading PG can eventually force you to upgrade collation defs, but you don't get into a situation where PG 41 supports only Unicode < 95 and PG 42 supports only Unicode >= 95. * In an absolutely perfect world, we'd have strong versioning of every type of collation from every provider. This is probably very difficult to achieve in practice, so maybe the somewhat more realistic goal might be to get to a point where most users, most of the time, are relying on collations with strong versioning. For glibc, this seems relatively hopeless unless upstream changes their policy in a big way. For ICU, loading multiple library versions seems like a possible path forward. Relying more on built-in collations seems like another possible approach, but I think that would require us to have more than just a code-point sort: we'd need to have built-in collations for users of various languages. That sounds like it would be a lot of work to develop, but even worse, it sounds like it would be a tremendous amount of work to maintain. I expect Tom will opine that this is an absolutely terrible idea that we should never do under any circumstances, and I understand the sentiment, but I think it might be worth considering if we're confident we will have people to do the maintenance over the long term. * I would imagine pg_upgrade either keeping the behavior unchanged for any strongly-versioned collation, or failing. I don't see a strong need to try to notify users about the availability of new versions otherwise. People who want to stay current will probably figure out how to do that, and people who don't will ignore any warnings we give them. I'm not completely opposed to some other form of notification, but I think it's OK if "we finally removed support for your extremely old ICU version" is the driving force that makes people upgrade. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, 2025-03-20 at 08:45 -0400, Robert Haas wrote: > * When the collation/ctype/whatever definitions upon which you are > relying change, you can either decide to switch to the new ones > without rebuilding your indexes and risk wrong results until you > reindex, or you can decide to create new indexes using the new > definitions and drop the old ones. Would newly-created objects pick up the new Unicode version, or stick with the old one? > Relying more on built-in collations seems like another > possible approach, but I think that would require us to have more > than > just a code-point sort: we'd need to have built-in collations for > users of various languages. That sounds like it would be a lot of > work > to develop, but even worse, it sounds like it would be a tremendous > amount of work to maintain. I expect Tom will opine that this is an > absolutely terrible idea that we should never do under any > circumstances, and I understand the sentiment, but I think it might > be > worth considering if we're confident we will have people to do the > maintenance over the long term. Supporting a built-in case-insensitive collation would be some work, but it's not a huge leap now that we have CASEFOLD(). Supprting built-in natural language sort orders would be a much larger scope. And I don't think we need that, but that's a larger discussion. > I don't see a strong > need to try to notify users about the availability of new versions > otherwise. People who want to stay current will probably figure out > how to do that What if we were able to tell, for instance, that your database has none of the codepoints affected by the most recent update. Then updating would be less risky than not updating: if you don't update Unicode, then the code points could end up in the database treated as unassigned, and then cause a problem for future updates. Regards, Jeff Davis
On Fri, Mar 21, 2025 at 2:45 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Thu, 2025-03-20 at 08:45 -0400, Robert Haas wrote: > > * When the collation/ctype/whatever definitions upon which you are > > relying change, you can either decide to switch to the new ones > > without rebuilding your indexes and risk wrong results until you > > reindex, or you can decide to create new indexes using the new > > definitions and drop the old ones. > > Would newly-created objects pick up the new Unicode version, or stick > with the old one? Hmm, I hadn't thought about that. I'm assuming that the Unicode version would need, in this scheme, to be coupled to the object that depends on it. For example, an index that uses a Unicode collation would need to store a Unicode version. But for a new index, how would that be set? Maybe the Unicode version would be treated as part of the collation. I'm guessing that an index defaults to the column collation, and I think the column collation defaults to the database default collation. We might need a way for ALTER DATABASE to allow the database default to be adjusted. I'm not quite sure here, but my general feeling is that Unicode version feels like part of the collation and that we should avoid introducing a separate mechanism if possible. What are your thoughts? > Supprting built-in natural language sort orders would be a much larger > scope. And I don't think we need that, but that's a larger discussion. I'm curious why you think this. My own feeling (as I think you probably know, but just to be clear) is that relatively few people need extremely precise control over their collation behavior, but there are some who do. However, I think there are many people for whom a code-point sort won't be good enough. If you want to leave this discussion for another time, that's fine. > What if we were able to tell, for instance, that your database has none > of the codepoints affected by the most recent update. Then updating > would be less risky than not updating: if you don't update Unicode, > then the code points could end up in the database treated as > unassigned, and then cause a problem for future updates. The problem with this is that it requires scanning the whole database. That's not to say it's useless. Some people can afford to scan the whole database, and some people might even WANT to scan the whole database just to give themselves peace of mind. But there are also plenty of people for whom this is a major downside, even unusable. I'd like to have a solution that is based on metadata. Maybe we should actually move in the direction of having encodings that are essentially specific versions of Unicode. Instead of just having UTF-8 that accepts whatever, you could have UTF-8.v16.0.0 or whatever, which would only accept code points known to that version of Unicode. Or maybe this shouldn't be entirely new encodings but something vaguely akin to a typmod, so that you could have columns of type text[limited_to_unicode_v16_0_0] or whatever. If we actually exclude unassigned code points, then we know they aren't there, and we can make deductions about what is safe to do based on that information. I'm not quite sure how useful that is, but I tend to think that enforcing rules when the data goes in has a decent shot at being better than letting anything going in and then having to scan it later to see how it all turned out. -- Robert Haas EDB: http://www.enterprisedb.com
On 15.03.25 07:54, Jeremy Schneider wrote: > in favor of leaving it alone because ICU is there for when I need > up-to-date unicode versions. > > From my perspective, the whole point of the builtin collation was to > one option that avoids these problems that come with updating both ICU > and glibc. > > So I guess the main point of the builtin provider just that it's faster > than ICU? A mistake that some people apparently continue to make in this discussion is that they assume that the only thing the Unicode tables drive is the builtin collation provider. This is not true, the Unicode tables were there long before the builtin collation provider, and they have other purposes. And we knew at the time the builtin collation provider was added that it would have certain problems with the Unicode table updates, and we accepted it with the understanding that this would not change our procedures. Otherwise, we would likely not have accepted it in its current form. Those who are now trying to make the builtin collation provider have properties that it does not have and was not intended to have when it was added, they would need to arrange the work to make it have those properties if they want them, rather than insist that progress in other areas must stop because they are content with the current state.
On Fri, 2025-03-21 at 17:15 +0100, Peter Eisentraut wrote: > And we knew at the time the builtin collation > provider was added that it would have certain problems with the > Unicode > table updates, and we accepted it with the understanding that this > would > not change our procedures. Correct. That was called out by me in the initial proposal for the builtin collation provider and documented explicitly. > Those who are now trying to make the builtin collation provider have > properties that it does not have and was not intended to have when it > was added, they would need to arrange the work to make it have those > properties if they want them, rather than insist that progress in > other > areas must stop because they are content with the current state. It does feel like the goalposts are moving. That's not necessarily bad by itself -- our expectations should go up. But the way it's happening in this thread makes it feel like new obligations are being put on the people already working on collation improvements, in particular Peter and I. Robert indicated that there might be some willing hackers, and perhaps even appetite for larger-scope projects in this area, which is great news. A lot of what's happening in this area is non-controversial, and more attention would be an unqualified win. For instance, Peter put some work into better support for non-deterministic collations, and I had some ideas there: https://www.postgresql.org/message-id/024c9b9aa834f668496ef95700b57e50bf3f4808.camel%40j-davis.com but I didn't have time to work on that this cycle. (Maybe my idea would be hard to implement or not work at all, or maybe Peter and Tom already have better ideas, but that's different from being controversial.) For the many people who think multi-lib is the way to go, the shortest path involves someone taking a look at this prerequisite: https://www.postgresql.org/message-id/cb580fec46ea4ca96dd4bbde9d2769360e097d01.camel%40j-davis.com Some technical review would be nice, but really what I needed was someone to say "this small regression in a worst case due to an unavoidable indirect function call is not worth worrying about". It might be a bit late now, though, as a big refactoring right before FF seems like a bad idea. So it will probably slip until July, adding risk that any other multi-lib work (which I am not promising to do) might slip to PG20, which users will see at the end of 2027. Ugh. Regards, Jeff Davis
On Fri, Mar 21, 2025 at 2:27 PM Jeff Davis <pgsql@j-davis.com> wrote: > > Those who are now trying to make the builtin collation provider have > > properties that it does not have and was not intended to have when it > > was added, they would need to arrange the work to make it have those > > properties if they want them, rather than insist that progress in > > other > > areas must stop because they are content with the current state. > > It does feel like the goalposts are moving. That's not necessarily bad > by itself -- our expectations should go up. But the way it's happening > in this thread makes it feel like new obligations are being put on the > people already working on collation improvements, in particular Peter > and I. Honestly, I'm confused as to why Peter hasn't committed the Unicode update a long time ago at this point. Nobody has alleged that the stability guarantees provided by the builtin collation provider are (a) worse than any of other other providers or (b) worse than what was documented. And nobody has refuted the argument that refusing to update the Unicode tables will cause other problems (such as not knowing what to do with new code points that are added in the other places where those tables are used). People who aren't doing the work to improve the infrastructure don't get to hold the longstanding process hostage. None of the above means that I wouldn't like things to be better in this area; this is a huge source of pain and I would obviously like all of kinds of amazing things, preferably done by other people with no real thought or work required on my part. But I'd like that in lots of areas of PostgreSQL and lots of areas of my life in general, and I understand that it's an unreasonable expectation. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 2025-03-21 at 10:45 -0400, Robert Haas wrote: > We might need a way for ALTER DATABASE to allow the > database default to be adjusted. I'm not quite sure here, but my > general feeling is that Unicode version feels like part of the > collation and that we should avoid introducing a separate mechanism > if > possible. What are your thoughts? My (early stage) plans are to have two new shared catalogs, pg_ctype_provider and pg_collation_provider. Objects would depend on records in those shared catalogs, which would each have a version. We'd eventually allow multiple records with providerkind=icu, for instance, and have some way to choose which one to use (perhaps new objects get the default version, old objects keep the old version, or something). The reason to have two shared catalogs is because some objects depend on collation behavior and some on ctype behavior. If there's an index on "t COLLATE PG_C_UTF8" then there would be no direct dependency from the index to the builtin provider in either catalog, because collation behavior in the builtin provider is unversioned memcmp. But if there's an index on "LOWER(t COLLATE PG_C_UTF8)", then it would have a dependency entry to the builtin provider's entry in pg_ctype_provider. > > I'm curious why you think this. My own feeling (as I think you > probably know, but just to be clear) is that relatively few people > need extremely precise control over their collation behavior, but > there are some who do. However, I think there are many people for > whom > a code-point sort won't be good enough. You can use ICU for sorting without using it for the index comparators. Using ICU in the index comparators is an implementation detail that's only required for unique indexes over non-deterministic collations. And if it's not used for the index comparators, then most of the problems go away, and versioning is not nearly so important. Sure, there are some cases where using ICU in the index comparator is important, and I'm not suggesting that we remove functionality. But I believe that using libc or ICU for index comparators is the wrong default behavior -- high downsides and low upsides for most text indexes that have ever been created. Even if there is an ORDER BY, using an index is often the wrong thing unless it's an index only scan. Text indexes are rarely correlated with the heap, so it would lead to a lot of random heap fetches, and it's often better to just execute the query and do a final sort. The situations where ICU in the comparator is a good idea are special cases of special cases. I've posted about this in the past, and got universal disagreement. But I believe others will eventually come to the same conclusion that I did. > > Maybe we should actually move in the direction of having encodings > that are essentially specific versions of Unicode. Instead of just > having UTF-8 that accepts whatever, you could have UTF-8.v16.0.0 or > whatever, which would only accept code points known to that version > of > Unicode. Or maybe this shouldn't be entirely new encodings but > something vaguely akin to a typmod, so that you could have columns of > type text[limited_to_unicode_v16_0_0] or whatever. If we actually > exclude unassigned code points, then we know they aren't there, and > we > can make deductions about what is safe to do based on that > information. I like this line of thinking, vaguely similar to my STRICT_UNICODE database option proposal. Maybe these aren't exactly the right things to do, but I think there are some possibilities here, and we shouldn't give up and assume there's a problem when usually there is not. It reminds me of fast-path locking: sure, there *might* be DDL happening while I'm trying to do a simple SELECT query. But probably not, so let's make it the responsibility of DDL to warn others that it's doing something, rather than the responsibility of the SELECT query. Regards, Jeff Davis
On Fri, 2025-03-21 at 14:54 -0400, Robert Haas wrote: > And nobody has refuted the argument that refusing to > update the Unicode tables will cause other problems (such as not > knowing what to do with new code points that are added in the other > places where those tables are used). The argument is irrefutable. I can only utter my (unproven) conviction that the complaints about some outlandish character being converted to upper case in the wrong way won't be as loud as the complaints about a corrupted index. But perhaps both occurrences would be rare enough that there is little point in arguing at all. > People who aren't doing the work > to improve the infrastructure don't get to hold the longstanding > process hostage. Actually, I am arguing for *not* doing the work... Yours, Laurenz Albe
On Fri, 21 Mar 2025 13:45:24 -0700 Jeff Davis <pgsql@j-davis.com> wrote: > > Maybe we should actually move in the direction of having encodings > > that are essentially specific versions of Unicode. Instead of just > > having UTF-8 that accepts whatever, you could have UTF-8.v16.0.0 or > > whatever, which would only accept code points known to that version > > of > > Unicode. Or maybe this shouldn't be entirely new encodings but > > something vaguely akin to a typmod, so that you could have columns > > of type text[limited_to_unicode_v16_0_0] or whatever. If we actually > > exclude unassigned code points, then we know they aren't there, and > > we > > can make deductions about what is safe to do based on that > > information. > > I like this line of thinking, vaguely similar to my STRICT_UNICODE > database option proposal. Maybe these aren't exactly the right things > to do, but I think there are some possibilities here, and we shouldn't > give up and assume there's a problem when usually there is not. There is "the iPhone paradox" here; if we reject unassigned code points, then websites are going to start throwing database errors for anyone with the latest iPhone who uses a new emoji. (Unless the database is updated very quickly, which is atypical.) Apple tends to get new emojis into consumers hands a year or less after the new Unicode release. -Jeremy
On 17.03.25 19:54, Jeff Davis wrote: > On Thu, 2025-03-13 at 14:49 +0100, Peter Eisentraut wrote: >> I think these test result changes are incorrect. AFAICT, nothing has >> changed in the upstream data that would explain such a change. >> >> I didn't get such test differences in my original patch. Did you >> rebase >> the patch or regenerate it? Maybe something went wrong there. > > Right. The previous version was improperly rebased. > > I went back to your original version and rebased over master (including > the latest optimizations for case mapping), and then rebased my changes > on top of that. Attached. > > One annoyance with the recent case mapping optimizations is that the > Unicode update diff is larger, similar to the diff for > unicode_norm_hashfunc.h. Previously, it was binary search, so the only > differences were logical differences. I have committed the update to the Unicode tables. I suggest you commit your pg_upgrade patch on top of that now.