Re: Collation version tracking for macOS - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Collation version tracking for macOS
Date
Msg-id CAH2-Wz=i3HXC8R4dP7zPQV-9bt=qCuh=BcpeFiJr23jcHW_OGQ@mail.gmail.com
Whole thread Raw
In response to Re: Collation version tracking for macOS  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Thu, Jun 9, 2022 at 4:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Suppose you pg_upgrade to something that is linked against 71.
> Perhaps you'd need to tell it how to dlopen 67 before you can open any
> collations with that library, but once you've done that your
> collation-dependent partition constraints etc should all hold.  I
> dunno, lots of problems to figure out here, including quite broad ones
> about various migration problems.  I haven't understood what Peter G
> is suggesting about how upgrades might work, so I'll go and try to do
> that...

I'm mostly just arguing for the idea that we should treat ICU versions
as essentially interchangeable in terms of their high-level
capabilities around collations and languages/scripts/whatever provided
for by the underlying CLDR version -- tools like pg_dump shouldn't
need to care about ICU versions per se. *ICU itself* should be
versioned, rather than having multiple independent ICU collation
providers. This should work as well as anything like this can ever be
expected to work -- because internationalization is just hard.

These remarks need to be interpreted in the context of how
internationalization is *supposed* to work under standards like BCP47
(again, this is a broad RFC about internationalization, not really an
ICU thing). Natural languages are inherently squishy, messy things.
The "default ICU collations" that initdb puts in pg_collation are not
really special to ICU -- we generate them through a quasi-arbitrary
process that iterates through top-level locales, which results in a
list that is a bit like what you get with libc collations. If you
pg_upgrade, you might have leftover "default ICU collations" that
wouldn't have been the default on a new initdb. It's inherently pretty
chaotic (because humans aren't as predictable as computers), which is
why BCP47 itself is so forgiving -- it literally has to be. Plus there
really isn't much downside to being so lax; as Jeremy pretty much said
already, the important thing is generally to have roughly the right
idea -- which this fuzzy approach mostly manages to do.

Let's not fight that. Let's leave the natural language stuff to the
experts, by versioning a single collation provider (like ICU), and
generalizing the definition of a collation along the same lines --
something that can be implemented using any available version of ICU
(with a preference for the latest on REINDEX, perhaps). It might turn
out that an older version does a slightly better job than a newer
version (regressions cannot be ruled out), but ultimately that's not
our problem. It can't be -- we're not the unicode consortium.

It's theoretically up to the user to make sure they're happy with any
behavioral changes under this scheme, perhaps by testing. They won't
actually test very often, of course, but that shouldn't matter in
practice. This is already what we advise for users that use advanced
tailorings of custom ICU collations, such as a custom collation for
"natural sorting", often used for things like alphanumeric invoice
numbers. That might break if you downgrade ICU version, and maybe even
if you upgrade ICU version.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: doc: Bring mention of unique index forced transaction wait behavior outside of the internal section
Next
From: Matthias van de Meent
Date:
Subject: Re: better page-level checksums