Re: Collation version tracking for macOS - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Collation version tracking for macOS |
Date | |
Msg-id | CAH2-Wz=i3HXC8R4dP7zPQV-9bt=qCuh=BcpeFiJr23jcHW_OGQ@mail.gmail.com Whole thread Raw |
In response to | Re: Collation version tracking for macOS (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
On Thu, Jun 9, 2022 at 4:23 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Suppose you pg_upgrade to something that is linked against 71. > Perhaps you'd need to tell it how to dlopen 67 before you can open any > collations with that library, but once you've done that your > collation-dependent partition constraints etc should all hold. I > dunno, lots of problems to figure out here, including quite broad ones > about various migration problems. I haven't understood what Peter G > is suggesting about how upgrades might work, so I'll go and try to do > that... I'm mostly just arguing for the idea that we should treat ICU versions as essentially interchangeable in terms of their high-level capabilities around collations and languages/scripts/whatever provided for by the underlying CLDR version -- tools like pg_dump shouldn't need to care about ICU versions per se. *ICU itself* should be versioned, rather than having multiple independent ICU collation providers. This should work as well as anything like this can ever be expected to work -- because internationalization is just hard. These remarks need to be interpreted in the context of how internationalization is *supposed* to work under standards like BCP47 (again, this is a broad RFC about internationalization, not really an ICU thing). Natural languages are inherently squishy, messy things. The "default ICU collations" that initdb puts in pg_collation are not really special to ICU -- we generate them through a quasi-arbitrary process that iterates through top-level locales, which results in a list that is a bit like what you get with libc collations. If you pg_upgrade, you might have leftover "default ICU collations" that wouldn't have been the default on a new initdb. It's inherently pretty chaotic (because humans aren't as predictable as computers), which is why BCP47 itself is so forgiving -- it literally has to be. Plus there really isn't much downside to being so lax; as Jeremy pretty much said already, the important thing is generally to have roughly the right idea -- which this fuzzy approach mostly manages to do. Let's not fight that. Let's leave the natural language stuff to the experts, by versioning a single collation provider (like ICU), and generalizing the definition of a collation along the same lines -- something that can be implemented using any available version of ICU (with a preference for the latest on REINDEX, perhaps). It might turn out that an older version does a slightly better job than a newer version (regressions cannot be ruled out), but ultimately that's not our problem. It can't be -- we're not the unicode consortium. It's theoretically up to the user to make sure they're happy with any behavioral changes under this scheme, perhaps by testing. They won't actually test very often, of course, but that shouldn't matter in practice. This is already what we advise for users that use advanced tailorings of custom ICU collations, such as a custom collation for "natural sorting", often used for things like alphanumeric invoice numbers. That might break if you downgrade ICU version, and maybe even if you upgrade ICU version. -- Peter Geoghegan
pgsql-hackers by date: