Re: Collation version tracking for macOS - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Collation version tracking for macOS
Date
Msg-id CA+hUKGLgzQU=zJneV4ASAmVDXyFPB6CY5MKMHx6trQJaFGGREw@mail.gmail.com
Whole thread Raw
In response to Re: Collation version tracking for macOS  (Jeremy Schneider <schneider@ardentperf.com>)
List pgsql-hackers
Hey Jeremy,

On Tue, Jun 7, 2022 at 12:42 PM Jeremy Schneider
<schneider@ardentperf.com> wrote:
> Thomas - thanks for the link back to one of the threads. I spent some time reading through that and it’s a lot of
material;I haven’t read the whole thread yet. If you have some others that would also be particularly good background,
letme know. I’m doing a chunk of this in my spare time at the moment, but I do want to keep getting more up to speed. I
waspulled into a bunch of various things related to PostgreSQL and ICU and collation and OS’s over the past couple
years,so I learned a lot from on-the-ground experience and I am interested in trying to get a little more involved in
theconversation here. 

There were more threads, but they mostly say the same things, hence my
current attempt to move from bloviation to trying out the ideas with
actual code :-D

> Personally, I really do think there should at least be an *option* to tell the DB to fully error rather than just
warnon version mismatch. Correctness matters to many users, and being able to *trust* string comparisons are correct is
prettydamn fundamental all throughout a database. It really doesn’t get any more basic and the potential for bad things
tohappen is pretty astronomical, if you can’t trust those. I understand the consternation about dealing with upgrades
oflarge & busy databases, but I’m still surprised that the community consensus arrived at the present behavior, and I
havea lot of reading to do, to really understand how that happened and where the dialogue is today. 

Given that the only thing you could do about it is REINDEX, and yet we
don't even know which indexes needed to be REINDEXed (the problem
Julien and I tried to address, but so far without success), it seemed
highly premature to convert the warning to an error.

I don't think the community consensus is that we have arrived
somewhere, it's more like we're in transit, possibly without a map.
For example 15 gains ICU support for the default collation (= how most
people consume collations), which changes things, and will surely lead
to more people thinking about this problem space.

> Multiple versions of ICU sounds nice for users who need real linguistic collation (like what Oracle and DB2 offer),
butI still feel like there needs to be a super simple basic “pseudo-linguistic” collation baked in, that’s “good
enough”for 99% of users and that is guaranteed to be the same everywhere on every platform and just won’t ever change.
Ithink glibc needs to be phased out somehow. At a minimum, not the default for new users… to stop the bleeding. If
MySQLwasn’t GPL then I’d say to just copy their collations. I’d be reluctant to spend too much time on a POC now
though,it feels like my idea is the outlier and the general PG hacker consensus would be to reject this idea. (But
maybeI’m wrong?) 

Hmm.  Well I personally wouldn't try to write a collator any more
willingly than I'd try to write a new cryptographic algorithm, just
not my bag.  We don't want to handle complaints about our sort order
(we already bat away complaints about glibc's, and I heard an account
from an OS vendor about the non-stop contradictory crank complaints
about sort order they used to receive before they gave up and adopted
ICU).  ICU really is quite authoritative here.

If you mean that you don't even want to have to specify a language
like "en", then note that you don't have to: ICU has a "root" collator
which you can request with an empty string (all other collators apply
cultural tweaks on top of that).  Like everything else, the root
collator has changed over time, though.

With my "distinct" experimental patch (v4), you could set your
database default collation to a specific ICU major version's root
collator like so:

create database ... locale_provider = icu icu_locale = '71:' template
= template0

That'll keep working, even across pg_upgrades to some version of
PostgreSQL far in the future that is linked against ICU 100, by
dlopen'ing the .71 libraries, for as long as you can get your hands on
a libicu71 package or otherwise build your own, and it'll error out if
it can't open that library, which may be the hard error you were
looking for.  If there's an API change in ICU we'll have to make some
changes, but that's already true.

Review/testing/flames/rants/better ideas welcome.

> Anyway, again, apologies for my pants-on-fire email last week. I hope I can enjoy a few beers someday - or coffee for
thenon-drinkers - with a few other PG collation nerds (which I never set out to be, but it may have befallen me <g>). 

+1



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Improve TAP tests of pg_upgrade for cross-version tests
Next
From: Julien Rouhaud
Date:
Subject: Re: Add header support to text format and matching feature