Re: Collation versioning - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Collation versioning
Date
Msg-id CA+hUKG++n=pdWD+9QZZvCDzUVy-BWXOWZjqhiGFxoTU_2_YLyw@mail.gmail.com
Whole thread Raw
In response to Re: Collation versioning  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On Wed, Nov 4, 2020 at 9:11 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Wed, Nov 04, 2020 at 08:44:15AM +0100, Juan José Santamaría Flecha wrote:
> >  We could create a static table with the conversion based on what was
> > discussed for commit a169155, please find attached a spreadsheet with the
> > comparison. This would require maintenance as new LCIDs are released [1].
> >
> > [1]
> > https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/70feba9f-294e-491e-b6eb-56532684c37f
>
> I am honestly not a fan of something like that as it has good chances
> to rot.

No opinion on that, other than that we'd surely want a machine
readable version.  As for *when* we use that information, I'm
wondering if it would make sense to convert datcollate to a language
tag in initdb, and also change pg_upgrade's equivalent_locale()
function to consider "English_United States.*" and "en-US" to be
equivalent when upgrading to 14 (which would then be the only point
you'd ever have to have faith that we can convert the old style names
to the new names correctly).  I'm unlikely to work on this myself as I
have other operating systems to fix, but I'll certainly be happy if
somehow we can get versioning for default on Windows in PG14 and not
have to come up with weasel words in the manual.

Just by the way, I think Windows does one thing pretty nicely here: it
has versions with a major and a minor part.  If the minor part goes
up, it means that they only added new code points, but didn't change
the ordering of any existing code points, so in some circumstances you
don't have to rebuild (which I think is the case for many Unicode
updates, adding new Chinese characters or emojis or whatever).  I
thought about whether we should replace the strcmp() comparison with a
call into provider-specific code, and in the case of Win32 locales it
could maybe understand that.  But there are two problems of limited
surmountability: (1) You have an idex built with version 42.1, and now
version 42.3 is present; OK, we can read this index, but if we write
any new data, then a streaming replica that has 42.2 will think it's
OK to read data, but it's not OK; so as soon as you write, you'd need
to update the catalogue, which is quite complicated (cf enum types);
(2) The whole theory only holds together if you didn't actually use
any of the new codepoints introduced by 42.3 while the index said
42.1, yet PostgreSQL isn't validating the codepoints you use against
the collation provider's internal map of valid code points.  So I gave
up with that line of thinking for now.



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Reduce the number of special cases to build contrib modules on windows
Next
From: David Rowley
Date:
Subject: Re: Hybrid Hash/Nested Loop joins and caching results from subplans