Re: ICU locale validation / canonicalization - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: ICU locale validation / canonicalization
Date
Msg-id 2595905d7c70333a4510d9a3301f2165d121c183.camel@j-davis.com
Whole thread Raw
In response to Re: ICU locale validation / canonicalization  (Peter Eisentraut <peter.eisentraut@enterprisedb.com>)
List pgsql-hackers
On Thu, 2023-03-30 at 08:59 +0200, Peter Eisentraut wrote:

> I don't think the special handling of IsBinaryUpgrade is needed or
> wanted.  I would hope that with this feature, all old-style locale
> IDs
> would go away, but this way we would keep them forever.  If we
> believe
> that canonicalization is safe, then I don't see why we cannot apply
> it
> during binary upgrade.

There are two issues:

1. Failures can occur. For instance, if an invalid attribute is used,
like '@collStrength=primary', then we can't canonicalize it (or if we
do, it could end up being not what the user intended).

2. Version 15 and earlier have a subtle bug: it passes the raw locale
straight to ucol_open(), and if the locale is "fr_CA.UTF-8" ucol_open()
mis-parses it to have language "fr" with no region. If you canonicalize
first, it properly parses the locale and produces "fr-CA", which
results in a different collator. The 15 behavior is wrong, and this
canonicalization patch will fix it, but it doesn't do so during
pg_upgrade because that could change the collator and corrupt an index.

The current patch deals with these problems by simply preserving the
locale (valid or not) during pg_upgrade, and only canonicalizing new
collations and databases (so #2 is only fixed for new
collations/databases). I think that's a good trade-off because a lot
more users will be on ICU now that it's the default, so let's avoid
creating more of the problem cases for those new users.

To get to perfectly-canonicalized catalogs for upgrades from earlier
versions:

* We need a way to detect #2, which I posted some code for in an
uncommitted revision[1] of this patch series.

* We need a way to detect #1 and #2 during the pg_upgrade --check
phase.

* We need actions that the user can take to correct the problems. I
have some ideas but they could use some dicsussion.

I'm not sure all of those will be ready for v16, though.

Regards,
    Jeff Davis

[1] See check_equivalent_icu_locales() and calling code here:
https://www.postgresql.org/message-id/8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com





pgsql-hackers by date:

Previous
From: Corey Huinker
Date:
Subject: Re: Thoughts on using Text::Template for our autogenerated code?
Next
From: Andres Freund
Date:
Subject: Re: Refactor calculations to use instr_time