Thread: ICU locale validation / canonicalization

ICU locale validation / canonicalization

From
Jeff Davis
Date:
Right now, ICU locales are not validated:

  initdb ... --locale-provider=icu --icu-locale=anything
  CREATE COLLATION foo (PROVIDER=icu, LOCALE='anything');
  CREATE DATABASE anythingdb ICU_LOCALE 'anything';

all succeed.

We do check that the value is accepted by ICU, but ICU seems to accept
anything and use some fallback logic. Bogus strings will typically end
up as the "root" locale (spelled "root" or "").

At first, I thought this was a bug. The ICU documentation[1] suggests
that the fallback logic can result in using the ICU default locale in
some cases. The default locale is problematic because it's affected by
the environment (LANG, LC_ALL, and strangely LC_MESSAGES; but strangely
not LC_COLLATE).

Fortunately, I didn't find any cases where it actually does fall back
to the default locale, so I think we're safe, but validation seems wise
regrardless. In different contexts we may want to fail (e.g. initdb
with a bogus locale), or warn, issue a notice that we changed the
string, or just silently change what the user entered to be in a
consistent form. BCP47 [2] seems to be the standard here, and we're
already using it when importing the icu collations.

ICU locale validation is not exactly straightforward, though, and I
suppose that's why it isn't already done. There's a document[3] that
explains canonicalization in terms of "level 1" and "level 2", and says
that ucol_canonicalize() provides level 2 canonicalization, but I am
not seeing all of the documented behavior in my tests. For instance,
the document says that "de__PHONEBOOK" should canonicalize to
"de@collation=phonebook", but instead I see that it remains
"de__PHONEBOOK". It also says that "C" should canonicalize to
"en_US_POSIX", but in my test, it just goes to "c".

The right entry point appears to get uloc_getLanguageTag(), which
internally calls uloc_canonicalize, but also converts to BCP47 format,
and gives the option for strictness. Non-strict mode seems problematic
because for "de__PHONEBOOK", it returns a langtag of plain "de", which
is a different actual locale than "de__PHONEBOOK". If uloc_canonicalize
worked as documented, it would have changed it to
"de@collation=phonebook" and the correct language tag "de-u-co-phonebk"
would be returned, which would find the right collator. I suppose that
means we would need to use strict mode.

And then we need to check whether it actually exists; i.e. reject well-
formed but bogus locales, like "wx-YZ". To do that, probably the most
straightforward way would be to initialize a UCollator and then query
it using ucol_getLocaleByType() with ULOC_VALID_LOCALE. If that results
in the root locale, we could say that it doesn't exist because it
failed to find a more suitable match (unless the user explicitly
requested the root locale). If it resolves to something else, we could
either just assume it's fine, or we could try to validate that it
matches what we expect in more detail. To be safe, we could double-
check that the resulting BCP 47 locale string loads the same actual
collator as what would have been loaded with the original string (also
check attributes?).

The overall benefit here is that we keep our catalogs consistently
using an independent standard format for ICU locale strings, rather
than whatever the user specifies. That makes it less likely that ICU
needs to use any fallback logic when trying to open a collator, which
could only be bad news.

Thoughts?


[1] https://unicode-org.github.io/icu/userguide/locale/#fallback
[2] https://en.wikipedia.org/wiki/IETF_language_tag
[3]
https://unicode-org.github.io/icu/userguide/locale/#canonicalization


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: ICU locale validation / canonicalization

From
Peter Eisentraut
Date:
On 08.02.23 08:59, Jeff Davis wrote:
> The overall benefit here is that we keep our catalogs consistently
> using an independent standard format for ICU locale strings, rather
> than whatever the user specifies. That makes it less likely that ICU
> needs to use any fallback logic when trying to open a collator, which
> could only be bad news.

One use case is that if a user specifies a locale, say, of 'de-AT', this 
might canonicalize to 'de' today, but we should still store what the 
user specified because 1) that documents what the user wanted, and 2) it 
might not canonicalize to the same thing tomorrow.




Re: ICU locale validation / canonicalization

From
Robert Haas
Date:
On Wed, Feb 8, 2023 at 2:59 AM Jeff Davis <pgsql@j-davis.com> wrote:
> We do check that the value is accepted by ICU, but ICU seems to accept
> anything and use some fallback logic. Bogus strings will typically end
> up as the "root" locale (spelled "root" or "").

I've noticed this, and I think it's really frustrating. There's barely
any documentation of what strings you're allowed to specify, and the
documentation that does exist is extremely difficult to understand.
Normally, you could work around that problem to some degree by making
a guess at what you're supposed to be doing and then seeing whether
the program accepts it, but here that doesn't work either. It just
accepts anything you give it and then you have to try to figure out
whether the behavior is what you wanted. But there's also no real
documentation of what the behavior of any collation is, so you're
apparently just supposed to magically know what collations exist and
how they behave and then you can test whether the string you put in
gave you the behavior you wanted.

Adding validation and canonicalization wouldn't cure the documentation
problems, but it would be a big help. You still wouldn't know what
string you were supposed to be passing to ICU, but if you did pass it
a string, you'd find out what it thought that string meant. I think
that would be a huge step forward.

Unfortunately, I have no idea whether your specific ideas about how to
make that happen are any good or not. But I hope they are, because the
current situation is pessimal.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Thu, 2023-02-09 at 15:44 +0100, Peter Eisentraut wrote:
> One use case is that if a user specifies a locale, say, of 'de-AT',
> this
> might canonicalize to 'de' today,

Canonicalization should not lose useful information, it should just
rearrange it, so I don't see a risk here based on what I read and the
behavior I saw. In ICU, "de-AT" canonicalizes to "de_AT" and becomes
the language tag "de-AT".

> but we should still store what the
> user specified because 1) that documents what the user wanted, and 2)
> it
> might not canonicalize to the same thing tomorrow.

We don't want to store things with ambiguous interpretations that could
change tomorrow; that's a recipe for trouble. That's why most people
store timestamps as the offset from some epoch in UTC rather than as
"2/9/23" (Feb 9 or Sept 2? 1923 or 2023?). There are exceptions where
you would want to store something like that, but I don't see why they'd
apply in this case, where reinterpretation probably means a corrupted
index.

If the user wants to know how their ad-hoc string was interpreted, they
can look at the resulting BCP 47 language tag, and see if it's what
they meant. We can try to make this user-friendly by offering a NOTICE,
WARNING, or helper functions that allow them to explore. We can also
double check that the canonicalized form resolves to the same actual
collator to be safe, and maybe even fall back to whatever the user
specified if not. I'm open to discuss how strict we want to be and what
kind of escape hatches we need to offer.

There is still a risk that the BCP 47 language tag resolves to a
different specific ICU collator or different collator version tomorrow.
That's why we need to be careful about versioning (library versions or
collator versions or both), and we've had long discussions about that.

--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Thu, 2023-02-09 at 10:53 -0500, Robert Haas wrote:
> Unfortunately, I have no idea whether your specific ideas about how
> to
> make that happen are any good or not. But I hope they are, because
> the
> current situation is pessimal.

It feels like BCP 47 is the right catalog representation. We are
already using it for the import of initial collations, and it's a
standard, and there seems to be good support in ICU.

There are a couple cases where canonicalization will succeed but
conversion to a BCP 47 language tag will fail. One is for unsupported
attributes, like "en_US@foo=bar". Another is a bug I found and reported
here:

https://unicode-org.atlassian.net/browse/ICU-22268

In both cases, we know that conversion has failed, and we have a choice
about how to proceed. We can fail, warn and continue with the user-
entered representation, or turn off the strictness checking and come up
with some BCP 47 tag and see if it resolves to the same collator.

I do like the ICU format locale IDs from a readability standpoint.
"en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks-
level1" (the equivalent language tag). And the format is specified[1],
even though it's not an independent standard. But I think the benefits
of better validation, an independent standard, and the fact that we're
already favoring BCP47 outweigh my subjective opinion.

I also attached a simple test program that I've been using to
experiment (not intended for code review).

It's hard for me to say that I'm sure I'm right. I really just got
involved in this a few months back, and had a few off-list
conversations with Peter Eisentraut to try to learn more (I believe he
is aligned with my proposal but I will let him speak for himself).

I should also say that I'm not exactly an expert in languages or
scripts. I assume that ICU and IETF are doing sensible things to
accommodate the diversity of human language as well as they can (or at
least much better than the Postgres project could do on its own).

I'm happy to hear more input or other proposals.

[1]
https://unicode-org.github.io/icu/userguide/locale/#canonicalization

--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: ICU locale validation / canonicalization

From
Andreas Karlsson
Date:
On 2/9/23 23:09, Jeff Davis wrote:
> I do like the ICU format locale IDs from a readability standpoint.
> "en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks-
> level1" (the equivalent language tag). And the format is specified[1],
> even though it's not an independent standard. But I think the benefits
> of better validation, an independent standard, and the fact that we're
> already favoring BCP47 outweigh my subjective opinion.

I have the same feeling one is readable and the other unreadable but the 
unreadable one is standardized. Hard call.

And in general I agree, if we are going to make ICU default it needs to 
be more user friendly than it is now. Currently there is no nice way to 
understand if you entered the right locale or made a typo in the BCP47 
syntax.

Andreas




Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Fri, 2023-02-10 at 01:04 +0100, Andreas Karlsson wrote:
> I have the same feeling one is readable and the other unreadable but
> the
> unreadable one is standardized. Hard call.
>
> And in general I agree, if we are going to make ICU default it needs
> to
> be more user friendly than it is now. Currently there is no nice way
> to
> understand if you entered the right locale or made a typo in the
> BCP47
> syntax.

We will still allow the ICU format locale IDs for input; we would just
convert them to BCP47 before storing them in the catalog. And there's
an inverse function, so it's easy enough to offer a view that shows the
ICU format locale IDs in addition to the BCP 47 tags.

I don't think it's hugely important that we use BCP47; ICU format
locale IDs would also make sense. But I do think we should be
consistent to simplify things where we can -- collator versioning is
hard enough without wondering how a user-entered string will be
interpreted. And if we're going to be consistent, BCP 47 seems like the
most obvious choice.

--
Jeff Davis
PostgreSQL Contributor Team - AWS




Re: ICU locale validation / canonicalization

From
Andreas Karlsson
Date:
On 2/10/23 02:22, Jeff Davis wrote:
> We will still allow the ICU format locale IDs for input; we would just
> convert them to BCP47 before storing them in the catalog. And there's
> an inverse function, so it's easy enough to offer a view that shows the
> ICU format locale IDs in addition to the BCP 47 tags.

Aha, then I misread your mail. Sorry! BCP 47 sounds perfect for storage.

Andreas




Re: ICU locale validation / canonicalization

From
Peter Eisentraut
Date:
On 09.02.23 22:15, Jeff Davis wrote:
> On Thu, 2023-02-09 at 15:44 +0100, Peter Eisentraut wrote:
>> One use case is that if a user specifies a locale, say, of 'de-AT',
>> this
>> might canonicalize to 'de' today,
> Canonicalization should not lose useful information, it should just
> rearrange it, so I don't see a risk here based on what I read and the
> behavior I saw. In ICU, "de-AT" canonicalizes to "de_AT" and becomes
> the language tag "de-AT".

It turns out that 'de_AT' is actually a distinct collation from 'de' in 
CLDR, so that was not the best example.  What behavior do you see for 
'de_CH'?




Re: ICU locale validation / canonicalization

From
Robert Haas
Date:
On Thu, Feb 9, 2023 at 5:09 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I do like the ICU format locale IDs from a readability standpoint.
> "en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks-
> level1" (the equivalent language tag).

Sadly, neither of those means a whole lot to me? :-(

How did you find out that those are equivalent?

> And the format is specified[1],
> even though it's not an independent standard. But I think the benefits
> of better validation, an independent standard, and the fact that we're
> already favoring BCP47 outweigh my subjective opinion.

See, I'm confused, because that link says "If a keyword list is
present it must be preceded by an at-sign" which makes it sound like
it is talking about stuff like en_US@colstrength=primary rather than
stuff like en-US-u-ks-level1. The examples are all that way too, like
it gives examples like en_IE@currency=IEP and
fr@collation=phonebook;calendar=islamic-civil.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Fri, 2023-02-10 at 07:42 +0100, Peter Eisentraut wrote:
> It turns out that 'de_AT' is actually a distinct collation from 'de'
> in
> CLDR, so that was not the best example.  What behavior do you see for
> 'de_CH'?

The canonicalized form is de_CH and the bcp47 tag is de-CH.

uloc_canonicalize() and uloc_getLanguageTag() are declared in uloc.h,
and they aren't (as far as I can tell) tied to which collations are
actually defined.

--
Jeff Davis
PostgreSQL Contributor Team - AWS




Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Fri, 2023-02-10 at 09:43 -0500, Robert Haas wrote:
> On Thu, Feb 9, 2023 at 5:09 PM Jeff Davis <pgsql@j-davis.com> wrote:
> > I do like the ICU format locale IDs from a readability standpoint.
> > "en_US@colstrength=primary" is more meaningful to me than "en-US-u-
> > ks-
> > level1" (the equivalent language tag).
>
> Sadly, neither of those means a whole lot to me? :-(
>
> How did you find out that those are equivalent?

In our tests you can see colstrength=primary is used to mean "case
insensitive". That's where I picked up the "colstrength" keyword, which
is also present in the ICU sources, but now that you ask I'm embarassed
that I don't see the keyword itself documented very well.

This document
https://unicode-org.github.io/icu/userguide/locale/#keywords
lists keywords, but colstrength is not there. It's easy enough to find
in the ICU source; I'm probably just missing the document.

Here's the API reference, which tells you that you can set the strength
of a collator (using the API, not the keyword):
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#acc801048729e684bcabed328be85f77a

The more precise definitions of the strengths are here:
https://unicode-org.github.io/icu/userguide/collation/concepts.html#comparison-levels

Regarding the equivalence of the two forms, uloc_toLanguageTag() and
uloc_toLanguageTag() are inverses. As far as I can tell (a lower degree
of assurance than you are looking for), if one succeeds, then the other
will also succeed and produce the original result.

There are another couple documents here (TR35):
http://www.unicode.org/reports/tr35/
https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
that seems to cover the "ks-level1" and how it maps to the collation
strength.

My examination of these standards is very superficial -- I'm basically
just checking that they seem to be there. If I search for a string like
"en-US-u-ks-level1", I only find Postgres-related results, so you could
also question whether these standards are actually used.

Using BCP 47 tags for icu locale strings, and moving to ICU (as
discussed in the other thread) is basically a leap of faith in ICU. The
docs aren't perfect, the source is hard to read, and we've found bugs.
But it seems like a better place for us than libc for the reasons I
mentioned in the other thread.

> > And the format is specified[1],
> > even though it's not an independent standard. But I think the
> > benefits
> > of better validation, an independent standard, and the fact that
> > we're
> > already favoring BCP47 outweigh my subjective opinion.
>
> See, I'm confused, because that link says "If a keyword list is
> present it must be preceded by an at-sign" which makes it sound like
> it is talking about stuff like en_US@colstrength=primary rather than
> stuff like en-US-u-ks-level1. The examples are all that way too, like
> it gives examples like en_IE@currency=IEP and
> fr@collation=phonebook;calendar=islamic-civil.

My paragraph was unclear, let me restate the point:

To represent ICU locale strings in the catalog consistently, we have
two choices, which as far as I can tell are equivalent:

1. ICU format Locale IDs. These are more readable, and still specified
(albeit non-standard).

2. BCP47 language tags. These are standardized, there's better
validation with "strict" mode, and we are already using them.

Honestly I don't think it's hugely important which one we pick. But
being consistent is important, so we need to pick one, and BCP 47 seems
like the better option to me.

--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: ICU locale validation / canonicalization

From
Robert Haas
Date:
On Fri, Feb 10, 2023 at 12:54 PM Jeff Davis <pgsql@j-davis.com> wrote:
> In our tests you can see colstrength=primary is used to mean "case
> insensitive". That's where I picked up the "colstrength" keyword, which
> is also present in the ICU sources, but now that you ask I'm embarassed
> that I don't see the keyword itself documented very well.
>
> This document
> https://unicode-org.github.io/icu/userguide/locale/#keywords
> lists keywords, but colstrength is not there. It's easy enough to find
> in the ICU source; I'm probably just missing the document.

The fact that you're figuring out how it all works from reading the
source code does not give me a warm feeling.

> But it seems like a better place for us than libc for the reasons I
> mentioned in the other thread.

It may be. But sometimes I feel that's not setting our sights very high. :-(

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Fri, 2023-02-10 at 22:50 -0500, Robert Haas wrote:
> The fact that you're figuring out how it all works from reading the
> source code does not give me a warm feeling.

Right. On the other hand, the behavior is quite well documented, it was
just the keyword that was undocumented (or I didn't find it).

> > But it seems like a better place for us than libc for the reasons I
> > mentioned in the other thread.
>
> It may be. But sometimes I feel that's not setting our sights very
> high. :-(

How much higher could we set our sights? What would the ideal collation
provider look like?

Those are good questions, but please let's take those questions to the
thread about ICU as a default.

The topic of this thread is:

Given that we are already offering ICU support, should we canonicalize
the locale string stored in the catalog? If so, should we use the ICU
format locale IDs, or BCP 47 language tags?

Do you have an opinion on that topic? If not, do you need additional
information?

--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Thu, 2023-02-09 at 14:09 -0800, Jeff Davis wrote:
> It feels like BCP 47 is the right catalog representation. We are
> already using it for the import of initial collations, and it's a
> standard, and there seems to be good support in ICU.

Patch attached.

We should have been canonicalizing all along -- either with
uloc_toLanguageTag(), as this patch does, or at least with
uloc_canonicalize() -- before passing to ucol_open().

ucol_open() is documented[1] to work on either language tags or ICU
format locale IDs. Anything else is invalid and ends up going through
some fallback logic, probably after being mis-parsed. For instance, in
ICU 72, "fr_CA.UTF-8" is not a valid ICU format locale ID or a valid
language tag, and is resolved by ucol_open() to the actual locale
"root"; but if you canonicalize it first (to the ICU format locale ID
"fr_CA" or the language tag "fr-CA"), it correctly resolves to the
actual locale "fr_CA".

The correct thing to do is canonicalize first and then pass to
ucol_open().

But because we didn't canonicalize in the past, there could be raw
locale strings stored in the catalog that resolve to the wrong actual
collator, and there could be indexes depending on the wrong collator,
so we have to be careful during pg_upgrade.

Say someone created two ICU collations, one with locale "en_US.UTF-8"
and one with locale "fr_CA.UTF-8" in PG15. When they upgrade to PG16,
this patch will check the language tag "en-US" and see that it resolves
to the same locale as "en_US.UTF-8", and change to the language tag
during upgrade (so "en-US" will be in the new catalog). But when it
checks the language tag "fr-CA", it will notice that it resolves to a
different locale than "fr_CA.UTF-8", and keep the latter string even
though it's wrong, because some indexes might be dependent on that
wrong collator.


[1]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#a3b0bf34733dc208040e4157b0fe5fcd6

--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: ICU locale validation / canonicalization

From
Peter Eisentraut
Date:
On 10.02.23 18:53, Jeff Davis wrote:
> To represent ICU locale strings in the catalog consistently, we have
> two choices, which as far as I can tell are equivalent:
> 
> 1. ICU format Locale IDs. These are more readable, and still specified
> (albeit non-standard).
> 
> 2. BCP47 language tags. These are standardized, there's better
> validation with "strict" mode, and we are already using them.
> 
> Honestly I don't think it's hugely important which one we pick. But
> being consistent is important, so we need to pick one, and BCP 47 seems
> like the better option to me.

I found some discussion about this from when ICU support was first 
added.  See this message as a starting point: 

https://www.postgresql.org/message-id/flat/5291804b-169e-3ba9-fdaf-fae8e7d2d959%402ndquadrant.com#96acb7eb9299c2ca64dbabcf58e11a90

There isn't much detail there, but the discussion and the current code 
seem pretty convinced that

a) BCP47 tags are preferred, and
b) They don't work with ICU versions before 54.

I can't locate the source for claim b) anymore.  However, it seems 
pretty clear that there is some cutoff, even if it isn't exactly 54.

I would support transitioning this forward somehow, but we would need to 
know exactly what the impact would be.



Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
New patch attached. The new patch also includes a GUC that (when
enabled) validates that the collator is actually found.

On Mon, 2023-02-20 at 15:46 +0100, Peter Eisentraut wrote:
> a) BCP47 tags are preferred, and

Agreed.

> b) They don't work with ICU versions before 54.

I tried in versions 50 through 53, and the language tags are supported,
but I think I know why we don't use them:

Prior to version 54, ICU would not set the collator attributes based on
the locale name. That is the same for either language tags or ICU
format locale IDs. However, for ICU format locale IDs, we added special
code to parse the locale string and set the attributes ourselves. We
didn't bother to add the same parsing logic for language tags, so if a
language tag is found in the catalog, the parts of it that specify
collation strength (for example) would be ignored. I don't know if
that's an actual problem when importing the system collations, because
I don't think we use any collator attributes, but it makes sense that
we'd not favor language tags in ICU prior to v54.

> I would support transitioning this forward somehow, but we would need
> to
> know exactly what the impact would be.

I've done quite a bit of investigation, which I've described upthread.

We need to transition somehow, because the prior behavior is incorrect
for locales like "fr_CA.UTF-8". Our tests suggest that's an acceptable
thing to do, but if we pass that straight to ucol_open(), then it gets
misinterpreted as plain "fr" because it doesn't understand the "." as a
valid separator. We must turn it into a language tag (or at least
canonicalize it) before passing the string to ucol_open().

This misbehavior only affects a small number of locales, which resolve
to a different actual collator than they should. The most problematic
case is during pg_upgrade, where a slight behavior change would result
in corrupt indexes. So during binary upgrade, my patch falls back to
the original raw string (not the language tag) when it resolves to a
different actual collator. If we want to be more paranoid, we could
also provide a compatibility GUC to preserve the old misbehavior for
newly-created collations, too, but I don't think that's necessary.

There is also some interaction with pg_upgrade's ability to check
whether the old and new cluster are compatible. If the catalog
representation of the locale changes, then it could falsely believe the
icu locales aren't compatible, because it's doing a simple string
comparison. But as we are discussing in the other thread[1], the whole
idea of checking for compatibility of the initialized cluster is
strange: pg_upgrade should be in charge of making a compatible cluster
to upgrade into (assuming the binaries are at least compatible). I
don't see this as a major problem; we'll sort out the other thread
first to allow ICU as the default, and then adapt this patch if
necessary.

[1]
https://www.postgresql.org/message-id/20230214175957.idkb7shsqzp5nbll@awork3.anarazel.de


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Mon, 2023-02-20 at 15:23 -0800, Jeff Davis wrote:
>
> New patch attached. The new patch also includes a GUC that (when
> enabled) validates that the collator is actually found.

New patch attached.

Now it always preserves the exact locale string during pg_upgrade, and
does not attempt to canonicalize it. Before it was trying to be clever
by determining if the language tag was finding the same collator as the
original string -- I didn't find a problem with that, but it just
seemed a bit too clever. So, only newly-created locales and databases
have the ICU locale string canonicalized to a language tag.

Also, I added a SQL function pg_icu_language_tag() that can convert
locale strings to language tags, and check whether they exist or not.


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: ICU locale validation / canonicalization

From
Peter Eisentraut
Date:
On 28.02.23 06:57, Jeff Davis wrote:
> On Mon, 2023-02-20 at 15:23 -0800, Jeff Davis wrote:
>>
>> New patch attached. The new patch also includes a GUC that (when
>> enabled) validates that the collator is actually found.
> 
> New patch attached.
> 
> Now it always preserves the exact locale string during pg_upgrade, and
> does not attempt to canonicalize it. Before it was trying to be clever
> by determining if the language tag was finding the same collator as the
> original string -- I didn't find a problem with that, but it just
> seemed a bit too clever. So, only newly-created locales and databases
> have the ICU locale string canonicalized to a language tag.
> 
> Also, I added a SQL function pg_icu_language_tag() that can convert
> locale strings to language tags, and check whether they exist or not.

This patch appears to do about three things at once, and it's not clear 
exactly where the boundaries are between them and which ones we might 
actually want.  And I think the terminology also gets mixed up a bit, 
which makes following this harder.

1. Canonicalizing the locale string.  This is presumably what 
uloc_canonicalize() does, which the patch doesn't actually use.  What 
are examples of what this does?  Does the patch actually do this?

2. Converting the locale string to BCP 47 format.  This converts 
'de@collation=phonebook' to 'de-u-co-phonebk'.  This is what 
uloc_getLanguageTag() does.

3. Validating the locale string, to reject faulty input.

What are the relationships between these?

I don't understand how the validation actually happens in your patch. 
Does uloc_getLanguageTag() do the validation also?

Can you do canonicalization without converting to language tag?

Can you do validation of un-canonicalized locale names?

What is the guidance for the use of the icu_locale_validation GUC?

The description throws in yet another term: "validates that ICU locale 
strings are well-formed".  What is "well-formed"?  How does that relate 
to the other concepts?

Personally, I'm not on board with this behavior:

=> CREATE COLLATION test (provider = icu, locale = 
'de@collation=phonebook');
NOTICE:  00000: using language tag "de-u-co-phonebk" for locale 
"de@collation=phonebook"

I mean, maybe that is a thing we want to do somehow sometime, to migrate 
people to the "new" spellings, but the old ones aren't wrong.  So this 
should be a separate consideration, with an option, and it would require 
various updates in the documentation.  It also doesn't appear to address 
how to handle ICU before version 54.

But, see earlier questions, are these three things all connected somehow?




Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Thu, 2023-03-09 at 09:46 +0100, Peter Eisentraut wrote:
> This patch appears to do about three things at once, and it's not
> clear
> exactly where the boundaries are between them and which ones we might
> actually want.  And I think the terminology also gets mixed up a bit,
> which makes following this harder.
>
> 1. Canonicalizing the locale string.  This is presumably what
> uloc_canonicalize() does, which the patch doesn't actually use.  What
> are examples of what this does?  Does the patch actually do this?

Both uloc_canonicalize() and uloc_getLanguageTag() do Level 2
Canonicalization, which is described here:

https://unicode-org.github.io/icu/userguide/locale/#canonicalization

> 2. Converting the locale string to BCP 47 format.  This converts
> 'de@collation=phonebook' to 'de-u-co-phonebk'.  This is what
> uloc_getLanguageTag() does.

Yes, though uloc_getLanguageTag() also canonicalizes. I consider
converting to the language tag a part of "canonicalization", because
it's the canonical form we agreed on in this thread.

> 3. Validating the locale string, to reject faulty input.

Canonicalization doesn't make sure the locale actually exists in ICU,
so it's easy to make a typo like "jp_JP" instead of "ja_JP". After
canonicalizing to a language tag, the former is "jp-JP" (resolving to
the collator with valid locale "root") and the latter is "ja-JP"
(resolving to the collator with valid locale "ja"). The former is
clearly a mistake, and I call catching that mistake "validation".

If the user specifies something other than the root locale (i.e. not
"root", "und", or ""), and the locale resolves to a collator with a
valid locale of "root", then this patch considers that to be a mistake
and issues a WARNING (upgraded to ERROR if the GUC
icu_locale_validation is true).

> What are the relationships between these?

1 & 2 are closely related. If we canonicalize, we need to pick one
canonical form: either BCP 47 or ICU format locale IDs.

3 is related, but can be seen as an independent change.

> I don't understand how the validation actually happens in your patch.
> Does uloc_getLanguageTag() do the validation also?

Using the above definition of "validation" it happens inside
icu_collator_exists().

> Can you do canonicalization without converting to language tag?

If we used uloc_canonicalize(), it would give us ICU format locale IDs,
and that would be a valid thing to do; and we could switch the
canonical form from ICU format locale IDs to BCP 47 in a separate
patch. I don't have a strong opinion, but if we're going to
canonicalize, I think it makes sense to go straight to language tags.

> Can you do validation of un-canonicalized locale names?

Yes, though I feel like an un-canonicalized name is less stable in
meaning, and so validation on that name may also be less stable.

For instance, if we don't canonicalize "fr_CA.UTF-8", it resolves to
plain "fr"; but if we do canonicalize it first, it resolves to "fr-CA".
Will the uncanonicalized name always resolve to "fr"? I'm not sure,
because the documentation says that ucol_open() expects either an ICU
format locale ID or, preferably, a language tag.

So they are technically independently useful changes, but I would
recommend that canonicalization goes in first.

> What is the guidance for the use of the icu_locale_validation GUC?

If an error when creating a new collation or database due to a bad
locale name would be highly disruptive, leave it false. If such an
error would be helpful to make sure you get the locale you expect, then
turn it on. In practice, existing important production systems would
leave it off; new systems could turn it on to help avoid
misconfigurations/mistakes.

> The description throws in yet another term: "validates that ICU
> locale
> strings are well-formed".  What is "well-formed"?  How does that
> relate
> to the other concepts?

Good point, I don't think I need to redefine "validation". Maybe I
should just describe it as elevating canonicalization or validation
problems from WARNING to ERROR.

> Personally, I'm not on board with this behavior:
>
> => CREATE COLLATION test (provider = icu, locale =
> 'de@collation=phonebook');
> NOTICE:  00000: using language tag "de-u-co-phonebk" for locale
> "de@collation=phonebook"
>
> I mean, maybe that is a thing we want to do somehow sometime, to
> migrate
> people to the "new" spellings, but the old ones aren't wrong.

I see what you mean; I'm not sure the best thing to do here. We are
adjusting the string passed by the user, and it feels like some users
might want to know that. It's a NOTICE, not a WARNING, so it's not
meant to imply that it's wrong.

But at the same time I can see it being annoying or confusing. If it's
confusing, perhaps a wording change and documentation would improve it?
If it's annoying, we might need to have an option and/or a different
log level?

> It also doesn't appear to address
> how to handle ICU before version 54.

Do you have a specific concern here?

Regards,
    Jeff Davis




Re: ICU locale validation / canonicalization

From
Peter Eisentraut
Date:
On 09.03.23 21:17, Jeff Davis wrote:
>> Personally, I'm not on board with this behavior:
>>
>> => CREATE COLLATION test (provider = icu, locale =
>> 'de@collation=phonebook');
>> NOTICE:  00000: using language tag "de-u-co-phonebk" for locale
>> "de@collation=phonebook"
>>
>> I mean, maybe that is a thing we want to do somehow sometime, to
>> migrate
>> people to the "new" spellings, but the old ones aren't wrong.
> 
> I see what you mean; I'm not sure the best thing to do here. We are
> adjusting the string passed by the user, and it feels like some users
> might want to know that. It's a NOTICE, not a WARNING, so it's not
> meant to imply that it's wrong.

For clarification, I wasn't complaining about the notice, but about the 
automatic conversion from old-style ICU locale ID to language tag.

>> It also doesn't appear to address
>> how to handle ICU before version 54.
> 
> Do you have a specific concern here?

What we had discussed a while ago in one of these threads is that ICU 
before version 54 do not support keyword lists, and we have custom code 
to do that parsing ourselves, but we don't have code to do the same for 
language tags.  Therefore, if I understand this right, if we 
automatically convert ICU locale IDs to language tags, as shown above, 
then we break support for such locales in those older ICU versions.



Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Mon, 2023-03-13 at 08:25 +0100, Peter Eisentraut wrote:
> For clarification, I wasn't complaining about the notice, but about
> the
> automatic conversion from old-style ICU locale ID to language tag.

Canonicalization means that we pick one format, and automatically
convert to it, right?

> What we had discussed a while ago in one of these threads is that ICU
> before version 54 do not support keyword lists, and we have custom
> code
> to do that parsing ourselves, but we don't have code to do the same
> for
> language tags.  Therefore, if I understand this right, if we
> automatically convert ICU locale IDs to language tags, as shown
> above,
> then we break support for such locales in those older ICU versions.

Right. In versions 53 and earlier, and during pg_upgrade, we would just
preserve the locale string as entered.

Alternatively, we could canonicalize to the ICU format locale IDs. Or
add something to parse out the attributes from a language tag.

Regards,
    Jeff Davis




Re: ICU locale validation / canonicalization

From
Peter Eisentraut
Date:
On 13.03.23 16:31, Jeff Davis wrote:
>> What we had discussed a while ago in one of these threads is that ICU
>> before version 54 do not support keyword lists, and we have custom
>> code
>> to do that parsing ourselves, but we don't have code to do the same
>> for
>> language tags.  Therefore, if I understand this right, if we
>> automatically convert ICU locale IDs to language tags, as shown
>> above,
>> then we break support for such locales in those older ICU versions.
> 
> Right. In versions 53 and earlier, and during pg_upgrade, we would just
> preserve the locale string as entered.

Another issue that came to mind:  Right now, you can, say, develop SQL 
schemas on a newer ICU version, say, your laptop, and then deploy them 
on a server running an older ICU version.  If we have a cutoff beyond 
which we convert ICU locale IDs to language tags, then this won't work 
anymore for certain combinations.  And RHEL/CentOS 7 is still pretty 
popular.




Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Tue, 2023-03-14 at 08:08 +0100, Peter Eisentraut wrote:
> Another issue that came to mind:  Right now, you can, say, develop
> SQL
> schemas on a newer ICU version, say, your laptop, and then deploy
> them
> on a server running an older ICU version.  If we have a cutoff beyond
> which we convert ICU locale IDs to language tags, then this won't
> work
> anymore for certain combinations.  And RHEL/CentOS 7 is still pretty
> popular.

If we just uloc_canonicalize() in icu_set_collation_attributes() then
versions 50-53 can support language tags. Patch attached.

One loose end is that we really should support language tags like "und"
in those older versions (54 and earlier). Your commit d72900bded
avoided the problem, but perhaps we should fix it by looking for "und"
and replacing it with "root" while opening, or something.


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Tue, 2023-03-14 at 10:10 -0700, Jeff Davis wrote:
> One loose end is that we really should support language tags like
> "und"
> in those older versions (54 and earlier). Your commit d72900bded
> avoided the problem, but perhaps we should fix it by looking for
> "und"
> and replacing it with "root" while opening, or something.

Attached are a few patches to implement this idea.

--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Tue, 2023-03-14 at 23:47 -0700, Jeff Davis wrote:
> On Tue, 2023-03-14 at 10:10 -0700, Jeff Davis wrote:
> > One loose end is that we really should support language tags like
> > "und"
> > in those older versions (54 and earlier). Your commit d72900bded
> > avoided the problem, but perhaps we should fix it by looking for
> > "und"
> > and replacing it with "root" while opening, or something.
>
> Attached are a few patches to implement this idea.

Here is an updated patch series that includes these earlier fixes for
older ICU versions, with the canonicalization patch last (0005).

I left out the validation patch for now, and I'm evaluating a different
approach that will attempt to match to the locales retrieved with
uloc_countAvailable()/uloc_getAvailable().


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Wed, 2023-03-15 at 15:18 -0700, Jeff Davis wrote:
> I left out the validation patch for now, and I'm evaluating a
> different
> approach that will attempt to match to the locales retrieved with
> uloc_countAvailable()/uloc_getAvailable().

I like this approach, attached new patch series with that included as
0006.

The first 3 patches are essentially bugfixes -- should they be
backported?


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: ICU locale validation / canonicalization

From
Peter Eisentraut
Date:
On 17.03.23 18:55, Jeff Davis wrote:
> On Wed, 2023-03-15 at 15:18 -0700, Jeff Davis wrote:
>> I left out the validation patch for now, and I'm evaluating a
>> different
>> approach that will attempt to match to the locales retrieved with
>> uloc_countAvailable()/uloc_getAvailable().
> 
> I like this approach, attached new patch series with that included as
> 0006.

I have looked at the first three patches.  I think we want what those 
patches do.


[PATCH v6 1/6] Support language tags in older ICU versions (53 and
  earlier).

In pg_import_system_collations(), this is now redundant and can be 
simplified:

-        if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr))
+        if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag))

icu_set_collation_attributes() needs more commenting about what is going 
on.  My guess is that uloc_canonicalize() converts from language tag to 
ICU locale ID, and then the existing logic to parse that apart would 
apply.  Is that how it works?


[PATCH v6 2/6] Wrap ICU ucol_open().

It makes sense to try to unify some of this.  But I find the naming 
confusing.  If I see pg_ucol_open(), then I would expect that all calls 
to ucol_open() would be replaced by this.  But here it's only a few, 
without explanation.  (pg_ucol_open() has no explanation at all AFAICT.)

I have in my notes that check_icu_locale() and make_icu_collator() 
should be combined into a single function.  I think that would be a 
better way to slice it.

Btw., I had intentionally not written code like this

+#if U_ICU_VERSION_MAJOR_NUM < 54
+    icu_set_collation_attributes(collator, loc_str);
+#endif

The disadvantage of doing it that way is that you then need to dig out 
an old version of ICU in order to check whether the code compiles at 
all.  With the current code, you can be sure that that code compiles if 
you make changes elsewhere.


[PATCH v6 3/6] Handle the "und" locale in ICU versions 54 and older.

This makes sense, but the same comment about not #if'ing out code for 
old ICU versions applies here.

The

+#ifdef USE_ICU
+

before pg_ucol_open() probably belongs in patch 2.



Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Tue, 2023-03-21 at 10:35 +0100, Peter Eisentraut wrote:
> [PATCH v6 1/6] Support language tags in older ICU versions (53 and
>   earlier).
>
> In pg_import_system_collations(), this is now redundant and can be
> simplified:
>
> -               if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr))
> +               if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag))
>
> icu_set_collation_attributes() needs more commenting about what is
> going
> on.  My guess is that uloc_canonicalize() converts from language tag
> to
> ICU locale ID, and then the existing logic to parse that apart would
> apply.  Is that how it works?

Fixed the redundancy, added some comments, and committed 0001.

> [PATCH v6 2/6] Wrap ICU ucol_open().
>
> It makes sense to try to unify some of this.  But I find the naming
> confusing.  If I see pg_ucol_open(), then I would expect that all
> calls
> to ucol_open() would be replaced by this.  But here it's only a few,
> without explanation.  (pg_ucol_open() has no explanation at all
> AFAICT.)

The remaining callsite which doesn't use the wrapper is in initdb.c,
which can't call into pg_locale.c, and has different intentions. initdb
uses ucol_open to get the default locale if icu_locale is not
specified; and it also uses ucol open to verify that the locale can be
opened (whether specified or the default). (Aside: I created a tiny
0004 patch which makes this difference more clear and adds a nice
comment.)

There's no reason to use a wrapper when getting the default locale,
because it's just passing NULL anyway.

When verifying that the locale can be opened, ucol_open() doesn't catch
many problems anyway, so I'm not sure it's worth a lot of effort to
copy these extra checks that the wrapper does into initdb.c. For
instance, what's the value in replacing "und" with "root" if opening
either will succeed? Parsing the attributes can potentially catch
problems, but the later patch 0006 will check the attributes when
converting to a language tag at initdb time.

So I'm inclined to just leave initdb alone in patches 0002 and 0003.

> I have in my notes that check_icu_locale() and make_icu_collator()
> should be combined into a single function.  I think that would be a
> better way to slice it.

That would leave out get_collation_actual_version(), which should
handle the same fixups for attributes and the "und" locale.

> Btw., I had intentionally not written code like this
>
> +#if U_ICU_VERSION_MAJOR_NUM < 54
> +       icu_set_collation_attributes(collator, loc_str);
> +#endif
>
> The disadvantage of doing it that way is that you then need to dig
> out
> an old version of ICU in order to check whether the code compiles at
> all.  With the current code, you can be sure that that code compiles
> if
> you make changes elsewhere.

I was wondering about that -- thank you, I changed it back to use "if"
rather than "#ifdef".


New series attached (starting at 0002 to better correspond to the
previous series).


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: ICU locale validation / canonicalization

From
Peter Eisentraut
Date:
On 22.03.23 19:05, Jeff Davis wrote:
> On Tue, 2023-03-21 at 10:35 +0100, Peter Eisentraut wrote:
>> [PATCH v6 1/6] Support language tags in older ICU versions (53 and
>>    earlier).
>>
>> In pg_import_system_collations(), this is now redundant and can be
>> simplified:
>>
>> -               if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr))
>> +               if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag))
>>
>> icu_set_collation_attributes() needs more commenting about what is
>> going
>> on.  My guess is that uloc_canonicalize() converts from language tag
>> to
>> ICU locale ID, and then the existing logic to parse that apart would
>> apply.  Is that how it works?
> 
> Fixed the redundancy, added some comments, and committed 0001.

So, does uloc_canonicalize() always convert to ICU locale IDs?  What if 
you pass a language tag, does it convert it to ICU locale ID as well?

>> [PATCH v6 2/6] Wrap ICU ucol_open().

> So I'm inclined to just leave initdb alone in patches 0002 and 0003.

0002 and 0003 look ok to me now.

In 0002, the error "opening default collator is not supported", should 
that be an assert or an elog?  Is it reachable by the user?

You might want to check the declarations at the top of pg_ucol_open(). 
0003 reformats them after they were just added in 0002.  Maybe check 
that they are pgindent'ed in 0002 properly.

I don't understand patch 0004.  It seems to do two things, handle 
C/POSIX locale specifications and add an SQL-callable function.  Are 
those connected?




Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Thu, 2023-03-23 at 07:27 +0100, Peter Eisentraut wrote:
> So, does uloc_canonicalize() always convert to ICU locale IDs?  What
> if
> you pass a language tag, does it convert it to ICU locale ID as well?

Yes.

The documentation is not clear on that point, but my testing shows that
it does. And this is only for old versions of the code, so we don't
need to worry about later versions of ICU changing that.

I thought about using uloc_forLanguageTag(), but the documentation for
that is not clear what formats it accepts as an input, so it doesn't
seem like a win. If wanted to be paranoid we could use
uloc_toLanguageTag() followed by uloc_forLanguageTag(), but that seemed
excessive.

> >
> 0002 and 0003 look ok to me now.

Thank you, committed 0002 and 0003.

> In 0002, the error "opening default collator is not supported",
> should
> that be an assert or an elog?  Is it reachable by the user?

It's not reachable by the user, but could catch a bug if we
accidentally read a NULL field from the catalog or something like that.
It seemed a worthwhile check to leave in production builds.

> You might want to check the declarations at the top of
> pg_ucol_open().
> 0003 reformats them after they were just added in 0002.  Maybe check
> that they are pgindent'ed in 0002 properly.

They seem to be pgindented fine in 0002, it was unnecessarily
reindented in 0003 and I fixed that.

I use emacs "align-current" and generally that does the right thing,
but I'll rely more on pgindent in the future.

> I don't understand patch 0004.  It seems to do two things, handle
> C/POSIX locale specifications and add an SQL-callable function.  Are
> those connected?

It's hard to test (or even exercise) the former without the latter.

I could get rid of the SQL-callable function and move the rest of the
changes into 0006. I'll see if that arrangement works better, and that
way we can add the SQL-callable function later (or perhaps not at all
if it's not desired).

Regards,
    Jeff Davis




Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Thu, 2023-03-23 at 10:16 -0700, Jeff Davis wrote:
> I could get rid of the SQL-callable function and move the rest of the
> changes into 0006. I'll see if that arrangement works better, and
> that
> way we can add the SQL-callable function later (or perhaps not at all
> if it's not desired).

Attached a new series that doesn't include the SQL-callable function.
It's probably better to just wait and see what functions seem actually
useful to users.

I included a new small patch to fix a potential UCollator leak and make
the errors more consistent.


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: ICU locale validation / canonicalization

From
Peter Eisentraut
Date:
On 23.03.23 18:16, Jeff Davis wrote:
>> In 0002, the error "opening default collator is not supported",
>> should
>> that be an assert or an elog?  Is it reachable by the user?
> It's not reachable by the user, but could catch a bug if we
> accidentally read a NULL field from the catalog or something like that.
> It seemed a worthwhile check to leave in production builds.

Then it ought to be an elog().



Re: ICU locale validation / canonicalization

From
Peter Eisentraut
Date:
On 24.03.23 07:39, Jeff Davis wrote:
> On Thu, 2023-03-23 at 10:16 -0700, Jeff Davis wrote:
>> I could get rid of the SQL-callable function and move the rest of the
>> changes into 0006. I'll see if that arrangement works better, and
>> that
>> way we can add the SQL-callable function later (or perhaps not at all
>> if it's not desired).
> 
> Attached a new series that doesn't include the SQL-callable function.
> It's probably better to just wait and see what functions seem actually
> useful to users.
> 
> I included a new small patch to fix a potential UCollator leak and make
> the errors more consistent.

[PATCH v8 1/4] Avoid potential UCollator leak for older ICU versions.

Couldn't we do this in a simpler way by just freeing the collator before 
the ereport() calls.  Or wrap a PG_TRY/PG_FINALLY around the whole thing?

It would be nicer to not make the callers of 
icu_set_collation_attributes() responsible for catching and reporting 
the errors.


[PATCH v8 2/4] initdb: emit message when using default ICU locale.

I'm not able to make initdb print this message.  Under what 
circumstances am I supposed to see this?  Do you have some examples?

The function check_icu_locale() has now gotten a lot more functionality 
than its name suggests.  Maybe the part that assigns the default ICU 
locale should be moved up one level to setlocales(), which has a better 
name and does something similar for the libc locale realm.


[PATCH v8 3/4] Canonicalize ICU locale names to language tags.

I'm still on the fence about whether we actually want to do this, but 
I'm warming up to it, now that the issues with pre-54 versions are fixed.

But if we do this, the documentation needs to be updated.  There is a 
bunch of text there that says, like, you can do this format or that 
format, whatever you like.  At least the guidance should be changed there.


[PATCH v8 4/4] Validate ICU locales.

I would make icu_locale_validation true by default.

Or maybe it should be a log-level type option, so you can set it to 
error, warning, and also completely off?



Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Fri, 2023-03-24 at 10:10 +0100, Peter Eisentraut wrote:
> Couldn't we do this in a simpler way by just freeing the collator
> before
> the ereport() calls.

I committed a tiny patch to do this.

We still need to address the error inconsistency though. The problem is
that, in older ICU versions, if the fixup for "und@colNumeric=lower" ->
"root@colNumeric=lower" is applied, then icu_set_collation_attributes()
will throw an error reporting "root@colNumeric=lower", which is not
what the user typed.

We could fix that directly by passing the original string to
icu_set_collation_attributes() instead, or perhaps as an extra
parameter used only for the ereport().

I like the minor refactoring I did better, though. It puts the
ereports() close to each other, so any differences are more obvious.
And it seems cleaner to me for pg_ucol_open to close the UCollator
because it's the one that opened it. I don't have a strong opinion, but
that's my reasoning.

>   Or wrap a PG_TRY/PG_FINALLY around the whole thing?

I generally avoid PG_TRY/FINALLY unless it avoids some major
awkwardness or other problem.

> It would be nicer to not make the callers of
> icu_set_collation_attributes() responsible for catching and reporting
> the errors.

There's only one caller now: pg_ucol_open().

> [PATCH v8 2/4] initdb: emit message when using default ICU locale.
>
> I'm not able to make initdb print this message.  Under what
> circumstances am I supposed to see this?  Do you have some examples?

It happens when you don't specify --icu-locale. It is slightly
redundant with "ICU locale", but it lets you see that it came from the
environment rather than the command line:

-------------
$ initdb -D data
The files belonging to this database system will be owned by user
"someone".
This user must also own the server process.

Using default ICU locale "en_US_POSIX".
The database cluster will be initialized with this locale
configuration:
  provider:    icu
  ICU locale:  en_US_POSIX
...
-------------

That seems fairly useful for testing, etc., where initdb.log doesn't
show the command line options.

> The function check_icu_locale() has now gotten a lot more
> functionality
> than its name suggests.  Maybe the part that assigns the default ICU
> locale should be moved up one level to setlocales(), which has a
> better
> name and does something similar for the libc locale realm.

Agreed, done.

In fact, initdb.c:check_icu_locale() is completely unnecessary in that
patch, because as the comment points out, the backend will try to open
it during post-bootstrap initialization. I think it was simply a
mistake to try to do this validation in commit 27b62377b4.

The later validation patch does do some better validation at initdb
time to make sure the language can be found.

> [PATCH v8 3/4] Canonicalize ICU locale names to language tags.
>
> I'm still on the fence about whether we actually want to do this, but
> I'm warming up to it, now that the issues with pre-54 versions are
> fixed.
>
> But if we do this, the documentation needs to be updated.  There is a
> bunch of text there that says, like, you can do this format or that
> format, whatever you like.  At least the guidance should be changed
> there.
>
>
> [PATCH v8 4/4] Validate ICU locales.
>
> I would make icu_locale_validation true by default.

Agreed. I considered also not having a GUC, but it seems like some kind
of escape hatch is wise, at least for now.

> Or maybe it should be a log-level type option, so you can set it to
> error, warning, and also completely off?

As the validation patch seems closer to acceptance, I changed it to be
before the canonicalization patch. New series attached.


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: ICU locale validation / canonicalization

From
Peter Eisentraut
Date:
[PATCH v9 1/5] Fix error inconsistency in older ICU versions.

ok


[PATCH v9 2/5] initdb: replace check_icu_locale() with
  default_icu_locale().

I would keep the #ifdef USE_ICU inside the lower-level function 
default_icu_locale(), like it was before, so that the higher-level 
setlocales() doesn't need to know about it.

Otherwise ok.


[PATCH v9 3/5] initdb: emit message when using default ICU locale.

ok


[PATCH v9 4/5] Validate ICU locales.

Also here, let's keep the #ifdef USE_ICU in the lower-level function and 
move more logic in there.  Otherwise you have to repeat various things 
in DefineCollation() and createdb().

I'm not sure we need the IsBinaryUpgrade checks.  Users can set 
icu_validation_level on the target instance if they don't want that.




Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Tue, 2023-03-28 at 08:41 +0200, Peter Eisentraut wrote:
> [PATCH v9 1/5] Fix error inconsistency in older ICU versions.
>
> ok

Committed 0001.

> [PATCH v9 2/5] initdb: replace check_icu_locale() with
>   default_icu_locale().
>
> I would keep the #ifdef USE_ICU inside the lower-level function
> default_icu_locale(), like it was before, so that the higher-level
> setlocales() doesn't need to know about it.
>
> Otherwise ok.

Done and committed 0002.

>
> [PATCH v9 3/5] initdb: emit message when using default ICU locale.

Done and committed 0003.

> [PATCH v9 4/5] Validate ICU locales.
>
> Also here, let's keep the #ifdef USE_ICU in the lower-level function
> and
> move more logic in there.  Otherwise you have to repeat various
> things
> in DefineCollation() and createdb().

Done.

> I'm not sure we need the IsBinaryUpgrade checks.  Users can set
> icu_validation_level on the target instance if they don't want that.

I committed a version that still performs the checks during binary
upgrade, but degrades the message to a WARNING if it's set higher than
that. I tried some upgrades with invalid locales, and getting an error
deep in the logs after the upgrade actually starts is not very user-
friendly. We could add something during the --check phase, which would
be more helpful, but I didn't do that for this patch.


Attached is a new version of the final patch, which performs
canonicalization. I'm not 100% sure that it's wanted, but it still
seems like a good idea to get the locales into a standard format in the
catalogs, and if a lot more people start using ICU in v16 (because it's
the default), then it would be a good time to do it. But perhaps there
are risks?


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: ICU locale validation / canonicalization

From
Peter Eisentraut
Date:
On 30.03.23 04:33, Jeff Davis wrote:
> Attached is a new version of the final patch, which performs
> canonicalization. I'm not 100% sure that it's wanted, but it still
> seems like a good idea to get the locales into a standard format in the
> catalogs, and if a lot more people start using ICU in v16 (because it's
> the default), then it would be a good time to do it. But perhaps there
> are risks?

I say, let's do it.


I don't think we should show the notice when the canonicalization 
doesn't change anything.  This is not useful:

+NOTICE:  using language tag "und-u-kf-upper" for locale "und-u-kf-upper"

Also, the message should be phrased more from the perspective of the 
user instead of using ICU jargon, like

NOTICE:  using canonicalized form "%s" for locale specification "%s"

(Still too many big words?)


I don't think the special handling of IsBinaryUpgrade is needed or 
wanted.  I would hope that with this feature, all old-style locale IDs 
would go away, but this way we would keep them forever.  If we believe 
that canonicalization is safe, then I don't see why we cannot apply it 
during binary upgrade.


Needs documentation updates in doc/src/sgml/charset.sgml.




Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Thu, 2023-03-30 at 08:59 +0200, Peter Eisentraut wrote:

> I don't think the special handling of IsBinaryUpgrade is needed or
> wanted.  I would hope that with this feature, all old-style locale
> IDs
> would go away, but this way we would keep them forever.  If we
> believe
> that canonicalization is safe, then I don't see why we cannot apply
> it
> during binary upgrade.

There are two issues:

1. Failures can occur. For instance, if an invalid attribute is used,
like '@collStrength=primary', then we can't canonicalize it (or if we
do, it could end up being not what the user intended).

2. Version 15 and earlier have a subtle bug: it passes the raw locale
straight to ucol_open(), and if the locale is "fr_CA.UTF-8" ucol_open()
mis-parses it to have language "fr" with no region. If you canonicalize
first, it properly parses the locale and produces "fr-CA", which
results in a different collator. The 15 behavior is wrong, and this
canonicalization patch will fix it, but it doesn't do so during
pg_upgrade because that could change the collator and corrupt an index.

The current patch deals with these problems by simply preserving the
locale (valid or not) during pg_upgrade, and only canonicalizing new
collations and databases (so #2 is only fixed for new
collations/databases). I think that's a good trade-off because a lot
more users will be on ICU now that it's the default, so let's avoid
creating more of the problem cases for those new users.

To get to perfectly-canonicalized catalogs for upgrades from earlier
versions:

* We need a way to detect #2, which I posted some code for in an
uncommitted revision[1] of this patch series.

* We need a way to detect #1 and #2 during the pg_upgrade --check
phase.

* We need actions that the user can take to correct the problems. I
have some ideas but they could use some dicsussion.

I'm not sure all of those will be ready for v16, though.

Regards,
    Jeff Davis

[1] See check_equivalent_icu_locales() and calling code here:
https://www.postgresql.org/message-id/8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com





Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Thu, 2023-03-30 at 08:59 +0200, Peter Eisentraut wrote:
> I don't think we should show the notice when the canonicalization
> doesn't change anything.  This is not useful:
>
> +NOTICE:  using language tag "und-u-kf-upper" for locale "und-u-kf-
> upper"

Done.

> Also, the message should be phrased more from the perspective of the
> user instead of using ICU jargon, like
>
> NOTICE:  using canonicalized form "%s" for locale specification "%s"
>
> (Still too many big words?)

Changed to:

   NOTICE:  using standard form "%s" for locale "%s"

> Needs documentation updates in doc/src/sgml/charset.sgml.

I made a very minor update. Do you have something more specific in
mind?

Regards,
    Jeff Davis



Attachment

Re: ICU locale validation / canonicalization

From
Peter Eisentraut
Date:
On 31.03.23 12:11, Jeff Davis wrote:
> On Thu, 2023-03-30 at 08:59 +0200, Peter Eisentraut wrote:
>> I don't think we should show the notice when the canonicalization
>> doesn't change anything.  This is not useful:
>>
>> +NOTICE:  using language tag "und-u-kf-upper" for locale "und-u-kf-
>> upper"
> 
> Done.
> 
>> Also, the message should be phrased more from the perspective of the
>> user instead of using ICU jargon, like
>>
>> NOTICE:  using canonicalized form "%s" for locale specification "%s"
>>
>> (Still too many big words?)
> 
> Changed to:
> 
>     NOTICE:  using standard form "%s" for locale "%s"
> 
>> Needs documentation updates in doc/src/sgml/charset.sgml.
> 
> I made a very minor update. Do you have something more specific in
> mind?

This all looks good to me.




Re: ICU locale validation / canonicalization

From
Thomas Munro
Date:
MSVC now says this on master:

[17:48:12.446] c:\cirrus\src\backend\utils\adt\pg_locale.c(2912) :
warning C4715: 'icu_language_tag': not all control paths return a
value

CI doesn't currently fail for MSVC warnings, so it's a bit hidden.
FWIW cfbot does show this with a ⚠ sign with its new system for
grovelling through logs, which will now show up on every entry now
that this warning is in master.



Re: ICU locale validation / canonicalization

From
Noah Misch
Date:
On Thu, Mar 30, 2023 at 08:59:41AM +0200, Peter Eisentraut wrote:
> On 30.03.23 04:33, Jeff Davis wrote:
> >Attached is a new version of the final patch, which performs
> >canonicalization. I'm not 100% sure that it's wanted, but it still
> >seems like a good idea to get the locales into a standard format in the
> >catalogs, and if a lot more people start using ICU in v16 (because it's
> >the default), then it would be a good time to do it. But perhaps there
> >are risks?
> 
> I say, let's do it.

The following is not cause for postgresql.git changes at this time, but I'm
sharing it in case it saves someone else the study effort.  Commit ea1db8a
("Canonicalize ICU locale names to language tags.") slowed buildfarm member
hoverfly, but that disappears if I drop debug_parallel_query from its config.
Typical end-to-end duration rose from 2h5m to 2h55m.  Most-affected were
installcheck runs, which rose from 11m to 19m.  (The "check" stage uses
NO_LOCALE=1, so it changed less.)  From profiles, my theory is that each of
the many parallel workers burns notable CPU and I/O opening its ICU collator
for the first time.  debug_parallel_query, by design, pursues parallelism
independent of cost, so this is working as intended.  If it ever matters in
non-debug configurations, we might raise the default parallel_setup_cost or
pre-load ICU collators in the postmaster.



Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Tue, 2023-05-02 at 07:29 -0700, Noah Misch wrote:
> On Thu, Mar 30, 2023 at 08:59:41AM +0200, Peter Eisentraut wrote:
> > On 30.03.23 04:33, Jeff Davis wrote:
> > > Attached is a new version of the final patch, which performs
> > > canonicalization. I'm not 100% sure that it's wanted, but it
> > > still
> > > seems like a good idea to get the locales into a standard format
> > > in the
> > > catalogs, and if a lot more people start using ICU in v16
> > > (because it's
> > > the default), then it would be a good time to do it. But perhaps
> > > there
> > > are risks?
> >
> > I say, let's do it.
>
> The following is not cause for postgresql.git changes at this time,
> but I'm
> sharing it in case it saves someone else the study effort.  Commit
> ea1db8a
> ("Canonicalize ICU locale names to language tags.") slowed buildfarm
> member
> hoverfly, but that disappears if I drop debug_parallel_query from its
> config.
> Typical end-to-end duration rose from 2h5m to 2h55m.  Most-affected
> were
> installcheck runs, which rose from 11m to 19m.  (The "check" stage
> uses
> NO_LOCALE=1, so it changed less.)  From profiles, my theory is that
> each of
> the many parallel workers burns notable CPU and I/O opening its ICU
> collator
> for the first time.

I didn't repro the overall test timings (mine is ~1m40s compared to
~11-19m on hoverfly) but I think a microbenchmark on the ICU calls
showed a possible cause.

I ran open in a loop 10M times on the requested locale. The root locale
("und"[1], "root" and "") take about 1.3s to open 10M times; simple
locales like 'en' and 'fr-CA' and 'de-DE' are all a little shower at
3.3s.

Unrecognized locales like "xyz" take about 10 times as long: 13s to
open 10M times, presumably to perform the fallback logic that
ultimately opens the root locale. Not sure if 10X slower in the open
path is enough to explain the overall test slowdown.

My guess is that the ICU locale for these tests is not recognized, or
is some other locale that opens slowly. Can you tell me the actual
daticulocale?

Regards,
    Jeff Davis

[1] It appears that "und" is also slow to open in ICU < 64. Hoverfly is
on v58, so it's possible that's the problem if daticulocale=und.




Re: ICU locale validation / canonicalization

From
Noah Misch
Date:
On Sat, May 20, 2023 at 10:19:30AM -0700, Jeff Davis wrote:
> On Tue, 2023-05-02 at 07:29 -0700, Noah Misch wrote:
> > On Thu, Mar 30, 2023 at 08:59:41AM +0200, Peter Eisentraut wrote:
> > > On 30.03.23 04:33, Jeff Davis wrote:
> > > > Attached is a new version of the final patch, which performs
> > > > canonicalization. I'm not 100% sure that it's wanted, but it
> > > > still
> > > > seems like a good idea to get the locales into a standard format
> > > > in the
> > > > catalogs, and if a lot more people start using ICU in v16
> > > > (because it's
> > > > the default), then it would be a good time to do it. But perhaps
> > > > there
> > > > are risks?
> > > 
> > > I say, let's do it.
> > 
> > The following is not cause for postgresql.git changes at this time,
> > but I'm
> > sharing it in case it saves someone else the study effort.  Commit
> > ea1db8a
> > ("Canonicalize ICU locale names to language tags.") slowed buildfarm
> > member
> > hoverfly, but that disappears if I drop debug_parallel_query from its
> > config.
> > Typical end-to-end duration rose from 2h5m to 2h55m.  Most-affected
> > were
> > installcheck runs, which rose from 11m to 19m.  (The "check" stage
> > uses
> > NO_LOCALE=1, so it changed less.)  From profiles, my theory is that
> > each of
> > the many parallel workers burns notable CPU and I/O opening its ICU
> > collator
> > for the first time.
> 
> I didn't repro the overall test timings (mine is ~1m40s compared to
> ~11-19m on hoverfly) but I think a microbenchmark on the ICU calls
> showed a possible cause.
> 
> I ran open in a loop 10M times on the requested locale. The root locale
> ("und"[1], "root" and "") take about 1.3s to open 10M times; simple
> locales like 'en' and 'fr-CA' and 'de-DE' are all a little shower at
> 3.3s.
> 
> Unrecognized locales like "xyz" take about 10 times as long: 13s to
> open 10M times, presumably to perform the fallback logic that
> ultimately opens the root locale. Not sure if 10X slower in the open
> path is enough to explain the overall test slowdown.
> 
> My guess is that the ICU locale for these tests is not recognized, or
> is some other locale that opens slowly. Can you tell me the actual
> daticulocale?

As of commit b8c3f6d, InstallCheck-C got daticulocale=en-US-u-va-posix.  Check
got daticulocale=NULL.

(The machine in question was unusable for PostgreSQL from 2023-05-12 to
2023-06-30, due to https://stackoverflow.com/q/76369660/16371536.  That
delayed my response.)



Re: ICU locale validation / canonicalization

From
Jeff Davis
Date:
On Sat, 2023-07-01 at 10:31 -0700, Noah Misch wrote:
> As of commit b8c3f6d, InstallCheck-C got daticulocale=en-US-u-va-
> posix.  Check
> got daticulocale=NULL.

With the same test setup, that locale takes about 8.6 seconds (opening
it 10M times), about 2.5X slower than "en-US" and about 7X slower than
"und". I think that explains it.

The locale "en-US-u-va-posix" normally happens when passing a locale
beginning with "C" to ICU. After 2535c74b1a we don't get ICU locales
from the environment anywhere, so that should be rare (and probably
indicates a user mistake). I don't think this is a practical problem
any more.

Regards,
    Jeff Davis