Re: ICU locale validation / canonicalization - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: ICU locale validation / canonicalization
Date
Msg-id c15981f01fba3c0cbcdb49a6d43cdf0a90cd77ff.camel@j-davis.com
Whole thread Raw
In response to Re: ICU locale validation / canonicalization  (Peter Eisentraut <peter.eisentraut@enterprisedb.com>)
Responses Re: ICU locale validation / canonicalization  (Peter Eisentraut <peter.eisentraut@enterprisedb.com>)
List pgsql-hackers
On Thu, 2023-03-09 at 09:46 +0100, Peter Eisentraut wrote:
> This patch appears to do about three things at once, and it's not
> clear
> exactly where the boundaries are between them and which ones we might
> actually want.  And I think the terminology also gets mixed up a bit,
> which makes following this harder.
>
> 1. Canonicalizing the locale string.  This is presumably what
> uloc_canonicalize() does, which the patch doesn't actually use.  What
> are examples of what this does?  Does the patch actually do this?

Both uloc_canonicalize() and uloc_getLanguageTag() do Level 2
Canonicalization, which is described here:

https://unicode-org.github.io/icu/userguide/locale/#canonicalization

> 2. Converting the locale string to BCP 47 format.  This converts
> 'de@collation=phonebook' to 'de-u-co-phonebk'.  This is what
> uloc_getLanguageTag() does.

Yes, though uloc_getLanguageTag() also canonicalizes. I consider
converting to the language tag a part of "canonicalization", because
it's the canonical form we agreed on in this thread.

> 3. Validating the locale string, to reject faulty input.

Canonicalization doesn't make sure the locale actually exists in ICU,
so it's easy to make a typo like "jp_JP" instead of "ja_JP". After
canonicalizing to a language tag, the former is "jp-JP" (resolving to
the collator with valid locale "root") and the latter is "ja-JP"
(resolving to the collator with valid locale "ja"). The former is
clearly a mistake, and I call catching that mistake "validation".

If the user specifies something other than the root locale (i.e. not
"root", "und", or ""), and the locale resolves to a collator with a
valid locale of "root", then this patch considers that to be a mistake
and issues a WARNING (upgraded to ERROR if the GUC
icu_locale_validation is true).

> What are the relationships between these?

1 & 2 are closely related. If we canonicalize, we need to pick one
canonical form: either BCP 47 or ICU format locale IDs.

3 is related, but can be seen as an independent change.

> I don't understand how the validation actually happens in your patch.
> Does uloc_getLanguageTag() do the validation also?

Using the above definition of "validation" it happens inside
icu_collator_exists().

> Can you do canonicalization without converting to language tag?

If we used uloc_canonicalize(), it would give us ICU format locale IDs,
and that would be a valid thing to do; and we could switch the
canonical form from ICU format locale IDs to BCP 47 in a separate
patch. I don't have a strong opinion, but if we're going to
canonicalize, I think it makes sense to go straight to language tags.

> Can you do validation of un-canonicalized locale names?

Yes, though I feel like an un-canonicalized name is less stable in
meaning, and so validation on that name may also be less stable.

For instance, if we don't canonicalize "fr_CA.UTF-8", it resolves to
plain "fr"; but if we do canonicalize it first, it resolves to "fr-CA".
Will the uncanonicalized name always resolve to "fr"? I'm not sure,
because the documentation says that ucol_open() expects either an ICU
format locale ID or, preferably, a language tag.

So they are technically independently useful changes, but I would
recommend that canonicalization goes in first.

> What is the guidance for the use of the icu_locale_validation GUC?

If an error when creating a new collation or database due to a bad
locale name would be highly disruptive, leave it false. If such an
error would be helpful to make sure you get the locale you expect, then
turn it on. In practice, existing important production systems would
leave it off; new systems could turn it on to help avoid
misconfigurations/mistakes.

> The description throws in yet another term: "validates that ICU
> locale
> strings are well-formed".  What is "well-formed"?  How does that
> relate
> to the other concepts?

Good point, I don't think I need to redefine "validation". Maybe I
should just describe it as elevating canonicalization or validation
problems from WARNING to ERROR.

> Personally, I'm not on board with this behavior:
>
> => CREATE COLLATION test (provider = icu, locale =
> 'de@collation=phonebook');
> NOTICE:  00000: using language tag "de-u-co-phonebk" for locale
> "de@collation=phonebook"
>
> I mean, maybe that is a thing we want to do somehow sometime, to
> migrate
> people to the "new" spellings, but the old ones aren't wrong.

I see what you mean; I'm not sure the best thing to do here. We are
adjusting the string passed by the user, and it feels like some users
might want to know that. It's a NOTICE, not a WARNING, so it's not
meant to imply that it's wrong.

But at the same time I can see it being annoying or confusing. If it's
confusing, perhaps a wording change and documentation would improve it?
If it's annoying, we might need to have an option and/or a different
log level?

> It also doesn't appear to address
> how to handle ICU before version 54.

Do you have a specific concern here?

Regards,
    Jeff Davis




pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Improve logging when using Huge Pages
Next
From: Tom Lane
Date:
Subject: Re: [PATCH] Add pretty-printed XML output option