Re: ICU locale validation / canonicalization - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: ICU locale validation / canonicalization |
Date | |
Msg-id | c15981f01fba3c0cbcdb49a6d43cdf0a90cd77ff.camel@j-davis.com Whole thread Raw |
In response to | Re: ICU locale validation / canonicalization (Peter Eisentraut <peter.eisentraut@enterprisedb.com>) |
Responses |
Re: ICU locale validation / canonicalization
|
List | pgsql-hackers |
On Thu, 2023-03-09 at 09:46 +0100, Peter Eisentraut wrote: > This patch appears to do about three things at once, and it's not > clear > exactly where the boundaries are between them and which ones we might > actually want. And I think the terminology also gets mixed up a bit, > which makes following this harder. > > 1. Canonicalizing the locale string. This is presumably what > uloc_canonicalize() does, which the patch doesn't actually use. What > are examples of what this does? Does the patch actually do this? Both uloc_canonicalize() and uloc_getLanguageTag() do Level 2 Canonicalization, which is described here: https://unicode-org.github.io/icu/userguide/locale/#canonicalization > 2. Converting the locale string to BCP 47 format. This converts > 'de@collation=phonebook' to 'de-u-co-phonebk'. This is what > uloc_getLanguageTag() does. Yes, though uloc_getLanguageTag() also canonicalizes. I consider converting to the language tag a part of "canonicalization", because it's the canonical form we agreed on in this thread. > 3. Validating the locale string, to reject faulty input. Canonicalization doesn't make sure the locale actually exists in ICU, so it's easy to make a typo like "jp_JP" instead of "ja_JP". After canonicalizing to a language tag, the former is "jp-JP" (resolving to the collator with valid locale "root") and the latter is "ja-JP" (resolving to the collator with valid locale "ja"). The former is clearly a mistake, and I call catching that mistake "validation". If the user specifies something other than the root locale (i.e. not "root", "und", or ""), and the locale resolves to a collator with a valid locale of "root", then this patch considers that to be a mistake and issues a WARNING (upgraded to ERROR if the GUC icu_locale_validation is true). > What are the relationships between these? 1 & 2 are closely related. If we canonicalize, we need to pick one canonical form: either BCP 47 or ICU format locale IDs. 3 is related, but can be seen as an independent change. > I don't understand how the validation actually happens in your patch. > Does uloc_getLanguageTag() do the validation also? Using the above definition of "validation" it happens inside icu_collator_exists(). > Can you do canonicalization without converting to language tag? If we used uloc_canonicalize(), it would give us ICU format locale IDs, and that would be a valid thing to do; and we could switch the canonical form from ICU format locale IDs to BCP 47 in a separate patch. I don't have a strong opinion, but if we're going to canonicalize, I think it makes sense to go straight to language tags. > Can you do validation of un-canonicalized locale names? Yes, though I feel like an un-canonicalized name is less stable in meaning, and so validation on that name may also be less stable. For instance, if we don't canonicalize "fr_CA.UTF-8", it resolves to plain "fr"; but if we do canonicalize it first, it resolves to "fr-CA". Will the uncanonicalized name always resolve to "fr"? I'm not sure, because the documentation says that ucol_open() expects either an ICU format locale ID or, preferably, a language tag. So they are technically independently useful changes, but I would recommend that canonicalization goes in first. > What is the guidance for the use of the icu_locale_validation GUC? If an error when creating a new collation or database due to a bad locale name would be highly disruptive, leave it false. If such an error would be helpful to make sure you get the locale you expect, then turn it on. In practice, existing important production systems would leave it off; new systems could turn it on to help avoid misconfigurations/mistakes. > The description throws in yet another term: "validates that ICU > locale > strings are well-formed". What is "well-formed"? How does that > relate > to the other concepts? Good point, I don't think I need to redefine "validation". Maybe I should just describe it as elevating canonicalization or validation problems from WARNING to ERROR. > Personally, I'm not on board with this behavior: > > => CREATE COLLATION test (provider = icu, locale = > 'de@collation=phonebook'); > NOTICE: 00000: using language tag "de-u-co-phonebk" for locale > "de@collation=phonebook" > > I mean, maybe that is a thing we want to do somehow sometime, to > migrate > people to the "new" spellings, but the old ones aren't wrong. I see what you mean; I'm not sure the best thing to do here. We are adjusting the string passed by the user, and it feels like some users might want to know that. It's a NOTICE, not a WARNING, so it's not meant to imply that it's wrong. But at the same time I can see it being annoying or confusing. If it's confusing, perhaps a wording change and documentation would improve it? If it's annoying, we might need to have an option and/or a different log level? > It also doesn't appear to address > how to handle ICU before version 54. Do you have a specific concern here? Regards, Jeff Davis
pgsql-hackers by date: