Re: ICU locale validation / canonicalization - Mailing list pgsql-hackers

From Peter Eisentraut
Subject Re: ICU locale validation / canonicalization
Date
Msg-id 899ab44a-4307-064f-0945-412723d57c02@enterprisedb.com
Whole thread Raw
In response to Re: ICU locale validation / canonicalization  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: ICU locale validation / canonicalization  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On 28.02.23 06:57, Jeff Davis wrote:
> On Mon, 2023-02-20 at 15:23 -0800, Jeff Davis wrote:
>>
>> New patch attached. The new patch also includes a GUC that (when
>> enabled) validates that the collator is actually found.
> 
> New patch attached.
> 
> Now it always preserves the exact locale string during pg_upgrade, and
> does not attempt to canonicalize it. Before it was trying to be clever
> by determining if the language tag was finding the same collator as the
> original string -- I didn't find a problem with that, but it just
> seemed a bit too clever. So, only newly-created locales and databases
> have the ICU locale string canonicalized to a language tag.
> 
> Also, I added a SQL function pg_icu_language_tag() that can convert
> locale strings to language tags, and check whether they exist or not.

This patch appears to do about three things at once, and it's not clear 
exactly where the boundaries are between them and which ones we might 
actually want.  And I think the terminology also gets mixed up a bit, 
which makes following this harder.

1. Canonicalizing the locale string.  This is presumably what 
uloc_canonicalize() does, which the patch doesn't actually use.  What 
are examples of what this does?  Does the patch actually do this?

2. Converting the locale string to BCP 47 format.  This converts 
'de@collation=phonebook' to 'de-u-co-phonebk'.  This is what 
uloc_getLanguageTag() does.

3. Validating the locale string, to reject faulty input.

What are the relationships between these?

I don't understand how the validation actually happens in your patch. 
Does uloc_getLanguageTag() do the validation also?

Can you do canonicalization without converting to language tag?

Can you do validation of un-canonicalized locale names?

What is the guidance for the use of the icu_locale_validation GUC?

The description throws in yet another term: "validates that ICU locale 
strings are well-formed".  What is "well-formed"?  How does that relate 
to the other concepts?

Personally, I'm not on board with this behavior:

=> CREATE COLLATION test (provider = icu, locale = 
'de@collation=phonebook');
NOTICE:  00000: using language tag "de-u-co-phonebk" for locale 
"de@collation=phonebook"

I mean, maybe that is a thing we want to do somehow sometime, to migrate 
people to the "new" spellings, but the old ones aren't wrong.  So this 
should be a separate consideration, with an option, and it would require 
various updates in the documentation.  It also doesn't appear to address 
how to handle ICU before version 54.

But, see earlier questions, are these three things all connected somehow?




pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: pg_upgrade and logical replication
Next
From: Peter Eisentraut
Date:
Subject: Re: Allow tests to pass in OpenSSL FIPS mode