Re: ICU locale validation / canonicalization - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: ICU locale validation / canonicalization
Date
Msg-id 2a9013df9a197aa5eb2d85b5f7ec0c5a726b6a2e.camel@j-davis.com
Whole thread Raw
In response to Re: ICU locale validation / canonicalization  (Peter Eisentraut <peter.eisentraut@enterprisedb.com>)
Responses Re: ICU locale validation / canonicalization  (Peter Eisentraut <peter.eisentraut@enterprisedb.com>)
List pgsql-hackers
On Fri, 2023-03-24 at 10:10 +0100, Peter Eisentraut wrote:
> Couldn't we do this in a simpler way by just freeing the collator
> before
> the ereport() calls.

I committed a tiny patch to do this.

We still need to address the error inconsistency though. The problem is
that, in older ICU versions, if the fixup for "und@colNumeric=lower" ->
"root@colNumeric=lower" is applied, then icu_set_collation_attributes()
will throw an error reporting "root@colNumeric=lower", which is not
what the user typed.

We could fix that directly by passing the original string to
icu_set_collation_attributes() instead, or perhaps as an extra
parameter used only for the ereport().

I like the minor refactoring I did better, though. It puts the
ereports() close to each other, so any differences are more obvious.
And it seems cleaner to me for pg_ucol_open to close the UCollator
because it's the one that opened it. I don't have a strong opinion, but
that's my reasoning.

>   Or wrap a PG_TRY/PG_FINALLY around the whole thing?

I generally avoid PG_TRY/FINALLY unless it avoids some major
awkwardness or other problem.

> It would be nicer to not make the callers of
> icu_set_collation_attributes() responsible for catching and reporting
> the errors.

There's only one caller now: pg_ucol_open().

> [PATCH v8 2/4] initdb: emit message when using default ICU locale.
>
> I'm not able to make initdb print this message.  Under what
> circumstances am I supposed to see this?  Do you have some examples?

It happens when you don't specify --icu-locale. It is slightly
redundant with "ICU locale", but it lets you see that it came from the
environment rather than the command line:

-------------
$ initdb -D data
The files belonging to this database system will be owned by user
"someone".
This user must also own the server process.

Using default ICU locale "en_US_POSIX".
The database cluster will be initialized with this locale
configuration:
  provider:    icu
  ICU locale:  en_US_POSIX
...
-------------

That seems fairly useful for testing, etc., where initdb.log doesn't
show the command line options.

> The function check_icu_locale() has now gotten a lot more
> functionality
> than its name suggests.  Maybe the part that assigns the default ICU
> locale should be moved up one level to setlocales(), which has a
> better
> name and does something similar for the libc locale realm.

Agreed, done.

In fact, initdb.c:check_icu_locale() is completely unnecessary in that
patch, because as the comment points out, the backend will try to open
it during post-bootstrap initialization. I think it was simply a
mistake to try to do this validation in commit 27b62377b4.

The later validation patch does do some better validation at initdb
time to make sure the language can be found.

> [PATCH v8 3/4] Canonicalize ICU locale names to language tags.
>
> I'm still on the fence about whether we actually want to do this, but
> I'm warming up to it, now that the issues with pre-54 versions are
> fixed.
>
> But if we do this, the documentation needs to be updated.  There is a
> bunch of text there that says, like, you can do this format or that
> format, whatever you like.  At least the guidance should be changed
> there.
>
>
> [PATCH v8 4/4] Validate ICU locales.
>
> I would make icu_locale_validation true by default.

Agreed. I considered also not having a GUC, but it seems like some kind
of escape hatch is wise, at least for now.

> Or maybe it should be a log-level type option, so you can set it to
> error, warning, and also completely off?

As the validation patch seems closer to acceptance, I changed it to be
before the canonicalization patch. New series attached.


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

pgsql-hackers by date:

Previous
From: Anton Kirilov
Date:
Subject: Add PQsendSyncMessage() to libpq
Next
From: Jeff Davis
Date:
Subject: Re: running logical replication as the subscription owner