Re: ICU locale validation / canonicalization - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: ICU locale validation / canonicalization
Date
Msg-id fccd3064fa38285d2b71c2cc46a3eae8e5c6d4fb.camel@j-davis.com
Whole thread Raw
In response to Re: ICU locale validation / canonicalization  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: ICU locale validation / canonicalization
Re: ICU locale validation / canonicalization
Re: ICU locale validation / canonicalization
List pgsql-hackers
On Thu, 2023-02-09 at 10:53 -0500, Robert Haas wrote:
> Unfortunately, I have no idea whether your specific ideas about how
> to
> make that happen are any good or not. But I hope they are, because
> the
> current situation is pessimal.

It feels like BCP 47 is the right catalog representation. We are
already using it for the import of initial collations, and it's a
standard, and there seems to be good support in ICU.

There are a couple cases where canonicalization will succeed but
conversion to a BCP 47 language tag will fail. One is for unsupported
attributes, like "en_US@foo=bar". Another is a bug I found and reported
here:

https://unicode-org.atlassian.net/browse/ICU-22268

In both cases, we know that conversion has failed, and we have a choice
about how to proceed. We can fail, warn and continue with the user-
entered representation, or turn off the strictness checking and come up
with some BCP 47 tag and see if it resolves to the same collator.

I do like the ICU format locale IDs from a readability standpoint.
"en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks-
level1" (the equivalent language tag). And the format is specified[1],
even though it's not an independent standard. But I think the benefits
of better validation, an independent standard, and the fact that we're
already favoring BCP47 outweigh my subjective opinion.

I also attached a simple test program that I've been using to
experiment (not intended for code review).

It's hard for me to say that I'm sure I'm right. I really just got
involved in this a few months back, and had a few off-list
conversations with Peter Eisentraut to try to learn more (I believe he
is aligned with my proposal but I will let him speak for himself).

I should also say that I'm not exactly an expert in languages or
scripts. I assume that ICU and IETF are doing sensible things to
accommodate the diversity of human language as well as they can (or at
least much better than the Postgres project could do on its own).

I'm happy to hear more input or other proposals.

[1]
https://unicode-org.github.io/icu/userguide/locale/#canonicalization

--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Importing pg_bsd_indent into our source tree
Next
From: Tom Lane
Date:
Subject: Re: Importing pg_bsd_indent into our source tree