Home > mailing lists

Re: ICU locale validation / canonicalization - Mailing list pgsql-hackers

From	Jeff Davis
Subject	Re: ICU locale validation / canonicalization
Date	February 9, 2023 22:09:39
Msg-id	fccd3064fa38285d2b71c2cc46a3eae8e5c6d4fb.camel@j-davis.com Whole thread Raw
In response to	Re: ICU locale validation / canonicalization (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: ICU locale validation / canonicalization Re: ICU locale validation / canonicalization Re: ICU locale validation / canonicalization
List	pgsql-hackers

Tree view

On Thu, 2023-02-09 at 10:53 -0500, Robert Haas wrote:
> Unfortunately, I have no idea whether your specific ideas about how
> to
> make that happen are any good or not. But I hope they are, because
> the
> current situation is pessimal.

It feels like BCP 47 is the right catalog representation. We are
already using it for the import of initial collations, and it's a
standard, and there seems to be good support in ICU.

There are a couple cases where canonicalization will succeed but
conversion to a BCP 47 language tag will fail. One is for unsupported
attributes, like "en_US@foo=bar". Another is a bug I found and reported
here:

https://unicode-org.atlassian.net/browse/ICU-22268

In both cases, we know that conversion has failed, and we have a choice
about how to proceed. We can fail, warn and continue with the user-
entered representation, or turn off the strictness checking and come up
with some BCP 47 tag and see if it resolves to the same collator.

I do like the ICU format locale IDs from a readability standpoint.
"en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks-
level1" (the equivalent language tag). And the format is specified[1],
even though it's not an independent standard. But I think the benefits
of better validation, an independent standard, and the fact that we're
already favoring BCP47 outweigh my subjective opinion.

I also attached a simple test program that I've been using to
experiment (not intended for code review).

It's hard for me to say that I'm sure I'm right. I really just got
involved in this a few months back, and had a few off-list
conversations with Peter Eisentraut to try to learn more (I believe he
is aligned with my proposal but I will let him speak for himself).

I should also say that I'm not exactly an expert in languages or
scripts. I assume that ICU and IETF are doing sensible things to
accommodate the diversity of human language as well as they can (or at
least much better than the Postgres project could do on its own).

I'm happy to hear more input or other proposals.

[1]
https://unicode-org.github.io/icu/userguide/locale/#canonicalization

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachment

icutool.c

pgsql-hackers by date:

From: Andres Freund
Date: 09 February 2023, 21:55:32
Subject: Re: Importing pg_bsd_indent into our source tree

From: Tom Lane
Date: 09 February 2023, 22:12:52
Subject: Re: Importing pg_bsd_indent into our source tree

Re: ICU locale validation / canonicalization - Mailing list pgsql-hackers

Attachment

Previous

Next