Re: ICU 54 and earlier are too dangerous - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: ICU 54 and earlier are too dangerous
Date
Msg-id df2efad0cae7c65180df8e5ebb709e5eb4f2a82b.camel@j-davis.com
Whole thread Raw
In response to Re: ICU 54 and earlier are too dangerous  (Andres Freund <andres@anarazel.de>)
Responses Re: ICU 54 and earlier are too dangerous  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Mon, 2023-03-13 at 18:13 -0700, Andres Freund wrote:
> What non-error code is returned in the above example?

When the collator for locale "asdf" is opened, the status is set to
U_USING_DEFAULT_WARNING.

That seemed very promising at first, but it's the same thing returned
after opening most valid locales, including "en" and "en-US". It seems
to only return U_ZERO_ERROR on an exact hit, like "fr-CA" or "root".

There's also U_USING_FALLBACK_WARNING, which also seemed promising, but
it's returned when opening "fr-FR" or "ja-JP" (falls back to "fr" and
"ja" respectively).

> Can we query the returned collator and see if it matches what we were
> looking
> for?

I tried a few variations of that in my canonicalization / validation
patch, which I called "validation". The closest thing I found is:

   ucol_getLocaleByType(collator, ULOC_VALID_LOCALE, &status)

We could strip away the attributes and compare to the result of that,
and it mostly works. There are a few complications, like I think we
need to preserve the "collation" attribute for things like
"de@collation=phonebook".

Another thing to consider is that the environment might happen to open
the collation you intend at the time the collation is created, but then
later of course the environment can change, so we'd have to check every
time it's opened. And getting an error when the collation is opened is
not great, so it might need to be a WARNING or something, and it starts
to get less useful.

What would be *really* nice is if there was some kind of way to tell if
there was no real match to a known locale, either during open or via
some other API. I wasn't able to find one, though.

Actually, now that I think about it, we could just search all known
locales using either ucol_getAvailable() or uloc_getAvailable(), and
see if there's a match. Not very clean, but it should catch most
problems. I'll look into whether there's a reasonable way to match or
not.

>
> I'm a bit confused by the dates.
> https://icu.unicode.org/download/55m1 says
> that version was released 2014-12-17, but the linked issue around
> root locales
> is from 2018: https://unicode-org.atlassian.net/browse/ICU-10823  - I
> guess
> the issue tracker was migrated at some point or such...

The dates are misleading in both git (migrated from SVN circa 2016) and
JIRA (migrated circa 2018, see
https://unicode-org.atlassian.net/browse/ICU-1 ). It seems 55.1 was
released in either 2014 or 2015.


Regards,
    Jeff Davis




pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Add LZ4 compression in pg_dump
Next
From: Tomas Vondra
Date:
Subject: Re: Add LZ4 compression in pg_dump