Thread: ICU 54 and earlier are too dangerous

ICU 54 and earlier are too dangerous

From
Jeff Davis
Date:
In ICU 54 and earlier, if ucol_open() is unable to find a matching
locale, it will fall back to the *environment*.

Using ICU 54:

  initdb -D data -N --locale="en_US.UTF-8"
  pg_ctl -D data -l logfile start
  psql postgres -c "create collation asdf(provider=icu, locale='asdf')"
  # returns true
  psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf"
  psql postgres -c "alter system set lc_messages='C'"
  pg_ctl -D data -l logfile restart
  # returns false and warns about collation version mismatch
  psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf"

This was fixed in ICU 55 to fall back to the root locale instead[1],
which is stable, has a collator version, and is not dependent on the
environment. As far as I can tell, 55 and later never fall back to the
environment when opening a collator (unless you explicitly pass NULL to
ucol_open(), which is documented).

It would be nice if we could detect when this fallback-to-environment
happens, so that we could just refuse to create the bogus collation.
But I didn't find a good way. There are non-error return codes from
ucol_open() that seem promising[2], but they aren't actually very
useful to distinguish the fallback-to-environment case as far as I can
tell.

Unless someone has a better idea, I think we need to bump the minimum
required ICU version to 55. That would solve the issue in v16 and
later, but those using old versions of ICU and old versions of postgres
would still be vulnerable to these kinds of typos.

Regards,
    Jeff Davis


[1] https://icu.unicode.org/download/55m1
[2]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utypes_8h.html#a3343c1c8a8377277046774691c98d78c



Re: ICU 54 and earlier are too dangerous

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> In ICU 54 and earlier, if ucol_open() is unable to find a matching
> locale, it will fall back to the *environment*.

That's not great, but ...

> Unless someone has a better idea, I think we need to bump the minimum
> required ICU version to 55. That would solve the issue in v16 and
> later, but those using old versions of ICU and old versions of postgres
> would still be vulnerable to these kinds of typos.

... that seems like an overreaction.  We know from the buildfarm
that there's still a lot of old ICU out there.  Is it really improving
anybody's life to try to forbid them from using such a version?

            regards, tom lane



Re: ICU 54 and earlier are too dangerous

From
Andres Freund
Date:
Hi,

On 2023-03-13 16:39:04 -0700, Jeff Davis wrote:
> In ICU 54 and earlier, if ucol_open() is unable to find a matching
> locale, it will fall back to the *environment*.
> 
> Using ICU 54:
> 
>   initdb -D data -N --locale="en_US.UTF-8"
>   pg_ctl -D data -l logfile start
>   psql postgres -c "create collation asdf(provider=icu, locale='asdf')"
>   # returns true
>   psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf"
>   psql postgres -c "alter system set lc_messages='C'"
>   pg_ctl -D data -l logfile restart
>   # returns false and warns about collation version mismatch
>   psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf"
> 
> This was fixed in ICU 55 to fall back to the root locale instead[1],
> which is stable, has a collator version, and is not dependent on the
> environment. As far as I can tell, 55 and later never fall back to the
> environment when opening a collator (unless you explicitly pass NULL to
> ucol_open(), which is documented).

> It would be nice if we could detect when this fallback-to-environment
> happens, so that we could just refuse to create the bogus collation.
> But I didn't find a good way. There are non-error return codes from
> ucol_open() that seem promising[2], but they aren't actually very
> useful to distinguish the fallback-to-environment case as far as I can
> tell.

What non-error code is returned in the above example?

Can we query the returned collator and see if it matches what we were looking
for?


> Unless someone has a better idea, I think we need to bump the minimum
> required ICU version to 55. That would solve the issue in v16 and
> later, but those using old versions of ICU and old versions of postgres
> would still be vulnerable to these kinds of typos.

I'm a bit confused by the dates. https://icu.unicode.org/download/55m1 says
that version was released 2014-12-17, but the linked issue around root locales
is from 2018: https://unicode-org.atlassian.net/browse/ICU-10823  - I guess
the issue tracker was migrated at some point or such...

If indeed 2014 is the correct year of release, then it might be ok to increase
the minimum version...

Greetings,

Andres Freund



Re: ICU 54 and earlier are too dangerous

From
Peter Eisentraut
Date:
On 14.03.23 01:26, Tom Lane wrote:
>> Unless someone has a better idea, I think we need to bump the minimum
>> required ICU version to 55. That would solve the issue in v16 and
>> later, but those using old versions of ICU and old versions of postgres
>> would still be vulnerable to these kinds of typos.
> ... that seems like an overreaction.  We know from the buildfarm
> that there's still a lot of old ICU out there.  Is it really improving
> anybody's life to try to forbid them from using such a version?

If I'm getting the dates right, the 10-year support of RHEL 7 will 
expire in June 2024.  So if we follow past practices, we could drop 
support for RHEL 7 in PG17.  This would allow us to drop support for old 
libicu, and also old openssl, zlib, maybe more.

So if we don't feel like we need to do an emergency change here, there 
is a path to do this in a principled way in the near future.




Re: ICU 54 and earlier are too dangerous

From
Jeff Davis
Date:
On Mon, 2023-03-13 at 18:13 -0700, Andres Freund wrote:
> What non-error code is returned in the above example?

When the collator for locale "asdf" is opened, the status is set to
U_USING_DEFAULT_WARNING.

That seemed very promising at first, but it's the same thing returned
after opening most valid locales, including "en" and "en-US". It seems
to only return U_ZERO_ERROR on an exact hit, like "fr-CA" or "root".

There's also U_USING_FALLBACK_WARNING, which also seemed promising, but
it's returned when opening "fr-FR" or "ja-JP" (falls back to "fr" and
"ja" respectively).

> Can we query the returned collator and see if it matches what we were
> looking
> for?

I tried a few variations of that in my canonicalization / validation
patch, which I called "validation". The closest thing I found is:

   ucol_getLocaleByType(collator, ULOC_VALID_LOCALE, &status)

We could strip away the attributes and compare to the result of that,
and it mostly works. There are a few complications, like I think we
need to preserve the "collation" attribute for things like
"de@collation=phonebook".

Another thing to consider is that the environment might happen to open
the collation you intend at the time the collation is created, but then
later of course the environment can change, so we'd have to check every
time it's opened. And getting an error when the collation is opened is
not great, so it might need to be a WARNING or something, and it starts
to get less useful.

What would be *really* nice is if there was some kind of way to tell if
there was no real match to a known locale, either during open or via
some other API. I wasn't able to find one, though.

Actually, now that I think about it, we could just search all known
locales using either ucol_getAvailable() or uloc_getAvailable(), and
see if there's a match. Not very clean, but it should catch most
problems. I'll look into whether there's a reasonable way to match or
not.

>
> I'm a bit confused by the dates.
> https://icu.unicode.org/download/55m1 says
> that version was released 2014-12-17, but the linked issue around
> root locales
> is from 2018: https://unicode-org.atlassian.net/browse/ICU-10823  - I
> guess
> the issue tracker was migrated at some point or such...

The dates are misleading in both git (migrated from SVN circa 2016) and
JIRA (migrated circa 2018, see
https://unicode-org.atlassian.net/browse/ICU-1 ). It seems 55.1 was
released in either 2014 or 2015.


Regards,
    Jeff Davis




Re: ICU 54 and earlier are too dangerous

From
Jeff Davis
Date:
On Tue, 2023-03-14 at 08:48 -0700, Jeff Davis wrote:
> Actually, now that I think about it, we could just search all known
> locales using either ucol_getAvailable() or uloc_getAvailable(), and
> see if there's a match. Not very clean, but it should catch most
> problems. I'll look into whether there's a reasonable way to match or
> not.

I posted a patch to do this as 0006 in the series here:

https://www.postgresql.org/message-id/9afa6dbe0d31053ad265aeba488fde784fd5b7ab.camel@j-davis.com

Regards,
    Jeff Davis