Thread: ICU 54 and earlier are too dangerous
In ICU 54 and earlier, if ucol_open() is unable to find a matching locale, it will fall back to the *environment*. Using ICU 54: initdb -D data -N --locale="en_US.UTF-8" pg_ctl -D data -l logfile start psql postgres -c "create collation asdf(provider=icu, locale='asdf')" # returns true psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf" psql postgres -c "alter system set lc_messages='C'" pg_ctl -D data -l logfile restart # returns false and warns about collation version mismatch psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf" This was fixed in ICU 55 to fall back to the root locale instead[1], which is stable, has a collator version, and is not dependent on the environment. As far as I can tell, 55 and later never fall back to the environment when opening a collator (unless you explicitly pass NULL to ucol_open(), which is documented). It would be nice if we could detect when this fallback-to-environment happens, so that we could just refuse to create the bogus collation. But I didn't find a good way. There are non-error return codes from ucol_open() that seem promising[2], but they aren't actually very useful to distinguish the fallback-to-environment case as far as I can tell. Unless someone has a better idea, I think we need to bump the minimum required ICU version to 55. That would solve the issue in v16 and later, but those using old versions of ICU and old versions of postgres would still be vulnerable to these kinds of typos. Regards, Jeff Davis [1] https://icu.unicode.org/download/55m1 [2] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utypes_8h.html#a3343c1c8a8377277046774691c98d78c
Jeff Davis <pgsql@j-davis.com> writes: > In ICU 54 and earlier, if ucol_open() is unable to find a matching > locale, it will fall back to the *environment*. That's not great, but ... > Unless someone has a better idea, I think we need to bump the minimum > required ICU version to 55. That would solve the issue in v16 and > later, but those using old versions of ICU and old versions of postgres > would still be vulnerable to these kinds of typos. ... that seems like an overreaction. We know from the buildfarm that there's still a lot of old ICU out there. Is it really improving anybody's life to try to forbid them from using such a version? regards, tom lane
Hi, On 2023-03-13 16:39:04 -0700, Jeff Davis wrote: > In ICU 54 and earlier, if ucol_open() is unable to find a matching > locale, it will fall back to the *environment*. > > Using ICU 54: > > initdb -D data -N --locale="en_US.UTF-8" > pg_ctl -D data -l logfile start > psql postgres -c "create collation asdf(provider=icu, locale='asdf')" > # returns true > psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf" > psql postgres -c "alter system set lc_messages='C'" > pg_ctl -D data -l logfile restart > # returns false and warns about collation version mismatch > psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf" > > This was fixed in ICU 55 to fall back to the root locale instead[1], > which is stable, has a collator version, and is not dependent on the > environment. As far as I can tell, 55 and later never fall back to the > environment when opening a collator (unless you explicitly pass NULL to > ucol_open(), which is documented). > It would be nice if we could detect when this fallback-to-environment > happens, so that we could just refuse to create the bogus collation. > But I didn't find a good way. There are non-error return codes from > ucol_open() that seem promising[2], but they aren't actually very > useful to distinguish the fallback-to-environment case as far as I can > tell. What non-error code is returned in the above example? Can we query the returned collator and see if it matches what we were looking for? > Unless someone has a better idea, I think we need to bump the minimum > required ICU version to 55. That would solve the issue in v16 and > later, but those using old versions of ICU and old versions of postgres > would still be vulnerable to these kinds of typos. I'm a bit confused by the dates. https://icu.unicode.org/download/55m1 says that version was released 2014-12-17, but the linked issue around root locales is from 2018: https://unicode-org.atlassian.net/browse/ICU-10823 - I guess the issue tracker was migrated at some point or such... If indeed 2014 is the correct year of release, then it might be ok to increase the minimum version... Greetings, Andres Freund
On 14.03.23 01:26, Tom Lane wrote: >> Unless someone has a better idea, I think we need to bump the minimum >> required ICU version to 55. That would solve the issue in v16 and >> later, but those using old versions of ICU and old versions of postgres >> would still be vulnerable to these kinds of typos. > ... that seems like an overreaction. We know from the buildfarm > that there's still a lot of old ICU out there. Is it really improving > anybody's life to try to forbid them from using such a version? If I'm getting the dates right, the 10-year support of RHEL 7 will expire in June 2024. So if we follow past practices, we could drop support for RHEL 7 in PG17. This would allow us to drop support for old libicu, and also old openssl, zlib, maybe more. So if we don't feel like we need to do an emergency change here, there is a path to do this in a principled way in the near future.
On Mon, 2023-03-13 at 18:13 -0700, Andres Freund wrote: > What non-error code is returned in the above example? When the collator for locale "asdf" is opened, the status is set to U_USING_DEFAULT_WARNING. That seemed very promising at first, but it's the same thing returned after opening most valid locales, including "en" and "en-US". It seems to only return U_ZERO_ERROR on an exact hit, like "fr-CA" or "root". There's also U_USING_FALLBACK_WARNING, which also seemed promising, but it's returned when opening "fr-FR" or "ja-JP" (falls back to "fr" and "ja" respectively). > Can we query the returned collator and see if it matches what we were > looking > for? I tried a few variations of that in my canonicalization / validation patch, which I called "validation". The closest thing I found is: ucol_getLocaleByType(collator, ULOC_VALID_LOCALE, &status) We could strip away the attributes and compare to the result of that, and it mostly works. There are a few complications, like I think we need to preserve the "collation" attribute for things like "de@collation=phonebook". Another thing to consider is that the environment might happen to open the collation you intend at the time the collation is created, but then later of course the environment can change, so we'd have to check every time it's opened. And getting an error when the collation is opened is not great, so it might need to be a WARNING or something, and it starts to get less useful. What would be *really* nice is if there was some kind of way to tell if there was no real match to a known locale, either during open or via some other API. I wasn't able to find one, though. Actually, now that I think about it, we could just search all known locales using either ucol_getAvailable() or uloc_getAvailable(), and see if there's a match. Not very clean, but it should catch most problems. I'll look into whether there's a reasonable way to match or not. > > I'm a bit confused by the dates. > https://icu.unicode.org/download/55m1 says > that version was released 2014-12-17, but the linked issue around > root locales > is from 2018: https://unicode-org.atlassian.net/browse/ICU-10823 - I > guess > the issue tracker was migrated at some point or such... The dates are misleading in both git (migrated from SVN circa 2016) and JIRA (migrated circa 2018, see https://unicode-org.atlassian.net/browse/ICU-1 ). It seems 55.1 was released in either 2014 or 2015. Regards, Jeff Davis
On Tue, 2023-03-14 at 08:48 -0700, Jeff Davis wrote: > Actually, now that I think about it, we could just search all known > locales using either ucol_getAvailable() or uloc_getAvailable(), and > see if there's a match. Not very clean, but it should catch most > problems. I'll look into whether there's a reasonable way to match or > not. I posted a patch to do this as 0006 in the series here: https://www.postgresql.org/message-id/9afa6dbe0d31053ad265aeba488fde784fd5b7ab.camel@j-davis.com Regards, Jeff Davis