Thread: pgsql: Add option to use ICU as global locale provider
Add option to use ICU as global locale provider This adds the option to use ICU as the default locale provider for either the whole cluster or a database. New options for initdb, createdb, and CREATE DATABASE are used to select this. Since some (legacy) code still uses the libc locale facilities directly, we still need to set the libc global locale settings even if ICU is otherwise selected. So pg_database now has three locale-related fields: the existing datcollate and datctype, which are always set, and a new daticulocale, which is only set if ICU is selected. A similar change is made in pg_collation for consistency, but in that case, only the libc-related fields or the ICU-related field is set, never both. Reviewed-by: Julien Rouhaud <rjuju123@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/5e756dd6-0e91-d778-96fd-b1bcb06c161a%402ndquadrant.com Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/f2553d43060edb210b36c63187d52a632448e1d2 Modified Files -------------- doc/src/sgml/catalogs.sgml | 9 ++ doc/src/sgml/charset.sgml | 102 ++++++++++++++++ doc/src/sgml/ref/create_database.sgml | 32 +++++ doc/src/sgml/ref/createdb.sgml | 19 +++ doc/src/sgml/ref/initdb.sgml | 72 +++++++++--- src/backend/catalog/pg_collation.c | 18 ++- src/backend/commands/collationcmds.c | 97 +++++++++------ src/backend/commands/dbcommands.c | 157 +++++++++++++++++++++---- src/backend/utils/adt/pg_locale.c | 144 ++++++++++++++--------- src/backend/utils/init/postinit.c | 21 +++- src/bin/initdb/Makefile | 4 +- src/bin/initdb/initdb.c | 97 +++++++++++++-- src/bin/initdb/t/001_initdb.pl | 27 +++++ src/bin/pg_dump/pg_dump.c | 30 ++++- src/bin/pg_upgrade/check.c | 13 ++ src/bin/pg_upgrade/info.c | 18 ++- src/bin/pg_upgrade/pg_upgrade.h | 2 + src/bin/psql/describe.c | 23 +++- src/bin/psql/tab-complete.c | 3 +- src/bin/scripts/Makefile | 2 + src/bin/scripts/createdb.c | 20 ++++ src/bin/scripts/t/020_createdb.pl | 28 +++++ src/include/catalog/catversion.h | 2 +- src/include/catalog/pg_collation.dat | 3 +- src/include/catalog/pg_collation.h | 20 +++- src/include/catalog/pg_database.dat | 4 +- src/include/catalog/pg_database.h | 6 + src/include/utils/pg_locale.h | 5 + src/test/Makefile | 6 +- src/test/icu/.gitignore | 2 + src/test/icu/Makefile | 25 ++++ src/test/icu/README | 27 +++++ src/test/icu/t/010_database.pl | 58 +++++++++ src/test/regress/expected/collate.icu.utf8.out | 10 +- src/test/regress/sql/collate.icu.utf8.sql | 8 +- 35 files changed, 947 insertions(+), 167 deletions(-)
Hi Peter, On Thu, Mar 17, 2022 at 10:22:32AM +0000, Peter Eisentraut wrote: > Add option to use ICU as global locale provider > > This adds the option to use ICU as the default locale provider for > either the whole cluster or a database. New options for initdb, > createdb, and CREATE DATABASE are used to select this. > > Since some (legacy) code still uses the libc locale facilities > directly, we still need to set the libc global locale settings even if > ICU is otherwise selected. So pg_database now has three > locale-related fields: the existing datcollate and datctype, which are > always set, and a new daticulocale, which is only set if ICU is > selected. A similar change is made in pg_collation for consistency, > but in that case, only the libc-related fields or the ICU-related > field is set, never both. FYI, prion is complaining here: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2022-03-18%2001%3A43%3A13 Some details: # Failed test 'fails for invalid ICU locale: matches' # at t/001_initdb.pl line 107. # '2022-03-18 01:54:58.563 UTC [504] FATAL: could # not open collator for locale "@colNumeric=lower": # U_ILLEGAL_ARGUMENT_ERROR -- Michael
Attachment
Hi, On Fri, Mar 18, 2022 at 11:01:11AM +0900, Michael Paquier wrote: > > FYI, prion is complaining here: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2022-03-18%2001%3A43%3A13 > > Some details: > # Failed test 'fails for invalid ICU locale: matches' > # at t/001_initdb.pl line 107. > # '2022-03-18 01:54:58.563 UTC [504] FATAL: could > # not open collator for locale "@colNumeric=lower": > # U_ILLEGAL_ARGUMENT_ERROR That's very strange, apparently initdb doesn't detect any problem when checking ucol_open() for initial checks, since it's expecting: # doesn't match '(?^:initdb: error: could not open collator for locale)' but then postgres in single-backend mode does detect the problem, with the exact same check, so it's not like --icu-locale=@colNumeric=lower wasn't correctly interpreted. Unfortunately we don't have the full initdb output, so we can't check what setup_locale_encoding reported. The only difference is that in initdb's setlocale(), the result of ucol_open is discarded, maybe the compiler is optimizing it away for some reason, even though it seems really unlikely. That being said, we could save the result and explicitly close the collator. That wouldn't make much difference in initdb (but may be a bit cleaner), but I see that there's a similar coding in createdb(), which seems like it could leak some memory according to ucol_close man page.
On Fri, Mar 18, 2022 at 4:12 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > On Fri, Mar 18, 2022 at 11:01:11AM +0900, Michael Paquier wrote: > > FYI, prion is complaining here: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2022-03-18%2001%3A43%3A13 > > > > Some details: > > # Failed test 'fails for invalid ICU locale: matches' > > # at t/001_initdb.pl line 107. > > # '2022-03-18 01:54:58.563 UTC [504] FATAL: could > > # not open collator for locale "@colNumeric=lower": > > # U_ILLEGAL_ARGUMENT_ERROR > > That's very strange, apparently initdb doesn't detect any problem when checking > ucol_open() for initial checks, since it's expecting: > > # doesn't match '(?^:initdb: error: could not open collator for locale)' > > but then postgres in single-backend mode does detect the problem, with the > exact same check, so it's not like --icu-locale=@colNumeric=lower wasn't > correctly interpreted. Unfortunately we don't have the full initdb output, so > we can't check what setup_locale_encoding reported. The only difference is > that in initdb's setlocale(), the result of ucol_open is discarded, maybe the > compiler is optimizing it away for some reason, even though it seems really > unlikely. > > That being said, we could save the result and explicitly close the collator. > That wouldn't make much difference in initdb (but may be a bit cleaner), but I > see that there's a similar coding in createdb(), which seems like it could leak > some memory according to ucol_close man page. No idea what's happening here but one observation is that that animal is running an older distro that shipped with ICU 5.0. Commit b8f9a2a6 may hold a clue...
On Fri, Mar 18, 2022 at 06:15:47PM +1300, Thomas Munro wrote: > > No idea what's happening here but one observation is that that animal > is running an older distro that shipped with ICU 5.0. Commit b8f9a2a6 > may hold a clue... Right. I'm setting up a similar podman environment, hopefully more info soon.
On Fri, Mar 18, 2022 at 02:36:48PM +0800, Julien Rouhaud wrote: > On Fri, Mar 18, 2022 at 06:15:47PM +1300, Thomas Munro wrote: > > > > No idea what's happening here but one observation is that that animal > > is running an older distro that shipped with ICU 5.0. Commit b8f9a2a6 > > may hold a clue... > > Right. I'm setting up a similar podman environment, hopefully more info soon. And indeed b8f9a2a6 is the problem. We would need some form of icu_set_collation_attributes() on the frontend side if we want to detect such a problem on older ICU version at the expected moment rather than when bootstrapping the info. A similar check is also needed in createdb(). I was thinking that this could be the cause of problem reported by Andres on centos 7 (which seems to ship ICU 50), but postinit.c calls make_icu_collator(), which sets the attribute as expected. Maybe it's because old ICU version simply don't understand locale ID like "en-u-kf-upper" and silently falls back to the root collator?
Julien Rouhaud <rjuju123@gmail.com> writes: > That being said, we could save the result and explicitly close the collator. > That wouldn't make much difference in initdb (but may be a bit cleaner), but I > see that there's a similar coding in createdb(), which seems like it could leak > some memory according to ucol_close man page. FYI, I verified using valgrind that (as of HEAD) there is a leak when creating a database with ICU collation that doesn't appear when creating one with libc collation. It's not a lot, a few hundred bytes per iteration, but it's there. regards, tom lane
I found a different problem with src/test/icu/: it fails altogether if the prevailing locale is "C", because then the database encoding defaults to SQL_ASCII which our ICU code won't cope with. I'm not sure if that explains any of the buildfarm failures, but it broke my local build (yeah, I'm that guy). I un-broke it for the moment by forcing the test to use UTF8 encoding, but do we want to do anything smarter than that? regards, tom lane
On 18.03.22 18:29, Tom Lane wrote: > I found a different problem with src/test/icu/: it fails altogether > if the prevailing locale is "C", because then the database encoding > defaults to SQL_ASCII which our ICU code won't cope with. I'm not > sure if that explains any of the buildfarm failures, but it broke > my local build (yeah, I'm that guy). I un-broke it for the moment > by forcing the test to use UTF8 encoding, but do we want to do > anything smarter than that? This is an appropriate solution, I think.