[COMMITTERS] pgsql: Further hacking on ICU collation creation and usage. - Mailing list pgsql-committers

From Tom Lane
Subject [COMMITTERS] pgsql: Further hacking on ICU collation creation and usage.
Date
Msg-id E1dOpGR-0000cU-Vj@gemulon.postgresql.org
Whole thread Raw
List pgsql-committers
Further hacking on ICU collation creation and usage.

pg_import_system_collations() refused to create any ICU collations if
the current database's encoding didn't support ICU.  This is wrongheaded:
initdb must initialize pg_collation in an encoding-independent way
since it might be used in other databases with different encodings.
The reason for the restriction seems to be that get_icu_locale_comment()
used icu_from_uchar() to convert the UChar-format display name, and that
unsurprisingly doesn't know what to do in unsupported encodings.
But by the same token that the initial catalog contents must be
encoding-independent, we can't allow non-ASCII characters in the comment
strings.  So we don't really need icu_from_uchar() here: just check for
Unicode codes outside the ASCII range, and if there are none, the format
conversion is trivial.  If there are some, we can simply not install the
comment.  (In my testing, this affects only Norwegian Bokmål, which has
given us trouble before.)

For paranoia's sake, also check for non-ASCII characters in ICU locale
names, and skip such locales, as we do for libc locales.  I don't
currently have a reason to believe that this will ever reject anything,
but then again the libc maintainers should have known better too.

With just the import changes, ICU collations can be found in pg_collation
in databases with unsupported encodings.  This resulted in more or less
clean failures at runtime, but that's not how things act for unsupported
encodings with libc collations.  Make it work the same as our traditional
behavior for libc collations by having collation lookup take into account
whether is_encoding_supported_by_icu().

Adjust documentation to match.  Also, expand Table 23.1 to show which
encodings are supported by ICU.

catversion bump because of likely change in pg_collation/pg_description
initial contents in ICU-enabled builds.

Discussion: https://postgr.es/m/20c74bc3-d6ca-243d-1bbc-12f17fa4fe9a@gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/ddb5fdc068635d003a0d1c303cb109d1cb3ebeb1

Modified Files
--------------
doc/src/sgml/charset.sgml            |  92 +++++++++++++++++++++++--------
src/backend/catalog/namespace.c      | 101 ++++++++++++++++++++++-------------
src/backend/commands/collationcmds.c |  86 +++++++++++++++++++----------
src/include/catalog/catversion.h     |   2 +-
4 files changed, 194 insertions(+), 87 deletions(-)


pgsql-committers by date:

Previous
From: Simon Riggs
Date:
Subject: [COMMITTERS] pgsql: Fix typo in comment in SerializeSnapshot
Next
From: Tom Lane
Date:
Subject: [COMMITTERS] pgsql: Doc: minor improvements for collation-related man pages.