Re: ICU integration - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: ICU integration
Date
Msg-id CAM3SWZQVv3s70tJ6WCmbcO8cVQjnj8ZruVMBNOqc1YpGmq7hFQ@mail.gmail.com
Whole thread Raw
In response to Re: ICU integration  (Craig Ringer <craig@2ndquadrant.com>)
Responses Re: ICU integration
List pgsql-hackers
On Thu, Sep 8, 2016 at 6:48 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> Pity ICU doesn't offer versioned collations within a single install.
> Though I can understand why they don't.

There are two separate issues with collator versioning. ICU can
probably be used in a way that clearly decouples these two issues,
which is very important. The first is that the rules of collations
change. The second is that the binary key that collators create (i.e.
the equivalent of strxfrm()) can change for various reasons that have
nothing to do with culture or natural languages -- purely technical
reasons. For example, they can add new optimizations to make
generating new binary keys faster. If there are bugs in how that
works, they can fix the bugs and increment the identifier [1], which
could allow Postgres to insist on a REINDEX (if abbreviated keys for
collated text were reenabled, although I don't think that problems
like that are limited to binary key generation).

So, to bring it back to that little program I wrote:

$ ./icu-coll-versions | head
Collator                                          | ICU Version | UCA Version
-----------------------------------------------------------------------------
Afrikaans                                         | 99-38-00-00 | 07-00-00-00
Afrikaans (Namibia)                               | 99-38-00-00 | 07-00-00-00
Afrikaans (South Africa)                          | 99-38-00-00 | 07-00-00-00
Aghem                                             | 99-38-00-00 | 07-00-00-00
Aghem (Cameroon)                                  | 99-38-00-00 | 07-00-00-00
Akan                                              | 99-38-00-00 | 07-00-00-00
Akan (Ghana)                                      | 99-38-00-00 | 07-00-00-00
Amharic                                           | 99-38-00-00 | 07-00-00-00

Here, what appears as "ICU version" has the identifier [1] baked in,
although this is undocumented (it also has any "custom tailorings"
that might be used, say if we had user defined customizations to
collations, as Firebird apparently does [2] [3]). I'm pretty sure that
UCA version relates to a version of the Unicode collation algorithm,
and its associated DUCET table (this is all subject to ISO
standardization). I gather that a particular collation is actually
comprised of a base UCA version (and DUCET table -- I think that ICU
sometimes calls this the "root"), with custom tailorings that a locale
provides for a given culture or country. These collators may in turn
be further "tailored" to get that fancy user defined customization
stuff.

In principle, and assuming I haven't gotten something wrong, it ought
to be possible to unambiguously identify a collation based on a
matching UCA version (i.e. DUCET table), plus the collation tailorings
matching exactly, even across ICU versions that happen to be based on
the same UCA version (they only seem to put out a new UCA version
about once a year [4]).  It *might* be fine, practically speaking, to
assume that a collation with a matching iso-code and UCA version is
compatible forever and always across any ICU version. If not, it might
instead be feasible to write a custom fingerprinter for collation
tailorings that ran at initdb time. Maybe the tailorings, which are
abstract rules, could even be stored in system catalogs, so the only
thing that need match is ICU's UCA version (the "root" collators must
still match), since replicas may reconstruct the serialized tailorings
that comprise a collation as needed [5][6], since the tailoring that a
default collator for a locale uses isn't special, technically
speaking.

Of course, this is all pretty hand-wavey right now, and much more
research is needed. I am very intrigued about the idea of storing the
collators in the system catalogs wholesale, since ICU provides
facilities that make that seem possible. If a "serialized unicode set"
build from a collators tailoring rules, or, alternatively, a collator
saved as a binary representation [7] were stored in the system
catalogs, perhaps it wouldn't matter as much that the stuff
distributed with different ICU versions didn't match, at least in
theory. It's unclear that the system catalog representation could be
usable with a fair cross section of ICU versions, but if it could then
that would be perfect. This also seems to be how Firebird style
user-defined tailorings might be implemented anyway, and it seems very
appealing to add that as a light layer on top of how the base system
works, if at all possible.

[1] https://github.com/svn2github/libicu/blob/c43ec130ea0ee6cd565d87d70088e1d70d892f32/common/unicode/uvernum.h#L149
[2] http://www.firebirdsql.org/refdocs/langrefupd25-ddl-collation.html
[3] http://userguide.icu-project.org/collation/customization#TOC-Building-on-Existing-Locales
[4] http://unicode.org/reports/tr10/#Synch_14651_Table
[5] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a1982f184bca8adaa848144a1959ff235
[6] https://ssl.icu-project.org/apiref/icu4c/structUSerializedSet.html
[7] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a2719995a75ebed7aacc1419bb2b781db
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Stopping logical replication protocol
Next
From: Andrey Borodin
Date:
Subject: Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]