Re: ICU integration - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: ICU integration |
Date | |
Msg-id | CAM3SWZQVv3s70tJ6WCmbcO8cVQjnj8ZruVMBNOqc1YpGmq7hFQ@mail.gmail.com Whole thread Raw |
In response to | Re: ICU integration (Craig Ringer <craig@2ndquadrant.com>) |
Responses |
Re: ICU integration
|
List | pgsql-hackers |
On Thu, Sep 8, 2016 at 6:48 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > Pity ICU doesn't offer versioned collations within a single install. > Though I can understand why they don't. There are two separate issues with collator versioning. ICU can probably be used in a way that clearly decouples these two issues, which is very important. The first is that the rules of collations change. The second is that the binary key that collators create (i.e. the equivalent of strxfrm()) can change for various reasons that have nothing to do with culture or natural languages -- purely technical reasons. For example, they can add new optimizations to make generating new binary keys faster. If there are bugs in how that works, they can fix the bugs and increment the identifier [1], which could allow Postgres to insist on a REINDEX (if abbreviated keys for collated text were reenabled, although I don't think that problems like that are limited to binary key generation). So, to bring it back to that little program I wrote: $ ./icu-coll-versions | head Collator | ICU Version | UCA Version ----------------------------------------------------------------------------- Afrikaans | 99-38-00-00 | 07-00-00-00 Afrikaans (Namibia) | 99-38-00-00 | 07-00-00-00 Afrikaans (South Africa) | 99-38-00-00 | 07-00-00-00 Aghem | 99-38-00-00 | 07-00-00-00 Aghem (Cameroon) | 99-38-00-00 | 07-00-00-00 Akan | 99-38-00-00 | 07-00-00-00 Akan (Ghana) | 99-38-00-00 | 07-00-00-00 Amharic | 99-38-00-00 | 07-00-00-00 Here, what appears as "ICU version" has the identifier [1] baked in, although this is undocumented (it also has any "custom tailorings" that might be used, say if we had user defined customizations to collations, as Firebird apparently does [2] [3]). I'm pretty sure that UCA version relates to a version of the Unicode collation algorithm, and its associated DUCET table (this is all subject to ISO standardization). I gather that a particular collation is actually comprised of a base UCA version (and DUCET table -- I think that ICU sometimes calls this the "root"), with custom tailorings that a locale provides for a given culture or country. These collators may in turn be further "tailored" to get that fancy user defined customization stuff. In principle, and assuming I haven't gotten something wrong, it ought to be possible to unambiguously identify a collation based on a matching UCA version (i.e. DUCET table), plus the collation tailorings matching exactly, even across ICU versions that happen to be based on the same UCA version (they only seem to put out a new UCA version about once a year [4]). It *might* be fine, practically speaking, to assume that a collation with a matching iso-code and UCA version is compatible forever and always across any ICU version. If not, it might instead be feasible to write a custom fingerprinter for collation tailorings that ran at initdb time. Maybe the tailorings, which are abstract rules, could even be stored in system catalogs, so the only thing that need match is ICU's UCA version (the "root" collators must still match), since replicas may reconstruct the serialized tailorings that comprise a collation as needed [5][6], since the tailoring that a default collator for a locale uses isn't special, technically speaking. Of course, this is all pretty hand-wavey right now, and much more research is needed. I am very intrigued about the idea of storing the collators in the system catalogs wholesale, since ICU provides facilities that make that seem possible. If a "serialized unicode set" build from a collators tailoring rules, or, alternatively, a collator saved as a binary representation [7] were stored in the system catalogs, perhaps it wouldn't matter as much that the stuff distributed with different ICU versions didn't match, at least in theory. It's unclear that the system catalog representation could be usable with a fair cross section of ICU versions, but if it could then that would be perfect. This also seems to be how Firebird style user-defined tailorings might be implemented anyway, and it seems very appealing to add that as a light layer on top of how the base system works, if at all possible. [1] https://github.com/svn2github/libicu/blob/c43ec130ea0ee6cd565d87d70088e1d70d892f32/common/unicode/uvernum.h#L149 [2] http://www.firebirdsql.org/refdocs/langrefupd25-ddl-collation.html [3] http://userguide.icu-project.org/collation/customization#TOC-Building-on-Existing-Locales [4] http://unicode.org/reports/tr10/#Synch_14651_Table [5] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a1982f184bca8adaa848144a1959ff235 [6] https://ssl.icu-project.org/apiref/icu4c/structUSerializedSet.html [7] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a2719995a75ebed7aacc1419bb2b781db -- Peter Geoghegan
pgsql-hackers by date: