Re: ICU integration - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: ICU integration |
Date | |
Msg-id | CAM3SWZQM9cx0JqiY=5S=OHrermc-rn0xMFLN7YWkmxong_xJQQ@mail.gmail.com Whole thread Raw |
In response to | Re: ICU integration (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: ICU integration
|
List | pgsql-hackers |
On Thu, Sep 8, 2016 at 8:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I understand that in principle, but I don't see operating system >> providers shipping a bunch of ICU versions to facilitate that. They >> will usually ship one. > > I agree with that estimate, and I would further venture that even if we > wanted to bundle ICU into our tarballs, distributors would rip it out > again on security grounds. I agree that we're not going to bundle our own ICU. And, that packagers have to be more or less on board with whatever plan we come up with for any this to be of much practical value. The plan itself is at least as important as the patch. > This is a problem, if ICU won't guarantee cross-version compatibility, > because it destroys the argument that moving to ICU would offer us > collation behavior stability. Not exactly. Peter E. didn't seem to be aware that there is an ICU collator versioning concept (perhaps I misunderstood, though). It might be that in practice, the locales are very stable, so it almost doesn't matter that it's annoying when they change. Note that "collators" are versioned in a sophisticated way, not locales. You can build the attached simple C program to see the versions of available collators from each locale, as follows: $ gcc icu-test.c -licui18n -licuuc -o icu-coll-versions $ ./icu-coll-versions | head -n 20 Collator | ICU Version | UCA Version ----------------------------------------------------------------------------- Afrikaans | 99-38-00-00 | 07-00-00-00 Afrikaans (Namibia) | 99-38-00-00 | 07-00-00-00 Afrikaans (South Africa) | 99-38-00-00 | 07-00-00-00 Aghem | 99-38-00-00 | 07-00-00-00 Aghem (Cameroon) | 99-38-00-00 | 07-00-00-00 Akan | 99-38-00-00 | 07-00-00-00 Akan (Ghana) | 99-38-00-00 | 07-00-00-00 Amharic | 99-38-00-00 | 07-00-00-00 Amharic (Ethiopia) | 99-38-00-00 | 07-00-00-00 Arabic | 99-38-1B-01 | 07-00-00-00 Arabic (World) | 99-38-1B-01 | 07-00-00-00 Arabic (United Arab Emirates) | 99-38-1B-01 | 07-00-00-00 Arabic (Bahrain) | 99-38-1B-01 | 07-00-00-00 Arabic (Djibouti) | 99-38-1B-01 | 07-00-00-00 Arabic (Algeria) | 99-38-1B-01 | 07-00-00-00 Arabic (Egypt) | 99-38-1B-01 | 07-00-00-00 Arabic (Western Sahara) | 99-38-1B-01 | 07-00-00-00 Arabic (Eritrea) | 99-38-1B-01 | 07-00-00-00 I also attach a full list from my Ubuntu 16.04 laptop. I'll try to find some other system to generate output from, to see how close it matches what I happen to have here. "ICU version" here is an opaque 32-bit integer [1]. I'd be interested to see how much the output of this program differs from one major version of ICU to the next. Collations will change. of course, but not that often. It's not the end of the world if somebody has to REINDEX when they change major OS version. It would be nice if everything just continued to work with no further input from the user, but it's not essential, assuming that collation are pretty stable in practice, which I think they are. It is a total disaster if a mismatch in collations is initially undetected, though. Another issue that nobody has mentioned here, I think, is that the glibc people just don't seem to care about our use-case (Carlos O'Donnell basically said as much, during the strxfrm() debacle earlier this year, but it wasn't limited to how we were relying on strxfrm() at that time). Since it's almost certainly true that other major database systems are critically reliant on ICU's strxfrm() agreeing with strcoll (substitute ICU equivalent spellings), and issues beyond that, it stands to reason that they take that stuff very seriously. It would be really nice to get back abbreviated keys for collated text, IMV. I think ICU gets us that. Even if we used ICU in exactly the same way as we use the C standard library today, that general sense of stability being critical that ICU has would still be a big advantage. If ICU drops the ball on collation stability, or strxfrm() disagreeing with strcoll(), it's a huge problem for lots of groups of people, not just us. [1] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#af756972781ac556a62e48cbd509ea4a6 -- Peter Geoghegan
Attachment
pgsql-hackers by date: