Re: ICU integration - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: ICU integration
Date
Msg-id CAM3SWZQM9cx0JqiY=5S=OHrermc-rn0xMFLN7YWkmxong_xJQQ@mail.gmail.com
Whole thread Raw
In response to Re: ICU integration  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: ICU integration
List pgsql-hackers
On Thu, Sep 8, 2016 at 8:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I understand that in principle, but I don't see operating system
>> providers shipping a bunch of ICU versions to facilitate that.  They
>> will usually ship one.
>
> I agree with that estimate, and I would further venture that even if we
> wanted to bundle ICU into our tarballs, distributors would rip it out
> again on security grounds.

I agree that we're not going to bundle our own ICU. And, that
packagers have to be more or less on board with whatever plan we come
up with for any this to be of much practical value. The plan itself is
at least as important as the patch.

> This is a problem, if ICU won't guarantee cross-version compatibility,
> because it destroys the argument that moving to ICU would offer us
> collation behavior stability.

Not exactly. Peter E. didn't seem to be aware that there is an ICU
collator versioning concept (perhaps I misunderstood, though). It
might be that in practice, the locales are very stable, so it almost
doesn't matter that it's annoying when they change. Note that
"collators" are versioned in a sophisticated way, not locales.

You can build the attached simple C program to see the versions of
available collators from each locale, as follows:

$ gcc icu-test.c -licui18n -licuuc -o icu-coll-versions
$ ./icu-coll-versions | head -n 20
Collator                                          | ICU Version | UCA Version
-----------------------------------------------------------------------------
Afrikaans                                         | 99-38-00-00 | 07-00-00-00
Afrikaans (Namibia)                               | 99-38-00-00 | 07-00-00-00
Afrikaans (South Africa)                          | 99-38-00-00 | 07-00-00-00
Aghem                                             | 99-38-00-00 | 07-00-00-00
Aghem (Cameroon)                                  | 99-38-00-00 | 07-00-00-00
Akan                                              | 99-38-00-00 | 07-00-00-00
Akan (Ghana)                                      | 99-38-00-00 | 07-00-00-00
Amharic                                           | 99-38-00-00 | 07-00-00-00
Amharic (Ethiopia)                                | 99-38-00-00 | 07-00-00-00
Arabic                                            | 99-38-1B-01 | 07-00-00-00
Arabic (World)                                    | 99-38-1B-01 | 07-00-00-00
Arabic (United Arab Emirates)                     | 99-38-1B-01 | 07-00-00-00
Arabic (Bahrain)                                  | 99-38-1B-01 | 07-00-00-00
Arabic (Djibouti)                                 | 99-38-1B-01 | 07-00-00-00
Arabic (Algeria)                                  | 99-38-1B-01 | 07-00-00-00
Arabic (Egypt)                                    | 99-38-1B-01 | 07-00-00-00
Arabic (Western Sahara)                           | 99-38-1B-01 | 07-00-00-00
Arabic (Eritrea)                                  | 99-38-1B-01 | 07-00-00-00

I also attach a full list from my Ubuntu 16.04 laptop. I'll try to
find some other system to generate output from, to see how close it
matches what I happen to have here.

"ICU version" here is an opaque 32-bit integer [1]. I'd be interested
to see how much the output of this program differs from one major
version of ICU to the next. Collations will change. of course, but not
that often. It's not the end of the world if somebody has to REINDEX
when they change major OS version. It would be nice if everything just
continued to work with no further input from the user, but it's not
essential, assuming that collation are pretty stable in practice,
which I think they are. It is a total disaster if a mismatch in
collations is initially undetected, though.

Another issue that nobody has mentioned here, I think, is that the
glibc people just don't seem to care about our use-case (Carlos
O'Donnell basically said as much, during the strxfrm() debacle earlier
this year, but it wasn't limited to how we were relying on strxfrm()
at that time). Since it's almost certainly true that other major
database systems are critically reliant on ICU's strxfrm() agreeing
with strcoll (substitute ICU equivalent spellings), and issues beyond
that, it stands to reason that they take that stuff very seriously. It
would be really nice to get back abbreviated keys for collated text,
IMV. I think ICU gets us that. Even if we used ICU in exactly the same
way as we use the C standard library today, that general sense of
stability being critical that ICU has would still be a big advantage.
If ICU drops the ball on collation stability, or strxfrm() disagreeing
with strcoll(), it's a huge problem for lots of groups of people, not
just us.

[1] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#af756972781ac556a62e48cbd509ea4a6
--
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: High-CPU consumption on information_schema (only) query
Next
From: Craig Ringer
Date:
Subject: Re: ICU integration