Re: Dealing with collation and strcoll/strxfrm/etc - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Dealing with collation and strcoll/strxfrm/etc
Date
Msg-id CAM3SWZRscJ45_Cwbmkn1Vh4Xue=7andxdRFHiGYQf6Fn20uBZA@mail.gmail.com
Whole thread Raw
In response to Re: Dealing with collation and strcoll/strxfrm/etc  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers
On Mon, Mar 28, 2016 at 12:36 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Having to figure out how each and every stdlib does versioning doesn't
> sound fun, I certainly agree with you there, but it hardly seems
> impossible.  What we need, even if we look to move to ICU, is a place to
> remember that version information and a way to do something when we
> discover that we're now using a different version.

I think that the versioning situation is all over the place. It isn't
in the C standard. And there are many different versions of many
different stdlibs to support. Most importantly, where support
nominally exists, a strong incentive to get it exactly right may not.
We've seen that already.

> I'm not quite sure what the best way to do that is, but I imagine it
> involves changes to existing catalogs or perhaps even a new one.  I
> don't have any particularly great ideas for existing releases (maybe
> stash information in the index somewhere when it's rebuilt and then
> check it and throw an ERROR if they don't match?)

I think we'd need to introduce an abstraction like a "collation
provider", of which ICU would theoretically be just one. The OS would
be a baked-in collation provider. Everything that works today would
continue to work. We'd then largely just be grandfathering out systems
that rely on OS locales across major version upgrades, since the vast
majority of users are happy with Unicode, and have no cultural or
technical reason to prefer the OS locales that I can think of.

I am unconvinced with the idea that it especially matters that sort(1)
might not be in agreement with Postgres. Neither is any Java app, or
any .Net app, or the user's web browser in the case of Safari or
Google Chrome (maybe others). I want Postgres to be consistent with
Postgres, across different nodes on the network, in environments where
I may have little knowledge of the underlying OS. Think "sort pushdown
in postgres_fdw".

Users from certain East Asian user communities might prefer to stick
with regional encodings, perhaps due to specific concerns about the
Han Unification controversy. But I'm pretty sure that these users have
very low expectations about collations in Postgres today. I was
recently told that collating Japanese is starting to get a bit better,
due to various new initiatives, but that most experienced Japanese
Postgres DBAs tend to use the "C" collation.

I don't want to impose a Unicode monoculture on anyone. But I do think
there are clear benefits for the large majority of users that always
use Unicode. Nothing needs to break that works today to make this
happen. Abbreviated keys provide an immediate incentive for users to
adopt ICU; users that might otherwise be on the fence about it.

>> The question is only how we deal with this when it happens. One thing
>> that's attractive about ICU is that it makes this explicit, both for
>> the logical behavior of a collation, as well as the stability of
>> binary sort keys (Glibc's versioning seemingly just does the former).
>> So the equivalent of strxfrm() output has license to change for
>> technical reasons that are orthogonal to the practical concerns of
>> end-users about how text sorts in their locale. ICU is clear on what
>> it takes to make binary sort keys in indexes work. And various major
>> database systems rely on this being right.
>
> There seems to be some disagreement about if ICU provides the
> information we'd need to make a decision or not.  It seems like it
> would, given its usage in other database systems, but if so, we need to
> very clearly understand exactly how it works and how we can depend on
> it.

It seems likely that it exposes the information required to make what
we need to do practical.

Certainly, adopting ICU is a big project that we should proceed
cautiously with, but there is a reason why every other major database
system uses either ICU, or a library based on UCA [1] that allows the
system to centrally control versioned collations (SQLite just makes
this optional).

I think that ICU *could* still tie us to the available collations on
an OS (those collations that are available with their ICU packages).
What I haven't figured out yet is if it's practical to install
versions that are available from some central location, like the CLDR
[2]. I don't think we'd want to have Postgres ship "supported
collations" in each major version, in roughly the style of the IANA
timezone stuff, but it's far too early to rule that out. It would have
upsides.

[1] https://en.wikipedia.org/wiki/Unicode_collation_algorithm
[2] http://cldr.unicode.org/
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Jim Nasby
Date:
Subject: SQL access to access method details
Next
From: Petr Jelinek
Date:
Subject: Re: Relation extension scalability