Re: Dealing with collation and strcoll/strxfrm/etc - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Dealing with collation and strcoll/strxfrm/etc |
Date | |
Msg-id | CAM3SWZRscJ45_Cwbmkn1Vh4Xue=7andxdRFHiGYQf6Fn20uBZA@mail.gmail.com Whole thread Raw |
In response to | Re: Dealing with collation and strcoll/strxfrm/etc (Stephen Frost <sfrost@snowman.net>) |
List | pgsql-hackers |
On Mon, Mar 28, 2016 at 12:36 PM, Stephen Frost <sfrost@snowman.net> wrote: > Having to figure out how each and every stdlib does versioning doesn't > sound fun, I certainly agree with you there, but it hardly seems > impossible. What we need, even if we look to move to ICU, is a place to > remember that version information and a way to do something when we > discover that we're now using a different version. I think that the versioning situation is all over the place. It isn't in the C standard. And there are many different versions of many different stdlibs to support. Most importantly, where support nominally exists, a strong incentive to get it exactly right may not. We've seen that already. > I'm not quite sure what the best way to do that is, but I imagine it > involves changes to existing catalogs or perhaps even a new one. I > don't have any particularly great ideas for existing releases (maybe > stash information in the index somewhere when it's rebuilt and then > check it and throw an ERROR if they don't match?) I think we'd need to introduce an abstraction like a "collation provider", of which ICU would theoretically be just one. The OS would be a baked-in collation provider. Everything that works today would continue to work. We'd then largely just be grandfathering out systems that rely on OS locales across major version upgrades, since the vast majority of users are happy with Unicode, and have no cultural or technical reason to prefer the OS locales that I can think of. I am unconvinced with the idea that it especially matters that sort(1) might not be in agreement with Postgres. Neither is any Java app, or any .Net app, or the user's web browser in the case of Safari or Google Chrome (maybe others). I want Postgres to be consistent with Postgres, across different nodes on the network, in environments where I may have little knowledge of the underlying OS. Think "sort pushdown in postgres_fdw". Users from certain East Asian user communities might prefer to stick with regional encodings, perhaps due to specific concerns about the Han Unification controversy. But I'm pretty sure that these users have very low expectations about collations in Postgres today. I was recently told that collating Japanese is starting to get a bit better, due to various new initiatives, but that most experienced Japanese Postgres DBAs tend to use the "C" collation. I don't want to impose a Unicode monoculture on anyone. But I do think there are clear benefits for the large majority of users that always use Unicode. Nothing needs to break that works today to make this happen. Abbreviated keys provide an immediate incentive for users to adopt ICU; users that might otherwise be on the fence about it. >> The question is only how we deal with this when it happens. One thing >> that's attractive about ICU is that it makes this explicit, both for >> the logical behavior of a collation, as well as the stability of >> binary sort keys (Glibc's versioning seemingly just does the former). >> So the equivalent of strxfrm() output has license to change for >> technical reasons that are orthogonal to the practical concerns of >> end-users about how text sorts in their locale. ICU is clear on what >> it takes to make binary sort keys in indexes work. And various major >> database systems rely on this being right. > > There seems to be some disagreement about if ICU provides the > information we'd need to make a decision or not. It seems like it > would, given its usage in other database systems, but if so, we need to > very clearly understand exactly how it works and how we can depend on > it. It seems likely that it exposes the information required to make what we need to do practical. Certainly, adopting ICU is a big project that we should proceed cautiously with, but there is a reason why every other major database system uses either ICU, or a library based on UCA [1] that allows the system to centrally control versioned collations (SQLite just makes this optional). I think that ICU *could* still tie us to the available collations on an OS (those collations that are available with their ICU packages). What I haven't figured out yet is if it's practical to install versions that are available from some central location, like the CLDR [2]. I don't think we'd want to have Postgres ship "supported collations" in each major version, in roughly the style of the IANA timezone stuff, but it's far too early to rule that out. It would have upsides. [1] https://en.wikipedia.org/wiki/Unicode_collation_algorithm [2] http://cldr.unicode.org/ -- Peter Geoghegan
pgsql-hackers by date: