Re: ICU integration - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: ICU integration
Date
Msg-id CAM3SWZSH-MdXQN2NsOAvhENAA9vtD91WbS8tFGnKN2TiN2+WCw@mail.gmail.com
Whole thread Raw
In response to ICU integration  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Responses Re: ICU integration  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers
On Tue, Aug 30, 2016 at 7:46 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> Here is a patch I've been working on to allow the use of ICU for sorting
> and other locale things.

I'm really happy that you're working on this. This is more important
than is widely appreciated, and very important overall.

In a world where ICU becomes the defacto standard (i.e. is used on
major platforms by default), what remaining barriers are there to
documenting and enforcing the binary compatibility of replicas? I may
be mistaken, but offhand I can't think of any. Being able to describe
exactly what works and what doesn't is very important. After all,
failure to adhere to "the rules" today, such as they are, can leave
you with a subtly broken replica. I'd like to make that scenario
mechanically impossible, by locking everything down.

> I'm not sure how well it will work to replace all the bits of LIKE and
> regular expressions with ICU API calls.  One problem is that ICU likes
> to do case folding as a whole string, not by character.  I need to do
> more research about that.

My guess is that there are cultural reasons why it wants to operate on
a whole string, at least in some cases.

> Also note that ICU locales are encoding-independent and don't support a
> separate collcollate and collctype, so the existing catalog structure is
> not optimal.

That makes more sense to me, personally. ICU very explicitly decouples
technical issues (like the representation of strxfrm() keys, and, I
gather, encoding) from cultural issues (the actual user-visible
behaviors). This allows us to use strxfrm()-style binary keys in
indexes directly, since they're versioned independently from their
underlying collation; they can add a new optimization to strxfrm()-key
generation to the next ICU version, and we can detect that and require
a REINDEX, even when the collation version itself (the user-visible
behaviors) are unchanged. I'm getting ahead of myself here, but that
does seem very useful.

The Unicode collation algorithm [1] that ICU is directly based on
knows plenty about the requirements of indexing. It contains guidance
about equivalence vs. equality that we learned the hard way in commit
656beff5, for example.

> Where it gets really interesting is what to do with the database
> locales.  They just set the global process locale.  So in order to port
> that to ICU we'd need to check every implicit use of the process locale
> and tweak it.  We could add a datcollprovider column or something.  But
> we also rely on the datctype setting to validate the encoding of the
> database.  Maybe we wouldn't need that anymore, but it sounds risky.

Not sure about that.

Whatever we come up with here needs to mesh well with the existing
conventions around collation versioning that ICU has, in the context
of various operating system packages in particular. We can arrange it
so that in practice, an ICU upgrade doesn't often break your indexes
due to a collation rule change; ICU is happy to have multiple versions
of a collation at a time, and you'll probably retain the old collation
version in ICU.

Even if your old collation version isn't available in a new ICU
release (which I think is unlikely in practice), or you downgrade ICU,
it might be possible to give guidance on how to download a "Collation
Resource Bundle" [2][3] that *does* have the right collation version,
which presumably satisfies the requirement immediately.

Firebird already uses ICU. Maybe we have something to learn from them
here. In particular, where do they (by which I mean the ICU version
that Firebird links to) get its collations from in practice? I think
that the CLDR Data collations were at one time not even distributed
with ICU source. It might be a matter of individual OS packagers of
ICU deciding what exact CLDR data to use, which may or may not be of
any significant consequence in practice.

[1] http://unicode.org/reports/tr10
[2] http://site.icu-project.org/design/size/collation
[3] http://userguide.icu-project.org/icudata
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [COMMITTERS] pgsql: Use static inline functions for float <-> Datum conversions.
Next
From: Heikki Linnakangas
Date:
Subject: Re: [COMMITTERS] pgsql: Use static inline functions for float <-> Datum conversions.