Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it? - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
Date
Msg-id CAH2-Wz=VbytqGTYfuT_qsN9U+e=rMw4UirOdMGDYt_jtS7kbpg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Sat, Sep 30, 2017 at 12:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This suggests to me that arguing about canonicalization is moot so
> far as avoiding reindexing goes: if you change ICU library versions,
> you're screwed and will have to jump through all the reindexing hoops,
> no matter what we do or don't do.  (Maybe we are looking at the wrong
> information to populate collversion?)

Just to be clear, I was never arguing that canonicalization would
avoid reindexing, and I didn't think anyone else was. I believe that
the version string will change when the behavior of its collator
changes for any reason and in any way. This includes changes to how
binary sort keys are generated.

(We don't currently store binary keys on disk, so a change to that
representation doesn't really necessitate a REINDEX for us. The UCA
spec explicitly decouples compatibility for indexing with binary keys
from changes needed due to human requirements. ICU binary sort keys
are compressed, and this is sometimes improved, independently of the
evolution of how natural language experts say text should be sorted.)

> Now, it may still be worthwhile to argue about whether canonicalization
> will help the other component of the problem, which is whether you can
> dump and reload CREATE COLLATION commands into a new ICU version and
> expect to get more-or-less-the-same behavior as before.

Again, to be clear, that's what I was arguing all along. I think that
users will get very good results with this approach. When the sorting
behavior of a locale is refined by natural language experts, it's
almost certainly a very small change that most users that are affected
won't actually notice. For example, when en-us-x-icu is changed,
that's actually due to a general change in the inherited root collator
that doesn't really affect English speakers. There is no practical
question about how you're supposed to sort English text.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
Next
From: Andrew Dunstan
Date:
Subject: Re: [HACKERS] alter server for foreign table