Re: [HACKERS] What users can do with custom ICU collations inPostgres 10 - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] What users can do with custom ICU collations inPostgres 10
Date
Msg-id CAH2-WznaO+jA+rNmpHw9c3vXyKiiPSSpktSfOccChRp_98r1Tw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] What users can do with custom ICU collations inPostgres 10  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers
On Mon, Aug 14, 2017 at 9:15 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> I'm having trouble finding some concrete documentation for this.  The TR
> 35 link you showed documents the key words and values, BCP 47 documents
> the syntax, but nothing puts it all together in a form consumable by
> users.  The ICU documentation still mainly focuses on the "old"
> @keyword=value syntax.  I guess we'll have to write our own for now.

There is an unusual style to the standards that apply here. It's
incredibly detailed, and the options are very powerful, but it's in an
unfamiliar language. ICU just considers itself a consumer of the CLDR
locale stuff, which is a broad standard.

We don't have to write comprehensive documentation of these
kn/kb/ka/kh options that I pointed out exist. I think it would be nice
to cover a few interesting cases, and link to the BCP 47 Unicode
extension (TR 35) stuff.

Here is a list of scripts, that are all reorderable with this TR 35
stuff (varies somewhat based on CLDR/ICU version):

http://unicode.org/iso15924/iso15924-codes.html

Here is a CLDR specific XML specification of the variant keywords (can
be mapped to specific ICU version easily):

http://www.unicode.org/repos/cldr/tags/release-31/common/bcp47/collation.xml

> Given that we cannot reasonably preload all these new variants that you
> demonstrated, I think it would make sense to drop all the keyword
> variants from the preloaded set.

Cool. While I am of course in favor of this, I actually understand
very well why you had initdb add them. I think that removing them
creates a discoverability problem that cannot easily be fixed through
documentation. ISTM that we ought to also add an SQL-callable function
that lists the most common keyword variants. Some of those are
specific to one or two locales, such as traditional Spanish, or the
alternative sort orders for Han characters.

What do you think of that idea?

I guess an alternative idea is to just link to that XML document
(collation.xml), which exactly specifies the variants. Users can get
the "co" variants there. Should be for the most part obvious which one
is interesting to which locale, since there is not that many "co"
variants to choose from, and users will probably know what to look for
if they look at all.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)
Next
From: Christoph Berg
Date:
Subject: Re: [HACKERS] pl/perl extension fails on Windows