Thread: [HACKERS] What users can do with custom ICU collations in Postgres 10

[HACKERS] What users can do with custom ICU collations in Postgres 10

From

Peter Geoghegan

Date:

10 August 2017, 04:49:41

There are actually very many customizations to collations that are
possible beyond what the "stock" ICU collations provide (whatever
"stock" means). Some of these are really cool, and I can imagine use
cases where they are very compelling that have nothing to do with
internationalization (such customizations are how we should eventually
implement case-insensitive collations, once the infrastructure for
doing that without breaking hashing is in place).

I'd like to give a demo on what is already possible, but not currently
documented. I didn't see anyone else comment on this, including Peter
E (maybe I missed that?). We should improve the documentation in this
area, to get this into the hands of users.

Say we're unhappy that numbers come first, which we see here:

postgres=# select * from (select '1a' i union select '1b' union select
'1c' union select 'a1' union select 'b2' union select 'c3') j order by
i collate "en-x-icu";i
────1a1b1ca1b2c3
(6 rows)

We may do this to get our desired sort order:

postgres=# create collation digitlast (provider=icu,
locale='en-u-kr-latn-digit');
CREATE COLLATION
postgres=# select * from (select '1a' i union select '1b' union select
'1c' union select 'a1' union select 'b2' union select 'c3') j order by
i collate "digitlast";i
────a1b2c31a1b1c
(6 rows)

Note that 'kr' is a specific BCP47 Key [1]. Many different options can
be set in this manner.

Let's say we are unhappy with the fact that capital letters sort
higher than lowercase:

postgres=# select * from (select 'B' i union select 'b' union select
'A' union select 'a') j order by i collate "en-x-icu";i
───aAbB
(4 rows)

ICU provides a solution here, too:

postgres=# create collation capitalfirst (provider=icu, locale='en-u-kf-upper');
CREATE COLLATION
postgres=#
select * from (select 'B' i union select 'b' union select 'A' union
select 'a') j order by i collate "capitalfirst";i
───AaBb
(4 rows)

And, yes -- you can even *combine* these two options by creating a
third custom collation. That can be spelled
'en-u-kf-upper-kr-latn-digit', in case you were wondering.

Users have certainly complained about not liking this or that aspect
of how glibc sorts text many times over the years, particularly around
things like how whitespace and punctuation are handled, which they can
now do something about [2]. Users can also have numbers sort like
numbers should when compared against other numbers, by using the
numericOrdering option (not shown). numericOrdering would be great for
things like alphanumeric invoice numbers, or the alphanumeric car
registration plate numbers that are used in certain countries [3],
with fixed number/letter fields. These options are very powerful.

[1] http://unicode.org/reports/tr35/tr35-collation.html#Setting_Options
[2] http://unicode.org/reports/tr35/tr35-collation.html#Common_Settings
[3] https://en.wikipedia.org/wiki/Vehicle_registration_plates_of_the_Republic_of_Ireland
--
Peter Geoghegan

Re: [HACKERS] What users can do with custom ICU collations inPostgres 10

From

Peter Eisentraut

Date:

14 August 2017, 22:15:27

On 8/9/17 18:49, Peter Geoghegan wrote:
> There are actually very many customizations to collations that are
> possible beyond what the "stock" ICU collations provide (whatever
> "stock" means).

This is very nice indeed, and I was not aware that this was possible
with what we already have in place.

I'm having trouble finding some concrete documentation for this.  The TR
35 link you showed documents the key words and values, BCP 47 documents
the syntax, but nothing puts it all together in a form consumable by
users.  The ICU documentation still mainly focuses on the "old"
@keyword=value syntax.  I guess we'll have to write our own for now.

Given that we cannot reasonably preload all these new variants that you
demonstrated, I think it would make sense to drop all the keyword
variants from the preloaded set.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] What users can do with custom ICU collations inPostgres 10

From

Peter Geoghegan

Date:

14 August 2017, 22:36:43

On Mon, Aug 14, 2017 at 9:15 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> I'm having trouble finding some concrete documentation for this.  The TR
> 35 link you showed documents the key words and values, BCP 47 documents
> the syntax, but nothing puts it all together in a form consumable by
> users.  The ICU documentation still mainly focuses on the "old"
> @keyword=value syntax.  I guess we'll have to write our own for now.

There is an unusual style to the standards that apply here. It's
incredibly detailed, and the options are very powerful, but it's in an
unfamiliar language. ICU just considers itself a consumer of the CLDR
locale stuff, which is a broad standard.

We don't have to write comprehensive documentation of these
kn/kb/ka/kh options that I pointed out exist. I think it would be nice
to cover a few interesting cases, and link to the BCP 47 Unicode
extension (TR 35) stuff.

Here is a list of scripts, that are all reorderable with this TR 35
stuff (varies somewhat based on CLDR/ICU version):

http://unicode.org/iso15924/iso15924-codes.html

Here is a CLDR specific XML specification of the variant keywords (can
be mapped to specific ICU version easily):

http://www.unicode.org/repos/cldr/tags/release-31/common/bcp47/collation.xml

> Given that we cannot reasonably preload all these new variants that you
> demonstrated, I think it would make sense to drop all the keyword
> variants from the preloaded set.

Cool. While I am of course in favor of this, I actually understand
very well why you had initdb add them. I think that removing them
creates a discoverability problem that cannot easily be fixed through
documentation. ISTM that we ought to also add an SQL-callable function
that lists the most common keyword variants. Some of those are
specific to one or two locales, such as traditional Spanish, or the
alternative sort orders for Han characters.

What do you think of that idea?

I guess an alternative idea is to just link to that XML document
(collation.xml), which exactly specifies the variants. Users can get
the "co" variants there. Should be for the most part obvious which one
is interesting to which locale, since there is not that many "co"
variants to choose from, and users will probably know what to look for
if they look at all.

-- 
Peter Geoghegan

Re: [HACKERS] What users can do with custom ICU collations inPostgres 10

From

Craig Ringer

Date:

15 August 2017, 08:10:23

On 10 August 2017 at 06:49, Peter Geoghegan <pg@bowt.ie> wrote:

There are actually very many customizations to collations that are
possible beyond what the "stock" ICU collations provide (whatever
"stock" means). Some of these are really cool, and I can imagine use
cases where they are very compelling that have nothing to do with
internationalization (such customizations are how we should eventually
implement case-insensitive collations, once the infrastructure for
doing that without breaking hashing is in place).

I'd like to give a demo on what is already possible, but not currently
documented. I didn't see anyone else comment on this, including Peter
E (maybe I missed that?). We should improve the documentation in this
area, to get this into the hands of users.

Say we're unhappy that numbers come first, which we see here:

Ooh, this finally gives us a path toward case-insensitive default database collation via CLDR caseLevel.

http://userguide.icu-project.org/collation

http://www.unicode.org/reports/tr35/tr35-collation.html#Algorithm_Case

That *definitely* should be documented and exposed by initdb.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] What users can do with custom ICU collations inPostgres 10

From

Michael Paquier

Date:

15 August 2017, 08:16:14

On Tue, Aug 15, 2017 at 11:10 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> Ooh, this finally gives us a path toward case-insensitive default database
> collation via CLDR caseLevel.
>
> http://userguide.icu-project.org/collation
> http://www.unicode.org/reports/tr35/tr35-collation.html#Algorithm_Case
>
> That *definitely* should be documented and exposed by initdb.

The addition of an interface to initdb smells like an item for v11~,
but the documentation for 10 could be improved in this sense?
-- 
Michael

Re: [HACKERS] What users can do with custom ICU collations inPostgres 10

From

Craig Ringer

Date:

15 August 2017, 08:27:32

On 15 August 2017 at 10:16, Michael Paquier <michael.paquier@gmail.com> wrote:

On Tue, Aug 15, 2017 at 11:10 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> Ooh, this finally gives us a path toward case-insensitive default database
> collation via CLDR caseLevel.
>
> http://userguide.icu-project.org/collation
> http://www.unicode.org/reports/tr35/tr35-collation.html#Algorithm_Case
>
> That *definitely* should be documented and exposed by initdb.

The addition of an interface to initdb smells like an item for v11~,
but the documentation for 10 could be improved in this sense?

Yeah, not suggesting changing it for Pg10, way too late.

It's also not enough for case-insensitive DB by its self, since we still do binary compares for equality. There'd need to be deeper surgery to make it work. So I'm getting prematurely excited here.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] What users can do with custom ICU collations inPostgres 10

From

Peter Eisentraut

Date:

16 August 2017, 00:19:43

On 8/14/17 12:15, Peter Eisentraut wrote:
> Given that we cannot reasonably preload all these new variants that you
> demonstrated, I think it would make sense to drop all the keyword
> variants from the preloaded set.

After playing with this a bit, I'm having some doubts.  While the "k"
keys from TR 35 are algorithmic parameters that apply to all locales and
can be looked up in the respective documents, I don't find any way a
user can discover what collation types ("co") are available for a
locale.  Any ideas?  If there isn't one, I think we need to provide one.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] What users can do with custom ICU collations inPostgres 10

From

Peter Eisentraut

Date:

16 August 2017, 00:33:56

On 8/9/17 18:49, Peter Geoghegan wrote:
> I'd like to give a demo on what is already possible, but not currently
> documented. I didn't see anyone else comment on this, including Peter
> E (maybe I missed that?). We should improve the documentation in this
> area, to get this into the hands of users.

Here is a small piece of documentation.  Thoughts?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-doc-Document-TR-35-collation-options-for-ICU.patch

Re: [HACKERS] What users can do with custom ICU collations inPostgres 10

From

Peter Geoghegan

Date:

16 August 2017, 00:36:22

On Tue, Aug 15, 2017 at 11:19 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 8/14/17 12:15, Peter Eisentraut wrote:
>> Given that we cannot reasonably preload all these new variants that you
>> demonstrated, I think it would make sense to drop all the keyword
>> variants from the preloaded set.
>
> After playing with this a bit, I'm having some doubts.  While the "k"
> keys from TR 35 are algorithmic parameters that apply to all locales and
> can be looked up in the respective documents, I don't find any way a
> user can discover what collation types ("co") are available for a
> locale.  Any ideas?  If there isn't one, I think we need to provide one.

I wanted to do that too, but Tom didn't seem sold on it yesterday. He
called it v11 material over on the ICU bug thread.

All of the unicode "u" extensions are documented per-CLDR version as
an XML file. For example:

http://www.unicode.org/repos/cldr/tags/release-31/common/bcp47/collation.xml

This isn't ideal, because only some of the "co" variants change things
for all possible base collations. But, there isn't that many "co"
options to choose from, and I think that for the most part it's
reasonably obvious which one is desirable. For example, Chinese people
are probably well aware of what Pinyin is, and what stroke is. Things
like EOR and search are much more esoteric, but also much less useful.
So, I wouldn't hate it if this was the only way that users could
discover the variants in v10.

-- 
Peter Geoghegan

Re: [HACKERS] What users can do with custom ICU collations inPostgres 10

From

Peter Geoghegan

Date:

16 August 2017, 01:04:37

On Tue, Aug 15, 2017 at 11:33 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 8/9/17 18:49, Peter Geoghegan wrote:
>> I'd like to give a demo on what is already possible, but not currently
>> documented. I didn't see anyone else comment on this, including Peter
>> E (maybe I missed that?). We should improve the documentation in this
>> area, to get this into the hands of users.
>
> Here is a small piece of documentation.  Thoughts?

This looks pretty good, but I do have some feedback:

* "23.2.2.3. Copying Collations" suggests that the only use of CREATE
COLLATION is copying collations, which is far from true with ICU. We
should change that at the same time as this change is made. I think
that just changing the title would improve the overall flow of the
page.

* Maybe add an example of numeric ordering -- the "alphanumeric
invoice" case, where you want text containing numbers to have the
numbers sort as numbers iff the comparison is to be resolved when
comparing numbers. I think that that's really useful, and worth
specifically calling out. I definitely would have used that had it
been available ten years ago.

* Let's use "en-u-kr-others-digit" instead of "en-u-kr-latn-digit' in
the example. It makes no real difference to us English speakers, but
means that the example works the same for those that use a different
alphabet. It's more culturally neutral.

* If we end up having initdb put all locales rather than all
collations in pg_collation, which I think is very likely, then we can
put in a link to ICU's locale explorer web resource:

https://ssl.icu-project.org/icu-bin/locexp?d_=en&_=en_HK

This lets the user see exactly what they'll get from a base locale
using an intuitive interface (assuming it matches their CLDR version).

-- 
Peter Geoghegan

Re: [HACKERS] What users can do with custom ICU collations inPostgres 10

From

Peter Eisentraut

Date:

21 August 2017, 21:25:25

On 8/15/17 15:04, Peter Geoghegan wrote:
> * "23.2.2.3. Copying Collations" suggests that the only use of CREATE
> COLLATION is copying collations, which is far from true with ICU. We
> should change that at the same time as this change is made. I think
> that just changing the title would improve the overall flow of the
> page.

I don't understand why that has to be changed and how.

> * Maybe add an example of numeric ordering -- the "alphanumeric
> invoice" case, where you want text containing numbers to have the
> numbers sort as numbers iff the comparison is to be resolved when
> comparing numbers. I think that that's really useful, and worth
> specifically calling out. I definitely would have used that had it
> been available ten years ago.

done, quite useful

> * Let's use "en-u-kr-others-digit" instead of "en-u-kr-latn-digit' in
> the example. It makes no real difference to us English speakers, but
> means that the example works the same for those that use a different
> alphabet. It's more culturally neutral.

I follow what you are saying, but that locale string is not accepted by
CREATE COLLATION.

> * If we end up having initdb put all locales rather than all
> collations in pg_collation, which I think is very likely, then we can
> put in a link to ICU's locale explorer web resource:
> 
> https://ssl.icu-project.org/icu-bin/locexp?d_=en&_=en_HK
> 
> This lets the user see exactly what they'll get from a base locale
> using an intuitive interface (assuming it matches their CLDR version).

done

Patch has been posted to another thread.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services