Thread: Allow tailoring of ICU locales with custom rules

Allow tailoring of ICU locales with custom rules

From
Peter Eisentraut
Date:
This patch exposes the ICU facility to add custom collation rules to a 
standard collation.  This would allow users to customize any ICU 
collation to whatever they want.  A very simple example from the 
documentation/tests:

CREATE COLLATION en_custom
     (provider = icu, locale = 'en', rules = '&a < g');

This places "g" after "a" before "b".  Details about the syntax can be 
found at 
<https://unicode-org.github.io/icu/userguide/collation/customization/>.

The code is pretty straightforward.  It mainly just records these rules 
in the catalog and feeds them to ICU when creating the collator object.
Attachment

Re: Allow tailoring of ICU locales with custom rules

From
Peter Eisentraut
Date:
Patch needed a rebase; no functionality changes.

On 14.12.22 10:26, Peter Eisentraut wrote:
> This patch exposes the ICU facility to add custom collation rules to a 
> standard collation.  This would allow users to customize any ICU 
> collation to whatever they want.  A very simple example from the 
> documentation/tests:
> 
> CREATE COLLATION en_custom
>      (provider = icu, locale = 'en', rules = '&a < g');
> 
> This places "g" after "a" before "b".  Details about the syntax can be 
> found at 
> <https://unicode-org.github.io/icu/userguide/collation/customization/>.
> 
> The code is pretty straightforward.  It mainly just records these rules 
> in the catalog and feeds them to ICU when creating the collator object.

Attachment

Re: Allow tailoring of ICU locales with custom rules

From
vignesh C
Date:
On Thu, 5 Jan 2023 at 20:45, Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
>
> Patch needed a rebase; no functionality changes.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

=== Applying patches on top of PostgreSQL commit ID
d952373a987bad331c0e499463159dd142ced1ef ===
=== applying patch
./v2-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patch
patching file doc/src/sgml/catalogs.sgml
patching file doc/src/sgml/ref/create_collation.sgml
patching file doc/src/sgml/ref/create_database.sgml
Hunk #1 FAILED at 192.
1 out of 1 hunk FAILED -- saving rejects to file
doc/src/sgml/ref/create_database.sgml.rej

[1] - http://cfbot.cputube.org/patch_41_4075.log

Regards,
Vignesh



Re: Allow tailoring of ICU locales with custom rules

From
Peter Eisentraut
Date:
On 11.01.23 03:50, vignesh C wrote:
> On Thu, 5 Jan 2023 at 20:45, Peter Eisentraut
> <peter.eisentraut@enterprisedb.com> wrote:
>>
>> Patch needed a rebase; no functionality changes.
> 
> The patch does not apply on top of HEAD as in [1], please post a rebased patch:

Updated patch attached.
Attachment

Re: Allow tailoring of ICU locales with custom rules

From
Laurenz Albe
Date:
On Mon, 2023-01-16 at 12:18 +0100, Peter Eisentraut wrote:
> Updated patch attached.

I like that patch.  It applies and passes regression tests.

I played with it:

  CREATE COLLATION german_phone (LOCALE = 'de-AT', PROVIDER = icu, RULES = '&oe < ö');

  SELECT * FROM (VALUES ('od'), ('oe'), ('of'), ('p'), ('ö')) AS q(c)
  ORDER BY c COLLATE german_phone;

   c
  ════
   od
   oe
   ö
   of
   p
  (5 rows)

Cool so far.  Now I created a database with that locale:

  CREATE DATABASE teutsch LOCALE_PROVIDER icu ICU_LOCALE german_phone
     LOCALE "de_AT.utf8" TEMPLATE template0;

Now the rules are not in "pg_database":

  SELECT datcollate, daticulocale, daticurules FROM pg_database WHERE datname = 'teutsch';

   datcollate │ daticulocale │ daticurules
  ════════════╪══════════════╪═════════════
   de_AT.utf8 │ german_phone │ ∅
  (1 row)

I connect to the database and try:

  SELECT * FROM (VALUES ('od'), ('oe'), ('of'), ('p'), ('ö')) AS q(c)
  ORDER BY c COLLATE german_phone;

  ERROR:  collation "german_phone" for encoding "UTF8" does not exist
  LINE 1: ... ('oe'), ('of'), ('p'), ('ö')) AS q(c) ORDER BY c COLLATE ge...
                                                               ^

Indeed, the collation isn't there...

I guess that it is not the fault of this patch that the collation isn't there,
but I think it is surprising.  What good is a database collation that does not
exist in the database?

What might be the fault of this patch, however, is that "daticurules" is not
set in "pg_database".  Looking at the code, that column seems to be copied
from the template database, but cannot be overridden.

Perhaps this only needs more documentation, but I am confused.

Yours,
Laurenz Albe



Re: Allow tailoring of ICU locales with custom rules

From
"Daniel Verite"
Date:
    Laurenz Albe wrote:

> Cool so far.  Now I created a database with that locale:
>
>  CREATE DATABASE teutsch LOCALE_PROVIDER icu ICU_LOCALE german_phone
>     LOCALE "de_AT.utf8" TEMPLATE template0;
>
> Now the rules are not in "pg_database":

The parameter after ICU_LOCALE is passed directly to ICU as a locale
ID, as opposed to refering a collation name in the current database.
This CREATE DATABASE doesn't fail because ICU accepts pretty much
anything as a locale ID, ignoring what it can't parse instead of
erroring out.

I think the way to express what you want should be:

CREATE DATABASE teutsch
 LOCALE_PROVIDER icu
 ICU_LOCALE 'de_AT'
 LOCALE 'de_AT.utf8'
 ICU_RULES '&a < g';

However it still leaves "daticurules" empty in the destination db,
because of an actual bug in the current patch.

Looking at createdb() in commands.c, it creates this variable:

@@ -711,6 +714,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
    char       *dbcollate = NULL;
    char       *dbctype = NULL;
    char       *dbiculocale = NULL;
+    char       *dbicurules = NULL;
    char        dblocprovider = '\0';
    char       *canonname;
    int            encoding = -1;

and then reads it later

@@ -1007,6 +1017,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
        dblocprovider = src_locprovider;
    if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU)
        dbiculocale = src_iculocale;
+    if (dbicurules == NULL && dblocprovider == COLLPROVIDER_ICU)
+        dbicurules = src_icurules;

    /* Some encodings are client only */
    if (!PG_VALID_BE_ENCODING(encoding))

but it forgets to assign it in between, so it stays NULL and src_icurules
is taken instead.

> I guess that it is not the fault of this patch that the collation
> isn't there, but I think it is surprising.  What good is a database
> collation that does not exist in the database?

Even if the above invocation of CREATE DATABASE worked as you
intuitively expected, by getting the characteristics from the
user-defined collation for the destination db, it still wouldn't work to
refer
to COLLATE "german_phone" in the destination database.
That's because there would be no "german_phone" entry in the
pg_collation of the destination db, as it's cloned from the template
db, which has no reason to have this collation.


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



Re: Allow tailoring of ICU locales with custom rules

From
Laurenz Albe
Date:
On Sat, 2023-02-04 at 14:41 +0100, Daniel Verite wrote:
>         Laurenz Albe wrote:
>
> > Cool so far.  Now I created a database with that locale:
> >
> >  CREATE DATABASE teutsch LOCALE_PROVIDER icu ICU_LOCALE german_phone
> >     LOCALE "de_AT.utf8" TEMPLATE template0;
> >
> > Now the rules are not in "pg_database":
>
> The parameter after ICU_LOCALE is passed directly to ICU as a locale
> ID, as opposed to refering a collation name in the current database.
> This CREATE DATABASE doesn't fail because ICU accepts pretty much
> anything as a locale ID, ignoring what it can't parse instead of
> erroring out.
>
> I think the way to express what you want should be:
>
> CREATE DATABASE teutsch
>  LOCALE_PROVIDER icu
>  ICU_LOCALE 'de_AT'
>  LOCALE 'de_AT.utf8'
>  ICU_RULES '&a < g';
>
> However it still leaves "daticurules" empty in the destination db,
> because of an actual bug in the current patch.

I see.  Thanks for the explanation.

> > I guess that it is not the fault of this patch that the collation
> > isn't there, but I think it is surprising.  What good is a database
> > collation that does not exist in the database?
>
> Even if the above invocation of CREATE DATABASE worked as you
> intuitively expected, by getting the characteristics from the
> user-defined collation for the destination db, it still wouldn't work to
> refer
> to COLLATE "german_phone" in the destination database.
> That's because there would be no "german_phone" entry in the
> pg_collation of the destination db, as it's cloned from the template
> db, which has no reason to have this collation.

That makes sense.  Then I think that the current behavior is buggy:
You should not be allowed to specify a collation that does not exist in
the template database.  Otherwise you end up with something weird.

This is not the fault of this patch though.

Yours,
Laurenz Albe



Re: Allow tailoring of ICU locales with custom rules

From
Peter Eisentraut
Date:
On 04.02.23 14:41, Daniel Verite wrote:
> However it still leaves "daticurules" empty in the destination db,
> because of an actual bug in the current patch.
> 
> Looking at createdb() in commands.c, it creates this variable:
> 
> @@ -711,6 +714,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
>     char       *dbcollate = NULL;
>     char       *dbctype = NULL;
>     char       *dbiculocale = NULL;
> +    char       *dbicurules = NULL;
>     char        dblocprovider = '\0';
>     char       *canonname;
>     int            encoding = -1;
> 
> and then reads it later
> 
> @@ -1007,6 +1017,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
>         dblocprovider = src_locprovider;
>     if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU)
>         dbiculocale = src_iculocale;
> +    if (dbicurules == NULL && dblocprovider == COLLPROVIDER_ICU)
> +        dbicurules = src_icurules;
>   
>     /* Some encodings are client only */
>     if (!PG_VALID_BE_ENCODING(encoding))
> 
> but it forgets to assign it in between, so it stays NULL and src_icurules
> is taken instead.

Right.  Here is a new patch with this fixed.

Attachment

Re: Allow tailoring of ICU locales with custom rules

From
Laurenz Albe
Date:
On Mon, 2023-02-06 at 22:16 +0100, Peter Eisentraut wrote:
> Right.  Here is a new patch with this fixed.

Thanks.  I played some more with it, and still are still some missing
odds and ends:

- There is a new option ICU_RULES to CREATE DATABASE, but it is not
  reflected in \h CREATE DATABASE.  sql_help_CREATE_DATABASE() needs to
  be amended.

- There is no way to show the rules except by querying "pg_collation" or
  "pg_database".  I think it would be good to show the rules with
  \dO+ and \l+.

- If I create a collation "x" with RULES and then create a database
  with "ICU_LOCALE x", the rules are not copied over.

  I don't know if that is intended or not, but it surprises me.
  Should that be a WARNING?  Or, since creating a database with a collation
  that does not exist in "template0" doesn't make much sense (or does it?),
  is there a way to forbid that?

Yours,
Laurenz Albe



Re: Allow tailoring of ICU locales with custom rules

From
Peter Eisentraut
Date:
On 14.02.23 17:53, Laurenz Albe wrote:
> On Mon, 2023-02-06 at 22:16 +0100, Peter Eisentraut wrote:
>> Right.  Here is a new patch with this fixed.
> 
> Thanks.  I played some more with it, and still are still some missing
> odds and ends:
> 
> - There is a new option ICU_RULES to CREATE DATABASE, but it is not
>    reflected in \h CREATE DATABASE.  sql_help_CREATE_DATABASE() needs to
>    be amended.

Fixed.

> - There is no way to show the rules except by querying "pg_collation" or
>    "pg_database".  I think it would be good to show the rules with
>    \dO+ and \l+.

Fixed.  I adjusted the order of the columns a bit, to make the overall 
picture more sensible.  The locale provider column is now earlier, since 
it indicates which of the subsequent columns are applicable.

> - If I create a collation "x" with RULES and then create a database
>    with "ICU_LOCALE x", the rules are not copied over.
> 
>    I don't know if that is intended or not, but it surprises me.
>    Should that be a WARNING?  Or, since creating a database with a collation
>    that does not exist in "template0" doesn't make much sense (or does it?),
>    is there a way to forbid that?

This is a misunderstanding of how things work.  The value of the 
database ICU_LOCALE attribute is passed to the ICU library.  It does not 
refer to a PostgreSQL collation object.
Attachment

Re: Allow tailoring of ICU locales with custom rules

From
"Daniel Verite"
Date:
    Peter Eisentraut wrote:

[patch v5]

Two quick comments:

- pg_dump support need to be added for CREATE COLLATION / DATABASE

- there doesn't seem to be a way to add rules to template1.
If someone wants to have icu rules and initial contents to their new
databases, I think they need to create a custom template database
(from template0) for that purpose, in addition to template1.
From a usability standpoint, this is a bit cumbersome, as it's
normally the role of template1.
To improve on that, shouldn't initdb be able to create template0 with
rules too?


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



Re: Allow tailoring of ICU locales with custom rules

From
Peter Eisentraut
Date:
On 20.02.23 17:30, Daniel Verite wrote:
> - pg_dump support need to be added for CREATE COLLATION / DATABASE

I have added that.

> 
> - there doesn't seem to be a way to add rules to template1.
> If someone wants to have icu rules and initial contents to their new
> databases, I think they need to create a custom template database
> (from template0) for that purpose, in addition to template1.
>  From a usability standpoint, this is a bit cumbersome, as it's
> normally the role of template1.
> To improve on that, shouldn't initdb be able to create template0 with
> rules too?

Right, that would be an initdb option.  Is that too many initdb options 
then?  It would be easy to add, if we think it's worth it.

Attachment

Re: Allow tailoring of ICU locales with custom rules

From
Laurenz Albe
Date:
On Wed, 2023-02-22 at 18:35 +0100, Peter Eisentraut wrote:
> > - there doesn't seem to be a way to add rules to template1.
> > If someone wants to have icu rules and initial contents to their new
> > databases, I think they need to create a custom template database
> > (from template0) for that purpose, in addition to template1.
> >   From a usability standpoint, this is a bit cumbersome, as it's
> > normally the role of template1.
> > To improve on that, shouldn't initdb be able to create template0 with
> > rules too?
>
> Right, that would be an initdb option.  Is that too many initdb options
> then?  It would be easy to add, if we think it's worth it.

An alternative would be to document that you can drop "template1" and
create it again using the ICU collation rules you need.

But I'd prefer an "initdb" option.

Yours,
Laurenz Albe



Re: Allow tailoring of ICU locales with custom rules

From
Peter Eisentraut
Date:
On 02.03.23 16:39, Laurenz Albe wrote:
> On Wed, 2023-02-22 at 18:35 +0100, Peter Eisentraut wrote:
>>> - there doesn't seem to be a way to add rules to template1.
>>> If someone wants to have icu rules and initial contents to their new
>>> databases, I think they need to create a custom template database
>>> (from template0) for that purpose, in addition to template1.
>>>    From a usability standpoint, this is a bit cumbersome, as it's
>>> normally the role of template1.
>>> To improve on that, shouldn't initdb be able to create template0 with
>>> rules too?
>>
>> Right, that would be an initdb option.  Is that too many initdb options
>> then?  It would be easy to add, if we think it's worth it.
> 
> An alternative would be to document that you can drop "template1" and
> create it again using the ICU collation rules you need.
> 
> But I'd prefer an "initdb" option.

Ok, here is a version with an initdb option and also a createdb option 
(to expose the CREATE DATABASE option).

You can mess with people by setting up your databases like this:

initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d'

;-)

Attachment

Re: Allow tailoring of ICU locales with custom rules

From
Jeff Davis
Date:
On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote:
> You can mess with people by setting up your databases like this:
>
> initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d'
>
> ;-)

Would we be the first major database to support custom collation rules?
This sounds useful for testing, experimentation, hacking, etc.

What are some of the use cases? Is it helpful to comply with unusual or
outdated standards or formats? Maybe there are people using special
delimiters/terminators and they need them to be treated a certain way
during comparisons?

Regards,
    Jeff Davis




Re: Allow tailoring of ICU locales with custom rules

From
Laurenz Albe
Date:
On Tue, 2023-03-07 at 22:06 -0800, Jeff Davis wrote:
> On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote:
> > You can mess with people by setting up your databases like this:
> >
> > initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d'
> >
> > ;-)
>
> Would we be the first major database to support custom collation rules?
> This sounds useful for testing, experimentation, hacking, etc.
>
> What are some of the use cases? Is it helpful to comply with unusual or
> outdated standards or formats? Maybe there are people using special
> delimiters/terminators and they need them to be treated a certain way
> during comparisons?

I regularly see complaints about the sort order; recently this one:
https://postgr.es/m/CAFCRh--xt-J8awOavhB216kom6TQnaP35TTVEQQS5bHH7gMemQ@mail.gmail.com

So being able to influence the sort order is useful.

Yours,
Laurenz Albe



Re: Allow tailoring of ICU locales with custom rules

From
Laurenz Albe
Date:
On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote:
> Ok, here is a version with an initdb option and also a createdb option
> (to expose the CREATE DATABASE option).
>
> You can mess with people by setting up your databases like this:
>
> initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d'

Looks good.  I cannot get it to misbehave, "make check-world" is successful
(the regression tests misbehave in interesting ways when running
"make installcheck" on a cluster created with non-standard ICU rules, but
that can be expected).

I checked the documentation, tested "pg_dump" support, everything fine.

I'll mark it as "ready for committer".

Yours,
Laurenz Albe



Re: Allow tailoring of ICU locales with custom rules

From
Peter Eisentraut
Date:
On 08.03.23 15:18, Laurenz Albe wrote:
> On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote:
>> Ok, here is a version with an initdb option and also a createdb option
>> (to expose the CREATE DATABASE option).
>>
>> You can mess with people by setting up your databases like this:
>>
>> initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d'
> 
> Looks good.  I cannot get it to misbehave, "make check-world" is successful
> (the regression tests misbehave in interesting ways when running
> "make installcheck" on a cluster created with non-standard ICU rules, but
> that can be expected).
> 
> I checked the documentation, tested "pg_dump" support, everything fine.
> 
> I'll mark it as "ready for committer".

committed