Thread: Allow tailoring of ICU locales with custom rules
This patch exposes the ICU facility to add custom collation rules to a standard collation. This would allow users to customize any ICU collation to whatever they want. A very simple example from the documentation/tests: CREATE COLLATION en_custom (provider = icu, locale = 'en', rules = '&a < g'); This places "g" after "a" before "b". Details about the syntax can be found at <https://unicode-org.github.io/icu/userguide/collation/customization/>. The code is pretty straightforward. It mainly just records these rules in the catalog and feeds them to ICU when creating the collator object.
Attachment
Patch needed a rebase; no functionality changes. On 14.12.22 10:26, Peter Eisentraut wrote: > This patch exposes the ICU facility to add custom collation rules to a > standard collation. This would allow users to customize any ICU > collation to whatever they want. A very simple example from the > documentation/tests: > > CREATE COLLATION en_custom > (provider = icu, locale = 'en', rules = '&a < g'); > > This places "g" after "a" before "b". Details about the syntax can be > found at > <https://unicode-org.github.io/icu/userguide/collation/customization/>. > > The code is pretty straightforward. It mainly just records these rules > in the catalog and feeds them to ICU when creating the collator object.
Attachment
On Thu, 5 Jan 2023 at 20:45, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > Patch needed a rebase; no functionality changes. The patch does not apply on top of HEAD as in [1], please post a rebased patch: === Applying patches on top of PostgreSQL commit ID d952373a987bad331c0e499463159dd142ced1ef === === applying patch ./v2-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patch patching file doc/src/sgml/catalogs.sgml patching file doc/src/sgml/ref/create_collation.sgml patching file doc/src/sgml/ref/create_database.sgml Hunk #1 FAILED at 192. 1 out of 1 hunk FAILED -- saving rejects to file doc/src/sgml/ref/create_database.sgml.rej [1] - http://cfbot.cputube.org/patch_41_4075.log Regards, Vignesh
On 11.01.23 03:50, vignesh C wrote: > On Thu, 5 Jan 2023 at 20:45, Peter Eisentraut > <peter.eisentraut@enterprisedb.com> wrote: >> >> Patch needed a rebase; no functionality changes. > > The patch does not apply on top of HEAD as in [1], please post a rebased patch: Updated patch attached.
Attachment
On Mon, 2023-01-16 at 12:18 +0100, Peter Eisentraut wrote: > Updated patch attached. I like that patch. It applies and passes regression tests. I played with it: CREATE COLLATION german_phone (LOCALE = 'de-AT', PROVIDER = icu, RULES = '&oe < ö'); SELECT * FROM (VALUES ('od'), ('oe'), ('of'), ('p'), ('ö')) AS q(c) ORDER BY c COLLATE german_phone; c ════ od oe ö of p (5 rows) Cool so far. Now I created a database with that locale: CREATE DATABASE teutsch LOCALE_PROVIDER icu ICU_LOCALE german_phone LOCALE "de_AT.utf8" TEMPLATE template0; Now the rules are not in "pg_database": SELECT datcollate, daticulocale, daticurules FROM pg_database WHERE datname = 'teutsch'; datcollate │ daticulocale │ daticurules ════════════╪══════════════╪═════════════ de_AT.utf8 │ german_phone │ ∅ (1 row) I connect to the database and try: SELECT * FROM (VALUES ('od'), ('oe'), ('of'), ('p'), ('ö')) AS q(c) ORDER BY c COLLATE german_phone; ERROR: collation "german_phone" for encoding "UTF8" does not exist LINE 1: ... ('oe'), ('of'), ('p'), ('ö')) AS q(c) ORDER BY c COLLATE ge... ^ Indeed, the collation isn't there... I guess that it is not the fault of this patch that the collation isn't there, but I think it is surprising. What good is a database collation that does not exist in the database? What might be the fault of this patch, however, is that "daticurules" is not set in "pg_database". Looking at the code, that column seems to be copied from the template database, but cannot be overridden. Perhaps this only needs more documentation, but I am confused. Yours, Laurenz Albe
Laurenz Albe wrote: > Cool so far. Now I created a database with that locale: > > CREATE DATABASE teutsch LOCALE_PROVIDER icu ICU_LOCALE german_phone > LOCALE "de_AT.utf8" TEMPLATE template0; > > Now the rules are not in "pg_database": The parameter after ICU_LOCALE is passed directly to ICU as a locale ID, as opposed to refering a collation name in the current database. This CREATE DATABASE doesn't fail because ICU accepts pretty much anything as a locale ID, ignoring what it can't parse instead of erroring out. I think the way to express what you want should be: CREATE DATABASE teutsch LOCALE_PROVIDER icu ICU_LOCALE 'de_AT' LOCALE 'de_AT.utf8' ICU_RULES '&a < g'; However it still leaves "daticurules" empty in the destination db, because of an actual bug in the current patch. Looking at createdb() in commands.c, it creates this variable: @@ -711,6 +714,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) char *dbcollate = NULL; char *dbctype = NULL; char *dbiculocale = NULL; + char *dbicurules = NULL; char dblocprovider = '\0'; char *canonname; int encoding = -1; and then reads it later @@ -1007,6 +1017,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) dblocprovider = src_locprovider; if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU) dbiculocale = src_iculocale; + if (dbicurules == NULL && dblocprovider == COLLPROVIDER_ICU) + dbicurules = src_icurules; /* Some encodings are client only */ if (!PG_VALID_BE_ENCODING(encoding)) but it forgets to assign it in between, so it stays NULL and src_icurules is taken instead. > I guess that it is not the fault of this patch that the collation > isn't there, but I think it is surprising. What good is a database > collation that does not exist in the database? Even if the above invocation of CREATE DATABASE worked as you intuitively expected, by getting the characteristics from the user-defined collation for the destination db, it still wouldn't work to refer to COLLATE "german_phone" in the destination database. That's because there would be no "german_phone" entry in the pg_collation of the destination db, as it's cloned from the template db, which has no reason to have this collation. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Sat, 2023-02-04 at 14:41 +0100, Daniel Verite wrote: > Laurenz Albe wrote: > > > Cool so far. Now I created a database with that locale: > > > > CREATE DATABASE teutsch LOCALE_PROVIDER icu ICU_LOCALE german_phone > > LOCALE "de_AT.utf8" TEMPLATE template0; > > > > Now the rules are not in "pg_database": > > The parameter after ICU_LOCALE is passed directly to ICU as a locale > ID, as opposed to refering a collation name in the current database. > This CREATE DATABASE doesn't fail because ICU accepts pretty much > anything as a locale ID, ignoring what it can't parse instead of > erroring out. > > I think the way to express what you want should be: > > CREATE DATABASE teutsch > LOCALE_PROVIDER icu > ICU_LOCALE 'de_AT' > LOCALE 'de_AT.utf8' > ICU_RULES '&a < g'; > > However it still leaves "daticurules" empty in the destination db, > because of an actual bug in the current patch. I see. Thanks for the explanation. > > I guess that it is not the fault of this patch that the collation > > isn't there, but I think it is surprising. What good is a database > > collation that does not exist in the database? > > Even if the above invocation of CREATE DATABASE worked as you > intuitively expected, by getting the characteristics from the > user-defined collation for the destination db, it still wouldn't work to > refer > to COLLATE "german_phone" in the destination database. > That's because there would be no "german_phone" entry in the > pg_collation of the destination db, as it's cloned from the template > db, which has no reason to have this collation. That makes sense. Then I think that the current behavior is buggy: You should not be allowed to specify a collation that does not exist in the template database. Otherwise you end up with something weird. This is not the fault of this patch though. Yours, Laurenz Albe
On 04.02.23 14:41, Daniel Verite wrote: > However it still leaves "daticurules" empty in the destination db, > because of an actual bug in the current patch. > > Looking at createdb() in commands.c, it creates this variable: > > @@ -711,6 +714,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) > char *dbcollate = NULL; > char *dbctype = NULL; > char *dbiculocale = NULL; > + char *dbicurules = NULL; > char dblocprovider = '\0'; > char *canonname; > int encoding = -1; > > and then reads it later > > @@ -1007,6 +1017,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) > dblocprovider = src_locprovider; > if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU) > dbiculocale = src_iculocale; > + if (dbicurules == NULL && dblocprovider == COLLPROVIDER_ICU) > + dbicurules = src_icurules; > > /* Some encodings are client only */ > if (!PG_VALID_BE_ENCODING(encoding)) > > but it forgets to assign it in between, so it stays NULL and src_icurules > is taken instead. Right. Here is a new patch with this fixed.
Attachment
On Mon, 2023-02-06 at 22:16 +0100, Peter Eisentraut wrote: > Right. Here is a new patch with this fixed. Thanks. I played some more with it, and still are still some missing odds and ends: - There is a new option ICU_RULES to CREATE DATABASE, but it is not reflected in \h CREATE DATABASE. sql_help_CREATE_DATABASE() needs to be amended. - There is no way to show the rules except by querying "pg_collation" or "pg_database". I think it would be good to show the rules with \dO+ and \l+. - If I create a collation "x" with RULES and then create a database with "ICU_LOCALE x", the rules are not copied over. I don't know if that is intended or not, but it surprises me. Should that be a WARNING? Or, since creating a database with a collation that does not exist in "template0" doesn't make much sense (or does it?), is there a way to forbid that? Yours, Laurenz Albe
On 14.02.23 17:53, Laurenz Albe wrote: > On Mon, 2023-02-06 at 22:16 +0100, Peter Eisentraut wrote: >> Right. Here is a new patch with this fixed. > > Thanks. I played some more with it, and still are still some missing > odds and ends: > > - There is a new option ICU_RULES to CREATE DATABASE, but it is not > reflected in \h CREATE DATABASE. sql_help_CREATE_DATABASE() needs to > be amended. Fixed. > - There is no way to show the rules except by querying "pg_collation" or > "pg_database". I think it would be good to show the rules with > \dO+ and \l+. Fixed. I adjusted the order of the columns a bit, to make the overall picture more sensible. The locale provider column is now earlier, since it indicates which of the subsequent columns are applicable. > - If I create a collation "x" with RULES and then create a database > with "ICU_LOCALE x", the rules are not copied over. > > I don't know if that is intended or not, but it surprises me. > Should that be a WARNING? Or, since creating a database with a collation > that does not exist in "template0" doesn't make much sense (or does it?), > is there a way to forbid that? This is a misunderstanding of how things work. The value of the database ICU_LOCALE attribute is passed to the ICU library. It does not refer to a PostgreSQL collation object.
Attachment
Peter Eisentraut wrote: [patch v5] Two quick comments: - pg_dump support need to be added for CREATE COLLATION / DATABASE - there doesn't seem to be a way to add rules to template1. If someone wants to have icu rules and initial contents to their new databases, I think they need to create a custom template database (from template0) for that purpose, in addition to template1. From a usability standpoint, this is a bit cumbersome, as it's normally the role of template1. To improve on that, shouldn't initdb be able to create template0 with rules too? Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On 20.02.23 17:30, Daniel Verite wrote: > - pg_dump support need to be added for CREATE COLLATION / DATABASE I have added that. > > - there doesn't seem to be a way to add rules to template1. > If someone wants to have icu rules and initial contents to their new > databases, I think they need to create a custom template database > (from template0) for that purpose, in addition to template1. > From a usability standpoint, this is a bit cumbersome, as it's > normally the role of template1. > To improve on that, shouldn't initdb be able to create template0 with > rules too? Right, that would be an initdb option. Is that too many initdb options then? It would be easy to add, if we think it's worth it.
Attachment
On Wed, 2023-02-22 at 18:35 +0100, Peter Eisentraut wrote: > > - there doesn't seem to be a way to add rules to template1. > > If someone wants to have icu rules and initial contents to their new > > databases, I think they need to create a custom template database > > (from template0) for that purpose, in addition to template1. > > From a usability standpoint, this is a bit cumbersome, as it's > > normally the role of template1. > > To improve on that, shouldn't initdb be able to create template0 with > > rules too? > > Right, that would be an initdb option. Is that too many initdb options > then? It would be easy to add, if we think it's worth it. An alternative would be to document that you can drop "template1" and create it again using the ICU collation rules you need. But I'd prefer an "initdb" option. Yours, Laurenz Albe
On 02.03.23 16:39, Laurenz Albe wrote: > On Wed, 2023-02-22 at 18:35 +0100, Peter Eisentraut wrote: >>> - there doesn't seem to be a way to add rules to template1. >>> If someone wants to have icu rules and initial contents to their new >>> databases, I think they need to create a custom template database >>> (from template0) for that purpose, in addition to template1. >>> From a usability standpoint, this is a bit cumbersome, as it's >>> normally the role of template1. >>> To improve on that, shouldn't initdb be able to create template0 with >>> rules too? >> >> Right, that would be an initdb option. Is that too many initdb options >> then? It would be easy to add, if we think it's worth it. > > An alternative would be to document that you can drop "template1" and > create it again using the ICU collation rules you need. > > But I'd prefer an "initdb" option. Ok, here is a version with an initdb option and also a createdb option (to expose the CREATE DATABASE option). You can mess with people by setting up your databases like this: initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d' ;-)
Attachment
On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote: > You can mess with people by setting up your databases like this: > > initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d' > > ;-) Would we be the first major database to support custom collation rules? This sounds useful for testing, experimentation, hacking, etc. What are some of the use cases? Is it helpful to comply with unusual or outdated standards or formats? Maybe there are people using special delimiters/terminators and they need them to be treated a certain way during comparisons? Regards, Jeff Davis
On Tue, 2023-03-07 at 22:06 -0800, Jeff Davis wrote: > On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote: > > You can mess with people by setting up your databases like this: > > > > initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d' > > > > ;-) > > Would we be the first major database to support custom collation rules? > This sounds useful for testing, experimentation, hacking, etc. > > What are some of the use cases? Is it helpful to comply with unusual or > outdated standards or formats? Maybe there are people using special > delimiters/terminators and they need them to be treated a certain way > during comparisons? I regularly see complaints about the sort order; recently this one: https://postgr.es/m/CAFCRh--xt-J8awOavhB216kom6TQnaP35TTVEQQS5bHH7gMemQ@mail.gmail.com So being able to influence the sort order is useful. Yours, Laurenz Albe
On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote: > Ok, here is a version with an initdb option and also a createdb option > (to expose the CREATE DATABASE option). > > You can mess with people by setting up your databases like this: > > initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d' Looks good. I cannot get it to misbehave, "make check-world" is successful (the regression tests misbehave in interesting ways when running "make installcheck" on a cluster created with non-standard ICU rules, but that can be expected). I checked the documentation, tested "pg_dump" support, everything fine. I'll mark it as "ready for committer". Yours, Laurenz Albe
On 08.03.23 15:18, Laurenz Albe wrote: > On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote: >> Ok, here is a version with an initdb option and also a createdb option >> (to expose the CREATE DATABASE option). >> >> You can mess with people by setting up your databases like this: >> >> initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d' > > Looks good. I cannot get it to misbehave, "make check-world" is successful > (the regression tests misbehave in interesting ways when running > "make installcheck" on a cluster created with non-standard ICU rules, but > that can be expected). > > I checked the documentation, tested "pg_dump" support, everything fine. > > I'll mark it as "ready for committer". committed