Thread: ICU for global collation
Here is an initial patch to add the option to use ICU as the global collation provider, a long-requested feature. To activate, use something like initdb --collation-provider=icu --locale=... A trick here is that since we need to also still set the normal POSIX locales, the --locale value needs to be valid as both a POSIX locale and a ICU locale. If that doesn't work out, there is also a way to specify it separately, e.g., initdb --collation-provider=icu --locale=en_US.utf8 --icu-locale=en This complexity is unfortunate, but I don't see a way around it right now. There are also options for createdb and CREATE DATABASE to do this for a particular database only. Besides this, the implementation is quite small: When starting up a database, we create an ICU collator object, store it in a global variable, and then use it when appropriate. All the ICU code for creating and invoking those collators already exists of course. For the version tracking, I use the pg_collation row for the "default" collation. Again, this mostly reuses existing code and concepts. Nondeterministic collations are not supported for the global collation, because then LIKE and regular expressions don't work and that breaks some system views. This needs some separate research. To test, run the existing regression tests against a database initialized with ICU. Perhaps some options for pg_regress could facilitate that. I fear that the Localization chapter in the documentation will need a bit of a rewrite after this, because the hitherto separately treated concepts of locale and collation are fusing together. I haven't done that here yet, but that would be the plan for later. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hi! > 20 авг. 2019 г., в 19:21, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> написал(а): > > Here is an initial patch to add the option to use ICU as the global > collation provider, a long-requested feature. > > To activate, use something like > > initdb --collation-provider=icu --locale=... > > A trick here is that since we need to also still set the normal POSIX > locales, the --locale value needs to be valid as both a POSIX locale and > a ICU locale. If that doesn't work out, there is also a way to specify > it separately, e.g., > > initdb --collation-provider=icu --locale=en_US.utf8 --icu-locale=en Thanks! This is very awaited feature. Seems like user cannot change locale for database if icu is already chosen? postgres=# \l List of databases Name | Owner | Encoding | Collate | Ctype | Provider | Access privileges -----------+-------+----------+---------+-------+----------+------------------- postgres | x4mmm | UTF8 | ru_RU | ru_RU | icu | template0 | x4mmm | UTF8 | ru_RU | ru_RU | icu | =c/x4mmm + | | | | | | x4mmm=CTc/x4mmm template1 | x4mmm | UTF8 | ru_RU | ru_RU | icu | =c/x4mmm + | | | | | | x4mmm=CTc/x4mmm (3 rows) postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8'; CREATE DATABASE postgres=# \c a 2019-08-21 11:43:40.379 +05 [41509] FATAL: collations with different collate and ctype values are not supported by ICU FATAL: collations with different collate and ctype values are not supported by ICU Previous connection kept Am I missing something? BTW, psql does not know about collation_provider. Best regards, Andrey Borodin.
On 2019-08-21 08:56, Andrey Borodin wrote: > postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8'; > CREATE DATABASE > postgres=# \c a > 2019-08-21 11:43:40.379 +05 [41509] FATAL: collations with different collate and ctype values are not supported by ICU > FATAL: collations with different collate and ctype values are not supported by ICU Try create database a template template0 collation_provider icu locale 'en_US.utf8'; which sets both lc_collate and lc_ctype. But 'en_US.utf8' is not a valid ICU locale name. Perhaps use 'en' or 'en-US'. I'm making a note that we should prevent creating a database with a faulty locale configuration in the first place instead of failing when we're connecting. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 21 авг. 2019 г., в 12:23, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> написал(а): > > On 2019-08-21 08:56, Andrey Borodin wrote: >> postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8'; >> CREATE DATABASE >> postgres=# \c a >> 2019-08-21 11:43:40.379 +05 [41509] FATAL: collations with different collate and ctype values are not supported by ICU >> FATAL: collations with different collate and ctype values are not supported by ICU > > Try > > create database a template template0 collation_provider icu locale > 'en_US.utf8'; > > which sets both lc_collate and lc_ctype. But 'en_US.utf8' is not a > valid ICU locale name. Perhaps use 'en' or 'en-US'. > > I'm making a note that we should prevent creating a database with a > faulty locale configuration in the first place instead of failing when > we're connecting. Yes, the problem is input with lc_collate is accepted postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8'; CREATE DATABASE postgres=# \c a 2019-09-11 10:01:00.373 +05 [56878] FATAL: collations with different collate and ctype values are not supported by ICU FATAL: collations with different collate and ctype values are not supported by ICU Previous connection kept postgres=# create database b template template0 collation_provider icu locale 'en_US.utf8'; CREATE DATABASE postgres=# \c b You are now connected to database "b" as user "x4mmm". I get same output with 'en' or 'en-US'. Also, cluster initialized --with-icu started on binaries without icu just fine. And only after some time, I've got that messages "ERROR: ICU is not supported in this build". Is it expected behavior? Maybe we should refuse to start without icu? Best regards, Andrey Borodin.
Hi, When trying databases defined with ICU locales, I see that backends that serve such databases seem to have their LC_CTYPE inherited from the environment (as opposed to a per-database fixed value). That's a problem for the backend code that depends on libc functions that themselves depend on LC_CTYPE, such as the full text search parser and dictionaries. For instance, if you start the instance with a C locale (LC_ALL=C pg_ctl...) , and tries to use FTS in an ICU UTF-8 database, it doesn't work: template1=# create database "fr-utf8" template 'template0' encoding UTF8 locale 'fr' collation_provider 'icu'; template1=# \c fr-utf8 You are now connected to database "fr-utf8" as user "daniel". fr-utf8=# show lc_ctype; lc_ctype ---------- fr (1 row) fr-utf8=# select to_tsvector('été'); ERROR: invalid multibyte character for locale HINT: The server's LC_CTYPE locale is probably incompatible with the database encoding. If I peek into the "real" LC_CTYPE when connected to this database, I can see it's "C": fr-utf8=# create extension plperl; CREATE EXTENSION fr-utf8=# create function lc_ctype() returns text as '$ENV{LC_CTYPE};' language plperl; CREATE FUNCTION fr-utf8=# select lc_ctype(); lc_ctype ---------- C Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Hi everyone, like the others before me we (the university of Münster) are happy to see this feature as well. Thank you this. When I applied the patch two weeks ago I run into the issue that initdb did not recognize the new parameters (collation-provider and icu-locale) but I guess it was caused by my own stupidity. > When trying databases defined with ICU locales, I see that backends > that serve such databases seem to have their LC_CTYPE inherited from > the environment (as opposed to a per-database fixed value). I am able to recreate the issue described by Daniel on my machine. Now it works as expected. I just had to update the patch since commit 3f6b3be3 had modified two lines which resulted in conflicts. You find the updated patch as attachement to this mail. Best regards, Marius Timmer -- Westfälische Wilhelms-Universität Münster (WWU) Zentrum für Informationsverarbeitung (ZIV) Röntgenstraße 7-13 Besucheradresse: Einsteinstraße 60 - Raum 107 48149 Münster +49 251 83 31158 marius.timmer@uni-muenster.de https://www.uni-muenster.de/ZIV
Attachment
On Wed, Oct 9, 2019 at 12:16 AM Marius Timmer <marius.timmer@uni-muenster.de> wrote: > like the others before me we (the university of Münster) are happy to > see this feature as well. Thank you this. > > When I applied the patch two weeks ago I run into the issue that initdb > did not recognize the new parameters (collation-provider and icu-locale) > but I guess it was caused by my own stupidity. > > > When trying databases defined with ICU locales, I see that backends > > that serve such databases seem to have their LC_CTYPE inherited from > > the environment (as opposed to a per-database fixed value). > I am able to recreate the issue described by Daniel on my machine. > > Now it works as expected. I just had to update the patch since commit > 3f6b3be3 had modified two lines which resulted in conflicts. You find > the updated patch as attachement to this mail. I rebased this patch, and tweaked get_collation_action_version() very slightly so that you get collation version change detection (of the ersatz kind provided by commit d5ac14f9) for the default collation even when not using ICU. Please see attached. +struct pg_locale_struct global_locale; Why not "default_locale"? Where is the terminology "global" coming from? + Specifies the collation provider for the database. "for the database's default collation"?
Attachment
On Thu, Oct 17, 2019 at 3:52 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I rebased this patch, and tweaked get_collation_action_version() very > slightly so that you get collation version change detection (of the > ersatz kind provided by commit d5ac14f9) for the default collation > even when not using ICU. Please see attached. It should also remove the sentence I recently added to alter_collation.sgml to say that the default collation doesn't have version tracking. Rereading that section, it's also clear that the query introduced with: "The following query can be used to identify all collations in the current database that need to be refreshed and the objects that depend on them:" … is wrong with this patch applied. The right query is quite hard to come up with, since we don't explicitly track dependencies on the default collation. That is, there is no pg_depend entry pointing from the index to the collation when you write CREATE INDEX ON t(x) for a text column using the default collation, but there is one when you write CREATE INDEX ON t(x COLLATE "fr_FR"), or when you write CREATE INDEX ON t(x) for a text column that was explicitly defined to use COLLATE "fr_FR". One solution is that we could start tracking those dependencies explicitly too. A preexisting problem with that query is that it doesn't report transitive dependencies. An index on t(x) of a user defined type defined with CREATE TYPE my_type AS (x text COLLATE "fr_FR") doesn't result in a pg_depend row from index to collation, so the query fails to report that as an index needing to be rebuilt. You could fix that with a sprinkle of recursive magic, but you'd need a different kind of magic to deal with transitive dependencies on the default collation unless we start listing such dependencies explicitly. In that example, my_type would need to depend on collation "default". You can't just do some kind of search for transitive dependencies on type "text", because they aren't tracked either. In my longer term proposal to track per-dependency versions, either by adding refobjversion to pg_depend or by creating another pg_depend-like catalog, you'd almost certainly need to add an explicit record for dependencies on the default collation.
On 2019-09-17 15:08, Daniel Verite wrote: > When trying databases defined with ICU locales, I see that backends > that serve such databases seem to have their LC_CTYPE inherited from > the environment (as opposed to a per-database fixed value). > fr-utf8=# select to_tsvector('été'); > ERROR: invalid multibyte character for locale > HINT: The server's LC_CTYPE locale is probably incompatible with the > database encoding. I looked into this problem. The way to address this would be adding proper collation support to the text search subsystem. See the TODO markers in src/backend/tsearch/ts_locale.c for starting points. These APIs spread out to a lot of places, so it will take some time to finish. In the meantime, I'm pausing this thread and will set the CF entry as RwF. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut wrote: > I looked into this problem. The way to address this would be adding > proper collation support to the text search subsystem. See the TODO > markers in src/backend/tsearch/ts_locale.c for starting points. These > APIs spread out to a lot of places, so it will take some time to finish. > In the meantime, I'm pausing this thread and will set the CF entry as RwF. Even if the FTS code is improved in that matter, any extension code with libc functions depending on LC_CTYPE is still going to be potentially problematic. In particular when it happens to be set to a different encoding than the database. Couldn't we simply invent per-database GUC options, as in ALTER DATABASE myicudb SET libc_lc_ctype TO 'value'; ALTER DATABASE myicudb SET libc_lc_collate TO 'value'; where libc_lc_ctype/libc_lc_collate would specifically set the values in the LC_CTYPE and LC_COLLATE environment vars of any backend serving the corresponding database"? Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On 2019-11-01 19:18, Daniel Verite wrote: > Even if the FTS code is improved in that matter, any extension code > with libc functions depending on LC_CTYPE is still going to be > potentially problematic. In particular when it happens to be set > to a different encoding than the database. I think the answer here is that extension code must not do that, at least in ways that potentially interact with other parts of the (collation-aware) database system. For example, libc and ICU might have different opinions about what is a letter, because of different versions of Unicode data in use. That would then affect tokenization etc. in text search and elsewhere. That's why things like isalpha have to go though ICU instead, if that is the collation provider in a particular context. > Couldn't we simply invent per-database GUC options, as in > ALTER DATABASE myicudb SET libc_lc_ctype TO 'value'; > ALTER DATABASE myicudb SET libc_lc_collate TO 'value'; > > where libc_lc_ctype/libc_lc_collate would specifically set > the values in the LC_CTYPE and LC_COLLATE environment vars > of any backend serving the corresponding database"? We could do that as a transition measure to support extensions like you mention above. But our own internal code should not have to rely on that. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
There were a few inquiries about this topic recently, so I dug up the old thread and patch. What we got stuck on last time was that we can't just swap out all locale support in a database for ICU. We still need to set the usual locale environment, otherwise some things that are not ICU aware will break or degrade. I had initially anticipated fixing that by converting everything that uses libc locales to ICU. But that turned out to be tedious and ultimately not very useful as far as the user-facing result is concerned, so I gave up. So this is a different approach: If you choose ICU as the default locale for a database, you still need to specify lc_ctype and lc_collate settings, as before. Unlike in the previous patch, where the ICU collation name was written in datcollate, there is now a third column (daticucoll), so we can store all three values. This fixes the described problem. Other than that, once you get all the initial settings right, it basically just works: The places that have ICU support now will use a database-wide ICU collation if appropriate, the places that don't have ICU support continue to use the global libc locale settings. I changed the datcollate, datctype, and the new daticucoll fields to type text (from name). That way, the daticucoll field can be set to null if it's not applicable. Also, the limit of 63 characters can actually be a problem if you want to use some combination of the options that ICU locales offer. And for less extreme uses, having variable-length fields will save some storage, since typical locale names are much shorter. For the same reasons and to keep things consistent, I also changed the analogous pg_collation fields like that. This also removes some weird code that has to check that colcollate and colctype have to be the same for ICU, so it's overall cleaner.
Attachment
Hi, On Thu, Dec 30, 2021 at 01:07:21PM +0100, Peter Eisentraut wrote: > > So this is a different approach: If you choose ICU as the default locale for > a database, you still need to specify lc_ctype and lc_collate settings, as > before. Unlike in the previous patch, where the ICU collation name was > written in datcollate, there is now a third column (daticucoll), so we can > store all three values. This fixes the described problem. Other than that, > once you get all the initial settings right, it basically just works: The > places that have ICU support now will use a database-wide ICU collation if > appropriate, the places that don't have ICU support continue to use the > global libc locale settings. That looks sensible to me. > @@ -2774,6 +2776,7 @@ dumpDatabase(Archive *fout) > appendPQExpBuffer(dbQry, "SELECT tableoid, oid, datname, " > "(%s datdba) AS dba, " > "pg_encoding_to_char(encoding) AS encoding, " > + "datcollprovider, " This needs to be in a new pg 15+ branch, not in the pg 9.3+. > - if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID) > - mylocale = pg_newlocale_from_collation(collid); > + if (!lc_collate_is_c(collid)) > + { > + if (collid != DEFAULT_COLLATION_OID) > + mylocale = pg_newlocale_from_collation(collid); > + else if (default_locale.provider == COLLPROVIDER_ICU) > + mylocale = &default_locale; > + } There are really a lot of places with this new code. Maybe it could be some new function/macro to wrap that for the normal case (e.g. not formatting.c)?
On 04.01.22 03:21, Julien Rouhaud wrote: >> @@ -2774,6 +2776,7 @@ dumpDatabase(Archive *fout) >> appendPQExpBuffer(dbQry, "SELECT tableoid, oid, datname, " >> "(%s datdba) AS dba, " >> "pg_encoding_to_char(encoding) AS encoding, " >> + "datcollprovider, " > > This needs to be in a new pg 15+ branch, not in the pg 9.3+. ok >> - if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID) >> - mylocale = pg_newlocale_from_collation(collid); >> + if (!lc_collate_is_c(collid)) >> + { >> + if (collid != DEFAULT_COLLATION_OID) >> + mylocale = pg_newlocale_from_collation(collid); >> + else if (default_locale.provider == COLLPROVIDER_ICU) >> + mylocale = &default_locale; >> + } > > There are really a lot of places with this new code. Maybe it could be some > new function/macro to wrap that for the normal case (e.g. not formatting.c)? Right, we could just put this into pg_newlocale_from_collation(), but the comment there says * In fact, they shouldn't call this function at all when they are dealing * with the default locale. That can save quite a bit in hotspots. I don't know how to assess that. We could pack this into a macro or inline function if we are concerned about this.
On Tue, Jan 04, 2022 at 05:03:10PM +0100, Peter Eisentraut wrote: > On 04.01.22 03:21, Julien Rouhaud wrote: > > > > - if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID) > > > - mylocale = pg_newlocale_from_collation(collid); > > > + if (!lc_collate_is_c(collid)) > > > + { > > > + if (collid != DEFAULT_COLLATION_OID) > > > + mylocale = pg_newlocale_from_collation(collid); > > > + else if (default_locale.provider == COLLPROVIDER_ICU) > > > + mylocale = &default_locale; > > > + } > > > > There are really a lot of places with this new code. Maybe it could be some > > new function/macro to wrap that for the normal case (e.g. not formatting.c)? > > Right, we could just put this into pg_newlocale_from_collation(), but the > comment there says > > * In fact, they shouldn't call this function at all when they are dealing > * with the default locale. That can save quite a bit in hotspots. > > I don't know how to assess that. > > We could pack this into a macro or inline function if we are concerned about > this. Yes that was my idea, just have a new function (inline function or a macro then since pg_newlocale_from_collation() clearly warns about performance concerns) that have the whole is-not-c-collation-and-is-default-collation-or-icu-collation logic and calls pg_newlocale_from_collation() only when needed.
I didn't notice anything version-specific about the patch. Would any modifications be needed to backport it to pg13 andpg14? After this patch goes in, the big next thing would be to support nondeterministic collations for LIKE, ILIKE and patternmatching operators in general. Is anyone interested in working on that? On 1/5/22, 10:36 PM, "Julien Rouhaud" <rjuju123@gmail.com> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you canconfirm the sender and know the content is safe. On Tue, Jan 04, 2022 at 05:03:10PM +0100, Peter Eisentraut wrote: > On 04.01.22 03:21, Julien Rouhaud wrote: > > > > - if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID) > > > - mylocale = pg_newlocale_from_collation(collid); > > > + if (!lc_collate_is_c(collid)) > > > + { > > > + if (collid != DEFAULT_COLLATION_OID) > > > + mylocale = pg_newlocale_from_collation(collid); > > > + else if (default_locale.provider == COLLPROVIDER_ICU) > > > + mylocale = &default_locale; > > > + } > > > > There are really a lot of places with this new code. Maybe it could be some > > new function/macro to wrap that for the normal case (e.g. not formatting.c)? > > Right, we could just put this into pg_newlocale_from_collation(), but the > comment there says > > * In fact, they shouldn't call this function at all when they are dealing > * with the default locale. That can save quite a bit in hotspots. > > I don't know how to assess that. > > We could pack this into a macro or inline function if we are concerned about > this. Yes that was my idea, just have a new function (inline function or a macro then since pg_newlocale_from_collation() clearly warns about performance concerns) that have the whole is-not-c-collation-and-is-default-collation-or-icu-collation logic and calls pg_newlocale_from_collation() only when needed.
On Thu, Jan 06, 2022 at 01:55:55PM +0000, Finnerty, Jim wrote: > > I didn't notice anything version-specific about the patch. Would any > modifications be needed to backport it to pg13 and pg14? This is a new feature so it can't be backported. The changes aren't big and mostly touches places that didn't change in a long time so I don't think that it would take much effort if you wanted to backport it on your own forks. > After this patch goes in, the big next thing would be to support > nondeterministic collations for LIKE, ILIKE and pattern matching operators in > general. Is anyone interested in working on that? As far as I know you're the last person that seemed to be working on that topic back in March :)
Hi, I looked a bit more in this patch and I have some additional remarks. On Thu, Dec 30, 2021 at 01:07:21PM +0100, Peter Eisentraut wrote: > > So this is a different approach: If you choose ICU as the default locale for > a database, you still need to specify lc_ctype and lc_collate settings, as > before. Unlike in the previous patch, where the ICU collation name was > written in datcollate, there is now a third column (daticucoll), so we can > store all three values. This fixes the described problem. Other than that, > once you get all the initial settings right, it basically just works: The > places that have ICU support now will use a database-wide ICU collation if > appropriate, the places that don't have ICU support continue to use the > global libc locale settings. So just to confirm a database can now have 2 different *default* collations: a libc-based one for everything that doesn't work with ICU and a ICU-based (if specified) for everything else, and the ICU-based is optional, so if not provided everything works as before, using the libc based default collation. As I mentioned I think this approach is sensible. However, should we document what are the things that are not ICU-aware? > I changed the datcollate, datctype, and the new daticucoll fields to type > text (from name). That way, the daticucoll field can be set to null if it's > not applicable. Also, the limit of 63 characters can actually be a problem > if you want to use some combination of the options that ICU locales offer. > And for less extreme uses, having variable-length fields will save some > storage, since typical locale names are much shorter. I understand the need to have daticucoll as text, however it's not clear to me why this has to be changed for datcollate and datctype? IIUC those will only ever contain libc-based collation and are still mandatory? > > For the same reasons and to keep things consistent, I also changed the > analogous pg_collation fields like that. The respective fields in pg_collation are now nullable, so the changes there sounds ok. Digging a bit more in the patch here are some things that looks problematic. - pg_upgrade It checks (in check_locale_and_encoding()) the compatibility for each database, and it looks like the daticucoll field should also be verified. Other than that I don't think there is anything else needed for the pg_upgrade part as everything else should be handled by pg_dump (I didn't look at the changes yet given the below problems). - CREATE DATABASE There's a new COLLATION_PROVIDER option, but the overall behavior seems quite unintuitive. As far as I can see the idea is to use LOCALE for the ICU default collation, but as it's providing a default for the other values it's quite annoying. For instance: =# CREATE DATABASE db COLLATION_PROVIDER icu LOCALE 'fr-x-icu' LC_COLLATE 'en_GB.UTF-8';; ERROR: 42809: invalid locale name: "fr-x-icu" LOCATION: createdb, dbcommands.c:397 Looking at the code it's actually complaining about LC_CTYPE. If you want a database with an ICU default collation the lc_collate and lc_ctype should inherit what's in the template database and not what was provided in the LOCALE I think. You could still probably overload them in some scenario, but without a list of what isn't ICU-aware I can't really be sure of how often one might have to do it. Now, if I specify everything as needed it looks like it's missing some checks on the ICU default collation when not using template0: =# CREATE DATABASE db COLLATION_PROVIDER icu LOCALE 'en-x-icu' LC_COLLATE 'en_GB.UTF-8' LC_CTYPE 'en_GB.UTF-8';; CREATE DATABASE =# SELECT datname, datcollate, datctype, daticucoll FROM pg_database ; datname | datcollate | datctype | daticucoll -----------+-------------+-------------+------------ postgres | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu db | en_GB.UTF-8 | en_GB.UTF-8 | en-x-icu template1 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu template0 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu (4 rows) Unless I'm missing something the same concerns about collation incompatibility with objects in the source database should also apply for the ICU collation? While at it, I'm not exactly sure of what the COLLATION_PROVIDER is supposed to mean, as the same commands but with a libc provider is accepted and has the exact same result: =# CREATE DATABASE db2 COLLATION_PROVIDER libc LOCALE 'en-x-icu' LC_COLLATE 'en_GB.UTF-8' LC_CTYPE 'en_GB.UTF-8';; CREATE DATABASE =# SELECT datname, datcollate, datctype, daticucoll FROM pg_database ; datname | datcollate | datctype | daticucoll -----------+-------------+-------------+------------ postgres | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu db | en_GB.UTF-8 | en_GB.UTF-8 | en-x-icu template1 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu template0 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu db2 | en_GB.UTF-8 | en_GB.UTF-8 | en-x-icu (5 rows) Shouldn't db2 have a NULL daticucoll, and if so also complain about incompatibility for it? - initdb I don't think that initdb --collation-provider icu should be allowed without --icu-locale, same for --collation-provider libc *with* --icu-locale. When trying that, I can also see that the NULL handling for daticucoll is broken in the BKI: $ initdb -k --collation-provider icu [...] Success. You can now start the database server using: =# select datname, datcollate, datctype, daticucoll from pg_database ; datname | datcollate | datctype | daticucoll -----------+-------------+-------------+------------ postgres | en_GB.UTF-8 | en_GB.UTF-8 | en_GB template1 | en_GB.UTF-8 | en_GB.UTF-8 | en_GB template0 | en_GB.UTF-8 | en_GB.UTF-8 | en_GB (3 rows) There's a fallback on my LANG/LC_* env settings, but I don't think it can ever be correct given the different naming convention in ICU (at least the s/_/-/). And $ initdb -k --collation-provider libc --icu-locale test [...] Success. You can now start the database server using: =# select datname, datcollate, datctype, daticucoll from pg_database ; datname | datcollate | datctype | daticucoll -----------+-------------+-------------+------------ postgres | en_GB.UTF-8 | en_GB.UTF-8 | _null_ template1 | en_GB.UTF-8 | en_GB.UTF-8 | _null_ template0 | en_GB.UTF-8 | en_GB.UTF-8 | _null_ (3 rows)
On 04.01.22 17:03, Peter Eisentraut wrote: >> There are really a lot of places with this new code. Maybe it could >> be some >> new function/macro to wrap that for the normal case (e.g. not >> formatting.c)? > > Right, we could just put this into pg_newlocale_from_collation(), but > the comment there says > > * In fact, they shouldn't call this function at all when they are dealing > * with the default locale. That can save quite a bit in hotspots. > > I don't know how to assess that. I tested this a bit. I used the following setup: create table t1 (a text); insert into t1 select md5(generate_series(1, 10000000)::text); select count(*) from t1 where a > ''; And then I changed in varstr_cmp(): if (collid != DEFAULT_COLLATION_OID) mylocale = pg_newlocale_from_collation(collid); to just mylocale = pg_newlocale_from_collation(collid); I find that the \timing results are indistinguishable. (I used locale "en_US.UTF-8" and made sure that that code path is actually hit.) Does anyone have other insights?
Julien Rouhaud wrote: > If you want a database with an ICU default collation the lc_collate > and lc_ctype should inherit what's in the template database and not > what was provided in the LOCALE I think. You could still probably > overload them in some scenario, but without a list of what isn't > ICU-aware I can't really be sure of how often one might have to do > it. I guess we'd need that when creating a database with a different encoding than the template databases, at least. About what's not ICU-aware, I believe the most significant part in core is the Full Text Search parser. It doesn't care about sorting strings, but it relies on the functions from POSIX <ctype.h>, which depend on LC_CTYPE (it looks however that this could be improved by following what has been done in backend/regex/regc_pg_locale.c, which has comparable needs and calls ICU functions when applicable). Also, any extension is potentially concerned. Surely many extensions call functions from ctype.h assuming that the current value of LC_CTYPE works with the data they handle. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
On Fri, Jan 07, 2022 at 03:25:28PM +0100, Peter Eisentraut wrote: > > I tested this a bit. I used the following setup: > > create table t1 (a text); > insert into t1 select md5(generate_series(1, 10000000)::text); > select count(*) from t1 where a > ''; > > And then I changed in varstr_cmp(): > > if (collid != DEFAULT_COLLATION_OID) > mylocale = pg_newlocale_from_collation(collid); > > to just > > mylocale = pg_newlocale_from_collation(collid); > > I find that the \timing results are indistinguishable. (I used locale > "en_US.UTF-8" and made sure that that code path is actually hit.) > > Does anyone have other insights? Looking at the git history, you added this comment in 414c5a2ea65. After a bit a digging in the lists, I found that you introduced it to fix a reported 13% slowdown in varstr_cmp(): https://www.postgresql.org/message-id/20110129075253.GA18784%40tornado.leadboat.com https://www.postgresql.org/message-id/1296748408.6442.1.camel%40vanquo.pezone.net
On Mon, Jan 10, 2022 at 11:25:08AM +0800, Julien Rouhaud wrote: > On Fri, Jan 07, 2022 at 03:25:28PM +0100, Peter Eisentraut wrote: > > > > I tested this a bit. I used the following setup: > > > > create table t1 (a text); > > insert into t1 select md5(generate_series(1, 10000000)::text); > > select count(*) from t1 where a > ''; > > > > And then I changed in varstr_cmp(): > > > > if (collid != DEFAULT_COLLATION_OID) > > mylocale = pg_newlocale_from_collation(collid); > > > > to just > > > > mylocale = pg_newlocale_from_collation(collid); > > > > I find that the \timing results are indistinguishable. (I used locale > > "en_US.UTF-8" and made sure that that code path is actually hit.) > > > > Does anyone have other insights? > > Looking at the git history, you added this comment in 414c5a2ea65. > > After a bit a digging in the lists, I found that you introduced it to fix a > reported 13% slowdown in varstr_cmp(): > https://www.postgresql.org/message-id/20110129075253.GA18784%40tornado.leadboat.com > https://www.postgresql.org/message-id/1296748408.6442.1.camel%40vanquo.pezone.net So I tried to run Noah's benchmark to see if I could reproduce the slowdown. Unfortunately the results I'm getting don't really make sense as removing the optimisation brings a 15% speedup, and with a few more runs I can see that I have about 25% noise, so there isn't much I can do to help.
Peter Eisentraut wrote: > Unlike in the previous patch, where the ICU > collation name was written in datcollate, there is now a third column > (daticucoll), so we can store all three values. I think some users would want their db-wide ICU collation to be case/accent-insensitive. Postgres users are trained to expect case-sensitive comparisons, but some apps initially made for e.g. MySQL or MS-SQL that use such collations natively would be easier to port to Postgres. IIRC, that was the context for some questions where people were enquiring about db-wide ICU collations. With the current patch, it's not possible, AFAICS, because the user can't tell that the collation is non-deterministic. Presumably this would require another option to CREATE DATABASE and another column to store that bit of information. The "daticucol" column also suggests that we don't expect to add other collation providers in the future. Maybe a pair of columns like (datcollprovider, datcolllocale) would be more future-proof, or a (datcollprovider, datcolllocale, datcollisdeterministic) triplet if non-deterministic collations are allowed. Also, pg_collation has "collversion" to detect a mismatch between the ICU runtime and existing indexes. I don't see that field for the db-wide ICU collation, so maybe we currently miss the capability to detect the mismatch for the db-wide collation? The lack of these fields overall suggest the idea that when CREATE DATABASE is called with a global ICU collation, what if it somehow inserted the collation into pg_collation in the new database? Then pg_database would just store the collation oid and no other collation-related field would need to be added into it, now or in the future. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
On Mon, Jan 10, 2022 at 12:49:07PM +0100, Daniel Verite wrote: > > The "daticucol" column also suggests that we don't expect to add > other collation providers in the future. Maybe a pair of columns like > (datcollprovider, datcolllocale) would be more future-proof, > or a (datcollprovider, datcolllocale, datcollisdeterministic) > triplet if non-deterministic collations are allowed. I'm not sure about the non-deterministic default collation given the current restrictions with it, but the extra column seems like a good idea. It would require a bit more thinking, as we would need a second collation column in pg_database for any default provider that's not libc. > Also, pg_collation has "collversion" to detect a mismatch between > the ICU runtime and existing indexes. I don't see that field > for the db-wide ICU collation, so maybe we currently miss the capability > to detect the mismatch for the db-wide collation? I don't think that storing a version there will really help. There's no guarantee that any object has been created with the version of the collation that was installed when the database was created. And we would still need to store a version with each underlying object anyway, as rebuilding all broken dependencies can last for a long time, including a server restart. > The lack of these fields overall suggest the idea that when CREATE > DATABASE is called with a global ICU collation, what if it somehow > inserted the collation into pg_collation in the new database? > Then pg_database would just store the collation oid and no other > collation-related field would need to be added into it, now > or in the future. I don't think it would be doable given the single-database-per-backend restriction.
Julien Rouhaud wrote: > > The lack of these fields overall suggest the idea that when CREATE > > DATABASE is called with a global ICU collation, what if it somehow > > inserted the collation into pg_collation in the new database? > > Then pg_database would just store the collation oid and no other > > collation-related field would need to be added into it, now > > or in the future. > > I don't think it would be doable given the single-database-per-backend > restriction. By that I understand that CREATE DATABASE is limited to copying a template database and then not write anything into it beyond that, as it's not even connected to it. I guess there's still the possibility of requiring that the ICU db-wide collation of the new database does exist in the template database, and then the CREATE DATABASE would refer to that collation instead of an independent locale string. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
On Mon, Jan 10, 2022 at 03:45:47PM +0100, Daniel Verite wrote: > > By that I understand that CREATE DATABASE is limited to copying a template > database and then not write anything into it beyond that, as it's > not even connected to it. Yes. > I guess there's still the possibility of requiring that the ICU db-wide > collation of the new database does exist in the template database, > and then the CREATE DATABASE would refer to that collation instead of > an independent locale string. That could work. However if having the collation in the template database a strict requirement the something should also be done for initdb, and it will probably be a bigger problem.
On 10.01.22 07:00, Julien Rouhaud wrote: >>> And then I changed in varstr_cmp(): >>> >>> if (collid != DEFAULT_COLLATION_OID) >>> mylocale = pg_newlocale_from_collation(collid); >>> >>> to just >>> >>> mylocale = pg_newlocale_from_collation(collid); >>> >>> I find that the \timing results are indistinguishable. (I used locale >>> "en_US.UTF-8" and made sure that that code path is actually hit.) >>> >>> Does anyone have other insights? >> >> Looking at the git history, you added this comment in 414c5a2ea65. >> >> After a bit a digging in the lists, I found that you introduced it to fix a >> reported 13% slowdown in varstr_cmp(): >> https://www.postgresql.org/message-id/20110129075253.GA18784%40tornado.leadboat.com >> https://www.postgresql.org/message-id/1296748408.6442.1.camel%40vanquo.pezone.net > > So I tried to run Noah's benchmark to see if I could reproduce the slowdown. > Unfortunately the results I'm getting don't really make sense as removing the > optimisation brings a 15% speedup, and with a few more runs I can see that I > have about 25% noise, so there isn't much I can do to help. Heh, I had that same experience, it actually got faster without the optimization, but then got lost in the noise on further testing. Looking back at those discussions, I don't think those old test results are relevant anymore. In the patch that was being tested there, pg_newlocale_from_collation(), did not contain if (collid == DEFAULT_COLLATION_OID) return (pg_locale_t) 0; so the default collation actually went through most or all of the function and did a lot of work. That would understandably be quite slow. But just calling a function and returning immediately should not be a problem. Otherwise, the call to check_collation_set() in varstr_cmp() and elsewhere would be just as bad. So, unless there are concerns, I'm going to see about making a patch to call pg_newlocale_from_collation() even with the default collation. That would make the actual feature patch quite a bit smaller, since we won't have to patch every call site of pg_newlocale_from_collation().
On Tue, Jan 11, 2022 at 10:10:25AM +0100, Peter Eisentraut wrote: > > On 10.01.22 07:00, Julien Rouhaud wrote: > > > > So I tried to run Noah's benchmark to see if I could reproduce the slowdown. > > Unfortunately the results I'm getting don't really make sense as removing the > > optimisation brings a 15% speedup, and with a few more runs I can see that I > > have about 25% noise, so there isn't much I can do to help. > > Heh, I had that same experience, it actually got faster without the > optimization, but then got lost in the noise on further testing. Ah, so it's not just my machine :) > Looking back at those discussions, I don't think those old test results are > relevant anymore. In the patch that was being tested there, > pg_newlocale_from_collation(), did not contain > > if (collid == DEFAULT_COLLATION_OID) > return (pg_locale_t) 0; > > so the default collation actually went through most or all of the function > and did a lot of work. That would understandably be quite slow. But just > calling a function and returning immediately should not be a problem. > Otherwise, the call to check_collation_set() in varstr_cmp() and elsewhere > would be just as bad. I didn't noticed that. That definitely explain why the performance concern isn't valid anymore. > So, unless there are concerns, I'm going to see about making a patch to call > pg_newlocale_from_collation() even with the default collation. That would > make the actual feature patch quite a bit smaller, since we won't have to > patch every call site of pg_newlocale_from_collation(). +1 for me!
On 07.01.22 10:03, Julien Rouhaud wrote: >> I changed the datcollate, datctype, and the new daticucoll fields to type >> text (from name). That way, the daticucoll field can be set to null if it's >> not applicable. Also, the limit of 63 characters can actually be a problem >> if you want to use some combination of the options that ICU locales offer. >> And for less extreme uses, having variable-length fields will save some >> storage, since typical locale names are much shorter. > > I understand the need to have daticucoll as text, however it's not clear to me > why this has to be changed for datcollate and datctype? IIUC those will only > ever contain libc-based collation and are still mandatory? Right. I just did this for consistency. It would be strange otherwise to have some fields as name and some as text. Arguably, using "name" here was wrong to begin with, since they are not really object names. Maybe there used to be a reason to avoid variable-length fields in pg_database, but there isn't one now AFAICT. > - pg_upgrade > > It checks (in check_locale_and_encoding()) the compatibility for each database, > and it looks like the daticucoll field should also be verified. Other than > that I don't think there is anything else needed for the pg_upgrade part as > everything else should be handled by pg_dump (I didn't look at the changes yet > given the below problems). Ok, I have added this and will include it in my next patch submission. > - CREATE DATABASE > - initdb Ok, some work is needed to make these interfaces behave sensibly in various combinations. I will look into that.
On 10.01.22 12:49, Daniel Verite wrote: > With the current patch, it's not possible, AFAICS, because the user > can't tell that the collation is non-deterministic. Presumably this > would require another option to CREATE DATABASE and another > column to store that bit of information. Adding this would be easy, but since pattern matching currently does not support nondeterministic collations, if you make a global collation nondeterministic, a lot of system views, psql, pg_dump queries etc. would break, so it's not practical. I view this is an orthogonal project. Once we can support this without breaking system views etc., then it's easy to enable with a new column in pg_database. > The "daticucol" column also suggests that we don't expect to add > other collation providers in the future. Maybe a pair of columns like > (datcollprovider, datcolllocale) would be more future-proof, > or a (datcollprovider, datcolllocale, datcollisdeterministic) > triplet if non-deterministic collations are allowed. I don't expect many new collation providers. So I don't think an EAV-like storage would be helpful. The other problem is that we don't know what we need. For example, the libc provider needs both a collate and a ctype value, so that wouldn't fit into that scheme nicely. > Also, pg_collation has "collversion" to detect a mismatch between > the ICU runtime and existing indexes. I don't see that field > for the db-wide ICU collation, so maybe we currently miss the capability > to detect the mismatch for the db-wide collation? Yeah, I think I need to add a datcollversion field and the associated checks.
Julien Rouhaud wrote: > > I guess there's still the possibility of requiring that the ICU db-wide > > collation of the new database does exist in the template database, > > and then the CREATE DATABASE would refer to that collation instead of > > an independent locale string. > > That could work. However if having the collation in the template database a > strict requirement the something should also be done for initdb, and it will > probably be a bigger problem. If CREATE DATABASE referred to a collation in the template db, either that collation already exists, or the user would have to add it to the template db with CREATE COLLATION. initdb already populates the template databases with a full set of ICU collations through pg_import_system_collations(). I don't quite see what change you're seeing that would be needed in initdb. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
On Tue, Jan 11, 2022 at 12:36:46PM +0100, Daniel Verite wrote: > > If CREATE DATABASE referred to a collation in the template db, > either that collation already exists, or the user would have to add it > to the template db with CREATE COLLATION. > initdb already populates the template databases with a full set of > ICU collations through pg_import_system_collations(). > I don't quite see what change you're seeing that would be needed in > initdb. Yes, there are already the system collation imported. But you still need to make sure that that collation does exist in the template database, and it's still impossible to connect to that database to check when processing the CREATE DATABASE. Also, if the wanted collation wasn't imported by pg_import_system_collations() and isn't the server's default collation, then the user would have to allow connection on the template0 database and create the wanted collation there. It doesn't seem like something that should be recommended for a reasonably standard use case.
On 11.01.22 12:08, Julien Rouhaud wrote: >> So, unless there are concerns, I'm going to see about making a patch to call >> pg_newlocale_from_collation() even with the default collation. That would >> make the actual feature patch quite a bit smaller, since we won't have to >> patch every call site of pg_newlocale_from_collation(). > +1 for me! Here is that patch. If this is applied, then in my estimation all these hunks will completely disappear from the global ICU patch. So this would be a significant win.
Attachment
Re: >> After this patch goes in, the big next thing would be to support >> nondeterministic collations for LIKE, ILIKE and pattern matching operators in >> general. Is anyone interested in working on that? > As far as I know you're the last person that seemed to be working on that topic > back in March :) I have a solution for LIKE and ILIKE for case-insensitive, accent-sensitive ICU collations and the UTF8 server encoding,but didn't attempt to address the general case until ICU collations were supported at the database and cluster levels. I may have another look at that after this patch goes in.
On 10.01.22 12:49, Daniel Verite wrote: > I think some users would want their db-wide ICU collation to be > case/accent-insensitive. ... > IIRC, that was the context for some questions where people were > enquiring about db-wide ICU collations. +1. There is the DEFAULT_COLLATION_OID, which is the cluster-level default collation, a.k.a. the "global collation", asdistinct from the "db-wide" database-level default collation, which controls the default type of the collatable types withinthat database. > With the current patch, it's not possible, AFAICS, because the user > can't tell that the collation is non-deterministic. Presumably this > would require another option to CREATE DATABASE and another > column to store that bit of information. On 1/11/22, 6:24 AM, "Peter Eisentraut" <peter.eisentraut@enterprisedb.com> wrote: > Adding this would be easy, but since pattern matching currently does not > support nondeterministic collations, if you make a global collation > nondeterministic, a lot of system views, psql, pg_dump queries etc. > would break, so it's not practical. I view this is an orthogonal > project. Once we can support this without breaking system views etc., > then it's easy to enable with a new column in pg_database. So this patch only enables the default cluster collation (DEFAULT_COLLATION_OID) to be a deterministic ICU collation, butdoesn't extend the metadata in a way that would enable database collations to be ICU collations? Waiting for the pattern matching problem to be solved in general before creating the metadata to support ICU collations everywherewill make it more difficult for members of the community to help solve the pattern matching problem. What additional metadata changes would be required to enable an ICU collation to be specified at either the cluster-levelor the database-level, even if new checks need to be added to disallow a nondeterministic collation to be specifiedat the cluster level for now?
Hi, On Mon, Jan 17, 2022 at 07:07:38PM +0000, Finnerty, Jim wrote: > On 10.01.22 12:49, Daniel Verite wrote: > > > I think some users would want their db-wide ICU collation to be > > case/accent-insensitive. > ... > > IIRC, that was the context for some questions where people were > > enquiring about db-wide ICU collations. > > +1. There is the DEFAULT_COLLATION_OID, which is the cluster-level default > collation, a.k.a. the "global collation", as distinct from the "db-wide" > database-level default collation, which controls the default type of the > collatable types within that database. There's no cluster-level default collation, and DEFAULT_COLLATION_OID is always database-level (and template1 having a specific default collation). The template0 database is there to be able to support different databases with different default collations, among other things. So this patchset would allow per-database default ICU collation, although not non-deterministic ones.
Hi, On Thu, Jan 13, 2022 at 09:39:42AM +0100, Peter Eisentraut wrote: > On 11.01.22 12:08, Julien Rouhaud wrote: > > > So, unless there are concerns, I'm going to see about making a patch to call > > > pg_newlocale_from_collation() even with the default collation. That would > > > make the actual feature patch quite a bit smaller, since we won't have to > > > patch every call site of pg_newlocale_from_collation(). > > +1 for me! > > Here is that patch. The patch is quite straightforward so I don't have much to say, it all looks good and passes all regression tests. > If this is applied, then in my estimation all these hunks will completely > disappear from the global ICU patch. So this would be a significant win. Agreed, the patch will be quite smaller and also easier to review. Are you planning to apply it on its own or add it to the default ICU patchset? Given the possible backpatch conflicts it can bring I'm not sure it's worthy enough to apply on its own.
On 18.01.22 05:02, Julien Rouhaud wrote: >> If this is applied, then in my estimation all these hunks will completely >> disappear from the global ICU patch. So this would be a significant win. > Agreed, the patch will be quite smaller and also easier to review. Are you > planning to apply it on its own or add it to the default ICU patchset? My plan is to apply this patch in the next few days.
On 18.01.22 13:54, Peter Eisentraut wrote: > On 18.01.22 05:02, Julien Rouhaud wrote: >>> If this is applied, then in my estimation all these hunks will >>> completely >>> disappear from the global ICU patch. So this would be a significant >>> win. >> Agreed, the patch will be quite smaller and also easier to review. >> Are you >> planning to apply it on its own or add it to the default ICU patchset? > > My plan is to apply this patch in the next few days. This patch has been committed. Here is a second preparation patch: Change collate and ctype fields to type text. I think this is useful by itself because it allows longer locale names. ICU locale names with several options appended can be longer than 63 bytes. This case is probably broken right now. Also, it saves space in the typical case, since most locale names are not anywhere near 63 bytes.
Attachment
Hi, On Fri, Jan 21, 2022 at 10:44:24AM +0100, Peter Eisentraut wrote: > > Here is a second preparation patch: Change collate and ctype fields to type > text. > > I think this is useful by itself because it allows longer locale names. ICU > locale names with several options appended can be longer than 63 bytes. > This case is probably broken right now. Also, it saves space in the typical > case, since most locale names are not anywhere near 63 bytes. I totally agree. > From 1c46bf3138ad42074971aa3130142236de7e63f7 Mon Sep 17 00:00:00 2001 > From: Peter Eisentraut <peter@eisentraut.org> > Date: Fri, 21 Jan 2022 10:01:25 +0100 > Subject: [PATCH] Change collate and ctype fields to type text + collversionstr = TextDatumGetCString(datum); + actual_versionstr = get_collation_actual_version(collform->collprovider, collcollate); if (!actual_versionstr) { @@ -1606,7 +1616,6 @@ pg_newlocale_from_collation(Oid collid) (errmsg("collation \"%s\" has no actual version, but a version was specified", NameStr(collform->collname)))); } - collversionstr = TextDatumGetCString(collversion); Is that change intended? There isn't any usage of the collversionstr before the possible error when actual_versionstr is missing. Apart from that the patch looks good to me, all tests pass and no issue with building the doc either.
On 21.01.22 14:51, Julien Rouhaud wrote: >> From 1c46bf3138ad42074971aa3130142236de7e63f7 Mon Sep 17 00:00:00 2001 >> From: Peter Eisentraut <peter@eisentraut.org> >> Date: Fri, 21 Jan 2022 10:01:25 +0100 >> Subject: [PATCH] Change collate and ctype fields to type text > > + collversionstr = TextDatumGetCString(datum); > + > actual_versionstr = get_collation_actual_version(collform->collprovider, collcollate); > if (!actual_versionstr) > { > @@ -1606,7 +1616,6 @@ pg_newlocale_from_collation(Oid collid) > (errmsg("collation \"%s\" has no actual version, but a version was specified", > NameStr(collform->collname)))); > } > - collversionstr = TextDatumGetCString(collversion); > > > Is that change intended? There isn't any usage of the collversionstr before > the possible error when actual_versionstr is missing. I wanted to move it closer to the SysCacheGetAttr() where the "datum" value is obtained. It seemed weird to get the datum, then do other things, then decode the datum.
On Fri, Jan 21, 2022 at 03:24:02PM +0100, Peter Eisentraut wrote: > On 21.01.22 14:51, Julien Rouhaud wrote: > > Is that change intended? There isn't any usage of the collversionstr before > > the possible error when actual_versionstr is missing. > > I wanted to move it closer to the SysCacheGetAttr() where the "datum" value > is obtained. It seemed weird to get the datum, then do other things, then > decode the datum. Oh ok. It won't make much difference performance-wise, so no objection.
On 21.01.22 17:13, Julien Rouhaud wrote: > On Fri, Jan 21, 2022 at 03:24:02PM +0100, Peter Eisentraut wrote: >> On 21.01.22 14:51, Julien Rouhaud wrote: >>> Is that change intended? There isn't any usage of the collversionstr before >>> the possible error when actual_versionstr is missing. >> >> I wanted to move it closer to the SysCacheGetAttr() where the "datum" value >> is obtained. It seemed weird to get the datum, then do other things, then >> decode the datum. > > Oh ok. It won't make much difference performance-wise, so no objection. I have committed this and will provide follow-up patches in the next few days.
On 27.01.22 09:10, Peter Eisentraut wrote: > On 21.01.22 17:13, Julien Rouhaud wrote: >> On Fri, Jan 21, 2022 at 03:24:02PM +0100, Peter Eisentraut wrote: >>> On 21.01.22 14:51, Julien Rouhaud wrote: >>>> Is that change intended? There isn't any usage of the >>>> collversionstr before >>>> the possible error when actual_versionstr is missing. >>> >>> I wanted to move it closer to the SysCacheGetAttr() where the "datum" >>> value >>> is obtained. It seemed weird to get the datum, then do other things, >>> then >>> decode the datum. >> >> Oh ok. It won't make much difference performance-wise, so no objection. > > I have committed this and will provide follow-up patches in the next few > days. Here is the main patch rebased on the various changes that have been committed in the meantime. There is still some work to be done on the user interfaces of initdb, createdb, etc. I have split out the database-level collation version tracking into a separate patch [0]. I think we should get that one in first and then refresh this one. [0]: https://www.postgresql.org/message-id/flat/f0ff3190-29a3-5b39-a179-fa32eee57db6%40enterprisedb.com
Attachment
On 02.02.22 14:01, Peter Eisentraut wrote: > Here is the main patch rebased on the various changes that have been > committed in the meantime. There is still some work to be done on the > user interfaces of initdb, createdb, etc. > > I have split out the database-level collation version tracking into a > separate patch [0]. I think we should get that one in first and then > refresh this one. All that preliminary work has been completed, so here is a new patch. There isn't actually much left here now except all the new DDL and command-line options to set this up and documentation for those. I have given all that another review and I hope it's more intuitive now, but I guess there will be other opinions. I have changed the terminology a bit to match ICU better. It's now called "ICU locale ID" and "locale provider" (instead of "collation"). It might actually cover things that are not strictly collations (such as the isalpha stuff in text search, maybe, in the future). One thing that is left that bothers me is that the invocations of get_collation_actual_version() have all gotten quite complicated. I'm thinking about ways to refactor that, but I haven't got a great idea yet.
Attachment
Hi, On Wed, Feb 16, 2022 at 03:25:40PM +0100, Peter Eisentraut wrote: > > All that preliminary work has been completed, so here is a new patch. > > There isn't actually much left here now except all the new DDL and > command-line options to set this up and documentation for those. I have > given all that another review and I hope it's more intuitive now, but I > guess there will be other opinions. Sorry it took me a bit of time to come back to this patch. TL;DR it works as expected. I just have a few comments, mostly on the doc. I say it works because I did manually check, as far as I can see there isn't any test that ensures it. I'm using this naive scenario: DROP DATABASE IF EXISTS dbicu; CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'en_US' ICU_LOCALE 'en-u-kf-upper' template 'template0'; \c dbicu CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper'); CREATE TABLE icu(def text, en text COLLATE "en_US", upfirst text COLLATE upperfirst); INSERT INTO icu VALUES ('a', 'a', 'a'), ('b','b','b'), ('A','A','A'), ('B','B','B'); SELECT def AS def FROM icu ORDER BY def; SELECT def AS en FROM icu ORDER BY en; SELECT def AS upfirst FROM icu ORDER BY upfirst; SELECT def AS upfirst_explicit FROM icu ORDER BY en COLLATE upperfirst; SELECT def AS en_x_explicit FROM icu ORDER BY def COLLATE "en-x-icu"; Maybe there should be some test along those lines included in the patch? > I have changed the terminology a bit to match ICU better. It's now called > "ICU locale ID" and "locale provider" (instead of "collation"). It might > actually cover things that are not strictly collations (such as the isalpha > stuff in text search, maybe, in the future). I'm not sure that's actually such an improvement as-is. Simply saying "ICU locale ID" is, at least for me, somewhat ambiguous as I don't see any place in our docs where it's clearly defined. I'm afraid that users might confuse it with the OID of a pg_collation line, or even a collation name (like en-x-icu), especially since there's no simple way to know if what you used means what you thought it meant. Also, it's not even used consistently in the patch. I can see e.g. "ICU locale" or "ICU locale setting" being used: + <varlistentry> + <term><replaceable class="parameter">icu_locale</replaceable></term> + <listitem> + <para> + Specifies the ICU locale if the ICU locale provider is used. If + this is not specified, the value from the <literal>LOCALE</literal> + option is used. + </para> + </listitem> + </varlistentry> + <term><option>--icu-locale=<replaceable class="parameter">locale</replaceable></option></term> + <listitem> + <para> + Specifies the ICU locale setting to be used in this database, if the + ICU locale provider is selected. + </para> Maybe we could point to the ICU documentation for a clear notion of what a locale ID is, and/or use their own short definition [1]: > The locale object in ICU is an identifier that specifies a particular locale > and has fields for language, country, and an optional code to specify further > variants or subdivisions. These fields also can be represented as a string > with the fields separated by an underscore. It seems critical to be crystal clear about what should be specified there. I spent some time looking at the ICU api trying to figure out if using a posix locale name (e.g. en_US) was actually compatible with an ICU locale name. It seems that ICU accept any of 'en-us', 'en-US', 'en_us' or 'en_US' as the same locale, but I might be wrong. I also didn't find a way to figure out how to ask ICU if the locale identifier passed is complete garbage or not. One sure thing is that the system collation we import are of the form 'en-us', so it seems weird to have this form in pg_collation and by default another form in pg_database. All in all I'm a bit worried of having the icu_locale optional. Note that this is inconsistent with createdb(), as there's at least one code path where the icu_locale is not optional: + if (dblocprovider == COLLPROVIDER_ICU) + { + /* + * This would happen if template0 uses the libc provider but the new + * database uses icu. + */ + if (!dbiculocale) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("ICU locale must be specified"))); + } > > One thing that is left that bothers me is that the invocations of > get_collation_actual_version() have all gotten quite complicated. I'm > thinking about ways to refactor that, but I haven't got a great idea yet. Indeed, and I don't have a great idea either. Maybe get_collation_actual_version_extended that would accept both strings? In CREATE DATABASE manual: + Specifies the provider to use for the default collation in this + database. Possible values are: + <literal>icu</literal>,<indexterm><primary>ICU</primary></indexterm> + <literal>libc</literal>. <literal>libc</literal> is the default. The + available choices depend on the operating system and build options. That's actually not true as pg_strcasecmp is used in createdb(): + if (pg_strcasecmp(locproviderstr, "icu") == 0) + dblocprovider = COLLPROVIDER_ICU; + else if (pg_strcasecmp(locproviderstr, "libc") == 0) + dblocprovider = COLLPROVIDER_LIBC; + else + ereport(ERROR, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("unrecognized locale provider: %s", + locproviderstr))); By extension that's the same for createdb, but createdb manual doesn't explicitly mention that case is important so I guess that's ok. - To alter the default collation order or character set classes, use the - <option>--lc-collate</option> and <option>--lc-ctype</option> options. - Collation orders other than <literal>C</literal> or <literal>POSIX</literal> also have - a performance penalty. For these reasons it is important to choose the - right locale when running <command>initdb</command>. + To choose a different locale for the cluster, use the option + <option>--locale</option>. There are also individual options + <option>--lc-*</option> (see below) to set values for the individual locale + categories. Note that inconsistent settings for different locale + categories can give nonsensical results, so this should be used with care. Unless I'm missing something you entirely removed the warninng about the performance penalty when using non C/POSIX locale? + To chose the specific ICU locale ID to apply, use the option + <option>--icu-locale</option>. The ICU locale ID defaults to + <option>--locale</option> or the environment, as above (with some name + mangling applied to make the locale naming appropriate for ICU). Note that + for implementation reasons and to support legacy code, + <command>initdb</command> will still select and initialize libc locale + settings when the ICU locale provider is used. initdb has some specific processing to transform the default libc locale to something more appropriate, but as far as I can see creatdb / CREATE DATABASE aren't doing that. It seems inconsistent, and IMHO another reason why defaulting to the libc locale looks like a bad idea. While on that topic, the doc should probably mention that default ICU collations can only be deterministic. @@ -168,18 +175,6 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e errmsg("collation \"default\" cannot be copied"))); } - if (localeEl) - { - collcollate = defGetString(localeEl); - collctype = defGetString(localeEl); - } [...] I tried to read the function and quickly got confused about whether possible problematic conditions could be reached or not and had protection or not. I think that DefineCollation is becoming more and more unreadable, with a mix of fromEl, !fromEl and either. Maybe it would be a good time to refactor the code a bit and have have something like if (fromEl) { // initialize all specific things } else { // initialize all specific things } // handle defaults and sanity checks What do you think? Also unrelated, but how about raising a warning about possibly hiding corruption if you use CREATE COLLATION ... VERSION or CREATE DATABASE ... COLLATION_VERSION if !IsBinaryUpgrade()? in AlterCollation(), pg_collation_actual_version(), AlterDatabaseRefreshColl() and pg_database_collation_actual_version(): - datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collcollate, &isnull); - Assert(!isnull); - newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum)); + datum = SysCacheGetAttr(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale :Anum_pg_collation_collcollate, &isnull); + if (!isnull) + newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum)); + else + newversion = NULL; The columns are now nullable, but can you actually end up with a null locale name in the expected field without manual DML on the catalog, corruption or similar? I think it should be a plain error explaining the inconsistency rather then silently assuming there's no version. Note that at least pg_newlocale_from_collation() asserts that the specific libc/icu field it's interested in isn't null. + else if (strcmp(defel->defname, "icu_locale") == 0) + { + if (diculocale) + errorConflictingDefElem(defel, pstate); + diculocale = defel; + } + else if (strcmp(defel->defname, "locale_provider") == 0) + { + if (dlocprovider) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"), + parser_errposition(pstate, defel->location))); + dlocprovider = defel; + } Why not using errorConflictingDefElem for locale_provider? in createdb(): + if (dblocprovider == COLLPROVIDER_ICU) + { + /* + * This would happen if template0 uses the libc provider but the new + * database uses icu. + */ + if (!dbiculocale) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("ICU locale must be specified"))); + } + + if (dblocprovider == COLLPROVIDER_ICU) + { +#ifdef USE_ICU [...] Seems like a refactoring leftover, no need for two blocks. Also, I think it would be better to first error out if there's no support for ICU rather than complaining about a possibly missing locale that wouldn't be accepted anyway. +void +make_icu_collator(const char *iculocstr, + struct pg_locale_struct *resultp) +{ +#ifdef USE_ICU +[...] + /* We will leak this string if we get an error below :-( */ + resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr); + resultp->info.icu.ucol = collator; +#else /* not USE_ICU */ + /* could get here if a collation was created by a build with ICU */ + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("ICU is not supported in this build"), \ + errhint("You need to rebuild PostgreSQL using %s.", "--with-icu"))); +#endif /* not USE_ICU */ +} The comment about the leak possibility needs some tweaking since it's been extracted from a larger function. Not material for this patch, but: @@ -1475,7 +1524,12 @@ pg_newlocale_from_collation(Oid collid) Assert(OidIsValid(collid)); if (collid == DEFAULT_COLLATION_OID) - return (pg_locale_t) 0; + { + if (default_locale.provider == COLLPROVIDER_ICU) + return &default_locale; + else + return (pg_locale_t) 0; + } I'm wondering if we could now always return &default_locale and avoid having to check if the function returned something in all callers, since CheckMyDatabase now initialize it. @@ -2184,6 +2195,50 @@ setlocales(void) check_locale_name(LC_CTYPE, lc_messages, &canonname); lc_messages = canonname; #endif + + if (locale_provider == COLLPROVIDER_ICU) + { + if (!icu_locale && locale) + icu_locale = locale; + + /* + * If ICU is selected but no ICU locale has been given, take the [...] + /* + * Check ICU locale name + */ +#ifdef USE_ICU + { + UErrorCode status; + + status = U_ZERO_ERROR; + ucol_open(icu_locale, &status); + if (U_FAILURE(status)) + { + pg_log_error("could not open collator for locale \"%s\": %s", + icu_locale, u_errorName(status)); + exit(1); + } + } +#else + pg_log_error("ICU is not supported in this build"); + fprintf(stderr, _("You need to rebuild PostgreSQL using %s.\n"), "--with-icu"); + exit(1); +#endif + } Shouldn't all the code before checking the locale name also be in the #ifdef USE_ICU? Also, the comment should be consistent with the doc and mention ICU locale ID. @@ -2859,6 +2870,17 @@ dumpDatabase(Archive *fout) appendPQExpBufferStr(creaQry, " ENCODING = "); appendStringLiteralAH(creaQry, encoding, fout); } + if (strlen(datlocprovider) > 0) + { + appendPQExpBufferStr(creaQry, " LOCALE_PROVIDER = "); + if (datlocprovider[0] == 'c') + appendPQExpBufferStr(creaQry, "libc"); + else if (datlocprovider[0] == 'i') + appendPQExpBufferStr(creaQry, "icu"); + else + fatal("unrecognized locale provider: %s", + datlocprovider); + } if (strlen(collate) > 0 && strcmp(collate, ctype) == 0) { AFAICS datlocprovider shouldn't be empty. Maybe raise an error or an assert if that's the case? # CREATE DATABASE db1 LOCALE_PROVIDER icu ICU_LOCALE "fr-x-icu"; ERROR: 22023: new locale provider (i) does not match locale provider of the template database (c) HINT: Use the same locale provider as in the template database, or use template0 as template. It feels strange to write "LOCALE_PROVIDER icu" and get "provider (i)" in the error message. I think it would be better to emit the user-facing value, not internal one. Finally, there are some changes in pg_collation, I'm assuming it's done for consistency since pg_database may need two different locale information, but it should probably be at least mentioned in the commit message. [1]: https://unicode-org.github.io/icu/userguide/locale/#:~:text=The%20locale%20object%20in%20ICU,fields%20separated%20by%20an%20underscore.
On 05.03.22 09:38, Julien Rouhaud wrote: > @@ -168,18 +175,6 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e > errmsg("collation \"default\" cannot be copied"))); > } > > - if (localeEl) > - { > - collcollate = defGetString(localeEl); > - collctype = defGetString(localeEl); > - } > [...] > > I tried to read the function and quickly got confused about whether possible > problematic conditions could be reached or not and had protection or not. I > think that DefineCollation is becoming more and more unreadable, with a mix of > fromEl, !fromEl and either. Maybe it would be a good time to refactor the code > a bit and have have something like > > if (fromEl) > { > // initialize all specific things > } > else > { > // initialize all specific things > } > > // handle defaults and sanity checks > > What do you think? How about the attached?
Attachment
On Thu, Mar 10, 2022 at 10:52:41AM +0100, Peter Eisentraut wrote: > On 05.03.22 09:38, Julien Rouhaud wrote: > > @@ -168,18 +175,6 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e > > errmsg("collation \"default\" cannot be copied"))); > > } > > > > - if (localeEl) > > - { > > - collcollate = defGetString(localeEl); > > - collctype = defGetString(localeEl); > > - } > > [...] > > > > I tried to read the function and quickly got confused about whether possible > > problematic conditions could be reached or not and had protection or not. I > > think that DefineCollation is becoming more and more unreadable, with a mix of > > fromEl, !fromEl and either. Maybe it would be a good time to refactor the code > > a bit and have have something like > > > > if (fromEl) > > { > > // initialize all specific things > > } > > else > > { > > // initialize all specific things > > } > > > > // handle defaults and sanity checks > > > > What do you think? > > How about the attached? Thanks! That's exactly what I had in mind. I think it's easier to follow and make sure of what it's doing exactly, so +1 for it.
On 05.03.22 09:38, Julien Rouhaud wrote: > I say it works because I did manually check, as far as I can see there isn't > any test that ensures it. > > I'm using this naive scenario: > > DROP DATABASE IF EXISTS dbicu; > CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'en_US' ICU_LOCALE 'en-u-kf-upper' template 'template0'; > \c dbicu > CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper'); > CREATE TABLE icu(def text, en text COLLATE "en_US", upfirst text COLLATE upperfirst); > INSERT INTO icu VALUES ('a', 'a', 'a'), ('b','b','b'), ('A','A','A'), ('B','B','B'); > SELECT def AS def FROM icu ORDER BY def; > SELECT def AS en FROM icu ORDER BY en; > SELECT def AS upfirst FROM icu ORDER BY upfirst; > SELECT def AS upfirst_explicit FROM icu ORDER BY en COLLATE upperfirst; > SELECT def AS en_x_explicit FROM icu ORDER BY def COLLATE "en-x-icu"; > > Maybe there should be some test along those lines included in the patch? I added something like this to a new test suite under src/test/icu/. (src/test/locale/ was already used for something else.) > Also, it's not even used consistently in the patch. I can see e.g. "ICU > locale" or "ICU locale setting" being used: I have fixed those. > Maybe we could point to the ICU documentation for a clear notion of what a > locale ID is, and/or use their own short definition [1]: > >> The locale object in ICU is an identifier that specifies a particular locale >> and has fields for language, country, and an optional code to specify further >> variants or subdivisions. These fields also can be represented as a string >> with the fields separated by an underscore. I think the Localization chapter needs to be reorganized a bit, but I'll leave that for a separate patch. > I spent some time looking at the ICU api trying to figure out if using a > posix locale name (e.g. en_US) was actually compatible with an ICU locale name. > It seems that ICU accept any of 'en-us', 'en-US', 'en_us' or 'en_US' as the > same locale, but I might be wrong. I also didn't find a way to figure out how > to ask ICU if the locale identifier passed is complete garbage or not. One > sure thing is that the system collation we import are of the form 'en-us', so > it seems weird to have this form in pg_collation and by default another form in > pg_database. Yeah it seems to be inconsistent about that. The locale ID documentation appears to indicate that "en_US" is the canonical form, but when you ask it to list all the locales it knows about it returns "en-US". > In CREATE DATABASE manual: > > + Specifies the provider to use for the default collation in this > + database. Possible values are: > + <literal>icu</literal>,<indexterm><primary>ICU</primary></indexterm> > + <literal>libc</literal>. <literal>libc</literal> is the default. The > + available choices depend on the operating system and build options. > > That's actually not true as pg_strcasecmp is used in createdb(): > > + if (pg_strcasecmp(locproviderstr, "icu") == 0) > + dblocprovider = COLLPROVIDER_ICU; > + else if (pg_strcasecmp(locproviderstr, "libc") == 0) > + dblocprovider = COLLPROVIDER_LIBC; > + else > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), > + errmsg("unrecognized locale provider: %s", > + locproviderstr))); I don't understand what the concern is here. > Unless I'm missing something you entirely removed the warninng about the > performance penalty when using non C/POSIX locale? Yeah, I think it is not the job of the initdb man page to lecture people about the merits of their locale choice. The same performance warning can also be found in the localization chapter; we don't need to repeat it everywhere a locale choice is mentioned. > initdb has some specific processing to transform the default libc locale to > something more appropriate, but as far as I can see creatdb / CREATE DATABASE > aren't doing that. It seems inconsistent, and IMHO another reason why > defaulting to the libc locale looks like a bad idea. This has all been removed. The separate ICU locale option should now be required everywhere (initdb, createdb, CREATE DATABASE). > While on that topic, the doc should probably mention that default ICU > collations can only be deterministic. Well, there is no option to do otherwise, so I'm not sure where/how to mention that. We usually don't document options that don't exist. ;-) > Also unrelated, but how about raising a warning about possibly hiding > corruption if you use CREATE COLLATION ... VERSION or CREATE DATABASE ... > COLLATION_VERSION if !IsBinaryUpgrade()? Hmm, there is a point to that. But then we should just make that an error. Otherwise, we raise the warning but then there is no way to "fix" what was warned about. Seems confusing. > in AlterCollation(), pg_collation_actual_version(), AlterDatabaseRefreshColl() > and pg_database_collation_actual_version(): > > - datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collcollate, &isnull); > - Assert(!isnull); > - newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum)); > + datum = SysCacheGetAttr(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale: Anum_pg_collation_collcollate, &isnull); > + if (!isnull) > + newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum)); > + else > + newversion = NULL; > > The columns are now nullable, but can you actually end up with a null locale > name in the expected field without manual DML on the catalog, corruption or > similar? I think it should be a plain error explaining the inconsistency > rather then silently assuming there's no version. Note that at least > pg_newlocale_from_collation() asserts that the specific libc/icu field it's > interested in isn't null. This is required because the default collations's fields are null now. > Why not using errorConflictingDefElem for locale_provider? fixed > in createdb(): > > + if (dblocprovider == COLLPROVIDER_ICU) > + { > + /* > + * This would happen if template0 uses the libc provider but the new > + * database uses icu. > + */ > + if (!dbiculocale) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("ICU locale must be specified"))); > + } > + > + if (dblocprovider == COLLPROVIDER_ICU) > + { > +#ifdef USE_ICU > [...] > > Seems like a refactoring leftover, no need for two blocks. These are two independent blocks in my mind. It's possible that someone might want to insert something in the middle at some point. > Also, I think it > would be better to first error out if there's no support for ICU rather than > complaining about a possibly missing locale that wouldn't be accepted anyway. Seems better to do general syntax checks first and then deeper checks of the passed values. > +void > +make_icu_collator(const char *iculocstr, > + struct pg_locale_struct *resultp) > +{ > +#ifdef USE_ICU > +[...] > + /* We will leak this string if we get an error below :-( */ > + resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr); > + resultp->info.icu.ucol = collator; > +#else /* not USE_ICU */ > + /* could get here if a collation was created by a build with ICU */ > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("ICU is not supported in this build"), \ > + errhint("You need to rebuild PostgreSQL using %s.", "--with-icu"))); > +#endif /* not USE_ICU */ > +} > > The comment about the leak possibility needs some tweaking since it's been > extracted from a larger function. fixed > Not material for this patch, but: > @@ -1475,7 +1524,12 @@ pg_newlocale_from_collation(Oid collid) > Assert(OidIsValid(collid)); > > if (collid == DEFAULT_COLLATION_OID) > - return (pg_locale_t) 0; > + { > + if (default_locale.provider == COLLPROVIDER_ICU) > + return &default_locale; > + else > + return (pg_locale_t) 0; > + } > > I'm wondering if we could now always return &default_locale and avoid having to > check if the function returned something in all callers, since CheckMyDatabase > now initialize it. Maybe that's something to look into, but that would probably require updating a call callers to handle the return values differently, which would require quite a bit of work. > Shouldn't all the code before checking the locale name also be in the #ifdef > USE_ICU? I like to keep the #ifdef sections as small as possible and limited to the code that really uses the respective library. > @@ -2859,6 +2870,17 @@ dumpDatabase(Archive *fout) > appendPQExpBufferStr(creaQry, " ENCODING = "); > appendStringLiteralAH(creaQry, encoding, fout); > } > + if (strlen(datlocprovider) > 0) > + { > + appendPQExpBufferStr(creaQry, " LOCALE_PROVIDER = "); > + if (datlocprovider[0] == 'c') > + appendPQExpBufferStr(creaQry, "libc"); > + else if (datlocprovider[0] == 'i') > + appendPQExpBufferStr(creaQry, "icu"); > + else > + fatal("unrecognized locale provider: %s", > + datlocprovider); > + } > if (strlen(collate) > 0 && strcmp(collate, ctype) == 0) > { > > AFAICS datlocprovider shouldn't be empty. Maybe raise an error or an assert if > that's the case? Yeah that was bogus, copied from the earlier encoding handling, which is probably also bogus, but I'm not touching that here. > # CREATE DATABASE db1 LOCALE_PROVIDER icu ICU_LOCALE "fr-x-icu"; > ERROR: 22023: new locale provider (i) does not match locale provider of the template database (c) > HINT: Use the same locale provider as in the template database, or use template0 as template. > > It feels strange to write "LOCALE_PROVIDER icu" and get "provider (i)" in the > error message. I think it would be better to emit the user-facing value, not > internal one. Fixed. I had added a collprovider_name() function but didn't use it here. > Finally, there are some changes in pg_collation, I'm assuming it's done for > consistency since pg_database may need two different locale information, but it > should probably be at least mentioned in the commit message. done
Attachment
On Mon, Mar 14, 2022 at 8:51 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > [ new patches ] I'm not very knowledgeable about this topic, but to me, it seems confusing to think of having both a libc collation and an ICU collation associated with a database. I have two main questions: 1. What will happen if I set the ICU collation to something that doesn't match the libc collation? How bad are the consequences? 2. If I want to avoid a mismatch between the two, then I will need a way to figure out which libc collation corresponds to a given ICU collation. How do I do that? I have a sneaking suspicion that I'm not going to like the answer to question #2 very much, but maybe that's life. I feel like this whole area is far too full of magical things. True practitioners know that if you utter some mysterious incantation to the Lords of ICU, you can get exactly the behavior you want. And ... everyone else has no idea what to do. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Mar 14, 2022 at 01:50:50PM +0100, Peter Eisentraut wrote: > On 05.03.22 09:38, Julien Rouhaud wrote: > > I say it works because I did manually check, as far as I can see there isn't > > any test that ensures it. > > > > I'm using this naive scenario: > > > > DROP DATABASE IF EXISTS dbicu; > > CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'en_US' ICU_LOCALE 'en-u-kf-upper' template 'template0'; > > \c dbicu > > CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper'); > > CREATE TABLE icu(def text, en text COLLATE "en_US", upfirst text COLLATE upperfirst); > > INSERT INTO icu VALUES ('a', 'a', 'a'), ('b','b','b'), ('A','A','A'), ('B','B','B'); > > SELECT def AS def FROM icu ORDER BY def; > > SELECT def AS en FROM icu ORDER BY en; > > SELECT def AS upfirst FROM icu ORDER BY upfirst; > > SELECT def AS upfirst_explicit FROM icu ORDER BY en COLLATE upperfirst; > > SELECT def AS en_x_explicit FROM icu ORDER BY def COLLATE "en-x-icu"; > > > > Maybe there should be some test along those lines included in the patch? > > I added something like this to a new test suite under src/test/icu/. > (src/test/locale/ was already used for something else.) Great, thanks! > > > The locale object in ICU is an identifier that specifies a particular locale > > > and has fields for language, country, and an optional code to specify further > > > variants or subdivisions. These fields also can be represented as a string > > > with the fields separated by an underscore. > > I think the Localization chapter needs to be reorganized a bit, but I'll > leave that for a separate patch. WFM. > > I spent some time looking at the ICU api trying to figure out if using a > > posix locale name (e.g. en_US) was actually compatible with an ICU locale name. > > It seems that ICU accept any of 'en-us', 'en-US', 'en_us' or 'en_US' as the > > same locale, but I might be wrong. I also didn't find a way to figure out how > > to ask ICU if the locale identifier passed is complete garbage or not. One > > sure thing is that the system collation we import are of the form 'en-us', so > > it seems weird to have this form in pg_collation and by default another form in > > pg_database. > > Yeah it seems to be inconsistent about that. The locale ID documentation > appears to indicate that "en_US" is the canonical form, but when you ask it > to list all the locales it knows about it returns "en-US". Yeah I saw that too when checking is POSIX locale names were valid, and that's not great. > > > In CREATE DATABASE manual: > > > > + Specifies the provider to use for the default collation in this > > + database. Possible values are: > > + <literal>icu</literal>,<indexterm><primary>ICU</primary></indexterm> > > + <literal>libc</literal>. <literal>libc</literal> is the default. The > > + available choices depend on the operating system and build options. > > > > That's actually not true as pg_strcasecmp is used in createdb(): > > I don't understand what the concern is here. Ah sorry I missed the <indexterm>, I thought the possible values listed were icu, ICU and libc. Shouldn't the ICU <indexterm> be before the icu <literal> rather than the libc, or at least before the comma? > > Yeah, I think it is not the job of the initdb man page to lecture people > about the merits of their locale choice. The same performance warning can > also be found in the localization chapter; we don't need to repeat it > everywhere a locale choice is mentioned. Ah, I tried to look for another place where the same warning could be found but missed it. I'm fine with it then! > > While on that topic, the doc should probably mention that default ICU > > collations can only be deterministic. > > Well, there is no option to do otherwise, so I'm not sure where/how to > mention that. We usually don't document options that don't exist. ;-) Sure, but I'm afraid that users may still be tempted to use ICU locales like und-u-ks-level2 from the case_insensitive example in the doc and hope that it will work accordingly. Or maybe it's just me that still sees ICU as dark magic and want to be extra cautious. Unrelated, but I just realized that we have PGC_INTERNAL gucs for lc_ctype and lc_collate. Should we add one for icu_locale too? > > Also unrelated, but how about raising a warning about possibly hiding > > corruption if you use CREATE COLLATION ... VERSION or CREATE DATABASE ... > > COLLATION_VERSION if !IsBinaryUpgrade()? > > Hmm, there is a point to that. But then we should just make that an error. > Otherwise, we raise the warning but then there is no way to "fix" what was > warned about. Seems confusing. I was afraid that an error was a bit too much, but if that's acceptable I agree that it's better. > > in AlterCollation(), pg_collation_actual_version(), AlterDatabaseRefreshColl() > > and pg_database_collation_actual_version(): > > > > - datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collcollate, &isnull); > > - Assert(!isnull); > > - newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum)); > > + datum = SysCacheGetAttr(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale: Anum_pg_collation_collcollate, &isnull); > > + if (!isnull) > > + newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum)); > > + else > > + newversion = NULL; > > > > The columns are now nullable, but can you actually end up with a null locale > > name in the expected field without manual DML on the catalog, corruption or > > similar? I think it should be a plain error explaining the inconsistency > > rather then silently assuming there's no version. Note that at least > > pg_newlocale_from_collation() asserts that the specific libc/icu field it's > > interested in isn't null. > > This is required because the default collations's fields are null now. Yes I saw that, but that's a specific exception. Detecting whether it's the DEFAULT_COLLATION_OID or not and raise an error when a null value isn't expected seems like it could be worthwhile. > > I'm wondering if we could now always return &default_locale and avoid having to > > check if the function returned something in all callers, since CheckMyDatabase > > now initialize it. > > Maybe that's something to look into, but that would probably require > updating a call callers to handle the return values differently, which would > require quite a bit of work. Agreed, I just wanted to mention it before forgetting about it. So apart from the few details mentioned above I'm happy with this patch! There might still be some more adjustments needed based on Robert's comment. At the very least it doesn't seem unreasonable to forbid a default ICU collation with a C/POSIX lc_ctype for instance.
On 14.03.22 19:57, Robert Haas wrote: > 1. What will happen if I set the ICU collation to something that > doesn't match the libc collation? How bad are the consequences? These are unrelated, so there are no consequences. > 2. If I want to avoid a mismatch between the two, then I will need a > way to figure out which libc collation corresponds to a given ICU > collation. How do I do that? You can specify the same name for both.
On Tue, Mar 15, 2022 at 12:58 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > On 14.03.22 19:57, Robert Haas wrote: > > 1. What will happen if I set the ICU collation to something that > > doesn't match the libc collation? How bad are the consequences? > > These are unrelated, so there are no consequences. Can you please elaborate on this? > > 2. If I want to avoid a mismatch between the two, then I will need a > > way to figure out which libc collation corresponds to a given ICU > > collation. How do I do that? > > You can specify the same name for both. Hmm. If every name were valid in both systems, I don't think you'd be proposing two fields. -- Robert Haas EDB: http://www.enterprisedb.com
Can we get some more consistent terminology around the term "locale"? In ICU, the "locale" is just the first part of what we can pass to the "locale" parameter in CREATE COLLATION - the partbefore the optional '@' delimiter. The ICU locale does not include the secondary or tertiary properties, so it is usuallyjust the country and the language, e.g. en_US (or en-US), but it can also be something like es_TRADITIONAL for traditionalSpanish. I think it would be an improvement in clarity if we consistently use the term 'locale' to mean the same thing that ICU meansby that term, and not to have the thing that we call the "locale" also include collation modifiers, or to imply thata locale is the same thing as a collation. /Jim
Julien Rouhaud wrote: > > > While on that topic, the doc should probably mention that default ICU > > > collations can only be deterministic. > > > > Well, there is no option to do otherwise, so I'm not sure where/how to > > mention that. We usually don't document options that don't exist. ;-) > > Sure, but I'm afraid that users may still be tempted to use ICU locales like > und-u-ks-level2 from the case_insensitive example in the doc and hope that > it will work accordingly. +1. The CREATE DATABASE doc says this currently: icu_locale Specifies the ICU locale ID if the ICU locale provider is used. ISTM that we need to say explicitly that this locale will be used by default to compare all collatable strings, except that it's overruled by a bytewise comparison to break ties in case of equality. The idea is to describe what the backend will do with the setting rather than saying that we don't have a nondeterministic option. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Finnerty, Jim wrote: > In ICU, the "locale" is just the first part of what we can pass to the > "locale" parameter in CREATE COLLATION - the part before the optional '@' > delimiter. The ICU locale does not include the secondary or tertiary > properties, Why not? Please see https://unicode-org.github.io/icu/userguide/locale It says that "a locale consists of one or more pieces of ordered information", the pieces being a language code, a script code, a country code, a variant code, and keywords. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On 15.03.22 18:28, Robert Haas wrote: > On Tue, Mar 15, 2022 at 12:58 PM Peter Eisentraut > <peter.eisentraut@enterprisedb.com> wrote: >> On 14.03.22 19:57, Robert Haas wrote: >>> 1. What will happen if I set the ICU collation to something that >>> doesn't match the libc collation? How bad are the consequences? >> >> These are unrelated, so there are no consequences. > > Can you please elaborate on this? The code that is aware of ICU generally works like this: if (locale_provider == ICU) result = call ICU code else result = call libc code return result However, there is code out there, both within PostgreSQL itself and in extensions, that does not do that yet. Ideally, we would eventually change all that over, but it's not happening now. So we ought to preserve the ability to set the libc to keep that legacy code working for now. This legacy code by definition doesn't know about ICU, so it doesn't care whether the ICU setting "matches" the libc setting or anything like that. It will just do its thing depending on its own setting. The only consequence of settings that don't match is that the different pieces of code behave semantically inconsistently (e.g., some routine thinks the data is Greek and other code thinks the data is French). But that's up to the user to set correctly. And the actual scenarios where you can actually do anything semantically relevant this way are very limited. A second point is that the LC_CTYPE setting tells other parts of libc what the current encoding is. This affects gettext for example. So you need to set this to something sensible even if you don't use libc locale routines otherwise. >>> 2. If I want to avoid a mismatch between the two, then I will need a >>> way to figure out which libc collation corresponds to a given ICU >>> collation. How do I do that? >> >> You can specify the same name for both. > > Hmm. If every name were valid in both systems, I don't think you'd be > proposing two fields. Earlier versions of this patch and predecessor patches indeed had common fields. But in fact the two systems accept different values if you want to delve into the advanced features. But for basic usage something like "en_US" will work for both.
On 15.03.22 08:41, Julien Rouhaud wrote: >>>> The locale object in ICU is an identifier that specifies a particular locale >>>> and has fields for language, country, and an optional code to specify further >>>> variants or subdivisions. These fields also can be represented as a string >>>> with the fields separated by an underscore. >> >> I think the Localization chapter needs to be reorganized a bit, but I'll >> leave that for a separate patch. > > WFM. I ended up writing a bit of content for that chapter. >>> While on that topic, the doc should probably mention that default ICU >>> collations can only be deterministic. >> >> Well, there is no option to do otherwise, so I'm not sure where/how to >> mention that. We usually don't document options that don't exist. ;-) > > Sure, but I'm afraid that users may still be tempted to use ICU locales like > und-u-ks-level2 from the case_insensitive example in the doc and hope that it > will work accordingly. Or maybe it's just me that still sees ICU as dark magic > and want to be extra cautious. I have added this to the CREATE DATABASE ref page. > Unrelated, but I just realized that we have PGC_INTERNAL gucs for lc_ctype and > lc_collate. Should we add one for icu_locale too? I'm not sure. I think the existing ones are more for backward compatibility with the time before it was settable per database. >>> in AlterCollation(), pg_collation_actual_version(), AlterDatabaseRefreshColl() >>> and pg_database_collation_actual_version(): >>> >>> - datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collcollate, &isnull); >>> - Assert(!isnull); >>> - newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum)); >>> + datum = SysCacheGetAttr(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale: Anum_pg_collation_collcollate, &isnull); >>> + if (!isnull) >>> + newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum)); >>> + else >>> + newversion = NULL; >>> >>> The columns are now nullable, but can you actually end up with a null locale >>> name in the expected field without manual DML on the catalog, corruption or >>> similar? I think it should be a plain error explaining the inconsistency >>> rather then silently assuming there's no version. Note that at least >>> pg_newlocale_from_collation() asserts that the specific libc/icu field it's >>> interested in isn't null. >> >> This is required because the default collations's fields are null now. > > Yes I saw that, but that's a specific exception. Detecting whether it's the > DEFAULT_COLLATION_OID or not and raise an error when a null value isn't > expected seems like it could be worthwhile. I have fixed that as you suggest. > So apart from the few details mentioned above I'm happy with this patch! committed!
Hi, Thank you to all the developers. I found that the description of the pg_database.daticulocale column was not written in the documentation. The attached small patch adds a description of the daticulocale column to catalogs.sgml. Regards, Noriyoshi Shinoda -----Original Message----- From: Peter Eisentraut <peter.eisentraut@enterprisedb.com> Sent: Thursday, March 17, 2022 7:29 PM To: Julien Rouhaud <rjuju123@gmail.com> Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; Daniel Verite <daniel@manitou-mail.org> Subject: Re: ICU for global collation On 15.03.22 08:41, Julien Rouhaud wrote: >>>> The locale object in ICU is an identifier that specifies a >>>> particular locale and has fields for language, country, and an >>>> optional code to specify further variants or subdivisions. These >>>> fields also can be represented as a string with the fields separated by an underscore. >> >> I think the Localization chapter needs to be reorganized a bit, but >> I'll leave that for a separate patch. > > WFM. I ended up writing a bit of content for that chapter. >>> While on that topic, the doc should probably mention that default >>> ICU collations can only be deterministic. >> >> Well, there is no option to do otherwise, so I'm not sure where/how >> to mention that. We usually don't document options that don't exist. >> ;-) > > Sure, but I'm afraid that users may still be tempted to use ICU > locales like > und-u-ks-level2 from the case_insensitive example in the doc and hope > that it will work accordingly. Or maybe it's just me that still sees > ICU as dark magic and want to be extra cautious. I have added this to the CREATE DATABASE ref page. > Unrelated, but I just realized that we have PGC_INTERNAL gucs for > lc_ctype and lc_collate. Should we add one for icu_locale too? I'm not sure. I think the existing ones are more for backward compatibility with the time before it was settable per database. >>> in AlterCollation(), pg_collation_actual_version(), >>> AlterDatabaseRefreshColl() and pg_database_collation_actual_version(): >>> >>> - datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collcollate, &isnull); >>> - Assert(!isnull); >>> - newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum)); >>> + datum = SysCacheGetAttr(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale: Anum_pg_collation_collcollate, &isnull); >>> + if (!isnull) >>> + newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum)); >>> + else >>> + newversion = NULL; >>> >>> The columns are now nullable, but can you actually end up with a >>> null locale name in the expected field without manual DML on the >>> catalog, corruption or similar? I think it should be a plain error >>> explaining the inconsistency rather then silently assuming there's >>> no version. Note that at least >>> pg_newlocale_from_collation() asserts that the specific libc/icu >>> field it's interested in isn't null. >> >> This is required because the default collations's fields are null now. > > Yes I saw that, but that's a specific exception. Detecting whether > it's the DEFAULT_COLLATION_OID or not and raise an error when a null > value isn't expected seems like it could be worthwhile. I have fixed that as you suggest. > So apart from the few details mentioned above I'm happy with this patch! committed!
Attachment
On 17.03.22 13:01, Shinoda, Noriyoshi (PN Japan FSIP) wrote: > Thank you to all the developers. > I found that the description of the pg_database.daticulocale column was not written in the documentation. > The attached small patch adds a description of the daticulocale column to catalogs.sgml. committed, thanks
On Thu, Mar 17, 2022 at 6:15 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > committed, thanks Glad that this finally happened. Thanks to everybody involved! -- Peter Geoghegan
Hi, On 2022-03-17 14:14:52 +0100, Peter Eisentraut wrote: > On 17.03.22 13:01, Shinoda, Noriyoshi (PN Japan FSIP) wrote: > > Thank you to all the developers. > > I found that the description of the pg_database.daticulocale column was not written in the documentation. > > The attached small patch adds a description of the daticulocale column to catalogs.sgml. > > committed, thanks Wee! That's a long time weakness addressed... Just saw a weird failure after rebasing my meson branch ontop of this. Tests passed on debian, suse, centos 8 stream, fedora rawhide (failed due to an independent reason), but not on centos 7. all runs: https://cirrus-ci.com/build/5190538184884224 centos 7: https://cirrus-ci.com/task/4786632883699712?logs=tests_world#L204 centos 7 failure: https://api.cirrus-ci.com/v1/artifact/task/4786632883699712/log/build/testrun/icu/t/010_database/log/regress_log_010_database not ok 1 - sort by database default locale # Failed test 'sort by database default locale' # at /tmp/cirrus-ci-build/src/test/icu/t/010_database.pl line 28. # got: 'a # A # b # B' # expected: 'A # a # B # b' ok 2 - sort by explicit collation standard not ok 3 - sort by explicit collation upper first # Failed test 'sort by explicit collation upper first' # at /tmp/cirrus-ci-build/src/test/icu/t/010_database.pl line 42. # got: 'a # A # b # B' # expected: 'A # a # B # b' ok 4 - ICU locale must be specified for ICU provider: exit code not 0 ok 5 - ICU locale must be specified for ICU provider: error message 1..5 This is a run building with meson. But I've now triggered builds with autoconf on centos 7 as well and that also failed. See https://cirrus-ci.com/task/6194007767252992?logs=test_world#L378 So it looks like older ICU versions don't work? Greetings, Andres Freund PS: I had not yet passed with_icu in the initdb tests for meson, that's why there's two failures with autoconf but only one with meson.
On Thu, Mar 17, 2022 at 02:14:52PM +0100, Peter Eisentraut wrote: > On 17.03.22 13:01, Shinoda, Noriyoshi (PN Japan FSIP) wrote: > > Thank you to all the developers. > > I found that the description of the pg_database.daticulocale column was not written in the documentation. > > The attached small patch adds a description of the daticulocale column to catalogs.sgml. > > committed, thanks Thanks a lot both! Glad to finally have that feature, as soon as we'll fix the few reported problems.
Hi, On 2022-03-17 14:14:52 +0100, Peter Eisentraut wrote: > committed, thanks Just noticed that this adds a new warning when building with -O3: In file included from /home/andres/src/postgresql/src/include/catalog/pg_collation.h:22, from /home/andres/src/postgresql/src/backend/commands/dbcommands.c:39: In function ‘collprovider_name’, inlined from ‘createdb’ at /home/andres/src/postgresql/src/backend/commands/dbcommands.c:514:4: ../../../src/include/catalog/pg_collation_d.h:47:9: warning: ‘src_locprovider’ may be used uninitialized [-Wmaybe-uninitialized] 47 | switch (c) | ^~~~~~ /home/andres/src/postgresql/src/backend/commands/dbcommands.c: In function ‘createdb’: /home/andres/src/postgresql/src/backend/commands/dbcommands.c:112:25: note: ‘src_locprovider’ was declared here 112 | char src_locprovider; | ^~~~~~~~~~~~~~~ I'd fixed that for nearby variables in 3f6b3be39ca9... Gonna just NULL initialize it as well. Greetings, Andres Freund
commit f2553d43060edb210b36c63187d52a632448e1d2 says >=1500 in a few places, but in pg_upgrade says <=1500, which looks wrong for upgrades from v15. I think it should say <= 1400. On Wed, Feb 02, 2022 at 02:01:23PM +0100, Peter Eisentraut wrote: > --- a/src/bin/pg_dump/pg_dump.c > +++ b/src/bin/pg_dump/pg_dump.c > @@ -2792,6 +2794,10 @@ dumpDatabase(Archive *fout) > appendPQExpBuffer(dbQry, "datminmxid, "); > else > appendPQExpBuffer(dbQry, "0 AS datminmxid, "); > + if (fout->remoteVersion >= 150000) > + appendPQExpBuffer(dbQry, "datcollprovider, "); > + else > + appendPQExpBuffer(dbQry, "'c' AS datcollprovider, "); > appendPQExpBuffer(dbQry, > "(SELECT spcname FROM pg_tablespace t WHERE t.oid = dattablespace) AS tablespace, " > "shobj_description(oid, 'pg_database') AS description " > diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c > index 69ef23119f..2a9ca0e389 100644 > --- a/src/bin/pg_upgrade/info.c > +++ b/src/bin/pg_upgrade/info.c > @@ -312,11 +312,20 @@ get_db_infos(ClusterInfo *cluster) > i_spclocation; > char query[QUERY_ALLOC]; > > snprintf(query, sizeof(query), > - "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, " > + "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, "); > + if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500) > + snprintf(query + strlen(query), sizeof(query) - strlen(query), > + "'c' AS datcollprovider, NULL AS daticucoll, "); > + else > + snprintf(query + strlen(query), sizeof(query) - strlen(query), > + "datcollprovider, daticucoll, "); > + snprintf(query + strlen(query), sizeof(query) - strlen(query), > "pg_catalog.pg_tablespace_location(t.oid) AS spclocation " > "FROM pg_catalog.pg_database d " > " LEFT OUTER JOIN pg_catalog.pg_tablespace t " > --- a/src/bin/psql/describe.c > +++ b/src/bin/psql/describe.c > @@ -896,6 +896,18 @@ listAllDbs(const char *pattern, bool verbose) > gettext_noop("Encoding"), > gettext_noop("Collate"), > gettext_noop("Ctype")); > + if (pset.sversion >= 150000) > + appendPQExpBuffer(&buf, > + " d.daticucoll as \"%s\",\n" > + " CASE d.datcollprovider WHEN 'c' THEN 'libc' WHEN 'i' THEN 'icu' END AS \"%s\",\n", > + gettext_noop("ICU Collation"), > + gettext_noop("Coll. Provider")); > + else > + appendPQExpBuffer(&buf, > + " d.datcollate as \"%s\",\n" > + " 'libc' AS \"%s\",\n", > + gettext_noop("ICU Collation"), > + gettext_noop("Coll. Provider")); > appendPQExpBufferStr(&buf, " "); > printACLColumn(&buf, "d.datacl"); > if (verbose) > @@ -4617,6 +4629,15 @@ listCollations(const char *pattern, bool verbose, bool showSystem) > gettext_noop("Collate"), > gettext_noop("Ctype")); > > + if (pset.sversion >= 150000) > + appendPQExpBuffer(&buf, > + ",\n c.collicucoll AS \"%s\"", > + gettext_noop("ICU Collation")); > + else > + appendPQExpBuffer(&buf, > + ",\n c.collcollate AS \"%s\"", > + gettext_noop("ICU Collation")); > + > if (pset.sversion >= 100000) > appendPQExpBuffer(&buf, > ",\n CASE c.collprovider WHEN 'd' THEN 'default' WHEN 'c' THEN 'libc' WHEN 'i' THEN 'icu'END AS \"%s\"",
Hi, On Sat, Jun 25, 2022 at 10:19:30AM -0500, Justin Pryzby wrote: > commit f2553d43060edb210b36c63187d52a632448e1d2 says >=1500 in a few places, > but in pg_upgrade says <=1500, which looks wrong for upgrades from v15. > I think it should say <= 1400. > > On Wed, Feb 02, 2022 at 02:01:23PM +0100, Peter Eisentraut wrote: > > diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c > > index 69ef23119f..2a9ca0e389 100644 > > --- a/src/bin/pg_upgrade/info.c > > +++ b/src/bin/pg_upgrade/info.c > > @@ -312,11 +312,20 @@ get_db_infos(ClusterInfo *cluster) > > i_spclocation; > > char query[QUERY_ALLOC]; > > > > snprintf(query, sizeof(query), > > - "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, " > > + "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, "); > > + if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500) > > + snprintf(query + strlen(query), sizeof(query) - strlen(query), > > + "'c' AS datcollprovider, NULL AS daticucoll, "); > > + else > > + snprintf(query + strlen(query), sizeof(query) - strlen(query), > > + "datcollprovider, daticucoll, "); > > + snprintf(query + strlen(query), sizeof(query) - strlen(query), > > "pg_catalog.pg_tablespace_location(t.oid) AS spclocation " > > "FROM pg_catalog.pg_database d " > > " LEFT OUTER JOIN pg_catalog.pg_tablespace t " Indeed!
On Sun, Jun 26, 2022 at 11:51:24AM +0800, Julien Rouhaud wrote: > On Sat, Jun 25, 2022 at 10:19:30AM -0500, Justin Pryzby wrote: >>> + if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500) >>> + snprintf(query + strlen(query), sizeof(query) - strlen(query), >>> + "'c' AS datcollprovider, NULL AS daticucoll, "); >>> + else >>> + snprintf(query + strlen(query), sizeof(query) - strlen(query), >>> + "datcollprovider, daticucoll, "); >>> + snprintf(query + strlen(query), sizeof(query) - strlen(query), >>> "pg_catalog.pg_tablespace_location(t.oid) AS spclocation " >>> "FROM pg_catalog.pg_database d " >>> " LEFT OUTER JOIN pg_catalog.pg_tablespace t " > > Indeed! Oops. Beta2 tagging is very close by, so I think that it would be better to not take a risk on that now, and this is an issue only when upgrading from v15 where datcollprovider is ICU for a database. As things stand, someone using beta1 with this new feature, running pg_upgrade to beta2 would lose any non-libc locale provider set. -- Michael
Attachment
On 26.06.22 05:51, Julien Rouhaud wrote: > Hi, > > On Sat, Jun 25, 2022 at 10:19:30AM -0500, Justin Pryzby wrote: >> commit f2553d43060edb210b36c63187d52a632448e1d2 says >=1500 in a few places, >> but in pg_upgrade says <=1500, which looks wrong for upgrades from v15. >> I think it should say <= 1400. >> >> On Wed, Feb 02, 2022 at 02:01:23PM +0100, Peter Eisentraut wrote: >>> diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c >>> index 69ef23119f..2a9ca0e389 100644 >>> --- a/src/bin/pg_upgrade/info.c >>> +++ b/src/bin/pg_upgrade/info.c >>> @@ -312,11 +312,20 @@ get_db_infos(ClusterInfo *cluster) >>> i_spclocation; >>> char query[QUERY_ALLOC]; >>> >>> snprintf(query, sizeof(query), >>> - "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, " >>> + "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, "); >>> + if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500) I think the fix here is to change <= to < ? >>> + snprintf(query + strlen(query), sizeof(query) - strlen(query), >>> + "'c' AS datcollprovider, NULL AS daticucoll, "); >>> + else >>> + snprintf(query + strlen(query), sizeof(query) - strlen(query), >>> + "datcollprovider, daticucoll, "); >>> + snprintf(query + strlen(query), sizeof(query) - strlen(query), >>> "pg_catalog.pg_tablespace_location(t.oid) AS spclocation " >>> "FROM pg_catalog.pg_database d " >>> " LEFT OUTER JOIN pg_catalog.pg_tablespace t " > > Indeed! > >
On Mon, Jun 27, 2022 at 08:23:59AM +0200, Peter Eisentraut wrote: > On 26.06.22 05:51, Julien Rouhaud wrote: >> On Sat, Jun 25, 2022 at 10:19:30AM -0500, Justin Pryzby wrote: >>>> + if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500) > > I think the fix here is to change <= to < ? Yes. -- Michael
Attachment
On 27.06.22 08:42, Michael Paquier wrote: > On Mon, Jun 27, 2022 at 08:23:59AM +0200, Peter Eisentraut wrote: >> On 26.06.22 05:51, Julien Rouhaud wrote: >>> On Sat, Jun 25, 2022 at 10:19:30AM -0500, Justin Pryzby wrote: >>>>> + if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500) >> >> I think the fix here is to change <= to < ? > > Yes. Ok, committed. (I see now that in the context of pg_upgrade, writing <= 1400 is equivalent, but I find that confusing, so I did < 1500.)
On Mon, Jun 27, 2022 at 09:10:29AM +0200, Peter Eisentraut wrote: > On 27.06.22 08:42, Michael Paquier wrote: > > On Mon, Jun 27, 2022 at 08:23:59AM +0200, Peter Eisentraut wrote: > > > On 26.06.22 05:51, Julien Rouhaud wrote: > > > > On Sat, Jun 25, 2022 at 10:19:30AM -0500, Justin Pryzby wrote: > > > > > > + if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500) > > > > > > I think the fix here is to change <= to < ? > > > > Yes. > > Ok, committed. > > (I see now that in the context of pg_upgrade, writing <= 1400 is equivalent, > but I find that confusing, so I did < 1500.) I suggested using <= 1400 for consistency with the other code, and per bc1fbc960. But YMMV. -- Justin
Hello everyone in this thread! While reading and testing the patch that adds ICU for global collations [1] I noticed on master (1c5818b9c68e5c2ac8f19d372f24cce409de1a26) and REL_15_STABLE (63b64d8270691894a9a8f2d4e929e7780020edb8) that: 1) pg_upgrade from REL_14_STABLE (63b64d8270691894a9a8f2d4e929e7780020edb8) does not always work: For REL_14_STABLE: $ initdb -D data_old For REL_15_STABLE or master: $ initdb -D data_new --locale-provider icu --icu-locale ru-RU $ pg_upgrade -d .../data_old -D data_new -b ... -B ... ... Restoring database schemas in the new cluster template1 *failure* Consult the last few lines of "data_new/pg_upgrade_output.d/20220815T142454.223/log/pg_upgrade_dump_1.log" for the probable cause of the failure. Failure, exiting In data_new/pg_upgrade_output.d/20220815T142454.223/log/pg_upgrade_dump_1.log: pg_restore: error: could not execute query: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. Command was: CREATE DATABASE "template1" WITH TEMPLATE = template0 OID = 1 ENCODING = 'UTF8' LOCALE_PROVIDER = libc LOCALE = 'en_US.UTF-8'; In data_new/pg_upgrade_output.d/20220815T142454.223/log/pg_upgrade_server.log: TRAP: FailedAssertion("(dblocprovider == COLLPROVIDER_ICU && dbiculocale) || (dblocprovider != COLLPROVIDER_ICU && !dbiculocale)", File: "dbcommands.c", Line: 1292, PID: 69247) postgres: marina postgres [local] CREATE DATABASE(ExceptionalCondition+0xb9)[0xb4d8ec] postgres: marina postgres [local] CREATE DATABASE(createdb+0x1abc)[0x68ca99] postgres: marina postgres [local] CREATE DATABASE(standard_ProcessUtility+0x651)[0x9b1d82] postgres: marina postgres [local] CREATE DATABASE(ProcessUtility+0x122)[0x9b172a] postgres: marina postgres [local] CREATE DATABASE[0x9b01cf] postgres: marina postgres [local] CREATE DATABASE[0x9b0433] postgres: marina postgres [local] CREATE DATABASE(PortalRun+0x2fe)[0x9af95d] postgres: marina postgres [local] CREATE DATABASE[0x9a953b] postgres: marina postgres [local] CREATE DATABASE(PostgresMain+0x733)[0x9ada6b] postgres: marina postgres [local] CREATE DATABASE[0x8ec632] postgres: marina postgres [local] CREATE DATABASE[0x8ebfbb] postgres: marina postgres [local] CREATE DATABASE[0x8e8653] postgres: marina postgres [local] CREATE DATABASE(PostmasterMain+0x1226)[0x8e7f26] postgres: marina postgres [local] CREATE DATABASE[0x7bbccb] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7eff082f90b3] postgres: marina postgres [local] CREATE DATABASE(_start+0x2e)[0x49c29e] 2022-08-15 14:24:56.124 MSK [69231] LOG: server process (PID 69247) was terminated by signal 6: Aborted 2022-08-15 14:24:56.124 MSK [69231] DETAIL: Failed process was running: CREATE DATABASE "template1" WITH TEMPLATE = template0 OID = 1 ENCODING = 'UTF8' LOCALE_PROVIDER = libc LOCALE = 'en_US.UTF-8'; 1.1) It looks like there's a bug in the function get_db_infos (src/bin/pg_upgrade/info.c), where the version of the old cluster is always checked: if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500) snprintf(query + strlen(query), sizeof(query) - strlen(query), "'c' AS datlocprovider, NULL AS daticulocale, "); else snprintf(query + strlen(query), sizeof(query) - strlen(query), "datlocprovider, daticulocale, "); With the simple patch diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c index df374ce4b362b4c6c87fc1fd0e476e5d6d353d9e..53ea348e211d3ac38334292bc16cb814bc13bb87 100644 --- a/src/bin/pg_upgrade/info.c +++ b/src/bin/pg_upgrade/info.c @@ -319,7 +319,7 @@ get_db_infos(ClusterInfo *cluster) snprintf(query, sizeof(query), "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, "); - if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500) + if (GET_MAJOR_VERSION(cluster->major_version) < 1500) snprintf(query + strlen(query), sizeof(query) - strlen(query), "'c' AS datlocprovider, NULL AS daticulocale, "); else I got the expected error during the upgrade: locale providers for database "template1" do not match: old "libc", new "icu" Failure, exiting 1.2) It looks like the mentioned asserion in dbcommands.c conflicts with the following lines earlier: if (dbiculocale == NULL) dbiculocale = src_iculocale; The following patch works for me: diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index b31a30550b025d48ba3cc250dc4c15f41f9a80be..17a2942341e528c01182fb6d4878580f2706bec9 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -1048,6 +1048,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("ICU locale must be specified"))); } + else + dbiculocale = NULL; if (dblocprovider == COLLPROVIDER_ICU) check_icu_locale(dbiculocale); 2) CREATE DATABASE does not always require the icu locale unlike initdb and createdb: $ initdb -D data --locale en_US.UTF-8 --locale-provider icu ... initdb: error: ICU locale must be specified $ initdb -D data --locale en_US.UTF-8 $ pg_ctl -D data -l logfile start $ createdb mydb --locale en_US.UTF-8 --template template0 --locale-provider icu createdb: error: database creation failed: ERROR: ICU locale must be specified $ psql -c "CREATE DATABASE mydb LOCALE \"en_US.UTF-8\" TEMPLATE template0 LOCALE_PROVIDER icu" postgres CREATE DATABASE $ psql -c "CREATE DATABASE mydb TEMPLATE template0 LOCALE_PROVIDER icu" postgres ERROR: ICU locale must be specified I'm wondering if this is not a fully-supported feature (because createdb creates an SQL command with LC_COLLATE and LC_CTYPE options instead of LOCALE option) or is it a bug in CREATE DATABASE?.. From src/backend/commands/dbcommands.c: if (dblocprovider == COLLPROVIDER_ICU && !dbiculocale) { if (dlocale && dlocale->arg) dbiculocale = defGetString(dlocale); } [1] https://github.com/postgres/postgres/commit/f2553d43060edb210b36c63187d52a632448e1d2 -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi, On Mon, Aug 15, 2022 at 03:06:32PM +0300, Marina Polyakova wrote: > > 1.1) It looks like there's a bug in the function get_db_infos > (src/bin/pg_upgrade/info.c), where the version of the old cluster is always > checked: > > if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500) > snprintf(query + strlen(query), sizeof(query) - strlen(query), > "'c' AS datlocprovider, NULL AS daticulocale, "); > else > snprintf(query + strlen(query), sizeof(query) - strlen(query), > "datlocprovider, daticulocale, "); > > With the simple patch > > diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c > index df374ce4b362b4c6c87fc1fd0e476e5d6d353d9e..53ea348e211d3ac38334292bc16cb814bc13bb87 > 100644 > --- a/src/bin/pg_upgrade/info.c > +++ b/src/bin/pg_upgrade/info.c > @@ -319,7 +319,7 @@ get_db_infos(ClusterInfo *cluster) > > snprintf(query, sizeof(query), > "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, "); > - if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500) > + if (GET_MAJOR_VERSION(cluster->major_version) < 1500) > snprintf(query + strlen(query), sizeof(query) - strlen(query), > "'c' AS datlocprovider, NULL AS daticulocale, "); > else > > I got the expected error during the upgrade: > > locale providers for database "template1" do not match: old "libc", new > "icu" > Failure, exiting Good catch. There's unfortunately not a lot of regression tests for ICU-initialized clusters. I'm wondering if the build-farm client could be taught about the locale provider rather than assuming libc. Clearly that wouldn't have caught this issue, but it should still increase the coverage a bit (I'm thinking of the recent problem with the abbreviated keys). > 1.2) It looks like the mentioned asserion in dbcommands.c conflicts with the > following lines earlier: > > if (dbiculocale == NULL) > dbiculocale = src_iculocale; > > The following patch works for me: > > diff --git a/src/backend/commands/dbcommands.c > b/src/backend/commands/dbcommands.c > index b31a30550b025d48ba3cc250dc4c15f41f9a80be..17a2942341e528c01182fb6d4878580f2706bec9 > 100644 > --- a/src/backend/commands/dbcommands.c > +++ b/src/backend/commands/dbcommands.c > @@ -1048,6 +1048,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > errmsg("ICU locale must be specified"))); > } > + else > + dbiculocale = NULL; > > if (dblocprovider == COLLPROVIDER_ICU) > check_icu_locale(dbiculocale); I think it would be better to do that in the various variable initialization. Maybe switch the dblocprovider and dbiculocale initialization, and only initialize dbiculocale if dblocprovider is COLLPROVIDER_ICU. > 2) CREATE DATABASE does not always require the icu locale unlike initdb and > createdb: > > $ createdb mydb --locale en_US.UTF-8 --template template0 --locale-provider > icu > createdb: error: database creation failed: ERROR: ICU locale must be > specified > > $ psql -c "CREATE DATABASE mydb LOCALE \"en_US.UTF-8\" TEMPLATE template0 > LOCALE_PROVIDER icu" postgres > CREATE DATABASE > > I'm wondering if this is not a fully-supported feature (because createdb > creates an SQL command with LC_COLLATE and LC_CTYPE options instead of > LOCALE option) or is it a bug in CREATE DATABASE?.. From > src/backend/commands/dbcommands.c: > > if (dblocprovider == COLLPROVIDER_ICU && !dbiculocale) > { > if (dlocale && dlocale->arg) > dbiculocale = defGetString(dlocale); > } This discrepancy between createdb and CREATE DATABASE looks like an oversight, as createdb indeed interprets --locale as: if (locale) { if (lc_ctype) pg_fatal("only one of --locale and --lc-ctype can be specified"); if (lc_collate) pg_fatal("only one of --locale and --lc-collate can be specified"); lc_ctype = locale; lc_collate = locale; } AFAIK the fallback in the CREATE DATABASE case is expected as POSIX locale names should be accepted by icu, so this should work for createdb too.
On 2022-08-17 19:53, Julien Rouhaud wrote: > Good catch. There's unfortunately not a lot of regression tests for > ICU-initialized clusters. I'm wondering if the build-farm client could > be > taught about the locale provider rather than assuming libc. Clearly > that > wouldn't have caught this issue, but it should still increase the > coverage a > bit (I'm thinking of the recent problem with the abbreviated keys). Looking at installchecks with different locales (e.g. see [1] with sv_SE.UTF-8) - why not?.. >> diff --git a/src/backend/commands/dbcommands.c >> b/src/backend/commands/dbcommands.c >> index >> b31a30550b025d48ba3cc250dc4c15f41f9a80be..17a2942341e528c01182fb6d4878580f2706bec9 >> 100644 >> --- a/src/backend/commands/dbcommands.c >> +++ b/src/backend/commands/dbcommands.c >> @@ -1048,6 +1048,8 @@ createdb(ParseState *pstate, const CreatedbStmt >> *stmt) >> (errcode(ERRCODE_INVALID_PARAMETER_VALUE), >> errmsg("ICU locale must be specified"))); >> } >> + else >> + dbiculocale = NULL; >> >> if (dblocprovider == COLLPROVIDER_ICU) >> check_icu_locale(dbiculocale); > > I think it would be better to do that in the various variable > initialization. > Maybe switch the dblocprovider and dbiculocale initialization, and only > initialize dbiculocale if dblocprovider is COLLPROVIDER_ICU. diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index b31a30550b025d48ba3cc250dc4c15f41f9a80be..883f381f3453142790f728a3725586cebe2e2f20 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -1012,10 +1012,10 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) dbcollate = src_collate; if (dbctype == NULL) dbctype = src_ctype; - if (dbiculocale == NULL) - dbiculocale = src_iculocale; if (dblocprovider == '\0') dblocprovider = src_locprovider; + if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU) + dbiculocale = src_iculocale; /* Some encodings are client only */ if (!PG_VALID_BE_ENCODING(encoding)) Then it seemed to me that it was easier to first get all the parameters from the template database as usual and then process them as needed. But with your suggestion the failed assertion will check the code above more accurately... > This discrepancy between createdb and CREATE DATABASE looks like an > oversight, > as createdb indeed interprets --locale as: > > if (locale) > { > if (lc_ctype) > pg_fatal("only one of --locale and --lc-ctype can be specified"); > if (lc_collate) > pg_fatal("only one of --locale and --lc-collate can be specified"); > lc_ctype = locale; > lc_collate = locale; > } > > AFAIK the fallback in the CREATE DATABASE case is expected as POSIX > locale > names should be accepted by icu, so this should work for createdb too. Oh, great, thanks! > > > I spent some time looking at the ICU api trying to figure out if using a > > > posix locale name (e.g. en_US) was actually compatible with an ICU locale name. > > > It seems that ICU accept any of 'en-us', 'en-US', 'en_us' or 'en_US' as the > > > same locale, but I might be wrong. I also didn't find a way to figure out how > > > to ask ICU if the locale identifier passed is complete garbage or not. One > > > sure thing is that the system collation we import are of the form 'en-us', so > > > it seems weird to have this form in pg_collation and by default another form in > > > pg_database. > > > > Yeah it seems to be inconsistent about that. The locale ID documentation > > appears to indicate that "en_US" is the canonical form, but when you ask it > > to list all the locales it knows about it returns "en-US". > > Yeah I saw that too when checking is POSIX locale names were valid, and > that's > not great. I'm sorry but IIUC pg_import_system_collations uses uloc_getAvailable to get the locale ID and then specifically calls uloc_toLanguageTag?.. > I don't think that initdb --collation-provider icu should be allowed > without > --icu-locale, same for --collation-provider libc *with* --icu-locale. > > initdb has some specific processing to transform the default libc locale to > > something more appropriate, but as far as I can see creatdb / CREATE DATABASE > > aren't doing that. It seems inconsistent, and IMHO another reason why > > defaulting to the libc locale looks like a bad idea. > > This has all been removed. The separate ICU locale option should now > be > required everywhere (initdb, createdb, CREATE DATABASE). If it's a feature and not a bug in CREATE DATABASE, why should not it work in initdb too? Here we define locale/lc_collate/lc_ctype for the first 3 databases in the cluster in much the same way... P.S. FYI there seems to be a bug for very old ICU versions: in master (92fce4e1eda9b24d73f583fbe9b58f4e03f097a4): $ initdb -D data && pg_ctl -D data -l logfile start && psql -c "CREATE DATABASE mydb LOCALE \"C.UTF-8\" LOCALE_PROVIDER icu TEMPLATE template0" postgres && psql -c "SELECT 1" mydb WARNING: database "mydb" has a collation version mismatch DETAIL: The database was created using collation version 49.192.0.42, but the operating system provides version 49.192.5.42. HINT: Rebuild all objects in this database that use the default collation and run ALTER DATABASE mydb REFRESH COLLATION VERSION, or build PostgreSQL with the right library version. See the additional output (diff_log_icu_collator_locale.patch) in the logfile: 2022-08-20 11:38:30.162 MSK [136546] LOG: check_icu_locale uloc_getDefault() en_US 2022-08-20 11:38:30.162 MSK [136546] STATEMENT: CREATE DATABASE mydb LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0 2022-08-20 11:38:30.163 MSK [136546] LOG: check_icu_locale icu_locale C.UTF-8 valid_locale en_US version 49.192.0.42 2022-08-20 11:38:30.163 MSK [136546] STATEMENT: CREATE DATABASE mydb LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0 2022-08-20 11:38:30.163 MSK [136546] LOG: get_collation_actual_version uloc_getDefault() en_US 2022-08-20 11:38:30.163 MSK [136546] STATEMENT: CREATE DATABASE mydb LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0 2022-08-20 11:38:30.163 MSK [136546] LOG: get_collation_actual_version icu_locale C.UTF-8 valid_locale en_US version 49.192.0.42 2022-08-20 11:38:30.163 MSK [136546] STATEMENT: CREATE DATABASE mydb LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0 2022-08-20 11:38:30.224 MSK [136548] LOG: make_icu_collator uloc_getDefault() c 2022-08-20 11:38:30.225 MSK [136548] LOG: make_icu_collator icu_locale C.UTF-8 valid_locale root version 49.192.5.42 2022-08-20 11:38:30.225 MSK [136548] LOG: get_collation_actual_version uloc_getDefault() c 2022-08-20 11:38:30.225 MSK [136548] LOG: get_collation_actual_version icu_locale C.UTF-8 valid_locale root version 49.192.5.42 2022-08-20 11:38:30.225 MSK [136548] WARNING: database "mydb" has a collation version mismatch 2022-08-20 11:38:30.225 MSK [136548] DETAIL: The database was created using collation version 49.192.0.42, but the operating system provides version 49.192.5.42. 2022-08-20 11:38:30.225 MSK [136548] HINT: Rebuild all objects in this database that use the default collation and run ALTER DATABASE mydb REFRESH COLLATION VERSION, or build PostgreSQL with the right library version. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&dt=2022-08-18%2006%3A25%3A17 -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 15.08.22 14:06, Marina Polyakova wrote: > 1.1) It looks like there's a bug in the function get_db_infos > (src/bin/pg_upgrade/info.c), where the version of the old cluster is > always checked: > > if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500) > snprintf(query + strlen(query), sizeof(query) - strlen(query), > "'c' AS datlocprovider, NULL AS daticulocale, "); > else > snprintf(query + strlen(query), sizeof(query) - strlen(query), > "datlocprovider, daticulocale, "); > > With the simple patch > > diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c > index > df374ce4b362b4c6c87fc1fd0e476e5d6d353d9e..53ea348e211d3ac38334292bc16cb814bc13bb87 > 100644 > --- a/src/bin/pg_upgrade/info.c > +++ b/src/bin/pg_upgrade/info.c > @@ -319,7 +319,7 @@ get_db_infos(ClusterInfo *cluster) > > snprintf(query, sizeof(query), > "SELECT d.oid, d.datname, d.encoding, d.datcollate, > d.datctype, "); > - if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500) > + if (GET_MAJOR_VERSION(cluster->major_version) < 1500) > snprintf(query + strlen(query), sizeof(query) - strlen(query), > "'c' AS datlocprovider, NULL AS daticulocale, "); > else fixed > 1.2) It looks like the mentioned asserion in dbcommands.c conflicts with > the following lines earlier: > > if (dbiculocale == NULL) > dbiculocale = src_iculocale; fixed > I'm wondering if this is not a fully-supported feature (because createdb > creates an SQL command with LC_COLLATE and LC_CTYPE options instead of > LOCALE option) or is it a bug in CREATE DATABASE?.. From > src/backend/commands/dbcommands.c: > > if (dblocprovider == COLLPROVIDER_ICU && !dbiculocale) > { > if (dlocale && dlocale->arg) > dbiculocale = defGetString(dlocale); > } I think this piece of code was left over from some earlier attempts to specify the libc locale and the icu locale with one option, which never really worked well. The CREATE DATABASE man page does not mention that LOCALE provides the default for ICU_LOCALE. Hence, I think we should delete this.
On 2022-08-22 17:10, Peter Eisentraut wrote: > On 15.08.22 14:06, Marina Polyakova wrote: >> 1.1) It looks like there's a bug in the function get_db_infos >> (src/bin/pg_upgrade/info.c), where the version of the old cluster is >> always checked: >> >> if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500) >> snprintf(query + strlen(query), sizeof(query) - strlen(query), >> "'c' AS datlocprovider, NULL AS daticulocale, "); >> else >> snprintf(query + strlen(query), sizeof(query) - strlen(query), >> "datlocprovider, daticulocale, "); >> >> With the simple patch >> >> diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c >> index >> df374ce4b362b4c6c87fc1fd0e476e5d6d353d9e..53ea348e211d3ac38334292bc16cb814bc13bb87 >> 100644 >> --- a/src/bin/pg_upgrade/info.c >> +++ b/src/bin/pg_upgrade/info.c >> @@ -319,7 +319,7 @@ get_db_infos(ClusterInfo *cluster) >> >> snprintf(query, sizeof(query), >> "SELECT d.oid, d.datname, d.encoding, d.datcollate, >> d.datctype, "); >> - if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500) >> + if (GET_MAJOR_VERSION(cluster->major_version) < 1500) >> snprintf(query + strlen(query), sizeof(query) - >> strlen(query), >> "'c' AS datlocprovider, NULL AS daticulocale, "); >> else > > fixed > >> 1.2) It looks like the mentioned asserion in dbcommands.c conflicts >> with the following lines earlier: >> >> if (dbiculocale == NULL) >> dbiculocale = src_iculocale; > > fixed > >> I'm wondering if this is not a fully-supported feature (because >> createdb creates an SQL command with LC_COLLATE and LC_CTYPE options >> instead of LOCALE option) or is it a bug in CREATE DATABASE?.. From >> src/backend/commands/dbcommands.c: >> >> if (dblocprovider == COLLPROVIDER_ICU && !dbiculocale) >> { >> if (dlocale && dlocale->arg) >> dbiculocale = defGetString(dlocale); >> } > > I think this piece of code was left over from some earlier attempts to > specify the libc locale and the icu locale with one option, which > never really worked well. The CREATE DATABASE man page does not > mention that LOCALE provides the default for ICU_LOCALE. Hence, I > think we should delete this. Thank you! -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Aug 22, 2022 at 04:10:59PM +0200, Peter Eisentraut wrote: > I think this piece of code was left over from some earlier attempts to > specify the libc locale and the icu locale with one option, which never > really worked well. The CREATE DATABASE man page does not mention that > LOCALE provides the default for ICU_LOCALE. Hence, I think we should delete > this. As of 36f729e, is there anything left to address on this thread or should this open item be closed? -- Michael
Attachment
My colleague Andrew Bille found another bug in master (b4e936859dc441102eb0b6fb7a104f3948c90490) and REL_15_STABLE (2c63b0930aee1bb5c265fad4a65c9d0b62b1f9da): pg_collation.colliculocale is not dumped. See check_icu_locale.sh: In the old cluster: SELECT collname, colliculocale FROM pg_collation WHERE collname = 'testcoll_backwards' collname | colliculocale --------------------+------------------- testcoll_backwards | @colBackwards=yes (1 row) In the new cluster: SELECT collname, colliculocale FROM pg_collation WHERE collname = 'testcoll_backwards' collname | colliculocale --------------------+--------------- testcoll_backwards | (1 row) diff_dump_colliculocale.patch works for me. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Aug 23, 2022 at 08:59:02PM +0300, Marina Polyakova wrote: > My colleague Andrew Bille found another bug in master > (b4e936859dc441102eb0b6fb7a104f3948c90490) and REL_15_STABLE > (2c63b0930aee1bb5c265fad4a65c9d0b62b1f9da): pg_collation.colliculocale is > not dumped. See check_icu_locale.sh: > > In the old cluster: > SELECT collname, colliculocale FROM pg_collation WHERE collname = > 'testcoll_backwards' > collname | colliculocale > --------------------+------------------- > testcoll_backwards | @colBackwards=yes > (1 row) > > In the new cluster: > SELECT collname, colliculocale FROM pg_collation WHERE collname = > 'testcoll_backwards' > collname | colliculocale > --------------------+--------------- > testcoll_backwards | > (1 row) > > diff_dump_colliculocale.patch works for me. Ugh. Good catch, again! I have not tested the patch in details but this looks rather sane to me on a quick read. Peter? -- Michael
Attachment
On Wed, Aug 24, 2022 at 01:38:44PM +0900, Michael Paquier wrote: > On Tue, Aug 23, 2022 at 08:59:02PM +0300, Marina Polyakova wrote: > > My colleague Andrew Bille found another bug in master > > (b4e936859dc441102eb0b6fb7a104f3948c90490) and REL_15_STABLE > > (2c63b0930aee1bb5c265fad4a65c9d0b62b1f9da): pg_collation.colliculocale is > > not dumped. See check_icu_locale.sh: > > > > In the old cluster: > > SELECT collname, colliculocale FROM pg_collation WHERE collname = > > 'testcoll_backwards' > > collname | colliculocale > > --------------------+------------------- > > testcoll_backwards | @colBackwards=yes > > (1 row) > > > > In the new cluster: > > SELECT collname, colliculocale FROM pg_collation WHERE collname = > > 'testcoll_backwards' > > collname | colliculocale > > --------------------+--------------- > > testcoll_backwards | > > (1 row) > > > > diff_dump_colliculocale.patch works for me. > > Ugh. Good catch, again! +1 > I have not tested the patch in details but > this looks rather sane to me on a quick read. Peter? Patch looks good to me too.
On 24.08.22 10:59, Julien Rouhaud wrote: >> I have not tested the patch in details but >> this looks rather sane to me on a quick read. Peter? > Patch looks good to me too. Committed, thanks. (This should conclude all the issues discussed in this thread recently.)
On Wed, Aug 24, 2022 at 08:28:24PM +0200, Peter Eisentraut wrote: > Committed, thanks. > > (This should conclude all the issues discussed in this thread recently.) Please note that this open item was still listed as open. I have closed it now. -- Michael
Attachment
Hello! IMO after adding ICU for global collations [1] the behaviour of createdb and CREATE DATABASE is a bit inconsistent when both locale and lc_collate (or locale and lc_ctype) options are used: $ createdb mydb --locale C --lc-collate C --template template0 createdb: error: only one of --locale and --lc-collate can be specified $ psql -c "create database mydb locale = 'C' lc_collate = 'C' template = 'template0'" postgres CREATE DATABASE From the CREATE DATABASE documentation [2]: locale This is a shortcut for setting LC_COLLATE and LC_CTYPE at once. If you specify this, you cannot specify either of those parameters. The patch diff_return_back_create_database_error.patch returns back the removed code for CREATE DATABASE so it behaves like createdb as before... [1] https://github.com/postgres/postgres/commit/f2553d43060edb210b36c63187d52a632448e1d2 [2] https://www.postgresql.org/docs/devel/sql-createdatabase.html -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
In pg14: |postgres=# create database a LC_COLLATE C LC_CTYPE C LOCALE C; |ERROR: conflicting or redundant options |DETAIL: LOCALE cannot be specified together with LC_COLLATE or LC_CTYPE. In pg15: |postgres=# create database a LC_COLLATE "en_US.UTF-8" LC_CTYPE "en_US.UTF-8" LOCALE "en_US.UTF-8" ; |CREATE DATABASE f2553d430 actually relaxed the restriction by removing this check: - if (dlocale && (dcollate || dctype)) - ereport(ERROR, - (errcode(ERRCODE_SYNTAX_ERROR), - errmsg("conflicting or redundant options"), - errdetail("LOCALE cannot be specified together with LC_COLLATE or LC_CTYPE."))); But isn't the right fix to do the corresponding thing in createdb (relaxing the frontend restriction rather than reverting its relaxation in the backend). diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c index e523e58b218..5b80e56dfd9 100644 --- a/src/bin/scripts/createdb.c +++ b/src/bin/scripts/createdb.c @@ -159,15 +159,10 @@ main(int argc, char *argv[]) exit(1); } - if (locale) - { - if (lc_ctype) - pg_fatal("only one of --locale and --lc-ctype can be specified"); - if (lc_collate) - pg_fatal("only one of --locale and --lc-collate can be specified"); + if (locale && !lc_ctype) lc_ctype = locale; + if (locale && !lc_collate) lc_collate = locale; - } if (encoding) { BTW it's somewhat crummy that it uses a string comparison, so if you write "UTF8" without a dash, it says this; it took me a few minutes to see the difference... postgres=# create database a LC_COLLATE "en_US.UTF8" LC_CTYPE "en_US.UTF8" LOCALE "en_US.UTF8"; ERROR: new collation (en_US.UTF8) is incompatible with the collation of the template database (en_US.UTF-8)
On 2022-09-09 19:46, Justin Pryzby wrote: > In pg14: > |postgres=# create database a LC_COLLATE C LC_CTYPE C LOCALE C; > |ERROR: conflicting or redundant options > |DETAIL: LOCALE cannot be specified together with LC_COLLATE or > LC_CTYPE. > > In pg15: > |postgres=# create database a LC_COLLATE "en_US.UTF-8" LC_CTYPE > "en_US.UTF-8" LOCALE "en_US.UTF-8" ; > |CREATE DATABASE > > f2553d430 actually relaxed the restriction by removing this check: > > - if (dlocale && (dcollate || dctype)) > - ereport(ERROR, > - (errcode(ERRCODE_SYNTAX_ERROR), > - errmsg("conflicting or redundant > options"), > - errdetail("LOCALE cannot be specified > together with LC_COLLATE or LC_CTYPE."))); > > But isn't the right fix to do the corresponding thing in createdb > (relaxing the frontend restriction rather than reverting its relaxation > in the backend). > > diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c > index e523e58b218..5b80e56dfd9 100644 > --- a/src/bin/scripts/createdb.c > +++ b/src/bin/scripts/createdb.c > @@ -159,15 +159,10 @@ main(int argc, char *argv[]) > exit(1); > } > > - if (locale) > - { > - if (lc_ctype) > - pg_fatal("only one of --locale and --lc-ctype can be specified"); > - if (lc_collate) > - pg_fatal("only one of --locale and --lc-collate can be specified"); > + if (locale && !lc_ctype) > lc_ctype = locale; > + if (locale && !lc_collate) > lc_collate = locale; > - } > > if (encoding) > { I agree with you that it is more comfortable and more similar to what has already been done in initdb. IMO it would be easier to do it like this: diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c index e523e58b2189275dc603a06324a2f28b0f49d8b7..a1482df3d981a680dd3322052e7c03ddacc8dc26 100644 --- a/src/bin/scripts/createdb.c +++ b/src/bin/scripts/createdb.c @@ -161,12 +161,10 @@ main(int argc, char *argv[]) if (locale) { - if (lc_ctype) - pg_fatal("only one of --locale and --lc-ctype can be specified"); - if (lc_collate) - pg_fatal("only one of --locale and --lc-collate can be specified"); - lc_ctype = locale; - lc_collate = locale; + if (!lc_ctype) + lc_ctype = locale; + if (!lc_collate) + lc_collate = locale; } if (encoding) Should we change the behaviour of createdb and CREATE DATABASE in previous major versions?.. > BTW it's somewhat crummy that it uses a string comparison, so if you > write "UTF8" without a dash, it says this; it took me a few minutes to > see the difference... > > postgres=# create database a LC_COLLATE "en_US.UTF8" LC_CTYPE > "en_US.UTF8" LOCALE "en_US.UTF8"; > ERROR: new collation (en_US.UTF8) is incompatible with the collation > of the template database (en_US.UTF-8) Perhaps we could check the locale itself with the function normalize_libc_locale_name (collationcmds.c). But ISTM that the current check is a safety net in case the function pg_get_encoding_from_locale (chklocale.c) returns -1 or PG_SQL_ASCII... -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 13.09.22 07:34, Marina Polyakova wrote: > I agree with you that it is more comfortable and more similar to what > has already been done in initdb. IMO it would be easier to do it like this: > > diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c > index > e523e58b2189275dc603a06324a2f28b0f49d8b7..a1482df3d981a680dd3322052e7c03ddacc8dc26 100644 > --- a/src/bin/scripts/createdb.c > +++ b/src/bin/scripts/createdb.c > @@ -161,12 +161,10 @@ main(int argc, char *argv[]) > > if (locale) > { > - if (lc_ctype) > - pg_fatal("only one of --locale and --lc-ctype can be > specified"); > - if (lc_collate) > - pg_fatal("only one of --locale and --lc-collate can be > specified"); > - lc_ctype = locale; > - lc_collate = locale; > + if (!lc_ctype) > + lc_ctype = locale; > + if (!lc_collate) > + lc_collate = locale; > } > > if (encoding) done that way > Should we change the behaviour of createdb and CREATE DATABASE in > previous major versions?.. I don't see a need for that. >> BTW it's somewhat crummy that it uses a string comparison, so if you >> write "UTF8" without a dash, it says this; it took me a few minutes to >> see the difference... >> >> postgres=# create database a LC_COLLATE "en_US.UTF8" LC_CTYPE >> "en_US.UTF8" LOCALE "en_US.UTF8"; >> ERROR: new collation (en_US.UTF8) is incompatible with the collation >> of the template database (en_US.UTF-8) > > Perhaps we could check the locale itself with the function > normalize_libc_locale_name (collationcmds.c). But ISTM that the current > check is a safety net in case the function pg_get_encoding_from_locale > (chklocale.c) returns -1 or PG_SQL_ASCII... This is not new behavior in PG15, is it?
On 2022-09-13 15:41, Peter Eisentraut wrote: > On 13.09.22 07:34, Marina Polyakova wrote: >> I agree with you that it is more comfortable and more similar to what >> has already been done in initdb. IMO it would be easier to do it like >> this: >> >> diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c >> index >> e523e58b2189275dc603a06324a2f28b0f49d8b7..a1482df3d981a680dd3322052e7c03ddacc8dc26 >> 100644 >> --- a/src/bin/scripts/createdb.c >> +++ b/src/bin/scripts/createdb.c >> @@ -161,12 +161,10 @@ main(int argc, char *argv[]) >> >> if (locale) >> { >> - if (lc_ctype) >> - pg_fatal("only one of --locale and --lc-ctype can be >> specified"); >> - if (lc_collate) >> - pg_fatal("only one of --locale and --lc-collate can be >> specified"); >> - lc_ctype = locale; >> - lc_collate = locale; >> + if (!lc_ctype) >> + lc_ctype = locale; >> + if (!lc_collate) >> + lc_collate = locale; >> } >> >> if (encoding) > > done that way Thank you! >>> BTW it's somewhat crummy that it uses a string comparison, so if you >>> write "UTF8" without a dash, it says this; it took me a few minutes >>> to >>> see the difference... >>> >>> postgres=# create database a LC_COLLATE "en_US.UTF8" LC_CTYPE >>> "en_US.UTF8" LOCALE "en_US.UTF8"; >>> ERROR: new collation (en_US.UTF8) is incompatible with the collation >>> of the template database (en_US.UTF-8) >> >> Perhaps we could check the locale itself with the function >> normalize_libc_locale_name (collationcmds.c). But ISTM that the >> current check is a safety net in case the function >> pg_get_encoding_from_locale (chklocale.c) returns -1 or >> PG_SQL_ASCII... > > This is not new behavior in PG15, is it? No, it has always existed [1] AFAICS.. [1] https://github.com/postgres/postgres/commit/61d967498802ab86d8897cb3c61740d7e9d712f6 -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello! I was surprised that it is allowed to create clusters/databases where the default ICU collations do not actually work due to unsupported encodings: $ initdb --encoding SQL_ASCII --locale-provider icu --icu-locale en-US -D data && pg_ctl -D data -l logfile start && psql -c "SELECT 'a' < 'b'" template1 ... waiting for server to start.... done server started ERROR: encoding "SQL_ASCII" not supported by ICU $ createdb --encoding SQL_ASCII --locale-provider icu --icu-locale en-US --template template0 mydb && psql -c "SELECT 'a' < 'b'" mydb ERROR: encoding "SQL_ASCII" not supported by ICU The patch diff_check_icu_encoding.patch prohibits the creation of such objects... -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
At Wed, 14 Sep 2022 17:19:34 +0300, Marina Polyakova <m.polyakova@postgrespro.ru> wrote in > Hello! > > I was surprised that it is allowed to create clusters/databases where > the default ICU collations do not actually work due to unsupported > encodings: > > $ initdb --encoding SQL_ASCII --locale-provider icu --icu-locale en-US > -D data && > pg_ctl -D data -l logfile start && > psql -c "SELECT 'a' < 'b'" template1 > ... > waiting for server to start.... done > server started > ERROR: encoding "SQL_ASCII" not supported by ICU Indeed. If I did the following, the direction of the patch looks fine to me. If I executed initdb as follows, I would be told to specify --icu-locale option. > $ initdb --encoding sql-ascii --locale-provider icu hoge > ... > initdb: error: ICU locale must be specified However, when I reran the command, it complains about incompatible encoding this time. I think it's more user-friendly to check for the encoding compatibility before the check for missing --icu-locale option. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2022-09-15 09:52, Kyotaro Horiguchi wrote: > If I executed initdb as follows, I would be told to specify > --icu-locale option. > >> $ initdb --encoding sql-ascii --locale-provider icu hoge >> ... >> initdb: error: ICU locale must be specified > > However, when I reran the command, it complains about incompatible > encoding this time. I think it's more user-friendly to check for the > encoding compatibility before the check for missing --icu-locale > option. > > regards. I agree with you. Here's another version of the patch. The locale/encoding checks and reports in initdb have been reordered, because now the encoding is set first and only then the ICU locale is checked. P.S. While working on the patch, I discovered that UTF8 encoding is always used for the ICU provider in initdb unless it is explicitly specified by the user: if (!encoding && locale_provider == COLLPROVIDER_ICU) encodingid = PG_UTF8; IMO this creates additional errors for locales with other encodings: $ initdb --locale de_DE.iso885915@euro --locale-provider icu --icu-locale de-DE ... initdb: error: encoding mismatch initdb: detail: The encoding you selected (UTF8) and the encoding that the selected locale uses (LATIN9) do not match. This would lead to misbehavior in various character string processing functions. initdb: hint: Rerun initdb and either do not specify an encoding explicitly, or choose a matching combination. And ICU supports many encodings, see the contents of pg_enc2icu_tbl in encnames.c... -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Wed, Sep 14, 2022 at 05:19:34PM +0300, Marina Polyakova wrote: > I was surprised that it is allowed to create clusters/databases where the > default ICU collations do not actually work due to unsupported encodings: > > $ initdb --encoding SQL_ASCII --locale-provider icu --icu-locale en-US -D > data && > pg_ctl -D data -l logfile start && > psql -c "SELECT 'a' < 'b'" template1 > ... > waiting for server to start.... done > server started > ERROR: encoding "SQL_ASCII" not supported by ICU > > $ createdb --encoding SQL_ASCII --locale-provider icu --icu-locale en-US > --template template0 mydb && > psql -c "SELECT 'a' < 'b'" mydb > ERROR: encoding "SQL_ASCII" not supported by ICU > > The patch diff_check_icu_encoding.patch prohibits the creation of such > objects... Agreed that it is a bit confusing to get this type of error after the database has been created when querying it due to a mix of unsupported options. Peter? -- Michael
Attachment
At Thu, 15 Sep 2022 18:41:31 +0300, Marina Polyakova <m.polyakova@postgrespro.ru> wrote in > P.S. While working on the patch, I discovered that UTF8 encoding is > always used for the ICU provider in initdb unless it is explicitly > specified by the user: > > if (!encoding && locale_provider == COLLPROVIDER_ICU) > encodingid = PG_UTF8; > > IMO this creates additional errors for locales with other encodings: > > $ initdb --locale de_DE.iso885915@euro --locale-provider icu > --icu-locale de-DE > ... > initdb: error: encoding mismatch > initdb: detail: The encoding you selected (UTF8) and the encoding that > the selected locale uses (LATIN9) do not match. This would lead to > misbehavior in various character string processing functions. > initdb: hint: Rerun initdb and either do not specify an encoding > explicitly, or choose a matching combination. > > And ICU supports many encodings, see the contents of pg_enc2icu_tbl in > encnames.c... It seems to me the best default that fits almost all cases using icu locales. So, we need to specify encoding explicitly in that case. $ initdb --encoding iso-8859-15 --locale de_DE.iso885915@euro --locale-provider icu --icu-locale de-DE However, I think it is hardly understantable from the documentation. (I checked this using euc-jp [1] so it might be wrong..) [1] initdb --encoding euc-jp --locale ja_JP.eucjp --locale-provider icu --icu-locale ja-x-icu regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2022-09-15 09:52, Kyotaro Horiguchi wrote: > If I executed initdb as follows, I would be told to specify > --icu-locale option. > >> $ initdb --encoding sql-ascii --locale-provider icu hoge >> ... >> initdb: error: ICU locale must be specified > > However, when I reran the command, it complains about incompatible > encoding this time. I think it's more user-friendly to check for the > encoding compatibility before the check for missing --icu-locale > option. > > regards. In continuation of options check: AFAICS the following checks in initdb if (locale_provider == COLLPROVIDER_ICU) { if (!icu_locale) pg_fatal("ICU locale must be specified"); /* * In supported builds, the ICU locale ID will be checked by the * backend during post-bootstrap initialization. */ #ifndef USE_ICU pg_fatal("ICU is not supported in this build"); #endif } are executed approximately when they are executed in create database after getting all the necessary data from the template database: if (dblocprovider == COLLPROVIDER_ICU) { /* * This would happen if template0 uses the libc provider but the new * database uses icu. */ if (!dbiculocale) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("ICU locale must be specified"))); } if (dblocprovider == COLLPROVIDER_ICU) check_icu_locale(dbiculocale); But perhaps the check that --icu-locale cannot be specified unless locale provider icu is chosen should also be moved here? So all these checks will be in one place and it will use the provider from the template database (which could be icu): $ initdb --locale-provider icu --icu-locale en-US -D data && pg_ctl -D data -l logfile start && createdb --icu-locale ru-RU --template template0 mydb ... createdb: error: database creation failed: ERROR: ICU locale cannot be specified unless locale provider is ICU -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 2022-09-16 07:55, Kyotaro Horiguchi wrote: > At Thu, 15 Sep 2022 18:41:31 +0300, Marina Polyakova > <m.polyakova@postgrespro.ru> wrote in >> P.S. While working on the patch, I discovered that UTF8 encoding is >> always used for the ICU provider in initdb unless it is explicitly >> specified by the user: >> >> if (!encoding && locale_provider == COLLPROVIDER_ICU) >> encodingid = PG_UTF8; >> >> IMO this creates additional errors for locales with other encodings: >> >> $ initdb --locale de_DE.iso885915@euro --locale-provider icu >> --icu-locale de-DE >> ... >> initdb: error: encoding mismatch >> initdb: detail: The encoding you selected (UTF8) and the encoding that >> the selected locale uses (LATIN9) do not match. This would lead to >> misbehavior in various character string processing functions. >> initdb: hint: Rerun initdb and either do not specify an encoding >> explicitly, or choose a matching combination. >> >> And ICU supports many encodings, see the contents of pg_enc2icu_tbl in >> encnames.c... > > It seems to me the best default that fits almost all cases using icu > locales. > > So, we need to specify encoding explicitly in that case. > > $ initdb --encoding iso-8859-15 --locale de_DE.iso885915@euro > --locale-provider icu --icu-locale de-DE > > However, I think it is hardly understantable from the documentation. > > (I checked this using euc-jp [1] so it might be wrong..) > > [1] initdb --encoding euc-jp --locale ja_JP.eucjp --locale-provider > icu --icu-locale ja-x-icu > > regards. Thank you! IMO it is hardly understantable from the program output either - it looks like I manually chose the encoding UTF8. Maybe first inform about selected encoding?.. diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 6aeec8d426c52414b827686781c245291f27ed1f..348bbbeba0f5bc7ff601912bf883510d580b814c 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -2310,7 +2310,11 @@ setup_locale_encoding(void) } if (!encoding && locale_provider == COLLPROVIDER_ICU) + { encodingid = PG_UTF8; + printf(_("The default database encoding has been set to \"%s\" for a better experience with the ICU provider.\n"), + pg_encoding_to_char(encodingid)); + } else if (!encoding) { int ctype_enc; ISTM that such choices (e.g. UTF8 for Windows in some cases) are described in the documentation [1] as By default, initdb uses the locale provider libc, takes the locale settings from the environment, and determines the encoding from the locale settings. This is almost always sufficient, unless there are special requirements. [1] https://www.postgresql.org/docs/devel/app-initdb.html -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 15.09.22 17:41, Marina Polyakova wrote: > I agree with you. Here's another version of the patch. The > locale/encoding checks and reports in initdb have been reordered, > because now the encoding is set first and only then the ICU locale is > checked. I committed something based on the first version of your patch. This reordering of the messages here was a little too much surgery for me at this point. For instance, there are also messages in #ifdef WIN32 code that would need to be reordered as well. I kept the overall structure of the code the same and just inserted the additional proposed checks. If you want to pursue the reordering of the checks and messages overall, a patch for the master branch could be considered.
On 16.09.22 09:31, Marina Polyakova wrote: > IMO it is hardly understantable from the program output either - it > looks like I manually chose the encoding UTF8. Maybe first inform about > selected encoding?.. Yes, I included something like that in the patch just committed. > diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c > index > 6aeec8d426c52414b827686781c245291f27ed1f..348bbbeba0f5bc7ff601912bf883510d580b814c 100644 > --- a/src/bin/initdb/initdb.c > +++ b/src/bin/initdb/initdb.c > @@ -2310,7 +2310,11 @@ setup_locale_encoding(void) > } > > if (!encoding && locale_provider == COLLPROVIDER_ICU) > + { > encodingid = PG_UTF8; > + printf(_("The default database encoding has been set to \"%s\" > for a better experience with the ICU provider.\n"), > + pg_encoding_to_char(encodingid)); > + } > else if (!encoding) > { > int ctype_enc;
At Fri, 16 Sep 2022 09:49:28 +0300, Marina Polyakova <m.polyakova@postgrespro.ru> wrote in > On 2022-09-15 09:52, Kyotaro Horiguchi wrote: > > However, when I reran the command, it complains about incompatible > > encoding this time. I think it's more user-friendly to check for the > > encoding compatibility before the check for missing --icu-locale > > option. > > regards. > > In continuation of options check: AFAICS the following checks in > initdb > > if (locale_provider == COLLPROVIDER_ICU) > { > if (!icu_locale) > pg_fatal("ICU locale must be specified"); > > /* > * In supported builds, the ICU locale ID will be checked by the > * backend during post-bootstrap initialization. > */ > #ifndef USE_ICU > pg_fatal("ICU is not supported in this build"); > #endif > } > > are executed approximately when they are executed in create database > after getting all the necessary data from the template database: initdb doesn't work that way, but anyway, I realized that I am proposing to move that code in setlocales() to the caller function as the result. I don't think setlocales() is the place for the code because icu locale has no business with what the function does. That being said there's no obvious reason we *need* to move the code out to its caller. > if (dblocprovider == COLLPROVIDER_ICU) > { > /* > * This would happen if template0 uses the libc provider but the new > * database uses icu. > */ > if (!dbiculocale) > ereport(ERROR, > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > errmsg("ICU locale must be specified"))); > } > > if (dblocprovider == COLLPROVIDER_ICU) > check_icu_locale(dbiculocale); > > But perhaps the check that --icu-locale cannot be specified unless > locale provider icu is chosen should also be moved here? So all these > checks will be in one place and it will use the provider from the > template database (which could be icu): > > $ initdb --locale-provider icu --icu-locale en-US -D data && > pg_ctl -D data -l logfile start && > createdb --icu-locale ru-RU --template template0 mydb > ... > createdb: error: database creation failed: ERROR: ICU locale cannot be > specified unless locale provider is ICU And, I realized that this causes bigger churn than I thought. So, I'm sorry but I withdraw the comment. Thus the first proposed patch will be more or less the direction we would go. And the patch looks good to me as a whole. + errmsg("encoding \"%s\" is not supported with ICU provider", + pg_log_error("encoding \"%s\" is not supported with ICU provider", + pg_encoding_to_char(encodingid)); I might be wrong, but the messages look wrong to me. The alternatives below might work. "encoding \"%s\" is not supported by ICU" "encoding \"%s\" cannot be used for/with ICU locales" + pg_log_error_hint("Rerun %s and choose a matching combination.", + progname); This doesn't seem to provide users with useful information. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 16 Sep 2022 09:56:31 +0200, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote in > On 15.09.22 17:41, Marina Polyakova wrote: > > I agree with you. Here's another version of the patch. The > > locale/encoding checks and reports in initdb have been reordered, > > because now the encoding is set first and only then the ICU locale is > > checked. > > I committed something based on the first version of your patch. This > reordering of the messages here was a little too much surgery for me > at this point. For instance, there are also messages in #ifdef WIN32 > code that would need to be reordered as well. I kept the overall > structure of the code the same and just inserted the additional > proposed checks. Yeah, as I sent just before, I reached the same conclusion. > If you want to pursue the reordering of the checks and messages > overall, a patch for the master branch could be considered. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 16.09.22 08:49, Marina Polyakova wrote: > But perhaps the check that --icu-locale cannot be specified unless > locale provider icu is chosen should also be moved here? So all these > checks will be in one place and it will use the provider from the > template database (which could be icu): > > $ initdb --locale-provider icu --icu-locale en-US -D data && > pg_ctl -D data -l logfile start && > createdb --icu-locale ru-RU --template template0 mydb > ... > createdb: error: database creation failed: ERROR: ICU locale cannot be > specified unless locale provider is ICU Can you be more specific about what you are proposing here? I'm not following.
On 2022-09-16 10:57, Peter Eisentraut wrote: > On 16.09.22 09:31, Marina Polyakova wrote: >> IMO it is hardly understantable from the program output either - it >> looks like I manually chose the encoding UTF8. Maybe first inform >> about selected encoding?.. > > Yes, I included something like that in the patch just committed. > >> diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c >> index >> 6aeec8d426c52414b827686781c245291f27ed1f..348bbbeba0f5bc7ff601912bf883510d580b814c >> 100644 >> --- a/src/bin/initdb/initdb.c >> +++ b/src/bin/initdb/initdb.c >> @@ -2310,7 +2310,11 @@ setup_locale_encoding(void) >> } >> >> if (!encoding && locale_provider == COLLPROVIDER_ICU) >> + { >> encodingid = PG_UTF8; >> + printf(_("The default database encoding has been set to >> \"%s\" for a better experience with the ICU provider.\n"), >> + pg_encoding_to_char(encodingid)); >> + } >> else if (!encoding) >> { >> int ctype_enc; Thank you! -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 2022-09-16 11:11, Kyotaro Horiguchi wrote: > At Fri, 16 Sep 2022 09:49:28 +0300, Marina Polyakova > <m.polyakova@postgrespro.ru> wrote in >> In continuation of options check: AFAICS the following checks in >> initdb >> >> if (locale_provider == COLLPROVIDER_ICU) >> { >> if (!icu_locale) >> pg_fatal("ICU locale must be specified"); >> >> /* >> * In supported builds, the ICU locale ID will be checked by the >> * backend during post-bootstrap initialization. >> */ >> #ifndef USE_ICU >> pg_fatal("ICU is not supported in this build"); >> #endif >> } >> >> are executed approximately when they are executed in create database >> after getting all the necessary data from the template database: > > initdb doesn't work that way, but anyway, I realized that I am > proposing to move that code in setlocales() to the caller function as > the result. I don't think setlocales() is the place for the code > because icu locale has no business with what the function does. That > being said there's no obvious reason we *need* to move the code out to > its caller. Excuse me, but could you explain your last sentence in more detail? I read that this code is not for setlocales and then - that it should not moved from here, so I'm confused... > + errmsg("encoding \"%s\" is not supported with ICU provider", > > + pg_log_error("encoding \"%s\" is not supported with ICU provider", > + pg_encoding_to_char(encodingid)); > > I might be wrong, but the messages look wrong to me. The alternatives > below might work. > > "encoding \"%s\" is not supported by ICU" > "encoding \"%s\" cannot be used for/with ICU locales" The message indicates that the selected encoding cannot be used with the ICU provider because it does not support it. But if the text of the message becomes better and clearer, I will only be glad. > + pg_log_error_hint("Rerun %s and choose a matching combination.", > + progname); > > This doesn't seem to provide users with useful information. It was commited in more verbose form: pg_log_error_hint("Rerun %s and either do not specify an encoding explicitly, " "or choose a matching combination.", -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Thanks to Kyotaro Horiguchi review we found out that there're interesting cases due to the order of some ICU checks: 1. ICU locale vs supported encoding: 1.1. On 2022-09-15 09:52, Kyotaro Horiguchi wrote: > If I executed initdb as follows, I would be told to specify > --icu-locale option. > >> $ initdb --encoding sql-ascii --locale-provider icu hoge >> ... >> initdb: error: ICU locale must be specified > > However, when I reran the command, it complains about incompatible > encoding this time. I think it's more user-friendly to check for the > encoding compatibility before the check for missing --icu-locale > option. 1.2. (ok?) $ initdb --encoding sql-ascii --icu-locale en-US hoge initdb: error: --icu-locale cannot be specified unless locale provider "icu" is chosen $ initdb --encoding sql-ascii --icu-locale en-US --locale-provider icu hoge ... initdb: error: encoding mismatch initdb: detail: The encoding you selected (SQL_ASCII) is not supported with the ICU provider. initdb: hint: Rerun initdb and either do not specify an encoding explicitly, or choose a matching combination. $ createdb --encoding sql-ascii --icu-locale en-US hoge createdb: error: database creation failed: ERROR: ICU locale cannot be specified unless locale provider is ICU $ createdb --encoding sql-ascii --icu-locale en-US --locale-provider icu hoge createdb: error: database creation failed: ERROR: encoding "SQL_ASCII" is not supported with ICU provider 2. For builds without ICU: 2.1. $ initdb --locale-provider icu hoge ... initdb: error: ICU locale must be specified $ initdb --locale-provider icu --icu-locale en-US hoge ... initdb: error: ICU is not supported in this build $ createdb --locale-provider icu hoge createdb: error: database creation failed: ERROR: ICU locale must be specified $ createdb --locale-provider icu --icu-locale en-US hoge createdb: error: database creation failed: ERROR: ICU is not supported in this build IMO, it would be more user-friendly to inform an unsupported build in the first runs too.. 2.2. (ok?) $ initdb --icu-locale en-US hoge initdb: error: --icu-locale cannot be specified unless locale provider "icu" is chosen $ initdb --icu-locale en-US --locale-provider icu hoge ... initdb: error: ICU is not supported in this build $ createdb --icu-locale en-US hoge createdb: error: database creation failed: ERROR: ICU locale cannot be specified unless locale provider is ICU $ createdb --icu-locale en-US --locale-provider icu hoge createdb: error: database creation failed: ERROR: ICU is not supported in this build 2.3. $ createdb --locale-provider icu --icu-locale en-US --encoding sql-ascii hoge createdb: error: database creation failed: ERROR: encoding "SQL_ASCII" is not supported with ICU provider $ createdb --locale-provider icu --icu-locale en-US --encoding utf8 hoge createdb: error: database creation failed: ERROR: ICU is not supported in this build IMO, it would be more user-friendly to inform an unsupported build in the first run too.. 3. The locale provider is ICU, but it has not yet been set from the template database: > $ initdb --locale-provider icu --icu-locale en-US -D data && > pg_ctl -D data -l logfile start && > createdb --icu-locale ru-RU --template template0 mydb > ... > createdb: error: database creation failed: ERROR: ICU locale cannot be > specified unless locale provider is ICU -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 2022-09-16 10:56, Peter Eisentraut wrote: > On 15.09.22 17:41, Marina Polyakova wrote: >> I agree with you. Here's another version of the patch. The >> locale/encoding checks and reports in initdb have been reordered, >> because now the encoding is set first and only then the ICU locale is >> checked. > > I committed something based on the first version of your patch. This > reordering of the messages here was a little too much surgery for me > at this point. For instance, there are also messages in #ifdef WIN32 > code that would need to be reordered as well. I kept the overall > structure of the code the same and just inserted the additional > proposed checks. > > If you want to pursue the reordering of the checks and messages > overall, a patch for the master branch could be considered. Thank you! I already wrote about the order of the ICU checks in initdb/create database, they were the only reason to propose such changes... -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 17.09.22 10:33, Marina Polyakova wrote: > Thanks to Kyotaro Horiguchi review we found out that there're > interesting cases due to the order of some ICU checks: > > 1. ICU locale vs supported encoding: > > 1.1. > > On 2022-09-15 09:52, Kyotaro Horiguchi wrote: >> If I executed initdb as follows, I would be told to specify >> --icu-locale option. >> >>> $ initdb --encoding sql-ascii --locale-provider icu hoge >>> ... >>> initdb: error: ICU locale must be specified >> >> However, when I reran the command, it complains about incompatible >> encoding this time. I think it's more user-friendly to check for the >> encoding compatibility before the check for missing --icu-locale >> option. This a valid point, but it would require quite a bit of work to move all those checks around and re-verify the result, so I don't want to do it in PG15. > 1.2. (ok?) > > $ initdb --encoding sql-ascii --icu-locale en-US hoge > initdb: error: --icu-locale cannot be specified unless locale provider > "icu" is chosen > > $ initdb --encoding sql-ascii --icu-locale en-US --locale-provider icu hoge > ... > initdb: error: encoding mismatch > initdb: detail: The encoding you selected (SQL_ASCII) is not supported > with the ICU provider. > initdb: hint: Rerun initdb and either do not specify an encoding > explicitly, or choose a matching combination. > > $ createdb --encoding sql-ascii --icu-locale en-US hoge > createdb: error: database creation failed: ERROR: ICU locale cannot be > specified unless locale provider is ICU > $ createdb --encoding sql-ascii --icu-locale en-US --locale-provider icu > hoge > createdb: error: database creation failed: ERROR: encoding "SQL_ASCII" > is not supported with ICU provider I don't see a problem here. > 2. For builds without ICU: > > 2.1. > > $ initdb --locale-provider icu hoge > ... > initdb: error: ICU locale must be specified > > $ initdb --locale-provider icu --icu-locale en-US hoge > ... > initdb: error: ICU is not supported in this build > > $ createdb --locale-provider icu hoge > createdb: error: database creation failed: ERROR: ICU locale must be > specified > > $ createdb --locale-provider icu --icu-locale en-US hoge > createdb: error: database creation failed: ERROR: ICU is not supported > in this build > > IMO, it would be more user-friendly to inform an unsupported build in > the first runs too.. Again, this would require reorganizing a bunch of code to get some cosmetic benefit, which isn't a good idea now for PG15. > 2.2. (ok?) > 2.3. same here > 3. > > The locale provider is ICU, but it has not yet been set from the > template database: > >> $ initdb --locale-provider icu --icu-locale en-US -D data && >> pg_ctl -D data -l logfile start && >> createdb --icu-locale ru-RU --template template0 mydb >> ... >> createdb: error: database creation failed: ERROR: ICU locale cannot be >> specified unless locale provider is ICU Please see attached patch for a fix. Does that work for you?
Attachment
On 2022-09-20 12:59, Peter Eisentraut wrote: > On 17.09.22 10:33, Marina Polyakova wrote: >> 3. >> >> The locale provider is ICU, but it has not yet been set from the >> template database: >> >>> $ initdb --locale-provider icu --icu-locale en-US -D data && >>> pg_ctl -D data -l logfile start && >>> createdb --icu-locale ru-RU --template template0 mydb >>> ... >>> createdb: error: database creation failed: ERROR: ICU locale cannot >>> be >>> specified unless locale provider is ICU > > Please see attached patch for a fix. Does that work for you? Yes, it works. The following test checks this fix: diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl index b87d8fc63b5246b02bcd4499aae815269b60df7c..c2464a99618cd7ca5616cc21121e1e4379b52baf 100644 --- a/src/bin/scripts/t/020_createdb.pl +++ b/src/bin/scripts/t/020_createdb.pl @@ -71,6 +71,14 @@ if ($ENV{with_icu} eq 'yes') $node2->command_ok( [ 'createdb', '-T', 'template0', '--locale-provider=libc', 'foobar55' ], 'create database with libc provider from template database with icu provider'); + + $node2->command_ok( + [ + 'createdb', '-T', 'template0', '--icu-locale', + 'en-US', 'foobar56' + ], + 'create database with icu locale from template database with icu provider' + ); } else { -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 21.09.22 08:50, Marina Polyakova wrote: > On 2022-09-20 12:59, Peter Eisentraut wrote: >> On 17.09.22 10:33, Marina Polyakova wrote: >>> 3. >>> >>> The locale provider is ICU, but it has not yet been set from the >>> template database: >>> >>>> $ initdb --locale-provider icu --icu-locale en-US -D data && >>>> pg_ctl -D data -l logfile start && >>>> createdb --icu-locale ru-RU --template template0 mydb >>>> ... >>>> createdb: error: database creation failed: ERROR: ICU locale cannot be >>>> specified unless locale provider is ICU >> >> Please see attached patch for a fix. Does that work for you? > > Yes, it works. The following test checks this fix: Committed with that test, thanks. I think that covers all the ICU issues you reported for PG15 for now?
On 2022-09-21 17:53, Peter Eisentraut wrote: > Committed with that test, thanks. I think that covers all the ICU > issues you reported for PG15 for now? I thought about the order of the ICU checks - if it is ok to check that the selected encoding is supported by ICU after printing all the locale & encoding information, why not to move almost all the ICU checks here?.. Examples of the work of the attached patch: 1. ICU locale vs supported encoding: 1.1. $ initdb --encoding sql-ascii --locale-provider icu hoge ... initdb: error: encoding mismatch initdb: detail: The encoding you selected (SQL_ASCII) is not supported with the ICU provider. initdb: hint: Rerun initdb and either do not specify an encoding explicitly, or choose a matching combination. 1.2. (like before) $ initdb --encoding sql-ascii --icu-locale en-US hoge initdb: error: --icu-locale cannot be specified unless locale provider "icu" is chosen $ createdb --encoding sql-ascii --icu-locale en-US hoge createdb: error: database creation failed: ERROR: ICU locale cannot be specified unless locale provider is ICU 2. For builds without ICU: 2.1. $ initdb --locale-provider icu hoge ... initdb: error: ICU is not supported in this build $ createdb --locale-provider icu hoge createdb: error: database creation failed: ERROR: ICU is not supported in this build 2.2. (like before) $ initdb --icu-locale en-US hoge initdb: error: --icu-locale cannot be specified unless locale provider "icu" is chosen $ createdb --icu-locale en-US hoge createdb: error: database creation failed: ERROR: ICU locale cannot be specified unless locale provider is ICU 2.3. $ createdb --locale-provider icu --icu-locale en-US --encoding sql-ascii hoge createdb: error: database creation failed: ERROR: ICU is not supported in this build 4. About errors in initdb: 4.1. If icu_locale is not specified, but it is required, then we get this: $ initdb --locale-provider icu hoge The files belonging to this database system will be owned by user "marina". This user must also own the server process. The database cluster will be initialized with this locale configuration: provider: icu LC_COLLATE: en_US.UTF-8 LC_CTYPE: en_US.UTF-8 LC_MESSAGES: en_US.UTF-8 LC_MONETARY: ru_RU.UTF-8 LC_NUMERIC: ru_RU.UTF-8 LC_TIME: ru_RU.UTF-8 The default database encoding has been set to "UTF8". initdb: error: ICU locale must be specified Almost the same if ICU is not supported in this build: $ initdb --locale-provider icu hoge The files belonging to this database system will be owned by user "marina". This user must also own the server process. The database cluster will be initialized with this locale configuration: provider: icu LC_COLLATE: en_US.UTF-8 LC_CTYPE: en_US.UTF-8 LC_MESSAGES: en_US.UTF-8 LC_MONETARY: ru_RU.UTF-8 LC_NUMERIC: ru_RU.UTF-8 LC_TIME: ru_RU.UTF-8 The default database encoding has been set to "UTF8". initdb: error: ICU is not supported in this build 4.2. If icu_locale is specified for the wrong provider, the error will be at the beginning of the program start as before: $ initdb --icu-locale en-US hoge initdb: error: --icu-locale cannot be specified unless locale provider "icu" is chosen -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 22.09.22 20:06, Marina Polyakova wrote: > On 2022-09-21 17:53, Peter Eisentraut wrote: >> Committed with that test, thanks. I think that covers all the ICU >> issues you reported for PG15 for now? > > I thought about the order of the ICU checks - if it is ok to check that > the selected encoding is supported by ICU after printing all the locale > & encoding information, why not to move almost all the ICU checks here?.. It's possible that we can do better, but I'm not going to add things like that to PG 15 at this point unless it fixes a faulty behavior.
On 2022-10-01 15:07, Peter Eisentraut wrote: > On 22.09.22 20:06, Marina Polyakova wrote: >> On 2022-09-21 17:53, Peter Eisentraut wrote: >>> Committed with that test, thanks. I think that covers all the ICU >>> issues you reported for PG15 for now? >> >> I thought about the order of the ICU checks - if it is ok to check >> that the selected encoding is supported by ICU after printing all the >> locale & encoding information, why not to move almost all the ICU >> checks here?.. > > It's possible that we can do better, but I'm not going to add things > like that to PG 15 at this point unless it fixes a faulty behavior. Will PG 15 always have this order of ICU checks, is the current behaviour correct enough? On the other hand, there may be a better fix for PG 16+ and not all changes can be backported... On 2022-09-16 10:56, Peter Eisentraut wrote: > On 15.09.22 17:41, Marina Polyakova wrote: >> I agree with you. Here's another version of the patch. The >> locale/encoding checks and reports in initdb have been reordered, >> because now the encoding is set first and only then the ICU locale is >> checked. > > I committed something based on the first version of your patch. This > reordering of the messages here was a little too much surgery for me > at this point. For instance, there are also messages in #ifdef WIN32 > code that would need to be reordered as well. I kept the overall > structure of the code the same and just inserted the additional > proposed checks. > > If you want to pursue the reordering of the checks and messages > overall, a patch for the master branch could be considered. I've worked on this again (see attached patch) but I'm not sure if the messages of encoding mismatches are clear enough without the full locale information. For $ initdb -D data --icu-locale en --locale-provider icu compare the outputs: The database cluster will be initialized with this locale configuration: provider: icu ICU locale: en LC_COLLATE: de_DE.iso885915@euro LC_CTYPE: de_DE.iso885915@euro LC_MESSAGES: en_US.utf8 LC_MONETARY: de_DE.iso885915@euro LC_NUMERIC: de_DE.iso885915@euro LC_TIME: de_DE.iso885915@euro The default database encoding has been set to "UTF8". initdb: error: encoding mismatch initdb: detail: The encoding you selected (UTF8) and the encoding that the selected locale uses (LATIN9) do not match. This would lead to misbehavior in various character string processing functions. initdb: hint: Rerun initdb and either do not specify an encoding explicitly, or choose a matching combination. and Encoding "UTF8" implied by locale will be set as the default database encoding. initdb: error: encoding mismatch initdb: detail: The encoding you selected (UTF8) and the encoding that the selected locale uses (LATIN9) do not match. This would lead to misbehavior in various character string processing functions. initdb: hint: Rerun initdb and either do not specify an encoding explicitly, or choose a matching combination. The same without ICU, e.g. for $ initdb -D data the output with locale information: The database cluster will be initialized with this locale configuration: provider: libc LC_COLLATE: en_US.utf8 LC_CTYPE: de_DE.iso885915@euro LC_MESSAGES: en_US.utf8 LC_MONETARY: de_DE.iso885915@euro LC_NUMERIC: de_DE.iso885915@euro LC_TIME: de_DE.iso885915@euro The default database encoding has accordingly been set to "LATIN9". initdb: error: encoding mismatch initdb: detail: The encoding you selected (LATIN9) and the encoding that the selected locale uses (UTF8) do not match. This would lead to misbehavior in various character string processing functions. initdb: hint: Rerun initdb and either do not specify an encoding explicitly, or choose a matching combination. and the "shorter" version: Encoding "LATIN9" implied by locale will be set as the default database encoding. initdb: error: encoding mismatch initdb: detail: The encoding you selected (LATIN9) and the encoding that the selected locale uses (UTF8) do not match. This would lead to misbehavior in various character string processing functions. initdb: hint: Rerun initdb and either do not specify an encoding explicitly, or choose a matching combination. BTW, what did you mean that "there are also messages in #ifdef WIN32 code that would need to be reordered as well"?.. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hello! I discovered an interesting behaviour during installcheck runs when the cluster was initialized with ICU locale provider: $ initdb --locale-provider icu --icu-locale en-US -D data && pg_ctl -D data -l logfile start 1) The ECPG tests fail because they use the SQL_ASCII encoding [1], the database template0 uses the ICU locale provider and SQL_ASCII is not supported by ICU: $ make -C src/interfaces/ecpg/ installcheck ... ============== creating database "ecpg1_regression" ============== ERROR: encoding "SQL_ASCII" is not supported with ICU provider ERROR: database "ecpg1_regression" does not exist command failed: "/home/marina/postgresql/master/my/inst/bin/psql" -X -c "CREATE DATABASE \"ecpg1_regression\" TEMPLATE=template0 ENCODING='SQL_ASCII'" -c "ALTER DATABASE \"ecpg1_regression\" SET lc_messages TO 'C';ALTER DATABASE \"ecpg1_regression\" SET lc_monetary TO 'C';ALTER DATABASE \"ecpg1_regression\" SET lc_numeric TO 'C';ALTER DATABASE \"ecpg1_regression\" SET lc_time TO 'C';ALTER DATABASE \"ecpg1_regression\" SET bytea_output TO 'hex';ALTER DATABASE \"ecpg1_regression\" SET timezone_abbreviations TO 'Default';" "postgres" 2) The option --no-locale in pg_regress is described as "use C locale" [2]. But in this case the created databases actually use the ICU locale provider with the ICU cluster locale from template0 (see diff_check_backend_used_provider.patch): $ make NO_LOCALE=1 installcheck In regression.diffs: diff -U3 /home/marina/postgresql/master/src/test/regress/expected/test_setup.out /home/marina/postgresql/master/src/test/regress/results/test_setup.out --- /home/marina/postgresql/master/src/test/regress/expected/test_setup.out 2022-09-27 05:31:27.674628815 +0300 +++ /home/marina/postgresql/master/src/test/regress/results/test_setup.out 2022-10-21 15:09:31.232992885 +0300 @@ -143,6 +143,798 @@ \set filename :abs_srcdir '/data/person.data' COPY person FROM :'filename'; VACUUM ANALYZE person; +NOTICE: varstrfastcmp_locale sss->collate_c 0 sss->locale 0xefacd0 +NOTICE: varstrfastcmp_locale sss->locale->provider i +NOTICE: varstrfastcmp_locale sss->locale->info.icu.locale en-US ... The patch diff_fix_pg_regress_create_database.patch fixes both issues for me. [1] https://github.com/postgres/postgres/blob/ce20f8b9f4354b46b40fd6ebf7ce5c37d08747e0/src/interfaces/ecpg/test/Makefile#L18 [2] https://github.com/postgres/postgres/blob/ce20f8b9f4354b46b40fd6ebf7ce5c37d08747e0/src/test/regress/pg_regress.c#L1992 -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Fri, Oct 21, 2022 at 05:32:38PM +0300, Marina Polyakova wrote: > Hello! > > I discovered an interesting behaviour during installcheck runs when the > cluster was initialized with ICU locale provider: > > $ initdb --locale-provider icu --icu-locale en-US -D data && > 2) The option --no-locale in pg_regress is described as "use C locale" [2]. > But in this case the created databases actually use the ICU locale provider > with the ICU cluster locale from template0 (see > diff_check_backend_used_provider.patch): > > $ make NO_LOCALE=1 installcheck Yes, this looks wrong on the ground on what -no-locale is expected to do, aka use a C locale. Peter? -- Michael