Thread: ICU for global collation

ICU for global collation

From

Peter Eisentraut

Date:

20 August 2019, 14:21:21

Here is an initial patch to add the option to use ICU as the global
collation provider, a long-requested feature.

To activate, use something like

    initdb --collation-provider=icu --locale=...

A trick here is that since we need to also still set the normal POSIX
locales, the --locale value needs to be valid as both a POSIX locale and
a ICU locale.  If that doesn't work out, there is also a way to specify
it separately, e.g.,

    initdb --collation-provider=icu --locale=en_US.utf8 --icu-locale=en

This complexity is unfortunate, but I don't see a way around it right now.

There are also options for createdb and CREATE DATABASE to do this for a
particular database only.

Besides this, the implementation is quite small: When starting up a
database, we create an ICU collator object, store it in a global
variable, and then use it when appropriate.  All the ICU code for
creating and invoking those collators already exists of course.

For the version tracking, I use the pg_collation row for the "default"
collation.  Again, this mostly reuses existing code and concepts.

Nondeterministic collations are not supported for the global collation,
because then LIKE and regular expressions don't work and that breaks
some system views.  This needs some separate research.

To test, run the existing regression tests against a database
initialized with ICU.  Perhaps some options for pg_regress could
facilitate that.

I fear that the Localization chapter in the documentation will need a
bit of a rewrite after this, because the hitherto separately treated
concepts of locale and collation are fusing together.  I haven't done
that here yet, but that would be the plan for later.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v1-0001-Add-option-to-use-ICU-as-global-collation-provide.patch

Re: ICU for global collation

From

Andrey Borodin

Date:

21 August 2019, 06:56:03

Hi!


> 20 авг. 2019 г., в 19:21, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> написал(а):
>
> Here is an initial patch to add the option to use ICU as the global
> collation provider, a long-requested feature.
>
> To activate, use something like
>
>    initdb --collation-provider=icu --locale=...
>
> A trick here is that since we need to also still set the normal POSIX
> locales, the --locale value needs to be valid as both a POSIX locale and
> a ICU locale.  If that doesn't work out, there is also a way to specify
> it separately, e.g.,
>
>    initdb --collation-provider=icu --locale=en_US.utf8 --icu-locale=en

Thanks! This is very awaited feature.

Seems like user cannot change locale for database if icu is already chosen?

postgres=# \l
                               List of databases
   Name    | Owner | Encoding | Collate | Ctype | Provider | Access privileges
-----------+-------+----------+---------+-------+----------+-------------------
 postgres  | x4mmm | UTF8     | ru_RU   | ru_RU | icu      |
 template0 | x4mmm | UTF8     | ru_RU   | ru_RU | icu      | =c/x4mmm         +
           |       |          |         |       |          | x4mmm=CTc/x4mmm
 template1 | x4mmm | UTF8     | ru_RU   | ru_RU | icu      | =c/x4mmm         +
           |       |          |         |       |          | x4mmm=CTc/x4mmm
(3 rows)

postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
CREATE DATABASE
postgres=# \c a
2019-08-21 11:43:40.379 +05 [41509] FATAL:  collations with different collate and ctype values are not supported by ICU
FATAL:  collations with different collate and ctype values are not supported by ICU
Previous connection kept

Am I missing something?

BTW, psql does not know about collation_provider.

Best regards, Andrey Borodin.

Re: ICU for global collation

From

Peter Eisentraut

Date:

21 August 2019, 07:23:50

On 2019-08-21 08:56, Andrey Borodin wrote:
> postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
> CREATE DATABASE
> postgres=# \c a
> 2019-08-21 11:43:40.379 +05 [41509] FATAL:  collations with different collate and ctype values are not supported by
ICU
> FATAL:  collations with different collate and ctype values are not supported by ICU

Try

create database a template template0 collation_provider icu locale
'en_US.utf8';

which sets both lc_collate and lc_ctype.  But 'en_US.utf8' is not a
valid ICU locale name.  Perhaps use 'en' or 'en-US'.

I'm making a note that we should prevent creating a database with a
faulty locale configuration in the first place instead of failing when
we're connecting.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ICU for global collation

From

Andrey Borodin

Date:

11 September 2019, 05:24:37

> 21 авг. 2019 г., в 12:23, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> написал(а):
>
> On 2019-08-21 08:56, Andrey Borodin wrote:
>> postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
>> CREATE DATABASE
>> postgres=# \c a
>> 2019-08-21 11:43:40.379 +05 [41509] FATAL:  collations with different collate and ctype values are not supported by
ICU
>> FATAL:  collations with different collate and ctype values are not supported by ICU
>
> Try
>
> create database a template template0 collation_provider icu locale
> 'en_US.utf8';
>
> which sets both lc_collate and lc_ctype.  But 'en_US.utf8' is not a
> valid ICU locale name.  Perhaps use 'en' or 'en-US'.
>
> I'm making a note that we should prevent creating a database with a
> faulty locale configuration in the first place instead of failing when
> we're connecting.

Yes, the problem is input with lc_collate is accepted
postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
CREATE DATABASE
postgres=# \c a
2019-09-11 10:01:00.373 +05 [56878] FATAL:  collations with different collate and ctype values are not supported by ICU
FATAL:  collations with different collate and ctype values are not supported by ICU
Previous connection kept
postgres=# create database b template template0 collation_provider icu locale 'en_US.utf8';
CREATE DATABASE
postgres=# \c b
You are now connected to database "b" as user "x4mmm".

I get same output with 'en' or 'en-US'.

Also, cluster initialized --with-icu started on binaries without icu just fine.
And only after some time, I've got that messages "ERROR:  ICU is not supported in this build".
Is it expected behavior? Maybe we should refuse to start without icu?

Best regards, Andrey Borodin.

Re: ICU for global collation

From

"Daniel Verite"

Date:

17 September 2019, 13:08:36

 Hi,

When trying databases defined with ICU locales, I see that backends
that serve such databases seem to have their LC_CTYPE inherited from
the environment (as opposed to a per-database fixed value).

That's a problem for the backend code that depends on libc functions
that themselves depend on LC_CTYPE, such as the full text search parser
and dictionaries.

For instance, if you start the instance with a C locale
(LC_ALL=C pg_ctl...) , and tries to use FTS in an ICU UTF-8 database,
it doesn't work:

template1=# create database "fr-utf8"
  template 'template0' encoding UTF8
  locale 'fr'
  collation_provider 'icu';

template1=# \c fr-utf8
You are now connected to database "fr-utf8" as user "daniel".

fr-utf8=# show lc_ctype;
 lc_ctype
----------
 fr
(1 row)

fr-utf8=# select to_tsvector('été');
ERROR:    invalid multibyte character for locale
HINT:  The server's LC_CTYPE locale is probably incompatible with the
database encoding.

If I peek into the "real" LC_CTYPE when connected to this database,
I can see it's "C":

fr-utf8=# create extension plperl;
CREATE EXTENSION

fr-utf8=# create function lc_ctype() returns text as '$ENV{LC_CTYPE};'
  language plperl;
CREATE FUNCTION

fr-utf8=# select lc_ctype();
 lc_ctype
----------
 C


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: ICU for global collation

From

Marius Timmer

Date:

08 October 2019, 11:15:57

Hi everyone,

like the others before me we (the university of Münster) are happy to
see this feature as well. Thank you this.

When I applied the patch two weeks ago I run into the issue that initdb
did not recognize the new parameters (collation-provider and icu-locale)
but I guess it was caused by my own stupidity.

> When trying databases defined with ICU locales, I see that backends
> that serve such databases seem to have their LC_CTYPE inherited from
> the environment (as opposed to a per-database fixed value).
I am able to recreate the issue described by Daniel on my machine.

Now it works as expected. I just had to update the patch since commit
3f6b3be3 had modified two lines which resulted in conflicts. You find
the updated patch as attachement to this mail.


Best regards,

Marius Timmer




--
Westfälische Wilhelms-Universität Münster (WWU)
Zentrum für Informationsverarbeitung (ZIV)
Röntgenstraße 7-13
Besucheradresse: Einsteinstraße 60 - Raum 107
48149 Münster
+49 251 83 31158
marius.timmer@uni-muenster.de
https://www.uni-muenster.de/ZIV

Attachment

Re: ICU for global collation

From

Thomas Munro

Date:

17 October 2019, 02:52:00

On Wed, Oct 9, 2019 at 12:16 AM Marius Timmer
<marius.timmer@uni-muenster.de> wrote:
> like the others before me we (the university of Münster) are happy to
> see this feature as well. Thank you this.
>
> When I applied the patch two weeks ago I run into the issue that initdb
> did not recognize the new parameters (collation-provider and icu-locale)
> but I guess it was caused by my own stupidity.
>
> > When trying databases defined with ICU locales, I see that backends
> > that serve such databases seem to have their LC_CTYPE inherited from
> > the environment (as opposed to a per-database fixed value).
> I am able to recreate the issue described by Daniel on my machine.
>
> Now it works as expected. I just had to update the patch since commit
> 3f6b3be3 had modified two lines which resulted in conflicts. You find
> the updated patch as attachement to this mail.

I rebased this patch, and tweaked get_collation_action_version() very
slightly so that you get collation version change detection (of the
ersatz kind provided by commit d5ac14f9) for the default collation
even when not using ICU.  Please see attached.

+struct pg_locale_struct global_locale;

Why not "default_locale"?  Where is the terminology "global" coming from?

+        Specifies the collation provider for the database.

"for the database's default collation"?

Attachment

v2-0001-Add-option-to-use-ICU-as-global-collation-provide.patch

Re: ICU for global collation

From

Thomas Munro

Date:

17 October 2019, 21:43:06

On Thu, Oct 17, 2019 at 3:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I rebased this patch, and tweaked get_collation_action_version() very
> slightly so that you get collation version change detection (of the
> ersatz kind provided by commit d5ac14f9) for the default collation
> even when not using ICU.  Please see attached.

It should also remove the sentence I recently added to
alter_collation.sgml to say that the default collation doesn't have
version tracking.  Rereading that section, it's also clear that the
query introduced with:

   "The following query can be used to identify all collations in the current
   database that need to be refreshed and the objects that depend on them:"

… is wrong with this patch applied.  The right query is quite hard to
come up with, since we don't explicitly track dependencies on the
default collation.  That is, there is no pg_depend entry pointing from
the index to the collation when you write CREATE INDEX ON t(x) for a
text column using the default collation, but there is one when you
write CREATE INDEX ON t(x COLLATE "fr_FR"), or when you write CREATE
INDEX ON t(x) for a text column that was explicitly defined to use
COLLATE "fr_FR".  One solution is that we could start tracking those
dependencies explicitly too.

A preexisting problem with that query is that it doesn't report
transitive dependencies.  An index on t(x) of a user defined type
defined with CREATE TYPE my_type AS (x text COLLATE "fr_FR") doesn't
result in a pg_depend row from index to collation, so the query fails
to report that as an index needing to be rebuilt.  You could fix that
with a sprinkle of recursive magic, but you'd need a different kind of
magic to deal with transitive dependencies on the default collation
unless we start listing such dependencies explicitly.  In that
example, my_type would need to depend on collation "default".  You
can't just do some kind of search for transitive dependencies on type
"text", because they aren't tracked either.

In my longer term proposal to track per-dependency versions, either by
adding refobjversion to pg_depend or by creating another
pg_depend-like catalog, you'd almost certainly need to add an explicit
record for dependencies on the default collation.

Re: ICU for global collation

From

Peter Eisentraut

Date:

31 October 2019, 07:22:44

On 2019-09-17 15:08, Daniel Verite wrote:
> When trying databases defined with ICU locales, I see that backends
> that serve such databases seem to have their LC_CTYPE inherited from
> the environment (as opposed to a per-database fixed value).

> fr-utf8=# select to_tsvector('été');
> ERROR:    invalid multibyte character for locale
> HINT:  The server's LC_CTYPE locale is probably incompatible with the
> database encoding.

I looked into this problem.  The way to address this would be adding 
proper collation support to the text search subsystem.  See the TODO 
markers in src/backend/tsearch/ts_locale.c for starting points.  These 
APIs spread out to a lot of places, so it will take some time to finish. 
  In the meantime, I'm pausing this thread and will set the CF entry as RwF.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ICU for global collation

From

"Daniel Verite"

Date:

01 November 2019, 18:18:13

    Peter Eisentraut wrote:

> I looked into this problem.  The way to address this would be adding
> proper collation support to the text search subsystem.  See the TODO
> markers in src/backend/tsearch/ts_locale.c for starting points.  These
> APIs spread out to a lot of places, so it will take some time to finish.
>  In the meantime, I'm pausing this thread and will set the CF entry as RwF.

Even if the FTS code is improved in that matter, any extension code
with libc functions depending on LC_CTYPE is still going to be
potentially problematic. In particular when it happens to be set
to a different encoding than the database.

Couldn't we simply invent per-database GUC options, as in
ALTER DATABASE myicudb SET libc_lc_ctype TO 'value';
ALTER DATABASE myicudb SET libc_lc_collate TO 'value';

where libc_lc_ctype/libc_lc_collate would specifically set
the values in the LC_CTYPE and LC_COLLATE environment vars
of any backend serving the corresponding database"?

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: ICU for global collation

From

Peter Eisentraut

Date:

06 November 2019, 10:09:57

On 2019-11-01 19:18, Daniel Verite wrote:
> Even if the FTS code is improved in that matter, any extension code
> with libc functions depending on LC_CTYPE is still going to be
> potentially problematic. In particular when it happens to be set
> to a different encoding than the database.

I think the answer here is that extension code must not do that, at 
least in ways that potentially interact with other parts of the 
(collation-aware) database system.  For example, libc and ICU might have 
different opinions about what is a letter, because of different versions 
of Unicode data in use.  That would then affect tokenization etc. in 
text search and elsewhere.  That's why things like isalpha have to go 
though ICU instead, if that is the collation provider in a particular 
context.

> Couldn't we simply invent per-database GUC options, as in
> ALTER DATABASE myicudb SET libc_lc_ctype TO 'value';
> ALTER DATABASE myicudb SET libc_lc_collate TO 'value';
> 
> where libc_lc_ctype/libc_lc_collate would specifically set
> the values in the LC_CTYPE and LC_COLLATE environment vars
> of any backend serving the corresponding database"?

We could do that as a transition measure to support extensions like you 
mention above.  But our own internal code should not have to rely on that.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ICU for global collation

From

Peter Eisentraut

Date:

30 December 2021, 12:07:21

There were a few inquiries about this topic recently, so I dug up the 
old thread and patch.  What we got stuck on last time was that we can't 
just swap out all locale support in a database for ICU.  We still need 
to set the usual locale environment, otherwise some things that are not 
ICU aware will break or degrade.  I had initially anticipated fixing 
that by converting everything that uses libc locales to ICU.  But that 
turned out to be tedious and ultimately not very useful as far as the 
user-facing result is concerned, so I gave up.

So this is a different approach: If you choose ICU as the default locale 
for a database, you still need to specify lc_ctype and lc_collate 
settings, as before.  Unlike in the previous patch, where the ICU 
collation name was written in datcollate, there is now a third column 
(daticucoll), so we can store all three values.  This fixes the 
described problem.  Other than that, once you get all the initial 
settings right, it basically just works: The places that have ICU 
support now will use a database-wide ICU collation if appropriate, the 
places that don't have ICU support continue to use the global libc 
locale settings.

I changed the datcollate, datctype, and the new daticucoll fields to 
type text (from name).  That way, the daticucoll field can be set to 
null if it's not applicable.  Also, the limit of 63 characters can 
actually be a problem if you want to use some combination of the options 
that ICU locales offer.  And for less extreme uses, having 
variable-length fields will save some storage, since typical locale 
names are much shorter.

For the same reasons and to keep things consistent, I also changed the 
analogous pg_collation fields like that.  This also removes some weird 
code that has to check that colcollate and colctype have to be the same 
for ICU, so it's overall cleaner.

Attachment

v3-0001-Add-option-to-use-ICU-as-global-collation-provide.patch

Re: ICU for global collation

From

Julien Rouhaud

Date:

04 January 2022, 02:21:04

Hi,

On Thu, Dec 30, 2021 at 01:07:21PM +0100, Peter Eisentraut wrote:
> 
> So this is a different approach: If you choose ICU as the default locale for
> a database, you still need to specify lc_ctype and lc_collate settings, as
> before.  Unlike in the previous patch, where the ICU collation name was
> written in datcollate, there is now a third column (daticucoll), so we can
> store all three values.  This fixes the described problem.  Other than that,
> once you get all the initial settings right, it basically just works: The
> places that have ICU support now will use a database-wide ICU collation if
> appropriate, the places that don't have ICU support continue to use the
> global libc locale settings.

That looks sensible to me.

> @@ -2774,6 +2776,7 @@ dumpDatabase(Archive *fout)
>          appendPQExpBuffer(dbQry, "SELECT tableoid, oid, datname, "
>                            "(%s datdba) AS dba, "
>                            "pg_encoding_to_char(encoding) AS encoding, "
> +                          "datcollprovider, "

This needs to be in a new pg 15+ branch, not in the pg 9.3+.

> -    if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID)
> -        mylocale = pg_newlocale_from_collation(collid);
> +    if (!lc_collate_is_c(collid))
> +    {
> +        if (collid != DEFAULT_COLLATION_OID)
> +            mylocale = pg_newlocale_from_collation(collid);
> +        else if (default_locale.provider == COLLPROVIDER_ICU)
> +            mylocale = &default_locale;
> +    }

There are really a lot of places with this new code.  Maybe it could be some
new function/macro to wrap that for the normal case (e.g. not formatting.c)?

Re: ICU for global collation

From

Peter Eisentraut

Date:

04 January 2022, 16:03:10

On 04.01.22 03:21, Julien Rouhaud wrote:
>> @@ -2774,6 +2776,7 @@ dumpDatabase(Archive *fout)
>>           appendPQExpBuffer(dbQry, "SELECT tableoid, oid, datname, "
>>                             "(%s datdba) AS dba, "
>>                             "pg_encoding_to_char(encoding) AS encoding, "
>> +                          "datcollprovider, "
> 
> This needs to be in a new pg 15+ branch, not in the pg 9.3+.

ok

>> -    if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID)
>> -        mylocale = pg_newlocale_from_collation(collid);
>> +    if (!lc_collate_is_c(collid))
>> +    {
>> +        if (collid != DEFAULT_COLLATION_OID)
>> +            mylocale = pg_newlocale_from_collation(collid);
>> +        else if (default_locale.provider == COLLPROVIDER_ICU)
>> +            mylocale = &default_locale;
>> +    }
> 
> There are really a lot of places with this new code.  Maybe it could be some
> new function/macro to wrap that for the normal case (e.g. not formatting.c)?

Right, we could just put this into pg_newlocale_from_collation(), but 
the comment there says

  * In fact, they shouldn't call this function at all when they are dealing
  * with the default locale.  That can save quite a bit in hotspots.

I don't know how to assess that.

We could pack this into a macro or inline function if we are concerned 
about this.

Re: ICU for global collation

From

Julien Rouhaud

Date:

06 January 2022, 03:35:14

On Tue, Jan 04, 2022 at 05:03:10PM +0100, Peter Eisentraut wrote:
> On 04.01.22 03:21, Julien Rouhaud wrote:
> 
> > > -    if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID)
> > > -        mylocale = pg_newlocale_from_collation(collid);
> > > +    if (!lc_collate_is_c(collid))
> > > +    {
> > > +        if (collid != DEFAULT_COLLATION_OID)
> > > +            mylocale = pg_newlocale_from_collation(collid);
> > > +        else if (default_locale.provider == COLLPROVIDER_ICU)
> > > +            mylocale = &default_locale;
> > > +    }
> > 
> > There are really a lot of places with this new code.  Maybe it could be some
> > new function/macro to wrap that for the normal case (e.g. not formatting.c)?
> 
> Right, we could just put this into pg_newlocale_from_collation(), but the
> comment there says
> 
>  * In fact, they shouldn't call this function at all when they are dealing
>  * with the default locale.  That can save quite a bit in hotspots.
> 
> I don't know how to assess that.
> 
> We could pack this into a macro or inline function if we are concerned about
> this.

Yes that was my idea, just have a new function (inline function or a macro
then since pg_newlocale_from_collation() clearly warns about performance
concerns) that have the whole
is-not-c-collation-and-is-default-collation-or-icu-collation logic and calls
pg_newlocale_from_collation() only when needed.

Re: ICU for global collation

From

"Finnerty, Jim"

Date:

06 January 2022, 13:55:55

I didn't notice anything version-specific about the patch.  Would any modifications be needed to backport it to pg13
andpg14?
 

After this patch goes in, the big next thing would be to support nondeterministic collations for LIKE, ILIKE and
patternmatching operators in general.  Is anyone interested in working on that?
 

On 1/5/22, 10:36 PM, "Julien Rouhaud" <rjuju123@gmail.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you
canconfirm the sender and know the content is safe.
 



    On Tue, Jan 04, 2022 at 05:03:10PM +0100, Peter Eisentraut wrote:
    > On 04.01.22 03:21, Julien Rouhaud wrote:
    >
    > > > - if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID)
    > > > -         mylocale = pg_newlocale_from_collation(collid);
    > > > + if (!lc_collate_is_c(collid))
    > > > + {
    > > > +         if (collid != DEFAULT_COLLATION_OID)
    > > > +                 mylocale = pg_newlocale_from_collation(collid);
    > > > +         else if (default_locale.provider == COLLPROVIDER_ICU)
    > > > +                 mylocale = &default_locale;
    > > > + }
    > >
    > > There are really a lot of places with this new code.  Maybe it could be some
    > > new function/macro to wrap that for the normal case (e.g. not formatting.c)?
    >
    > Right, we could just put this into pg_newlocale_from_collation(), but the
    > comment there says
    >
    >  * In fact, they shouldn't call this function at all when they are dealing
    >  * with the default locale.  That can save quite a bit in hotspots.
    >
    > I don't know how to assess that.
    >
    > We could pack this into a macro or inline function if we are concerned about
    > this.

    Yes that was my idea, just have a new function (inline function or a macro
    then since pg_newlocale_from_collation() clearly warns about performance
    concerns) that have the whole
    is-not-c-collation-and-is-default-collation-or-icu-collation logic and calls
    pg_newlocale_from_collation() only when needed.

Re: ICU for global collation

From

Julien Rouhaud

Date:

06 January 2022, 14:45:43

On Thu, Jan 06, 2022 at 01:55:55PM +0000, Finnerty, Jim wrote:
> 
> I didn't notice anything version-specific about the patch.  Would any
> modifications be needed to backport it to pg13 and pg14?

This is a new feature so it can't be backported.  The changes aren't big and
mostly touches places that didn't change in a long time so I don't think that
it would take much effort if you wanted to backport it on your own forks.

> After this patch goes in, the big next thing would be to support
> nondeterministic collations for LIKE, ILIKE and pattern matching operators in
> general.  Is anyone interested in working on that?

As far as I know you're the last person that seemed to be working on that topic
back in March :)

Re: ICU for global collation

From

Julien Rouhaud

Date:

07 January 2022, 09:03:29

Hi,

I looked a bit more in this patch and I have some additional remarks.

On Thu, Dec 30, 2021 at 01:07:21PM +0100, Peter Eisentraut wrote:
> 
> So this is a different approach: If you choose ICU as the default locale for
> a database, you still need to specify lc_ctype and lc_collate settings, as
> before.  Unlike in the previous patch, where the ICU collation name was
> written in datcollate, there is now a third column (daticucoll), so we can
> store all three values.  This fixes the described problem.  Other than that,
> once you get all the initial settings right, it basically just works: The
> places that have ICU support now will use a database-wide ICU collation if
> appropriate, the places that don't have ICU support continue to use the
> global libc locale settings.

So just to confirm a database can now have 2 different *default* collations: a
libc-based one for everything that doesn't work with ICU and a ICU-based (if
specified) for everything else, and the ICU-based is optional, so if not
provided everything works as before, using the libc based default collation.

As I mentioned I think this approach is sensible.  However, should we document
what are the things that are not ICU-aware?

> I changed the datcollate, datctype, and the new daticucoll fields to type
> text (from name).  That way, the daticucoll field can be set to null if it's
> not applicable.  Also, the limit of 63 characters can actually be a problem
> if you want to use some combination of the options that ICU locales offer.
> And for less extreme uses, having variable-length fields will save some
> storage, since typical locale names are much shorter.

I understand the need to have daticucoll as text, however it's not clear to me
why this has to be changed for datcollate and datctype?  IIUC those will only
ever contain libc-based collation and are still mandatory?
> 
> For the same reasons and to keep things consistent, I also changed the
> analogous pg_collation fields like that.

The respective fields in pg_collation are now nullable, so the changes there
sounds ok.

Digging a bit more in the patch here are some things that looks problematic.

- pg_upgrade

It checks (in check_locale_and_encoding()) the compatibility for each database,
and it looks like the daticucoll field should also be verified.  Other than
that I don't think there is anything else needed for the pg_upgrade part as
everything else should be handled by pg_dump (I didn't look at the changes yet
given the below problems).

- CREATE DATABASE

There's a new COLLATION_PROVIDER option, but the overall behavior seems quite
unintuitive.  As far as I can see the idea is to use LOCALE for the ICU default
collation, but as it's providing a default for the other values it's quite
annoying.  For instance:

=# CREATE DATABASE db COLLATION_PROVIDER icu LOCALE 'fr-x-icu' LC_COLLATE 'en_GB.UTF-8';;
ERROR:  42809: invalid locale name: "fr-x-icu"
LOCATION:  createdb, dbcommands.c:397

Looking at the code it's actually complaining about LC_CTYPE.  If you want a
database with an ICU default collation the lc_collate and lc_ctype should
inherit what's in the template database and not what was provided in  the
LOCALE I think.  You could still probably overload them in some scenario, but
without a list of what isn't ICU-aware I can't really be sure of how often one
might have to do it.

Now, if I specify everything as needed it looks like it's missing some checks
on the ICU default collation when not using template0:

=# CREATE DATABASE db COLLATION_PROVIDER icu LOCALE 'en-x-icu' LC_COLLATE 'en_GB.UTF-8' LC_CTYPE 'en_GB.UTF-8';;
CREATE DATABASE

=# SELECT datname, datcollate, datctype, daticucoll FROM pg_database ;
  datname  | datcollate  |  datctype   | daticucoll
-----------+-------------+-------------+------------
 postgres  | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
 db        | en_GB.UTF-8 | en_GB.UTF-8 | en-x-icu
 template1 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
 template0 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
(4 rows)

Unless I'm missing something the same concerns about collation incompatibility
with objects in the source database should also apply for the ICU collation?

While at it, I'm not exactly sure of what the COLLATION_PROVIDER is supposed to
mean, as the same commands but with a libc provider is accepted and has
the exact same result:

=# CREATE DATABASE db2 COLLATION_PROVIDER libc LOCALE 'en-x-icu' LC_COLLATE 'en_GB.UTF-8' LC_CTYPE 'en_GB.UTF-8';;
CREATE DATABASE

=# SELECT datname, datcollate, datctype, daticucoll FROM pg_database ;
  datname  | datcollate  |  datctype   | daticucoll
-----------+-------------+-------------+------------
 postgres  | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
 db        | en_GB.UTF-8 | en_GB.UTF-8 | en-x-icu
 template1 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
 template0 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
 db2       | en_GB.UTF-8 | en_GB.UTF-8 | en-x-icu
(5 rows)

Shouldn't db2 have a NULL daticucoll, and if so also complain about
incompatibility for it?

- initdb

I don't think that initdb --collation-provider icu should be allowed without
--icu-locale, same for --collation-provider libc *with* --icu-locale.

When trying that, I can also see that the NULL handling for daticucoll is
broken in the BKI:

$ initdb -k --collation-provider icu
[...]
Success. You can now start the database server using:

=# select datname, datcollate, datctype, daticucoll from pg_database ;
  datname  | datcollate  |  datctype   | daticucoll
-----------+-------------+-------------+------------
 postgres  | en_GB.UTF-8 | en_GB.UTF-8 | en_GB
 template1 | en_GB.UTF-8 | en_GB.UTF-8 | en_GB
 template0 | en_GB.UTF-8 | en_GB.UTF-8 | en_GB
(3 rows)

There's a fallback on my LANG/LC_* env settings, but I don't think it can ever
be correct given the different naming convention in ICU (at least the s/_/-/).

And

$ initdb -k --collation-provider libc --icu-locale test
[...]
Success. You can now start the database server using:

=# select datname, datcollate, datctype, daticucoll from pg_database ;
  datname  | datcollate  |  datctype   | daticucoll
-----------+-------------+-------------+------------
 postgres  | en_GB.UTF-8 | en_GB.UTF-8 | _null_
 template1 | en_GB.UTF-8 | en_GB.UTF-8 | _null_
 template0 | en_GB.UTF-8 | en_GB.UTF-8 | _null_
(3 rows)

Re: ICU for global collation

From

Peter Eisentraut

Date:

07 January 2022, 14:25:28

On 04.01.22 17:03, Peter Eisentraut wrote:
>> There are really a lot of places with this new code.  Maybe it could 
>> be some
>> new function/macro to wrap that for the normal case (e.g. not 
>> formatting.c)?
> 
> Right, we could just put this into pg_newlocale_from_collation(), but 
> the comment there says
> 
>   * In fact, they shouldn't call this function at all when they are dealing
>   * with the default locale.  That can save quite a bit in hotspots.
> 
> I don't know how to assess that.

I tested this a bit.  I used the following setup:

create table t1 (a text);
insert into t1 select md5(generate_series(1, 10000000)::text);
select count(*) from t1 where a > '';

And then I changed in varstr_cmp():

         if (collid != DEFAULT_COLLATION_OID)
             mylocale = pg_newlocale_from_collation(collid);

to just

         mylocale = pg_newlocale_from_collation(collid);

I find that the \timing results are indistinguishable.  (I used locale 
"en_US.UTF-8" and made sure that that code path is actually hit.)

Does anyone have other insights?

Re: ICU for global collation

From

"Daniel Verite"

Date:

07 January 2022, 17:19:09

    Julien Rouhaud wrote:

> If you want a database with an ICU default collation the lc_collate
> and lc_ctype should inherit what's in the template database and not
> what was provided in the LOCALE I think.  You could still probably
> overload them in some scenario, but without a list of what isn't
> ICU-aware I can't really be sure of how often one might have to do
> it.

I guess we'd need that when creating a database with a different
encoding than the template databases, at least.

About what's not ICU-aware, I believe the most significant part in
core is the Full Text Search parser.
It doesn't care about sorting strings, but it relies on the functions
from POSIX <ctype.h>, which depend on LC_CTYPE
(it looks however that this could be improved by following
what has been done in backend/regex/regc_pg_locale.c, which has
comparable needs and calls ICU functions when applicable).

Also, any extension is potentially concerned. Surely many
extensions call functions from ctype.h assuming that
the current value of LC_CTYPE works with the data they handle.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

Re: ICU for global collation

From

Julien Rouhaud

Date:

10 January 2022, 03:25:08

On Fri, Jan 07, 2022 at 03:25:28PM +0100, Peter Eisentraut wrote:
> 
> I tested this a bit.  I used the following setup:
> 
> create table t1 (a text);
> insert into t1 select md5(generate_series(1, 10000000)::text);
> select count(*) from t1 where a > '';
> 
> And then I changed in varstr_cmp():
> 
>         if (collid != DEFAULT_COLLATION_OID)
>             mylocale = pg_newlocale_from_collation(collid);
> 
> to just
> 
>         mylocale = pg_newlocale_from_collation(collid);
> 
> I find that the \timing results are indistinguishable.  (I used locale
> "en_US.UTF-8" and made sure that that code path is actually hit.)
> 
> Does anyone have other insights?

Looking at the git history, you added this comment in 414c5a2ea65.

After a bit a digging in the lists, I found that you introduced it to fix a
reported 13% slowdown in varstr_cmp():
https://www.postgresql.org/message-id/20110129075253.GA18784%40tornado.leadboat.com
https://www.postgresql.org/message-id/1296748408.6442.1.camel%40vanquo.pezone.net

Re: ICU for global collation

From

Julien Rouhaud

Date:

10 January 2022, 06:00:27

On Mon, Jan 10, 2022 at 11:25:08AM +0800, Julien Rouhaud wrote:
> On Fri, Jan 07, 2022 at 03:25:28PM +0100, Peter Eisentraut wrote:
> > 
> > I tested this a bit.  I used the following setup:
> > 
> > create table t1 (a text);
> > insert into t1 select md5(generate_series(1, 10000000)::text);
> > select count(*) from t1 where a > '';
> > 
> > And then I changed in varstr_cmp():
> > 
> >         if (collid != DEFAULT_COLLATION_OID)
> >             mylocale = pg_newlocale_from_collation(collid);
> > 
> > to just
> > 
> >         mylocale = pg_newlocale_from_collation(collid);
> > 
> > I find that the \timing results are indistinguishable.  (I used locale
> > "en_US.UTF-8" and made sure that that code path is actually hit.)
> > 
> > Does anyone have other insights?
> 
> Looking at the git history, you added this comment in 414c5a2ea65.
> 
> After a bit a digging in the lists, I found that you introduced it to fix a
> reported 13% slowdown in varstr_cmp():
> https://www.postgresql.org/message-id/20110129075253.GA18784%40tornado.leadboat.com
> https://www.postgresql.org/message-id/1296748408.6442.1.camel%40vanquo.pezone.net

So I tried to run Noah's benchmark to see if I could reproduce the slowdown.
Unfortunately the results I'm getting don't really make sense as removing the
optimisation brings a 15% speedup, and with a few more runs I can see that I
have about 25% noise, so there isn't much I can do to help.

Re: ICU for global collation

From

"Daniel Verite"

Date:

10 January 2022, 11:49:07

    Peter Eisentraut wrote:

> Unlike in the previous patch, where the ICU
> collation name was written in datcollate, there is now a third column
> (daticucoll), so we can store all three values.

I think some users would want their db-wide ICU collation to be
case/accent-insensitive. Postgres users are trained to expect
case-sensitive comparisons, but some apps initially made for
e.g. MySQL or MS-SQL that use such collations natively would be easier
to port to Postgres.
IIRC, that was the context for some questions where people were
enquiring about db-wide ICU collations.

With the current patch, it's not possible, AFAICS, because the user
can't tell that the collation is non-deterministic. Presumably this
would require another option to CREATE DATABASE and another
column to store that bit of information.

The "daticucol" column also suggests that we don't expect to add
other collation providers in the future. Maybe a pair of columns like
(datcollprovider, datcolllocale) would be more future-proof,
or a (datcollprovider, datcolllocale, datcollisdeterministic)
triplet if non-deterministic collations are allowed.

Also, pg_collation has "collversion" to detect a mismatch between
the ICU runtime and existing indexes. I don't see that field
for the db-wide ICU collation, so maybe we currently miss the capability
to detect the mismatch for the db-wide collation?

The lack of these fields overall suggest the idea that when CREATE
DATABASE is called with a global ICU collation, what if it somehow
inserted the collation into pg_collation in the new database?
Then pg_database would just store the collation oid and no other
collation-related field would need to be added into it, now
or in the future.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

Re: ICU for global collation

From

Julien Rouhaud

Date:

10 January 2022, 12:56:56

On Mon, Jan 10, 2022 at 12:49:07PM +0100, Daniel Verite wrote:
> 
> The "daticucol" column also suggests that we don't expect to add
> other collation providers in the future. Maybe a pair of columns like
> (datcollprovider, datcolllocale) would be more future-proof,
> or a (datcollprovider, datcolllocale, datcollisdeterministic)
> triplet if non-deterministic collations are allowed.

I'm not sure about the non-deterministic default collation given the current
restrictions with it, but the extra column seems like a good idea.  It would
require a bit more thinking, as we would need a second collation column in
pg_database for any default provider that's not libc.

> Also, pg_collation has "collversion" to detect a mismatch between
> the ICU runtime and existing indexes. I don't see that field
> for the db-wide ICU collation, so maybe we currently miss the capability
> to detect the mismatch for the db-wide collation?

I don't think that storing a version there will really help.  There's no
guarantee that any object has been created with the version of the collation
that was installed when the database was created.  And we would still need
to store a version with each underlying object anyway, as rebuilding all broken
dependencies can last for a long time, including a server restart.

> The lack of these fields overall suggest the idea that when CREATE
> DATABASE is called with a global ICU collation, what if it somehow
> inserted the collation into pg_collation in the new database?
> Then pg_database would just store the collation oid and no other
> collation-related field would need to be added into it, now
> or in the future.

I don't think it would be doable given the single-database-per-backend
restriction.

Re: ICU for global collation

From

"Daniel Verite"

Date:

10 January 2022, 14:45:47

    Julien Rouhaud wrote:

> > The lack of these fields overall suggest the idea that when CREATE
> > DATABASE is called with a global ICU collation, what if it somehow
> > inserted the collation into pg_collation in the new database?
> > Then pg_database would just store the collation oid and no other
> > collation-related field would need to be added into it, now
> > or in the future.
>
> I don't think it would be doable given the single-database-per-backend
> restriction.

By that I understand that CREATE DATABASE is limited to copying a template
database and then not write anything into it beyond that, as it's
not even connected to it.
I guess there's still the possibility of requiring that the ICU db-wide
collation of the new database does exist in the template database,
and then the CREATE DATABASE would refer to that collation instead of
an independent locale string.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

Re: ICU for global collation

From

Julien Rouhaud

Date:

10 January 2022, 15:14:31

On Mon, Jan 10, 2022 at 03:45:47PM +0100, Daniel Verite wrote:
> 
> By that I understand that CREATE DATABASE is limited to copying a template
> database and then not write anything into it beyond that, as it's
> not even connected to it.

Yes.

> I guess there's still the possibility of requiring that the ICU db-wide
> collation of the new database does exist in the template database,
> and then the CREATE DATABASE would refer to that collation instead of
> an independent locale string.

That could work.  However if having the collation in the template database a
strict requirement the something should also be done for initdb, and it will
probably be a bigger problem.

Re: ICU for global collation

From

Peter Eisentraut

Date:

11 January 2022, 09:10:25

On 10.01.22 07:00, Julien Rouhaud wrote:
>>> And then I changed in varstr_cmp():
>>>
>>>          if (collid != DEFAULT_COLLATION_OID)
>>>              mylocale = pg_newlocale_from_collation(collid);
>>>
>>> to just
>>>
>>>          mylocale = pg_newlocale_from_collation(collid);
>>>
>>> I find that the \timing results are indistinguishable.  (I used locale
>>> "en_US.UTF-8" and made sure that that code path is actually hit.)
>>>
>>> Does anyone have other insights?
>>
>> Looking at the git history, you added this comment in 414c5a2ea65.
>>
>> After a bit a digging in the lists, I found that you introduced it to fix a
>> reported 13% slowdown in varstr_cmp():
>> https://www.postgresql.org/message-id/20110129075253.GA18784%40tornado.leadboat.com
>> https://www.postgresql.org/message-id/1296748408.6442.1.camel%40vanquo.pezone.net
> 
> So I tried to run Noah's benchmark to see if I could reproduce the slowdown.
> Unfortunately the results I'm getting don't really make sense as removing the
> optimisation brings a 15% speedup, and with a few more runs I can see that I
> have about 25% noise, so there isn't much I can do to help.

Heh, I had that same experience, it actually got faster without the 
optimization, but then got lost in the noise on further testing.

Looking back at those discussions, I don't think those old test results 
are relevant anymore.  In the patch that was being tested there, 
pg_newlocale_from_collation(), did not contain

     if (collid == DEFAULT_COLLATION_OID)
         return (pg_locale_t) 0;

so the default collation actually went through most or all of the 
function and did a lot of work.  That would understandably be quite 
slow.  But just calling a function and returning immediately should not 
be a problem.  Otherwise, the call to check_collation_set() in 
varstr_cmp() and elsewhere would be just as bad.

So, unless there are concerns, I'm going to see about making a patch to 
call pg_newlocale_from_collation() even with the default collation. 
That would make the actual feature patch quite a bit smaller, since we 
won't have to patch every call site of pg_newlocale_from_collation().

Re: ICU for global collation

From

Julien Rouhaud

Date:

11 January 2022, 11:08:52

On Tue, Jan 11, 2022 at 10:10:25AM +0100, Peter Eisentraut wrote:
> 
> On 10.01.22 07:00, Julien Rouhaud wrote:
> > 
> > So I tried to run Noah's benchmark to see if I could reproduce the slowdown.
> > Unfortunately the results I'm getting don't really make sense as removing the
> > optimisation brings a 15% speedup, and with a few more runs I can see that I
> > have about 25% noise, so there isn't much I can do to help.
> 
> Heh, I had that same experience, it actually got faster without the
> optimization, but then got lost in the noise on further testing.

Ah, so it's not just my machine :)

> Looking back at those discussions, I don't think those old test results are
> relevant anymore.  In the patch that was being tested there,
> pg_newlocale_from_collation(), did not contain
> 
>     if (collid == DEFAULT_COLLATION_OID)
>         return (pg_locale_t) 0;
> 
> so the default collation actually went through most or all of the function
> and did a lot of work.  That would understandably be quite slow.  But just
> calling a function and returning immediately should not be a problem.
> Otherwise, the call to check_collation_set() in varstr_cmp() and elsewhere
> would be just as bad.

I didn't noticed that.  That definitely explain why the performance concern
isn't valid anymore.

> So, unless there are concerns, I'm going to see about making a patch to call
> pg_newlocale_from_collation() even with the default collation. That would
> make the actual feature patch quite a bit smaller, since we won't have to
> patch every call site of pg_newlocale_from_collation().

+1 for me!

Re: ICU for global collation

From

Peter Eisentraut

Date:

11 January 2022, 11:10:14

On 07.01.22 10:03, Julien Rouhaud wrote:
>> I changed the datcollate, datctype, and the new daticucoll fields to type
>> text (from name).  That way, the daticucoll field can be set to null if it's
>> not applicable.  Also, the limit of 63 characters can actually be a problem
>> if you want to use some combination of the options that ICU locales offer.
>> And for less extreme uses, having variable-length fields will save some
>> storage, since typical locale names are much shorter.
> 
> I understand the need to have daticucoll as text, however it's not clear to me
> why this has to be changed for datcollate and datctype?  IIUC those will only
> ever contain libc-based collation and are still mandatory?

Right.  I just did this for consistency.  It would be strange otherwise 
to have some fields as name and some as text.  Arguably, using "name" 
here was wrong to begin with, since they are not really object names. 
Maybe there used to be a reason to avoid variable-length fields in 
pg_database, but there isn't one now AFAICT.

> - pg_upgrade
> 
> It checks (in check_locale_and_encoding()) the compatibility for each database,
> and it looks like the daticucoll field should also be verified.  Other than
> that I don't think there is anything else needed for the pg_upgrade part as
> everything else should be handled by pg_dump (I didn't look at the changes yet
> given the below problems).

Ok, I have added this and will include it in my next patch submission.

> - CREATE DATABASE

> - initdb

Ok, some work is needed to make these interfaces behave sensibly in 
various combinations.  I will look into that.

Re: ICU for global collation

From

Peter Eisentraut

Date:

11 January 2022, 11:18:58

On 10.01.22 12:49, Daniel Verite wrote:
> With the current patch, it's not possible, AFAICS, because the user
> can't tell that the collation is non-deterministic. Presumably this
> would require another option to CREATE DATABASE and another
> column to store that bit of information.

Adding this would be easy, but since pattern matching currently does not 
support nondeterministic collations, if you make a global collation 
nondeterministic, a lot of system views, psql, pg_dump queries etc. 
would break, so it's not practical.  I view this is an orthogonal 
project.  Once we can support this without breaking system views etc., 
then it's easy to enable with a new column in pg_database.

> The "daticucol" column also suggests that we don't expect to add
> other collation providers in the future. Maybe a pair of columns like
> (datcollprovider, datcolllocale) would be more future-proof,
> or a (datcollprovider, datcolllocale, datcollisdeterministic)
> triplet if non-deterministic collations are allowed.

I don't expect many new collation providers.  So I don't think an 
EAV-like storage would be helpful.  The other problem is that we don't 
know what we need.  For example, the libc provider needs both a collate 
and a ctype value, so that wouldn't fit into that scheme nicely.

> Also, pg_collation has "collversion" to detect a mismatch between
> the ICU runtime and existing indexes. I don't see that field
> for the db-wide ICU collation, so maybe we currently miss the capability
> to detect the mismatch for the db-wide collation?

Yeah, I think I need to add a datcollversion field and the associated 
checks.

Re: ICU for global collation

From

"Daniel Verite"

Date:

11 January 2022, 11:36:46

    Julien Rouhaud wrote:

> > I guess there's still the possibility of requiring that the ICU db-wide
> > collation of the new database does exist in the template database,
> > and then the CREATE DATABASE would refer to that collation instead of
> > an independent locale string.
>
> That could work.  However if having the collation in the template database a
> strict requirement the something should also be done for initdb, and it will
> probably be a bigger problem.

If CREATE DATABASE referred to a collation in the template db,
either that collation already exists, or the user would have to add it
to the template db with CREATE COLLATION.
initdb already populates the template databases with a full set of
ICU collations through pg_import_system_collations().
I don't quite see what change you're seeing that would be needed in
initdb.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

Re: ICU for global collation

From

Julien Rouhaud

Date:

11 January 2022, 11:47:00

On Tue, Jan 11, 2022 at 12:36:46PM +0100, Daniel Verite wrote:
> 
> If CREATE DATABASE referred to a collation in the template db,
> either that collation already exists, or the user would have to add it
> to the template db with CREATE COLLATION.
> initdb already populates the template databases with a full set of
> ICU collations through pg_import_system_collations().
> I don't quite see what change you're seeing that would be needed in
> initdb.

Yes, there are already the system collation imported.  But you still need to
make sure that that collation does exist in the template database, and it's
still impossible to connect to that database to check when processing the
CREATE DATABASE.  Also, if the wanted collation wasn't imported by
pg_import_system_collations() and isn't the server's default collation, then
the user would have to allow connection on the template0 database and create
the wanted collation there.  It doesn't seem like something that should be
recommended for a reasonably standard use case.

Re: ICU for global collation

From

Peter Eisentraut

Date:

13 January 2022, 08:39:42

On 11.01.22 12:08, Julien Rouhaud wrote:
>> So, unless there are concerns, I'm going to see about making a patch to call
>> pg_newlocale_from_collation() even with the default collation. That would
>> make the actual feature patch quite a bit smaller, since we won't have to
>> patch every call site of pg_newlocale_from_collation().
> +1 for me!

Here is that patch.

If this is applied, then in my estimation all these hunks will 
completely disappear from the global ICU patch.  So this would be a 
significant win.

Attachment

0001-Call-pg_newlocale_from_collation-also-with-default-c.patch

Re: ICU for global collation

From

"Finnerty, Jim"

Date:

17 January 2022, 17:30:45

Re: 
>> After this patch goes in, the big next thing would be to support
>> nondeterministic collations for LIKE, ILIKE and pattern matching operators in
>> general.  Is anyone interested in working on that?

> As far as I know you're the last person that seemed to be working on that topic
> back in March :)

I have a solution for LIKE and ILIKE for case-insensitive, accent-sensitive ICU collations and the UTF8 server
encoding,but didn't attempt to address the general case until ICU collations were supported at the database and cluster
levels. I may have another look at that after this patch goes in.

Re: ICU for global collation

From

"Finnerty, Jim"

Date:

17 January 2022, 19:07:38

On 10.01.22 12:49, Daniel Verite wrote:

> I think some users would want their db-wide ICU collation to be
> case/accent-insensitive. 
...
> IIRC, that was the context for some questions where people were
> enquiring about db-wide ICU collations.

+1.  There is the DEFAULT_COLLATION_OID, which is the cluster-level default collation, a.k.a. the "global collation",
asdistinct from the "db-wide" database-level default collation, which controls the default type of the collatable types
withinthat database.

> With the current patch, it's not possible, AFAICS, because the user
> can't tell that the collation is non-deterministic. Presumably this
> would require another option to CREATE DATABASE and another
> column to store that bit of information.

On 1/11/22, 6:24 AM, "Peter Eisentraut" <peter.eisentraut@enterprisedb.com> wrote:

>  Adding this would be easy, but since pattern matching currently does not
>  support nondeterministic collations, if you make a global collation
>  nondeterministic, a lot of system views, psql, pg_dump queries etc.
>  would break, so it's not practical.  I view this is an orthogonal
>  project.  Once we can support this without breaking system views etc.,
>  then it's easy to enable with a new column in pg_database.

So this patch only enables the default cluster collation (DEFAULT_COLLATION_OID) to be a deterministic ICU collation,
butdoesn't extend the metadata in a way that would enable database collations to be ICU collations?

Waiting for the pattern matching problem to be solved in general before creating the metadata to support ICU collations
everywherewill make it more difficult for members of the community to help solve the pattern matching problem.  

What additional metadata changes would be required to enable an ICU collation to be specified at either the
cluster-levelor the database-level, even if new checks need to be added to disallow a nondeterministic collation to be
specifiedat the cluster level for now?

Re: ICU for global collation

From

Julien Rouhaud

Date:

18 January 2022, 03:47:13

Hi,

On Mon, Jan 17, 2022 at 07:07:38PM +0000, Finnerty, Jim wrote:
>     On 10.01.22 12:49, Daniel Verite wrote:
> 
> > I think some users would want their db-wide ICU collation to be
> > case/accent-insensitive. 
> ...
> > IIRC, that was the context for some questions where people were
> > enquiring about db-wide ICU collations.
> 
> +1.  There is the DEFAULT_COLLATION_OID, which is the cluster-level default
> collation, a.k.a. the "global collation", as distinct from the "db-wide"
> database-level default collation, which controls the default type of the
> collatable types within that database.

There's no cluster-level default collation, and DEFAULT_COLLATION_OID is always
database-level (and template1 having a specific default collation).  The
template0 database is there to be able to support different databases with
different default collations, among other things.

So this patchset would allow per-database default ICU collation, although not
non-deterministic ones.

Re: ICU for global collation

From

Julien Rouhaud

Date:

18 January 2022, 04:02:46

Hi,

On Thu, Jan 13, 2022 at 09:39:42AM +0100, Peter Eisentraut wrote:
> On 11.01.22 12:08, Julien Rouhaud wrote:
> > > So, unless there are concerns, I'm going to see about making a patch to call
> > > pg_newlocale_from_collation() even with the default collation. That would
> > > make the actual feature patch quite a bit smaller, since we won't have to
> > > patch every call site of pg_newlocale_from_collation().
> > +1 for me!
> 
> Here is that patch.

The patch is quite straightforward so I don't have much to say, it all looks
good and passes all regression tests.

> If this is applied, then in my estimation all these hunks will completely
> disappear from the global ICU patch.  So this would be a significant win.

Agreed, the patch will be quite smaller and also easier to review.  Are you
planning to apply it on its own or add it to the default ICU patchset?  Given
the possible backpatch conflicts it can bring I'm not sure it's worthy enough
to apply on its own.

Re: ICU for global collation

From

Peter Eisentraut

Date:

18 January 2022, 12:54:50

On 18.01.22 05:02, Julien Rouhaud wrote:
>> If this is applied, then in my estimation all these hunks will completely
>> disappear from the global ICU patch.  So this would be a significant win.
> Agreed, the patch will be quite smaller and also easier to review.  Are you
> planning to apply it on its own or add it to the default ICU patchset?

My plan is to apply this patch in the next few days.

Re: ICU for global collation

From

Peter Eisentraut

Date:

21 January 2022, 09:44:24

On 18.01.22 13:54, Peter Eisentraut wrote:
> On 18.01.22 05:02, Julien Rouhaud wrote:
>>> If this is applied, then in my estimation all these hunks will 
>>> completely
>>> disappear from the global ICU patch.  So this would be a significant 
>>> win.
>> Agreed, the patch will be quite smaller and also easier to review.  
>> Are you
>> planning to apply it on its own or add it to the default ICU patchset?
> 
> My plan is to apply this patch in the next few days.

This patch has been committed.

Here is a second preparation patch: Change collate and ctype fields to 
type text.

I think this is useful by itself because it allows longer locale names. 
ICU locale names with several options appended can be longer than 63 
bytes.  This case is probably broken right now.  Also, it saves space in 
the typical case, since most locale names are not anywhere near 63 bytes.

Attachment

0001-Change-collate-and-ctype-fields-to-type-text.patch

Re: ICU for global collation

From

Julien Rouhaud

Date:

21 January 2022, 13:51:19

Hi,

On Fri, Jan 21, 2022 at 10:44:24AM +0100, Peter Eisentraut wrote:
> 
> Here is a second preparation patch: Change collate and ctype fields to type
> text.
> 
> I think this is useful by itself because it allows longer locale names. ICU
> locale names with several options appended can be longer than 63 bytes.
> This case is probably broken right now.  Also, it saves space in the typical
> case, since most locale names are not anywhere near 63 bytes.

I totally agree.

> From 1c46bf3138ad42074971aa3130142236de7e63f7 Mon Sep 17 00:00:00 2001
> From: Peter Eisentraut <peter@eisentraut.org>
> Date: Fri, 21 Jan 2022 10:01:25 +0100
> Subject: [PATCH] Change collate and ctype fields to type text

+            collversionstr = TextDatumGetCString(datum);
+
             actual_versionstr = get_collation_actual_version(collform->collprovider, collcollate);
             if (!actual_versionstr)
             {
@@ -1606,7 +1616,6 @@ pg_newlocale_from_collation(Oid collid)
                         (errmsg("collation \"%s\" has no actual version, but a version was specified",
                                 NameStr(collform->collname))));
             }
-            collversionstr = TextDatumGetCString(collversion);


Is that change intended?  There isn't any usage of the collversionstr before
the possible error when actual_versionstr is missing.

Apart from that the patch looks good to me, all tests pass and no issue with
building the doc either.

Re: ICU for global collation

From

Peter Eisentraut

Date:

21 January 2022, 14:24:02

On 21.01.22 14:51, Julien Rouhaud wrote:
>>  From 1c46bf3138ad42074971aa3130142236de7e63f7 Mon Sep 17 00:00:00 2001
>> From: Peter Eisentraut <peter@eisentraut.org>
>> Date: Fri, 21 Jan 2022 10:01:25 +0100
>> Subject: [PATCH] Change collate and ctype fields to type text
> 
> +            collversionstr = TextDatumGetCString(datum);
> +
>               actual_versionstr = get_collation_actual_version(collform->collprovider, collcollate);
>               if (!actual_versionstr)
>               {
> @@ -1606,7 +1616,6 @@ pg_newlocale_from_collation(Oid collid)
>                           (errmsg("collation \"%s\" has no actual version, but a version was specified",
>                                   NameStr(collform->collname))));
>               }
> -            collversionstr = TextDatumGetCString(collversion);
> 
> 
> Is that change intended?  There isn't any usage of the collversionstr before
> the possible error when actual_versionstr is missing.

I wanted to move it closer to the SysCacheGetAttr() where the "datum" 
value is obtained.  It seemed weird to get the datum, then do other 
things, then decode the datum.

Re: ICU for global collation

From

Julien Rouhaud

Date:

21 January 2022, 16:13:31

On Fri, Jan 21, 2022 at 03:24:02PM +0100, Peter Eisentraut wrote:
> On 21.01.22 14:51, Julien Rouhaud wrote:
> > Is that change intended?  There isn't any usage of the collversionstr before
> > the possible error when actual_versionstr is missing.
> 
> I wanted to move it closer to the SysCacheGetAttr() where the "datum" value
> is obtained.  It seemed weird to get the datum, then do other things, then
> decode the datum.

Oh ok.  It won't make much difference performance-wise, so no objection.

Re: ICU for global collation

From

Peter Eisentraut

Date:

27 January 2022, 08:10:46

On 21.01.22 17:13, Julien Rouhaud wrote:
> On Fri, Jan 21, 2022 at 03:24:02PM +0100, Peter Eisentraut wrote:
>> On 21.01.22 14:51, Julien Rouhaud wrote:
>>> Is that change intended?  There isn't any usage of the collversionstr before
>>> the possible error when actual_versionstr is missing.
>>
>> I wanted to move it closer to the SysCacheGetAttr() where the "datum" value
>> is obtained.  It seemed weird to get the datum, then do other things, then
>> decode the datum.
> 
> Oh ok.  It won't make much difference performance-wise, so no objection.

I have committed this and will provide follow-up patches in the next few 
days.

Re: ICU for global collation

From

Peter Eisentraut

Date:

02 February 2022, 13:01:23

On 27.01.22 09:10, Peter Eisentraut wrote:
> On 21.01.22 17:13, Julien Rouhaud wrote:
>> On Fri, Jan 21, 2022 at 03:24:02PM +0100, Peter Eisentraut wrote:
>>> On 21.01.22 14:51, Julien Rouhaud wrote:
>>>> Is that change intended?  There isn't any usage of the 
>>>> collversionstr before
>>>> the possible error when actual_versionstr is missing.
>>>
>>> I wanted to move it closer to the SysCacheGetAttr() where the "datum" 
>>> value
>>> is obtained.  It seemed weird to get the datum, then do other things, 
>>> then
>>> decode the datum.
>>
>> Oh ok.  It won't make much difference performance-wise, so no objection.
> 
> I have committed this and will provide follow-up patches in the next few 
> days.

Here is the main patch rebased on the various changes that have been 
committed in the meantime.  There is still some work to be done on the 
user interfaces of initdb, createdb, etc.

I have split out the database-level collation version tracking into a 
separate patch [0].  I think we should get that one in first and then 
refresh this one.

[0]: 
https://www.postgresql.org/message-id/flat/f0ff3190-29a3-5b39-a179-fa32eee57db6%40enterprisedb.com

Attachment

v4-0001-Add-option-to-use-ICU-as-global-collation-provide.patch

Re: ICU for global collation

From

Peter Eisentraut

Date:

16 February 2022, 14:25:40

On 02.02.22 14:01, Peter Eisentraut wrote:
> Here is the main patch rebased on the various changes that have been 
> committed in the meantime.  There is still some work to be done on the 
> user interfaces of initdb, createdb, etc.
> 
> I have split out the database-level collation version tracking into a 
> separate patch [0].  I think we should get that one in first and then 
> refresh this one.

All that preliminary work has been completed, so here is a new patch.

There isn't actually much left here now except all the new DDL and 
command-line options to set this up and documentation for those.  I have 
given all that another review and I hope it's more intuitive now, but I 
guess there will be other opinions.

I have changed the terminology a bit to match ICU better.  It's now 
called "ICU locale ID" and "locale provider" (instead of "collation"). 
It might actually cover things that are not strictly collations (such as 
the isalpha stuff in text search, maybe, in the future).

One thing that is left that bothers me is that the invocations of 
get_collation_actual_version() have all gotten quite complicated.  I'm 
thinking about ways to refactor that, but I haven't got a great idea yet.

Attachment

v5-0001-Add-option-to-use-ICU-as-global-locale-provider.patch

Re: ICU for global collation

From

Julien Rouhaud

Date:

05 March 2022, 08:38:30

Hi,

On Wed, Feb 16, 2022 at 03:25:40PM +0100, Peter Eisentraut wrote:
>
> All that preliminary work has been completed, so here is a new patch.
>
> There isn't actually much left here now except all the new DDL and
> command-line options to set this up and documentation for those.  I have
> given all that another review and I hope it's more intuitive now, but I
> guess there will be other opinions.

Sorry it took me a bit of time to come back to this patch.

TL;DR it works as expected.  I just have a few comments, mostly on the doc.

I say it works because I did manually check, as far as I can see there isn't
any test that ensures it.

I'm using this naive scenario:

DROP DATABASE IF EXISTS dbicu;
CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'en_US' ICU_LOCALE 'en-u-kf-upper' template 'template0';
\c dbicu
CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');
CREATE TABLE icu(def text, en text COLLATE "en_US", upfirst text COLLATE upperfirst);
INSERT INTO icu VALUES ('a', 'a', 'a'), ('b','b','b'), ('A','A','A'), ('B','B','B');
SELECT def AS def FROM icu ORDER BY def;
SELECT def AS en FROM icu ORDER BY en;
SELECT def AS upfirst FROM icu ORDER BY upfirst;
SELECT def AS upfirst_explicit FROM icu ORDER BY en COLLATE upperfirst;
SELECT def AS en_x_explicit FROM icu ORDER BY def COLLATE "en-x-icu";

Maybe there should be some test along those lines included in the patch?

> I have changed the terminology a bit to match ICU better.  It's now called
> "ICU locale ID" and "locale provider" (instead of "collation"). It might
> actually cover things that are not strictly collations (such as the isalpha
> stuff in text search, maybe, in the future).

I'm not sure that's actually such an improvement as-is.  Simply saying "ICU
locale ID" is, at least for me, somewhat ambiguous as I don't see any place in
our docs where it's clearly defined.  I'm afraid that users might confuse it
with the OID of a pg_collation line, or even a collation name (like en-x-icu),
especially since there's no simple way to know if what you used means what you
thought it meant.

Also, it's not even used consistently in the patch.  I can see e.g. "ICU
locale" or "ICU locale setting" being used:

+     <varlistentry>
+      <term><replaceable class="parameter">icu_locale</replaceable></term>
+      <listitem>
+       <para>
+        Specifies the ICU locale if the ICU locale provider is used.  If
+        this is not specified, the value from the <literal>LOCALE</literal>
+        option is used.
+       </para>
+      </listitem>
+     </varlistentry>

+      <term><option>--icu-locale=<replaceable class="parameter">locale</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the ICU locale setting to be used in this database, if the
+        ICU locale provider is selected.
+       </para>

Maybe we could point to the ICU documentation for a clear notion of what a
locale ID is, and/or use their own short definition [1]:

> The locale object in ICU is an identifier that specifies a particular locale
> and has fields for language, country, and an optional code to specify further
> variants or subdivisions. These fields also can be represented as a string
> with the fields separated by an underscore.

It seems critical to be crystal clear about what should be specified there.

I spent some time looking at the ICU api trying to figure out if using a
posix locale name (e.g. en_US) was actually compatible with an ICU locale name.
It seems that ICU accept any of 'en-us', 'en-US', 'en_us' or 'en_US' as the
same locale, but I might be wrong.  I also didn't find a way to figure out how
to ask ICU if the locale identifier passed is complete garbage or not.  One
sure thing is that the system collation we import are of the form 'en-us', so
it seems weird to have this form in pg_collation and by default another form in
pg_database.

All in all I'm a bit worried of having the icu_locale optional.  Note that this
is inconsistent with createdb(), as there's at least one code path where the
icu_locale is not optional:

+   if (dblocprovider == COLLPROVIDER_ICU)
+   {
+       /*
+        * This would happen if template0 uses the libc provider but the new
+        * database uses icu.
+        */
+       if (!dbiculocale)
+           ereport(ERROR,
+                   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                    errmsg("ICU locale must be specified")));
+   }

>
> One thing that is left that bothers me is that the invocations of
> get_collation_actual_version() have all gotten quite complicated.  I'm
> thinking about ways to refactor that, but I haven't got a great idea yet.

Indeed, and I don't have a great idea either.  Maybe
get_collation_actual_version_extended that would accept both strings?

In CREATE DATABASE manual:

+        Specifies the provider to use for the default collation in this
+        database.  Possible values are:
+        <literal>icu</literal>,<indexterm><primary>ICU</primary></indexterm>
+        <literal>libc</literal>.  <literal>libc</literal> is the default.  The
+        available choices depend on the operating system and build options.

That's actually not true as pg_strcasecmp is used in createdb():

+       if (pg_strcasecmp(locproviderstr, "icu") == 0)
+           dblocprovider = COLLPROVIDER_ICU;
+       else if (pg_strcasecmp(locproviderstr, "libc") == 0)
+           dblocprovider = COLLPROVIDER_LIBC;
+       else
+           ereport(ERROR,
+                   (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+                    errmsg("unrecognized locale provider: %s",
+                           locproviderstr)));

By extension that's the same for createdb, but createdb manual doesn't
explicitly mention that case is important so I guess that's ok.

-   To alter the default collation order or character set classes, use the
-   <option>--lc-collate</option> and <option>--lc-ctype</option> options.
-   Collation orders other than <literal>C</literal> or <literal>POSIX</literal> also have
-   a performance penalty.  For these reasons it is important to choose the
-   right locale when running <command>initdb</command>.
+   To choose a different locale for the cluster, use the option
+   <option>--locale</option>.  There are also individual options
+   <option>--lc-*</option> (see below) to set values for the individual locale
+   categories.  Note that inconsistent settings for different locale
+   categories can give nonsensical results, so this should be used with care.

Unless I'm missing something you entirely removed the warninng about the
performance penalty when using non C/POSIX locale?

+   To chose the specific ICU locale ID to apply, use the option
+   <option>--icu-locale</option>.  The ICU locale ID defaults to
+   <option>--locale</option> or the environment, as above (with some name
+   mangling applied to make the locale naming appropriate for ICU).  Note that
+   for implementation reasons and to support legacy code,
+   <command>initdb</command> will still select and initialize libc locale
+   settings when the ICU locale provider is used.

initdb has some specific processing to transform the default libc locale to
something more appropriate, but as far as I can see creatdb / CREATE DATABASE
aren't doing that.  It seems inconsistent, and IMHO another reason why
defaulting to the libc locale looks like a bad idea.

While on that topic, the doc should probably mention that default ICU
collations can only be deterministic.

@@ -168,18 +175,6 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
                     errmsg("collation \"default\" cannot be copied")));
    }

-   if (localeEl)
-   {
-       collcollate = defGetString(localeEl);
-       collctype = defGetString(localeEl);
-   }
[...]

I tried to read the function and quickly got confused about whether possible
problematic conditions could be reached or not and had protection or not.  I
think that DefineCollation is becoming more and more unreadable, with a mix of
fromEl, !fromEl and either.  Maybe it would be a good time to refactor the code
a bit and have have something like

if (fromEl)
{
    // initialize all specific things
}
else
{
    // initialize all specific things
}

// handle defaults and sanity checks

What do you think?

Also unrelated, but how about raising a warning about possibly hiding
corruption if you use CREATE COLLATION ...  VERSION or CREATE DATABASE ...
COLLATION_VERSION if !IsBinaryUpgrade()?

in AlterCollation(), pg_collation_actual_version(), AlterDatabaseRefreshColl()
and pg_database_collation_actual_version():

-   datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collcollate, &isnull);
-   Assert(!isnull);
-   newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
+   datum = SysCacheGetAttr(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale
:Anum_pg_collation_collcollate, &isnull);

+   if (!isnull)
+       newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
+   else
+       newversion = NULL;

The columns are now nullable, but can you actually end up with a null locale
name in the expected field without manual DML on the catalog, corruption or
similar?  I think it should be a plain error explaining the inconsistency
rather then silently assuming there's no version.  Note that at least
pg_newlocale_from_collation() asserts that the specific libc/icu field it's
interested in isn't null.

+       else if (strcmp(defel->defname, "icu_locale") == 0)
+       {
+           if (diculocale)
+               errorConflictingDefElem(defel, pstate);
+           diculocale = defel;
+       }
+       else if (strcmp(defel->defname, "locale_provider") == 0)
+       {
+           if (dlocprovider)
+               ereport(ERROR,
+                       (errcode(ERRCODE_SYNTAX_ERROR),
+                        errmsg("conflicting or redundant options"),
+                        parser_errposition(pstate, defel->location)));
+           dlocprovider = defel;
+       }

Why not using errorConflictingDefElem for locale_provider?

in createdb():

+   if (dblocprovider == COLLPROVIDER_ICU)
+   {
+       /*
+        * This would happen if template0 uses the libc provider but the new
+        * database uses icu.
+        */
+       if (!dbiculocale)
+           ereport(ERROR,
+                   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                    errmsg("ICU locale must be specified")));
+   }
+
+   if (dblocprovider == COLLPROVIDER_ICU)
+   {
+#ifdef USE_ICU
[...]

Seems like a refactoring leftover, no need for two blocks.  Also, I think it
would be better to first error out if there's no support for ICU rather than
complaining about a possibly missing locale that wouldn't be accepted anyway.

+void
+make_icu_collator(const char *iculocstr,
+                 struct pg_locale_struct *resultp)
+{
+#ifdef USE_ICU
+[...]
+   /* We will leak this string if we get an error below :-( */
+   resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
+   resultp->info.icu.ucol = collator;
+#else                          /* not USE_ICU */
+   /* could get here if a collation was created by a build with ICU */
+   ereport(ERROR,
+           (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+            errmsg("ICU is not supported in this build"), \
+            errhint("You need to rebuild PostgreSQL using %s.", "--with-icu")));
+#endif                         /* not USE_ICU */
+}

The comment about the leak possibility needs some tweaking since it's been
extracted from a larger function.

Not material for this patch, but:
@@ -1475,7 +1524,12 @@ pg_newlocale_from_collation(Oid collid)
    Assert(OidIsValid(collid));

    if (collid == DEFAULT_COLLATION_OID)
-       return (pg_locale_t) 0;
+   {
+       if (default_locale.provider == COLLPROVIDER_ICU)
+           return &default_locale;
+       else
+           return (pg_locale_t) 0;
+   }

I'm wondering if we could now always return &default_locale and avoid having to
check if the function returned something in all callers, since CheckMyDatabase
now initialize it.

@@ -2184,6 +2195,50 @@ setlocales(void)
    check_locale_name(LC_CTYPE, lc_messages, &canonname);
    lc_messages = canonname;
 #endif
+
+   if (locale_provider == COLLPROVIDER_ICU)
+   {
+       if (!icu_locale && locale)
+           icu_locale = locale;
+
+       /*
+        * If ICU is selected but no ICU locale has been given, take the
[...]
+       /*
+        * Check ICU locale name
+        */
+#ifdef USE_ICU
+       {
+           UErrorCode  status;
+
+           status = U_ZERO_ERROR;
+           ucol_open(icu_locale, &status);
+           if (U_FAILURE(status))
+           {
+               pg_log_error("could not open collator for locale \"%s\": %s",
+                            icu_locale, u_errorName(status));
+               exit(1);
+           }
+       }
+#else
+       pg_log_error("ICU is not supported in this build");
+       fprintf(stderr, _("You need to rebuild PostgreSQL using %s.\n"), "--with-icu");
+       exit(1);
+#endif
+   }

Shouldn't all the code before checking the locale name also be in the #ifdef
USE_ICU?  Also, the comment should be consistent with the doc and mention
ICU locale ID.

@@ -2859,6 +2870,17 @@ dumpDatabase(Archive *fout)
        appendPQExpBufferStr(creaQry, " ENCODING = ");
        appendStringLiteralAH(creaQry, encoding, fout);
    }
+   if (strlen(datlocprovider) > 0)
+   {
+       appendPQExpBufferStr(creaQry, " LOCALE_PROVIDER = ");
+       if (datlocprovider[0] == 'c')
+           appendPQExpBufferStr(creaQry, "libc");
+       else if (datlocprovider[0] == 'i')
+           appendPQExpBufferStr(creaQry, "icu");
+       else
+           fatal("unrecognized locale provider: %s",
+                 datlocprovider);
+   }
    if (strlen(collate) > 0 && strcmp(collate, ctype) == 0)
    {

AFAICS datlocprovider shouldn't be empty.  Maybe raise an error or an assert if
that's the case?

# CREATE DATABASE db1 LOCALE_PROVIDER icu ICU_LOCALE "fr-x-icu";
ERROR:  22023: new locale provider (i) does not match locale provider of the template database (c)
HINT:  Use the same locale provider as in the template database, or use template0 as template.

It feels strange to write "LOCALE_PROVIDER icu" and get "provider (i)" in the
error message.  I think it would be better to emit the user-facing value, not
internal one.

Finally, there are some changes in pg_collation, I'm assuming it's done for
consistency since pg_database may need two different locale information, but it
should probably be at least mentioned in the commit message.

[1]:
https://unicode-org.github.io/icu/userguide/locale/#:~:text=The%20locale%20object%20in%20ICU,fields%20separated%20by%20an%20underscore.

Re: ICU for global collation

From

Peter Eisentraut

Date:

10 March 2022, 09:52:41

On 05.03.22 09:38, Julien Rouhaud wrote:
> @@ -168,18 +175,6 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
>                       errmsg("collation \"default\" cannot be copied")));
>      }
> 
> -   if (localeEl)
> -   {
> -       collcollate = defGetString(localeEl);
> -       collctype = defGetString(localeEl);
> -   }
> [...]
> 
> I tried to read the function and quickly got confused about whether possible
> problematic conditions could be reached or not and had protection or not.  I
> think that DefineCollation is becoming more and more unreadable, with a mix of
> fromEl, !fromEl and either.  Maybe it would be a good time to refactor the code
> a bit and have have something like
> 
> if (fromEl)
> {
>      // initialize all specific things
> }
> else
> {
>      // initialize all specific things
> }
> 
> // handle defaults and sanity checks
> 
> What do you think?

How about the attached?

Attachment

0001-DefineCollation-code-cleanup.patch

Re: ICU for global collation

From

Julien Rouhaud

Date:

10 March 2022, 10:27:12

On Thu, Mar 10, 2022 at 10:52:41AM +0100, Peter Eisentraut wrote:
> On 05.03.22 09:38, Julien Rouhaud wrote:
> > @@ -168,18 +175,6 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
> >                       errmsg("collation \"default\" cannot be copied")));
> >      }
> > 
> > -   if (localeEl)
> > -   {
> > -       collcollate = defGetString(localeEl);
> > -       collctype = defGetString(localeEl);
> > -   }
> > [...]
> > 
> > I tried to read the function and quickly got confused about whether possible
> > problematic conditions could be reached or not and had protection or not.  I
> > think that DefineCollation is becoming more and more unreadable, with a mix of
> > fromEl, !fromEl and either.  Maybe it would be a good time to refactor the code
> > a bit and have have something like
> > 
> > if (fromEl)
> > {
> >      // initialize all specific things
> > }
> > else
> > {
> >      // initialize all specific things
> > }
> > 
> > // handle defaults and sanity checks
> > 
> > What do you think?
> 
> How about the attached?

Thanks!  That's exactly what I had in mind.  I think it's easier to follow and
make sure of what it's doing exactly, so +1 for it.

Re: ICU for global collation

From

Peter Eisentraut

Date:

14 March 2022, 12:50:50

On 05.03.22 09:38, Julien Rouhaud wrote:
> I say it works because I did manually check, as far as I can see there isn't
> any test that ensures it.
> 
> I'm using this naive scenario:
> 
> DROP DATABASE IF EXISTS dbicu;
> CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'en_US' ICU_LOCALE 'en-u-kf-upper' template 'template0';
> \c dbicu
> CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');
> CREATE TABLE icu(def text, en text COLLATE "en_US", upfirst text COLLATE upperfirst);
> INSERT INTO icu VALUES ('a', 'a', 'a'), ('b','b','b'), ('A','A','A'), ('B','B','B');
> SELECT def AS def FROM icu ORDER BY def;
> SELECT def AS en FROM icu ORDER BY en;
> SELECT def AS upfirst FROM icu ORDER BY upfirst;
> SELECT def AS upfirst_explicit FROM icu ORDER BY en COLLATE upperfirst;
> SELECT def AS en_x_explicit FROM icu ORDER BY def COLLATE "en-x-icu";
> 
> Maybe there should be some test along those lines included in the patch?

I added something like this to a new test suite under src/test/icu/. 
(src/test/locale/ was already used for something else.)

> Also, it's not even used consistently in the patch.  I can see e.g. "ICU
> locale" or "ICU locale setting" being used:

I have fixed those.

> Maybe we could point to the ICU documentation for a clear notion of what a
> locale ID is, and/or use their own short definition [1]:
> 
>> The locale object in ICU is an identifier that specifies a particular locale
>> and has fields for language, country, and an optional code to specify further
>> variants or subdivisions. These fields also can be represented as a string
>> with the fields separated by an underscore.

I think the Localization chapter needs to be reorganized a bit, but I'll 
leave that for a separate patch.

> I spent some time looking at the ICU api trying to figure out if using a
> posix locale name (e.g. en_US) was actually compatible with an ICU locale name.
> It seems that ICU accept any of 'en-us', 'en-US', 'en_us' or 'en_US' as the
> same locale, but I might be wrong.  I also didn't find a way to figure out how
> to ask ICU if the locale identifier passed is complete garbage or not.  One
> sure thing is that the system collation we import are of the form 'en-us', so
> it seems weird to have this form in pg_collation and by default another form in
> pg_database.

Yeah it seems to be inconsistent about that.  The locale ID 
documentation appears to indicate that "en_US" is the canonical form, 
but when you ask it to list all the locales it knows about it returns 
"en-US".

> In CREATE DATABASE manual:
> 
> +        Specifies the provider to use for the default collation in this
> +        database.  Possible values are:
> +        <literal>icu</literal>,<indexterm><primary>ICU</primary></indexterm>
> +        <literal>libc</literal>.  <literal>libc</literal> is the default.  The
> +        available choices depend on the operating system and build options.
> 
> That's actually not true as pg_strcasecmp is used in createdb():
> 
> +       if (pg_strcasecmp(locproviderstr, "icu") == 0)
> +           dblocprovider = COLLPROVIDER_ICU;
> +       else if (pg_strcasecmp(locproviderstr, "libc") == 0)
> +           dblocprovider = COLLPROVIDER_LIBC;
> +       else
> +           ereport(ERROR,
> +                   (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> +                    errmsg("unrecognized locale provider: %s",
> +                           locproviderstr)));

I don't understand what the concern is here.

> Unless I'm missing something you entirely removed the warninng about the
> performance penalty when using non C/POSIX locale?

Yeah, I think it is not the job of the initdb man page to lecture people 
about the merits of their locale choice.  The same performance warning 
can also be found in the localization chapter; we don't need to repeat 
it everywhere a locale choice is mentioned.

> initdb has some specific processing to transform the default libc locale to
> something more appropriate, but as far as I can see creatdb / CREATE DATABASE
> aren't doing that.  It seems inconsistent, and IMHO another reason why
> defaulting to the libc locale looks like a bad idea.

This has all been removed.  The separate ICU locale option should now be 
required everywhere (initdb, createdb, CREATE DATABASE).

> While on that topic, the doc should probably mention that default ICU
> collations can only be deterministic.

Well, there is no option to do otherwise, so I'm not sure where/how to 
mention that.  We usually don't document options that don't exist. ;-)

> Also unrelated, but how about raising a warning about possibly hiding
> corruption if you use CREATE COLLATION ...  VERSION or CREATE DATABASE ...
> COLLATION_VERSION if !IsBinaryUpgrade()?

Hmm, there is a point to that.  But then we should just make that an 
error.  Otherwise, we raise the warning but then there is no way to 
"fix" what was warned about.  Seems confusing.

> in AlterCollation(), pg_collation_actual_version(), AlterDatabaseRefreshColl()
> and pg_database_collation_actual_version():
> 
> -   datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collcollate, &isnull);
> -   Assert(!isnull);
> -   newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
> +   datum = SysCacheGetAttr(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ?
Anum_pg_collation_colliculocale: Anum_pg_collation_collcollate, &isnull);
 
> +   if (!isnull)
> +       newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
> +   else
> +       newversion = NULL;
> 
> The columns are now nullable, but can you actually end up with a null locale
> name in the expected field without manual DML on the catalog, corruption or
> similar?  I think it should be a plain error explaining the inconsistency
> rather then silently assuming there's no version.  Note that at least
> pg_newlocale_from_collation() asserts that the specific libc/icu field it's
> interested in isn't null.

This is required because the default collations's fields are null now.

> Why not using errorConflictingDefElem for locale_provider?

fixed

> in createdb():
> 
> +   if (dblocprovider == COLLPROVIDER_ICU)
> +   {
> +       /*
> +        * This would happen if template0 uses the libc provider but the new
> +        * database uses icu.
> +        */
> +       if (!dbiculocale)
> +           ereport(ERROR,
> +                   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                    errmsg("ICU locale must be specified")));
> +   }
> +
> +   if (dblocprovider == COLLPROVIDER_ICU)
> +   {
> +#ifdef USE_ICU
> [...]
> 
> Seems like a refactoring leftover, no need for two blocks.

These are two independent blocks in my mind.  It's possible that someone 
might want to insert something in the middle at some point.

> Also, I think it
> would be better to first error out if there's no support for ICU rather than
> complaining about a possibly missing locale that wouldn't be accepted anyway.

Seems better to do general syntax checks first and then deeper checks of 
the passed values.

> +void
> +make_icu_collator(const char *iculocstr,
> +                 struct pg_locale_struct *resultp)
> +{
> +#ifdef USE_ICU
> +[...]
> +   /* We will leak this string if we get an error below :-( */
> +   resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
> +   resultp->info.icu.ucol = collator;
> +#else                          /* not USE_ICU */
> +   /* could get here if a collation was created by a build with ICU */
> +   ereport(ERROR,
> +           (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +            errmsg("ICU is not supported in this build"), \
> +            errhint("You need to rebuild PostgreSQL using %s.", "--with-icu")));
> +#endif                         /* not USE_ICU */
> +}
> 
> The comment about the leak possibility needs some tweaking since it's been
> extracted from a larger function.

fixed

> Not material for this patch, but:
> @@ -1475,7 +1524,12 @@ pg_newlocale_from_collation(Oid collid)
>      Assert(OidIsValid(collid));
> 
>      if (collid == DEFAULT_COLLATION_OID)
> -       return (pg_locale_t) 0;
> +   {
> +       if (default_locale.provider == COLLPROVIDER_ICU)
> +           return &default_locale;
> +       else
> +           return (pg_locale_t) 0;
> +   }
> 
> I'm wondering if we could now always return &default_locale and avoid having to
> check if the function returned something in all callers, since CheckMyDatabase
> now initialize it.

Maybe that's something to look into, but that would probably require 
updating a call callers to handle the return values differently, which 
would require quite a bit of work.

> Shouldn't all the code before checking the locale name also be in the #ifdef
> USE_ICU?

I like to keep the #ifdef sections as small as possible and limited to 
the code that really uses the respective library.

> @@ -2859,6 +2870,17 @@ dumpDatabase(Archive *fout)
>          appendPQExpBufferStr(creaQry, " ENCODING = ");
>          appendStringLiteralAH(creaQry, encoding, fout);
>      }
> +   if (strlen(datlocprovider) > 0)
> +   {
> +       appendPQExpBufferStr(creaQry, " LOCALE_PROVIDER = ");
> +       if (datlocprovider[0] == 'c')
> +           appendPQExpBufferStr(creaQry, "libc");
> +       else if (datlocprovider[0] == 'i')
> +           appendPQExpBufferStr(creaQry, "icu");
> +       else
> +           fatal("unrecognized locale provider: %s",
> +                 datlocprovider);
> +   }
>      if (strlen(collate) > 0 && strcmp(collate, ctype) == 0)
>      {
> 
> AFAICS datlocprovider shouldn't be empty.  Maybe raise an error or an assert if
> that's the case?

Yeah that was bogus, copied from the earlier encoding handling, which is 
probably also bogus, but I'm not touching that here.

> # CREATE DATABASE db1 LOCALE_PROVIDER icu ICU_LOCALE "fr-x-icu";
> ERROR:  22023: new locale provider (i) does not match locale provider of the template database (c)
> HINT:  Use the same locale provider as in the template database, or use template0 as template.
> 
> It feels strange to write "LOCALE_PROVIDER icu" and get "provider (i)" in the
> error message.  I think it would be better to emit the user-facing value, not
> internal one.

Fixed.  I had added a collprovider_name() function but didn't use it here.

> Finally, there are some changes in pg_collation, I'm assuming it's done for
> consistency since pg_database may need two different locale information, but it
> should probably be at least mentioned in the commit message.

done

Attachment

v6-0001-Add-option-to-use-ICU-as-global-locale-provider.patch

Re: ICU for global collation

From

Robert Haas

Date:

14 March 2022, 18:57:48

On Mon, Mar 14, 2022 at 8:51 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> [ new patches ]

I'm not very knowledgeable about this topic, but to me, it seems
confusing to think of having both a libc collation and an ICU
collation associated with a database. I have two main questions:

1. What will happen if I set the ICU collation to something that
doesn't match the libc collation? How bad are the consequences?

2. If I want to avoid a mismatch between the two, then I will need a
way to figure out which libc collation corresponds to a given ICU
collation. How do I do that?

I have a sneaking suspicion that I'm not going to like the answer to
question #2 very much, but maybe that's life. I feel like this whole
area is far too full of magical things. True practitioners know that
if you utter some mysterious incantation to the Lords of ICU, you can
get exactly the behavior you want. And ... everyone else has no idea
what to do.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: ICU for global collation

From

Julien Rouhaud

Date:

15 March 2022, 07:41:03

On Mon, Mar 14, 2022 at 01:50:50PM +0100, Peter Eisentraut wrote:
> On 05.03.22 09:38, Julien Rouhaud wrote:
> > I say it works because I did manually check, as far as I can see there isn't
> > any test that ensures it.
> >
> > I'm using this naive scenario:
> >
> > DROP DATABASE IF EXISTS dbicu;
> > CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'en_US' ICU_LOCALE 'en-u-kf-upper' template 'template0';
> > \c dbicu
> > CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');
> > CREATE TABLE icu(def text, en text COLLATE "en_US", upfirst text COLLATE upperfirst);
> > INSERT INTO icu VALUES ('a', 'a', 'a'), ('b','b','b'), ('A','A','A'), ('B','B','B');
> > SELECT def AS def FROM icu ORDER BY def;
> > SELECT def AS en FROM icu ORDER BY en;
> > SELECT def AS upfirst FROM icu ORDER BY upfirst;
> > SELECT def AS upfirst_explicit FROM icu ORDER BY en COLLATE upperfirst;
> > SELECT def AS en_x_explicit FROM icu ORDER BY def COLLATE "en-x-icu";
> >
> > Maybe there should be some test along those lines included in the patch?
>
> I added something like this to a new test suite under src/test/icu/.
> (src/test/locale/ was already used for something else.)

Great, thanks!

> > > The locale object in ICU is an identifier that specifies a particular locale
> > > and has fields for language, country, and an optional code to specify further
> > > variants or subdivisions. These fields also can be represented as a string
> > > with the fields separated by an underscore.
>
> I think the Localization chapter needs to be reorganized a bit, but I'll
> leave that for a separate patch.

WFM.

> > I spent some time looking at the ICU api trying to figure out if using a
> > posix locale name (e.g. en_US) was actually compatible with an ICU locale name.
> > It seems that ICU accept any of 'en-us', 'en-US', 'en_us' or 'en_US' as the
> > same locale, but I might be wrong.  I also didn't find a way to figure out how
> > to ask ICU if the locale identifier passed is complete garbage or not.  One
> > sure thing is that the system collation we import are of the form 'en-us', so
> > it seems weird to have this form in pg_collation and by default another form in
> > pg_database.
>
> Yeah it seems to be inconsistent about that.  The locale ID documentation
> appears to indicate that "en_US" is the canonical form, but when you ask it
> to list all the locales it knows about it returns "en-US".

Yeah I saw that too when checking is POSIX locale names were valid, and that's
not great.
>
> > In CREATE DATABASE manual:
> >
> > +        Specifies the provider to use for the default collation in this
> > +        database.  Possible values are:
> > +        <literal>icu</literal>,<indexterm><primary>ICU</primary></indexterm>
> > +        <literal>libc</literal>.  <literal>libc</literal> is the default.  The
> > +        available choices depend on the operating system and build options.
> >
> > That's actually not true as pg_strcasecmp is used in createdb():
>
> I don't understand what the concern is here.

Ah sorry I missed the <indexterm>, I thought the possible values listed were
icu, ICU and libc.  Shouldn't the ICU <indexterm> be before the icu <literal>
rather than the libc, or at least before the comma?
>
> Yeah, I think it is not the job of the initdb man page to lecture people
> about the merits of their locale choice.  The same performance warning can
> also be found in the localization chapter; we don't need to repeat it
> everywhere a locale choice is mentioned.

Ah, I tried to look for another place where the same warning could be found but
missed it.  I'm fine with it then!

> > While on that topic, the doc should probably mention that default ICU
> > collations can only be deterministic.
>
> Well, there is no option to do otherwise, so I'm not sure where/how to
> mention that.  We usually don't document options that don't exist. ;-)

Sure, but I'm afraid that users may still be tempted to use ICU locales like
und-u-ks-level2 from the case_insensitive example in the doc and hope that it
will work accordingly.  Or maybe it's just me that still sees ICU as dark magic
and want to be extra cautious.

Unrelated, but I just realized that we have PGC_INTERNAL gucs for lc_ctype and
lc_collate.  Should we add one for icu_locale too?

> > Also unrelated, but how about raising a warning about possibly hiding
> > corruption if you use CREATE COLLATION ...  VERSION or CREATE DATABASE ...
> > COLLATION_VERSION if !IsBinaryUpgrade()?
>
> Hmm, there is a point to that.  But then we should just make that an error.
> Otherwise, we raise the warning but then there is no way to "fix" what was
> warned about.  Seems confusing.

I was afraid that an error was a bit too much, but if that's acceptable I agree
that it's better.

> > in AlterCollation(), pg_collation_actual_version(), AlterDatabaseRefreshColl()
> > and pg_database_collation_actual_version():
> >
> > -   datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collcollate, &isnull);
> > -   Assert(!isnull);
> > -   newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
> > +   datum = SysCacheGetAttr(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ?
Anum_pg_collation_colliculocale: Anum_pg_collation_collcollate, &isnull);
 
> > +   if (!isnull)
> > +       newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
> > +   else
> > +       newversion = NULL;
> >
> > The columns are now nullable, but can you actually end up with a null locale
> > name in the expected field without manual DML on the catalog, corruption or
> > similar?  I think it should be a plain error explaining the inconsistency
> > rather then silently assuming there's no version.  Note that at least
> > pg_newlocale_from_collation() asserts that the specific libc/icu field it's
> > interested in isn't null.
>
> This is required because the default collations's fields are null now.

Yes I saw that, but that's a specific exception.  Detecting whether it's the
DEFAULT_COLLATION_OID or not and raise an error when a null value isn't
expected seems like it could be worthwhile.

> > I'm wondering if we could now always return &default_locale and avoid having to
> > check if the function returned something in all callers, since CheckMyDatabase
> > now initialize it.
>
> Maybe that's something to look into, but that would probably require
> updating a call callers to handle the return values differently, which would
> require quite a bit of work.

Agreed, I just wanted to mention it before forgetting about it.

So apart from the few details mentioned above I'm happy with this patch!

There might still be some more adjustments needed  based on Robert's comment.
At the very least it doesn't seem unreasonable to forbid a default ICU
collation with a C/POSIX lc_ctype for instance.

Re: ICU for global collation

From

Peter Eisentraut

Date:

15 March 2022, 16:58:17

On 14.03.22 19:57, Robert Haas wrote:
> 1. What will happen if I set the ICU collation to something that
> doesn't match the libc collation? How bad are the consequences?

These are unrelated, so there are no consequences.

> 2. If I want to avoid a mismatch between the two, then I will need a
> way to figure out which libc collation corresponds to a given ICU
> collation. How do I do that?

You can specify the same name for both.

Re: ICU for global collation

From

Robert Haas

Date:

15 March 2022, 17:28:01

On Tue, Mar 15, 2022 at 12:58 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> On 14.03.22 19:57, Robert Haas wrote:
> > 1. What will happen if I set the ICU collation to something that
> > doesn't match the libc collation? How bad are the consequences?
>
> These are unrelated, so there are no consequences.

Can you please elaborate on this?

> > 2. If I want to avoid a mismatch between the two, then I will need a
> > way to figure out which libc collation corresponds to a given ICU
> > collation. How do I do that?
>
> You can specify the same name for both.

Hmm. If every name were valid in both systems, I don't think you'd be
proposing two fields.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: ICU for global collation

From

"Finnerty, Jim"

Date:

15 March 2022, 19:47:09

Can we get some more consistent terminology around the term "locale"?

In ICU, the "locale" is just the first part of what we can pass to the "locale" parameter in CREATE COLLATION - the
partbefore the optional '@' delimiter.  The ICU locale does not include the secondary or tertiary properties, so it is
usuallyjust the country and the language, e.g. en_US (or en-US), but it can also be something like es_TRADITIONAL for
traditionalSpanish.  
 

I think it would be an improvement in clarity if we consistently use the term 'locale' to mean the same thing that ICU
meansby that term, and not to have the thing that we call the "locale" also include collation modifiers, or to imply
thata locale is the same thing as a collation.
 

    /Jim

Re: ICU for global collation

From

"Daniel Verite"

Date:

15 March 2022, 19:48:00

    Julien Rouhaud wrote:

> > > While on that topic, the doc should probably mention that default ICU
> > > collations can only be deterministic.
> >
> > Well, there is no option to do otherwise, so I'm not sure where/how to
> > mention that.  We usually don't document options that don't exist. ;-)
>
> Sure, but I'm afraid that users may still be tempted to use ICU locales like
> und-u-ks-level2 from the case_insensitive example in the doc and hope that
> it will work accordingly.

+1.

The CREATE DATABASE doc says this currently:

icu_locale

    Specifies the ICU locale ID if the ICU locale provider is used.

ISTM that we need to say explicitly that this locale will be used by
default to compare all collatable strings, except that it's overruled
by a bytewise comparison to break ties in case of equality.

The idea is to describe what the backend will do with the setting
rather than saying that we don't have a nondeterministic option.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: ICU for global collation

From

"Daniel Verite"

Date:

15 March 2022, 20:19:41

    Finnerty, Jim wrote:

> In ICU, the "locale" is just the first part of what we can pass to the
> "locale" parameter in CREATE COLLATION - the part before the optional '@'
> delimiter.  The ICU locale does not include the secondary or tertiary
> properties,

Why not? Please see
https://unicode-org.github.io/icu/userguide/locale

It says that "a locale consists of one or more pieces of ordered
information", the pieces being a language code, a script code, a
country code, a variant code, and keywords.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: ICU for global collation

From

Peter Eisentraut

Date:

16 March 2022, 14:25:09

On 15.03.22 18:28, Robert Haas wrote:
> On Tue, Mar 15, 2022 at 12:58 PM Peter Eisentraut
> <peter.eisentraut@enterprisedb.com> wrote:
>> On 14.03.22 19:57, Robert Haas wrote:
>>> 1. What will happen if I set the ICU collation to something that
>>> doesn't match the libc collation? How bad are the consequences?
>>
>> These are unrelated, so there are no consequences.
> 
> Can you please elaborate on this?

The code that is aware of ICU generally works like this:

if (locale_provider == ICU)
   result = call ICU code
else
   result = call libc code
return result

However, there is code out there, both within PostgreSQL itself and in 
extensions, that does not do that yet.  Ideally, we would eventually 
change all that over, but it's not happening now.  So we ought to 
preserve the ability to set the libc to keep that legacy code working 
for now.

This legacy code by definition doesn't know about ICU, so it doesn't 
care whether the ICU setting "matches" the libc setting or anything like 
that.  It will just do its thing depending on its own setting.

The only consequence of settings that don't match is that the different 
pieces of code behave semantically inconsistently (e.g., some routine 
thinks the data is Greek and other code thinks the data is French).  But 
that's up to the user to set correctly.  And the actual scenarios where 
you can actually do anything semantically relevant this way are very 
limited.

A second point is that the LC_CTYPE setting tells other parts of libc 
what the current encoding is.  This affects gettext for example.  So you 
need to set this to something sensible even if you don't use libc locale 
routines otherwise.

>>> 2. If I want to avoid a mismatch between the two, then I will need a
>>> way to figure out which libc collation corresponds to a given ICU
>>> collation. How do I do that?
>>
>> You can specify the same name for both.
> 
> Hmm. If every name were valid in both systems, I don't think you'd be
> proposing two fields.

Earlier versions of this patch and predecessor patches indeed had common 
fields.  But in fact the two systems accept different values if you want 
to delve into the advanced features.  But for basic usage something like 
"en_US" will work for both.

Re: ICU for global collation

From

Peter Eisentraut

Date:

17 March 2022, 10:28:39

On 15.03.22 08:41, Julien Rouhaud wrote:
>>>> The locale object in ICU is an identifier that specifies a particular locale
>>>> and has fields for language, country, and an optional code to specify further
>>>> variants or subdivisions. These fields also can be represented as a string
>>>> with the fields separated by an underscore.
>>
>> I think the Localization chapter needs to be reorganized a bit, but I'll
>> leave that for a separate patch.
> 
> WFM.

I ended up writing a bit of content for that chapter.

>>> While on that topic, the doc should probably mention that default ICU
>>> collations can only be deterministic.
>>
>> Well, there is no option to do otherwise, so I'm not sure where/how to
>> mention that.  We usually don't document options that don't exist. ;-)
> 
> Sure, but I'm afraid that users may still be tempted to use ICU locales like
> und-u-ks-level2 from the case_insensitive example in the doc and hope that it
> will work accordingly.  Or maybe it's just me that still sees ICU as dark magic
> and want to be extra cautious.

I have added this to the CREATE DATABASE ref page.

> Unrelated, but I just realized that we have PGC_INTERNAL gucs for lc_ctype and
> lc_collate.  Should we add one for icu_locale too?

I'm not sure.  I think the existing ones are more for backward 
compatibility with the time before it was settable per database.

>>> in AlterCollation(), pg_collation_actual_version(), AlterDatabaseRefreshColl()
>>> and pg_database_collation_actual_version():
>>>
>>> -   datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collcollate, &isnull);
>>> -   Assert(!isnull);
>>> -   newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
>>> +   datum = SysCacheGetAttr(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ?
Anum_pg_collation_colliculocale: Anum_pg_collation_collcollate, &isnull);
 
>>> +   if (!isnull)
>>> +       newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
>>> +   else
>>> +       newversion = NULL;
>>>
>>> The columns are now nullable, but can you actually end up with a null locale
>>> name in the expected field without manual DML on the catalog, corruption or
>>> similar?  I think it should be a plain error explaining the inconsistency
>>> rather then silently assuming there's no version.  Note that at least
>>> pg_newlocale_from_collation() asserts that the specific libc/icu field it's
>>> interested in isn't null.
>>
>> This is required because the default collations's fields are null now.
> 
> Yes I saw that, but that's a specific exception.  Detecting whether it's the
> DEFAULT_COLLATION_OID or not and raise an error when a null value isn't
> expected seems like it could be worthwhile.

I have fixed that as you suggest.

> So apart from the few details mentioned above I'm happy with this patch!

committed!

RE: ICU for global collation

From

"Shinoda, Noriyoshi (PN Japan FSIP)"

Date:

17 March 2022, 12:01:14

Hi, 
Thank you to all the developers.
I found that the description of the pg_database.daticulocale column was not written in the documentation. 
The attached small patch adds a description of the daticulocale column to catalogs.sgml.

Regards,
Noriyoshi Shinoda
-----Original Message-----
From: Peter Eisentraut <peter.eisentraut@enterprisedb.com> 
Sent: Thursday, March 17, 2022 7:29 PM
To: Julien Rouhaud <rjuju123@gmail.com>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; Daniel Verite <daniel@manitou-mail.org>
Subject: Re: ICU for global collation

On 15.03.22 08:41, Julien Rouhaud wrote:
>>>> The locale object in ICU is an identifier that specifies a 
>>>> particular locale and has fields for language, country, and an 
>>>> optional code to specify further variants or subdivisions. These 
>>>> fields also can be represented as a string with the fields separated by an underscore.
>>
>> I think the Localization chapter needs to be reorganized a bit, but 
>> I'll leave that for a separate patch.
> 
> WFM.

I ended up writing a bit of content for that chapter.

>>> While on that topic, the doc should probably mention that default 
>>> ICU collations can only be deterministic.
>>
>> Well, there is no option to do otherwise, so I'm not sure where/how 
>> to mention that.  We usually don't document options that don't exist. 
>> ;-)
> 
> Sure, but I'm afraid that users may still be tempted to use ICU 
> locales like
> und-u-ks-level2 from the case_insensitive example in the doc and hope 
> that it will work accordingly.  Or maybe it's just me that still sees 
> ICU as dark magic and want to be extra cautious.

I have added this to the CREATE DATABASE ref page.

> Unrelated, but I just realized that we have PGC_INTERNAL gucs for 
> lc_ctype and lc_collate.  Should we add one for icu_locale too?

I'm not sure.  I think the existing ones are more for backward compatibility with the time before it was settable per
database.

>>> in AlterCollation(), pg_collation_actual_version(), 
>>> AlterDatabaseRefreshColl() and pg_database_collation_actual_version():
>>>
>>> -   datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collcollate, &isnull);
>>> -   Assert(!isnull);
>>> -   newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
>>> +   datum = SysCacheGetAttr(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ?
Anum_pg_collation_colliculocale: Anum_pg_collation_collcollate, &isnull);
 
>>> +   if (!isnull)
>>> +       newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
>>> +   else
>>> +       newversion = NULL;
>>>
>>> The columns are now nullable, but can you actually end up with a 
>>> null locale name in the expected field without manual DML on the 
>>> catalog, corruption or similar?  I think it should be a plain error 
>>> explaining the inconsistency rather then silently assuming there's 
>>> no version.  Note that at least
>>> pg_newlocale_from_collation() asserts that the specific libc/icu 
>>> field it's interested in isn't null.
>>
>> This is required because the default collations's fields are null now.
> 
> Yes I saw that, but that's a specific exception.  Detecting whether 
> it's the DEFAULT_COLLATION_OID or not and raise an error when a null 
> value isn't expected seems like it could be worthwhile.

I have fixed that as you suggest.

> So apart from the few details mentioned above I'm happy with this patch!

committed!

Attachment

pg_database_iculocale_v1.diff

Re: ICU for global collation

From

Peter Eisentraut

Date:

17 March 2022, 13:14:52

On 17.03.22 13:01, Shinoda, Noriyoshi (PN Japan FSIP) wrote:
> Thank you to all the developers.
> I found that the description of the pg_database.daticulocale column was not written in the documentation.
> The attached small patch adds a description of the daticulocale column to catalogs.sgml.

committed, thanks

Re: ICU for global collation

From

Peter Geoghegan

Date:

17 March 2022, 19:34:50

On Thu, Mar 17, 2022 at 6:15 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> committed, thanks

Glad that this finally happened. Thanks to everybody involved!

-- 
Peter Geoghegan

Re: ICU for global collation

From

Andres Freund

Date:

18 March 2022, 00:01:40

Hi,

On 2022-03-17 14:14:52 +0100, Peter Eisentraut wrote:
> On 17.03.22 13:01, Shinoda, Noriyoshi (PN Japan FSIP) wrote:
> > Thank you to all the developers.
> > I found that the description of the pg_database.daticulocale column was not written in the documentation.
> > The attached small patch adds a description of the daticulocale column to catalogs.sgml.
> 
> committed, thanks

Wee! That's a long time weakness addressed...

Just saw a weird failure after rebasing my meson branch ontop of this. Tests
passed on debian, suse, centos 8 stream, fedora rawhide (failed due to an
independent reason), but not on centos 7.

all runs: https://cirrus-ci.com/build/5190538184884224
centos 7: https://cirrus-ci.com/task/4786632883699712?logs=tests_world#L204
centos 7 failure:
https://api.cirrus-ci.com/v1/artifact/task/4786632883699712/log/build/testrun/icu/t/010_database/log/regress_log_010_database

not ok 1 - sort by database default locale
#   Failed test 'sort by database default locale'
#   at /tmp/cirrus-ci-build/src/test/icu/t/010_database.pl line 28.
#          got: 'a
# A
# b
# B'
#     expected: 'A
# a
# B
# b'
ok 2 - sort by explicit collation standard
not ok 3 - sort by explicit collation upper first
#   Failed test 'sort by explicit collation upper first'
#   at /tmp/cirrus-ci-build/src/test/icu/t/010_database.pl line 42.
#          got: 'a
# A
# b
# B'
#     expected: 'A
# a
# B
# b'
ok 4 - ICU locale must be specified for ICU provider: exit code not 0
ok 5 - ICU locale must be specified for ICU provider: error message
1..5

This is a run building with meson. But I've now triggered builds with autoconf
on centos 7 as well and that also failed. See
https://cirrus-ci.com/task/6194007767252992?logs=test_world#L378

So it looks like older ICU versions don't work?

Greetings,

Andres Freund

PS: I had not yet passed with_icu in the initdb tests for meson, that's why
there's two failures with autoconf but only one with meson.

Re: ICU for global collation

From

Julien Rouhaud

Date:

18 March 2022, 06:52:17

On Thu, Mar 17, 2022 at 02:14:52PM +0100, Peter Eisentraut wrote:
> On 17.03.22 13:01, Shinoda, Noriyoshi (PN Japan FSIP) wrote:
> > Thank you to all the developers.
> > I found that the description of the pg_database.daticulocale column was not written in the documentation.
> > The attached small patch adds a description of the daticulocale column to catalogs.sgml.
>
> committed, thanks

Thanks a lot both!  Glad to finally have that feature, as soon as we'll fix
the few reported problems.

Re: ICU for global collation

From

Andres Freund

Date:

19 March 2022, 01:47:07

Hi,

On 2022-03-17 14:14:52 +0100, Peter Eisentraut wrote:
> committed, thanks

Just noticed that this adds a new warning when building with -O3:

In file included from /home/andres/src/postgresql/src/include/catalog/pg_collation.h:22,
                 from /home/andres/src/postgresql/src/backend/commands/dbcommands.c:39:
In function ‘collprovider_name’,
    inlined from ‘createdb’ at /home/andres/src/postgresql/src/backend/commands/dbcommands.c:514:4:
../../../src/include/catalog/pg_collation_d.h:47:9: warning: ‘src_locprovider’ may be used uninitialized
[-Wmaybe-uninitialized]
   47 |         switch (c)
      |         ^~~~~~
/home/andres/src/postgresql/src/backend/commands/dbcommands.c: In function ‘createdb’:
/home/andres/src/postgresql/src/backend/commands/dbcommands.c:112:25: note: ‘src_locprovider’ was declared here
  112 |         char            src_locprovider;
      |                         ^~~~~~~~~~~~~~~

I'd fixed that for nearby variables in 3f6b3be39ca9... Gonna just NULL
initialize it as well.

Greetings,

Andres Freund

Re: ICU for global collation

From

Justin Pryzby

Date:

25 June 2022, 15:19:30

commit f2553d43060edb210b36c63187d52a632448e1d2 says >=1500 in a few places,
but in pg_upgrade says <=1500, which looks wrong for upgrades from v15.
I think it should say <= 1400.

On Wed, Feb 02, 2022 at 02:01:23PM +0100, Peter Eisentraut wrote:
> --- a/src/bin/pg_dump/pg_dump.c
> +++ b/src/bin/pg_dump/pg_dump.c
> @@ -2792,6 +2794,10 @@ dumpDatabase(Archive *fout)
>          appendPQExpBuffer(dbQry, "datminmxid, ");
>      else
>          appendPQExpBuffer(dbQry, "0 AS datminmxid, ");
> +    if (fout->remoteVersion >= 150000)
> +        appendPQExpBuffer(dbQry, "datcollprovider, ");
> +    else
> +        appendPQExpBuffer(dbQry, "'c' AS datcollprovider, ");
>      appendPQExpBuffer(dbQry,
>                        "(SELECT spcname FROM pg_tablespace t WHERE t.oid = dattablespace) AS tablespace, "
>                        "shobj_description(oid, 'pg_database') AS description "

> diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
> index 69ef23119f..2a9ca0e389 100644
> --- a/src/bin/pg_upgrade/info.c
> +++ b/src/bin/pg_upgrade/info.c
> @@ -312,11 +312,20 @@ get_db_infos(ClusterInfo *cluster)
>                  i_spclocation;
>      char        query[QUERY_ALLOC];
>  
>      snprintf(query, sizeof(query),
> -             "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, "
> +             "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, ");
> +    if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500)
> +        snprintf(query + strlen(query), sizeof(query) - strlen(query),
> +                 "'c' AS datcollprovider, NULL AS daticucoll, ");
> +    else
> +        snprintf(query + strlen(query), sizeof(query) - strlen(query),
> +                 "datcollprovider, daticucoll, ");
> +    snprintf(query + strlen(query), sizeof(query) - strlen(query),
>               "pg_catalog.pg_tablespace_location(t.oid) AS spclocation "
>               "FROM pg_catalog.pg_database d "
>               " LEFT OUTER JOIN pg_catalog.pg_tablespace t "

> --- a/src/bin/psql/describe.c
> +++ b/src/bin/psql/describe.c
> @@ -896,6 +896,18 @@ listAllDbs(const char *pattern, bool verbose)
>                        gettext_noop("Encoding"),
>                        gettext_noop("Collate"),
>                        gettext_noop("Ctype"));
> +    if (pset.sversion >= 150000)
> +        appendPQExpBuffer(&buf,
> +                          "       d.daticucoll as \"%s\",\n"
> +                          "       CASE d.datcollprovider WHEN 'c' THEN 'libc' WHEN 'i' THEN 'icu' END AS
\"%s\",\n",
> +                          gettext_noop("ICU Collation"),
> +                          gettext_noop("Coll. Provider"));
> +    else
> +        appendPQExpBuffer(&buf,
> +                          "       d.datcollate as \"%s\",\n"
> +                          "       'libc' AS \"%s\",\n",
> +                          gettext_noop("ICU Collation"),
> +                          gettext_noop("Coll. Provider"));
>      appendPQExpBufferStr(&buf, "       ");
>      printACLColumn(&buf, "d.datacl");
>      if (verbose)

> @@ -4617,6 +4629,15 @@ listCollations(const char *pattern, bool verbose, bool showSystem)
>                        gettext_noop("Collate"),
>                        gettext_noop("Ctype"));
>  
> +    if (pset.sversion >= 150000)
> +        appendPQExpBuffer(&buf,
> +                          ",\n       c.collicucoll AS \"%s\"",
> +                          gettext_noop("ICU Collation"));
> +    else
> +        appendPQExpBuffer(&buf,
> +                          ",\n       c.collcollate AS \"%s\"",
> +                          gettext_noop("ICU Collation"));
> +
>      if (pset.sversion >= 100000)
>          appendPQExpBuffer(&buf,
>                            ",\n       CASE c.collprovider WHEN 'd' THEN 'default' WHEN 'c' THEN 'libc' WHEN 'i' THEN
'icu'END AS \"%s\"",

Re: ICU for global collation

From

Julien Rouhaud

Date:

26 June 2022, 03:51:24

Hi,

On Sat, Jun 25, 2022 at 10:19:30AM -0500, Justin Pryzby wrote:
> commit f2553d43060edb210b36c63187d52a632448e1d2 says >=1500 in a few places,
> but in pg_upgrade says <=1500, which looks wrong for upgrades from v15.
> I think it should say <= 1400.
> 
> On Wed, Feb 02, 2022 at 02:01:23PM +0100, Peter Eisentraut wrote:
> > diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
> > index 69ef23119f..2a9ca0e389 100644
> > --- a/src/bin/pg_upgrade/info.c
> > +++ b/src/bin/pg_upgrade/info.c
> > @@ -312,11 +312,20 @@ get_db_infos(ClusterInfo *cluster)
> >                  i_spclocation;
> >      char        query[QUERY_ALLOC];
> >  
> >      snprintf(query, sizeof(query),
> > -             "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, "
> > +             "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, ");
> > +    if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500)
> > +        snprintf(query + strlen(query), sizeof(query) - strlen(query),
> > +                 "'c' AS datcollprovider, NULL AS daticucoll, ");
> > +    else
> > +        snprintf(query + strlen(query), sizeof(query) - strlen(query),
> > +                 "datcollprovider, daticucoll, ");
> > +    snprintf(query + strlen(query), sizeof(query) - strlen(query),
> >               "pg_catalog.pg_tablespace_location(t.oid) AS spclocation "
> >               "FROM pg_catalog.pg_database d "
> >               " LEFT OUTER JOIN pg_catalog.pg_tablespace t "

Indeed!

Re: ICU for global collation

From

Michael Paquier

Date:

26 June 2022, 07:20:34

On Sun, Jun 26, 2022 at 11:51:24AM +0800, Julien Rouhaud wrote:
> On Sat, Jun 25, 2022 at 10:19:30AM -0500, Justin Pryzby wrote:
>>> +    if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500)
>>> +        snprintf(query + strlen(query), sizeof(query) - strlen(query),
>>> +                 "'c' AS datcollprovider, NULL AS daticucoll, ");
>>> +    else
>>> +        snprintf(query + strlen(query), sizeof(query) - strlen(query),
>>> +                 "datcollprovider, daticucoll, ");
>>> +    snprintf(query + strlen(query), sizeof(query) - strlen(query),
>>>               "pg_catalog.pg_tablespace_location(t.oid) AS spclocation "
>>>               "FROM pg_catalog.pg_database d "
>>>               " LEFT OUTER JOIN pg_catalog.pg_tablespace t "
>
> Indeed!

Oops.  Beta2 tagging is very close by, so I think that it would be
better to not take a risk on that now, and this is an issue only when
upgrading from v15 where datcollprovider is ICU for a database.
As things stand, someone using beta1 with this new feature, running
pg_upgrade to beta2 would lose any non-libc locale provider set.
--
Michael

Attachment

signature.asc

Re: ICU for global collation

From

Peter Eisentraut

Date:

27 June 2022, 06:23:59

On 26.06.22 05:51, Julien Rouhaud wrote:
> Hi,
> 
> On Sat, Jun 25, 2022 at 10:19:30AM -0500, Justin Pryzby wrote:
>> commit f2553d43060edb210b36c63187d52a632448e1d2 says >=1500 in a few places,
>> but in pg_upgrade says <=1500, which looks wrong for upgrades from v15.
>> I think it should say <= 1400.
>>
>> On Wed, Feb 02, 2022 at 02:01:23PM +0100, Peter Eisentraut wrote:
>>> diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
>>> index 69ef23119f..2a9ca0e389 100644
>>> --- a/src/bin/pg_upgrade/info.c
>>> +++ b/src/bin/pg_upgrade/info.c
>>> @@ -312,11 +312,20 @@ get_db_infos(ClusterInfo *cluster)
>>>                   i_spclocation;
>>>       char        query[QUERY_ALLOC];
>>>   
>>>       snprintf(query, sizeof(query),
>>> -             "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, "
>>> +             "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, ");
>>> +    if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500)

I think the fix here is to change <= to < ?

>>> +        snprintf(query + strlen(query), sizeof(query) - strlen(query),
>>> +                 "'c' AS datcollprovider, NULL AS daticucoll, ");
>>> +    else
>>> +        snprintf(query + strlen(query), sizeof(query) - strlen(query),
>>> +                 "datcollprovider, daticucoll, ");
>>> +    snprintf(query + strlen(query), sizeof(query) - strlen(query),
>>>                "pg_catalog.pg_tablespace_location(t.oid) AS spclocation "
>>>                "FROM pg_catalog.pg_database d "
>>>                " LEFT OUTER JOIN pg_catalog.pg_tablespace t "
> 
> Indeed!
> 
>

Re: ICU for global collation

From

Michael Paquier

Date:

27 June 2022, 06:42:19

On Mon, Jun 27, 2022 at 08:23:59AM +0200, Peter Eisentraut wrote:
> On 26.06.22 05:51, Julien Rouhaud wrote:
>> On Sat, Jun 25, 2022 at 10:19:30AM -0500, Justin Pryzby wrote:
>>>> +    if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500)
>
> I think the fix here is to change <= to < ?

Yes.
--
Michael

Attachment

signature.asc

Re: ICU for global collation

From

Peter Eisentraut

Date:

27 June 2022, 07:10:29

On 27.06.22 08:42, Michael Paquier wrote:
> On Mon, Jun 27, 2022 at 08:23:59AM +0200, Peter Eisentraut wrote:
>> On 26.06.22 05:51, Julien Rouhaud wrote:
>>> On Sat, Jun 25, 2022 at 10:19:30AM -0500, Justin Pryzby wrote:
>>>>> +    if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500)
>>
>> I think the fix here is to change <= to < ?
> 
> Yes.

Ok, committed.

(I see now that in the context of pg_upgrade, writing <= 1400 is 
equivalent, but I find that confusing, so I did < 1500.)

Re: ICU for global collation

From

Justin Pryzby

Date:

28 June 2022, 01:32:05

On Mon, Jun 27, 2022 at 09:10:29AM +0200, Peter Eisentraut wrote:
> On 27.06.22 08:42, Michael Paquier wrote:
> > On Mon, Jun 27, 2022 at 08:23:59AM +0200, Peter Eisentraut wrote:
> > > On 26.06.22 05:51, Julien Rouhaud wrote:
> > > > On Sat, Jun 25, 2022 at 10:19:30AM -0500, Justin Pryzby wrote:
> > > > > > +    if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500)
> > > 
> > > I think the fix here is to change <= to < ?
> > 
> > Yes.
> 
> Ok, committed.
> 
> (I see now that in the context of pg_upgrade, writing <= 1400 is equivalent,
> but I find that confusing, so I did < 1500.)

I suggested using <= 1400 for consistency with the other code, and per
bc1fbc960.  But YMMV.

-- 
Justin

Re: ICU for global collation

From

Marina Polyakova

Date:

15 August 2022, 12:06:32

Hello everyone in this thread!

While reading and testing the patch that adds ICU for global collations 
[1] I noticed on master (1c5818b9c68e5c2ac8f19d372f24cce409de1a26) and 
REL_15_STABLE (63b64d8270691894a9a8f2d4e929e7780020edb8) that:

1) pg_upgrade from REL_14_STABLE 
(63b64d8270691894a9a8f2d4e929e7780020edb8) does not always work:

For REL_14_STABLE:

$ initdb -D data_old

For REL_15_STABLE or master:

$ initdb -D data_new --locale-provider icu --icu-locale ru-RU
$ pg_upgrade -d .../data_old -D data_new -b ... -B ...
...
Restoring database schemas in the new cluster
   template1
*failure*

Consult the last few lines of 
"data_new/pg_upgrade_output.d/20220815T142454.223/log/pg_upgrade_dump_1.log" 
for
the probable cause of the failure.
Failure, exiting

In 
data_new/pg_upgrade_output.d/20220815T142454.223/log/pg_upgrade_dump_1.log:

pg_restore: error: could not execute query: server closed the connection 
unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
Command was: CREATE DATABASE "template1" WITH TEMPLATE = template0 OID = 
1 ENCODING = 'UTF8' LOCALE_PROVIDER = libc LOCALE = 'en_US.UTF-8';

In 
data_new/pg_upgrade_output.d/20220815T142454.223/log/pg_upgrade_server.log:

TRAP: FailedAssertion("(dblocprovider == COLLPROVIDER_ICU && 
dbiculocale) || (dblocprovider != COLLPROVIDER_ICU && !dbiculocale)", 
File: "dbcommands.c", Line: 1292, PID: 69247)
postgres: marina postgres [local] CREATE 
DATABASE(ExceptionalCondition+0xb9)[0xb4d8ec]
postgres: marina postgres [local] CREATE 
DATABASE(createdb+0x1abc)[0x68ca99]
postgres: marina postgres [local] CREATE 
DATABASE(standard_ProcessUtility+0x651)[0x9b1d82]
postgres: marina postgres [local] CREATE 
DATABASE(ProcessUtility+0x122)[0x9b172a]
postgres: marina postgres [local] CREATE DATABASE[0x9b01cf]
postgres: marina postgres [local] CREATE DATABASE[0x9b0433]
postgres: marina postgres [local] CREATE 
DATABASE(PortalRun+0x2fe)[0x9af95d]
postgres: marina postgres [local] CREATE DATABASE[0x9a953b]
postgres: marina postgres [local] CREATE 
DATABASE(PostgresMain+0x733)[0x9ada6b]
postgres: marina postgres [local] CREATE DATABASE[0x8ec632]
postgres: marina postgres [local] CREATE DATABASE[0x8ebfbb]
postgres: marina postgres [local] CREATE DATABASE[0x8e8653]
postgres: marina postgres [local] CREATE 
DATABASE(PostmasterMain+0x1226)[0x8e7f26]
postgres: marina postgres [local] CREATE DATABASE[0x7bbccb]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7eff082f90b3]
postgres: marina postgres [local] CREATE DATABASE(_start+0x2e)[0x49c29e]
2022-08-15 14:24:56.124 MSK [69231] LOG:  server process (PID 69247) was 
terminated by signal 6: Aborted
2022-08-15 14:24:56.124 MSK [69231] DETAIL:  Failed process was running: 
CREATE DATABASE "template1" WITH TEMPLATE = template0 OID = 1 ENCODING = 
'UTF8' LOCALE_PROVIDER = libc LOCALE = 'en_US.UTF-8';

1.1) It looks like there's a bug in the function get_db_infos 
(src/bin/pg_upgrade/info.c), where the version of the old cluster is 
always checked:

if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500)
    snprintf(query + strlen(query), sizeof(query) - strlen(query),
             "'c' AS datlocprovider, NULL AS daticulocale, ");
else
    snprintf(query + strlen(query), sizeof(query) - strlen(query),
             "datlocprovider, daticulocale, ");

With the simple patch

diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index 
df374ce4b362b4c6c87fc1fd0e476e5d6d353d9e..53ea348e211d3ac38334292bc16cb814bc13bb87 
100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -319,7 +319,7 @@ get_db_infos(ClusterInfo *cluster)

      snprintf(query, sizeof(query),
               "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, ");
-    if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500)
+    if (GET_MAJOR_VERSION(cluster->major_version) < 1500)
          snprintf(query + strlen(query), sizeof(query) - strlen(query),
                   "'c' AS datlocprovider, NULL AS daticulocale, ");
      else

I got the expected error during the upgrade:

locale providers for database "template1" do not match:  old "libc", new 
"icu"
Failure, exiting

1.2) It looks like the mentioned asserion in dbcommands.c conflicts with 
the following lines earlier:

if (dbiculocale == NULL)
    dbiculocale = src_iculocale;

The following patch works for me:

diff --git a/src/backend/commands/dbcommands.c 
b/src/backend/commands/dbcommands.c
index 
b31a30550b025d48ba3cc250dc4c15f41f9a80be..17a2942341e528c01182fb6d4878580f2706bec9 
100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1048,6 +1048,8 @@ createdb(ParseState *pstate, const CreatedbStmt 
*stmt)
                      (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                       errmsg("ICU locale must be specified")));
      }
+    else
+        dbiculocale = NULL;

      if (dblocprovider == COLLPROVIDER_ICU)
          check_icu_locale(dbiculocale);

2) CREATE DATABASE does not always require the icu locale unlike initdb 
and createdb:

$ initdb -D data --locale en_US.UTF-8 --locale-provider icu
...
initdb: error: ICU locale must be specified

$ initdb -D data --locale en_US.UTF-8
$ pg_ctl -D data -l logfile start

$ createdb mydb --locale en_US.UTF-8 --template template0 
--locale-provider icu
createdb: error: database creation failed: ERROR:  ICU locale must be 
specified

$ psql -c "CREATE DATABASE mydb LOCALE \"en_US.UTF-8\" TEMPLATE 
template0 LOCALE_PROVIDER icu" postgres
CREATE DATABASE

$ psql -c "CREATE DATABASE mydb TEMPLATE template0 LOCALE_PROVIDER icu" 
postgres
ERROR:  ICU locale must be specified

I'm wondering if this is not a fully-supported feature (because createdb 
creates an SQL command with LC_COLLATE and LC_CTYPE options instead of 
LOCALE option) or is it a bug in CREATE DATABASE?.. From 
src/backend/commands/dbcommands.c:

if (dblocprovider == COLLPROVIDER_ICU && !dbiculocale)
{
    if (dlocale && dlocale->arg)
        dbiculocale = defGetString(dlocale);
}

[1] 
https://github.com/postgres/postgres/commit/f2553d43060edb210b36c63187d52a632448e1d2

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: ICU for global collation

From

Julien Rouhaud

Date:

17 August 2022, 16:53:59

Hi,

On Mon, Aug 15, 2022 at 03:06:32PM +0300, Marina Polyakova wrote:
>
> 1.1) It looks like there's a bug in the function get_db_infos
> (src/bin/pg_upgrade/info.c), where the version of the old cluster is always
> checked:
>
> if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500)
>     snprintf(query + strlen(query), sizeof(query) - strlen(query),
>              "'c' AS datlocprovider, NULL AS daticulocale, ");
> else
>     snprintf(query + strlen(query), sizeof(query) - strlen(query),
>              "datlocprovider, daticulocale, ");
>
> With the simple patch
>
> diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
> index df374ce4b362b4c6c87fc1fd0e476e5d6d353d9e..53ea348e211d3ac38334292bc16cb814bc13bb87
> 100644
> --- a/src/bin/pg_upgrade/info.c
> +++ b/src/bin/pg_upgrade/info.c
> @@ -319,7 +319,7 @@ get_db_infos(ClusterInfo *cluster)
>
>      snprintf(query, sizeof(query),
>               "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, ");
> -    if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500)
> +    if (GET_MAJOR_VERSION(cluster->major_version) < 1500)
>          snprintf(query + strlen(query), sizeof(query) - strlen(query),
>                   "'c' AS datlocprovider, NULL AS daticulocale, ");
>      else
>
> I got the expected error during the upgrade:
>
> locale providers for database "template1" do not match:  old "libc", new
> "icu"
> Failure, exiting

Good catch.  There's unfortunately not a lot of regression tests for
ICU-initialized clusters.  I'm wondering if the build-farm client could be
taught about the locale provider rather than assuming libc.  Clearly that
wouldn't have caught this issue, but it should still increase the coverage a
bit (I'm thinking of the recent problem with the abbreviated keys).

> 1.2) It looks like the mentioned asserion in dbcommands.c conflicts with the
> following lines earlier:
>
> if (dbiculocale == NULL)
>     dbiculocale = src_iculocale;
>
> The following patch works for me:
>
> diff --git a/src/backend/commands/dbcommands.c
> b/src/backend/commands/dbcommands.c
> index b31a30550b025d48ba3cc250dc4c15f41f9a80be..17a2942341e528c01182fb6d4878580f2706bec9
> 100644
> --- a/src/backend/commands/dbcommands.c
> +++ b/src/backend/commands/dbcommands.c
> @@ -1048,6 +1048,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
>                      (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>                       errmsg("ICU locale must be specified")));
>      }
> +    else
> +        dbiculocale = NULL;
>
>      if (dblocprovider == COLLPROVIDER_ICU)
>          check_icu_locale(dbiculocale);

I think it would be better to do that in the various variable initialization.
Maybe switch the dblocprovider and dbiculocale initialization, and only
initialize dbiculocale if dblocprovider is COLLPROVIDER_ICU.

> 2) CREATE DATABASE does not always require the icu locale unlike initdb and
> createdb:
>
> $ createdb mydb --locale en_US.UTF-8 --template template0 --locale-provider
> icu
> createdb: error: database creation failed: ERROR:  ICU locale must be
> specified
>
> $ psql -c "CREATE DATABASE mydb LOCALE \"en_US.UTF-8\" TEMPLATE template0
> LOCALE_PROVIDER icu" postgres
> CREATE DATABASE
>
> I'm wondering if this is not a fully-supported feature (because createdb
> creates an SQL command with LC_COLLATE and LC_CTYPE options instead of
> LOCALE option) or is it a bug in CREATE DATABASE?.. From
> src/backend/commands/dbcommands.c:
>
> if (dblocprovider == COLLPROVIDER_ICU && !dbiculocale)
> {
>     if (dlocale && dlocale->arg)
>         dbiculocale = defGetString(dlocale);
> }

This discrepancy between createdb and CREATE DATABASE looks like an oversight,
as createdb indeed interprets --locale as:

    if (locale)
    {
        if (lc_ctype)
            pg_fatal("only one of --locale and --lc-ctype can be specified");
        if (lc_collate)
            pg_fatal("only one of --locale and --lc-collate can be specified");
        lc_ctype = locale;
        lc_collate = locale;
    }

AFAIK the fallback in the CREATE DATABASE case is expected as POSIX locale
names should be accepted by icu, so this should work for createdb too.

Re: ICU for global collation

From

Marina Polyakova

Date:

20 August 2022, 09:41:51

On 2022-08-17 19:53, Julien Rouhaud wrote:
> Good catch.  There's unfortunately not a lot of regression tests for
> ICU-initialized clusters.  I'm wondering if the build-farm client could 
> be
> taught about the locale provider rather than assuming libc.  Clearly 
> that
> wouldn't have caught this issue, but it should still increase the 
> coverage a
> bit (I'm thinking of the recent problem with the abbreviated keys).

Looking at installchecks with different locales (e.g. see [1] with 
sv_SE.UTF-8) - why not?..

>> diff --git a/src/backend/commands/dbcommands.c
>> b/src/backend/commands/dbcommands.c
>> index 
>> b31a30550b025d48ba3cc250dc4c15f41f9a80be..17a2942341e528c01182fb6d4878580f2706bec9
>> 100644
>> --- a/src/backend/commands/dbcommands.c
>> +++ b/src/backend/commands/dbcommands.c
>> @@ -1048,6 +1048,8 @@ createdb(ParseState *pstate, const CreatedbStmt 
>> *stmt)
>>                      (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>>                       errmsg("ICU locale must be specified")));
>>      }
>> +    else
>> +        dbiculocale = NULL;
>> 
>>      if (dblocprovider == COLLPROVIDER_ICU)
>>          check_icu_locale(dbiculocale);
> 
> I think it would be better to do that in the various variable 
> initialization.
> Maybe switch the dblocprovider and dbiculocale initialization, and only
> initialize dbiculocale if dblocprovider is COLLPROVIDER_ICU.

diff --git a/src/backend/commands/dbcommands.c 
b/src/backend/commands/dbcommands.c
index 
b31a30550b025d48ba3cc250dc4c15f41f9a80be..883f381f3453142790f728a3725586cebe2e2f20 
100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1012,10 +1012,10 @@ createdb(ParseState *pstate, const CreatedbStmt 
*stmt)
          dbcollate = src_collate;
      if (dbctype == NULL)
          dbctype = src_ctype;
-    if (dbiculocale == NULL)
-        dbiculocale = src_iculocale;
      if (dblocprovider == '\0')
          dblocprovider = src_locprovider;
+    if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU)
+        dbiculocale = src_iculocale;

      /* Some encodings are client only */
      if (!PG_VALID_BE_ENCODING(encoding))

Then it seemed to me that it was easier to first get all the parameters 
from the template database as usual and then process them as needed. But 
with your suggestion the failed assertion will check the code above more 
accurately...

> This discrepancy between createdb and CREATE DATABASE looks like an 
> oversight,
> as createdb indeed interprets --locale as:
> 
>     if (locale)
>     {
>         if (lc_ctype)
>             pg_fatal("only one of --locale and --lc-ctype can be specified");
>         if (lc_collate)
>             pg_fatal("only one of --locale and --lc-collate can be specified");
>         lc_ctype = locale;
>         lc_collate = locale;
>     }
> 
> AFAIK the fallback in the CREATE DATABASE case is expected as POSIX 
> locale
> names should be accepted by icu, so this should work for createdb too.

Oh, great, thanks!

> > > I spent some time looking at the ICU api trying to figure out if using a
> > > posix locale name (e.g. en_US) was actually compatible with an ICU locale name.
> > > It seems that ICU accept any of 'en-us', 'en-US', 'en_us' or 'en_US' as the
> > > same locale, but I might be wrong.  I also didn't find a way to figure out how
> > > to ask ICU if the locale identifier passed is complete garbage or not.  One
> > > sure thing is that the system collation we import are of the form 'en-us', so
> > > it seems weird to have this form in pg_collation and by default another form in
> > > pg_database.
> >
> > Yeah it seems to be inconsistent about that.  The locale ID documentation
> > appears to indicate that "en_US" is the canonical form, but when you ask it
> > to list all the locales it knows about it returns "en-US".
> 
> Yeah I saw that too when checking is POSIX locale names were valid, and 
> that's
> not great.

I'm sorry but IIUC pg_import_system_collations uses uloc_getAvailable to 
get the locale ID and then specifically calls uloc_toLanguageTag?..

> I don't think that initdb --collation-provider icu should be allowed 
> without
> --icu-locale, same for --collation-provider libc *with* --icu-locale.

> > initdb has some specific processing to transform the default libc locale to
> > something more appropriate, but as far as I can see creatdb / CREATE DATABASE
> > aren't doing that.  It seems inconsistent, and IMHO another reason why
> > defaulting to the libc locale looks like a bad idea.
> 
> This has all been removed.  The separate ICU locale option should now 
> be
> required everywhere (initdb, createdb, CREATE DATABASE).

If it's a feature and not a bug in CREATE DATABASE, why should not it 
work in initdb too? Here we define locale/lc_collate/lc_ctype for the 
first 3 databases in the cluster in much the same way...

P.S. FYI there seems to be a bug for very old ICU versions: in master 
(92fce4e1eda9b24d73f583fbe9b58f4e03f097a4):

$ initdb -D data &&
pg_ctl -D data -l logfile start &&
psql -c "CREATE DATABASE mydb LOCALE \"C.UTF-8\" LOCALE_PROVIDER icu 
TEMPLATE template0" postgres &&
psql -c "SELECT 1" mydb

WARNING:  database "mydb" has a collation version mismatch
DETAIL:  The database was created using collation version 49.192.0.42, 
but the operating system provides version 49.192.5.42.
HINT:  Rebuild all objects in this database that use the default 
collation and run ALTER DATABASE mydb REFRESH COLLATION VERSION, or 
build PostgreSQL with the right library version.

See the additional output (diff_log_icu_collator_locale.patch) in the 
logfile:

2022-08-20 11:38:30.162 MSK [136546] LOG:  check_icu_locale 
uloc_getDefault() en_US
2022-08-20 11:38:30.162 MSK [136546] STATEMENT:  CREATE DATABASE mydb 
LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0
2022-08-20 11:38:30.163 MSK [136546] LOG:  check_icu_locale icu_locale 
C.UTF-8 valid_locale en_US version 49.192.0.42
2022-08-20 11:38:30.163 MSK [136546] STATEMENT:  CREATE DATABASE mydb 
LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0
2022-08-20 11:38:30.163 MSK [136546] LOG:  get_collation_actual_version 
uloc_getDefault() en_US
2022-08-20 11:38:30.163 MSK [136546] STATEMENT:  CREATE DATABASE mydb 
LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0
2022-08-20 11:38:30.163 MSK [136546] LOG:  get_collation_actual_version 
icu_locale C.UTF-8 valid_locale en_US version 49.192.0.42
2022-08-20 11:38:30.163 MSK [136546] STATEMENT:  CREATE DATABASE mydb 
LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0
2022-08-20 11:38:30.224 MSK [136548] LOG:  make_icu_collator 
uloc_getDefault() c
2022-08-20 11:38:30.225 MSK [136548] LOG:  make_icu_collator icu_locale 
C.UTF-8 valid_locale root version 49.192.5.42
2022-08-20 11:38:30.225 MSK [136548] LOG:  get_collation_actual_version 
uloc_getDefault() c
2022-08-20 11:38:30.225 MSK [136548] LOG:  get_collation_actual_version 
icu_locale C.UTF-8 valid_locale root version 49.192.5.42
2022-08-20 11:38:30.225 MSK [136548] WARNING:  database "mydb" has a 
collation version mismatch
2022-08-20 11:38:30.225 MSK [136548] DETAIL:  The database was created 
using collation version 49.192.0.42, but the operating system provides 
version 49.192.5.42.
2022-08-20 11:38:30.225 MSK [136548] HINT:  Rebuild all objects in this 
database that use the default collation and run ALTER DATABASE mydb 
REFRESH COLLATION VERSION, or build PostgreSQL with the right library 
version.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&dt=2022-08-18%2006%3A25%3A17

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

diff_log_icu_collator_locale.patch

Re: ICU for global collation

From

Peter Eisentraut

Date:

22 August 2022, 14:10:59

On 15.08.22 14:06, Marina Polyakova wrote:
> 1.1) It looks like there's a bug in the function get_db_infos 
> (src/bin/pg_upgrade/info.c), where the version of the old cluster is 
> always checked:
> 
> if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500)
>      snprintf(query + strlen(query), sizeof(query) - strlen(query),
>               "'c' AS datlocprovider, NULL AS daticulocale, ");
> else
>      snprintf(query + strlen(query), sizeof(query) - strlen(query),
>               "datlocprovider, daticulocale, ");
> 
> With the simple patch
> 
> diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
> index 
> df374ce4b362b4c6c87fc1fd0e476e5d6d353d9e..53ea348e211d3ac38334292bc16cb814bc13bb87 
> 100644
> --- a/src/bin/pg_upgrade/info.c
> +++ b/src/bin/pg_upgrade/info.c
> @@ -319,7 +319,7 @@ get_db_infos(ClusterInfo *cluster)
> 
>       snprintf(query, sizeof(query),
>                "SELECT d.oid, d.datname, d.encoding, d.datcollate, 
> d.datctype, ");
> -    if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500)
> +    if (GET_MAJOR_VERSION(cluster->major_version) < 1500)
>           snprintf(query + strlen(query), sizeof(query) - strlen(query),
>                    "'c' AS datlocprovider, NULL AS daticulocale, ");
>       else

fixed

> 1.2) It looks like the mentioned asserion in dbcommands.c conflicts with 
> the following lines earlier:
> 
> if (dbiculocale == NULL)
>      dbiculocale = src_iculocale;

fixed

> I'm wondering if this is not a fully-supported feature (because createdb 
> creates an SQL command with LC_COLLATE and LC_CTYPE options instead of 
> LOCALE option) or is it a bug in CREATE DATABASE?.. From 
> src/backend/commands/dbcommands.c:
> 
> if (dblocprovider == COLLPROVIDER_ICU && !dbiculocale)
> {
>      if (dlocale && dlocale->arg)
>          dbiculocale = defGetString(dlocale);
> }

I think this piece of code was left over from some earlier attempts to 
specify the libc locale and the icu locale with one option, which never 
really worked well.  The CREATE DATABASE man page does not mention that 
LOCALE provides the default for ICU_LOCALE.  Hence, I think we should 
delete this.

Re: ICU for global collation

From

Marina Polyakova

Date:

22 August 2022, 14:17:31

On 2022-08-22 17:10, Peter Eisentraut wrote:
> On 15.08.22 14:06, Marina Polyakova wrote:
>> 1.1) It looks like there's a bug in the function get_db_infos 
>> (src/bin/pg_upgrade/info.c), where the version of the old cluster is 
>> always checked:
>> 
>> if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500)
>>      snprintf(query + strlen(query), sizeof(query) - strlen(query),
>>               "'c' AS datlocprovider, NULL AS daticulocale, ");
>> else
>>      snprintf(query + strlen(query), sizeof(query) - strlen(query),
>>               "datlocprovider, daticulocale, ");
>> 
>> With the simple patch
>> 
>> diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
>> index 
>> df374ce4b362b4c6c87fc1fd0e476e5d6d353d9e..53ea348e211d3ac38334292bc16cb814bc13bb87 
>> 100644
>> --- a/src/bin/pg_upgrade/info.c
>> +++ b/src/bin/pg_upgrade/info.c
>> @@ -319,7 +319,7 @@ get_db_infos(ClusterInfo *cluster)
>> 
>>       snprintf(query, sizeof(query),
>>                "SELECT d.oid, d.datname, d.encoding, d.datcollate, 
>> d.datctype, ");
>> -    if (GET_MAJOR_VERSION(old_cluster.major_version) < 1500)
>> +    if (GET_MAJOR_VERSION(cluster->major_version) < 1500)
>>           snprintf(query + strlen(query), sizeof(query) - 
>> strlen(query),
>>                    "'c' AS datlocprovider, NULL AS daticulocale, ");
>>       else
> 
> fixed
> 
>> 1.2) It looks like the mentioned asserion in dbcommands.c conflicts 
>> with the following lines earlier:
>> 
>> if (dbiculocale == NULL)
>>      dbiculocale = src_iculocale;
> 
> fixed
> 
>> I'm wondering if this is not a fully-supported feature (because 
>> createdb creates an SQL command with LC_COLLATE and LC_CTYPE options 
>> instead of LOCALE option) or is it a bug in CREATE DATABASE?.. From 
>> src/backend/commands/dbcommands.c:
>> 
>> if (dblocprovider == COLLPROVIDER_ICU && !dbiculocale)
>> {
>>      if (dlocale && dlocale->arg)
>>          dbiculocale = defGetString(dlocale);
>> }
> 
> I think this piece of code was left over from some earlier attempts to
> specify the libc locale and the icu locale with one option, which
> never really worked well.  The CREATE DATABASE man page does not
> mention that LOCALE provides the default for ICU_LOCALE.  Hence, I
> think we should delete this.

Thank you!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: ICU for global collation

From

Michael Paquier

Date:

23 August 2022, 04:16:31

On Mon, Aug 22, 2022 at 04:10:59PM +0200, Peter Eisentraut wrote:
> I think this piece of code was left over from some earlier attempts to
> specify the libc locale and the icu locale with one option, which never
> really worked well.  The CREATE DATABASE man page does not mention that
> LOCALE provides the default for ICU_LOCALE.  Hence, I think we should delete
> this.

As of 36f729e, is there anything left to address on this thread or
should this open item be closed?
--
Michael

Attachment

signature.asc

Re: ICU for global collation

From

Marina Polyakova

Date:

23 August 2022, 17:59:02

My colleague Andrew Bille found another bug in master 
(b4e936859dc441102eb0b6fb7a104f3948c90490) and REL_15_STABLE 
(2c63b0930aee1bb5c265fad4a65c9d0b62b1f9da): pg_collation.colliculocale 
is not dumped. See check_icu_locale.sh:

In the old cluster:
SELECT collname, colliculocale FROM pg_collation WHERE collname = 
'testcoll_backwards'
       collname      |   colliculocale
--------------------+-------------------
  testcoll_backwards | @colBackwards=yes
(1 row)

In the new cluster:
SELECT collname, colliculocale FROM pg_collation WHERE collname = 
'testcoll_backwards'
       collname      | colliculocale
--------------------+---------------
  testcoll_backwards |
(1 row)

diff_dump_colliculocale.patch works for me.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: ICU for global collation

From

Michael Paquier

Date:

24 August 2022, 04:38:44

On Tue, Aug 23, 2022 at 08:59:02PM +0300, Marina Polyakova wrote:
> My colleague Andrew Bille found another bug in master
> (b4e936859dc441102eb0b6fb7a104f3948c90490) and REL_15_STABLE
> (2c63b0930aee1bb5c265fad4a65c9d0b62b1f9da): pg_collation.colliculocale is
> not dumped. See check_icu_locale.sh:
>
> In the old cluster:
> SELECT collname, colliculocale FROM pg_collation WHERE collname =
> 'testcoll_backwards'
>       collname      |   colliculocale
> --------------------+-------------------
>  testcoll_backwards | @colBackwards=yes
> (1 row)
>
> In the new cluster:
> SELECT collname, colliculocale FROM pg_collation WHERE collname =
> 'testcoll_backwards'
>       collname      | colliculocale
> --------------------+---------------
>  testcoll_backwards |
> (1 row)
>
> diff_dump_colliculocale.patch works for me.

Ugh.  Good catch, again!  I have not tested the patch in details but
this looks rather sane to me on a quick read.  Peter?
--
Michael

Attachment

signature.asc

Re: ICU for global collation

From

Julien Rouhaud

Date:

24 August 2022, 08:59:23

On Wed, Aug 24, 2022 at 01:38:44PM +0900, Michael Paquier wrote:
> On Tue, Aug 23, 2022 at 08:59:02PM +0300, Marina Polyakova wrote:
> > My colleague Andrew Bille found another bug in master
> > (b4e936859dc441102eb0b6fb7a104f3948c90490) and REL_15_STABLE
> > (2c63b0930aee1bb5c265fad4a65c9d0b62b1f9da): pg_collation.colliculocale is
> > not dumped. See check_icu_locale.sh:
> > 
> > In the old cluster:
> > SELECT collname, colliculocale FROM pg_collation WHERE collname =
> > 'testcoll_backwards'
> >       collname      |   colliculocale
> > --------------------+-------------------
> >  testcoll_backwards | @colBackwards=yes
> > (1 row)
> > 
> > In the new cluster:
> > SELECT collname, colliculocale FROM pg_collation WHERE collname =
> > 'testcoll_backwards'
> >       collname      | colliculocale
> > --------------------+---------------
> >  testcoll_backwards |
> > (1 row)
> > 
> > diff_dump_colliculocale.patch works for me.
> 
> Ugh.  Good catch, again!

+1

> I have not tested the patch in details but
> this looks rather sane to me on a quick read.  Peter?

Patch looks good to me too.

Re: ICU for global collation

From

Peter Eisentraut

Date:

24 August 2022, 18:28:24

On 24.08.22 10:59, Julien Rouhaud wrote:
>> I have not tested the patch in details but
>> this looks rather sane to me on a quick read.  Peter?
> Patch looks good to me too.

Committed, thanks.

(This should conclude all the issues discussed in this thread recently.)

Re: ICU for global collation

From

Michael Paquier

Date:

30 August 2022, 13:03:28

On Wed, Aug 24, 2022 at 08:28:24PM +0200, Peter Eisentraut wrote:
> Committed, thanks.
>
> (This should conclude all the issues discussed in this thread recently.)

Please note that this open item was still listed as open.  I have
closed it now.
--
Michael

Attachment

signature.asc

Re: ICU for global collation

From

Marina Polyakova

Date:

05 September 2022, 15:14:00

Hello!

IMO after adding ICU for global collations [1] the behaviour of createdb 
and CREATE DATABASE is a bit inconsistent when both locale and 
lc_collate (or locale and lc_ctype) options are used:

$ createdb mydb --locale C --lc-collate C --template template0
createdb: error: only one of --locale and --lc-collate can be specified
$ psql -c "create database mydb locale = 'C' lc_collate = 'C' template = 
'template0'" postgres
CREATE DATABASE

 From the CREATE DATABASE documentation [2]:

locale
This is a shortcut for setting LC_COLLATE and LC_CTYPE at once. If you 
specify this, you cannot specify either of those parameters.

The patch diff_return_back_create_database_error.patch returns back the 
removed code for CREATE DATABASE so it behaves like createdb as 
before...

[1] 
https://github.com/postgres/postgres/commit/f2553d43060edb210b36c63187d52a632448e1d2
[2] https://www.postgresql.org/docs/devel/sql-createdatabase.html

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

diff_return_back_create_database_error.patch

Re: ICU for global collation

From

Justin Pryzby

Date:

09 September 2022, 16:46:25

In pg14:
|postgres=# create database a LC_COLLATE C LC_CTYPE C LOCALE C;
|ERROR:  conflicting or redundant options
|DETAIL:  LOCALE cannot be specified together with LC_COLLATE or LC_CTYPE.

In pg15:
|postgres=# create database a LC_COLLATE "en_US.UTF-8" LC_CTYPE "en_US.UTF-8" LOCALE "en_US.UTF-8" ;
|CREATE DATABASE

f2553d430 actually relaxed the restriction by removing this check:

-       if (dlocale && (dcollate || dctype))
-               ereport(ERROR,
-                               (errcode(ERRCODE_SYNTAX_ERROR),
-                                errmsg("conflicting or redundant options"),
-                                errdetail("LOCALE cannot be specified together with LC_COLLATE or LC_CTYPE.")));

But isn't the right fix to do the corresponding thing in createdb
(relaxing the frontend restriction rather than reverting its relaxation
in the backend).

diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index e523e58b218..5b80e56dfd9 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -159,15 +159,10 @@ main(int argc, char *argv[])
             exit(1);
     }
 
-    if (locale)
-    {
-        if (lc_ctype)
-            pg_fatal("only one of --locale and --lc-ctype can be specified");
-        if (lc_collate)
-            pg_fatal("only one of --locale and --lc-collate can be specified");
+    if (locale && !lc_ctype)
         lc_ctype = locale;
+    if (locale && !lc_collate)
         lc_collate = locale;
-    }
 
     if (encoding)
     {


BTW it's somewhat crummy that it uses a string comparison, so if you
write "UTF8" without a dash, it says this; it took me a few minutes to
see the difference...

postgres=# create database a LC_COLLATE "en_US.UTF8" LC_CTYPE "en_US.UTF8" LOCALE "en_US.UTF8";
ERROR:  new collation (en_US.UTF8) is incompatible with the collation of the template database (en_US.UTF-8)

Re: ICU for global collation

From

Marina Polyakova

Date:

13 September 2022, 05:34:16

On 2022-09-09 19:46, Justin Pryzby wrote:
> In pg14:
> |postgres=# create database a LC_COLLATE C LC_CTYPE C LOCALE C;
> |ERROR:  conflicting or redundant options
> |DETAIL:  LOCALE cannot be specified together with LC_COLLATE or 
> LC_CTYPE.
> 
> In pg15:
> |postgres=# create database a LC_COLLATE "en_US.UTF-8" LC_CTYPE
> "en_US.UTF-8" LOCALE "en_US.UTF-8" ;
> |CREATE DATABASE
> 
> f2553d430 actually relaxed the restriction by removing this check:
> 
> -       if (dlocale && (dcollate || dctype))
> -               ereport(ERROR,
> -                               (errcode(ERRCODE_SYNTAX_ERROR),
> -                                errmsg("conflicting or redundant 
> options"),
> -                                errdetail("LOCALE cannot be specified
> together with LC_COLLATE or LC_CTYPE.")));
> 
> But isn't the right fix to do the corresponding thing in createdb
> (relaxing the frontend restriction rather than reverting its relaxation
> in the backend).
> 
> diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
> index e523e58b218..5b80e56dfd9 100644
> --- a/src/bin/scripts/createdb.c
> +++ b/src/bin/scripts/createdb.c
> @@ -159,15 +159,10 @@ main(int argc, char *argv[])
>              exit(1);
>      }
> 
> -    if (locale)
> -    {
> -        if (lc_ctype)
> -            pg_fatal("only one of --locale and --lc-ctype can be specified");
> -        if (lc_collate)
> -            pg_fatal("only one of --locale and --lc-collate can be specified");
> +    if (locale && !lc_ctype)
>          lc_ctype = locale;
> +    if (locale && !lc_collate)
>          lc_collate = locale;
> -    }
> 
>      if (encoding)
>      {

I agree with you that it is more comfortable and more similar to what 
has already been done in initdb. IMO it would be easier to do it like 
this:

diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index 
e523e58b2189275dc603a06324a2f28b0f49d8b7..a1482df3d981a680dd3322052e7c03ddacc8dc26 
100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -161,12 +161,10 @@ main(int argc, char *argv[])

      if (locale)
      {
-        if (lc_ctype)
-            pg_fatal("only one of --locale and --lc-ctype can be specified");
-        if (lc_collate)
-            pg_fatal("only one of --locale and --lc-collate can be specified");
-        lc_ctype = locale;
-        lc_collate = locale;
+        if (!lc_ctype)
+            lc_ctype = locale;
+        if (!lc_collate)
+            lc_collate = locale;
      }

      if (encoding)

Should we change the behaviour of createdb and CREATE DATABASE in 
previous major versions?..

> BTW it's somewhat crummy that it uses a string comparison, so if you
> write "UTF8" without a dash, it says this; it took me a few minutes to
> see the difference...
> 
> postgres=# create database a LC_COLLATE "en_US.UTF8" LC_CTYPE
> "en_US.UTF8" LOCALE "en_US.UTF8";
> ERROR:  new collation (en_US.UTF8) is incompatible with the collation
> of the template database (en_US.UTF-8)

Perhaps we could check the locale itself with the function 
normalize_libc_locale_name (collationcmds.c). But ISTM that the current 
check is a safety net in case the function pg_get_encoding_from_locale 
(chklocale.c) returns -1 or PG_SQL_ASCII...

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: ICU for global collation

From

Peter Eisentraut

Date:

13 September 2022, 12:41:41

On 13.09.22 07:34, Marina Polyakova wrote:
> I agree with you that it is more comfortable and more similar to what 
> has already been done in initdb. IMO it would be easier to do it like this:
> 
> diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
> index 
> e523e58b2189275dc603a06324a2f28b0f49d8b7..a1482df3d981a680dd3322052e7c03ddacc8dc26 100644
> --- a/src/bin/scripts/createdb.c
> +++ b/src/bin/scripts/createdb.c
> @@ -161,12 +161,10 @@ main(int argc, char *argv[])
> 
>       if (locale)
>       {
> -        if (lc_ctype)
> -            pg_fatal("only one of --locale and --lc-ctype can be 
> specified");
> -        if (lc_collate)
> -            pg_fatal("only one of --locale and --lc-collate can be 
> specified");
> -        lc_ctype = locale;
> -        lc_collate = locale;
> +        if (!lc_ctype)
> +            lc_ctype = locale;
> +        if (!lc_collate)
> +            lc_collate = locale;
>       }
> 
>       if (encoding)

done that way

> Should we change the behaviour of createdb and CREATE DATABASE in 
> previous major versions?..

I don't see a need for that.

>> BTW it's somewhat crummy that it uses a string comparison, so if you
>> write "UTF8" without a dash, it says this; it took me a few minutes to
>> see the difference...
>>
>> postgres=# create database a LC_COLLATE "en_US.UTF8" LC_CTYPE
>> "en_US.UTF8" LOCALE "en_US.UTF8";
>> ERROR:  new collation (en_US.UTF8) is incompatible with the collation
>> of the template database (en_US.UTF-8)
> 
> Perhaps we could check the locale itself with the function 
> normalize_libc_locale_name (collationcmds.c). But ISTM that the current 
> check is a safety net in case the function pg_get_encoding_from_locale 
> (chklocale.c) returns -1 or PG_SQL_ASCII...

This is not new behavior in PG15, is it?

Re: ICU for global collation

From

Marina Polyakova

Date:

13 September 2022, 15:51:39

On 2022-09-13 15:41, Peter Eisentraut wrote:
> On 13.09.22 07:34, Marina Polyakova wrote:
>> I agree with you that it is more comfortable and more similar to what 
>> has already been done in initdb. IMO it would be easier to do it like 
>> this:
>> 
>> diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
>> index 
>> e523e58b2189275dc603a06324a2f28b0f49d8b7..a1482df3d981a680dd3322052e7c03ddacc8dc26 
>> 100644
>> --- a/src/bin/scripts/createdb.c
>> +++ b/src/bin/scripts/createdb.c
>> @@ -161,12 +161,10 @@ main(int argc, char *argv[])
>> 
>>       if (locale)
>>       {
>> -        if (lc_ctype)
>> -            pg_fatal("only one of --locale and --lc-ctype can be 
>> specified");
>> -        if (lc_collate)
>> -            pg_fatal("only one of --locale and --lc-collate can be 
>> specified");
>> -        lc_ctype = locale;
>> -        lc_collate = locale;
>> +        if (!lc_ctype)
>> +            lc_ctype = locale;
>> +        if (!lc_collate)
>> +            lc_collate = locale;
>>       }
>> 
>>       if (encoding)
> 
> done that way

Thank you!

>>> BTW it's somewhat crummy that it uses a string comparison, so if you
>>> write "UTF8" without a dash, it says this; it took me a few minutes 
>>> to
>>> see the difference...
>>> 
>>> postgres=# create database a LC_COLLATE "en_US.UTF8" LC_CTYPE
>>> "en_US.UTF8" LOCALE "en_US.UTF8";
>>> ERROR:  new collation (en_US.UTF8) is incompatible with the collation
>>> of the template database (en_US.UTF-8)
>> 
>> Perhaps we could check the locale itself with the function 
>> normalize_libc_locale_name (collationcmds.c). But ISTM that the 
>> current check is a safety net in case the function 
>> pg_get_encoding_from_locale (chklocale.c) returns -1 or 
>> PG_SQL_ASCII...
> 
> This is not new behavior in PG15, is it?

No, it has always existed [1] AFAICS..

[1] 
https://github.com/postgres/postgres/commit/61d967498802ab86d8897cb3c61740d7e9d712f6

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: ICU for global collation

From

Marina Polyakova

Date:

14 September 2022, 14:19:34

Hello!

I was surprised that it is allowed to create clusters/databases where 
the default ICU collations do not actually work due to unsupported 
encodings:

$ initdb --encoding SQL_ASCII --locale-provider icu --icu-locale en-US 
-D data &&
pg_ctl -D data -l logfile start &&
psql -c "SELECT 'a' < 'b'" template1
...
waiting for server to start.... done
server started
ERROR:  encoding "SQL_ASCII" not supported by ICU

$ createdb --encoding SQL_ASCII --locale-provider icu --icu-locale en-US 
--template template0 mydb &&
psql -c "SELECT 'a' < 'b'" mydb
ERROR:  encoding "SQL_ASCII" not supported by ICU

The patch diff_check_icu_encoding.patch prohibits the creation of such 
objects...

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

diff_check_icu_encoding.patch

Re: ICU for global collation

From

Kyotaro Horiguchi

Date:

15 September 2022, 06:52:10

At Wed, 14 Sep 2022 17:19:34 +0300, Marina Polyakova <m.polyakova@postgrespro.ru> wrote in 
> Hello!
> 
> I was surprised that it is allowed to create clusters/databases where
> the default ICU collations do not actually work due to unsupported
> encodings:
> 
> $ initdb --encoding SQL_ASCII --locale-provider icu --icu-locale en-US
> -D data &&
> pg_ctl -D data -l logfile start &&
> psql -c "SELECT 'a' < 'b'" template1
> ...
> waiting for server to start.... done
> server started
> ERROR:  encoding "SQL_ASCII" not supported by ICU

Indeed. If I did the following, the direction of the patch looks fine
to me.

If I executed initdb as follows, I would be told to specify
--icu-locale option.

> $ initdb --encoding sql-ascii --locale-provider icu hoge
> ...
> initdb: error: ICU locale must be specified

However, when I reran the command, it complains about incompatible
encoding this time.  I think it's more user-friendly to check for the
encoding compatibility before the check for missing --icu-locale
option.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: ICU for global collation

From

Marina Polyakova

Date:

15 September 2022, 15:41:31

On 2022-09-15 09:52, Kyotaro Horiguchi wrote:
> If I executed initdb as follows, I would be told to specify
> --icu-locale option.
> 
>> $ initdb --encoding sql-ascii --locale-provider icu hoge
>> ...
>> initdb: error: ICU locale must be specified
> 
> However, when I reran the command, it complains about incompatible
> encoding this time.  I think it's more user-friendly to check for the
> encoding compatibility before the check for missing --icu-locale
> option.
> 
> regards.

I agree with you. Here's another version of the patch. The 
locale/encoding checks and reports in initdb have been reordered, 
because now the encoding is set first and only then the ICU locale is 
checked.

P.S. While working on the patch, I discovered that UTF8 encoding is 
always used for the ICU provider in initdb unless it is explicitly 
specified by the user:

if (!encoding && locale_provider == COLLPROVIDER_ICU)
    encodingid = PG_UTF8;

IMO this creates additional errors for locales with other encodings:

$ initdb --locale de_DE.iso885915@euro --locale-provider icu 
--icu-locale de-DE
...
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (UTF8) and the encoding that 
the selected locale uses (LATIN9) do not match. This would lead to 
misbehavior in various character string processing functions.
initdb: hint: Rerun initdb and either do not specify an encoding 
explicitly, or choose a matching combination.

And ICU supports many encodings, see the contents of pg_enc2icu_tbl in 
encnames.c...

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

v2-diff_check_icu_encoding.patch

Re: ICU for global collation

From

Michael Paquier

Date:

16 September 2022, 02:28:09

On Wed, Sep 14, 2022 at 05:19:34PM +0300, Marina Polyakova wrote:
> I was surprised that it is allowed to create clusters/databases where the
> default ICU collations do not actually work due to unsupported encodings:
>
> $ initdb --encoding SQL_ASCII --locale-provider icu --icu-locale en-US -D
> data &&
> pg_ctl -D data -l logfile start &&
> psql -c "SELECT 'a' < 'b'" template1
> ...
> waiting for server to start.... done
> server started
> ERROR:  encoding "SQL_ASCII" not supported by ICU
>
> $ createdb --encoding SQL_ASCII --locale-provider icu --icu-locale en-US
> --template template0 mydb &&
> psql -c "SELECT 'a' < 'b'" mydb
> ERROR:  encoding "SQL_ASCII" not supported by ICU
>
> The patch diff_check_icu_encoding.patch prohibits the creation of such
> objects...

Agreed that it is a bit confusing to get this type of error after the
database has been created when querying it due to a mix of unsupported
options.  Peter?
--
Michael

Attachment

signature.asc

Re: ICU for global collation

From

Kyotaro Horiguchi

Date:

16 September 2022, 04:55:19

At Thu, 15 Sep 2022 18:41:31 +0300, Marina Polyakova <m.polyakova@postgrespro.ru> wrote in 
> P.S. While working on the patch, I discovered that UTF8 encoding is
> always used for the ICU provider in initdb unless it is explicitly
> specified by the user:
> 
> if (!encoding && locale_provider == COLLPROVIDER_ICU)
>     encodingid = PG_UTF8;
> 
> IMO this creates additional errors for locales with other encodings:
> 
> $ initdb --locale de_DE.iso885915@euro --locale-provider icu
> --icu-locale de-DE
> ...
> initdb: error: encoding mismatch
> initdb: detail: The encoding you selected (UTF8) and the encoding that
> the selected locale uses (LATIN9) do not match. This would lead to
> misbehavior in various character string processing functions.
> initdb: hint: Rerun initdb and either do not specify an encoding
> explicitly, or choose a matching combination.
> 
> And ICU supports many encodings, see the contents of pg_enc2icu_tbl in
> encnames.c...

It seems to me the best default that fits almost all cases using icu
locales.

So, we need to specify encoding explicitly in that case.

$ initdb --encoding iso-8859-15 --locale de_DE.iso885915@euro --locale-provider icu --icu-locale de-DE

However, I think it is hardly understantable from the documentation.

(I checked this using euc-jp [1] so it might be wrong..)

[1] initdb --encoding euc-jp --locale ja_JP.eucjp --locale-provider icu --icu-locale ja-x-icu

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: ICU for global collation

From

Marina Polyakova

Date:

16 September 2022, 06:49:28

On 2022-09-15 09:52, Kyotaro Horiguchi wrote:
> If I executed initdb as follows, I would be told to specify
> --icu-locale option.
> 
>> $ initdb --encoding sql-ascii --locale-provider icu hoge
>> ...
>> initdb: error: ICU locale must be specified
> 
> However, when I reran the command, it complains about incompatible
> encoding this time.  I think it's more user-friendly to check for the
> encoding compatibility before the check for missing --icu-locale
> option.
> 
> regards.

In continuation of options check: AFAICS the following checks in initdb

    if (locale_provider == COLLPROVIDER_ICU)
    {
        if (!icu_locale)
            pg_fatal("ICU locale must be specified");

        /*
         * In supported builds, the ICU locale ID will be checked by the
         * backend during post-bootstrap initialization.
         */
#ifndef USE_ICU
        pg_fatal("ICU is not supported in this build");
#endif
    }

are executed approximately when they are executed in create database 
after getting all the necessary data from the template database:

if (dblocprovider == COLLPROVIDER_ICU)
{
    /*
     * This would happen if template0 uses the libc provider but the new
     * database uses icu.
     */
    if (!dbiculocale)
        ereport(ERROR,
                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                 errmsg("ICU locale must be specified")));
}

if (dblocprovider == COLLPROVIDER_ICU)
    check_icu_locale(dbiculocale);

But perhaps the check that --icu-locale cannot be specified unless 
locale provider icu is chosen should also be moved here? So all these 
checks will be in one place and it will use the provider from the 
template database (which could be icu):

$ initdb --locale-provider icu --icu-locale en-US -D data &&
pg_ctl -D data -l logfile start &&
createdb --icu-locale ru-RU --template template0 mydb
...
createdb: error: database creation failed: ERROR:  ICU locale cannot be 
specified unless locale provider is ICU

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: ICU for global collation

From

Marina Polyakova

Date:

16 September 2022, 07:31:42

On 2022-09-16 07:55, Kyotaro Horiguchi wrote:
> At Thu, 15 Sep 2022 18:41:31 +0300, Marina Polyakova
> <m.polyakova@postgrespro.ru> wrote in
>> P.S. While working on the patch, I discovered that UTF8 encoding is
>> always used for the ICU provider in initdb unless it is explicitly
>> specified by the user:
>> 
>> if (!encoding && locale_provider == COLLPROVIDER_ICU)
>>     encodingid = PG_UTF8;
>> 
>> IMO this creates additional errors for locales with other encodings:
>> 
>> $ initdb --locale de_DE.iso885915@euro --locale-provider icu
>> --icu-locale de-DE
>> ...
>> initdb: error: encoding mismatch
>> initdb: detail: The encoding you selected (UTF8) and the encoding that
>> the selected locale uses (LATIN9) do not match. This would lead to
>> misbehavior in various character string processing functions.
>> initdb: hint: Rerun initdb and either do not specify an encoding
>> explicitly, or choose a matching combination.
>> 
>> And ICU supports many encodings, see the contents of pg_enc2icu_tbl in
>> encnames.c...
> 
> It seems to me the best default that fits almost all cases using icu
> locales.
> 
> So, we need to specify encoding explicitly in that case.
> 
> $ initdb --encoding iso-8859-15 --locale de_DE.iso885915@euro
> --locale-provider icu --icu-locale de-DE
> 
> However, I think it is hardly understantable from the documentation.
> 
> (I checked this using euc-jp [1] so it might be wrong..)
> 
> [1] initdb --encoding euc-jp --locale ja_JP.eucjp --locale-provider
> icu --icu-locale ja-x-icu
> 
> regards.

Thank you!

IMO it is hardly understantable from the program output either - it 
looks like I manually chose the encoding UTF8. Maybe first inform about 
selected encoding?..

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 
6aeec8d426c52414b827686781c245291f27ed1f..348bbbeba0f5bc7ff601912bf883510d580b814c 
100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2310,7 +2310,11 @@ setup_locale_encoding(void)
      }

      if (!encoding && locale_provider == COLLPROVIDER_ICU)
+    {
          encodingid = PG_UTF8;
+        printf(_("The default database encoding has been set to \"%s\" for a 
better experience with the ICU provider.\n"),
+               pg_encoding_to_char(encodingid));
+    }
      else if (!encoding)
      {
          int            ctype_enc;

ISTM that such choices (e.g. UTF8 for Windows in some cases) are 
described in the documentation [1] as

By default, initdb uses the locale provider libc, takes the locale 
settings from the environment, and determines the encoding from the 
locale settings. This is almost always sufficient, unless there are 
special requirements.

[1] https://www.postgresql.org/docs/devel/app-initdb.html

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: ICU for global collation

From

Peter Eisentraut

Date:

16 September 2022, 07:56:31

On 15.09.22 17:41, Marina Polyakova wrote:
> I agree with you. Here's another version of the patch. The 
> locale/encoding checks and reports in initdb have been reordered, 
> because now the encoding is set first and only then the ICU locale is 
> checked.

I committed something based on the first version of your patch.  This 
reordering of the messages here was a little too much surgery for me at 
this point.  For instance, there are also messages in #ifdef WIN32 code 
that would need to be reordered as well.  I kept the overall structure 
of the code the same and just inserted the additional proposed checks.

If you want to pursue the reordering of the checks and messages overall, 
a patch for the master branch could be considered.

Re: ICU for global collation

From

Peter Eisentraut

Date:

16 September 2022, 07:57:39

On 16.09.22 09:31, Marina Polyakova wrote:
> IMO it is hardly understantable from the program output either - it 
> looks like I manually chose the encoding UTF8. Maybe first inform about 
> selected encoding?..

Yes, I included something like that in the patch just committed.

> diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
> index 
> 6aeec8d426c52414b827686781c245291f27ed1f..348bbbeba0f5bc7ff601912bf883510d580b814c 100644
> --- a/src/bin/initdb/initdb.c
> +++ b/src/bin/initdb/initdb.c
> @@ -2310,7 +2310,11 @@ setup_locale_encoding(void)
>       }
> 
>       if (!encoding && locale_provider == COLLPROVIDER_ICU)
> +    {
>           encodingid = PG_UTF8;
> +        printf(_("The default database encoding has been set to \"%s\" 
> for a better experience with the ICU provider.\n"),
> +               pg_encoding_to_char(encodingid));
> +    }
>       else if (!encoding)
>       {
>           int            ctype_enc;

Re: ICU for global collation

From

Kyotaro Horiguchi

Date:

16 September 2022, 08:11:16

At Fri, 16 Sep 2022 09:49:28 +0300, Marina Polyakova <m.polyakova@postgrespro.ru> wrote in 
> On 2022-09-15 09:52, Kyotaro Horiguchi wrote:
> > However, when I reran the command, it complains about incompatible
> > encoding this time.  I think it's more user-friendly to check for the
> > encoding compatibility before the check for missing --icu-locale
> > option.
> > regards.
> 
> In continuation of options check: AFAICS the following checks in
> initdb
> 
>     if (locale_provider == COLLPROVIDER_ICU)
>     {
>         if (!icu_locale)
>             pg_fatal("ICU locale must be specified");
> 
>         /*
>          * In supported builds, the ICU locale ID will be checked by the
>          * backend during post-bootstrap initialization.
>          */
> #ifndef USE_ICU
>         pg_fatal("ICU is not supported in this build");
> #endif
>     }
> 
> are executed approximately when they are executed in create database
> after getting all the necessary data from the template database:

initdb doesn't work that way, but anyway, I realized that I am
proposing to move that code in setlocales() to the caller function as
the result. I don't think setlocales() is the place for the code
because icu locale has no business with what the function does.  That
being said there's no obvious reason we *need* to move the code out to
its caller.

> if (dblocprovider == COLLPROVIDER_ICU)
> {
>     /*
>      * This would happen if template0 uses the libc provider but the new
>      * database uses icu.
>      */
>     if (!dbiculocale)
>         ereport(ERROR,
>                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>                  errmsg("ICU locale must be specified")));
> }
> 
> if (dblocprovider == COLLPROVIDER_ICU)
>     check_icu_locale(dbiculocale);
> 
> But perhaps the check that --icu-locale cannot be specified unless
> locale provider icu is chosen should also be moved here? So all these
> checks will be in one place and it will use the provider from the
> template database (which could be icu):
> 
> $ initdb --locale-provider icu --icu-locale en-US -D data &&
> pg_ctl -D data -l logfile start &&
> createdb --icu-locale ru-RU --template template0 mydb
> ...
> createdb: error: database creation failed: ERROR: ICU locale cannot be
> specified unless locale provider is ICU

And, I realized that this causes bigger churn than I thought. So, I'm
sorry but I withdraw the comment.

Thus the first proposed patch will be more or less the direction we
would go. And the patch looks good to me as a whole.

+                     errmsg("encoding \"%s\" is not supported with ICU provider",

+        pg_log_error("encoding \"%s\" is not supported with ICU provider",
+                     pg_encoding_to_char(encodingid));

I might be wrong, but the messages look wrong to me.  The alternatives
below might work.

"encoding \"%s\" is not supported by ICU"
"encoding \"%s\" cannot be used for/with ICU locales"

+        pg_log_error_hint("Rerun %s and choose a matching combination.",
+                          progname);

This doesn't seem to provide users with useful information.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: ICU for global collation

From

Kyotaro Horiguchi

Date:

16 September 2022, 08:14:25

At Fri, 16 Sep 2022 09:56:31 +0200, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote in 
> On 15.09.22 17:41, Marina Polyakova wrote:
> > I agree with you. Here's another version of the patch. The
> > locale/encoding checks and reports in initdb have been reordered,
> > because now the encoding is set first and only then the ICU locale is
> > checked.
> 
> I committed something based on the first version of your patch.  This
> reordering of the messages here was a little too much surgery for me
> at this point.  For instance, there are also messages in #ifdef WIN32
> code that would need to be reordered as well.  I kept the overall
> structure of the code the same and just inserted the additional
> proposed checks.

Yeah, as I sent just before, I reached the same conclusion.

> If you want to pursue the reordering of the checks and messages
> overall, a patch for the master branch could be considered.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: ICU for global collation

From

Peter Eisentraut

Date:

17 September 2022, 04:58:36

On 16.09.22 08:49, Marina Polyakova wrote:
> But perhaps the check that --icu-locale cannot be specified unless 
> locale provider icu is chosen should also be moved here? So all these 
> checks will be in one place and it will use the provider from the 
> template database (which could be icu):
> 
> $ initdb --locale-provider icu --icu-locale en-US -D data &&
> pg_ctl -D data -l logfile start &&
> createdb --icu-locale ru-RU --template template0 mydb
> ...
> createdb: error: database creation failed: ERROR:  ICU locale cannot be 
> specified unless locale provider is ICU

Can you be more specific about what you are proposing here?  I'm not 
following.

Re: ICU for global collation

From

Marina Polyakova

Date:

17 September 2022, 06:25:00

On 2022-09-16 10:57, Peter Eisentraut wrote:
> On 16.09.22 09:31, Marina Polyakova wrote:
>> IMO it is hardly understantable from the program output either - it 
>> looks like I manually chose the encoding UTF8. Maybe first inform 
>> about selected encoding?..
> 
> Yes, I included something like that in the patch just committed.
> 
>> diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
>> index 
>> 6aeec8d426c52414b827686781c245291f27ed1f..348bbbeba0f5bc7ff601912bf883510d580b814c 
>> 100644
>> --- a/src/bin/initdb/initdb.c
>> +++ b/src/bin/initdb/initdb.c
>> @@ -2310,7 +2310,11 @@ setup_locale_encoding(void)
>>       }
>> 
>>       if (!encoding && locale_provider == COLLPROVIDER_ICU)
>> +    {
>>           encodingid = PG_UTF8;
>> +        printf(_("The default database encoding has been set to 
>> \"%s\" for a better experience with the ICU provider.\n"),
>> +               pg_encoding_to_char(encodingid));
>> +    }
>>       else if (!encoding)
>>       {
>>           int            ctype_enc;

Thank you!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: ICU for global collation

From

Marina Polyakova

Date:

17 September 2022, 07:05:32

On 2022-09-16 11:11, Kyotaro Horiguchi wrote:
> At Fri, 16 Sep 2022 09:49:28 +0300, Marina Polyakova
> <m.polyakova@postgrespro.ru> wrote in
>> In continuation of options check: AFAICS the following checks in
>> initdb
>> 
>>     if (locale_provider == COLLPROVIDER_ICU)
>>     {
>>         if (!icu_locale)
>>             pg_fatal("ICU locale must be specified");
>> 
>>         /*
>>          * In supported builds, the ICU locale ID will be checked by the
>>          * backend during post-bootstrap initialization.
>>          */
>> #ifndef USE_ICU
>>         pg_fatal("ICU is not supported in this build");
>> #endif
>>     }
>> 
>> are executed approximately when they are executed in create database
>> after getting all the necessary data from the template database:
> 
> initdb doesn't work that way, but anyway, I realized that I am
> proposing to move that code in setlocales() to the caller function as
> the result. I don't think setlocales() is the place for the code
> because icu locale has no business with what the function does.  That
> being said there's no obvious reason we *need* to move the code out to
> its caller.

Excuse me, but could you explain your last sentence in more detail? I 
read that this code is not for setlocales and then - that it should not 
moved from here, so I'm confused...

> +                     errmsg("encoding \"%s\" is not supported with ICU provider",
> 
> +        pg_log_error("encoding \"%s\" is not supported with ICU provider",
> +                     pg_encoding_to_char(encodingid));
> 
> I might be wrong, but the messages look wrong to me.  The alternatives
> below might work.
> 
> "encoding \"%s\" is not supported by ICU"
> "encoding \"%s\" cannot be used for/with ICU locales"

The message indicates that the selected encoding cannot be used with the 
ICU provider because it does not support it. But if the text of the 
message becomes better and clearer, I will only be glad.

> +        pg_log_error_hint("Rerun %s and choose a matching combination.",
> +                          progname);
> 
> This doesn't seem to provide users with useful information.

It was commited in more verbose form:

pg_log_error_hint("Rerun %s and either do not specify an encoding 
explicitly, "
                  "or choose a matching combination.",

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: ICU for global collation

From

Marina Polyakova

Date:

17 September 2022, 08:33:30

Thanks to Kyotaro Horiguchi review we found out that there're 
interesting cases due to the order of some ICU checks:

1. ICU locale vs supported encoding:

1.1.

On 2022-09-15 09:52, Kyotaro Horiguchi wrote:
> If I executed initdb as follows, I would be told to specify
> --icu-locale option.
> 
>> $ initdb --encoding sql-ascii --locale-provider icu hoge
>> ...
>> initdb: error: ICU locale must be specified
> 
> However, when I reran the command, it complains about incompatible
> encoding this time.  I think it's more user-friendly to check for the
> encoding compatibility before the check for missing --icu-locale
> option.

1.2. (ok?)

$ initdb --encoding sql-ascii --icu-locale en-US hoge
initdb: error: --icu-locale cannot be specified unless locale provider 
"icu" is chosen

$ initdb --encoding sql-ascii --icu-locale en-US --locale-provider icu 
hoge
...
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (SQL_ASCII) is not supported 
with the ICU provider.
initdb: hint: Rerun initdb and either do not specify an encoding 
explicitly, or choose a matching combination.

$ createdb --encoding sql-ascii --icu-locale en-US hoge
createdb: error: database creation failed: ERROR:  ICU locale cannot be 
specified unless locale provider is ICU
$ createdb --encoding sql-ascii --icu-locale en-US --locale-provider icu 
hoge
createdb: error: database creation failed: ERROR:  encoding "SQL_ASCII" 
is not supported with ICU provider

2. For builds without ICU:

2.1.

$ initdb --locale-provider icu hoge
...
initdb: error: ICU locale must be specified

$ initdb --locale-provider icu --icu-locale en-US hoge
...
initdb: error: ICU is not supported in this build

$ createdb --locale-provider icu hoge
createdb: error: database creation failed: ERROR:  ICU locale must be 
specified

$ createdb --locale-provider icu --icu-locale en-US hoge
createdb: error: database creation failed: ERROR:  ICU is not supported 
in this build

IMO, it would be more user-friendly to inform an unsupported build in 
the first runs too..

2.2. (ok?)

$ initdb --icu-locale en-US hoge
initdb: error: --icu-locale cannot be specified unless locale provider 
"icu" is chosen
$ initdb --icu-locale en-US --locale-provider icu hoge
...
initdb: error: ICU is not supported in this build

$ createdb --icu-locale en-US hoge
createdb: error: database creation failed: ERROR:  ICU locale cannot be 
specified unless locale provider is ICU
$ createdb --icu-locale en-US --locale-provider icu hoge
createdb: error: database creation failed: ERROR:  ICU is not supported 
in this build

2.3.

$ createdb --locale-provider icu --icu-locale en-US --encoding sql-ascii 
hoge
createdb: error: database creation failed: ERROR:  encoding "SQL_ASCII" 
is not supported with ICU provider
$ createdb --locale-provider icu --icu-locale en-US --encoding utf8 hoge
createdb: error: database creation failed: ERROR:  ICU is not supported 
in this build

IMO, it would be more user-friendly to inform an unsupported build in 
the first run too..

3.

The locale provider is ICU, but it has not yet been set from the 
template database:

> $ initdb --locale-provider icu --icu-locale en-US -D data &&
> pg_ctl -D data -l logfile start &&
> createdb --icu-locale ru-RU --template template0 mydb
> ...
> createdb: error: database creation failed: ERROR:  ICU locale cannot be
> specified unless locale provider is ICU

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: ICU for global collation

From

Marina Polyakova

Date:

17 September 2022, 08:42:24

On 2022-09-16 10:56, Peter Eisentraut wrote:
> On 15.09.22 17:41, Marina Polyakova wrote:
>> I agree with you. Here's another version of the patch. The 
>> locale/encoding checks and reports in initdb have been reordered, 
>> because now the encoding is set first and only then the ICU locale is 
>> checked.
> 
> I committed something based on the first version of your patch.  This
> reordering of the messages here was a little too much surgery for me
> at this point.  For instance, there are also messages in #ifdef WIN32
> code that would need to be reordered as well.  I kept the overall
> structure of the code the same and just inserted the additional
> proposed checks.
> 
> If you want to pursue the reordering of the checks and messages
> overall, a patch for the master branch could be considered.

Thank you! I already wrote about the order of the ICU checks in 
initdb/create database, they were the only reason to propose such 
changes...

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: ICU for global collation

From

Peter Eisentraut

Date:

20 September 2022, 09:59:04

On 17.09.22 10:33, Marina Polyakova wrote:
> Thanks to Kyotaro Horiguchi review we found out that there're 
> interesting cases due to the order of some ICU checks:
> 
> 1. ICU locale vs supported encoding:
> 
> 1.1.
> 
> On 2022-09-15 09:52, Kyotaro Horiguchi wrote:
>> If I executed initdb as follows, I would be told to specify
>> --icu-locale option.
>>
>>> $ initdb --encoding sql-ascii --locale-provider icu hoge
>>> ...
>>> initdb: error: ICU locale must be specified
>>
>> However, when I reran the command, it complains about incompatible
>> encoding this time.  I think it's more user-friendly to check for the
>> encoding compatibility before the check for missing --icu-locale
>> option.

This a valid point, but it would require quite a bit of work to move all 
those checks around and re-verify the result, so I don't want to do it 
in PG15.

> 1.2. (ok?)
> 
> $ initdb --encoding sql-ascii --icu-locale en-US hoge
> initdb: error: --icu-locale cannot be specified unless locale provider 
> "icu" is chosen
> 
> $ initdb --encoding sql-ascii --icu-locale en-US --locale-provider icu hoge
> ...
> initdb: error: encoding mismatch
> initdb: detail: The encoding you selected (SQL_ASCII) is not supported 
> with the ICU provider.
> initdb: hint: Rerun initdb and either do not specify an encoding 
> explicitly, or choose a matching combination.
> 
> $ createdb --encoding sql-ascii --icu-locale en-US hoge
> createdb: error: database creation failed: ERROR:  ICU locale cannot be 
> specified unless locale provider is ICU
> $ createdb --encoding sql-ascii --icu-locale en-US --locale-provider icu 
> hoge
> createdb: error: database creation failed: ERROR:  encoding "SQL_ASCII" 
> is not supported with ICU provider

I don't see a problem here.

> 2. For builds without ICU:
> 
> 2.1.
> 
> $ initdb --locale-provider icu hoge
> ...
> initdb: error: ICU locale must be specified
> 
> $ initdb --locale-provider icu --icu-locale en-US hoge
> ...
> initdb: error: ICU is not supported in this build
> 
> $ createdb --locale-provider icu hoge
> createdb: error: database creation failed: ERROR:  ICU locale must be 
> specified
> 
> $ createdb --locale-provider icu --icu-locale en-US hoge
> createdb: error: database creation failed: ERROR:  ICU is not supported 
> in this build
> 
> IMO, it would be more user-friendly to inform an unsupported build in 
> the first runs too..

Again, this would require reorganizing a bunch of code to get some 
cosmetic benefit, which isn't a good idea now for PG15.

> 2.2. (ok?)
> 2.3.

same here

> 3.
> 
> The locale provider is ICU, but it has not yet been set from the 
> template database:
> 
>> $ initdb --locale-provider icu --icu-locale en-US -D data &&
>> pg_ctl -D data -l logfile start &&
>> createdb --icu-locale ru-RU --template template0 mydb
>> ...
>> createdb: error: database creation failed: ERROR:  ICU locale cannot be
>> specified unless locale provider is ICU

Please see attached patch for a fix.  Does that work for you?

Attachment

0001-Improve-ICU-option-handling-in-CREATE-DATABASE.patch

Re: ICU for global collation

From

Marina Polyakova

Date:

21 September 2022, 06:50:55

On 2022-09-20 12:59, Peter Eisentraut wrote:
> On 17.09.22 10:33, Marina Polyakova wrote:
>> 3.
>> 
>> The locale provider is ICU, but it has not yet been set from the 
>> template database:
>> 
>>> $ initdb --locale-provider icu --icu-locale en-US -D data &&
>>> pg_ctl -D data -l logfile start &&
>>> createdb --icu-locale ru-RU --template template0 mydb
>>> ...
>>> createdb: error: database creation failed: ERROR:  ICU locale cannot 
>>> be
>>> specified unless locale provider is ICU
> 
> Please see attached patch for a fix.  Does that work for you?

Yes, it works. The following test checks this fix:

diff --git a/src/bin/scripts/t/020_createdb.pl 
b/src/bin/scripts/t/020_createdb.pl
index 
b87d8fc63b5246b02bcd4499aae815269b60df7c..c2464a99618cd7ca5616cc21121e1e4379b52baf 
100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -71,6 +71,14 @@ if ($ENV{with_icu} eq 'yes')
      $node2->command_ok(
          [ 'createdb', '-T', 'template0', '--locale-provider=libc', 'foobar55' 
],
          'create database with libc provider from template database with icu 
provider');
+
+    $node2->command_ok(
+        [
+            'createdb', '-T', 'template0', '--icu-locale',
+            'en-US', 'foobar56'
+        ],
+        'create database with icu locale from template database with icu 
provider'
+    );
  }
  else
  {

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: ICU for global collation

From

Peter Eisentraut

Date:

21 September 2022, 14:53:00

On 21.09.22 08:50, Marina Polyakova wrote:
> On 2022-09-20 12:59, Peter Eisentraut wrote:
>> On 17.09.22 10:33, Marina Polyakova wrote:
>>> 3.
>>>
>>> The locale provider is ICU, but it has not yet been set from the 
>>> template database:
>>>
>>>> $ initdb --locale-provider icu --icu-locale en-US -D data &&
>>>> pg_ctl -D data -l logfile start &&
>>>> createdb --icu-locale ru-RU --template template0 mydb
>>>> ...
>>>> createdb: error: database creation failed: ERROR:  ICU locale cannot be
>>>> specified unless locale provider is ICU
>>
>> Please see attached patch for a fix.  Does that work for you?
> 
> Yes, it works. The following test checks this fix:

Committed with that test, thanks.  I think that covers all the ICU 
issues you reported for PG15 for now?

Re: ICU for global collation

From

Marina Polyakova

Date:

22 September 2022, 18:06:07

On 2022-09-21 17:53, Peter Eisentraut wrote:
> Committed with that test, thanks.  I think that covers all the ICU
> issues you reported for PG15 for now?

I thought about the order of the ICU checks - if it is ok to check that 
the selected encoding is supported by ICU after printing all the locale 
& encoding information, why not to move almost all the ICU checks 
here?..

Examples of the work of the attached patch:

1. ICU locale vs supported encoding:

1.1.

$ initdb --encoding sql-ascii --locale-provider icu hoge
...
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (SQL_ASCII) is not supported 
with the ICU provider.
initdb: hint: Rerun initdb and either do not specify an encoding 
explicitly, or choose a matching combination.

1.2. (like before)

$ initdb --encoding sql-ascii --icu-locale en-US hoge
initdb: error: --icu-locale cannot be specified unless locale provider 
"icu" is chosen

$ createdb --encoding sql-ascii --icu-locale en-US hoge
createdb: error: database creation failed: ERROR:  ICU locale cannot be 
specified unless locale provider is ICU

2. For builds without ICU:

2.1.

$ initdb --locale-provider icu hoge
...
initdb: error: ICU is not supported in this build

$ createdb --locale-provider icu hoge
createdb: error: database creation failed: ERROR:  ICU is not supported 
in this build

2.2. (like before)

$ initdb --icu-locale en-US hoge
initdb: error: --icu-locale cannot be specified unless locale provider 
"icu" is chosen

$ createdb --icu-locale en-US hoge
createdb: error: database creation failed: ERROR:  ICU locale cannot be 
specified unless locale provider is ICU

2.3.

$ createdb --locale-provider icu --icu-locale en-US --encoding sql-ascii 
hoge
createdb: error: database creation failed: ERROR:  ICU is not supported 
in this build

4. About errors in initdb:

4.1. If icu_locale is not specified, but it is required, then we get 
this:

$ initdb --locale-provider icu hoge
The files belonging to this database system will be owned by user 
"marina".
This user must also own the server process.

The database cluster will be initialized with this locale configuration:
   provider:    icu
   LC_COLLATE:  en_US.UTF-8
   LC_CTYPE:    en_US.UTF-8
   LC_MESSAGES: en_US.UTF-8
   LC_MONETARY: ru_RU.UTF-8
   LC_NUMERIC:  ru_RU.UTF-8
   LC_TIME:     ru_RU.UTF-8
The default database encoding has been set to "UTF8".
initdb: error: ICU locale must be specified

Almost the same if ICU is not supported in this build:

$ initdb --locale-provider icu hoge
The files belonging to this database system will be owned by user 
"marina".
This user must also own the server process.

The database cluster will be initialized with this locale configuration:
   provider:    icu
   LC_COLLATE:  en_US.UTF-8
   LC_CTYPE:    en_US.UTF-8
   LC_MESSAGES: en_US.UTF-8
   LC_MONETARY: ru_RU.UTF-8
   LC_NUMERIC:  ru_RU.UTF-8
   LC_TIME:     ru_RU.UTF-8
The default database encoding has been set to "UTF8".
initdb: error: ICU is not supported in this build

4.2. If icu_locale is specified for the wrong provider, the error will 
be at the beginning of the program start as before:

$ initdb --icu-locale en-US hoge
initdb: error: --icu-locale cannot be specified unless locale provider 
"icu" is chosen

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

diff_icu_options_check_order.patch

Re: ICU for global collation

From

Peter Eisentraut

Date:

01 October 2022, 12:07:29

On 22.09.22 20:06, Marina Polyakova wrote:
> On 2022-09-21 17:53, Peter Eisentraut wrote:
>> Committed with that test, thanks.  I think that covers all the ICU
>> issues you reported for PG15 for now?
> 
> I thought about the order of the ICU checks - if it is ok to check that 
> the selected encoding is supported by ICU after printing all the locale 
> & encoding information, why not to move almost all the ICU checks here?..

It's possible that we can do better, but I'm not going to add things 
like that to PG 15 at this point unless it fixes a faulty behavior.

Re: ICU for global collation

From

Marina Polyakova

Date:

08 October 2022, 18:08:18

On 2022-10-01 15:07, Peter Eisentraut wrote:
> On 22.09.22 20:06, Marina Polyakova wrote:
>> On 2022-09-21 17:53, Peter Eisentraut wrote:
>>> Committed with that test, thanks.  I think that covers all the ICU
>>> issues you reported for PG15 for now?
>> 
>> I thought about the order of the ICU checks - if it is ok to check 
>> that the selected encoding is supported by ICU after printing all the 
>> locale & encoding information, why not to move almost all the ICU 
>> checks here?..
> 
> It's possible that we can do better, but I'm not going to add things
> like that to PG 15 at this point unless it fixes a faulty behavior.

Will PG 15 always have this order of ICU checks, is the current 
behaviour correct enough? On the other hand, there may be a better fix 
for PG 16+ and not all changes can be backported...

On 2022-09-16 10:56, Peter Eisentraut wrote:
> On 15.09.22 17:41, Marina Polyakova wrote:
>> I agree with you. Here's another version of the patch. The 
>> locale/encoding checks and reports in initdb have been reordered, 
>> because now the encoding is set first and only then the ICU locale is 
>> checked.
> 
> I committed something based on the first version of your patch.  This
> reordering of the messages here was a little too much surgery for me
> at this point.  For instance, there are also messages in #ifdef WIN32
> code that would need to be reordered as well.  I kept the overall
> structure of the code the same and just inserted the additional
> proposed checks.
> 
> If you want to pursue the reordering of the checks and messages
> overall, a patch for the master branch could be considered.

I've worked on this again (see attached patch) but I'm not sure if the 
messages of encoding mismatches are clear enough without the full locale 
information. For

$ initdb -D data --icu-locale en --locale-provider icu

compare the outputs:

The database cluster will be initialized with this locale configuration:
   provider:    icu
   ICU locale:  en
   LC_COLLATE:  de_DE.iso885915@euro
   LC_CTYPE:    de_DE.iso885915@euro
   LC_MESSAGES: en_US.utf8
   LC_MONETARY: de_DE.iso885915@euro
   LC_NUMERIC:  de_DE.iso885915@euro
   LC_TIME:     de_DE.iso885915@euro
The default database encoding has been set to "UTF8".
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (UTF8) and the encoding that 
the selected locale uses (LATIN9) do not match. This would lead to 
misbehavior in various character string processing functions.
initdb: hint: Rerun initdb and either do not specify an encoding 
explicitly, or choose a matching combination.

and

Encoding "UTF8" implied by locale will be set as the default database 
encoding.
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (UTF8) and the encoding that 
the selected locale uses (LATIN9) do not match. This would lead to 
misbehavior in various character string processing functions.
initdb: hint: Rerun initdb and either do not specify an encoding 
explicitly, or choose a matching combination.

The same without ICU, e.g. for

$ initdb -D data

the output with locale information:

The database cluster will be initialized with this locale configuration:
   provider:    libc
   LC_COLLATE:  en_US.utf8
   LC_CTYPE:    de_DE.iso885915@euro
   LC_MESSAGES: en_US.utf8
   LC_MONETARY: de_DE.iso885915@euro
   LC_NUMERIC:  de_DE.iso885915@euro
   LC_TIME:     de_DE.iso885915@euro
The default database encoding has accordingly been set to "LATIN9".
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (LATIN9) and the encoding that 
the selected locale uses (UTF8) do not match. This would lead to 
misbehavior in various character string processing functions.
initdb: hint: Rerun initdb and either do not specify an encoding 
explicitly, or choose a matching combination.

and the "shorter" version:

Encoding "LATIN9" implied by locale will be set as the default database 
encoding.
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (LATIN9) and the encoding that 
the selected locale uses (UTF8) do not match. This would lead to 
misbehavior in various character string processing functions.
initdb: hint: Rerun initdb and either do not specify an encoding 
explicitly, or choose a matching combination.

BTW, what did you mean that "there are also messages in #ifdef WIN32 
code that would need to be reordered as well"?..

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

v2-diff_icu_options_check_order.patch

Re: ICU for global collation

From

Marina Polyakova

Date:

21 October 2022, 14:32:38

Hello!

I discovered an interesting behaviour during installcheck runs when the 
cluster was initialized with ICU locale provider:

$ initdb --locale-provider icu --icu-locale en-US -D data &&
pg_ctl -D data -l logfile start

1) The ECPG tests fail because they use the SQL_ASCII encoding [1], the 
database template0 uses the ICU locale provider and SQL_ASCII is not 
supported by ICU:

$ make -C src/interfaces/ecpg/ installcheck
...
============== creating database "ecpg1_regression"   ==============
ERROR:  encoding "SQL_ASCII" is not supported with ICU provider
ERROR:  database "ecpg1_regression" does not exist
command failed: "/home/marina/postgresql/master/my/inst/bin/psql" -X -c 
"CREATE DATABASE \"ecpg1_regression\" TEMPLATE=template0 
ENCODING='SQL_ASCII'" -c "ALTER DATABASE \"ecpg1_regression\" SET 
lc_messages TO 'C';ALTER DATABASE \"ecpg1_regression\" SET lc_monetary 
TO 'C';ALTER DATABASE \"ecpg1_regression\" SET lc_numeric TO 'C';ALTER 
DATABASE \"ecpg1_regression\" SET lc_time TO 'C';ALTER DATABASE 
\"ecpg1_regression\" SET bytea_output TO 'hex';ALTER DATABASE 
\"ecpg1_regression\" SET timezone_abbreviations TO 'Default';" 
"postgres"

2) The option --no-locale in pg_regress is described as "use C locale" 
[2]. But in this case the created databases actually use the ICU locale 
provider with the ICU cluster locale from template0 (see 
diff_check_backend_used_provider.patch):

$ make NO_LOCALE=1 installcheck

In regression.diffs:

diff -U3 
/home/marina/postgresql/master/src/test/regress/expected/test_setup.out 
/home/marina/postgresql/master/src/test/regress/results/test_setup.out
--- 
/home/marina/postgresql/master/src/test/regress/expected/test_setup.out    2022-09-27 
05:31:27.674628815 +0300
+++ 
/home/marina/postgresql/master/src/test/regress/results/test_setup.out    2022-10-21 
15:09:31.232992885 +0300
@@ -143,6 +143,798 @@
  \set filename :abs_srcdir '/data/person.data'
  COPY person FROM :'filename';
  VACUUM ANALYZE person;
+NOTICE:  varstrfastcmp_locale sss->collate_c 0 sss->locale 0xefacd0
+NOTICE:  varstrfastcmp_locale sss->locale->provider i
+NOTICE:  varstrfastcmp_locale sss->locale->info.icu.locale en-US
...

The patch diff_fix_pg_regress_create_database.patch fixes both issues 
for me.

[1] 

https://github.com/postgres/postgres/blob/ce20f8b9f4354b46b40fd6ebf7ce5c37d08747e0/src/interfaces/ecpg/test/Makefile#L18
[2] 
https://github.com/postgres/postgres/blob/ce20f8b9f4354b46b40fd6ebf7ce5c37d08747e0/src/test/regress/pg_regress.c#L1992

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: ICU for global collation

From

Michael Paquier

Date:

01 November 2022, 07:03:11

On Fri, Oct 21, 2022 at 05:32:38PM +0300, Marina Polyakova wrote:
> Hello!
>
> I discovered an interesting behaviour during installcheck runs when the
> cluster was initialized with ICU locale provider:
>
> $ initdb --locale-provider icu --icu-locale en-US -D data &&
> 2) The option --no-locale in pg_regress is described as "use C locale" [2].
> But in this case the created databases actually use the ICU locale provider
> with the ICU cluster locale from template0 (see
> diff_check_backend_used_provider.patch):
>
> $ make NO_LOCALE=1 installcheck

Yes, this looks wrong on the ground on what -no-locale is expected to
do, aka use a C locale.  Peter?
--
Michael

Attachment

signature.asc