Re: ICU for global collation - Mailing list pgsql-hackers

From Marina Polyakova
Subject Re: ICU for global collation
Date
Msg-id 78ac48c77a365908d11d19519cb6ee56@postgrespro.ru
Whole thread Raw
In response to Re: ICU for global collation  (Julien Rouhaud <rjuju123@gmail.com>)
List pgsql-hackers
On 2022-08-17 19:53, Julien Rouhaud wrote:
> Good catch.  There's unfortunately not a lot of regression tests for
> ICU-initialized clusters.  I'm wondering if the build-farm client could 
> be
> taught about the locale provider rather than assuming libc.  Clearly 
> that
> wouldn't have caught this issue, but it should still increase the 
> coverage a
> bit (I'm thinking of the recent problem with the abbreviated keys).

Looking at installchecks with different locales (e.g. see [1] with 
sv_SE.UTF-8) - why not?..

>> diff --git a/src/backend/commands/dbcommands.c
>> b/src/backend/commands/dbcommands.c
>> index 
>> b31a30550b025d48ba3cc250dc4c15f41f9a80be..17a2942341e528c01182fb6d4878580f2706bec9
>> 100644
>> --- a/src/backend/commands/dbcommands.c
>> +++ b/src/backend/commands/dbcommands.c
>> @@ -1048,6 +1048,8 @@ createdb(ParseState *pstate, const CreatedbStmt 
>> *stmt)
>>                      (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>>                       errmsg("ICU locale must be specified")));
>>      }
>> +    else
>> +        dbiculocale = NULL;
>> 
>>      if (dblocprovider == COLLPROVIDER_ICU)
>>          check_icu_locale(dbiculocale);
> 
> I think it would be better to do that in the various variable 
> initialization.
> Maybe switch the dblocprovider and dbiculocale initialization, and only
> initialize dbiculocale if dblocprovider is COLLPROVIDER_ICU.

diff --git a/src/backend/commands/dbcommands.c 
b/src/backend/commands/dbcommands.c
index 
b31a30550b025d48ba3cc250dc4c15f41f9a80be..883f381f3453142790f728a3725586cebe2e2f20 
100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1012,10 +1012,10 @@ createdb(ParseState *pstate, const CreatedbStmt 
*stmt)
          dbcollate = src_collate;
      if (dbctype == NULL)
          dbctype = src_ctype;
-    if (dbiculocale == NULL)
-        dbiculocale = src_iculocale;
      if (dblocprovider == '\0')
          dblocprovider = src_locprovider;
+    if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU)
+        dbiculocale = src_iculocale;

      /* Some encodings are client only */
      if (!PG_VALID_BE_ENCODING(encoding))

Then it seemed to me that it was easier to first get all the parameters 
from the template database as usual and then process them as needed. But 
with your suggestion the failed assertion will check the code above more 
accurately...

> This discrepancy between createdb and CREATE DATABASE looks like an 
> oversight,
> as createdb indeed interprets --locale as:
> 
>     if (locale)
>     {
>         if (lc_ctype)
>             pg_fatal("only one of --locale and --lc-ctype can be specified");
>         if (lc_collate)
>             pg_fatal("only one of --locale and --lc-collate can be specified");
>         lc_ctype = locale;
>         lc_collate = locale;
>     }
> 
> AFAIK the fallback in the CREATE DATABASE case is expected as POSIX 
> locale
> names should be accepted by icu, so this should work for createdb too.

Oh, great, thanks!

> > > I spent some time looking at the ICU api trying to figure out if using a
> > > posix locale name (e.g. en_US) was actually compatible with an ICU locale name.
> > > It seems that ICU accept any of 'en-us', 'en-US', 'en_us' or 'en_US' as the
> > > same locale, but I might be wrong.  I also didn't find a way to figure out how
> > > to ask ICU if the locale identifier passed is complete garbage or not.  One
> > > sure thing is that the system collation we import are of the form 'en-us', so
> > > it seems weird to have this form in pg_collation and by default another form in
> > > pg_database.
> >
> > Yeah it seems to be inconsistent about that.  The locale ID documentation
> > appears to indicate that "en_US" is the canonical form, but when you ask it
> > to list all the locales it knows about it returns "en-US".
> 
> Yeah I saw that too when checking is POSIX locale names were valid, and 
> that's
> not great.

I'm sorry but IIUC pg_import_system_collations uses uloc_getAvailable to 
get the locale ID and then specifically calls uloc_toLanguageTag?..

> I don't think that initdb --collation-provider icu should be allowed 
> without
> --icu-locale, same for --collation-provider libc *with* --icu-locale.

> > initdb has some specific processing to transform the default libc locale to
> > something more appropriate, but as far as I can see creatdb / CREATE DATABASE
> > aren't doing that.  It seems inconsistent, and IMHO another reason why
> > defaulting to the libc locale looks like a bad idea.
> 
> This has all been removed.  The separate ICU locale option should now 
> be
> required everywhere (initdb, createdb, CREATE DATABASE).

If it's a feature and not a bug in CREATE DATABASE, why should not it 
work in initdb too? Here we define locale/lc_collate/lc_ctype for the 
first 3 databases in the cluster in much the same way...

P.S. FYI there seems to be a bug for very old ICU versions: in master 
(92fce4e1eda9b24d73f583fbe9b58f4e03f097a4):

$ initdb -D data &&
pg_ctl -D data -l logfile start &&
psql -c "CREATE DATABASE mydb LOCALE \"C.UTF-8\" LOCALE_PROVIDER icu 
TEMPLATE template0" postgres &&
psql -c "SELECT 1" mydb

WARNING:  database "mydb" has a collation version mismatch
DETAIL:  The database was created using collation version 49.192.0.42, 
but the operating system provides version 49.192.5.42.
HINT:  Rebuild all objects in this database that use the default 
collation and run ALTER DATABASE mydb REFRESH COLLATION VERSION, or 
build PostgreSQL with the right library version.

See the additional output (diff_log_icu_collator_locale.patch) in the 
logfile:

2022-08-20 11:38:30.162 MSK [136546] LOG:  check_icu_locale 
uloc_getDefault() en_US
2022-08-20 11:38:30.162 MSK [136546] STATEMENT:  CREATE DATABASE mydb 
LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0
2022-08-20 11:38:30.163 MSK [136546] LOG:  check_icu_locale icu_locale 
C.UTF-8 valid_locale en_US version 49.192.0.42
2022-08-20 11:38:30.163 MSK [136546] STATEMENT:  CREATE DATABASE mydb 
LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0
2022-08-20 11:38:30.163 MSK [136546] LOG:  get_collation_actual_version 
uloc_getDefault() en_US
2022-08-20 11:38:30.163 MSK [136546] STATEMENT:  CREATE DATABASE mydb 
LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0
2022-08-20 11:38:30.163 MSK [136546] LOG:  get_collation_actual_version 
icu_locale C.UTF-8 valid_locale en_US version 49.192.0.42
2022-08-20 11:38:30.163 MSK [136546] STATEMENT:  CREATE DATABASE mydb 
LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0
2022-08-20 11:38:30.224 MSK [136548] LOG:  make_icu_collator 
uloc_getDefault() c
2022-08-20 11:38:30.225 MSK [136548] LOG:  make_icu_collator icu_locale 
C.UTF-8 valid_locale root version 49.192.5.42
2022-08-20 11:38:30.225 MSK [136548] LOG:  get_collation_actual_version 
uloc_getDefault() c
2022-08-20 11:38:30.225 MSK [136548] LOG:  get_collation_actual_version 
icu_locale C.UTF-8 valid_locale root version 49.192.5.42
2022-08-20 11:38:30.225 MSK [136548] WARNING:  database "mydb" has a 
collation version mismatch
2022-08-20 11:38:30.225 MSK [136548] DETAIL:  The database was created 
using collation version 49.192.0.42, but the operating system provides 
version 49.192.5.42.
2022-08-20 11:38:30.225 MSK [136548] HINT:  Rebuild all objects in this 
database that use the default collation and run ALTER DATABASE mydb 
REFRESH COLLATION VERSION, or build PostgreSQL with the right library 
version.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&dt=2022-08-18%2006%3A25%3A17

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: shadow variables - pg15 edition
Next
From: Zhihong Yu
Date:
Subject: Re: including pid's for `There are XX other sessions using the database`