Re: ICU for global collation - Mailing list pgsql-hackers

From Marina Polyakova
Subject Re: ICU for global collation
Date
Msg-id 79f410460c4fc9534000785adb8bf39a@postgrespro.ru
Whole thread Raw
In response to Re: ICU for global collation  (Peter Eisentraut <peter.eisentraut@enterprisedb.com>)
Responses Re: ICU for global collation  (Marina Polyakova <m.polyakova@postgrespro.ru>)
List pgsql-hackers
On 2022-10-01 15:07, Peter Eisentraut wrote:
> On 22.09.22 20:06, Marina Polyakova wrote:
>> On 2022-09-21 17:53, Peter Eisentraut wrote:
>>> Committed with that test, thanks.  I think that covers all the ICU
>>> issues you reported for PG15 for now?
>> 
>> I thought about the order of the ICU checks - if it is ok to check 
>> that the selected encoding is supported by ICU after printing all the 
>> locale & encoding information, why not to move almost all the ICU 
>> checks here?..
> 
> It's possible that we can do better, but I'm not going to add things
> like that to PG 15 at this point unless it fixes a faulty behavior.

Will PG 15 always have this order of ICU checks, is the current 
behaviour correct enough? On the other hand, there may be a better fix 
for PG 16+ and not all changes can be backported...

On 2022-09-16 10:56, Peter Eisentraut wrote:
> On 15.09.22 17:41, Marina Polyakova wrote:
>> I agree with you. Here's another version of the patch. The 
>> locale/encoding checks and reports in initdb have been reordered, 
>> because now the encoding is set first and only then the ICU locale is 
>> checked.
> 
> I committed something based on the first version of your patch.  This
> reordering of the messages here was a little too much surgery for me
> at this point.  For instance, there are also messages in #ifdef WIN32
> code that would need to be reordered as well.  I kept the overall
> structure of the code the same and just inserted the additional
> proposed checks.
> 
> If you want to pursue the reordering of the checks and messages
> overall, a patch for the master branch could be considered.

I've worked on this again (see attached patch) but I'm not sure if the 
messages of encoding mismatches are clear enough without the full locale 
information. For

$ initdb -D data --icu-locale en --locale-provider icu

compare the outputs:

The database cluster will be initialized with this locale configuration:
   provider:    icu
   ICU locale:  en
   LC_COLLATE:  de_DE.iso885915@euro
   LC_CTYPE:    de_DE.iso885915@euro
   LC_MESSAGES: en_US.utf8
   LC_MONETARY: de_DE.iso885915@euro
   LC_NUMERIC:  de_DE.iso885915@euro
   LC_TIME:     de_DE.iso885915@euro
The default database encoding has been set to "UTF8".
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (UTF8) and the encoding that 
the selected locale uses (LATIN9) do not match. This would lead to 
misbehavior in various character string processing functions.
initdb: hint: Rerun initdb and either do not specify an encoding 
explicitly, or choose a matching combination.

and

Encoding "UTF8" implied by locale will be set as the default database 
encoding.
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (UTF8) and the encoding that 
the selected locale uses (LATIN9) do not match. This would lead to 
misbehavior in various character string processing functions.
initdb: hint: Rerun initdb and either do not specify an encoding 
explicitly, or choose a matching combination.

The same without ICU, e.g. for

$ initdb -D data

the output with locale information:

The database cluster will be initialized with this locale configuration:
   provider:    libc
   LC_COLLATE:  en_US.utf8
   LC_CTYPE:    de_DE.iso885915@euro
   LC_MESSAGES: en_US.utf8
   LC_MONETARY: de_DE.iso885915@euro
   LC_NUMERIC:  de_DE.iso885915@euro
   LC_TIME:     de_DE.iso885915@euro
The default database encoding has accordingly been set to "LATIN9".
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (LATIN9) and the encoding that 
the selected locale uses (UTF8) do not match. This would lead to 
misbehavior in various character string processing functions.
initdb: hint: Rerun initdb and either do not specify an encoding 
explicitly, or choose a matching combination.

and the "shorter" version:

Encoding "LATIN9" implied by locale will be set as the default database 
encoding.
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (LATIN9) and the encoding that 
the selected locale uses (UTF8) do not match. This would lead to 
misbehavior in various character string processing functions.
initdb: hint: Rerun initdb and either do not specify an encoding 
explicitly, or choose a matching combination.

BTW, what did you mean that "there are also messages in #ifdef WIN32 
code that would need to be reordered as well"?..

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Non-robustness in pmsignal.c
Next
From: Nathan Bossart
Date:
Subject: Re: Adding Support for Copy callback functionality on COPY TO api