Re: ICU for global collation - Mailing list pgsql-hackers

From Marina Polyakova
Subject Re: ICU for global collation
Date
Msg-id 1989d430b926be3c08735f97fffc6294@postgrespro.ru
Whole thread Raw
In response to Re: ICU for global collation  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: ICU for global collation
List pgsql-hackers
On 2022-09-16 07:55, Kyotaro Horiguchi wrote:
> At Thu, 15 Sep 2022 18:41:31 +0300, Marina Polyakova
> <m.polyakova@postgrespro.ru> wrote in
>> P.S. While working on the patch, I discovered that UTF8 encoding is
>> always used for the ICU provider in initdb unless it is explicitly
>> specified by the user:
>> 
>> if (!encoding && locale_provider == COLLPROVIDER_ICU)
>>     encodingid = PG_UTF8;
>> 
>> IMO this creates additional errors for locales with other encodings:
>> 
>> $ initdb --locale de_DE.iso885915@euro --locale-provider icu
>> --icu-locale de-DE
>> ...
>> initdb: error: encoding mismatch
>> initdb: detail: The encoding you selected (UTF8) and the encoding that
>> the selected locale uses (LATIN9) do not match. This would lead to
>> misbehavior in various character string processing functions.
>> initdb: hint: Rerun initdb and either do not specify an encoding
>> explicitly, or choose a matching combination.
>> 
>> And ICU supports many encodings, see the contents of pg_enc2icu_tbl in
>> encnames.c...
> 
> It seems to me the best default that fits almost all cases using icu
> locales.
> 
> So, we need to specify encoding explicitly in that case.
> 
> $ initdb --encoding iso-8859-15 --locale de_DE.iso885915@euro
> --locale-provider icu --icu-locale de-DE
> 
> However, I think it is hardly understantable from the documentation.
> 
> (I checked this using euc-jp [1] so it might be wrong..)
> 
> [1] initdb --encoding euc-jp --locale ja_JP.eucjp --locale-provider
> icu --icu-locale ja-x-icu
> 
> regards.

Thank you!

IMO it is hardly understantable from the program output either - it 
looks like I manually chose the encoding UTF8. Maybe first inform about 
selected encoding?..

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 
6aeec8d426c52414b827686781c245291f27ed1f..348bbbeba0f5bc7ff601912bf883510d580b814c 
100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2310,7 +2310,11 @@ setup_locale_encoding(void)
      }

      if (!encoding && locale_provider == COLLPROVIDER_ICU)
+    {
          encodingid = PG_UTF8;
+        printf(_("The default database encoding has been set to \"%s\" for a 
better experience with the ICU provider.\n"),
+               pg_encoding_to_char(encodingid));
+    }
      else if (!encoding)
      {
          int            ctype_enc;

ISTM that such choices (e.g. UTF8 for Windows in some cases) are 
described in the documentation [1] as

By default, initdb uses the locale provider libc, takes the locale 
settings from the environment, and determines the encoding from the 
locale settings. This is almost always sufficient, unless there are 
special requirements.

[1] https://www.postgresql.org/docs/devel/app-initdb.html

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans
Next
From: "houzj.fnst@fujitsu.com"
Date:
Subject: RE: why can't a table be part of the same publication as its schema