Re: ICU for global collation - Mailing list pgsql-hackers
From | Julien Rouhaud |
---|---|
Subject | Re: ICU for global collation |
Date | |
Msg-id | 20220107090329.lnsy6tnodrrbmyy5@jrouhaud Whole thread Raw |
In response to | Re: ICU for global collation (Peter Eisentraut <peter.eisentraut@enterprisedb.com>) |
Responses |
Re: ICU for global collation
Re: ICU for global collation |
List | pgsql-hackers |
Hi, I looked a bit more in this patch and I have some additional remarks. On Thu, Dec 30, 2021 at 01:07:21PM +0100, Peter Eisentraut wrote: > > So this is a different approach: If you choose ICU as the default locale for > a database, you still need to specify lc_ctype and lc_collate settings, as > before. Unlike in the previous patch, where the ICU collation name was > written in datcollate, there is now a third column (daticucoll), so we can > store all three values. This fixes the described problem. Other than that, > once you get all the initial settings right, it basically just works: The > places that have ICU support now will use a database-wide ICU collation if > appropriate, the places that don't have ICU support continue to use the > global libc locale settings. So just to confirm a database can now have 2 different *default* collations: a libc-based one for everything that doesn't work with ICU and a ICU-based (if specified) for everything else, and the ICU-based is optional, so if not provided everything works as before, using the libc based default collation. As I mentioned I think this approach is sensible. However, should we document what are the things that are not ICU-aware? > I changed the datcollate, datctype, and the new daticucoll fields to type > text (from name). That way, the daticucoll field can be set to null if it's > not applicable. Also, the limit of 63 characters can actually be a problem > if you want to use some combination of the options that ICU locales offer. > And for less extreme uses, having variable-length fields will save some > storage, since typical locale names are much shorter. I understand the need to have daticucoll as text, however it's not clear to me why this has to be changed for datcollate and datctype? IIUC those will only ever contain libc-based collation and are still mandatory? > > For the same reasons and to keep things consistent, I also changed the > analogous pg_collation fields like that. The respective fields in pg_collation are now nullable, so the changes there sounds ok. Digging a bit more in the patch here are some things that looks problematic. - pg_upgrade It checks (in check_locale_and_encoding()) the compatibility for each database, and it looks like the daticucoll field should also be verified. Other than that I don't think there is anything else needed for the pg_upgrade part as everything else should be handled by pg_dump (I didn't look at the changes yet given the below problems). - CREATE DATABASE There's a new COLLATION_PROVIDER option, but the overall behavior seems quite unintuitive. As far as I can see the idea is to use LOCALE for the ICU default collation, but as it's providing a default for the other values it's quite annoying. For instance: =# CREATE DATABASE db COLLATION_PROVIDER icu LOCALE 'fr-x-icu' LC_COLLATE 'en_GB.UTF-8';; ERROR: 42809: invalid locale name: "fr-x-icu" LOCATION: createdb, dbcommands.c:397 Looking at the code it's actually complaining about LC_CTYPE. If you want a database with an ICU default collation the lc_collate and lc_ctype should inherit what's in the template database and not what was provided in the LOCALE I think. You could still probably overload them in some scenario, but without a list of what isn't ICU-aware I can't really be sure of how often one might have to do it. Now, if I specify everything as needed it looks like it's missing some checks on the ICU default collation when not using template0: =# CREATE DATABASE db COLLATION_PROVIDER icu LOCALE 'en-x-icu' LC_COLLATE 'en_GB.UTF-8' LC_CTYPE 'en_GB.UTF-8';; CREATE DATABASE =# SELECT datname, datcollate, datctype, daticucoll FROM pg_database ; datname | datcollate | datctype | daticucoll -----------+-------------+-------------+------------ postgres | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu db | en_GB.UTF-8 | en_GB.UTF-8 | en-x-icu template1 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu template0 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu (4 rows) Unless I'm missing something the same concerns about collation incompatibility with objects in the source database should also apply for the ICU collation? While at it, I'm not exactly sure of what the COLLATION_PROVIDER is supposed to mean, as the same commands but with a libc provider is accepted and has the exact same result: =# CREATE DATABASE db2 COLLATION_PROVIDER libc LOCALE 'en-x-icu' LC_COLLATE 'en_GB.UTF-8' LC_CTYPE 'en_GB.UTF-8';; CREATE DATABASE =# SELECT datname, datcollate, datctype, daticucoll FROM pg_database ; datname | datcollate | datctype | daticucoll -----------+-------------+-------------+------------ postgres | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu db | en_GB.UTF-8 | en_GB.UTF-8 | en-x-icu template1 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu template0 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu db2 | en_GB.UTF-8 | en_GB.UTF-8 | en-x-icu (5 rows) Shouldn't db2 have a NULL daticucoll, and if so also complain about incompatibility for it? - initdb I don't think that initdb --collation-provider icu should be allowed without --icu-locale, same for --collation-provider libc *with* --icu-locale. When trying that, I can also see that the NULL handling for daticucoll is broken in the BKI: $ initdb -k --collation-provider icu [...] Success. You can now start the database server using: =# select datname, datcollate, datctype, daticucoll from pg_database ; datname | datcollate | datctype | daticucoll -----------+-------------+-------------+------------ postgres | en_GB.UTF-8 | en_GB.UTF-8 | en_GB template1 | en_GB.UTF-8 | en_GB.UTF-8 | en_GB template0 | en_GB.UTF-8 | en_GB.UTF-8 | en_GB (3 rows) There's a fallback on my LANG/LC_* env settings, but I don't think it can ever be correct given the different naming convention in ICU (at least the s/_/-/). And $ initdb -k --collation-provider libc --icu-locale test [...] Success. You can now start the database server using: =# select datname, datcollate, datctype, daticucoll from pg_database ; datname | datcollate | datctype | daticucoll -----------+-------------+-------------+------------ postgres | en_GB.UTF-8 | en_GB.UTF-8 | _null_ template1 | en_GB.UTF-8 | en_GB.UTF-8 | _null_ template0 | en_GB.UTF-8 | en_GB.UTF-8 | _null_ (3 rows)
pgsql-hackers by date: