Re: [HACKERS] ICU integration - Mailing list pgsql-hackers

From Andreas Karlsson
Subject Re: [HACKERS] ICU integration
Date
Msg-id 29b8c74d-4029-b7f4-1dae-6e622e9ce70b@proxel.se
Whole thread Raw
In response to Re: [HACKERS] ICU integration  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Responses Re: [HACKERS] ICU integration  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers
On 03/15/2017 05:33 PM, Peter Eisentraut wrote:
> Updated patch attached.

Ok, I applied to patch again and ran the tests. I get a test failure on 
make check-world in the pg_dump tests but it can be fixed with the below.

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl 
b/src/bin/pg_dump/t/002_pg_dump.pl
index 3cac4a9ae0..7d9c90363b 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -2422,7 +2422,7 @@ qr/^\QINSERT INTO test_fifth_table (col1, col2, 
col3, col4, col5) VALUES (NULL,          'CREATE COLLATION test0 FROM "C";',        regexp =>          qr/^
-         \QCREATE COLLATION test0 (lc_collate = 'C', lc_ctype = 'C');\E/xm,
+         \QCREATE COLLATION test0 (provider = 'libc', locale = 'C');\E/xm,        like => {            binary_upgrade
        => 1,            clean                    => 1,
 

>> - I do not like how it is not obvious which is the default version of
>> every locale. E.g. I believe that "sv%icu" is "sv_reformed%icu" and not
>> "sv_standard%icu" as one might expect. Is this inherent to ICU or
>> something we can work around?
>
> We get these keywords from ucol_getKeywordValuesForLocale(), which says
> "Given a key and a locale, returns an array of string values in a
> preferred order that would make a difference."  So all those are
> supposedly different from each other.

I believe you are mistaken. The locale "sv" is just an alias for 
"sv-u-standard" as far as I can tell. See the definition of the Swedish 
locale 
(http://source.icu-project.org/repos/icu/trunk/icu4c/source/data/coll/sv.txt) 
and there are just three collations: reformed (default), standard, 
search (plus eot and emoji which are inherited).

I am also quite annoyed at col-emoji and col-eor (European Ordering 
Rules). They are defined at the root and inherited by all languages, but 
no language modifies col-emoji for their needs which makes it a foot 
gun. See the Danish sorting example below where at least I expected the 
same order. For col-eor it makes a bit more sense since I would expect 
the locales en_GB-u-col-eot and sv_SE-u-col-eot to collate exactly the same.

# SELECT * FROM (VALUES ('a'), ('k'), ('aa')) q (x) ORDER BY x COLLATE 
"da-x-icu"; x
---- a k aa
(3 rows)

# SELECT * FROM (VALUES ('a'), ('k'), ('aa')) q (x) ORDER BY x COLLATE 
"da-u-co-emoji-x-icu"; x
---- a aa k
(3 rows)

It seems ICU has made quite the mess here, and I am not sure to what 
degree we need to handle it to avoid our users getting confused. I need 
to think some of it, and would love input from others. Maybe the right 
thing to do is to ignore the issue with col-emoji, but I would want to 
do something about the default collations.

>> - ICU talks about a new syntax for locale extensions (old:
>> de_DE@collation=phonebook, new: de_DE_u_co_phonebk) at this page
>> http://userguide.icu-project.org/collation/api. Is this something we
>> need to car about? It looks like we currently use the old format, and
>> while I personally prefer it I am not sure we should rely on an old syntax.
>
> Interesting.  I hadn't see this before, and the documentation is sparse.
>  But it seems that this refers to BCP 47 language tags, which seem like
> a good idea.
>
> So what I have done is change all the predefined ICU collations to be
> named after the BCP 47 scheme, with a "private use" -x-icu suffix
> (instead of %icu).  The preserves the original idea but uses a standard
> naming scheme.
>
> I'm not terribly worried that they are going to remove the "old" locale
> naming, but just to be forward looking, I have changed it so that the
> collcollate entries are made using the "new" naming for ICU >=54.

Sounds good.

>> - I get an error when creating a new locale.
>>
>> #CREATE COLLATION sv2 (LOCALE = 'sv');
>> ERROR:  could not create locale "sv": Success
>>
>> # CREATE COLLATION sv2 (LOCALE = 'sv');
>> ERROR:  could not create locale "sv": Resource temporarily unavailable
>> Time: 1.109 ms
>
> Hmm, that's pretty straightforward code.  What is your operating system?
>  What are the build options?  Does it work without this patch?

This issue is unrelated to ICU. I had forgot to specify provider so the 
eorrs are correct (even though that the value of the errno is weird).

>> - For the collprovider is it really correct to say that 'd' is the
>> default value as it does in catalogs.sgml?
>
> It doesn't say it's the default value, it says it uses the database
> default.  This is all a bit confusing.  We have a collation object named
> "default", which uses the locale set for the database.  That's been that
> way for a while.  Now when introducing the collation providers, that
> "default" collation gets its own collprovider category 'd'.  That is not
> really used anywhere, but we have to put something there.

Ah, I see now. Hm, that is a bit awkward but I cannot think of a cleaner 
alternative.

>> - I do not know what the policy for formatting the documentation is, but
>> some of the paragraphs are in need of re-wrapping.
>
> Hmm, I don't see anything terribly bad.

Maybe it is just me who is sensitive to wrapping, but her is an example 
which sticks out to me with its 98 character line.

+   <para>
+    A collation object provided by <literal>libc</literal> maps to a 
combination
+    of <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol> 
settings.  (As     the name would suggest, the main purpose of a collation is to set     <symbol>LC_COLLATE</symbol>,
whichcontrols the sort order.  But     it is rarely necessary in practice to have an     <symbol>LC_CTYPE</symbol>
settingthat is different from     <symbol>LC_COLLATE</symbol>, so it is more convenient to collect     these under one
conceptthan to create another infrastructure for
 
-    setting <symbol>LC_CTYPE</symbol> per expression.)  Also, a collation
+    setting <symbol>LC_CTYPE</symbol> per expression.)  Also, a 
<literal>libc</literal> collation     is tied to a character set encoding (see <xref linkend="multibyte">).     The
samecollation name may exist for different encodings.    </para>
 

>> - Add a hint to "ERROR:  conflicting or redundant options". The error
>> message is pretty unclear.
>
> I don't see that in my patch.  Example?

It is for the below. But after some thinking I am fine not fixing it.

# CREATE COLLATION svi (LOCALE = 'sv', LC_COLLATE = 'sv', PROVIDER =  icu);
ERROR:  conflicting or redundant options

>> - I am not a fan of this patch comparing the encoding with a -1 literal.
>> How about adding -1 as a value to the enum? See the example below for a
>> place where the patch compares with -1.
>
> That's existing practice.  Not a great practice, probably, but widespread.

Ok, if it is widespread in the code then let's keep using it. In a case 
like this consistency is just as important.

>> - The patch adds "FIXME" in the below. Is that a left-over from
>> development or something which still needs doing?
>>
>>      /*
>>       * Also forbid matching an any-encoding entry.  This test of course
>> is not
>>       * backed up by the unique index, but it's not a problem since we don't
>> -    * support adding any-encoding entries after initdb.
>> +    * support adding any-encoding entries after initdb. FIXME
>>       */
>
> I had mentioned that upthread.  It technically needs "doing" as you say,
> but it's not clear how and it's not terribly important, arguably.

The comment is no longer true since for ICU we can do that (it is not an 
issue though). At the very least this comment needs to be updated.

Andreas



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] [COMMITTERS] pgsql: Fix and simplify check for whether we're running as Windows serv
Next
From: Peter Eisentraut
Date:
Subject: [HACKERS] Re: [COMMITTERS] pgsql: Add TAP tests for password-basedauthentication methods.