Thread: ICU_LOCALE set database default icu collation but not working as intended.
Base on this thread: https://www.postgresql.org/message-id/20220305083830.lpz3k3yku5lmm5xs%40jrouhaud
ordering reference: https://unicode-org.github.io/cldr-staging/charts/latest/collation/en_US_POSIX.html
CREATE DATABASE dbicu1 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8' ICU_LOCALE 'en-u-kf-upper' TEMPLATE 'template0';
CREATE DATABASE dbicu2 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8' ICU_LOCALE 'en-u-kr-latn-digit' TEMPLATE 'template0';
--same script apply to dbicu1 dbicu2
BEGIN;
CREATE COLLATION upperfirst (
provider = icu,
locale = 'en-u-kf-upper'
);
CREATE TABLE icu (
def text,
en text COLLATE "en_US",
upfirst text COLLATE upperfirst,
test_kr text
);
INSERT INTO icu
VALUES ('a', 'a', 'a', '1 a'), ('b', 'b', 'b', 'A 11'), ('A', 'A', 'A', 'A 19'), ('B', 'B', 'B', '8 p');
INSERT INTO icu
VALUES ('a', 'a', 'a', 'a 7');
INSERT INTO icu
VALUES ('a', 'a', 'a', 'Œ 1');
COMMIT;
CREATE DATABASE dbicu2 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8' ICU_LOCALE 'en-u-kr-latn-digit' TEMPLATE 'template0';
--same script apply to dbicu1 dbicu2
BEGIN;
CREATE COLLATION upperfirst (
provider = icu,
locale = 'en-u-kf-upper'
);
CREATE TABLE icu (
def text,
en text COLLATE "en_US",
upfirst text COLLATE upperfirst,
test_kr text
);
INSERT INTO icu
VALUES ('a', 'a', 'a', '1 a'), ('b', 'b', 'b', 'A 11'), ('A', 'A', 'A', 'A 19'), ('B', 'B', 'B', '8 p');
INSERT INTO icu
VALUES ('a', 'a', 'a', 'a 7');
INSERT INTO icu
VALUES ('a', 'a', 'a', 'Œ 1');
COMMIT;
-----------------------
--dbicu1
SELECT def AS def FROM icu ORDER BY def; --since only character. all works fine.
SELECT test_kr FROM icu ORDER BY def;
/*
test_kr
---------
A 19
1 a
a 7
Œ 1
8 p
A 11
*/
--dbicu2
SELECT def AS def FROM icu ORDER BY def; --since only character. all works fine.
SELECT test_kr FROM icu ORDER BY def;
/*
test_kr
---------
1 a
a 7
Œ 1
A 19
A 11
8 p
(6 rows)
*/
SELECT def AS def FROM icu ORDER BY def; --since only character. all works fine.
SELECT test_kr FROM icu ORDER BY def;
/*
test_kr
---------
A 19
1 a
a 7
Œ 1
8 p
A 11
*/
--dbicu2
SELECT def AS def FROM icu ORDER BY def; --since only character. all works fine.
SELECT test_kr FROM icu ORDER BY def;
/*
test_kr
---------
1 a
a 7
Œ 1
A 19
A 11
8 p
(6 rows)
*/
Since dbicu1 and dbicu2 set the default collation then
In dbicu1, I should expect the ordering:
number >> Upper case alphabet letter >> lower case alphabet letter>> character Œ (U+0153)
In dbicu2, I should expect the ordering:
lower case alphabet letter >> Upper case alphabet letter >> number >> character Œ (U+0153)
As you can see one letter works fine for dbicu1, dbicu2. However, it does not work on more characters.
Or The result is correct, but something I misunderstood?
I am not sure this is my personal misunderstanding.
In the above examples, the first character of column test_kr
is so different that the comparison is based on the first letter.
If the first letter is the same then compute the second letter..
So for whatever collation, I should expect 'A 19' to be adjacent with 'A 11'?
In the above examples, the first character of column test_kr
is so different that the comparison is based on the first letter.
If the first letter is the same then compute the second letter..
So for whatever collation, I should expect 'A 19' to be adjacent with 'A 11'?
--
I recommend David Deutsch's <<The Beginning of Infinity>>
Jian
Re: ICU_LOCALE set database default icu collation but not working as intended.
From
"Daniel Verite"
Date:
jian he wrote: > CREATE > DATABASE dbicu1 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8' ICU_LOCALE > 'en-u-kf-upper' TEMPLATE 'template0'; > CREATE DATABASE dbicu2 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8' ICU_LOCALE > 'en-u-kr-latn-digit' TEMPLATE 'template0'; > [...] > I am not sure this is my personal misunderstanding. > In the above examples, the first character of column *test_kr* > is so different that the comparison is based on the first letter. > If the first letter is the same then compute the second letter.. > So for whatever collation, I should expect 'A 19' to be adjacent with 'A > 11'? The query "SELECT test_kr FROM icu ORDER BY def;" does not order by test_kr, so the contents of test_kr have no bearing on the order of the results. If you order by test_kr, the results look like what you're expecting: dbicu1=# SELECT test_kr,def FROM icu ORDER BY test_kr; test_kr | def ---------+----- 1 a | a 8 p | B A 11 | b A 19 | A a 7 | a Œ 1 | a dbicu2=# SELECT test_kr,def FROM icu ORDER BY test_kr ; test_kr | def ---------+----- A 11 | b A 19 | A a 7 | a Œ 1 | a 1 a | a 8 p | B Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Hi, here are some other trigger cases.
CREATE DATABASE dbicu3 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8'
ICU_LOCALE 'en-u-kr-latn-digit-kf-upper-kn-true' TEMPLATE 'template0';
CREATE DATABASE dbicu4 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8'
ICU_LOCALE 'en-u-kr-latn-digit-kn-true' TEMPLATE 'template0';
--mistake
CREATE DATABASE dbicu5 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8'
ICU_LOCALE 'en-u-kr-latn-digit-kr-upper' TEMPLATE 'template0';
CREATE DATABASE dbicu6 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8'
ICU_LOCALE 'en-u-kr-latn-digit-kf-upper' TEMPLATE 'template0';
ICU_LOCALE 'en-u-kr-latn-digit-kf-upper-kn-true' TEMPLATE 'template0';
CREATE DATABASE dbicu4 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8'
ICU_LOCALE 'en-u-kr-latn-digit-kn-true' TEMPLATE 'template0';
--mistake
CREATE DATABASE dbicu5 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8'
ICU_LOCALE 'en-u-kr-latn-digit-kr-upper' TEMPLATE 'template0';
CREATE DATABASE dbicu6 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8'
ICU_LOCALE 'en-u-kr-latn-digit-kf-upper' TEMPLATE 'template0';
--same script applies to dbicu3, dbicu4, dbicu5, dbicu6.
begin;
CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');
CREATE TABLE icu(def text, en text COLLATE "en_US", upfirst text COLLATE upperfirst, test_kr text);
INSERT INTO icu VALUES ('a', 'a', 'a', '1 a'), ('b','b','b', 'A 11'), ('A','A','A','A 19'), ('B','B','B', '8 p');
INSERT INTO icu VALUES ('a', 'a', 'a', 'a 7'),('a', 'a', 'a', 'a 117');
INSERT INTO icu VALUES ('a', 'a', 'a', 'a 70'), ('a', 'a', 'a', 'A 70');
INSERT INTO icu VALUES ('a', 'a', 'a', 'Œ 1');
commit ;
CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');
CREATE TABLE icu(def text, en text COLLATE "en_US", upfirst text COLLATE upperfirst, test_kr text);
INSERT INTO icu VALUES ('a', 'a', 'a', '1 a'), ('b','b','b', 'A 11'), ('A','A','A','A 19'), ('B','B','B', '8 p');
INSERT INTO icu VALUES ('a', 'a', 'a', 'a 7'),('a', 'a', 'a', 'a 117');
INSERT INTO icu VALUES ('a', 'a', 'a', 'a 70'), ('a', 'a', 'a', 'A 70');
INSERT INTO icu VALUES ('a', 'a', 'a', 'Œ 1');
commit ;
-----------------------
localhost:5433 admin@dbicu3=# SELECT test_kr FROM icu ORDER BY test_kr ;
test_kr
---------
a 7
A 11
A 19
A 70
a 70
a 117
Œ 1
1 a
8 p
(9 rows)
---------
a 7
A 11
A 19
A 70
a 70
a 117
Œ 1
1 a
8 p
(9 rows)
--------------------------------------
localhost:5433 admin@dbicu4=# SELECT test_kr FROM icu ORDER BY test_kr ;
test_kr
---------
a 7
A 11
A 19
a 70
A 70
a 117
Œ 1
1 a
8 p
(9 rows)
test_kr
---------
a 7
A 11
A 19
a 70
A 70
a 117
Œ 1
1 a
8 p
(9 rows)
------------------------------------------------------------------------
localhost:5433 admin@dbicu6=# SELECT test_kr FROM icu ORDER BY test_kr ;
test_kr
---------
A 11
a 117
A 19
a 7
A 70
a 70
Œ 1
1 a
8 p
(9 rows)
test_kr
---------
A 11
a 117
A 19
a 7
A 70
a 70
Œ 1
1 a
8 p
(9 rows)
-----------------------------------------------------------------------------
- dbicu3, ICU_LOCALE 'en-u-kr-latn-digit-kf-upper-kn-true' seems 'kf-upper' not grouped strings beginning with character 'A' together?
- dbicu4, ICU_LOCALE 'en-u-kr-latn-digit-kn-true' since upper/lower not explicitly mentioned, and since the collation is deterministic, so character 'A' should be grouped together first then do the numeric value comparison.
- dbicu6, ICU_LOCALE 'en-u-kr-latn-digit-kf-upper' , from the result, kr-latn-digit is working as intended. But kf-upper seems not working.
maybe this link( https://www.unicode.org/reports/tr35/tr35-collation.html#314-case-parameters ) can help.
Can I specify as many key-value settings options (https://www.unicode.org/reports/tr35/tr35-collation.html#table-collation-settings) as I want in ICU_LOCALE while I create a new database?
On Fri, May 27, 2022 at 1:44 AM Daniel Verite <daniel@manitou-mail.org> wrote:
jian he wrote:
> CREATE
> DATABASE dbicu1 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8' ICU_LOCALE
> 'en-u-kf-upper' TEMPLATE 'template0';
> CREATE DATABASE dbicu2 LOCALE_PROVIDER icu LOCALE 'en_US.UTF-8' ICU_LOCALE
> 'en-u-kr-latn-digit' TEMPLATE 'template0';
> [...]
> I am not sure this is my personal misunderstanding.
> In the above examples, the first character of column *test_kr*
> is so different that the comparison is based on the first letter.
> If the first letter is the same then compute the second letter..
> So for whatever collation, I should expect 'A 19' to be adjacent with 'A
> 11'?
The query "SELECT test_kr FROM icu ORDER BY def;"
does not order by test_kr, so the contents of test_kr have no bearing
on the order of the results.
If you order by test_kr, the results look like what you're expecting:
dbicu1=# SELECT test_kr,def FROM icu ORDER BY test_kr;
test_kr | def
---------+-----
1 a | a
8 p | B
A 11 | b
A 19 | A
a 7 | a
Œ 1 | a
dbicu2=# SELECT test_kr,def FROM icu ORDER BY test_kr ;
test_kr | def
---------+-----
A 11 | b
A 19 | A
a 7 | a
Œ 1 | a
1 a | a
8 p | B
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
--
I recommend David Deutsch's <<The Beginning of Infinity>>
Jian
Re: ICU_LOCALE set database default icu collation but not working as intended.
From
"Daniel Verite"
Date:
jian he wrote: > - dbicu3, ICU_LOCALE 'en-u-kr-latn-digit-kf-upper-kn-true' seems > 'kf-upper' not grouped strings beginning with character 'A' together? You seem to expect that the sort algorithm takes characters from left to right, and when it compares 'A' and 'a', it will sort the string with the 'A' before, no matter what other characters are in the rest of the string. I don't think that's what kf-upper does. I think kf-upper kicks in only for strings that are identical at the secondary level. In your example, its effect is to make 'A 70' sort before 'a 70' . The other strings are unaffected. > - dbicu4, ICU_LOCALE 'en-u-kr-latn-digit-kn-true' since upper/lower not > explicitly mentioned, and since the collation is deterministic, so > character 'A' should be grouped together first then do the numeric value The deterministic property is only relevant when strings are compared equal by ICU. Since your collations use the default strength setting (tertiary) and the strings in your example are all different at this level, the fact that the collation is deterministic does not play a role in the results. Besides, the TR35 doc says for "kn" (numeric ordering) "If set to on, any sequence of Decimal Digits (General_Category = Nd in the [UAX44]) is sorted at a primary level with its numeric value" which means that the order of numbers (7, 11, 19, 70, 117) is "stronger" (primary level) than the relative order of the 'a' and 'A' (case difference=secondary level) that precede them. That's why these numbers drive the sort for these strings that are otherwise identical at the primary level. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite