Thread: Character expansion with ICU collations

Character expansion with ICU collations

From
"Finnerty, Jim"
Date:

 

Is there a way to get ‘character expansions’ with the ICU collations that are available in PostgreSQL?

 

Using this example on a database with UTF-8 encoding:

 

CREATE COLLATION CI_AS (provider = icu, locale=’utf8@colStrength=secondary’, deterministic = false);

 

CREATE TABLE MyTable3

(

    ID INT IDENTITY(1, 1),

    Comments VARCHAR(100)

)

 

 

INSERT INTO MyTable3 (Comments) VALUES ('strasse')

INSERT INTO MyTable3 (Comments) VALUES ('straße')

 

SELECT * FROM MyTable3 WHERE Comments COLLATE CI_AS = 'strasse'

SELECT * FROM MyTable3 WHERE Comments COLLATE CI_AS = 'straße'

 

We would like to control whether each SELECT statement finds both records (because the sort key of ‘ß’ equals the sort key of ‘ss’), or whether each SELECT statement finds just one record.  ICU supports character expansions and other tailorings that support advanced features like changing the collation order for specific characters, and while CREATE COLLATION doesn’t expose tailoring directives that do either character expansion or specific character reorderings (other than @colReorder to reorder entire categories of characters such as Greek vs Roman) , it seems to be the expectation that many <language> <country> pairs such as en_US should already cause ‘ß’ to match ‘ss’, not just to have them sort close together (which they do).

 

If PostgreSQL supports character expansion with ICU collations, can someone provide an example where 'strasse' = 'straße'?

 

Re: Character expansion with ICU collations

From
Peter Eisentraut
Date:
On 09.06.21 17:31, Finnerty, Jim wrote:
> CREATE COLLATION CI_AS (provider = icu, 
> locale=’utf8@colStrength=secondary’, deterministic = false);
> 
> CREATE TABLE MyTable3
> (
> 
>      ID INT IDENTITY(1, 1),
>      Comments VARCHAR(100)
> 
> )
> 
> INSERT INTO MyTable3 (Comments) VALUES ('strasse')
> INSERT INTO MyTable3 (Comments) VALUES ('straße')
> SELECT * FROM MyTable3 WHERE Comments COLLATE CI_AS = 'strasse'
> SELECT * FROM MyTable3 WHERE Comments COLLATE CI_AS = 'straße'
> 
> We would like to control whether each SELECT statement finds both 
> records (because the sort key of ‘ß’ equals the sort key of ‘ss’), or 
> whether each SELECT statement finds just one record.

You can have these queries return both rows if you use an 
accent-ignoring collation, like this example in the documentation:

CREATE COLLATION ignore_accents (provider = icu, locale = 
'und-u-ks-level1-kc-true', deterministic = false);



Re: Character expansion with ICU collations

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> You can have these queries return both rows if you use an 
> accent-ignoring collation, like this example in the documentation:

> CREATE COLLATION ignore_accents (provider = icu, locale = 
> 'und-u-ks-level1-kc-true', deterministic = false);

It occurs to me to wonder whether texteq() still obeys transitivity
when using such a collation.

            regards, tom lane



Re: Character expansion with ICU collations

From
"Finnerty, Jim"
Date:
>>
    You can have these queries return both rows if you use an
    accent-ignoring collation, like this example in the documentation:

    CREATE COLLATION ignore_accents (provider = icu, locale =
    'und-u-ks-level1-kc-true', deterministic = false);
<<

Indeed.  Is the dependency between the character expansion capability and accent-insensitive collations documented
anywhere?

Another unexpected dependency appears to be @colCaseFirst=upper.  If specified in combination with
colStrength=secondary,it appears that the upper/lower case ordering is random within a group of characters that are
secondaryequal, e.g. 'A' < 'a', but 'b' < 'B', 'c' < 'C', ...  , but then 'L' < 'l'.  It is not even consistently
orderedwith respect to case.  If I make it a nondeterministic CS_AI collation, then it sorts upper before lower
consistently. The rule seems to be that you can't sort by case within a group that is case-insensitive. 
 

Can a CI collation be ordered upper case first, or is this a limitation of ICU?

For example, this is part of the sort order that I'd like to achieve with ICU, with the code point in column 1 and
dense_rank()shown in the rightmost column indicating that 'b' = 'B', for example:
 

66    B    B    138    151
98    b    b    138    151   <- so within a group that is CI_AS equal, the sort order needs to be upper case first
67    C    C    139    152
99    c    c    139    152
199    Ç    Ç    140    153
231    ç    ç    140    153
68    D    D    141    154
100    d    d    141    154
208    Ð    Ð    142    199
240    ð    ð    142    199
69    E    E    143    155
101    e    e    143    155

Can this sort order be achieved with ICU?

More generally, is there any interest in leveraging the full power of ICU tailoring rules to get whatever order someone
mayneed, subject to the limitations of ICU itself?  what would be required to extend CREATE COLLATION to accept an
optionalsequence of tailoring rules that we would store in the pg_collation catalog and apply along with the modifiers
inthe locale string?
 

    /Jim




Re: Character expansion with ICU collations

From
Peter Eisentraut
Date:
On 11.06.21 22:05, Finnerty, Jim wrote:
>>>
>      You can have these queries return both rows if you use an
>      accent-ignoring collation, like this example in the documentation:
> 
>      CREATE COLLATION ignore_accents (provider = icu, locale =
>      'und-u-ks-level1-kc-true', deterministic = false);
> <<
> 
> Indeed.  Is the dependency between the character expansion capability and accent-insensitive collations documented
anywhere?

The above is merely a consequence of what the default collation elements 
for 'ß' are.

Expansion isn't really a relevant concept in collation.  Any character 
can map to 1..N collation elements.  The collation algorithm doesn't 
care how many it is.

> Can a CI collation be ordered upper case first, or is this a limitation of ICU?

I don't know the authoritative answer to that, but to me it doesn't make 
sense, since the effect of a case-insensitive collation is to throw away 
the third-level weights, so there is nothing left for "upper case first" 
to operate on.

> More generally, is there any interest in leveraging the full power of ICU tailoring rules to get whatever order
someonemay need, subject to the limitations of ICU itself?  what would be required to extend CREATE COLLATION to accept
anoptional sequence of tailoring rules that we would store in the pg_collation catalog and apply along with the
modifiersin the locale string?
 

yes



Re: Character expansion with ICU collations

From
"Finnerty, Jim"
Date:
Re: 
    >> Can a CI collation be ordered upper case first, or is this a limitation of ICU?

    > I don't know the authoritative answer to that, but to me it doesn't make
    > sense, since the effect of a case-insensitive collation is to throw away
    > the third-level weights, so there is nothing left for "upper case first"
    > to operate on.

It wouldn't make sense for the ICU sort key of a CI collation itself because the sort keys need to be binary equal, but
whatthe collation of interest does is equivalent to adding a secondary "C"-collated expression to the ORDER BY clause.
Forexample:
 

SELECT ... ORDER BY expr COLLATE ci_as;

Is ordered as if the query had been written:

SELECT ... ORDER BY expr COLLATE ci_as, expr COLLATE "C";

Re: 
    > tailoring rules
    >> yes

It looks like the relevant API call is ucol_openRules(), 
    Interface documented here: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html
    example usage from C here:
https://android.googlesource.com/platform/external/icu/+/db20b09/source/test/cintltst/citertst.c

for example:

    /* Test with an expanding character sequence */
    u_uastrcpy(rule, "&a < b < c/abd < d");
    c2 = ucol_openRules(rule, u_strlen(rule), UCOL_OFF, UCOL_DEFAULT_STRENGTH, NULL, &status);

and a reordering rule test:

    u_uastrcpy(rule, "&z < AB");
    coll = ucol_openRules(rule, u_strlen(rule), UCOL_OFF, UCOL_DEFAULT_STRENGTH, NULL, &status);

that looks encouraging.  It returns a UCollator object, like ucol_open(const char *localeString, ...), so it's an
alternativeto ucol_open().  One of the parameters is the equivalent of colStrength, so then the question would be, how
arethe other keyword/value pairs like colCaseFirst, colAlternate, etc. specified via the rules argument?  In the same
waywith the exception of colStrength?
 

e.g. is "colAlternate=shifted;&z < AB" a valid rules string?

The ICU documentation says simply:

" rules    A string describing the collation rules. For the syntax of the rules please see users guide."

Transform rules are documented here: http://userguide.icu-project.org/transforms/general/rules

But there are no examples of using the keyword/value pairs that may appear in a locale string with the transform rules,
andthere's no locale argument on ucol_openRules.  How can the keyword/value pairs that may appear in the locale string
beapplied in combination with tailoring rules (with the exception of colStrength)?
 






Re: Character expansion with ICU collations

From
"Finnerty, Jim"
Date:
I have a proposal for how to support tailoring rules in ICU collations: The ucol_openRules() function is an alternative
tothe ucol_open() function that PostgreSQL calls today, but it takes the collation strength as one if its parameters so
thelocale string would need to be parsed before creating the collator.  After the collator is created using either
ucol_openRulesor ucol_open, the ucol_setAttribute() function may be used to set individual attributes from
keyword=valuepairs in the locale string as it does now, except that the strength probably can't be changed after
openingthe collator with ucol_openRules.  So the logic in pg_locale.c would need to be reorganized a little bit, but
thatsounds straightforward.
 

One simple solution would be to have the tailoring rules be specified as a new keyword=value pair, such as
colTailoringRules=<rulestring>. Since the <rulestring> may contain single quote characters or PostgreSQL escape
characters,any single quote characters or escapes would need to be escaped using PostgreSQL escape rules.  If
colTailoringRulesis present, colStrength would also be known prior to opening the collator, or would default to
tertiary,and we would keep a local flag indicating that we should not process the colStrength keyword again, if
specified.
 

Representing the TailoringRules as just another keyword=value in the locale string means that we don't need any change
tothe catalog to store it.  It's just part of the locale specification.  I think we wouldn't even need to bump the
catversion.

Are there any tailoring rules, such as expansions and contractions, that we should disallow?  I realize that we don't
handlenondeterministic collations in LIKE or regular expression operations as of PG14, but given expr LIKE 'a%' on a
databasewith a UTF-8 encoding and arbitrary tailoring rules that include expansions and contractions, is it still
guaranteedthat expr must sort BETWEEN 'a' AND ('a' || E'/uFFFF') ?