Thread: Character expansion with ICU collations
Is there a way to get ‘character expansions’ with the ICU collations that are available in PostgreSQL?
Using this example on a database with UTF-8 encoding:
CREATE COLLATION CI_AS (provider = icu, locale=’utf8@colStrength=secondary’, deterministic = false);
CREATE TABLE MyTable3
(
ID INT IDENTITY(1, 1),
Comments VARCHAR(100)
)
INSERT INTO MyTable3 (Comments) VALUES ('strasse')
INSERT INTO MyTable3 (Comments) VALUES ('straße')
SELECT * FROM MyTable3 WHERE Comments COLLATE CI_AS = 'strasse'
SELECT * FROM MyTable3 WHERE Comments COLLATE CI_AS = 'straße'
We would like to control whether each SELECT statement finds both records (because the sort key of ‘ß’ equals the sort key of ‘ss’), or whether each SELECT statement finds just one record. ICU supports character expansions and other tailorings that support advanced features like changing the collation order for specific characters, and while CREATE COLLATION doesn’t expose tailoring directives that do either character expansion or specific character reorderings (other than @colReorder to reorder entire categories of characters such as Greek vs Roman) , it seems to be the expectation that many <language> <country> pairs such as en_US should already cause ‘ß’ to match ‘ss’, not just to have them sort close together (which they do).
If PostgreSQL supports character expansion with ICU collations, can someone provide an example where 'strasse' = 'straße'?
On 09.06.21 17:31, Finnerty, Jim wrote: > CREATE COLLATION CI_AS (provider = icu, > locale=’utf8@colStrength=secondary’, deterministic = false); > > CREATE TABLE MyTable3 > ( > > ID INT IDENTITY(1, 1), > Comments VARCHAR(100) > > ) > > INSERT INTO MyTable3 (Comments) VALUES ('strasse') > INSERT INTO MyTable3 (Comments) VALUES ('straße') > SELECT * FROM MyTable3 WHERE Comments COLLATE CI_AS = 'strasse' > SELECT * FROM MyTable3 WHERE Comments COLLATE CI_AS = 'straße' > > We would like to control whether each SELECT statement finds both > records (because the sort key of ‘ß’ equals the sort key of ‘ss’), or > whether each SELECT statement finds just one record. You can have these queries return both rows if you use an accent-ignoring collation, like this example in the documentation: CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false);
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > You can have these queries return both rows if you use an > accent-ignoring collation, like this example in the documentation: > CREATE COLLATION ignore_accents (provider = icu, locale = > 'und-u-ks-level1-kc-true', deterministic = false); It occurs to me to wonder whether texteq() still obeys transitivity when using such a collation. regards, tom lane
>> You can have these queries return both rows if you use an accent-ignoring collation, like this example in the documentation: CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false); << Indeed. Is the dependency between the character expansion capability and accent-insensitive collations documented anywhere? Another unexpected dependency appears to be @colCaseFirst=upper. If specified in combination with colStrength=secondary,it appears that the upper/lower case ordering is random within a group of characters that are secondaryequal, e.g. 'A' < 'a', but 'b' < 'B', 'c' < 'C', ... , but then 'L' < 'l'. It is not even consistently orderedwith respect to case. If I make it a nondeterministic CS_AI collation, then it sorts upper before lower consistently. The rule seems to be that you can't sort by case within a group that is case-insensitive. Can a CI collation be ordered upper case first, or is this a limitation of ICU? For example, this is part of the sort order that I'd like to achieve with ICU, with the code point in column 1 and dense_rank()shown in the rightmost column indicating that 'b' = 'B', for example: 66 B B 138 151 98 b b 138 151 <- so within a group that is CI_AS equal, the sort order needs to be upper case first 67 C C 139 152 99 c c 139 152 199 Ç Ç 140 153 231 ç ç 140 153 68 D D 141 154 100 d d 141 154 208 Ð Ð 142 199 240 ð ð 142 199 69 E E 143 155 101 e e 143 155 Can this sort order be achieved with ICU? More generally, is there any interest in leveraging the full power of ICU tailoring rules to get whatever order someone mayneed, subject to the limitations of ICU itself? what would be required to extend CREATE COLLATION to accept an optionalsequence of tailoring rules that we would store in the pg_collation catalog and apply along with the modifiers inthe locale string? /Jim
On 11.06.21 22:05, Finnerty, Jim wrote: >>> > You can have these queries return both rows if you use an > accent-ignoring collation, like this example in the documentation: > > CREATE COLLATION ignore_accents (provider = icu, locale = > 'und-u-ks-level1-kc-true', deterministic = false); > << > > Indeed. Is the dependency between the character expansion capability and accent-insensitive collations documented anywhere? The above is merely a consequence of what the default collation elements for 'ß' are. Expansion isn't really a relevant concept in collation. Any character can map to 1..N collation elements. The collation algorithm doesn't care how many it is. > Can a CI collation be ordered upper case first, or is this a limitation of ICU? I don't know the authoritative answer to that, but to me it doesn't make sense, since the effect of a case-insensitive collation is to throw away the third-level weights, so there is nothing left for "upper case first" to operate on. > More generally, is there any interest in leveraging the full power of ICU tailoring rules to get whatever order someonemay need, subject to the limitations of ICU itself? what would be required to extend CREATE COLLATION to accept anoptional sequence of tailoring rules that we would store in the pg_collation catalog and apply along with the modifiersin the locale string? yes
Re: >> Can a CI collation be ordered upper case first, or is this a limitation of ICU? > I don't know the authoritative answer to that, but to me it doesn't make > sense, since the effect of a case-insensitive collation is to throw away > the third-level weights, so there is nothing left for "upper case first" > to operate on. It wouldn't make sense for the ICU sort key of a CI collation itself because the sort keys need to be binary equal, but whatthe collation of interest does is equivalent to adding a secondary "C"-collated expression to the ORDER BY clause. Forexample: SELECT ... ORDER BY expr COLLATE ci_as; Is ordered as if the query had been written: SELECT ... ORDER BY expr COLLATE ci_as, expr COLLATE "C"; Re: > tailoring rules >> yes It looks like the relevant API call is ucol_openRules(), Interface documented here: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html example usage from C here: https://android.googlesource.com/platform/external/icu/+/db20b09/source/test/cintltst/citertst.c for example: /* Test with an expanding character sequence */ u_uastrcpy(rule, "&a < b < c/abd < d"); c2 = ucol_openRules(rule, u_strlen(rule), UCOL_OFF, UCOL_DEFAULT_STRENGTH, NULL, &status); and a reordering rule test: u_uastrcpy(rule, "&z < AB"); coll = ucol_openRules(rule, u_strlen(rule), UCOL_OFF, UCOL_DEFAULT_STRENGTH, NULL, &status); that looks encouraging. It returns a UCollator object, like ucol_open(const char *localeString, ...), so it's an alternativeto ucol_open(). One of the parameters is the equivalent of colStrength, so then the question would be, how arethe other keyword/value pairs like colCaseFirst, colAlternate, etc. specified via the rules argument? In the same waywith the exception of colStrength? e.g. is "colAlternate=shifted;&z < AB" a valid rules string? The ICU documentation says simply: " rules A string describing the collation rules. For the syntax of the rules please see users guide." Transform rules are documented here: http://userguide.icu-project.org/transforms/general/rules But there are no examples of using the keyword/value pairs that may appear in a locale string with the transform rules, andthere's no locale argument on ucol_openRules. How can the keyword/value pairs that may appear in the locale string beapplied in combination with tailoring rules (with the exception of colStrength)?
I have a proposal for how to support tailoring rules in ICU collations: The ucol_openRules() function is an alternative tothe ucol_open() function that PostgreSQL calls today, but it takes the collation strength as one if its parameters so thelocale string would need to be parsed before creating the collator. After the collator is created using either ucol_openRulesor ucol_open, the ucol_setAttribute() function may be used to set individual attributes from keyword=valuepairs in the locale string as it does now, except that the strength probably can't be changed after openingthe collator with ucol_openRules. So the logic in pg_locale.c would need to be reorganized a little bit, but thatsounds straightforward. One simple solution would be to have the tailoring rules be specified as a new keyword=value pair, such as colTailoringRules=<rulestring>. Since the <rulestring> may contain single quote characters or PostgreSQL escape characters,any single quote characters or escapes would need to be escaped using PostgreSQL escape rules. If colTailoringRulesis present, colStrength would also be known prior to opening the collator, or would default to tertiary,and we would keep a local flag indicating that we should not process the colStrength keyword again, if specified. Representing the TailoringRules as just another keyword=value in the locale string means that we don't need any change tothe catalog to store it. It's just part of the locale specification. I think we wouldn't even need to bump the catversion. Are there any tailoring rules, such as expansions and contractions, that we should disallow? I realize that we don't handlenondeterministic collations in LIKE or regular expression operations as of PG14, but given expr LIKE 'a%' on a databasewith a UTF-8 encoding and arbitrary tailoring rules that include expansions and contractions, is it still guaranteedthat expr must sort BETWEEN 'a' AND ('a' || E'/uFFFF') ?