Re: Character expansion with ICU collations - Mailing list pgsql-hackers

From Finnerty, Jim
Subject Re: Character expansion with ICU collations
Date
Msg-id 10F78B0E-3C4B-4BF8-9EF0-BEE684F4C8CC@amazon.com
Whole thread Raw
In response to Re: Character expansion with ICU collations  ("Finnerty, Jim" <jfinnert@amazon.com>)
List pgsql-hackers
I have a proposal for how to support tailoring rules in ICU collations: The ucol_openRules() function is an alternative
tothe ucol_open() function that PostgreSQL calls today, but it takes the collation strength as one if its parameters so
thelocale string would need to be parsed before creating the collator.  After the collator is created using either
ucol_openRulesor ucol_open, the ucol_setAttribute() function may be used to set individual attributes from
keyword=valuepairs in the locale string as it does now, except that the strength probably can't be changed after
openingthe collator with ucol_openRules.  So the logic in pg_locale.c would need to be reorganized a little bit, but
thatsounds straightforward.
 

One simple solution would be to have the tailoring rules be specified as a new keyword=value pair, such as
colTailoringRules=<rulestring>. Since the <rulestring> may contain single quote characters or PostgreSQL escape
characters,any single quote characters or escapes would need to be escaped using PostgreSQL escape rules.  If
colTailoringRulesis present, colStrength would also be known prior to opening the collator, or would default to
tertiary,and we would keep a local flag indicating that we should not process the colStrength keyword again, if
specified.
 

Representing the TailoringRules as just another keyword=value in the locale string means that we don't need any change
tothe catalog to store it.  It's just part of the locale specification.  I think we wouldn't even need to bump the
catversion.

Are there any tailoring rules, such as expansions and contractions, that we should disallow?  I realize that we don't
handlenondeterministic collations in LIKE or regular expression operations as of PG14, but given expr LIKE 'a%' on a
databasewith a UTF-8 encoding and arbitrary tailoring rules that include expansions and contractions, is it still
guaranteedthat expr must sort BETWEEN 'a' AND ('a' || E'/uFFFF') ?
 


pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Doc chapter for Hash Indexes
Next
From: Filip Gospodinov
Date:
Subject: Fix pkg-config file for static linking