Character expansion with ICU collations - Mailing list pgsql-hackers

From Finnerty, Jim
Subject Character expansion with ICU collations
Date
Msg-id 5FF0F55B-1593-4FBC-A81A-7F3F6D1E4388@amazon.com
Whole thread Raw
Responses Re: Character expansion with ICU collations
List pgsql-hackers

 

Is there a way to get ‘character expansions’ with the ICU collations that are available in PostgreSQL?

 

Using this example on a database with UTF-8 encoding:

 

CREATE COLLATION CI_AS (provider = icu, locale=’utf8@colStrength=secondary’, deterministic = false);

 

CREATE TABLE MyTable3

(

    ID INT IDENTITY(1, 1),

    Comments VARCHAR(100)

)

 

 

INSERT INTO MyTable3 (Comments) VALUES ('strasse')

INSERT INTO MyTable3 (Comments) VALUES ('straße')

 

SELECT * FROM MyTable3 WHERE Comments COLLATE CI_AS = 'strasse'

SELECT * FROM MyTable3 WHERE Comments COLLATE CI_AS = 'straße'

 

We would like to control whether each SELECT statement finds both records (because the sort key of ‘ß’ equals the sort key of ‘ss’), or whether each SELECT statement finds just one record.  ICU supports character expansions and other tailorings that support advanced features like changing the collation order for specific characters, and while CREATE COLLATION doesn’t expose tailoring directives that do either character expansion or specific character reorderings (other than @colReorder to reorder entire categories of characters such as Greek vs Roman) , it seems to be the expectation that many <language> <country> pairs such as en_US should already cause ‘ß’ to match ‘ss’, not just to have them sort close together (which they do).

 

If PostgreSQL supports character expansion with ICU collations, can someone provide an example where 'strasse' = 'straße'?

 

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Decoding speculative insert with toast leaks memory
Next
From: Matthias van de Meent
Date:
Subject: Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic