Home > mailing lists

Re: insensitive collations - Mailing list pgsql-hackers

From	Jim Finnerty
Subject	Re: insensitive collations
Date	April 3, 2021 21:47:46
Msg-id	1617475666203-0.post@n3.nabble.com Whole thread Raw
In response to	Re: insensitive collations (Jim Finnerty <jfinnert@amazon.com>)
Responses	Re: insensitive collations
List	pgsql-hackers

Tree view

My previous post had a flaw, but fixing that flaw led me to what may be a
bug?

Using column COLLATE "C" ILIKE pattern almost does what you'd like, but the
single-character wildcard is treated as a single byte with "C", and that
won't match a multi-byte character.  The fix, I thought, would be to use a
deterministic CS_AS ICU collation, since we can handle the per-character
advance correctly in that case (we think).  Well, maybe not.  It looks like
single-character wildcards using a deterministic ICU collation match a
single byte instead of a single character:

This creates a deterministic case-sensitive, accent-sensitive collation in a
utf8-encoded database:

SET client_encoding = WIN1252;
CREATE COLLATION CS_AS (
    provider = icu,
    locale = 'utf8@colStrength=secondary;colCaseLevel=yes',
    deterministic = true);

CREATE TABLE locations (location VARCHAR(255) COLLATE CS_AS);
CREATE INDEX location_index ON locations (location);

INSERT INTO locations VALUES ('Franche-Comté')
                        , ('Midi-Pyrénées')
                        , ('midi-Pyrénées')
                        , ('midi-Pyrenées')
                        , ('Brian Bruß')
                        , ('Brian Bruss')
                        , ('Steven Sossmix')
                        , ('Provence-Alpes-Côte d Azur');

postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comté';
    location
----------------
 Franche-Comté
(1 row)

postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comt_';   --
is this a bug?
 location
----------
(0 rows)

postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comt__';  --
the wildcard is applied byte by byte instead of character by character, so
the 2-byte accented character is matched only by 2 '_'s
    location
----------------
 Franche-Comté
(1 row)





-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

pgsql-hackers by date:

From: "Joel Jacobson"
Date: 03 April 2021, 21:24:10
Subject: Re: [PATCH] Implement motd for PostgreSQL

From: Chapman Flack
Date: 03 April 2021, 21:48:53
Subject: Re: [PATCH] Implement motd for PostgreSQL

Re: insensitive collations - Mailing list pgsql-hackers

Previous

Next