Re: insensitive collations - Mailing list pgsql-hackers

From Jim Finnerty
Subject Re: insensitive collations
Date
Msg-id 1617475666203-0.post@n3.nabble.com
Whole thread Raw
In response to Re: insensitive collations  (Jim Finnerty <jfinnert@amazon.com>)
Responses Re: insensitive collations  ("Daniel Verite" <daniel@manitou-mail.org>)
List pgsql-hackers
My previous post had a flaw, but fixing that flaw led me to what may be a
bug?

Using column COLLATE "C" ILIKE pattern almost does what you'd like, but the
single-character wildcard is treated as a single byte with "C", and that
won't match a multi-byte character.  The fix, I thought, would be to use a
deterministic CS_AS ICU collation, since we can handle the per-character
advance correctly in that case (we think).  Well, maybe not.  It looks like
single-character wildcards using a deterministic ICU collation match a
single byte instead of a single character:

This creates a deterministic case-sensitive, accent-sensitive collation in a
utf8-encoded database:

SET client_encoding = WIN1252;
CREATE COLLATION CS_AS (
    provider = icu,
    locale = 'utf8@colStrength=secondary;colCaseLevel=yes',
    deterministic = true);

CREATE TABLE locations (location VARCHAR(255) COLLATE CS_AS);
CREATE INDEX location_index ON locations (location);

INSERT INTO locations VALUES ('Franche-Comté')
                        , ('Midi-Pyrénées')
                        , ('midi-Pyrénées')
                        , ('midi-Pyrenées')
                        , ('Brian Bruß')
                        , ('Brian Bruss')
                        , ('Steven Sossmix')
                        , ('Provence-Alpes-Côte d Azur');

postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comté';
    location
----------------
 Franche-Comté
(1 row)

postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comt_';   --
is this a bug?
 location
----------
(0 rows)

postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comt__';  --
the wildcard is applied byte by byte instead of character by character, so
the 2-byte accented character is matched only by 2 '_'s
    location
----------------
 Franche-Comté
(1 row)





-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html



pgsql-hackers by date:

Previous
From: "Joel Jacobson"
Date:
Subject: Re: [PATCH] Implement motd for PostgreSQL
Next
From: Chapman Flack
Date:
Subject: Re: [PATCH] Implement motd for PostgreSQL