My previous post had a flaw, but fixing that flaw led me to what may be a
bug?
Using column COLLATE "C" ILIKE pattern almost does what you'd like, but the
single-character wildcard is treated as a single byte with "C", and that
won't match a multi-byte character. The fix, I thought, would be to use a
deterministic CS_AS ICU collation, since we can handle the per-character
advance correctly in that case (we think). Well, maybe not. It looks like
single-character wildcards using a deterministic ICU collation match a
single byte instead of a single character:
This creates a deterministic case-sensitive, accent-sensitive collation in a
utf8-encoded database:
SET client_encoding = WIN1252;
CREATE COLLATION CS_AS (
provider = icu,
locale = 'utf8@colStrength=secondary;colCaseLevel=yes',
deterministic = true);
CREATE TABLE locations (location VARCHAR(255) COLLATE CS_AS);
CREATE INDEX location_index ON locations (location);
INSERT INTO locations VALUES ('Franche-Comté')
, ('Midi-Pyrénées')
, ('midi-Pyrénées')
, ('midi-Pyrenées')
, ('Brian Bruß')
, ('Brian Bruss')
, ('Steven Sossmix')
, ('Provence-Alpes-Côte d Azur');
postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comté';
location
----------------
Franche-Comté
(1 row)
postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comt_'; --
is this a bug?
location
----------
(0 rows)
postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comt__'; --
the wildcard is applied byte by byte instead of character by character, so
the 2-byte accented character is matched only by 2 '_'s
location
----------------
Franche-Comté
(1 row)
-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html