Hi hackers,
In thread [1] we found that pg_trgm always uses DEFAULT_COLLATION_OID
for converting trigrams to lower-case. Here are some examples where
today the collation is ignored:
CREATE EXSTENSION pg_trgm;
CREATE COLLATION turkish (provider = libc, locale = 'tr_TR.utf8');
postgres=# SELECT show_trgm('ISTANBUL' COLLATE "turkish");
show_trgm
---------------------------------------------
{" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
CREATE TABLE test(col TEXT COLLATE "turkish");
INSERT INTO test VALUES ('ISTANBUL');
postgres=# select show_trgm(col) FROM test;
show_trgm
---------------------------------------------
{" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
COLLATE "turkish");
similarity
------------
0.5
If the database is initialized via initdb --locale="tr_TR.utf8", the
output changes:
postgres=# SELECT show_trgm('ISTANBUL');
show_trgm
--------------------------------------------------------
{0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
and
postgres=# select show_trgm(col) FROM test;
show_trgm
--------------------------------------------------------
{0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
COLLATE "turkish");
similarity
------------
1
tr_TR.utf8 converts capital I to ı which is a multibyte character, while
my default collation converts I to i.
The attached patch attempts to fix that. I grepped for all occurrences
of DEFAULT_COLLATION_OID in contrib/pg_trgm and use the function's
collation OID instead DEFAULT_COLLATION_OID.
The corresponding regression tests pass.
[1]
https://www.postgresql.org/message-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d%40gmail.com
--
David Geier