Re: Wrong results with equality search using trigram index and non-deterministic collation - Mailing list pgsql-hackers

From Laurenz Albe
Subject Re: Wrong results with equality search using trigram index and non-deterministic collation
Date
Msg-id 79e467049c97b0b96f1b5d714644a1f09cef1c54.camel@cybertec.at
Whole thread
In response to Re: Wrong results with equality search using trigram index and non-deterministic collation  (David Geier <geidav.pg@gmail.com>)
Responses Re: Wrong results with equality search using trigram index and non-deterministic collation
List pgsql-hackers
On Mon, 2026-05-04 at 13:53 +0200, David Geier wrote:
> >
> Attached patch makes your case work, including the % case. It builds on
> top of the other patches from [1] that makes pg_trgm use the inferred
> collation trigram extraction.
>
> Instead of using btint4cmp() to compare trigrams, the patch uses a
> collation-aware string comparison function.

Thanks!  I tried your patch, and it does indeed fix the bug I reported.

I looked at your patch, and it is pretty straightforward.
("git am" complained about an empty line at the end of
"pg_trgm--1.6--1.7.sql", but that's merely cosmetic.)

> This is just a PoC. I haven't given much thought to the details but e.g.
> when three consecutive characters exceed 3 bytes then compact_trigram()
> uses a truncated 32-bit hash value as trigram instead. Such trigrams
> won't work in all cases. We could omit them from the query string but
> for languages where the majority of trigrams are hashed or where the
> query string consists of only a few trigrams, the look-up performance
> would suffer.

Does that mean that you could end up with wrong results (which would not
be acceptable), or that you could end up with false positives that
later get eliminated by the recheck (which would be fine)?

I am worried about collations that have digraphs - the letters would be
split when trigrams are formed, and that might cause trouble.

And indeed, I am able to break it with a "quadrigraph":

  CREATE COLLATION crazy (
     PROVIDER = icu,
     LOCALE = 'da-DK',
     DETERMINISTIC = FALSE,
     RULES = '& a = zzzz'
  );

  CREATE TABLE boom2 (id integer PRIMARY KEY, t text COLLATE crazy);

  INSERT INTO boom2 VALUES (1, 'myad'), (2, 'myzzzzd');

  SELECT * FROM boom2 WHERE t = 'myad';

   id │    t
  ════╪═════════
    1 │ myad
    2 │ myzzzzd
  (2 rows)

  CREATE INDEX trgm_idx2 ON boom2 USING gin (t gin_trgm_ops);

  SET enable_seqscan = off;

  SELECT * FROM boom2 WHERE t = 'myad';

   id │  t
  ════╪══════
    1 │ myad
  (1 row)

> I guess better would be using a collation-aware hash function that maps
> different values that compare equal to the same hash value. hashtext()
> does that already. The new comparison function would then have to
> distinguish between plain text trigrams and hash trigrams.
> Alternatively, we could store all trigrams as hashes but that would
> break functions such as show_trgm().

But that would probably not fix the above problem, right?

My initial thought about this bug was to just not consider a trigram
index if a non-deterministic collation is involved, but I can't see
how that could be done in the planner.

Still, I think that the first two patches of your set do the right thing.

Yours,
Laurenz Albe



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Why clearing the VM doesn't require registering vm buffer in wal record
Next
From: Ayush Tiwari
Date:
Subject: Re: [PATCH] Fix duplicate errmsg in ALTER TABLE SPLIT PARTITION