Re: Wrong results with equality search using trigram index and non-deterministic collation - Mailing list pgsql-hackers

From David Geier
Subject Re: Wrong results with equality search using trigram index and non-deterministic collation
Date
Msg-id 9b850976-8f0d-4957-9308-e1c053a35559@gmail.com
Whole thread
In response to Re: Wrong results with equality search using trigram index and non-deterministic collation  (Laurenz Albe <laurenz.albe@cybertec.at>)
List pgsql-hackers
>> I think we should change that. It's very counter intuitive that a query
>> can change behavior when the planner flips from using e.g. a Seq Scan to
>> a Bitmap Index Scan or the other way around. There's already a patch for
>> that, see [1].
>>
>> [1]
>> https://www.postgresql.org/message-id/flat/db087c3e-230e-4119-8a03-8b5d74956bc2%40gmail.com
> 
> 
> That's not only unintuitive, it is a clear bug.
> An index is not allowed to change the semantics of a query.
> 
> Does your patch fix the bug, that is, will the query with "WHERE t = 'right'"
> return both results?  That's the case that is mostly in need of fixing.
> I am not sure if the behavior for the % operator should also be considered
> a bug.

Not yet :).

>>> I don't know what the correct fix would be.  Perhaps just refusing to use
>>> the index for equality comparisons with non-deterministic collations.
>>
>> If we merge [1], then not only = but also LIKE would be incorrect. How
>> about disabling CREATE INDEX USING gin on columns with non-deterministic
>> collations?
> 
> Oh, I see.  So your patch won't fix the bug.

Indeed.

> I am not sure if refusing to *create* the index is the best solution.
> Perhaps a warning will be better:
> 
> WARNING:  GIN indexes won't be used columns with non-deterministic collations

At this point we could more generally say

WARNING:  pg_trgm does not use inferred column collations but always
uses the default database collation.

But I'm hoping the patch in [1] gets merged and we can also use the
inferred collation to fix the bug you found.

>> Or is there maybe a way to make these cases work correctly for
>> non-deterministic collations by applying the collation when extracting
>> the search trigrams? I take a look into that.

Attached patch makes your case work, including the % case. It builds on
top of the other patches from [1] that makes pg_trgm use the inferred
collation trigram extraction.

Instead of using btint4cmp() to compare trigrams, the patch uses a
collation-aware string comparison function.

This is just a PoC. I haven't given much thought to the details but e.g.
when three consecutive characters exceed 3 bytes then compact_trigram()
uses a truncated 32-bit hash value as trigram instead. Such trigrams
won't work in all cases. We could omit them from the query string but
for languages where the majority of trigrams are hashed or where the
query string consists of only a few trigrams, the look-up performance
would suffer.

I guess better would be using a collation-aware hash function that maps
different values that compare equal to the same hash value. hashtext()
does that already. The new comparison function would then have to
distinguish between plain text trigrams and hash trigrams.
Alternatively, we could store all trigrams as hashes but that would
break functions such as show_trgm().

--
David Geier
Attachment

pgsql-hackers by date:

Previous
From: Ayush Tiwari
Date:
Subject: Re: Enforce INSERT RLS checks for FOR PORTION OF leftovers?
Next
From: Nisha Moond
Date:
Subject: Re: Support EXCEPT for TABLES IN SCHEMA publications