While looking at all the places where we currently use CRC, I bumped
into this:
postgres=# select 'penomaha'::tsquery @> 'lbgimpca'::tsquery;
?column?
----------
t
(1 row)
The @> operator is supposed to return true if the first query contains
all the terms of the second query. The above result is bogus; the
strings are completely different. It returns true because both terms
have the same CRC (with our funky CRC algorithm), and the tsq_mcontains
function only compares the CRCs, not the actual values.
Another bug is that the function performs a length check first, and
returns false if the second string is larger than the first. The
thinking goes that the first string cannot possibly contain the second
string if the second string is larger. But that doesn't take into
account that there can be duplicate strings (this is basically the same
bug that was recently fixed in jsonb):
postgres=# select 'a & b' @> 'a & a'::tsquery; /* CORRECT */
?column?
----------
t
(1 row)
postgres-# select 'a' @> 'a & a'::tsquery; /* WRONG */
?column?
----------
f
(1 row)
I propose the attached fix. It completely rewrites the tsq_mcontains
function, so that it first extracts all the strings from both tsqueries,
then sorts them and removes duplicates, and then compares the arrays.
(I actually find the whole operator pretty useless. What is it good for?
But that's a different story..)
- Heikki