Jeff Davis <pgsql@j-davis.com> writes:
> On Thu, 2026-03-26 at 09:50 +0100, David Geier wrote:
>> I agree. That is inconsistent. But if anything, shouldn't we change
>> tsvector/tsquery to as well adhere to the inferred collation?
> I am not sure either way.
> It's easy to specify a COLLATE clause to affect the interpretation of
> the input. But once you parse the inputs into a stored value, you can't
> later reinterpret those values by specifying a COLLATE clause. The
> parsing already happened and the original input string was lost.
> You can end up with a table full of values, some of which were parsed
> with one set of semantics, and others parsed with a different set of
> semantics. That may make sense or it may just cause confusion. It's
> tough for me to say.
The rule that text search goes by is that it's okay to be a bit
fuzzy about this because people are usually looking for approximate
matches, so that even if you have sets of lexemes that were extracted
under slightly different parsing rules you can probably still find
what you want. While that argument still works for pg_trgm's original
"similarity" functions, it falls flat for the LIKE/ILIKE/regex index
support functionality: people will be justifiably unhappy if the index
doesn't find the exact same matches that a seqscan-and-filter would.
I've not experimented, but I rather imagine that things are already
buggy as heck, in that optimizing a LIKE or regex expression that's
got collation A applied to it into an indexscan on a pg_trgm index
made with collation B will not work if different trigrams get
extracted. I think we have to insist that the index collation match
the query. Once we've done that, the concern about making a change
like this seems less: you will not get wrong answers, rather the
planner will refuse to use an incompatible index.
regards, tom lane