Re: Use correct collation in pg_trgm - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Use correct collation in pg_trgm
Date
Msg-id 2c15502fd399128ee27fbe1a305e006780159f66.camel@j-davis.com
Whole thread Raw
In response to Use correct collation in pg_trgm  (David Geier <geidav.pg@gmail.com>)
Responses Re: Use correct collation in pg_trgm
List pgsql-hackers
On Wed, 2026-01-21 at 16:36 +0100, David Geier wrote:
> Hi hackers,
>
> In thread [1] we found that pg_trgm always uses DEFAULT_COLLATION_OID
> for converting trigrams to lower-case. Here are some examples where
> today the collation is ignored:
>
...

> The attached patch attempts to fix that. I grepped for all
> occurrences
> of DEFAULT_COLLATION_OID in contrib/pg_trgm and use the function's
> collation OID instead DEFAULT_COLLATION_OID.

Hi,

Thank you for working on this.

This area is a bit awkward conceptually. The case you found is not
about the *sort order* of the values; it's about the casing semantics.
We mix those two concepts into a single "collation oid" that determines
both sort order and casing semantics (and pattern matching semantics,
too).

LOWER() and UPPER() take the casing semantics from the inferred
collation, so that's a good argument that you're doing the right thing
here.

But full text search does not; it uses DEFAULT_COLLATION_OID for
parsing the input. That sort of makes sense, because tsvector/tsquery
don't have a collatable sort order -- it's more about the parsing
semantics to create the values in the first place, not about how the
tsvector/tsquery values are sorted.

So that leaves me wondering: why would pg_trgm use the inferred
collation and tsvector/tsquery use DEFAULT_COLLATION_OID? They seem
conceptually similar, and the only real difference I see is that
tsvector/tsquery are types and pg_trgm is a set of functions.

Note that I made some changes here recently: full text search and ltree
used to use libc unconditionally or a mix of libc and
DEFAULT_COLLATION_OID; that was clearly wrong and I changed it to
consistently use DEFAULT_COLLATION_OID. But I didn't resolve the
conceptual problem of whether we should use the inferred collation (as
you suggest) or not.

Regards,
    Jeff Davis




pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Adding locks statistics
Next
From: Gyan Sreejith
Date:
Subject: Re: [Proposal] Adding Log File Capability to pg_createsubscriber