On Wed, 2026-01-21 at 16:36 +0100, David Geier wrote:
> Hi hackers,
>
> In thread [1] we found that pg_trgm always uses DEFAULT_COLLATION_OID
> for converting trigrams to lower-case. Here are some examples where
> today the collation is ignored:
>
...
> The attached patch attempts to fix that. I grepped for all
> occurrences
> of DEFAULT_COLLATION_OID in contrib/pg_trgm and use the function's
> collation OID instead DEFAULT_COLLATION_OID.
Hi,
Thank you for working on this.
This area is a bit awkward conceptually. The case you found is not
about the *sort order* of the values; it's about the casing semantics.
We mix those two concepts into a single "collation oid" that determines
both sort order and casing semantics (and pattern matching semantics,
too).
LOWER() and UPPER() take the casing semantics from the inferred
collation, so that's a good argument that you're doing the right thing
here.
But full text search does not; it uses DEFAULT_COLLATION_OID for
parsing the input. That sort of makes sense, because tsvector/tsquery
don't have a collatable sort order -- it's more about the parsing
semantics to create the values in the first place, not about how the
tsvector/tsquery values are sorted.
So that leaves me wondering: why would pg_trgm use the inferred
collation and tsvector/tsquery use DEFAULT_COLLATION_OID? They seem
conceptually similar, and the only real difference I see is that
tsvector/tsquery are types and pg_trgm is a set of functions.
Note that I made some changes here recently: full text search and ltree
used to use libc unconditionally or a mix of libc and
DEFAULT_COLLATION_OID; that was clearly wrong and I changed it to
consistently use DEFAULT_COLLATION_OID. But I didn't resolve the
conceptual problem of whether we should use the inferred collation (as
you suggest) or not.
Regards,
Jeff Davis