Home > mailing lists

Re: Use correct collation in pg_trgm - Mailing list pgsql-hackers

From	David Geier
Subject	Re: Use correct collation in pg_trgm
Date	March 26 11:50:28
Msg-id	7e11acde-9d5b-49a1-9c41-23096d51d2e2@gmail.com Whole thread Raw
In response to	Re: Use correct collation in pg_trgm (Jeff Davis <pgsql@j-davis.com>)
List	pgsql-hackers

Tree view

Hi Jeff!

> This area is a bit awkward conceptually. The case you found is not
> about the *sort order* of the values; it's about the casing semantics.
> We mix those two concepts into a single "collation oid" that determines
> both sort order and casing semantics (and pattern matching semantics,
> too).
> 
> LOWER() and UPPER() take the casing semantics from the inferred
> collation, so that's a good argument that you're doing the right thing
> here.
> 
> But full text search does not; it uses DEFAULT_COLLATION_OID for
> parsing the input. That sort of makes sense, because tsvector/tsquery
> don't have a collatable sort order -- it's more about the parsing
> semantics to create the values in the first place, not about how the
> tsvector/tsquery values are sorted.

For pg_trgm it's also not only about casing but also about parsing: the
decision of what is considered an alpha-numeric character in ISWORDCHR()
depends on the collation.

> So that leaves me wondering: why would pg_trgm use the inferred
> collation and tsvector/tsquery use DEFAULT_COLLATION_OID? They seem
> conceptually similar, and the only real difference I see is that
> tsvector/tsquery are types and pg_trgm is a set of functions.

I agree. That is inconsistent. But if anything, shouldn't we change
tsvector/tsquery to as well adhere to the inferred collation?

For example, when a user specifies a collation for some table column, he
expects the collation to not only impact sort order. With say collation
en-US-x-icu, the B-tree lookup will be case-insensitive. Why would a GIN
index suddenly not adhere to the collation? That seems counter-intuitive
and confusing. The same when using tsvector/tsquery.

More generally: shouldn't it, from a user's point-of-view, be an all or
nothing to avoid surprises?

If not, we should come up with easy to understand and easy to remember
reasons for what adheres to the inferred collation and what adheres the
default collation and document that.

> Note that I made some changes here recently: full text search and ltree
> used to use libc unconditionally or a mix of libc and
> DEFAULT_COLLATION_OID; that was clearly wrong and I changed it to
> consistently use DEFAULT_COLLATION_OID. But I didn't resolve the
> conceptual problem of whether we should use the inferred collation (as
> you suggest) or not.

Thanks for the heads up.

--
David Geier

pgsql-hackers by date:

From: "Hayato Kuroda (Fujitsu)"
Date: 26 March, 11:35:51
Subject: RE: Initial COPY of Logical Replication is too slow

From: jian he
Date: 26 March, 11:59:56
Subject: Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions

Re: Use correct collation in pg_trgm - Mailing list pgsql-hackers

Previous

Next