Re: Hash join not finding which collation to use for string hashing - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Hash join not finding which collation to use for string hashing
Date
Msg-id 13291.1580416177@sss.pgh.pa.us
Whole thread Raw
In response to Re: Hash join not finding which collation to use for string hashing  (Mark Dilger <mark.dilger@enterprisedb.com>)
Responses Re: Hash join not finding which collation to use for string hashing  (Mark Dilger <mark.dilger@enterprisedb.com>)
List pgsql-hackers
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> Would it make sense to catch a qual with unassigned collation
> somewhere in the planner, where the qual's operator family is
> estatblished, by checking if the operator family behavior is sensitive
> to collations?

> I’m not sure how to do that.  pg_opfamily doesn’t seem to have a field for that.  Can you recommend how I would
proceedthere? 

There's no such information attached to opfamilies, which is more or less
forced by the fact that individual operators don't expose it either.
There's not much hope of making that better without incompatible changes
in the requirements for extensions to define operators and/or operator
families.

> So, for = and !=, I’m looking at the definition of texteq, and it calls check_collation_set as one of the very first
thingsit does.  That’s where the error that annoys me comes out.  But I don’t think it really needs to be doing this.
Itshould first be determining if collation *matters*. 

But of course it matters.  How do you know whether the operation is
deterministic if you don't know the collation?

> 3) Extend the concept of collations to collation sets.  Right now, I’m only thinking about a collation set as having
twovalues, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get
builtup and need to work.  Anyway, at the point in the executor that the collations don’t match, instead of passing
NULLdown the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with
twodifferent collations and decide if they can deal with that or if they need to throw an error. 

Maybe this could work.  I think it would get messy when bubbling up
collations, but as long as you're talking about "sets" not "pairs"
it might be possible to postpone collation resolution.

To me, though, the main advantage of this is that we could throw a
more explicit error like "collations "ja_JP" and "tr_TR" cannot be
unified", since that information would still be there at runtime.
I'm still pretty dubious that having texteq special-case the situation
where the collations are different but all deterministic is a reasonable
thing to do.

One practical problem is that postponing that work to runtime could be
a huge performance hit, because you'd have to do it over again on each
call of the operator.  I suppose some caching might be possible.

Another issue is that you're still putting far too much emphasis on
the fact that a hash-join plan manages to avoid this error, and ignoring
the problem that a lot of other plans for the same query will not avoid
it.  What if the planner had chosen a merge-join, for instance?  How
useful is it to allow the join if things still break the moment you
add an ORDER BY?

            regards, tom lane



pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: Hash join not finding which collation to use for string hashing
Next
From: Melanie Plageman
Date:
Subject: Re: BufFileRead() error signalling