Re: Hash join not finding which collation to use for string hashing - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: Hash join not finding which collation to use for string hashing
Date
Msg-id 5C771C6B-704C-4DD6-955F-21C9EBE48980@enterprisedb.com
Whole thread Raw
In response to Re: Hash join not finding which collation to use for string hashing  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Hash join not finding which collation to use for string hashing  (Amit Langote <amitlangote09@gmail.com>)
List pgsql-hackers

> On Jan 30, 2020, at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> Would it make sense to catch a qual with unassigned collation
>> somewhere in the planner, where the qual's operator family is
>> estatblished, by checking if the operator family behavior is sensitive
>> to collations?
>
>> I’m not sure how to do that.  pg_opfamily doesn’t seem to have a field for that.  Can you recommend how I would
proceedthere? 
>
> There's no such information attached to opfamilies, which is more or less
> forced by the fact that individual operators don't expose it either.
> There's not much hope of making that better without incompatible changes
> in the requirements for extensions to define operators and/or operator
> families.

Thanks, Tom, for confirming this.

Given the excellent explanations you and Robert have given, I think I’m retracting this whole idea and accepting your
positionsthat it’s not worth it. 

For the archives, I’m still going to respond to the rest of what you say:

>> So, for = and !=, I’m looking at the definition of texteq, and it calls check_collation_set as one of the very first
thingsit does.  That’s where the error that annoys me comes out.  But I don’t think it really needs to be doing this.
Itshould first be determining if collation *matters*. 
>
> But of course it matters.  How do you know whether the operation is
> deterministic if you don't know the collation?
>
>> 3) Extend the concept of collations to collation sets.  Right now, I’m only thinking about a collation set as having
twovalues, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get
builtup and need to work.  Anyway, at the point in the executor that the collations don’t match, instead of passing
NULLdown the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with
twodifferent collations and decide if they can deal with that or if they need to throw an error. 
>
> Maybe this could work.  I think it would get messy when bubbling up
> collations, but as long as you're talking about "sets" not "pairs"
> it might be possible to postpone collation resolution.
>
> To me, though, the main advantage of this is that we could throw a
> more explicit error like "collations "ja_JP" and "tr_TR" cannot be
> unified", since that information would still be there at runtime.
> I'm still pretty dubious that having texteq special-case the situation
> where the collations are different but all deterministic is a reasonable
> thing to do.

On my mac, when I run “SELECT * FROM pg_collation”, every one of the 271 rows I get back have collisdeterministic true.
Iknow that which collations you get on a system is variable, so I’m not saying that nobody has nondeterministic
collations,but it seems common enough that mismatched collations will both be deterministic.  That’s the common case,
notsome weird edge case. 

So the issue here seems to be whether equality should get different treatment from other operators, and I obviously am
arguingthat it should, though you and Robert have both made really good points against that position. 

>
> One practical problem is that postponing that work to runtime could be
> a huge performance hit, because you'd have to do it over again on each
> call of the operator.  I suppose some caching might be possible.

Yes, Robert mentioned performance implications, too.

> Another issue is that you're still putting far too much emphasis on
> the fact that a hash-join plan manages to avoid this error, and ignoring
> the problem that a lot of other plans for the same query will not avoid
> it.  What if the planner had chosen a merge-join, for instance?

You’re looking at the problem from the point of view of how postgres is currently and historically implemented, and
seeingthat this problem is hard.  I was looking at it more from the perspective of a user who gets the error message
andthinks, “this is stupid, the query is refusing to run for want of a collation being specified, but I can clearly see
thatit doesn’t actually need one.”  I think that same reaction from the user would happen if the planner chose a
merge-join. The user would just say, “gee, what a stupid planner, why did it choose a merge join for this when the lack
ofa collation clause clearly indicates that it should have limited itself to something that only needs equality
comparison.”

I’m not calling the planner stupid, nor the system generally, but I know that people get frustrated with systems that
haveunintuitive limitations like this when they don’t know the internals of the system that give rise to the
limitations.

> How
> useful is it to allow the join if things still break the moment you
> add an ORDER BY?

I think that’s apples-to-oranges.  If you ask the system to order the data, and you’ve got ambiguity about which
orderingyou mean, then of course it can’t continue until you tell it which collation you want. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Enabling B-Tree deduplication by default
Next
From: Peter Geoghegan
Date:
Subject: Re: Enabling B-Tree deduplication by default