Re: [HACKERS] Re: Improve OR conditions on joined columns (commonstar schema problem) - Mailing list pgsql-hackers

From Jim Nasby
Subject Re: [HACKERS] Re: Improve OR conditions on joined columns (commonstar schema problem)
Date
Msg-id 6cf3ba25-7dbc-20fe-4ff2-4f60a369a577@openscg.com
Whole thread Raw
In response to Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 3/19/17 2:32 PM, Tom Lane wrote:
> Jim Nasby <jim.nasby@openscg.com> writes:
>> Having thought about it, I share Tom's concerns. And now I'm worried
>> about what happens if there are multiple separate OR clauses. I guess
>> those would be handled by separate UNIONs?
>
> As proposed, the patch would try to optimize by splitting each OR clause
> independently, and would choose whichever way gave the best cost estimate.
> I'm not sure it's possible to do better than that, and even if it is,
> I think improving it could be left for later.

Agreed.

> I'd also considered an approach of de-duping on the basis of all relation
> ctids, while allowing a rel's ctid to be returned as NULL from a UNION arm
> in which the rel was eliminated entirely.  But that doesn't fix it,
> because in this example the first arm would return (a.ctid, NULL) while
> the second arm would return (NULL, b.ctid), so that the UNION step would
> still fail to detect any duplication.  To make it work, we'd have to not
> eliminate the joins, which would pretty much defeat the usefulness of
> the optimization for your original example case.

It might still be worth-while in some circumstances. In your example, if 
there were these indexes:

a__id ON a(id), a__x ON a(x)
b__id ON b(id), b__y ON b(y)

then it might be faster to nested loop a__x=42 to b__id=a.id and union 
that with b__y=43 nested to a__id=b.id.

That said, now isn't the time to be adding more complexity.

> So full joins definitely break this whole optimization.  I think it's okay
> with left joins though, because when starting from "a left join b" it will
> never be possible to remove "a" so we'll always include a.ctid in the
> UNION de-duping step.  If b is removed in some arm, then it must be true
> that we get exactly one left-join output row per a row, regardless of the
> contents of b, in that arm.  The argument for the patch being okay is
> essentially that we must get exactly one left-join output row per a row,
> regardless of the contents of b, in *every* arm, because the various
> modified versions of the OR clause can't affect that conclusion.  In some
> of the arms we might not remove b, and we might even be able to reduce the
> left join to an inner join, but there should still be no more than one
> join output row per a row.  That being the case, it should be sufficient
> to de-dup using a.ctid while ignoring b.ctid.

The only case I can think of is: would it be possible (perhaps not 
today; maybe in the future) for other parts of the query to affect join 
elimination? I can't conceive of how that might happen, but if it did 
then it's possible that the elimination would work differently with the 
UNION than it would with an OR.

The comment on join_is_removable() does mention that there's other 
potentially interesting cases that we can't handle now; it's maybe worth 
mentioning

Other than that, I can't see any issues with your logic.

> Any clearer yet?

Absolutely. I think it would be very valuable to include that with the 
initial comment in planunionor.c. Join reduction and removal is already 
tricky enough to wrap your head around.

Other comments:

> +  * is retty mechanical, but we can't do it until we have a RelOptInfo for the

s/retty/pretty/


I suspect that in many systems single-table queries are far more common 
than CTEs, so maybe it's worth reversing those two tests in 
split_join_or_clauses().


For the Unique path, it would be nice if the code did what would be 
necessary to consider a TID hash join, since that means a user could 
create the appropriate operator and it would just be picked up. 
Certainly not worth much effort at this point though.

> +     /*
> +      * Must not have any volatile functions in FROM or WHERE (see notes at
> +      * head of file).
> +      */
> +     if (contain_volatile_functions((Node *) parse->jointree))

Is there by chance anywhere else that needs to check that? Maybe worth 
adding the info to the Query struct if so.

> +      * We insist that all baserels used in the query be plain relations, so

Dumb question... views have already be decomposed at this point, right?

Perhaps less dumb question... what happens if the original query already 
had setOps? AIUI setOps work has already been done by the time 
split_join_or_clauses() happens; I just want to check that that's OK.

I'm not sure a GUC is worth it... I suspect that any query with multiple 
rels and an OR condition is going to be expensive enough that whatever 
additional plan time there is won't be noticeable.

I've verified that the patch still applies and make check-world is clean.
-- 
Jim Nasby, Chief Data Architect, Austin TX
OpenSCG                 http://OpenSCG.com



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers