Re: Patch to support SEMI and ANTI join removal - Mailing list pgsql-hackers

From David Rowley
Subject Re: Patch to support SEMI and ANTI join removal
Date
Msg-id CAApHDvqrW2QjhLGyp5YQzk36Hb+VjhBWqtPbVE=KP5peBXDSOA@mail.gmail.com
Whole thread Raw
In response to Re: Patch to support SEMI and ANTI join removal  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: Patch to support SEMI and ANTI join removal
List pgsql-hackers
On Wed, Oct 1, 2014 at 1:34 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-10-01 01:03:35 +1300, David Rowley wrote:
> On Wed, Oct 1, 2014 at 12:01 AM, Andres Freund <andres@2ndquadrant.com>
> wrote:
>
> > On 2014-09-30 23:25:45 +1300, David Rowley wrote:
> > >
> > > I've not quite gotten my head around how we might stop the unneeded
> > > relation from being the primary path to join the other inner relations,
> > > i.e. what would stop the planner making a plan that hashed all the other
> > > relations and planned to perform a sequence scan on the relation that we
> > > have no need to scan (because the foreign key tells us the join is
> > > pointless). If we were not use use that relation then we'd just have a
> > > bunch of hash tables with no path to join them up. If we did anything to
> > > force the planner into creating a plan that would suit skipping
> > relations,
> > > then we could possibly be throwing away the optimal plan..... Right?
> >
> > I'm not 100% sure I understand your problem description, but let me
> > describe how I think this would work. During planning, you'd emit the
> > exactly same plan as you'd today, with two exceptions:
> > a) When costing a node where one side of a join is very likely to be
> >    removable, you'd cost it nearly as if there wasn't a join.
> >
>
> Ok given the tables:
> create table t1 (x int primary key);
> create table t2 (y int primary key);
>
> suppose the planner came up with something like:
>
> test=# explain (costs off) select t2.* from t1 inner join t2 on t1.x=t2.y;
>          QUERY PLAN
> ----------------------------
>  Hash Join
>    Hash Cond: (t1.x = t2.y)
>    ->  Seq Scan on t1
>    ->  Hash
>          ->  Seq Scan on t2
>
> If we had a foreign key...
>
> alter table t2 add constraint t2_y_fkey foreign key (y) references t1 (x);
>
> ...the join to t1 could possibly be "ignored" by the executor... but
> there's a problem as the plan states we're going to seqscan then hash that
> relation, then seqscan t1 with a hash lookup on each of t1's rows. In this
> case how would the executor skip the scan on t1? I can see how this might
> work if it was t2 that we were removing, as we'd just skip the hash lookup
> part in the hash join node.

Hm, right. But that doesn't seem like a fatal problem to me. The planner
knows about t1/t2 and Seq(t1), Seq(t2), not just Hash(Seq(t2)). So it
can tell the HashJoin node that when the 'shortcut' qualifier is true,
it should source everything from Seq(t2). Since the sequence scan
doesn't care about the node ontop that doesn't seem to be overly
dramatic?
Obviously reality makes this a bit more complicated...


Ok, after a bit of study on the hash join code I can see what you mean, it shouldn't really matter.
I've started working on this now and I've made some changes in analyzejoins.c so that instead of removing joins, it just marks the RangeTblEntry, setting a new flag named skipJoinPossible to true. (I'll think of a better name later)

I'm starting off with hash joins and I'm hacking a bit at ExecInitHashJoin. I want to add 2 bool flags to HashJoinState, outerIsRequired and innerIsRequired. I think if both of these flags are set then we can just abort the join altogether. Though in order to set these flags I need to identify which relations are for the outer and which are for the inner side of the join. I've got the logic for that only partially worked out. My understanding of it so far is that I just need to look at the hjstate->js.ps.lefttree and righttree. Inner being right, and outer being left. I'm a little stuck on more complex cases where the scan is nested deeper in the tree and I'm not quite sure on the logic I should be using to navigate to it.

Take the following plan: (which I've amended to mark the left and right nodes)

explain select t1.* from t1 inner join t2 on t1.t2_id= t2.id inner join t3 on t2.t3_id=t3.id;
                               QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=122.15..212.40 rows=2140 width=8)
   Hash Cond: (t2.t3_id = t3.id)
   ->  Hash Join  (cost=58.15..118.98 rows=2140 width=12) (left node)
         Hash Cond: (t1.t2_id = t2.id)
         ->  Seq Scan on t1  (cost=0.00..31.40 rows=2140 width=8) (left node)
         ->  Hash  (cost=31.40..31.40 rows=2140 width=8) (right node)
               ->  Seq Scan on t2  (cost=0.00..31.40 rows=2140 width=8) (left node)
   ->  Hash  (cost=34.00..34.00 rows=2400 width=4) (right node)
         ->  Seq Scan on t3  (cost=0.00..34.00 rows=2400 width=4) (left node)

The schema is set up in such a way that the joins to t2 and t3 can be... "skipped", and my new code in analyzejoin.c marks the RangeTblEntry records for this relation to reflect that.

During ExecInitHashJoin for the join between t2 and t3, if I want to find t2 in the tree, I'd need to do hjstate->js.ps.lefttree->righttree->lefttree... (which I know just from looking at the explain output) I just can't work out the logic behind where the scan node will actually be. At first I had thought something like, loop down the lefttree path until I reach a deadend, and that's the outer scan node, but in this case there's a right turn in there too, so that won't work. If I keep going down the left path I'd end up at t1, which is completely wrong.

Can anyone shed any light on how I might determine where the scan rel is in the tree? I need to find it so I can check if the RangeTblEntry is marked as skip-able. 

Regards

David Rowley

pgsql-hackers by date:

Previous
From: Emre Hasegeli
Date:
Subject: Re: KNN-GiST with recheck
Next
From: Etsuro Fujita
Date:
Subject: Improve automatic analyze messages for inheritance trees