Re: Patch to support SEMI and ANTI join removal - Mailing list pgsql-hackers

From David Rowley
Subject Re: Patch to support SEMI and ANTI join removal
Date
Msg-id CAApHDvpWtMw6L6kcvymp-r8xL7j__6nC_+OaCFd-rgj2rc_rzQ@mail.gmail.com
Whole thread Raw
In response to Re: Patch to support SEMI and ANTI join removal  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Patch to support SEMI and ANTI join removal
List pgsql-hackers
On Sun, Nov 16, 2014 at 10:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 15 October 2014 11:03, David Rowley <dgrowleyml@gmail.com> wrote:

> The explain analyze from the above query looks like:
> test=# explain (analyze, costs off, timing off) select count(*) from t1
> inner join t2 on t1.t2_id=t2.id;
>                             QUERY PLAN
> ------------------------------------------------------------------
>  Aggregate (actual rows=1 loops=1)
>    ->  Nested Loop (actual rows=1000000 loops=1)
>          ->  Seq Scan on t1 (actual rows=1000000 loops=1)
>          ->  Index Only Scan using t2_pkey on t2 (never executed)
>                Index Cond: (id = t1.t2_id)
>                Heap Fetches: 0
>  Execution time: 124.990 ms
> (7 rows)
>
> As you can see the scan on t2 never occurred.

Very good, happy to see this happening (yay FKs!) and with
PostgreSQL-style rigour.


:)
 
I've reviewed the patch from cold and I have a few comments.


Thanks!
 
The plan you end up with here works quite differently from left outer
join removal, where the join is simply absent. That inconsistency
causes most of the other problems I see.

I propose that we keep track of whether there are any potentially
skippable joins at the top of the plan. When we begin execution we do
a single if test to see if there is run-time work to do. If we pass
the run-time tests we then descend the tree and prune the plan to
completely remove unnecessary nodes. We end with an EXPLAIN and
EXPLAIN ANALYZE that looks like this

>                             QUERY PLAN
> ------------------------------------------------------------------
>  Aggregate (actual rows=1 loops=1)
>          ->  Seq Scan on t1 (actual rows=1000000 loops=1)

Doing that removes all the overheads and complexity; it also matches
how join removal currently works.


This sounds much cleaner than what I have at the moment, although, you say EXPLAIN would look like that... I don't think that's quite true as the EXPLAIN still would have the un-pruned version, as the pruning would be done as executor start-up. Would it cause problems to have the EXPLAIN have a different looking plan than EXPLAIN ANALYZE?

I'll need to look into how the plan is stored in the case of PREPARE statements, as no doubt I can't go vandalising any plans that are stored in the PREPARE hashtable. I'd need to make a copy first, unless that's already done for me. But I guess I'd only have to do that if some flag on PlannerInfo hasSkippableNodes was true, so likely the overhead of such a copy would be regained by skipping some joins.
 
The alternative is accepting some pretty horrible additional code in
most join types, plus a small regression on nested loop joins which I
would have to call out as regrettably unacceptable. (Horrible in this
sense that we don't want that code, not that David's code is poor).


Yeah it is quite horrid. I did try and keep it as simple and as non-invasive as possible, but for nest loop it seemed there was just no better way.
 
The tests on the patch are pretty poor. If we should use EXPLAINs to
show a join removal that works and a join removal that fails. With a
few of the main permutations.


Agreed. To be honest I abandoned the tests due to a problem with EXPLAIN ANALYZE outputting the variable timing information at the bottom. There's no way to disable this! So that makes testing much harder.

I added myself to the list of complainers over here -> http://www.postgresql.org/message-id/CAApHDvoVzBTzLJbD9VfaznWo6jooK1k6-7rFQ8zYM9H7ndCcSA@mail.gmail.com but the proposed solution (diff tool which supports regex matching) is not all that simple, and I've not gotten around to attempting to make one yet.

I've also attached a rebased patch, as the old one is no longer applying. 

Regards

David Rowley
Attachment

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: controlling psql's use of the pager a bit more
Next
From: Simon Riggs
Date:
Subject: Re: Improve automatic analyze messages for inheritance trees