Re: [BUGS] BUG #14811: Nested IN SUBQERY that returns empty results executed multiple times. - Mailing list pgsql-bugs

From Tom Lane
Subject Re: [BUGS] BUG #14811: Nested IN SUBQERY that returns empty results executed multiple times.
Date
Msg-id 9311.1505171327@sss.pgh.pa.us
Whole thread Raw
In response to [BUGS] BUG #14811: Nested IN SUBQERY that returns empty results executedmultiple times.  (serovov@gmail.com)
Responses Re: [BUGS] BUG #14811: Nested IN SUBQERY that returns empty resultsexecuted multiple times.
List pgsql-bugs
serovov@gmail.com writes:
> I have a query planner bug that executes the same subquery multiple times.

This is not a bug, it's a RFE, and not really a planner RFE either.

AFAICS, the issue is that the scan on betta2zetta returns zero rows.
The plan chosen for ANY(ARRAY()) happens to exploit that case
successfully:
Limit  (cost=4155.01..4183.81 rows=6 width=22) (actual time=0.122..0.122 rows=0 loops=1)  InitPlan 1 (returns $1)    ->
Nested Loop  (cost=134.14..4154.58 rows=6639 width=4) (actual time=0.078..0.078 rows=0 loops=1)          ->  Seq Scan
onbetta2zetta  (cost=0.00..5.75 rows=1 width=4) (actual time=0.077..0.077 rows=0 loops=1)                Filter:
(zetta_id= 3001)                Rows Removed by Filter: 300          ->  Bitmap Heap Scan on alpha2betta
(cost=134.14..4079.52rows=6931 width=8) (never executed)                Recheck Cond: (betta_id = betta2zetta.betta_id)
              ->  Bitmap Index Scan on alpha2betta_betta_id_idx  (cost=0.00..132.41 rows=6931 width=0) (never executed)
                    Index Cond: (betta_id = betta2zetta.betta_id)  ->  Index Scan using alpha_pkey on alpha
(cost=0.42..48.43rows=10 width=22) (actual time=0.118..0.118 rows=0 loops=1)        Index Cond: (id = ANY ($1))Planning
time:0.944 msExecution time: 0.294 ms 

Note that alpha2betta isn't read at all, and neither is alpha because the
ANY doesn't get any values to scan for.

But the plan used for IN doesn't optimize this case as well:
Limit  (cost=0.85..48.91 rows=6 width=22) (actual time=272.964..272.964 rows=0 loops=1)  ->  Merge Semi Join
(cost=0.85..53179.99rows=6639 width=22) (actual time=272.962..272.962 rows=0 loops=1)        Merge Cond: (alpha.id =
alpha2betta.alpha_id)       ->  Index Scan using alpha_pkey on alpha  (cost=0.42..22654.42 rows=700000 width=22)
(actualtime=0.017..0.017 rows=1 loops=1)        ->  Nested Loop  (cost=0.42..28694.18 rows=6639 width=4) (actual
time=272.939..272.939rows=0 loops=1)              Join Filter: (alpha2betta.betta_id = betta2zetta.betta_id)
 ->  Index Only Scan using alpha2betta_alpha_id_betta_id_key on alpha2betta  (cost=0.42..18188.42 rows=700000 width=8)
(actualtime=0.100..117.954 rows=700000 loops=1)                    Heap Fetches: 0              ->  Materialize
(cost=0.00..5.75rows=1 width=4) (actual time=0.000..0.000 rows=0 loops=700000)                    ->  Seq Scan on
betta2zetta (cost=0.00..5.75 rows=1 width=4) (actual time=0.071..0.071 rows=0 loops=1)                          Filter:
(zetta_id= 3001)                          Rows Removed by Filter: 300Planning time: 1.423 msExecution time: 273.148 ms 

We do stop short on the alpha scan, but alpha2betta gets read to the end,
looking for a joinable row that it won't ever find.

The planner doesn't ever optimize on the basis that a subquery will return
zero rows (unless that's provably true), and I do not think it would be
wise to do so.  We have enough trouble with plans being optimized on the
assumption of one row out and that proving to be an underestimate.
So I would not like to make it prefer the form with an initplan to the
form without, even if it were to believe that there are zero rows with
zetta_id = 3001.  If it's wrong that will be a horribly bad choice,
as the estimated costs indicate.

We could, however, imagine optimizing this case at runtime.  If
nodeNestloop.c were to note that it got no rows from the inner scan on the
first iteration, and that it isn't passing any nestloop parameters into
the inner side, then it could plausibly assume that all future executions
of the inner plan will also give zero rows, and so it can't produce any
join rows [if it's a standard inner join] and it can stop scanning the
outer side.  In that case we'd stop the alpha2betta indexscan after the
first tuple and win.  Merge and hash joins both have optimizations of this
ilk for empty input relations, so it's reasonable for nestloop to do so
also.

Now a possible objection to this argument is that if the inner side
contains volatile quals or set-returning functions, it might produce some
rows in a future execution even though it didn't this time.  But we've
already decided that we aren't making any guarantees about such behavior.
In this example, we broke any chance of behaving "correctly" for such an
inner scan by sticking a Material node on top of it.  More generally, if
the inner scan doesn't have any parameter dependency on the outer, we're
free to implement the join as a merge or hash join, which would read the
inner scan only once anyway.  So I think that objection can be rejected.

Whether this is worth doing, given that it hasn't really come up before,
I'm not sure of.  It would add a little bit of overhead to nestloops:
we'd need at least something like "node->nl_gotInnerTuple = true" added
to each successful iteration.  That's probably negligible, but maybe not.
The equivalent optimizations for merge and hash add only one-time tests,
not once-per-tuple overhead, so it's hard to argue that they cost much.
        regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

pgsql-bugs by date:

Previous
From: serovov@gmail.com
Date:
Subject: [BUGS] BUG #14811: Nested IN SUBQERY that returns empty results executedmultiple times.
Next
From: Thomas Munro
Date:
Subject: Re: [BUGS] BUG #14808: V10-beta4, backend abort