Thread: Why will hashed SubPlan not use multiple batches

Why will hashed SubPlan not use multiple batches

From

Jeff Janes

Date:

25 January 2013, 20:50:51

A hashed SubPlan will not be used if it would need more than one
batch.  Is there a fundamental reason for that, or just that no one
got around to adding it?

A small decrease in work_mem leads to a 38000 fold change in estimated
query execution (and that might be accurate, as the actual change in
execution is too large to measure)

I have no control over the real query itself (otherwise changing it
from NOT IN to NOT EXISTS would fix it, because that hash plan will
use multiple batches).

I have temporarily fixed it by increasing work_mem, but it would be
better if the planner did the best with the resources it had.

This example works with default settings on "PostgreSQL 9.2.2 on
x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.6 20120305 (Red
Hat 4.4.6-4), 64-bit".   Same behavior on 9.1.7 and 9.3dev.

Is this a Todo item?

test case below.

create table foo as select (random()*10000)::integer as bar from
generate_series(1,100000);
create table foo2 as select (random()*10000)::integer as bar2 from
generate_series(1,100000);
analyze;


set work_mem TO 3300;
explain select foo.bar from foo where bar not in (select bar2 from foo2);                            QUERY PLAN
--------------------------------------------------------------------Seq Scan on foo  (cost=1693.00..3386.00 rows=50000
width=4) Filter: (NOT (hashed SubPlan 1))  SubPlan 1    ->  Seq Scan on foo2  (cost=0.00..1443.00 rows=100000 width=4)
 
(4 rows)


set work_mem TO 3100;
explain select foo.bar from foo where bar not in (select bar2 from foo2);                               QUERY PLAN
--------------------------------------------------------------------------Seq Scan on foo  (cost=0.00..129201693.00
rows=50000width=4)  Filter: (NOT (SubPlan 1))  SubPlan 1    ->  Materialize  (cost=0.00..2334.00 rows=100000 width=4)
      ->  Seq Scan on foo2  (cost=0.00..1443.00 rows=100000 width=4)
 


Cheers,

Jeff

Re: Why will hashed SubPlan not use multiple batches

From

Tom Lane

Date:

25 January 2013, 21:38:01

Jeff Janes <jeff.janes@gmail.com> writes:
> A hashed SubPlan will not be used if it would need more than one
> batch.  Is there a fundamental reason for that, or just that no one
> got around to adding it?

It can't, really.  Batching a hash join requires freedom to reorder the
rows on both sides of the join.  A SubPlan, by definition, must deliver
the correct answer for the current outer row on-demand.

The only real fix for your problem would be to teach the regular hash
join machinery how to handle NOT IN semantics accurately, so that we
could transform this query into a regular kind of join instead of a
seqscan with a SubPlan wart attached to it.  In the past it hasn't
really seemed worth it, since 99% of the time, once you question
somebody about why they're insisting on NOT IN rather than NOT EXISTS,
you find out that they didn't really want NOT IN semantics after all.

We could also consider adding logic to notice NOT NULL constraints on
the inner select's outputs, which would allow the planner to prove that
the query can be transformed to a regular antijoin.  That only helps
people who've put on such constraints though ...

> I have no control over the real query itself

Sigh.  Another badly-written ORM, I suppose?
        regards, tom lane