Thread: Why will hashed SubPlan not use multiple batches
A hashed SubPlan will not be used if it would need more than one batch. Is there a fundamental reason for that, or just that no one got around to adding it? A small decrease in work_mem leads to a 38000 fold change in estimated query execution (and that might be accurate, as the actual change in execution is too large to measure) I have no control over the real query itself (otherwise changing it from NOT IN to NOT EXISTS would fix it, because that hash plan will use multiple batches). I have temporarily fixed it by increasing work_mem, but it would be better if the planner did the best with the resources it had. This example works with default settings on "PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit". Same behavior on 9.1.7 and 9.3dev. Is this a Todo item? test case below. create table foo as select (random()*10000)::integer as bar from generate_series(1,100000); create table foo2 as select (random()*10000)::integer as bar2 from generate_series(1,100000); analyze; set work_mem TO 3300; explain select foo.bar from foo where bar not in (select bar2 from foo2); QUERY PLAN --------------------------------------------------------------------Seq Scan on foo (cost=1693.00..3386.00 rows=50000 width=4) Filter: (NOT (hashed SubPlan 1)) SubPlan 1 -> Seq Scan on foo2 (cost=0.00..1443.00 rows=100000 width=4) (4 rows) set work_mem TO 3100; explain select foo.bar from foo where bar not in (select bar2 from foo2); QUERY PLAN --------------------------------------------------------------------------Seq Scan on foo (cost=0.00..129201693.00 rows=50000width=4) Filter: (NOT (SubPlan 1)) SubPlan 1 -> Materialize (cost=0.00..2334.00 rows=100000 width=4) -> Seq Scan on foo2 (cost=0.00..1443.00 rows=100000 width=4) Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > A hashed SubPlan will not be used if it would need more than one > batch. Is there a fundamental reason for that, or just that no one > got around to adding it? It can't, really. Batching a hash join requires freedom to reorder the rows on both sides of the join. A SubPlan, by definition, must deliver the correct answer for the current outer row on-demand. The only real fix for your problem would be to teach the regular hash join machinery how to handle NOT IN semantics accurately, so that we could transform this query into a regular kind of join instead of a seqscan with a SubPlan wart attached to it. In the past it hasn't really seemed worth it, since 99% of the time, once you question somebody about why they're insisting on NOT IN rather than NOT EXISTS, you find out that they didn't really want NOT IN semantics after all. We could also consider adding logic to notice NOT NULL constraints on the inner select's outputs, which would allow the planner to prove that the query can be transformed to a regular antijoin. That only helps people who've put on such constraints though ... > I have no control over the real query itself Sigh. Another badly-written ORM, I suppose? regards, tom lane