Thread: Execution plans for tpc-h
We have Postgres 7.0.2 . There is a query in TPC-H Benchmark that produces this execution plan: Aggregate (cost=698221486855.00..698221486855.00 rows=1 width=72) -> Nested Loop (cost=0.00..698221486855.00 rows=1 width=72) -> Seq Scan on part (cost=0.00..6855.00 rows=200000 width=32) -> Seq Scan on lineitem (cost=0.00..190439.15 rows=6001215 width=40) The functional definition of this query (Q19) is : select sum(l_extendedprice* (1 - l_discount)) as revenue from lineitem, part where ( p_partkey = l_partkey and p_brand = 'Brand#12' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 1 and l_quantity <= 1 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#23' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity >= 10 and l_quantity <= 10 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#34' and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity >= 20 and l_quantity <= 20 + 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ); There is an xjoin (p_partkey = l_partkey) so, why Postgres utilize Nestloop??, Would not be the HashJoin more useful??. I have tried to put the variable ENABLE_NESTLOOP to OFF but it continues utilising NestLoop. With this plan the time execution of this query is eternal. Can anybody explain me the reason the reason because Postgres utilize NestLoop in this query? Thanks in advance.
From: "Victor Muntes Mutero" <vmuntes@ac.upc.es> > We have Postgres 7.0.2 . > > There is a query in TPC-H Benchmark that produces this execution plan: > > Aggregate (cost=698221486855.00..698221486855.00 rows=1 width=72) > -> Nested Loop (cost=0.00..698221486855.00 rows=1 width=72) > -> Seq Scan on part (cost=0.00..6855.00 rows=200000 width=32) > -> Seq Scan on lineitem (cost=0.00..190439.15 rows=6001215 > width=40) I've got to say that's the largest cost estimate I've ever seen (should there be some sort of award?) Look at the "rows=" values - PG thinks it's got to check zillions, so it's obviously missed the p_partkey=l_partkey. Assuming indexes etc are OK, try moving this out of the brackets: ... where ( p_partkey = l_partkey and ( p_brand='Brand#12' ... ) or ( p_brand='Brand#23' ... Alternatively, try and explicit join and see if PG gets the message. - Richard Huxton
Victor Muntes Mutero <vmuntes@ac.upc.es> writes: > There is an xjoin (p_partkey = l_partkey) Not in that form of the query. You have WHERE ( ... ) OR ( ... ) OR ( ... ) If Postgres were to reduce the WHERE expression to CNF, it would discover that the p_partkey = l_partkey clause is common to all three ORs, whereupon it would have WHERE (p_partkey = l_partkey) AND (some big OR) and could use p_partkey = l_partkey as a merge or hash join key. But I suspect it decides that the WHERE is too complex to try to simplify to CNF. We used to apply cnfify() unconditionally, but that has some unpleasant exponential behavior on large expressions :-(. What's worse is that DNF is actually the preferred form in some cases --- especially when a repeated indexscan would be useful. The heuristics that decide whether to convert to CNF could probably use further refinement. If you are interested in playing with them, see canonicalize_qual() in src/backend/optimizer/prep/prepqual.c. But it would not be easy to recognize that this query would be better off in CNF form: that requires noticing that the OR'd clauses have a lot of terms in common, and you can't discover that without actually doing most of the canonicalization work... regards, tom lane