Thread: Execution plans for tpc-h

Execution plans for tpc-h

From
Victor Muntes Mutero
Date:
We have Postgres 7.0.2 .

There is a query in TPC-H Benchmark that produces this execution plan:

Aggregate  (cost=698221486855.00..698221486855.00 rows=1 width=72)
  ->  Nested Loop  (cost=0.00..698221486855.00 rows=1 width=72)
        ->  Seq Scan on part  (cost=0.00..6855.00 rows=200000 width=32)
        ->  Seq Scan on lineitem  (cost=0.00..190439.15 rows=6001215
width=40)

The functional definition of this query (Q19) is :

select
    sum(l_extendedprice* (1 - l_discount)) as revenue
from
    lineitem,
    part
where
    (
        p_partkey = l_partkey
        and p_brand = 'Brand#12'
        and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
        and l_quantity >= 1 and l_quantity <= 1 + 10
        and p_size between 1 and 5
        and l_shipmode in ('AIR', 'AIR REG')
        and l_shipinstruct = 'DELIVER IN PERSON'
    )
    or
    (
        p_partkey = l_partkey
        and p_brand = 'Brand#23'
        and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
        and l_quantity >= 10 and l_quantity <= 10 + 10
        and p_size between 1 and 10
        and l_shipmode in ('AIR', 'AIR REG')
        and l_shipinstruct = 'DELIVER IN PERSON'
    )
    or
    (
        p_partkey = l_partkey
        and p_brand = 'Brand#34'
        and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
        and l_quantity >= 20 and l_quantity <= 20 + 10
        and p_size between 1 and 15
        and l_shipmode in ('AIR', 'AIR REG')
        and l_shipinstruct = 'DELIVER IN PERSON'
    );


There is an xjoin (p_partkey = l_partkey) so, why Postgres utilize
Nestloop??,
Would not be the HashJoin more useful??. I have tried to put the
variable ENABLE_NESTLOOP to OFF but it continues utilising NestLoop.
With this plan the
time execution of this query is eternal.

Can anybody explain me the reason the reason because Postgres utilize
NestLoop in this query?

Thanks in advance.

Re: Execution plans for tpc-h

From
"Richard Huxton"
Date:
From: "Victor Muntes Mutero" <vmuntes@ac.upc.es>


> We have Postgres 7.0.2 .
>
> There is a query in TPC-H Benchmark that produces this execution plan:
>
> Aggregate  (cost=698221486855.00..698221486855.00 rows=1 width=72)
>   ->  Nested Loop  (cost=0.00..698221486855.00 rows=1 width=72)
>         ->  Seq Scan on part  (cost=0.00..6855.00 rows=200000 width=32)
>         ->  Seq Scan on lineitem  (cost=0.00..190439.15 rows=6001215
> width=40)

I've got to say that's the largest cost estimate I've ever seen (should
there be some sort of award?)

Look at the "rows=" values - PG thinks it's got to check zillions, so it's
obviously missed the p_partkey=l_partkey.

Assuming indexes etc are OK, try moving this out of the brackets:

...
where (
p_partkey = l_partkey and
( p_brand='Brand#12'
...
)
or ( p_brand='Brand#23'
...

Alternatively, try and explicit join and see if PG gets the message.

- Richard Huxton



Re: Execution plans for tpc-h

From
Tom Lane
Date:
Victor Muntes Mutero <vmuntes@ac.upc.es> writes:
> There is an xjoin (p_partkey = l_partkey)

Not in that form of the query.  You have

    WHERE ( ... ) OR ( ... ) OR ( ... )

If Postgres were to reduce the WHERE expression to CNF, it would discover
that the p_partkey = l_partkey clause is common to all three ORs,
whereupon it would have WHERE (p_partkey = l_partkey) AND (some big OR)
and could use p_partkey = l_partkey as a merge or hash join key.
But I suspect it decides that the WHERE is too complex to try to
simplify to CNF.  We used to apply cnfify() unconditionally, but that
has some unpleasant exponential behavior on large expressions :-(.
What's worse is that DNF is actually the preferred form in some cases
--- especially when a repeated indexscan would be useful.

The heuristics that decide whether to convert to CNF could probably use
further refinement.  If you are interested in playing with them, see
canonicalize_qual() in src/backend/optimizer/prep/prepqual.c.  But it
would not be easy to recognize that this query would be better off in
CNF form: that requires noticing that the OR'd clauses have a lot of
terms in common, and you can't discover that without actually doing
most of the canonicalization work...

            regards, tom lane