Re: 7.0.2 and 6.5.3 performance - Mailing list pgsql-general

From Tom Lane
Subject Re: 7.0.2 and 6.5.3 performance
Date
Msg-id 9634.969244010@sss.pgh.pa.us
Whole thread Raw
In response to Re: 7.0.2 and 6.5.3 performance  (lec <englim@pc.jaring.my>)
List pgsql-general
lec <englim@pc.jaring.my> writes:
> explain  select itm_section, itm_brand, itm_itemno, sum(stl_qty)
>      from stl, item where stl_itemno = itm_itemno
>         and stl_discounttype in ('','S','V')
>         and stl_trxdate>='Jul 01, 2000' and stl_trxdate<='Jul 02, 2000'
>         and stl_status='N'
>         and itm_sectiontype='O'
>         and itm_section >='101'
>         and itm_section <='135'
>         group by itm_section, itm_brand, itm_itemno;

> The following is the query/explain under 7.0.2:
>    ->  Nested Loop  (cost=0.00..27770.88 rows=1 width=72)
>          ->  Seq Scan on item  (cost=0.00..2549.45 rows=1 width=36)
>          ->  Index Scan using stl_datetrx on stl (cost=0.00..25157.34 rows=5127 width=36)
>
> The following is the query/explain under 6.5.3:
>    ->  Hash Join  (cost=25780.13 rows=55386 width=72)
>          ->  Index Scan using stl_datetrx on stl (cost=15488.75 rows=210658 width=36)
>          ->  Hash  (cost=3225.66 rows=2879 width=36)
>                ->  Seq Scan on item  (cost=3225.66 rows=2879 width=36)

What's driving the difference in plan choice is evidently the much
smaller number of rows that 7.0 estimates will match the query
restrictions.  Specifically, it's figuring that only one item row
will match
        and itm_sectiontype='O'
        and itm_section >='101'
        and itm_section <='135'
whereas 6.5 estimates 2879.  With only one row to match, it figures it
might as well just do the join with a nested loop and not bother to set
up a hashtable.

Since you're complaining, I suppose this statistical estimate is way
off :-(.  What do you actually get from

     select count(*) from stl where
            stl_discounttype in ('','S','V')
        and stl_trxdate>='Jul 01, 2000' and stl_trxdate<='Jul 02, 2000'
        and stl_status='N';

     select count(*) from item where
            itm_sectiontype='O'
        and itm_section >='101'
        and itm_section <='135';
?

I suspect the reason for the difference in estimated row counts is
that 7.0 recognizes that these are range queries; instead of treating
"stl_trxdate>='Jul 01, 2000' and stl_trxdate<='Jul 02, 2000'" as two
independent clauses, or "itm_section >='101' and itm_section <='135'"
as independent clauses, it combines those clauses and tries to estimate
the fraction of the data range being requested.  This is usually a win
but it's possible to lose rather badly if the requested range covers
a spike in the data's frequency distribution.  We don't yet have
adequate statistics to know that a spike may be involved...

            regards, tom lane

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Large Objects Across a Network
Next
From: "Robert D Oden, IS Director"
Date:
Subject: Re: Regression test failures