Thread: mis-estimation on data-warehouse aggregate creation

mis-estimation on data-warehouse aggregate creation

From

Kris Jurka

Date:

16 November 2004, 09:09:46

I've have a miniature data-warehouse in which I'm trying to rebuild
pre-calcuated aggregate data directly in the database and I'm geting some
poor plans due to a bad mis-estimation of the number of rows involved.

In a standard star schema I have a sales fact table and dimensions
product, customer, and period.  From those dimensions I have created
"shrunken" versions of them that only have productline, salesperson and
month data.  Now I want to rollup the base fact table to a "shrunken"
version with data summed up for these smaller aggregate dimensions.

The idea is to take a sales table (productid, customerid, periodid,
quantity, usdamount) and create a table with the same columns that have
the "id" columns pointing to the matching smaller dimensions and total up
the quantity and usdamount.  Since the shrunken dimension tables have
renumbered ids we look these up by joining on all of the common columns
between the base and shrunken dimensions.  The following query does just
that:

CREATE TABLE shf_sales_by_salesperson_productline_month AS
SELECT SUM(sales.quantity) AS quantity,
    SUM(sales.usdamount) AS usdamount,
    shd_productline.id AS productid,
    shd_month.id AS periodid,
        shd_salesperson.id AS customerid
FROM sales
JOIN (
        SELECT shd_productline.id, product.id AS productid
    FROM product, shd_productline
    WHERE product.productline = shd_productline.productline
        AND product.category = shd_productline.category
        AND product.platform = shd_productline.platform
        ) shd_productline
ON sales.productid = shd_productline.productid
JOIN (
SELECT shd_month.id, period.id AS periodid
        FROM period, shd_month
        WHERE period.monthnumber = shd_month.monthnumber
        AND period.monthname = shd_month.monthname
        AND period.year = shd_month.year
        AND period.monthyear = shd_month.monthyear
        AND period.quarter = shd_month.quarter
        AND period.quarteryear = shd_month.quarteryear
        ) shd_month
ON sales.periodid = shd_month.periodid
JOIN (
        SELECT shd_salesperson.id, customer.id AS customerid
        FROM customer, shd_salesperson
        WHERE customer.salesperson = shd_salesperson.salesperson
        ) shd_salesperson
ON sales.customerid = shd_salesperson.customerid

GROUP BY shd_productline.id, shd_month.id, shd_salesperson.id

This generates the following EXPLAIN ANALYZE plan for the SELECT portion:

 HashAggregate  (cost=32869.33..32869.34 rows=1 width=36) (actual time=475182.855..475188.304 rows=911 loops=1)
   ->  Nested Loop  (cost=377.07..32869.32 rows=1 width=36) (actual time=130.179..464299.167 rows=1232140 loops=1)
         Join Filter: ("outer".salesperson = "inner".salesperson)
         ->  Nested Loop  (cost=377.07..32868.18 rows=1 width=44) (actual time=130.140..411975.760 rows=1232140
loops=1)
               Join Filter: ("outer".customerid = "inner".id)
               ->  Hash Join  (cost=377.07..32864.32 rows=1 width=32) (actual time=130.072..23167.501 rows=1232140
loops=1)
                     Hash Cond: ("outer".productid = "inner".id)
                     ->  Hash Join  (cost=194.23..32679.08 rows=375 width=28) (actual time=83.118..14019.802
rows=1232140loops=1) 
                           Hash Cond: ("outer".periodid = "inner".id)
                           ->  Seq Scan on sales  (cost=0.00..26320.40 rows=1232140 width=24) (actual
time=0.109..3335.275rows=1232140 loops=1) 
                           ->  Hash  (cost=194.23..194.23 rows=1 width=12) (actual time=81.548..81.548 rows=0 loops=1)
                                 ->  Hash Join  (cost=4.70..194.23 rows=1 width=12) (actual time=2.544..72.798
rows=3288loops=1) 
                                       Hash Cond: (("outer".monthnumber = "inner".monthnumber) AND ("outer".monthname =
"inner".monthname)AND ("outer"."year" = "inner"."year") AND ("outer".monthyear = "inner".monthyear) AND
("outer".quarter= "inner".quarter) AND ("outer".quarteryear = "inner".quarteryear)) 
                                       ->  Seq Scan on period  (cost=0.00..90.88 rows=3288 width=54) (actual
time=0.009..9.960rows=3288 loops=1) 
                                       ->  Hash  (cost=3.08..3.08 rows=108 width=58) (actual time=1.643..1.643 rows=0
loops=1)
                                             ->  Seq Scan on shd_month  (cost=0.00..3.08 rows=108 width=58) (actual
time=0.079..0.940rows=108 loops=1) 
                     ->  Hash  (cost=182.18..182.18 rows=265 width=12) (actual time=45.431..45.431 rows=0 loops=1)
                           ->  Hash Join  (cost=1.23..182.18 rows=265 width=12) (actual time=1.205..40.216 rows=1932
loops=1)
                                 Hash Cond: (("outer".productline = "inner".productline) AND ("outer".category =
"inner".category)AND ("outer".platform = "inner".platform)) 
                                 ->  Seq Scan on product  (cost=0.00..149.32 rows=1932 width=32) (actual
time=0.013..6.179rows=1932 loops=1) 
                                 ->  Hash  (cost=1.13..1.13 rows=13 width=45) (actual time=0.199..0.199 rows=0 loops=1)
                                       ->  Seq Scan on shd_productline  (cost=0.00..1.13 rows=13 width=45) (actual
time=0.048..0.083rows=13 loops=1) 
               ->  Seq Scan on customer  (cost=0.00..2.83 rows=83 width=20) (actual time=0.005..0.174 rows=83
loops=1232140)
         ->  Seq Scan on shd_salesperson  (cost=0.00..1.06 rows=6 width=24) (actual time=0.004..0.019 rows=6
loops=1232140)
 Total runtime: 475197.372 ms
(25 rows)

Note that the estimated number of input rows to the final HashAggreggate
is 1 while the actual number is 1.2 million.  By rewriting the JOIN
conditions to LEFT JOIN we force the planner to recognize that there will
be a match for every row in the sales table:

 HashAggregate  (cost=74601.88..74644.00 rows=8424 width=36) (actual time=39956.115..39961.507 rows=911 loops=1)
   ->  Hash Left Join  (cost=382.43..59200.13 rows=1232140 width=36) (actual time=140.879..30765.373 rows=1232140
loops=1)
         Hash Cond: ("outer".customerid = "inner".id)
         ->  Hash Left Join  (cost=377.07..40712.67 rows=1232140 width=32) (actual time=136.069..22721.760 rows=1232140
loops=1)
               Hash Cond: ("outer".periodid = "inner".id)
               ->  Hash Left Join  (cost=182.84..34353.99 rows=1232140 width=28) (actual time=50.815..14742.610
rows=1232140loops=1) 
                     Hash Cond: ("outer".productid = "inner".id)
                     ->  Seq Scan on sales  (cost=0.00..26320.40 rows=1232140 width=24) (actual time=0.099..4490.148
rows=1232140loops=1) 
                     ->  Hash  (cost=182.18..182.18 rows=265 width=12) (actual time=49.114..49.114 rows=0 loops=1)
                           ->  Hash Join  (cost=1.23..182.18 rows=265 width=12) (actual time=1.331..43.662 rows=1932
loops=1)
                                 Hash Cond: (("outer".productline = "inner".productline) AND ("outer".category =
"inner".category)AND ("outer".platform = "inner".platform)) 
                                 ->  Seq Scan on product  (cost=0.00..149.32 rows=1932 width=32) (actual
time=0.128..11.246rows=1932 loops=1) 
                                 ->  Hash  (cost=1.13..1.13 rows=13 width=45) (actual time=0.200..0.200 rows=0 loops=1)
                                       ->  Seq Scan on shd_productline  (cost=0.00..1.13 rows=13 width=45) (actual
time=0.047..0.081rows=13 loops=1) 
               ->  Hash  (cost=194.23..194.23 rows=1 width=12) (actual time=83.651..83.651 rows=0 loops=1)
                     ->  Hash Join  (cost=4.70..194.23 rows=1 width=12) (actual time=2.675..74.693 rows=3288 loops=1)
                           Hash Cond: (("outer".monthnumber = "inner".monthnumber) AND ("outer".monthname =
"inner".monthname)AND ("outer"."year" = "inner"."year") AND ("outer".monthyear = "inner".monthyear) AND
("outer".quarter= "inner".quarter) AND ("outer".quarteryear = "inner".quarteryear)) 
                           ->  Seq Scan on period  (cost=0.00..90.88 rows=3288 width=54) (actual time=0.118..12.126
rows=3288loops=1) 
                           ->  Hash  (cost=3.08..3.08 rows=108 width=58) (actual time=1.658..1.658 rows=0 loops=1)
                                 ->  Seq Scan on shd_month  (cost=0.00..3.08 rows=108 width=58) (actual
time=0.081..0.947rows=108 loops=1) 
         ->  Hash  (cost=5.15..5.15 rows=83 width=12) (actual time=3.131..3.131 rows=0 loops=1)
               ->  Hash Join  (cost=1.07..5.15 rows=83 width=12) (actual time=1.937..2.865 rows=83 loops=1)
                     Hash Cond: ("outer".salesperson = "inner".salesperson)
                     ->  Seq Scan on customer  (cost=0.00..2.83 rows=83 width=20) (actual time=0.137..0.437 rows=83
loops=1)
                     ->  Hash  (cost=1.06..1.06 rows=6 width=24) (actual time=0.152..0.152 rows=0 loops=1)
                           ->  Seq Scan on shd_salesperson  (cost=0.00..1.06 rows=6 width=24) (actual time=0.045..0.064
rows=6loops=1) 
 Total runtime: 39974.236 ms
(27 rows)

Given better row estimates the resulting plan runs more than ten times
faster.  Why is the planner doing so poorly with estimating the number of
rows returned?  I tried:

SET default_statistics_target = 1000;
VACUUM FULL ANALYZE;

but the results were the same.  This is on 8.0beta4.  Any ideas?

Kris Jurka

Re: mis-estimation on data-warehouse aggregate creation

From

"F. Senault"

Date:

16 November 2004, 10:56:13

Tuesday, November 16, 2004, 10:10:17 AM, you wrote:


>  HashAggregate  (cost=32869.33..32869.34 rows=1 width=36)
                                                ^
>                 (actual time=475182.855..475188.304 rows=911 loops=1)
                                                           ^^^
>    ->  Nested Loop  (cost=377.07..32869.32 rows=1 width=36)
                                                  ^
>                     (actual time=130.179..464299.167 rows=1232140 loops=1)
                                                            ^^^^^^^
Let me guess...  You've never run "analyze" on your tables ?

Fred

Re: mis-estimation on data-warehouse aggregate creation

From

Kris Jurka

Date:

16 November 2004, 19:49:14

On Tue, 16 Nov 2004, F. Senault wrote:

> Let me guess...  You've never run "analyze" on your tables ?
>

No, I have.  I mentioned that I did in my email, but you can also tell by
the exactly correct guesses for some other plan steps:

->  Seq Scan on period  (cost=0.00..90.88 rows=3288 width=54) (actual time=0.118..12.126 rows=3288 loops=1)

Kris Jurka

Re: mis-estimation on data-warehouse aggregate creation

From

Simon Riggs

Date:

16 November 2004, 21:33:28

On Tue, 2004-11-16 at 09:10, Kris Jurka wrote:
>  By rewriting the JOIN
> conditions to LEFT JOIN we force the planner to recognize that there will
> be a match for every row in the sales table:
>

You realise that returns a different answer (or at least it potentially
does, depending upon your data?

>                      ->  Hash Join  (cost=4.70..194.23 rows=1 width=12) (actual time=2.675..74.693 rows=3288 loops=1)
>                            Hash Cond: (("outer".monthnumber = "inner".monthnumber) AND ("outer".monthname =
"inner".monthname)AND ("outer"."year" = "inner"."year") AND ("outer".monthyear = "inner".monthyear) AND
("outer".quarter= "inner".quarter) AND ("outer".quarteryear = "inner".quarteryear)) 
>                            ->  Seq Scan on period  (cost=0.00..90.88 rows=3288 width=54) (actual time=0.118..12.126
rows=3288loops=1) 
>                            ->  Hash  (cost=3.08..3.08 rows=108 width=58) (actual time=1.658..1.658 rows=0 loops=1)
>                                  ->  Seq Scan on shd_month  (cost=0.00..3.08 rows=108 width=58) (actual
time=0.081..0.947rows=108 loops=1) 

ISTM your trouble starts here ^^^
estimate=1, but rows=3288

The join condition has so many ANDed predicates that we assume that this
will reduce the selectivity considerably. It does not, and so you pay
the cost dearly later on.

In both plans, the trouble starts at this point.

If you pre-build tables that have only a single join column between the
full.oldids and shrunken.renumberedids then this will most likely work
correctly, since the planner will be able to correctly estimate the join
selectivity. i.e. put product.id onto shd_productline ahead of time, so
you can avoid the complex join.

Setting join_collapse_limit lower doesn't look like it would help, since
the plan already shows joining the sub-queries together first.

--
Best Regards, Simon Riggs

Re: mis-estimation on data-warehouse aggregate creation

From

Kris Jurka

Date:

17 November 2004, 08:19:22

On Tue, 16 Nov 2004, Simon Riggs wrote:

> The join condition has so many ANDed predicates that we assume that this
> will reduce the selectivity considerably. It does not, and so you pay
> the cost dearly later on.
>

Yes, that makes a lot of sense.  Without some incredibly good cross-column
statistics there is no way it could expect all of the rows to match.
Thanks for the analysis.

Kris Jurka