mis-estimation on data-warehouse aggregate creation - Mailing list pgsql-performance

From Kris Jurka
Subject mis-estimation on data-warehouse aggregate creation
Date
Msg-id Pine.BSO.4.56.0411160339500.31841@leary.csoft.net
Whole thread Raw
Responses Re: mis-estimation on data-warehouse aggregate creation
Re: mis-estimation on data-warehouse aggregate creation
List pgsql-performance
I've have a miniature data-warehouse in which I'm trying to rebuild
pre-calcuated aggregate data directly in the database and I'm geting some
poor plans due to a bad mis-estimation of the number of rows involved.

In a standard star schema I have a sales fact table and dimensions
product, customer, and period.  From those dimensions I have created
"shrunken" versions of them that only have productline, salesperson and
month data.  Now I want to rollup the base fact table to a "shrunken"
version with data summed up for these smaller aggregate dimensions.

The idea is to take a sales table (productid, customerid, periodid,
quantity, usdamount) and create a table with the same columns that have
the "id" columns pointing to the matching smaller dimensions and total up
the quantity and usdamount.  Since the shrunken dimension tables have
renumbered ids we look these up by joining on all of the common columns
between the base and shrunken dimensions.  The following query does just
that:

CREATE TABLE shf_sales_by_salesperson_productline_month AS
SELECT SUM(sales.quantity) AS quantity,
    SUM(sales.usdamount) AS usdamount,
    shd_productline.id AS productid,
    shd_month.id AS periodid,
        shd_salesperson.id AS customerid
FROM sales
JOIN (
        SELECT shd_productline.id, product.id AS productid
    FROM product, shd_productline
    WHERE product.productline = shd_productline.productline
        AND product.category = shd_productline.category
        AND product.platform = shd_productline.platform
        ) shd_productline
ON sales.productid = shd_productline.productid
JOIN (
SELECT shd_month.id, period.id AS periodid
        FROM period, shd_month
        WHERE period.monthnumber = shd_month.monthnumber
        AND period.monthname = shd_month.monthname
        AND period.year = shd_month.year
        AND period.monthyear = shd_month.monthyear
        AND period.quarter = shd_month.quarter
        AND period.quarteryear = shd_month.quarteryear
        ) shd_month
ON sales.periodid = shd_month.periodid
JOIN (
        SELECT shd_salesperson.id, customer.id AS customerid
        FROM customer, shd_salesperson
        WHERE customer.salesperson = shd_salesperson.salesperson
        ) shd_salesperson
ON sales.customerid = shd_salesperson.customerid

GROUP BY shd_productline.id, shd_month.id, shd_salesperson.id

This generates the following EXPLAIN ANALYZE plan for the SELECT portion:

 HashAggregate  (cost=32869.33..32869.34 rows=1 width=36) (actual time=475182.855..475188.304 rows=911 loops=1)
   ->  Nested Loop  (cost=377.07..32869.32 rows=1 width=36) (actual time=130.179..464299.167 rows=1232140 loops=1)
         Join Filter: ("outer".salesperson = "inner".salesperson)
         ->  Nested Loop  (cost=377.07..32868.18 rows=1 width=44) (actual time=130.140..411975.760 rows=1232140
loops=1)
               Join Filter: ("outer".customerid = "inner".id)
               ->  Hash Join  (cost=377.07..32864.32 rows=1 width=32) (actual time=130.072..23167.501 rows=1232140
loops=1)
                     Hash Cond: ("outer".productid = "inner".id)
                     ->  Hash Join  (cost=194.23..32679.08 rows=375 width=28) (actual time=83.118..14019.802
rows=1232140loops=1) 
                           Hash Cond: ("outer".periodid = "inner".id)
                           ->  Seq Scan on sales  (cost=0.00..26320.40 rows=1232140 width=24) (actual
time=0.109..3335.275rows=1232140 loops=1) 
                           ->  Hash  (cost=194.23..194.23 rows=1 width=12) (actual time=81.548..81.548 rows=0 loops=1)
                                 ->  Hash Join  (cost=4.70..194.23 rows=1 width=12) (actual time=2.544..72.798
rows=3288loops=1) 
                                       Hash Cond: (("outer".monthnumber = "inner".monthnumber) AND ("outer".monthname =
"inner".monthname)AND ("outer"."year" = "inner"."year") AND ("outer".monthyear = "inner".monthyear) AND
("outer".quarter= "inner".quarter) AND ("outer".quarteryear = "inner".quarteryear)) 
                                       ->  Seq Scan on period  (cost=0.00..90.88 rows=3288 width=54) (actual
time=0.009..9.960rows=3288 loops=1) 
                                       ->  Hash  (cost=3.08..3.08 rows=108 width=58) (actual time=1.643..1.643 rows=0
loops=1)
                                             ->  Seq Scan on shd_month  (cost=0.00..3.08 rows=108 width=58) (actual
time=0.079..0.940rows=108 loops=1) 
                     ->  Hash  (cost=182.18..182.18 rows=265 width=12) (actual time=45.431..45.431 rows=0 loops=1)
                           ->  Hash Join  (cost=1.23..182.18 rows=265 width=12) (actual time=1.205..40.216 rows=1932
loops=1)
                                 Hash Cond: (("outer".productline = "inner".productline) AND ("outer".category =
"inner".category)AND ("outer".platform = "inner".platform)) 
                                 ->  Seq Scan on product  (cost=0.00..149.32 rows=1932 width=32) (actual
time=0.013..6.179rows=1932 loops=1) 
                                 ->  Hash  (cost=1.13..1.13 rows=13 width=45) (actual time=0.199..0.199 rows=0 loops=1)
                                       ->  Seq Scan on shd_productline  (cost=0.00..1.13 rows=13 width=45) (actual
time=0.048..0.083rows=13 loops=1) 
               ->  Seq Scan on customer  (cost=0.00..2.83 rows=83 width=20) (actual time=0.005..0.174 rows=83
loops=1232140)
         ->  Seq Scan on shd_salesperson  (cost=0.00..1.06 rows=6 width=24) (actual time=0.004..0.019 rows=6
loops=1232140)
 Total runtime: 475197.372 ms
(25 rows)

Note that the estimated number of input rows to the final HashAggreggate
is 1 while the actual number is 1.2 million.  By rewriting the JOIN
conditions to LEFT JOIN we force the planner to recognize that there will
be a match for every row in the sales table:

 HashAggregate  (cost=74601.88..74644.00 rows=8424 width=36) (actual time=39956.115..39961.507 rows=911 loops=1)
   ->  Hash Left Join  (cost=382.43..59200.13 rows=1232140 width=36) (actual time=140.879..30765.373 rows=1232140
loops=1)
         Hash Cond: ("outer".customerid = "inner".id)
         ->  Hash Left Join  (cost=377.07..40712.67 rows=1232140 width=32) (actual time=136.069..22721.760 rows=1232140
loops=1)
               Hash Cond: ("outer".periodid = "inner".id)
               ->  Hash Left Join  (cost=182.84..34353.99 rows=1232140 width=28) (actual time=50.815..14742.610
rows=1232140loops=1) 
                     Hash Cond: ("outer".productid = "inner".id)
                     ->  Seq Scan on sales  (cost=0.00..26320.40 rows=1232140 width=24) (actual time=0.099..4490.148
rows=1232140loops=1) 
                     ->  Hash  (cost=182.18..182.18 rows=265 width=12) (actual time=49.114..49.114 rows=0 loops=1)
                           ->  Hash Join  (cost=1.23..182.18 rows=265 width=12) (actual time=1.331..43.662 rows=1932
loops=1)
                                 Hash Cond: (("outer".productline = "inner".productline) AND ("outer".category =
"inner".category)AND ("outer".platform = "inner".platform)) 
                                 ->  Seq Scan on product  (cost=0.00..149.32 rows=1932 width=32) (actual
time=0.128..11.246rows=1932 loops=1) 
                                 ->  Hash  (cost=1.13..1.13 rows=13 width=45) (actual time=0.200..0.200 rows=0 loops=1)
                                       ->  Seq Scan on shd_productline  (cost=0.00..1.13 rows=13 width=45) (actual
time=0.047..0.081rows=13 loops=1) 
               ->  Hash  (cost=194.23..194.23 rows=1 width=12) (actual time=83.651..83.651 rows=0 loops=1)
                     ->  Hash Join  (cost=4.70..194.23 rows=1 width=12) (actual time=2.675..74.693 rows=3288 loops=1)
                           Hash Cond: (("outer".monthnumber = "inner".monthnumber) AND ("outer".monthname =
"inner".monthname)AND ("outer"."year" = "inner"."year") AND ("outer".monthyear = "inner".monthyear) AND
("outer".quarter= "inner".quarter) AND ("outer".quarteryear = "inner".quarteryear)) 
                           ->  Seq Scan on period  (cost=0.00..90.88 rows=3288 width=54) (actual time=0.118..12.126
rows=3288loops=1) 
                           ->  Hash  (cost=3.08..3.08 rows=108 width=58) (actual time=1.658..1.658 rows=0 loops=1)
                                 ->  Seq Scan on shd_month  (cost=0.00..3.08 rows=108 width=58) (actual
time=0.081..0.947rows=108 loops=1) 
         ->  Hash  (cost=5.15..5.15 rows=83 width=12) (actual time=3.131..3.131 rows=0 loops=1)
               ->  Hash Join  (cost=1.07..5.15 rows=83 width=12) (actual time=1.937..2.865 rows=83 loops=1)
                     Hash Cond: ("outer".salesperson = "inner".salesperson)
                     ->  Seq Scan on customer  (cost=0.00..2.83 rows=83 width=20) (actual time=0.137..0.437 rows=83
loops=1)
                     ->  Hash  (cost=1.06..1.06 rows=6 width=24) (actual time=0.152..0.152 rows=0 loops=1)
                           ->  Seq Scan on shd_salesperson  (cost=0.00..1.06 rows=6 width=24) (actual time=0.045..0.064
rows=6loops=1) 
 Total runtime: 39974.236 ms
(27 rows)

Given better row estimates the resulting plan runs more than ten times
faster.  Why is the planner doing so poorly with estimating the number of
rows returned?  I tried:

SET default_statistics_target = 1000;
VACUUM FULL ANALYZE;

but the results were the same.  This is on 8.0beta4.  Any ideas?

Kris Jurka


pgsql-performance by date:

Previous
From: "dentfirst13@earthlink.net"
Date:
Subject: Re: Insertion puzzles
Next
From: "F. Senault"
Date:
Subject: Re: mis-estimation on data-warehouse aggregate creation