mis-estimation on data-warehouse aggregate creation - Mailing list pgsql-performance
From | Kris Jurka |
---|---|
Subject | mis-estimation on data-warehouse aggregate creation |
Date | |
Msg-id | Pine.BSO.4.56.0411160339500.31841@leary.csoft.net Whole thread Raw |
Responses |
Re: mis-estimation on data-warehouse aggregate creation
Re: mis-estimation on data-warehouse aggregate creation |
List | pgsql-performance |
I've have a miniature data-warehouse in which I'm trying to rebuild pre-calcuated aggregate data directly in the database and I'm geting some poor plans due to a bad mis-estimation of the number of rows involved. In a standard star schema I have a sales fact table and dimensions product, customer, and period. From those dimensions I have created "shrunken" versions of them that only have productline, salesperson and month data. Now I want to rollup the base fact table to a "shrunken" version with data summed up for these smaller aggregate dimensions. The idea is to take a sales table (productid, customerid, periodid, quantity, usdamount) and create a table with the same columns that have the "id" columns pointing to the matching smaller dimensions and total up the quantity and usdamount. Since the shrunken dimension tables have renumbered ids we look these up by joining on all of the common columns between the base and shrunken dimensions. The following query does just that: CREATE TABLE shf_sales_by_salesperson_productline_month AS SELECT SUM(sales.quantity) AS quantity, SUM(sales.usdamount) AS usdamount, shd_productline.id AS productid, shd_month.id AS periodid, shd_salesperson.id AS customerid FROM sales JOIN ( SELECT shd_productline.id, product.id AS productid FROM product, shd_productline WHERE product.productline = shd_productline.productline AND product.category = shd_productline.category AND product.platform = shd_productline.platform ) shd_productline ON sales.productid = shd_productline.productid JOIN ( SELECT shd_month.id, period.id AS periodid FROM period, shd_month WHERE period.monthnumber = shd_month.monthnumber AND period.monthname = shd_month.monthname AND period.year = shd_month.year AND period.monthyear = shd_month.monthyear AND period.quarter = shd_month.quarter AND period.quarteryear = shd_month.quarteryear ) shd_month ON sales.periodid = shd_month.periodid JOIN ( SELECT shd_salesperson.id, customer.id AS customerid FROM customer, shd_salesperson WHERE customer.salesperson = shd_salesperson.salesperson ) shd_salesperson ON sales.customerid = shd_salesperson.customerid GROUP BY shd_productline.id, shd_month.id, shd_salesperson.id This generates the following EXPLAIN ANALYZE plan for the SELECT portion: HashAggregate (cost=32869.33..32869.34 rows=1 width=36) (actual time=475182.855..475188.304 rows=911 loops=1) -> Nested Loop (cost=377.07..32869.32 rows=1 width=36) (actual time=130.179..464299.167 rows=1232140 loops=1) Join Filter: ("outer".salesperson = "inner".salesperson) -> Nested Loop (cost=377.07..32868.18 rows=1 width=44) (actual time=130.140..411975.760 rows=1232140 loops=1) Join Filter: ("outer".customerid = "inner".id) -> Hash Join (cost=377.07..32864.32 rows=1 width=32) (actual time=130.072..23167.501 rows=1232140 loops=1) Hash Cond: ("outer".productid = "inner".id) -> Hash Join (cost=194.23..32679.08 rows=375 width=28) (actual time=83.118..14019.802 rows=1232140loops=1) Hash Cond: ("outer".periodid = "inner".id) -> Seq Scan on sales (cost=0.00..26320.40 rows=1232140 width=24) (actual time=0.109..3335.275rows=1232140 loops=1) -> Hash (cost=194.23..194.23 rows=1 width=12) (actual time=81.548..81.548 rows=0 loops=1) -> Hash Join (cost=4.70..194.23 rows=1 width=12) (actual time=2.544..72.798 rows=3288loops=1) Hash Cond: (("outer".monthnumber = "inner".monthnumber) AND ("outer".monthname = "inner".monthname)AND ("outer"."year" = "inner"."year") AND ("outer".monthyear = "inner".monthyear) AND ("outer".quarter= "inner".quarter) AND ("outer".quarteryear = "inner".quarteryear)) -> Seq Scan on period (cost=0.00..90.88 rows=3288 width=54) (actual time=0.009..9.960rows=3288 loops=1) -> Hash (cost=3.08..3.08 rows=108 width=58) (actual time=1.643..1.643 rows=0 loops=1) -> Seq Scan on shd_month (cost=0.00..3.08 rows=108 width=58) (actual time=0.079..0.940rows=108 loops=1) -> Hash (cost=182.18..182.18 rows=265 width=12) (actual time=45.431..45.431 rows=0 loops=1) -> Hash Join (cost=1.23..182.18 rows=265 width=12) (actual time=1.205..40.216 rows=1932 loops=1) Hash Cond: (("outer".productline = "inner".productline) AND ("outer".category = "inner".category)AND ("outer".platform = "inner".platform)) -> Seq Scan on product (cost=0.00..149.32 rows=1932 width=32) (actual time=0.013..6.179rows=1932 loops=1) -> Hash (cost=1.13..1.13 rows=13 width=45) (actual time=0.199..0.199 rows=0 loops=1) -> Seq Scan on shd_productline (cost=0.00..1.13 rows=13 width=45) (actual time=0.048..0.083rows=13 loops=1) -> Seq Scan on customer (cost=0.00..2.83 rows=83 width=20) (actual time=0.005..0.174 rows=83 loops=1232140) -> Seq Scan on shd_salesperson (cost=0.00..1.06 rows=6 width=24) (actual time=0.004..0.019 rows=6 loops=1232140) Total runtime: 475197.372 ms (25 rows) Note that the estimated number of input rows to the final HashAggreggate is 1 while the actual number is 1.2 million. By rewriting the JOIN conditions to LEFT JOIN we force the planner to recognize that there will be a match for every row in the sales table: HashAggregate (cost=74601.88..74644.00 rows=8424 width=36) (actual time=39956.115..39961.507 rows=911 loops=1) -> Hash Left Join (cost=382.43..59200.13 rows=1232140 width=36) (actual time=140.879..30765.373 rows=1232140 loops=1) Hash Cond: ("outer".customerid = "inner".id) -> Hash Left Join (cost=377.07..40712.67 rows=1232140 width=32) (actual time=136.069..22721.760 rows=1232140 loops=1) Hash Cond: ("outer".periodid = "inner".id) -> Hash Left Join (cost=182.84..34353.99 rows=1232140 width=28) (actual time=50.815..14742.610 rows=1232140loops=1) Hash Cond: ("outer".productid = "inner".id) -> Seq Scan on sales (cost=0.00..26320.40 rows=1232140 width=24) (actual time=0.099..4490.148 rows=1232140loops=1) -> Hash (cost=182.18..182.18 rows=265 width=12) (actual time=49.114..49.114 rows=0 loops=1) -> Hash Join (cost=1.23..182.18 rows=265 width=12) (actual time=1.331..43.662 rows=1932 loops=1) Hash Cond: (("outer".productline = "inner".productline) AND ("outer".category = "inner".category)AND ("outer".platform = "inner".platform)) -> Seq Scan on product (cost=0.00..149.32 rows=1932 width=32) (actual time=0.128..11.246rows=1932 loops=1) -> Hash (cost=1.13..1.13 rows=13 width=45) (actual time=0.200..0.200 rows=0 loops=1) -> Seq Scan on shd_productline (cost=0.00..1.13 rows=13 width=45) (actual time=0.047..0.081rows=13 loops=1) -> Hash (cost=194.23..194.23 rows=1 width=12) (actual time=83.651..83.651 rows=0 loops=1) -> Hash Join (cost=4.70..194.23 rows=1 width=12) (actual time=2.675..74.693 rows=3288 loops=1) Hash Cond: (("outer".monthnumber = "inner".monthnumber) AND ("outer".monthname = "inner".monthname)AND ("outer"."year" = "inner"."year") AND ("outer".monthyear = "inner".monthyear) AND ("outer".quarter= "inner".quarter) AND ("outer".quarteryear = "inner".quarteryear)) -> Seq Scan on period (cost=0.00..90.88 rows=3288 width=54) (actual time=0.118..12.126 rows=3288loops=1) -> Hash (cost=3.08..3.08 rows=108 width=58) (actual time=1.658..1.658 rows=0 loops=1) -> Seq Scan on shd_month (cost=0.00..3.08 rows=108 width=58) (actual time=0.081..0.947rows=108 loops=1) -> Hash (cost=5.15..5.15 rows=83 width=12) (actual time=3.131..3.131 rows=0 loops=1) -> Hash Join (cost=1.07..5.15 rows=83 width=12) (actual time=1.937..2.865 rows=83 loops=1) Hash Cond: ("outer".salesperson = "inner".salesperson) -> Seq Scan on customer (cost=0.00..2.83 rows=83 width=20) (actual time=0.137..0.437 rows=83 loops=1) -> Hash (cost=1.06..1.06 rows=6 width=24) (actual time=0.152..0.152 rows=0 loops=1) -> Seq Scan on shd_salesperson (cost=0.00..1.06 rows=6 width=24) (actual time=0.045..0.064 rows=6loops=1) Total runtime: 39974.236 ms (27 rows) Given better row estimates the resulting plan runs more than ten times faster. Why is the planner doing so poorly with estimating the number of rows returned? I tried: SET default_statistics_target = 1000; VACUUM FULL ANALYZE; but the results were the same. This is on 8.0beta4. Any ideas? Kris Jurka
pgsql-performance by date: