Hashaggregate estimates - Mailing list pgsql-performance
From | Jorge Montero |
---|---|
Subject | Hashaggregate estimates |
Date | |
Msg-id | 4B4DEDB5.2E1C.0042.0@homedecorators.com Whole thread Raw |
Responses |
Re: Hashaggregate estimates
|
List | pgsql-performance |
I'm having some performance problems in a few sales reports running on postgres 8.3, running on Redhat 4.1.2. The hardware is a bit old, but it performs well enough. The reports are the typical sales reporting fare: Gather the sales of a time period based some criteria, aggregate them by product, and then join with a bunch of other tables to display different kinds of product information.
I'll spare you all the pain of looking at the entire queries: The ultimate issue appears to be the same: The innermost table of the queries is an inline view, which aggregates the data by product. It runs rather quickly, but postgres underestimates the number of rows that come out of it, making the rest of the query plan rather suboptimal. The inline view look like this
select sku_id, sum(rs.price) as dollarsSold, sum(rs.quantity) as units
from reporting.sales rs
where rs.sale_date between ? AND ? group by sku_id
from reporting.sales rs
where rs.sale_date between ? AND ? group by sku_id
In some cases, we see extra conditions aside of the dates, but they have the same shape. Barring a massive date range, the rest of the filters are less selective than the date, so postgres uses an index on sale_date,sku_id. I have increased the statistics calculations on sale_date quite a bit to make sure Postgres makes decent row estimates.The problem is in the aggregation:
"HashAggregate (cost=54545.20..54554.83 rows=642 width=24) (actual time=87.945..98.219 rows=11462 loops=1)"
" -> Index Scan using reporting_sales_sale_date_idx on sales rs (cost=0.00..54288.63 rows=34209 width=24) (actual time=0.042..34.194 rows=23744 loops=1)"
" Index Cond: ((sale_date >= '2009-07-01 00:00:00'::timestamp without time zone) AND (sale_date <= '2009-07-06 00:00:00'::timestamp without time zone))"
"Total runtime: 10.110 ms"
" -> Index Scan using reporting_sales_sale_date_idx on sales rs (cost=0.00..54288.63 rows=34209 width=24) (actual time=0.042..34.194 rows=23744 loops=1)"
" Index Cond: ((sale_date >= '2009-07-01 00:00:00'::timestamp without time zone) AND (sale_date <= '2009-07-06 00:00:00'::timestamp without time zone))"
"Total runtime: 10.110 ms"
As you an seem the Index scan's estimate is pretty close when I use a single condition, but the aggregate estimate is off by a factor of 20. When I add further conditions, the estimate just gets worse and worse.
"HashAggregate (cost=8894.83..8894.85 rows=1 width=24) (actual time=6.444..6.501 rows=92 loops=1)"
" -> Index Scan using reporting_sales_sale_date_sku_id_idx on sales rs (cost=0.00..8894.76 rows=9 width=24) (actual time=0.103..6.278 rows=94 loops=1)"
" Index Cond: ((sale_date >= '2009-07-01 00:00:00'::timestamp without time zone) AND (sale_date <= '2009-07-06 00:00:00'::timestamp without time zone) AND ((sale_channel)::text = 'RETAIL'::text))"
" Filter: ((activity_type)::text = 'RETURN'::text)"
"Total runtime: 6.583 ms"
" -> Index Scan using reporting_sales_sale_date_sku_id_idx on sales rs (cost=0.00..8894.76 rows=9 width=24) (actual time=0.103..6.278 rows=94 loops=1)"
" Index Cond: ((sale_date >= '2009-07-01 00:00:00'::timestamp without time zone) AND (sale_date <= '2009-07-06 00:00:00'::timestamp without time zone) AND ((sale_channel)::text = 'RETAIL'::text))"
" Filter: ((activity_type)::text = 'RETURN'::text)"
"Total runtime: 6.583 ms"
I think I've done what I could when it comes to altering statistics: For example, activity_type and sale_channel have full statistics, and they are rather independent as filtering mechanisms: If all Postgres did when trying to estimate their total filtering capacity was just multiply the frequency of each value, the estimates would not be far off.
The killer seems to be the row aggregation. There are about 95K different values of sku_id in the sales table, and even the best seller items are a very small percentage of all rows, so expecting the aggregation to consolidate the rows 50:1 like it does in one of the explains above is a pipe dream. I've increased statistics in sku_id into the three digits, but results are not any better
schemaname;tablename;attname;null_frac;avg_width;n_distinct;most_common_freqs
"reporting";"sales";"sku_id";0;11;58337;"{0.00364167,0.0027125,0.00230417,0.00217083,0.00178333,0.001675,0.00136667,0.00135,0.0012875,0.0011875,...."
Is there any way I can coax Postgres into making a more realistic aggregation estimate? I could just delay aggregation until the rest of the data is joined, making the estimate's failure moot, but the price would be quite hefty in some of the reports, which could return 20K products and widths of over 150, so it's not optimal, especially when right now the same query that can request 100 rows could end up requesting 80K.
pgsql-performance by date: