Thread: Optimizer produces wildly different row count estimate depending on casts
Standard cygwin Postgresql 7.4.1 package on Windows 2000 box. I have a relatively small table (about 1 million records) that contains inventory transaction data with an index on transaction date (date datatype). Since the data was inserted chronologically, the transaction date column is strongly correlated with the physical row ordering. As a result, I created a non-unique index on the date column in order to see how much of a performance boost I could get for queries that filter over a range of dates. My initial problem was the typical newbie issue of convincing the optimizer to use the index I had built on the transaction date. I had forgotten that Postgresql is very picky about data types when utilizing indexes so it didn't take me long to figure out how to rewrite my query in order to take advantage of the index. However, as a result of my experimentation I've run across some puzzling results where the optimizer returned wildly inaccurate row count estimates depending on how I cast my date interval. Consider the following queries: #1) explain analyze select txn_code, count(*) from inventory_txns where txn_date > '07-FEB-2004' and txn_date <= current_date group by txn_code; HashAggregate (cost=1643.56..1643.56 rows=1 width=8) (actual time=731.000..731.000 rows=11 loops=1) -> Index Scan using idx_inventory_txns_txndate on inventory_txns (cost=0.00..1619.94 rows=4723 width=8) (actual time=76.000..709.000 rows=8425 loops=1) Index Cond: ((txn_date > '2004-02-07'::date) AND (txn_date <= ('now'::text)::date)) Total runtime: 742.000 ms #2) explain analyze select txn_code, count(*) from inventory_txns where txn_date > current_date - 30 and txn_date <= current_date group by txn_code; HashAggregate (cost=1667.17..1667.17 rows=1 width=8) (actual time=44.000..44.000 rows=11 loops=1) -> Index Scan using idx_inventory_txns_txndate on inventory_txns (cost=0.00..1643.55 rows=4723 width=8) (actual time=0.000..27.000 rows=8425 loops=1) Index Cond: ((txn_date > (('now'::text)::date - 30)) AND (txn_date <= ('now'::text)::date)) Total runtime: 44.000 ms For the queries above, the optimizer correctly identified that using the index is a big gain over a sequential scan and the estimated vs actual row counts are within striking distance of one another. Now, lets see what happens when I intentionally use casts that prevent the optimizer from utilizing the index on txn_date: #3) explain analyze select txn_code, count(*) from inventory_txns where txn_date > (current_date - 30)::timestamp and txn_date <= current_date group by txn_code; HashAggregate (cost=66035.52..66035.53 rows=5 width=8) (actual time=9149.000..9149.000 rows=11 loops=1) -> Seq Scan on inventory_txns (cost=0.00..65510.82 rows=104941 width=8) (actual time=56.000..9125.000 rows=8425 loops=1) Filter: (((txn_date)::timestamp without time zone > ((('now'::text)::date - 30))::timestamp without time zone) AND (txn_date <= ('now'::text)::date)) Total runtime: 9149.000 ms #4) explain analyze select txn_code, count(*) from inventory_txns where txn_date > (current_date - 30)::timestamp and txn_date <= current_date::timestamp group by txn_code; HashAggregate (cost=70256.76..70256.76 rows=1 width=8) (actual time=10193.000..10193.000 rows=11 loops=1) -> Seq Scan on inventory_txns (cost=0.00..70233.15 rows=4723 width=8) (actual time=14.000..10178.000 rows=8425 loops=1) Filter: (((txn_date)::timestamp without time zone > ((('now'::text)::date - 30))::timestamp without time zone) AND ((txn_date)::timestamp without time zone <= (('now'::text)::date)::timestamp without time zone)) Total runtime: 10194.000 ms So finally, my question to the group is why did my intentional cast on #3 throw the optimizer row count estimate so far off but not for #4 even though both (correctly) ignored the index on txn_date and opted for a sequential scan instead? The txn_date column has statsistics set to 500 and these results are after a full vacuum and analyze. Regards, Shelby Cain __________________________________ Do you Yahoo!? Yahoo! Search - Find what you�re looking for faster http://search.yahoo.com
Shelby Cain <alyandon@yahoo.com> writes: > Consider the following queries: > #1) explain analyze select txn_code, count(*) from > inventory_txns where txn_date > '07-FEB-2004' and > txn_date <= current_date group by txn_code; > #2) explain analyze select txn_code, count(*) from > inventory_txns where txn_date > current_date - 30 and > txn_date <= current_date group by txn_code; > For the queries above, the optimizer correctly > identified that using the index is a big gain over a > sequential scan and the estimated vs actual row counts > are within striking distance of one another. The optimizer has no idea about the selectivity of the clauses involving current_date, since it hasn't got a constant value to compare to the statistical histogram. (In the case of expressions involving current_date and similar functions, it might be reasonable to compute the current value and use that as an estimate, but I'm unsure of the conditions under which that's safe. In any case there is no such code at present.) So for query #1 we have a hopefully-pretty-good estimate for "txn_date > '07-FEB-2004'", and then we knock that down by an arbitrary percentage because we don't have a clue about "txn_date <= current_date"; which is why the estimate is too small. In query #2 we are, plain and simply, guessing. The optimizer can however see that this is a range constraint on txn_date, and the default guess in such cases is chosen to favor an indexscan. > #3) explain analyze select txn_code, count(*) from > inventory_txns where txn_date > (current_date - > 30)::timestamp and txn_date <= current_date group by > txn_code; The first clause is not indexable (before 7.5) because the effective meaning is txn_date::timestamp > something, and your index is not on txn_date::timestamp. Indexing using only the second clause would be a loser and the planner won't do it. What's more, this is no longer obviously a range limitation, because the constrained variables aren't the same. So the estimate is pretty stupid --- it assumes the two WHERE clauses are independent. > #4) explain analyze select txn_code, count(*) from > inventory_txns where txn_date > (current_date - > 30)::timestamp and txn_date <= current_date::timestamp > group by txn_code; Neither clause is indexable, but at least the planner can see that it's a range constraint on txn_date::timestamp, so you get a smaller guesstimate. regards, tom lane
Re: Optimizer produces wildly different row count estimate depending on casts
From
Shelby Cain
Date:
--- Tom Lane <tgl@sss.pgh.pa.us> wrote: > The optimizer has no idea about the selectivity of > the clauses involving > current_date, since it hasn't got a constant value > to compare to the > statistical histogram. (In the case of expressions > involving > current_date and similar functions, it might be > reasonable to compute > the current value and use that as an estimate, but > I'm unsure of the > conditions under which that's safe. In any case > there is no such code > at present.) So for query #1 we have a > hopefully-pretty-good estimate > for "txn_date > '07-FEB-2004'", and then we knock > that down by an > arbitrary percentage because we don't have a clue > about "txn_date <= > current_date"; which is why the estimate is too > small. In query #2 we > are, plain and simply, guessing. The optimizer can > however see that > this is a range constraint on txn_date, and the > default guess in such > cases is chosen to favor an indexscan. > Interesting. Thanks for the clarification. > > Neither clause is indexable, but at least the > planner can see that it's > a range constraint on txn_date::timestamp, so you > get a smaller > guesstimate. > That makes 100% sense. Thanks again. Regards, Shelby Cain __________________________________ Do you Yahoo!? Yahoo! Search - Find what you�re looking for faster http://search.yahoo.com