Optimizer produces wildly different row count estimate depending on casts - Mailing list pgsql-general

From Shelby Cain
Subject Optimizer produces wildly different row count estimate depending on casts
Date
Msg-id 20040308213004.2218.qmail@web41604.mail.yahoo.com
Whole thread Raw
Responses Re: Optimizer produces wildly different row count estimate depending on casts
List pgsql-general
Standard cygwin Postgresql 7.4.1 package on Windows
2000 box.

I have a relatively small table (about 1 million
records) that contains inventory transaction data with
an index on transaction date (date datatype).  Since
the data was inserted chronologically, the transaction
date column is strongly correlated with the physical
row ordering.  As a result, I created a non-unique
index on the date column in order to see how much of a
performance boost I could get for queries that filter
over a range of dates.

My initial problem was the typical newbie issue of
convincing the optimizer to use the index I had built
on the transaction date.  I had forgotten that
Postgresql is very picky about data types when
utilizing indexes so it didn't take me long to figure
out how to rewrite my query in order to take advantage
of the index.  However, as a result of my
experimentation I've run across some puzzling results
where the optimizer returned wildly inaccurate row
count estimates depending on how I cast my date
interval.

Consider the following queries:

#1) explain analyze select txn_code, count(*) from
inventory_txns where txn_date > '07-FEB-2004' and
txn_date <= current_date group by txn_code;

HashAggregate  (cost=1643.56..1643.56 rows=1 width=8)
(actual time=731.000..731.000 rows=11 loops=1)
   ->  Index Scan using idx_inventory_txns_txndate on
inventory_txns  (cost=0.00..1619.94 rows=4723 width=8)
(actual time=76.000..709.000 rows=8425 loops=1)
         Index Cond: ((txn_date > '2004-02-07'::date)
AND (txn_date <= ('now'::text)::date))
 Total runtime: 742.000 ms


#2) explain analyze select txn_code, count(*) from
inventory_txns where txn_date > current_date - 30 and
txn_date <= current_date group by txn_code;

 HashAggregate  (cost=1667.17..1667.17 rows=1 width=8)
(actual time=44.000..44.000 rows=11 loops=1)
   ->  Index Scan using idx_inventory_txns_txndate on
inventory_txns  (cost=0.00..1643.55 rows=4723 width=8)
(actual time=0.000..27.000 rows=8425 loops=1)
         Index Cond: ((txn_date > (('now'::text)::date
- 30)) AND (txn_date <= ('now'::text)::date))
 Total runtime: 44.000 ms

For the queries above, the optimizer correctly
identified that using the index is a big gain over a
sequential scan and the estimated vs actual row counts
are within striking distance of one another.

Now, lets see what happens when I intentionally use
casts that prevent the optimizer from utilizing the
index on txn_date:

#3) explain analyze select txn_code, count(*) from
inventory_txns where txn_date > (current_date -
30)::timestamp and txn_date <= current_date group by
txn_code;

 HashAggregate  (cost=66035.52..66035.53 rows=5
width=8) (actual time=9149.000..9149.000 rows=11
loops=1)
   ->  Seq Scan on inventory_txns
(cost=0.00..65510.82 rows=104941 width=8) (actual
time=56.000..9125.000 rows=8425 loops=1)
         Filter: (((txn_date)::timestamp without time
zone > ((('now'::text)::date - 30))::timestamp without
time zone) AND (txn_date <= ('now'::text)::date))
 Total runtime: 9149.000 ms

#4) explain analyze select txn_code, count(*) from
inventory_txns where txn_date > (current_date -
30)::timestamp and txn_date <= current_date::timestamp
group by txn_code;

 HashAggregate  (cost=70256.76..70256.76 rows=1
width=8) (actual time=10193.000..10193.000 rows=11
loops=1)
   ->  Seq Scan on inventory_txns
(cost=0.00..70233.15 rows=4723 width=8) (actual
time=14.000..10178.000 rows=8425 loops=1)
         Filter: (((txn_date)::timestamp without time
zone > ((('now'::text)::date - 30))::timestamp without
time zone) AND ((txn_date)::timestamp without time
zone <= (('now'::text)::date)::timestamp without time
zone))
 Total runtime: 10194.000 ms

So finally, my question to the group is why did my
intentional cast on #3 throw the optimizer row count
estimate so far off but not for #4 even though both
(correctly) ignored the index on txn_date and opted
for a sequential scan instead?  The txn_date column
has statsistics set to 500 and these results are after
a full vacuum and analyze.

Regards,

Shelby Cain


__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you�re looking for faster
http://search.yahoo.com

pgsql-general by date:

Previous
From: javier garcia - CEBAS
Date:
Subject: creating sequential timestamp
Next
From: Joe Conway
Date:
Subject: Re: creating sequential timestamp