Thread: Optimizer produces wildly different row count estimate depending on casts

Optimizer produces wildly different row count estimate depending on casts

From
Shelby Cain
Date:
Standard cygwin Postgresql 7.4.1 package on Windows
2000 box.

I have a relatively small table (about 1 million
records) that contains inventory transaction data with
an index on transaction date (date datatype).  Since
the data was inserted chronologically, the transaction
date column is strongly correlated with the physical
row ordering.  As a result, I created a non-unique
index on the date column in order to see how much of a
performance boost I could get for queries that filter
over a range of dates.

My initial problem was the typical newbie issue of
convincing the optimizer to use the index I had built
on the transaction date.  I had forgotten that
Postgresql is very picky about data types when
utilizing indexes so it didn't take me long to figure
out how to rewrite my query in order to take advantage
of the index.  However, as a result of my
experimentation I've run across some puzzling results
where the optimizer returned wildly inaccurate row
count estimates depending on how I cast my date
interval.

Consider the following queries:

#1) explain analyze select txn_code, count(*) from
inventory_txns where txn_date > '07-FEB-2004' and
txn_date <= current_date group by txn_code;

HashAggregate  (cost=1643.56..1643.56 rows=1 width=8)
(actual time=731.000..731.000 rows=11 loops=1)
   ->  Index Scan using idx_inventory_txns_txndate on
inventory_txns  (cost=0.00..1619.94 rows=4723 width=8)
(actual time=76.000..709.000 rows=8425 loops=1)
         Index Cond: ((txn_date > '2004-02-07'::date)
AND (txn_date <= ('now'::text)::date))
 Total runtime: 742.000 ms


#2) explain analyze select txn_code, count(*) from
inventory_txns where txn_date > current_date - 30 and
txn_date <= current_date group by txn_code;

 HashAggregate  (cost=1667.17..1667.17 rows=1 width=8)
(actual time=44.000..44.000 rows=11 loops=1)
   ->  Index Scan using idx_inventory_txns_txndate on
inventory_txns  (cost=0.00..1643.55 rows=4723 width=8)
(actual time=0.000..27.000 rows=8425 loops=1)
         Index Cond: ((txn_date > (('now'::text)::date
- 30)) AND (txn_date <= ('now'::text)::date))
 Total runtime: 44.000 ms

For the queries above, the optimizer correctly
identified that using the index is a big gain over a
sequential scan and the estimated vs actual row counts
are within striking distance of one another.

Now, lets see what happens when I intentionally use
casts that prevent the optimizer from utilizing the
index on txn_date:

#3) explain analyze select txn_code, count(*) from
inventory_txns where txn_date > (current_date -
30)::timestamp and txn_date <= current_date group by
txn_code;

 HashAggregate  (cost=66035.52..66035.53 rows=5
width=8) (actual time=9149.000..9149.000 rows=11
loops=1)
   ->  Seq Scan on inventory_txns
(cost=0.00..65510.82 rows=104941 width=8) (actual
time=56.000..9125.000 rows=8425 loops=1)
         Filter: (((txn_date)::timestamp without time
zone > ((('now'::text)::date - 30))::timestamp without
time zone) AND (txn_date <= ('now'::text)::date))
 Total runtime: 9149.000 ms

#4) explain analyze select txn_code, count(*) from
inventory_txns where txn_date > (current_date -
30)::timestamp and txn_date <= current_date::timestamp
group by txn_code;

 HashAggregate  (cost=70256.76..70256.76 rows=1
width=8) (actual time=10193.000..10193.000 rows=11
loops=1)
   ->  Seq Scan on inventory_txns
(cost=0.00..70233.15 rows=4723 width=8) (actual
time=14.000..10178.000 rows=8425 loops=1)
         Filter: (((txn_date)::timestamp without time
zone > ((('now'::text)::date - 30))::timestamp without
time zone) AND ((txn_date)::timestamp without time
zone <= (('now'::text)::date)::timestamp without time
zone))
 Total runtime: 10194.000 ms

So finally, my question to the group is why did my
intentional cast on #3 throw the optimizer row count
estimate so far off but not for #4 even though both
(correctly) ignored the index on txn_date and opted
for a sequential scan instead?  The txn_date column
has statsistics set to 500 and these results are after
a full vacuum and analyze.

Regards,

Shelby Cain


__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you�re looking for faster
http://search.yahoo.com

Shelby Cain <alyandon@yahoo.com> writes:
> Consider the following queries:

> #1) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > '07-FEB-2004' and
> txn_date <= current_date group by txn_code;

> #2) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > current_date - 30 and
> txn_date <= current_date group by txn_code;

> For the queries above, the optimizer correctly
> identified that using the index is a big gain over a
> sequential scan and the estimated vs actual row counts
> are within striking distance of one another.

The optimizer has no idea about the selectivity of the clauses involving
current_date, since it hasn't got a constant value to compare to the
statistical histogram.  (In the case of expressions involving
current_date and similar functions, it might be reasonable to compute
the current value and use that as an estimate, but I'm unsure of the
conditions under which that's safe.  In any case there is no such code
at present.)  So for query #1 we have a hopefully-pretty-good estimate
for "txn_date > '07-FEB-2004'", and then we knock that down by an
arbitrary percentage because we don't have a clue about "txn_date <=
current_date"; which is why the estimate is too small.  In query #2 we
are, plain and simply, guessing.  The optimizer can however see that
this is a range constraint on txn_date, and the default guess in such
cases is chosen to favor an indexscan.

> #3) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > (current_date -
> 30)::timestamp and txn_date <= current_date group by
> txn_code;

The first clause is not indexable (before 7.5) because the effective
meaning is txn_date::timestamp > something, and your index is not on
txn_date::timestamp.  Indexing using only the second clause would be
a loser and the planner won't do it.  What's more, this is no longer
obviously a range limitation, because the constrained variables aren't
the same.  So the estimate is pretty stupid --- it assumes the two WHERE
clauses are independent.

> #4) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > (current_date -
> 30)::timestamp and txn_date <= current_date::timestamp
> group by txn_code;

Neither clause is indexable, but at least the planner can see that it's
a range constraint on txn_date::timestamp, so you get a smaller
guesstimate.

            regards, tom lane

--- Tom Lane <tgl@sss.pgh.pa.us> wrote:

> The optimizer has no idea about the selectivity of
> the clauses involving
> current_date, since it hasn't got a constant value
> to compare to the
> statistical histogram.  (In the case of expressions
> involving
> current_date and similar functions, it might be
> reasonable to compute
> the current value and use that as an estimate, but
> I'm unsure of the
> conditions under which that's safe.  In any case
> there is no such code
> at present.)  So for query #1 we have a
> hopefully-pretty-good estimate
> for "txn_date > '07-FEB-2004'", and then we knock
> that down by an
> arbitrary percentage because we don't have a clue
> about "txn_date <=
> current_date"; which is why the estimate is too
> small.  In query #2 we
> are, plain and simply, guessing.  The optimizer can
> however see that
> this is a range constraint on txn_date, and the
> default guess in such
> cases is chosen to favor an indexscan.
>

Interesting.  Thanks for the clarification.

>
> Neither clause is indexable, but at least the
> planner can see that it's
> a range constraint on txn_date::timestamp, so you
> get a smaller
> guesstimate.
>

That makes 100% sense.  Thanks again.

Regards,

Shelby Cain

__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you�re looking for faster
http://search.yahoo.com