Re: Optimizer produces wildly different row count estimate depending on casts - Mailing list pgsql-general

From Tom Lane
Subject Re: Optimizer produces wildly different row count estimate depending on casts
Date
Msg-id 16813.1078784050@sss.pgh.pa.us
Whole thread Raw
In response to Optimizer produces wildly different row count estimate depending on casts  (Shelby Cain <alyandon@yahoo.com>)
Responses Re: Optimizer produces wildly different row count estimate depending on casts
List pgsql-general
Shelby Cain <alyandon@yahoo.com> writes:
> Consider the following queries:

> #1) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > '07-FEB-2004' and
> txn_date <= current_date group by txn_code;

> #2) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > current_date - 30 and
> txn_date <= current_date group by txn_code;

> For the queries above, the optimizer correctly
> identified that using the index is a big gain over a
> sequential scan and the estimated vs actual row counts
> are within striking distance of one another.

The optimizer has no idea about the selectivity of the clauses involving
current_date, since it hasn't got a constant value to compare to the
statistical histogram.  (In the case of expressions involving
current_date and similar functions, it might be reasonable to compute
the current value and use that as an estimate, but I'm unsure of the
conditions under which that's safe.  In any case there is no such code
at present.)  So for query #1 we have a hopefully-pretty-good estimate
for "txn_date > '07-FEB-2004'", and then we knock that down by an
arbitrary percentage because we don't have a clue about "txn_date <=
current_date"; which is why the estimate is too small.  In query #2 we
are, plain and simply, guessing.  The optimizer can however see that
this is a range constraint on txn_date, and the default guess in such
cases is chosen to favor an indexscan.

> #3) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > (current_date -
> 30)::timestamp and txn_date <= current_date group by
> txn_code;

The first clause is not indexable (before 7.5) because the effective
meaning is txn_date::timestamp > something, and your index is not on
txn_date::timestamp.  Indexing using only the second clause would be
a loser and the planner won't do it.  What's more, this is no longer
obviously a range limitation, because the constrained variables aren't
the same.  So the estimate is pretty stupid --- it assumes the two WHERE
clauses are independent.

> #4) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > (current_date -
> 30)::timestamp and txn_date <= current_date::timestamp
> group by txn_code;

Neither clause is indexable, but at least the planner can see that it's
a range constraint on txn_date::timestamp, so you get a smaller
guesstimate.

            regards, tom lane

pgsql-general by date:

Previous
From: "scott.marlowe"
Date:
Subject: Re: creating sequential timestamp
Next
From: Shelby Cain
Date:
Subject: Re: Optimizer produces wildly different row count estimate depending on casts