Home > mailing lists

Re: Optimizer produces wildly different row count estimate depending on casts - Mailing list pgsql-general

From	Tom Lane
Subject	Re: Optimizer produces wildly different row count estimate depending on casts
Date	March 8, 2004 18:14:15
Msg-id	16813.1078784050@sss.pgh.pa.us Whole thread Raw
In response to	Optimizer produces wildly different row count estimate depending on casts (Shelby Cain <alyandon@yahoo.com>)
Responses	Re: Optimizer produces wildly different row count estimate depending on casts
List	pgsql-general

Tree view

Shelby Cain <alyandon@yahoo.com> writes:
> Consider the following queries:

> #1) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > '07-FEB-2004' and
> txn_date <= current_date group by txn_code;

> #2) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > current_date - 30 and
> txn_date <= current_date group by txn_code;

> For the queries above, the optimizer correctly
> identified that using the index is a big gain over a
> sequential scan and the estimated vs actual row counts
> are within striking distance of one another.

The optimizer has no idea about the selectivity of the clauses involving
current_date, since it hasn't got a constant value to compare to the
statistical histogram.  (In the case of expressions involving
current_date and similar functions, it might be reasonable to compute
the current value and use that as an estimate, but I'm unsure of the
conditions under which that's safe.  In any case there is no such code
at present.)  So for query #1 we have a hopefully-pretty-good estimate
for "txn_date > '07-FEB-2004'", and then we knock that down by an
arbitrary percentage because we don't have a clue about "txn_date <=
current_date"; which is why the estimate is too small.  In query #2 we
are, plain and simply, guessing.  The optimizer can however see that
this is a range constraint on txn_date, and the default guess in such
cases is chosen to favor an indexscan.

> #3) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > (current_date -
> 30)::timestamp and txn_date <= current_date group by
> txn_code;

The first clause is not indexable (before 7.5) because the effective
meaning is txn_date::timestamp > something, and your index is not on
txn_date::timestamp.  Indexing using only the second clause would be
a loser and the planner won't do it.  What's more, this is no longer
obviously a range limitation, because the constrained variables aren't
the same.  So the estimate is pretty stupid --- it assumes the two WHERE
clauses are independent.

> #4) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > (current_date -
> 30)::timestamp and txn_date <= current_date::timestamp
> group by txn_code;

Neither clause is indexable, but at least the planner can see that it's
a range constraint on txn_date::timestamp, so you get a smaller
guesstimate.

            regards, tom lane

pgsql-general by date:

From: "scott.marlowe"
Date: 08 March 2004, 17:56:03
Subject: Re: creating sequential timestamp

From: Shelby Cain
Date: 08 March 2004, 19:57:53
Subject: Re: Optimizer produces wildly different row count estimate depending on casts

Re: Optimizer produces wildly different row count estimate depending on casts - Mailing list pgsql-general

Previous

Next