Re: Index Scans become Seq Scans after VACUUM ANALYSE - Mailing list pgsql-hackers

From mlw
Subject Re: Index Scans become Seq Scans after VACUUM ANALYSE
Date
Msg-id 3CBDDAD3.D8744B3A@mohawksoft.com
Whole thread Raw
In response to Re: Index Scans become Seq Scans after VACUUM ANALYSE  (mlw <markw@mohawksoft.com>)
Responses Re: Index Scans become Seq Scans after VACUUM ANALYSE  (Andrew Sullivan <andrew@libertyrms.info>)
Re: Index Scans become Seq Scans after VACUUM ANALYSE  (Bruce Momjian <pgman@candle.pha.pa.us>)
Re: Index Scans become Seq Scans after VACUUM ANALYSE  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Andrew Sullivan wrote: 
> You haven't shown anything except a couple of anecdotal reports as
> evidence against his view.  Anyone who asks you for more evidence
> gets treated to a remark that statistics won't do everything in this
> case. 

I do not, currently, have access to systems which exhibit the behavior, but I
have worked with PostgreSQL quite a bit, and have done a number of projects
with it, and have seen the issue first hand and have had to work around it. I
have posted detailed data to this list in the past.

> You'll need something stronger than an antipathy for
> statistical methods to support your position.

I do not have an antipathy for statistics at all, however statistics are a
reduction of data. They represent a number of properties obtained from a larger
group. For "statistics" to be useful, the trends or characteristics you
calculate must apply to the problem.

Oracle has a cost based optimizer, and they allow you to override it, offer
hints as to what it should do, or use the rules based optimizer. They know that
a cost based optimizer can not generate the best query all the time.


>  The heuristic model
> you propose is a fundamental shift from the current attempts to make
> the planner choose better plans on the basis of what's in the
> database.  You're saying it can never know enough.  And I say, prove
> it.

I say it is obvious it can never know enough, since statistics are a summation
of the data set from which they were obtained, thus they can not contain all
the information about that data set unless they are at least as large as the
data set.
> 
> > to one another. This is where heuristics and a bit of fuzziness
> > needs to be applied. Favoring an index scan over a sequential scan
> > would probably generate a better query.
> 
> Tom has argued, several times, with reference to actual cases, why
> that is false.  Repeating your view doesn't make it so.

For some reason, it seems that Tom is under the impression that I am saying
"always use and index" when that is not what I am saying at all.

Here is the logical argument: (correct me if I am wrong)

(1) The table statistics are a summation of the properties of the table which,
among other things, are thought to affect query performance.

(2) The planner uses the statistics to create estimates about how a strategy
will perform.

(3) The "estimates" based on the table statistics have a degree of uncertainty,
because they are based on the statistical information about the table not the
table itself. The only way to be 100% sure you get the best query is to try all
the permutations of the query.

(4) Since the generated estimates have a degree of uncertainty, when multiple
query paths are evaluated, the planner will choose a suboptimal query once in a
while.

Now my argument, based on my personal experience, and based on Tom's own
statement that +- 2 seconds on a 10 second query is not something the gets
excited about, is this:

When the planner is presented with a number of possible plans, it must weigh
the cost estimates. If there is a choice between two plans which are within
some percent range of each other, we can fall into a risk analysis.

For instance: say we have two similarly performing plans, close to one another,
say within 20%, one plan uses an index, and one does not. It is unlikely that
the index plan will perform substantially worse than the non-index plan, right?
That is the point of the cost estimates, right?

Now, given the choice of the two strategies on a table, both pretty close to
one another, the risk of poor performance for using the index scan is minimal
based on the statistics, but the risk of poor performance for using the
sequential scan is quite high on a large table.

Does anyone disagree?


pgsql-hackers by date:

Previous
From: "Rod Taylor"
Date:
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE
Next
From: Andrew Sullivan
Date:
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE