Re: Index Scans become Seq Scans after VACUUM ANALYSE - Mailing list pgsql-hackers

From Thomas Lockhart
Subject Re: Index Scans become Seq Scans after VACUUM ANALYSE
Date
Msg-id 3CBD837E.D03E45C@fourpalms.org
Whole thread Raw
In response to Re: Index Scans become Seq Scans after VACUUM ANALYSE  (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses Re: Index Scans become Seq Scans after VACUUM ANALYSE  (Lincoln Yeoh <lyeoh@pop.jaring.my>)
List pgsql-hackers
> Oh, come on Tom, surely I have been around long enough to lend credence that
> wish to have a positive affect on PostgreSQL development.

:) Tom does have a way with words sometimes, eh?

> enable_seqscan=0, disallows sequential scan, that is not what I am saying. This
> is a problem I (and others) have been yapping about for a long time.
> I just think there is sufficient evidence to suggest that if a DBA creates an
> index, there is strong evidence (better than statistics) that the index need be
> used. In the event that an index exists, there is a strong indication that,
> without overwhelming evidence, that the index should be used. You have admitted
> that statistics suck, but the existence of an index must weight (heavily) on
> the evaluation on whether or not to use an index.

Tom is a mathematician by training, and is trying to balance the
optimizer decisions right on the transition between best and next-best
possibility. Biasing it to one decision or another when all of his test
cases clearly show the *other* choice would be better puts it in the
realm of an arbitrary choice *not* supported by the data!

afaict there are *two* areas which might benefit from analysis or
adjustments, but they have to be supported by real test cases.

1) the cost estimates used for each property of the data and storage.

2) the statistical sampling done on actual data during analysis.

The cost numbers have been tested on what is hopefully a representative
set of machines. It may be possible to have a test suite which would
allow folks to run the same data on many different platforms, and to
contribute other test cases for consideration. Perhaps you would like to
organize a test suite? Tom, do you already have cases you would like to
see in this test?

The statistical sampling (using somewhere between 1 and 10% of the data
afaicr) *could* get fooled by pathalogical storage topologies. So for
cases which seem to have reached the "wrong" conclusion we should show
*what* values for the above would have arrived at the correct result
*without* biasing other potential results in the wrong direction.

Systems which have optimizing planners can *never* be guaranteed to
generate the actual lowest-cost query plan. Any impression that Oracle,
for example, actually does do that may come from a lack of visibility
into the process, and a lack of forum for discussing these edge cases.

Please don't take Tom's claim that he doesn't get excited about wrong
planner choices which are not too wrong as an indication that he isn't
interested. The point is that for *edge cases* the "correct answer" can
never be known until the query is actually run two different ways. And
the planner is *never* allowed to do that. So tuning the optimizer over
time is the only way to improve things, and with edge cases a factor of
two in timing is, statistically, an indication that the results are
close to optimal.

We rarely get reports that the planner made the best choice for a plan,
but of course people usually don't consider optimal performance to be a
reportable problem ;)
                     - Thomas


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Implicit coercions need to be reined in
Next
From: Michael Loftis
Date:
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE