Re: Index Scans become Seq Scans after VACUUM ANALYSE - Mailing list pgsql-hackers

From Dann Corbit
Subject Re: Index Scans become Seq Scans after VACUUM ANALYSE
Date
Msg-id D90A5A6C612A39408103E6ECDD77B82920CD6B@voyager.corporate.connx.com
Whole thread Raw
In response to Index Scans become Seq Scans after VACUUM ANALYSE  (Louis-David Mitterrand <vindex@apartia.org>)
List pgsql-hackers
-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
Sent: Wednesday, April 17, 2002 2:39 PM
To: mlw
Cc: Andrew Sullivan; PostgreSQL-development; Tom Lane
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

mlw wrote:
> Bruce Momjian wrote:
> >
> > mlw wrote:
> > > Now, given the choice of the two strategies on a table, both
pretty close to
> > > one another, the risk of poor performance for using the index scan
is minimal
> > > based on the statistics, but the risk of poor performance for
using the
> > > sequential scan is quite high on a large table.
>
> > My second point, that index scan is more risky than sequential scan,
is
> > outlined above.  A sequential scan reads each page once, and uses
the
> > file system read-ahead code to prefetch the disk buffers.  Index
scans
> > are random, and could easily re-read disk pages to plow through a
> > significant portion of the table, and because the reads are random,
> > the file system will not prefetch the rows so the index scan will
have
> > to wait for each non-cache-resident row to come in from disk.
>
> That is a very interesting point, but shouldn't that be factored into
the cost
> (random_tuple_cost?) In which case my point still stands.

Yes, I see your point.  I think on the high end that index scans can get
very expensive if you start to do lots of cache misses and have to wait
for i/o.  I know the random cost is 4, but I think that number is not
linear.  It can be much higher for lots of cache misses and waiting for
I/O, and think that is why it feels more risky to do an index scan on a
sample size that is not perfectly known.

Actually, you pretty much can know sequential scan size because you know
the number of blocks in the table.  It is index scan that is more
unknown because you don't know how many index lookups you will need, and
how well they will stay in the cache.

Does that help?  Wow, this _is_ confusing.  I am still looking for that
holy grail that will allow this all to be codified so others can learn
from it and we don't have to rehash this repeatedly, but frankly, this
whole discussion is covering new ground that we haven't covered yet.

(Maybe TODO.detail this discussion and point to it from the FAQ.)
>>
General rules of thumb (don't know if they apply to postgres or not):

Index scans are increasingly costly when the data is only a few types.
For instance, an index on a single bit makes for a very expensive scan.
After 5% of the data, it would be cheaper to scan the whole table
without using the index.

Now, if you have a clustered index (where the data is *physically*
ordered by the index order) you can use index scans much more cheaply
than when the data is not ordered in this way.

A golden rule of thumb when accessing data in a relational database
is to use the unique clustered index whenever possible (even if it
is not the primary key).

These decisions are always heuristic in nature.  If a table is small
it may be cheapest to load the whole table into memory and sort it.

If the vacuum command could categorize statistically (some RDBMS
systems do this) then you can look at the statistical data and make
much smarter choices for how to use the index relations.  The
statistical information saved could be as simple as the min, max, mean,
median, mode, and standard deviation, or it might also have quartiles
or deciles, or some other measure to show even more data about the
actual distribution.  You could save what is essentially a binned
histogram of the data that is present in the table.  A bit of
imagination will quickly show how useful this could be (some commercial
database systems actually do this).

Another notion is to super optimize some queries.  By this, I mean
that if someone says that a particular query is very important, it
might be worthwhile to actually try a dozen or so different plans
(of the potentially billions that are possible) against the query
and store the best one.  They could also run the super optimize
feature again later or even automatically if the vacuum operation
detects that the data distributions or cardinality have changed in
some significant manner.

Better yet, let them store the plan and hand edit it, if need be.
Rdb is an example of a commercial database that allows this.
A maintenance nightmare when dimwits do it, of course, but such is
life.
<<


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [SQL] A bug in gistPageAddItem()/gist_tuple_replacekey() ???
Next
From: mlw
Date:
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE