Re: Column correlation drifts, index ignored again - Mailing list pgsql-performance

From Kevin Brown
Subject Re: Column correlation drifts, index ignored again
Date
Msg-id 20040224201419.GD3090@filer
Whole thread Raw
In response to Re: Column correlation drifts, index ignored again  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Column correlation drifts, index ignored again
Re: Column correlation drifts, index ignored again
List pgsql-performance
Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > Kevin,
> >> 1.  set enable_seqscan = on
> >> 2.  set random_page_cost = <some really high value to force seqscans>
> >> 3.  EXPLAIN ANALYZE query
> >> 4.  record the ratio of estimated to actual scan times.
> >> 5.  set enable_seqscan = off
> >> 6.  set random_page_cost = <rough estimate of what it should be>
> >> 7.  EXPLAIN ANALYZE query
> >> 8.  record the actual index scan time(s)
> >> 9.  tweak random_page_cost
> >> 10.  EXPLAIN query
> >> 11.  If ratio of estimate to actual (recorded in step 8) is much
> >> different than that recorded in step 4, then go back to step 9.
> >> Reduce random_page_cost if the random ratio is larger than the
> >> sequential ratio, increase if it's smaller.
>
> > Nice, we ought to post that somewhere people can find it in the future.
>
> If we post it as recommended procedure we had better put big caveat
> notices on it.  The pitfalls with doing this are:
>
> 1. If you repeat the sequence exactly as given, you will be homing in on
> a RANDOM_PAGE_COST that describes your system's behavior with a fully
> cached query.  It is to be expected that you will end up with 1.0 or
> something very close to it.  The only way to avoid that is to use a
> query that is large enough to blow out your kernel's RAM cache; which of
> course will take long enough that iterating step 10 will be no fun,
> and people will be mighty tempted to take shortcuts.

Oops.  You're right.  I did this on my system, but forgot to put it in
the list of things to do:

0.  Fill the page cache with something other than PG data, e.g. by
    repeatedly catting several large files and redirecting the output to
    /dev/null.  The sum total size of the files should exceed the amount
    of memory on the system.

The reason you might not have to do this between EXPLAIN ANALYZE queries
is that the first query will scan the table itself while the second one
will scan the index.  But that was probably more specific to the query I
was doing.  If the one you're doing is complex enough the system may have
to read data pages from the table itself after fetching the index page,
in which case you'll want to fill the page cache between the queries.

> 2. Of course, you are computing a RANDOM_PAGE_COST that is relevant to
> just this single query.  Prudence would suggest repeating the process
> with several different queries and taking some sort of average.

Right.  And the average should probably be weighted based on the
relative frequency that the query in question will be executed.

In my case, the query I was experimenting with was by far the biggest
query that occurs on my system (though it turns out that there are
others in that same process that I should look at as well).

> When I did the experiments that led up to choosing 4.0 as the default,
> some years ago, it took several days of thrashing the disks on a couple
> of different machines before I had numbers that I didn't think were
> mostly noise :-(.  I am *real* suspicious of any replacement numbers
> that have been derived in just a few minutes.

One problem I've been running into is the merge join spilling to disk
because sort_mem isn't big enough.  The problem isn't that this is
happening, it's that I think the planner is underestimating the impact
that doing this will have on the time the merge join takes.  Does the
planner even account for the possibility that a sort or join will spill
to disk?  Spilling to disk like that will suddenly cause sequential
reads to perform much more like random reads, unless the sequential
scans are performed in their entirety between sorts/merges.


In any case, one thing that none of this really accounts for is that
it's better to set random_page_cost too low than too high.  The reason is
that index scans are more selective than sequential scans: a sequential
scan will read the entire table every time, whereas an index scan will
read only the index pages (and their parents) that match the query.
My experience is that when the planner improperly computes the selectivity
of the query (e.g., by not having good enough or sufficiently up to
date statistics), it generally computes a lower selectivity than the
query actually represents, and thus selects a sequential scan when an
index scan would be more efficient.

The auto vacuum daemon helps in this regard, by keeping the statistics
more up-to-date.

Certainly you shouldn't go overboard by setting random_page_cost too low
"just in case", but it does mean that if you go through the process of
running tests to determine the proper value for random_page_cost, you
should probably select a random_page_cost that's in the lower part of
the range of values you got.


--
Kevin Brown                          kevin@sysexperts.com

pgsql-performance by date:

Previous
From: "Ed L."
Date:
Subject: Re: [PERFORMANCE] slow small delete on large table
Next
From: Christopher Browne
Date:
Subject: Re: [PERFORMANCE] slow small delete on large table