Re: More tablescanning fun - Mailing list pgsql-performance

From Tom Lane
Subject Re: More tablescanning fun
Date
Msg-id 13549.1051248190@sss.pgh.pa.us
Whole thread Raw
In response to Re: More tablescanning fun  ("Jim C. Nasby" <jim@nasby.net>)
Responses Re: More tablescanning fun  ("Jim C. Nasby" <jim@nasby.net>)
List pgsql-performance
"Jim C. Nasby" <jim@nasby.net> writes:
> On Thu, Apr 24, 2003 at 07:58:30PM -0400, Tom Lane wrote:
>> There's been some previous debate about the equation used to correct
>> for correlation, which is certainly bogus (I picked it more or less
>> out of the air ;-)).  But so far no one has proposed a replacement
>> equation with any better foundation ... take a look in
>> src/backend/optimizer/path/costsize.c if you want to get involved.

> Are you reffering to the PF formula?

The PF formula is good as far as I know, but it assumes an uncorrelated
table order.  The debate is how to correct it for nonzero correlation.
Specifically, this bit:

     * When the index ordering is exactly correlated with the table ordering
     * (just after a CLUSTER, for example), the number of pages fetched should
     * be just sT.  What's more, these will be sequential fetches, not the
     * random fetches that occur in the uncorrelated case.  So, depending on
     * the extent of correlation, we should estimate the actual I/O cost
     * somewhere between s * T * 1.0 and PF * random_cost.  We currently
     * interpolate linearly between these two endpoints based on the
     * correlation squared (XXX is that appropriate?).

I believe the endpoints s*T and PF*random_cost, I think, but the curve
between them is anyone's guess.  It's also quite possible that the
correlation stat that we currently compute is inadequate to model what's
going on.

>> No.  It's not apparent to me how you could do that without abandoning
>> MVCC, which we're not likely to do.

> Hmm... does MVCC mandate inserts go at the end?

Anywhere that there's free space.  The point is that you can't promise
updates will fit on the same page as the original tuple.  So whatever
desirable physical ordering you may have started with will surely
degrade over time.

> On the other hand, it might be possible to get the advantages of a
> clustered index without doing a *true* clustered index. The real point
> is to be able to use indexes; I've heard things like 'if you need to
> access more than 10% of a table then using an index would be
> disasterous', and that's not good... that number should really be over
> 50% for most reasonable ratios of fields indexed to fields in table (of
> course field size plays a factor).

If you have to read 50% of a table, you certainly should be doing a
linear scan.  There will be hardly any pages you can skip (unless the
table is improbably well clustered), and the extra I/O needed to read
the index will buy you nothing.

            regards, tom lane


pgsql-performance by date:

Previous
From: "Jim C. Nasby"
Date:
Subject: Re: More tablescanning fun
Next
From: Hannu Krosing
Date:
Subject: Re: Important speed difference between a query and a