Adding -performance back in.
On Tue, Oct 03, 2006 at 05:10:04PM -0700, Ron Mayer wrote:
> Jim C. Nasby wrote:
> >
> > Index scans are also pretty picky about correlation. If you have really
> > low correlation you don't want to index scan,
>
> I'm still don't think "correlation" is the right metric
> at all for making this decision.
>
> If you have a list of addresses clustered by "zip"
> the "correlation" of State, City, County, etc will all be zero (since
> the zip codes don't match the alphabetical order of state or city names)
> but index scans are still big wins because the data for any given
> state or city will be packed on the same few pages - and in fact
> the pages could be read mostly sequentially.
That's a good point that I don't think has been considered before. I
think correlation is still somewhat important, but what's probably far
more important is data localization.
One possible way to calculate this would be to note the location of
every tuple with a given value in the heap. Calculate the geometric mean
of those locations (I think you could essentially average all the
ctids), and divide that by the average distance of each tuple from that
mean (or maybe the reciprocal of that would be more logical).
Obviously we don't want to scan the whole table to do that, but there
should be some way to do it via sampling as well.
Or perhaps someone knows of a research paper with real data on how to do
this instead of hand-waving. :)
> > but I think our current
> > estimates make it too eager to switch to a seqscan.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net