Re: new correlation metric - Mailing list pgsql-hackers

From Tom Lane
Subject Re: new correlation metric
Date
Msg-id 11984.1225039446@sss.pgh.pa.us
Whole thread Raw
In response to Re: new correlation metric  (Martijn van Oosterhout <kleptog@svana.org>)
Responses Re: new correlation metric
List pgsql-hackers
Martijn van Oosterhout <kleptog@svana.org> writes:
> I think the code is in the right direction, but I think want you want
> is some kind of estimate of "given I've looked for tuple X, how many
> tuples in the next k pages are near this one". Unfortunatly I don't see
> a way of calculating it other than a full simulation.

I wonder whether we ought to rethink the problem entirely.  In
particular, the notion of associating correlation stats with particular
columns doesn't seem amazingly useful when you get right down to it.
We're wasting our time calculating the correlation of a column that has
no index; and if the column is part of a multicolumn index it's not that
easy to figure out what the index correlation is from the per-column
numbers.  Not to mention functional and partial indexes.

So it occurs to me that maybe we should forget about per-column
correlations altogether, and instead try directly to calculate *per
index* correlations.  You could do this now by doing an index-only scan
and looking at the series of tuple block numbers that come back.
However, there's no obvious way to do any sampling in that approach
--- you can't read random subsets of the index without cooperation from
the index AM.

So the direction I'm headed in is to imagine that we should add an
"analyze" entry point to the index AM API, with the definition that it
will be called once per index per ANALYZE, and is in charge of putting
useful stats into someplace or other.  We might need to invent some
other catalog besides pg_statistic if we want to represent per-index
properties like correlation.  A minimal solution would be to add a
correlation column to pg_class or pg_index, but that doesn't scale well
if you imagine that different index AMs might want different sorts of
stats.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Greg Stark
Date:
Subject: Re: new correlation metric
Next
From: Jeff Davis
Date:
Subject: array_agg and array_accum (patch)