Re: More thoughts about planner's cost estimates - Mailing list pgsql-hackers

From David Fetter
Subject Re: More thoughts about planner's cost estimates
Date
Msg-id 20060602010412.GI10906@fetter.org
Whole thread Raw
In response to Re: More thoughts about planner's cost estimates  (Greg Stark <gsstark@mit.edu>)
Responses Re: More thoughts about planner's cost estimates  (Greg Stark <gsstark@mit.edu>)
List pgsql-hackers
On Thu, Jun 01, 2006 at 08:36:16PM -0400, Greg Stark wrote:
> 
> Josh Berkus <josh@agliodbs.com> writes:
> 
> > Greg, Tom,
> > 
> > > a) We already use block based sampling to reduce overhead.  If
> > > you're talking about using the entire block and not just
> > > randomly sampled tuples from within those blocks then your
> > > sample will be biased.
> > 
> > There are actually some really good equations to work with this,
> > estimating both row correlation and n-distinct based on sampling
> > complete blocks.  See prior discussions on this list on
> > N-distinct.
> 
> In the prior discussions someone posted the paper with the algorithm
> I mentioned.  That paper mentions that previous work showed poor
> results at estimating n_distinct even with sample sizes as large as
> 50% or more.

Which paper?  People have referenced several different ones.

> > > 1) You have n^2 possible two-column combinations.  That's a lot
> > > of processing and storage.
> > 
> > Yes, that's the hard problem to solve.  Actually, btw, it's n!,
> > not n^2.
> 
> You have O(n^2) possible *two-column* combinations and O(n!)
> n-column combinations.  I was hoping two-column combinations would
> be enough information to extrapolate from for larger combinations.

The math nerd in me says that this is bad math, but it might work
"well enough" by ass-u-ming a lack of higher-order correllations.

> > > 2) It isn't even clear what data you're exactly looking for.
> > > Certainly "correlation" is just shorthand here and isn't what
> > > you're actually looking for. 
> > 
> > Actually, I'd think that a correlation number estimate (0 =
> > complete uncorrelated, 1 = completely correlated) would be
> > sufficient to improve row count estimation significantly, without
> > incurring the vast overhead of histogramXhistorgram manipulation.
> 
> No, correlation is actually quite uninteresting here.  Consider for
> example two columns "state" and "area code".  The "area code" is
> pretty much 100% dependent on state, but will show very little
> correlation.

There are quantitative tests of independence available.  I'm not sure
whether any of these have been applied even theoretically in
DBMS-land.

> Hopefully you're right that you don't need complete histograms.
> Perhaps there's some statistics concept they don't teach in stats
> 101 that would cover this need.  What we're looking for is a
> function f(a,b,x) that gives the net selectivity given a and b, the
> selectivity of two clauses based on two columns, and x some simple
> value that can be easily calculated by looking at the data in
> advance.

That would be neat :)

Cheers,
D
-- 
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666                             Skype: davidfetter

Remember to vote!


pgsql-hackers by date:

Previous
From: Greg Stark
Date:
Subject: Re: More thoughts about planner's cost estimates
Next
From: Mark Kirkwood
Date:
Subject: Re: More thoughts about planner's cost estimates