Re: More thoughts about planner's cost estimates - Mailing list pgsql-hackers

From Greg Stark
Subject Re: More thoughts about planner's cost estimates
Date
Msg-id 87lksgfl6n.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: More thoughts about planner's cost estimates  (Josh Berkus <josh@agliodbs.com>)
Responses Re: More thoughts about planner's cost estimates  (David Fetter <david@fetter.org>)
List pgsql-hackers
Josh Berkus <josh@agliodbs.com> writes:

> Greg, Tom,
> 
> > a) We already use block based sampling to reduce overhead. If you're
> > talking about using the entire block and not just randomly sampled
> > tuples from within those blocks then your sample will be biased.
> 
> There are actually some really good equations to work with this, estimating 
> both row correlation and n-distinct based on sampling complete blocks.  
> See prior discussions on this list on N-distinct.

In the prior discussions someone posted the paper with the algorithm I
mentioned. That paper mentions that previous work showed poor results at
estimating n_distinct even with sample sizes as large as 50% or more.

> > 1) You have n^2 possible two-column combinations. That's a lot of
> > processing and storage.
> 
> Yes, that's the hard problem to solve.  Actually, btw, it's n!, not n^2.

You have O(n^2) possible *two-column* combinations and O(n!) n-column
combinations. I was hoping two-column combinations would be enough information
to extrapolate from for larger combinations.

> > 2) It isn't even clear what data you're exactly looking for. Certainly
> >    "correlation" is just shorthand here and isn't what you're actually
> > looking for. 
> 
> Actually, I'd think that a correlation number estimate (0 = complete 
> uncorrelated, 1 = completely correlated) would be sufficient to improve 
> row count estimation significantly, without incurring the vast overhead of 
> histogramXhistorgram manipulation.

No, correlation is actually quite uninteresting here. Consider for example two
columns "state" and "area code". The "area code" is pretty much 100% dependent
on state, but will show very little correlation. Similarly things like
"product name" and "catalog number" or even "author" and "genre" would be
problem cases but have little correlation. And you can easily imagine queries
that specify restrictive clauses on two such columns for which the existing
statistics will overestimate the selectivity because it assumes no
interdependency.

Hopefully you're right that you don't need complete histograms. Perhaps
there's some statistics concept they don't teach in stats 101 that would cover
this need. What we're looking for is a function f(a,b,x) that gives the net
selectivity given a and b, the selectivity of two clauses based on two
columns, and x some simple value that can be easily calculated by looking at
the data in advance.

-- 
greg



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: More thoughts about planner's cost estimates
Next
From: David Fetter
Date:
Subject: Re: More thoughts about planner's cost estimates