Re: More thoughts about planner's cost estimates - Mailing list pgsql-hackers

From Jim C. Nasby
Subject Re: More thoughts about planner's cost estimates
Date
Msg-id 20060601230948.GW53487@pervasive.com
Whole thread Raw
In response to Re: More thoughts about planner's cost estimates  (Greg Stark <gsstark@mit.edu>)
List pgsql-hackers
On Thu, Jun 01, 2006 at 02:25:56PM -0400, Greg Stark wrote:
> 
> Josh Berkus <josh@agliodbs.com> writes:
> 
> > 1. n-distinct estimation is bad, as previously discussed;
> > 
> > 2. our current heuristics sampling methods prevent us from sampling more than
> > 0.5% of any reasonably large table, causing all statistics on those tables to
> > be off for any table with irregular distribution of values;
> 
> I'm convinced these two are more connected than you believe. For distributions
> of values in a range using small samples is on solid statistical basis. It's
> the same reason poll results based on only a few hundred people can accurately
> predict elections in a country with hundreds of millions of voters.
> 
> However for things like estimating discrete values there's absolutely no solid
> foundation for these small samples. Your accuracy is going to be simply
> proportional to the sample size, meaning you can't get anything trustworthy
> without sampling large enough portions of the table that a full sequential
> scan would be just as fast.
> 
> I might be interested in implementing that algorithm that was posted a while
> back that involved generating good unbiased samples of discrete values. The
> algorithm was quite clever and well described and paper claimed impressively
> good results.
> 
> However it will only make sense if people are willing to accept that analyze
> will need a full table scan -- at least for tables where the DBA knows that
> good n_distinct estimates are necessary.
There *might* be an alternative....

Suppose that the executor could keep track of what values it's seeing
from a table as it's actually running a query. These could be used to
build statistical information without actually running analyze.

Of course, the problem is that you could end up with some seriously
skewed statistics, depending on what your query patterns are. But in
some ways that's not bad; you're optimizing for the queries you most
often run.

One possible way to handle this would be to keep track of how many times
each value has been seen, as well as when it was last seen, so you have
some idea of the quality of the data. Another idea is to somehow blend
these stats with the traditional method.

> > 3. We don't have any method to analyze inter-column correlation within a table;
> > 
> > 4. We don't keep statistics on foriegn key correlation;
> 
> Gosh these would be nice but they sound like hard problems. Has anybody even
> made any headway in brainstorming how to tackle them?
A *huge* improvement would be gathering statistics on all multi-column
indexes. Some of the stats might not make sense in this context, but
others (such as correlation) would.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


pgsql-hackers by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: "CVS-Unknown" buildfarm failures?
Next
From: "Jim C. Nasby"
Date:
Subject: Re: More thoughts about planner's cost estimates