Re: More thoughts about planner's cost estimates - Mailing list pgsql-hackers
From | Jim C. Nasby |
---|---|
Subject | Re: More thoughts about planner's cost estimates |
Date | |
Msg-id | 20060601230948.GW53487@pervasive.com Whole thread Raw |
In response to | Re: More thoughts about planner's cost estimates (Greg Stark <gsstark@mit.edu>) |
List | pgsql-hackers |
On Thu, Jun 01, 2006 at 02:25:56PM -0400, Greg Stark wrote: > > Josh Berkus <josh@agliodbs.com> writes: > > > 1. n-distinct estimation is bad, as previously discussed; > > > > 2. our current heuristics sampling methods prevent us from sampling more than > > 0.5% of any reasonably large table, causing all statistics on those tables to > > be off for any table with irregular distribution of values; > > I'm convinced these two are more connected than you believe. For distributions > of values in a range using small samples is on solid statistical basis. It's > the same reason poll results based on only a few hundred people can accurately > predict elections in a country with hundreds of millions of voters. > > However for things like estimating discrete values there's absolutely no solid > foundation for these small samples. Your accuracy is going to be simply > proportional to the sample size, meaning you can't get anything trustworthy > without sampling large enough portions of the table that a full sequential > scan would be just as fast. > > I might be interested in implementing that algorithm that was posted a while > back that involved generating good unbiased samples of discrete values. The > algorithm was quite clever and well described and paper claimed impressively > good results. > > However it will only make sense if people are willing to accept that analyze > will need a full table scan -- at least for tables where the DBA knows that > good n_distinct estimates are necessary. There *might* be an alternative.... Suppose that the executor could keep track of what values it's seeing from a table as it's actually running a query. These could be used to build statistical information without actually running analyze. Of course, the problem is that you could end up with some seriously skewed statistics, depending on what your query patterns are. But in some ways that's not bad; you're optimizing for the queries you most often run. One possible way to handle this would be to keep track of how many times each value has been seen, as well as when it was last seen, so you have some idea of the quality of the data. Another idea is to somehow blend these stats with the traditional method. > > 3. We don't have any method to analyze inter-column correlation within a table; > > > > 4. We don't keep statistics on foriegn key correlation; > > Gosh these would be nice but they sound like hard problems. Has anybody even > made any headway in brainstorming how to tackle them? A *huge* improvement would be gathering statistics on all multi-column indexes. Some of the stats might not make sense in this context, but others (such as correlation) would. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
pgsql-hackers by date: