Re: Better default_statistics_target - Mailing list pgsql-patches
From | Chris Browne |
---|---|
Subject | Re: Better default_statistics_target |
Date | |
Msg-id | 608x48yk5f.fsf@dba2.int.libertyrms.com Whole thread Raw |
In response to | Re: Better default_statistics_target (Simon Riggs <simon@2ndquadrant.com>) |
Responses |
Re: Better default_statistics_target
Re: Better default_statistics_target |
List | pgsql-patches |
guillaume.smet@gmail.com ("Guillaume Smet") writes: > On Dec 5, 2007 3:26 PM, Greg Sabino Mullane <greg@turnstep.com> wrote: >> Agreed, this would be a nice 8.4 thing. But what about 8.3 and 8.2? Is >> there a reason not to make this change? I know I've been lazy and not run >> any absolute figures, but rough tests show that raising it (from 10 to >> 100) results in a very minor increase in analyze time, even for large >> databases. I think the burden of a slightly slower analyze time, which >> can be easily adjusted, both in postgresql.conf and right before running >> an analyze, is very small compared to the pain of some queries - which worked >> before - suddenly running much, much slower for no apparent reason at all. > > As Tom stated it earlier, the ANALYZE slow down is far from being the > only consequence. The planner will also have more work to do and > that's the hard point IMHO. > > Without studying the impacts of this change on a large set of queries > in different cases, it's quite hard to know for sure that it won't > have a negative impact in a lot of cases. > > It's a bit too late in the cycle to change that IMHO, especially > without any numbers. I have the theory (thus far not borne out by any numbers) that it might be a useful approach to try to go through the DB schema and use what information is there to try to come up with better numbers on a per-column basis. As a "first order" perspective on things: - Any columns marked "unique" could keep to having somewhat smaller numbers of bins in the histogram because we know that uniqueness will keep values dispersed at least somewhat. Ditto for "SERIAL" types. - Columns NOT marked unique should imply adding some bins to the histogram. - Datestamps tend to imply temporal dispersion, ergo "somewhat fewer bins." Similar for floats. - Discrete values (integer, text) frequently see less dispersion, -> "more bins" Then could come a "second order" perspective, where data would actually get sampled from pg_statistics. - If we look at the number of distinct histogram bins used, for a particular column, and find that there are some not used, we might drop bins. - We might try doing some summary statistics to see how many unique values there actually are, on each column, and increase the number of bins if they're all in use, and there are other values that *are* frequently used. Maybe cheaper, if we find that pg_statistics tells us that all bins are in use, and extrapolation shows that there's a lot of the table NOT represented, we increase the number of bins. There might even be a "third order" analysis, where you'd try to collect additional data from the table, and analytically try to determine appropriate numbers of bins... Thus, we don't have a universal increase in the amount of statistics collected - the added stats are localized to places where there is some reason to imagine them useful. -- let name="cbbrowne" and tld="acm.org" in String.concat "@" [name;tld];; http://cbbrowne.com/info/nonrdbms.html There was a young lady of Crewe Whose limericks stopped at line two.
pgsql-patches by date: