Thread: Better default_statistics_target
Per a recent bug in which the planner can behave very differently at < 100, and accounting for the fact that analyze is still plenty fast on today's systems even at a tenfold increase, attached is a patch to change default_statistics_target from 10 to 100.
Attachment
Greg Sabino Mullane <greg@turnstep.com> writes: > Per a recent bug in which the planner can behave very differently at < > 100, and accounting for the fact that analyze is still plenty fast on > today's systems even at a tenfold increase, attached is a patch to > change default_statistics_target from 10 to 100. This is not happening without a whole lot more evidence (as in, more than zero) to back up the choice of value. regards, tom lane
On Wed, 2007-11-14 at 00:00 -0500, Tom Lane wrote: > Greg Sabino Mullane <greg@turnstep.com> writes: > > Per a recent bug in which the planner can behave very differently at < > > 100, and accounting for the fact that analyze is still plenty fast on > > today's systems even at a tenfold increase, attached is a patch to > > change default_statistics_target from 10 to 100. > > This is not happening without a whole lot more evidence (as in, more > than zero) to back up the choice of value. The choice of 100 is because of the way the LIKE estimator is configured. Greg is not suggesting he measured it and found 100 to be best, he is saying that the LIKE operator is hard-coded at 100 and so the stats_target should reflect that. Setting it to 100 for all columns because of LIKE doesn't make much sense. I think we should set stats target differently depending upon the data type, but thats probably an 8.4 thing. Long text fields that might use LIKE should be set to 100. CHAR(1) and general fields should be set to 10. Two thoughts: - why did we pick 100 for the LIKE operator? - should we document the better selectivity for LIKE operators at 100? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > Long text fields that might use LIKE should be set to 100. CHAR(1) and > general fields should be set to 10. I could see arguing that either way. Longer fields are capable of more precision and so may need more buckets to predict. On the other hand longer fields take more space and take longer to compare so to make consistent use of resources you would want to avoid storing and comparing large numbers of them whereas you could afford much larger targets for small quick columns. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Hi, > - why did we pick 100 for the LIKE operator? http://archives.postgresql.org/pgsql-hackers/2006-09/msg01715.php > - should we document the better selectivity for LIKE operators at 100? Probably... it I'm the only one who knows about it ;) -- Matteo Beccati Openads - http://www.openads.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Simon spoke: > The choice of 100 is because of the way the LIKE estimator is > configured. Greg is not suggesting he measured it and found 100 to be > best, he is saying that the LIKE operator is hard-coded at 100 and so > the stats_target should reflect that. Exactly. > Setting it to 100 for all columns because of LIKE doesn't make much > sense. I think we should set stats target differently depending upon the > data type, but thats probably an 8.4 thing. Long text fields that might > use LIKE should be set to 100. CHAR(1) and general fields should be set > to 10. Agreed, this would be a nice 8.4 thing. But what about 8.3 and 8.2? Is there a reason not to make this change? I know I've been lazy and not run any absolute figures, but rough tests show that raising it (from 10 to 100) results in a very minor increase in analyze time, even for large databases. I think the burden of a slightly slower analyze time, which can be easily adjusted, both in postgresql.conf and right before running an analyze, is very small compared to the pain of some queries - which worked before - suddenly running much, much slower for no apparent reason at all. Sure, 100 may have been chosen somewhat arbitrarily for the LIKE thing, but this is a current real-world performance regression (aka a bug, according to a nearby thread). Almost everyone agrees that 10 is too low, so why not make it 100, throw a big warning in the release notes, and then start some serious re-evaluation for 8.4? - -- Greg Sabino Mullane greg@turnstep.com End Point Corporation PGP Key: 0x14964AC8 200712050920 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iD8DBQFHVrSivJuQZxSWSsgRAyDNAKCInH9SJRO8ly1L1MomJUPlBslBlgCeLQ1v +w4ZumRcB5U5L3SGT0rk4AE= =I8Ur -----END PGP SIGNATURE-----
On Dec 5, 2007 3:26 PM, Greg Sabino Mullane <greg@turnstep.com> wrote: > Agreed, this would be a nice 8.4 thing. But what about 8.3 and 8.2? Is > there a reason not to make this change? I know I've been lazy and not run > any absolute figures, but rough tests show that raising it (from 10 to > 100) results in a very minor increase in analyze time, even for large > databases. I think the burden of a slightly slower analyze time, which > can be easily adjusted, both in postgresql.conf and right before running > an analyze, is very small compared to the pain of some queries - which worked > before - suddenly running much, much slower for no apparent reason at all. As Tom stated it earlier, the ANALYZE slow down is far from being the only consequence. The planner will also have more work to do and that's the hard point IMHO. Without studying the impacts of this change on a large set of queries in different cases, it's quite hard to know for sure that it won't have a negative impact in a lot of cases. It's a bit too late in the cycle to change that IMHO, especially without any numbers. -- Guillaume
guillaume.smet@gmail.com ("Guillaume Smet") writes: > On Dec 5, 2007 3:26 PM, Greg Sabino Mullane <greg@turnstep.com> wrote: >> Agreed, this would be a nice 8.4 thing. But what about 8.3 and 8.2? Is >> there a reason not to make this change? I know I've been lazy and not run >> any absolute figures, but rough tests show that raising it (from 10 to >> 100) results in a very minor increase in analyze time, even for large >> databases. I think the burden of a slightly slower analyze time, which >> can be easily adjusted, both in postgresql.conf and right before running >> an analyze, is very small compared to the pain of some queries - which worked >> before - suddenly running much, much slower for no apparent reason at all. > > As Tom stated it earlier, the ANALYZE slow down is far from being the > only consequence. The planner will also have more work to do and > that's the hard point IMHO. > > Without studying the impacts of this change on a large set of queries > in different cases, it's quite hard to know for sure that it won't > have a negative impact in a lot of cases. > > It's a bit too late in the cycle to change that IMHO, especially > without any numbers. I have the theory (thus far not borne out by any numbers) that it might be a useful approach to try to go through the DB schema and use what information is there to try to come up with better numbers on a per-column basis. As a "first order" perspective on things: - Any columns marked "unique" could keep to having somewhat smaller numbers of bins in the histogram because we know that uniqueness will keep values dispersed at least somewhat. Ditto for "SERIAL" types. - Columns NOT marked unique should imply adding some bins to the histogram. - Datestamps tend to imply temporal dispersion, ergo "somewhat fewer bins." Similar for floats. - Discrete values (integer, text) frequently see less dispersion, -> "more bins" Then could come a "second order" perspective, where data would actually get sampled from pg_statistics. - If we look at the number of distinct histogram bins used, for a particular column, and find that there are some not used, we might drop bins. - We might try doing some summary statistics to see how many unique values there actually are, on each column, and increase the number of bins if they're all in use, and there are other values that *are* frequently used. Maybe cheaper, if we find that pg_statistics tells us that all bins are in use, and extrapolation shows that there's a lot of the table NOT represented, we increase the number of bins. There might even be a "third order" analysis, where you'd try to collect additional data from the table, and analytically try to determine appropriate numbers of bins... Thus, we don't have a universal increase in the amount of statistics collected - the added stats are localized to places where there is some reason to imagine them useful. -- let name="cbbrowne" and tld="acm.org" in String.concat "@" [name;tld];; http://cbbrowne.com/info/nonrdbms.html There was a young lady of Crewe Whose limericks stopped at line two.
"Chris Browne" <cbbrowne@acm.org> writes: > - Any columns marked "unique" could keep to having somewhat smaller > numbers of bins in the histogram because we know that uniqueness > will keep values dispersed at least somewhat. I think you're on the wrong track. It's not dispersal that's significant but how evenly the values are dispersed. If the values are evenly spread throughout the region from low to high bound then we just need the single bucket telling us the low and high bound and how many values there are. If they're unevenly distributed then we need enough buckets to be able to distinguish the dense areas from the sparse areas. Perhaps something like starting with 1 bucket, splitting it into 2, seeing if the distributions are similar in which case we stop. If not repeat for each bucket. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
On Wed, 2007-12-05 at 15:13 -0500, Chris Browne wrote: > I have the theory (thus far not borne out by any numbers) that it > might be a useful approach to try to go through the DB schema and use > what information is there to try to come up with better numbers on a > per-column basis. Yeh, agreed. The difficulty is making this work for generic datatypes. > - Datestamps tend to imply temporal dispersion, ergo "somewhat fewer > bins." Similar for floats. Hmmm, not sure about that one. Some date/time columns can change very quickly over time, so the stats are frequently out of date. > Then could come a "second order" perspective, where data would > actually get sampled from pg_statistics. > > - If we look at the number of distinct histogram bins used, for a > particular column, and find that there are some not used, we might > drop bins. The histograms are height balanced, so they are always all used. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Wed, Dec 05, 2007 at 06:49:00PM +0100, Guillaume Smet wrote: > On Dec 5, 2007 3:26 PM, Greg Sabino Mullane <greg@turnstep.com> wrote: > > Agreed, this would be a nice 8.4 thing. But what about 8.3 and 8.2? Is > > there a reason not to make this change? I know I've been lazy and not run > > any absolute figures, but rough tests show that raising it (from 10 to > > 100) results in a very minor increase in analyze time, even for large > > databases. I think the burden of a slightly slower analyze time, which > > can be easily adjusted, both in postgresql.conf and right before running > > an analyze, is very small compared to the pain of some queries - which worked > > before - suddenly running much, much slower for no apparent reason at all. > > As Tom stated it earlier, the ANALYZE slow down is far from being the > only consequence. The planner will also have more work to do and > that's the hard point IMHO. How much more? Doesn't it now use a binary search? If so, ISTM that going from 10 to 100 would at worst double the time spent finding the bucket we need. Considering that we're talking something that takes microseconds, and that there's a huge penalty to be paid if you have bad stats estimates, that doesn't seem that big a deal. And on modern machines it's not like the additional space in the catalogs is going to kill us. FWIW, I've never seen anything but a performance increase or no change when going from 10 to 100. In most cases there's a noticeable improvement since it's common to have over 100k rows in a table, and there's just no way to capture any kind of a real picture of that with only 10 buckets. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Attachment
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 > As Tom stated it earlier, the ANALYZE slow down is far from being the > only consequence. The planner will also have more work to do and > that's the hard point IMHO. > > Without studying the impacts of this change on a large set of queries > in different cases, it's quite hard to know for sure that it won't > have a negative impact in a lot of cases. > > It's a bit too late in the cycle to change that IMHO, especially > without any numbers. The decision to add the magic "99/100" number was made without any such analysis either, and I can assure you it has caused lots of real-world problems. Going from 10 to 100 adds a small amount of planner overhead. The 99/100 change adds an order of magnitude speed difference to SELECT queries. I still cannot see that as anything other than a major performance regression. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200802032259 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iD8DBQFHpo3jvJuQZxSWSsgRA61dAJ4hglXzi/EQT08j/NSWl8UeqI9CigCcDxSs ob//pk7+jTCWPKlssAYKmy8= =VKhG -----END PGP SIGNATURE-----