Thread: raising the default default_statistics_target
From time to time, people on IRC ask for help with performance problems, and the cause of the difficulty is ultimately traced to a poor query plan that is chosen because default_statistics_target is too low. While there will always need to be *some* tuning of the statistics target by advanced users, I wanted to see what the performance penalty would be to increase the default stats target out of the box -- that way, more users will get good query plans without needing to manually tweak the configuration. In the simple test I performed, raising the default_statistics_target from 10 to 25 resulted in a 40% increase in the time to ANALYZE a large table. (I picked 25 more or less at random -- would 15 or 20 be better?) That's a larger hit than I was hoping to see; however, it can be argued that ANALYZE isn't really performance-critical anyway (since it doesn't hold the same kind of locks that VACUUM and especially VACUUM FULL hold). Also, I only have anecdotal evidence that this is actually a problem. It may also be the case that for those people for whom 10 is an insufficient stats target, 25 is also insufficient. Any comments on whether increasing the default stats target is a good idea for 7.5? (Details on the test I performed are included below) -Neil I created a 2.1 GB table with 3 columns (int, varchar, and float): nconway=# select relpages from pg_class where relname = 'abc'; relpages ---------- 279621 (1 row) nconway=# select reltuples from pg_class where relname = 'abc'; reltuples ------------- 3.35545e+07 (1 row) I tested two default_statistcs_target settings: 10 (the current default), and 25. The test machine is a P4 1.8 Ghz with 768 MB of RAM and a pretty mediocre 7200 RPM IDE disk running Linux 2.6.3. I rebooted the machine before and between tests. ANALYZE w/ stats target = 10: 51.643 seconds ANALYZE w/ stats target = 25: 71.969 seconds (Additional tests performed w/o rebooting seem to be consistent with these numbers.)
Neil Conway <neilc@samurai.com> writes: > Any comments on whether increasing the default stats target is a good > idea for 7.5? (Details on the test I performed are included below) This is something we need to consider, but we'll need more evidence before making a choice. One thing that we have very little data about is how much difference it makes in the quality of planner choices. (There's no point in slowing down ANALYZE unless the plans get better.) Also, I would expect that larger stats targets would slow down the parts of the planner that look at the stats, since there are more data values to examine. I do not have any numbers about this cost though --- do you want to try to get some? regards, tom lane
Tom Lane wrote: > This is something we need to consider, but we'll need more evidence > before making a choice. One thing that we have very little data about > is how much difference it makes in the quality of planner choices. Right, but is there a practical way to actually get this data? If the distribution of data in the table is irregular, a higher stats target is needed to allow good planning choices. Therefore, the effect that the stats target has on planner choices depends on the regularity of the distribution of data at installations, and there is no way to know that in general AFAICS. > Also, I would expect that larger stats targets would slow down the parts > of the planner that look at the stats, since there are more data values > to examine. I do not have any numbers about this cost though --- do you > want to try to get some? Given the magnitude of the change (25 data elements versus 10), I wouldn't expect this to produce a major change in the total runtime of the optimizer. However, I don't know the optimizer that well, so I'll do some benchmarks when I get a chance. -Neil
Neil Conway <neilc@samurai.com> writes: > Tom Lane wrote: >> This is something we need to consider, but we'll need more evidence >> before making a choice. One thing that we have very little data about >> is how much difference it makes in the quality of planner choices. > Right, but is there a practical way to actually get this data? I haven't thought of one yet, but perhaps someone will have an idea. >> Also, I would expect that larger stats targets would slow down the parts >> of the planner that look at the stats, since there are more data values >> to examine. I do not have any numbers about this cost though --- do you >> want to try to get some? > Given the magnitude of the change (25 data elements versus 10), I > wouldn't expect this to produce a major change in the total runtime of > the optimizer. I wouldn't either, but if we need to raise the stats target to 100 or 1000 to make a meaningful difference, then the question becomes more urgent. regards, tom lane
Neil, > In the simple test I performed, raising the default_statistics_target > from 10 to 25 resulted in a 40% increase in the time to ANALYZE a > large table. (I picked 25 more or less at random -- would 15 or 20 be > better?) I find that very interesting, since I haven't found much higher increases to be a proportionate penality. For example, on an 11-column table raising 3 columns to statistics=250 merely doubled the ANALYZE time. I have not done exact timing, but would be happy to .... >It may also be the case that for those people for whom 10 > is an insufficient stats target, 25 is also insufficient. It is. I've found that "problem" queries, especially those caused by real, uneven distribution of data, require raising statistics to 150-400 in order to fix. This is much to high a level to assign as a default. > Any comments on whether increasing the default stats target is a good > idea for 7.5? (Details on the test I performed are included below) No. I don't think it's a good idea to raise the default for *all* columns; for one thing, I'd really hate to think what, say, a default stats of 100 would do to a TEXT column with an average of 8K of data per row. Further, in 7.5 we'll be introducing correlated stats for multi-column indexes (unless something's gone off with that?) which should help a lot of problem queries. And change our whole emphasis on brute forcing analyze through increasing stats into the 100's. If you really want to tackle this issue, though, here's what I suggest: 1) add a GUC called default_statistics_indexed, which starts at say 100 or 50. 2) When ever the user indexes a column, automatically increase the stats to the level in default_statistics_indexed, if they are at the level in default_statistics_target. This will then give indexed columns "automatically" a somewhat higher level of stats analysis than other columns. This should help a lot of "slow query" problems, yet effectively leave the selection of "important" columns in the hands of the DBA. Make sense? Also, another great feature in this department would be to extend the multi-column correlation statistics to cover foriegn keys, as a way of improving cross-table estimates. Anyway, keep me in the loop on this, I have a lot of very complex databases I can test such issues on. -- -Josh BerkusAglio Database SolutionsSan Francisco
Josh Berkus <josh@agliodbs.com> writes: > It is. I've found that "problem" queries, especially those caused by real, > uneven distribution of data, require raising statistics to 150-400 in order > to fix. This is much to high a level to assign as a default. That's basically what's bothering me about the suggestion to increase to 25 --- I'm dubious that it will do any good. > Further, in 7.5 we'll be introducing correlated stats for multi-column indexes > (unless something's gone off with that?) News to me. It's certainly not there now. > This will then give indexed columns "automatically" a somewhat higher > level of stats analysis than other columns. That is potentially a good idea. There's still the question of what is a reasonable default, though. regards, tom lane
Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > It is. I've found that "problem" queries, especially those caused by real, > > uneven distribution of data, require raising statistics to 150-400 in order > > to fix. This is much to high a level to assign as a default. > > That's basically what's bothering me about the suggestion to increase to > 25 --- I'm dubious that it will do any good. > > > Further, in 7.5 we'll be introducing correlated stats for multi-column indexes > > (unless something's gone off with that?) > > News to me. It's certainly not there now. > > > This will then give indexed columns "automatically" a somewhat higher > > level of stats analysis than other columns. > > That is potentially a good idea. There's still the question of what is > a reasonable default, though. Do all the columns have to have the same number of statistics buckets? Could that stats collector adjust the number of buckets based on the data somehow? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Do all the columns have to have the same number of statistics buckets? They do not, but the effort spent by ANALYZE is proportional to the largest stats target among all the columns of the table. regards, tom lane
On 7 Mar, Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: >> Tom Lane wrote: >>> This is something we need to consider, but we'll need more evidence >>> before making a choice. One thing that we have very little data about >>> is how much difference it makes in the quality of planner choices. > >> Right, but is there a practical way to actually get this data? > > I haven't thought of one yet, but perhaps someone will have an idea. If DBT-3 is an appropriate test, we can get those EXPLAIN ANALYZE changes I have working in the STP version and queued up tests varying the default_statistics_target parameter. Mark
Tom, > > Further, in 7.5 we'll be introducing correlated stats for multi-column indexes > > (unless something's gone off with that?) This was discussed on Hackers in October, a complete implementation was shown, I thought it was committed at that time. If not, what happened to it? Dammit, it's impossible to find anything in the archives if you don't have some good keywords or at least the author. Is the autor reading this? Will you speak up? -- -Josh BerkusAglio Database SolutionsSan Francisco
Josh Berkus <josh@agliodbs.com> writes: > Further, in 7.5 we'll be introducing correlated stats for multi-column > indexes (unless something's gone off with that?) > This was discussed on Hackers in October, a complete implementation was shown, > I thought it was committed at that time. If not, what happened to it? Are you sure you're not thinking of stats for functional indexes? regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Do all the columns have to have the same number of statistics buckets? > > They do not, but the effort spent by ANALYZE is proportional to the > largest stats target among all the columns of the table. Could we use previous stats to determine how many buckets to use when running ANALYZE. Also, if columns have a different number of buckets, does that mean that we don't have the same per-query overhead for a larger stats target? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom, > Are you sure you're not thinking of stats for functional indexes? Positive. I even remember seeing that the patch was accepted. The patch specifically had to do with a multi-column correlation algorithm for improving the selectivity of multi-column indexes. Problem is, with 1400 posts per month August to October, I can't find it, and the keywords that I think are obvious don't turn anything up. -- -Josh BerkusAglio Database SolutionsSan Francisco
I think the thread you're thinking of is on or about this post: http://archives.postgresql.org/pgsql-patches/2003-12/msg00039.php Manfred Koizar produced a patch that modified index correlation by sorting equal key values based on item pointers. The patch went as far as getting accepted into the patch queue, but Tom raised some doubts about it and it was subsequently removed. Robert Treat On Mon, 2004-03-08 at 14:41, Josh Berkus wrote: > Tom, > > > Are you sure you're not thinking of stats for functional indexes? > > Positive. I even remember seeing that the patch was accepted. > > The patch specifically had to do with a multi-column correlation algorithm for > improving the selectivity of multi-column indexes. > > Problem is, with 1400 posts per month August to October, I can't find it, and > the keywords that I think are obvious don't turn anything up. > > -- > -Josh Berkus > Aglio Database Solutions > San Francisco > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
Robert Treat <xzilla@users.sourceforge.net> writes: > I think the thread you're thinking of is on or about this post: > http://archives.postgresql.org/pgsql-patches/2003-12/msg00039.php > Manfred Koizar produced a patch that modified index correlation by > sorting equal key values based on item pointers. The patch went as far > as getting accepted into the patch queue, but Tom raised some doubts > about it and it was subsequently removed. Hm, that had nothing to do with multi-column correlation though. I'm at a loss to think of any work that matches with Josh's recollection. regards, tom lane
Guys, > Hm, that had nothing to do with multi-column correlation though. > I'm at a loss to think of any work that matches with Josh's > recollection. Hmmmm .... it's possible that early e-mails about Manfred's patch claimed to improve performance for multi-column indexes. But it's also possible I'm remembering something else. Darn it, though! 'cause multi-column correlation is one of our big issues on estimates for complex queries. -- -Josh BerkusAglio Database SolutionsSan Francisco
On Sun, 7 Mar 2004, Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: > > Tom Lane wrote: > >> This is something we need to consider, but we'll need more evidence > >> before making a choice. One thing that we have very little data about > >> is how much difference it makes in the quality of planner choices. > > > Right, but is there a practical way to actually get this data? > > I haven't thought of one yet, but perhaps someone will have an idea. Hi Tom. I ran some very simple tests on analyze times and query plan times on a very simple table, with data randomly distributed. The index was on a date field, since that's what I was testing last. This was all done on my 512Meg memory 1.1GHz celeron workstation with an IDE drive. I'd love more input on better testing methodologies here... with 100k or 1M rows that look kinda like this: (I'll test 10M rows later, which means the dataset won't fit in memory, so there'll be lots of access going on. Right now the 1M row table is 80 meg) select * from test2 limit 5; info | dt | id ---------------------------------------------+---------------------+---------Francize perfectible swirling fluctuates | 2004-05-20 20:12:04 | 2721995Fields chauffeur attentionality grandmother | 2004-04-07 14:36:02 | 2721996Belgium bilkedexplosively defendant | 2004-09-16 16:27:22 | 2721997perspectives Buenos Pollux discriminates | 2004-11-1112:28:31 | 2721998Victorianize Savonarola blackmails sufficed | 2004-02-27 21:17:20 | 2721999 (5 rows) here's what I get with different statistics targets for analyze times: 100k 1M 1M analyze analyze plan target ms ms ms 10 250 875 2 20 350 1250 30 430 1500 40 520 1725 50 580 1900 60 690 2100 70 775 2175 80 850 2300 90 950 2400 100 1000 2600 2.5 200 1806 3700 300 2600 4800 400 2600 5900 500 2600 7200 700 2600 9500 1000 2600 13000 5 Since this data is randomly distributed, I didn't bother doing a lot of testing to see how accurate each target setting was. If that would be useful to know I'd gladly test it, but I was only setting out to test the time to analyze and the time to plan. Note that I only tested 3 targets for planning time, as it didn't seem to make a very big difference. The query was: select * from test2 where dt between 'march 11, 2004' and 'march 13, 2004'; I also ran some quick tests on smaller tables (1000 and 10k rows) and there, the plateau that we see in the 100k analyze shows up much quicker, at something like 50 or so. I.e. the analyze time flattened out quickly and higher numbers cost very little if anything. Since this query was quite an easy plan, I'd expect to need a much more complex one to test the increase in planning time, say something that has to look at a lot of statistics. Any particular join type or something that's likely to do that?
"scott.marlowe" <scott.marlowe@ihs.com> writes: > Hi Tom. I ran some very simple tests on analyze times and query plan > times on a very simple table, with data randomly distributed. The index > was on a date field, since that's what I was testing last. Thanks. > I also ran some quick tests on smaller tables (1000 and 10k rows) and > there, the plateau that we see in the 100k analyze shows up much quicker, > at something like 50 or so. I.e. the analyze time flattened out quickly > and higher numbers cost very little if anything. The sample size is (IIRC) 300 times stats_target rows, so the "plateau" that you're seeing occurs when the sample size becomes the entire table. It would be useful to note how large the ANALYZE process got to be during these runs. > Since this query was quite an easy plan, I'd expect to need a much more > complex one to test the increase in planning time, say something that has > to look at a lot of statistics. Any particular join type or something > that's likely to do that? I'd say try a join on any reasonably plausible foreign-key relationship (unique key on one side, not-unique data on the other). That's probably the most common situation. As for making it complicated, just stack up a bunch of such joins ... regards, tom lane