Thread: Expected accuracy of planner statistics
I have some databases that have grown significantly over time (as databases do). As the databases have grown, I have noticed that the statistics have grown less and less accurate. In particular, the n_distinct values have become many OOM too small for certain foreign key columns. Predictably this leads to poor query plans. The databases in question were all using the default stats target value, so naturally the thing to do is to increase that and see what happens. First I'll show you one table in question: qa_full=# \d fk Table "public.fk" Column | Type | Modifiers --------------+-----------------------------+--------------- fk_id | bigint | not null st_id | bigint | not null is_positive | boolean | not null mc_id | character varying(20) | not null matching_seed | character varying(20) | ft_id | character varying(20) | s_title | text | not null a_summary | text | not null date_created | timestamp without time zone | default now() qx_id | bigint | Indexes: "fk_pkey" PRIMARY KEY, btree (fk_id) "fk_st_mc_id_idx" UNIQUE, btree (st_id, mc_id) "fk_date_created_is_positive_idx" btree (is_positive, date_created) "fk_st_id_idx" btree (st_id) Foreign-key constraints: "fk_qx_id_fkey" FOREIGN KEY (qx_id) REFERENCES st(st_id) ON DELETE RESTRICT "fk_st_id_fkey" FOREIGN KEY (st_id) REFERENCES st(st_id) ON DELETE RESTRICT qa_full=# select count(*) from fk; count ----------- 195555889 Here are the n_distinct stats on the st_id column with stock stats settings: qa_full=# select n_distinct from pg_stats where tablename='fk' and attname='st_id''; attname | n_distinct -----------+------------- st_id | 14910 here's the actual distinct count: qa_full=# select count(distinct st_id) from fk; count ---------- 15191387 (1 row) Here's what it looks like after turning the stats target up to 100: qa_full=# select n_distinct from pg_stats where tablename='fk' and attname='st_id''; attname | n_distinct -----------+------------- st_id | 136977 Still way off (3 OOM), so let's pull out the stops and go for 1000: qa_full=# select n_distinct from pg_stats where tablename='fk' and attname='st_id''; attname | n_distinct -----------+------------- st_id | 860796 Better, but still way off. Here's more of the pg_stats row for the curious with the stats target at 1000: schemaname | public tablename | fk attname | st_id null_frac | 0 avg_width | 8 n_distinct | 860796 most_common_vals | {9822972459012807,81553350123749183,50260420266636724,16953859416556337, 57992478091506908,6789385517968759,13155841310992808,4649594156182905,11 950505984130111,19815690615418387,23232929805154508,24940819255590358,25 304517086243633,30084673952005845,33845252828401578,36510232790970904,44 301350711321256,47572440754042499,66302045808587415,106949745150210138,7 948257888859857,11709841786637953,12034360925626832,17311819170902574,21 933556169120032,31401742852411043,37178443803282644,39714175315169346,42 699954975194688,63648700912541567,73785794393665562,...many elided..} most_common_freqs | {7.33333e-05,6.66667e-05,5.33333e-05,5e-05,5e-05,4.66667e-05,4.66667e-05 , 4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05, 4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05, 4.33333e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05, 4e-05,4e-05,4e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.6666 7e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.6666 7e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.33333e-05,3.33333e-05,3.3333 3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.3333 3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.3333 3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.3333 3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3e-05,3e-05,3e-05, 3e-05,3e-05,3e-05,3e-05,3e-05,3e-05,..many elided..} histogram_bounds | {9474697855526,186642098833097,425502410065792,655064117100237,917344884 999940,1135224280975580,1510900775316064,1919850381534192,23918286327044 65,2773745714634569,3197981109338899,3601128214604953,3887435029566307,4 289757501117626,4604286546172963,5030605000015434,5410915764179364,57126 62986537560,6096452674229658,6531206443844232,6761515475182966,692428185 0823004,7145897868348599,7357502317108796,7537560231072453,7737194605867 515,7923617661480232,8094845122681350,8304911973154200,8504211340608556, 8735469559703009,9008968782181381,9233161779966219,..many elided..} correlation | 0.770339 The correlation is likely high here because this table has been clustered on this column in the past. I don't know if that contributes to the n_distinct inaccuracy, I don't know if I have the patience to reorder the table to find out ;^) Note that new st_ids are also being added all the time, at a rate roughly proportional to fk rows (fk rows being added more frequently). So actually a fractional value for the n_distinct here would be more ideal. The docs hint that analyze will sometimes decide to use a fractional (negative) value. What triggers that? I was also trying to figure out how big the sample really is. Does a stats target of 1000 mean 1000 rows sampled? If the sample really is a fixed number of rows, it would seem to my naive eyes that sampling a fraction of the rows (like 0.1% or something) would be better (especially in cases like this), but maybe it already tries to do that. Any insights appreciated. -Casey
On Thu, Sep 28, 2006 at 03:19:46PM -0700, Casey Duncan wrote: > I have some databases that have grown significantly over time (as > databases do). As the databases have grown, I have noticed that the > statistics have grown less and less accurate. In particular, the > n_distinct values have become many OOM too small for certain foreign > key columns. Predictably this leads to poor query plans. Search the -hackers archives. The problem is that you can't actually get a good n_distinct estimate if you're sampling less than a very large chunk of the table. Since our sampling maxes out at something like 30k pages, at some point the n_distinct estimates just degrade. :( Patches/solutions welcome. :) -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Casey Duncan <casey@pandora.com> writes: > I was also trying to figure out how big the sample really is. Does a > stats target of 1000 mean 1000 rows sampled? No. From memory, the sample size is 300 times the stats target (eg, 3000 rows sampled for the default target of 10). This is based on some math that says that's enough for a high probability of getting good histogram estimates. Unfortunately that math promises nothing about n_distinct. The information we've seen says that the only statistically reliable way to arrive at an accurate n_distinct estimate is to examine most of the table :-(. Which seems infeasible for extremely large tables, which is exactly where the problem is worst. Marginal increases in the sample size seem unlikely to help much ... as indeed your experiment shows. We could also diddle the estimator equation to inflate the estimate. I'm not sure whether such a cure would be worse than the disease, but certainly the current code was not given to us on stone tablets. IIRC I picked an equation out of the literature partially on the basis of it being simple and fairly cheap to compute... regards, tom lane
On Sep 28, 2006, at 8:51 PM, Tom Lane wrote: [..] > The information we've seen says that the only statistically > reliable way > to arrive at an accurate n_distinct estimate is to examine most of the > table :-(. Which seems infeasible for extremely large tables, > which is > exactly where the problem is worst. Marginal increases in the sample > size seem unlikely to help much ... as indeed your experiment shows. I think a first step might be to introduce a new analyze command, such as ANALYZE FULL. This would be executed deliberately (IOW not by autovacuum) like CLUSTER or VACUUM FULL when deemed necessary by the dba. The command as implied would scan the entire table and fill in the stats based on that (as analyze used to do IIRC). It would also be useful if this command froze the stats so that autovacuum didn't clobber them with inaccurate ones shortly thereafter. Perhaps an explicit ANALYZE FULL FREEZE command would be useful for that case, the behavior being that a normal ANALYZE would not overwrite the stats for a stats-frozen table, another ANALYZE FULL would, however. Such a frozen state would also be useful if you wanted to hand-tweak stats for a single table and have it stick and still use autovac. As I understand it now, with autovac on, you cannot do that unless you hack the pg_autovacuum table (i.e., set anl_base_thresh to an artificially high value). Another option (that I think others have suggested) would be to make this the behavior for VACUUM ANALYZE. That saves the baggage of a new command at least. Another advantage would be that the autovac daemon could run it. Perhaps some smarts could also be built in. What if VACUUM ANALYZE first runs a normal (sampled) ANALYZE. Then it performs the VACUUM with full ANALYZE pass. The stats gathered by the latter full pass are compared to that of the first sampled pass. If the full ANALYZE statistics are sufficiently different from the sampled pass, then the table is flagged so that normal ANALYZE is not performed by the autovac daemon on that table. Also, a global ANALYZE could ignore it (though this seems more magical). A more pie-in-the-sky idea could take advantage of the fact that the larger a table is the less likely the statistics will change much over time. If we cannot afford to sample many rows in a given analyze pass, then perhaps we should use a "newton's method" approach where we attempt to converge on an accurate value over time with each analyze pass contributing more samples to the statistics and honing them incrementally rather than simply replacing the old ones. I'm not statistician, so it's not clear to me how much more state you would need to keep between analyze passes to make this viable, but in order for this to work the following would need to be true: 1) Analyze would need to be run on a regular basis (luckily we have autovaccum to help). You would want to analyze this table periodically even if nothing much changed, however. Perhaps tuning the autovac parameters is enough here. 2) Each analyze pass would need to sample randomly so that multiple passes tend to sample different rows. 3) The stats would need to somehow be cumulative. Perhaps this means storing sample values between passes, or some other statistical voodoo. 4) Needs to be smart enough to realize when a table has changed drastically, and toss out the old stats in this case. Either that or we require a human to tell us via ANALYZE FULL/VACUUM ANALYZE. I think that the incremental stats approach would more or less depend on the full ANALYZE functionality for bootstrapping. I think when you first load the table, you want to get the stats right immediately and not wait some indeterminate amount of time for them to "converge" on the right value. -Casey
In article <20060929010011.GQ34238@nasby.net>, jim@nasby.net ("Jim C. Nasby") wrote: > The problem is that you can't actually get > a good n_distinct estimate if you're sampling less than a very large > chunk of the table. Since our sampling maxes out at something like 30k > pages, at some point the n_distinct estimates just degrade. :( Can the DBA just set n_distinct? Sometimes s/he just knows what the value should be. Then, of course, the questions becomes how to keep vacuum et al from messing it up. -arturo
Arturo Perez <aperez@hayesinc.com> writes: > Can the DBA just set n_distinct? Sometimes s/he just knows what the > value should be. Having an expensive process run once in a while and setting this value also sounds interesting. If it has to be calculated every time then this is a bad thing, but having some kind of command or function to update it that could be called when the database has a lower load would be interesting. For companies that work from 8am to 5pm this could be scheduled to run every night... > Then, of course, the questions becomes how to keep vacuum et al from > messing it up. It could not touch these setting if the specific command isn't called, it could gain a new parameter "VACUUM FULL N_DISTINCT ..." to touch it (and then we probably discard the extra command / function) or it could update these settings when called with ANALYZE only... -- Jorge Godoy <jgodoy@gmail.com>
> [snip] Having an expensive process run once in a while and setting this value also > sounds interesting. [snip] What about externalizing the statistics calculation ? I mean, would it make sense to have for e.g. a WAL-fed standby which has an additional process which keeps the statistics in sync based on the incoming WAL records, and feeds back the stats to the master as soon as they change significantly ? The standby would be able to crunch ALL the data, in almost real time... with almost no overhead for the master. It would require though another server... but I guess where analyze is a problem, throwing another server at it is not a problem. Cheers, Csaba.
Tom Lane wrote: > The information we've seen says that the only statistically > reliable way > to arrive at an accurate n_distinct estimate is to examine most of the > table :-(. > IIRC I picked an equation out of the literature partially on the basis > of it being simple and fairly cheap to compute... I'm very curious about this - can you recall where you got this, or at least point me to where in the code this happens? Thanks. - John Burger MITRE
"John D. Burger" <john@mitre.org> writes: > Tom Lane wrote: >> IIRC I picked an equation out of the literature partially on the basis >> of it being simple and fairly cheap to compute... > I'm very curious about this - can you recall where you got this, or > at least point me to where in the code this happens? src/backend/commands/analyze.c, around line 1930 as of CVS HEAD: /*---------- * Estimate the number of distinct values using the estimator * proposed by Haas and Stokes in IBM Research Report RJ 10025: * n*d / (n - f1 + f1*n/N) * where f1 is the number of distinct values that occurred * exactly once in our sample of n rows (from a total of N), * and d is the total number of distinct values in the sample. * This is their Duj1 estimator; the other estimators they * recommend are considerably more complex, and are numerically * very unstable when n is much smaller than N. * * Overwidth values are assumed to have been distinct. *---------- */ regards, tom lane
> src/backend/commands/analyze.c, around line 1930 as of CVS HEAD: > > /*---------- > * Estimate the number of distinct values using the > estimator > * proposed by Haas and Stokes in IBM Research Report > RJ 10025: Thanks for the pointer, Tom. I shouldn't have been surprised to find such a nice comment pointing me at the literature. - John D. Burger MITRE