Re: ANALYZE sampling is too good - Mailing list pgsql-hackers
From | Mark Kirkwood |
---|---|
Subject | Re: ANALYZE sampling is too good |
Date | |
Msg-id | 52A3B812.4070407@catalyst.net.nz Whole thread Raw |
In response to | Re: ANALYZE sampling is too good (Mark Kirkwood <mark.kirkwood@catalyst.net.nz>) |
Responses |
Re: ANALYZE sampling is too good
|
List | pgsql-hackers |
On 08/12/13 12:27, Mark Kirkwood wrote: > On 08/12/13 09:25, Josh Berkus wrote: >> On 12/07/2013 11:46 AM, Robert Haas wrote: >>> Maybe there's some highly-principled statistical approach which could >>> be taken here, and if so that's fine, but I suspect not. So what I >>> think we should do is auto-tune the statistics target based on the >>> table size. If, say, we think that the generally useful range for the >>> statistics target is something like 10 to 400, then let's come up with >>> a formula based on table size that outputs 10 for small tables, 400 >>> for really big tables, and intermediate values for tables in the >>> middle. >> >> The only approach which makes sense is to base it on a % of the table. >> In fact, pretty much every paper which has examined statistics >> estimation for database tables has determined that any estimate based on >> a less-than-5% sample is going to be wildly inaccurate. Not that 5% >> samples are 100% accurate, but at least they fit the 80/20 rule. >> >> This is the reason why implementing block-based sampling is critical; >> using our current "take one row out of every page" method, sampling 5% >> of the table means scanning the whole thing in most tables. We also >> need to decouple the number of MCVs we keep from the sample size. >> Certainly our existing sampling algo seems designed to maximize IO for >> the sample size. >> >> There's other qualitative improvements we could make, which Nathan Boley >> has spoken on. For example, our stats code has no way to recognize a >> normal or exponential distrbution -- it assumes that all columns are >> randomly distributed. If we could recoginze common distribution >> patterns, then not only could we have better query estimates, those >> would require keeping *fewer* stats, since all you need for a normal >> distribution are the end points and the variance. >> > > From src/backend/commands/analyze.c > > * As of May 2004 we use a new two-stage method: Stage one selects up > * to targrows random blocks (or all blocks, if there aren't so many). > * Stage two scans these blocks and uses the Vitter algorithm to create > * a random sample of targrows rows (or less, if there are less in the > * sample of blocks). The two stages are executed simultaneously: each > * block is processed as soon as stage one returns its number and while > * the rows are read stage two controls which ones are to be inserted > * into the sample. > > I don't think we always read every block (been a while since I looked at > this code, so I'll recheck). From what I understand this stuff is based > on reasonable research (Vitter algorithm). Not to say it > couldn't/shouldn't be looked at again to improve it - but it is not just > dreamed up with no valid research! > Since I volunteered to recheck :-) from analyze.c again: /*-------------------- * The following choice of minrows is based on the paper * "Random sampling for histogram construction:how much is enough?" * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in * Proceedings of ACM SIGMODInternational Conference on Management * of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5 * says thatfor table size n, histogram size k, maximum relative * error in bin size f, and error probability gamma, the minimum* random sample size is * r = 4 * k * ln(2*n/gamma) / f^2 * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain* r = 305.82 * k * Note that because of the log function, the dependence on n is * quite weak; even at n = 10^12,a 300*k sample gives <= 0.66 * bin size error with probability 0.99. So there's no real need to * scale for n, whichis a good thing because we don't necessarily * know it at this point. *-------------------- */ stats->minrows = 300 * attr->attstattarget; So for typical settings (default_statictics_target = 100), we try to get 30000 rows. This means we will sample about 30000 blocks. Indeed quickly checking with a scale 100 pgbench db and a simple patch to make the block sampler say how many blocks it reads (note pgbench_accounts has 163935 blocks): bench=# ANALYZE pgbench_branches; NOTICE: acquire sample will need 30000 blocks NOTICE: sampled 1 blocks ANALYZE Time: 16.935 ms bench=# ANALYZE pgbench_accounts; NOTICE: acquire sample will need 30000 blocks NOTICE: sampled 30000 blocks ANALYZE Time: 10059.446 ms bench=# \q Regards Mark
pgsql-hackers by date: