Home > mailing lists

Re: ANALYZE sampling is too good - Mailing list pgsql-hackers

From	Mark Kirkwood
Subject	Re: ANALYZE sampling is too good
Date	December 8, 2013 00:06:52
Msg-id	52A3B812.4070407@catalyst.net.nz Whole thread Raw
In response to	Re: ANALYZE sampling is too good (Mark Kirkwood <mark.kirkwood@catalyst.net.nz>)
Responses	Re: ANALYZE sampling is too good
List	pgsql-hackers

Tree view

On 08/12/13 12:27, Mark Kirkwood wrote:
> On 08/12/13 09:25, Josh Berkus wrote:
>> On 12/07/2013 11:46 AM, Robert Haas wrote:
>>> Maybe there's some highly-principled statistical approach which could
>>> be taken here, and if so that's fine, but I suspect not.  So what I
>>> think we should do is auto-tune the statistics target based on the
>>> table size.  If, say, we think that the generally useful range for the
>>> statistics target is something like 10 to 400, then let's come up with
>>> a formula based on table size that outputs 10 for small tables, 400
>>> for really big tables, and intermediate values for tables in the
>>> middle.
>>
>> The only approach which makes sense is to base it on a % of the table.
>> In fact, pretty much every paper which has examined statistics
>> estimation for database tables has determined that any estimate based on
>> a less-than-5% sample is going to be wildly inaccurate.  Not that 5%
>> samples are 100% accurate, but at least they fit the 80/20 rule.
>>
>> This is the reason why implementing block-based sampling is critical;
>> using our current "take one row out of every page" method, sampling 5%
>> of the table means scanning the whole thing in most tables.  We also
>> need to decouple the number of MCVs we keep from the sample size.
>> Certainly our existing sampling algo seems designed to maximize IO for
>> the sample size.
>>
>> There's other qualitative improvements we could make, which Nathan Boley
>> has spoken on.   For example, our stats code has no way to recognize a
>> normal or exponential distrbution -- it assumes that all columns are
>> randomly distributed.  If we could recoginze common distribution
>> patterns, then not only could we have better query estimates, those
>> would require keeping *fewer* stats, since all you need for a normal
>> distribution are the end points and the variance.
>>
>
>  From src/backend/commands/analyze.c
>
> * As of May 2004 we use a new two-stage method:  Stage one selects up
>   * to targrows random blocks (or all blocks, if there aren't so many).
>   * Stage two scans these blocks and uses the Vitter algorithm to create
>   * a random sample of targrows rows (or less, if there are less in the
>   * sample of blocks).  The two stages are executed simultaneously: each
>   * block is processed as soon as stage one returns its number and while
>   * the rows are read stage two controls which ones are to be inserted
>   * into the sample.
>
> I don't think we always read every block (been a while since I looked at
> this code, so I'll recheck). From what I understand this stuff is based
> on reasonable research (Vitter algorithm). Not to say it
> couldn't/shouldn't be looked at again to improve it - but it is not just
> dreamed up with no valid research!
>

Since I volunteered to recheck :-)

from analyze.c again:


/*-------------------- * The following choice of minrows is based on the paper * "Random sampling for histogram
construction:how much is enough?" * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in * Proceedings of ACM
SIGMODInternational Conference on Management * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5 * says
thatfor table size n, histogram size k, maximum relative * error in bin size f, and error probability gamma, the
minimum* random sample size is *      r = 4 * k * ln(2*n/gamma) / f^2 * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we
obtain*      r = 305.82 * k * Note that because of the log function, the dependence on n is * quite weak; even at n =
10^12,a 300*k sample gives <= 0.66 * bin size error with probability 0.99.  So there's no real need to * scale for n,
whichis a good thing because we don't necessarily * know it at this point. *-------------------- */
 
stats->minrows = 300 * attr->attstattarget;


So for typical settings (default_statictics_target = 100), we try to get 
30000 rows. This means we will sample about 30000 blocks.

Indeed quickly checking with a scale 100 pgbench db and a simple patch 
to make the block sampler say how many blocks it reads (note 
pgbench_accounts has 163935 blocks):

bench=# ANALYZE pgbench_branches;
NOTICE:  acquire sample will need 30000 blocks
NOTICE:  sampled 1 blocks
ANALYZE
Time: 16.935 ms

bench=# ANALYZE pgbench_accounts;
NOTICE:  acquire sample will need 30000 blocks
NOTICE:  sampled 30000 blocks
ANALYZE
Time: 10059.446 ms
bench=# \q


Regards

Mark

pgsql-hackers by date:

From: Stephen Frost
Date: 08 December 2013, 00:04:58
Subject: Re: Extension Templates S03E11

From: Andres Freund
Date: 08 December 2013, 00:32:59
Subject: Re: Add %z support to elog/ereport?

Re: ANALYZE sampling is too good - Mailing list pgsql-hackers

Previous

Next