Re: ANALYZE sampling is too good - Mailing list pgsql-hackers

From Mark Kirkwood
Subject Re: ANALYZE sampling is too good
Date
Msg-id 52A3AEE3.2030706@catalyst.net.nz
Whole thread Raw
In response to Re: ANALYZE sampling is too good  (Josh Berkus <josh@agliodbs.com>)
Responses Re: ANALYZE sampling is too good
List pgsql-hackers
On 08/12/13 09:25, Josh Berkus wrote:
> On 12/07/2013 11:46 AM, Robert Haas wrote:
>> Maybe there's some highly-principled statistical approach which could
>> be taken here, and if so that's fine, but I suspect not.  So what I
>> think we should do is auto-tune the statistics target based on the
>> table size.  If, say, we think that the generally useful range for the
>> statistics target is something like 10 to 400, then let's come up with
>> a formula based on table size that outputs 10 for small tables, 400
>> for really big tables, and intermediate values for tables in the
>> middle.
>
> The only approach which makes sense is to base it on a % of the table.
> In fact, pretty much every paper which has examined statistics
> estimation for database tables has determined that any estimate based on
> a less-than-5% sample is going to be wildly inaccurate.  Not that 5%
> samples are 100% accurate, but at least they fit the 80/20 rule.
>
> This is the reason why implementing block-based sampling is critical;
> using our current "take one row out of every page" method, sampling 5%
> of the table means scanning the whole thing in most tables.  We also
> need to decouple the number of MCVs we keep from the sample size.
> Certainly our existing sampling algo seems designed to maximize IO for
> the sample size.
>
> There's other qualitative improvements we could make, which Nathan Boley
> has spoken on.   For example, our stats code has no way to recognize a
> normal or exponential distrbution -- it assumes that all columns are
> randomly distributed.  If we could recoginze common distribution
> patterns, then not only could we have better query estimates, those
> would require keeping *fewer* stats, since all you need for a normal
> distribution are the end points and the variance.
>
From src/backend/commands/analyze.c

* As of May 2004 we use a new two-stage method:  Stage one selects up * to targrows random blocks (or all blocks, if
therearen't so many). * Stage two scans these blocks and uses the Vitter algorithm to create * a random sample of
targrowsrows (or less, if there are less in the * sample of blocks).  The two stages are executed simultaneously: each
*block is processed as soon as stage one returns its number and while * the rows are read stage two controls which ones
areto be inserted * into the sample.
 

I don't think we always read every block (been a while since I looked at 
this code, so I'll recheck). From what I understand this stuff is based 
on reasonable research (Vitter algorithm). Not to say it 
couldn't/shouldn't be looked at again to improve it - but it is not just 
dreamed up with no valid research!

Cheers

Mark



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Extension Templates S03E11
Next
From: Andrew Gierth
Date:
Subject: Re: WITHIN GROUP patch