Re: ANALYZE sampling is too good - Mailing list pgsql-hackers
From | Mark Kirkwood |
---|---|
Subject | Re: ANALYZE sampling is too good |
Date | |
Msg-id | 52A3AEE3.2030706@catalyst.net.nz Whole thread Raw |
In response to | Re: ANALYZE sampling is too good (Josh Berkus <josh@agliodbs.com>) |
Responses |
Re: ANALYZE sampling is too good
|
List | pgsql-hackers |
On 08/12/13 09:25, Josh Berkus wrote: > On 12/07/2013 11:46 AM, Robert Haas wrote: >> Maybe there's some highly-principled statistical approach which could >> be taken here, and if so that's fine, but I suspect not. So what I >> think we should do is auto-tune the statistics target based on the >> table size. If, say, we think that the generally useful range for the >> statistics target is something like 10 to 400, then let's come up with >> a formula based on table size that outputs 10 for small tables, 400 >> for really big tables, and intermediate values for tables in the >> middle. > > The only approach which makes sense is to base it on a % of the table. > In fact, pretty much every paper which has examined statistics > estimation for database tables has determined that any estimate based on > a less-than-5% sample is going to be wildly inaccurate. Not that 5% > samples are 100% accurate, but at least they fit the 80/20 rule. > > This is the reason why implementing block-based sampling is critical; > using our current "take one row out of every page" method, sampling 5% > of the table means scanning the whole thing in most tables. We also > need to decouple the number of MCVs we keep from the sample size. > Certainly our existing sampling algo seems designed to maximize IO for > the sample size. > > There's other qualitative improvements we could make, which Nathan Boley > has spoken on. For example, our stats code has no way to recognize a > normal or exponential distrbution -- it assumes that all columns are > randomly distributed. If we could recoginze common distribution > patterns, then not only could we have better query estimates, those > would require keeping *fewer* stats, since all you need for a normal > distribution are the end points and the variance. > From src/backend/commands/analyze.c * As of May 2004 we use a new two-stage method: Stage one selects up * to targrows random blocks (or all blocks, if therearen't so many). * Stage two scans these blocks and uses the Vitter algorithm to create * a random sample of targrowsrows (or less, if there are less in the * sample of blocks). The two stages are executed simultaneously: each *block is processed as soon as stage one returns its number and while * the rows are read stage two controls which ones areto be inserted * into the sample. I don't think we always read every block (been a while since I looked at this code, so I'll recheck). From what I understand this stuff is based on reasonable research (Vitter algorithm). Not to say it couldn't/shouldn't be looked at again to improve it - but it is not just dreamed up with no valid research! Cheers Mark
pgsql-hackers by date: