Re: ANALYZE sampling is too good - Mailing list pgsql-hackers
From | Greg Stark |
---|---|
Subject | Re: ANALYZE sampling is too good |
Date | |
Msg-id | CAM-w4HOQfAVvUMYCXOfmkZwg=8ye9t7jpqYBdyE7+ny-hZtZqA@mail.gmail.com Whole thread Raw |
In response to | Re: ANALYZE sampling is too good (Josh Berkus <josh@agliodbs.com>) |
Responses |
Re: ANALYZE sampling is too good
|
List | pgsql-hackers |
On Sat, Dec 7, 2013 at 8:25 PM, Josh Berkus <josh@agliodbs.com> wrote: > The only approach which makes sense is to base it on a % of the table. > In fact, pretty much every paper which has examined statistics > estimation for database tables has determined that any estimate based on > a less-than-5% sample is going to be wildly inaccurate. Not that 5% > samples are 100% accurate, but at least they fit the 80/20 rule. This is nonsense. The math behind the deductions the planner makes is well understood and we know how to estimate the precision based on the sample size. There are some questions that really need a proportional sample such as n_distinct (and as Robert points out the MCV list) but the main motivation of the sample size is the histograms and those are used to answer questions which very clearly do not need a proportional sample. The statistics is very clear there. Using a proportional sample for the histograms would just be silly. It would be substituting a gut feeling for real statistics. The problems with using a proportional sample for things like n_distinct or the MCV list is that you're talking about sorting or hashing an unboundedly large set of values and storing an unboundedly large array in the statistics table. I don't think that would be tenable without dramatically changing the way process and store that data to be more scalable. Using a lossy counting algorithm and something more scalable than toasted arrays would be prerequisites I think. And as Robert mentions even if we solved those problems it wouldn't help n_distinct. You really need to read all the values to deal with n_distinct. > This is the reason why implementing block-based sampling is critical; > using our current "take one row out of every page" method, sampling 5% > of the table means scanning the whole thing in most tables. We also > need to decouple the number of MCVs we keep from the sample size. > Certainly our existing sampling algo seems designed to maximize IO for > the sample size. This would be helpful if you could specify what you mean by "black-based sampling". The reason these previous pleas didn't go anywhere is not because we can't do math. It's because of the lack of specifics here. What we do *today* is called "block-based sampling" by the literature. What I'm saying is we need to change the "block-based sampling" that we do to extract more rows per block. We currently extract the "correct" number of rows to get a strong guarantee of uniformity. If we could get a weaker guarantee of being "close enough" to uniform samples for the deductions we want to make and get many more rows per block then we could read a lot fewer blocks. Or to put it another way people could increase the statistics target dramatically and still be reading the same number of blocks as today. In an ideal world perhaps we could have people reading 1% of the blocks they read today and get statistics targets 10x better than today. -- greg
pgsql-hackers by date: