On Wed, Oct 20, 2010 at 3:15 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
>>> Maybe what should be done about this is to have separate sizes for the
>>> MCV list and the histogram, where the MCV list is automatically sized
>>> during ANALYZE.
>
> It's been suggested multiple times that we should base our sample size
> on a % of the table, or at least offer that as an option.
Why? Afaict this has been suggested multiple times by people who don't
justify it in any way except with handwavy -- larger samples are
better. The sample size is picked based on what sample statistics
tells us we need to achieve a given 95th percentile confidence
interval for the bucket size given.
Robert pointed out one reason we would want smaller buckets for larger
tables but nobody has explained why we would want smaller confidence
intervals for the same size buckets. That amounts to querying larger
tables for the same percentage of the table but wanting more precise
estimates than you want for smaller tables.
> I've pointed
> out (with math, which Simon wrote a prototype for) that doing
> block-based sampling instead of random-row sampling would allow us to
> collect, say, 2% of a very large table without more I/O than we're doing
> now.
Can you explain when this would and wouldn't bias the sample for the
users so they can decide whether to use it or not?
> Nathan Boley has also shown that we could get tremendously better
> estimates without additional sampling if our statistics collector
> recognized common patterns such as normal, linear and geometric
> distributions. Right now our whole stats system assumes a completely
> random distribution.
That's interesting, I hadn't seen that.
> So, I think we could easily be quite a bit smarter than just increasing
> the MCV. Although that might be a nice start.
I think increasing the MCV is too simplistic since we don't really
have any basis for any particular value. I think what we need are some
statistics nerds to come along and say here's this nice tool from
which you can make the following predictions and understand how
increasing or decreasing the data set size affects the accuracy of the
predictions.
--
greg