> -----Original Message-----
> From: Josh Berkus [mailto:josh@agliodbs.com]
> Sent: Wednesday, April 27, 2005 10:25 AM
> To: Andrew Dunstan
> Cc: Mischa Sandberg; pgsql-perform; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
> suggested?
>
> [...]
> Actually, it's more to characterize how large of a sample
> we need. For example, if we sample 0.005 of disk pages, and
> get an estimate, and then sample another 0.005 of disk pages
> and get an estimate which is not even close to the first
> estimate, then we have an idea that this is a table which
> defies analysis based on small samples.
I buy that.
> Wheras if the two estimates are < 1.0 stdev apart, we can
> have good confidence that the table is easily estimated.
I don't buy that. A negative indication is nothing more than
proof by contradiction. A positive indication is mathematical
induction over the set, which in this type of context is
logically unsound. There is no reason to believe that two
small samples with a small difference imply that a table is
easily estimated rather than that you got unlucky in your
samples.
> [...]
> Yes, actually. We need 3 different estimation methods:
> 1 for tables where we can sample a large % of pages
> (say, >= 0.1)
> 1 for tables where we sample a small % of pages but are
> "easily estimated"
> 1 for tables which are not easily estimated by we can't
> afford to sample a large % of pages.
I don't buy that the first and second need to be different
estimation methods. I think you can use the same block
sample estimator for both, and simply stop sampling at
different points. If you set the default to be a fixed
number of blocks, you could get a large % of pages on
small tables and a small % of pages on large tables, which
is exactly how you define the first two cases. However,
I think such a default should also be overridable to a
% of the table or a desired accuracy.
Of course, I would recommend the distinct sample technique
for the third case.
> If we're doing sampling-based estimation, I really don't
> want people to lose sight of the fact that page-based random
> sampling is much less expensive than row-based random
> sampling. We should really be focusing on methods which
> are page-based.
Of course, that savings comes at the expense of having to
account for factors like clustering within blocks. So block
sampling is more efficient, but can also be less accurate.
Nonetheless, I agree that of the sampling estimators, block
sampling is the better technique.
__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129