On Sat, 2005-04-23 at 16:39 -0700, Josh Berkus wrote:
Greg Stark wrote
> > I looked into this a while back when we were talking about changing the
> > sampling method. The conclusions were discouraging. Fundamentally, using
> > constant sized samples of data for n_distinct is bogus. Constant sized
> > samples only work for things like the histograms that can be analyzed
> > through standard statistics population sampling which depends on the law of
> > large numbers.
ISTM Greg's comments are correct. There is no way to calculate this with
consistent accuracy when using a constant sized sample. (If it were,
then people wouldnt bother to hold large databases...)
> Overall, our formula is inherently conservative of n_distinct. That is, I
> believe that it is actually computing the *smallest* number of distinct
> values which would reasonably produce the given sample, rather than the
> *median* one. This is contrary to the notes in analyze.c, which seem to
> think that we're *overestimating* n_distinct.
The only information you can determine from a sample is the smallest
number of distinct values that would reasonably produce the given
sample. There is no meaningful concept of a median one... (You do have
an upper bound: the number of rows in the table, but I cannot see any
meaning from taking (Nrows+estimatedN_distinct)/2 ).
Even if you use Zipf's Law to predict the frequency of occurrence, you'd
still need to estimate the parameters for the distribution.
Most other RDBMS make optimizer statistics collection an unsampled
scan. Some offer this as one of their options, as well as the ability to
define the sample size in terms of fixed number of rows or fixed
proportion of the table.
My suggested hack for PostgreSQL is to have an option to *not* sample,
just to scan the whole table and find n_distinct accurately. Then
anybody who doesn't like the estimated statistics has a clear path to
take.
The problem of poorly derived base statistics is a difficult one. When
considering join estimates we already go to the trouble of including MFV
comparisons to ensure an upper bound of join selectivity is known. If
the statistics derived are themselves inaccurate the error propagation
touches every other calculation in the optimizer. GIGO.
What price a single scan of a table, however large, when incorrect
statistics could force scans and sorts to occur when they aren't
actually needed ?
Best Regards, Simon Riggs