On Sun, 2005-04-24 at 00:48 -0400, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > Overall, our formula is inherently conservative of n_distinct. That is, I
> > believe that it is actually computing the *smallest* number of distinct
> > values which would reasonably produce the given sample, rather than the
> > *median* one. This is contrary to the notes in analyze.c, which seem to
> > think that we're *overestimating* n_distinct.
>
> Well, the notes are there because the early tests I ran on that formula
> did show it overestimating n_distinct more often than not. Greg is
> correct that this is inherently a hard problem :-(
>
> I have nothing against adopting a different formula, if you can find
> something with a comparable amount of math behind it ... but I fear
> it'd only shift the failure cases around.
>
Perhaps the formula is not actually being applied?
The code looks like this...
if (nmultiple == 0)
{
/* If we found no repeated values, assume it's a unique column */
stats->stadistinct = -1.0;
}
else if (toowide_cnt == 0 && nmultiple == ndistinct)
{
/*
* Every value in the sample appeared more than once. Assume
* the column has just these values.
*/
stats->stadistinct = ndistinct;
}
else
{
/*----------
* Estimate the number of distinct values using the estimator
* proposed by Haas and Stokes in IBM Research Report RJ 10025:
The middle chunk of code looks to me like if we find a distribution
where values all occur at least twice, then we won't bother to apply the
Haas and Stokes equation. That type of frequency distribution would be
very common in a set of values with very high ndistinct, especially when
sampled.
The comment
* Every value in the sample appeared more than once. Assume
* the column has just these values.
doesn't seem to apply when using larger samples, as Josh is using.
Looking at Josh's application it does seem likely that when taking a
sample, all site visitors clicked more than once during their session,
especially if they include home page, adverts, images etc for each page.
Could it be that we have overlooked this simple explanation and that the
Haas and Stokes equation is actually quite good, but just not being
applied?
Best Regards, Simon Riggs