On 21 January 2018 at 07:26, John Naylor <jcnaylor@gmail.com> wrote:
> I spent a few hours hacking on this, and it turns out calculating the
> right number of MCVs taking into account both uniform and highly
> non-uniform distributions is too delicate a problem for me to solve
> right now. The logic suggested by Dean Rasheed in [1] always produces
> no MCVs for a perfectly uniform distribution (which is good), but very
> often also for other distributions, which is not good. My efforts to
> tweak that didn't work, so I didn't get as far as adapting it for the
> problem Jeff is trying to solve.
Hmm, Tom suggested that the test based on the average frequency over
all values might be too strict because the estimated number of
distinct values is often too low, so that might explain what you're
seeing.
It occurs to me that maybe a better test to exclude a value from the
MCV list would be to demand that its relative standard error not be
too high. Such a test, in addition to the existing tests, might be
sufficient to solve the opposite problem of too many values in the MCV
list, because the real problem there is including a value after having
seen relatively few occurrences of it in the sample, and thus having a
wildly inaccurate estimate for it. Setting a bound on the relative
standard error would mean that we could have a reasonable degree of
confidence in estimates produced from the sample.
Regards,
Dean