On 17 March 2018 at 17:16, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> Using the calculator above, you can see that the distribution is
> fairly normal-like, but with a noticeable positive skew. The 2-stddev
> interval is 0.6 to 9.4, and according to the calculator the
> probability of the value being less than or equal to 1 is in the
> ballpark of the 2.5% figure expected. So even with just 5 occurrences
> in the sample, it's fairly close to a normal distribution.
>
One thing this does illustrate is that the hypergeometric distribution
is a discrete distribution and there can be quite large jumps in the
probability from one value to the next, so care may be needed when
approximating it with a continuous distribution. The standard
technique used to handle this is to apply what is known as a
continuity correction factor. Suppose that X is a random variable with
a discrete hypergeometric distribution, and Y is a continuous normal
distribution, with the same mean and variance, being used to
approximate X. Then P(X>i) for some integer i is the same as
P(X>=i+1), because X can only be integer-valued. The idea is then that
you can use P(Y>i+0.5) to get a fairly good approximation to P(X>i).
That would correspond to adding 0.5 to the right-hand side of the
current test, i.e.,
if (mcv_counts[num_mcv - 1] > selec * samplerows + 2 * stddev + 0.5)
=> Common enough to be included in MCV-list
A lot of the time that won't make much difference, except when dealing
with the smaller counts at the tail end of the MCV list, where it
might help avoid the too-many-mcvs problem, so I think it's worth
trying out.
Apparently, in textbooks, an interval like the mean +/- 2*stddev is
known as a Wald confidence interval, and the mean +/- (2*devdev+0.5)
is the continuity-corrected Wald interval, so it's probably worth
mentioning that in the comments. They are generally regarded as quite
crude approximations for hypergeometric distributions, and there's
quite a bit of research around establishing more accurate confidence
intervals for this kind of distribution, but the formulae involved
tend to be quite complicated, whereas the Wald interval has the
advantage of being very simple. In this context, I don't think we need
to establish a particularly accurate confidence interval. We just want
to be able to say that the value is probably more common than the
non-mcv values, without being too rigorous about what "probably"
means, as long as it works in practice to discount values that just
happen to be a bit more common in the sample, but aren't actually more
common in the table as a whole.
Regards,
Dean