Home > mailing lists

Re: MCV lists for highly skewed distributions - Mailing list pgsql-hackers

From	Dean Rasheed
Subject	Re: MCV lists for highly skewed distributions
Date	March 17, 2018 20:21:59
Msg-id	CAEZATCVHEEg+CrP+-0JUsVeNPu5rs_S23oJVeH4VF=fgDwhfLQ@mail.gmail.com Whole thread Raw
In response to	Re: MCV lists for highly skewed distributions (Dean Rasheed <dean.a.rasheed@gmail.com>)
List	pgsql-hackers

Tree view

On 17 March 2018 at 17:16, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> Using the calculator above, you can see that the distribution is
> fairly normal-like, but with a noticeable positive skew. The 2-stddev
> interval is 0.6 to 9.4, and according to the calculator the
> probability of the value being less than or equal to 1 is in the
> ballpark of the 2.5% figure expected. So even with just 5 occurrences
> in the sample, it's fairly close to a normal distribution.
>

One thing this does illustrate is that the hypergeometric distribution
is a discrete distribution and there can be quite large jumps in the
probability from one value to the next, so care may be needed when
approximating it with a continuous distribution. The standard
technique used to handle this is to apply what is known as a
continuity correction factor. Suppose that X is a random variable with
a discrete hypergeometric distribution, and Y is a continuous normal
distribution, with the same mean and variance, being used to
approximate X. Then P(X>i) for some integer i is the same as
P(X>=i+1), because X can only be integer-valued. The idea is then that
you can use P(Y>i+0.5) to get a fairly good approximation to P(X>i).
That would correspond to adding 0.5 to the right-hand side of the
current test, i.e.,

    if (mcv_counts[num_mcv - 1] > selec * samplerows + 2 * stddev + 0.5)
        => Common enough to be included in MCV-list

A lot of the time that won't make much difference, except when dealing
with the smaller counts at the tail end of the MCV list, where it
might help avoid the too-many-mcvs problem, so I think it's worth
trying out.

Apparently, in textbooks, an interval like the mean +/- 2*stddev is
known as a Wald confidence interval, and the mean +/- (2*devdev+0.5)
is the continuity-corrected Wald interval, so it's probably worth
mentioning that in the comments. They are generally regarded as quite
crude approximations for hypergeometric distributions, and there's
quite a bit of research around establishing more accurate confidence
intervals for this kind of distribution, but the formulae involved
tend to be quite complicated, whereas the Wald interval has the
advantage of being very simple. In this context, I don't think we need
to establish a particularly accurate confidence interval. We just want
to be able to say that the value is probably more common than the
non-mcv values, without being too rigorous about what "probably"
means, as long as it works in practice to discount values that just
happen to be a bit more common in the sample, but aren't actually more
common in the table as a whole.

Regards,
Dean

pgsql-hackers by date:

From: Michael Meskes
Date: 17 March 2018, 20:16:19
Subject: Re: ECPG oracle mode test program patch

From: Magnus Hagander
Date: 17 March 2018, 20:24:42
Subject: Re: [PATCH] pg_hba.conf : new auth option : clientcert=verify-full

Re: MCV lists for highly skewed distributions - Mailing list pgsql-hackers

Previous

Next