Home > mailing lists

Re: MCV lists for highly skewed distributions - Mailing list pgsql-hackers

From	Dean Rasheed
Subject	Re: MCV lists for highly skewed distributions
Date	March 17, 2018 23:16:14
Msg-id	CAEZATCUb5xY-Q9g2hebL35RU4E4SURYd-v7nHrQGAO6xFr+9hg@mail.gmail.com Whole thread Raw
In response to	Re: MCV lists for highly skewed distributions (John Naylor <jcnaylor@gmail.com>)
Responses	Re: MCV lists for highly skewed distributions (Dean Rasheed <dean.a.rasheed@gmail.com>)
List	pgsql-hackers

Tree view

On 13 March 2018 at 08:39, John Naylor <jcnaylor@gmail.com> wrote:
>> Also, this is common enough that in fact that distribution
>> can be reasonably approximated by a normal distribution.
>
> For the archives, because it's typically seen 10 times in the sample,
> per the rule of thumb mention upthread?
>

Actually, I think that the rule of thumb of at least 10 instances in
the sample for a normal approximation is probably too strict.

Looking around, values as low as 5 seem to be quite commonly used. Of
course there is no right answer here, it depends on what you're doing
with it, and how close you want the normal distribution to be to the
hypergeometric one. For our purposes, I don't think we really need it
to be that close. We just want to be able to justify statements like
"the value is *probably* more common than X, and it was unlikely that
that was just random sampling error".

>> Making use of
>> the non-MCV-based estimate allows us to do better -- if the MCV-based
>> estimate is statistically significantly higher than the non-MCV-based
>> estimate, then it would almost certainly be better to include that
>> value in the MCV list, even if its spread is quite large.
>
> Because we can treat the sample as normal, testing for > 2 standard
> deviations means that we're "~95% sure" it's more common in the table
> than the non-MCV selectivity would suggest, right? (I realize I'm not
> using technically accurate language)
>

Actually, it's more like 97.5% (remember the normal distribution is
symmetric, so there's around 2.5% below the 2-stddev interval, around
95% in it, and another 2.5% above it), but that's just nit-picking.

There are several nice online calculators for playing around with
hypergeometric distributions, such as
http://keisan.casio.com/exec/system/1180573201

Consider this distribution:

N = 1000000
n = 10000
K = 500
Mean = n * K / N = 5
Variance = (K / N) * (1 - K / N) * n * (N - n) / (N - 1) = 4.9
Stddev = sqrt(Variance) = 2.2

Using the calculator above, you can see that the distribution is
fairly normal-like, but with a noticeable positive skew. The 2-stddev
interval is 0.6 to 9.4, and according to the calculator the
probability of the value being less than or equal to 1 is in the
ballpark of the 2.5% figure expected. So even with just 5 occurrences
in the sample, it's fairly close to a normal distribution.

Regards,
Dean

pgsql-hackers by date:

From: Christos Maris
Date: 17 March 2018, 22:12:42
Subject: Re: Google Summer of Code: Potential Applicant

From: Michael Meskes
Date: 17 March 2018, 23:16:19
Subject: Re: ECPG oracle mode test program patch

Re: MCV lists for highly skewed distributions - Mailing list pgsql-hackers

Previous

Next