Re: MCV lists for highly skewed distributions - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: MCV lists for highly skewed distributions
Date
Msg-id CAEZATCUb5xY-Q9g2hebL35RU4E4SURYd-v7nHrQGAO6xFr+9hg@mail.gmail.com
Whole thread Raw
In response to Re: MCV lists for highly skewed distributions  (John Naylor <jcnaylor@gmail.com>)
Responses Re: MCV lists for highly skewed distributions  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-hackers
On 13 March 2018 at 08:39, John Naylor <jcnaylor@gmail.com> wrote:
>> Also, this is common enough that in fact that distribution
>> can be reasonably approximated by a normal distribution.
>
> For the archives, because it's typically seen 10 times in the sample,
> per the rule of thumb mention upthread?
>

Actually, I think that the rule of thumb of at least 10 instances in
the sample for a normal approximation is probably too strict.

Looking around, values as low as 5 seem to be quite commonly used. Of
course there is no right answer here, it depends on what you're doing
with it, and how close you want the normal distribution to be to the
hypergeometric one. For our purposes, I don't think we really need it
to be that close. We just want to be able to justify statements like
"the value is *probably* more common than X, and it was unlikely that
that was just random sampling error".


>> Making use of
>> the non-MCV-based estimate allows us to do better -- if the MCV-based
>> estimate is statistically significantly higher than the non-MCV-based
>> estimate, then it would almost certainly be better to include that
>> value in the MCV list, even if its spread is quite large.
>
> Because we can treat the sample as normal, testing for > 2 standard
> deviations means that we're "~95% sure" it's more common in the table
> than the non-MCV selectivity would suggest, right? (I realize I'm not
> using technically accurate language)
>

Actually, it's more like 97.5% (remember the normal distribution is
symmetric, so there's around 2.5% below the 2-stddev interval, around
95% in it, and another 2.5% above it), but that's just nit-picking.

There are several nice online calculators for playing around with
hypergeometric distributions, such as
http://keisan.casio.com/exec/system/1180573201

Consider this distribution:

N = 1000000
n = 10000
K = 500
Mean = n * K / N = 5
Variance = (K / N) * (1 - K / N) * n * (N - n) / (N - 1) = 4.9
Stddev = sqrt(Variance) = 2.2

Using the calculator above, you can see that the distribution is
fairly normal-like, but with a noticeable positive skew. The 2-stddev
interval is 0.6 to 9.4, and according to the calculator the
probability of the value being less than or equal to 1 is in the
ballpark of the 2.5% figure expected. So even with just 5 occurrences
in the sample, it's fairly close to a normal distribution.

Regards,
Dean


pgsql-hackers by date:

Previous
From: Christos Maris
Date:
Subject: Re: Google Summer of Code: Potential Applicant
Next
From: Michael Meskes
Date:
Subject: Re: ECPG oracle mode test program patch