Thread: [HACKERS] Make ANALYZE more selective about what is a "most common value"?

I've been thinking about the behavior discussed in
https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org
and it seems to me that there are a couple of things we ought to do about
it.

First, I think we need a larger hard floor on the number of occurrences
of a value that're required to make ANALYZE decide it is a "most common
value".  The existing coding is willing to believe that anything that
appears at least twice in the sample is a potential MCV, but that design
originated when we were envisioning stats samples of just a few thousand
rows --- specifically, default_statistics_target was originally just 10,
leading to a 3000-row sample size.  So accepting two-appearance values as
MCVs would lead to a minimum MCV frequency estimate of 1/1500.  Now it
could be a tenth or a hundredth of that.

As a round number, I'm thinking that a good floor would be a frequency
estimate of 1/1000.  With today's typical sample size of 30000 rows,
a value would have to appear at least 30 times in the sample to be
believed to be an MCV.  That seems like it gives us a reasonable margin
of error against the kind of sampling noise seen in the above-cited
thread.

Second, the code also has a rule that potential MCVs need to have an
estimated frequency at least 25% larger than what it thinks the "average"
value's frequency is.  A rule of that general form seems like a good idea,
but I now think the 25% threshold is far too small to do anything useful.
In particular, in any case like this where there are more distinct values
than there are sample rows, the "average frequency" estimate will
correspond to less than one occurrence in the sample, so that this rule is
totally useless to filter anything that we would otherwise consider as an
MCV.  I wonder if we shouldn't make it be "at least double the estimated
average frequency".

Thoughts?
        regards, tom lane



On Sun, Jun 4, 2017 at 5:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I've been thinking about the behavior discussed in
> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org
> and it seems to me that there are a couple of things we ought to do about
> it.
>
> First, I think we need a larger hard floor on the number of occurrences
> of a value that're required to make ANALYZE decide it is a "most common
> value".  The existing coding is willing to believe that anything that
> appears at least twice in the sample is a potential MCV, but that design
> originated when we were envisioning stats samples of just a few thousand
> rows --- specifically, default_statistics_target was originally just 10,
> leading to a 3000-row sample size.  So accepting two-appearance values as
> MCVs would lead to a minimum MCV frequency estimate of 1/1500.  Now it
> could be a tenth or a hundredth of that.
>
> As a round number, I'm thinking that a good floor would be a frequency
> estimate of 1/1000.  With today's typical sample size of 30000 rows,
> a value would have to appear at least 30 times in the sample to be
> believed to be an MCV.  That seems like it gives us a reasonable margin
> of error against the kind of sampling noise seen in the above-cited
> thread.

This kind of math isn't my strong point, but it seems to me that,
while sampling noise is a problem, it's not obvious how to tell
whether a given result is signal or noise.  I mean, if we sample
30,000 rows from a table with a billion rows and we see the the same
value twice, then it could be the case that the row occurs just twice
in the whole table and we unluckily saw both occurrences.
Contrariwise, it could also be that the typical value in that table
occurs 1000 times and the value in question occurs 50,000 times, so
our sample was right on the money.  There's no way to know whether, if
we were to take a bunch more samples, we'd see that value coming up
twice in most of those samples or whether the fact that it showed up
twice this time is a statistical fluke.

What makes me a bit cautious about this approach overall is that I've
seen cases where the length of the MCV list turns out to be key to
getting a good plan, and I needed to make it longer in order for
things to work.  Any non-MCV is, IIRC, assumed to be less frequent
than the least-common MCV.  If you raise the minimum frequency that is
required for something to be considered an MCV, then the cap on the
frequency of a non-MCV also goes up, and I think it's not impossible
to imagine that being a problem for somebody.  Now maybe you're going
to say those people can fix it by twiddling n_distinct using DDL.  I'm
not sure whether that works in all cases, but even if it does, some of
those people might not need to twiddle n_distinct today, so from their
point of view it would be a regression.

Another way to state it is: is this problem one-sided?  If we *only*
have a problem with things that aren't really MCVs being considered as
MCVs, then tightening up the rules for including something in the MCV
list will fix it.  Also, the statistics will be more stable and
planning will be faster, which all sounds like a win.  However, I'm
guessing (on the basis of some personal experience) that we *also*
have a problem with things that really are MCVs *not* being considered
MCVs.  If it's true that both of those things are problems, then the
only real fix is to increase the sample size -- which I notice you
recommended on the original thread.

In general, I've pretty skeptical of the idea that sampling 30,000
rows out of an arbitrarily large table will produce a
sufficiently-accurate MCV list.  There's a comment in std_typanalyze()
citing a 1998 paper by Chaudhuri, Motwani, and Narasayya for the
proposition that we'll get a reasonably accurate histogram with that
sample size, but histogram bins are not MCVs.  Moreover, in practice,
we need *increased* accuracy, not just *equal* accuracy, as the table
size increases.  On a billion row table, a minimum MCV frequency of
1/1000 means that we can't distinguish between a value that occurs 1
time and a value that occurs 999,999 times -- they are both non-MCVs.
As you shrink the table that sort of thing becomes a smaller and
smaller problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Jun 4, 2017 at 5:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> First, I think we need a larger hard floor on the number of occurrences
>> of a value that're required to make ANALYZE decide it is a "most common
>> value".

> This kind of math isn't my strong point, but it seems to me that,
> while sampling noise is a problem, it's not obvious how to tell
> whether a given result is signal or noise.

I think that a single count in a 30K-row sample is noise by definition.
We really ought to be setting the threshold for "what is an MCV" high
enough that it's not drastically affected by variations that are clearly
at the sampling-error level.

> What makes me a bit cautious about this approach overall is that I've
> seen cases where the length of the MCV list turns out to be key to
> getting a good plan, and I needed to make it longer in order for
> things to work.

As long as they actually are MCVs, sure.  The problem I've got with the
current behavior is that it manufactures a spiky distribution where there
is none.  That leads directly to bad estimates, as shown in Marko's
example.  We'd be much better off, both as to planning time and accuracy,
if we'd concluded that the table had no MCVs.

> Another way to state it is: is this problem one-sided?

You know as well as I do that there's no free lunch in this area.
Anything we change at all will make things worse for somebody, if only
by accident.  But I do not think that choosing a bunch of values entirely
at random and claiming (incorrectly) that they are more common than other
values in the table can possibly lead to better results except by
accident.

> In general, I've pretty skeptical of the idea that sampling 30,000
> rows out of an arbitrarily large table will produce a
> sufficiently-accurate MCV list.

Perhaps not, but allowing two occurrences to define an MCV surely isn't
helping with that.  More to the point maybe, it's a behavior that doesn't
go away simply by making the sample larger.  Making the sample larger
just allows us to falsely invent more pseudo-MCVs.

I'm not by any means wedded to the proposition that we have to fix it
simply by changing the filter rule.  One idea that seems worth considering
is to keep a track list that's a bit longer than the maximum allowed
number of MCVs, and then to say that we accept only MCVs whose counts are
significantly greater than what we find at the tail of the list.  I'm not
sure what "significantly" should be exactly, but surely a distribution
as flat as the ones I was showing upthread should be a red flag.
        regards, tom lane



On 05/06/17 09:30, Tom Lane wrote:

> I've been thinking about the behavior discussed in
> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org
> and it seems to me that there are a couple of things we ought to do about
> it.
>
> First, I think we need a larger hard floor on the number of occurrences
> of a value that're required to make ANALYZE decide it is a "most common
> value".  The existing coding is willing to believe that anything that
> appears at least twice in the sample is a potential MCV, but that design
> originated when we were envisioning stats samples of just a few thousand
> rows --- specifically, default_statistics_target was originally just 10,
> leading to a 3000-row sample size.  So accepting two-appearance values as
> MCVs would lead to a minimum MCV frequency estimate of 1/1500.  Now it
> could be a tenth or a hundredth of that.
>
> As a round number, I'm thinking that a good floor would be a frequency
> estimate of 1/1000.  With today's typical sample size of 30000 rows,
> a value would have to appear at least 30 times in the sample to be
> believed to be an MCV.  That seems like it gives us a reasonable margin
> of error against the kind of sampling noise seen in the above-cited
> thread.
>
> Second, the code also has a rule that potential MCVs need to have an
> estimated frequency at least 25% larger than what it thinks the "average"
> value's frequency is.  A rule of that general form seems like a good idea,
> but I now think the 25% threshold is far too small to do anything useful.
> In particular, in any case like this where there are more distinct values
> than there are sample rows, the "average frequency" estimate will
> correspond to less than one occurrence in the sample, so that this rule is
> totally useless to filter anything that we would otherwise consider as an
> MCV.  I wonder if we shouldn't make it be "at least double the estimated
> average frequency".
>

Or possibly calculate the sample standard deviation and make use of that 
to help decide on a more flexible cutoff than twice the avg frequency?

Are there any research papers that might help us here (I'm drowning in a 
sea of barely relevant search results for most phrases I've tried so 
far)? I recall there were some that Tom referenced when this stuff was 
originally written.

On the other hand I do have access to some mathematicians specializing 
in statistics - so can get their thoughts on this issue if you feel it 
would be worthwhile.

Cheers

Mark



On 06/06/17 09:41, Mark Kirkwood wrote:
> On 05/06/17 09:30, Tom Lane wrote:
>
>> I've been thinking about the behavior discussed in
>> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org 
>>
>> and it seems to me that there are a couple of things we ought to do 
>> about
>> it.
>>
>> First, I think we need a larger hard floor on the number of occurrences
>> of a value that're required to make ANALYZE decide it is a "most common
>> value".  The existing coding is willing to believe that anything that
>> appears at least twice in the sample is a potential MCV, but that design
>> originated when we were envisioning stats samples of just a few thousand
>> rows --- specifically, default_statistics_target was originally just 10,
>> leading to a 3000-row sample size.  So accepting two-appearance 
>> values as
>> MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it
>> could be a tenth or a hundredth of that.
>>
>> As a round number, I'm thinking that a good floor would be a frequency
>> estimate of 1/1000.  With today's typical sample size of 30000 rows,
>> a value would have to appear at least 30 times in the sample to be
>> believed to be an MCV.  That seems like it gives us a reasonable margin
>> of error against the kind of sampling noise seen in the above-cited
>> thread.
>>
>> Second, the code also has a rule that potential MCVs need to have an
>> estimated frequency at least 25% larger than what it thinks the 
>> "average"
>> value's frequency is.  A rule of that general form seems like a good 
>> idea,
>> but I now think the 25% threshold is far too small to do anything 
>> useful.
>> In particular, in any case like this where there are more distinct 
>> values
>> than there are sample rows, the "average frequency" estimate will
>> correspond to less than one occurrence in the sample, so that this 
>> rule is
>> totally useless to filter anything that we would otherwise consider 
>> as an
>> MCV.  I wonder if we shouldn't make it be "at least double the estimated
>> average frequency".
>>
>
> Or possibly calculate the sample standard deviation and make use of 
> that to help decide on a more flexible cutoff than twice the avg 
> frequency?
>
> Are there any research papers that might help us here (I'm drowning in 
> a sea of barely relevant search results for most phrases I've tried so 
> far)? I recall there were some that Tom referenced when this stuff was 
> originally written.
>
> On the other hand I do have access to some mathematicians specializing 
> in statistics - so can get their thoughts on this issue if you feel it 
> would be worthwhile.
>
> Cheers
>
> Mark
>
>
The standard deviation (sd) is proportional to the square root of the 
number in the sample in a Normal Distribution.

In a Normal Distribution, about 2/3 the values will be within plus or 
minus one sd of the mean.

There seems to be an implicit assumption that the distribution of values 
follows the Normal Distribution - has this been verified? I suspect that 
real data will have a skewed distribution of values, and may even be 
multi modal (multiple peaks)  The Normal Distribution has one central 
peak with 2 tails of the same shape & size.

So in a sample of 100 the sd is proportional to 10%,
for 10,000 the sd is proportional to 1%.

So essentially, the larger the sample size the more reliable is 
knowledge of the most common values (ignoring pathologically extreme 
distributions!) - the measure of reliability depends on the distribution.

How about selecting the cut off as the mean plus one sd, or something of 
that nature?  Note that the cut off point may result in no mcv's being 
selected - especially for small samples.

If practicable, it would be good to sample real datasets. Suggest 
looking at datasets were the current mechanism looks reasonable, and 
ones were the estimates are too far off.  Also, if possible, try any new 
selection method on the datasets and see what the difference is.

The above is based on what I remember from my university statistics 
papers, I took it up to 4th year level many moons ago.


Cheers,
Gavin





On 06/06/17 10:12, Gavin Flower wrote:
> On 06/06/17 09:41, Mark Kirkwood wrote:
>> On 05/06/17 09:30, Tom Lane wrote:
>>
>>> I've been thinking about the behavior discussed in
>>> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org 
>>>
>>> and it seems to me that there are a couple of things we ought to do 
>>> about
>>> it.
>>>
>>> First, I think we need a larger hard floor on the number of occurrences
>>> of a value that're required to make ANALYZE decide it is a "most common
>>> value".  The existing coding is willing to believe that anything that
>>> appears at least twice in the sample is a potential MCV, but that 
>>> design
>>> originated when we were envisioning stats samples of just a few 
>>> thousand
>>> rows --- specifically, default_statistics_target was originally just 
>>> 10,
>>> leading to a 3000-row sample size.  So accepting two-appearance 
>>> values as
>>> MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it
>>> could be a tenth or a hundredth of that.
>>>
>>> As a round number, I'm thinking that a good floor would be a frequency
>>> estimate of 1/1000.  With today's typical sample size of 30000 rows,
>>> a value would have to appear at least 30 times in the sample to be
>>> believed to be an MCV.  That seems like it gives us a reasonable margin
>>> of error against the kind of sampling noise seen in the above-cited
>>> thread.
>>>
>>> Second, the code also has a rule that potential MCVs need to have an
>>> estimated frequency at least 25% larger than what it thinks the 
>>> "average"
>>> value's frequency is.  A rule of that general form seems like a good 
>>> idea,
>>> but I now think the 25% threshold is far too small to do anything 
>>> useful.
>>> In particular, in any case like this where there are more distinct 
>>> values
>>> than there are sample rows, the "average frequency" estimate will
>>> correspond to less than one occurrence in the sample, so that this 
>>> rule is
>>> totally useless to filter anything that we would otherwise consider 
>>> as an
>>> MCV.  I wonder if we shouldn't make it be "at least double the 
>>> estimated
>>> average frequency".
>>>
>>
>> Or possibly calculate the sample standard deviation and make use of 
>> that to help decide on a more flexible cutoff than twice the avg 
>> frequency?
>>
>> Are there any research papers that might help us here (I'm drowning 
>> in a sea of barely relevant search results for most phrases I've 
>> tried so far)? I recall there were some that Tom referenced when this 
>> stuff was originally written.
>>
>> On the other hand I do have access to some mathematicians 
>> specializing in statistics - so can get their thoughts on this issue 
>> if you feel it would be worthwhile.
>>
>> Cheers
>>
>> Mark
>>
>>
> The standard deviation (sd) is proportional to the square root of the 
> number in the sample in a Normal Distribution.
>
> In a Normal Distribution, about 2/3 the values will be within plus or 
> minus one sd of the mean.
>
> There seems to be an implicit assumption that the distribution of 
> values follows the Normal Distribution - has this been verified? I 
> suspect that real data will have a skewed distribution of values, and 
> may even be multi modal (multiple peaks)  The Normal Distribution has 
> one central peak with 2 tails of the same shape & size.
>
> So in a sample of 100 the sd is proportional to 10%,
> for 10,000 the sd is proportional to 1%.
>
> So essentially, the larger the sample size the more reliable is 
> knowledge of the most common values (ignoring pathologically extreme 
> distributions!) - the measure of reliability depends on the distribution.
>
> How about selecting the cut off as the mean plus one sd, or something 
> of that nature?  Note that the cut off point may result in no mcv's 
> being selected - especially for small samples.
>
> If practicable, it would be good to sample real datasets. Suggest 
> looking at datasets were the current mechanism looks reasonable, and 
> ones were the estimates are too far off.  Also, if possible, try any 
> new selection method on the datasets and see what the difference is.
>
> The above is based on what I remember from my university statistics 
> papers, I took it up to 4th year level many moons ago.
>
>
> Cheers,
> Gavin
>
>
>
>
Suddenly realized, that the distribution of the FREQUENCIES of values in 
a Normal Distribution are probably NOT normally distributed!

For some shapes, and fancy names for them (like 'leptokurtic'), see: 
http://onlinestatbook.com/2/introduction/distributions.html

For multi modal examples: 
http://www.statisticshowto.com/multimodal-distribution

I tried looking for useful stuff to clarify things without success - so 
I think that asking a practising statistician is probably a very good 
idea!  :-)


Cheers,
Gavin


Cheers,
Gavin




On Mon, Jun 5, 2017 at 3:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think that a single count in a 30K-row sample is noise by definition.

Not if the table only *has* 30K rows.  Or even 100K.  And anyway we're
talking about what to do with a value we hit at least twice, which is
not quite the same thing.

> As long as they actually are MCVs, sure.  The problem I've got with the
> current behavior is that it manufactures a spiky distribution where there
> is none.  That leads directly to bad estimates, as shown in Marko's
> example.  We'd be much better off, both as to planning time and accuracy,
> if we'd concluded that the table had no MCVs.

That is so, but that example is also constructed to show the case
where the current code falls down.  The case where the distribution
actually is spiky but the frequency is near the minimum that ANALYZE
can detect isn't tested by that example.

>> Another way to state it is: is this problem one-sided?
>
> You know as well as I do that there's no free lunch in this area.
> Anything we change at all will make things worse for somebody, if only
> by accident.

Yep.

> But I do not think that choosing a bunch of values entirely
> at random and claiming (incorrectly) that they are more common than other
> values in the table can possibly lead to better results except by
> accident.

But accidents happen, and sometimes they happen frequently.  I
maintain that the root of this problem is that you want to pretend
like a 30k row sample on (in this instance) a 100m row table ought to
be expected to be enough to reliably produce good results, and I think
that's doomed.  If you tweak the algorithm to remove what is in this
case noise, you'll just break some other case instead.  Maybe it'll
take somebody a few years to find and post an example of such a case,
but surely you can see that there must be cases where most ANALYZE
runs will find 2 of some value precisely because it is a lot more
common than average.

I think what we ought to do is install some kind of auto-scaling
behavior so that when we discover that a table contains a very large
number of rows, we use a higher statistics target.  Admittedly there's
some circularity problems there -- if we haven't sampled the table
yet, we don't know how wide the rows are.  But we could estimate based
on the on-disk size for the first ANALYZE and based on the
previously-observed row width thereafter, and probably improve things
measurably.

Now if, in conjunction with that, we also get more aggressive about
filtering out low-order MCVs, fine.  I think there are plenty of small
tables where the low-order MCVs make very little difference and
chucking them will do nothing but save planning time.  But I believe
that for very large tables, shortening the MCV list will hurt -- in
some cases, probably a lot -- so I'd like to see some compensating
change to avoid that.

> I'm not by any means wedded to the proposition that we have to fix it
> simply by changing the filter rule.  One idea that seems worth considering
> is to keep a track list that's a bit longer than the maximum allowed
> number of MCVs, and then to say that we accept only MCVs whose counts are
> significantly greater than what we find at the tail of the list.  I'm not
> sure what "significantly" should be exactly, but surely a distribution
> as flat as the ones I was showing upthread should be a red flag.

I'm not sure exactly which bit upthread (or on the other thread)
you're referring to here, but I'm worried that your thinking is being
excessively shaped by the example presently in front of us.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?

From
Alvaro Herrera
Date:
Gavin Flower wrote:

> The standard deviation (sd) is proportional to the square root of
> the number in the sample in a Normal Distribution.
> 
> In a Normal Distribution, about 2/3 the values will be within plus
> or minus one sd of the mean.
> 
> There seems to be an implicit assumption that the distribution of
> values follows the Normal Distribution - has this been verified?

The whole problem here is precisely to determine what is the data
distribution -- one side of it is how to represent it for the planner
(which we do by storing a number of distinct values, a list of MCVs and
their respective frequencies, and a histogram representing values not in
the MCV list); the other side is how to figure out what data to put in
the MCV list and histogram (i.e. what to compute during ANALYZE).

If we knew the distribution was a normal, we wouldn't need any of these
things -- we'd just store the mean and standard deviation.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On 06/06/17 10:12, Gavin Flower wrote:

> On 06/06/17 09:41, Mark Kirkwood wrote:
>> On 05/06/17 09:30, Tom Lane wrote:
>>
>>> I've been thinking about the behavior discussed in
>>> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org 
>>>
>>> and it seems to me that there are a couple of things we ought to do 
>>> about
>>> it.
>>>
>>> First, I think we need a larger hard floor on the number of occurrences
>>> of a value that're required to make ANALYZE decide it is a "most common
>>> value".  The existing coding is willing to believe that anything that
>>> appears at least twice in the sample is a potential MCV, but that 
>>> design
>>> originated when we were envisioning stats samples of just a few 
>>> thousand
>>> rows --- specifically, default_statistics_target was originally just 
>>> 10,
>>> leading to a 3000-row sample size.  So accepting two-appearance 
>>> values as
>>> MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it
>>> could be a tenth or a hundredth of that.
>>>
>>> As a round number, I'm thinking that a good floor would be a frequency
>>> estimate of 1/1000.  With today's typical sample size of 30000 rows,
>>> a value would have to appear at least 30 times in the sample to be
>>> believed to be an MCV.  That seems like it gives us a reasonable margin
>>> of error against the kind of sampling noise seen in the above-cited
>>> thread.
>>>
>>> Second, the code also has a rule that potential MCVs need to have an
>>> estimated frequency at least 25% larger than what it thinks the 
>>> "average"
>>> value's frequency is.  A rule of that general form seems like a good 
>>> idea,
>>> but I now think the 25% threshold is far too small to do anything 
>>> useful.
>>> In particular, in any case like this where there are more distinct 
>>> values
>>> than there are sample rows, the "average frequency" estimate will
>>> correspond to less than one occurrence in the sample, so that this 
>>> rule is
>>> totally useless to filter anything that we would otherwise consider 
>>> as an
>>> MCV.  I wonder if we shouldn't make it be "at least double the 
>>> estimated
>>> average frequency".
>>>
>>
>> Or possibly calculate the sample standard deviation and make use of 
>> that to help decide on a more flexible cutoff than twice the avg 
>> frequency?
>>
>> Are there any research papers that might help us here (I'm drowning 
>> in a sea of barely relevant search results for most phrases I've 
>> tried so far)? I recall there were some that Tom referenced when this 
>> stuff was originally written.
>>
>> On the other hand I do have access to some mathematicians 
>> specializing in statistics - so can get their thoughts on this issue 
>> if you feel it would be worthwhile.
>>
>> Cheers
>>
>> Mark
>>
>>
> The standard deviation (sd) is proportional to the square root of the 
> number in the sample in a Normal Distribution.
>
> In a Normal Distribution, about 2/3 the values will be within plus or 
> minus one sd of the mean.
>
> There seems to be an implicit assumption that the distribution of 
> values follows the Normal Distribution - has this been verified? I 
> suspect that real data will have a skewed distribution of values, and 
> may even be multi modal (multiple peaks)  The Normal Distribution has 
> one central peak with 2 tails of the same shape & size.
>
> So in a sample of 100 the sd is proportional to 10%,
> for 10,000 the sd is proportional to 1%.
>
> So essentially, the larger the sample size the more reliable is 
> knowledge of the most common values (ignoring pathologically extreme 
> distributions!) - the measure of reliability depends on the distribution.
>
> How about selecting the cut off as the mean plus one sd, or something 
> of that nature?  Note that the cut off point may result in no mcv's 
> being selected - especially for small samples.
>
> If practicable, it would be good to sample real datasets. Suggest 
> looking at datasets were the current mechanism looks reasonable, and 
> ones were the estimates are too far off.  Also, if possible, try any 
> new selection method on the datasets and see what the difference is.
>
> The above is based on what I remember from my university statistics 
> papers, I took it up to 4th year level many moons ago.
>
>
>

With respect to the assumption of Normal distribution - it is easy to 
come up with an example that shows it need not be: consider our our 
Pgbench 'accounts' table. The distribution of 'bid' values is not Normal 
(it is Uniform).

Now I realized I made people think about Normal distributions by 
mentioning computing the standard deviation of the sample (and I 
probably should have said 'standard deviation of the *sample 
frequencies*') - but this was only a suggestion for working out how to 
(maybe) be less arbitrary about how we decide what values to put in the 
MCV list (currently 25% different from the mean frequency).

regards

Mark




On 05/06/17 09:30, Tom Lane wrote:
> First, I think we need a larger hard floor on the number of occurrences
> of a value that're required to make ANALYZE decide it is a "most common
> value"...
>
> Second, the code also has a rule that potential MCVs need to have an
> estimated frequency at least 25% larger than what it thinks the "average"
> value's frequency is.  A rule of that general form seems like a good idea,
> but I now think the 25% threshold is far too small to do anything useful.
> In particular, in any case like this where there are more distinct values
> than there are sample rows, the "average frequency" estimate will
> correspond to less than one occurrence in the sample, so that this rule is
> totally useless to filter anything that we would otherwise consider as an
> MCV.  I wonder if we shouldn't make it be "at least double the estimated
> average frequency".
>

I think we should attempt to come up with a more principled approach
to this, taking into account the table and sample sizes. Here's what I
found, after a bit of research:

A common initial rule of thumb is that the value should occur at least
10 times in the sample - see, for example [1], [2]. The idea behind
that goes as follows:

Suppose that N is the population size (total number of rows in the
table), and n is the sample size, and that some particular candidate
value appears x times in the sample.

Then the "sample proportion" is given by
 p = x/n.

Taking a different sample would, in general, give a different value
for x, and hence for the sample proportion, so repeatedly taking
different random samples of the table would lead to a distribution of
values of p, and in general, this distribution would be a binomial
distribution, since x is the count of sampled rows matching a yes/no
question (does the row match the candidate value?).

The idea behind the rule of thumb above is that if n is large enough,
and x is not too close to 0 or n, then the binomial distribution of p
can be reasonably approximated by a normal distribution. There are
various rules of thumb for this to be true, but this one is actually
one of the stronger ones, as can be seen in [2]. Technically the rule
should be:
 x >= 10 and x <= n-10

but it's probably not worth bothering with the second test, since
we're looking for MCVs.

Note that this says nothing about the margin of error of p, just that
it is reasonable to treat it as having a normal distribution, which
then allows the margin of error to be analysed using standard
techniques.

The standard way of doing this is to calculate the "standard error" of
the sample proportion - see, for example [3], [4]:
 SE = sqrt(p*(1-p)/n)

Note, however, that this formula assumes that the sample size n is
small compared to the population size N, which is not necessarily the
case. This can be taken into account by applying the "finite
population correction" (see, for example [5]), which involves
multiplying by an additional factor:
 SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))

This correction factor becomes 0 when n=N (whole table sampled => no
error) and it approaches 1 when n is much smaller than N.

Having computed the standard error of the sample proportion, it is
then possible to establish a confidence interval around p. For
example, one could say that there is a 95% probability that the total
population proportion lies in the range
 [ pmin = p-2*SE, pmax = p+2*SE ]

This gives rise to the possibility of writing another rule for candidate MCVs:

If there are Nd distinct values in the table, so that the average
frequency of occurrence of any particular value is 1/Nd, then the test
 pmin > 1/Nd

would imply that there is a 97.5% probably that the candidate value is
more common than the average (since the 5% of the distribution of p
outside the confidence interval is split evenly below pmin and above
pmax).

To take a concrete example, suppose that the table has N=1000,000
rows, with Nd=500 distinct values, so that each value occurs on
average 2000 times. If the sample size is 30,000 then the current rule
that a value needs to have an estimated frequency 25% over the average
means that a value would need to be seen 75 times to be considered as
an MCV. If that rule were changed to "at least double the estimated
average frequency", then a value would need to be seen at least 150
times. On the other hand, using the above rule based on a 95%
confidence interval, a value would only need to be seen 78 times, in
which case the estimated total number of occurrences of the value in
the table would be 2600+/-579, and using the MCV estimate would almost
certainly be better than not using it.

Regards,
Dean


[1] https://onlinecourses.science.psu.edu/stat200/node/43
[2] https://en.wikipedia.org/wiki/Binomial_distribution#Normal_approximation
[3] http://mathworld.wolfram.com/SampleProportion.html
[4] https://en.wikipedia.org/wiki/Population_proportion
[5] http://stattrek.com/statistics/dictionary.aspx?Definition=finite_population_correction
[6] https://en.wikipedia.org/wiki/Margin_of_error



Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> I think we should attempt to come up with a more principled approach
> to this, taking into account the table and sample sizes. Here's what I
> found, after a bit of research:

Thanks for doing some legwork on this!

> A common initial rule of thumb is that the value should occur at least
> 10 times in the sample - see, for example [1], [2]. ...
> Note that this says nothing about the margin of error of p, just that
> it is reasonable to treat it as having a normal distribution, which
> then allows the margin of error to be analysed using standard
> techniques.

Got it.  Still, insisting on >= 10 occurrences feels a lot better to me
than insisting on >= 2.  It's interesting that that can be correlated
to whether the margin of error is easily analyzed.

> The standard way of doing this is to calculate the "standard error" of
> the sample proportion - see, for example [3], [4]:
>   SE = sqrt(p*(1-p)/n)
> Note, however, that this formula assumes that the sample size n is
> small compared to the population size N, which is not necessarily the
> case. This can be taken into account by applying the "finite
> population correction" (see, for example [5]), which involves
> multiplying by an additional factor:
>   SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))

It's been a long time since college statistics, but that wikipedia article
reminds me that the binomial distribution isn't really the right thing for
our problem anyway.  We're doing sampling without replacement, so that the
correct model is the hypergeometric distribution.  The article points out
that the binomial distribution is a good approximation as long as n << N.
Can this FPC factor be justified as converting binomial estimates into
hypergeometric ones, or is it ad hoc?

> This gives rise to the possibility of writing another rule for candidate MCVs:
> If there are Nd distinct values in the table, so that the average
> frequency of occurrence of any particular value is 1/Nd, then the test
>   pmin > 1/Nd

Interesting indeed.  We'd have to be working with our estimated Nd, of
course, not the true Nd.  We know empirically that the estimate usually
lowballs the true value unless our sample is a fairly large fraction of
the whole table.  That would tend to make our cutoff pmin too high,
so that we'd be excluding some candidate MCVs even though a better sample
would show they almost certainly had a frequency more common than average.
That behavior seems fine to me; it'd tend to drop questionable MCVs in
small samples, which is exactly what I think we should be doing.  But it
is probably worth doing some empirical experiments with this rule to see
what we get.
        regards, tom lane



On 11 June 2017 at 20:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The standard way of doing this is to calculate the "standard error" of
>> the sample proportion - see, for example [3], [4]:
>>   SE = sqrt(p*(1-p)/n)
>> Note, however, that this formula assumes that the sample size n is
>> small compared to the population size N, which is not necessarily the
>> case. This can be taken into account by applying the "finite
>> population correction" (see, for example [5]), which involves
>> multiplying by an additional factor:
>>   SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))
>
> It's been a long time since college statistics, but that wikipedia article
> reminds me that the binomial distribution isn't really the right thing for
> our problem anyway.  We're doing sampling without replacement, so that the
> correct model is the hypergeometric distribution.

Yes that's right.

>  The article points out
> that the binomial distribution is a good approximation as long as n << N.
> Can this FPC factor be justified as converting binomial estimates into
> hypergeometric ones, or is it ad hoc?

No, it's not just ad hoc. It comes from the variance of the
hypergeometric distribution [1] divided by the variance of a binomial
distribution [2] with p=K/N, in the notation of those articles.

This is actually a very widely used formula, used in fields like
analysis of survey data, which is inherently sampling without
replacement (assuming the questioners don't survey the same people
more than once!).

Regards,
Dean


[1] https://en.wikipedia.org/wiki/Hypergeometric_distribution
[2] https://en.wikipedia.org/wiki/Binomial_distribution