Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"? - Mailing list pgsql-hackers

From Gavin Flower
Subject Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
Date
Msg-id 63623eb3-1fdb-b2f8-7f0f-261f641a77c3@archidevsys.co.nz
Whole thread Raw
In response to Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?  (Mark Kirkwood <mark.kirkwood@catalyst.net.nz>)
Responses Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
List pgsql-hackers
On 06/06/17 09:41, Mark Kirkwood wrote:
> On 05/06/17 09:30, Tom Lane wrote:
>
>> I've been thinking about the behavior discussed in
>> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org 
>>
>> and it seems to me that there are a couple of things we ought to do 
>> about
>> it.
>>
>> First, I think we need a larger hard floor on the number of occurrences
>> of a value that're required to make ANALYZE decide it is a "most common
>> value".  The existing coding is willing to believe that anything that
>> appears at least twice in the sample is a potential MCV, but that design
>> originated when we were envisioning stats samples of just a few thousand
>> rows --- specifically, default_statistics_target was originally just 10,
>> leading to a 3000-row sample size.  So accepting two-appearance 
>> values as
>> MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it
>> could be a tenth or a hundredth of that.
>>
>> As a round number, I'm thinking that a good floor would be a frequency
>> estimate of 1/1000.  With today's typical sample size of 30000 rows,
>> a value would have to appear at least 30 times in the sample to be
>> believed to be an MCV.  That seems like it gives us a reasonable margin
>> of error against the kind of sampling noise seen in the above-cited
>> thread.
>>
>> Second, the code also has a rule that potential MCVs need to have an
>> estimated frequency at least 25% larger than what it thinks the 
>> "average"
>> value's frequency is.  A rule of that general form seems like a good 
>> idea,
>> but I now think the 25% threshold is far too small to do anything 
>> useful.
>> In particular, in any case like this where there are more distinct 
>> values
>> than there are sample rows, the "average frequency" estimate will
>> correspond to less than one occurrence in the sample, so that this 
>> rule is
>> totally useless to filter anything that we would otherwise consider 
>> as an
>> MCV.  I wonder if we shouldn't make it be "at least double the estimated
>> average frequency".
>>
>
> Or possibly calculate the sample standard deviation and make use of 
> that to help decide on a more flexible cutoff than twice the avg 
> frequency?
>
> Are there any research papers that might help us here (I'm drowning in 
> a sea of barely relevant search results for most phrases I've tried so 
> far)? I recall there were some that Tom referenced when this stuff was 
> originally written.
>
> On the other hand I do have access to some mathematicians specializing 
> in statistics - so can get their thoughts on this issue if you feel it 
> would be worthwhile.
>
> Cheers
>
> Mark
>
>
The standard deviation (sd) is proportional to the square root of the 
number in the sample in a Normal Distribution.

In a Normal Distribution, about 2/3 the values will be within plus or 
minus one sd of the mean.

There seems to be an implicit assumption that the distribution of values 
follows the Normal Distribution - has this been verified? I suspect that 
real data will have a skewed distribution of values, and may even be 
multi modal (multiple peaks)  The Normal Distribution has one central 
peak with 2 tails of the same shape & size.

So in a sample of 100 the sd is proportional to 10%,
for 10,000 the sd is proportional to 1%.

So essentially, the larger the sample size the more reliable is 
knowledge of the most common values (ignoring pathologically extreme 
distributions!) - the measure of reliability depends on the distribution.

How about selecting the cut off as the mean plus one sd, or something of 
that nature?  Note that the cut off point may result in no mcv's being 
selected - especially for small samples.

If practicable, it would be good to sample real datasets. Suggest 
looking at datasets were the current mechanism looks reasonable, and 
ones were the estimates are too far off.  Also, if possible, try any new 
selection method on the datasets and see what the difference is.

The above is based on what I remember from my university statistics 
papers, I took it up to 4th year level many moons ago.


Cheers,
Gavin





pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] PG10 transition tables, wCTEs and multiple operations on the same table
Next
From: Gavin Flower
Date:
Subject: Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?