Mark kirkwood <markir@slingshot.co.nz> writes:
> However Tom's observation is still valid (in spite of my math) - all the
> frequencies are overestimated, rather than the expected "some bigger,
> some smaller" sort of thing.
No, that makes sense. The values that get into the most-common-values
list are only going to be ones that are significantly more common (in
the sample) than the estimated average frequency. So if the thing makes
a good estimate of the average frequency, you'll only see upside
outliers in the MCV list. The relevant logic is in analyze.c:
/*
* Decide how many values are worth storing as most-common values.
* If we are able to generate a complete MCV list (all the values
* in the sample will fit, and we think these are all the ones in
* the table), then do so. Otherwise, store only those values
* that are significantly more common than the (estimated)
* average. We set the threshold rather arbitrarily at 25% more
* than average, with at least 2 instances in the sample. Also,
* we won't suppress values that have a frequency of at least 1/K
* where K is the intended number of histogram bins; such values
* might otherwise cause us to emit duplicate histogram bin
* boundaries.
*/
regards, tom lane