Re: On Distributions In 7.2.1 - Mailing list pgsql-general

From Tom Lane
Subject Re: On Distributions In 7.2.1
Date
Msg-id 7233.1020348710@sss.pgh.pa.us
Whole thread Raw
In response to Re: On Distributions In 7.2.1  (Mark kirkwood <markir@slingshot.co.nz>)
Responses Tracking down Database growth
Re: On Distributions In 7.2.1
List pgsql-general
Mark kirkwood <markir@slingshot.co.nz> writes:
> However Tom's observation is still valid (in spite of my math) - all the
> frequencies are overestimated, rather than the expected "some bigger,
> some smaller" sort of thing.

No, that makes sense.  The values that get into the most-common-values
list are only going to be ones that are significantly more common (in
the sample) than the estimated average frequency.  So if the thing makes
a good estimate of the average frequency, you'll only see upside
outliers in the MCV list.  The relevant logic is in analyze.c:

        /*
         * Decide how many values are worth storing as most-common values.
         * If we are able to generate a complete MCV list (all the values
         * in the sample will fit, and we think these are all the ones in
         * the table), then do so.    Otherwise, store only those values
         * that are significantly more common than the (estimated)
         * average. We set the threshold rather arbitrarily at 25% more
         * than average, with at least 2 instances in the sample.  Also,
         * we won't suppress values that have a frequency of at least 1/K
         * where K is the intended number of histogram bins; such values
         * might otherwise cause us to emit duplicate histogram bin
         * boundaries.
         */

            regards, tom lane

pgsql-general by date:

Previous
From: "Christopher Kings-Lynne"
Date:
Subject: PureFTPd
Next
From: Tom Lane
Date:
Subject: Re: Using views and MS access via odbc