Re: MCV lists for highly skewed distributions - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: MCV lists for highly skewed distributions
Date
Msg-id CAEZATCXMttz+Ot+STT4zH8g-oGDtHcLVhhK8-bEFPO5TmScurw@mail.gmail.com
Whole thread Raw
In response to Re: MCV lists for highly skewed distributions  (John Naylor <jcnaylor@gmail.com>)
Responses Re: MCV lists for highly skewed distributions
List pgsql-hackers
On 21 January 2018 at 07:26, John Naylor <jcnaylor@gmail.com> wrote:
> I spent a few hours hacking on this, and it turns out calculating the
> right number of MCVs taking into account both uniform and highly
> non-uniform distributions is too delicate a problem for me to solve
> right now. The logic suggested by Dean Rasheed in [1] always produces
> no MCVs for a perfectly uniform distribution (which is good), but very
> often also for other distributions, which is not good. My efforts to
> tweak that didn't work, so I didn't get as far as adapting it for the
> problem Jeff is trying to solve.

Hmm, Tom suggested that the test based on the average frequency over
all values might be too strict because the estimated number of
distinct values is often too low, so that might explain what you're
seeing.

It occurs to me that maybe a better test to exclude a value from the
MCV list would be to demand that its relative standard error not be
too high. Such a test, in addition to the existing tests, might be
sufficient to solve the opposite problem of too many values in the MCV
list, because the real problem there is including a value after having
seen relatively few occurrences of it in the sample, and thus having a
wildly inaccurate estimate for it. Setting a bound on the relative
standard error would mean that we could have a reasonable degree of
confidence in estimates produced from the sample.

Regards,
Dean


pgsql-hackers by date:

Previous
From: Andrey Borodin
Date:
Subject: Re: New gist vacuum.
Next
From: Andrey Borodin
Date:
Subject: Re: [HACKERS] WIP: Covering + unique indexes.