Re: MCV lists for highly skewed distributions - Mailing list pgsql-hackers

From Robert Haas
Subject Re: MCV lists for highly skewed distributions
Date
Msg-id CA+TgmoZOHwXjyrw4Fi-Lrdz5ChB6GN77xg4adJQvgg70_9kjAA@mail.gmail.com
Whole thread Raw
In response to Re: MCV lists for highly skewed distributions  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: MCV lists for highly skewed distributions  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-hackers
On Thu, Feb 1, 2018 at 12:21 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> For a highly skewed distribution, it is possible for there to be
> hardly any values (maybe only one) that appears more than 1.25 times
> the average frequency, and so lots of otherwise perfectly good common
> values get discarded. For such a distribution, I don't think that the
> average frequency is particularly interesting, and it shouldn't be
> used to filter the MCV list.

Although I don't think I have enough math background to analyze this
as rigorously as you, I agree with you on this point: the average
frequency doesn't seem very interesting.

One point which I want to emphasize is that the length of the MCV list
bounds the estimated frequency of non-MCVs in two ways: no non-MCV is
ever thought to be more frequent than the least-common MCVs, and
however many non-MCVs we think we have (probably fewer than we
actually have) have to fit into whatever percentage of the table is
consumed by MCVs.  This would be less important if we had reliable
n_distinct estimates, but we don't.  So, even throwing things into the
MCV list that are no more common than the average item can improve
planning in some cases.

> Testing it with the example from [1], it does indeed appear to solve
> the too-many-MCVs problem in that particular case (actually generating
> no MCVs).
>
> Testing it with the highly skewed example at the start of this thread,
> it typically generates a couple more MCVs, producing somewhat better
> estimates, but to get really good estimates for who=17, you need to
> crank up the stats target. It does at least appear to no longer be the
> case that cranking up the stats target has a weak effect.

Hmm, nice result.  I agree with you that it would be nice if we can
come up with a good general-purpose algorithm for this, rather than
making it pluggable.  I don't know whether we can succeed in doing
that but we should try hard, because it's always better when things
just work automatically...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: JIT compiling with LLVM v9.0
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] why not parallel seq scan for slow functions