Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Jun 4, 2017 at 5:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> First, I think we need a larger hard floor on the number of occurrences
>> of a value that're required to make ANALYZE decide it is a "most common
>> value".
> This kind of math isn't my strong point, but it seems to me that,
> while sampling noise is a problem, it's not obvious how to tell
> whether a given result is signal or noise.
I think that a single count in a 30K-row sample is noise by definition.
We really ought to be setting the threshold for "what is an MCV" high
enough that it's not drastically affected by variations that are clearly
at the sampling-error level.
> What makes me a bit cautious about this approach overall is that I've
> seen cases where the length of the MCV list turns out to be key to
> getting a good plan, and I needed to make it longer in order for
> things to work.
As long as they actually are MCVs, sure. The problem I've got with the
current behavior is that it manufactures a spiky distribution where there
is none. That leads directly to bad estimates, as shown in Marko's
example. We'd be much better off, both as to planning time and accuracy,
if we'd concluded that the table had no MCVs.
> Another way to state it is: is this problem one-sided?
You know as well as I do that there's no free lunch in this area.
Anything we change at all will make things worse for somebody, if only
by accident. But I do not think that choosing a bunch of values entirely
at random and claiming (incorrectly) that they are more common than other
values in the table can possibly lead to better results except by
accident.
> In general, I've pretty skeptical of the idea that sampling 30,000
> rows out of an arbitrarily large table will produce a
> sufficiently-accurate MCV list.
Perhaps not, but allowing two occurrences to define an MCV surely isn't
helping with that. More to the point maybe, it's a behavior that doesn't
go away simply by making the sample larger. Making the sample larger
just allows us to falsely invent more pseudo-MCVs.
I'm not by any means wedded to the proposition that we have to fix it
simply by changing the filter rule. One idea that seems worth considering
is to keep a track list that's a bit longer than the maximum allowed
number of MCVs, and then to say that we accept only MCVs whose counts are
significantly greater than what we find at the tail of the list. I'm not
sure what "significantly" should be exactly, but surely a distribution
as flat as the ones I was showing upthread should be a red flag.
regards, tom lane