Re: More stable query plans via more predictable column statistics - Mailing list pgsql-hackers

From Tom Lane
Subject Re: More stable query plans via more predictable column statistics
Date
Msg-id 12927.1459724809@sss.pgh.pa.us
Whole thread Raw
In response to Re: More stable query plans via more predictable column statistics  (Alex Shulgin <alex.shulgin@gmail.com>)
Responses Re: More stable query plans via more predictable column statistics  (Alex Shulgin <alex.shulgin@gmail.com>)
List pgsql-hackers
Alex Shulgin <alex.shulgin@gmail.com> writes:
> On Sun, Apr 3, 2016 at 10:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The reason for checking toowide_cnt is that if it's greater than zero,
>> then in fact the track list does NOT include all values seen, and it's
>> flat-out wrong to claim that it is an exhaustive set of values.

> But do we really state that with the short path?

Well, yes: the point of the short path is that we're hypothesizing that
the track list contains all values in the table, and that they should all
be considered MCVs.  Another way to think about it is that if we didn't
have the width-limit implementation restriction, those values would appear
in the track list, almost certainly with count 1, and so we would not
have taken the short path anyway.

Now you can argue that the long path would have accepted all the real
track-list entries as MCVs, and have rejected all these hypothetical
count-1 entries for too-wide values, and so the end result would be the
same.  But that gets back to the fact that that's not necessarily how
the long path behaves, either today or with the proposed patch.

The design intention was that the short path would handle columns
with a finite, small set of values (think booleans or enums) where the
ideal thing is that the table population is completely represented by
the MCV list.  As soon as we have good reason to think that the MCV
list won't represent the table contents fully, we should switch over
to a different approach where we're trying to identify which sample
values are common enough to justify putting in the MCV list.  In that
situation there are good reasons to not blindly fill the MCV list all
the way to the stats-target limit, but to try to cut it off at the
point of diminishing returns, so that the planner isn't saddled with
a huge MCV list that doesn't really contain a lot of useful information.

So that's the logic behind there being two code paths with discontinuous
behavior.  I'm not sure whether we need to try to narrow the discontinuity
or whether it's OK to act that way and we just need to refine the decision
rule about which path to take.  But anyway, comparisons of frequencies
of candidate MCVs seem to me to make sense in a large-ndistinct scenario
(where we have to be selective) but not a small-ndistinct scenario
(where we should just take 'em all).

>> The point of the original logic was to try to decide whether the
>> values in the sample are significantly more common than typical values
>> in the whole table population.  I think we may have broken that with
>> 3d3bf62f3: you can't make any such determination if you consider only
>> what's in the sample without trying to estimate what is not in the
>> sample.

> Speaking of rabbit holes...
> I'm out of ideas, unfortunately.  We badly need more eyes/brainpower on
> this, which is why I have submitted a talk proposal on this topic to
> PGDay.ru this summer in St. Petersburg, fearing that it might be too late
> to commit a satisfactory version during the current dev cycle for 9.6, and
> in hope to draw at least some attention to it.

If you're thinking it's too late to get more done for 9.6, I'm inclined to
revert the aspect of 3d3bf62f3 that made us work from "d" (the observed
number of distinct values in the sample) rather than stadistinct (the
extrapolated estimate for the table).  On reflection I think that that's
inconsistent with the theory behind the old MCV-cutoff rule.  It wouldn't
matter if we were going to replace the cutoff rule with something else,
but it's beginning to sound like that won't happen for 9.6.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Corey Huinker
Date:
Subject: Re: psql metaqueries with \gexec
Next
From: Greg Stark
Date:
Subject: Re: Using quicksort for every external sort run