Re: More stable query plans via more predictable column statistics - Mailing list pgsql-hackers

From Alex Shulgin
Subject Re: More stable query plans via more predictable column statistics
Date
Msg-id CAM-UEKRnHmE-LHao2MqC86bME6-e_-jS0su=HroAkxw4mEJojg@mail.gmail.com
Whole thread Raw
In response to Re: More stable query plans via more predictable column statistics  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: More stable query plans via more predictable column statistics  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Mon, Apr 4, 2016 at 1:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alex Shulgin <alex.shulgin@gmail.com> writes:
> On Sun, Apr 3, 2016 at 10:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The reason for checking toowide_cnt is that if it's greater than zero,
>> then in fact the track list does NOT include all values seen, and it's
>> flat-out wrong to claim that it is an exhaustive set of values.

> But do we really state that with the short path?

Well, yes: the point of the short path is that we're hypothesizing that
the track list contains all values in the table, and that they should all
be considered MCVs.  Another way to think about it is that if we didn't
have the width-limit implementation restriction, those values would appear
in the track list, almost certainly with count 1, and so we would not
have taken the short path anyway.

Now you can argue that the long path would have accepted all the real
track-list entries as MCVs, and have rejected all these hypothetical
count-1 entries for too-wide values, and so the end result would be the
same.  But that gets back to the fact that that's not necessarily how
the long path behaves, either today or with the proposed patch.

 Agreed.

The design intention was that the short path would handle columns
with a finite, small set of values (think booleans or enums) where the
ideal thing is that the table population is completely represented by
the MCV list.  As soon as we have good reason to think that the MCV
list won't represent the table contents fully, we should switch over
to a different approach where we're trying to identify which sample
values are common enough to justify putting in the MCV list.

This is a precious detail that I unfortunately couldn't find in any of the sources of information available to me online. :-)

I don't have a habit of hanging on IRC channels, but now I wonder how likely is it that I could learn this by just asking around on #postgresql (or mailing you directly as the committer of this early implementation--is that OK at all?)

Again, having this type of design decisions documented in the code might save some time and confusion for the sociopath^W introvert-type of folks like myself. ;-)

In that
situation there are good reasons to not blindly fill the MCV list all
the way to the stats-target limit, but to try to cut it off at the
point of diminishing returns, so that the planner isn't saddled with
a huge MCV list that doesn't really contain a lot of useful information.

This came to be my understanding also at some point.

So that's the logic behind there being two code paths with discontinuous
behavior.  I'm not sure whether we need to try to narrow the discontinuity
or whether it's OK to act that way and we just need to refine the decision
rule about which path to take.  But anyway, comparisons of frequencies
of candidate MCVs seem to me to make sense in a large-ndistinct scenario
(where we have to be selective) but not a small-ndistinct scenario
(where we should just take 'em all).

Yeah, this seems to be an open question.  And a totally new one to me in the light of recent revelations.

>> The point of the original logic was to try to decide whether the
>> values in the sample are significantly more common than typical values
>> in the whole table population.  I think we may have broken that with
>> 3d3bf62f3: you can't make any such determination if you consider only
>> what's in the sample without trying to estimate what is not in the
>> sample.

> Speaking of rabbit holes...
> I'm out of ideas, unfortunately.  We badly need more eyes/brainpower on
> this, which is why I have submitted a talk proposal on this topic to
> PGDay.ru this summer in St. Petersburg, fearing that it might be too late
> to commit a satisfactory version during the current dev cycle for 9.6, and
> in hope to draw at least some attention to it.

If you're thinking it's too late to get more done for 9.6,

Not necessarily, but given the time constraints and some personal issues that just keep popping up I'm not that optimistic as I was just 24h ago anymore.
 
I'm inclined to
revert the aspect of 3d3bf62f3 that made us work from "d" (the observed
number of distinct values in the sample) rather than stadistinct (the
extrapolated estimate for the table).  On reflection I think that that's
inconsistent with the theory behind the old MCV-cutoff rule.  It wouldn't
matter if we were going to replace the cutoff rule with something else,
but it's beginning to sound like that won't happen for 9.6.

Please feel free to do what you think is in the best interest of the people preparing 9.6 for the freeze.  I'm not all that familiar with the process, but I guess reverting this early might save some head-scratching if some interesting interactions of this change combined with some others are going to show up.

Cheers!
--
Alex

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: psql metaqueries with \gexec
Next
From: Michael Paquier
Date:
Subject: Re: pgbench more operators & functions