Re: More stable query plans via more predictable column statistics - Mailing list pgsql-hackers

From Alex Shulgin
Subject Re: More stable query plans via more predictable column statistics
Date
Msg-id CAM-UEKSyrBf5dyHMdr0Am0Zo6ohtTUoeKq4SEErrQFf=Fi4nEA@mail.gmail.com
Whole thread Raw
In response to Re: More stable query plans via more predictable column statistics  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: More stable query plans via more predictable column statistics  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Sun, Apr 3, 2016 at 10:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alex Shulgin <alex.shulgin@gmail.com> writes:
> This recalled observation can now also explain to me why in the regression
> you've seen, the short path was not followed: my bet is that stadistinct
> appeared negative.

Yes, I think that's right.  The table under consideration had just a few
live rows (I think 3), so that even though there was only one value in
the sample, the "if (stats->stadistinct > 0.1 * totalrows)" condition
succeeded.

Yeah, this part of the logic can be really surprising at times.

> Given that we change the logic in the complex path substantially, the
> assumptions that lead to the "Take all MVCs" condition above might no
> longer hold, and I see it as a pretty compelling argument to remove the
> extra checks, thus keeping the only one: track_cnt == ndistinct.  This
> should also bring the patch's effect more close to the thread's topic,
> which is "More stable query plans".

The reason for checking toowide_cnt is that if it's greater than zero,
then in fact the track list does NOT include all values seen, and it's
flat-out wrong to claim that it is an exhaustive set of values.

But do we really state that with the short path?

If there would be only one too wide value, it might be the only thing left for the histogram in the end and will be discarded anyway, so from the end result perspective there is no difference.

If there are multiple too wide values, they will be effectively discarded by the histogram calculation part also, so again no difference from the perspective of the end result.

The reason for the track_cnt <= num_mcv condition is that if that's not
true, the track list has to be trimmed to meet the statistics target.
Again, that's not optional.

Yes, but this check we only need in compute_distinct_stats() and we are talking about compute_scalar_stats() now where track_cnt is always less than or equals to num_mcv (again, please see at the bottom of the thread-starting email), or is my analysis broken on this part?

I think the reasoning for having the stats->stadistinct > 0 test in there
was that if we'd set it negative, then we think that the set of distinct
values will grow --- which again implies that the set of values actually
seen should not be considered exhaustive.

This is actually very neat.  So the idea here as I get it is that if we have enough distinct values to suspect that more unique ones will be added later as the table grows (which is a natural tendency with most of the tables anyway), then at the moment the statistics we produce are going to be actually used by the planner, it is likely that we no longer cover all the distinct values by the MCV list, right?

I would *love* to see that documented in code comments at the least.

Of course, with a table as
small as that regression-test example, we have little evidence to support
either that conclusion or its opposite.

I think it might be possible to record historical ndistinct values between the ANALYZE runs and use that as better evidence that the number of distincts is actually growing rather than basing that decision on that hard-coded 10% limit rule.  What do you think?

We do not support migration of pg_statistic system table during major version upgrades (yet), so if we somehow achieve what I've just described, it might be not a compatibility-breaking change anyway.

It's possible that what we should do to eliminate the sudden change
of behaviors is to drop the "track list includes all values seen, and all
will fit" code path entirely, and always go through the track list
one-at-a-time.

That could also be an option, that I have considered initially.  Now that I read your explanation of each check, I'm not that sure anymore.

If we do, though, the currently-proposed filter rules aren't going to
be too satisfactory: if we have a relatively small group of roughly
equally common MCVs, this logic would reject all of them, which is
surely not what we want.

Indeed. :-( 
 
The point of the original logic was to try to decide whether the
values in the sample are significantly more common than typical values
in the whole table population.  I think we may have broken that with
3d3bf62f3: you can't make any such determination if you consider only
what's in the sample without trying to estimate what is not in the
sample.

Speaking of rabbit holes...

I'm out of ideas, unfortunately.  We badly need more eyes/brainpower on this, which is why I have submitted a talk proposal on this topic to PGDay.ru this summer in St. Petersburg, fearing that it might be too late to commit a satisfactory version during the current dev cycle for 9.6, and in hope to draw at least some attention to it.

Regards,
--
Alex

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: PATCH: use foreign keys to improve join estimates v1
Next
From: Corey Huinker
Date:
Subject: Re: psql metaqueries with \gexec