Re: More stable query plans via more predictable column statistics - Mailing list pgsql-hackers

From Alex Shulgin
Subject Re: More stable query plans via more predictable column statistics
Date
Msg-id CAM-UEKS=S0hVRTJXBoYvFfJgK3=sv4YvUDn=FEoktaEnRzK2nw@mail.gmail.com
Whole thread Raw
In response to Re: More stable query plans via more predictable column statistics  (Alex Shulgin <alex.shulgin@gmail.com>)
Responses Re: More stable query plans via more predictable column statistics  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Sun, Apr 3, 2016 at 8:24 AM, Alex Shulgin <alex.shulgin@gmail.com> wrote:
>
> On Sun, Apr 3, 2016 at 7:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Alex Shulgin <alex.shulgin@gmail.com> writes:
>> > On Sun, Apr 3, 2016 at 7:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >> Well, we have to do *something* with the last (possibly only) value.
>> >> Neither "include always" nor "omit always" seem sane to me.  What other
>> >> decision rule do you want there?
>>
>> > Well, what implies that the last value is somehow special?  I would think
>> > we should just do with it whatever we do with the rest of the candidate
>> > MCVs.
>>
>> Sure, but both of the proposed decision rules break down when there are no
>> values after the one under consideration.  We need to do something sane
>> there.
>
>
> Hm... There are indeed the case where it would beneficial to have at least 2 values in the histogram (to have at least the low/high bounds for inequality comparison selectivity) instead of taking both to the MCV list or taking one to the MCVs and having to discard the other.

I was thinking about this in the background...

Popularity of the last sample value (which is not the only) one can be:

a) As high as 50%, in case we have an even division between the only two values in the sample.  Quite obviously, we should take this one into the MCV list (well, unless the user has specified statistics_target of 1 for some bizarre reason, but that should not be our problem).

b) As low as 2/(statistics_target*300), which is with the target set to a maximum allowed value of 10,000 amounts to 2/(10,000*300) = 1 in 1,500,000.  This seems like a really tiny number, but if your table has some tens of billions of rows, for example, seeing such a value at least twice means that it might correspond to some thousands of rows in the table, whereas seeing a value only once might mean just that: it's a unique value.

In this case, putting such a duplicate value in the MCV list will allow a much better selectivity estimate for equality comparison, as I've mentioned earlier.  It also allows for better estimate with inequality comparison, since MCVs are also consulted in this case.  I see no good reason to discard such a value.

c) Or anything in between the above figures.

In my opinion that amounts to "include always" being the sane option.  Do you see anything else as a problem here?

> Obviously, we need a fresh idea on how to handle this.

On reflection, the case where we have a duplicate value in the track list which is not followed by any other sample should be covered by the short path where we put all the tracked values in the MCV list, so there should be no point to even consider all of the above!

But the exact short path condition is formulated like this:

if (track_cnt == ndistinct && toowide_cnt == 0 &&
    stats->stadistinct > 0 &&
    track_cnt <= num_mcv)
{
    /* Track list includes all values seen, and all will fit */

So the execution path here is additionally put in dependence of two factors: whether we've seen at least one too wide sample or the distinct estimation produced a number higher than 10% of the estimated total table size (which is yet another arbitrary limit, but that's not in scope of this patch).

I've been puzzled by these conditions a lot, as I have mentioned in the last section of this thread's starting email and I could not find anything that would hint why they exist there, in the documentation, code comments or emails on hackers leading to the introduction of analyze.c in the form we know it today.  Probably we will never know, unless Tom still has some notes on this topic from 15 years ago. ;-)

This recalled observation can now also explain to me why in the regression you've seen, the short path was not followed: my bet is that stadistinct appeared negative.

Given that we change the logic in the complex path substantially, the assumptions that lead to the "Take all MVCs" condition above might no longer hold, and I see it as a pretty compelling argument to remove the extra checks, thus keeping the only one: track_cnt == ndistinct.  This should also bring the patch's effect more close to the thread's topic, which is "More stable query plans".

Regards,
--
Alex

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: PATCH: use foreign keys to improve join estimates v1
Next
From: Tom Lane
Date:
Subject: Re: psql metaqueries with \gexec