Re: More stable query plans via more predictable column statistics - Mailing list pgsql-hackers

From Shulgin, Oleksandr
Subject Re: More stable query plans via more predictable column statistics
Date
Msg-id CACACo5Qhjudn+8=fnh4v_VTe3UxQRwSCtJy2oPLXqL6Hjy6APA@mail.gmail.com
Whole thread Raw
In response to Re: More stable query plans via more predictable column statistics  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: More stable query plans via more predictable column statistics  ("Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de>)
List pgsql-hackers
On Sat, Jan 23, 2016 at 11:22 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,

On 01/20/2016 10:49 PM, Alvaro Herrera wrote:

Tom, are you reviewing this for the current commitfest?

While I'm not the right Tom, I've been looking the the patch recently, so let me post the review here ...

Thank you for the review!

2) mincount = 1.25 * avgcount
-----------------------------

While I share the dislike of arbitrary constants (like the 1.25 here), I do think we better keep this, otherwise we can just get rid of the mincount entirely I think - the most frequent value will always be above the (remaining) average count, making the threshold pointless.

Correct.

It might have impact in the original code, but in the new one it's quite useless (see the next point), unless I'm missing some detail.


3) modifying avgcount threshold inside the loop
-----------------------------------------------

The comment was extended with this statement:

 * We also decrease ndistinct in the process such that going forward
 * it refers to the number of distinct values left for the histogram.

and indeed, that's what's happening - at the end of each loop, we do this:

    /* Narrow our view of samples left for the histogram */
    sample_cnt -= track[i].count;
    ndistinct--;

but that immediately lowers the avgcount, as that's re-evaluated within the same loop

    avgcount = (double) sample_cnt / (double) ndistinct;

which means it's continuously decreasing and lowering the threshold, although that's partially mitigated by keeping the 1.25 coefficient.

I was going to write "not necessarily lowering", but this is actually accurate.  The following holds due to track[i].count > avgcount (= sample_cnt / ndistinct):

 sample_cnt     sample_cnt - track[i].count
------------ > -----------------------------
  ndistinct           ndistinct - 1

It however makes reasoning about the algorithm much more complicated.

Unfortunately, yes.

4) for (i = 0; /* i < num_mcv */; i++)
---------------------------------------

The way the loop is coded seems rather awkward, I guess. Not only there's an unexpected comment in the "for" clause, but the condition also says this

    /* Another way to say "while (i < num_mcv)" */
    if (i >= num_mcv)
        break;

Why not to write it as a while loop, then? Possibly including the (num_hist >= 2) condition, like this:

    while ((i < num_mcv) && (num_hist >= 2))
    {
        ...
    }

In any case, the loop would deserve a comment explaining why we think computing the thresholds like this makes sense.

This is partially explained by a comment inside the loop:

! for (i = 0; /* i < num_mcv */; i++)
  {
! /*
! * We have to put this before the loop condition, otherwise
! * we'll have to repeat this code before the loop and after
! * decreasing ndistinct.
! */
! num_hist = ndistinct;
! if (num_hist > num_bins)
! num_hist = num_bins + 1;

I guess this is a case where code duplication can be traded for more apparent control flow, i.e:

num_hist = ndistinct;
if (num_hist > num_bins)
num_hist = num_bins + 1;

for (i = 0; i < num_mcv && num_hist >= 2; i++)
   {
...
+ /* Narrow our view of samples left for the histogram */
+ sample_cnt -= track[i].count;
+ ndistinct--;
num_hist = ndistinct;
if (num_hist > num_bins)
num_hist = num_bins + 1;
   }

Summary
-------

Overall, I think this is really about deciding when to cut-off the MCV, so that it does not grow needlessly large - as Robert pointed out, the larger the list, the more expensive the estimation (and thus planning).

So even if we could fit the whole sample into the MCV list (i.e. we believe we've seen all the values and we can fit them into the MCV list), it may not make sense to do so. The ultimate goal is to estimate conditions, and if we can do that reasonably even after cutting of the least frequent values from the MCV list, then why not?

>From this point of view, the analysis concentrates deals just with the ANALYZE part and does not discuss the estimation counter-part at all.

True, this aspect still needs verification.  As stated, my primary motivation was to improve the plan stability for relatively short MCV lists.

Longer MCV lists might be a different story, but see "Increasing stats target" section of the original mail: increasing the target doesn't give quite the expected results with unpatched code either.

5) ndistinct estimation vs. NULLs
---------------------------------

While looking at the patch, I started realizing whether we're actually handling NULLs correctly when estimating ndistinct. Because that part also uses samplerows directly and entirely ignores NULLs, as it does this:

    numer = (double) samplerows *(double) d;

    denom = (double) (samplerows - f1) +
        (double) f1 *(double) samplerows / totalrows;

    ...
    if (stadistinct > totalrows)
        stadistinct = totalrows;

For tables with large fraction of NULLs, this seems to significantly underestimate the ndistinct value - for example consider a trivial table with 95% of NULL values and ~10k distinct values with skewed distribution:

    create table t (id int);

    insert into t
    select (case when random() < 0.05 then (10000 * random() * random())
                 else null end) from generate_series(1,1000000) s(i);

In practice, there are 8325 distinct values in my sample:

    test=# select count(distinct id) from t;
     count
    -------
      8325
    (1 row)

But after ANALYZE with default statistics target (100), ndistinct is estimated to be ~1300, and with target=1000 the estimate increases to ~6000.

After fixing the estimator to consider fraction of NULLs, the estimates look like this:

    statistics target |   master  |  patched
   ------------------------------------------
                  100 |     1302  |     5356
                 1000 |     6022  |     6791

So this seems to significantly improve the ndistinct estimate (patch attached).

Hm... this looks correct.  And compute_distinct_stats() needs the same treatment, obviously.

--
Alex

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Unbreak mbregression test
Next
From: Aleksander Alekseev
Date:
Subject: Re: Patch: ResourceOwner optimization for tables with many partitions