Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id CAEZATCW9AXzf+1T44B0=rBd8a0AD+pWO_PpkhApqLBE9JS1Cyg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On 18 March 2018 at 23:57, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> Attached is an updated version of the patch series, addressing issues
> pointed out by Alvaro.

I've just been reading the new code in
statext_clauselist_selectivity() and mcv_clauselist_selectivity(), and
I'm having a hard time convincing myself that it's correct.

This code in statext_clauselist_selectivity() looks a bit odd:

    /*
     * Evaluate the MCV selectivity. See if we got a full match and the
     * minimal selectivity.
     */
    if (stat->kind == STATS_EXT_MCV)
        s1 = mcv_clauselist_selectivity(root, stat, clauses, varRelid,
                                        jointype, sjinfo, rel,
                                        &fullmatch, &mcv_lowsel);

    /*
     * If we got a full equality match on the MCV list, we're done (and the
     * estimate is likely pretty good).
     */
    if (fullmatch && (s1 > 0.0))
        return s1;

    /*
     * If it's a full match (equalities on all columns) but we haven't found
     * it in the MCV, then we limit the selectivity by frequency of the last
     * MCV item. Otherwise reset it to 1.0.
     */
    if (!fullmatch)
        mcv_lowsel = 1.0;

    return Min(s1, mcv_lowsel);

So if fullmatch is true and s1 is greater than 0, it will return s1.
If fullmatch is true and s1 equals 0, it will return Min(s1,
mcv_lowsel) which will also be s1. If fullmatch is false, mcv_lowsel
will be set to 1 and it will return Min(s1, mcv_lowsel) which will
also be s1. So it always just returns s1, no? Maybe there's no point
in computing fullmatch.

Also, wouldn't mcv_lowsel potentially be a significant overestimate
anyway? Perhaps 1 minus the sum of the MCV frequencies might be
closer, but even that ought to take into account the number of
distinct values remaining, although that information may not always be
available.

Also, just above that, in statext_clauselist_selectivity(), it
computes the list stat_clauses, then doesn't appear to use it
anywhere. I think that would have been the appropriate thing to pass
to mcv_clauselist_selectivity(). Otherwise, passing unrelated clauses
into mcv_clauselist_selectivity() will cause it to fail to find any
matches and then underestimate.

I've also come across a few incorrect/out-of-date comments:

/*
 * mcv_clauselist_selectivity
 *      Return the estimated selectivity of the given clauses using MCV list
 *      statistics, or 1.0 if no useful MCV list statistic exists.
 */

-- I can't see any code path that returns 1.0 if there are no MCV
stats. The last part of that comment is probably more appropriate to
statext_clauselist_selectivity()


/*
 * mcv_update_match_bitmap
 * [snip]
 * The function returns the number of items currently marked as 'match', and
 * ...

-- it doesn't seem to return the number of items marked as 'match'.

Then inside that function, this comment is wrong (copied from the
preceding comment):

                /* AND clauses assume nothing matches, initially */
                memset(bool_matches, STATS_MATCH_FULL, sizeof(char) *
mcvlist->nitems);

Still reading...

Regards,
Dean


pgsql-hackers by date:

Previous
From: Damir Simunic
Date:
Subject: Re: Proposal: http2 wire format
Next
From: Andres Freund
Date:
Subject: Re: Proposal: http2 wire format