Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id CAEZATCV5Xq7FGFYvigJWPZQrV9Hx37WvHfTgKdf6Ak3smbip2w@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Wed, 26 Dec 2018 at 22:09, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Attached is an updated version of the patch - rebased and fixing the
> warnings reported by Thomas Munro.
>

Here are a few random review comments based on what I've read so far:


On the CREATE STATISTICS doc page, the syntax in the new examples
added to the bottom of the page is incorrect. E.g., instead of

CREATE STATISTICS s2 WITH (mcv) ON (a, b) FROM t2;

it should read

CREATE STATISTICS s2 (mcv) ON a, b FROM t2;

I think perhaps there should be also be a short explanatory sentence
after each example (as in the previous one) just to explain what the
example is intended to demonstrate. E.g., for the new MCV example,
perhaps say

   These statistics give the planner more detailed information about the
   specific values that commonly appear in the table, as well as an upper
   bound on the selectivities of combinations of values that do not appear in
   the table, allowing it to generate better estimates in both cases.

I don't think there's a need for too much detail there, since it's
explained more fully elsewhere, but it feels like it needs a little
more just to explain the purpose of the example.


There is additional documentation in perform.sgml that needs updating
-- about what kinds of stats the planner keeps. Those docs are
actually quite similar to the ones on planstats.sgml. It seems the
former focus more one what stats the planner stores, while the latter
focus on how the planner uses those stats.


In func.sgml, the docs for pg_mcv_list_items need extending to include
the base frequency column. Similarly for the example query in
planstats.sgml.


Tab-completion for the CREATE STATISTICS statement should be extended
for the new kinds.


Looking at mcv_update_match_bitmap(), it's called 3 times (twice
recursively from within itself), and I think the pattern for calling
it is a bit messy. E.g.,

            /* by default none of the MCV items matches the clauses */
            bool_matches = palloc0(sizeof(char) * mcvlist->nitems);

            if (or_clause(clause))
            {
                /* OR clauses assume nothing matches, initially */
                memset(bool_matches, STATS_MATCH_NONE, sizeof(char) *
mcvlist->nitems);
            }
            else
            {
                /* AND clauses assume everything matches, initially */
                memset(bool_matches, STATS_MATCH_FULL, sizeof(char) *
mcvlist->nitems);
            }

            /* build the match bitmap for the OR-clauses */
            mcv_update_match_bitmap(root, bool_clauses, keys,
                                    mcvlist, bool_matches,
                                    or_clause(clause));

the comment for the AND case directly contradicts the initial comment,
and the final comment is wrong because it could be and AND clause. For
a NOT clause it does:

            /* by default none of the MCV items matches the clauses */
            not_matches = palloc0(sizeof(char) * mcvlist->nitems);

            /* NOT clauses assume nothing matches, initially */
            memset(not_matches, STATS_MATCH_FULL, sizeof(char) *
mcvlist->nitems);

            /* build the match bitmap for the NOT-clause */
            mcv_update_match_bitmap(root, not_args, keys,
                                    mcvlist, not_matches, false);

so the second comment is wrong. I understand the evolution that lead
to this function existing in this form, but I think that it can now be
refactored into a "getter" function rather than an "update" function.
I.e., something like mcv_get_match_bitmap() which first allocates the
array to be returned and initialises it based on the passed-in value
of is_or. That way, all the calling sites can be simplified to
one-liners like

            /* get the match bitmap for the AND/OR clause */
            bool_matches = mcv_get_match_bitmap(root, bool_clauses, keys,
                                    mcvlist, or_clause(clause));


In the previous discussion around UpdateStatisticsForTypeChange(), the
consensus appeared to be that we should just unconditionally drop all
extended statistics when ALTER TABLE changes the type of an included
column (just as we do for per-column stats), since such a type change
can rewrite the data in arbitrary ways, so there's no reason to assume
that the old stats are still valid. I think it makes sense to extract
that as a separate patch to be committed ahead of these ones, and I'd
also argue for back-patching it.


That's it for now. I'll try to keep reviewing if time permits.

Regards,
Dean


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] pgbench - allow to store select results into variables
Next
From: Alvaro Herrera
Date:
Subject: Re: speeding up planning with partitions