Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id 19b77ce3-83ec-c04b-ff7a-02475f044f4d@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists
List pgsql-hackers
On 1/10/19 6:09 PM, Dean Rasheed wrote:
> On Wed, 26 Dec 2018 at 22:09, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Attached is an updated version of the patch - rebased and fixing the
>> warnings reported by Thomas Munro.
>>
> 
> Here are a few random review comments based on what I've read so far:
> 
> 
> On the CREATE STATISTICS doc page, the syntax in the new examples
> added to the bottom of the page is incorrect. E.g., instead of
> 
> CREATE STATISTICS s2 WITH (mcv) ON (a, b) FROM t2;
> 
> it should read
> 
> CREATE STATISTICS s2 (mcv) ON a, b FROM t2;
> 

Fixed.

> I think perhaps there should be also be a short explanatory sentence
> after each example (as in the previous one) just to explain what the
> example is intended to demonstrate. E.g., for the new MCV example,
> perhaps say
> 
>    These statistics give the planner more detailed information about the
>    specific values that commonly appear in the table, as well as an upper
>    bound on the selectivities of combinations of values that do not appear in
>    the table, allowing it to generate better estimates in both cases.
> 
> I don't think there's a need for too much detail there, since it's
> explained more fully elsewhere, but it feels like it needs a little
> more just to explain the purpose of the example.
> 

I agree, this part of docs can be quite terse. I've adopted the wording
you proposed, and I've done something similar for the histogram patch,
which needs to add something too. It's a bit repetitive, though.

> 
> There is additional documentation in perform.sgml that needs updating
> -- about what kinds of stats the planner keeps. Those docs are
> actually quite similar to the ones on planstats.sgml. It seems the
> former focus more one what stats the planner stores, while the latter
> focus on how the planner uses those stats.
> 

OK, I've expanded this part a bit too.

> 
> In func.sgml, the docs for pg_mcv_list_items need extending to include
> the base frequency column. Similarly for the example query in
> planstats.sgml.
> 

Fixed.

> 
> Tab-completion for the CREATE STATISTICS statement should be extended
> for the new kinds.
> 

Fixed.

> 
> Looking at mcv_update_match_bitmap(), it's called 3 times (twice
> recursively from within itself), and I think the pattern for calling
> it is a bit messy. E.g.,
> 
>             /* by default none of the MCV items matches the clauses */
>             bool_matches = palloc0(sizeof(char) * mcvlist->nitems);
> 
>             if (or_clause(clause))
>             {
>                 /* OR clauses assume nothing matches, initially */
>                 memset(bool_matches, STATS_MATCH_NONE, sizeof(char) *
> mcvlist->nitems);
>             }
>             else
>             {
>                 /* AND clauses assume everything matches, initially */
>                 memset(bool_matches, STATS_MATCH_FULL, sizeof(char) *
> mcvlist->nitems);
>             }
> 
>             /* build the match bitmap for the OR-clauses */
>             mcv_update_match_bitmap(root, bool_clauses, keys,
>                                     mcvlist, bool_matches,
>                                     or_clause(clause));
> 
> the comment for the AND case directly contradicts the initial comment,
> and the final comment is wrong because it could be and AND clause. For
> a NOT clause it does:
> 
>             /* by default none of the MCV items matches the clauses */
>             not_matches = palloc0(sizeof(char) * mcvlist->nitems);
> 
>             /* NOT clauses assume nothing matches, initially */
>             memset(not_matches, STATS_MATCH_FULL, sizeof(char) *
> mcvlist->nitems);
> 
>             /* build the match bitmap for the NOT-clause */
>             mcv_update_match_bitmap(root, not_args, keys,
>                                     mcvlist, not_matches, false);
> 
> so the second comment is wrong. I understand the evolution that lead
> to this function existing in this form, but I think that it can now be
> refactored into a "getter" function rather than an "update" function.
> I.e., something like mcv_get_match_bitmap() which first allocates the
> array to be returned and initialises it based on the passed-in value
> of is_or. That way, all the calling sites can be simplified to
> one-liners like
> 
>             /* get the match bitmap for the AND/OR clause */
>             bool_matches = mcv_get_match_bitmap(root, bool_clauses, keys,
>                                     mcvlist, or_clause(clause));
> 

Yes, I agree. I've reworked the function per your proposal, and I've
done the same for the histogram too.

> 
> In the previous discussion around UpdateStatisticsForTypeChange(), the
> consensus appeared to be that we should just unconditionally drop all
> extended statistics when ALTER TABLE changes the type of an included
> column (just as we do for per-column stats), since such a type change
> can rewrite the data in arbitrary ways, so there's no reason to assume
> that the old stats are still valid. I think it makes sense to extract
> that as a separate patch to be committed ahead of these ones, and I'd
> also argue for back-patching it.
> 

Wasn't the agreement to keep stats that don't include column values
(functional dependencies and ndistinct coefficients), and reset only
more complex stats? That's what happens in master and how it's extended
by the patch for MCV lists and histograms.

> 
> That's it for now. I'll try to keep reviewing if time permits.
> 

Thanks!


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Next
From: Paul Martinez
Date:
Subject: PATCH: Include all columns in default names for foreign key constraints.