Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From David Rowley
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id CAKJS1f-gQVQ6Je+vA6GjK_NKcNAL8haa47=gWet6vgkNMXQ+hQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
On Fri, 18 Jan 2019 at 10:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> thanks for the review. The attached patches address most of the issues
> mentioned in the past several messages, both in the MCV and histogram parts.

I made another pass over the 0001 patch. I've not read through mcv.c
again yet. Will try to get to that soon.

0001-multivariate-MCV-lists-20190117.patch

1. The following mentions "multiple functions", but lists just 1 function.

   <para>
    To inspect statistics defined using <command>CREATE STATISTICS</command>
    command, <productname>PostgreSQL</productname> provides multiple functions.
   </para>

2. There's a mix of usages of <literal>MCV</literal> and
<acronym>MCV</acronym> around the docs. Should these be the same?

3. analyze_mcv_list() is modified to make it an external function, but
it's not used anywhere out of analyze.c

4. The following can be simplified further:

* We can also leave the record as it is if there are no statistics
* including the datum values, like for example MCV lists.
*/
if (statext_is_kind_built(oldtup, STATS_EXT_MCV))
reset_stats = true;

/*
* If we can leave the statistics as it is, just do minimal cleanup
* and we're done.
*/
if (!reset_stats)
{
ReleaseSysCache(oldtup);
return;
}

to just:

/*
* When none of the defined statistics types contain datum values
* from the table's columns then there's no need to reset the stats.
* Functional dependencies and ndistinct stats should still hold true.
*/
if (!statext_is_kind_built(oldtup, STATS_EXT_MCV))
{
ReleaseSysCache(oldtup);
return;
}

5. "so that we can ignore them below." seems misplaced now since
you've moved all the code below into clauselist_selectivity_simple().
   Maybe you can change it to "so that we can inform
clauselist_selectivity_simple about clauses that it should ignore" ?

* filled with the 0-based list positions of clauses used that way, so
* that we can ignore them below.

6. README.mcv: multi-variate -> multivariate

are large the list may be quite large. This is especially true for multi-variate

7. README.mcv: similar -> a similar

it impossible to use anyarrays. It might be possible to produce similar

8. I saw you added IS NOT NULL to README.mcv, but README just mentions:

    (b) MCV lists - equality and inequality clauses (AND, OR, NOT), IS NULL

Should that mention IS NOT NULL too?

9. The header comment for build_attnums_array() claims that it
"transforms an array of AttrNumber values into a bitmap", but it does
the opposite.

 * Transforms an array of AttrNumber values into a bitmap.

10. The following Assert is not entirely useless.  The bitmapset could
have a 0 member, but it can't store negative values.

while ((j = bms_next_member(attrs, j)) >= 0)
{
/* user-defined attributes only */
Assert(AttrNumberIsForUserDefinedAttr(j));

Just checking you thought of that when you added it?

11. XXX comments are normally reserved for things we may wish to
reconsider later, but the following seems more like a "Note:"

 * XXX All the memory is allocated in a single chunk, so that the caller
 * can simply pfree the return value to release all of it.

12. In statext_is_compatible_clause_internal() there's still a comment
that mentions "Var op Const", but Const op Var is also okay too.

13. This is not fall-through.  Generally, such a comment is reserved
to confirm that the "break;" is meant to be missing.

default:
/* fall-through */
return false;

https://developers.redhat.com/blog/2017/03/10/wimplicit-fallthrough-in-gcc-7/
mentions various comment patterns that are used for that case.  Your
case seems misplaced since it's right about a return, and not another
case.

14. The header comment for statext_is_compatible_clause() is not
accurate.  It mentions only opexprs with equality operations are
allowed, but none of those are true.

 * Only OpExprs with two arguments using an equality operator are supported.
 * When returning True attnum is set to the attribute number of the Var within
 * the supported clause.

15. statext_clauselist_selectivity(): "a number" -> "the number" ?

 * Selects the best extended (multi-column) statistic on a table (measured by
 * a number of attributes extracted from the clauses and covered by it), and

16. I understand you're changing this to a bitmask in the 0002 patch,
but int is the wrong type here;
/* we're interested in MCV lists */
int types = STATS_EXT_MCV;

Maybe just pass the STATS_EXT_MCV directly, or at least make it a char.

17. bms_membership(clauses_attnums) != BMS_MULTIPLE seems better here.
It can stop once it finds 2. No need to count them all.

/* We need at least two attributes for MCV lists. */
if (bms_num_members(clauses_attnums) < 2)
return 1.0;

18. The following comment in statext_is_compatible_clause_internal()
does not seem to be true.  I see OpExprs are supported and NULL test,
including others too.

/* We only support plain Vars for now */

19. The header comment for clauselist_selectivity_simple() does not
mention what estimatedclauses is for.

20. New line. Also, missing "the" before "maximum"

+ * We
+ * iteratively search for multivariate n-distinct with maximum number

21. This comment seems like it's been copied from
estimate_num_groups() without being edited.

/* we're done with this relation */
varinfos = NIL;

Looks like it's using this to break out of the loop.

22. I don't see any dividing going on below this comment:

/*
* Sanity check --- don't divide by zero if empty relation.
*/

23. I see a few tests mentioning: "-- check change of unrelated column
type does not reset the MCV statistics"

Would it be better to just look at pg_statistic_ext there and do something like:

SELECT COUNT(*) FROM pg_statistic_ext WHERE stxname = 'whatever' AND
stxmcv IS NOT NULL;

Otherwise, you seem to be ensuring the stats were not reset by looking
at a query plan, so it's a bit harder to follow and likely testing
more than it needs to.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: Refactoring the checkpointer's fsync request queue
Next
From: Kevin Grittner
Date:
Subject: Re: TestForOldSnapshot() seems to be in the wrong place