Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id 20190313.131951.178195776.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
List pgsql-hackers
Hello.

At Wed, 13 Mar 2019 02:25:40 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<19f76496-dcf3-ccea-dd82-26fbed57b8f5@2ndquadrant.com>
> Hi,
> 
> attached is an updated version of the patch, addressing most of the
> issues raised in the recent reviews. There are two main exceptions:
> 
> 1) I haven't reworked the regression tests to use a function to check
> cardinality estimates and making them faster.
> 
> 2) Review handling of bitmap in statext_is_compatible_clause_internal
> when processing AND/OR/NOT clauses.
> 
> I plan to look into those items next, but I don't want block review of
> other parts of the patch unnecessarily.

I briefly looked it and have some comments.

0001-multivariate-MCV-lists-20190312.patch

+/*
+ * bms_member_index
+ *        determine 0-based index of the varattno in the bitmap
+ *
+ * Returns (-1) when the value is not a member.

I think the comment should be more generic.

"determine 0-based index of member x among the bitmap members"
" Returns -1 when x is not a member."


(cont'ed)
+    if (a == NULL)
+        return 0;

Isn't the case of "not a member"?

bms_member_index seems working differently than maybe expected.

 bms_member_index((2, 4), 0) => 0, (I think) should be -1
 bms_member_index((2, 4), 1) => 0, should be -1
 bms_member_index((2, 4), 2) => 0, should be 0
 bms_member_index((2, 4), 3) => 1, should be -1
 bms_member_index((2, 4), 4) => 1, should be 1
 bms_member_index((2, 4), 5) => 2, should be -1
 bms_member_index((2, 4), 6) => 2, should be -1
...
 bms_member_index((2, 4), 63) => 2, should be -1
 bms_member_index((2, 4), 64) => -1, correct

It works correctly only when x is a member - the way the function
is maybe actually used in this patch -, or needs to change the
specifiction (or the comment) of the function.




+    if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+    {
+        /*
+         * Estimate selectivity on any clauses applicable by stats tracking
+         * actual values first, then apply functional dependencies on the
+         * remaining clauses.

The comment doesn't seem needed since it is mentioning the detail
of statext_clauselist_selectivity() called just below.



+        if (statext_is_kind_built(htup, STATS_EXT_MCV))
+        {
+            StatisticExtInfo *info = makeNode(StatisticExtInfo);
+
+            info->statOid = statOid;
+            info->rel = rel;
+            info->kind = STATS_EXT_MCV;
+            info->keys = bms_copy(keys);
+
+            stainfos = lcons(info, stainfos);
+        }


We are to have four kinds of extended statistics, at worst we
have a list containing four StatisticExtInfos with the same
statOid, rel, keys and only different kind. Couldn't we reverse
the structure so that StatisticExtIbfo be something like:

>  struct StatsticExtInfo
>  {
>     NodeTag    type;
>     Oid        statOid;
>     RelOptInfo *rel;
!     char       kind[8];  /* arbitrary.. */
>     Bitmapset *keys;     



+OBJS = extended_stats.o dependencies.o mcv.o mvdistinct.o

The module for MV distinctness is named 'mvdistinct', but mcv
doesn't have the prefix. I'm not sure we need to unify the
names, though.


+Multivariate MCV (most-common values) lists are a straightforward extension of

  "lists are *a*" is wrong?



@@ -223,26 +220,16 @@ dependency_degree(int numrows, HeapTuple *rows, int k, AttrNumber *dependency,

I haven't read it in datil, but why MV-MCV patch contains (maybe)
improvement of functional dependency code?



+int
+compare_scalars_simple(const void *a, const void *b, void *arg)

Seems to need a comment. "compare_scalars without tupnoLink maininance"?


+int
+compare_datums_simple(Datum a, Datum b, SortSupport ssup)
+{
+    return ApplySortComparator(a, false, b, false, ssup);
+}

This wrapper function doesn't seem to me required.



+/* simple counterpart to qsort_arg */
+void *
+bsearch_arg(const void *key, const void *base, size_t nmemb, size_t size,

We have some functions named *_bsearch. If it is really
qsort_arg's bsearch versoin, it might be better to be placed in
qsort_arg.c or new file bsearch_arg.c?


+int *
+build_attnums_array(Bitmapset *attrs)

If the attrs is not offset, I'd like that it is named
differently, say, attrs_nooffset or something.


+    int            i,
+                j,
+                len;

I'm not sure but is it following our coding convention?


+            items[i].values[j] = heap_getattr(rows[i],

items is needed by qsort_arg and as return value. It seems to me
that using just values[] and isnull[] make the code simpler there.


+    /* Look inside any binary-compatible relabeling (as in examine_variable) */
+    if (IsA(clause, RelabelType))
+        clause = (Node *) ((RelabelType *) clause)->arg;

This is quite a common locution so it's enough that the comment
just mention what it does, like "Remove any relabel
decorations". And relabelling can happen recursively so the 'if'
should be 'while'?


+        /* we also better ensure the Var is from the current level */
+        if (var->varlevelsup > 0)
+            return false;


I don't get the meaning of the "better". If it cannot/don't
accept subquery's output, it would be "we refuse Vars from ...",
or if the function is not assumed to receive such Vars, it should
be an assertion.




+        /* see if it actually has the right shape (one Var, one Const) */
+        ok = (NumRelids((Node *) expr) == 1) &&
+            (is_pseudo_constant_clause(lsecond(expr->args)) ||
+             (varonleft = false,
+              is_pseudo_constant_clause(linitial(expr->args))));

I don't think such "expression" with unidentifiable side-effect
is a good thing. Counldn't it in more plain code? (Yeah, it is
already used in clauselist_selectivity so I don't insist on
that.)


+         * This uses the function for estimating selectivity, not the operator
+         * directly (a bit awkward, but well ...).

Not only it is the right thing but actually the operators for the
type path don't have operrst.



+ * statext_is_compatible_clause
+ *        Determines if the clause is compatible with MCV lists.

I think the name should contain the word "mcv". Isn't the name
better to be "staext_clause_is_mcv_compatibe"?


(Sorry, further comments may come later..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pgsql: Add support for hyperbolic functions, as well as log10().
Next
From: "Matsumura, Ryo"
Date:
Subject: RE: ECPG regression with DECLARE STATEMENT support