Re: Using multiple extended statistics for estimates - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Using multiple extended statistics for estimates
Date
Msg-id 20191107110514.ythldmnq4tvjm6we@development
Whole thread Raw
In response to Re: Using multiple extended statistics for estimates  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
On Thu, Nov 07, 2019 at 01:38:20PM +0900, Kyotaro Horiguchi wrote:
>Hello.
>
>At Wed, 6 Nov 2019 20:58:49 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
>> >Here is a slightly polished v2 of the patch, the main difference being
>> >that computing clause_attnums was moved to a separate function.
>> >
>>
>> This time with the attachment ;-)
>
>This patch is a kind of straight-forward, which repeats what the
>previous statext_mcv_clauselist_selectivity did as long as remaining
>clauses matches any of MV-MCVs. Almost no regression in the cases
>where zero or just one MV-MCV applies to the given clause list.
>
>It applies cleanly on the current master and seems working as
>expected.
>
>
>I have some comments.
>
>Could we have description in the documentation on what multiple
>MV-MCVs are used in a query? And don't we need some regression tests?
>

Yes, regression tests are certainly needed - I though I've added them,
but it seems I failed to include them in the patch. Will fix.

I agree it's probably worth mentioning we can consider multiple stats,
but I'm a bit hesitant to put the exact rules how we pick the "best"
statistic to the docs. It's not 100% deterministic and it's likely
we'll need to tweak it a bit in the future.

I'd prefer showing the stats in EXPLAIN, but that's a separate patch.

>
>+/*
>+ * statext_mcv_clause_attnums
>+ *        Recalculate attnums from compatible but not-yet-estimated clauses.
>
>It returns attnums collected from multiple clause*s*. Is the name OK
>with "clause_attnums"?
>
>The comment says as if it checks the compatibility of each clause but
>the work is done in the caller side. I'm not sure such strictness is
>required, but it might be better that the comment represents what
>exactly the function does.
>

But the incompatible clauses have the pre-computed attnums set to NULL,
so technically the comment is correct. But I'll clarify.

>
>+ */
>+static Bitmapset *
>+statext_mcv_clause_attnums(int nclauses, Bitmapset **estimatedclauses,
>+                           Bitmapset **list_attnums)
>
>The last two parameters are in the same type in notation but in
>different actual types.. that is, one is a pointer to Bitmapset*, and
>another is an array of Bitmaptset*. The code in the function itself
>suggests that, but it would be helpful if a brief explanation of the
>parameters is seen in the function comment.
>

OK, will explain in a comment.

>+        /*
>+         * Recompute attnums in the remaining clauses (we simply use the bitmaps
>+         * computed earlier, so that we don't have to inspect the clauses again).
>+         */
>+        clauses_attnums = statext_mcv_clause_attnums(list_length(clauses),
>
>Couldn't we avoid calling this function twice with the same parameters
>at the first round in the loop?
>

Hmmm, yeah. That's a good point.

>+        foreach(l, clauses)
>         {
>-            stat_clauses = lappend(stat_clauses, (Node *) lfirst(l));
>-            *estimatedclauses = bms_add_member(*estimatedclauses, listidx);
>+            /*
>+             * If the clause is compatible with the selected statistics, mark it
>+             * as estimated and add it to the list to estimate.
>+             */
>+            if (list_attnums[listidx] != NULL &&
>+                bms_is_subset(list_attnums[listidx], stat->keys))
>+            {
>+                stat_clauses = lappend(stat_clauses, (Node *) lfirst(l));
>+                *estimatedclauses = bms_add_member(*estimatedclauses, listidx);
>+            }
>
>The loop runs through all clauses every time. I agree that that is
>better than using a copy of the clauses to avoid to step on already
>estimated clauses, but maybe we need an Assertion that the listidx is
>not a part of estimatedclauses to make sure no clauses are not
>estimated twice.
>

Well, we can't really operate on a smaller "copy" of the list anyway,
because that would break the precalculation logic (the listidx value
would be incorrect for the new list), and tweaking it would be more
expensive than just iterating over all clauses. The assumption is that
we won't see extremely large number of clauses here.

Adding an assert seems reasonable. And maybe a comment why we should not
see any already-estimated clauses here.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: [PATCH][DOC] Fix for PREPARE TRANSACTION doc and postgres_fdw message.
Next
From: Fabien COELHO
Date:
Subject: Re: Add Change Badges to documentation