Re: [HACKERS] Parallel Append implementation - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Parallel Append implementation
Date
Msg-id CA+Tgmoa2roUeraFvMbE9+pEaFZ2o-z_EAX+mz9qjMw_3uJQ1Rg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel Append implementation  (Amit Khandekar <amitdkhan.pg@gmail.com>)
Responses Re: [HACKERS] Parallel Append implementation  (Amit Khandekar <amitdkhan.pg@gmail.com>)
List pgsql-hackers
On Wed, Mar 22, 2017 at 4:49 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Attached is the updated patch that handles the changes for all the
> comments except the cost changes part. Details about the specific
> changes are after the cost-related points discussed below.
>
> For non-partial paths, I was checking following 3 options :
>
> Option 1. Just take the sum of total non-partial child costs and
> divide it by number of workers. It seems to be getting close to the
> actual cost.

If the costs for all children are about equal, then that works fine.
But when they are very unequal, then it's highly misleading.

> Option 2. Calculate exact cost by an algorithm which I mentioned
> before, which is pasted below for reference :
> Per­subpath cost : 20 16 10 8 3 1, with 3 workers.
> After 10 time units (this is minimum of first 3 i.e. 20, 16, 10), the
> times remaining are :
> 10  6  0 8 3 1
> After 6 units (minimum of 10, 06, 08), the times remaining are :
> 4  0  0 2 3 1
> After 2 units (minimum of 4, 2, 3), the times remaining are :
>  2  0  0 0 1 1
> After 1 units (minimum of 2, 1, 1), the times remaining are :
>  1  0  0 0 0 0
> After 1 units (minimum of 1, 0 , 0), the times remaining are :
>  0  0  0 0 0 0
> Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20

This gives the same answer as what I was proposing, but I believe it's
more complicated to compute.  The way my proposal would work in this
case is that we would start with an array C[3] (since there are three
workers], with all entries 0.  Logically C[i] represents the amount of
work to be performed by worker i.  We add each path in turn to the
worker whose array entry is currently smallest; in the case of a tie,
just pick the first such entry.

So in your example we do this:

C[0] += 20;
C[1] += 16;
C[2] += 10;
/* C[2] is smaller than C[0] or C[1] at this point, so we add the next
path to C[2] */
C[2] += 8;
/* after the previous line, C[1] is now the smallest, so add to that
entry next */
C[1] += 3;
/* now we've got C[0] = 20, C[1] = 19, C[2] = 18, so add to C[2] */
C[2] += 1;
/* final result: C[0] = 20, C[1] = 19, C[2] = 19 */

Now we just take the highest entry that appears in any array, which in
this case is C[0], as the total cost.

Comments on this latest version:

In my previous review, I said that you should "define and document a
new builtin tranche ID"; you did the first but not the second.  See
the table in monitoring.sgml.

Definition of exec_append_goto_next_plan should have a line break
after the return type, per usual PostgreSQL style rules.

-     * initialize to scan first subplan
+     * In case it's a sequential Append, initialize to scan first subplan.

This comment is confusing because the code is executed whether it's
parallel or not.  I think it might be better to write something like
"initialize to scan first subplan (but note that we'll override this
later in the case of a parallel append)"
        /*
+         * Check if we are already finished plans from parallel append. This
+         * can happen if all the subplans are finished when this worker
+         * has not even started returning tuples.
+         */
+        if (node->as_padesc && node->as_whichplan == PA_INVALID_PLAN)
+            return ExecClearTuple(node->ps.ps_ResultTupleSlot);

There seems to be no reason why this couldn't be hoisted out of the
loop.  Actually, I think Ashutosh pointed this out before, but I
didn't understand at that time what his point was.  Looking back, I
see that he also pointed out that the as_padesc test isn't necessary,
which is also true.

+        if (node->as_padesc)
+            node->as_padesc->pa_finished[node->as_whichplan] = true;

I think you should move this logic inside exec_append_parallel_next.
That would avoid testing node->pa_desc an extra time for non-parallel
append.  I note that the comment doesn't explain why it's safe to do
this without taking the lock.  I think we could consider doing it with
the lock held, but it probably is safe, because we're only setting it
from false to true.  If someone else does the same thing, that won't
hurt anything, and if someone else fails to see our update, then the
worst-case scenario is that they'll try to execute a plan that has no
chance of returning any more rows.  That's not so bad.  Actually,
looking further, you do have a comment explaining that, but it's in
exec_append_parallel_next() where the value is used, rather than here.

+    memset(padesc->pa_finished, 0, sizeof(bool) * node->as_nplans);
+
+    shm_toc_insert(pcxt->toc, node->ps.plan->plan_node_id, padesc);
+    node->as_padesc = padesc;

Putting the shm_toc_insert call after we fully initialize the
structure seems better than putting it after we've done some of the
initialization and before we've done the rest.

+    /* Choose the optimal subplan to be executed. */

I think the word "first" would be more accurate than "optimal".  We
can only hope to pick the optimal one, but whichever one we pick is
definitely the one we're executing first!

I think the loop in exec_append_parallel_next() is a bit confusing.
It has three exit conditions, one checked at the top of the loop and
two other ways to exit via break statements.  Sometimes it exits
because whichplan == PA_INVALID_PLAN was set by
exec_append_goto_next_plan(), and other times it exits because
whichplan == initial_plan and then it sets whichplan ==
PA_INVALID_PLAN itself.  I feel like this whole function could be
written more simply somehow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] Hash support for grouping sets
Next
From: Amit Langote
Date:
Subject: Re: [HACKERS] Partitioned tables and relfilenode