Re: [DESIGN] ParallelAppend - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [DESIGN] ParallelAppend
Date
Msg-id CA+TgmobU1dipfJEFOLN4VNgNTh-r0Kh_7PwapoV9zc7wvXhLLA@mail.gmail.com
Whole thread Raw
In response to Re: [DESIGN] ParallelAppend  (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
Responses Re: [DESIGN] ParallelAppend
List pgsql-hackers
On Sun, Oct 25, 2015 at 9:23 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> I entirely agree with your suggestion.
>
> We may be able to use an analogy between PartialSeqScan and the
> parallel- aware Append node.
> PartialSeqScan fetches blocks pointed by the index on shared memory
> segment, thus multiple workers eventually co-operate to scan a table
> using round-robin manner even though individual worker fetches comb-
> shaped blocks.
> If we assume individual blocks are individual sub-plans on the parallel
> aware Append, it performs very similar. A certain number of workers
> (more than zero) is launched by Gather node, then the parallel aware
> Append node fetches one of the sub-plans if any.

Exactly, except for the part about the blocks being "comb-shaped",
which doesn't seem to make sense.

> I think, this design also gives additional flexibility according to
> the required parallelism by the underlying sub-plans.
> Please assume the "PartialSeqScan on p2" in the above example wants
> 3 workers to process the scan, we can expand the virtual array of
> the sub-plans as follows. Then, if Gather node kicks 5 workers,
> individual workers are assigned on some of plans. If Gather node
> could kick less than 5 workers, the first exit worker picks the
> second sub-plan, then it eventually provides the best parallelism.
>
> +--------+
> |sub-plan |       * Sub-Plan 1 ... Index Scan on p1
> |index on *-----> * Sub-Plan 2 ... PartialSeqScan on p2
> |shared   |       * Sub-Plan 2 ... PartialSeqScan on p2
> |memory   |       * Sub-Plan 2 ... PartialSeqScan on p2
> +---------+       * Sub-Plan 3 ... Index Scan on p3

I don't think the shared memory chunk should be indexed by worker, but
by sub-plan.  So with 3 subplans, we would initially have [0,0,0].
The first worker would grab the first subplan, and we get [1,0,0].
The second worker grabs the third subplan, so now we have [1,0,1].
The remaining workers can't join the execution of those plans because
they are not truly parallel, so they all gang up on the second
subplan.  At 5 workers we have [1,3,1].  Workers need not ever
decrement the array entries because they only pick a new sub-plan when
the one they picked previously is exhausted; thus, at the end of the
plan, each element in the array shows the total number of workers that
touched it at some point during its execution.

> For more generic plan construction, Plan node may have a field for
> number of "desirable" workers even though most of plan-nodes are not
> parallel aware, and it is not guaranteed.
> In above case, the parallel aware Append will want 5 workers in total
> (2 by 2 index-scans, plus 3 by a partial-seq-scan). It is a discretion
> of Gather node how many workers are actually launched, however, it
> will give clear information how many workers are best.

Yeah, maybe.  I haven't thought about this deeply just yet, but I
agree it needs more consideration.

>> First, we can teach Append that, when running in parallel,
>> it should initialize a chunk of dynamic shared memory with an array
>> indicating how many workers are currently working on each subplan.
> Can the parallel-aware Append can recognize the current mode using
> MyBgworkerEntry whether it is valid or not?

No - that would be quite wrong.  What it needs to do is define
ExecAppendEstimate and ExecAppendInitializeDSM and call those
functions from ExecParallelEstimate and ExecParallelInitializeDSM.  It
also needs to define a third callback ExecAppendInitializeWorker which
will be called from the ExecParallelInitializeWorker function added by
the latest partial seq scan patch.  ExecAppendEstimate must estimate
required shared memory usage for the shared memory state;
ExecAppendInitializeDSM must initialize that state, store a pointer to
it in the planstate note, and add a TOC entry; ExecAppendWorker will
run in the worker and should look up the TOC entry and store the
result in the same planstate node that ExecAppendInitializeDSM
populated in the leader.  Then ExecAppend can decide what to do based
on whether that pointer is set, and based on the data to which it
points.

Are you going to look at implementing this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Artur Zakirov
Date:
Subject: Re: [PROPOSAL] Improvements of Hunspell dictionaries support
Next
From: "Joshua D. Drake"
Date:
Subject: pg_basebackup and replication slots