Re: [DESIGN] ParallelAppend - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [DESIGN] ParallelAppend |
Date | |
Msg-id | CA+TgmobU1dipfJEFOLN4VNgNTh-r0Kh_7PwapoV9zc7wvXhLLA@mail.gmail.com Whole thread Raw |
In response to | Re: [DESIGN] ParallelAppend (Kouhei Kaigai <kaigai@ak.jp.nec.com>) |
Responses |
Re: [DESIGN] ParallelAppend
|
List | pgsql-hackers |
On Sun, Oct 25, 2015 at 9:23 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: > I entirely agree with your suggestion. > > We may be able to use an analogy between PartialSeqScan and the > parallel- aware Append node. > PartialSeqScan fetches blocks pointed by the index on shared memory > segment, thus multiple workers eventually co-operate to scan a table > using round-robin manner even though individual worker fetches comb- > shaped blocks. > If we assume individual blocks are individual sub-plans on the parallel > aware Append, it performs very similar. A certain number of workers > (more than zero) is launched by Gather node, then the parallel aware > Append node fetches one of the sub-plans if any. Exactly, except for the part about the blocks being "comb-shaped", which doesn't seem to make sense. > I think, this design also gives additional flexibility according to > the required parallelism by the underlying sub-plans. > Please assume the "PartialSeqScan on p2" in the above example wants > 3 workers to process the scan, we can expand the virtual array of > the sub-plans as follows. Then, if Gather node kicks 5 workers, > individual workers are assigned on some of plans. If Gather node > could kick less than 5 workers, the first exit worker picks the > second sub-plan, then it eventually provides the best parallelism. > > +--------+ > |sub-plan | * Sub-Plan 1 ... Index Scan on p1 > |index on *-----> * Sub-Plan 2 ... PartialSeqScan on p2 > |shared | * Sub-Plan 2 ... PartialSeqScan on p2 > |memory | * Sub-Plan 2 ... PartialSeqScan on p2 > +---------+ * Sub-Plan 3 ... Index Scan on p3 I don't think the shared memory chunk should be indexed by worker, but by sub-plan. So with 3 subplans, we would initially have [0,0,0]. The first worker would grab the first subplan, and we get [1,0,0]. The second worker grabs the third subplan, so now we have [1,0,1]. The remaining workers can't join the execution of those plans because they are not truly parallel, so they all gang up on the second subplan. At 5 workers we have [1,3,1]. Workers need not ever decrement the array entries because they only pick a new sub-plan when the one they picked previously is exhausted; thus, at the end of the plan, each element in the array shows the total number of workers that touched it at some point during its execution. > For more generic plan construction, Plan node may have a field for > number of "desirable" workers even though most of plan-nodes are not > parallel aware, and it is not guaranteed. > In above case, the parallel aware Append will want 5 workers in total > (2 by 2 index-scans, plus 3 by a partial-seq-scan). It is a discretion > of Gather node how many workers are actually launched, however, it > will give clear information how many workers are best. Yeah, maybe. I haven't thought about this deeply just yet, but I agree it needs more consideration. >> First, we can teach Append that, when running in parallel, >> it should initialize a chunk of dynamic shared memory with an array >> indicating how many workers are currently working on each subplan. > Can the parallel-aware Append can recognize the current mode using > MyBgworkerEntry whether it is valid or not? No - that would be quite wrong. What it needs to do is define ExecAppendEstimate and ExecAppendInitializeDSM and call those functions from ExecParallelEstimate and ExecParallelInitializeDSM. It also needs to define a third callback ExecAppendInitializeWorker which will be called from the ExecParallelInitializeWorker function added by the latest partial seq scan patch. ExecAppendEstimate must estimate required shared memory usage for the shared memory state; ExecAppendInitializeDSM must initialize that state, store a pointer to it in the planstate note, and add a TOC entry; ExecAppendWorker will run in the worker and should look up the TOC entry and store the result in the same planstate node that ExecAppendInitializeDSM populated in the leader. Then ExecAppend can decide what to do based on whether that pointer is set, and based on the data to which it points. Are you going to look at implementing this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: