Re: [DESIGN] ParallelAppend - Mailing list pgsql-hackers

From Kouhei Kaigai
Subject Re: [DESIGN] ParallelAppend
Date
Msg-id 9A28C8860F777E439AA12E8AEA7694F80115F83B@BPXM15GP.gisp.nec.co.jp
Whole thread Raw
In response to Re: [DESIGN] ParallelAppend  (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
Responses Re: [DESIGN] ParallelAppend  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
At PGconf.EU, I could have a talk with Robert about this topic,
then it became clear we have same idea.

> +--------+
> |sub-plan |       * Sub-Plan 1 ... Index Scan on p1
> |index on *-----> * Sub-Plan 2 ... PartialSeqScan on p2
> |shared   |       * Sub-Plan 2 ... PartialSeqScan on p2
> |memory   |       * Sub-Plan 2 ... PartialSeqScan on p2
> +---------+       * Sub-Plan 3 ... Index Scan on p3
>
In the above example, I put non-parallel sub-plan to use only
1 slot of the array, even though a PartialSeqScan takes 3 slots.
It is a strict rule; non-parallel aware sub-plan can be picked
up once.
The index of sub-plan array is initialized to 0, then increased
to 5 by each workers when it processes the parallel-aware Append.
So, once a worker takes non-parallel sub-plan, other worker can
never take the same slot again, thus, no duplicated rows will be
produced by non-parallel sub-plan in the parallel aware Append.
Also, this array structure will prevent too large number of
workers pick up a particular parallel aware sub-plan, because
PartialSeqScan occupies 3 slots; that means at most three workers
can pick up this sub-plan. If 1st worker took the IndexScan on
p1, and 2nd-4th worker took the PartialSeqScan on p2, then the
5th worker (if any) will pick up the IndexScan on p3 even if
PartialSeqScan on p2 was not completed.



One other thought experiment, what happen if parallel-aware
Append is underlying another parallel-aware Append.
As literal, parallel-aware Append is parallel-aware, thus, it
can occupy multiple slots in the array of sub-plans, like:

subplans[0] ... SeqScan on p1
subplans[1] ... Parallel Append on p2+p3+p4
subplans[2] ... Parallel Append on p2+p3+p4
subplans[3] ... Parallel Append on p2+p3+p4
subplans[4] ... Parallel Append on p2+p3+p4
subplans[5] ... IndexScan on p5

Also, assume the child parallel-aware Append the following
array.
subplans[0] ... SeqScan on p2
subplans[1] ... PartialSeqScan on p3
subplans[2] ... PartialSeqScan on p3
subplans[3] ... SeqScan on p4

The Gather node located on top of the upper Append node will
launch (at most) 6 workers according to the requirement by
the upper Append node.
Each worker picks up a particular sub-plan in the array of
upper Append node, so some of them (4 workers at most) will
execute the child Append node.
Then, these 4 workers will also pick up a particular sub-plan
in the array of child Append node.
It will work even if number of workers are less than the optimal,
so I believe the overall design is reasonable.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Tuesday, October 27, 2015 6:46 AM
> To: Robert Haas
> Cc: pgsql-hackers@postgresql.org; Amit Kapila; Kyotaro HORIGUCHI
> Subject: Re: [HACKERS] [DESIGN] ParallelAppend
> 
> > -----Original Message-----
> > From: pgsql-hackers-owner@postgresql.org
> > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
> > Sent: Monday, October 26, 2015 8:53 PM
> > To: Kaigai Kouhei(海外 浩平)
> > Cc: pgsql-hackers@postgresql.org; Amit Kapila; Kyotaro HORIGUCHI
> > Subject: Re: [HACKERS] [DESIGN] ParallelAppend
> >
> > On Sun, Oct 25, 2015 at 9:23 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > > I entirely agree with your suggestion.
> > >
> > > We may be able to use an analogy between PartialSeqScan and the
> > > parallel- aware Append node.
> > > PartialSeqScan fetches blocks pointed by the index on shared memory
> > > segment, thus multiple workers eventually co-operate to scan a table
> > > using round-robin manner even though individual worker fetches comb-
> > > shaped blocks.
> > > If we assume individual blocks are individual sub-plans on the parallel
> > > aware Append, it performs very similar. A certain number of workers
> > > (more than zero) is launched by Gather node, then the parallel aware
> > > Append node fetches one of the sub-plans if any.
> >
> > Exactly, except for the part about the blocks being "comb-shaped",
> > which doesn't seem to make sense.
> >
> > > I think, this design also gives additional flexibility according to
> > > the required parallelism by the underlying sub-plans.
> > > Please assume the "PartialSeqScan on p2" in the above example wants
> > > 3 workers to process the scan, we can expand the virtual array of
> > > the sub-plans as follows. Then, if Gather node kicks 5 workers,
> > > individual workers are assigned on some of plans. If Gather node
> > > could kick less than 5 workers, the first exit worker picks the
> > > second sub-plan, then it eventually provides the best parallelism.
> > >
> > > +--------+
> > > |sub-plan |       * Sub-Plan 1 ... Index Scan on p1
> > > |index on *-----> * Sub-Plan 2 ... PartialSeqScan on p2
> > > |shared   |       * Sub-Plan 2 ... PartialSeqScan on p2
> > > |memory   |       * Sub-Plan 2 ... PartialSeqScan on p2
> > > +---------+       * Sub-Plan 3 ... Index Scan on p3
> >
> > I don't think the shared memory chunk should be indexed by worker, but
> > by sub-plan.  So with 3 subplans, we would initially have [0,0,0].
> > The first worker would grab the first subplan, and we get [1,0,0].
> > The second worker grabs the third subplan, so now we have [1,0,1].
> > The remaining workers can't join the execution of those plans because
> > they are not truly parallel, so they all gang up on the second
> > subplan.  At 5 workers we have [1,3,1].  Workers need not ever
> > decrement the array entries because they only pick a new sub-plan when
> > the one they picked previously is exhausted; thus, at the end of the
> > plan, each element in the array shows the total number of workers that
> > touched it at some point during its execution.
> >
> Sorry, I could not get the point in the above explanation.
> The 1st worker would grab the first subplan, then [1,0,0]. It's OK.
> The 2nd worker would grab the last subplan, then [1,0,1]. It's
> understandable, even though I'm uncertain why it does not pick up
> the 2nd one.
> Why remaining worker have to gang up to kick the 2nd (PartialSeqScan;
> that is parallel-aware execution node)?
> 
> Even if only one worker is kicked towards the PartialSeqScan, it tries
> to scan the relation sequentially (because of no parallel workers actually).
> Then, once other worker gets additionally launched to scan same relation,
> both of the worker begins to co-operate using a common block-index kept
> on the shared memory.
> So, do we need to wait for completion of non-parallel aware nodes here?
> 
> I assume that it is better to launch PartialSeqScan, even if one worker,
> than synchronization, because other worker can join the execution later.
> 
> > > For more generic plan construction, Plan node may have a field for
> > > number of "desirable" workers even though most of plan-nodes are not
> > > parallel aware, and it is not guaranteed.
> > > In above case, the parallel aware Append will want 5 workers in total
> > > (2 by 2 index-scans, plus 3 by a partial-seq-scan). It is a discretion
> > > of Gather node how many workers are actually launched, however, it
> > > will give clear information how many workers are best.
> >
> > Yeah, maybe.  I haven't thought about this deeply just yet, but I
> > agree it needs more consideration.
> >
> OK, I'll build a patch including the concept.
> 
> > >> First, we can teach Append that, when running in parallel,
> > >> it should initialize a chunk of dynamic shared memory with an array
> > >> indicating how many workers are currently working on each subplan.
> > > Can the parallel-aware Append can recognize the current mode using
> > > MyBgworkerEntry whether it is valid or not?
> >
> > No - that would be quite wrong.  What it needs to do is define
> > ExecAppendEstimate and ExecAppendInitializeDSM and call those
> > functions from ExecParallelEstimate and ExecParallelInitializeDSM.  It
> > also needs to define a third callback ExecAppendInitializeWorker which
> > will be called from the ExecParallelInitializeWorker function added by
> > the latest partial seq scan patch.  ExecAppendEstimate must estimate
> > required shared memory usage for the shared memory state;
> > ExecAppendInitializeDSM must initialize that state, store a pointer to
> > it in the planstate note, and add a TOC entry; ExecAppendWorker will
> > run in the worker and should look up the TOC entry and store the
> > result in the same planstate node that ExecAppendInitializeDSM
> > populated in the leader.  Then ExecAppend can decide what to do based
> > on whether that pointer is set, and based on the data to which it
> > points.
> >
> Thanks for the clear direction.
> 
> > Are you going to look at implementing this?
> >
> I feel the scale of implementation is not large, if Append node itself
> is not capable to launch a new worker process. Let me try it.
> 
> Best regards,
> --
> NEC Business Creation Division / PG-Strom Project
> KaiGai Kohei <kaigai@ak.jp.nec.com>
> 
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [PROPOSAL] Max recursion depth in WITH Queries (Common Table Expressions)
Next
From: Valery Popov
Date:
Subject: Re: [PROPOSAL] Max recursion depth in WITH Queries (Common Table Expressions)