Re: [DESIGN] ParallelAppend - Mailing list pgsql-hackers

From Kouhei Kaigai
Subject Re: [DESIGN] ParallelAppend
Date
Msg-id 9A28C8860F777E439AA12E8AEA7694F801170BD5@BPXM15GP.gisp.nec.co.jp
Whole thread Raw
In response to [DESIGN] ParallelAppend  (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
Responses Re: [DESIGN] ParallelAppend  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
> On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > I'm now designing the parallel feature of Append...
> >
> > Here is one challenge. How do we determine whether each sub-plan
> > allows execution in the background worker context?
> 
> I've been thinking about these questions for a bit now, and I think we
> can work on improving Append in multiple phases.  The attached patch
> shows what I have in mind for phase 1.  Currently, if you set up an
> inheritance hierarchy, you might get an Append some of whose children
> are Gather nodes with Parallel Seq Scans under them, like this:
> 
> Append
> -> Gather
>   -> Parallel Seq Scan
> -> Gather
>   -> Parallel Seq Scan
> -> Seq Scan
> 
> This is a crappy plan.  Each Gather node will try to launch its own
> bunch of workers, which sucks.  The attached patch lets you get this
> plan instead:
> 
> Gather
> -> Append
>   -> Partial Seq Scan
>   -> Partial Seq Scan
>   -> Partial Seq Scan
> 
> That's a much better plan.
> 
> To make this work, this plan introduces a couple of new concepts.
> Each RelOptInfo gets a new partial_pathlist field, which stores paths
> that need to be gathered to produce a workable plan.  Once we've
> populated the partial_pathlist with whatever partial paths we know how
> to generate, we can either push a Gather node on top of one of those
> partial paths to create a real path; or we can use those partial paths
> to build more partial paths.  The current patch handles only the
> simplest case: we can build a partial path for an appendrel by
> appending a partial path for each member rel.  But eventually I hope
> to handle joinrels this way as well: you can join a partial path with
> an ordinary path for form a partial path for the joinrel.
>
This idea will solve my concern gracefully.
The new partial_pathlist keeps candidate of path-nodes to be gathered
in this level or upper. Unlike path-nodes in the pathlist already, we
don't need to rip off GatherPath later.

Can we expect any path-nodes in the partial_pathlist don't contain
underlying GatherPath even if and when we would apply this design on
joinrel also?

If we would be able to ensure the path-nodes in partial_pathlist is
safe to run under the Gather node - it never contains Gather itself
or any others should not perform on the background worker context,
it will make path consideration much simpler than my expectation.

> This requires some way of figuring out how many workers to request for
> the append-path, so this patch also adds a parallel_degree field to
> the path object, which is 0 for non-parallel things and the number of
> workers that the path believes to be ideal otherwise.  Right now, it
> just percolates the highest worker count of any child up to the
> appendrel, which might not be right, especially once the append node
> knows how to balance workers, but it seems like a reasonable place to
> start.
>
I agree with. The new parallel_degree will give useful information,
and one worker per child relation is a good start.

> > Type-A) It can be performed on background worker context and
> >    picked up by multiple worker processes concurrently.
> >   (e.g: Parallel SeqScan)
> > Type-B) It can be performed on background worker context but
> >   cannot be picked up by multiple worker processes.
> >   (e.g: non-parallel aware node)
> > Type-C) It should not be performed on background worker context.
> >    (e.g: plan/path node with volatile functions)
> 
> At the time that we're forming an AppendPath, we can identify what
> you're calling type-A paths very easily: they're the ones that are
> coming from the partial_pathlist.  We can distinguish between type-B
> and type-C paths coming from the childrel's pathlist based on the
> childrel's consider_parallel flag: if it's false, it's type-C, else
> type-B.  At some point, we might need a per-path flag to distinguish
> cases where a particular path is type-C even though some other plan
> for that relation might be type-B.  But right now that case doesn't
> arise.
>
I also think we eventually have to have a per-path flag when we support
parallel capability on joinrel, although, it is not an immediate action.

> Now, of course, it's not good enough to have this information
> available when we're generating the AppendPath; it has to survive
> until execution time.  But that doesn't mean the paths need to be
> self-identifying.  We could, of course, decide to make them so, and
> maybe that's the best design, but I'm thinking about another approach:
> suppose the append node itself, instead of having one list of child
> plans, has a list of type-A plans, a list of type-B plans, and a list
> of type-C plans.  This actually seems quite convenient, because at
> execution time, you presumably want the leader to prioritize type-C
> plans over any of the others, since it's the only one that can execute
> them, and the workers to prioritize type-B plans, since they can only
> take one worker each and are thus prone to be the limiting factor on
> finishing the whole Append.  Having the plans divided in advance
> between these three lists (say, restricted_plans, safe_plans,
> parallel_plans) makes that easy to implement.
>
I'd like to agree with this idea. If Append can handle restricted_plans
concurrently with safe_plans and parallel_plans, we don't need to give
up parallelism even if any of child relation has neither safe- nor
parallel-plans.
One thing we need to pay attention is, we have to inform Gather node
to kick local sub-plans if underlying Append node has any restricted
plans. It also needs to distinguish the case when Gather node cannot
launch any background workers, because the first case runs only type-C
but the second case has to run all the sub-plans in local context.

> Incidentally, I think it's subtly wrong to think of the parallel_aware
> flag as telling you whether the plan can absorb multiple workers.
> That's not really what it's for.  It's to tell you whether the plan is
> doing *something* parallel aware - that is, whether its Estimate,
> InitializeDSM, and InitializeWorker callbacks should do anything.  For
> SeqScan, flipping parallel_aware actually does split the input among
> all the workers, but for Append it's probably just load balances and
> for other nodes it might be something else again.  The term I'm using
> to indicate a path/plan that returns only a subset of the results in
> each worker is "partial".
>
Therefore, a NestLoop that takes underlying ParallelSeqScan and IndexScan
may not be parallel aware by itself, however, it is exactly partial.
This NestLoop will has parallel_degree likely larger than "1", won't it?

It seems to me the "partial" is more clear concept to introduce how sub-
plan will perform.

> Whether or not a path is partial is, in the
> design embodied in this patch, indicated both by whether
> path->parallel_degree > 0 and whether the path is in rel->pathlist or
> rel->partial_pathlist.
>
We should have Assert to detect paths with parallel_degree==0 but in
the rel->partial_pathlist or parallel_degree > 1 but not appear in
the rel->partial_pathlist?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: [PATCH] SQL function to report log message
Next
From: Alvaro Herrera
Date:
Subject: Re: Making tab-complete.c easier to maintain