Re: [DESIGN] ParallelAppend - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [DESIGN] ParallelAppend
Date
Msg-id CA+Tgmob18VZEhghSzeWqqEHBPQ89j7B9gJadytN=v0OEBtNM1A@mail.gmail.com
Whole thread Raw
In response to Re: [DESIGN] ParallelAppend  (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
Responses Re: [DESIGN] ParallelAppend  (Thom Brown <thom@linux.com>)
Re: [DESIGN] ParallelAppend  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> I'm now designing the parallel feature of Append...
>
> Here is one challenge. How do we determine whether each sub-plan
> allows execution in the background worker context?

I've been thinking about these questions for a bit now, and I think we
can work on improving Append in multiple phases.  The attached patch
shows what I have in mind for phase 1.  Currently, if you set up an
inheritance hierarchy, you might get an Append some of whose children
are Gather nodes with Parallel Seq Scans under them, like this:

Append
-> Gather
  -> Parallel Seq Scan
-> Gather
  -> Parallel Seq Scan
-> Seq Scan

This is a crappy plan.  Each Gather node will try to launch its own
bunch of workers, which sucks.  The attached patch lets you get this
plan instead:

Gather
-> Append
  -> Partial Seq Scan
  -> Partial Seq Scan
  -> Partial Seq Scan

That's a much better plan.

To make this work, this plan introduces a couple of new concepts.
Each RelOptInfo gets a new partial_pathlist field, which stores paths
that need to be gathered to produce a workable plan.  Once we've
populated the partial_pathlist with whatever partial paths we know how
to generate, we can either push a Gather node on top of one of those
partial paths to create a real path; or we can use those partial paths
to build more partial paths.  The current patch handles only the
simplest case: we can build a partial path for an appendrel by
appending a partial path for each member rel.  But eventually I hope
to handle joinrels this way as well: you can join a partial path with
an ordinary path for form a partial path for the joinrel.

This requires some way of figuring out how many workers to request for
the append-path, so this patch also adds a parallel_degree field to
the path object, which is 0 for non-parallel things and the number of
workers that the path believes to be ideal otherwise.  Right now, it
just percolates the highest worker count of any child up to the
appendrel, which might not be right, especially once the append node
knows how to balance workers, but it seems like a reasonable place to
start.

> Type-A) It can be performed on background worker context and
>    picked up by multiple worker processes concurrently.
>   (e.g: Parallel SeqScan)
> Type-B) It can be performed on background worker context but
>   cannot be picked up by multiple worker processes.
>   (e.g: non-parallel aware node)
> Type-C) It should not be performed on background worker context.
>    (e.g: plan/path node with volatile functions)

At the time that we're forming an AppendPath, we can identify what
you're calling type-A paths very easily: they're the ones that are
coming from the partial_pathlist.  We can distinguish between type-B
and type-C paths coming from the childrel's pathlist based on the
childrel's consider_parallel flag: if it's false, it's type-C, else
type-B.  At some point, we might need a per-path flag to distinguish
cases where a particular path is type-C even though some other plan
for that relation might be type-B.  But right now that case doesn't
arise.

Now, of course, it's not good enough to have this information
available when we're generating the AppendPath; it has to survive
until execution time.  But that doesn't mean the paths need to be
self-identifying.  We could, of course, decide to make them so, and
maybe that's the best design, but I'm thinking about another approach:
suppose the append node itself, instead of having one list of child
plans, has a list of type-A plans, a list of type-B plans, and a list
of type-C plans.  This actually seems quite convenient, because at
execution time, you presumably want the leader to prioritize type-C
plans over any of the others, since it's the only one that can execute
them, and the workers to prioritize type-B plans, since they can only
take one worker each and are thus prone to be the limiting factor on
finishing the whole Append.  Having the plans divided in advance
between these three lists (say, restricted_plans, safe_plans,
parallel_plans) makes that easy to implement.

Incidentally, I think it's subtly wrong to think of the parallel_aware
flag as telling you whether the plan can absorb multiple workers.
That's not really what it's for.  It's to tell you whether the plan is
doing *something* parallel aware - that is, whether its Estimate,
InitializeDSM, and InitializeWorker callbacks should do anything.  For
SeqScan, flipping parallel_aware actually does split the input among
all the workers, but for Append it's probably just load balances and
for other nodes it might be something else again.  The term I'm using
to indicate a path/plan that returns only a subset of the results in
each worker is "partial".  Whether or not a path is partial is, in the
design embodied in this patch, indicated both by whether
path->parallel_degree > 0 and whether the path is in rel->pathlist or
rel->partial_pathlist.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: Inaccurate results from numeric ln(), log(), exp() and pow()
Next
From: Jeff Janes
Date:
Subject: check for interrupts in set_rtable_names