Re: [DESIGN] ParallelAppend - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [DESIGN] ParallelAppend
Date
Msg-id CA+TgmoYu2yQRjtZKZH_pY+F1DQnuOkCsvMNCFfhUNqG2P5hSsw@mail.gmail.com
Whole thread Raw
In response to [DESIGN] ParallelAppend  (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
List pgsql-hackers
On Sat, Jul 25, 2015 at 11:13 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> I'm recently working/investigating on ParallelAppend feature
> towards the next commit fest. Below is my design proposal.
>
> 1. Concept
> ----------
> Its concept is quite simple anybody might consider more than once.
> ParallelAppend node kicks background worker process to execute
> child nodes in parallel / asynchronous.
> It intends to improve the performance to scan a large partitioned
> tables from standpoint of entire throughput, however, latency of
> the first multi-hundred rows are not scope of this project.
> From standpoint of technology trend, it primarily tries to utilize
> multi-cores capability within a system, but also enables to expand
> distributed database environment using foreign-tables inheritance
> features.
> Its behavior is very similar to Funnel node except for several
> points, thus, we can reuse its infrastructure we have had long-
> standing discussion through the v9.5 development cycle.

Now that I've got more of the parallel infrastructure committed, I've
been starting to have a little time to think about what we might want
to do after we get PartialSeqScan committed.  I'm positive on doing
something with the Append node and parallelism, but I'm negative on
doing what you've described here.

I don't think the Append node has any business launching workers.
That's the job of Gather.  Append may need to do something
parallel-aware, but starting workers is not that thing.  Without
making any changes at all to Append, we can use it like this:

Gather
-> Append -> Partial Seq Scan on p1 -> Partial Seq Scan on p2 -> Partial Seq Scan on p3

The Gather node will spin up workers and in each worker we will ask
the Append nodes for tuples.  Append will ask the first
not-yet-completed child for tuples, so the workers will cooperatively
scan first p1, then p2, then p3.  This is great: instead of merely
doing a parallel seq scan of a single table, we can do a parallel seq
scan of a partitioned table.  However, there are two improvements we
can make.  First, we can teach Append that, when running in parallel,
it should initialize a chunk of dynamic shared memory with an array
indicating how many workers are currently working on each subplan.
Each new worker should join a subplan with the minimum number of
workers, work on that one until it's completely, and then pick a new
subplan.  This minimizes contention.  Second, we can teach the Append
to serve as a parent not only for intrinsically parallel nodes like
Partial Seq Scan, but also for other nodes, say, an Index Scan.  When
an Append is running in parallel but with non-parallel-aware children,
each such child can be claimed by at most one worker process and will
be run to completion by that worker process.  For example:

Gather
-> Append -> Index Scan on p1 -> Partial Seq Scan on p2 -> Index Scan on p3

The first worker which executes the Append should begin the index scan
on p1 and the second should begin the index scan on p3.  The remaining
workers, and those two once they're finished, can work on p2.

Proceeding in this way, I don't think we need a separate Parallel
Append node.  Rather, we can just give the existing Append node some
extra smarts that are used only when it's running in parallel.

We can also push other things in between the Gather and the Append,
which wouldn't be possible in your design.  For example, consider a
join between a partitioned table p and an unpartitioned table q.  We
could do this:

Gather -> Nested Loop   -> Append     -> Index Scan on p1     -> Partial Seq Scan on p2     -> Index Scan on p3 ->
IndexScan on q     Index Cond q.x = p.x
 

That's a pretty sweet plan.  Assuming p1, p2, and p3 are all
reasonably large, we could probably benefit from throwing at least 3
processes at this plan tree - maybe more, if p2 is really big.  Only
one process can work on each of p1 and p3, but p2, since it has a
truly parallel plan, can soak up as many as we want to throw at it
(although performance may top out at some point if we're I/O-bound).

Sorry for taking so long to give a really substantive reply on this,
but it wasn't until this last week or so that I really had time to
think about this in detail.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Making tab-complete.c easier to maintain
Next
From: Thomas Munro
Date:
Subject: Re: Making tab-complete.c easier to maintain