Re: Parallel Seq Scan - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Parallel Seq Scan
Date
Msg-id CAA4eK1KxH4MiD77851MBvRr8LaOBfNtRr-4c4XtdzdTx1FHTog@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Seq Scan  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers
On Tue, Apr 21, 2015 at 2:29 PM, David Rowley <dgrowleyml@gmail.com> wrote:
I've also been thinking about how, instead of having to have a special PartialSeqScan node which contains a bunch of code to store tuples in a shared memory queue, could we not have a "TupleBuffer", or "ParallelTupleReader" node, one of which would always be the root node of a plan branch that's handed off to a worker process. This node would just try to keep it's shared tuple store full, and perhaps once it fills it could have a bit of a sleep and be woken up when there's a bit more space on the queue. When no more tuples were available from the node below this, then the worker could exit. (providing there was no rescan required)

I think between the Funnel node and a ParallelTupleReader we could actually parallelise plans that don't even have parallel safe nodes.... Let me explain:

Let's say we have a 4 way join, and the join order must be {a,b}, {c,d} => {a,b,c,d}, Assuming the cost of joining a to b and c to d are around the same, the Parallelizer may notice this and decide to inject a Funnel and then ParallelTupleReader just below the node for c join d and have c join d in parallel. Meanwhile the main worker process could be executing the root node, as normal. This way the main worker wouldn't have to go to the trouble of joining c to d itself as the worker would have done all that hard work.

I know the current patch is still very early in the evolution of PostgreSQL's parallel query, but how would that work with the current method of selecting which parts of the plan to parallelise?

The Funnel node is quite generic and can handle the case as
described by you if we add Funnel on top of join node (c join d).
It currently passes plannedstmt to worker which can contain any
type of plan (though we need to add some more code to make it
work if want to execute any node other than Result or PartialSeqScan
node.)
 
I really think the plan needs to be a complete plan before it can be best analysed on how to divide the workload between workers, and also, it would be quite useful to know how many workers are going to be able to lend a hand in order to know best how to divide the plan up as evenly as possible.


I think there is some advantage of changing an already built plan
to parallel plan based on resources and there is some literature
about the same, but I think we will loose much more by not considering
parallelism during planning time.  If I remember correctly, then some
of the other databases do tackle this problem of shortage of resources
during execution as mentioned by me upthread, but I think for that it
is not necessary to have a Parallel Planner as a separate layer.
I believe it is important to have some way to handle shortage of resources
during execution, but it can be done at later stage.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: installcheck missing in src/bin/pg_rewind/Makefile
Next
From: Simon Riggs
Date:
Subject: Re: Logical Decoding follows timelines