Re: Parallel Seq Scan - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Parallel Seq Scan
Date
Msg-id 20150210140838.GI21017@alap3.anarazel.de
Whole thread Raw
In response to Re: Parallel Seq Scan  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Parallel Seq Scan  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2015-02-10 08:52:09 -0500, Robert Haas wrote:
> On Tue, Feb 10, 2015 at 2:48 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Note that I'm not saying that Amit's patch is right - I haven't read it
> > - but that I don't think a 'scan this range of pages' heapscan API would
> > not be a bad idea. Not even just for parallelism, but for a bunch of
> > usecases.
> 
> We do have that, already.  heap_setscanlimits().  I'm just not
> convinced that that's the right way to split up a parallel scan.
> There's too much risk of ending up with a very-uneven distribution of
> work.

If you make the chunks small enough, and then coordate only the chunk
distribution, not really.

> >> Regarding tuple flow between backends, I've thought about that before,
> >> I agree that we need it, and I don't think I know how to do it.  I can
> >> see how to have a group of processes executing a single node in
> >> parallel, or a single process executing a group of nodes we break off
> >> from the query tree and push down to it, but what you're talking about
> >> here is a group of processes executing a group of nodes jointly.
> >
> > I don't think it really is that. I think you'd do it essentially by
> > introducing a couple more nodes. Something like
> >
> >                               SomeUpperLayerNode
> >                                       |
> >                                       |
> >                                  AggCombinerNode
> >                                     /   \
> >                                    /     \
> >                                   /       \
> >                    PartialHashAggNode   PartialHashAggNode .... .PartialHashAggNode ...
> >                            |                    |
> >                            |                    |
> >                            |                    |
> >                            |                    |
> >                     PartialSeqScan        PartialSeqScan
> >
> > The only thing that'd potentially might need to end up working jointly
> > jointly would be the block selection of the individual PartialSeqScans
> > to avoid having to wait for stragglers for too long. E.g. each might
> > just ask for a range of a 16 megabytes or so that it scans sequentially.
> >
> > In such a plan - a pretty sensible and not that uncommon thing for
> > parallelized aggregates - you'd need to be able to tell the heap scans
> > which blocks to scan. Right?
> 
> For this case, what I would imagine is that there is one parallel heap
> scan, and each PartialSeqScan attaches to it.  The executor says "give
> me a tuple" and heapam.c provides one.  Details like the chunk size
> are managed down inside heapam.c, and the executor does not know about
> them.  It just knows that it can establish a parallel scan and then
> pull tuples from it.

I think that's a horrible approach that'll end up with far more
entangled pieces than what you're trying to avoid. Unless the tuple flow
is organized to only happen in the necessary cases the performance will
be horrible. And good chunk sizes et al depend on higher layers,
selectivity estimates and such. And that's planner/executor work, not
the physical layer (which heapam.c pretty much is).

A individual heap scan's state lives in process private memory. And if
the results inside the separate workers should directly be used in the
these workers without shipping over the network it'd be horrible to have
the logic in the heapscan. How would you otherwise model an executor
tree that does the seqscan and aggregation combined in multiple
processes at the same time?

> >> Maybe we designate nodes as can-generate-multiple-tuple-streams (seq
> >> scan, mostly, I would think) and can-absorb-parallel-tuple-streams
> >> (sort, hash, materialize), or something like that, but I'm really
> >> fuzzy on the details.
> >
> > I don't think we really should have individual nodes that produce
> > multiple streams - that seems like it'd end up being really
> > complicated. I'd more say that we have distinct nodes (like the
> > PartialSeqScan ones above) that do a teensy bit of coordination about
> > which work to perform.
> 
> I think we're in violent agreement here, except for some
> terminological confusion.  Are there N PartialSeqScan nodes, one
> running in each node, or is there one ParallelSeqScan node, which is
> copied and run jointly across N nodes?  You can talk about either way
> and have it make sense, but we haven't had enough conversations about
> this on this list to have settled on a consistent set of vocabulary
> yet.

I pretty strongly believe that it has to be independent scan nodes. Both
from a implementation and a conversational POV. They might have some
very light cooperation between them (e.g. coordinating block ranges or
such), but everything else should be separate. From an implementation
POV it seems pretty awful to have executor node that's accessed by
multiple separate backends - that'd mean it have to be concurrency safe,
have state in shared memory and everything.

Now, there'll be a node that needs to do some parallel magic - but in
the above example that should be the AggCombinerNode, which would not
only ask for tuples from one of the children at a time, but ask multiple
ones in parallel. But even then it doesn't have to deal with concurrency
around it's own state.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Fetch zero result rows when executing a query?
Next
From: Robert Haas
Date:
Subject: Re: Parallel Seq Scan