Re: Parallel Seq Scan - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Parallel Seq Scan
Date
Msg-id 20150210074841.GA21017@alap3.anarazel.de
Whole thread Raw
In response to Re: Parallel Seq Scan  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Parallel Seq Scan  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2015-02-07 17:16:12 -0500, Robert Haas wrote:
> On Sat, Feb 7, 2015 at 4:30 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > > [ criticicm of Amit's heapam integration ]

> > I'm not convinced that that reasoning is generally valid. While it may
> > work out nicely for seqscans - which might be useful enough on its own -
> > the more stuff we parallelize the *more* the executor will have to know
> > about it to make it sane. To actually scale nicely e.g. a parallel sort
> > will have to execute the nodes below it on each backend, instead of
> > doing that in one as a separate step, ferrying over all tuples to
> > indivdual backends through queues, and only then parallezing the
> > sort.
> >
> > Now. None of that is likely to matter immediately, but I think starting
> > to build the infrastructure at the points where we'll later need it does
> > make some sense.
>
> Well, I agree with you, but I'm not really sure what that has to do
> with the issue at hand.  I mean, if we were to apply Amit's patch,
> we'd been in a situation where, for a non-parallel heap scan, heapam.c
> decides the order in which blocks get scanned, but for a parallel heap
> scan, nodeParallelSeqscan.c makes that decision.  Maybe I'm an old
> fuddy-duddy[1] but that seems like an abstraction violation to me.  I
> think the executor should see a parallel scan as a stream of tuples
> that streams into a bunch of backends in parallel, without really
> knowing how heapam.c is dividing up the work.  That's how it's
> modularized today, and I don't see a reason to change it.  Do you?

I don't really agree. Normally heapam just sequentially scan the heap in
one go, not much logic to that. Ok, then there's also the synchronized
seqscan stuff - which just about every user of heapscans but the
executor promptly disables again. I don't think a heap_scan_page() or
similar API will consitute a relevant layering violation over what we
already have.

Note that I'm not saying that Amit's patch is right - I haven't read it
- but that I don't think a 'scan this range of pages' heapscan API would
not be a bad idea. Not even just for parallelism, but for a bunch of
usecases.

> Regarding tuple flow between backends, I've thought about that before,
> I agree that we need it, and I don't think I know how to do it.  I can
> see how to have a group of processes executing a single node in
> parallel, or a single process executing a group of nodes we break off
> from the query tree and push down to it, but what you're talking about
> here is a group of processes executing a group of nodes jointly.

I don't think it really is that. I think you'd do it essentially by
introducing a couple more nodes. Something like
                             SomeUpperLayerNode                                     |
 |                                AggCombinerNode                                   /   \
  /     \                                 /       \                  PartialHashAggNode   PartialHashAggNode ....
.PartialHashAggNode...                          |                    |                          |                    |
                       |                    |                          |                    |
PartialSeqScan       PartialSeqScan
 

The only thing that'd potentially might need to end up working jointly
jointly would be the block selection of the individual PartialSeqScans
to avoid having to wait for stragglers for too long. E.g. each might
just ask for a range of a 16 megabytes or so that it scans sequentially.

In such a plan - a pretty sensible and not that uncommon thing for
parallelized aggregates - you'd need to be able to tell the heap scans
which blocks to scan. Right?

> That seems like an excellent idea, but I don't know how to design it.
> Actually routing the tuples between whichever backends we want to
> exchange them between is easy enough, but how do we decide whether to
> generate such a plan?  What does the actual plan tree look like?

I described above how I think it'd roughly look like. Whether to
generate it probably would be dependant on the cardinality (not much
point to do the above if all groups are distinct) and possibly the
aggregates in use (if we have a parallizable sum/count/avg etc).

> Maybe we designate nodes as can-generate-multiple-tuple-streams (seq
> scan, mostly, I would think) and can-absorb-parallel-tuple-streams
> (sort, hash, materialize), or something like that, but I'm really
> fuzzy on the details.

I don't think we really should have individual nodes that produce
multiple streams - that seems like it'd end up being really
complicated. I'd more say that we have distinct nodes (like the
PartialSeqScan ones above) that do a teensy bit of coordination about
which work to perform.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: "Syed, Rahila"
Date:
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Next
From: Tatsuo Ishii
Date:
Subject: Re: pgbench -f and vacuum