Re: Parallel Seq Scan - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Parallel Seq Scan
Date
Msg-id CA+TgmobM7X6jgre442638b+33h1EWa=vcZqnsvzEdX057ZHVuw@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Seq Scan  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: Parallel Seq Scan  (Kohei KaiGai <kaigai@kaigai.gr.jp>)
Re: Parallel Seq Scan  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Tue, Feb 17, 2015 at 11:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I still think this belongs in heapam.c somehow or other.  If the logic
>> is all in the executor, then it becomes impossible for any code that
>> doensn't use the executor to do a parallel heap scan, and that's
>> probably bad.  It's not hard to imagine something like CLUSTER wanting
>> to reuse that code, and that won't be possible if the logic is up in
>> some higher layer.
>
> Yea.
>
>> If the logic we want is to start with a large chunk size and then
>> switch to a small chunk size when there's not much of the relation
>> left to scan, there's still no reason that can't be encapsulated in
>> heapam.c.
>
> I don't mind having some logic in there, but I think you put in too
> much. The snapshot stuff should imo go, and the next page logic should
> be caller provided.

If we need to provide a way for the caller to provide the next-page
logic, then I think that should be done via configuration arguments or
flags, not a callback.  There's just no way that the needs of the
executor are going to be so radically different from a utility command
that only a callback will do.

>> I think it makes sense to think of a set of tasks in which workers can
>> assist.  So you a query tree which is just one query tree, with no
>> copies of the nodes, and then there are certain places in that query
>> tree where a worker can jump in and assist that node.  To do that, it
>> will have a copy of the node, but that doesn't mean that all of the
>> stuff inside the node becomes shared data at the code level, because
>> that would be stupid.
>
> My only "problem" with that description is that I think workers will
> have to work on more than one node - it'll be entire subtrees of the
> executor tree.

Amit and I had a long discussion about this on Friday while in Boston
together.  I previously argued that the master and the slave should be
executing the same node, ParallelSeqScan.  However, Amit argued
persuasively that what the master is doing is really pretty different
from what the worker is doing, and that they really ought to be
running two different nodes.  This led us to cast about for a better
design, and we came up with something that I think will be much
better.

The basic idea is to introduce a new node called Funnel.  A Funnel
node will have a left child but no right child, and its job will be to
fire up a given number of workers.  Each worker will execute the plan
which is the left child of the funnel.  The funnel node itself will
pull tuples from all of those workers, and can also (if there are no
tuples available from any worker) execute the plan itself.  So a
parallel sequential scan will look something like this:

Funnel
Workers: 4
-> Partial Heap Scan on xyz

What this is saying is that each worker is going to scan part of the
heap for xyz; together, they will scan the whole thing.

The neat thing about this way of separating things out is that we can
eventually write code to push more stuff down into the funnel.  For
example, consider this:

Nested Loop
-> Seq Scan on foo
-> Index Scan on bar   Index Cond: bar.x = foo.x

Now, if a parallel sequential scan is cheaper than a regular
sequential scan, we can instead do this:

Nested Loop
-> Funnel   -> Partial Heap Scan on foo
-> Index Scan on bara   Index Cond: bar.x = foo.x

The problem with this is that the nested loop/index scan is happening
entirely in the master.  But we can have logic that fixes that by
knowing that a nested loop can be pushed through a funnel, yielding
this:

Funnel
-> Nested Loop   -> Partial Heap Scan on foo   -> Index Scan on bar       Index Cond: bar.x = foo.x

Now that's pretty neat.  One can also imagine doing this with
aggregates.  Consider:

HashAggregate
-> Funnel   -> Partial Heap Scan on foo       Filter: x = 1

Here, we can't just push the HashAggregate through the filter, but
given infrastructure for we could convert that to something like this:

HashAggregateFinish
-> Funnel   -> HashAggregatePartial       -> Partial Heap Scan on foo            Filter: x = 1

That'd be swell.

You can see that something like this will also work for breaking off
an entire plan tree and shoving it down into a worker.  The master
can't participate in the computation in that case, but it's otherwise
the same idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Petr Jelinek
Date:
Subject: Re: Add min and max execute statement time in pg_stat_statement
Next
From: Jeff Davis
Date:
Subject: Re: PATCH: decreasing memory needlessly consumed by array_agg