Re: Parallel Seq Scan - Mailing list pgsql-hackers

From Kohei KaiGai
Subject Re: Parallel Seq Scan
Date
Msg-id CADyhKSVn=yy1jQkFhX-SPMv4fpwGkGBXA3QrU_gsqsPffU-9fw@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Seq Scan  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
> Amit and I had a long discussion about this on Friday while in Boston
> together.  I previously argued that the master and the slave should be
> executing the same node, ParallelSeqScan.  However, Amit argued
> persuasively that what the master is doing is really pretty different
> from what the worker is doing, and that they really ought to be
> running two different nodes.  This led us to cast about for a better
> design, and we came up with something that I think will be much
> better.
>
> The basic idea is to introduce a new node called Funnel.  A Funnel
> node will have a left child but no right child, and its job will be to
> fire up a given number of workers.  Each worker will execute the plan
> which is the left child of the funnel.  The funnel node itself will
> pull tuples from all of those workers, and can also (if there are no
> tuples available from any worker) execute the plan itself.  So a
> parallel sequential scan will look something like this:
>
> Funnel
> Workers: 4
> -> Partial Heap Scan on xyz
>
> What this is saying is that each worker is going to scan part of the
> heap for xyz; together, they will scan the whole thing.
>

What is the best way to determine the number of workers?
Fixed number is an idea. It may also make sense to add a new common field
to Path node to introduce how much portion of the node execution can be
parallelized, or unavailable to run in parallel.
Not on the plan time, we may be able to determine the number according to
the number of concurrent workers and number of CPU cores.

> The neat thing about this way of separating things out is that we can
> eventually write code to push more stuff down into the funnel.  For
> example, consider this:
>
> Nested Loop
> -> Seq Scan on foo
> -> Index Scan on bar
>     Index Cond: bar.x = foo.x
>
> Now, if a parallel sequential scan is cheaper than a regular
> sequential scan, we can instead do this:
>
> Nested Loop
> -> Funnel
>     -> Partial Heap Scan on foo
> -> Index Scan on bara
>     Index Cond: bar.x = foo.x
>
> The problem with this is that the nested loop/index scan is happening
> entirely in the master.  But we can have logic that fixes that by
> knowing that a nested loop can be pushed through a funnel, yielding
> this:
>
> Funnel
> -> Nested Loop
>     -> Partial Heap Scan on foo
>     -> Index Scan on bar
>         Index Cond: bar.x = foo.x
>
> Now that's pretty neat.  One can also imagine doing this with
> aggregates.  Consider:
>
I guess the planner enhancement shall exist around add_paths_to_joinrel().
In case when any underlying join paths that support multi-node execution,
the new portion will add Funnel node with these join paths. Just my thought.

> HashAggregate
> -> Funnel
>     -> Partial Heap Scan on foo
>         Filter: x = 1
>
> Here, we can't just push the HashAggregate through the filter, but
> given infrastructure for we could convert that to something like this:
>
> HashAggregateFinish
> -> Funnel
>     -> HashAggregatePartial
>         -> Partial Heap Scan on foo
>              Filter: x = 1
>
> That'd be swell.
>
> You can see that something like this will also work for breaking off
> an entire plan tree and shoving it down into a worker.  The master
> can't participate in the computation in that case, but it's otherwise
> the same idea.
>
I believe the entire vision we've discussed around combining aggregate
function thread is above, although people primarily considers to apply
this feature on aggregate push-down across join.

One key infrastructure may be a capability to define the combining function
of aggregates. It informs the planner given aggregation support two stage
execution. In addition to this, we may need to have a planner enhancement
to inject the partial aggregate node during path construction.

Probably, we have to set a flag to inform later stage (that will construct
Agg plan) the underlying scan/join node takes partial aggregation, thus,
final aggregation has to expect state data, instead of usual arguments for
row-by-row.

Also, I think HashJoin with very large outer relation but unbalanced much
small inner is a good candidate to distribute multiple nodes.
Even if multi-node HashJoin has to read the small inner relation N-times,
separation of very large outer relation will make gain.

Thanks,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Next
From: Robert Haas
Date:
Subject: Re: Enforce creation of destination folders for source files in pg_regress (Was: pg_regress writes into source tree)