Re: [DESIGN] ParallelAppend - Mailing list pgsql-hackers

From Thom Brown
Subject Re: [DESIGN] ParallelAppend
Date
Msg-id CAA-aLv7k8b0FgN+gMJ6e1TGAOQF7j1t+HL5K7tX4n2ogx8X+fA@mail.gmail.com
Whole thread Raw
In response to Re: [DESIGN] ParallelAppend  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [DESIGN] ParallelAppend  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 13 November 2015 at 22:09, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>> I'm now designing the parallel feature of Append...
>>
>> Here is one challenge. How do we determine whether each sub-plan
>> allows execution in the background worker context?
>
> I've been thinking about these questions for a bit now, and I think we
> can work on improving Append in multiple phases.  The attached patch
> shows what I have in mind for phase 1.  Currently, if you set up an
> inheritance hierarchy, you might get an Append some of whose children
> are Gather nodes with Parallel Seq Scans under them, like this:
>
> Append
> -> Gather
>   -> Parallel Seq Scan
> -> Gather
>   -> Parallel Seq Scan
> -> Seq Scan
>
> This is a crappy plan.  Each Gather node will try to launch its own
> bunch of workers, which sucks.  The attached patch lets you get this
> plan instead:
>
> Gather
> -> Append
>   -> Partial Seq Scan
>   -> Partial Seq Scan
>   -> Partial Seq Scan
>
> That's a much better plan.
>
> To make this work, this plan introduces a couple of new concepts.
> Each RelOptInfo gets a new partial_pathlist field, which stores paths
> that need to be gathered to produce a workable plan.  Once we've
> populated the partial_pathlist with whatever partial paths we know how
> to generate, we can either push a Gather node on top of one of those
> partial paths to create a real path; or we can use those partial paths
> to build more partial paths.  The current patch handles only the
> simplest case: we can build a partial path for an appendrel by
> appending a partial path for each member rel.  But eventually I hope
> to handle joinrels this way as well: you can join a partial path with
> an ordinary path for form a partial path for the joinrel.
>
> This requires some way of figuring out how many workers to request for
> the append-path, so this patch also adds a parallel_degree field to
> the path object, which is 0 for non-parallel things and the number of
> workers that the path believes to be ideal otherwise.  Right now, it
> just percolates the highest worker count of any child up to the
> appendrel, which might not be right, especially once the append node
> knows how to balance workers, but it seems like a reasonable place to
> start.
>
>> Type-A) It can be performed on background worker context and
>>    picked up by multiple worker processes concurrently.
>>   (e.g: Parallel SeqScan)
>> Type-B) It can be performed on background worker context but
>>   cannot be picked up by multiple worker processes.
>>   (e.g: non-parallel aware node)
>> Type-C) It should not be performed on background worker context.
>>    (e.g: plan/path node with volatile functions)
>
> At the time that we're forming an AppendPath, we can identify what
> you're calling type-A paths very easily: they're the ones that are
> coming from the partial_pathlist.  We can distinguish between type-B
> and type-C paths coming from the childrel's pathlist based on the
> childrel's consider_parallel flag: if it's false, it's type-C, else
> type-B.  At some point, we might need a per-path flag to distinguish
> cases where a particular path is type-C even though some other plan
> for that relation might be type-B.  But right now that case doesn't
> arise.
>
> Now, of course, it's not good enough to have this information
> available when we're generating the AppendPath; it has to survive
> until execution time.  But that doesn't mean the paths need to be
> self-identifying.  We could, of course, decide to make them so, and
> maybe that's the best design, but I'm thinking about another approach:
> suppose the append node itself, instead of having one list of child
> plans, has a list of type-A plans, a list of type-B plans, and a list
> of type-C plans.  This actually seems quite convenient, because at
> execution time, you presumably want the leader to prioritize type-C
> plans over any of the others, since it's the only one that can execute
> them, and the workers to prioritize type-B plans, since they can only
> take one worker each and are thus prone to be the limiting factor on
> finishing the whole Append.  Having the plans divided in advance
> between these three lists (say, restricted_plans, safe_plans,
> parallel_plans) makes that easy to implement.
>
> Incidentally, I think it's subtly wrong to think of the parallel_aware
> flag as telling you whether the plan can absorb multiple workers.
> That's not really what it's for.  It's to tell you whether the plan is
> doing *something* parallel aware - that is, whether its Estimate,
> InitializeDSM, and InitializeWorker callbacks should do anything.  For
> SeqScan, flipping parallel_aware actually does split the input among
> all the workers, but for Append it's probably just load balances and
> for other nodes it might be something else again.  The term I'm using
> to indicate a path/plan that returns only a subset of the results in
> each worker is "partial".  Whether or not a path is partial is, in the
> design embodied in this patch, indicated both by whether
> path->parallel_degree > 0 and whether the path is in rel->pathlist or
> rel->partial_pathlist.

Okay, I've tried this patch.  I created a database with
pgbench_accounts -s 300, and partitioned the pgbench_accounts table
into 300 different children based on "bid".

# explain analyse select count(*) from pgbench_accounts;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=634889.14..634889.15 rows=1 width=0) (actual
 
time=14868.918..14868.918 rows=1 loops=1)  ->  Gather  (cost=1000.00..559784.13 rows=30042001 width=0) (actual
time=7.015..12319.699 rows=30000000 loops=1)        Number of Workers: 2        ->  Append  (cost=0.00..528742.13
rows=30042001width=0)
 
(actual time=0.019..24531.096 rows=59094295 loops=1)              ->  Parallel Seq Scan on pgbench_accounts
(cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.006 rows=0
loops=1)              ->  Parallel Seq Scan on pgbench_accounts_1
(cost=0.00..1711.60 rows=100000 width=0) (actual time=0.017..44.586
rows=170314 loops=1)              ->  Parallel Seq Scan on pgbench_accounts_2
(cost=0.00..1711.60 rows=100000 width=0) (actual time=0.438..49.974
rows=198923 loops=1)              ->  Parallel Seq Scan on pgbench_accounts_3
(cost=0.00..1711.60 rows=100000 width=0) (actual time=0.350..42.909
rows=198496 loops=1)              ->  Parallel Seq Scan on pgbench_accounts_4
(cost=0.00..1711.60 rows=100000 width=0) (actual time=0.656..37.556
rows=198780 loops=1)              ->  Parallel Seq Scan on pgbench_accounts_5
(cost=0.00..1711.60 rows=100000 width=0) (actual time=4.510..90.154
rows=193799 loops=1)              ->  Parallel Seq Scan on pgbench_accounts_6
(cost=0.00..1711.60 rows=100000 width=0) (actual time=4.326..76.018
rows=192680 loops=1)

--snip--

Yes, it's working!

However, the first parallel seq scan shows it getting 170314 rows.
Another run shows it getting 194165 rows.  The final result is
correct, but as you can see from the rows on the Append node (59094295
rows), it doesn't match the number of rows on the Gather node
(30000000).

And also, for some reason, I can no longer get this using more than 2
workers, even with max_worker_processes = 16 and max_parallel_degree =
12.  I don't know if that's anything to do with this patch though.

Thom



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Next
From: Dean Rasheed
Date:
Subject: Re: Bug in numeric multiplication