Re: [DESIGN] ParallelAppend - Mailing list pgsql-hackers

From Kouhei Kaigai
Subject Re: [DESIGN] ParallelAppend
Date
Msg-id 9A28C8860F777E439AA12E8AEA7694F801121799@BPXM15GP.gisp.nec.co.jp
Whole thread Raw
In response to Re: [DESIGN] ParallelAppend  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: [DESIGN] ParallelAppend  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
> On Tue, Jul 28, 2015 at 6:08 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> >
> > > On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: pgsql-hackers-owner@postgresql.org
> > > > > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
> > > > > Sent: Monday, July 27, 2015 11:07 PM
> > > > > To: Amit Kapila
> > > > > >
> > > > > > Is there a real need to have new node like ParallelAppendPath?
> > > > > > Can't we have Funnel node beneath AppendNode and then each
> > > > > > worker will be responsible to have SeqScan on each inherited child
> > > > > > relation.  Something like
> > > > > >
> > > > > > Append
> > > > > >    ---> Funnel
> > > > > >       --> SeqScan rel1
> > > > > >       --> SeqScan rel2
> > > > > >
> > > > > If Funnel can handle both of horizontal and vertical parallelism,
> > > > > it is a great simplification. I never stick a new node.
> > > > >
> > > > > Once Funnel get a capability to have multiple child nodes, probably,
> > > > > Append node above will have gone. I expect set_append_rel_pathlist()
> > > > > add two paths based on Append and Funnel, then planner will choose
> > > > > the cheaper one according to its cost.
> > > > >
> > > > In the latest v16 patch, Funnel is declared as follows:
> > > >
> > > >   typedef struct Funnel
> > > >   {
> > > >       Scan        scan;
> > > >       int         num_workers;
> > > >   } Funnel;
> > > >
> > > > If we try to add Append capability here, I expects the structure will
> > > > be adjusted as follows, for example:
> > > >
> > > >   typedef struct Funnel
> > > >   {
> > > >       Scan        scan;
> > > >       List       *funnel_plans;
> > > >       List       *funnel_num_workers;
> > > >   } Funnel;
> > > >
> > > > As literal, funnel_plans saves underlying Plan nodes instead of the
> > > > lefttree. Also, funnel_num_workers saves number of expected workers
> > > > to be assigned on individual child plans.
> > > >
> > >
> > > or shall we have a node like above and name it as FunnelAppend or
> > > AppenFunnel?
> > >
> > It is better to have smaller number of node types which are capable to
> > kick background workers because of simplification of path construction.
> >
> > Let's assume the case below. When planner considers a path to append
> > child scans on rel1, rel2 and rel3 but the cheapest path of rel2 is
> > Funnel+PartialSeqScan, we cannot put Funnel here unless we don't pull
> > up Funnel of rel2, can we?
> >
> >   (Append? or Funnel)
> >    --> SeqScan on rel1
> >    --> Funnel
> >         --> PartialSeqScan on rel2
> >    --> IndexScan on rel3
> >
> 
> I am not sure, but what problem do you see in putting Funnel node
> for one of the relation scans and not for the others.
>
At this moment, I'm not certain whether background worker can/ought
to launch another background workers.
If sub-Funnel node is executed by 10-processes then it also launch
10-processes for each, 100-processes will run for each?

> > If we pull Funnel here, I think the plan shall be as follows:
> >   Funnel
> >    --> SeqScan on rel1
> >    --> PartialSeqScan on rel2
> >    --> IndexScan on rel3
> >
> 
> So if we go this route, then Funnel should have capability
> to execute non-parallel part of plan as well, like in this
> case it should be able to execute non-parallel IndexScan on
> rel3 as well and then it might need to distinguish between
> parallel and non-parallel part of plans.  I think this could
> make Funnel node complex.
>
It is difference from what I plan now. In the above example,
Funnel node has two non-parallel aware node (rel1 and rel3)
and one parallel aware node, then three PlannedStmt for each
shall be put on the TOC segment. Even though the background
workers pick up a PlannedStmt from the three, only one worker
can pick up the PlannedStmt for rel1 and rel3, however, rel2
can be executed by multiple workers simultaneously.
(Note: if number of workers are less than three in this case,
PlannedStmt for rel3 shall not be picked up unless any other
worker don't complete to run other plan on rel1 or rel2).

From the standpoint of the Funnel, it just kicks background
workers with:- multiple PlannedStmt nodes- maximum number of workers for each plan
in addition to the current form.

Then, it continues to fetch records from the shm_mq.
Probably, it does not change the current form so much.

> > If all we have to pay attention is Funnel node, it makes the code
> > around path construction and pull-up logic much simpler, rather than
> > multiple node types can kick background workers.
> >
> 
> Okay, but I think pulling-up Funnel node makes sense only when all
> nodes beneath it needs to be executed parallely.
>
I think its decision should be based on the cost, that includes
additional startup_cost to launch background worker, as long as
non-parallel node is also capable to run on the worker side.

> > > > Even though create_parallelscan_paths() in v16 set num_workers not
> > > > larger than parallel_seqscan_degree, total number of the concurrent
> > > > background workers may exceed this configuration if more than two
> > > > PartialSeqScan nodes are underlying.
> > > > It is a different configuration from max_worker_processes, so it is
> > > > not a matter as long as we have another restriction.
> > > > However, how do we control the cap of number of worker processes per
> > > > "appendable" Funnel node? For example, if a parent table has 200
> > > > child tables but max_worker_processes are configured to 50.
> > > > It is obviously impossible to launch all the background workers
> > > > simultaneously. One idea I have is to suspend launch of some plans
> > > > until earlier ones are completed.
> > > >
> > >
> > > Okay, but I think in that idea you need to re-launch the workers again for
> > > new set of relation scan's which could turn out to be costly, how about
> > > designing some way where workers after completing their assigned work
> > > check for new set of task/'s (which in this case would be to scan a new) and
> > > then execute the same.  I think in this way we can achieve dynamic allocation
> > > of work and achieve maximum parallelism with available set of workers.
> > > We have achieved this in ParallelSeqScan by scanning at block level, once
> > > a worker finishes a block, it checks for new block to scan.
> > >
> > Is it possible to put multiple PlannedStmt on TOC, isn't it?
> 
> Yes, I don't see any problem in doing that way.  So here for
> each different (child) relation, you want to create a separate
> PlannedStmt or do you have something else in mind?
>
I plan to create a separate PlannedStmt for each sub-plan, then
a background worker will focus on a particular PlannedStmt until
it completes the current focused one.

> > If background worker picks up an uncompleted PlannedStmt first
> > (based on round-robin likely?), it may achieve the maximum
> > parallelism.
> 
> I think this can work well for the cases when there are insufficient
> number of workers to execute the different planned statements.
>
Yep, it is the biggest reason why I like the design than what
I initially proposed; fixed number of workers for each sub-plan.

> > Yep, it seems to me a good idea which I want to try.
> > If (num of worker) > (num of sub-plans), some of sub-plans can
> > have multiple workers from the beginning, then, other workers
> > also help to execute heavy plans later.
> > It may be better to put PlannedStmt in order of total_cost to
> > bias multi-workers execution from the beginning.
> >
> 
> Yeah, that might be better, but I think for doing so you might
> need to traverse each child plan and compare there costs while
> constructing multiple planned statements which might incur some
> overhead when number of plans are large, however OTOH this cost
> should be much smaller as compare to starting up workers, so
> probably it should be okay.
>
Yep. If we have to execute thousands of child plans, its execution
cost is relatively large, not only planning cost. :-)

> > TODO: Even if a heavy query occupied most of available worker slots,
> > another session wants to use parallel execution later but during
> > execution of the primary query. We may need to have a 'scoreboard'
> > on shared memory to know how many workers are potentially needed
> > and how much ones are overused by somebody. If someone overconsumed
> > background workers, it should exit first, rather than picking up
> > the next PlannedStmt.
> >
> 
> Actually distribution of workers among parallel queriesis a very
> tricky problem and I think we have to keep on working on it till
> we get some good solution for it.
>
I agree. Even if initial version adopts simple solution, we can
improve the logic according to our experiences.

> > > Another way to look at dividing the work in this case could be in terms of
> > > chunk-of-blocks, once a worker finishes it current set of block/'s, it should
> > > be
> > > able to get new set of block's to scan.  So let us assume if we decide
> > > chunk-size as 32 and total number of blocks in whole inheritance hierarchy
> > > are 3200, then the max workers we should allocate to this scan are 100 and
> > > if we have parallel_seqscan degree lesser than that then we can use those
> > > many workers and then let them scan 32-blocks-at-a-time.
> > >
> > If we use the above multi-PlannedStmt approach, TOC also need to have a counter
> > to track how many background workers are running on a particular PlannedStmt,
> > then if enough number of worker is running on the PlannedStmt, next available
> > worker will skip this PlannedStmt (even if round-robin) or just exit?
> 
> I think for a particular PlannedStmt, number of workers must have
> been decided before start of execution, so if those many workers are
> available to work on that particular PlannedStmt, then next/new
> worker should work on next PlannedStmt.
>
My concern about pre-determined number of workers is, it depends on the
run-time circumstances of concurrent sessions. Even if planner wants to
assign 10-workers on a particular sub-plan, only 4-workers may be
available on the run-time because of consumption by side sessions.
So, I expect only maximum number of workers is meaningful configuration.

> > Anyway, I think an infrastructure may be needed to avoid too aggressive
> > parallel execution.
> >
> 
> 
> Yes, I think we need some infrastructure for workers if we have
> to follow the design discussed above.
> 
> 
> So I think we have three main parts to work for this patch.
> 
> 1. Allocation of work among workers which needs some different
> mechanism than ParallelSeqScan Patch.
>
Yes, I expect to extend the format of TOC, to store multiple PlannedStmt
nodes and state information for each node like PartialSeqScanState.

> 2. Execution of work by workers and Funnel node and then pass
> the results back to upper node.  I think this needs some more
> work in addition to ParallelSeqScan patch.
>
I expect we can utilize existing infrastructure here. It just picks
up the records come from the underlying workers, then raise it to
the upper node.

> 3. Generation of parallel plan for Append node needs somewhat
> different mechanism as we might want to have some additional
> logic for transaformation of nodes.
>
I expect set_append_rel_pathlist() is the best location to add
FunnelPath in addition to AppendPath. If its cost is more attractive
than AppendPath, planner will pick up.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );
Next
From: Kouhei Kaigai
Date:
Subject: Re: Foreign join pushdown vs EvalPlanQual