Re: [DESIGN] ParallelAppend - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [DESIGN] ParallelAppend
Date
Msg-id CAA4eK1KvKe5LFzFZ+hC4y840nzpMSXLDqaFoN4zM_eid2yirrA@mail.gmail.com
Whole thread Raw
In response to Re: [DESIGN] ParallelAppend  (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
Responses Re: [DESIGN] ParallelAppend  (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
List pgsql-hackers
On Tue, Jul 28, 2015 at 6:08 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>
> > On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > >
> > > > -----Original Message-----
> > > > From: pgsql-hackers-owner@postgresql.org
> > > > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
> > > > Sent: Monday, July 27, 2015 11:07 PM
> > > > To: Amit Kapila
> > > > >
> > > > > Is there a real need to have new node like ParallelAppendPath?
> > > > > Can't we have Funnel node beneath AppendNode and then each
> > > > > worker will be responsible to have SeqScan on each inherited child
> > > > > relation.  Something like
> > > > >
> > > > > Append
> > > > >    ---> Funnel
> > > > >       --> SeqScan rel1
> > > > >       --> SeqScan rel2
> > > > >
> > > > If Funnel can handle both of horizontal and vertical parallelism,
> > > > it is a great simplification. I never stick a new node.
> > > >
> > > > Once Funnel get a capability to have multiple child nodes, probably,
> > > > Append node above will have gone. I expect set_append_rel_pathlist()
> > > > add two paths based on Append and Funnel, then planner will choose
> > > > the cheaper one according to its cost.
> > > >
> > > In the latest v16 patch, Funnel is declared as follows:
> > >
> > >   typedef struct Funnel
> > >   {
> > >       Scan        scan;
> > >       int         num_workers;
> > >   } Funnel;
> > >
> > > If we try to add Append capability here, I expects the structure will
> > > be adjusted as follows, for example:
> > >
> > >   typedef struct Funnel
> > >   {
> > >       Scan        scan;
> > >       List       *funnel_plans;
> > >       List       *funnel_num_workers;
> > >   } Funnel;
> > >
> > > As literal, funnel_plans saves underlying Plan nodes instead of the
> > > lefttree. Also, funnel_num_workers saves number of expected workers
> > > to be assigned on individual child plans.
> > >
> >
> > or shall we have a node like above and name it as FunnelAppend or
> > AppenFunnel?
> >
> It is better to have smaller number of node types which are capable to
> kick background workers because of simplification of path construction.
>
> Let's assume the case below. When planner considers a path to append
> child scans on rel1, rel2 and rel3 but the cheapest path of rel2 is
> Funnel+PartialSeqScan, we cannot put Funnel here unless we don't pull
> up Funnel of rel2, can we?
>
>   (Append? or Funnel)
>    --> SeqScan on rel1
>    --> Funnel
>         --> PartialSeqScan on rel2
>    --> IndexScan on rel3
>

I am not sure, but what problem do you see in putting Funnel node
for one of the relation scans and not for the others.

> If we pull Funnel here, I think the plan shall be as follows:
>   Funnel
>    --> SeqScan on rel1
>    --> PartialSeqScan on rel2
>    --> IndexScan on rel3
>

So if we go this route, then Funnel should have capability
to execute non-parallel part of plan as well, like in this
case it should be able to execute non-parallel IndexScan on
rel3 as well and then it might need to distinguish between
parallel and non-parallel part of plans.  I think this could
make Funnel node complex.


> If all we have to pay attention is Funnel node, it makes the code
> around path construction and pull-up logic much simpler, rather than
> multiple node types can kick background workers.
>

Okay, but I think pulling-up Funnel node makes sense only when all
nodes beneath it needs to be executed parallely.

> > > Even though create_parallelscan_paths() in v16 set num_workers not
> > > larger than parallel_seqscan_degree, total number of the concurrent
> > > background workers may exceed this configuration if more than two
> > > PartialSeqScan nodes are underlying.
> > > It is a different configuration from max_worker_processes, so it is
> > > not a matter as long as we have another restriction.
> > > However, how do we control the cap of number of worker processes per
> > > "appendable" Funnel node? For example, if a parent table has 200
> > > child tables but max_worker_processes are configured to 50.
> > > It is obviously impossible to launch all the background workers
> > > simultaneously. One idea I have is to suspend launch of some plans
> > > until earlier ones are completed.
> > >
> >
> > Okay, but I think in that idea you need to re-launch the workers again for
> > new set of relation scan's which could turn out to be costly, how about
> > designing some way where workers after completing their assigned work
> > check for new set of task/'s (which in this case would be to scan a new) and
> > then execute the same.  I think in this way we can achieve dynamic allocation
> > of work and achieve maximum parallelism with available set of workers.
> > We have achieved this in ParallelSeqScan by scanning at block level, once
> > a worker finishes a block, it checks for new block to scan.
> >
> Is it possible to put multiple PlannedStmt on TOC, isn't it?

Yes, I don't see any problem in doing that way.  So here for
each different (child) relation, you want to create a separate
PlannedStmt or do you have something else in mind?

> If background worker picks up an uncompleted PlannedStmt first
> (based on round-robin likely?), it may achieve the maximum
> parallelism. 

I think this can work well for the cases when there are insufficient
number of workers to execute the different planned statements.

> Yep, it seems to me a good idea which I want to try.
> If (num of worker) > (num of sub-plans), some of sub-plans can
> have multiple workers from the beginning, then, other workers
> also help to execute heavy plans later.
> It may be better to put PlannedStmt in order of total_cost to
> bias multi-workers execution from the beginning.
>

Yeah, that might be better, but I think for doing so you might
need to traverse each child plan and compare there costs while
constructing multiple planned statements which might incur some
overhead when number of plans are large, however OTOH this cost
should be much smaller as compare to starting up workers, so
probably it should be okay.

> TODO: Even if a heavy query occupied most of available worker slots,
> another session wants to use parallel execution later but during
> execution of the primary query. We may need to have a 'scoreboard'
> on shared memory to know how many workers are potentially needed
> and how much ones are overused by somebody. If someone overconsumed
> background workers, it should exit first, rather than picking up
> the next PlannedStmt.
>

Actually distribution of workers among parallel queriesis a very
tricky problem and I think we have to keep on working on it till
we get some good solution for it.

>
> > Another way to look at dividing the work in this case could be in terms of
> > chunk-of-blocks, once a worker finishes it current set of block/'s, it should
> > be
> > able to get new set of block's to scan.  So let us assume if we decide
> > chunk-size as 32 and total number of blocks in whole inheritance hierarchy
> > are 3200, then the max workers we should allocate to this scan are 100 and
> > if we have parallel_seqscan degree lesser than that then we can use those
> > many workers and then let them scan 32-blocks-at-a-time.
> >
> If we use the above multi-PlannedStmt approach, TOC also need to have a counter
> to track how many background workers are running on a particular PlannedStmt,
> then if enough number of worker is running on the PlannedStmt, next available
> worker will skip this PlannedStmt (even if round-robin) or just exit?

I think for a particular PlannedStmt, number of workers must have
been decided before start of execution, so if those many workers are
available to work on that particular PlannedStmt, then next/new
worker should work on next PlannedStmt.

> Anyway, I think an infrastructure may be needed to avoid too aggressive
> parallel execution.
>

Yes, I think we need some infrastructure for workers if we have
to follow the design discussed above.


So I think we have three main parts to work for this patch.

1. Allocation of work among workers which needs some different
mechanism than ParallelSeqScan Patch.
2. Execution of work by workers and Funnel node and then pass
the results back to upper node.  I think this needs some more
work in addition to ParallelSeqScan patch.
3. Generation of parallel plan for Append node needs somewhat
different mechanism as we might want to have some additional
logic for transaformation of nodes.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: Solaris testers wanted for strxfrm() behavior
Next
From: Noah Misch
Date:
Subject: Re: security labels on databases are bad for dump & restore