Home > mailing lists
Re: [DESIGN] ParallelAppend - Mailing list pgsql-hackers

From	Kouhei Kaigai
Subject	Re: [DESIGN] ParallelAppend
Date	July 28, 2015 05:30:29
Msg-id	9A28C8860F777E439AA12E8AEA7694F80111E008@BPXM15GP.gisp.nec.co.jp Whole thread Raw
In response to	Re: [DESIGN] ParallelAppend (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
Responses	Re: [DESIGN] ParallelAppend
List	pgsql-hackers
Tree view
> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Monday, July 27, 2015 11:07 PM
> To: Amit Kapila
> Cc: pgsql-hackers@postgresql.org; Robert Haas; Kyotaro HORIGUCHI
> Subject: Re: [HACKERS] [DESIGN] ParallelAppend
> 
> > On Sun, Jul 26, 2015 at 8:43 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > >
> > > Hello,
> > >
> > > I'm recently working/investigating on ParallelAppend feature
> > > towards the next commit fest. Below is my design proposal.
> > >
> > > 1. Concept
> > > ----------
> > > Its concept is quite simple anybody might consider more than once.
> > > ParallelAppend node kicks background worker process to execute
> > > child nodes in parallel / asynchronous.
> > > It intends to improve the performance to scan a large partitioned
> > > tables from standpoint of entire throughput, however, latency of
> > > the first multi-hundred rows are not scope of this project.
> > > From standpoint of technology trend, it primarily tries to utilize
> > > multi-cores capability within a system, but also enables to expand
> > > distributed database environment using foreign-tables inheritance
> > > features.
> > > Its behavior is very similar to Funnel node except for several
> > > points, thus, we can reuse its infrastructure we have had long-
> > > standing discussion through the v9.5 development cycle.
> > >
> > > 2. Problems to be solved
> > > -------------------------
> > > Typical OLAP workloads takes tons of tables join and scan on large
> > > tables which are often partitioned, and its KPI is query response
> > > time but very small number of sessions are active simultaneously.
> > > So, we are required to run a single query as rapid as possible even
> > > if it consumes larger computing resources than typical OLTP workloads.
> > >
> > > Current implementation to scan heap is painful when we look at its
> > > behavior from the standpoint - how many rows we can read within a
> > > certain time, because of synchronous manner.
> > > In the worst case, when SeqScan node tries to fetch the next tuple,
> > > heap_getnext() looks up a block on shared buffer, then ReadBuffer()
> > > calls storage manager to read the target block from the filesystem
> > > if not on the buffer. Next, operating system makes the caller
> > > process slept until required i/o get completed.
> > > Most of the cases are helped in earlier stage than the above worst
> > > case, however, the best scenario we can expect is: the next tuple
> > > already appear on top of the message queue (of course visibility
> > > checks are already done also) with no fall down to buffer manager
> > > or deeper.
> > > If we can run multiple scans in parallel / asynchronous, CPU core
> > > shall be assigned to another process by operating system, thus,
> > > it eventually improves the i/o density and enables higher processing
> > > throughput.
> > > Append node is an ideal point to be parallelized because
> > > - child nodes can have physically different location by tablespace,
> > >   so further tuning is possible according to the system landscape.
> > > - it can control whether subplan is actually executed on background
> > >   worker, per subplan basis. If subplan contains large tables and
> > >   small tables, ParallelAppend may kick background worker to scan
> > >   large tables only, but scan on small tables are by itself.
> > > - Like as Funnel node, we don't need to care about enhancement of
> > >   individual node types. SeqScan, IndexScan, ForeignScan or others
> > >   can perform as usual, but actually in parallel.
> > >
> > >
> > > 3. Implementation
> > > ------------------
> > > * Plan & Cost
> > >
> > > ParallelAppend shall appear where Appen can appear except for the
> > > usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
> > > both of AppendPath and ParallelAppendPath with cost for each.
> > >
> >
> > Is there a real need to have new node like ParallelAppendPath?
> > Can't we have Funnel node beneath AppendNode and then each
> > worker will be responsible to have SeqScan on each inherited child
> > relation.  Something like
> >
> > Append
> >    ---> Funnel
> >       --> SeqScan rel1
> >       --> SeqScan rel2
> >
> If Funnel can handle both of horizontal and vertical parallelism,
> it is a great simplification. I never stick a new node.
> 
> Once Funnel get a capability to have multiple child nodes, probably,
> Append node above will have gone. I expect set_append_rel_pathlist()
> add two paths based on Append and Funnel, then planner will choose
> the cheaper one according to its cost.
>
In the latest v16 patch, Funnel is declared as follows:
 typedef struct Funnel {     Scan        scan;     int         num_workers; } Funnel;

If we try to add Append capability here, I expects the structure will
be adjusted as follows, for example:
 typedef struct Funnel {     Scan        scan;     List       *funnel_plans;     List       *funnel_num_workers; }
Funnel;

As literal, funnel_plans saves underlying Plan nodes instead of the 
lefttree. Also, funnel_num_workers saves number of expected workers
to be assigned on individual child plans.

Even though create_parallelscan_paths() in v16 set num_workers not
larger than parallel_seqscan_degree, total number of the concurrent
background workers may exceed this configuration if more than two
PartialSeqScan nodes are underlying.
It is a different configuration from max_worker_processes, so it is
not a matter as long as we have another restriction.
However, how do we control the cap of number of worker processes per
"appendable" Funnel node? For example, if a parent table has 200
child tables but max_worker_processes are configured to 50.
It is obviously impossible to launch all the background workers
simultaneously. One idea I have is to suspend launch of some plans
until earlier ones are completed.


> We will need to pay attention another issues we will look at when Funnel
> kicks background worker towards asymmetric relations.
> 
> If number of rows of individual child nodes are various, we may
> want to assign 10 background workers to scan rel1 with PartialSeqScan.
> On the other hands, rel2 may have very small number of rows thus
> its total_cost may be smaller than cost to launch a worker.
> In this case, Funnel has child nodes to be executed asynchronously and
> synchronously.
> 
> If cheapest path of the child relation is a pair of Funnel and
> PartialSeqScan, we have to avoid to stack Funnel node. Probably,
> Funnel node that performs like Append needs to pull up underlying
> Funnel and assign equivalen number of workers as follows.
> 
>   Append
>    --> Funnel
>         --> PartialSeqScan on rel1 (num_workers = 4)
>    --> Funnel
>         --> PartialSeqScan on rel2 (num_workers = 8)
>    --> SeqScan on rel3
> 
>  shall be rewritten to
>   Funnel
>     --> PartialSeqScan on rel1 (num_workers = 4)
>     --> PartialSeqScan on rel2 (num_workers = 8)
>     --> SeqScan on rel3        (num_workers = 1)
> 
> We also need to consider whether Funnel will have capability
> equivalent to MergeAppend, even though parallel sorting is
> a fantastic challenge.
>
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>
pgsql-hackers by date:
From: Joe Conway
Date: 28 July 2015, 05:19:13
Subject: Re: A little RLS oversight?
From: Kouhei Kaigai
Date: 28 July 2015, 05:44:33
Subject: Re: CustomScan and readfuncs.c
Re: [DESIGN] ParallelAppend - Mailing list pgsql-hackers

Previous

Next