Re: [DESIGN] ParallelAppend - Mailing list pgsql-hackers
From | Kouhei Kaigai |
---|---|
Subject | Re: [DESIGN] ParallelAppend |
Date | |
Msg-id | 9A28C8860F777E439AA12E8AEA7694F80111E008@BPXM15GP.gisp.nec.co.jp Whole thread Raw |
In response to | Re: [DESIGN] ParallelAppend (Kouhei Kaigai <kaigai@ak.jp.nec.com>) |
Responses |
Re: [DESIGN] ParallelAppend
|
List | pgsql-hackers |
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai > Sent: Monday, July 27, 2015 11:07 PM > To: Amit Kapila > Cc: pgsql-hackers@postgresql.org; Robert Haas; Kyotaro HORIGUCHI > Subject: Re: [HACKERS] [DESIGN] ParallelAppend > > > On Sun, Jul 26, 2015 at 8:43 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: > > > > > > Hello, > > > > > > I'm recently working/investigating on ParallelAppend feature > > > towards the next commit fest. Below is my design proposal. > > > > > > 1. Concept > > > ---------- > > > Its concept is quite simple anybody might consider more than once. > > > ParallelAppend node kicks background worker process to execute > > > child nodes in parallel / asynchronous. > > > It intends to improve the performance to scan a large partitioned > > > tables from standpoint of entire throughput, however, latency of > > > the first multi-hundred rows are not scope of this project. > > > From standpoint of technology trend, it primarily tries to utilize > > > multi-cores capability within a system, but also enables to expand > > > distributed database environment using foreign-tables inheritance > > > features. > > > Its behavior is very similar to Funnel node except for several > > > points, thus, we can reuse its infrastructure we have had long- > > > standing discussion through the v9.5 development cycle. > > > > > > 2. Problems to be solved > > > ------------------------- > > > Typical OLAP workloads takes tons of tables join and scan on large > > > tables which are often partitioned, and its KPI is query response > > > time but very small number of sessions are active simultaneously. > > > So, we are required to run a single query as rapid as possible even > > > if it consumes larger computing resources than typical OLTP workloads. > > > > > > Current implementation to scan heap is painful when we look at its > > > behavior from the standpoint - how many rows we can read within a > > > certain time, because of synchronous manner. > > > In the worst case, when SeqScan node tries to fetch the next tuple, > > > heap_getnext() looks up a block on shared buffer, then ReadBuffer() > > > calls storage manager to read the target block from the filesystem > > > if not on the buffer. Next, operating system makes the caller > > > process slept until required i/o get completed. > > > Most of the cases are helped in earlier stage than the above worst > > > case, however, the best scenario we can expect is: the next tuple > > > already appear on top of the message queue (of course visibility > > > checks are already done also) with no fall down to buffer manager > > > or deeper. > > > If we can run multiple scans in parallel / asynchronous, CPU core > > > shall be assigned to another process by operating system, thus, > > > it eventually improves the i/o density and enables higher processing > > > throughput. > > > Append node is an ideal point to be parallelized because > > > - child nodes can have physically different location by tablespace, > > > so further tuning is possible according to the system landscape. > > > - it can control whether subplan is actually executed on background > > > worker, per subplan basis. If subplan contains large tables and > > > small tables, ParallelAppend may kick background worker to scan > > > large tables only, but scan on small tables are by itself. > > > - Like as Funnel node, we don't need to care about enhancement of > > > individual node types. SeqScan, IndexScan, ForeignScan or others > > > can perform as usual, but actually in parallel. > > > > > > > > > 3. Implementation > > > ------------------ > > > * Plan & Cost > > > > > > ParallelAppend shall appear where Appen can appear except for the > > > usage for dummy. So, I'll enhance set_append_rel_pathlist() to add > > > both of AppendPath and ParallelAppendPath with cost for each. > > > > > > > Is there a real need to have new node like ParallelAppendPath? > > Can't we have Funnel node beneath AppendNode and then each > > worker will be responsible to have SeqScan on each inherited child > > relation. Something like > > > > Append > > ---> Funnel > > --> SeqScan rel1 > > --> SeqScan rel2 > > > If Funnel can handle both of horizontal and vertical parallelism, > it is a great simplification. I never stick a new node. > > Once Funnel get a capability to have multiple child nodes, probably, > Append node above will have gone. I expect set_append_rel_pathlist() > add two paths based on Append and Funnel, then planner will choose > the cheaper one according to its cost. > In the latest v16 patch, Funnel is declared as follows: typedef struct Funnel { Scan scan; int num_workers; } Funnel; If we try to add Append capability here, I expects the structure will be adjusted as follows, for example: typedef struct Funnel { Scan scan; List *funnel_plans; List *funnel_num_workers; } Funnel; As literal, funnel_plans saves underlying Plan nodes instead of the lefttree. Also, funnel_num_workers saves number of expected workers to be assigned on individual child plans. Even though create_parallelscan_paths() in v16 set num_workers not larger than parallel_seqscan_degree, total number of the concurrent background workers may exceed this configuration if more than two PartialSeqScan nodes are underlying. It is a different configuration from max_worker_processes, so it is not a matter as long as we have another restriction. However, how do we control the cap of number of worker processes per "appendable" Funnel node? For example, if a parent table has 200 child tables but max_worker_processes are configured to 50. It is obviously impossible to launch all the background workers simultaneously. One idea I have is to suspend launch of some plans until earlier ones are completed. > We will need to pay attention another issues we will look at when Funnel > kicks background worker towards asymmetric relations. > > If number of rows of individual child nodes are various, we may > want to assign 10 background workers to scan rel1 with PartialSeqScan. > On the other hands, rel2 may have very small number of rows thus > its total_cost may be smaller than cost to launch a worker. > In this case, Funnel has child nodes to be executed asynchronously and > synchronously. > > If cheapest path of the child relation is a pair of Funnel and > PartialSeqScan, we have to avoid to stack Funnel node. Probably, > Funnel node that performs like Append needs to pull up underlying > Funnel and assign equivalen number of workers as follows. > > Append > --> Funnel > --> PartialSeqScan on rel1 (num_workers = 4) > --> Funnel > --> PartialSeqScan on rel2 (num_workers = 8) > --> SeqScan on rel3 > > shall be rewritten to > Funnel > --> PartialSeqScan on rel1 (num_workers = 4) > --> PartialSeqScan on rel2 (num_workers = 8) > --> SeqScan on rel3 (num_workers = 1) > > We also need to consider whether Funnel will have capability > equivalent to MergeAppend, even though parallel sorting is > a fantastic challenge. > -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
pgsql-hackers by date: