[DESIGN] ParallelAppend - Mailing list pgsql-hackers
From | Kouhei Kaigai |
---|---|
Subject | [DESIGN] ParallelAppend |
Date | |
Msg-id | 9A28C8860F777E439AA12E8AEA7694F80111CF22@BPXM15GP.gisp.nec.co.jp Whole thread Raw |
Responses |
Re: [DESIGN] ParallelAppend
Re: [DESIGN] ParallelAppend Re: [DESIGN] ParallelAppend |
List | pgsql-hackers |
Hello, I'm recently working/investigating on ParallelAppend feature towards the next commit fest. Below is my design proposal. 1. Concept ---------- Its concept is quite simple anybody might consider more than once. ParallelAppend node kicks background worker process to execute child nodes in parallel / asynchronous. It intends to improve the performance to scan a large partitioned tables from standpoint of entire throughput, however, latency of the first multi-hundred rows are not scope of this project. From standpoint of technology trend, it primarily tries to utilize multi-cores capability within a system, but also enables to expand distributed database environment using foreign-tables inheritance features. Its behavior is very similar to Funnel node except for several points, thus, we can reuse its infrastructure we have had long- standing discussion through the v9.5 development cycle. 2. Problems to be solved ------------------------- Typical OLAP workloads takes tons of tables join and scan on large tables which are often partitioned, and its KPI is query response time but very small number of sessions are active simultaneously. So, we are required to run a single query as rapid as possible even if it consumes larger computing resources than typical OLTP workloads. Current implementation to scan heap is painful when we look at its behavior from the standpoint - how many rows we can read within a certain time, because of synchronous manner. In the worst case, when SeqScan node tries to fetch the next tuple, heap_getnext() looks up a block on shared buffer, then ReadBuffer() calls storage manager to read the target block from the filesystem if not on the buffer. Next, operating system makes the caller process slept until required i/o get completed. Most of the cases are helped in earlier stage than the above worst case, however, the best scenario we can expect is: the next tuple already appear on top of the message queue (of course visibility checks are already done also) with no fall down to buffer manager or deeper. If we can run multiple scans in parallel / asynchronous, CPU core shall be assigned to another process by operating system, thus, it eventually improves the i/o density and enables higher processing throughput. Append node is an ideal point to be parallelized because - child nodes can have physically different location by tablespace, so further tuning is possible according to the systemlandscape. - it can control whether subplan is actually executed on background worker, per subplan basis. If subplan contains largetables and small tables, ParallelAppend may kick background worker to scan large tables only, but scan on small tablesare by itself. - Like as Funnel node, we don't need to care about enhancement of individual node types. SeqScan, IndexScan, ForeignScanor others can perform as usual, but actually in parallel. 3. Implementation ------------------ * Plan & Cost ParallelAppend shall appear where Appen can appear except for the usage for dummy. So, I'll enhance set_append_rel_pathlist() to add both of AppendPath and ParallelAppendPath with cost for each. Cost estimation logic shall take further discussions, however, I expect the logic below to estimate the cost for ParallelAppend. 1. Sum startup_cost and run_cost for each child pathnode,but distinguish according to synchronous or asynchronous. Probably, total cost of pathnode is less than: (parallel_setup_cost + its total cost / parallel_append_degree + number of rows * cpu_tuple_comm_cost) is nonsense to run on background worker. 2. parallel_setup_cost * (# of asynchronous nodes) are addedto sum of startup_cost of asynchronous nodes. 3. sum of run_cost of asynchronous nodes are divided by parallel_append_degree,then cpu_tuple_comm_cost * (total # of rows by asynchronous nodes) are added. 4. both of synchronousand asynchronous cost are added, then it becomes the cost of ParallelAppend. Obviously, it stand on the viewpoint that says: cost reflects response time of the underlying plan. So, cost of ParallelAppend can be smaller than sum of underlying child nodes. * Execution Like Funnel node, it kicks background worker on the ExecProcNode handler, thus, its startup time may be later than Fujita-san's approach if call of ParallelAppend would be late. For example, when ParallelAppend is located under HashJoin but inner Hash loads billion of rows. Even though I expect ExecParallelAppend takes, at least, simple round- robin scheduling like funnel_getnext(), we may give synchronous nodes than asynchronous just after the background worker startup. 4. Further challenges ---------------------- * Serialization of CustomScan via outfuncs.c/readfuncs.c Because methods field is, basically, a set of pointers per processbasis, we need to have an infrastructure to reproduce same table on the background worker process identified by thename. (I also try to design it.) * Duplication of the parallel If Funnel+PartialSeqScan is located under ParallelAppend, directly or indirectly, it eventuallyleads background worker process to launch another background workers. Is it expected usage of the current backgroundworkers?? * Join pushdown Distribution of nested-loop and hash-join may have advantage by parallel processing, and by reduction ofhash-size if CHECK() constraint of individual partitioned tables informs rows obviously not to be joined. Also see thethread: [idea] table partition + hash join: http://bit.ly/1S2xpHT My colleague already started to investigate / developthis feature based on existing Append, to reduce num_batches. As an aside, my GpuJoin feature works most effectively if entire inner relations can be loaded to hash-table on GPU RAM,so features are very welcome. * Sort break-down If mergejoin tried to have ParallelAppend node on left or right input, we may be able to compare its costwith MargeParallelAppend + Sort on the partial relation. * Aggregate Push Down It is what I exactly want to do. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
pgsql-hackers by date: