Thread: Steps inside ExecEndGather
Hello, I'm now trying to carry extra performance statistics on CustomScan (like DMA transfer rate, execution time of GPU kernels, etc...) from parallel workers to the leader process using the DSM segment attached by the parallel-context. We can require an arbitrary length of DSM using ExecCustomScanEstimate hook by extension, then it looks leader/worker can share the DSM area. However, we have a problem on this design. Below is the implementation of ExecEndGather(). void ExecEndGather(GatherState *node) { ExecShutdownGather(node); ExecFreeExprContext(&node->ps); ExecClearTuple(node->ps.ps_ResultTupleSlot); ExecEndNode(outerPlanState(node)); } It calls ExecShutdownGather() prior to the recursive call of ExecEndNode(). The DSM segment shall be released on this call, so child node cannot reference the DSM at the time of ExecEndNode(). Is there some technical reason why parallel context needs to be released prior to ExecEndNode() of the child nodes? Or, just convention of coding? I think I'm not an only person who wants to use DSM of CustomScan to write back something extra status of parallel workers. How about an idea to move ExecShutdownGather() after the ExecEndNode()? To avoid this problem, right now, I allocate an another DSM then inform its handle to the parallel workers. This segment can be survived until ExecEndCustomScan(), but not best effective way, of course. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
> I'm now trying to carry extra performance statistics on CustomScan > (like DMA transfer rate, execution time of GPU kernels, etc...) > from parallel workers to the leader process using the DSM segment > attached by the parallel-context. > We can require an arbitrary length of DSM using ExecCustomScanEstimate > hook by extension, then it looks leader/worker can share the DSM area. > However, we have a problem on this design. > > Below is the implementation of ExecEndGather(). > > void > ExecEndGather(GatherState *node) > { > ExecShutdownGather(node); > ExecFreeExprContext(&node->ps); > ExecClearTuple(node->ps.ps_ResultTupleSlot); > ExecEndNode(outerPlanState(node)); > } > > It calls ExecShutdownGather() prior to the recursive call of ExecEndNode(). > The DSM segment shall be released on this call, so child node cannot > reference the DSM at the time of ExecEndNode(). > > Is there some technical reason why parallel context needs to be released > prior to ExecEndNode() of the child nodes? Or, just convention of coding? > > I think I'm not an only person who wants to use DSM of CustomScan to write > back something extra status of parallel workers. > How about an idea to move ExecShutdownGather() after the ExecEndNode()? > > To avoid this problem, right now, I allocate an another DSM then inform > its handle to the parallel workers. This segment can be survived until > ExecEndCustomScan(), but not best effective way, of course. > My analysis was not collect a bit. ExecShutdownNode() at ExecutePlan() is the primary point to call ExecShutdownGather(), thus, parallel context shall not survive at the point of ExecEndPlan() regardless of the implementation of ExecEndGather. Hmm, what is the best way to do...? Or, is it completely abuse of DSM that is setup by the parallel context? Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On Mon, Oct 17, 2016 at 6:22 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: > Hello, > > I'm now trying to carry extra performance statistics on CustomScan > (like DMA transfer rate, execution time of GPU kernels, etc...) > from parallel workers to the leader process using the DSM segment > attached by the parallel-context. > We can require an arbitrary length of DSM using ExecCustomScanEstimate > hook by extension, then it looks leader/worker can share the DSM area. > However, we have a problem on this design. > > Below is the implementation of ExecEndGather(). > > void > ExecEndGather(GatherState *node) > { > ExecShutdownGather(node); > ExecFreeExprContext(&node->ps); > ExecClearTuple(node->ps.ps_ResultTupleSlot); > ExecEndNode(outerPlanState(node)); > } > > It calls ExecShutdownGather() prior to the recursive call of ExecEndNode(). > The DSM segment shall be released on this call, so child node cannot > reference the DSM at the time of ExecEndNode(). > Before releasing DSM, we do collect all the statistics or instrumentation information of each node. Refer ExecParallelFinish()->ExecParallelRetrieveInstrumentation(), so I am wondering why can't you collect the additional information in the same way? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
> On Mon, Oct 17, 2016 at 6:22 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: > > Hello, > > > > I'm now trying to carry extra performance statistics on CustomScan > > (like DMA transfer rate, execution time of GPU kernels, etc...) > > from parallel workers to the leader process using the DSM segment > > attached by the parallel-context. > > We can require an arbitrary length of DSM using ExecCustomScanEstimate > > hook by extension, then it looks leader/worker can share the DSM area. > > However, we have a problem on this design. > > > > Below is the implementation of ExecEndGather(). > > > > void > > ExecEndGather(GatherState *node) > > { > > ExecShutdownGather(node); > > ExecFreeExprContext(&node->ps); > > ExecClearTuple(node->ps.ps_ResultTupleSlot); > > ExecEndNode(outerPlanState(node)); > > } > > > > It calls ExecShutdownGather() prior to the recursive call of ExecEndNode(). > > The DSM segment shall be released on this call, so child node cannot > > reference the DSM at the time of ExecEndNode(). > > > > Before releasing DSM, we do collect all the statistics or > instrumentation information of each node. Refer > ExecParallelFinish()->ExecParallelRetrieveInstrumentation(), so I am > wondering why can't you collect the additional information in the same > way? > Thanks for the suggestion. Hmm. Indeed, it is more straightforward way to do, although a new hook is needed for CSP/FDW. What I want to collect are: DMA transfer rate between RAM<->GPU, Execution time of GPU kernels and etc... These are obviously out of the standard Instrumentation structure, so only CSP/FDW can know its size and format. If we would have a callback just before the planstate_tree_walker() when planstate is either CustomScanState or ForeignScanState, it looks to me the problem can be solved very cleanly. Best regards, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>