Thread: Steps inside ExecEndGather

Steps inside ExecEndGather

From
Kouhei Kaigai
Date:
Hello,

I'm now trying to carry extra performance statistics on CustomScan
(like DMA transfer rate, execution time of GPU kernels, etc...)
from parallel workers to the leader process using the DSM segment
attached by the parallel-context.
We can require an arbitrary length of DSM using ExecCustomScanEstimate
hook by extension, then it looks leader/worker can share the DSM area.
However, we have a problem on this design.

Below is the implementation of ExecEndGather().
 void ExecEndGather(GatherState *node) {     ExecShutdownGather(node);     ExecFreeExprContext(&node->ps);
ExecClearTuple(node->ps.ps_ResultTupleSlot);    ExecEndNode(outerPlanState(node)); } 

It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
The DSM segment shall be released on this call, so child node cannot
reference the DSM at the time of ExecEndNode().

Is there some technical reason why parallel context needs to be released
prior to ExecEndNode() of the child nodes? Or, just convention of coding?

I think I'm not an only person who wants to use DSM of CustomScan to write
back something extra status of parallel workers.
How about an idea to move ExecShutdownGather() after the ExecEndNode()?

To avoid this problem, right now, I allocate an another DSM then inform
its handle to the parallel workers. This segment can be survived until
ExecEndCustomScan(), but not best effective way, of course.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>



Re: Steps inside ExecEndGather

From
Kouhei Kaigai
Date:
> I'm now trying to carry extra performance statistics on CustomScan
> (like DMA transfer rate, execution time of GPU kernels, etc...)
> from parallel workers to the leader process using the DSM segment
> attached by the parallel-context.
> We can require an arbitrary length of DSM using ExecCustomScanEstimate
> hook by extension, then it looks leader/worker can share the DSM area.
> However, we have a problem on this design.
>
> Below is the implementation of ExecEndGather().
>
>   void
>   ExecEndGather(GatherState *node)
>   {
>       ExecShutdownGather(node);
>       ExecFreeExprContext(&node->ps);
>       ExecClearTuple(node->ps.ps_ResultTupleSlot);
>       ExecEndNode(outerPlanState(node));
>   }
>
> It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
> The DSM segment shall be released on this call, so child node cannot
> reference the DSM at the time of ExecEndNode().
>
> Is there some technical reason why parallel context needs to be released
> prior to ExecEndNode() of the child nodes? Or, just convention of coding?
>
> I think I'm not an only person who wants to use DSM of CustomScan to write
> back something extra status of parallel workers.
> How about an idea to move ExecShutdownGather() after the ExecEndNode()?
>
> To avoid this problem, right now, I allocate an another DSM then inform
> its handle to the parallel workers. This segment can be survived until
> ExecEndCustomScan(), but not best effective way, of course.
>
My analysis was not collect a bit.

ExecShutdownNode() at ExecutePlan() is the primary point to call
ExecShutdownGather(), thus, parallel context shall not survive at the point
of ExecEndPlan() regardless of the implementation of ExecEndGather.

Hmm, what is the best way to do...? Or, is it completely abuse of DSM that
is setup by the parallel context?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>



Re: Steps inside ExecEndGather

From
Amit Kapila
Date:
On Mon, Oct 17, 2016 at 6:22 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> Hello,
>
> I'm now trying to carry extra performance statistics on CustomScan
> (like DMA transfer rate, execution time of GPU kernels, etc...)
> from parallel workers to the leader process using the DSM segment
> attached by the parallel-context.
> We can require an arbitrary length of DSM using ExecCustomScanEstimate
> hook by extension, then it looks leader/worker can share the DSM area.
> However, we have a problem on this design.
>
> Below is the implementation of ExecEndGather().
>
>   void
>   ExecEndGather(GatherState *node)
>   {
>       ExecShutdownGather(node);
>       ExecFreeExprContext(&node->ps);
>       ExecClearTuple(node->ps.ps_ResultTupleSlot);
>       ExecEndNode(outerPlanState(node));
>   }
>
> It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
> The DSM segment shall be released on this call, so child node cannot
> reference the DSM at the time of ExecEndNode().
>

Before releasing DSM, we do collect all the statistics or
instrumentation information of each node.  Refer
ExecParallelFinish()->ExecParallelRetrieveInstrumentation(), so I am
wondering why can't you collect the additional information in the same
way?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Steps inside ExecEndGather

From
Kouhei Kaigai
Date:
> On Mon, Oct 17, 2016 at 6:22 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > Hello,
> >
> > I'm now trying to carry extra performance statistics on CustomScan
> > (like DMA transfer rate, execution time of GPU kernels, etc...)
> > from parallel workers to the leader process using the DSM segment
> > attached by the parallel-context.
> > We can require an arbitrary length of DSM using ExecCustomScanEstimate
> > hook by extension, then it looks leader/worker can share the DSM area.
> > However, we have a problem on this design.
> >
> > Below is the implementation of ExecEndGather().
> >
> >   void
> >   ExecEndGather(GatherState *node)
> >   {
> >       ExecShutdownGather(node);
> >       ExecFreeExprContext(&node->ps);
> >       ExecClearTuple(node->ps.ps_ResultTupleSlot);
> >       ExecEndNode(outerPlanState(node));
> >   }
> >
> > It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
> > The DSM segment shall be released on this call, so child node cannot
> > reference the DSM at the time of ExecEndNode().
> >
> 
> Before releasing DSM, we do collect all the statistics or
> instrumentation information of each node.  Refer
> ExecParallelFinish()->ExecParallelRetrieveInstrumentation(), so I am
> wondering why can't you collect the additional information in the same
> way?
>
Thanks for the suggestion.
Hmm. Indeed, it is more straightforward way to do, although a new hook
is needed for CSP/FDW.

What I want to collect are: DMA transfer rate between RAM<->GPU, Execution
time of GPU kernels and etc... These are obviously out of the standard
Instrumentation structure, so only CSP/FDW can know its size and format.

If we would have a callback just before the planstate_tree_walker() when
planstate is either CustomScanState or ForeignScanState, it looks to me
the problem can be solved very cleanly.

Best regards,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>