Re: [parallel query] random server crash while running tpc-h query on power2 - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [parallel query] random server crash while running tpc-h query on power2
Date
Msg-id CAA4eK1LNxOnsuRS1my0PpbW=mqKu5wWXEfxg6xn3_CMWghOckw@mail.gmail.com
Whole thread Raw
In response to [parallel query] random server crash while running tpc-h query on power2  (Rushabh Lathia <rushabh.lathia@gmail.com>)
Responses Re: [parallel query] random server crash while running tpc-h query on power2  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Sat, Aug 13, 2016 at 11:10 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Hi All,
>
> Recently while running tpc-h queries on postgresql master branch, I am
> noticed
> random server crash. Most of the time server crash coming while turn tpch
> query
> number 3 - (but its very random).
>
>
> Here its clear that work_instrument is either corrupted or Un-inililized
> that is the
> reason its ending up with server crash.
>
> With bit more debugging and looked at git history I found that issue started
> coming
> with commit af33039317ddc4a0e38a02e2255c2bf453115fd2. gather_readnext()
> calls
> ExecShutdownGatherWorkers() when nreaders == 0. ExecShutdownGatherWorkers()
> calls ExecParallelFinish() which collects the instrumentation before marking
> ParallelExecutorInfo to finish. ExecParallelRetrieveInstrumentation() do the
> allocation
> of planstate->worker_instrument.
>
> With commit af33039317 now we calling the gather_readnext() with per-tuple
> context,
> but with nreader == 0 with ExecShutdownGatherWorkers() we end up with
> allocation
> of planstate->worker_instrument into per-tuple context - which is wrong.
>
> Now fix can be:
>
> 1) Avoid calling ExecShutdownGatherWorkers() from the gather_readnext() and
> let
> ExecEndGather() do that things.
>

I don't think we can wait till ExecEndGather() to collect statistics,
as we need it before that for explain path.  However, we do call
ExecShutdownNode() from ExecutePlan() when there are no more tuples
which can take care of ensuring the shutdown of Gather node.   I think
the advantage of calling it in
gather_readnext() is that it will resources to be released early and
populating the instrumentation/statistics as early as possible.

> But with this change, gather_readread() and
> gather_getnext() depend on planstate->reader structure to continue reading
> tuple.
> Now either we can change those condition to be depend on planstate->nreaders
> or
> just pfree(planstate->reader) into gather_readnext() instead of calling
> ExecShutdownGatherWorkers().
>
>
> Attaching patch, which fix the issue with approach 1).
>

AFAICS, your patch seems to be the right fix for this issue, unless we
need the instrumentation information during execution (other than for
explain) for some purpose.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Rushabh Lathia
Date:
Subject: [parallel query] random server crash while running tpc-h query on power2
Next
From: Amit Kapila
Date:
Subject: Re: new autovacuum criterion for visible pages