Re: [HACKERS] parallel.c oblivion of worker-startup failures - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [HACKERS] parallel.c oblivion of worker-startup failures
Date
Msg-id CAA4eK1KD=bz6mfA3p0-p7=FGF6DfH2A_HGV_ffPDDz0AnH6cRQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] parallel.c oblivion of worker-startup failures  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Wed, Jan 24, 2018 at 10:03 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 23, 2018 at 8:25 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>>> Hmm, I think that case will be addressed because tuple queues can
>>> detect if the leader is not attached.  It does in code path
>>> shm_mq_receive->shm_mq_counterparty_gone.  In
>>> shm_mq_counterparty_gone, it can detect if the worker is gone by using
>>> GetBackgroundWorkerPid.  Moreover, I have manually tested this
>>> particular case before saying your patch is fine.  Do you have some
>>> other case in mind which I am missing?
>>
>> Hmm.  Yeah.  I can't seem to reach a stuck case and was probably just
>> confused and managed to confuse Robert too.  If you make
>> fork_process() fail randomly (see attached), I see that there are a
>> couple of easily reachable failure modes (example session at bottom of
>> message):
>>
>> 1.  HandleParallelMessages() is reached and raises a "lost connection
>> to parallel worker" error because shm_mq_receive() returns
>> SHM_MQ_DETACHED, I think because shm_mq_counterparty_gone() checked
>> GetBackgroundWorkerPid() just as you said.  I guess that's happening
>> because some other process is (coincidentally) sending
>> PROCSIG_PARALLEL_MESSAGE at shutdown, causing us to notice that a
>> process is unexpectedly stopped.
>>
>> 2.  WaitForParallelWorkersToFinish() is reached and raises a "parallel
>> worker failed to initialize" error.  TupleQueueReaderNext() set done
>> to true, because shm_mq_receive() returned SHM_MQ_DETACHED.  Once
>> again, that is because shm_mq_counterparty_gone() returned true.  This
>> is the bit Robert and I missed in our off-list discussion.
>>
>> As long as we always get our latch set by the postmaster after a fork
>> failure (ie kill SIGUSR1) and after GetBackgroundWorkerPid() is
>> guaranteed to return BGWH_STOPPED after that, and as long as we only
>> ever use latch/CFI loops to wait, and as long as we try to read from a
>> shm_mq, then I don't see a failure mode that hangs.
>
> What about the parallel_leader_participation=off case?
>

There is nothing special about that case, there shouldn't be any
problem till we can detect the worker failures appropriately.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: [HACKERS] parallel.c oblivion of worker-startup failures
Next
From: Amit Kapila
Date:
Subject: Re: [HACKERS] parallel.c oblivion of worker-startup failures