Re: Parallel worker hangs while handling errors. - Mailing list pgsql-hackers
From | Bharath Rupireddy |
---|---|
Subject | Re: Parallel worker hangs while handling errors. |
Date | |
Msg-id | CALj2ACVBLox7LvCKpWd7k-kubwrmeRFjZRTrnGHN6RZQgG4cxw@mail.gmail.com Whole thread Raw |
In response to | Parallel worker hangs while handling errors. (vignesh C <vignesh21@gmail.com>) |
Responses |
Re: Parallel worker hangs while handling errors.
|
List | pgsql-hackers |
> > Leader process then identifies that there are some messages that need > to be processed, it copies the messages and sets the latch so that the > worker process can copy the remaining message from the below function: > shm_mq_inc_bytes_read -> SetLatch(&sender->procLatch);, Worker is not > able to receive any signal at this point of time & hangs infinitely > Worker hangs in this case because when the worker is started the > signals will be masked using sigprocmask. Unblocking of signals is > done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain. > Now due to error handling the worker has jumped to setjmp in > StartBackgroundWorker function. Here the signals are in a blocked > state, hence the signal is not received by the worker process. > Your analysis looks fine to me. Adding some more info: The worker uses SIGUSR1 (with a special shared memory flag PROCSIG_PARALLEL_MESSAGE) both for error message sharing(from mq_putmessage) and for parallel worker shutdown(from ParallelWorkerShutdown). And yes, the postmaster blocks SIGUSR1 before forking bgworker(in PostmasterMain with pqinitmask() and PG_SETMASK(&BlockSig)), bgworker receives the same blocked signal mask for itself and enters StartBackgroundWorker(), uses sigsetjmp for error handling, and then goes to ParallelWorkerMain() where few of the signal handers are set and then BackgroundWorkerUnblockSignals() is called to not block any signals. But observe when we did sigsetjmp the state of the signal mask is that of we received from postmaster which is all signals blocked. So, now in error cases when the control is jumped to sigsetjmp we still have the signals blocked(along with SIGUSR1) mask and in the code path of EmitErrorReport, we do send SIGUSR1 with flag PROCSIG_PARALLEL_MESSAGE to the leader/backend and wait for the latch to be set, this happens only if the worker is able to receive back SIGUSR1 from the leader/backend. In this reported issue, since SIGUSR1 is blocked at sigsetjmp in StartBackgroundWorker(), there is no way that the worker process receiving it from the leader and the latch cannot be set and hence the hang occurs. The same hang issue can occur(though I'm not able to back it up with a use case), in the cases from wherever the EmitErrorReport() is called from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and postgres.c. > > One of the fixes could be to call BackgroundWorkerUnblockSignals just > after sigsetjmp. I'm not sure if this is the best solution. > Robert & myself had a discussion about the problem yesterday. We felt > this is a genuine problem with the parallel worker error handling and > need to be fixed. > Note that, in all sigsetjmp blocks, we intentionally HOLD_INTERRUPTS(), to not cause any issues while performing error handling, I'm concerned here that now, if we directly call BackgroundWorkerUnblockSignals() which will open up all the signals and our main intention of holding interrupts or block signals may go away. Since the main problem for this hang issue is because of blocking SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1, instead of unblocking all signals? I tried this with parallel copy hang, the issue is resolved. Something like below - if (sigsetjmp(local_sigjmp_buf, 1) != 0) { sigset_t x; sigemptyset (&x); sigaddset(&x, SIGUSR1); sigprocmask(SIG_UNBLOCK, &x, NULL); /* Since not using PG_TRY, must reset error stack by hand */ error_context_stack = NULL; /* Prevent interrupts while cleaning up */ HOLD_INTERRUPTS(); If okay, with the above approach, we can put the above sigprocmask(SIG_UNBLOCK,..) piece of code(of course generically to unblock any given signal) in a macro similar to PG_SETMASK() and use that in all the places wherever EmitErrorReport() is called from sigsetjmp. We should mind the portability of sigprocmask as well. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: