Thread: delay starting WAL receiver
I discussed this a bit in a different thread [0], but I thought it deserved its own thread. After setting wal_retrieve_retry_interval to 1ms in the tests, I noticed that the recovery tests consistently take much longer. Upon further inspection, it looks like a similar race condition to the one described in e5d494d's commit message. With some added debug logs, I see that all of the callers of MaybeStartWalReceiver() complete before SIGCHLD is processed, so ServerLoop() waits for a minute before starting the WAL receiver. The attached patch fixes this by adjusting DetermineSleepTime() to limit the sleep to at most 100ms when WalReceiverRequested is set, similar to how the sleep is limited when background workers must be restarted. [0] https://postgr.es/m/20221215224721.GA694065%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Jan 11, 2023 at 2:08 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > I discussed this a bit in a different thread [0], but I thought it deserved > its own thread. > > After setting wal_retrieve_retry_interval to 1ms in the tests, I noticed > that the recovery tests consistently take much longer. Upon further > inspection, it looks like a similar race condition to the one described in > e5d494d's commit message. With some added debug logs, I see that all of > the callers of MaybeStartWalReceiver() complete before SIGCHLD is > processed, so ServerLoop() waits for a minute before starting the WAL > receiver. > > The attached patch fixes this by adjusting DetermineSleepTime() to limit > the sleep to at most 100ms when WalReceiverRequested is set, similar to how > the sleep is limited when background workers must be restarted. Is the problem here that SIGCHLD is processed ... PG_SETMASK(&UnBlockSig); <--- here? selres = select(nSockets, &rmask, NULL, NULL, &timeout); Meanwhile the SIGCHLD handler code says: * Was it the wal receiver? If exit status is zero (normal) or one * (FATAL exit), we assume everything is all right just like normal * backends. (If we need a new wal receiver, we'll start one at the * next iteration of the postmaster's main loop.) ... which is true, but that won't be reached for a while in this case if the timeout has already been set to 60s. Your patch makes that 100ms, in that case, a time delay that by now attracts my attention like a red rag to a bull (I don't know why you didn't make it 0). I'm not sure, but if I got that right, then I think the whole problem might automatically go away with CF #4032. The SIGCHLD processing code will run not when signals are unblocked before select() (that is gone), but instead *after* the event loop wakes up with WL_LATCH_SET, and runs the handler code in the regular user context before dropping through to the rest of the main loop.
On Wed, Jan 11, 2023 at 5:20 PM Thomas Munro <thomas.munro@gmail.com> wrote: > (I don't know why you didn't make it 0) (Oh, I see why it had to be non-zero to avoiding burning CPU, ignore that part.)
On Wed, Jan 11, 2023 at 05:20:38PM +1300, Thomas Munro wrote: > Is the problem here that SIGCHLD is processed ... > > PG_SETMASK(&UnBlockSig); <--- here? > > selres = select(nSockets, &rmask, NULL, NULL, &timeout); > > Meanwhile the SIGCHLD handler code says: > > * Was it the wal receiver? If exit status is zero (normal) or one > * (FATAL exit), we assume everything is all right just like normal > * backends. (If we need a new wal receiver, we'll start one at the > * next iteration of the postmaster's main loop.) > > ... which is true, but that won't be reached for a while in this case > if the timeout has already been set to 60s. Your patch makes that > 100ms, in that case, a time delay that by now attracts my attention > like a red rag to a bull (I don't know why you didn't make it 0). I think this is right. At the very least, it seems consistent with my observations. > I'm not sure, but if I got that right, then I think the whole problem > might automatically go away with CF #4032. The SIGCHLD processing > code will run not when signals are unblocked before select() (that is > gone), but instead *after* the event loop wakes up with WL_LATCH_SET, > and runs the handler code in the regular user context before dropping > through to the rest of the main loop. Yeah, with those patches, the problem goes away. IIUC the key part is that the postmaster's latch gets set when SIGCHLD is received, so even if SIGUSR1 and SIGCHLD are processed out of order, WalReceiverPID gets cleared and we try to restart it shortly thereafter. I find this much easier to reason about. I'll go ahead and withdraw this patch from the commitfest. Thanks for chiming in. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com