Thread: low wal_retrieve_retry_interval causes missed signals on Windows
I discussed this elsewhere [0], but I thought it deserved its own thread. After setting wal_retrieve_retry_interval to 1ms in the tests, I noticed that some of the archiving tests began consistently failing on Windows. I believe the problem is that WaitForWALToBecomeAvailable() depends on the call to WaitLatch() for wal_retrieve_retry_interval to ensure that signals are dispatched (i.e., pgwin32_dispatch_queued_signals()). With a low retry interval, WaitForWALToBecomeAvailable() might skip the call to WaitLatch(), and the signals are never processed. The attached patch fixes this by always calling WaitLatch(), even if wal_retrieve_retry_interval milliseconds have already elapsed and the timeout is 0. [0] https://postgr.es/m/20221231235019.GA1223171%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Hi, On 2023-01-10 22:11:16 -0800, Nathan Bossart wrote: > The attached patch fixes this by always calling WaitLatch(), even if > wal_retrieve_retry_interval milliseconds have already elapsed and the > timeout is 0. It doesn't seem right to call WaitLatch() just for that purpose - nor necessarily a complete fix. Given that we check for interrupts in other parts of recovery with HandleStartupProcInterrupt(), which doesn't interact with latches, isn't the actual bug that HandleStartupProcInterrupt() doesn't contain the same black magic that CHECK_FOR_INTERRUPTS() contains on windows? Namely this stuff: #ifndef WIN32 ... #else #define INTERRUPTS_PENDING_CONDITION() \ (unlikely(UNBLOCKED_SIGNAL_QUEUE()) ? pgwin32_dispatch_queued_signals() : 0, \ unlikely(InterruptPending)) #endif /* Service interrupt, if one is pending and it's safe to service it now */ #define CHECK_FOR_INTERRUPTS() \ do { \ if (INTERRUPTS_PENDING_CONDITION()) \ ProcessInterrupts(); \ } while(0) Looks like we have that bug in quite a few places... Some are "protected" by unconditional WaitLatch() calls, but at least pgarch.c, checkpointer.c via CheckpointWriteDelay() seem borked. Greetings, Andres Freund
On Wed, Jan 11, 2023 at 12:48:36PM -0800, Andres Freund wrote: > Given that we check for interrupts in other parts of recovery with > HandleStartupProcInterrupt(), which doesn't interact with latches, isn't the > actual bug that HandleStartupProcInterrupt() doesn't contain the same black > magic that CHECK_FOR_INTERRUPTS() contains on windows? Namely this stuff: Yeah, this seems like a more comprehensive fix. I've attached a patch that adds this Windows signaling stuff to the HandleXXXInterrupts() functions in the files you listed. Is this roughly what you had in mind? If so, I'll look around for anywhere else it is needed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Hi, On 2023-01-11 15:26:45 -0800, Nathan Bossart wrote: > On Wed, Jan 11, 2023 at 12:48:36PM -0800, Andres Freund wrote: > > Given that we check for interrupts in other parts of recovery with > > HandleStartupProcInterrupt(), which doesn't interact with latches, isn't the > > actual bug that HandleStartupProcInterrupt() doesn't contain the same black > > magic that CHECK_FOR_INTERRUPTS() contains on windows? Namely this stuff: > > Yeah, this seems like a more comprehensive fix. I've attached a patch that > adds this Windows signaling stuff to the HandleXXXInterrupts() functions in > the files you listed. Is this roughly what you had in mind? If so, I'll > look around for anywhere else it is needed. Yes, that's what I roughly was thinking of. Although seeing the diff, I think it might be worth introducing a helper function that'd containing at least pgwin32_dispatch_queued_signals() and ProcessProcSignalBarrier(). It's a bit complicated by ProcessProcSignalBarrier() only being applicable to shared memory connected processes - excluding e.g. pgarch. Greetings, Andres Freund
On Wed, Jan 11, 2023 at 04:40:14PM -0800, Andres Freund wrote: > On 2023-01-11 15:26:45 -0800, Nathan Bossart wrote: >> On Wed, Jan 11, 2023 at 12:48:36PM -0800, Andres Freund wrote: >> > Given that we check for interrupts in other parts of recovery with >> > HandleStartupProcInterrupt(), which doesn't interact with latches, isn't the >> > actual bug that HandleStartupProcInterrupt() doesn't contain the same black >> > magic that CHECK_FOR_INTERRUPTS() contains on windows? Namely this stuff: >> >> Yeah, this seems like a more comprehensive fix. I've attached a patch that >> adds this Windows signaling stuff to the HandleXXXInterrupts() functions in >> the files you listed. Is this roughly what you had in mind? If so, I'll >> look around for anywhere else it is needed. > > Yes, that's what I roughly was thinking of. Although seeing the diff, I think > it might be worth introducing a helper function that'd containing at least > pgwin32_dispatch_queued_signals() and ProcessProcSignalBarrier(). It's a bit > complicated by ProcessProcSignalBarrier() only being applicable to shared > memory connected processes - excluding e.g. pgarch. As of d75288f, the archiver should be connected to shared memory, so we might be in luck. I guess we'd need to watch out for this if we want to back-patch it beyond v14. I'll work on a patch... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com