Re: failures in t/031_recovery_conflict.pl on CI - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: failures in t/031_recovery_conflict.pl on CI |
Date | |
Msg-id | 20220409220054.fqn5arvbeesmxdg5@alap3.anarazel.de Whole thread Raw |
In response to | Re: failures in t/031_recovery_conflict.pl on CI (Andres Freund <andres@anarazel.de>) |
Responses |
Re: failures in t/031_recovery_conflict.pl on CI
|
List | pgsql-hackers |
Hi, On 2022-04-08 22:05:01 -0700, Andres Freund wrote: > On 2022-04-08 21:55:15 -0700, Andres Freund wrote: > > on CI [1] the new t/031_recovery_conflict.pl is failing occasionally. Which is > > interesting, because I ran it there dozens if not hundreds of times before > > commit, with - I think - only cosmetic changes. > > Scratch that part - I found an instance of the freebsd failure earlier, just > didn't notice because that run failed for other reasons as well. So this might > just have uncovered an older bug around recovery conflict handling, > potentially platform dependent. > > I guess I'll try to reproduce it on freebsd... That failed. Adding a bunch of debug statements, I think I might have found at least some problems. The code in LockBufferForCleanup() that actually interacts with other backends is: /* Publish the bufid that Startup process waits on */ SetStartupBufferPinWaitBufId(buffer - 1); /* Set alarm and then wait to be signaled by UnpinBuffer() */ ResolveRecoveryConflictWithBufferPin(); /* Reset the published bufid */ SetStartupBufferPinWaitBufId(-1); where ResolveRecoveryConflictWithBufferPin() sends procsignals to all other backends and then waits with ProcWaitForSignal(): void ProcWaitForSignal(uint32 wait_event_info) { (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0, wait_event_info); ResetLatch(MyLatch); CHECK_FOR_INTERRUPTS(); } one problem is that we pretty much immediately get a SIGALRM whenever we're in that WaitLatch(). Which does a SetLatch(), interrupting the WaitLatch(). The startup process then proceeds to SetStartupBufferPinWaitBufId(-1). In the unlucky cases, the backend holding the pin only processes the interrupt (in RecoveryConflictInterrupt()) after the SetStartupBufferPinWaitBufId(-1). The backend then does sees that HoldingBufferPinThatDelaysRecovery() returns false, and happily continues. But that's not the whole story, I think. It's a problem leading to conflicts being handled more slowly, but we eventually should not have more timeouts. Here's debugging output from a failing run, where I added a few debugging statements: https://api.cirrus-ci.com/v1/artifact/task/6179111512047616/log/src/test/recovery/tmp_check/log/000_0recovery_conflict_standby.log https://github.com/anarazel/postgres/commit/212268e753093861aa22a51657c6598c65eeb81b Curiously, there's only 20644: received interrupt 11 Which is PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK, not PROCSIG_RECOVERY_CONFLICT_BUFFERPIN. I guess we've never gotten past the standby timeout. 20644: received interrupt 11 2022-04-09 21:18:23.824 UTC [20630][startup] DEBUG: one cycle of LockBufferForCleanup() iterating in HS 2022-04-09 21:18:23.824 UTC [20630][startup] CONTEXT: WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0 nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0 2022-04-09 21:18:23.824 UTC [20630][startup] DEBUG: setting timeout() in 0 10000 2022-04-09 21:18:23.824 UTC [20630][startup] CONTEXT: WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0 nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0 2022-04-09 21:18:23.835 UTC [20630][startup] DEBUG: setting latch() 2022-04-09 21:18:23.835 UTC [20630][startup] CONTEXT: WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0 nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0 2022-04-09 21:18:23.835 UTC [20630][startup] DEBUG: setting timeout() in 0 3481 2022-04-09 21:18:23.835 UTC [20630][startup] CONTEXT: WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0 nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0 20644: received interrupt 11 2022-04-09 21:23:47.975 UTC [20631][walreceiver] FATAL: could not receive data from WAL stream: server closed the connectionunexpectedly So we sent a conflict interrupt, and then waited. And nothing happened. What are we expecting to wake the startup process up, once it does SendRecoveryConflictWithBufferPin()? It's likely not the problem here, because we never seem to have even reached that path, but afaics once we've called disable_all_timeouts() at the bottom of ResolveRecoveryConflictWithBufferPin() and then re-entered ResolveRecoveryConflictWithBufferPin(), and go into the "We're already behind, so clear a path as quickly as possible." path, there's no guarantee for any timeout to be pending anymore? If there's either no backend that we're still conflicting with (an entirely possible race condition), or if there's e.g. a snapshot or database conflict, there's afaics nobody setting the startup processes' latch. Greetings, Andres Freund
pgsql-hackers by date: