Thread: failures in t/031_recovery_conflict.pl on CI
Hi, on CI [1] the new t/031_recovery_conflict.pl is failing occasionally. Which is interesting, because I ran it there dozens if not hundreds of times before commit, with - I think - only cosmetic changes. I've reproduced it in a private branch, with more logging. And the results are sure interesting. https://cirrus-ci.com/task/6448492666159104 https://api.cirrus-ci.com/v1/artifact/task/6448492666159104/log/src/test/recovery/tmp_check/log/031_recovery_conflict_standby.log The primary is waiting for 0/343A000 to be applied, which requires a recovery conflict to be detected and resolved. On the standby there's the following sequence (some omitted): prerequisite for recovery conflict: 2022-04-09 04:05:31.292 UTC [35071][client backend] [031_recovery_conflict.pl][2/2:0] LOG: statement: BEGIN; 2022-04-09 04:05:31.292 UTC [35071][client backend] [031_recovery_conflict.pl][2/2:0] LOG: statement: DECLARE test_recovery_conflict_cursorCURSOR FOR SELECT b FROM test_recovery_conflict_table1; 2022-04-09 04:05:31.293 UTC [35071][client backend] [031_recovery_conflict.pl][2/2:0] LOG: statement: FETCH FORWARD FROMtest_recovery_conflict_cursor; detecting the conflict: 2022-04-09 04:05:31.382 UTC [35038][startup] LOG: recovery still waiting after 28.821 ms: recovery conflict on buffer pin 2022-04-09 04:05:31.382 UTC [35038][startup] CONTEXT: WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0 nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0 and then nothing until the timeout: 2022-04-09 04:09:19.317 UTC [35035][postmaster] LOG: received immediate shutdown request 2022-04-09 04:09:19.317 UTC [35035][postmaster] DEBUG: sending signal 3 to process 35071 2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG: reaping dead processes 2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG: reaping dead processes 2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG: server process (PID 35071) exited with exit code 2 Afaics that has to mean something is broken around sending, receiving or processing of recovery conflict interrupts. All the failures so far were on freebsd, from what I can see. There were other failures in other tests, but I think for reverted or fixed things. Except for not previously triggering while the shmstats patch was in development, it's hard to tell whether it's a regression or just a longstanding bug - we never had tests for recovery conflicts... I don't really see how recovery prefetching could play a role here, clearly we've been trying to replay the record. So we're elsewhere... Greetings, Andres Freund https://cirrus-ci.com/github/postgres/postgres/master
Hi, On 2022-04-08 21:55:15 -0700, Andres Freund wrote: > on CI [1] the new t/031_recovery_conflict.pl is failing occasionally. Which is > interesting, because I ran it there dozens if not hundreds of times before > commit, with - I think - only cosmetic changes. Scratch that part - I found an instance of the freebsd failure earlier, just didn't notice because that run failed for other reasons as well. So this might just have uncovered an older bug around recovery conflict handling, potentially platform dependent. I guess I'll try to reproduce it on freebsd... Greetings, Andres Freund
Hi, On 2022-04-08 22:05:01 -0700, Andres Freund wrote: > On 2022-04-08 21:55:15 -0700, Andres Freund wrote: > > on CI [1] the new t/031_recovery_conflict.pl is failing occasionally. Which is > > interesting, because I ran it there dozens if not hundreds of times before > > commit, with - I think - only cosmetic changes. > > Scratch that part - I found an instance of the freebsd failure earlier, just > didn't notice because that run failed for other reasons as well. So this might > just have uncovered an older bug around recovery conflict handling, > potentially platform dependent. > > I guess I'll try to reproduce it on freebsd... That failed. Adding a bunch of debug statements, I think I might have found at least some problems. The code in LockBufferForCleanup() that actually interacts with other backends is: /* Publish the bufid that Startup process waits on */ SetStartupBufferPinWaitBufId(buffer - 1); /* Set alarm and then wait to be signaled by UnpinBuffer() */ ResolveRecoveryConflictWithBufferPin(); /* Reset the published bufid */ SetStartupBufferPinWaitBufId(-1); where ResolveRecoveryConflictWithBufferPin() sends procsignals to all other backends and then waits with ProcWaitForSignal(): void ProcWaitForSignal(uint32 wait_event_info) { (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0, wait_event_info); ResetLatch(MyLatch); CHECK_FOR_INTERRUPTS(); } one problem is that we pretty much immediately get a SIGALRM whenever we're in that WaitLatch(). Which does a SetLatch(), interrupting the WaitLatch(). The startup process then proceeds to SetStartupBufferPinWaitBufId(-1). In the unlucky cases, the backend holding the pin only processes the interrupt (in RecoveryConflictInterrupt()) after the SetStartupBufferPinWaitBufId(-1). The backend then does sees that HoldingBufferPinThatDelaysRecovery() returns false, and happily continues. But that's not the whole story, I think. It's a problem leading to conflicts being handled more slowly, but we eventually should not have more timeouts. Here's debugging output from a failing run, where I added a few debugging statements: https://api.cirrus-ci.com/v1/artifact/task/6179111512047616/log/src/test/recovery/tmp_check/log/000_0recovery_conflict_standby.log https://github.com/anarazel/postgres/commit/212268e753093861aa22a51657c6598c65eeb81b Curiously, there's only 20644: received interrupt 11 Which is PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK, not PROCSIG_RECOVERY_CONFLICT_BUFFERPIN. I guess we've never gotten past the standby timeout. 20644: received interrupt 11 2022-04-09 21:18:23.824 UTC [20630][startup] DEBUG: one cycle of LockBufferForCleanup() iterating in HS 2022-04-09 21:18:23.824 UTC [20630][startup] CONTEXT: WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0 nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0 2022-04-09 21:18:23.824 UTC [20630][startup] DEBUG: setting timeout() in 0 10000 2022-04-09 21:18:23.824 UTC [20630][startup] CONTEXT: WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0 nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0 2022-04-09 21:18:23.835 UTC [20630][startup] DEBUG: setting latch() 2022-04-09 21:18:23.835 UTC [20630][startup] CONTEXT: WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0 nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0 2022-04-09 21:18:23.835 UTC [20630][startup] DEBUG: setting timeout() in 0 3481 2022-04-09 21:18:23.835 UTC [20630][startup] CONTEXT: WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0 nredirected0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0 20644: received interrupt 11 2022-04-09 21:23:47.975 UTC [20631][walreceiver] FATAL: could not receive data from WAL stream: server closed the connectionunexpectedly So we sent a conflict interrupt, and then waited. And nothing happened. What are we expecting to wake the startup process up, once it does SendRecoveryConflictWithBufferPin()? It's likely not the problem here, because we never seem to have even reached that path, but afaics once we've called disable_all_timeouts() at the bottom of ResolveRecoveryConflictWithBufferPin() and then re-entered ResolveRecoveryConflictWithBufferPin(), and go into the "We're already behind, so clear a path as quickly as possible." path, there's no guarantee for any timeout to be pending anymore? If there's either no backend that we're still conflicting with (an entirely possible race condition), or if there's e.g. a snapshot or database conflict, there's afaics nobody setting the startup processes' latch. Greetings, Andres Freund
Hi, On 2022-04-09 15:00:54 -0700, Andres Freund wrote: > What are we expecting to wake the startup process up, once it does > SendRecoveryConflictWithBufferPin()? > > It's likely not the problem here, because we never seem to have even reached > that path, but afaics once we've called disable_all_timeouts() at the bottom > of ResolveRecoveryConflictWithBufferPin() and then re-entered > ResolveRecoveryConflictWithBufferPin(), and go into the "We're already behind, > so clear a path as quickly as possible." path, there's no guarantee for any > timeout to be pending anymore? > > If there's either no backend that we're still conflicting with (an entirely > possible race condition), or if there's e.g. a snapshot or database conflict, > there's afaics nobody setting the startup processes' latch. It's not that (although I still suspect it's a problem). It's a self-deadlock, because StandbyTimeoutHandler(), which ResolveRecoveryConflictWithBufferPin() *explicitly enables*, calls SendRecoveryConflictWithBufferPin(). Which does CancelDBBackends(). Which ResolveRecoveryConflictWithBufferPin() also calls, if the deadlock timeout is reached. To make it easier to hit, I put a pg_usleep(10000) in CancelDBBackends(), and boom: (gdb) bt #0 __futex_abstimed_wait_common64 (futex_word=futex_word@entry=0x7fd4cb001138, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>, cancel=cancel@entry=true) at ../sysdeps/nptl/futex-internal.c:87 #1 0x00007fd4ce5a215b in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fd4cb001138, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ../sysdeps/nptl/futex-internal.c:123 #2 0x00007fd4ce59e44f in do_futex_wait (sem=sem@entry=0x7fd4cb001138, abstime=0x0, clockid=0) at sem_waitcommon.c:112 #3 0x00007fd4ce59e4e8 in __new_sem_wait_slow64 (sem=0x7fd4cb001138, abstime=0x0, clockid=0) at sem_waitcommon.c:184 #4 0x00007fd4cf20d823 in PGSemaphoreLock (sema=0x7fd4cb001138) at pg_sema.c:327 #5 0x00007fd4cf2dd675 in LWLockAcquire (lock=0x7fd4cb001600, mode=LW_EXCLUSIVE) at /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1324 #6 0x00007fd4cf2c36e7 in CancelDBBackends (databaseid=0, sigmode=PROCSIG_RECOVERY_CONFLICT_BUFFERPIN, conflictPending=false) at /home/andres/src/postgresql/src/backend/storage/ipc/procarray.c:3638 #7 0x00007fd4cf2cc579 in SendRecoveryConflictWithBufferPin (reason=PROCSIG_RECOVERY_CONFLICT_BUFFERPIN) at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:846 #8 0x00007fd4cf2cc69d in StandbyTimeoutHandler () at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:911 #9 0x00007fd4cf4e68d7 in handle_sig_alarm (postgres_signal_arg=14) at /home/andres/src/postgresql/src/backend/utils/misc/timeout.c:421 #10 <signal handler called> #11 0x00007fd4cddfffc4 in __GI___select (nfds=0, readfds=0x0, writefds=0x0, exceptfds=0x0, timeout=0x7ffc6e5561c0) at ../sysdeps/unix/sysv/linux/select.c:71 #12 0x00007fd4cf52ea2a in pg_usleep (microsec=10000) at /home/andres/src/postgresql/src/port/pgsleep.c:56 #13 0x00007fd4cf2c36f1 in CancelDBBackends (databaseid=0, sigmode=PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK, conflictPending=false) at /home/andres/src/postgresql/src/backend/storage/ipc/procarray.c:3640 #14 0x00007fd4cf2cc579 in SendRecoveryConflictWithBufferPin (reason=PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK) at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:846 #15 0x00007fd4cf2cc50f in ResolveRecoveryConflictWithBufferPin () at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:820 #16 0x00007fd4cf2a996f in LockBufferForCleanup (buffer=43) at /home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:4336 #17 0x00007fd4ceec911c in XLogReadBufferForRedoExtended (record=0x7fd4d106a618, block_id=0 '\000', mode=RBM_NORMAL, get_cleanup_lock=true,buf=0x7ffc6e556394) at /home/andres/src/postgresql/src/backend/access/transam/xlogutils.c:427 #18 0x00007fd4cee1aa41 in heap_xlog_prune (record=0x7fd4d106a618) at /home/andres/src/postgresql/src/backend/access/heap/heapam.c:8634 it's reproducible on linux. I'm lacking words I dare to put in an email to describe how bad an idea it is to call CancelDBBackends() from within a timeout function, particularly when the function enabling the timeout also calls that function itself. Before disabling timers. I ... Greetings, Andres Freund
Hi, On 2022-04-09 16:10:02 -0700, Andres Freund wrote: > It's not that (although I still suspect it's a problem). It's a self-deadlock, > because StandbyTimeoutHandler(), which ResolveRecoveryConflictWithBufferPin() > *explicitly enables*, calls SendRecoveryConflictWithBufferPin(). Which does > CancelDBBackends(). Which ResolveRecoveryConflictWithBufferPin() also calls, > if the deadlock timeout is reached. > > To make it easier to hit, I put a pg_usleep(10000) in CancelDBBackends(), and boom: > > [... backtrace ... ] > > it's reproducible on linux. > > > I'm lacking words I dare to put in an email to describe how bad an idea it is > to call CancelDBBackends() from within a timeout function, particularly when > the function enabling the timeout also calls that function itself. Before > disabling timers. It's been broken in different ways all the way back to 9.0, from what I can see, but I didn't check every single version. Afaics the fix is to nuke the idea of doing anything substantial in the signal handler from orbit, and instead just set a flag in the handler. Then check whether the timeout happend after ProcWaitForSignal() and call SendRecoveryConflictWithBufferPin(). Also worth noting that the disable_all_timeouts() calls appears to break STARTUP_PROGRESS_TIMEOUT. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > It's been broken in different ways all the way back to 9.0, from what I can > see, but I didn't check every single version. > Afaics the fix is to nuke the idea of doing anything substantial in the signal > handler from orbit, and instead just set a flag in the handler. +1. This is probably more feasible given the latch infrastructure than it was when that code was first written. regards, tom lane
Hi, On 2022-04-09 19:34:26 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > It's been broken in different ways all the way back to 9.0, from what I can > > see, but I didn't check every single version. > > > Afaics the fix is to nuke the idea of doing anything substantial in the signal > > handler from orbit, and instead just set a flag in the handler. > > +1. This is probably more feasible given the latch infrastructure > than it was when that code was first written. What do you think about just reordering the disable_all_timeouts() to be before the got_standby_deadlock_timeout check in the back branches? I think that should close at least the most obvious hole. And fix it properly in HEAD? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-04-09 19:34:26 -0400, Tom Lane wrote: >> +1. This is probably more feasible given the latch infrastructure >> than it was when that code was first written. > What do you think about just reordering the disable_all_timeouts() to be > before the got_standby_deadlock_timeout check in the back branches? I think > that should close at least the most obvious hole. And fix it properly in > HEAD? I don't have much faith in that, and I don't see why we can't fix it properly. Don't we just need to have the signal handler set MyLatch, and then do the unsafe stuff back in the "if (got_standby_deadlock_timeout)" stanza in mainline? regards, tom lane
Hi, On 2022-04-12 15:05:22 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-04-09 19:34:26 -0400, Tom Lane wrote: > >> +1. This is probably more feasible given the latch infrastructure > >> than it was when that code was first written. > > > What do you think about just reordering the disable_all_timeouts() to be > > before the got_standby_deadlock_timeout check in the back branches? I think > > that should close at least the most obvious hole. And fix it properly in > > HEAD? > > I don't have much faith in that, and I don't see why we can't fix it > properly. It's not too hard, agreed. It's somewhat hard to test that it works though. The new recovery conflict tests test all recovery conflicts except for deadlock ones, because they're not easy to trigger... But I think I nearly got it working reliably. It's probably worth backpatching the tests, after stripping them of the stats checks? Three questions: - For HEAD we have to replace the disable_all_timeouts() calls, it breaks the replay progress reporting. Is there a reason to keep them in the backbranches? Hard to see how an extension or something could rely on it, but ...? - I named the variable set by the signal handler got_standby_delay_timeout, because just got_standby_timeout looked weird besides the _deadlock_. Which makes me think that we should rename STANDBY_TIMEOUT to STANDBY_DELAY_TIMEOUT too? - There's the following comment in ResolveRecoveryConflictWithBufferPin(): "We assume that only UnpinBuffer() and the timeout requests established above can wake us up here." That bogus afaict? There's plenty other things that can cause MyProc->latch to be set. Is it worth doing something about this at the same time? Right now we seem to call ResolveRecoveryConflictWithBufferPin() in rapid succession initially. Greetings, Andres Freund
Hi, Attached are patches for this issue. It adds a test case for deadlock conflicts to make sure that case isn't broken. I also tested the recovery conflict tests in the back branches, and they work there with a reasonably small set of changes. Questions: - I'm planning to backpatch the test as 031_recovery_conflict.pl, even though preceding numbers are unused. It seems way more problematic to use a different number in the backbranches than have gaps? - The test uses pump_until() and wait_for_log(), which don't exist in the backbranches. For now I've just inlined the implementation, but I guess we could also backpatch their introduction? - There's a few incompatibilities in the test with older branches: - older branches don't have allow_in_place_tablespaces - I think just skipping tablespace conflicts is fine, they're comparatively simple. Eventually it might make sense to backpatch allow_in_place_tablespaces, our test coverage in the area is quite poor. - the stats tests can't easily made reliably in the backbranches - which is ok, as the conflict itself is verified via the log - some branches don't have log_recovery_conflict_waits, since it's not critical to the test, it's ok to just not include it there I played with the idea of handling the differences using version comparisons in the code, and have the test be identically across branches. Since it's something we don't do so far, I'm leaning against it, but ... > - For HEAD we have to replace the disable_all_timeouts() calls, it breaks the > replay progress reporting. Is there a reason to keep them in the > backbranches? Hard to see how an extension or something could rely on it, > but ...? I've left it as is for now, will start a separate thread. > - There's the following comment in ResolveRecoveryConflictWithBufferPin(): > > "We assume that only UnpinBuffer() and the timeout requests established > above can wake us up here." > > That bogus afaict? There's plenty other things that can cause MyProc->latch > to be set. Is it worth doing something about this at the same time? Right > now we seem to call ResolveRecoveryConflictWithBufferPin() in rapid > succession initially. The comment is more recent than I had realized. I raised this separately in https://postgr.es/m/20220429191815.xewxjlpmq7mxhsr2%40alap3.anarazel.de pgindent uses some crazy formatting nearby: SendRecoveryConflictWithBufferPin( PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK); I'm tempted to clean that up in passing by having just one SendRecoveryConflictWithBufferPin() call instead of two, storing the type of conflict in a local variable? Doesn't look entirely pretty either, but ... I'm very doubtful of this claim above ResolveRecoveryConflictWithBufferPin(), btw. But that'd be a non-backpatchable cleanup, I think: * The ProcWaitForSignal() sleep normally done in LockBufferForCleanup() * (when not InHotStandby) is performed here, for code clarity. Greetings, Andres Freund
Attachment
Andres Freund <andres@anarazel.de> writes: > Questions: > - I'm planning to backpatch the test as 031_recovery_conflict.pl, even though > preceding numbers are unused. It seems way more problematic to use a > different number in the backbranches than have gaps? +1 > - The test uses pump_until() and wait_for_log(), which don't exist in the > backbranches. For now I've just inlined the implementation, but I guess we > could also backpatch their introduction? I'd backpatch --- seems unlikely this will be the last need for 'em. > pgindent uses some crazy formatting nearby: > SendRecoveryConflictWithBufferPin( > PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK); I do not believe that that line break is pgindent's fault. If you just fold it into one line it should stay that way. regards, tom lane
Hi, On 2022-04-29 19:26:59 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > - The test uses pump_until() and wait_for_log(), which don't exist in the > > backbranches. For now I've just inlined the implementation, but I guess we > > could also backpatch their introduction? > > I'd backpatch --- seems unlikely this will be the last need for 'em. Done. I ended up committing the extension of the test first, before the fix. I think that's the cause of the failure on longfin on serinus. Let's hope the situation improves with the now also committed (and backpatched) fix. > > pgindent uses some crazy formatting nearby: > > SendRecoveryConflictWithBufferPin( > > PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK); > > I do not believe that that line break is pgindent's fault. Oh - I'm fairly certain I've seen pgindent do that in the past. But you're right, it's not. Perhaps it was an older version of pgindent? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > I ended up committing the extension of the test first, before the fix. I think > that's the cause of the failure on longfin on serinus. Let's hope the > situation improves with the now also committed (and backpatched) fix. longfin's definitely not very happy: four out of six tries have ended with psql:<stdin>:8: ERROR: canceling statement due to conflict with recovery LINE 1: SELECT * FROM test_recovery_conflict_table2; ^ DETAIL: User was holding shared buffer pin for too long. timed out waiting for match: (?^:User transaction caused buffer deadlock with recovery.) at t/031_recovery_conflict.pl line358. I can poke into that tomorrow, but are you sure that that isn't an expectable result? regards, tom lane
Hi, On 2022-05-02 23:44:32 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I ended up committing the extension of the test first, before the fix. I think > > that's the cause of the failure on longfin on serinus. Let's hope the > > situation improves with the now also committed (and backpatched) fix. > > longfin's definitely not very happy: four out of six tries have ended with Too bad :( > psql:<stdin>:8: ERROR: canceling statement due to conflict with recovery > LINE 1: SELECT * FROM test_recovery_conflict_table2; > ^ > DETAIL: User was holding shared buffer pin for too long. > timed out waiting for match: (?^:User transaction caused buffer deadlock with recovery.) at t/031_recovery_conflict.plline 358. > > > I can poke into that tomorrow, but are you sure that that isn't an > expectable result? It's not expected. But I think I might see what the problem is: $psql_standby{stdin} .= qq[ BEGIN; -- hold pin DECLARE $cursor1 CURSOR FOR SELECT a FROM $table1; FETCH FORWARD FROM $cursor1; -- wait for lock held by prepared transaction SELECT * FROM $table2; ]; ok( pump_until( $psql_standby{run}, $psql_timeout, \$psql_standby{stdout}, qr/^1$/m,), "$sect: cursor holding conflicting pin, also waiting for lock, established" ); We wait for the FETCH (and thus the buffer pin to be acquired). But that doesn't guarantee that the lock has been acquired. We can't check that with pump_until() afaics, because there'll not be any output. But a query_until() checking pg_locks should do the trick? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-05-02 23:44:32 -0400, Tom Lane wrote: >> I can poke into that tomorrow, but are you sure that that isn't an >> expectable result? > It's not expected. But I think I might see what the problem is: > We wait for the FETCH (and thus the buffer pin to be acquired). But that > doesn't guarantee that the lock has been acquired. We can't check that with > pump_until() afaics, because there'll not be any output. But a query_until() > checking pg_locks should do the trick? Irritatingly, it doesn't reproduce (at least not easily) in a manual build on the same box. So it's almost surely a timing issue, and your theory here seems plausible. regards, tom lane
On 2022-May-02, Andres Freund wrote: > > > pgindent uses some crazy formatting nearby: > > > SendRecoveryConflictWithBufferPin( > > > PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK); > > > > I do not believe that that line break is pgindent's fault. > > Oh - I'm fairly certain I've seen pgindent do that in the past. But you're > right, it's not. Perhaps it was an older version of pgindent? No, it's never done that. We used to fold lines that way manually, because pgindent used to push the argument to the left, but that changed at commit 382ceffdf7f6. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "La grandeza es una experiencia transitoria. Nunca es consistente. Depende en gran parte de la imaginación humana creadora de mitos" (Irulan)
Hi, On 2022-05-03 01:16:46 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-05-02 23:44:32 -0400, Tom Lane wrote: > >> I can poke into that tomorrow, but are you sure that that isn't an > >> expectable result? > > > It's not expected. But I think I might see what the problem is: > > We wait for the FETCH (and thus the buffer pin to be acquired). But that > > doesn't guarantee that the lock has been acquired. We can't check that with > > pump_until() afaics, because there'll not be any output. But a query_until() > > checking pg_locks should do the trick? > > Irritatingly, it doesn't reproduce (at least not easily) in a manual > build on the same box. Odd, given how readily it seem to reproduce on the bf. I assume you built with > Uses -fsanitize=alignment -DWRITE_READ_PARSE_PLAN_TREES -DSTRESS_SORT_INT_MIN -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS > So it's almost surely a timing issue, and your theory here seems plausible. Unfortunately I don't think my theory holds, because I actually had added a defense against this into the test that I forgot about momentarily... # just to make sure we're waiting for lock already ok( $node_standby->poll_query_until( 'postgres', qq[ SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted; ], 'waiting'), "$sect: lock acquisition is waiting"); and on longfin that step completes sucessfully. I think what happens is that we get a buffer pin conflict, because these days we can actually process buffer pin conflicts while waiting for a lock. The easiest way to get around that is to increase the replay timeout for that test, I think? I think we need a restart, not a reload, because reloads aren't guaranteed to be processed at any certain point in time :/. Testing a fix in a variety of timing circumstances now... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-05-03 01:16:46 -0400, Tom Lane wrote: >> Irritatingly, it doesn't reproduce (at least not easily) in a manual >> build on the same box. > Odd, given how readily it seem to reproduce on the bf. I assume you built with >> Uses -fsanitize=alignment -DWRITE_READ_PARSE_PLAN_TREES -DSTRESS_SORT_INT_MIN -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS Yeah, I copied all that stuff ... >> So it's almost surely a timing issue, and your theory here seems plausible. > Unfortunately I don't think my theory holds, because I actually had added a > defense against this into the test that I forgot about momentarily... Oh, hm. I can try harder to repro it. regards, tom lane
Hi, On 2022-05-03 14:23:23 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > >> So it's almost surely a timing issue, and your theory here seems plausible. > > > Unfortunately I don't think my theory holds, because I actually had added a > > defense against this into the test that I forgot about momentarily... > > Oh, hm. I can try harder to repro it. I've now reproduced it a couple times here running under rr, so it's probably not worth putting too much effort into that. Attached is a fix for the test that I think should avoid the problem. Couldn't repro it with it applied, under both rr and valgrind. My current problem is that I'm running into some IPC::Run issues (bug?). I get "ack Broken pipe:" iff I add "SELECT pg_sleep(1);" after "-- wait for lock held by prepared transaction" It doesn't happen without that debugging thing, but it makes me worried that it's something that'll come up in random situations. It looks to me like it's a bug in IPC::Run - with a few changes I get the failure to happen inside pump_nb(), which seems like it shouldn't error out just because the child process exited... I *think* it might not happen without the sleep. But I'm not at all confident. In general I'm kinda worried on how much effectively unmaintained perl stuff we're depending :( Greetings, Andres Freund
Attachment
Andres Freund <andres@anarazel.de> writes: > Attached is a fix for the test that I think should avoid the problem. Couldn't > repro it with it applied, under both rr and valgrind. May I ask where we're at on this? Next week's back-branch release is getting uncomfortably close, and I'm still seeing various buildfarm animals erratically failing on 031_recovery_conflict.pl. Should we just remove that test from the back branches for now? Also, it appears that the back-patch of pump_until failed to remove some pre-existing copies, eg check-world in v14 now reports Subroutine pump_until redefined at t/013_crash_restart.pl line 248. Subroutine pump_until redefined at t/022_crash_temp_files.pl line 272. I didn't check whether these are exact duplicates. regards, tom lane
Hi, On 2022-05-05 22:07:40 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Attached is a fix for the test that I think should avoid the problem. Couldn't > > repro it with it applied, under both rr and valgrind. > > May I ask where we're at on this? Next week's back-branch release is > getting uncomfortably close, and I'm still seeing various buildfarm > animals erratically failing on 031_recovery_conflict.pl. Yea, sorry. I had crappy internet connectivity / device access for the last few days, making it hard to make progress. Looks like the problems are gone on HEAD at least. > Should we just remove that test from the back branches for now? That might be the best course, marking the test as TODO perhaps? Unfortunately a pg_ctl reload isn't processed reliably by the time the next test steps execute (saw that once when running in a loop), and a restart causes other problems (throws stats away). > Also, it appears that the back-patch of pump_until failed to remove > some pre-existing copies, eg check-world in v14 now reports > Subroutine pump_until redefined at t/013_crash_restart.pl line 248. > Subroutine pump_until redefined at t/022_crash_temp_files.pl line 272. > > I didn't check whether these are exact duplicates. They're not quite identical copies, which is why I left them in-place. But the warnings clearly make that a bad idea. I somehow mis-extrapolated from Cluster.pm, where it's not a problem (accessed via object). I'll remove them. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-05-05 22:07:40 -0400, Tom Lane wrote: >> May I ask where we're at on this? Next week's back-branch release is >> getting uncomfortably close, and I'm still seeing various buildfarm >> animals erratically failing on 031_recovery_conflict.pl. > Looks like the problems are gone on HEAD at least. It does look that way, although the number of successes is not large yet. >> Should we just remove that test from the back branches for now? > That might be the best course, marking the test as TODO perhaps? I poked closer and saw that you reverted 5136967f1 et al because (I suppose) adjust_conf is not there in the back branches. While I'd certainly support back-patching that functionality, I think we need to have a discussion about how to do it. I wonder whether we shouldn't drop src/test/perl/PostgreSQL/... into the back branches in toto and make the old test APIs into a wrapper around the new ones instead of vice versa. But that's definitely not a task to undertake three days before a release deadline. So I reluctantly vote for removing 031_recovery_conflict.pl in the back branches for now, with the expectation that we'll fix the infrastructure and put it back after the current release round is done. regards, tom lane
Hi, On 2022-05-05 23:36:22 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-05-05 22:07:40 -0400, Tom Lane wrote: > >> May I ask where we're at on this? Next week's back-branch release is > >> getting uncomfortably close, and I'm still seeing various buildfarm > >> animals erratically failing on 031_recovery_conflict.pl. > > > Looks like the problems are gone on HEAD at least. > > It does look that way, although the number of successes is not large yet. > > >> Should we just remove that test from the back branches for now? > > > That might be the best course, marking the test as TODO perhaps? > > I poked closer and saw that you reverted 5136967f1 et al because > (I suppose) adjust_conf is not there in the back branches. Yea. That one was a stupid mistake, working outside my usual environment. It's easy enough to work around adjust_conf not existing (just appending works), but then there's subsequent test failures... > So I reluctantly vote for removing 031_recovery_conflict.pl in the > back branches for now, with the expectation that we'll fix the > infrastructure and put it back after the current release round > is done. What about instead marking the flapping test TODO? That'd still give us most of the coverage... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-05-05 23:36:22 -0400, Tom Lane wrote: >> So I reluctantly vote for removing 031_recovery_conflict.pl in the >> back branches for now, with the expectation that we'll fix the >> infrastructure and put it back after the current release round >> is done. > What about instead marking the flapping test TODO? That'd still give us most > of the coverage... Are you sure there's just one test that's failing? I haven't checked the buildfarm history close enough to be sure of that. But if it's true, disabling just that one would be fine (again, as a stopgap measure). regards, tom lane
Hi, On 2022-05-05 23:57:28 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-05-05 23:36:22 -0400, Tom Lane wrote: > >> So I reluctantly vote for removing 031_recovery_conflict.pl in the > >> back branches for now, with the expectation that we'll fix the > >> infrastructure and put it back after the current release round > >> is done. > > > What about instead marking the flapping test TODO? That'd still give us most > > of the coverage... > > Are you sure there's just one test that's failing? I haven't checked > the buildfarm history close enough to be sure of that. But if it's > true, disabling just that one would be fine (again, as a stopgap > measure). I looked through all the failures I found and it's two kinds of failures, both related to the deadlock test. So I'm thinking of skipping just that test as in the attached. Working on committing / backpatching that, unless somebody suggests changes quickly... Greetings, Andres Freund
Attachment
Andres Freund <andres@anarazel.de> writes: > I looked through all the failures I found and it's two kinds of failures, both > related to the deadlock test. So I'm thinking of skipping just that test as in > the attached. > Working on committing / backpatching that, unless somebody suggests changes > quickly... WFM. regards, tom lane
On 2022-05-06 12:12:19 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I looked through all the failures I found and it's two kinds of failures, both > > related to the deadlock test. So I'm thinking of skipping just that test as in > > the attached. > > > Working on committing / backpatching that, unless somebody suggests changes > > quickly... > > WFM. Done. Perhaps you could trigger a run on longfin, that seems to have been the most reliably failing animal? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Done. Perhaps you could trigger a run on longfin, that seems to have been the > most reliably failing animal? No need, its cron job launched already. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2022-05-05 23:57:28 -0400, Tom Lane wrote: >> Are you sure there's just one test that's failing? I haven't checked >> the buildfarm history close enough to be sure of that. But if it's >> true, disabling just that one would be fine (again, as a stopgap >> measure). > I looked through all the failures I found and it's two kinds of failures, both > related to the deadlock test. So I'm thinking of skipping just that test as in > the attached. Per lapwing's latest results [1], this wasn't enough. I'm again thinking we should pull the whole test from the back branches. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2022-05-07%2016%3A40%3A04
Hi, On 2022-05-08 11:28:34 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-05-05 23:57:28 -0400, Tom Lane wrote: > >> Are you sure there's just one test that's failing? I haven't checked > >> the buildfarm history close enough to be sure of that. But if it's > >> true, disabling just that one would be fine (again, as a stopgap > >> measure). > > > I looked through all the failures I found and it's two kinds of failures, both > > related to the deadlock test. So I'm thinking of skipping just that test as in > > the attached. > > Per lapwing's latest results [1], this wasn't enough. I'm again thinking > we should pull the whole test from the back branches. That failure is different from the earlier failures though. I don't think it's a timing issue in the test like the deadlock check one. I rather suspect it's indicative of further problems in this area. Potentially the known problem with RecoveryConflictInterrupt() running in the signal handler? I think Thomas has a patch for that... One failure in ~20 runs, on one animal doesn't seem worth disabling the test for. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-05-08 11:28:34 -0400, Tom Lane wrote: >> Per lapwing's latest results [1], this wasn't enough. I'm again thinking >> we should pull the whole test from the back branches. > That failure is different from the earlier failures though. I don't think it's > a timing issue in the test like the deadlock check one. I rather suspect it's > indicative of further problems in this area. Yeah, that was my guess too. > Potentially the known problem > with RecoveryConflictInterrupt() running in the signal handler? I think Thomas > has a patch for that... Maybe; or given that it's on v10, it could be telling us about some yet-other problem we perhaps solved since then without realizing it needed to be back-patched. > One failure in ~20 runs, on one animal doesn't seem worth disabling the test > for. No one is going to thank us for shipping a known-unstable test case. It does nothing to fix the problem; all it will lead to is possible failures during package builds. I have no idea whether any packagers use "make check-world" rather than just "make check" while building. But if they do, even fairly low-probability failures can be problematic. (I still carry the scars I acquired while working at Red Hat and being responsible for packaging mysql: at least back then, their test suite was full of cases that mostly worked fine, except when getting stressed in Red Hat's build farm. Dealing with a test suite that fails 50% of the time under load, while trying to push out an urgent security fix, is NOT a pleasant situation.) I'm happy to have this test in the stable branches once we have committed fixes that address all known problems. Until then, it will just be a nuisance for anyone who is not a developer working on those problems. regards, tom lane
Hi, On 2022-05-08 13:59:09 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-05-08 11:28:34 -0400, Tom Lane wrote: > >> Per lapwing's latest results [1], this wasn't enough. I'm again thinking > >> we should pull the whole test from the back branches. > > > That failure is different from the earlier failures though. I don't think it's > > a timing issue in the test like the deadlock check one. I rather suspect it's > > indicative of further problems in this area. > > Yeah, that was my guess too. > > > Potentially the known problem > > with RecoveryConflictInterrupt() running in the signal handler? I think Thomas > > has a patch for that... > > Maybe; or given that it's on v10, it could be telling us about some > yet-other problem we perhaps solved since then without realizing > it needed to be back-patched. > > > One failure in ~20 runs, on one animal doesn't seem worth disabling the test > > for. > > No one is going to thank us for shipping a known-unstable test case. IDK, hiding failures indicating bugs isn't really better, at least if it doesn't look like a bug in the test. But you seem to have a stronger opinion on this than me, so I'll skip the entire test for now :/ Greetings, Andres Freund
On 2022-05-08 15:11:39 -0700, Andres Freund wrote: > But you seem to have a stronger opinion on this than me, so I'll skip the > entire test for now :/ And done.
Andres Freund <andres@anarazel.de> writes: > On 2022-05-08 15:11:39 -0700, Andres Freund wrote: >> But you seem to have a stronger opinion on this than me, so I'll skip the >> entire test for now :/ > And done. Thanks, I appreciate that. regards, tom lane
On 2022-May-08, Andres Freund wrote: > On 2022-05-08 13:59:09 -0400, Tom Lane wrote: > > No one is going to thank us for shipping a known-unstable test case. > > IDK, hiding failures indicating bugs isn't really better, at least if it > doesn't look like a bug in the test. But you seem to have a stronger opinion > on this than me, so I'll skip the entire test for now :/ Hey, I just noticed that these tests are still disabled. The next minors are coming soon; should we wait until *those* are done and then re-enable; or re-enable them now to see how they fare and then re-disable before the next minors if there's still problems we don't find fixes for? -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "How strange it is to find the words "Perl" and "saner" in such close proximity, with no apparent sense of irony. I doubt that Larry himself could have managed it." (ncm, http://lwn.net/Articles/174769/)
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Hey, I just noticed that these tests are still disabled. The next > minors are coming soon; should we wait until *those* are done and then > re-enable; or re-enable them now to see how they fare and then > re-disable before the next minors if there's still problems we don't > find fixes for? Maybe I missed it, but I don't think anything's been done to fix the test's problems in the back branches. regards, tom lane
Hi, On 2022-07-26 12:47:38 -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > Hey, I just noticed that these tests are still disabled. The next > > minors are coming soon; should we wait until *those* are done and then > > re-enable; or re-enable them now to see how they fare and then > > re-disable before the next minors if there's still problems we don't > > find fixes for? > > Maybe I missed it, but I don't think anything's been done to fix the > test's problems in the back branches. Yea, I don't think either. What's worse, there's several actual problems in the recovery conflict code in all branches - which causes occasional failures of the test in HEAD as well. There's one big set of fixes in https://www.postgresql.org/message-id/CA%2BhUKGKrLKx7Ky1T_FHk-Y729K0oie-gOXKCbxCXyjbPDJAOOw%40mail.gmail.com and I suspect it'll be hard to have test be fully reliable without addressing https://postgr.es/m/20220715172938.t7uivghdx5vj36cn%40awork3.anarazel.de Greetings, Andres Freund
Hello. Even after applying the patch, we are still facing an "ack Broken pipe" problem. It occurs on the arm64 platform, presumably under high load. Here is a log snippet from buildfarm: ... [19:08:12.150](0.394s) ok 13 - startup deadlock: cursor holding conflicting pin, also waiting for lock, established [19:08:12.208](0.057s) ok 14 - startup deadlock: lock acquisition is waiting Waiting for replication conn standby's replay_lsn to pass 0/33F5FE0 on primary done psql:<stdin>:8: ERROR: canceling statement due to conflict with recovery LINE 1: SELECT * FROM test_recovery_conflict_table2; ^ DETAIL: User transaction caused buffer deadlock with recovery. [19:08:12.319](0.112s) ok 15 - startup deadlock: logfile contains terminated connection due to recovery conflict ack Broken pipe: write( 13, '\\q ' ) at /usr/share/perl5/IPC/Run/IO.pm line 550. ### Stopping node "primary" using mode immediate # Running: pg_ctl -D /home/bf/build/buildfarm-flaviventris/REL_15_STABLE/pgsql.build/src/test/rec overy/tmp_check/t_031_recovery_conflict_primary_data/pgdata -m immediate stop waiting for server to shut down... done server stopped # No postmaster PID for node "primary" ### Stopping node "standby" using mode immediate # Running: pg_ctl -D /home/bf/build/buildfarm-flaviventris/REL_15_STABLE/pgsql.build/src/test/rec overy/tmp_check/t_031_recovery_conflict_standby_data/pgdata -m immediate stop waiting for server to shut down.... done server stopped # No postmaster PID for node "standby" [19:08:12.450](0.131s) # Tests were run but no plan was declared and done_testing() was not seen. [19:08:12.450](0.000s) # Looks like your test exited with 32 just after 15. ... Below is a test report: ... [20:46:35] t/030_stats_cleanup_replica.pl ....... ok 9956 ms ( 0.01 usr 0.00 sys + 3.49 cusr 2.49 csys = 5.99 CPU) # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 32 just after 15. [20:46:43] t/031_recovery_conflict.pl ........... Dubious, test returned 32 (wstat 8192, 0x2000) All 15 subtests passed [20:46:56] t/032_relfilenode_reuse.pl ........... ok 13625 ms ( 0.01 usr 0.00 sys + 3.53 cusr 2.39 csys = 5.93 CPU) [20:47:03] t/110_wal_sender_check_crc.pl ........ ok 6421 ms ( 0.00 usr 0.00 sys + 3.20 cusr 1.87 csys = 5.07 CPU) [20:47:03] Test Summary Report ------------------- t/031_recovery_conflict.pl (Wstat: 8192 Tests: 15 Failed: 0) Non-zero exit status: 32 Parse errors: No plan found in TAP output -----Original Message----- From: Andres Freund <andres@anarazel.de> Sent: Tuesday, May 3, 2022 11:13 PM To: Tom Lane <tgl@sss.pgh.pa.us> Cc: Robert Haas <robertmhaas@gmail.com>; pgsql-hackers@postgresql.org; Thomas Munro <thomas.munro@gmail.com> Subject: Re: failures in t/031_recovery_conflict.pl on CI Hi, On 2022-05-03 14:23:23 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > >> So it's almost surely a timing issue, and your theory here seems plausible. > > > Unfortunately I don't think my theory holds, because I actually had > > added a defense against this into the test that I forgot about momentarily... > > Oh, hm. I can try harder to repro it. I've now reproduced it a couple times here running under rr, so it's probably not worth putting too much effort into that. Attached is a fix for the test that I think should avoid the problem. Couldn't repro it with it applied, under both rr and valgrind. My current problem is that I'm running into some IPC::Run issues (bug?). I get "ack Broken pipe:" iff I add "SELECT pg_sleep(1);" after "-- wait for lock held by prepared transaction" It doesn't happen without that debugging thing, but it makes me worried that it's something that'll come up in random situations. It looks to me like it's a bug in IPC::Run - with a few changes I get the failure to happen inside pump_nb(), which seems like it shouldn't error out just because the child process exited... I *think* it might not happen without the sleep. But I'm not at all confident. In general I'm kinda worried on how much effectively unmaintained perl stuff we're depending :( Greetings, Andres Freund