Thread: Unstable tests for recovery conflict handling
[ starting a new thread cuz the shared-stats one is way too long ] Andres Freund <andres@anarazel.de> writes: > Add minimal tests for recovery conflict handling. It's been kind of hidden by other buildfarm noise, but 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4]. Three of those failures look like [11:08:46.806](105.129s) ok 1 - buffer pin conflict: cursor with conflicting pin established Waiting for replication conn standby's replay_lsn to pass 0/33EF190 on primary [12:01:49.614](3182.807s) # poll_query_until timed out executing this query: # SELECT '0/33EF190' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('standby', 'walreceiver') # expecting this output: # t # last actual query output: # f # with stderr: timed out waiting for catchup at t/031_recovery_conflict.pl line 123. In each of these examples we can see in the standby's log that it detected the expected buffer pin conflict: 2022-04-27 11:08:46.353 UTC [1961604][client backend][2/2:0] LOG: statement: BEGIN; 2022-04-27 11:08:46.416 UTC [1961604][client backend][2/2:0] LOG: statement: DECLARE test_recovery_conflict_cursor CURSORFOR SELECT b FROM test_recovery_conflict_table1; 2022-04-27 11:08:46.730 UTC [1961604][client backend][2/2:0] LOG: statement: FETCH FORWARD FROM test_recovery_conflict_cursor; 2022-04-27 11:08:47.825 UTC [1961298][startup][1/0:0] LOG: recovery still waiting after 13.367 ms: recovery conflict onbuffer pin 2022-04-27 11:08:47.825 UTC [1961298][startup][1/0:0] CONTEXT: WAL redo at 0/33E6E80 for Heap2/PRUNE: latestRemovedXid 0nredirected 0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0 but then nothing happens until the script times out and kills the test. I think this is showing us a real bug, ie we sometimes fail to cancel the conflicting query. The other example [3] looks different: [01:02:43.582](2.357s) ok 1 - buffer pin conflict: cursor with conflicting pin established Waiting for replication conn standby's replay_lsn to pass 0/342C000 on primary done [01:02:43.747](0.165s) ok 2 - buffer pin conflict: logfile contains terminated connection due to recovery conflict [01:02:43.804](0.057s) not ok 3 - buffer pin conflict: stats show conflict on standby [01:02:43.805](0.000s) [01:02:43.805](0.000s) # Failed test 'buffer pin conflict: stats show conflict on standby' # at t/031_recovery_conflict.pl line 295. [01:02:43.805](0.000s) # got: '0' # expected: '1' Not sure what to make of that --- could there be a race condition in the reporting of the conflict? regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-04-27%2007%3A16%3A51 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2022-04-21%2021%3A20%3A15 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2022-04-13%2022%3A45%3A30 [4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-04-11%2005%3A40%3A41
> On Apr 27, 2022, at 9:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > [ starting a new thread cuz the shared-stats one is way too long ] > > Andres Freund <andres@anarazel.de> writes: >> Add minimal tests for recovery conflict handling. > > It's been kind of hidden by other buildfarm noise, but > 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4]. > > Three of those failures look like Interesting, I have been getting failures on REL_14_STABLE: t/012_subtransactions.pl ............. 11/12 # Failed test 'Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on promoted standby' # at t/012_subtransactions.pl line 211. # got: '3' # expected: '0' t/012_subtransactions.pl ............. 12/12 # Looks like you failed 1 test of 12. t/012_subtransactions.pl ............. Dubious, test returned 1 (wstat 256, 0x100) Failed 1/12 subtests And the logs, tmp_check/log/regress_log_012_subtransactions, showing: ### Enabling streaming replication for node "primary" ### Starting node "primary" # Running: pg_ctl -D /Users/mark.dilger/recovery_test/postgresql/src/test/recovery/tmp_check/t_012_subtransactions_primary_data/pgdata-l /Users/mark.dilger/recovery_test/postgresql/src/test/recovery/tmp_check/log/012_subtransactions_primary.log-o --cluster-name=primarystart waiting for server to start.... done server started # Postmaster PID for node "primary" is 46270 psql:<stdin>:1: ERROR: prepared transaction with identifier "xact_012_1" does not exist not ok 11 - Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on promoted standby # Failed test 'Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on promoted standby' # at t/012_subtransactions.pl line 211. # got: '3' # expected: '0' This is quite consistent for me, but only when I configure with --enable-coverage and --enable-dtrace. (I haven't yet triedone of those without the other.) I wasn't going to report this yet, having not yet completely narrowed this down, but I wonder if anybody else is seeing this? I'll try again on master.... — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Apr 27, 2022, at 10:11 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > I'll try again on master.... Still with coverage and dtrace enabled, I get the same thing, except that master formats the logs a bit differently: # Postmaster PID for node "primary" is 19797 psql:<stdin>:1: ERROR: prepared transaction with identifier "xact_012_1" does not exist [10:26:16.314](1.215s) not ok 11 - Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on promoted standby [10:26:16.314](0.000s) [10:26:16.314](0.000s) # Failed test 'Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on promoted standby' [10:26:16.314](0.000s) # at t/012_subtransactions.pl line 208. [10:26:16.314](0.000s) # got: '3' # expected: '0' With coverage but not dtrace enabled, I still get the error, though the log leading up to the error now has a bunch of coveragenoise lines like: profiling: /Users/mark.dilger/recovery_test/postgresql/src/backend/utils/sort/tuplesort.gcda: cannot merge previous GCDAfile: corrupt arc tag The error itself looks the same except the timing numbers differ a little. With neither enabled, all tests pass. I'm inclined to think that either the recovery code or the test have a race condition, and that enabling coverage causesthe race to come out differently. I'll keep poking.... — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I wrote: > It's been kind of hidden by other buildfarm noise, but > 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4]. > ... > I think this is showing us a real bug, ie we sometimes fail to cancel > the conflicting query. After digging around in the code, I think this is almost certainly some manifestation of the previously-complained-of problem [1] that RecoveryConflictInterrupt is not safe to call in a signal handler, leading the conflicting backend to sometimes decide that it's not the problem. That squares with the observation that skink is more prone to show this than other animals: you'd have to get the SIGUSR1 while the target backend isn't idle, so a very slow machine ought to show it more. We don't seem to have that issue on the open items list, but I'll go add it. Not sure if the 'buffer pin conflict: stats show conflict on standby' failure could trace to a similar cause. regards, tom lane [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com
Hi, On 2022-04-27 10:11:53 -0700, Mark Dilger wrote: > > > > On Apr 27, 2022, at 9:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > [ starting a new thread cuz the shared-stats one is way too long ] > > > > Andres Freund <andres@anarazel.de> writes: > >> Add minimal tests for recovery conflict handling. > > > > It's been kind of hidden by other buildfarm noise, but > > 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4]. > > > > Three of those failures look like > > Interesting, > > I have been getting failures on REL_14_STABLE: > > t/012_subtransactions.pl ............. 11/12 > # Failed test 'Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on promoted standby' > # at t/012_subtransactions.pl line 211. > # got: '3' > # expected: '0' > t/012_subtransactions.pl ............. 12/12 # Looks like you failed 1 test of 12. > t/012_subtransactions.pl ............. Dubious, test returned 1 (wstat 256, 0x100) > Failed 1/12 subtests > > And the logs, tmp_check/log/regress_log_012_subtransactions, showing: I'm a bit confused - what's the relation of that failure to this thread / the tests / this commit? Greetings, Andres Freund
Hi, On 2022-04-27 14:08:45 -0400, Tom Lane wrote: > I wrote: > > It's been kind of hidden by other buildfarm noise, but > > 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4]. > > ... > > I think this is showing us a real bug, ie we sometimes fail to cancel > > the conflicting query. > > After digging around in the code, I think this is almost certainly > some manifestation of the previously-complained-of problem [1] that > RecoveryConflictInterrupt is not safe to call in a signal handler, > leading the conflicting backend to sometimes decide that it's not > the problem. I think at least some may actually be because of https://postgr.es/m/20220409045515.35ypjzddp25v72ou%40alap3.anarazel.de rather than RecoveryConflictInterrupt itself. I'll go an finish up the comment bits that I still need to clean up in my bugix and commit that... Greetings, Andres Freund
> On Apr 27, 2022, at 6:26 PM, Andres Freund <andres@anarazel.de> wrote: > > I'm a bit confused - what's the relation of that failure to this thread / the > tests / this commit? None, upon further reflection. It turned out to be unrelated. Sorry for the noise. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Apr 28, 2022 at 5:50 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > psql:<stdin>:1: ERROR: prepared transaction with identifier "xact_012_1" does not exist > [10:26:16.314](1.215s) not ok 11 - Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on promoted standby > [10:26:16.314](0.000s) > [10:26:16.314](0.000s) # Failed test 'Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on promoted standby' > [10:26:16.314](0.000s) # at t/012_subtransactions.pl line 208. > [10:26:16.314](0.000s) # got: '3' > # expected: '0' FWIW I see that on my FBSD/clang system when I build with -fsanitize=undefined -fno-sanitize-recover=all. It's something to do with our stack depth detection and tricks being used by -fsanitize, because there's a stack depth exceeded message in the log.
I wrote: >> It's been kind of hidden by other buildfarm noise, but >> 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4]. > After digging around in the code, I think this is almost certainly > some manifestation of the previously-complained-of problem [1] that > RecoveryConflictInterrupt is not safe to call in a signal handler, > leading the conflicting backend to sometimes decide that it's not > the problem. I happened to notice that while skink continues to fail off-and-on in 031_recovery_conflict.pl, the symptoms have changed! What we're getting now typically looks like [1]: [10:45:11.475](0.023s) ok 14 - startup deadlock: lock acquisition is waiting Waiting for replication conn standby's replay_lsn to pass 0/33FB8B0 on primary done timed out waiting for match: (?^:User transaction caused buffer deadlock with recovery.) at t/031_recovery_conflict.pl line367. where absolutely nothing happens in the standby log, until we time out: 2022-07-24 10:45:11.452 UTC [1468367][client backend][2/4:0] LOG: statement: SELECT * FROM test_recovery_conflict_table2; 2022-07-24 10:45:11.472 UTC [1468547][client backend][3/2:0] LOG: statement: SELECT 'waiting' FROM pg_locks WHERE locktype= 'relation' AND NOT granted; 2022-07-24 10:48:15.860 UTC [1468362][walreceiver][:0] FATAL: could not receive data from WAL stream: server closed theconnection unexpectedly So this is not a case of RecoveryConflictInterrupt doing the wrong thing: the startup process hasn't detected the buffer conflict in the first place. Don't know what to make of that, but I vaguely suspect a test timing problem. gull has shown this once as well, although at a different step in the script [2]. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-07-24%2007%3A00%3A29 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gull&dt=2022-07-23%2009%3A34%3A54
Hi, On 2022-07-26 13:57:53 -0400, Tom Lane wrote: > I happened to notice that while skink continues to fail off-and-on > in 031_recovery_conflict.pl, the symptoms have changed! What > we're getting now typically looks like [1]: > > [10:45:11.475](0.023s) ok 14 - startup deadlock: lock acquisition is waiting > Waiting for replication conn standby's replay_lsn to pass 0/33FB8B0 on primary > done > timed out waiting for match: (?^:User transaction caused buffer deadlock with recovery.) at t/031_recovery_conflict.plline 367. > > where absolutely nothing happens in the standby log, until we time out: > > 2022-07-24 10:45:11.452 UTC [1468367][client backend][2/4:0] LOG: statement: SELECT * FROM test_recovery_conflict_table2; > 2022-07-24 10:45:11.472 UTC [1468547][client backend][3/2:0] LOG: statement: SELECT 'waiting' FROM pg_locks WHERE locktype= 'relation' AND NOT granted; > 2022-07-24 10:48:15.860 UTC [1468362][walreceiver][:0] FATAL: could not receive data from WAL stream: server closed theconnection unexpectedly > > So this is not a case of RecoveryConflictInterrupt doing the wrong thing: > the startup process hasn't detected the buffer conflict in the first > place. I wonder if this, at least partially, could be be due to the elog thing I was complaining about nearby. I.e. we decide to FATAL as part of a recovery conflict interrupt, and then during that ERROR out as part of another recovery conflict interrupt (because nothing holds interrupts as part of FATAL). Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-07-26 13:57:53 -0400, Tom Lane wrote: >> So this is not a case of RecoveryConflictInterrupt doing the wrong thing: >> the startup process hasn't detected the buffer conflict in the first >> place. > I wonder if this, at least partially, could be be due to the elog thing > I was complaining about nearby. I.e. we decide to FATAL as part of a > recovery conflict interrupt, and then during that ERROR out as part of > another recovery conflict interrupt (because nothing holds interrupts as > part of FATAL). There are all sorts of things one could imagine going wrong in the backend receiving the recovery conflict interrupt, but AFAICS in these failures, the startup process hasn't sent a recovery conflict interrupt. It certainly hasn't logged anything suggesting it noticed a conflict. regards, tom lane
Hi, On 2022-07-26 14:30:30 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-07-26 13:57:53 -0400, Tom Lane wrote: > >> So this is not a case of RecoveryConflictInterrupt doing the wrong thing: > >> the startup process hasn't detected the buffer conflict in the first > >> place. > > > I wonder if this, at least partially, could be be due to the elog thing > > I was complaining about nearby. I.e. we decide to FATAL as part of a > > recovery conflict interrupt, and then during that ERROR out as part of > > another recovery conflict interrupt (because nothing holds interrupts as > > part of FATAL). > > There are all sorts of things one could imagine going wrong in the > backend receiving the recovery conflict interrupt, but AFAICS in these > failures, the startup process hasn't sent a recovery conflict interrupt. > It certainly hasn't logged anything suggesting it noticed a conflict. I don't think we reliably emit a log message before the recovery conflict is resolved. I've wondered a couple times now about making tap test timeouts somehow trigger a core dump of all processes. Certainly would make it easier to debug some of these kinds of issues. Greetings, Andres Freund
Hi, On 2022-07-26 13:03:54 -0700, Andres Freund wrote: > On 2022-07-26 14:30:30 -0400, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > On 2022-07-26 13:57:53 -0400, Tom Lane wrote: > > >> So this is not a case of RecoveryConflictInterrupt doing the wrong thing: > > >> the startup process hasn't detected the buffer conflict in the first > > >> place. > > > > > I wonder if this, at least partially, could be be due to the elog thing > > > I was complaining about nearby. I.e. we decide to FATAL as part of a > > > recovery conflict interrupt, and then during that ERROR out as part of > > > another recovery conflict interrupt (because nothing holds interrupts as > > > part of FATAL). > > > > There are all sorts of things one could imagine going wrong in the > > backend receiving the recovery conflict interrupt, but AFAICS in these > > failures, the startup process hasn't sent a recovery conflict interrupt. > > It certainly hasn't logged anything suggesting it noticed a conflict. > > I don't think we reliably emit a log message before the recovery > conflict is resolved. I played around trying to reproduce this kind of issue. One way to quickly run into trouble on a slow system is that ResolveRecoveryConflictWithVirtualXIDs() can end up sending signals more frequently than the target can process them. The next signal can arrive by the time SIGUSR1 processing finished, which, at least on linux, causes the queued signal to immediately be processed, without "normal" postgres code gaining control. The reason nothing might get logged in some cases is that e.g. ResolveRecoveryConflictWithLock() tells ResolveRecoveryConflictWithVirtualXIDs() to *not* report the waiting: /* * Prevent ResolveRecoveryConflictWithVirtualXIDs() from reporting * "waiting" in PS display by disabling its argument report_waiting * because the caller, WaitOnLock(), has already reported that. */ so ResolveRecoveryConflictWithLock() can end up looping indefinitely without logging anything. Another question I have about ResolveRecoveryConflictWithLock() is whether it's ok that we don't check deadlocks around the ResolveRecoveryConflictWithVirtualXIDs() call? It might be ok, because we'd only block if there's a recovery conflict, in which killing the process ought to succeed? I think there's also might be a problem with the wait loop in ProcSleep() wrt recovery conflicts: We rely on interrupts to be processed to throw recovery conflict errors, but ProcSleep() is called in a bunch of places with interrupts held. An Assert(INTERRUPTS_CAN_BE_PROCESSED()) after releasing the partition lock triggers a bunch. It's possible that these aren't problematic cases for recovery conflicts, because they're all around extension locks: #2 0x0000562032f1968d in ExceptionalCondition (conditionName=0x56203310623a "INTERRUPTS_CAN_BE_PROCESSED()", errorType=0x562033105f6c"FailedAssertion", fileName=0x562033105f30 "/home/andres/src/postgresql/src/backend/storage/lmgr/proc.c", lineNumber=1208) at /home/andres/src/postgresql/src/backend/utils/error/assert.c:69 #3 0x0000562032d50f41 in ProcSleep (locallock=0x562034cafaf0, lockMethodTable=0x562033281740 <default_lockmethod>) at /home/andres/src/postgresql/src/backend/storage/lmgr/proc.c:1208 #4 0x0000562032d3e2ce in WaitOnLock (locallock=0x562034cafaf0, owner=0x562034d12c58) at /home/andres/src/postgresql/src/backend/storage/lmgr/lock.c:1859 #5 0x0000562032d3cd0a in LockAcquireExtended (locktag=0x7ffc7b4d0810, lockmode=7, sessionLock=false, dontWait=false, reportMemoryError=true,locallockp=0x0) at /home/andres/src/postgresql/src/backend/storage/lmgr/lock.c:1101 #6 0x0000562032d3c1c4 in LockAcquire (locktag=0x7ffc7b4d0810, lockmode=7, sessionLock=false, dontWait=false) at /home/andres/src/postgresql/src/backend/storage/lmgr/lock.c:752 #7 0x0000562032d3a696 in LockRelationForExtension (relation=0x7f54646b1dd8, lockmode=7) at /home/andres/src/postgresql/src/backend/storage/lmgr/lmgr.c:439 #8 0x0000562032894276 in _bt_getbuf (rel=0x7f54646b1dd8, blkno=4294967295, access=2) at /home/andres/src/postgresql/src/backend/access/nbtree/nbtpage.c:975 #9 0x000056203288f1cb in _bt_split (rel=0x7f54646b1dd8, itup_key=0x562034ea7428, buf=770, cbuf=0, newitemoff=408, newitemsz=16,newitem=0x562034ea3fc8, orignewitem=0x0, nposting=0x0, postingoff=0) at /home/andres/src/postgresql/src/backend/access/nbtree/nbtinsert.c:1715 #10 0x000056203288e4bb in _bt_insertonpg (rel=0x7f54646b1dd8, itup_key=0x562034ea7428, buf=770, cbuf=0, stack=0x562034ea1fb8,itup=0x562034ea3fc8, itemsz=16, newitemoff=408, postingoff=0, split_only_page=false) at /home/andres/src/postgresql/src/backend/access/nbtree/nbtinsert.c:1212 #11 0x000056203288caf9 in _bt_doinsert (rel=0x7f54646b1dd8, itup=0x562034ea3fc8, checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false,heapRel=0x7f546823dde0) at /home/andres/src/postgresql/src/backend/access/nbtree/nbtinsert.c:258 #12 0x000056203289851f in btinsert (rel=0x7f54646b1dd8, values=0x7ffc7b4d0c50, isnull=0x7ffc7b4d0c30, ht_ctid=0x562034dd083c,heapRel=0x7f546823dde0, checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false, indexInfo=0x562034ea71c0) at /home/andres/src/postgresql/src/backend/access/nbtree/nbtree.c:200 #13 0x000056203288710b in index_insert (indexRelation=0x7f54646b1dd8, values=0x7ffc7b4d0c50, isnull=0x7ffc7b4d0c30, heap_t_ctid=0x562034dd083c, heapRelation=0x7f546823dde0, checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false, indexInfo=0x562034ea71c0) at /home/andres/src/postgresql/src/backend/access/index/indexam.c:193 #14 0x000056203292e9da in CatalogIndexInsert (indstate=0x562034dd02b0, heapTuple=0x562034dd0838) (gdb) p num_held_lwlocks $14 = 1 (gdb) p held_lwlocks[0] $15 = {lock = 0x7f1a0d18d2e4, mode = LW_EXCLUSIVE} (gdb) p held_lwlocks[0].lock->tranche $16 = 56 (gdb) p BuiltinTrancheNames[held_lwlocks[0].lock->tranche - NUM_INDIVIDUAL_LWLOCKS] $17 = 0x558ce5710ede "BufferContent" Independent of recovery conflicts, isn't it dangerous that we acquire the relation extension lock with a buffer content lock held? I *guess* it might be ok because BufferAlloc(P_NEW) only acquires buffer content locks in a conditional way. Greetings, Andres Freund
On Wed, Aug 3, 2022 at 1:57 PM Andres Freund <andres@anarazel.de> wrote: > The reason nothing might get logged in some cases is that > e.g. ResolveRecoveryConflictWithLock() tells > ResolveRecoveryConflictWithVirtualXIDs() to *not* report the waiting: > /* > * Prevent ResolveRecoveryConflictWithVirtualXIDs() from reporting > * "waiting" in PS display by disabling its argument report_waiting > * because the caller, WaitOnLock(), has already reported that. > */ > > so ResolveRecoveryConflictWithLock() can end up looping indefinitely without > logging anything. I understand why we need to avoid adding "waiting" to the PS status when we've already done that, but it doesn't seem like that should imply skipping ereport() of log messages. I think we could redesign the way the ps display works to make things a whole lot simpler. Let's have a function set_ps_display() and another function set_ps_display_suffix(). What gets reported to the OS is the concatenation of the two. Calling set_ps_display() implicitly resets the suffix to empty. AFAICS, that'd let us get rid of this tricky logic, and some other tricky logic as well. Here, we'd just say set_ps_display_suffix(" waiting") and not worry about whether the caller might have already done something similar. > Another question I have about ResolveRecoveryConflictWithLock() is whether > it's ok that we don't check deadlocks around the > ResolveRecoveryConflictWithVirtualXIDs() call? It might be ok, because we'd > only block if there's a recovery conflict, in which killing the process ought > to succeed? The startup process is supposed to always "win" in any deadlock situation, so I'm not sure what you think is a problem here. We get the conflicting lockers. We kill them. If they don't die, that's a bug, but killing ourselves doesn't really help anything; if we die, the whole system goes down, which seems undesirable. > I think there's also might be a problem with the wait loop in ProcSleep() wrt > recovery conflicts: We rely on interrupts to be processed to throw recovery > conflict errors, but ProcSleep() is called in a bunch of places with > interrupts held. An Assert(INTERRUPTS_CAN_BE_PROCESSED()) after releasing the > partition lock triggers a bunch. It's possible that these aren't problematic > cases for recovery conflicts, because they're all around extension locks: > [...] > Independent of recovery conflicts, isn't it dangerous that we acquire the > relation extension lock with a buffer content lock held? I *guess* it might be > ok because BufferAlloc(P_NEW) only acquires buffer content locks in a > conditional way. These things both seem a bit sketchy but it's not 100% clear to me that anything is actually broken. Now it's also not clear to me that nothing is broken ... -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-08-03 16:33:46 -0400, Robert Haas wrote: > On Wed, Aug 3, 2022 at 1:57 PM Andres Freund <andres@anarazel.de> wrote: > > The reason nothing might get logged in some cases is that > > e.g. ResolveRecoveryConflictWithLock() tells > > ResolveRecoveryConflictWithVirtualXIDs() to *not* report the waiting: > > /* > > * Prevent ResolveRecoveryConflictWithVirtualXIDs() from reporting > > * "waiting" in PS display by disabling its argument report_waiting > > * because the caller, WaitOnLock(), has already reported that. > > */ > > > > so ResolveRecoveryConflictWithLock() can end up looping indefinitely without > > logging anything. > > I understand why we need to avoid adding "waiting" to the PS status > when we've already done that, but it doesn't seem like that should > imply skipping ereport() of log messages. > > I think we could redesign the way the ps display works to make things > a whole lot simpler. Let's have a function set_ps_display() and > another function set_ps_display_suffix(). What gets reported to the OS > is the concatenation of the two. Calling set_ps_display() implicitly > resets the suffix to empty. > > AFAICS, that'd let us get rid of this tricky logic, and some other > tricky logic as well. Here, we'd just say set_ps_display_suffix(" > waiting") and not worry about whether the caller might have already > done something similar. That sounds like it'd be an improvement. Of course we still need to fix that we can signal at a rate not allowing the other side to handle the conflict, but at least that'd be easier to identify... > > Another question I have about ResolveRecoveryConflictWithLock() is whether > > it's ok that we don't check deadlocks around the > > ResolveRecoveryConflictWithVirtualXIDs() call? It might be ok, because we'd > > only block if there's a recovery conflict, in which killing the process ought > > to succeed? > > The startup process is supposed to always "win" in any deadlock > situation, so I'm not sure what you think is a problem here. We get > the conflicting lockers. We kill them. If they don't die, that's a > bug, but killing ourselves doesn't really help anything; if we die, > the whole system goes down, which seems undesirable. The way deadlock timeout for the startup process works is that we wait for it to pass and then send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK to the backends. So it's not that the startup process would die. The question is basically whether there are cases were PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK would resolve a conflict but PROCSIG_RECOVERY_CONFLICT_LOCK wouldn't. It seems plausible that there isn't, but it's also not obvious enough that I'd fully trust it. Greetings, Andres Freund