Re: Unstable tests for recovery conflict handling - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Unstable tests for recovery conflict handling |
Date | |
Msg-id | CA+TgmoaPNnKCd-01pCuJ6Jm2xEHReq3c-yyD=jS9SgMgj1WPOw@mail.gmail.com Whole thread Raw |
In response to | Unstable tests for recovery conflict handling (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Unstable tests for recovery conflict handling
|
List | pgsql-hackers |
On Wed, Aug 3, 2022 at 1:57 PM Andres Freund <andres@anarazel.de> wrote: > The reason nothing might get logged in some cases is that > e.g. ResolveRecoveryConflictWithLock() tells > ResolveRecoveryConflictWithVirtualXIDs() to *not* report the waiting: > /* > * Prevent ResolveRecoveryConflictWithVirtualXIDs() from reporting > * "waiting" in PS display by disabling its argument report_waiting > * because the caller, WaitOnLock(), has already reported that. > */ > > so ResolveRecoveryConflictWithLock() can end up looping indefinitely without > logging anything. I understand why we need to avoid adding "waiting" to the PS status when we've already done that, but it doesn't seem like that should imply skipping ereport() of log messages. I think we could redesign the way the ps display works to make things a whole lot simpler. Let's have a function set_ps_display() and another function set_ps_display_suffix(). What gets reported to the OS is the concatenation of the two. Calling set_ps_display() implicitly resets the suffix to empty. AFAICS, that'd let us get rid of this tricky logic, and some other tricky logic as well. Here, we'd just say set_ps_display_suffix(" waiting") and not worry about whether the caller might have already done something similar. > Another question I have about ResolveRecoveryConflictWithLock() is whether > it's ok that we don't check deadlocks around the > ResolveRecoveryConflictWithVirtualXIDs() call? It might be ok, because we'd > only block if there's a recovery conflict, in which killing the process ought > to succeed? The startup process is supposed to always "win" in any deadlock situation, so I'm not sure what you think is a problem here. We get the conflicting lockers. We kill them. If they don't die, that's a bug, but killing ourselves doesn't really help anything; if we die, the whole system goes down, which seems undesirable. > I think there's also might be a problem with the wait loop in ProcSleep() wrt > recovery conflicts: We rely on interrupts to be processed to throw recovery > conflict errors, but ProcSleep() is called in a bunch of places with > interrupts held. An Assert(INTERRUPTS_CAN_BE_PROCESSED()) after releasing the > partition lock triggers a bunch. It's possible that these aren't problematic > cases for recovery conflicts, because they're all around extension locks: > [...] > Independent of recovery conflicts, isn't it dangerous that we acquire the > relation extension lock with a buffer content lock held? I *guess* it might be > ok because BufferAlloc(P_NEW) only acquires buffer content locks in a > conditional way. These things both seem a bit sketchy but it's not 100% clear to me that anything is actually broken. Now it's also not clear to me that nothing is broken ... -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: