Re: Unstable tests for recovery conflict handling - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Unstable tests for recovery conflict handling
Date
Msg-id CA+TgmoaPNnKCd-01pCuJ6Jm2xEHReq3c-yyD=jS9SgMgj1WPOw@mail.gmail.com
Whole thread Raw
In response to Unstable tests for recovery conflict handling  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Unstable tests for recovery conflict handling
List pgsql-hackers
On Wed, Aug 3, 2022 at 1:57 PM Andres Freund <andres@anarazel.de> wrote:
> The reason nothing might get logged in some cases is that
> e.g. ResolveRecoveryConflictWithLock() tells
> ResolveRecoveryConflictWithVirtualXIDs() to *not* report the waiting:
>                 /*
>                  * Prevent ResolveRecoveryConflictWithVirtualXIDs() from reporting
>                  * "waiting" in PS display by disabling its argument report_waiting
>                  * because the caller, WaitOnLock(), has already reported that.
>                  */
>
> so ResolveRecoveryConflictWithLock() can end up looping indefinitely without
> logging anything.

I understand why we need to avoid adding "waiting" to the PS status
when we've already done that, but it doesn't seem like that should
imply skipping ereport() of log messages.

I think we could redesign the way the ps display works to make things
a whole lot simpler. Let's have a function set_ps_display() and
another function set_ps_display_suffix(). What gets reported to the OS
is the concatenation of the two. Calling set_ps_display() implicitly
resets the suffix to empty.

AFAICS, that'd let us get rid of this tricky logic, and some other
tricky logic as well. Here, we'd just say set_ps_display_suffix("
waiting") and not worry about whether the caller might have already
done something similar.

> Another question I have about ResolveRecoveryConflictWithLock() is whether
> it's ok that we don't check deadlocks around the
> ResolveRecoveryConflictWithVirtualXIDs() call? It might be ok, because we'd
> only block if there's a recovery conflict, in which killing the process ought
> to succeed?

The startup process is supposed to always "win" in any deadlock
situation, so I'm not sure what you think is a problem here. We get
the conflicting lockers. We kill them. If they don't die, that's a
bug, but killing ourselves doesn't really help anything; if we die,
the whole system goes down, which seems undesirable.

> I think there's also might be a problem with the wait loop in ProcSleep() wrt
> recovery conflicts: We rely on interrupts to be processed to throw recovery
> conflict errors, but ProcSleep() is called in a bunch of places with
> interrupts held. An Assert(INTERRUPTS_CAN_BE_PROCESSED()) after releasing the
> partition lock triggers a bunch.  It's possible that these aren't problematic
> cases for recovery conflicts, because they're all around extension locks:
> [...]
> Independent of recovery conflicts, isn't it dangerous that we acquire the
> relation extension lock with a buffer content lock held? I *guess* it might be
> ok because BufferAlloc(P_NEW) only acquires buffer content locks in a
> conditional way.

These things both seem a bit sketchy but it's not 100% clear to me
that anything is actually broken. Now it's also not clear to me that
nothing is broken ...

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: pg15b2: large objects lost on upgrade
Next
From: Tom Lane
Date:
Subject: Re: pg15b2: large objects lost on upgrade