Re: Unstable tests for recovery conflict handling - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Unstable tests for recovery conflict handling
Date
Msg-id 20220803220756.be7lljtuakaenjzu@awork3.anarazel.de
Whole thread Raw
In response to Re: Unstable tests for recovery conflict handling  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On 2022-08-03 16:33:46 -0400, Robert Haas wrote:
> On Wed, Aug 3, 2022 at 1:57 PM Andres Freund <andres@anarazel.de> wrote:
> > The reason nothing might get logged in some cases is that
> > e.g. ResolveRecoveryConflictWithLock() tells
> > ResolveRecoveryConflictWithVirtualXIDs() to *not* report the waiting:
> >                 /*
> >                  * Prevent ResolveRecoveryConflictWithVirtualXIDs() from reporting
> >                  * "waiting" in PS display by disabling its argument report_waiting
> >                  * because the caller, WaitOnLock(), has already reported that.
> >                  */
> >
> > so ResolveRecoveryConflictWithLock() can end up looping indefinitely without
> > logging anything.
> 
> I understand why we need to avoid adding "waiting" to the PS status
> when we've already done that, but it doesn't seem like that should
> imply skipping ereport() of log messages.
> 
> I think we could redesign the way the ps display works to make things
> a whole lot simpler. Let's have a function set_ps_display() and
> another function set_ps_display_suffix(). What gets reported to the OS
> is the concatenation of the two. Calling set_ps_display() implicitly
> resets the suffix to empty.
> 
> AFAICS, that'd let us get rid of this tricky logic, and some other
> tricky logic as well. Here, we'd just say set_ps_display_suffix("
> waiting") and not worry about whether the caller might have already
> done something similar.

That sounds like it'd be an improvement.  Of course we still need to fix that
we can signal at a rate not allowing the other side to handle the conflict,
but at least that'd be easier to identify...


> > Another question I have about ResolveRecoveryConflictWithLock() is whether
> > it's ok that we don't check deadlocks around the
> > ResolveRecoveryConflictWithVirtualXIDs() call? It might be ok, because we'd
> > only block if there's a recovery conflict, in which killing the process ought
> > to succeed?
> 
> The startup process is supposed to always "win" in any deadlock
> situation, so I'm not sure what you think is a problem here. We get
> the conflicting lockers. We kill them. If they don't die, that's a
> bug, but killing ourselves doesn't really help anything; if we die,
> the whole system goes down, which seems undesirable.

The way deadlock timeout for the startup process works is that we wait for it
to pass and then send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK to the
backends. So it's not that the startup process would die.

The question is basically whether there are cases were
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK would resolve a conflict but
PROCSIG_RECOVERY_CONFLICT_LOCK wouldn't. It seems plausible that there isn't,
but it's also not obvious enough that I'd fully trust it.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: postgres_fdw: batch inserts vs. before row triggers
Next
From: Matthias van de Meent
Date:
Subject: Re: postgres_fdw: batch inserts vs. before row triggers