Re: Assertion failure in WaitForWALToBecomeAvailable state machine - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Assertion failure in WaitForWALToBecomeAvailable state machine
Date
Msg-id YyACvP++zgDphlcm@paquier.xyz
Whole thread Raw
In response to Re: Assertion failure in WaitForWALToBecomeAvailable state machine  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
On Mon, Sep 12, 2022 at 08:22:56PM -0700, Noah Misch wrote:
> I plan to use that message's patch, because it guarantees WALRCV_STOPPED at
> the code location being changed.  Today, in the unlikely event of
> !WalRcvStreaming() due to WALRCV_WAITING or WALRCV_STOPPING, that code
> proceeds without waiting for WALRCV_STOPPED.  I think WALRCV_STOPPING can't
> happen there.  Even if WALRCV_WAITING also can't happen, it's a simplicity win
> to remove the need to think about that.  Dilip Kumar and Nathan Bossart also
> stated support for that design.

I did not notice this one.  Sounds pretty much right.

> If WALRCV_WAITING or WALRCV_STOPPING can happen at that patch's code site, I
> perhaps should back-patch the change to released versions.  Does anyone know
> whether one or both can happen?

Hmm.  I agree that this could be a good thing in the long-term.

> The startup process maintains the invariant that archive recovery and
> streaming recovery don't overlap in time; one stops before the other starts.
> As long as it achieves that, I'm not aware of a cause to care whether the flag
> reset happens when streaming ends vs. when archive recovery begins.  If the
> startup process develops a defect such that it ceases to maintain that
> invariant, "when archive recovery begins" does have the
> advantage-or-disadvantage of causing a failure in non-assert builds.  The
> other way gets only an assertion failure.  Michael Paquier or Bharath
> Rupireddy, did you envision benefits other than that one?

I think that you are forgetting one case here though: a standby where
standby.signal is set without restore_command and with
primary_conninfo.  It could move on with recovery even if the WAL
receiver is unstable when something external pushes more WAL segments
to the standby's pg_wal/.  So my point of upthread is that we should
make sure that standby.signal+primary_conninfo resets the flag state
when the WAL receiver stops streaming in this case as well.

The proposed v1-0001-Do-not-skip-calling-XLogShutdownWalRcv.patch
would achieve that, because it does not change the state of
InstallXLogFileSegmentActive depending on the source being
XLOG_FROM_ARCHIVE.  So I'm fine with what you want to do.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Zhang Mingli
Date:
Subject: Removed unused param isSlice of function transformAssignmentSubscripts
Next
From: Michael Paquier
Date:
Subject: Re: Background writer and checkpointer in crash recovery