Re: Assertion failure in WaitForWALToBecomeAvailable state machine - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Assertion failure in WaitForWALToBecomeAvailable state machine
Date
Msg-id 20220913032256.GB1438180@rfd.leadboat.com
Whole thread Raw
In response to Re: Assertion failure in WaitForWALToBecomeAvailable state machine  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: Assertion failure in WaitForWALToBecomeAvailable state machine
Re: Assertion failure in WaitForWALToBecomeAvailable state machine
List pgsql-hackers
On Mon, Sep 12, 2022 at 05:58:08PM +0530, Bharath Rupireddy wrote:
> On Mon, Sep 12, 2022 at 12:30 PM Michael Paquier <michael@paquier.xyz> wrote:
> > On Sat, Sep 10, 2022 at 07:52:01AM +0530, Bharath Rupireddy wrote:
> > > Just for the records - there's another report of the assertion failure
> > > at [1], many thanks to Kyotaro-san for providing consistent
> > > reproducible steps.
> > >
> > > [1] - https://www.postgresql.org/message-id/flat/20220909.172949.2223165886970819060.horikyota.ntt%40gmail.com

I plan to use that message's patch, because it guarantees WALRCV_STOPPED at
the code location being changed.  Today, in the unlikely event of
!WalRcvStreaming() due to WALRCV_WAITING or WALRCV_STOPPING, that code
proceeds without waiting for WALRCV_STOPPED.  I think WALRCV_STOPPING can't
happen there.  Even if WALRCV_WAITING also can't happen, it's a simplicity win
to remove the need to think about that.  Dilip Kumar and Nathan Bossart also
stated support for that design.

If WALRCV_WAITING or WALRCV_STOPPING can happen at that patch's code site, I
perhaps should back-patch the change to released versions.  Does anyone know
whether one or both can happen?

> > FWIW, I
> > find better the approach taken by Horiguchi-san in [1] to reset the
> > state before we attempt to read WAL from the archives *or* pg_wal,
> > after we know that the last source has failed, where we know that we
> > are not streaming yet (but recovery may be requested soon).
> >
> > [1]: https://www.postgresql.org/message-id/20220214.171428.735280610520122187.horikyota.ntt@gmail.com
> 
> I think the fix at [1] is wrong. It basically resets
> InstallXLogFileSegmentActive for both XLOG_FROM_ARCHIVE and
> XLOG_FROM_PG_WAL, remember that we don't need WAL files preallocation
> and recycling just for archive. Moreover, if we were to reset just for
> archive there, it's too aggressive, meaning for every WAL record, we
> take ControlFileLock in exclusive mode to reset it.
> 
> IMO, doing it once when we switch the source to archive is the correct
> way because it avoids frequent ControlFileLock acquisitions and makes
> the fix more intuitive. And we have a comment saying why we reset
> InstallXLogFileSegmentActive, if required we can point to the commit
> cc2c7d6 in the comments.

The startup process maintains the invariant that archive recovery and
streaming recovery don't overlap in time; one stops before the other starts.
As long as it achieves that, I'm not aware of a cause to care whether the flag
reset happens when streaming ends vs. when archive recovery begins.  If the
startup process develops a defect such that it ceases to maintain that
invariant, "when archive recovery begins" does have the
advantage-or-disadvantage of causing a failure in non-assert builds.  The
other way gets only an assertion failure.  Michael Paquier or Bharath
Rupireddy, did you envision benefits other than that one?



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: cataloguing NOT NULL constraints
Next
From: Zhang Mingli
Date:
Subject: Removed unused param isSlice of function transformAssignmentSubscripts