Re: Assertion failure in WaitForWALToBecomeAvailable state machine - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: Assertion failure in WaitForWALToBecomeAvailable state machine |
Date | |
Msg-id | 20220913032256.GB1438180@rfd.leadboat.com Whole thread Raw |
In response to | Re: Assertion failure in WaitForWALToBecomeAvailable state machine (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
Responses |
Re: Assertion failure in WaitForWALToBecomeAvailable state machine
Re: Assertion failure in WaitForWALToBecomeAvailable state machine |
List | pgsql-hackers |
On Mon, Sep 12, 2022 at 05:58:08PM +0530, Bharath Rupireddy wrote: > On Mon, Sep 12, 2022 at 12:30 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Sat, Sep 10, 2022 at 07:52:01AM +0530, Bharath Rupireddy wrote: > > > Just for the records - there's another report of the assertion failure > > > at [1], many thanks to Kyotaro-san for providing consistent > > > reproducible steps. > > > > > > [1] - https://www.postgresql.org/message-id/flat/20220909.172949.2223165886970819060.horikyota.ntt%40gmail.com I plan to use that message's patch, because it guarantees WALRCV_STOPPED at the code location being changed. Today, in the unlikely event of !WalRcvStreaming() due to WALRCV_WAITING or WALRCV_STOPPING, that code proceeds without waiting for WALRCV_STOPPED. I think WALRCV_STOPPING can't happen there. Even if WALRCV_WAITING also can't happen, it's a simplicity win to remove the need to think about that. Dilip Kumar and Nathan Bossart also stated support for that design. If WALRCV_WAITING or WALRCV_STOPPING can happen at that patch's code site, I perhaps should back-patch the change to released versions. Does anyone know whether one or both can happen? > > FWIW, I > > find better the approach taken by Horiguchi-san in [1] to reset the > > state before we attempt to read WAL from the archives *or* pg_wal, > > after we know that the last source has failed, where we know that we > > are not streaming yet (but recovery may be requested soon). > > > > [1]: https://www.postgresql.org/message-id/20220214.171428.735280610520122187.horikyota.ntt@gmail.com > > I think the fix at [1] is wrong. It basically resets > InstallXLogFileSegmentActive for both XLOG_FROM_ARCHIVE and > XLOG_FROM_PG_WAL, remember that we don't need WAL files preallocation > and recycling just for archive. Moreover, if we were to reset just for > archive there, it's too aggressive, meaning for every WAL record, we > take ControlFileLock in exclusive mode to reset it. > > IMO, doing it once when we switch the source to archive is the correct > way because it avoids frequent ControlFileLock acquisitions and makes > the fix more intuitive. And we have a comment saying why we reset > InstallXLogFileSegmentActive, if required we can point to the commit > cc2c7d6 in the comments. The startup process maintains the invariant that archive recovery and streaming recovery don't overlap in time; one stops before the other starts. As long as it achieves that, I'm not aware of a cause to care whether the flag reset happens when streaming ends vs. when archive recovery begins. If the startup process develops a defect such that it ceases to maintain that invariant, "when archive recovery begins" does have the advantage-or-disadvantage of causing a failure in non-assert builds. The other way gets only an assertion failure. Michael Paquier or Bharath Rupireddy, did you envision benefits other than that one?
pgsql-hackers by date: