Re: Assertion failure in WaitForWALToBecomeAvailable state machine - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Assertion failure in WaitForWALToBecomeAvailable state machine
Date
Msg-id CALj2ACUoBWbaFo_t0gew+x6n0V+mpvB_23HLvsVD9abgCShV5A@mail.gmail.com
Whole thread Raw
In response to Assertion failure in WaitForWALToBecomeAvailable state machine  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: Assertion failure in WaitForWALToBecomeAvailable state machine
List pgsql-hackers
On Fri, Feb 11, 2022 at 3:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Hi,
>
> The problem is that whenever we are going for streaming we always set
> XLogCtl->InstallXLogFileSegmentActive to true, but while switching
> from streaming to archive we do not always reset it so it hits
> assertion in some cases. Basically we reset it inside
> XLogShutdownWalRcv() but while switching from the streaming mode we
> only call it conditionally if WalRcvStreaming().  But it is very much
> possible that even before we call WalRcvStreaming() the walreceiver
> might have set alrcv->walRcvState to WALRCV_STOPPED.  So now
> WalRcvStreaming() will return false.  So I agree now we do not want to
> really shut down the walreceiver but who will reset the flag?
>
> I just ran some tests on primary and attached the walreceiver to gdb
> and waited for it to exit with timeout and then the recovery process
> hit the assertion.
>
> 2022-02-11 14:33:56.976 IST [60978] FATAL:  terminating walreceiver
> due to timeout
> cp: cannot stat
> ‘/home/dilipkumar/work/PG/install/bin/wal_archive/00000002.history’:
> No such file or directory
> 2022-02-11 14:33:57.002 IST [60973] LOG:  restored log file
> "000000010000000000000003" from archive
> TRAP: FailedAssertion("!XLogCtl->InstallXLogFileSegmentActive", File:
> "xlog.c", Line: 3823, PID: 60973)
>
> I have just applied a quick fix and that solved the issue, basically
> if the last failed source was streaming and the WalRcvStreaming() is
> false then just reset this flag.

IIUC, the issue can happen while the walreceiver failed to get WAL
from primary for whatever reasons and its status is not
WALRCV_STOPPING or WALRCV_STOPPED, and the startup process moved ahead
in WaitForWALToBecomeAvailable for reading from archive which ends up
in this assertion failure. ITSM, a rare scenario and it depends on
what walreceiver does between failure to get WAL from primary and
updating status to WALRCV_STOPPING or WALRCV_STOPPED.

If the above race condition is a serious problem, if one thinks at
least it is a problem at all, that needs to be fixed. I don't think
just making InstallXLogFileSegmentActive false is enough. By looking
at the comment [1], it doesn't make sense to move ahead for restoring
from the archive location without the WAL receiver fully stopped.
IMO, the real fix is to just remove WalRcvStreaming() and call
XLogShutdownWalRcv() unconditionally. Anyways, we have the
Assert(!WalRcvStreaming()); down below. I don't think it will create
any problem.

[1]
                    /*
                     * Before we leave XLOG_FROM_STREAM state, make sure that
                     * walreceiver is not active, so that it won't overwrite
                     * WAL that we restore from archive.
                     */
                    if (WalRcvStreaming())
                        XLogShutdownWalRcv();

Regards,
Bharath Rupireddy.



pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: Database-level collation version tracking
Next
From: Etsuro Fujita
Date:
Subject: Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit