Re: Assertion failure in WaitForWALToBecomeAvailable state machine - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Assertion failure in WaitForWALToBecomeAvailable state machine
Date
Msg-id CALj2ACXPn_xePphnh88qmoQqqW+E2KEOdxGL+D-o9o7_XNGkkw@mail.gmail.com
Whole thread Raw
In response to Re: Assertion failure in WaitForWALToBecomeAvailable state machine  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: Assertion failure in WaitForWALToBecomeAvailable state machine
List pgsql-hackers
On Mon, Feb 14, 2022 at 1:44 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Fri, 11 Feb 2022 22:25:49 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
> > > I don't think
> > > > just making InstallXLogFileSegmentActive false is enough. By looking
> > > > at the comment [1], it doesn't make sense to move ahead for restoring
> > > > from the archive location without the WAL receiver fully stopped.
> > > > IMO, the real fix is to just remove WalRcvStreaming() and call
> > > > XLogShutdownWalRcv() unconditionally. Anyways, we have the
> > > > Assert(!WalRcvStreaming()); down below. I don't think it will create
> > > > any problem.
> > >
> > > If WalRcvStreaming() is returning false that means walreceiver is
> > > already stopped so we don't need to shutdown it externally.  I think
> > > like we are setting this flag outside start streaming we can reset it
> > > also outside XLogShutdownWalRcv.  Or I am fine even if we call
> > > XLogShutdownWalRcv() because if walreceiver is stopped it will just
> > > reset the flag we want it to reset and it will do nothing else.
> >
> > As I said, I'm okay with just calling XLogShutdownWalRcv()
> > unconditionally as it just returns in case walreceiver has already
> > stopped and updates the InstallXLogFileSegmentActive flag to false.
> >
> > Let's also hear what other hackers have to say about this.
>
> Firstly, good catch:)  And the direction seems right.
>
> It seems like an overlook of cc2c7d65fc. We cannot install new wal
> segments only while we're in archive recovery.  Conversely, we must
> turn off it when entering archive recovery (not exiting streaming
> recovery).  So, *I* feel like to do that at the beginning of
> XLOG_FROM_ARCHIVE/PG_WAL rather than the end of XLOG_FROM_STREAM.
>
> (And I would like to remove XLogShutDownWalRcv() and turn off the flag
>  in StartupXLOG explicitly, but it would be overdone.)
>
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -12800,6 +12800,16 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
>                                  */
>                                 Assert(!WalRcvStreaming());
>
> +                               /*
> +                                * WAL segment installation conflicts with archive
> +                                * recovery. Make sure it is turned off.  XLogShutdownWalRcv()
> +                                * does that but it is not done when the process has voluntary
> +                                * exited for example for replication timeout.
> +                                */
> +                               LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> +                               XLogCtl->InstallXLogFileSegmentActive = false;
> +                               LWLockRelease(ControlFileLock);
> +
>                                 /* Close any old file we might have open. */
>                                 if (readFile >= 0)

Today I encountered the assertion failure [2] twice while working on
another patch [1]. The pattern seems to be that the walreceiver got
killed or crashed and set it's state to WALRCV_STOPPING or
WALRCV_STOPPED by the team the WAL state machine moves to archive and
hence the following XLogShutdownWalRcv() code will not get hit:

                    /*
                     * Before we leave XLOG_FROM_STREAM state, make sure that
                     * walreceiver is not active, so that it won't overwrite
                     * WAL that we restore from archive.
                     */
                    if (WalRcvStreaming())
                        ShutdownWalRcv();

I agree with Kyotaro-san to reset InstallXLogFileSegmentActive before
entering into the archive mode. Hence I tweaked the code introduced by
the following commit a bit, the result v1 patch is attached herewith.
Please review it.

commit cc2c7d65fc27e877c9f407587b0b92d46cd6dd16
Author: Noah Misch <noah@leadboat.com>
Date:   Mon Jun 28 18:34:56 2021 -0700

    Skip WAL recycling and preallocation during archive recovery.

[1] https://www.postgresql.org/message-id/CALj2ACUYz1z6QPduGn5gguCkfd-ko44j4hKcOMtp6fzv9xEWgw%40mail.gmail.com
[2]
2022-08-11 05:10:29.051 UTC [1487086] FATAL:  terminating walreceiver
due to timeout
cp: cannot stat '/home/ubuntu/archived_wal/00000002.history': No such
file or directory
2022-08-11 05:10:29.054 UTC [1487075] LOG:  switched WAL source from
stream to archive after failure
2022-08-11 05:10:29.067 UTC [1487075] LOG:  restored log file
"000000010000000000000034" from archive
TRAP: FailedAssertion("!IsInstallXLogFileSegmentActive()", File:
"xlogrecovery.c", Line: 4149, PID: 1487075)
postgres: startup waiting for
000000010000000000000034(ExceptionalCondition+0xd6)[0x559d20e6deb1]
postgres: startup waiting for
000000010000000000000034(+0x1e4e14)[0x559d20813e14]
postgres: startup waiting for
000000010000000000000034(+0x1e50c0)[0x559d208140c0]
postgres: startup waiting for
000000010000000000000034(+0x1e4354)[0x559d20813354]
postgres: startup waiting for
000000010000000000000034(+0x1e3946)[0x559d20812946]
postgres: startup waiting for
000000010000000000000034(+0x1dbc58)[0x559d2080ac58]
postgres: startup waiting for
000000010000000000000034(+0x1db3ce)[0x559d2080a3ce]
postgres: startup waiting for
000000010000000000000034(XLogReadAhead+0x3d)[0x559d2080aa23]
postgres: startup waiting for
000000010000000000000034(+0x1d9713)[0x559d20808713]
postgres: startup waiting for
000000010000000000000034(+0x1d909c)[0x559d2080809c]
postgres: startup waiting for
000000010000000000000034(XLogPrefetcherReadRecord+0xf9)[0x559d208092a2]
postgres: startup waiting for
000000010000000000000034(+0x1e3427)[0x559d20812427]
postgres: startup waiting for
000000010000000000000034(PerformWalRecovery+0x465)[0x559d2080fd0f]
postgres: startup waiting for
000000010000000000000034(StartupXLOG+0x987)[0x559d207fc732]
postgres: startup waiting for
000000010000000000000034(StartupProcessMain+0xf8)[0x559d20bc3990]
postgres: startup waiting for
000000010000000000000034(AuxiliaryProcessMain+0x1ae)[0x559d20bb62cf]
postgres: startup waiting for
000000010000000000000034(+0x593020)[0x559d20bc2020]
postgres: startup waiting for
000000010000000000000034(+0x591ebb)[0x559d20bc0ebb]
postgres: startup waiting for
000000010000000000000034(+0x590907)[0x559d20bbf907]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe518b26520]
/lib/x86_64-linux-gnu/libc.so.6(__select+0xbd)[0x7fe518bff74d]
postgres: startup waiting for
000000010000000000000034(+0x58e1bd)[0x559d20bbd1bd]
postgres: startup waiting for
000000010000000000000034(PostmasterMain+0x14cc)[0x559d20bbcaa2]
postgres: startup waiting for
000000010000000000000034(+0x44a752)[0x559d20a79752]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fe518b0dd90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fe518b0de40]
postgres: startup waiting for
000000010000000000000034(_start+0x25)[0x559d206fe5b5]
2022-08-11 05:10:29.161 UTC [1486413] LOG:  startup process (PID
1487075) was terminated by signal 6: Aborted
2022-08-11 05:10:29.161 UTC [1486413] LOG:  terminating any other
active server processes
2022-08-11 05:10:29.169 UTC [1486413] LOG:  shutting down due to
startup process failure
2022-08-11 05:10:29.181 UTC [1486413] LOG:  database system is shut down

2022-08-11 05:34:54.234 UTC [1509079] LOG:  switched WAL source from
stream to archive after failure
2022-08-11 05:34:54.234 UTC [1509079] LOG:  current WAL source is
archive and last WAL source is stream, standby mode 1
2022-08-11 05:34:54.248 UTC [1509079] LOG:  restored log file
"000000010000000000000003" from archive
TRAP: FailedAssertion("!IsInstallXLogFileSegmentActive()", File:
"xlogrecovery.c", Line: 4155, PID: 1509079)
postgres: startup waiting for
000000010000000000000003(ExceptionalCondition+0xd6)[0x55e9cb1cbf2a]
postgres: startup waiting for
000000010000000000000003(+0x1e4e8d)[0x55e9cab71e8d]
postgres: startup waiting for
000000010000000000000003(+0x1e5139)[0x55e9cab72139]
postgres: startup waiting for
000000010000000000000003(+0x1e43cd)[0x55e9cab713cd]
postgres: startup waiting for
000000010000000000000003(+0x1e3946)[0x55e9cab70946]
postgres: startup waiting for
000000010000000000000003(+0x1dbc58)[0x55e9cab68c58]
postgres: startup waiting for
000000010000000000000003(+0x1db0b7)[0x55e9cab680b7]
postgres: startup waiting for
000000010000000000000003(XLogReadAhead+0x3d)[0x55e9cab68a23]
postgres: startup waiting for
000000010000000000000003(+0x1d9713)[0x55e9cab66713]
postgres: startup waiting for
000000010000000000000003(+0x1d909c)[0x55e9cab6609c]
postgres: startup waiting for
000000010000000000000003(XLogPrefetcherReadRecord+0xf9)[0x55e9cab672a2]
postgres: startup waiting for
000000010000000000000003(+0x1e3427)[0x55e9cab70427]
postgres: startup waiting for
000000010000000000000003(+0x1e4911)[0x55e9cab71911]
postgres: startup waiting for
000000010000000000000003(InitWalRecovery+0xaf4)[0x55e9cab6bc4e]
postgres: startup waiting for
000000010000000000000003(StartupXLOG+0x4af)[0x55e9cab5a25a]
postgres: startup waiting for
000000010000000000000003(StartupProcessMain+0xf8)[0x55e9caf21a09]
postgres: startup waiting for
000000010000000000000003(AuxiliaryProcessMain+0x1ae)[0x55e9caf14348]
postgres: startup waiting for
000000010000000000000003(+0x593099)[0x55e9caf20099]
postgres: startup waiting for
000000010000000000000003(PostmasterMain+0x1476)[0x55e9caf1aac5]
postgres: startup waiting for
000000010000000000000003(+0x44a7cb)[0x55e9cadd77cb]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f754c287d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f754c287e40]
postgres: startup waiting for
000000010000000000000003(_start+0x25)[0x55e9caa5c5b5]
2022-08-11 05:34:54.336 UTC [1509076] LOG:  startup process (PID
1509079) was terminated by signal 6: Aborted
2022-08-11 05:34:54.336 UTC [1509076] LOG:  aborting startup due to
startup process failure
2022-08-11 05:34:54.337 UTC [1509076] LOG:  database system is shut down

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Attachment

pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Use SetInstallXLogFileSegmentActive() for setting XLogCtl->InstallXLogFileSegmentActive
Next
From: vignesh C
Date:
Subject: Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock