Thread: Possible crash on standby
Hello. While I played with some patch, I met an assertion failure. #2 0x0000000000b350e0 in ExceptionalCondition ( conditionName=0xbd8970 "!IsInstallXLogFileSegmentActive()", errorType=0xbd6e11 "FailedAssertion", fileName=0xbd6f28 "xlogrecovery.c", lineNumber=4190) at assert.c:69 #3 0x0000000000586f9c in XLogFileRead (segno=61, emode=13, tli=1, source=XLOG_FROM_ARCHIVE, notfoundOk=true) at xlogrecovery.c:4190 #4 0x00000000005871d2 in XLogFileReadAnyTLI (segno=61, emode=13, source=XLOG_FROM_ANY) at xlogrecovery.c:4296 #5 0x000000000058656f in WaitForWALToBecomeAvailable (RecPtr=1023410360, randAccess=false, fetching_ckpt=false, tliRecPtr=1023410336, replayTLI=1, replayLSN=1023410336, nonblocking=false) at xlogrecovery.c:3727 This is replayable by the following steps. 1. insert a sleep(1) in WaitForWALToBecomeAvailable(). > * WAL that we restore from archive. > */ > + sleep(1); > if (WalRcvStreaming()) > XLogShutdownWalRcv(); 2. create a primary with archiving enabled. 3. create a standby with recovering from the primary's archive and unconnectable primary_conninfo. 4. start the primary. 5. switch wal on the primary. 6. Kaboom. This is because WaitForWALToBecomeAvailable doesn't call XLogSHutdownWalRcv() when walreceiver has been stopped before we reach the WalRcvStreaming() call cited above. But we need to set InstasllXLogFileSegmentActive to false even in that case, since no one other than startup process does that. Unconditionally calling XLogShutdownWalRcv() fixes it. I feel we might need to correct the dependencies between the flag and walreceiver state, but it not mandatory because XLogShutdownWalRcv() is designed so that it can be called even after walreceiver is stopped. I don't have a clear memory about why we do that at the time, though, but recovery check runs successfully with this. This code was introduced at PG12. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
On Fri, Sep 09, 2022 at 05:29:49PM +0900, Kyotaro Horiguchi wrote: > This is because WaitForWALToBecomeAvailable doesn't call > XLogSHutdownWalRcv() when walreceiver has been stopped before we reach > the WalRcvStreaming() call cited above. But we need to set > InstasllXLogFileSegmentActive to false even in that case, since no one > other than startup process does that. Nice find. > Unconditionally calling XLogShutdownWalRcv() fixes it. I feel we might > need to correct the dependencies between the flag and walreceiver > state, but it not mandatory because XLogShutdownWalRcv() is designed > so that it can be called even after walreceiver is stopped. I don't > have a clear memory about why we do that at the time, though, but > recovery check runs successfully with this. I suppose the alternative would be to set InstallXLogFileSegmentActive to false in an 'else' block, but that doesn't seem necessary if XLogShutdownWalRcv() is safe to call unconditionally. So, unless there is a bigger problem that I'm not seeing, +1 for your patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Sep 9, 2022 at 2:00 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Hello. > > While I played with some patch, I met an assertion failure. > > #2 0x0000000000b350e0 in ExceptionalCondition ( > conditionName=0xbd8970 "!IsInstallXLogFileSegmentActive()", > errorType=0xbd6e11 "FailedAssertion", fileName=0xbd6f28 "xlogrecovery.c", > lineNumber=4190) at assert.c:69 > #3 0x0000000000586f9c in XLogFileRead (segno=61, emode=13, tli=1, > source=XLOG_FROM_ARCHIVE, notfoundOk=true) at xlogrecovery.c:4190 > #4 0x00000000005871d2 in XLogFileReadAnyTLI (segno=61, emode=13, > source=XLOG_FROM_ANY) at xlogrecovery.c:4296 > #5 0x000000000058656f in WaitForWALToBecomeAvailable (RecPtr=1023410360, > randAccess=false, fetching_ckpt=false, tliRecPtr=1023410336, replayTLI=1, > replayLSN=1023410336, nonblocking=false) at xlogrecovery.c:3727 > > This is replayable by the following steps. > > 1. insert a sleep(1) in WaitForWALToBecomeAvailable(). > > * WAL that we restore from archive. > > */ > > + sleep(1); > > if (WalRcvStreaming()) > > XLogShutdownWalRcv(); > > 2. create a primary with archiving enabled. > > 3. create a standby with recovering from the primary's archive and > unconnectable primary_conninfo. > > 4. start the primary. > > 5. switch wal on the primary. > > 6. Kaboom. > > This is because WaitForWALToBecomeAvailable doesn't call > XLogSHutdownWalRcv() when walreceiver has been stopped before we reach > the WalRcvStreaming() call cited above. But we need to set > InstasllXLogFileSegmentActive to false even in that case, since no one > other than startup process does that. > > Unconditionally calling XLogShutdownWalRcv() fixes it. I feel we might > need to correct the dependencies between the flag and walreceiver > state, but it not mandatory because XLogShutdownWalRcv() is designed > so that it can be called even after walreceiver is stopped. I don't > have a clear memory about why we do that at the time, though, but > recovery check runs successfully with this. > > This code was introduced at PG12. I think it is a duplicate of [1]. I have tested the above use-case with the patch at [1] and it fixes the issue. [1] https://www.postgresql.org/message-id/CALj2ACXPn_xePphnh88qmoQqqW%2BE2KEOdxGL%2BD-o9o7_XNGkkw%40mail.gmail.com -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Sep 09, 2022 at 10:51:10PM +0530, Bharath Rupireddy wrote: > I think it is a duplicate of [1]. I have tested the above use-case > with the patch at [1] and it fixes the issue. I added this thread to the existing commitfest entry. Thanks for pointing this out. https://commitfest.postgresql.org/39/3814 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com