Thread: Walsender may fail to send wal to the end.
Hello, I happened to see a doubious behavior of walsender. On a replication set with wal_keep_size/(segments) = 0, running the following command on the primary causes walsender to fail to send up to the final shutdown checkpoint record to the standby. (create table t in advance) psql -c 'insert into t values(0); select pg_switch_wal();'; pg_ctl stop The primary complains like this: 2021-03-26 17:59:29.324 JST [checkpointer][140697] LOG: shutting down 2021-03-26 17:59:29.387 JST [walsender][140816] ERROR: requested WAL segment 000000010000000000000032 has already been removed 2021-03-26 17:59:29.387 JST [walsender][140816] STATEMENT: START_REPLICATION 0/32000000 TIMELINE 1 2021-03-26 17:59:29.394 JST [postmaster][140695] LOG: database system is shut down This is because XLogSendPhysical detects removal of the wal segment currently reading by shutdown checkpoint. However, there' no fear of overwriting of WAL segments at the time. So I think we can omit the call to CheckXLogRemoved() while MyWalSnd->state is WALSNDSTTE_STOPPING because the state comes after the shutdown checkpoint completes. Of course that doesn't help if walsender was running two segments behind. There still could be a small window for the failure. But it's a great help to save the case of just 1 segment behind. Is it worth fixing? regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 23baa4498a..4b1e0cf9c5 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2755,9 +2755,18 @@ retry: &errinfo)) WALReadRaiseError(&errinfo); - /* See logical_read_xlog_page(). */ - XLByteToSeg(startptr, segno, xlogreader->segcxt.ws_segsize); - CheckXLogRemoved(segno, xlogreader->seg.ws_tli); + /* + * See logical_read_xlog_page(). However, there's a case where we're + * reading the segment which is removed/recycled by shutdown checkpoint. + * We continue to read it in that case because 1) It's safe because no wal + * activity happens after shutdown checkpoint completes, 2) We need to do + * our best to send WAL up to the shutdown checkpoint record. + */ + if (MyWalSnd->state < WALSNDSTATE_STOPPING) + { + XLByteToSeg(startptr, segno, xlogreader->segcxt.ws_segsize); + CheckXLogRemoved(segno, xlogreader->seg.ws_tli); + } /* * During recovery, the currently-open WAL file might be replaced with the
Hi, On 2021-03-26 18:20:14 +0900, Kyotaro Horiguchi wrote: > This is because XLogSendPhysical detects removal of the wal segment > currently reading by shutdown checkpoint. However, there' no fear of > overwriting of WAL segments at the time. > > So I think we can omit the call to CheckXLogRemoved() while > MyWalSnd->state is WALSNDSTTE_STOPPING because the state comes after > the shutdown checkpoint completes. > > Of course that doesn't help if walsender was running two segments > behind. There still could be a small window for the failure. But it's > a great help to save the case of just 1 segment behind. -1. This seems like a bandaid to make a broken configuration work a tiny bit better, without actually being meaningfully better. Greetings, Andres Freund
At Mon, 29 Mar 2021 14:47:33 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Fri, Mar 26, 2021 at 10:16:40AM -0700, Andres Freund wrote: > > Hi, > > > > On 2021-03-26 18:20:14 +0900, Kyotaro Horiguchi wrote: > > > This is because XLogSendPhysical detects removal of the wal segment > > > currently reading by shutdown checkpoint. However, there' no fear of > > > overwriting of WAL segments at the time. > > > > > > So I think we can omit the call to CheckXLogRemoved() while > > > MyWalSnd->state is WALSNDSTTE_STOPPING because the state comes after > > > the shutdown checkpoint completes. > > > > > > Of course that doesn't help if walsender was running two segments > > > behind. There still could be a small window for the failure. But it's > > > a great help to save the case of just 1 segment behind. > > > > -1. This seems like a bandaid to make a broken configuration work a tiny > > bit better, without actually being meaningfully better. > > Agreed. Still, wouldn't it be better to avoid such configurations and > protect a bit things with a check on the new value? The repro was a bit artificial but the symptom happened without pg_switch_wal() and no load. It caused just by shutting down of primary. If it is normal behavior for walsenders to fail to send the last shutdown record to standby while fast shutdown, we should refuse to startup at least wal sender if wal_keep_size = 0. I can guess two ways to do that. 1. refuse to start server if wal_keep_size = 0 when max_wal_senders > 0. 2. refuse to start wal sender if wal_keep_size= 0. 2 looks like broken. 1 is somewhat annoying.. However, since max_wal_senders already premises wal_level > minimal, we can accept that restriction? <start serer> FATAL: WAL streaming (max_wal_senders > 0) requires wal_level "replica" or "logical" <Mmm. wal_level, fixed, then retry starting server> FATAL: WAL streaming (max_wal_senders > 0) requires wal_keep_size to be at least 1MB <Oops!> Of couse we can list all incompatible parameters at once. FATAL: WAL streaming (max_wal_senders > 0) requires wal_level "replica" or "logical" and wal_keep_size to be at least 1MB regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Mon, 29 Mar 2021 11:41:32 -0400, Stephen Frost <sfrost@snowman.net> wrote in > Greetings, > > * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote: > > At Mon, 29 Mar 2021 14:47:33 +0900, Michael Paquier <michael@paquier.xyz> wrote in > > > On Fri, Mar 26, 2021 at 10:16:40AM -0700, Andres Freund wrote: > > > > On 2021-03-26 18:20:14 +0900, Kyotaro Horiguchi wrote: > > > > > This is because XLogSendPhysical detects removal of the wal segment > > > > > currently reading by shutdown checkpoint. However, there' no fear of > > > > > overwriting of WAL segments at the time. > > > > > > > > > > So I think we can omit the call to CheckXLogRemoved() while > > > > > MyWalSnd->state is WALSNDSTTE_STOPPING because the state comes after > > > > > the shutdown checkpoint completes. > > > > > > > > > > Of course that doesn't help if walsender was running two segments > > > > > behind. There still could be a small window for the failure. But it's > > > > > a great help to save the case of just 1 segment behind. > > > > > > > > -1. This seems like a bandaid to make a broken configuration work a tiny > > > > bit better, without actually being meaningfully better. > > > > > > Agreed. Still, wouldn't it be better to avoid such configurations and > > > protect a bit things with a check on the new value? > > I have a hard time agreeing that this is somehow a 'broken' > configuration, instead it looks like a race condition that wasn't > considered and should be addressed. If there's zero lag then we really > should allow the final WAL to get sent to the replica. My unstated point was switching primary/secondary roles in a replication set where both host have separate archives, by the steps "fast shutdown primary"->"promote standby"->"attach the old primary as new standby", wihtout a need of synchronizing old primary's archive to that of the new standby before starting the new standby. I thought that should work even if wal_keep_size = 0. > > The repro was a bit artificial but the symptom happened without > > pg_switch_wal() and no load. It caused just by shutting down of > > primary. If it is normal behavior for walsenders to fail to send the > > last shutdown record to standby while fast shutdown, we should refuse > > to startup at least wal sender if wal_keep_size = 0. > > > > I can guess two ways to do that. > > Both of which will break things for people, so this certainly isn't a > great approach, and besides, if archiving is happening with > archive_command and the replica has a restore command then it should be Right. > able to follow that just fine, no? So we'd have to also check if > archive_command has been set up and hope the admin has a restore Yeah, that sounds stupid (or kind of impossible). > command. Having to go through that dance instead of just making sure to > push out the last WAL to the replica seems a bit silly though. Sounds reasonable to me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center