Thread: Walsender may fail to send wal to the end.

Walsender may fail to send wal to the end.

From
Kyotaro Horiguchi
Date:
Hello, I happened to see a doubious behavior of walsender.

On a replication set with wal_keep_size/(segments) = 0, running the
following command on the primary causes walsender to fail to send up
to the final shutdown checkpoint record to the standby.

(create table t in advance)

psql -c 'insert into t values(0); select pg_switch_wal();'; pg_ctl stop

The primary complains like this:

2021-03-26 17:59:29.324 JST [checkpointer][140697] LOG:  shutting down
2021-03-26 17:59:29.387 JST [walsender][140816] ERROR:  requested WAL segment 000000010000000000000032 has already been
removed
2021-03-26 17:59:29.387 JST [walsender][140816] STATEMENT:  START_REPLICATION 0/32000000 TIMELINE 1
2021-03-26 17:59:29.394 JST [postmaster][140695] LOG:  database system is shut down

This is because XLogSendPhysical detects removal of the wal segment
currently reading by shutdown checkpoint.  However, there' no fear of
overwriting of WAL segments at the time.

So I think we can omit the call to CheckXLogRemoved() while
MyWalSnd->state is WALSNDSTTE_STOPPING because the state comes after
the shutdown checkpoint completes.

Of course that doesn't help if walsender was running two segments
behind. There still could be a small window for the failure.  But it's
a great help to save the case of just 1 segment behind.

Is it worth fixing?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 23baa4498a..4b1e0cf9c5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2755,9 +2755,18 @@ retry:
                  &errinfo))
         WALReadRaiseError(&errinfo);
 
-    /* See logical_read_xlog_page(). */
-    XLByteToSeg(startptr, segno, xlogreader->segcxt.ws_segsize);
-    CheckXLogRemoved(segno, xlogreader->seg.ws_tli);
+    /*
+     * See logical_read_xlog_page().  However, there's a case where we're
+     * reading the segment which is removed/recycled by shutdown checkpoint.
+     * We continue to read it in that case because 1) It's safe because no wal
+     * activity happens after shutdown checkpoint completes, 2) We need to do
+     * our best to send WAL up to the shutdown checkpoint record.
+     */
+    if (MyWalSnd->state < WALSNDSTATE_STOPPING)
+    {
+        XLByteToSeg(startptr, segno, xlogreader->segcxt.ws_segsize);
+        CheckXLogRemoved(segno, xlogreader->seg.ws_tli);
+    }
 
     /*
      * During recovery, the currently-open WAL file might be replaced with the

Re: Walsender may fail to send wal to the end.

From
Andres Freund
Date:
Hi,

On 2021-03-26 18:20:14 +0900, Kyotaro Horiguchi wrote:
> This is because XLogSendPhysical detects removal of the wal segment
> currently reading by shutdown checkpoint.  However, there' no fear of
> overwriting of WAL segments at the time.
>
> So I think we can omit the call to CheckXLogRemoved() while
> MyWalSnd->state is WALSNDSTTE_STOPPING because the state comes after
> the shutdown checkpoint completes.
>
> Of course that doesn't help if walsender was running two segments
> behind. There still could be a small window for the failure.  But it's
> a great help to save the case of just 1 segment behind.

-1. This seems like a bandaid to make a broken configuration work a tiny
bit better, without actually being meaningfully better.

Greetings,

Andres Freund



Re: Walsender may fail to send wal to the end.

From
Kyotaro Horiguchi
Date:
At Mon, 29 Mar 2021 14:47:33 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Fri, Mar 26, 2021 at 10:16:40AM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2021-03-26 18:20:14 +0900, Kyotaro Horiguchi wrote:
> > > This is because XLogSendPhysical detects removal of the wal segment
> > > currently reading by shutdown checkpoint.  However, there' no fear of
> > > overwriting of WAL segments at the time.
> > >
> > > So I think we can omit the call to CheckXLogRemoved() while
> > > MyWalSnd->state is WALSNDSTTE_STOPPING because the state comes after
> > > the shutdown checkpoint completes.
> > >
> > > Of course that doesn't help if walsender was running two segments
> > > behind. There still could be a small window for the failure.  But it's
> > > a great help to save the case of just 1 segment behind.
> > 
> > -1. This seems like a bandaid to make a broken configuration work a tiny
> > bit better, without actually being meaningfully better.
> 
> Agreed.  Still, wouldn't it be better to avoid such configurations and
> protect a bit things with a check on the new value?

The repro was a bit artificial but the symptom happened without
pg_switch_wal() and no load.  It caused just by shutting down of
primary.  If it is normal behavior for walsenders to fail to send the
last shutdown record to standby while fast shutdown, we should refuse
to startup at least wal sender if wal_keep_size = 0.

I can guess two ways to do that.

1. refuse to start server if wal_keep_size = 0 when max_wal_senders > 0.

2. refuse to start wal sender if wal_keep_size= 0.

2 looks like broken.  1 is somewhat annoying.. However, since
max_wal_senders already premises wal_level > minimal, we can accept
that restriction?


<start serer>

FATAL:  WAL streaming (max_wal_senders > 0) requires wal_level "replica" or "logical"

<Mmm. wal_level, fixed, then retry starting server>

FATAL:  WAL streaming (max_wal_senders > 0) requires wal_keep_size to be at least 1MB

<Oops!>

Of couse we can list all incompatible parameters at once.

FATAL:  WAL streaming (max_wal_senders > 0) requires wal_level "replica" or "logical" and wal_keep_size to be at least
1MB

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Walsender may fail to send wal to the end.

From
Kyotaro Horiguchi
Date:
At Mon, 29 Mar 2021 11:41:32 -0400, Stephen Frost <sfrost@snowman.net> wrote in 
> Greetings,
> 
> * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
> > At Mon, 29 Mar 2021 14:47:33 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> > > On Fri, Mar 26, 2021 at 10:16:40AM -0700, Andres Freund wrote:
> > > > On 2021-03-26 18:20:14 +0900, Kyotaro Horiguchi wrote:
> > > > > This is because XLogSendPhysical detects removal of the wal segment
> > > > > currently reading by shutdown checkpoint.  However, there' no fear of
> > > > > overwriting of WAL segments at the time.
> > > > >
> > > > > So I think we can omit the call to CheckXLogRemoved() while
> > > > > MyWalSnd->state is WALSNDSTTE_STOPPING because the state comes after
> > > > > the shutdown checkpoint completes.
> > > > >
> > > > > Of course that doesn't help if walsender was running two segments
> > > > > behind. There still could be a small window for the failure.  But it's
> > > > > a great help to save the case of just 1 segment behind.
> > > > 
> > > > -1. This seems like a bandaid to make a broken configuration work a tiny
> > > > bit better, without actually being meaningfully better.
> > > 
> > > Agreed.  Still, wouldn't it be better to avoid such configurations and
> > > protect a bit things with a check on the new value?
> 
> I have a hard time agreeing that this is somehow a 'broken'
> configuration, instead it looks like a race condition that wasn't
> considered and should be addressed.  If there's zero lag then we really
> should allow the final WAL to get sent to the replica.

My unstated point was switching primary/secondary roles in a
replication set where both host have separate archives, by the steps
"fast shutdown primary"->"promote standby"->"attach the old primary as
new standby", wihtout a need of synchronizing old primary's archive to
that of the new standby before starting the new standby. I thought
that should work even if wal_keep_size = 0.

> > The repro was a bit artificial but the symptom happened without
> > pg_switch_wal() and no load.  It caused just by shutting down of
> > primary.  If it is normal behavior for walsenders to fail to send the
> > last shutdown record to standby while fast shutdown, we should refuse
> > to startup at least wal sender if wal_keep_size = 0.
> > 
> > I can guess two ways to do that.
> 
> Both of which will break things for people, so this certainly isn't a
> great approach, and besides, if archiving is happening with
> archive_command and the replica has a restore command then it should be

Right. 

> able to follow that just fine, no?  So we'd have to also check if
> archive_command has been set up and hope the admin has a restore

Yeah, that sounds stupid (or kind of impossible).

> command.  Having to go through that dance instead of just making sure to
> push out the last WAL to the replica seems a bit silly though.

Sounds reasonable to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center