Thread: walsender waiting_for_ping spuriously set

walsender waiting_for_ping spuriously set

From
Alvaro Herrera
Date:
Ashutosh Bapat noticed that WalSndWaitForWal() is setting
waiting_for_ping_response after sending a keepalive that does *not*
request a reply.  The bad consequence is that other callers that do
require a reply end up in not sending a keepalive, because they think it
was already sent previously.  So the whole thing gets stuck.

He found that commit 41d5f8ad734 failed to remove the setting of
waiting_for_ping_response after changing the "request" parameter
WalSndKeepalive from true to false; that seems to have been an omission
and it breaks the algorithm.  Thread at [1].

The simplest fix is just to remove the line that sets
waiting_for_ping_response, but I think it is less error-prone to have
WalSndKeepalive set the flag itself, instead of expecting its callers to
do it (and know when to).  Patch attached.  Also rewords some related
commentary.

[1] https://postgr.es/m/flat/BLU436-SMTP25712B7EF9FC2ADEB87C522DC040@phx.gbl

-- 
Álvaro Herrera       Valdivia, Chile

Attachment

Re: walsender waiting_for_ping spuriously set

From
Ashutosh Bapat
Date:
The patch looks good to me. Thanks for improving comments around that code. I like the change to set waiting_for_ping_response in WalSndKeepalive. Thanks.

On Fri, 7 Aug 2020 at 04:26, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Ashutosh Bapat noticed that WalSndWaitForWal() is setting
waiting_for_ping_response after sending a keepalive that does *not*
request a reply.  The bad consequence is that other callers that do
require a reply end up in not sending a keepalive, because they think it
was already sent previously.  So the whole thing gets stuck.

He found that commit 41d5f8ad734 failed to remove the setting of
waiting_for_ping_response after changing the "request" parameter
WalSndKeepalive from true to false; that seems to have been an omission
and it breaks the algorithm.  Thread at [1].

The simplest fix is just to remove the line that sets
waiting_for_ping_response, but I think it is less error-prone to have
WalSndKeepalive set the flag itself, instead of expecting its callers to
do it (and know when to).  Patch attached.  Also rewords some related
commentary.

[1] https://postgr.es/m/flat/BLU436-SMTP25712B7EF9FC2ADEB87C522DC040@phx.gbl

--
Álvaro Herrera       Valdivia, Chile


--
Best Wishes,
Ashutosh

Re: walsender waiting_for_ping spuriously set

From
Alvaro Herrera
Date:
I just noticed that part of this comment I'm modifying:

> @@ -1444,17 +1444,13 @@ WalSndWaitForWal(XLogRecPtr loc)
>           * We only send regular messages to the client for full decoded
>           * transactions, but a synchronous replication and walsender shutdown
>           * possibly are waiting for a later location. So, before sleeping, we
> -         * send a ping containing the flush location. If the receiver is
> -         * otherwise idle, this keepalive will trigger a reply. Processing the
> -         * reply will update these MyWalSnd locations.
> +         * send a ping containing the flush location. A reply from standby is
> +         * not needed and would be wasteful.

was added very recently, in f246ea3b2a5e ("In caught-up logical
walsender, sleep only in WalSndWaitForWal().").  Added Noah to CC.

I think the walreceiver will only send a reply if
wal_receiver_status_interval is set to a nonzero value.  I don't
understand what reason could there possibly be for setting this
parameter to zero, but it seems better to be explicit about it, as this
code is confusing enough.

I'm thinking in keeping the sentences that were added in that commit,
maybe like so:

>           * We only send regular messages to the client for full decoded
>           * transactions, but a synchronous replication and walsender shutdown
>           * possibly are waiting for a later location. So, before sleeping, we
> +         * send a ping containing the flush location. A reply from standby is
> +         * not needed and would be wasteful most of the time,
> +         * but if the receiver is otherwise idle and walreceiver status messages
> +         * are enabled, this keepalive will trigger a reply.  Processing the
> +         * reply will update these MyWalSnd locations.

(Also, the comment would be updated all the way back to 9.5, even if
f246ea3b2a5e itself was not.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: walsender waiting_for_ping spuriously set

From
Alvaro Herrera
Date:
On 2020-Aug-07, Alvaro Herrera wrote:

> I'm thinking in keeping the sentences that were added in that commit,
> maybe like so:
> 
> >           * We only send regular messages to the client for full decoded
> >           * transactions, but a synchronous replication and walsender shutdown
> >           * possibly are waiting for a later location. So, before sleeping, we
> > +         * send a ping containing the flush location. A reply from standby is
> > +         * not needed and would be wasteful most of the time,
> > +         * but if the receiver is otherwise idle and walreceiver status messages
> > +         * are enabled, this keepalive will trigger a reply.  Processing the
> > +         * reply will update these MyWalSnd locations.

After rereading this a few more times, I think it's OK as Noah had it,
so I'll just use his wording.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: walsender waiting_for_ping spuriously set

From
Alvaro Herrera
Date:
Pushed.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: walsender waiting_for_ping spuriously set

From
Andres Freund
Date:
On 2020-08-06 18:55:58 -0400, Alvaro Herrera wrote:
> Ashutosh Bapat noticed that WalSndWaitForWal() is setting
> waiting_for_ping_response after sending a keepalive that does *not*
> request a reply.  The bad consequence is that other callers that do
> require a reply end up in not sending a keepalive, because they think it
> was already sent previously.  So the whole thing gets stuck.
> 
> He found that commit 41d5f8ad734 failed to remove the setting of
> waiting_for_ping_response after changing the "request" parameter
> WalSndKeepalive from true to false; that seems to have been an omission
> and it breaks the algorithm.  Thread at [1].
> 
> The simplest fix is just to remove the line that sets
> waiting_for_ping_response, but I think it is less error-prone to have
> WalSndKeepalive set the flag itself, instead of expecting its callers to
> do it (and know when to).  Patch attached.  Also rewords some related
> commentary.

Thanks for diagnosis and fix!

- Andres