Thread: walsender waiting_for_ping spuriously set
Ashutosh Bapat noticed that WalSndWaitForWal() is setting waiting_for_ping_response after sending a keepalive that does *not* request a reply. The bad consequence is that other callers that do require a reply end up in not sending a keepalive, because they think it was already sent previously. So the whole thing gets stuck. He found that commit 41d5f8ad734 failed to remove the setting of waiting_for_ping_response after changing the "request" parameter WalSndKeepalive from true to false; that seems to have been an omission and it breaks the algorithm. Thread at [1]. The simplest fix is just to remove the line that sets waiting_for_ping_response, but I think it is less error-prone to have WalSndKeepalive set the flag itself, instead of expecting its callers to do it (and know when to). Patch attached. Also rewords some related commentary. [1] https://postgr.es/m/flat/BLU436-SMTP25712B7EF9FC2ADEB87C522DC040@phx.gbl -- Álvaro Herrera Valdivia, Chile
Attachment
The patch looks good to me. Thanks for improving comments around that code. I like the change to set waiting_for_ping_response in WalSndKeepalive. Thanks.
On Fri, 7 Aug 2020 at 04:26, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Ashutosh Bapat noticed that WalSndWaitForWal() is setting
waiting_for_ping_response after sending a keepalive that does *not*
request a reply. The bad consequence is that other callers that do
require a reply end up in not sending a keepalive, because they think it
was already sent previously. So the whole thing gets stuck.
He found that commit 41d5f8ad734 failed to remove the setting of
waiting_for_ping_response after changing the "request" parameter
WalSndKeepalive from true to false; that seems to have been an omission
and it breaks the algorithm. Thread at [1].
The simplest fix is just to remove the line that sets
waiting_for_ping_response, but I think it is less error-prone to have
WalSndKeepalive set the flag itself, instead of expecting its callers to
do it (and know when to). Patch attached. Also rewords some related
commentary.
[1] https://postgr.es/m/flat/BLU436-SMTP25712B7EF9FC2ADEB87C522DC040@phx.gbl
--
Álvaro Herrera Valdivia, Chile
Best Wishes,
Ashutosh
I just noticed that part of this comment I'm modifying: > @@ -1444,17 +1444,13 @@ WalSndWaitForWal(XLogRecPtr loc) > * We only send regular messages to the client for full decoded > * transactions, but a synchronous replication and walsender shutdown > * possibly are waiting for a later location. So, before sleeping, we > - * send a ping containing the flush location. If the receiver is > - * otherwise idle, this keepalive will trigger a reply. Processing the > - * reply will update these MyWalSnd locations. > + * send a ping containing the flush location. A reply from standby is > + * not needed and would be wasteful. was added very recently, in f246ea3b2a5e ("In caught-up logical walsender, sleep only in WalSndWaitForWal()."). Added Noah to CC. I think the walreceiver will only send a reply if wal_receiver_status_interval is set to a nonzero value. I don't understand what reason could there possibly be for setting this parameter to zero, but it seems better to be explicit about it, as this code is confusing enough. I'm thinking in keeping the sentences that were added in that commit, maybe like so: > * We only send regular messages to the client for full decoded > * transactions, but a synchronous replication and walsender shutdown > * possibly are waiting for a later location. So, before sleeping, we > + * send a ping containing the flush location. A reply from standby is > + * not needed and would be wasteful most of the time, > + * but if the receiver is otherwise idle and walreceiver status messages > + * are enabled, this keepalive will trigger a reply. Processing the > + * reply will update these MyWalSnd locations. (Also, the comment would be updated all the way back to 9.5, even if f246ea3b2a5e itself was not.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Aug-07, Alvaro Herrera wrote: > I'm thinking in keeping the sentences that were added in that commit, > maybe like so: > > > * We only send regular messages to the client for full decoded > > * transactions, but a synchronous replication and walsender shutdown > > * possibly are waiting for a later location. So, before sleeping, we > > + * send a ping containing the flush location. A reply from standby is > > + * not needed and would be wasteful most of the time, > > + * but if the receiver is otherwise idle and walreceiver status messages > > + * are enabled, this keepalive will trigger a reply. Processing the > > + * reply will update these MyWalSnd locations. After rereading this a few more times, I think it's OK as Noah had it, so I'll just use his wording. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Pushed. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-08-06 18:55:58 -0400, Alvaro Herrera wrote: > Ashutosh Bapat noticed that WalSndWaitForWal() is setting > waiting_for_ping_response after sending a keepalive that does *not* > request a reply. The bad consequence is that other callers that do > require a reply end up in not sending a keepalive, because they think it > was already sent previously. So the whole thing gets stuck. > > He found that commit 41d5f8ad734 failed to remove the setting of > waiting_for_ping_response after changing the "request" parameter > WalSndKeepalive from true to false; that seems to have been an omission > and it breaks the algorithm. Thread at [1]. > > The simplest fix is just to remove the line that sets > waiting_for_ping_response, but I think it is less error-prone to have > WalSndKeepalive set the flag itself, instead of expecting its callers to > do it (and know when to). Patch attached. Also rewords some related > commentary. Thanks for diagnosis and fix! - Andres