wal_sender_timeout should ignore server-side latency - Mailing list pgsql-hackers

From Noah Misch
Subject wal_sender_timeout should ignore server-side latency
Date
Msg-id 20180826034600.GA1105084@rfd.leadboat.com
Whole thread Raw
List pgsql-hackers
WalSndLoop() does this, simplifying considerably:

    for (;;)
    {
        /* does: last_reply_timestamp = GetCurrentTimestamp() */
        ProcessRepliesIfAny();  
        send_data();  /* e.g. XLogSendPhysical(), which calls XLogRead() */
        WalSndCheckTimeOut(GetCurrentTimestamp());
    }

A consequence is that any time spent in the send_data() callback counts
against the timeout.  In particular, if a single send_data() takes longer than
wal_sender_timeout, the client is powerless to prevent a timeout.  This
disagrees with the wal_sender_timeout documentation ("Terminate replication
connections that are inactive longer than the specified number of
milliseconds. This is useful for the sending server to detect a standby crash
or network outage").  I find it undesirable.

The fix, attached, is to interpret the timeout relative to a timestamp taken
before ProcessRepliesIfAny() polls the socket.  If that timestamp is
wal_sender_timeout later than the last reply, we can terminate with
confidence.  This adds one gettimeofday() per ProcessRepliesIfAny() finding no
replies, which feels cheap enough.

We've seen a number of wal_sender_timeout buildfarm failures on systems with
I/O performance trouble:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2018-08-16%2020:55:57
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2018-06-30%2020:38:10
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2018-04-12%2018:12:36
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2018-01-13%2005:01:17
https://postgr.es/m/flat/20170604211229.GA1528911@rfd.leadboat.com

Fixing $SUBJECT won't necessarily cure that, because an I/O stall on the
client side can still cause a failure.  We'd need something like threads or
async I/O to avoid that.  I mention a less-important corner case in the
WalSndCheckTimeOut() header comment.  You can simulate slow XLogSendPhysical()
to explore these problems on any system:

--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -65,2 +65,3 @@
 #include "libpq/pqformat.h"
+#include "libpq/pqsignal.h"
 #include "miscadmin.h"
@@ -2731,2 +2732,5 @@ XLogSendPhysical(void)
     enlargeStringInfo(&output_message, nbytes);
+    PG_SETMASK(&BlockSig);
+    pg_usleep(65 * 1000 * 1000);
+    PG_SETMASK(&UnBlockSig);
     XLogRead(&output_message.data[output_message.len], startptr, nbytes);

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: has_table_privilege for a table in unprivileged schema causes an error
Next
From: Fabien COELHO
Date:
Subject: Re: JIT compiling with LLVM v12