On Fri, Jan 25, 2019 at 03:26:38PM +0100, Nick B wrote:
> On server we see this error firing: "terminating walsender process due to
> replication timeout"
> The problem occurs during a network or file system acting very slow. One
> example of such case looks like this (strace output for fsync calls):
>
> 0.033383 fsync(8) = 0 <20.265275>
> 20.265399 fsync(8) = 0 <0.000011>
> 0.022892 fsync(7) = 0 <48.350566>
> 48.350654 fsync(7) = 0 <0.000005>
> 0.000674 fsync(8) = 0 <0.851536>
> 0.851619 fsync(8) = 0 <0.000007>
> 0.000067 fsync(7) = 0 <0.000006>
> 0.000045 fsync(7) = 0 <0.000005>
> 0.031733 fsync(8) = 0 <0.826957>
> 0.827869 fsync(8) = 0 <0.000016>
> 0.005344 fsync(7) = 0 <1.437103>
> 1.446450 fsync(6) = 0 <0.063148>
> 0.063246 fsync(6) = 0 <0.000006>
> 0.000381 +++ exited with 1 +++
These are a bit unregular. Which files are taking that long to
complete while others are way faster? It may be something that we
could improve on the base backup side as there is no actual point in
syncing segments while the backup is running and we could delay that
at the end of the backup (if I recall that stuff correctly).
> This begs a question, why is the GUC handled the way it is? What would be
> the correct way to solve this? Shall we change the behaviour or do a better
> job explaining what are implications of wal_sender_timeout in the
> docs?
The following commit and thread are the ones you look for here:
https://www.postgresql.org/message-id/506972B9.6060104@vmware.com
commit: 6f60fdd7015b032bf49273c99f80913d57eac284
committer: Heikki Linnakangas <heikki.linnakangas@iki.fi>
date: Thu, 11 Oct 2012 17:48:08 +0300
Improve replication connection timeouts.
Rename replication_timeout to wal_sender_timeout, and add a new setting
called wal_receiver_timeout that does the same at the walreceiver side.
There was previously no timeout in walreceiver, so if the network went down,
for example, the walreceiver could take a long time to notice that the
connection was lost. Now with the two settings, both sides of a replication
connection will detect a broken connection similarly.
It is no longer necessary to manually set wal_receiver_status_interval
to a value smaller than the timeout. Both wal sender and receiver now
automatically send a "ping" message if more than 1/2 of the configured
timeout has elapsed, and it hasn't received any messages from the
other end.
The docs could be improved to describe that better..
--
Michael