Re: pg_basebackup, walreceiver and wal_sender_timeout - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: pg_basebackup, walreceiver and wal_sender_timeout
Date
Msg-id 20190126032327.GJ6459@paquier.xyz
Whole thread Raw
In response to pg_basebackup, walreceiver and wal_sender_timeout  (Nick B <nbedxp@gmail.com>)
Responses Re: pg_basebackup, walreceiver and wal_sender_timeout
List pgsql-hackers
On Fri, Jan 25, 2019 at 03:26:38PM +0100, Nick B wrote:
> On server we see this error firing: "terminating walsender process due to
> replication timeout"
> The problem occurs during a network or file system acting very slow. One
> example of such case looks like this (strace output for fsync calls):
>
> 0.033383 fsync(8)                  = 0 <20.265275>
> 20.265399 fsync(8)                 = 0 <0.000011>
> 0.022892 fsync(7)                  = 0 <48.350566>
> 48.350654 fsync(7)                 = 0 <0.000005>
> 0.000674 fsync(8)                  = 0 <0.851536>
> 0.851619 fsync(8)                  = 0 <0.000007>
> 0.000067 fsync(7)                  = 0 <0.000006>
> 0.000045 fsync(7)                  = 0 <0.000005>
> 0.031733 fsync(8)                  = 0 <0.826957>
> 0.827869 fsync(8)                  = 0 <0.000016>
> 0.005344 fsync(7)                  = 0 <1.437103>
> 1.446450 fsync(6)                  = 0 <0.063148>
> 0.063246 fsync(6)                  = 0 <0.000006>
> 0.000381 +++ exited with 1 +++

These are a bit unregular.  Which files are taking that long to
complete while others are way faster?  It may be something that we
could improve on the base backup side as there is no actual point in
syncing segments while the backup is running and we could delay that
at the end of the backup (if I recall that stuff correctly).

> This begs a question, why is the GUC handled the way it is? What would be
> the correct way to solve this? Shall we change the behaviour or do a better
> job explaining what are implications of wal_sender_timeout in the
> docs?

The following commit and thread are the ones you look for here:
https://www.postgresql.org/message-id/506972B9.6060104@vmware.com

commit: 6f60fdd7015b032bf49273c99f80913d57eac284
committer: Heikki Linnakangas <heikki.linnakangas@iki.fi>
date: Thu, 11 Oct 2012 17:48:08 +0300
Improve replication connection timeouts.

Rename replication_timeout to wal_sender_timeout, and add a new setting
called wal_receiver_timeout that does the same at the walreceiver side.
There was previously no timeout in walreceiver, so if the network went down,
for example, the walreceiver could take a long time to notice that the
connection was lost. Now with the two settings, both sides of a replication
connection will detect a broken connection similarly.

It is no longer necessary to manually set wal_receiver_status_interval
to a value smaller than the timeout. Both wal sender and receiver now
automatically send a "ping" message if more than 1/2 of the configured
timeout has elapsed, and it hasn't received any messages from the
other end.

The docs could be improved to describe that better..
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: using expression syntax for partition bounds
Next
From: Chapman Flack
Date:
Subject: Re: PostgreSQL vs SQL/XML Standards