Thread: pg_basebackup, walreceiver and wal_sender_timeout

pg_basebackup, walreceiver and wal_sender_timeout

From

Nick B

Date:

25 January 2019, 14:26:38

Hi, hackers.

When running pg_basebackup with -X s with network file system as target we would consistently get "could not receive data from WAL stream: server closed the connection unexpectedly".

On server we see this error firing: "terminating walsender process due to replication timeout"

The problem occurs during a network or file system acting very slow. One example of such case looks like this (strace output for fsync calls):

0.033383 fsync(8) = 0 <20.265275>

20.265399 fsync(8) = 0 <0.000011>

0.022892 fsync(7) = 0 <48.350566>

48.350654 fsync(7) = 0 <0.000005>

0.000674 fsync(8) = 0 <0.851536>

0.851619 fsync(8) = 0 <0.000007>

0.000067 fsync(7) = 0 <0.000006>

0.000045 fsync(7) = 0 <0.000005>

0.031733 fsync(8) = 0 <0.826957>

0.827869 fsync(8) = 0 <0.000016>

0.005344 fsync(7) = 0 <1.437103>

1.446450 fsync(6) = 0 <0.063148>

0.063246 fsync(6) = 0 <0.000006>

0.000381 +++ exited with 1 +++

So the longest fsync call took 48 seconds, but how would that result in a termination if wal_sender_timeout is (default) 60 seconds?

The problem is in the way wal_sender handles this timeout:

* If half of wal_sender_timeout has lapsed without receiving any reply

* from the standby, send a keep-alive message to the standby requesting

* an immediate reply.

Obviously the receiver cannot respond immediately while in a syscall.

This begs a question, why is the GUC handled the way it is? What would be the correct way to solve this? Shall we change the behaviour or do a better job explaining what are implications of wal_sender_timeout in the docs?

Regards,

Nick.

Re: pg_basebackup, walreceiver and wal_sender_timeout

From

Michael Paquier

Date:

26 January 2019, 03:23:27

On Fri, Jan 25, 2019 at 03:26:38PM +0100, Nick B wrote:
> On server we see this error firing: "terminating walsender process due to
> replication timeout"
> The problem occurs during a network or file system acting very slow. One
> example of such case looks like this (strace output for fsync calls):
>
> 0.033383 fsync(8)                  = 0 <20.265275>
> 20.265399 fsync(8)                 = 0 <0.000011>
> 0.022892 fsync(7)                  = 0 <48.350566>
> 48.350654 fsync(7)                 = 0 <0.000005>
> 0.000674 fsync(8)                  = 0 <0.851536>
> 0.851619 fsync(8)                  = 0 <0.000007>
> 0.000067 fsync(7)                  = 0 <0.000006>
> 0.000045 fsync(7)                  = 0 <0.000005>
> 0.031733 fsync(8)                  = 0 <0.826957>
> 0.827869 fsync(8)                  = 0 <0.000016>
> 0.005344 fsync(7)                  = 0 <1.437103>
> 1.446450 fsync(6)                  = 0 <0.063148>
> 0.063246 fsync(6)                  = 0 <0.000006>
> 0.000381 +++ exited with 1 +++

These are a bit unregular.  Which files are taking that long to
complete while others are way faster?  It may be something that we
could improve on the base backup side as there is no actual point in
syncing segments while the backup is running and we could delay that
at the end of the backup (if I recall that stuff correctly).

> This begs a question, why is the GUC handled the way it is? What would be
> the correct way to solve this? Shall we change the behaviour or do a better
> job explaining what are implications of wal_sender_timeout in the
> docs?

The following commit and thread are the ones you look for here:
https://www.postgresql.org/message-id/506972B9.6060104@vmware.com

commit: 6f60fdd7015b032bf49273c99f80913d57eac284
committer: Heikki Linnakangas <heikki.linnakangas@iki.fi>
date: Thu, 11 Oct 2012 17:48:08 +0300
Improve replication connection timeouts.

Rename replication_timeout to wal_sender_timeout, and add a new setting
called wal_receiver_timeout that does the same at the walreceiver side.
There was previously no timeout in walreceiver, so if the network went down,
for example, the walreceiver could take a long time to notice that the
connection was lost. Now with the two settings, both sides of a replication
connection will detect a broken connection similarly.

It is no longer necessary to manually set wal_receiver_status_interval
to a value smaller than the timeout. Both wal sender and receiver now
automatically send a "ping" message if more than 1/2 of the configured
timeout has elapsed, and it hasn't received any messages from the
other end.

The docs could be improved to describe that better..
--
Michael

Attachment

signature.asc

Re: pg_basebackup, walreceiver and wal_sender_timeout

From

Nick B

Date:

26 January 2019, 12:34:49

On Sat, Jan 26, 2019 at 4:23 AM Michael Paquier <michael@paquier.xyz> wrote:
> These are a bit unregular.  Which files are taking that long to
> complete while others are way faster?  It may be something that we
> could improve on the base backup side as there is no actual point in
> syncing segments while the backup is running and we could delay that
> at the end of the backup (if I recall that stuff correctly).

I don't have a good sample for these. One instance of this happening is below:
....
0.000125 fsync(7)                  = 0 <0.016677>
0.000039 fsync(7)                  = 0 <0.000005>
# 2048 writes for total of 16777216 bytes (16MB)
0.000618 write(7,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192) = 8192 <0.000021>
0.000078 fsync(8)                  = 0 <57.609720>
57.609830 fsync(8)                  = 0 <0.000007>

Again, it is a problem with our network file system that we are still
investigating. I'm not sure this can be improved easily, since
pg_basebackup shares this code with walreceiver.

> The docs could be improved to describe that better..

I will look into that.

Regards,
Nick.

Re: pg_basebackup, walreceiver and wal_sender_timeout

From

Magnus Hagander

Date:

26 January 2019, 12:45:46

On Sat, Jan 26, 2019 at 1:35 PM Nick B <nbedxp@gmail.com> wrote:

On Sat, Jan 26, 2019 at 4:23 AM Michael Paquier <michael@paquier.xyz> wrote:
> These are a bit unregular. Which files are taking that long to
> complete while others are way faster? It may be something that we
> could improve on the base backup side as there is no actual point in
> syncing segments while the backup is running and we could delay that
> at the end of the backup (if I recall that stuff correctly).

I don't have a good sample for these. One instance of this happening is below:
....
0.000125 fsync(7) = 0 <0.016677>
0.000039 fsync(7) = 0 <0.000005>
# 2048 writes for total of 16777216 bytes (16MB)
0.000618 write(7,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192) = 8192 <0.000021>
0.000078 fsync(8) = 0 <57.609720>
57.609830 fsync(8) = 0 <0.000007>

Again, it is a problem with our network file system that we are still
investigating. I'm not sure this can be improved easily, since
pg_basebackup shares this code with walreceiver.

One workaround you could perhaps look at here is to run pg_basebackup with --no-sync. That way there will be no fsyncs issued while running. You will then of course have to take care of syncing all the files to disk after it's done, but a network filesystem might be happier in dealing with a large "batch-sync" like that rather than piece-by-piece sync.

(yes, I realize that wasn't your original question, just wanted to make sure it was a workaround you had considered)

//Magnus

Re: pg_basebackup, walreceiver and wal_sender_timeout

From

Michael Paquier

Date:

27 January 2019, 12:58:57

On Sat, Jan 26, 2019 at 01:45:46PM +0100, Magnus Hagander wrote:
> One workaround you could perhaps look at here is to run pg_basebackup
> with --no-sync. That way there will be no fsyncs issued while running. You
> will then of course have to take care of syncing all the files to disk
> after it's done, but a network filesystem might be happier in dealing with
> a large "batch-sync" like that rather than piece-by-piece sync.

Hm.  Aren't we actually wrong in letting the WAL receive method use
the value of do_sync depending on the command line arguments, with
true being the default for pg_basebackup?  In plain format, we flush
the full data directory anyway when the backup ends.  In tar format,
each individual tar file is flushed one-by-one after being received,
and we issue a final sync on the parent directory at the end.  So
what's missing is just to make sure that the fully generated
pg_wal.tar is synced once completed.  This would be way cheaper than
letting the stream process issue syncs for each segments, which does
not matter much in the event of a host crash because the base backup
may finish in an inconsistent state, and one should not use it.
--
Michael

Attachment

signature.asc

Re: pg_basebackup, walreceiver and wal_sender_timeout

From

Magnus Hagander

Date:

28 January 2019, 08:05:26

On Sun, Jan 27, 2019 at 1:59 PM Michael Paquier <michael@paquier.xyz> wrote:

On Sat, Jan 26, 2019 at 01:45:46PM +0100, Magnus Hagander wrote:
> One workaround you could perhaps look at here is to run pg_basebackup
> with --no-sync. That way there will be no fsyncs issued while running. You
> will then of course have to take care of syncing all the files to disk
> after it's done, but a network filesystem might be happier in dealing with
> a large "batch-sync" like that rather than piece-by-piece sync.

Hm. Aren't we actually wrong in letting the WAL receive method use
the value of do_sync depending on the command line arguments, with
true being the default for pg_basebackup? In plain format, we flush
the full data directory anyway when the backup ends. In tar format,
each individual tar file is flushed one-by-one after being received,
and we issue a final sync on the parent directory at the end. So
what's missing is just to make sure that the fully generated
pg_wal.tar is synced once completed. This would be way cheaper than
letting the stream process issue syncs for each segments, which does
not matter much in the event of a host crash because the base backup
may finish in an inconsistent state, and one should not use it.

Yeah, that could be done without giving up any of the guarantees -- we only give the guarantee at the end of the completed backup. I wouldn't necessarily say we're wrong now, but it could definitely be a nice performance improvement.

And for plain format, we'd do the same -- sync after each file segment, and then a final one of the directory when done, right?

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: pg_basebackup, walreceiver and wal_sender_timeout

From

Michael Paquier

Date:

28 January 2019, 09:25:10

On Mon, Jan 28, 2019 at 09:05:26AM +0100, Magnus Hagander wrote:
> Yeah, that could be done without giving up any of the guarantees -- we only
> give the guarantee at the end of the completed backup. I wouldn't
> necessarily say we're wrong now, but it could definitely be a nice
> performance improvement.

The code ensures durability in its current shape, and does more than
it actually needs to.

> And for plain format, we'd do the same -- sync after each file segment, and
> then a final one of the directory when done, right?

Well, the code is doing a double amount of work in its current shape
as we call fsync_pgdata() for the plain format, which cascades to
pg_wal and all its files, so it seems to me that there is little point
in issuing a sync when each segment is finished streaming if that's
what you mean.
--
Michael

Attachment

signature.asc

Re: pg_basebackup, walreceiver and wal_sender_timeout

From

Alex Kliukin

Date:

28 January 2019, 13:00:59

On Mon, Jan 28, 2019, at 10:25, Michael Paquier wrote:
> On Mon, Jan 28, 2019 at 09:05:26AM +0100, Magnus Hagander wrote:

> > And for plain format, we'd do the same -- sync after each file segment, and
> > then a final one of the directory when done, right?
> 
> Well, the code is doing a double amount of work in its current shape
> as we call fsync_pgdata() for the plain format, which cascades to
> pg_wal and all its files, so it seems to me that there is little point
> in issuing a sync when each segment is finished streaming if that's
> what you mean.

Agreed.

While reading the doc page for the pg_basebackup, I've been confused by the fact that it says WAL files will be written
to.tarballs (either base.tar or pg_wal.tar) when pg_basebackup is instructed to stream WALs alongside the backup
itself.I think it makes sense to elaborate that it only happens when tar format is specified (doc patch is attached).

Cheers,
Oleksii

Attachment

pg_basebackup_wal_method_clarification_v1.patch

Re: pg_basebackup, walreceiver and wal_sender_timeout

From

Michael Paquier

Date:

29 January 2019, 05:19:09

On Mon, Jan 28, 2019 at 02:00:59PM +0100, Alex Kliukin wrote:
> While reading the doc page for the pg_basebackup, I've been confused
> by the fact that it says WAL files will be written to .tarballs
> (either base.tar or pg_wal.tar) when pg_basebackup is instructed to
> stream WALs alongside the backup itself. I think it makes sense to
> elaborate that it only happens when tar format is specified (doc
> patch is attached).

Agreed.  The current wording can be confusing depending on the format,
and your suggestion looks right.  Any opinions from others?
--
Michael

Attachment

signature.asc

Re: pg_basebackup, walreceiver and wal_sender_timeout

From

Magnus Hagander

Date:

29 January 2019, 09:45:34

On Tue, Jan 29, 2019 at 6:19 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Jan 28, 2019 at 02:00:59PM +0100, Alex Kliukin wrote:
> While reading the doc page for the pg_basebackup, I've been confused
> by the fact that it says WAL files will be written to .tarballs
> (either base.tar or pg_wal.tar) when pg_basebackup is instructed to
> stream WALs alongside the backup itself. I think it makes sense to
> elaborate that it only happens when tar format is specified (doc
> patch is attached).

Agreed. The current wording can be confusing depending on the format,
and your suggestion looks right. Any opinions from others?

Agreed, definitely confusing.

Since you also agreed on it, I went ahead and pushed (with backpatch).

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: pg_basebackup, walreceiver and wal_sender_timeout

From

Michael Paquier

Date:

29 January 2019, 12:36:13

On Tue, Jan 29, 2019 at 10:45:34AM +0100, Magnus Hagander wrote:
> Since you also agreed on it, I went ahead and pushed (with backpatch).

Thanks for taking care of it, Magnus.
--
Michael

Attachment

signature.asc

Re: pg_basebackup, walreceiver and wal_sender_timeout

From

Oleksii Kliukin

Date:

29 January 2019, 13:02:53

On 29. Jan 2019, at 10:45, Magnus Hagander <magnus@hagander.net> wrote:

On Tue, Jan 29, 2019 at 6:19 AM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Jan 28, 2019 at 02:00:59PM +0100, Alex Kliukin wrote:
> While reading the doc page for the pg_basebackup, I've been confused
> by the fact that it says WAL files will be written to .tarballs
> (either base.tar or pg_wal.tar) when pg_basebackup is instructed to
> stream WALs alongside the backup itself. I think it makes sense to
> elaborate that it only happens when tar format is specified (doc
> patch is attached).

Agreed. The current wording can be confusing depending on the format,
and your suggestion looks right. Any opinions from others?

Agreed, definitely confusing.

Since you also agreed on it, I went ahead and pushed (with backpatch).

Thanks a lot (and to Michael as well for looking into it)!

Cheers,

Oleksii

Re: pg_basebackup, walreceiver and wal_sender_timeout

From

Nick B

Date:

29 January 2019, 16:11:30

Greetings,

I also would like to thank everyone for looking into this.

On Sat, Jan 26, 2019 at 01:45:46PM +0100, Magnus Hagander wrote:
> One workaround you could perhaps look at here is to run pg_basebackup
> with --no-sync. That way there will be no fsyncs issued while running. You
> will then of course have to take care of syncing all the files to disk
> after it's done, but a network filesystem might be happier in dealing with
> a large "batch-sync" like that rather than piece-by-piece sync.

Thanks for the pointer. I actually was not aware of the existence of this flag. I've ran two rounds of tests with --no-sync and backup failed at a much later point in time, which suggests that the bottleneck is in fact the metadata server of ceph. We're now looking into ways of improving this. (This is a 15TB cluster with a few hundred thousands tables which on average generates 4 WAL segments per second, so throttling transfer rate is not a good option either).

On Sat, Jan 26, 2019 at 4:23 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> The docs could be improved to describe that better..

I had an off-list discussion of a possible documentation update with Stephen Frost and he voiced an opinion that the behaviour I was trying to describe sounds a lot like a bug and documenting that is not a good practice.

Upon further examination of WalSndKeepaliveIfNecessary I found out that the implementation of "requesting an immediate reply" is done by setting the socket into non-blocking mode and issuing a flush. I find it hard to believe there is a scenario where client can react to that keep-alive on time (unless of course I misunderstood something). So the question is, will we ever wait the actual wal_sender_timeout before terminating the connection?

Regards,

Nick.