Re: pg_basebackup, walreceiver and wal_sender_timeout - Mailing list pgsql-hackers

From Nick B
Subject Re: pg_basebackup, walreceiver and wal_sender_timeout
Date
Msg-id CAPHA_mkS-70+FWku4tQiMR+NVJe826Y6oCEG69YaJtWi2C2Ebw@mail.gmail.com
Whole thread Raw
In response to Re: pg_basebackup, walreceiver and wal_sender_timeout  (Oleksii Kliukin <alexk@hintbits.com>)
List pgsql-hackers
Greetings,
I also would like to thank everyone for looking into this.

On Sat, Jan 26, 2019 at 01:45:46PM +0100, Magnus Hagander wrote:
> One workaround you could perhaps look at here is to run pg_basebackup
> with --no-sync. That way there will be no fsyncs issued while running. You
> will then of course have to take care of syncing all the files to disk
> after it's done, but a network filesystem might be happier in dealing with
> a large "batch-sync" like that rather than piece-by-piece sync.

Thanks for the pointer. I actually was not aware of the existence of this flag. I've ran two rounds of tests with --no-sync and backup failed at a much later point in time, which suggests that the bottleneck is in fact the metadata server of ceph. We're now looking into ways of improving this. (This is a 15TB cluster with a few hundred thousands tables which on average generates 4 WAL segments per second, so throttling transfer rate is not a good option either).

On Sat, Jan 26, 2019 at 4:23 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> The docs could be improved to describe that better..

I had an off-list discussion of a possible documentation update with Stephen Frost and he voiced an opinion that the behaviour I was trying to describe sounds a lot like a bug and documenting that is not a good practice.

Upon further examination of WalSndKeepaliveIfNecessary I found out that the implementation of "requesting an immediate reply" is done by setting the socket into non-blocking mode and issuing a flush. I find it hard to believe there is a scenario where client can react to that keep-alive on time (unless of course I misunderstood something). So the question is, will we ever wait the actual wal_sender_timeout before terminating the connection?

Regards,
Nick.

pgsql-hackers by date:

Previous
From: Petr Jelinek
Date:
Subject: Re: Why does execReplication.c lock tuples?
Next
From: Andres Freund
Date:
Subject: Re: Why does execReplication.c lock tuples?