Thread: BUG #14321: pg_basebackup --xlog-method=stream fails

BUG #14321: pg_basebackup --xlog-method=stream fails

From

Jürgen Strobel

Date:

10 September 2016, 00:10:52

On 10 September 2016 at 00:09, Michael Paquier <michael.paquier@gmail.com> wrote:

On Sat, Sep 10, 2016 at 1:58 AM, <juergen+postgresql@strobel.info> wrote:
> The filsystem backup continues successfully to its end, but it concludes
> without the necessary WAL files. I verified in pg_stat_replication that
> pg_basebackup is not trying to reconnect to the master.
>
> I understand how to repair this manually and it's not an end-of-the-world
> bug, but it would be nice if pg_basebackup would just reconnect the
> streaming WAL connection in the same way as pg_receivexlog does. Especially
> as that error happens in a long script run by cron and/or other people who
> do not have this insight.

Perhaps. The source server logs do prove the fact that pg_basebackup
is requesting for missing WAL segments, right?

> I haven't had time to try 9.6's --slot option yet, but I suspect this won't
> be a full cure either unless it also changes the re-connect behavior.

If what you are seeing missing are the first WAL segments that your
backup needs, first the backup you took will be useless if you don't
have a WAL archive from where recovery could fetch those missing
segments. And in this case --slot will definitely help, but just be
sure that this does not bloat your pg_xlog partition if disk space is
a concern there.
--
Michael

First, I do have another WAL archive (usually).

But no I only see the first WAL segments up to the point when the problem occurs, then nothing more.

The timeline as far as I can tell is:

1. pg_basebackup --xlog-method=stream starts and creates 2 connections for backup and WAL streaming.

2. The VM's crappy IO system hickups and stalls the whole VM for a surprisingly long time.

3. The server runs into wal_sender_timeout and closes the WAL streaming connection.

4. pg_basebackup prints the warning, and continues the filesystem copy, *but makes no effort to re-open the WAL streaming connection*. With ps I see zombie child of the pg_basbackup process, I assume that's the one doing the WAL streaming.

5. pg_baseback finishes up with the second half of pg_xlog missing, and the DB fails to start.

In contrast if the same problem occurs while running pg_receivexlog it waits for 5 seconds then reopens the connection. I think that pg_basebackup should show the same resilience.

-Jürgen

Re: BUG #14321: pg_basebackup --xlog-method=stream fails

From

Michael Paquier

Date:

10 September 2016, 05:30:48

On Sat, Sep 10, 2016 at 9:10 AM, J=C3=BCrgen Strobel
<juergen+postgresql@strobel.info> wrote:
> First, I do have another WAL archive (usually).
> But no I only see the first WAL segments up to the point when the problem
> occurs, then nothing more.
>
> The timeline as far as I can tell is:
>
> 1. pg_basebackup --xlog-method=3Dstream starts and creates 2 connections =
for
> backup and WAL streaming.
> 2. The VM's crappy IO system hickups and stalls the whole VM for a
> surprisingly long time.

I know that people can do fancy things here, believe me.

> 3. The server runs into wal_sender_timeout and closes the WAL streaming
> connection.
> 4. pg_basebackup prints the warning, and continues the filesystem copy, *=
but
> makes no effort to re-open the WAL streaming connection*. With ps I see
> zombie child of the pg_basbackup process, I assume that's the one doing t=
he
> WAL streaming.
> 5. pg_baseback finishes up with the second half of pg_xlog missing, and t=
he
> DB fails to start.
>
> In contrast if the same problem occurs while running pg_receivexlog it wa=
its
> for 5 seconds then reopens the connection. I think that pg_basebackup sho=
uld
> show the same resilience.

You can blame your VM here to begin with :(
Even with the default values of pg_basebackup --status-interval and
wal_sender_timeout on the server there is enough margin to prevent
things to get killed, but if things get heavily constrained on I/O...
Well, there is not much than any software could do... Now I agree that
there would be room for improvement to make pg_basebackup retry a
stream instead of failing, and that may be something that people would
be willing to have. But that's hard to think about improvements in
this area as something else than a new feature, and not a bug.

Anyway, replication slots would not help here if you just rely on
pg_basebackup to finish the job.
--=20
Michael

Re: BUG #14321: pg_basebackup --xlog-method=stream fails

From

Jürgen Strobel

Date:

10 September 2016, 19:28:14

On 10 September 2016 at 07:30, Michael Paquier <michael.paquier@gmail.com> wrote:

On Sat, Sep 10, 2016 at 9:10 AM, Jürgen Strobel
<juergen+postgresql@strobel.info> wrote:
> First, I do have another WAL archive (usually).
> But no I only see the first WAL segments up to the point when the problem
> occurs, then nothing more.
>
> The timeline as far as I can tell is:
>
> 1. pg_basebackup --xlog-method=stream starts and creates 2 connections for
> backup and WAL streaming.
> 2. The VM's crappy IO system hickups and stalls the whole VM for a
> surprisingly long time.

I know that people can do fancy things here, believe me.

> 3. The server runs into wal_sender_timeout and closes the WAL streaming
> connection.
> 4. pg_basebackup prints the warning, and continues the filesystem copy, *but
> makes no effort to re-open the WAL streaming connection*. With ps I see
> zombie child of the pg_basbackup process, I assume that's the one doing the
> WAL streaming.
> 5. pg_baseback finishes up with the second half of pg_xlog missing, and the
> DB fails to start.
>
> In contrast if the same problem occurs while running pg_receivexlog it waits
> for 5 seconds then reopens the connection. I think that pg_basebackup should
> show the same resilience.

You can blame your VM here to begin with :(
Even with the default values of pg_basebackup

--status-interval and
wal_sender_timeout on the server there is enough margin to prevent
things to get killed, but if things get heavily constrained on I/O...
Well, there is not much than any software could do... Now I agree that
there would be room for improvement to make pg_basebackup retry a
stream instead of failing, and that may be something that people would
be willing to have. But that's hard to think about improvements in
this area as something else than a new feature, and not a bug.

Anyway, replication slots would not help here if you just rely on
pg_basebackup to finish the job.
--
Michael

I do agree the VM is bad, but I have to work with what I got now.

I do not agree it's a pure feature request though. When this problem happens pg_baseback should either abort fully with a suitable error, or retry streaming WAL until it got everything it needs for a functional backup (or streaming fails due to WAL cleanup on the server). The current behavior of finishing the filesystem backup with a mere warning is inconsistent and not user friendly. If I use --xlog-method=stream I expect to end up with all WAL in the end or to get a clear error. It took me quite some time to figure out what's happening. And of course this never happened in QA/staging systems, only in production.

I understand that this may not affect many people, and that it's not going to get immediate attention, classify it as you wish.

The replication slot feature might make it easier for me to recover from the problem using pg_receivexlog afterwards.

-Jürgen