Home > mailing lists

BUG #14321: pg_basebackup --xlog-method=stream fails - Mailing list pgsql-bugs

From	Jürgen Strobel
Subject	BUG #14321: pg_basebackup --xlog-method=stream fails
Date	September 10, 2016 03:10:52
Msg-id	CALWJi_eA9X5K5z_OS58F_3j+WmQ0-UKKy+Z0e8qxXCcNkPDhjQ@mail.gmail.com Whole thread Raw
Responses	Re: BUG #14321: pg_basebackup --xlog-method=stream fails (Michael Paquier <michael.paquier@gmail.com>)
List	pgsql-bugs

Tree view

On 10 September 2016 at 00:09, Michael Paquier <michael.paquier@gmail.com> wrote:

On Sat, Sep 10, 2016 at 1:58 AM, <juergen+postgresql@strobel.info> wrote:
> The filsystem backup continues successfully to its end, but it concludes
> without the necessary WAL files. I verified in pg_stat_replication that
> pg_basebackup is not trying to reconnect to the master.
>
> I understand how to repair this manually and it's not an end-of-the-world
> bug, but it would be nice if pg_basebackup would just reconnect the
> streaming WAL connection in the same way as pg_receivexlog does. Especially
> as that error happens in a long script run by cron and/or other people who
> do not have this insight.

Perhaps. The source server logs do prove the fact that pg_basebackup
is requesting for missing WAL segments, right?

> I haven't had time to try 9.6's --slot option yet, but I suspect this won't
> be a full cure either unless it also changes the re-connect behavior.

If what you are seeing missing are the first WAL segments that your
backup needs, first the backup you took will be useless if you don't
have a WAL archive from where recovery could fetch those missing
segments. And in this case --slot will definitely help, but just be
sure that this does not bloat your pg_xlog partition if disk space is
a concern there.
--
Michael

First, I do have another WAL archive (usually).

But no I only see the first WAL segments up to the point when the problem occurs, then nothing more.

The timeline as far as I can tell is:

1. pg_basebackup --xlog-method=stream starts and creates 2 connections for backup and WAL streaming.

2. The VM's crappy IO system hickups and stalls the whole VM for a surprisingly long time.

3. The server runs into wal_sender_timeout and closes the WAL streaming connection.

4. pg_basebackup prints the warning, and continues the filesystem copy, *but makes no effort to re-open the WAL streaming connection*. With ps I see zombie child of the pg_basbackup process, I assume that's the one doing the WAL streaming.

5. pg_baseback finishes up with the second half of pg_xlog missing, and the DB fails to start.

In contrast if the same problem occurs while running pg_receivexlog it waits for 5 seconds then reopens the connection. I think that pg_basebackup should show the same resilience.

-Jürgen

pgsql-bugs by date:

From: Keith
Date: 10 September 2016, 03:03:41
Subject: Re: BUG #14322: Possible inconsistent behavior with timestamp_to_str()

From: Michael Paquier
Date: 10 September 2016, 08:30:48
Subject: Re: BUG #14321: pg_basebackup --xlog-method=stream fails

BUG #14321: pg_basebackup --xlog-method=stream fails - Mailing list pgsql-bugs

Previous

Next