Thread: pg_basebackup from cascading standby after timeline switch

pg_basebackup from cascading standby after timeline switch

From
Heikki Linnakangas
Date:
pg_basebackup -x is supposed to include all the required WAL files in 
the backup, so that you have everything needed to restore a consistent 
database. However, it's not including the timeline history files. 
Usually that's not a problem because normally you don't need to follow 
any old timelines when restoring, but there is one scenario where it 
causes a failure to restore:

Create a master, a standby, and a cascading standby. Kill the master 
server, promote the standby to become new master, bumping the timeline. 
After the cascading standby has followed the timeline switch (either 
through the archive, which also works on 9.2, or directly via streaming 
replication which only works on 9.3devel), take a base backup from the 
cascading standby using pg_basebackup -x. When you try to start the 
server from the new backup (without setting up a restore_command or 
streaming replication), you get an error about "unexpected timeline ID 1 
in log segment ..."

C 2012-12-17 15:55:25.732 EET 534 LOG:  database system was interrupted 
while in recovery at log time 2012-12-17 15:55:15 EET
C 2012-12-17 15:55:25.732 EET 534 HINT:  If this has occurred more than 
once some data might be corrupted and you might need to choose an 
earlier recovery target.
C 2012-12-17 15:55:25.732 EET 534 LOG:  creating missing WAL directory 
"pg_xlog/archive_status"
C 2012-12-17 15:55:25.732 EET 534 LOG:  unexpected timeline ID 1 in log 
segment 000000020000000000000003, offset 0
C 2012-12-17 15:55:25.732 EET 534 LOG:  invalid checkpoint record
C 2012-12-17 15:55:25.733 EET 534 FATAL:  could not locate required 
checkpoint record
C 2012-12-17 15:55:25.733 EET 534 HINT:  If you are not restoring from a 
backup, try removing the file 
"/home/heikki/pgsql.master/data-standbyC/backup_label".
C 2012-12-17 15:55:25.733 EET 533 LOG:  startup process (PID 534) exited 
with exit code 1
C 2012-12-17 15:55:25.733 EET 533 LOG:  aborting startup due to startup 
process failure

The timeline was bumped within the log segment 000000020000000000000003, 
so the beginning of the file uses timeline 1, up to the checkpoint 
record that changes the timeline. Normally, recovery accepts that 
because timeline 1 is an ancestor of timeline 2, but because the backup 
does not include the timelime history file, it does not know that.

This does not happen if you run pg_basebackup against the master server, 
because in the master it forces an xlog switch, which ensures that the 
new xlog file only contains pages with the latest timeline ID. There's 
even comments in pg_start_backup explaining that that's the reason for 
the xlog switch:

>     /*
>      * Force an XLOG file switch before the checkpoint, to ensure that the
>      * WAL segment the checkpoint is written to doesn't contain pages with
>      * old timeline IDs.  That would otherwise happen if you called
>      * pg_start_backup() right after restoring from a PITR archive: the
>      * first WAL segment containing the startup checkpoint has pages in
>      * the beginning with the old timeline ID.    That can cause trouble at
>      * recovery: we won't have a history file covering the old timeline if
>      * pg_xlog directory was not included in the base backup and the WAL
>      * archive was cleared too before starting the backup.
>      *
>      * This also ensures that we have emitted a WAL page header that has
>      * XLP_BKP_REMOVABLE off before we emit the checkpoint record.
>      * Therefore, if a WAL archiver (such as pglesslog) is trying to
>      * compress out removable backup blocks, it won't remove any that
>      * occur after this point.
>      *
>      * During recovery, we skip forcing XLOG file switch, which means that
>      * the backup taken during recovery is not available for the special
>      * recovery case described above.
>      */
>     if (!backup_started_in_recovery)
>         RequestXLogSwitch();

I'm not happy with the fact that we just ignore the problem in a backup 
taken from a standby, silently giving the user a backup that won't start 
up. Why not include the timeline history file in the backup? That seems 
like a good idea regardless of this issue. I also wonder if 
pg_basebackup should include *all* timeline history files in the backup, 
not just the latest one strictly required to restore. They're fairly 
small, so our approach has generally been to try to include them all in 
the archive, and not try to prune them, so the same might make sense here.

- Heikki



Re: pg_basebackup from cascading standby after timeline switch

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> I'm not happy with the fact that we just ignore the problem in a backup 
> taken from a standby, silently giving the user a backup that won't start 
> up. Why not include the timeline history file in the backup?

+1.  I was not aware that we weren't doing that --- it seems pretty
foolish, especially since as you say they're tiny.
        regards, tom lane



Re: pg_basebackup from cascading standby after timeline switch

From
Magnus Hagander
Date:
On Mon, Dec 17, 2012 at 5:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
>> I'm not happy with the fact that we just ignore the problem in a backup
>> taken from a standby, silently giving the user a backup that won't start
>> up. Why not include the timeline history file in the backup?
>
> +1.  I was not aware that we weren't doing that --- it seems pretty
> foolish, especially since as you say they're tiny.

Yeah, +1. That should probably have been a part of the whole
"basebackup from slave" patch, so it can probably be considered a
back-patchable bugfix in itself, no?

--Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/



Re: pg_basebackup from cascading standby after timeline switch

From
Simon Riggs
Date:
On 17 December 2012 14:16, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

> I'm not happy with the fact that we just ignore the problem in a backup
> taken from a standby, silently giving the user a backup that won't start up.

That's spooky. I just found a different issue with prmotion during
backup on your other thread.

> Why not include the timeline history file in the backup? That seems like a
> good idea regardless of this issue.

Yeh

> I also wonder if pg_basebackup should
> include *all* timeline history files in the backup, not just the latest one
> strictly required to restore.

Why? the timeline history file includes the previous timelines already.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_basebackup from cascading standby after timeline switch

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On 17 December 2012 14:16, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>> I also wonder if pg_basebackup should
>> include *all* timeline history files in the backup, not just the latest one
>> strictly required to restore.

> Why? the timeline history file includes the previous timelines already.

The original intention was that the WAL archive might contain multiple
timeline files corresponding to various experimental recovery attempts;
furthermore, such files might be hand-annotated (that's why there's a
comment provision).  So they would be at least as valuable from an
archival standpoint as the WAL files proper.  I think we ought to just
copy all of them, period.  Anything less is penny-wise and
pound-foolish.
        regards, tom lane



Re: pg_basebackup from cascading standby after timeline switch

From
Simon Riggs
Date:
On 18 December 2012 00:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On 17 December 2012 14:16, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>>> I also wonder if pg_basebackup should
>>> include *all* timeline history files in the backup, not just the latest one
>>> strictly required to restore.
>
>> Why? the timeline history file includes the previous timelines already.
>
> The original intention was that the WAL archive might contain multiple
> timeline files corresponding to various experimental recovery attempts;
> furthermore, such files might be hand-annotated (that's why there's a
> comment provision).  So they would be at least as valuable from an
> archival standpoint as the WAL files proper.  I think we ought to just
> copy all of them, period.  Anything less is penny-wise and
> pound-foolish.

What I'm saying is that the new history file is created from the old
one, so the latest file includes all the history as a direct copy. The
only thing new is one line of information.

Copying all files grows at O(N^2) with redundancy and will eventually
become a space problem and a performance issue for smaller systems.
There should be some limit to keep this sane, for example, the last 32
history files, or the last 1000 lines of history. Some sane limit.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_basebackup from cascading standby after timeline switch

From
Fujii Masao
Date:
On Tue, Dec 18, 2012 at 8:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 18 December 2012 00:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>> On 17 December 2012 14:16, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>>>> I also wonder if pg_basebackup should
>>>> include *all* timeline history files in the backup, not just the latest one
>>>> strictly required to restore.
>>
>>> Why? the timeline history file includes the previous timelines already.
>>
>> The original intention was that the WAL archive might contain multiple
>> timeline files corresponding to various experimental recovery attempts;
>> furthermore, such files might be hand-annotated (that's why there's a
>> comment provision).  So they would be at least as valuable from an
>> archival standpoint as the WAL files proper.  I think we ought to just
>> copy all of them, period.  Anything less is penny-wise and
>> pound-foolish.
>
> What I'm saying is that the new history file is created from the old
> one, so the latest file includes all the history as a direct copy. The
> only thing new is one line of information.

The timeline history file includes only ancestor timelines history. So
the latest one might not include all the history.

Regards,

-- 
Fujii Masao



Re: pg_basebackup from cascading standby after timeline switch

From
Tom Lane
Date:
Fujii Masao <masao.fujii@gmail.com> writes:
> On Tue, Dec 18, 2012 at 8:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> What I'm saying is that the new history file is created from the old
>> one, so the latest file includes all the history as a direct copy. The
>> only thing new is one line of information.

> The timeline history file includes only ancestor timelines history. So
> the latest one might not include all the history.

Indeed.  And even if there are a thousand of them, so what?  It'd still
be less space than a single WAL segment file.

Better to keep the data than wish we had it later.
        regards, tom lane



Re: pg_basebackup from cascading standby after timeline switch

From
Heikki Linnakangas
Date:
On 17.12.2012 18:58, Magnus Hagander wrote:
> On Mon, Dec 17, 2012 at 5:19 PM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>> Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
>>> I'm not happy with the fact that we just ignore the problem in a backup
>>> taken from a standby, silently giving the user a backup that won't start
>>> up. Why not include the timeline history file in the backup?
>>
>> +1.  I was not aware that we weren't doing that --- it seems pretty
>> foolish, especially since as you say they're tiny.
>
> Yeah, +1. That should probably have been a part of the whole
> "basebackup from slave" patch, so it can probably be considered a
> back-patchable bugfix in itself, no?

Yes, this should be backpatched to 9.2. I came up with the attached.

However, thinking about this some more, there's a another bug in the way
WAL files are included in the backup, when a timeline switch happens.
basebackup.c includes all the WAL files on ThisTimeLineID, but when the
backup is taken from a standby, the standby might've followed a timeline
switch. So it's possible that some of the WAL files should come from
timeline 1, while others should come from timeline 2. This leads to an
error like "requested WAL segment 00000001000000000000000C has already
been removed" in pg_basebackup.

Attached is a script to reproduce that bug, if someone wants to play
with it. It's a bit sensitive to timing, and needs tweaking the paths at
the top.

One solution to that would be to pay more attention to the timelines to
include WAL from. basebackup.c could read the timeline history file, to
see exactly where the timeline switches happened, and then construct the
filename of each WAL segment using the correct timeline id. Another
approach would be to do readdir() on pg_xlog, and include all WAL files,
regardless of timeline IDs, that fall in the right XLogRecPtr range. The
latter seems easier to backpatch.

- Heikki

Attachment

Re: pg_basebackup from cascading standby after timeline switch

From
Amit kapila
Date:
On Friday, December 21, 2012 6:24 PM Heikki Linnakangas wrote:
On 17.12.2012 18:58, Magnus Hagander wrote:
> On Mon, Dec 17, 2012 at 5:19 PM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>> Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
>>>> I'm not happy with the fact that we just ignore the problem in a backup
>>>> taken from a standby, silently giving the user a backup that won't start
>>>> up. Why not include the timeline history file in the backup?
>>
>>> +1.  I was not aware that we weren't doing that --- it seems pretty
>>> foolish, especially since as you say they're tiny.
>
>> Yeah, +1. That should probably have been a part of the whole
>> "basebackup from slave" patch, so it can probably be considered a
>> back-patchable bugfix in itself, no?

>Yes, this should be backpatched to 9.2. I came up with the attached.



> One solution to that would be to pay more attention to the timelines to
> include WAL from. basebackup.c could read the timeline history file, to
> see exactly where the timeline switches happened, and then construct the
> filename of each WAL segment using the correct timeline id. Another
> approach would be to do readdir() on pg_xlog, and include all WAL files,
> regardless of timeline IDs, that fall in the right XLogRecPtr range. The
> latter seems easier to backpatch.

I also think approach implemented by you is more better.
One small point, shouldn't it check (walsender_shutdown_requested || walsender_ready_to_stop) during ReadDir of pg_xlog
similarto what is done in ReadDir() in SendDir? 

With Regards,
Amit Kapila.



Re: pg_basebackup from cascading standby after timeline switch

From
Fujii Masao
Date:
On Fri, Dec 21, 2012 at 9:54 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Yes, this should be backpatched to 9.2. I came up with the attached.

In this patch, if '-X stream' is specified in pg_basebackup, the timeline
history files are not backed up. We should change pg_backup background
process and walsender so that they stream also timeline history files,
for example, by using 'TIMELINE_HISTORY' replication command?
Or basebackup.c should send all timeline history files at the end of backup
even if '-X stream' is specified?

> However, thinking about this some more, there's a another bug in the way WAL
> files are included in the backup, when a timeline switch happens.
> basebackup.c includes all the WAL files on ThisTimeLineID, but when the
> backup is taken from a standby, the standby might've followed a timeline
> switch. So it's possible that some of the WAL files should come from
> timeline 1, while others should come from timeline 2. This leads to an error
> like "requested WAL segment 00000001000000000000000C has already been
> removed" in pg_basebackup.
>
> Attached is a script to reproduce that bug, if someone wants to play with
> it. It's a bit sensitive to timing, and needs tweaking the paths at the top.
>
> One solution to that would be to pay more attention to the timelines to
> include WAL from. basebackup.c could read the timeline history file, to see
> exactly where the timeline switches happened, and then construct the
> filename of each WAL segment using the correct timeline id. Another approach
> would be to do readdir() on pg_xlog, and include all WAL files, regardless
> of timeline IDs, that fall in the right XLogRecPtr range. The latter seems
> easier to backpatch.

The latter sounds good to me. But how all WAL files with different timelines
are shipped in pg_basebackup -X stream mode?

Regards,

-- 
Fujii Masao



Re: pg_basebackup from cascading standby after timeline switch

From
Heikki Linnakangas
Date:
On 23.12.2012 15:33, Fujii Masao wrote:
> On Fri, Dec 21, 2012 at 9:54 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>> Yes, this should be backpatched to 9.2. I came up with the attached.
>
> In this patch, if '-X stream' is specified in pg_basebackup, the timeline
> history files are not backed up.

Good point.

> We should change pg_backup background
> process and walsender so that they stream also timeline history files,
> for example, by using 'TIMELINE_HISTORY' replication command?
> Or basebackup.c should send all timeline history files at the end of backup
> even if '-X stream' is specified?

Perhaps. We should enhance pg_receivexlog to follow timeline switches, 
anyway. I was thinking of leaving that as a todo item, but pg_basebackup 
-X stream shares the code, so we should implement that now to get that 
support into both.

In the problem you reported on the other thread 
(http://archives.postgresql.org/message-id/50DB5EA9.7010406@vmware.com), 
you also need the timeline history files, but that one didn't use "-X" 
at all. Even if we teach pg_basebackup to fetch the timeline history 
files in "-X stream" mode, that still leaves the problem on that other 
thread.

The simplest solution would be to always include all timeline history 
files in the backup, even if -X is not used. Currently, however, pg_xlog 
is backed up as an empty directory in that case, but that would no 
longer be the case if we start including timeline history files there. I 
wonder if that would confuse any existing backup scripts people are using.

- Heikki



Re: pg_basebackup from cascading standby after timeline switch

From
Heikki Linnakangas
Date:
On 27.12.2012 12:06, Heikki Linnakangas wrote:
> On 23.12.2012 15:33, Fujii Masao wrote:
>> On Fri, Dec 21, 2012 at 9:54 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>> Yes, this should be backpatched to 9.2. I came up with the attached.
>>
>> In this patch, if '-X stream' is specified in pg_basebackup, the timeline
>> history files are not backed up.
>
> Good point.
>
>> We should change pg_backup background
>> process and walsender so that they stream also timeline history files,
>> for example, by using 'TIMELINE_HISTORY' replication command?
>> Or basebackup.c should send all timeline history files at the end of
>> backup
>> even if '-X stream' is specified?
>
> Perhaps. We should enhance pg_receivexlog to follow timeline switches,
> anyway. I was thinking of leaving that as a todo item, but pg_basebackup
> -X stream shares the code, so we should implement that now to get that
> support into both.
>
> In the problem you reported on the other thread
> (http://archives.postgresql.org/message-id/50DB5EA9.7010406@vmware.com),
> you also need the timeline history files, but that one didn't use "-X"
> at all. Even if we teach pg_basebackup to fetch the timeline history
> files in "-X stream" mode, that still leaves the problem on that other
> thread.
>
> The simplest solution would be to always include all timeline history
> files in the backup, even if -X is not used. Currently, however, pg_xlog
> is backed up as an empty directory in that case, but that would no
> longer be the case if we start including timeline history files there. I
> wonder if that would confuse any existing backup scripts people are using.

This thread has spread out a bit, so here's a summary of the remaining
issues and what I'm going to do about them:

9.2
---

If you take a backup with "pg_basebackup -X fetch", and the timeline
switches while the backup is taken, you currently get an error like
"requested WAL segment 00000001000000000000000C has already been
removed". To fix, let's change the server-side support of "-X fetch" to
include all WAL files between the backup start and end pointers,
regardless of timelines. I'm thinking of doing this by scanning pg_xlog
with readdir(), and sending over any files in that range. Another option
would be to read and parse the timeline history file to figure out the
exact filenames expected, but the readdir() approach seems simpler.

You also need the timeline history files. With "-X fetch", I think we
should always include them in the pg_xlog directory of the backup, along
with the WAL files themselves.

"-X stream" has a similar problem. If timeline changes during backup,
the replication will stop at the timeline switch, and the backup fails.
There isn't much we can do about that, as you can't follow a timeline
switch via streaming replication in 9.2. At best, we could try to detect
the situation and give a better error message.

With plain pg_basebackup, without -X option, you usually need a WAL
archive to restore. The only exception is when you initialize a
streaming standby with plain "pg_basebackup", intending to connect it to
the master right after taking the backup, so that it can stream all the
required WAL at that point. We have a problem with that scenario,
because even if there was no timeline switch while the backup was taken
(if there was, you're screwed the same as with "-X stream"), because of
the issue mentioned in the first post in this thread: the beginning of
the first WAL file contains the old timeline ID. Even though that's not
replayed, because the checkpoint is in the later part of the segment,
recovery still complains if there is a timeline ID in the beginning of
the file that we don't recognize as our ancestor. This could be fixed if
we somehow got the timeline history files in the backup, but I'm worried
that might break people's backup scripts. At the moment, the pg_xlog
directory in the backup is empty when -X is not used, we even spell that
out explicitly in the manual. Including timeline history files would
change that. That might be an issue if you symlink pg_xlog to a
different drive, for example. To make things worse, there is no timeline
history file for timeline 1, so you would not notice when you test your
backup scripts in simple cases.

In summary, in 9.2 I think we should fix "-X fetch" to tolerate a
timeline switch, and include all the timeline history files. The
behavior of other modes would not be changed, and they will fail if a
timeline changes during or just before backup.

Master
------

In master, we can try harder for the "-X stream" case, because you can
replicate a timeline switch, and fetch timeline history files via a
replication connection. Let's teach pg_receivexlog, and "pg_basebackup
-X stream", to use those facilities, so that even if the timeline
changes while the backup is taken, all the history files and WAL files
are correctly included in the backup. I'll start working on a patch to
do that.

That leaves one case not covered: If you take a backup with plain
"pg_basebackup" from a standby, without -X, and the first WAL segment
contains a timeline switch (ie. you take the backup right after a
failover), and you try to recover from it without a WAL archive, it
doesn't work. This is the original issue that started this thread,
except that I used "-x" in my original test case. The problem here is
that even though streaming replication will fetch the timeline history
file when it connects, at the very beginning of recovery, we expect that
we already have the timeline history file corresponding the initial
timeline available, either in pg_xlog or the WAL archive. By the time
streaming replication has connected and fetched the history file, we've
already initialized expectedTLEs to contain just the latest TLI. To fix
that, we should delay calling readTimeLineHistoryFile() until streaming
replication has connected and fetched the file.

Barring objections, I'll commit the attached two patches.
include-wal-files-from-all-timelines-in-base-backup-1.patch is for 9.2
and master, and it fixes the "pg_basebackup -X fetch" case.
delay-reading-timeline-history-file.patch is for master, and it changes
recovery so if a timeline history file for the initial target timeline
is fetched over streaming replication, expectedTLEs is initialized with
the streamed file. That fixes the plain "pg_basebackup" without -X case
on master.

What remains is to teach "pg_receivexlog" and "pg_basebackup -X stream"
to cross timeline changes. I'll start working on a patch for that.

- Heikki

Attachment

Re: pg_basebackup from cascading standby after timeline switch

From
Greg Stark
Date:
On Wed, Jan 2, 2013 at 1:55 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> If you take a backup with "pg_basebackup -X fetch", and the timeline
> switches while the backup is taken, you currently get an error like
> "requested WAL segment 00000001000000000000000C has already been removed".
> To fix, let's change the server-side support of "-X fetch" to include all
> WAL files between the backup start and end pointers, regardless of
> timelines. I'm thinking of doing this by scanning pg_xlog with readdir(),
> and sending over any files in that range. Another option would be to read
> and parse the timeline history file to figure out the exact filenames
> expected, but the readdir() approach seems simpler.

I'm not clear what you mean by "any files in that range". There could
be other timelines in the archive that aren't relevant to the restore
at all. For example if the database you're requesting a backup from
has previously been restored from an old backup the archive could have
archives from the original timeline as well as the active timeline.

I'm trying to wrap my head around what other combinations are
possible. Is it possible there have been other false starts or
multiple timeline switches during the time the backup was being taken?
At first blush I think not, I think it's only possible for there to be
one timeline switch and it would be when a standby database was being
backed up and is activated while the backup was being taken.


-- 
greg