Re: v12 and TimeLine switches and backups/restores - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: v12 and TimeLine switches and backups/restores
Date
Msg-id 20200701200218.GV3125@tamriel.snowman.net
Whole thread Raw
In response to Re: v12 and TimeLine switches and backups/restores  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: v12 and TimeLine switches and backups/restores
List pgsql-hackers
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Jul 1, 2020 at 12:12 AM Stephen Frost <sfrost@snowman.net> wrote:
> > Among the changes made to PG's recovery in v12 was to set
> > recovery_target_timeline to be 'latest' by default.  That's handy when
> > you're flipping back and forth between replicas and want to have
> > everyone follow that game, but it's made doing some basic things like
> > restoring from a backup problematic.
> >
> > Specifically, if you take a backup off a primary and, while that backup
> > is going on, some replica is promoted and drops a .history file into the
> > WAL repo, that backup is no longer able to be restored with the new
> > recovery_target_timeline default.  What happens is that the restore
> > process will happily follow the timeline change- even though it happened
> > before we reached consistency, and then it'll never find the needed
> > end-of-backup WAL point that would allow us to reach consistency.
>
> Ouch. Should we revert that change rather than doing this? Seems like
> this might create a lot of problems for people, and they might be
> problems that happen rarely enough that it looks like it's working
> until it doesn't. What's the fix, if you hit the error? Add
> recovery_target_timeline=<the correct timeline> to
> postgresql.auto.conf?

I don't really think reverting the change to make following the latest
timeline would end up being terribly helpful- an awful lot of systems
are going to be running with that anyway for HA and such, so it seems
like something we just need to deal with.  As such, it seems like this
is also something that would need to be back-patched, though I've not
looked at how much effort that'll be (yet), since it probably makes
sense to get agreement on if this approach is the best first.

There's two solutions, really- first would be, as you suggest, configure
PG to stay on the timeline that the backup was taken on, but I suspect
that's often *not* what the user actually wants- what they really want
is to restore an earlier backup (one taken before the TL switch) and
then have PG follow the timeline switch when it comes across it.  We're
looking at having pgbackrest automatically pick the correct backup to be
able to make that happen when someone requests timeline-latest (pretty
handy having a repo full of backups that allow us to pick the right one
based on what the user's request is).

There's another option here, though I rejected it, which is that we
could possibly force the restore to ignore a TL switch before reaching
consistency, but if we do that then, sure, we'll finish the restore but
we won't be on the TL that the user asked us to be, and we wouldn't be
able to follow a primary that's on that TL, so ultimately the restore
wouldn't actually be what the user wanted.  There's really not an option
to do what the user wanted except to find an earlier backup to restore,
so that's why I'm proposing that if we hit this situation we just PANIC.

> Typo: similairly.

Fixed locally.

Thanks!

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Remove Deprecated Exclusive Backup Mode
Next
From: Robert Haas
Date:
Subject: Re: v12 and TimeLine switches and backups/restores