Re: trying again to get incremental backup - Mailing list pgsql-hackers

From Robert Haas
Subject Re: trying again to get incremental backup
Date
Msg-id CA+TgmobFDWjEBCDYFJQjxiB=QEeMhGVagk7naeAVjLkvs2EhFA@mail.gmail.com
Whole thread Raw
In response to Re: trying again to get incremental backup  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses Re: trying again to get incremental backup
List pgsql-hackers
On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> If that is still an area open for discussion: wouldn't it be better to
> just specify LSN as it would allow resyncing standby across major lag
> where the WAL to replay would be enormous? Given that we had
> primary->standby where standby would be stuck on some LSN, right now
> it would be:
> 1) calculate backup manifest of desynced 10TB standby (how? using
> which tool?)  - even if possible, that means reading 10TB of data
> instead of just putting a number, isn't it?
> 2) backup primary with such incremental backup >= LSN
> 3) copy the incremental backup to standby
> 4) apply it to the impaired standby
> 5) restart the WAL replay

As you may be able to tell from the flurry of posts and new patch
sets, I'm trying hard to sort out the remaining open items that
pertain to this patch set, and I'm now back to thinking about this
one.

TL;DR: I think the idea has some potential, but there are some
pitfalls that I'm not sure how to address.

I spent some time looking at how we currently use the data from the
backup manifest. Currently, we do two things with it. First, when
we're backing up each file, we check whether it's present in the
backup manifest and, if not, we back it up in full. This actually
feels fairly poor. If it makes any difference at all, then presumably
the underlying algorithm is buggy and needs to be fixed. Maybe that
should be ripped out altogether or turned into some kind of sanity
check that causes a big explosion if it fails. Second, we check
whether the WAL ranges reported by the client match up with the
timeline history of the server (see PrepareForIncrementalBackup). This
set of sanity checks seems fairly important to me, and I'd regret
discarding them. I think there's some possibility that they might
catch user error, like where somebody promotes multiple standbys and
maybe they even get the same timeline on more than one of them, and
then confusion might ensue. I also think that there's a real
possibility that they might make it easier to track down bugs in my
code, even if those bugs aren't necessarily timeline-related. If (or
more realistically when) somebody ends up with a corrupted cluster
after running pg_combinebackup, we're going to need to figure out
whether that corruption is the result of bugs (and if so where they
are) or user error (and if so what it was). The most obvious ways of
ending up with a corrupted cluster are (1) taking an incremental
backup against a prior backup that is not in the history of the server
from which the backup is taken or (2) combining an incremental backup
the wrong prior backup, so whatever sanity checks we can have that
will tend to prevent those kinds of mistakes seem like a really good
idea.

And those kinds of checks seem relevant here, too. Consider that it
wouldn't be valid to use pg_combinebackup to fast-forward a standby
server if the incremental backup's backup-end-LSN preceded the standby
server's minimum recovery point. Imagine that you have a standby whose
last checkpoint's redo location was at LSN 2/48. Being the
enterprising DBA that you are, you make a note of that LSN and go take
an incremental backup based on it. You then stop the standby server
and try to apply the incremental backup to fast-forward the standby.
Well, it's possible that in the meanwhile the standby actually caught
up, and now has a minimum recovery point that follows the
backup-end-LSN of your incremental backup. In that case, you can't
legally use that incremental backup to fast-forward that standby, but
no code I've yet written would be smart enough to figure that out. Or,
maybe you (or some other DBA on your team) got really excited and
actually promoted that standby meanwhile, and now it's not even on the
same timeline any more. In the "normal" case where you take an
incremental backup based on an earlier base backup, these kinds of
problems are detectable, and it seems to me that if we want to enable
this kind of use case, it would be pretty smart to have a plan to
detect similar mistakes here. I don't, currently, but maybe there is
one.

Another practical problem here is that, right now, pg_combinebackup
doesn't have an in-place mode. It knows how to take a bunch of input
backups and write out an output backup, but that output backup needs
to go into a new, fresh directory (or directories plural, if there are
user-defined tablespaces). I had previously considered adding such a
mode, but the idea I had at the time wouldn't have worked for this
case. I imagined that someone might want to run "pg_combinebackup
--in-place full incr" and clobber the contents of the incr directory
with the output, basically discarding the incremental backup you took
in favor of a full backup that could have been taken at the same point
in time. But here, you'd want to clobber the *first* input to
pg_combinebackup, not the last one, so if we want to add something
like this, the UI needs some thought.

One thing that I find quite scary about such a mode is that if you
crash mid-way through, you're in a lot of trouble. In the case that I
had previous contemplated -- overwrite the last incremental with the
reconstructed full backup -- you *might* be able to make it crash safe
by writing out the full files for each incremental file, fsyncing
everything, then removing all of the incremental files and fsyncing
again. The idea would be that if you crash midway through it's OK to
just repeat whatever you were trying to do before the crash and if it
succeeds the second time then all is well. If, for a given file, there
are both incremental and non-incremental versions, then the second
attempt should remove and recreate the non-incremental version from
the incremental version. If there's only a non-incremental version, it
could be that the previous attempt got far enough to remove the
incremental file, but in that case the full file that we now have
should be the same thing that we would produce if we did the operation
now. It all sounds a little scary, but maybe it's OK. And as long as
you don't remove the this-is-an-incremental-backup markers from the
backup_label file until you've done everything else, you can tell
whether you've ever successfully completed the reassembly or not. But
if you're using a hypothetical overwrite mode to overwrite the first
input rather than the last one, well, it looks like a valid data
directory already, and if you replace a bunch of files and then crash,
it still does, but it's not any more, really. I'm not sure I've really
wrapped my head around all of the cases here, but it does feel like
there are some new ways to go wrong.

One thing I also realized when thinking about this is that you could
probably hose yourself with the patch set as it stands today by taking
a full backup, downgrading to wal_level=minimal for a while, doing
some WAL-skipping operations, upgrading to a higher WAL-level again,
and then taking an incremental backup. I think the solution to that is
probably for the WAL summarizer to refuse to run if wal_level=minimal.
Then there would be a gap in the summary files which an incremental
backup attempt would detect.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: always use runtime checks for CRC-32C instructions
Next
From: Robert Haas
Date:
Subject: Re: Requiring recovery.signal or standby.signal when recovering with a backup_label