Re: trying again to get incremental backup - Mailing list pgsql-hackers

From Robert Haas
Subject Re: trying again to get incremental backup
Date
Msg-id CA+TgmoYUhrgcNin=qrY+J+S-f6kctf9DaUKoaD-e3cww_ox9vg@mail.gmail.com
Whole thread Raw
In response to Re: trying again to get incremental backup  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses Re: trying again to get incremental backup
Re: trying again to get incremental backup
Re: trying again to get incremental backup
Re: trying again to get incremental backup
List pgsql-hackers
On Fri, Dec 8, 2023 at 5:02 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> While we are at it, maybe around the below in PrepareForIncrementalBackup()
>
>                 if (tlep[i] == NULL)
>                         ereport(ERROR,
>
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>                                          errmsg("timeline %u found in
> manifest, but not in this server's history",
>                                                         range->tli)));
>
> we could add
>
>     errhint("You might need to start a new full backup instead of
> incremental one")
>
> ?

I can't exactly say that such a hint would be inaccurate, but I think
the impulse to add it here is misguided. One of my design goals for
this system is to make it so that you never have to take a new
incremental backup "just because," not even in case of an intervening
timeline switch. So, all of the errors in this function are warning
you that you've done something that you really should not have done.
In this particular case, you've either (1) manually removed the
timeline history file, and not just any timeline history file but the
one for a timeline for a backup that you still intend to use as the
basis for taking an incremental backup or (2) tried to use a full
backup taken from one server as the basis for an incremental backup on
a completely different server that happens to share the same system
identifier, e.g. because you promoted two standbys derived from the
same original primary and then tried to use a full backup taken on one
as the basis for an incremental backup taken on the other.

The scenario I was really concerned about when I wrote this test was
(2), because that could lead to a corrupt restore. This test isn't
strong enough to prevent that completely, because two unrelated
standbys can branch onto the same new timelines at the same LSNs, and
then these checks can't tell that something bad has happened. However,
they can detect a useful subset of problem cases. And the solution is
not so much "take a new full backup" as "keep straight which server is
which." Likewise, in case (1), the relevant hint would be "don't
manually remove timeline history files, and if you must, then at least
don't nuke timelines that you actually still care about."

> > I have a fix for this locally, but I'm going to hold off on publishing
> > a new version until either there's a few more things I can address all
> > at once, or until Thomas commits the ubsan fix.
> >
>
> Great, I cannot get it to fail again today, it had to be some dirty
> state of the testing env. BTW: Thomas has pushed that ubsan fix.

Huzzah, the cfbot likes the patch set now. Here's a new version with
the promised fix for your non-reproducible issue. Let's see whether
you and cfbot still like this version.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)
Next
From: Alexander Korotkov
Date:
Subject: Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)