Re: Add notes to pg_combinebackup docs - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Add notes to pg_combinebackup docs
Date
Msg-id CA+Tgmob45C8T4nPkq3bGMoxNWm2V53-J=SJRwfdrwc5xS+KSVQ@mail.gmail.com
Whole thread Raw
In response to Re: Add notes to pg_combinebackup docs  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Add notes to pg_combinebackup docs
List pgsql-hackers
On Tue, Apr 9, 2024 at 5:46 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> What we could do is detect this in pg_combinebackup, and either just
> disable checksums with a warning and hint to maybe enable them again. Or
> maybe just print that the user needs to disable them.
>
> I was thinking maybe we could detect this while taking the backups, and
> force taking a full backup if checksums got enabled since the last
> backup. But we can't do that because we only have the manifest from the
> last backup, and the manifest does not include info about checksums.

So, as I see it, we have at least five options here:

1. Document that this doesn't work. By "this doesn't work" I think
what we mean is: (A) If any of the backups you'd need to combine were
taken with checksums off but the last one was taken with checksums on,
then you might get checksum failures on the cluster written by
pg_combinebackup. (B) Therefore, after enabling checksums, you should
take a new full backup and make future incrementals dependent
thereupon. (C) But if you don't, then you can (I presume) run recovery
with ignore_checksum_failure=true to bring the cluster to a consistent
state, stop the database, shut checksums off, optionally also turn
them on again, and restart the database. Then presumably everything
should be OK, except that you'll have wiped out any real checksum
failures that weren't artifacts of the reconstruction process along
with the fake ones.

2. As (1), but make check_control_files() emit a warning message when
the problem case is detected.

3. As (2), but also add a command-line option to pg_combinebackup to
flip the checksum flag to false in the control file. Then, if you have
the problem case, instead of following the procedure described above,
you can just use this option, and enable checksums afterward if you
want. It still has the same disadvantage as the procedure described
above: any "real" checksum failures will be suppressed, too. From a
design perspective, this feels like kind of an ugly wart to me: hey,
in this one scenario, you have to add --do-something-random or it
doesn't work! But I see it's got some votes already, so maybe it's the
right answer.

4. Add the checksum state to the backup manifest. Then, if someone
tries to take an incremental backup with checksums on and the
precursor backup had checksums off, we could fail. A strength of this
proposal is that it actually stops the problem from happening at
backup time, which in general is a whole lot nicer than not noticing a
problem until restore time. A possible weakness is that it stops you
from doing something that is ... actually sort of OK. I mean, in the
strict sense, the incremental backup isn't valid, because it's going
to cause checksum failures after reconstruction, but it's valid apart
from the checksums, and those are fixable. I wonder whether users who
encounter this error message will say "oh, I'm glad PostgreSQL
prevented me from doing that" or "oh, I'm annoyed that PostgreSQL
prevented me from doing that."

5. At reconstruction time, notice which backups have checksums
enabled. If the final backup in the chain has them enabled, then
whenever we take a block from an earlier backup with checksums
disabled, re-checksum the block. As opposed to any of the previous
options, this creates a fully correct result, so there's no real need
to document any restrictions on what you're allowed to do. We might
need to document the performance consequences, though: fast file copy
methods will have to be disabled, and we'll have to go block by block
while paying the cost of checksum calculation for each one. You might
be sad to find out that your reconstruction is a lot slower than you
were expecting.

While I'm probably willing to implement any of these, I have some
reservations about attempting (4) or especially (5) after feature
freeze. I think there's a pretty decent chance that those fixes will
turn out to have issues of their own which we'll then need to fix in
turn. We could perhaps consider doing (2) for now and (5) for a future
release, or something like that.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: AIX support
Next
From: "Shankaran, Akash"
Date:
Subject: RE: Popcount optimization using AVX512