Re: pg_combinebackup does not detect missing files - Mailing list pgsql-hackers

From Robert Haas
Subject Re: pg_combinebackup does not detect missing files
Date
Msg-id CA+TgmobyLDcvqn4W1VdQp+27zHJuy31c+vCBCAOzZP8EzqJicg@mail.gmail.com
Whole thread Raw
In response to Re: pg_combinebackup does not detect missing files  (David Steele <david@pgmasters.net>)
Responses Re: pg_combinebackup does not detect missing files
Re: pg_combinebackup does not detect missing files
List pgsql-hackers
On Fri, May 17, 2024 at 1:18 AM David Steele <david@pgmasters.net> wrote:
> However, I think allowing the user to optionally validate the input
> would be a good feature. Running pg_verifybackup as a separate step is
> going to be a more expensive then verifying/copying at the same time.
> Even with storage tricks to copy ranges of data, pg_combinebackup is
> going to aware of files that do not need to be verified for the current
> operation, e.g. old copies of free space maps.

In cases where pg_combinebackup reuses a checksums from the input
manifest rather than recomputing it, this could accomplish something.
However, for any file that's actually reconstructed, pg_combinebackup
computes the checksum as it's writing the output file. I don't see how
it's sensible to then turn around and verify that the checksum that we
just computed is the same one that we now get. It makes sense to run
pg_verifybackup on the output of pg_combinebackup at a later time,
because that can catch bits that have been flipped on disk in the
meanwhile. But running the equivalent of pg_verifybackup during
pg_combinebackup would amount to doing the exact same checksum
calculation twice and checking that it gets the same answer both
times.

> One more thing occurs to me -- if data checksums are enabled then a
> rough and ready output verification would be to test the checksums
> during combine. Data checksums aren't very good but something should be
> triggered if a bunch of pages go wrong, especially since the block
> offset is part of the checksum. This would be helpful for catching
> combine bugs.

I don't know, I'm not very enthused about this. I bet pg_combinebackup
has some bugs, and it's possible that one of them could involve
putting blocks in the wrong places, but it doesn't seem especially
likely. Even if it happens, it's more likely to be that
pg_combinebackup thinks it's putting them in the right places but is
actually writing them to the wrong offset in the file, in which case a
block-checksum calculation inside pg_combinebackup is going to think
everything's fine, but a standalone tool that isn't confused will be
able to spot the damage.

It's frustrating that we can't do better verification of these things,
but to fix that I think we need better infrastructure elsewhere. For
instance, if we made pg_basebackup copy blocks from shared_buffers
rather than the filesystem, or at least copy them when they weren't
being concurrently written to the filesystem, then we'd not have the
risk of torn pages producing spurious bad checksums. If we could
somehow render a backup consistent when taking it instead of when
restoring it, we could verify tons of stuff. If we had some useful
markers of how long files were supposed to be and which ones were
supposed to be present, we could check a lot of things that are
uncheckable today. pg_combinebackup does the best it can -- or the
best I could make it do -- but there's a disappointing number of
situations where it's like "hmm, in this situation, either something
bad happened or it's just the somewhat unusual case where this happens
in the normal course of events, and we have no way to tell which it
is."

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Joe Conway
Date:
Subject: Re: commitfest.postgresql.org is no longer fit for purpose
Next
From: Jelte Fennema-Nio
Date:
Subject: Re: commitfest.postgresql.org is no longer fit for purpose