Re: pg_combinebackup does not detect missing files - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: pg_combinebackup does not detect missing files |
Date | |
Msg-id | CA+TgmobyLDcvqn4W1VdQp+27zHJuy31c+vCBCAOzZP8EzqJicg@mail.gmail.com Whole thread Raw |
In response to | Re: pg_combinebackup does not detect missing files (David Steele <david@pgmasters.net>) |
Responses |
Re: pg_combinebackup does not detect missing files
Re: pg_combinebackup does not detect missing files |
List | pgsql-hackers |
On Fri, May 17, 2024 at 1:18 AM David Steele <david@pgmasters.net> wrote: > However, I think allowing the user to optionally validate the input > would be a good feature. Running pg_verifybackup as a separate step is > going to be a more expensive then verifying/copying at the same time. > Even with storage tricks to copy ranges of data, pg_combinebackup is > going to aware of files that do not need to be verified for the current > operation, e.g. old copies of free space maps. In cases where pg_combinebackup reuses a checksums from the input manifest rather than recomputing it, this could accomplish something. However, for any file that's actually reconstructed, pg_combinebackup computes the checksum as it's writing the output file. I don't see how it's sensible to then turn around and verify that the checksum that we just computed is the same one that we now get. It makes sense to run pg_verifybackup on the output of pg_combinebackup at a later time, because that can catch bits that have been flipped on disk in the meanwhile. But running the equivalent of pg_verifybackup during pg_combinebackup would amount to doing the exact same checksum calculation twice and checking that it gets the same answer both times. > One more thing occurs to me -- if data checksums are enabled then a > rough and ready output verification would be to test the checksums > during combine. Data checksums aren't very good but something should be > triggered if a bunch of pages go wrong, especially since the block > offset is part of the checksum. This would be helpful for catching > combine bugs. I don't know, I'm not very enthused about this. I bet pg_combinebackup has some bugs, and it's possible that one of them could involve putting blocks in the wrong places, but it doesn't seem especially likely. Even if it happens, it's more likely to be that pg_combinebackup thinks it's putting them in the right places but is actually writing them to the wrong offset in the file, in which case a block-checksum calculation inside pg_combinebackup is going to think everything's fine, but a standalone tool that isn't confused will be able to spot the damage. It's frustrating that we can't do better verification of these things, but to fix that I think we need better infrastructure elsewhere. For instance, if we made pg_basebackup copy blocks from shared_buffers rather than the filesystem, or at least copy them when they weren't being concurrently written to the filesystem, then we'd not have the risk of torn pages producing spurious bad checksums. If we could somehow render a backup consistent when taking it instead of when restoring it, we could verify tons of stuff. If we had some useful markers of how long files were supposed to be and which ones were supposed to be present, we could check a lot of things that are uncheckable today. pg_combinebackup does the best it can -- or the best I could make it do -- but there's a disappointing number of situations where it's like "hmm, in this situation, either something bad happened or it's just the somewhat unusual case where this happens in the normal course of events, and we have no way to tell which it is." -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: