Re: Add notes to pg_combinebackup docs - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Add notes to pg_combinebackup docs |
Date | |
Msg-id | CA+Tgmob45C8T4nPkq3bGMoxNWm2V53-J=SJRwfdrwc5xS+KSVQ@mail.gmail.com Whole thread Raw |
In response to | Re: Add notes to pg_combinebackup docs (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Add notes to pg_combinebackup docs
|
List | pgsql-hackers |
On Tue, Apr 9, 2024 at 5:46 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > What we could do is detect this in pg_combinebackup, and either just > disable checksums with a warning and hint to maybe enable them again. Or > maybe just print that the user needs to disable them. > > I was thinking maybe we could detect this while taking the backups, and > force taking a full backup if checksums got enabled since the last > backup. But we can't do that because we only have the manifest from the > last backup, and the manifest does not include info about checksums. So, as I see it, we have at least five options here: 1. Document that this doesn't work. By "this doesn't work" I think what we mean is: (A) If any of the backups you'd need to combine were taken with checksums off but the last one was taken with checksums on, then you might get checksum failures on the cluster written by pg_combinebackup. (B) Therefore, after enabling checksums, you should take a new full backup and make future incrementals dependent thereupon. (C) But if you don't, then you can (I presume) run recovery with ignore_checksum_failure=true to bring the cluster to a consistent state, stop the database, shut checksums off, optionally also turn them on again, and restart the database. Then presumably everything should be OK, except that you'll have wiped out any real checksum failures that weren't artifacts of the reconstruction process along with the fake ones. 2. As (1), but make check_control_files() emit a warning message when the problem case is detected. 3. As (2), but also add a command-line option to pg_combinebackup to flip the checksum flag to false in the control file. Then, if you have the problem case, instead of following the procedure described above, you can just use this option, and enable checksums afterward if you want. It still has the same disadvantage as the procedure described above: any "real" checksum failures will be suppressed, too. From a design perspective, this feels like kind of an ugly wart to me: hey, in this one scenario, you have to add --do-something-random or it doesn't work! But I see it's got some votes already, so maybe it's the right answer. 4. Add the checksum state to the backup manifest. Then, if someone tries to take an incremental backup with checksums on and the precursor backup had checksums off, we could fail. A strength of this proposal is that it actually stops the problem from happening at backup time, which in general is a whole lot nicer than not noticing a problem until restore time. A possible weakness is that it stops you from doing something that is ... actually sort of OK. I mean, in the strict sense, the incremental backup isn't valid, because it's going to cause checksum failures after reconstruction, but it's valid apart from the checksums, and those are fixable. I wonder whether users who encounter this error message will say "oh, I'm glad PostgreSQL prevented me from doing that" or "oh, I'm annoyed that PostgreSQL prevented me from doing that." 5. At reconstruction time, notice which backups have checksums enabled. If the final backup in the chain has them enabled, then whenever we take a block from an earlier backup with checksums disabled, re-checksum the block. As opposed to any of the previous options, this creates a fully correct result, so there's no real need to document any restrictions on what you're allowed to do. We might need to document the performance consequences, though: fast file copy methods will have to be disabled, and we'll have to go block by block while paying the cost of checksum calculation for each one. You might be sad to find out that your reconstruction is a lot slower than you were expecting. While I'm probably willing to implement any of these, I have some reservations about attempting (4) or especially (5) after feature freeze. I think there's a pretty decent chance that those fixes will turn out to have issues of their own which we'll then need to fix in turn. We could perhaps consider doing (2) for now and (5) for a future release, or something like that. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: