Re: [PATCH] Verify Checksums during Basebackups - Mailing list pgsql-hackers
From | Stephen Frost |
---|---|
Subject | Re: [PATCH] Verify Checksums during Basebackups |
Date | |
Msg-id | 20180304172914.GU2416@tamriel.snowman.net Whole thread Raw |
In response to | Re: [PATCH] Verify Checksums during Basebackups (Magnus Hagander <magnus@hagander.net>) |
List | pgsql-hackers |
Greetings Magnus, all, * Magnus Hagander (magnus@hagander.net) wrote: > On Sun, Mar 4, 2018 at 3:49 PM, Stephen Frost <sfrost@snowman.net> wrote: > > * Magnus Hagander (magnus@hagander.net) wrote: > > > I think it would also be a good idea to have this a three-mode setting, > > > with "no check", "check and warning", "check and error". Where "check and > > > error" should be the default, but you could turn off that in "save > > whatever > > > is left mode". But I think it's better if pg_basebackup simply fails on a > > > checksum error, because that will make it glaringly obvious that there > > is a > > > problem -- which is the main point of checksums in the first place. And > > > then an option to turn it off completely in cases where performance is > > the > > > thing. > > > > When we implemented page-level checksum checking in pgBackRest, David > > and I had a good long discussion about exactly this question of "warn" > > vs. "error" and came to a different conclusion- you want a backup to > > always back up as much as it can even in the face of corruption. If the > > user has set up their backups in such a way that they don't see the > > warnings being thrown, it's a good bet they won't see failed backups > > happening either, in which case they might go from having "mostly" good > > backups to not having any. Note that I *do* think a checksum failure > > should result in an non-zero exit-code result from pg_basebackup, > > indicating that there was something which went wrong. > > I would argue that the likelihood of them seeing an error vs a warning is > orders of magnitude higher. > > That said, if we want to exit pg_basebacukp with an exit code but still > complete the backup, that would also be a workable way I guess. But I > strongly feel that we should make pg_basebackup scream at the user and > actually exit with an error -- it's the exit with error that will cause > their cronjobs to fail, and thus somebody notice it. Yes, we need to have it exit with a non-zero exit code, I definitely agree with that. Any indication that the backup may not be valid should do that, imv. I don't believe we should just abort the backup and throw away whatever effort has gone into getting the data thus far and then leave an incomplete backup in place- someone might think it's not incomplete.. I certainly hope you weren't thinking that pg_basebackup would then go through and remove the backup that it had been running when it reached the checksum failure- that would be a dangerous and rarely tested code path, after all. > > One difference is that with pgBackRest, we manage the backups and a > > backup with page-level checksums isn't considered "valid", so we won't > > expire old backups if a new backup has a checksum failure, but I'm not > > sure that's really enough to change my mind on if pg_basebackup should > > outright fail on a checksum error or if it should throw big warnings but > > still try to perform the backup. If the admin sets things up in a way > > that a warning and error-exit code from pg_basebackup is ignored and > > they still expire out their old backups, then even having an actual > > error result wouldn't change that. > > There is another important difference as well. In pgBackRest somebody will > have to explicitly enable checksum verification -- which already there > means that they are much more likely to actually check the logs from it. That's actually not correct- we automatically check page-level checksums when the C library is available (and it's now required as part of 2.0) and the database has checksums enabled (that's required of both methods, of course...), so I don't see the difference you're suggesting here. pgBackRest does have an option to *require* checksum-checking be done, and one to disable checksumming, but by default it's enabled. See: https://pgbackrest.org/command.html#command-backup/category-command/option-checksum-page > > As an admin, the first thing I would want in a checksum failure scenario > > is a backup of everything, even the blocks which failed (and then a > > report of which blocks failed...). I'd rather we think about that > > use-case than the use-case where the admin sets up backups in such a way > > that they don't see warnings being thrown from the backup. > > I agree. But I absolutely don't want my backup to be listed as successful, > because then I might expire old ones. > > So sure, if we go with WARNING + exit with an errorcode, that is perhaps > the best combination of the two. Right, that's what we settled on for pgBackRest also and definitely makes the most sense to me. > That said, it probably still makes sense to implement all modes. Or at > least to implement a "don't bother verifying the checksums" mode. This > mainly controls what the default would be. Yes, I'm fine with a "don't verify checksums" option, but I believe the default should be to verify checksums when the database is configured with them and, on a checksum failure, throw warnings and exit with an exit-code that's non-zero and means "page-level checksum verification failed." Thanks! Stephen
Attachment
pgsql-hackers by date: