Re: [PATCH] Verify Checksums during Basebackups - Mailing list pgsql-hackers

From Magnus Hagander
Subject Re: [PATCH] Verify Checksums during Basebackups
Date
Msg-id CABUevEy4VbVz4pF9oLPOd77VLxVOF6_fzmYHMmW6sB8fOT2Scg@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Verify Checksums during Basebackups  (Stephen Frost <sfrost@snowman.net>)
Responses Re: [PATCH] Verify Checksums during Basebackups  (Stephen Frost <sfrost@snowman.net>)
Re: [PATCH] Verify Checksums during Basebackups  (Michael Banck <michael.banck@credativ.de>)
List pgsql-hackers


On Sun, Mar 4, 2018 at 3:49 PM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings Magnus, all,

* Magnus Hagander (magnus@hagander.net) wrote:
> I think it would also be a good idea to have this a three-mode setting,
> with "no check", "check and warning", "check and error". Where "check and
> error" should be the default, but you could turn off that in "save whatever
> is left mode". But I think it's better if pg_basebackup simply fails on a
> checksum error, because that will make it glaringly obvious that there is a
> problem -- which is the main point of checksums in the first place. And
> then an option to turn it off completely in cases where performance is the
> thing.

When we implemented page-level checksum checking in pgBackRest, David
and I had a good long discussion about exactly this question of "warn"
vs. "error" and came to a different conclusion- you want a backup to
always back up as much as it can even in the face of corruption.  If the
user has set up their backups in such a way that they don't see the
warnings being thrown, it's a good bet they won't see failed backups
happening either, in which case they might go from having "mostly" good
backups to not having any.  Note that I *do* think a checksum failure
should result in an non-zero exit-code result from pg_basebackup,
indicating that there was something which went wrong.

I would argue that the likelihood of them seeing an error vs a warning is orders of magnitude higher.

That said, if we want to exit pg_basebacukp with an exit code but still complete the backup, that would also be a workable way I guess. But I strongly feel that we should make pg_basebackup scream at the user and actually exit with an error -- it's the exit with error that will cause their cronjobs to fail, and thus somebody notice it.


 
One difference is that with pgBackRest, we manage the backups and a
backup with page-level checksums isn't considered "valid", so we won't
expire old backups if a new backup has a checksum failure, but I'm not
sure that's really enough to change my mind on if pg_basebackup should
outright fail on a checksum error or if it should throw big warnings but
still try to perform the backup.  If the admin sets things up in a way
that a warning and error-exit code from pg_basebackup is ignored and
they still expire out their old backups, then even having an actual
error result wouldn't change that.

There is another important difference as well. In pgBackRest somebody will have to explicitly enable checksum verification -- which already there means that they are much more likely to actually check the logs from it.


As an admin, the first thing I would want in a checksum failure scenario
is a backup of everything, even the blocks which failed (and then a
report of which blocks failed...).  I'd rather we think about that
use-case than the use-case where the admin sets up backups in such a way
that they don't see warnings being thrown from the backup.

I agree. But I absolutely don't want my backup to be listed as successful, because then I might expire old ones.

So sure, if we go with WARNING + exit with an errorcode, that is perhaps the best combination of the two.

That said, it probably still makes sense to implement all modes. Or at least to implement a "don't bother verifying the checksums" mode. This mainly controls what the default would be.

--

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] user-defined numeric data types triggering ERROR: unsupported type
Next
From: Anton Dignös
Date:
Subject: Re: IndexJoin memory problem using spgist and boxes