Re: [PATCH] Verify Checksums during Basebackups - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: [PATCH] Verify Checksums during Basebackups
Date
Msg-id 20180304144918.GQ2416@tamriel.snowman.net
Whole thread Raw
In response to Re: [PATCH] Verify Checksums during Basebackups  (Magnus Hagander <magnus@hagander.net>)
Responses Re: [PATCH] Verify Checksums during Basebackups  (Magnus Hagander <magnus@hagander.net>)
List pgsql-hackers
Greetings Magnus, all,

* Magnus Hagander (magnus@hagander.net) wrote:
> I think it would also be a good idea to have this a three-mode setting,
> with "no check", "check and warning", "check and error". Where "check and
> error" should be the default, but you could turn off that in "save whatever
> is left mode". But I think it's better if pg_basebackup simply fails on a
> checksum error, because that will make it glaringly obvious that there is a
> problem -- which is the main point of checksums in the first place. And
> then an option to turn it off completely in cases where performance is the
> thing.

When we implemented page-level checksum checking in pgBackRest, David
and I had a good long discussion about exactly this question of "warn"
vs. "error" and came to a different conclusion- you want a backup to
always back up as much as it can even in the face of corruption.  If the
user has set up their backups in such a way that they don't see the
warnings being thrown, it's a good bet they won't see failed backups
happening either, in which case they might go from having "mostly" good
backups to not having any.  Note that I *do* think a checksum failure
should result in an non-zero exit-code result from pg_basebackup,
indicating that there was something which went wrong.

One difference is that with pgBackRest, we manage the backups and a
backup with page-level checksums isn't considered "valid", so we won't
expire old backups if a new backup has a checksum failure, but I'm not
sure that's really enough to change my mind on if pg_basebackup should
outright fail on a checksum error or if it should throw big warnings but
still try to perform the backup.  If the admin sets things up in a way
that a warning and error-exit code from pg_basebackup is ignored and
they still expire out their old backups, then even having an actual
error result wouldn't change that.

As an admin, the first thing I would want in a checksum failure scenario
is a backup of everything, even the blocks which failed (and then a
report of which blocks failed...).  I'd rather we think about that
use-case than the use-case where the admin sets up backups in such a way
that they don't see warnings being thrown from the backup.

Thanks!

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Michael Banck
Date:
Subject: Re: Online enabling of checksums
Next
From: Tomas Vondra
Date:
Subject: Re: [PATCH] btree_gin, add support for uuid, bool, name, bpchar andanyrange types