Re: [PATCH] Verify Checksums during Basebackups - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: [PATCH] Verify Checksums during Basebackups
Date
Msg-id 20180304172914.GU2416@tamriel.snowman.net
Whole thread Raw
In response to Re: [PATCH] Verify Checksums during Basebackups  (Magnus Hagander <magnus@hagander.net>)
List pgsql-hackers
Greetings Magnus, all,

* Magnus Hagander (magnus@hagander.net) wrote:
> On Sun, Mar 4, 2018 at 3:49 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > * Magnus Hagander (magnus@hagander.net) wrote:
> > > I think it would also be a good idea to have this a three-mode setting,
> > > with "no check", "check and warning", "check and error". Where "check and
> > > error" should be the default, but you could turn off that in "save
> > whatever
> > > is left mode". But I think it's better if pg_basebackup simply fails on a
> > > checksum error, because that will make it glaringly obvious that there
> > is a
> > > problem -- which is the main point of checksums in the first place. And
> > > then an option to turn it off completely in cases where performance is
> > the
> > > thing.
> >
> > When we implemented page-level checksum checking in pgBackRest, David
> > and I had a good long discussion about exactly this question of "warn"
> > vs. "error" and came to a different conclusion- you want a backup to
> > always back up as much as it can even in the face of corruption.  If the
> > user has set up their backups in such a way that they don't see the
> > warnings being thrown, it's a good bet they won't see failed backups
> > happening either, in which case they might go from having "mostly" good
> > backups to not having any.  Note that I *do* think a checksum failure
> > should result in an non-zero exit-code result from pg_basebackup,
> > indicating that there was something which went wrong.
>
> I would argue that the likelihood of them seeing an error vs a warning is
> orders of magnitude higher.
>
> That said, if we want to exit pg_basebacukp with an exit code but still
> complete the backup, that would also be a workable way I guess. But I
> strongly feel that we should make pg_basebackup scream at the user and
> actually exit with an error -- it's the exit with error that will cause
> their cronjobs to fail, and thus somebody notice it.

Yes, we need to have it exit with a non-zero exit code, I definitely
agree with that.  Any indication that the backup may not be valid should
do that, imv.  I don't believe we should just abort the backup and throw
away whatever effort has gone into getting the data thus far and then
leave an incomplete backup in place- someone might think it's not
incomplete..  I certainly hope you weren't thinking that pg_basebackup
would then go through and remove the backup that it had been running
when it reached the checksum failure- that would be a dangerous and
rarely tested code path, after all.

> > One difference is that with pgBackRest, we manage the backups and a
> > backup with page-level checksums isn't considered "valid", so we won't
> > expire old backups if a new backup has a checksum failure, but I'm not
> > sure that's really enough to change my mind on if pg_basebackup should
> > outright fail on a checksum error or if it should throw big warnings but
> > still try to perform the backup.  If the admin sets things up in a way
> > that a warning and error-exit code from pg_basebackup is ignored and
> > they still expire out their old backups, then even having an actual
> > error result wouldn't change that.
>
> There is another important difference as well. In pgBackRest somebody will
> have to explicitly enable checksum verification -- which already there
> means that they are much more likely to actually check the logs from it.

That's actually not correct- we automatically check page-level checksums
when the C library is available (and it's now required as part of 2.0)
and the database has checksums enabled (that's required of both methods,
of course...), so I don't see the difference you're suggesting here.
pgBackRest does have an option to *require* checksum-checking be done,
and one to disable checksumming, but by default it's enabled.  See:

https://pgbackrest.org/command.html#command-backup/category-command/option-checksum-page

> > As an admin, the first thing I would want in a checksum failure scenario
> > is a backup of everything, even the blocks which failed (and then a
> > report of which blocks failed...).  I'd rather we think about that
> > use-case than the use-case where the admin sets up backups in such a way
> > that they don't see warnings being thrown from the backup.
>
> I agree. But I absolutely don't want my backup to be listed as successful,
> because then I might expire old ones.
>
> So sure, if we go with WARNING + exit with an errorcode, that is perhaps
> the best combination of the two.

Right, that's what we settled on for pgBackRest also and definitely
makes the most sense to me.

> That said, it probably still makes sense to implement all modes. Or at
> least to implement a "don't bother verifying the checksums" mode. This
> mainly controls what the default would be.

Yes, I'm fine with a "don't verify checksums" option, but I believe the
default should be to verify checksums when the database is configured
with them and, on a checksum failure, throw warnings and exit with an
exit-code that's non-zero and means "page-level checksum verification
failed."

Thanks!

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Anton Dignös
Date:
Subject: Re: IndexJoin memory problem using spgist and boxes
Next
From: Pavel Stehule
Date:
Subject: Re: [HACKERS] plpgsql - additional extra checks