Re: Online verification of checksums - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Online verification of checksums
Date
Msg-id 20190329153014.GL6197@tamriel.snowman.net
Whole thread Raw
In response to Re: Online verification of checksums  (Magnus Hagander <magnus@hagander.net>)
Responses Re: Online verification of checksums
Re: Online verification of checksums
List pgsql-hackers
Greetings,

* Magnus Hagander (magnus@hagander.net) wrote:
> On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
>
> > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > >Hi,
> > >
> > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > >> I agree that the current patch might have some corner-cases where it
> > >> does not guarantee 100% accuracy in online mode, but I hope the current
> > >> version at least has no more false negatives.
> > >
> > >False positives are *bad*. We shouldn't integrate code that has them.
> > >
> >
> > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> > communicate with the server, which would presumably address these issues.
> > Can someone explain why not to do that?
>
> I agree that this effort seems better spent on fixing those issues there
> (of which many are the same), and then re-use that.

This really seems like it depends on which of the options we're talking
about..   Connecting to the server and asking what the current insert
point is, so we can check that the LSN isn't completely insane, seems
reasonable, but at least one option being discussed was to have
pg_basebackup actually *lock the page* (even if just for I/O..) and then
re-read it, and having an external tool doing that instead of the
backend seems like a whole different level to me.  That would involve
having an SQL function for "lock this page against I/O" and then another
for "unlock this page", wouldn't it?

> > FWIW I've initially argued against that, believing that we can address
> > those issues in some other way, and I'd love if that was possible. But
> > considering we're still trying to make that work reliably I think the
> > reasonable conclusion is that Andres was right communicating with the
> > server is necessary.

As part of a backup, you could check against the pages written out into
the WAL as a cross-check and be able to be confident that at least
everything which was backed up had been checked.  That doesn't cover
things like unlogged tables though.

For my part, at least, adding additional checks around the LSN seems
like a good solution (though we can't allow those checks to turn into
false positives...) and would seriously reduce the risk that we have
false negatives (we can *not* completely eliminate false negatives
entirely..  we could possibly get to a point where at least we don't
have any more false negatives than PG itself has but it looks like an
awful lot of work and ends up adding its own risks...).

As I've said before, I'd certainly support a background worker which
performs ongoing checksum validation of pages and that would be able to
use the same approach as what we do with pg_basebackup, but having an
external tool locking pages seems really unlikely to be reasonable.

Thanks!

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: speeding up planning with partitions
Next
From: Jesper Pedersen
Date:
Subject: Re: partitioned tables referenced by FKs