Re: Online verification of checksums - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Online verification of checksums
Date
Msg-id CAOuzzgpJ1X7oXZXGy+Rkv52M=ZiRRWUF5gkD99=cCswwt5idRA@mail.gmail.com
Whole thread Raw
In response to Re: Online verification of checksums  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Greetings,

On Mon, Nov 23, 2020 at 20:28 Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Nov 23, 2020 at 05:28:52PM -0500, Stephen Frost wrote:
> * Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote:
>> Yes and this is a tricky part. Until you have explained it in your latest
>> message, I wasn't sure how we can distinct concurrent update from a page
>> header corruption. Now I agree that if page LSN updated and increased
>> between rereads, it is safe enough to conclude that we have some concurrent
>> load.
>
> Even in this case, it's almost free to compare the LSN to the starting
> backup LSN, and to the current LSN position, and make sure it's
> somewhere between the two.  While that doesn't entirely eliminite the
> possibility that the page happened to get corrupted *and* return a
> different result on subsequent reads *and* that it was corrupted in such
> a way that the LSN ended up falling between the starting backup LSN and
> the current LSN, it's certainly reducing the chances of a false negative
> a fair bit.

FWIW, I am not much a fan of designs that are not bullet-proof by
design.  This reduces the odds of problems, sure, still it does not
discard the possibility of incorrect results, confusing users as well
as people looking at such reports.

Let’s be clear about this- our checksums are, themselves, far from bulletproof, regardless of all of our other efforts.  They are not foolproof against any corruption, and certainly not even close to being sufficient for guarantees you’d expect in, say, encryption integrity.  We cannot say with certainty that a page which passes checksum validation isn’t corrupted in some way.  A page which doesn’t pass checksum validation may be corrupted or may be torn and we aren’t 100% of that either, but we can work to try and make a sensible call about which it is.

>> To sum up, I agree with your proposal to reread the page and rely on
>> ascending LSNs. Can you submit a patch?
>
> Probably would make sense to give Michael an opportunity to comment and
> get his thoughts on this, and for him to update the patch if he agrees.

I think that a LSN check would be a safe thing to do iff pd_checksum
is already checked first to make sure that the page contents are fine
to use.   Still, what's the point in doing a LSN check anyway if we
know that the checksum is valid?  Then on a retry if the first attempt
failed you also need the guarantee that there is zero concurrent I/O
activity while a page is rechecked (no need to do that unless the
initial page check doing a checksum match failed).  So the retry needs
to do some s_b interactions, but then comes the much trickier point of
concurrent smgrwrite() calls bypassing the shared buffers.

I agree that the LSN check isn’t interesting if the page passes the checksum validation.  I do think we can look at the LSN and make reasonable inferences based off of it even if the checksum doesn’t validate- in particular, in my experience at least, the result of a read, without any intervening write, is very likely to be the same if performed multiple times quickly even if there is latent storage corruption- due to cache’ing, if nothing else.  What’s interesting about the LSN check is that we are specifically looking to see if it *changed* in a reasonable and predictable manner, and that it was replaced with a new yet reasonable value. The chances of that happening due to latent storage corruption is vanishingly small.

> As it relates to pgbackrest, we're currently contemplating having a
> higher level loop which, upon detecting any page with an invalid
> checksum, continues to scan to the end of that file and perform the
> compression, encryption, et al, but then loops back after we've
> completed that file and skips through the file again, re-reading those
> pages which didn't have a valid checksum the first time to see if their
> LSN has changed and is within the range of the backup.  This will
> certainly give more opportunity for the kernel to 'catch up', if needed,
> and give us an updated page without a random 100ms delay, and will also
> make it easier for us to, eventually, check and make sure the page was
> in the WAL that was been produced as part of the backup, to give us a
> complete guarantee that the contents of this page don't matter and that
> the failed checksum isn't a sign of latent storage corruption.

That would reduce the likelyhood of facing torn pages, still you
cannot fully discard the problem either as a same page may get changed
again once you loop over, no?  And what if a corruption has updated
pd_lsn on-disk?  Unlikely so, still possible.

We surely don’t care about a page which has been changed multiple times by PG during the backup, since all those changes will be, by definition, in the WAL, no?  Therefore, one loop to see that the value of the LSN *changed*, meaning something wrote something new there, with a cross-check to see that the LSN was in the expected range, is going an awfully long way to assuring that this isn’t a case of latent storage corruption. If there is an attacker who is not the PG process but who is modifying files then, yes, that’s a risk, and won’t be picked up by this, but why would they create an invalid checksum in the first place..?

We aren’t attempting to protect against a sophisticated attack, we are trying to detect latent storage corruption.

I would also ask for a clarification as to if you feel that checking the WAL for the page to be insufficient somehow, since I mentioned that as also being on the roadmap.  If there’s some reason that checking the WAL for the page wouldn’t be sufficient, I am anxious to understand that reasoning.

Thanks,

Stephen

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: POC: postgres_fdw insert batching
Next
From: Michael Paquier
Date:
Subject: Re: abstract Unix-domain sockets