Re: Online verification of checksums - Mailing list pgsql-hackers

From Magnus Hagander
Subject Re: Online verification of checksums
Date
Msg-id CABUevEwS7o01gBDPya88Vf4RMjmEcTv66wN+vqJ=jK-amqrnEA@mail.gmail.com
Whole thread Raw
In response to Re: Online verification of checksums  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Online verification of checksums  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers
On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> > We are not just looking at one LSN value. Here are the steps we are
> > proposing (I'll skip checks for zero pages here):
> >
> > 1) Test the page checksum. If it passes the page is OK.
> > 2) If the checksum does not pass then record the page offset and LSN and
> > continue.
>
> But here the checksum is broken, so while the offset is something we
> can rely on how do you make sure that the LSN is fine?  A broken
> checksum could perfectly mean that the LSN is actually *not* fine if
> the page header got corrupted.
>
> > 3) After the file is copied, reopen and reread the file, seeking to offsets
> > where possible invalid pages were recorded in the first pass.
> >     a) If the page is now valid then it is OK.
> >     b) If the page is not valid but the LSN has increased from the LSN
>
> Per se the previous point about the LSN value that we cannot rely on.

We cannot rely on the LSN itself. But it's a lot more likely that we
can rely on the LSN changing, and on the LSN changing in a "correct
way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
*doesn't change* we know it's corrupt. But if the LSN *increases* AND
the new page now has a correct checksum, it's very most likely to be
correct. You could perhaps even put cap on it saying "if the LSN
increased, but less than <n>", where <n> is a sufficiently high number
that it's entirely unreasonable to advanced that far between the
reading of two blocks. But it has to have a very high margin in that
case.


> > A malicious attacker could easily trick these checks, but as Stephen pointed
> > out elsewhere they would likely make the checksums valid which would escape
> > detection anyway.
> >
> > We believe that the chances of random storage corruption passing all these
> > checks is incredibly small, but eventually we'll also check against the WAL
> > to be completely sure.
>
> The lack of check for any concurrent I/O on the follow-up retries is
> disturbing.  How do you guarantee that on the second retry what you
> have is a torn page and not something corrupted?  Init forks for
> example are made of up to 2 blocks, so the window would get short for
> at least those.  There are many instances with tables that have few
> pages as well.

Here I was more worried that the window might get *too long* if tables
are large :)

The risk is certainly that you get a torn page *again* on the second
read. It could be the same torn page (if it hasn't changed), but you
can detect that (by the fact that it hasn't actually changed) and
possibly do a short delay before trying again if it gets that far.
That could happen if the process is too quick. It could also be that
you are unlucky and that you hit a *new* write, and you were so
unlucky that both times it happened to hit exactly when you were
reading the page the next time. I'm not sure the chance of that
happening is even big enough we have to care about it, though?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: OpenSSL 3.0.0 compatibility
Next
From: Anastasia Lubennikova
Date:
Subject: Re: cleanup temporary files after crash