Re: Online verification of checksums - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Online verification of checksums
Date
Msg-id CAOuzzgrhZJ6kBaK7v+Xra=q7_XFbtXyQCBFrvcrsCqZZ20WFTw@mail.gmail.com
Whole thread Raw
In response to Re: Online verification of checksums  (Andres Freund <andres@anarazel.de>)
Responses Re: Online verification of checksums
List pgsql-hackers
Greetings,

On Tue, Mar 19, 2019 at 23:59 Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > It's torn pages that I am concerned about - the server is writing and
> > we are reading, and we get a mix of old and new content.  We have been
> > quite diligent about protecting ourselves from such risks elsewhere,
> > and checksum verification should not be held to any lesser standard.
>
> If we see a checksum failure on an otherwise correctly read block in
> online mode, we retry the block on the theory that we might have read a
> torn page.  If the checksum verification still fails, we compare its LSN
> to the LSN of the current checkpoint and don't mind if its newer.  This
> way, a torn page should not cause a false positive either way I
> think?.

False positives, no. But there's plenty potential for false
negatives. In plenty clusters a large fraction of the pages is going to
be touched in most checkpoints.

How is it a false negative?  The page was in the middle of being written, if we crash the page won’t be used because it’ll get replayed over by the checkpoint, if we don’t crash then it also won’t be used until it’s been written out completely.  I don’t agree that this is in any way a false negative- it’s simply a page that happens to be in the middle of a file that we can skip because it isn’t going to be used. It’s not like there’s going to be a checksum failure if the backend reads it.

Not only that, but checksums and such failures are much more likely to happen on long dormant data, not on data that’s actively being written out and therefore is still in the Linux FS cache and hasn’t even hit actual storage yet anyway.

>  If it is a genuine storage failure we will see it in the next
> pg_checksums run as its LSN will be older than the checkpoint.

Well, but also, by that time it might be too late to recover things. Or
it might be a backup that you just made, that you later want to recover
from, ...

If it’s a backup you just made then that page is going to be in the WAL and the torn page on disk isn’t going to be used, so how is this an issue?  This is why we have WAL- to deal with torn pages.

> The basebackup checksum verification works in the same way.

Shouldn't have been merged that way.

I have a hard time not finding this offensive.  These issues were considered, discussed, and well thought out, with the result being committed after agreement.

Do you have any example cases where the code in pg_basebackup has resulted in either a false positive or a false negative?  Any case which can be shown to result in either?

If not then I think we need to stop this, because if we can’t trust that a torn page won’t be actually used in that torn state then it seems likely that our entire WAL system is broken and we can’t trust the way we do backups either and have to rewrite all of that to take precautions to lock pages while doing a backup.

Thanks!

Stephen

pgsql-hackers by date:

Previous
From: "Daniel Verite"
Date:
Subject: Re: Willing to fix a PQexec() in libpq module
Next
From: Tom Lane
Date:
Subject: Re: Rare SSL failures on eelpout