Re: Online verification of checksums - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Online verification of checksums
Date
Msg-id ce2984fe-1c0f-41e1-588f-a9c061e7bbf2@2ndquadrant.com
Whole thread Raw
In response to Re: Online verification of checksums  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Online verification of checksums
List pgsql-hackers
On 3/3/19 12:48 AM, Michael Paquier wrote:
> On Sat, Mar 02, 2019 at 02:00:31PM -0800, Andres Freund wrote:
>> I gotta say, my conclusion from this debate is that it's simply a
>> mistake to do this without involvement of the server that can use
>> locking to prevent these kind of issues.  It seems pretty absurd to me
>> to have hacky workarounds around partial writes of a live server, around
>> truncation, etc, even though the server has ways to deal with that.
> 
> I agree with Andres on this one.  We are never going to make this
> stuff safe if we don't handle page reads with the proper locks because
> of torn pages.  What I think we should do is provide a SQL function
> which reads a page in shared mode, and then checks its checksum if its
> LSN is older than the previous redo point.  This discards cases with
> rather hot pages, but if the page is hot enough then the backend
> re-reading the page would just do the same by verifying the page
> checksum by itself.

Handling torn pages is not difficult, and the patch already does that
(it reads LSN of the last checkpoint LSN from the control file, and uses
it the same way basebackup does). That's working since (at least)
September, so I don't see how would the SQL function help with this?

The other issue (raised recently) is partial reads, where we read only a
fraction of the page. Basebackup simply ignores such pages, likely on
the assumption that it's either concurrent extension or truncation (in
which case it's newer than the last checkpoint LSN anyway). So maybe we
should do the same thing here. As I mentioned before, we can't reliably
detect incomplete segments anyway (at least I believe that's the case).

You and Andres may be right that trying to verify checksums online
without close interaction with the server is ultimately futile (or at
least overly complex). But I'm not sure those issues (torn pages and
partial reads) are very good arguments, considering basebackup has to
deal with them too. Not sure.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: James Coleman
Date:
Subject: Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
Next
From: David Rowley
Date:
Subject: Re: NOT IN subquery optimization