Re: Enabling Checksums - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Enabling Checksums
Date
Msg-id 50A08137.5090200@2ndQuadrant.com
Whole thread Raw
In response to Re: Enabling Checksums  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Enabling Checksums  (Jesper Krogh <jesper@krogh.cc>)
Re: Enabling Checksums  (Markus Wanner <markus@bluegap.ch>)
Re: Enabling Checksums  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: Enabling Checksums  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On 11/11/12 2:56 PM, Jeff Davis wrote:
> We could have a separate utility, pg_checksums, that can
> alter the state and/or do an offline verification. And initdb would take
> an option that would start everything out fully protected with
> checksums.

Adding an initdb option to start out with everything checksummed seems 
an uncontroversial good first thing to have available.  It seems like a 
proper 9.3 target to aim at even if per-table upgrading gets bogged down 
in details.  I have an argument below that the area between initdb and 
per-table upgrades is fundamentally uncertain and therefore not worth 
chasing after, based on reasons you already started to outline.  There's 
not much useful middle ground there.

Won't a pg_checksums program just grow until it looks like a limited 
version of vacuum though?  It's going to iterate over most of the table; 
it needs the same cost controls as autovacuum (and to respect the load 
of concurrent autovacuum work) to keep I/O under control; and those cost 
control values might change if there's a SIGHUP to reload parameters. 
It looks so much like vacuum that I think there needs to be a really 
compelling reason to split it into something new.  Why can't this be yet 
another autovacuum worker that does its thing?
> In order to get to the fully-protected state, you still need to> somehow make sure that all of the old data is
checksummed.>>And the "fully protected" state is important in my opinion, because> otherwise we aren't protected
againstcorrupt page headers that say> they have no checksum (even when it really should have a checksum).
 

I think it's useful to step back for a minute and consider the larger 
uncertainty an existing relation has, which amplifies just how ugly this 
situation is.  The best guarantee I think online checksumming can offer 
is to tell the user "after transaction id X, all new data in relation R 
is known to be checksummed".  Unless you do this at initdb time, any 
conversion case is going to have the possibility that a page is 
corrupted before you get to it--whether you're adding the checksum as 
part of a "let's add them while we're writing anyway" page update or the 
conversion tool is hitting it.

That's why I don't think anyone will find online conversion really 
useful until they've done a full sweep updating the old pages.  And if 
you accept that, a flexible checksum upgrade utility, one that co-exists 
with autovacuum activity costs, becomes a must.

One of the really common cases I was expecting here is that conversions 
are done by kicking off a slow background VACUUM CHECKSUM job that might 
run in pieces.  I was thinking of an approach like this:

-Initialize a last_checked_block value for each table
-Loop:
--Grab the next block after the last checked one
--When on the last block of the relation, grab an exclusive lock to 
protect against race conditions with extension
--If it's marked as checksummed and the checksum matches, skip it
---Otherwise, add a checksum and write it out
--When that succeeds, update last_checked_block
--If that was the last block, save some state saying the whole table is 
checkedsummed

With that logic, there is at least a forward moving pointer that removes 
the uncertainty around whether pages have been updated or not.  It will 
keep going usefully if interrupted too.  One obvious this way this can 
fail is if:

1) A late page in the relation is updated and a checksummed page written
2) The page is corrupted such that the "is this checksummed?" bits are 
not consistent anymore, along with other damage to it
3) The conversion process gets to this page eventually
4) The corruption of (2) isn't detected

But I think that this possibility--that a page might get quietly 
corrupted after checked once, but still in the middle of checking a 
relation--is both impossible to remove and a red herring.  How do we 
know that this page of the relation wasn't corrupted on disk before we 
even started?  We don't, and we can't.

The only guarantee I see that we can give for online upgrades is that 
after a VACUUM CHECKSUM sweep is done, and every page is known to both 
have a valid checksum on it and have its checksum bits set, *then* any 
page that doesn't have both set bits and a matching checksum is garbage.  Until reaching that point, any old data is
suspect. The idea of 
 
operating in an "we'll convert on write but never convert old pages" 
can't come up with any useful guarantees about data integrity that I can 
see.  As you say, you don't ever gain the ability to tell pages that 
were checksummed but have since been corrupted from ones that were 
corrupt all along in that path.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Doc patch "only relevant" -> "relevant only"
Next
From: Greg Smith
Date:
Subject: Re: WIP checksums patch