Re: Checksums, state of play - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Checksums, state of play
Date
Msg-id CA+Tgmob10n=FOCSDFBtD_8Xv8k9rN_xbOh0--e5K4p-a-Lrr7w@mail.gmail.com
Whole thread Raw
In response to Re: Checksums, state of play  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
On Tue, Mar 6, 2012 at 10:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Tue, Mar 6, 2012 at 2:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 4. Checksums are being removed, but some blocks may still have them.
>> Thus, it's not an error for a block to have no checksum, but any
>> still-remaining checksums should be correct (though possibly we ought
>> not to complain if they aren't, to provide a recovery path for users
>> who are turning checksums off because they're getting errors they
>> don't want).  Any block that's written is written without checksums.
>
> I agree its advantageous to have a means of removing pagesums from
> data blocks as a 4th state.
>
> I think we need a 5th state - pagesums disabled. Which allows an
> emergency disabling of the feature without rewriting the blocks.
> Obviously if the database is damaged and I/O devices going bad, trying
> to rescan database is likely to cause further problems.

Maybe we should consider making this a separate flag that controls
what we do when we detect a checksum failure -- ERROR, LOG, or just
count it in the stats -- rather than an extra state.

>> I think we need to be clear about how the system transitions between
>> these states.  In the current patch, AIUI, you can effectively go from
>> 1->2 or 4->2 by setting page_checksums=on and from 2->4 by setting
>> page_checksums=off, but there's no easy way to ensure that you've
>> reached state 3 or that you've gotten back to state 1.  Some variant
>> of VACUUM seems like a good way of doing that, but it doesn't make
>> sense for example to have page_checksums=off and do VACUUM (CHECKSUMS
>> ON), or to have page_checksums=on and do VACUUM (CHECKSUMS OFF).  I
>> guess we could just reject those as error cases, but it would be
>> nicer, I think, to have an interface with a higher degree of
>> orthogonality.
>
> Right, a misunderstanding I think.
>
> If we have states set at database level then we'd not have a GUC as well.

OK.  If the flag is set at database level rather than table or cluster
level, then shared catalogs become a bit of a wart.  Maybe managably
so, but it needs thought, at the least.

>> There's probably more than one way to do that, but my personal
>> preference, as previously noted, is to make this a table-level option,
>> rather than a GUC.  Then, VACUUM (CHECKSUMS ON) can first change the
>> pg_class entry to indicate that checksums are enabling-in-progress
>> (i.e. 1->2), then scan the table, adding checksums, and then mark
>> checksums as fully enabled (i.e. 2->3).  VACUUM (CHECKSUMS OFF) can
>> proceed in symmetric fashion, marking checksums as
>> disabling-in-progress (3->4), then scanning the table and getting rid
>> of them, and then marking them fully disabled (4->1).  If a crash
>> happens in the middle somewhere, the state of the table can get left
>> as enabling-in-progress or disabling-in-progress, but a new VACUUM
>> (CHECKSUMS X) can be used to finish the process, and we always know
>> exactly where we're at.
>
> Any in-progress state needs to have checksums removed first, then re-added.

I don't think that's necessarily true.  If the user begins disabling
checksums (i.e. they are in state #4) and they change their mind, they
should be able to switch back to enabling mode.

> I'll keep an open mind for now about database/table level. I'm not
> sure how possible/desirable each is.

I think that per-table is clearly more desirable, but I haven't
studied how invasive the code changes are.  I suspect it is manageable
but I might be building castles in the air.

>> I generally agree with this outline, though I think that in lieu of a
>> version number we could simply set a new pd_flags bit indicating that
>> checksums are enabled.  If we haven't fully enabled checksums yet,
>> then the fact that this bit isn't set is not an error; but if
>> checksums are fully enabled, then every page must have that bit set,
>> and any page that doesn't is ipso facto corrupt.
>
> Whether to have it or not, if a corruption occurs during
> checksum-enabling then we could get a false reading.

That is true, and I think unavoidable.  While checksums are being
enabled, we have to assume that a not-checksummed page just hasn't
been updated yet.  There's no way to distinguish those cases.

> If we have a bit
> then the bit can be set wrong, so we could either make a check when it
> wasn't due, or skip a check we should have made. If we don't have a
> bit and so skip checksum checking during enabling process then we can
> get an error that isn't spotted by the checksum process.
>
> Given it can happen both ways, we should have a bit/ or not depending
> upon which is the least likely to be wrong. I would say having the bit
> provides the least likely way to get false readings.

I agree that having a bit is good.  It's not really necessary, but it
makes it easier to understand the logic, and allows us to catch and
distinguish error cases better.  Once checksums are fully enabled, all
checksums must match, bit or no bit.  But it's easier to understand
what the intermediate states look like with the bit, so I think it's
right to have it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Checksums, state of play
Next
From: Tom Lane
Date:
Subject: Re: [9.2] Confusion over CacheRegisterSyscacheCallback