Re: corrupt pages detected by enabling checksums - Mailing list pgsql-hackers

From Jim Nasby
Subject Re: corrupt pages detected by enabling checksums
Date
Msg-id 518BF8C3.2040807@nasby.net
Whole thread Raw
In response to Re: corrupt pages detected by enabling checksums  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: corrupt pages detected by enabling checksums  (Simon Riggs <simon@2ndQuadrant.com>)
Re: corrupt pages detected by enabling checksums  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On 5/8/13 7:34 PM, Jeff Davis wrote:
> On Wed, 2013-05-08 at 17:56 -0500, Jim Nasby wrote:
>> Apologies if this is a stupid question, but is this mostly an issue
>> due to torn pages? IOW, if we had a way to ensure we never see torn
>> pages, would that mean an invalid CRC on a WAL page indicated there
>> really was corruption on that page?
>>
>> Maybe it's worth putting (yet more) thought into the torn page
>> issue... :/
>
> Sort of. For data, a page is the logically-atomic unit that is expected
> to be intact. For WAL, a record is the logically-atomic unit that is
> expected to be intact.
>
> So it might be better to say that the issue for the WAL is "torn
> records". A record might be larger than a page (it can hold up to three
> full-page images in one record), but is often much smaller.
>
> We use a CRC to validate that the WAL record is fully intact. The
> concern is that, if it fails the CRC check, we *assume* that it's
> because it wasn't completely flushed yet (i.e. a "torn record"). Based
> on that assumption, neither that record nor any later record contains
> committed transactions, so we can safely consider that the end of the
> WAL (as of the crash) and bring the system up.
>
> The problem is that the assumption is not always true: a CRC failure
> could also indicate real corruption of WAL records that have been
> previously flushed successfully, and may contain committed transactions.
> That can mean we bring the system up way too early, corrupting the
> database.
>
> Unfortunately, it seems that doing any kind of validation to determine
> that we have a valid end-of-the-WAL inherently requires some kind of
> separate durable write somewhere. It would be a tiny amount of data (an
> LSN and maybe some extra crosscheck information), so I could imagine
> that would be just fine given the right hardware; but if we just write
> to disk that would be pretty bad. Ideas welcome.

What about moving some critical data from the beginning of the WAL record to the end? That would make it easier to
detectthat we don't have a complete record. It wouldn't necessarily replace the CRC though, so maybe that's not good
enough.

Actually, what if we actually *duplicated* some of the same WAL header info at the end of the record? Given a
reasonableamount of data that would damn-near ensure that a torn record was detected, because the odds of having the
exactsame sequence of random bytes would be so low. Potentially even just duplicating the LSN would suffice.
 

On the separate write idea, if that could be controlled by a GUC I think it'd be worth doing. Anyone that needs to
worryabout this corner case probably has hardware that would support that.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



pgsql-hackers by date:

Previous
From: "Evan D. Hoffman"
Date:
Subject: Re: [GENERAL] pg_upgrade fails, "mismatch of relation OID" - 9.1.9 to 9.2.4
Next
From: Bruce Momjian
Date:
Subject: Re: [GENERAL] pg_upgrade fails, "mismatch of relation OID" - 9.1.9 to 9.2.4