Re: Page Checksums - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Page Checksums
Date
Msg-id CA+TgmoZhSKAP-TN6N2ahe-+zfZn_L-T_ykVOekyuCU_Z2Kh+=Q@mail.gmail.com
Whole thread Raw
In response to Re: Page Checksums  (David Fetter <david@fetter.org>)
Responses Re: Page Checksums
List pgsql-hackers
On Mon, Dec 19, 2011 at 12:07 PM, David Fetter <david@fetter.org> wrote:
> On Mon, Dec 19, 2011 at 09:34:51AM -0500, Robert Haas wrote:
>> On Mon, Dec 19, 2011 at 9:14 AM, Stephen Frost <sfrost@snowman.net> wrote:
>> > * Aidan Van Dyk (aidan@highrise.ca) wrote:
>> >> But the scary part is you don't know how long *ago* the crash was.
>> >> Because a hint-bit-only change w/ a torn-page is a "non event" in
>> >> PostgreSQL *DESIGN*, on crash recovery, it doesn't do anything to try
>> >> and "scrub" every page in the database.
>> >
>> > Fair enough, but, could we distinguish these two cases?  In other words,
>> > would it be possible to detect if a page was torn due to a 'traditional'
>> > crash and not complain in that case, but complain if there's a CRC
>> > failure and it *doesn't* look like a torn page?
>>
>> No.
>
> Would you be so kind as to elucidate this a bit?

Well, basically, Stephen's proposal was pure hand-waving.  :-)

I don't know of any magic trick that would allow us to know whether a
CRC failure "looks like a torn page".  The only information we're
going to get is the knowledge of whether the CRC matches or not.  If
it doesn't, it's fundamentally impossible for us to know why.  We know
the page contents are not as expected - that's it!

It's been proposed before that we could examine the page, consider all
the unset hint bits that could be set, and try all combinations of
setting and clearing them to see whether any of them produce a valid
CRC.  But, as Tom has pointed out previously, that has a really quite
large chance of making a page that's *actually* been corrupted look
OK.  If you have 30 or so unset hint bits, odds are very good that
some combination will produce the 32-CRC you're expecting.

To put this another way, we currently WAL-log just about everything.
We get away with NOT WAL-logging some things when we don't care about
whether they make it to disk.  Hint bits, killed index tuple pointers,
etc. cause no harm if they don't get written out, even if some other
portion of the same page does get written out.  But as soon as you CRC
the whole page, now absolutely every single bit on that page becomes
critical data which CANNOT be lost.  IOW, it now requires the same
sort of protection that we already need for our other critical updates
- i.e. WAL logging.  Or you could introduce some completely new
mechanism that serves the same purpose, like MySQL's double-write
buffer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Greg Smith
Date:
Subject: Re: why do we need two snapshots per query?
Next
From: Robert Haas
Date:
Subject: Re: Page Checksums