Re: corrupt pages detected by enabling checksums - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: corrupt pages detected by enabling checksums
Date
Msg-id CAMkU=1xaJYGO+8Wp_Df+f9Qc-HOFn+WSwepoCSxyUC=9iqzy4Q@mail.gmail.com
Whole thread Raw
In response to Re: corrupt pages detected by enabling checksums  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: corrupt pages detected by enabling checksums
Re: corrupt pages detected by enabling checksums
List pgsql-hackers
On Thu, Apr 4, 2013 at 5:30 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 4 April 2013 02:39, Andres Freund <andres@2ndquadrant.com> wrote:

> If by now the first backend has proceeded to PageSetLSN() we are writing
> different data to disk than the one we computed the checksum of
> before. Boom.

Right, so nothing else we were doing was wrong, that's why we couldn't
spot a bug. The problem is that we aren't replaying enough WAL because
the checksum on the WAL record is broke.

This brings up a pretty frightening possibility to me, unrelated to data checksums.  If a bit gets twiddled in the WAL file due to a hardware issue or a "cosmic ray", and then a crash happens, automatic recovery will stop early with the failed WAL checksum with an innocuous looking message.  The system will start up but will be invisibly inconsistent, and will proceed to overwrite that portion of the WAL file which contains the old data (real data that would have been necessary to reconstruct, once the corruption is finally realized ) with an end-of-recovery checkpoint record and continue to chew up real data from there.

I don't know a solution here, though, other than trusting your hardware.  Changing timelines upon ordinary crash recovery, not just media recovery, seems excessive but also seems to be exactly what timelines were invented for, right?

Back to the main topic here, Jeff Davis mentioned earlier "You'd still think this would cause incorrect results, but...".  I didn't realize the significance of that until now.  It does produce incorrect query results.  I was just bailing out before detecting them.  Once I specify ignore_checksum_failure=1 my test harness complains bitterly about the data not being consistent with what the Perl program knows it is supposed to be.  


I missed out on doing that with XLOG_HINT records, so the WAL CRC can
be incorrect because the data is scanned twice; normally that would be
OK because we have an exclusive lock on the block, but with hints we
only have share lock. So what we need to do is take a copy of the
buffer before we do XLogInsert().

Simple patch to do this attached for discussion. (Not tested).

We might also do this by modifying the WAL record to take the whole
block and bypass the BkpBlock mechanism entirely. But that's more work
and doesn't seem like it would be any cleaner. I figure lets solve the
problem first then discuss which approach is best.


I've tested your patch it and it seems to do the job.  It has survived far longer than unpatched ever did, with neither checksum warnings, nor complaints of inconsistent data.  (I haven't analyzed the code much, just the results, and leave the discussion of the best approach to others)


 Thanks,

Jeff

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: CREATE EXTENSION BLOCKS
Next
From: Tom Lane
Date:
Subject: Re: Clang compiler warning on 9.3 HEAD