Re: corrupt pages detected by enabling checksums - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: corrupt pages detected by enabling checksums
Date
Msg-id CAMkU=1zX8vL8_HmJPa61XBp5uTQwEBaKoz93O1zM98x4g4rKTw@mail.gmail.com
Whole thread Raw
In response to Re: corrupt pages detected by enabling checksums  (Greg Stark <stark@mit.edu>)
Responses Re: corrupt pages detected by enabling checksums
Re: corrupt pages detected by enabling checksums
List pgsql-hackers
On Fri, May 10, 2013 at 9:54 AM, Greg Stark <stark@mit.edu> wrote:
On Fri, May 10, 2013 at 5:31 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
> In the case where one block is missing, how can it even reach to next record
> to check "prev" pointer.
> I think it can be possible when one of the record is corrupt and following
> are okay which I think is the
> case in which it can proceed with warning as suggested by Simon.

A single WAL record can be over 24kB. The checksum covers the entire
WAL record and if it reports corruption it can be because a chunk in
the middle wasn't flushed to disk before the system crashed. The
beginning of the WAL record with the length and checksum and the
entire following record with its prev pointer might have been flushed
but the missing block in the middle of this record means it can't be
replayed. This would be a normal situation in case of a system crash.

If you replayed the following record but not this record you would
have an inconsistent database.

I don't think we would ever want to *skip* the record and play the next one.  But if it looks like the next record is valid, we might not want to automatically open the database in a possibly inconsistent state and in the process overwrite the only existing copy of those WAL records which would be necessary to make it consistent.  Instead, could we present the DBA with an explicit choice to either open the database, or try to reconstruct the corrupted record via forensic inspection so that it can be played through (I have no idea how likely it is that such an attempt would succeed), or to copy the database for later inspection and then open it.

But based on your description, perhaps refusing to automatically restart and forcing an explicit decision would happen a lot more often, during normal crashes with no corruption, than I was thinking it would.

Of course the paranoid DBA could turn off restart_after_crash and do a manual investigation on every crash, but in that case the database would refuse to restart even in the case where it perfectly clear that all the following WAL belongs to the recycled file and not the current file.  They would also have to turn off any startup scripts in init.d, to make sure a rebooting server doesn't do recovery automatically and destroy evidence that way.


Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: corrupt pages detected by enabling checksums
Next
From: Marko Kreen
Date:
Subject: Re: pgcrypto: Fix RSA password-protected keys