Re: corrupt pages detected by enabling checksums - Mailing list pgsql-hackers

From Greg Stark
Subject Re: corrupt pages detected by enabling checksums
Date
Msg-id CAM-w4HP4OEnArGMaC8hahAAK16dub9MBC7rgW+y+H9hzL9RhVg@mail.gmail.com
Whole thread Raw
In response to Re: corrupt pages detected by enabling checksums  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: corrupt pages detected by enabling checksums  (Jeff Davis <pgsql@j-davis.com>)
Re: corrupt pages detected by enabling checksums  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
On Thu, May 9, 2013 at 10:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 9 May 2013 22:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>> If the current WAL record is corrupt and the next WAL record is in
>>> every way valid, we can potentially continue.
>>
>> That seems like a seriously bad idea.
>
> I agree. But if you knew that were true, is stopping a better idea?

Having one corrupt record followed by a valid record is not an
abnormal situation. It could easily be the correct end of WAL.

I think it's not possible to protect 100% against this without giving
up the checksum optimization which implies doing two fsyncs per commit
instead of 1.

However it is possible to reduce the window. Every time the
transaction log is synced a different file can be updated with the a
known minimum transaction log recovery point. Even if it's not synced
consistently on every transaction commit or wal sync it would serve as
a low water mark. Recovering to that point is not sufficient but is
necessary for a consistent recovery. That file could be synced lazily,
say, every 10s or something like that and would guarantee that any wal
corruption would be caught except for the last 10s of wal traffic for
example.

If you're only interested in database consistency and not lost commits
then that file could be synced on buffer xlog flushes (making a
painful case even more painful). Off the top of my head that would be
sufficient to guarantee that a corrupt xlog that would create an
inconsistent database would not be missed. I may be missing cases
involving checkpoints or the like though.


-- 
greg



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Re: [GENERAL] pg_upgrade fails, "mismatch of relation OID" - 9.1.9 to 9.2.4
Next
From: Jeff Davis
Date:
Subject: Re: corrupt pages detected by enabling checksums