Re: Block-level CRC checks - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Block-level CRC checks |
Date | |
Msg-id | 603c8f070912010806n4ee9528fsdf89665016dd5b30@mail.gmail.com Whole thread Raw |
In response to | Re: Block-level CRC checks (Simon Riggs <simon@2ndQuadrant.com>) |
List | pgsql-hackers |
On Tue, Dec 1, 2009 at 10:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote: > >> It's not hard to imagine that when a hardware glitch happens >> causing corruption, it also causes the system to crash. Recalculating >> the CRCs after crash would mask the corruption. > > They are already masked from us, so continuing to mask those errors > would not put us in a worse position. > > If we are saying that 99% of page corruptions are caused at crash time > because of torn pages on hint bits, then only WAL logging can help us > find the 1%. I'm not convinced that is an accurate or safe assumption > and I'd at least like to see LOG entries showing what happened. It may or may not be true that most page corruptions happen at crash time, but it's certainly false that they are caused at crash time *because of torn pages on hint bits*. If only part of a block is written to disk and the unwritten parts contain hint-bit changes - that's not corruption. That's design behavior. Any CRC system needs to avoid complaining about errors when that happens because otherwise people will think that their database is corrupted and their hardware is faulty when in reality it is not. If we could find a way to put the hint bits in the same 512-byte block as the CRC, that might do it, but I'm not sure whether that is possible. Ignoring CRC errors after a crash until we've re-CRC'd the entire database will certainly eliminate the bogus error reports, but it seems likely to mask a large percentage of legitimate errors. For example, suppose that I write 1MB of data out to disk and then don't access it for a year. During that time the data is corrupted. Then the system crashes. Upon recovery, since there's no way of knowing whether hint bits on those pages were being updated at the time of the crash, so the system re-CRC's the corrupted data and declares it known good. Six months later, I try to access the data and find out that it's bad. Sucks to be me. Now consider the following alternative scenario: I write the block to disk. Five minutes later, without an intervening crash, I read it back in and it's bad. Yeah, the system detects it. Which is more likely? I'm not an expert on disk failure modes, but my intuition is that the first one will happen often enough to make us look silly. Is it 10%? 20%? 50%? I don't know. But ISTM that a CRC system that has no ability to determine whether a system is still "ok" post-crash is not a compelling proposition, even though it might still be able to detect some problems. ...Robert
pgsql-hackers by date: