Re: Block-level CRC checks - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Block-level CRC checks
Date
Msg-id 1259657244.13774.12112.camel@ebony
Whole thread Raw
In response to Re: Block-level CRC checks  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: Block-level CRC checks
List pgsql-hackers
On Tue, 2009-12-01 at 10:04 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > There is no "creation" of corruption events. This scheme detects
> > corruption events that *have* occurred. Now I understand that we
> > previously would have recovered seamlessly from such events, but they
> > were corruption events nonetheless and I think they need to be reported.
> > (For why, see Conclusion #2, below).
> 
> No, you're still missing the point. The point is *not* random bit errors
> affecting hint bits, but the torn page problem. Today, a torn page is a
> completely valid and expected behavior from the OS and storage
> subsystem. We handle it with full_page_writes, and by relying on the
> fact that it's OK for a hint bit set to get lost. With your scheme, a
> torn page would become a corrupt page.

Well, its easy to keep going on about how much you think I
misunderstand. But I think that's just misdirection.

The way we handle torn page corruptions *hides* actual corruptions from
us. The frequency of true positives and false positives is important
here. If the false positive ratio is very small, then reporting them is
not a problem because of the benefit we get from having spotted the true
positives. Some convicted murderers didn't do it, but that is not an
argument for letting them all go free (without knowing the details). So
we need to know what the false positive ratio is before we evaluate the
benefit of either reporting or non-reporting possible corruption events.

When do you think torn pages happen? Only at crash, or other times also?
Do they always happen at crash? Are there ways to re-check a block that
has suffered a hint-related torn page issue? Are there ways to isolate
and minimise the reporting of false positives? Those are important
questions and this is not black and white.

If the *only* answer really is we-must-WAL-log everything, then that is
the answer, as an option. I suspect that there is a less strict
possibility, if we question our assumptions and look at the frequencies.

We know that I have no time to work on this; I am just trying to hold
open the door to a few possibilities that we have not fully considered
in a balanced way. And I myself am guilty of having slammed the door
previously. I encourage development of a way forward based upon a
balance of utility.

-- Simon Riggs           www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Marko Kreen
Date:
Subject: Re: Application name patch - v4
Next
From: Dave Page
Date:
Subject: Re: Application name patch - v4