Re: Block-level CRC checks - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Block-level CRC checks
Date
Msg-id 1259668683.13774.12903.camel@ebony
Whole thread Raw
In response to Re: Block-level CRC checks  (Bruce Momjian <bruce@momjian.us>)
Responses Re: Block-level CRC checks
List pgsql-hackers
On Tue, 2009-12-01 at 06:35 -0500, Bruce Momjian wrote:
> Simon Riggs wrote:
> > The way we handle torn page corruptions *hides* actual corruptions from
> > us. The frequency of true positives and false positives is important
> > here. If the false positive ratio is very small, then reporting them is
> > not a problem because of the benefit we get from having spotted the true
> > positives. Some convicted murderers didn't do it, but that is not an
> > argument for letting them all go free (without knowing the details). So
> > we need to know what the false positive ratio is before we evaluate the
> > benefit of either reporting or non-reporting possible corruption events.
> > 
> > When do you think torn pages happen? Only at crash, or other times also?
> > Do they always happen at crash? Are there ways to re-check a block that
> > has suffered a hint-related torn page issue? Are there ways to isolate
> > and minimise the reporting of false positives? Those are important
> > questions and this is not black and white.
> > 
> > If the *only* answer really is we-must-WAL-log everything, then that is
> > the answer, as an option. I suspect that there is a less strict
> > possibility, if we question our assumptions and look at the frequencies.
> > 
> > We know that I have no time to work on this; I am just trying to hold
> > open the door to a few possibilities that we have not fully considered
> > in a balanced way. And I myself am guilty of having slammed the door
> > previously. I encourage development of a way forward based upon a
> > balance of utility.
> 
> I think the problem boils down to what the user response should be to a
> corruption report.  If it is a torn page, it would be corrected and the
> user doesn't have to do anything.  If it is something that is not
> correctable, then the user has corruption and/or bad hardware. 

> I think
> the problem is that the existing proposal can't distinguish between
> these two cases so the user has no idea how to respond to the report.

If 99.5% of cases are real corruption then there is little need to
distinguish between the cases, nor much value in doing so. The
prevalence of the different error types is critical to understanding how
to respond.

If a man pulls a gun on you, your first thought isn't "some people
remove guns from their jacket to polish them, so perhaps he intends to
polish it now" because the prevalence of shootings is high, when faced
by people with guns, and the risk of dying is also high. You make a
judgement based upon the prevalence and the risk. 

That is all I am asking for us to do here, make a balanced call. These
recent comments are a change in my own position, based upon evaluating
the prevalence and the risk. I ask others to consider the same line of
thought rather than a black/white assessment.

All useful detection mechanisms have non-zero false positives because we
would rather sometimes ring the bell for no reason than to let bad
things through silently, as we do now.

-- Simon Riggs           www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Euler Taveira de Oliveira
Date:
Subject: Re: ProcessUtility_hook
Next
From: Robert Haas
Date:
Subject: Re: CommitFest status/management