Re: Block-level CRC checks - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Block-level CRC checks
Date
Msg-id 200912011135.nB1BZgs15378@momjian.us
Whole thread Raw
In response to Re: Block-level CRC checks  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Block-level CRC checks  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
Simon Riggs wrote:
> The way we handle torn page corruptions *hides* actual corruptions from
> us. The frequency of true positives and false positives is important
> here. If the false positive ratio is very small, then reporting them is
> not a problem because of the benefit we get from having spotted the true
> positives. Some convicted murderers didn't do it, but that is not an
> argument for letting them all go free (without knowing the details). So
> we need to know what the false positive ratio is before we evaluate the
> benefit of either reporting or non-reporting possible corruption events.
> 
> When do you think torn pages happen? Only at crash, or other times also?
> Do they always happen at crash? Are there ways to re-check a block that
> has suffered a hint-related torn page issue? Are there ways to isolate
> and minimise the reporting of false positives? Those are important
> questions and this is not black and white.
> 
> If the *only* answer really is we-must-WAL-log everything, then that is
> the answer, as an option. I suspect that there is a less strict
> possibility, if we question our assumptions and look at the frequencies.
> 
> We know that I have no time to work on this; I am just trying to hold
> open the door to a few possibilities that we have not fully considered
> in a balanced way. And I myself am guilty of having slammed the door
> previously. I encourage development of a way forward based upon a
> balance of utility.

I think the problem boils down to what the user response should be to a
corruption report.  If it is a torn page, it would be corrected and the
user doesn't have to do anything.  If it is something that is not
correctable, then the user has corruption and/or bad hardware. I think
the problem is that the existing proposal can't distinguish between
these two cases so the user has no idea how to respond to the report.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: enable-thread-safety defaults?
Next
From: Robert Haas
Date:
Subject: Re: CommitFest status/management