Re: Block-level CRC checks - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: Block-level CRC checks |
Date | |
Msg-id | 1259668683.13774.12903.camel@ebony Whole thread Raw |
In response to | Re: Block-level CRC checks (Bruce Momjian <bruce@momjian.us>) |
Responses |
Re: Block-level CRC checks
|
List | pgsql-hackers |
On Tue, 2009-12-01 at 06:35 -0500, Bruce Momjian wrote: > Simon Riggs wrote: > > The way we handle torn page corruptions *hides* actual corruptions from > > us. The frequency of true positives and false positives is important > > here. If the false positive ratio is very small, then reporting them is > > not a problem because of the benefit we get from having spotted the true > > positives. Some convicted murderers didn't do it, but that is not an > > argument for letting them all go free (without knowing the details). So > > we need to know what the false positive ratio is before we evaluate the > > benefit of either reporting or non-reporting possible corruption events. > > > > When do you think torn pages happen? Only at crash, or other times also? > > Do they always happen at crash? Are there ways to re-check a block that > > has suffered a hint-related torn page issue? Are there ways to isolate > > and minimise the reporting of false positives? Those are important > > questions and this is not black and white. > > > > If the *only* answer really is we-must-WAL-log everything, then that is > > the answer, as an option. I suspect that there is a less strict > > possibility, if we question our assumptions and look at the frequencies. > > > > We know that I have no time to work on this; I am just trying to hold > > open the door to a few possibilities that we have not fully considered > > in a balanced way. And I myself am guilty of having slammed the door > > previously. I encourage development of a way forward based upon a > > balance of utility. > > I think the problem boils down to what the user response should be to a > corruption report. If it is a torn page, it would be corrected and the > user doesn't have to do anything. If it is something that is not > correctable, then the user has corruption and/or bad hardware. > I think > the problem is that the existing proposal can't distinguish between > these two cases so the user has no idea how to respond to the report. If 99.5% of cases are real corruption then there is little need to distinguish between the cases, nor much value in doing so. The prevalence of the different error types is critical to understanding how to respond. If a man pulls a gun on you, your first thought isn't "some people remove guns from their jacket to polish them, so perhaps he intends to polish it now" because the prevalence of shootings is high, when faced by people with guns, and the risk of dying is also high. You make a judgement based upon the prevalence and the risk. That is all I am asking for us to do here, make a balanced call. These recent comments are a change in my own position, based upon evaluating the prevalence and the risk. I ask others to consider the same line of thought rather than a black/white assessment. All useful detection mechanisms have non-zero false positives because we would rather sometimes ring the bell for no reason than to let bad things through silently, as we do now. -- Simon Riggs www.2ndQuadrant.com
pgsql-hackers by date: