Re: Checksums, state of play - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Checksums, state of play
Date
Msg-id 20120306175024.GA1347@momjian.us
Whole thread Raw
In response to Re: Checksums, state of play  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Checksums, state of play  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Checksums, state of play  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Tue, Mar 06, 2012 at 09:25:17AM -0500, Robert Haas wrote:
> > 2. Turning checksums on/off/on/off in rapid succession can cause false
> > positive reports of checksum failure if crashes occur and are ignored.
> > That may lead to the feature and PostgreSQL being held in disrepute.
> 
> This I do think is a problem, although not for precisely the reason
> stated here.  In my experience, in data corruption situations, the
> first thing customers do is blame PostgreSQL: they don't believe it's
> the hardware; they accuse us of having bugs in our code.  Having a
> checksum feature would be valuable, because, first, we'd perhaps
> detect problems sooner and, second, people understand what checksums
> are and that checksum failures really shouldn't happen unless the
> hardware is bad.  More generally, one of the purposes of checksums is
> to distinguish hardware failure from other possible causes of data
> corruption problems.  If there are code paths where checksum failures
> can happy despite the hardware being good, I think that the patch will
> fail to accomplish its goal of giving us confidence that the hardware
> is bad.

I think the "turning checksums on/off/on/off" is really a killer
problem, and obviously many of the actions needed to make it safe make
the checksum feature itself less useful.  

One crazy idea would be to have a checksum _version_ number somewhere on
the page and in pg_controldata.  When you turn on checksums, you
increment that value, and all new checksum pages get that checksum
version;  if you turn off checksums, we just don't check them anymore,
but they might get incorrect due to a hint bit write and a crash.  When
you turn on checksums again, you increment the checksum version again,
and only check pages having the _new_ checksum version.

Yes, this does add additional storage requirements for the checksum, but
I don't see another clean option.  If you can spare one byte, that gives
you 255 times to turn on checksums;   after that, you have to
dump/reload to use the checksum feature.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: elegant and effective way for running jobs inside a database
Next
From: Bruce Momjian
Date:
Subject: Re: Checksums, state of play