Re: [HACKERS] Checksums by default? - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] Checksums by default?
Date
Msg-id CAM3SWZTc+4QjysO-Op4ui8hZJro3QG0fN1MFj1NtMuVHQY1sew@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Checksums by default?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] Checksums by default?  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
List pgsql-hackers
On Sat, Jan 21, 2017 at 9:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Not at all; I just think that it's not clear that they are a net win
> for the average user, and so I'm unconvinced that turning them on by
> default is a good idea.  I could be convinced otherwise by suitable
> evidence.  What I'm objecting to is turning them on without making
> any effort to collect such evidence.

+1

One insight Jim Gray has in the classic paper "Why Do Computers Stop
and What Can Be Done About It?" [1] is that fault-tolerant hardware is
table stakes, and so most failures are related to operator error, and
to a lesser extent software bugs. The paper is about 30 years old.

I don't recall ever seeing a checksum failure on a Heroku Postgres
database, even though they were enabled as soon as the feature became
available. I have seen a few corruption problems brought to light by
amcheck, though, all of which were due to bugs in software.
Apparently, before I joined Heroku there were real reliability
problems with the storage subsystem that Heroku Postgres runs on (it's
a pluggable storage service from a popular cloud provider -- the
"pluggable" functionality would have made it fairly novel at the
time). These problems were something that the Heroku Postgres team
dealt with about 6 years ago. However, anecdotal evidence suggests
that the reliability of the same storage system *vastly* improved
roughly a year or two later. We still occasionally lose drives, but
drives seem to fail fast in a fashion that lets us recover without
data loss easily. In practice, Postgres checksums do *not* seem to
catch problems. That's been my experience, at least.

Obviously every additional check helps, and it may be something we can
do without any appreciable downside. I'd like to see a benchmark.

[1] http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Jim Nasby
Date:
Subject: Re: [HACKERS] GSoC 2017
Next
From: Jia Yu
Date:
Subject: Re: [HACKERS] IndexBuild Function call fcinfo cannot access memory