Re: Enabling Checksums - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Enabling Checksums
Date
Msg-id 51477984.5010008@2ndQuadrant.com
Whole thread Raw
In response to Re: Enabling Checksums  (Bruce Momjian <bruce@momjian.us>)
Responses Re: Enabling Checksums  (Daniel Farina <daniel@heroku.com>)
Re: Enabling Checksums  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 3/18/13 10:52 AM, Bruce Momjian wrote:
> With a potential 10-20% overhead, I am unclear who would enable this at
> initdb time.

If you survey people who are running PostgreSQL on "cloud" hardware, be 
it Amazon's EC2 or similar options from other vendors, you will find a 
high percentage of them would pay quite a bit of performance to make 
their storage more reliable.  To pick one common measurement for 
popularity, a Google search on "ebs corruption" returns 17 million hits.  To quote one of those, Baron Schwartz of
Perconatalking about MySQL 
 
on EC2:

"BTW, I have seen data corruption on EBS volumes. It’s not clear whether 
it was InnoDB’s fault (extremely unlikely IMO), the operating system’s 
fault, EBS’s fault, or something else."

http://www.mysqlperformanceblog.com/2011/08/04/mysql-performance-on-ec2ebs-versus-rds/

*That* uncertainty is where a lot of the demand for this feature is 
coming from.  People deploy into the cloud, their data gets corrupted, 
and no one call tell them what/why/how it happened.  And that means they 
don't even know what to change to make it better.  The only people I see 
really doing something about this problem all seem years off, and I'm 
not sure they are going to help--especially since some of them are 
targeting "enterprise" storage rather than the cloud-style installations.

> I assume a user would wait until they suspected corruption to turn it
> on, and because it is only initdb-enabled, they would have to
> dump/reload their cluster.  The open question is whether this is a
> usable feature as written, or whether we should wait until 9.4.

The reliability issues of both physical and virtual hardware are so 
widely known that many people will deploy with this on as their default 
configuration.

If you don't trust your existing data, you can't retroactively check it.  A checksum of an already corrupt block is
useless. Accordingly, there 
 
is no use case for converting an installation with real or even 
suspected problems to a checksummed one.  If you wait until you suspect 
corruption to care about checksums, it's really too late.  There is only 
one available next step:  you must do a dump to figure out what's 
readable.  That is the spot that all of the incoming data recovery 
customers we see at 2ndQuadrant are already in when we're called.  The 
cluster is suspicious, sometimes they can get data out of it with a 
dump, and if we hack up their install we can usually recover a bit more 
than they could.

After the data from a partially corrupted database is dumped, someone 
who has just been through that pain might decide they should turn 
checksums on when they restore the dump.  When it's on, they can access 
future damage easily at the block level when it happens, and possibly 
repair it without doing a full dump/reload.  What's implemented in the 
feature we're talking about has a good enough UI to handle this entire 
cycle I see damaged installations go through.

Good questions, Bruce, I don't think the reasons behind this feature's 
demand have been highlighted very well before.  I try not to spook the 
world by talking regularly about how many corrupt PostgreSQL databases 
I've seen, but they do happen.  Right now we have two states:  "believed 
good" and "believed corrupted"--and the transitions between them are 
really nasty.  Just being able to quantify corruption would be a huge 
improvement.

Related aside, most of my regular ranting on crappy SSDs that lie about 
writes comes from a TB scale PostgreSQL install that got corrupted due 
to the write-cache flaws of the early Intel SSDs--twice.  They would 
have happily lost even the worst-case 20% of regular performance to 
avoid going down for two days each time they saw corruption, where we 
had to dump/reload to get them going again.  If the install had 
checksums, I could have figured out which blocks were damaged and 
manually fixed them, basically go on a hunt for torn pages and the last 
known good copy via full-page write.  Without checksums, there really 
was nowhere to go with them except dump/reload.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: JSON Function Bike Shedding
Next
From: Tom Lane
Date:
Subject: Re: machine-parseable object descriptions