Re: Fwd: Re: SSDD reliability - Mailing list pgsql-general

From Greg Smith
Subject Re: Fwd: Re: SSDD reliability
Date
Msg-id 4DC3173A.4080600@2ndQuadrant.com
Whole thread Raw
In response to Re: Fwd: Re: SSDD reliability  (David Boreham <david_list@boreham.org>)
List pgsql-general
On 05/04/2011 08:31 PM, David Boreham wrote:
> Here's my best theory at present : the failures ARE caused by cell
> wear-out, but the SSD firmware is buggy in so far as it fails to boot
> up and respond to host commands due to the wear-out state. So rather
> than the expected outcome (SSD responds but has read-only behavior),
> it appears to be (and is) dead. At least to my mind, this is a more
> plausible explanation for the reported failures vs. the alternative
> (SSD vendors are uniquely clueless at making basic electronics
> subassemblies), especially considering the difficulty in testing the
> firmware under all possible wear-out conditions.
>
> One question worth asking is : in the cases you were involved in, was
> manufacturer failure analysis performed (and if so what was the
> failure cause reported?).

Unfortunately not.  Many of the people I deal with, particularly the
ones with budgets to be early SSD adopters, are not the sort to return
things that have failed to the vendor.  In some of these shops, if the
data can't be securely erased first, it doesn't leave the place.  The
idea that some trivial fix at the hardware level might bring the drive
back to life, data intact, is terrifying to many businesses when drives
fail hard.

Your bigger point, that this could just easily be software failures due
to unexpected corner cases rather than hardware issues, is both a fair
one to raise and even more scary.

>> Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
>> deployments (not OEM ones) is 0.6%.  Typical measured AFR rates for
>> mechanical drives is around 2% during their first year, spiking to 5%
>> afterwards.  I suspect that Intel's numbers are actually much better
>> than the other manufacturers here, so a SSD from anyone else can
>> easily be less reliable than a regular hard drive still.
>>
> Hmm, this is speculation I don't support (non-intel vendors have a 10x
> worse early failure rate). The entire industry uses very similar
> processes (often the same factories). One rogue vendor with a bad
> process...sure, but all of them ??
>

I was postulating that you only have to be 4X as bad as Intel to reach
2.4%, and then be worse than a mechanical drive for early failures.  If
you look at http://labs.google.com/papers/disk_failures.pdf you can see
there's a 5:1 ratio in first-year AFR just between light and heavy usage
on the drive.  So a 4:1 ratio between best and worst manufacturer for
SSD seemed possible.  Plenty of us have seen particular drive models
that were much more than 4X as bad as average ones among regular hard
drives.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


pgsql-general by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: multiple sequence number for one column
Next
From: David Johnston
Date:
Subject: Re: multiple sequence number for one column