Re: Fwd: Re: SSDD reliability - Mailing list pgsql-general

From David Boreham
Subject Re: Fwd: Re: SSDD reliability
Date
Msg-id 4DC1EFFB.6040901@boreham.org
Whole thread Raw
In response to Re: Fwd: Re: SSDD reliability  (Greg Smith <greg@2ndQuadrant.com>)
Responses Re: Fwd: Re: SSDD reliability  (Scott Marlowe <scott.marlowe@gmail.com>)
Re: Fwd: Re: SSDD reliability  (Greg Smith <greg@2ndQuadrant.com>)
List pgsql-general
On 5/4/2011 6:02 PM, Greg Smith wrote:
> On 05/04/2011 03:24 PM, David Boreham wrote:
>> So if someone says that SSDs have "failed", I'll assume that they
>> suffered from Flash cell
>> wear-out unless there is compelling proof to the contrary.
>
> I've been involved in four recovery situations similar to the one
> described in that coding horror article, and zero of them were flash
> wear-out issues.  The telling sign is that the device should fail to
> read-only mode if it wears out.  That's not what I've seen happen
> though; what reports from the field are saying is that sudden,
> complete failures are the more likely event.

Sorry to harp on this (last time I promise), but I somewhat do know what
I'm talking about, and I'm quite motivated to get to the bottom of this
"SSDs fail, but not for the reason you'd suspect" syndrome (because we
want to deploy SSDs in production soon).

Here's my best theory at present : the failures ARE caused by cell
wear-out, but the SSD firmware is buggy in so far as it fails to boot up
and respond to host commands due to the wear-out state. So rather than
the expected outcome (SSD responds but has read-only behavior), it
appears to be (and is) dead. At least to my mind, this is a more
plausible explanation for the reported failures vs. the alternative (SSD
vendors are uniquely clueless at making basic electronics
subassemblies), especially considering the difficulty in testing the
firmware under all possible wear-out conditions.

One question worth asking is : in the cases you were involved in, was
manufacturer failure analysis performed (and if so what was the failure
cause reported?).
>
> The environment inside a PC of any sort, desktop or particularly
> portable, is not a predictable environment.  Just because the drives
> should be less prone to heat and vibration issues doesn't mean
> individual components can't slide out of spec because of them.  And
> hard drive manufacturers have a giant head start at working out
> reliability bugs in that area.  You can't design that sort of issue
> out of a new product in advance; all you can do is analyze returns
> from the field, see what you screwed up, and do another design rev to
> address it.
That's not really how it works (I've been the guy responsible for this
for 10 years in a prior career, so I feel somewhat qualified to argue
about this). The technology and manufacturing processes are common
across many different types of product. They either all work , or they
all fail. In fact, I'll eat my keyboard if SSDs are not manufactured on
the exact same production lines as regular disk drives, DRAM modules,
and so on (manufacturing tends to be contracted to high volume factories
that make all kinds of things on the same lines). The only different
thing about SSDs vs. any other electronics you'd come across is the
Flash devices themselves. However, those are used in extraordinary high
volumes all over the place and if there were a failure mode with the
incidence suggested by these stories, I suspect we'd be reading about it
on the front page of the WSJ.

>
> Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
> deployments (not OEM ones) is 0.6%.  Typical measured AFR rates for
> mechanical drives is around 2% during their first year, spiking to 5%
> afterwards.  I suspect that Intel's numbers are actually much better
> than the other manufacturers here, so a SSD from anyone else can
> easily be less reliable than a regular hard drive still.
>
Hmm, this is speculation I don't support (non-intel vendors have a 10x
worse early failure rate). The entire industry uses very similar
processes (often the same factories). One rogue vendor with a bad
process...sure, but all of them ??

For the benefit of anyone reading this who may have a failed SSD : all
the tier 1 manufacturers have departments dedicated to the analysis of
product that fails in the field. With some persistence, you can usually
get them to take a failed unit and put it through the FA process (and
tell you why it failed). For example, here's a job posting for someone
who would do this work :
http://www.internmatch.com/internships/4620/intel/ssd-failure-analysis-intern-592345
I'd encourage you to at least try to get your failed devices into the
failure analysis pile. If units are not returned, the manufacturer never
finds out what broke, and therefore can't fix the problem.






pgsql-general by date:

Previous
From: Greg Smith
Date:
Subject: Re: Fwd: Re: SSDD reliability
Next
From: Scott Marlowe
Date:
Subject: Re: Fwd: Re: SSDD reliability