Thread: SSDD reliability

SSDD reliability

From
Scott Ribe
Date:
Yeah, on that subject, anybody else see this:

<>

Absolutely pathetic.

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice





Re: SSDD reliability

From
Scott Ribe
Date:
On May 4, 2011, at 10:50 AM, Greg Smith wrote:

> Your link didn't show up on this.

Sigh... Step 2: paste link in ;-)

<http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html>


--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice





Re: SSDD reliability

From
David Boreham
Date:
On 5/4/2011 11:15 AM, Scott Ribe wrote:
>
> Sigh... Step 2: paste link in ;-)
>
> <http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html>
>
>
To be honest, like the article author, I'd be happy with 300+ days to
failure, IF the drives provide an accurate predictor of impending doom.
That is, if I can be notified "this drive will probably quit working in
30 days", then I'd arrange to cycle in a new drive.
The performance benefits vs rotating drives are for me worth this hassle.

OTOH if the drive says it is just fine and happy, then suddenly quits
working, that's bad.

Given the physical characteristics of the cell wear-out mechanism, I
think it should be possible to provide a reasonable accurate remaining
lifetime estimate, but so far my attempts to read this information via
SMART have failed, for the drives we have in use here.

FWIW I have a server with 481 days uptime, and 31 months operating  that
has an el-cheapo SSD for its boot/OS drive.



Re: SSDD reliability

From
Scott Ribe
Date:
On May 4, 2011, at 11:31 AM, David Boreham wrote:

> To be honest, like the article author, I'd be happy with 300+ days to failure, IF the drives provide an accurate
predictorof impending doom. 

No problem with that, for a first step. ***BUT*** the failures in this article and many others I've read about are not
inhigh-write db workloads, so they're not write wear, they're just crappy electronics failing. 

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice





Re: SSDD reliability

From
Toby Corkindale
Date:
On 05/05/11 03:31, David Boreham wrote:
> On 5/4/2011 11:15 AM, Scott Ribe wrote:
>>
>> Sigh... Step 2: paste link in ;-)
>>
>> <http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html>
>>
> To be honest, like the article author, I'd be happy with 300+ days to
> failure, IF the drives provide an accurate predictor of impending doom.
> That is, if I can be notified "this drive will probably quit working in
> 30 days", then I'd arrange to cycle in a new drive.
> The performance benefits vs rotating drives are for me worth this hassle.
>
> OTOH if the drive says it is just fine and happy, then suddenly quits
> working, that's bad.
>
> Given the physical characteristics of the cell wear-out mechanism, I
> think it should be possible to provide a reasonable accurate remaining
> lifetime estimate, but so far my attempts to read this information via
> SMART have failed, for the drives we have in use here.

In what way has the SMART read failed?
(I get the relevant values out successfully myself, and have Munin graph
them.)

> FWIW I have a server with 481 days uptime, and 31 months operating that
> has an el-cheapo SSD for its boot/OS drive.

Likewise, I have a server with a first-gen SSD (Kingston 60GB) that has
been running constantly for over a year, without any hiccups. It runs a
few small websites, a few email lists, all of which interact with
PostgreSQL databases.. lifetime writes to the disk are close to
three-quarters of a terabyte, and despite its lack of TRIM support, the
performance is still pretty good.

I'm pretty happy!

I note in the comments of that blog post above, it includes:

"I have shipped literally hundreds of Intel G1 and G2 SSDs to my
customers and never had a single in the field failure (save for one
drive in a laptop where the drive itself functioned fine but one of the
contacts on the SATA connector was actually flaky, probably from
vibrational damage from a lot of airplane flights, and one DOA drive). I
think you just got unlucky there."

I do have to wonder if this Portman Wills guy was somehow Doing It Wrong
to get a 100% failure rate over eight disks..

Re: SSDD reliability

From
David Boreham
Date:
On 5/4/2011 11:50 PM, Toby Corkindale wrote:
>
> In what way has the SMART read failed?
> (I get the relevant values out successfully myself, and have Munin
> graph them.)
Mis-parse :) It was my _attempts_ to read SMART that failed.
Specifically, I was able to read a table of numbers from the drive, but
none of the numbers looked particularly useful or likely to be a "time
to live" number. Similar to traditional drives, where you get this table
of numbers that are either zero or random, that you look at saying
"Huh?", all of which are flagged as "failing". Perhaps I'm using the
wrong SMART groking tools ?

>
>
> I do have to wonder if this Portman Wills guy was somehow Doing It
> Wrong to get a 100% failure rate over eight disks..
>
There are people out there who are especially highly charged.
So if he didn't wear out the drives, the next most likely cause I'd
suspect is that he ESD zapped them.



SMART attributes for SSD (was: SSDD reliability)

From
Toby Corkindale
Date:
On 05/05/11 22:50, David Boreham wrote:
> On 5/4/2011 11:50 PM, Toby Corkindale wrote:
>>
>> In what way has the SMART read failed?
>> (I get the relevant values out successfully myself, and have Munin
>> graph them.)

> Mis-parse :) It was my _attempts_ to read SMART that failed.
> Specifically, I was able to read a table of numbers from the drive, but
> none of the numbers looked particularly useful or likely to be a "time
> to live" number. Similar to traditional drives, where you get this table
> of numbers that are either zero or random, that you look at saying
> "Huh?", all of which are flagged as "failing". Perhaps I'm using the
> wrong SMART groking tools ?

I run:
sudo smartctl -a /dev/sda

And amongst the usual values, I also get:
232 Available_Reservd_Space 0x0002   100   048   000    Old_age   Always
       -       9011683733561
233 Media_Wearout_Indicator 0x0002   100   000   000    Old_age   Always
       -       0

The media wearout indicator is the useful one.

Plus some unknown attributes:
229 Unknown_Attribute       0x0002   100   000   000    Old_age   Always
       -       21941823264152
234 Unknown_Attribute       0x0002   100   000   000    Old_age   Always
       -       953583437830
235 Unknown_Attribute       0x0002   100   000   000    Old_age   Always
       -       1476591679


I found some suggested definitions for those attributes, but they didn't
seem to match up with my values once I decoded them, so mine must be
proprietary.

-Toby