Thread: SSDD reliability
Yeah, on that subject, anybody else see this: <> Absolutely pathetic. -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
On May 4, 2011, at 10:50 AM, Greg Smith wrote: > Your link didn't show up on this. Sigh... Step 2: paste link in ;-) <http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html> -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
On 5/4/2011 11:15 AM, Scott Ribe wrote: > > Sigh... Step 2: paste link in ;-) > > <http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html> > > To be honest, like the article author, I'd be happy with 300+ days to failure, IF the drives provide an accurate predictor of impending doom. That is, if I can be notified "this drive will probably quit working in 30 days", then I'd arrange to cycle in a new drive. The performance benefits vs rotating drives are for me worth this hassle. OTOH if the drive says it is just fine and happy, then suddenly quits working, that's bad. Given the physical characteristics of the cell wear-out mechanism, I think it should be possible to provide a reasonable accurate remaining lifetime estimate, but so far my attempts to read this information via SMART have failed, for the drives we have in use here. FWIW I have a server with 481 days uptime, and 31 months operating that has an el-cheapo SSD for its boot/OS drive.
On May 4, 2011, at 11:31 AM, David Boreham wrote: > To be honest, like the article author, I'd be happy with 300+ days to failure, IF the drives provide an accurate predictorof impending doom. No problem with that, for a first step. ***BUT*** the failures in this article and many others I've read about are not inhigh-write db workloads, so they're not write wear, they're just crappy electronics failing. -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
On 05/05/11 03:31, David Boreham wrote: > On 5/4/2011 11:15 AM, Scott Ribe wrote: >> >> Sigh... Step 2: paste link in ;-) >> >> <http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html> >> > To be honest, like the article author, I'd be happy with 300+ days to > failure, IF the drives provide an accurate predictor of impending doom. > That is, if I can be notified "this drive will probably quit working in > 30 days", then I'd arrange to cycle in a new drive. > The performance benefits vs rotating drives are for me worth this hassle. > > OTOH if the drive says it is just fine and happy, then suddenly quits > working, that's bad. > > Given the physical characteristics of the cell wear-out mechanism, I > think it should be possible to provide a reasonable accurate remaining > lifetime estimate, but so far my attempts to read this information via > SMART have failed, for the drives we have in use here. In what way has the SMART read failed? (I get the relevant values out successfully myself, and have Munin graph them.) > FWIW I have a server with 481 days uptime, and 31 months operating that > has an el-cheapo SSD for its boot/OS drive. Likewise, I have a server with a first-gen SSD (Kingston 60GB) that has been running constantly for over a year, without any hiccups. It runs a few small websites, a few email lists, all of which interact with PostgreSQL databases.. lifetime writes to the disk are close to three-quarters of a terabyte, and despite its lack of TRIM support, the performance is still pretty good. I'm pretty happy! I note in the comments of that blog post above, it includes: "I have shipped literally hundreds of Intel G1 and G2 SSDs to my customers and never had a single in the field failure (save for one drive in a laptop where the drive itself functioned fine but one of the contacts on the SATA connector was actually flaky, probably from vibrational damage from a lot of airplane flights, and one DOA drive). I think you just got unlucky there." I do have to wonder if this Portman Wills guy was somehow Doing It Wrong to get a 100% failure rate over eight disks..
On 5/4/2011 11:50 PM, Toby Corkindale wrote: > > In what way has the SMART read failed? > (I get the relevant values out successfully myself, and have Munin > graph them.) Mis-parse :) It was my _attempts_ to read SMART that failed. Specifically, I was able to read a table of numbers from the drive, but none of the numbers looked particularly useful or likely to be a "time to live" number. Similar to traditional drives, where you get this table of numbers that are either zero or random, that you look at saying "Huh?", all of which are flagged as "failing". Perhaps I'm using the wrong SMART groking tools ? > > > I do have to wonder if this Portman Wills guy was somehow Doing It > Wrong to get a 100% failure rate over eight disks.. > There are people out there who are especially highly charged. So if he didn't wear out the drives, the next most likely cause I'd suspect is that he ESD zapped them.
On 05/05/11 22:50, David Boreham wrote: > On 5/4/2011 11:50 PM, Toby Corkindale wrote: >> >> In what way has the SMART read failed? >> (I get the relevant values out successfully myself, and have Munin >> graph them.) > Mis-parse :) It was my _attempts_ to read SMART that failed. > Specifically, I was able to read a table of numbers from the drive, but > none of the numbers looked particularly useful or likely to be a "time > to live" number. Similar to traditional drives, where you get this table > of numbers that are either zero or random, that you look at saying > "Huh?", all of which are flagged as "failing". Perhaps I'm using the > wrong SMART groking tools ? I run: sudo smartctl -a /dev/sda And amongst the usual values, I also get: 232 Available_Reservd_Space 0x0002 100 048 000 Old_age Always - 9011683733561 233 Media_Wearout_Indicator 0x0002 100 000 000 Old_age Always - 0 The media wearout indicator is the useful one. Plus some unknown attributes: 229 Unknown_Attribute 0x0002 100 000 000 Old_age Always - 21941823264152 234 Unknown_Attribute 0x0002 100 000 000 Old_age Always - 953583437830 235 Unknown_Attribute 0x0002 100 000 000 Old_age Always - 1476591679 I found some suggested definitions for those attributes, but they didn't seem to match up with my values once I decoded them, so mine must be proprietary. -Toby