Thread: Fwd: Re: SSDD reliability
>No problem with that, for a first step. ***BUT*** the failures in this article and >many others I've read about are not in high-write db workloads, so they're not write wear, >they're just crappy electronics failing. As a (lapsed) electronics design engineer, I'm suspicious of the notion that a subassembly consisting of solid state devices surface-mounted on FR4 substrate will fail except in very rare (and of great interest to the manufacturer) circumstances. And especially suspicious that one product category (SSD) happens to have a much higher failure rate than all others. Consider that an SSD is much simpler (just considering the electronics) than a traditional disk drive, and subject to less vibration and heat. Therefore one should see disk drives failing at the same (or higher rate). Even if the owner is highly statically charged, you'd expect the to destroy all categories of electronics at roughly the same rate (rather than just SSDs). So if someone says that SSDs have "failed", I'll assume that they suffered from Flash cell wear-out unless there is compelling proof to the contrary.
On 05/04/2011 03:24 PM, David Boreham wrote: > So if someone says that SSDs have "failed", I'll assume that they > suffered from Flash cell > wear-out unless there is compelling proof to the contrary. I've been involved in four recovery situations similar to the one described in that coding horror article, and zero of them were flash wear-out issues. The telling sign is that the device should fail to read-only mode if it wears out. That's not what I've seen happen though; what reports from the field are saying is that sudden, complete failures are the more likely event. The environment inside a PC of any sort, desktop or particularly portable, is not a predictable environment. Just because the drives should be less prone to heat and vibration issues doesn't mean individual components can't slide out of spec because of them. And hard drive manufacturers have a giant head start at working out reliability bugs in that area. You can't design that sort of issue out of a new product in advance; all you can do is analyze returns from the field, see what you screwed up, and do another design rev to address it. The idea that these new devices, which are extremely complicated and based on hardware that hasn't been manufactured in volume before, should be expected to have high reliability is an odd claim. I assume that any new electronics gadget has an extremely high failure rate during its first few years of volume production, particularly from a new manufacturer of that product. Intel claims their Annual Failure Rate (AFR) on their SSDs in IT deployments (not OEM ones) is 0.6%. Typical measured AFR rates for mechanical drives is around 2% during their first year, spiking to 5% afterwards. I suspect that Intel's numbers are actually much better than the other manufacturers here, so a SSD from anyone else can easily be less reliable than a regular hard drive still. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On 5/4/2011 6:02 PM, Greg Smith wrote: > On 05/04/2011 03:24 PM, David Boreham wrote: >> So if someone says that SSDs have "failed", I'll assume that they >> suffered from Flash cell >> wear-out unless there is compelling proof to the contrary. > > I've been involved in four recovery situations similar to the one > described in that coding horror article, and zero of them were flash > wear-out issues. The telling sign is that the device should fail to > read-only mode if it wears out. That's not what I've seen happen > though; what reports from the field are saying is that sudden, > complete failures are the more likely event. Sorry to harp on this (last time I promise), but I somewhat do know what I'm talking about, and I'm quite motivated to get to the bottom of this "SSDs fail, but not for the reason you'd suspect" syndrome (because we want to deploy SSDs in production soon). Here's my best theory at present : the failures ARE caused by cell wear-out, but the SSD firmware is buggy in so far as it fails to boot up and respond to host commands due to the wear-out state. So rather than the expected outcome (SSD responds but has read-only behavior), it appears to be (and is) dead. At least to my mind, this is a more plausible explanation for the reported failures vs. the alternative (SSD vendors are uniquely clueless at making basic electronics subassemblies), especially considering the difficulty in testing the firmware under all possible wear-out conditions. One question worth asking is : in the cases you were involved in, was manufacturer failure analysis performed (and if so what was the failure cause reported?). > > The environment inside a PC of any sort, desktop or particularly > portable, is not a predictable environment. Just because the drives > should be less prone to heat and vibration issues doesn't mean > individual components can't slide out of spec because of them. And > hard drive manufacturers have a giant head start at working out > reliability bugs in that area. You can't design that sort of issue > out of a new product in advance; all you can do is analyze returns > from the field, see what you screwed up, and do another design rev to > address it. That's not really how it works (I've been the guy responsible for this for 10 years in a prior career, so I feel somewhat qualified to argue about this). The technology and manufacturing processes are common across many different types of product. They either all work , or they all fail. In fact, I'll eat my keyboard if SSDs are not manufactured on the exact same production lines as regular disk drives, DRAM modules, and so on (manufacturing tends to be contracted to high volume factories that make all kinds of things on the same lines). The only different thing about SSDs vs. any other electronics you'd come across is the Flash devices themselves. However, those are used in extraordinary high volumes all over the place and if there were a failure mode with the incidence suggested by these stories, I suspect we'd be reading about it on the front page of the WSJ. > > Intel claims their Annual Failure Rate (AFR) on their SSDs in IT > deployments (not OEM ones) is 0.6%. Typical measured AFR rates for > mechanical drives is around 2% during their first year, spiking to 5% > afterwards. I suspect that Intel's numbers are actually much better > than the other manufacturers here, so a SSD from anyone else can > easily be less reliable than a regular hard drive still. > Hmm, this is speculation I don't support (non-intel vendors have a 10x worse early failure rate). The entire industry uses very similar processes (often the same factories). One rogue vendor with a bad process...sure, but all of them ?? For the benefit of anyone reading this who may have a failed SSD : all the tier 1 manufacturers have departments dedicated to the analysis of product that fails in the field. With some persistence, you can usually get them to take a failed unit and put it through the FA process (and tell you why it failed). For example, here's a job posting for someone who would do this work : http://www.internmatch.com/internships/4620/intel/ssd-failure-analysis-intern-592345 I'd encourage you to at least try to get your failed devices into the failure analysis pile. If units are not returned, the manufacturer never finds out what broke, and therefore can't fix the problem.
On Wed, May 4, 2011 at 6:31 PM, David Boreham <david_list@boreham.org> wrote: > > this). The technology and manufacturing processes are common across many > different types of product. They either all work , or they all fail. Most of it is. But certain parts are fairly new, i.e. the controllers. It is quite possible that all these various failing drives share some long term ~ 1 year degradation issue like the 6Gb/s SAS ports on the early sandybridge Intel CPUs. If that's the case then the just plain up and dying thing makes some sense.
On 5/4/2011 9:06 PM, Scott Marlowe wrote: > Most of it is. But certain parts are fairly new, i.e. the > controllers. It is quite possible that all these various failing > drives share some long term ~ 1 year degradation issue like the 6Gb/s > SAS ports on the early sandybridge Intel CPUs. If that's the case > then the just plain up and dying thing makes some sense. That Intel SATA port circuit issue was an extraordinarily rare screwup. So ok, yeah...I said that chips don't just keel over and die mid-life and you came up with the one counterexample in the history of the industry :) When I worked in the business in the 80's and 90's we had a few things like this happen, but they're very rare and typically don't escape into the wild (as Intel's pretty much didn't). If a similar problem affected SSDs, they would have been recalled and lawsuits would be underway. SSDs are just not that different from anything else. No special voodoo technology (besides the Flash devices themselves).
On Wed, May 4, 2011 at 9:34 PM, David Boreham <david_list@boreham.org> wrote: > On 5/4/2011 9:06 PM, Scott Marlowe wrote: >> >> Most of it is. But certain parts are fairly new, i.e. the >> controllers. It is quite possible that all these various failing >> drives share some long term ~ 1 year degradation issue like the 6Gb/s >> SAS ports on the early sandybridge Intel CPUs. If that's the case >> then the just plain up and dying thing makes some sense. > > That Intel SATA port circuit issue was an extraordinarily rare screwup. > > So ok, yeah...I said that chips don't just keel over and die mid-life > and you came up with the one counterexample in the history of > the industry :) When I worked in the business in the 80's and 90's > we had a few things like this happen, but they're very rare and > typically don't escape into the wild (as Intel's pretty much didn't). > If a similar problem affected SSDs, they would have been recalled > and lawsuits would be underway. Not necessarily. If there's a chip that has a 15% failure rate instead of the predicted <1% it might not fail enough for people to have noticed, since a user with a typically small sample might think he just got a bit unlucky etc. Nvidia made GPUs that overheated and died by the thousand, but took 1 to 2 years to die. There WAS a lawsuit, and now to settle it, they're offering to buy everybody who got stuck with the broken GPUs a nice single core $279 Compaq computer, even if they bought a $4,000 workstation with one of those dodgy GPUs. There's a lot of possibilities as to why some folks are seeing high failure rates, it'd be nice to know the cause. But we can't assume it's not an inherent problem with some part in them any more than we can assume that it is.
* Greg Smith: > Intel claims their Annual Failure Rate (AFR) on their SSDs in IT > deployments (not OEM ones) is 0.6%. Typical measured AFR rates for > mechanical drives is around 2% during their first year, spiking to 5% > afterwards. I suspect that Intel's numbers are actually much better > than the other manufacturers here, so a SSD from anyone else can > easily be less reliable than a regular hard drive still. I'm a bit concerned with usage-dependent failures. Presumably, two SDDs in a RAID-1 configuration are weared down in the same way, and it would be rather inconvenient if they failed at the same point. With hard disks, this doesn't seem to happen; even bad batches fail pretty much randomly. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
On 5/5/2011 2:36 AM, Florian Weimer wrote: > > I'm a bit concerned with usage-dependent failures. Presumably, two SDDs > in a RAID-1 configuration are weared down in the same way, and it would > be rather inconvenient if they failed at the same point. With hard > disks, this doesn't seem to happen; even bad batches fail pretty much > randomly. > fwiw this _can_ happen with traditional drives : we had a bunch of WD 300G velociraptor drives that had a firmware bug related to a 32-bit counter roll-over. This happened at exactly the same time for all drives in a machine (because the counter counted since power-up time). Needless to say this was quite frustrating !
On May 4, 2011, at 9:34 PM, David Boreham wrote: > So ok, yeah...I said that chips don't just keel over and die mid-life > and you came up with the one counterexample in the history of > the industry Actually, any of us who really tried could probably come up with a dozen examples--more if we've been around for a while.Original design cutting corners on power regulation; final manufacturers cutting corners on specs; component manufacturerscutting corners on specs or selling outright counterfeit parts... -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
On 5/5/2011 8:04 AM, Scott Ribe wrote: > > Actually, any of us who really tried could probably come up with a dozen examples--more if we've been around for a while.Original design cutting corners on power regulation; final manufacturers cutting corners on specs; component manufacturerscutting corners on specs or selling outright counterfeit parts... These are excellent examples of failure causes for electronics, but they are not counter-examples. They're unrelated to the discussion about SSD early lifetime hard failures.
On 05/05/2011 10:35 AM, David Boreham wrote: > On 5/5/2011 8:04 AM, Scott Ribe wrote: >> >> Actually, any of us who really tried could probably come up with a >> dozen examples--more if we've been around for a while. Original >> design cutting corners on power regulation; final manufacturers >> cutting corners on specs; component manufacturers cutting corners on >> specs or selling outright counterfeit parts... > > These are excellent examples of failure causes for electronics, but > they are > not counter-examples. They're unrelated to the discussion about SSD > early lifetime hard failures. That's really optimistic. For all we know, these problems are the latest incarnation of something like the bulging capacitor plague circa 5 years ago. Some part that is unique to the SSDs other than the flash cells that there's a giant bad batch of. I think your faith in PC component manufacturing is out of touch with the actual field failure rates for this stuff, which is produced with enormous cost cutting pressure driving tolerances to the bleeding edge in many cases. The equipment of the 80's and 90's you were referring to ran slower, and was more expensive so better quality components could be justified. The quality trend at the board and component level has been trending for a long time toward cheap over good in almost every case nowadays. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Thu, May 5, 2011 at 1:54 PM, Greg Smith <greg@2ndquadrant.com> wrote: > I think your faith in PC component manufacturing is out of touch with the > actual field failure rates for this stuff, which is produced with enormous > cost cutting pressure driving tolerances to the bleeding edge in many cases. > The equipment of the 80's and 90's you were referring to ran slower, and > was more expensive so better quality components could be justified. The > quality trend at the board and component level has been trending for a long > time toward cheap over good in almost every case nowadays. Modern CASE tools make this more and more of an issue. You can be in a circuit design program, right click on a component and pick from a dozen other components with lower tolerances and get a SPICE simulation that says initial production line failure rates will go from 0.01% to 0.02%. Multiply that times 100 components and it seems like a small change. But all it takes is one misstep and you've got a board with a theoretical production line failure rate of 0.05 that's really 0.08, and the first year failure rate goes from 0.5% to 2 or 3% and the $2.00 you saved on all components on the board times 1M units goes right out the window. BTW, the common term we used to refer to things that fail due to weird and unforseen circumstances were often referred to as P.O.M. dependent, (phase of the moon) because they'd often cluster around certain operating conditions that were unobvious until you collected and collated a large enough data set. Like hard drives that have abnormally high failure rates at altitudes above 4500ft etc. Seem fine til you order 1,000 for your Denver data center and they all start failing. It could be anything like that. SSDs that operate fine until they're in an environment with constant % humidity below 15% and boom they start failing like mad. It's impossible to test for all conditions in the field, and it's quite possible that environmental factors affect some of these SSDs we've heard about. More research is necessary to say why someone would see such clustering though.
On 05/04/2011 08:31 PM, David Boreham wrote: > Here's my best theory at present : the failures ARE caused by cell > wear-out, but the SSD firmware is buggy in so far as it fails to boot > up and respond to host commands due to the wear-out state. So rather > than the expected outcome (SSD responds but has read-only behavior), > it appears to be (and is) dead. At least to my mind, this is a more > plausible explanation for the reported failures vs. the alternative > (SSD vendors are uniquely clueless at making basic electronics > subassemblies), especially considering the difficulty in testing the > firmware under all possible wear-out conditions. > > One question worth asking is : in the cases you were involved in, was > manufacturer failure analysis performed (and if so what was the > failure cause reported?). Unfortunately not. Many of the people I deal with, particularly the ones with budgets to be early SSD adopters, are not the sort to return things that have failed to the vendor. In some of these shops, if the data can't be securely erased first, it doesn't leave the place. The idea that some trivial fix at the hardware level might bring the drive back to life, data intact, is terrifying to many businesses when drives fail hard. Your bigger point, that this could just easily be software failures due to unexpected corner cases rather than hardware issues, is both a fair one to raise and even more scary. >> Intel claims their Annual Failure Rate (AFR) on their SSDs in IT >> deployments (not OEM ones) is 0.6%. Typical measured AFR rates for >> mechanical drives is around 2% during their first year, spiking to 5% >> afterwards. I suspect that Intel's numbers are actually much better >> than the other manufacturers here, so a SSD from anyone else can >> easily be less reliable than a regular hard drive still. >> > Hmm, this is speculation I don't support (non-intel vendors have a 10x > worse early failure rate). The entire industry uses very similar > processes (often the same factories). One rogue vendor with a bad > process...sure, but all of them ?? > I was postulating that you only have to be 4X as bad as Intel to reach 2.4%, and then be worse than a mechanical drive for early failures. If you look at http://labs.google.com/papers/disk_failures.pdf you can see there's a 5:1 ratio in first-year AFR just between light and heavy usage on the drive. So a 4:1 ratio between best and worst manufacturer for SSD seemed possible. Plenty of us have seen particular drive models that were much more than 4X as bad as average ones among regular hard drives. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On 05/05/11 18:36, Florian Weimer wrote: > * Greg Smith: > >> Intel claims their Annual Failure Rate (AFR) on their SSDs in IT >> deployments (not OEM ones) is 0.6%. Typical measured AFR rates for >> mechanical drives is around 2% during their first year, spiking to 5% >> afterwards. I suspect that Intel's numbers are actually much better >> than the other manufacturers here, so a SSD from anyone else can >> easily be less reliable than a regular hard drive still. > > I'm a bit concerned with usage-dependent failures. Presumably, two SDDs > in a RAID-1 configuration are weared down in the same way, and it would > be rather inconvenient if they failed at the same point. With hard > disks, this doesn't seem to happen; even bad batches fail pretty much > randomly. Actually I think it'll be the same as with hard disks. ie. A batch of drives with sequential serial numbers will have a fairly similar average lifetime, but they won't pop their clogs all on the same day. (Unless there is an outside influence - see note 1) The wearing-out of SSDs is not as exact as people seem to think. If the drive is rated for 10,000 erase cycles, then that is meant to be a MINIMUM amount. So most blocks will get more than that amount, and maybe a small number will die before that amount. I guess it's a probability curve, engineered such that 95% or some other high percentage will outlast that count. (and the SSDs have reserved blocks which are introduced to take over from failing blocks, invisibly to the end-user -since it can always read from the failing-to-erase block) Note 1: I have seen an array that was powered on continuously for about six years, which killed half the disks when it was finally powered down, left to cool for a few hours, then started up again.
BTW, I saw a news article today about a brand of SSD that was claiming to have the price effectiveness of MLC-type chips, but with lifetime of 4TB/day over 5 years. http://www.storagereview.com/anobit_unveils_genesis_mlc_enterprise_ssds which also links to: http://www.storagereview.com/sandforce_and_ibm_promote_virtues_mlcbased_ssds_enterprise which is a similar tech - much improved erase-cycle-counts on MLC. No doubt this'll be common in all SSDs in a year or so then!
> Note 1: > I have seen an array that was powered on continuously for about six > years, which killed half the disks when it was finally powered down, > left to cool for a few hours, then started up again. > Recently we rebooted about 6 machines that had uptimes of 950+ days. Last time fsck had run on the file systems was 2006. When stuff gets that old, has been on-line and under heavy load all that time you actually get paranoid about reboots. In my newly reaffirmed opinion, at that stage reboots are at best a crap shoot. We lost several hours to that gamble more than we had budgeted for. HP is getting more of their gear back than in a usual month. Maybe that is just life with HP. -M
what is this talk about replicating your primary database to secondary nodes in the cloud.. or is cloud computing still marketing hype?
Martin
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité
Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
> From: dvlhntr@gmail.com
> To: toby.corkindale@strategicdata.com.au; pgsql-general@postgresql.org
> Subject: Re: Fwd: Re: [GENERAL] SSDD reliability
> Date: Wed, 18 May 2011 18:50:28 -0600
>
> > Note 1:
> > I have seen an array that was powered on continuously for about six
> > years, which killed half the disks when it was finally powered down,
> > left to cool for a few hours, then started up again.
> >
>
>
> Recently we rebooted about 6 machines that had uptimes of 950+ days.
> Last time fsck had run on the file systems was 2006.
>
> When stuff gets that old, has been on-line and under heavy load all that
> time you actually get paranoid about reboots. In my newly reaffirmed
> opinion, at that stage reboots are at best a crap shoot. We lost several
> hours to that gamble more than we had budgeted for. HP is getting more of
> their gear back than in a usual month.
>
> Maybe that is just life with HP.
>
>
> -M
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
Martin
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité
Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le destinataire prévu, nous te demandons avec bonté que pour satisfaire informez l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est interdite. Ce message sert à l'information seulement et n'aura pas n'importe quel effet légalement obligatoire. Étant donné que les email peuvent facilement être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité pour le contenu fourni.
> From: dvlhntr@gmail.com
> To: toby.corkindale@strategicdata.com.au; pgsql-general@postgresql.org
> Subject: Re: Fwd: Re: [GENERAL] SSDD reliability
> Date: Wed, 18 May 2011 18:50:28 -0600
>
> > Note 1:
> > I have seen an array that was powered on continuously for about six
> > years, which killed half the disks when it was finally powered down,
> > left to cool for a few hours, then started up again.
> >
>
>
> Recently we rebooted about 6 machines that had uptimes of 950+ days.
> Last time fsck had run on the file systems was 2006.
>
> When stuff gets that old, has been on-line and under heavy load all that
> time you actually get paranoid about reboots. In my newly reaffirmed
> opinion, at that stage reboots are at best a crap shoot. We lost several
> hours to that gamble more than we had budgeted for. HP is getting more of
> their gear back than in a usual month.
>
> Maybe that is just life with HP.
>
>
> -M
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
On 19/05/11 10:50, mark wrote: >> Note 1: >> I have seen an array that was powered on continuously for about six >> years, which killed half the disks when it was finally powered down, >> left to cool for a few hours, then started up again. >> > > > Recently we rebooted about 6 machines that had uptimes of 950+ days. > Last time fsck had run on the file systems was 2006. > > When stuff gets that old, has been on-line and under heavy load all that > time you actually get paranoid about reboots. In my newly reaffirmed > opinion, at that stage reboots are at best a crap shoot. We lost several > hours to that gamble more than we had budgeted for. HP is getting more of > their gear back than in a usual month. I worked at one place, years ago, which had an odd policy.. They had automated hard resets hit all their servers on a Friday night, every week. I thought they were mad at the time! But.. it does mean that people design and test the systems so that they can survive unattended resets reliably. (No one wants to get a support call at 11pm on Friday because their server didn't come back up.) It still seems a bit messed up though - even if friday night is a low-use period, it still means causing a small amount of disruption to customers - especially if a developer or sysadmin messed up, and a server *doesn't* come back up.
On 05/19/2011 08:57 AM, Martin Gainty wrote: > what is this talk about replicating your primary database to secondary > nodes in the cloud... slow. You'd have to do async replication with unbounded slave lag. It'd also be very easy to get to the point where the load on the master meant that the slave could never, ever catch up because there just wasn't enough bandwidth. -- Craig Ringer