Thread: ECC RAM really needed?
We're thinking of building some new servers. We bought some a while back that have ECC (error correcting) RAM, which isabsurdly expensive compared to the same amount of non-ECC RAM. Does anyone have any real-life data about the error rateof non-ECC RAM, and whether it matters or not? In my long career, I've never once had a computer that corrupted memory,or at least I never knew if it did. ECC sound like a good idea, but is it solving a non-problem? Thanks, Craig
On Fri, May 25, 2007 at 18:45:15 -0700, Craig James <craig_james@emolecules.com> wrote: > We're thinking of building some new servers. We bought some a while back > that have ECC (error correcting) RAM, which is absurdly expensive compared > to the same amount of non-ECC RAM. Does anyone have any real-life data > about the error rate of non-ECC RAM, and whether it matters or not? In my > long career, I've never once had a computer that corrupted memory, or at > least I never knew if it did. ECC sound like a good idea, but is it > solving a non-problem? In the past when I purchased ECC ram it wasn't that much more expensive than nonECC ram. Wikipedia suggests a rule of thumb of one error per month per gigabyte, though suggests error rates vary widely. They reference a paper that should provide you with more background.
On Fri, 25 May 2007, Bruno Wolff III wrote: > Wikipedia suggests a rule of thumb of one error per month per gigabyte, > though suggests error rates vary widely. They reference a paper that > should provide you with more background. The paper I would recommend is http://www.tezzaron.com/about/papers/soft_errors_1_1_secure.pdf which is a summary of many other people's papers, and quite informative. I know I had no idea before reading it how much error rates go up with increasing altitute. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > The paper I would recommend is > http://www.tezzaron.com/about/papers/soft_errors_1_1_secure.pdf > which is a summary of many other people's papers, and quite informative. > I know I had no idea before reading it how much error rates go up with > increasing altitute. Not real surprising if you figure the problem is mostly cosmic rays. Anyway, this paper says > Even using a relatively conservative error rate (500 FIT/Mbit), a > system with 1 GByte of RAM can expect an error every two weeks; which should pretty much cure any idea that you want to run a server with non-ECC memory. regards, tom lane
On Fri, May 25, 2007 at 06:45:15PM -0700, Craig James wrote: >We're thinking of building some new servers. We bought some a while back >that have ECC (error correcting) RAM, which is absurdly expensive compared >to the same amount of non-ECC RAM. Does anyone have any real-life data >about the error rate of non-ECC RAM, and whether it matters or not? In my >long career, I've never once had a computer that corrupted memory, or at >least I never knew if it did. ...because ECC RAM will correct single bit errors. FWIW, I've seen *a lot* of single bit errors over the years. Some systems are much better about reporting than others, but any system will have occasional errors. Also, if a stick starts to go bad you'll generally be told about with ECC memory, rather than having the system just start to flake out. Mike Stone
On Sat, May 26, 2007 at 08:43:15AM -0400, Michael Stone wrote: > On Fri, May 25, 2007 at 06:45:15PM -0700, Craig James wrote: > >We're thinking of building some new servers. We bought some a while back > >that have ECC (error correcting) RAM, which is absurdly expensive compared > >to the same amount of non-ECC RAM. Does anyone have any real-life data > >about the error rate of non-ECC RAM, and whether it matters or not? In my > >long career, I've never once had a computer that corrupted memory, or at > >least I never knew if it did. > ...because ECC RAM will correct single bit errors. FWIW, I've seen *a > lot* of single bit errors over the years. Some systems are much better > about reporting than others, but any system will have occasional errors. > Also, if a stick starts to go bad you'll generally be told about with > ECC memory, rather than having the system just start to flake out. First: I would use ECC RAM for a server. The memory is not significantly more expensive. Now that this is out of the way - I found this thread interesting because although it talked about RAM bit errors, I haven't seen reference to the significance of RAM bit errors. Quite a bit of memory is only rarely used (sent out to swap or flushed before it is accessed), or used in a read-only capacity in a limited form. For example, if searching table rows - as long as the row is not selected, and the bit error is in a field that isn't involved in the selection criteria, who cares if it is wrong? So, the question then becomes, what percentage of memory is required to be correct all of the time? I believe the estimates for bit error are high estimates with regard to actual effect. Stating that a bit may be wrong once every two weeks does not describe effect. In my opinion, software defects have a similar estimate for potential for damage to occur. In the last 10 years - the only problems with memory I have ever successfully diagnosed were with cheap hardware running in a poor environment, where the problem became quickly obvious, to the point that the system would be unusable or the BIOS would refuse to boot with the broken memory stick. (This paragraph represents the primary state of many of my father's machines :-) ) Replacing the memory stick made the problems go away. In any case - the word 'cheap' is significant in the above paragraph. non-ECC RAM should be considered 'cheap' memory. It will work fine most of the time and most people will never notice a problem. Do you want to be the one person who does notice a problem? :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
On Sat, May 26, 2007 at 10:52:14AM -0400, mark@mark.mielke.cc wrote: > Do you want to be the one person who does notice a problem? :-) Right, and notice that when you notice the problem _may not_ be when it happens. The problem with errors in memory (or on disk controllers, another place not to skimp in your hardware budget for database machines) is that the unnoticed failure could well write corrupted data out. It's some time later that you notice you have the problem, when you go to look at the data and discover you have garbage. If your data is worth storing, it's worth storing correctly, and so doing things to improve the chances of correct storage is a good idea. A -- Andrew Sullivan | ajs@crankycanuck.ca Everything that happens in the world happens at some place. --Jane Jacobs