Thread: ECC RAM really needed?

ECC RAM really needed?

From
Craig James
Date:
We're thinking of building some new servers.  We bought some a while back that have ECC (error correcting) RAM, which
isabsurdly expensive compared to the same amount of non-ECC RAM.  Does anyone have any real-life data about the error
rateof non-ECC RAM, and whether it matters or not?  In my long career, I've never once had a computer that corrupted
memory,or at least I never knew if it did.  ECC sound like a good idea, but is it solving a non-problem? 

Thanks,
Craig

Re: ECC RAM really needed?

From
Bruno Wolff III
Date:
On Fri, May 25, 2007 at 18:45:15 -0700,
  Craig James <craig_james@emolecules.com> wrote:
> We're thinking of building some new servers.  We bought some a while back
> that have ECC (error correcting) RAM, which is absurdly expensive compared
> to the same amount of non-ECC RAM.  Does anyone have any real-life data
> about the error rate of non-ECC RAM, and whether it matters or not?  In my
> long career, I've never once had a computer that corrupted memory, or at
> least I never knew if it did.  ECC sound like a good idea, but is it
> solving a non-problem?

In the past when I purchased ECC ram it wasn't that much more expensive
than nonECC ram.

Wikipedia suggests a rule of thumb of one error per month per gigabyte,
though suggests error rates vary widely. They reference a paper that should
provide you with more background.

Re: ECC RAM really needed?

From
Greg Smith
Date:
On Fri, 25 May 2007, Bruno Wolff III wrote:

> Wikipedia suggests a rule of thumb of one error per month per gigabyte,
> though suggests error rates vary widely. They reference a paper that
> should provide you with more background.

The paper I would recommend is

http://www.tezzaron.com/about/papers/soft_errors_1_1_secure.pdf

which is a summary of many other people's papers, and quite informative.
I know I had no idea before reading it how much error rates go up with
increasing altitute.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: ECC RAM really needed?

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> The paper I would recommend is
> http://www.tezzaron.com/about/papers/soft_errors_1_1_secure.pdf
> which is a summary of many other people's papers, and quite informative.
> I know I had no idea before reading it how much error rates go up with
> increasing altitute.

Not real surprising if you figure the problem is mostly cosmic rays.

Anyway, this paper says

> Even using a relatively conservative error rate (500 FIT/Mbit), a
> system with 1 GByte of RAM can expect an error every two weeks;

which should pretty much cure any idea that you want to run a server
with non-ECC memory.

            regards, tom lane

Re: ECC RAM really needed?

From
Michael Stone
Date:
On Fri, May 25, 2007 at 06:45:15PM -0700, Craig James wrote:
>We're thinking of building some new servers.  We bought some a while back
>that have ECC (error correcting) RAM, which is absurdly expensive compared
>to the same amount of non-ECC RAM.  Does anyone have any real-life data
>about the error rate of non-ECC RAM, and whether it matters or not?  In my
>long career, I've never once had a computer that corrupted memory, or at
>least I never knew if it did.

...because ECC RAM will correct single bit errors. FWIW, I've seen *a
lot* of single bit errors over the years. Some systems are much better
about reporting than others, but any system will have occasional errors.
Also, if a stick starts to go bad you'll generally be told about with
ECC memory, rather than having the system just start to flake out.

Mike Stone

Re: ECC RAM really needed?

From
mark@mark.mielke.cc
Date:
On Sat, May 26, 2007 at 08:43:15AM -0400, Michael Stone wrote:
> On Fri, May 25, 2007 at 06:45:15PM -0700, Craig James wrote:
> >We're thinking of building some new servers.  We bought some a while back
> >that have ECC (error correcting) RAM, which is absurdly expensive compared
> >to the same amount of non-ECC RAM.  Does anyone have any real-life data
> >about the error rate of non-ECC RAM, and whether it matters or not?  In my
> >long career, I've never once had a computer that corrupted memory, or at
> >least I never knew if it did.
> ...because ECC RAM will correct single bit errors. FWIW, I've seen *a
> lot* of single bit errors over the years. Some systems are much better
> about reporting than others, but any system will have occasional errors.
> Also, if a stick starts to go bad you'll generally be told about with
> ECC memory, rather than having the system just start to flake out.

First: I would use ECC RAM for a server. The memory is not
significantly more expensive.

Now that this is out of the way - I found this thread interesting because
although it talked about RAM bit errors, I haven't seen reference to the
significance of RAM bit errors.

Quite a bit of memory is only rarely used (sent out to swap or flushed
before it is accessed), or used in a read-only capacity in a limited form.
For example, if searching table rows - as long as the row is not selected,
and the bit error is in a field that isn't involved in the selection
criteria, who cares if it is wrong?

So, the question then becomes, what percentage of memory is required
to be correct all of the time? I believe the estimates for bit error
are high estimates with regard to actual effect. Stating that a bit
may be wrong once every two weeks does not describe effect. In my
opinion, software defects have a similar estimate for potential for
damage to occur.

In the last 10 years - the only problems with memory I have ever
successfully diagnosed were with cheap hardware running in a poor
environment, where the problem became quickly obvious, to the point
that the system would be unusable or the BIOS would refuse to boot
with the broken memory stick. (This paragraph represents the primary
state of many of my father's machines :-) ) Replacing the memory
stick made the problems go away.

In any case - the word 'cheap' is significant in the above paragraph.
non-ECC RAM should be considered 'cheap' memory. It will work fine
most of the time and most people will never notice a problem.

Do you want to be the one person who does notice a problem? :-)

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/


Re: ECC RAM really needed?

From
Andrew Sullivan
Date:
On Sat, May 26, 2007 at 10:52:14AM -0400, mark@mark.mielke.cc wrote:
> Do you want to be the one person who does notice a problem? :-)

Right, and notice that when you notice the problem _may not_ be when
it happens.  The problem with errors in memory (or on disk
controllers, another place not to skimp in your hardware budget for
database machines) is that the unnoticed failure could well write
corrupted data out.  It's some time later that you notice you have
the problem, when you go to look at the data and discover you have
garbage.

If your data is worth storing, it's worth storing correctly, and so
doing things to improve the chances of correct storage is a good
idea.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
Everything that happens in the world happens at some place.
        --Jane Jacobs