Re: ECC RAM really needed? - Mailing list pgsql-performance

From mark@mark.mielke.cc
Subject Re: ECC RAM really needed?
Date
Msg-id 20070526145214.GA21290@mark.mielke.cc
Whole thread Raw
In response to Re: ECC RAM really needed?  (Michael Stone <mstone+postgres@mathom.us>)
Responses Re: ECC RAM really needed?
List pgsql-performance
On Sat, May 26, 2007 at 08:43:15AM -0400, Michael Stone wrote:
> On Fri, May 25, 2007 at 06:45:15PM -0700, Craig James wrote:
> >We're thinking of building some new servers.  We bought some a while back
> >that have ECC (error correcting) RAM, which is absurdly expensive compared
> >to the same amount of non-ECC RAM.  Does anyone have any real-life data
> >about the error rate of non-ECC RAM, and whether it matters or not?  In my
> >long career, I've never once had a computer that corrupted memory, or at
> >least I never knew if it did.
> ...because ECC RAM will correct single bit errors. FWIW, I've seen *a
> lot* of single bit errors over the years. Some systems are much better
> about reporting than others, but any system will have occasional errors.
> Also, if a stick starts to go bad you'll generally be told about with
> ECC memory, rather than having the system just start to flake out.

First: I would use ECC RAM for a server. The memory is not
significantly more expensive.

Now that this is out of the way - I found this thread interesting because
although it talked about RAM bit errors, I haven't seen reference to the
significance of RAM bit errors.

Quite a bit of memory is only rarely used (sent out to swap or flushed
before it is accessed), or used in a read-only capacity in a limited form.
For example, if searching table rows - as long as the row is not selected,
and the bit error is in a field that isn't involved in the selection
criteria, who cares if it is wrong?

So, the question then becomes, what percentage of memory is required
to be correct all of the time? I believe the estimates for bit error
are high estimates with regard to actual effect. Stating that a bit
may be wrong once every two weeks does not describe effect. In my
opinion, software defects have a similar estimate for potential for
damage to occur.

In the last 10 years - the only problems with memory I have ever
successfully diagnosed were with cheap hardware running in a poor
environment, where the problem became quickly obvious, to the point
that the system would be unusable or the BIOS would refuse to boot
with the broken memory stick. (This paragraph represents the primary
state of many of my father's machines :-) ) Replacing the memory
stick made the problems go away.

In any case - the word 'cheap' is significant in the above paragraph.
non-ECC RAM should be considered 'cheap' memory. It will work fine
most of the time and most people will never notice a problem.

Do you want to be the one person who does notice a problem? :-)

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/


pgsql-performance by date:

Previous
From: Michael Stone
Date:
Subject: Re: ECC RAM really needed?
Next
From: Tom Lane
Date:
Subject: Re: general PG network slowness (possible cure) (repost)