On Wed, 2006-09-20 at 15:14 +0200, Matthias.Pitzl@izb.de wrote:
> Hello Scott!
>
> Thank you for the quick answer. I'll try to check our hardware which is a
> Compaq DL380 G4 with a batteyr buffered write cache on our raid controller.
> As the system is running stable at all i think it's not the cpu or memory.
> At moment i tend more to a bad disk or SCSI controller but even with that i
> don't get any message in my logs...
> Any ideas how i could check the hardware?
Keep in mind, a single bad memory location is all it takes to cause data
corruption, so it could well be memory. CPU is less likely if the
machine is otherwise running stable.
The standard tool on x86 hardware is memtest86 www.memtest86.com
So, you'd have to schedule a maintenance window to run the test in since
you have to basically down the machine and run just memtest86. I think
a few live linux distros have it built in (FC has a memtest label in
some versions I think)
My first suspicion is always memory. We ordered a batch of memory from
a very off brand supplier, and over 75% tested bad. And it took >24
hours to find some of the bad memory.
good luck with your testing, let us know how it goes.