Server testing. - Mailing list pgsql-general
From | scott.marlowe |
---|---|
Subject | Server testing. |
Date | |
Msg-id | Pine.LNX.4.33.0212161653190.24055-100000@css120.ihs.com Whole thread Raw |
In response to | Re: Total crash of my db-server (Kevin Brown <kevin@sysexperts.com>) |
Responses |
Re: Server testing.
Re: Server testing. |
List | pgsql-general |
This recent thread about a server crashing got me to thinking of server acceptance testing. When you are faced with the daunting task of testing a server, you should be trying to break it. Honestly, this is the most common mistake I see, if folks ordering a new server and simply assuming there's no problems with it. Assume all hardware is bad until you've proven to yourself otherwise. No at what point your hardware will be brought to it's knees (or worse) before your users can do that to you. Here are a few good tests for bad hardware that I've found, if anyone else has any, please chip in. Note that not all failures are deterministic and repeatable. Some show up very seldomly, or only when the server room is above 70 degress. It's easy to know when you've got a big problem with your hardware, but often hard to see the little ones. The first thing I test with is compiling the linux kernel AND / OR compiling Postgresql. Both are complex projects that stress the system fairly well. Toss in a '-j 8' setting and watch the machine chew up memory and CPU time. It's easy to write a script that basically does a make clean;make over several iterations and stores the md5sum of the outputted make data. They should all be the same. Set the box up to compile the linux kernel 1000 times over the weekend. Check the md5s, see if you have a few different. i've seen boxes with bad memory compile the linux kernel 10 or 20 times before generating an error. most of the time a bad memory module is obvious, sometimes not. memtest86 is pretty good. It too, can miss a bad memory location if the memory is right MOST of the time, but sometimes flakes out on you. but you may need to run it multiple times. Copy HUGE files across your drive arrays, and md5sum them at the beginning and end. The md5sum should always match, if it doesn't match, even just once out of hundreds of copies, your machine has a problem. Make sure you machine can operate reliably at the temperatures it may have to experience. I've seen plenty of servers that run fine in a nice cold room (say 60 degrees F or less) but failed when the temp rose 5 or 10 degrees. A server that fails at 72 degrees F consistently is too heat sensitive to be reliable over the long haul. Remember that dust collecting and age make electronics more susceptable to heat failure, so a new server that fails at 72, might fail at 70 next year, and 68 the year after that. I know I'm missing lots, so feel free to join it. The two most important concepts for server acceptance testing: 1: Assume it is broken. 2: Try to prove it is broken. That way, when it DOES work, you'll be pleasantly surprised, which is way better than assuming it works and finding out during production that your new server has issues. An aside: Many newer users get upset when they get told they must have bad hardware, because Postgresql just doesn't act like that. But it's true, Postgresql doesn't just act flakey. This reminds me of my favorite saying: "When you hear hoofbeats, don't think Zebra!" Loosely translated, when your postgresql box starts acting up, don't think it's postgresql's "fault" because it almost never is.
pgsql-general by date: