Server testing. - Mailing list pgsql-general

From scott.marlowe
Subject Server testing.
Date
Msg-id Pine.LNX.4.33.0212161653190.24055-100000@css120.ihs.com
Whole thread Raw
In response to Re: Total crash of my db-server  (Kevin Brown <kevin@sysexperts.com>)
Responses Re: Server testing.
Re: Server testing.
List pgsql-general
This recent thread about a server crashing got me to thinking of server
acceptance testing.

When you are faced with the daunting task of testing a server, you should
be trying to break it.  Honestly, this is the most common mistake I see,
if folks ordering a new server and simply assuming there's no problems
with it.  Assume all hardware is bad until you've proven to yourself
otherwise.  No at what point your hardware will be brought to it's knees
(or worse) before your users can do that to you.

Here are a few good tests for bad hardware that I've found, if anyone else
has any, please chip in.  Note that not all failures are deterministic and
repeatable.  Some show up very seldomly, or only when the server room is
above 70 degress.  It's easy to know when you've got a big problem with
your hardware, but often hard to see the little ones.

The first thing I test with is compiling the linux kernel AND / OR
compiling Postgresql.  Both are complex projects that stress the system
fairly well.  Toss in a '-j 8' setting and watch the machine chew up
memory and CPU time.

It's easy to write a script that basically does a make clean;make over
several iterations and stores the md5sum of the outputted make data.  They
should all be the same.  Set the box up to compile the linux kernel 1000
times over the weekend.  Check the md5s, see if you have a few different.
i've seen boxes with bad memory compile the linux kernel 10 or 20 times
before generating an error.  most of the time a bad memory module is
obvious, sometimes not.

memtest86 is pretty good.  It too, can miss a bad memory location if the
memory is right MOST of the time, but sometimes flakes out on you.  but
you may need to run it multiple times.

Copy HUGE files across your drive arrays, and md5sum them at the beginning
and end.  The md5sum should always match, if it doesn't match, even just
once out of hundreds of copies, your machine has a problem.

Make sure you machine can operate reliably at the temperatures it may have
to experience.  I've seen plenty of servers that run fine in a nice cold
room (say 60 degrees F or less) but failed when the temp rose 5 or 10
degrees.  A server that fails at 72 degrees F consistently is too heat
sensitive to be reliable over the long haul.  Remember that dust
collecting and age make electronics more susceptable to heat failure, so a
new server that fails at 72, might fail at 70 next year, and 68 the year
after that.

I know I'm missing lots, so feel free to join it.

The two most important concepts for server acceptance testing:

1:  Assume it is broken.
2:  Try to prove it is broken.

That way, when it DOES work, you'll be pleasantly surprised, which is way
better than assuming it works and finding out during production that your
new server has issues.

An aside:  Many newer users get upset when they get told they must have
bad hardware, because Postgresql just doesn't act like that.  But it's
true, Postgresql doesn't just act flakey.

This reminds me of my favorite saying:  "When you hear hoofbeats, don't
think Zebra!"  Loosely translated, when your postgresql box starts acting
up, don't think it's postgresql's "fault" because it almost never is.


pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Changing a column's type
Next
From: Hadley Willan
Date:
Subject: Re: Changing a column's type