On Fri, Jun 12, 2009 at 04:40:12PM -0400, Alan McKay wrote:
> Any pointers for good reading material here? Other tips?
The manuals and/or source code for your software? Stories, case studies, and
reports from others in similar situations who have gone through problems?
Monitoring's job is to avert crises by letting you know things are going south
before they die completely. So you probably want to figure out ways in which
your setup is most likely to die, and make sure the critical points in that
equation are well-monitored, and you understand the monitoring. Provided you
stick with it long enough, you'll inevitably encounter a breakdown of some
kind or other, which will help you refine your idea of which points are
critical.
Apart from that, I find it's helpful to read about statistics and formal
testing, so you have some idea how confident you can be that the monitors are
accurate, that your decisions are justified, etc. But that's not everyone's
cup of tea...
- Josh / eggyknap