Re: Application monitoring - Mailing list pgsql-admin

From Steve Crawford
Subject Re: Application monitoring
Date
Msg-id 200311031824.12260.scrawford@pinpointresearch.com
Whole thread Raw
In response to Application monitoring  (Steve Lane <slane@moyergroup.com>)
List pgsql-admin
On Monday 03 November 2003 5:19 pm, Steve Lane wrote:
> Hi all:
>
> We maintain a number of web-based applications that use postgres as
> the back end....
>
> The client responded that surely this problem of monitoring a
> database-backed web app was a known, solved problem, and wanted to
> know what other people did to solve the problem.

Do your best to anticipate and then plug holes as found. A sad story
(different OS/DB/server but same idea - this tale involves
NT/IIS/Cold Fusion/MSSQL).

Back in the day at a dot-com the colo would "monitor the servers". By
monitor they meant "ping". Of course IIS had a habit of playing dead
and pings worked fine.

So the colo added a port 80 monitor to the servers. This would alarm
if a connection to port 80 was refused. Turned out, however, that IIS
could die leaving a Dr. Watson message on the screen. It would
continue accepting connections to port 80 but do nothing with them -
at least until the Dr. Watson warning was clicked at which time the
alarm would go off. Useless.

So we switched to regex testing. The monitoring system would look for
a special page - something like /test.html and make sure the correct
text, "IIS running" was returned.

But it turned out that Cold Fusion could die on its own. So we changed
the test page to look at /test.cf (or whatever Cold Fusion used as a
extension - I don't care to remember). That page concatenated a
couple of strings and returned the result. Cool, we were much better
at trapping events.

But what about the database? We changed the ColdFusion page to run a
very simple query - something like "select 0" (see a thread from a
couple months back regarding "what's the fastest query" which had to
do with PG server monitoring) and if it got the correct result it
would return something like "db running".

We were happy with this arrangement till we discovered that _parts_ of
ColdFusion could die and the rest could run fine?!? The tests worked
most of the time but when CF "half-died" one page of the site that
pulled data from another web site would not work.

So we switched everything to a Java based app server and were able to
handle twice the load with 1/7th the machines and crashing became a
thing of the past - but I digress.

We used the same basic tests on the new server. We had a static page
served by the front-end, a simple page served by the app server, and
one that checked the database server. The colo monitors checked the
database testing page and the others allowed for some quick-n-dirty
remote diagnosis (hmm, front end and app servers are running but db
isn't responding to the app server could be determined in 30 seconds
from any browser).

In addition we automatically checked the pages from our office and I
checked from a server at home. The checks ran once per minute and 3
consecutive fails would trigger a page. I'm sure there are many
things that could have fooled us but they are rare enough that we
never saw them - the monitoring worked like a charm.

Don't forget that you need to make sure the monitoring is happening.
It's easy to lose track of a well-written monitoring app when there
are no failures and only find that someone turned the monitor program
off when a real failure happens. We figured that the combination of
our monitoring along with the colo monitoring offered enough
redundancy.

Obviously it's best if at least some of the monitoring comes from
off-site and never trust a machine to monitor itself.

BTW, some of these server test pages can be used by a load balancer to
fail a server in a cluster so they are very handy for more than just
testing.

Oh, to answer your other question - the problem has been "solved". You
can pay for very expensive monitoring from a variety of third
parties.

Cheers,
Steve


pgsql-admin by date:

Previous
From: Jeff Bohmer
Date:
Subject: Re: Application monitoring
Next
From: "Jean Huveneers"
Date:
Subject: Re: Application monitoring