Re: Application monitoring - Mailing list pgsql-admin
From | Steve Crawford |
---|---|
Subject | Re: Application monitoring |
Date | |
Msg-id | 200311031824.12260.scrawford@pinpointresearch.com Whole thread Raw |
In response to | Application monitoring (Steve Lane <slane@moyergroup.com>) |
List | pgsql-admin |
On Monday 03 November 2003 5:19 pm, Steve Lane wrote: > Hi all: > > We maintain a number of web-based applications that use postgres as > the back end.... > > The client responded that surely this problem of monitoring a > database-backed web app was a known, solved problem, and wanted to > know what other people did to solve the problem. Do your best to anticipate and then plug holes as found. A sad story (different OS/DB/server but same idea - this tale involves NT/IIS/Cold Fusion/MSSQL). Back in the day at a dot-com the colo would "monitor the servers". By monitor they meant "ping". Of course IIS had a habit of playing dead and pings worked fine. So the colo added a port 80 monitor to the servers. This would alarm if a connection to port 80 was refused. Turned out, however, that IIS could die leaving a Dr. Watson message on the screen. It would continue accepting connections to port 80 but do nothing with them - at least until the Dr. Watson warning was clicked at which time the alarm would go off. Useless. So we switched to regex testing. The monitoring system would look for a special page - something like /test.html and make sure the correct text, "IIS running" was returned. But it turned out that Cold Fusion could die on its own. So we changed the test page to look at /test.cf (or whatever Cold Fusion used as a extension - I don't care to remember). That page concatenated a couple of strings and returned the result. Cool, we were much better at trapping events. But what about the database? We changed the ColdFusion page to run a very simple query - something like "select 0" (see a thread from a couple months back regarding "what's the fastest query" which had to do with PG server monitoring) and if it got the correct result it would return something like "db running". We were happy with this arrangement till we discovered that _parts_ of ColdFusion could die and the rest could run fine?!? The tests worked most of the time but when CF "half-died" one page of the site that pulled data from another web site would not work. So we switched everything to a Java based app server and were able to handle twice the load with 1/7th the machines and crashing became a thing of the past - but I digress. We used the same basic tests on the new server. We had a static page served by the front-end, a simple page served by the app server, and one that checked the database server. The colo monitors checked the database testing page and the others allowed for some quick-n-dirty remote diagnosis (hmm, front end and app servers are running but db isn't responding to the app server could be determined in 30 seconds from any browser). In addition we automatically checked the pages from our office and I checked from a server at home. The checks ran once per minute and 3 consecutive fails would trigger a page. I'm sure there are many things that could have fooled us but they are rare enough that we never saw them - the monitoring worked like a charm. Don't forget that you need to make sure the monitoring is happening. It's easy to lose track of a well-written monitoring app when there are no failures and only find that someone turned the monitor program off when a real failure happens. We figured that the combination of our monitoring along with the colo monitoring offered enough redundancy. Obviously it's best if at least some of the monitoring comes from off-site and never trust a machine to monitor itself. BTW, some of these server test pages can be used by a load balancer to fail a server in a cluster so they are very handy for more than just testing. Oh, to answer your other question - the problem has been "solved". You can pay for very expensive monitoring from a variety of third parties. Cheers, Steve
pgsql-admin by date: