Re: Infrastructure monitoring - Mailing list pgsql-www
From | Magnus Hagander |
---|---|
Subject | Re: Infrastructure monitoring |
Date | |
Msg-id | 6BCB9D8A16AC4241919521715F4D8BCE92E9BA@algol.sollentuna.se Whole thread Raw |
In response to | Infrastructure monitoring ("Jim C. Nasby" <jnasby@pervasive.com>) |
Responses |
Re: Infrastructure monitoring
Re: Infrastructure monitoring Re: Infrastructure monitoring |
List | pgsql-www |
> >> Search has been down for at least 2 days now, and this certainly > >> isn't the first time it's happened. There's also been cases of > >> archives getting stuck, and probably other outages besides > those that > >> went on until someone email'd about it. > >> > >> Would it be difficult to setup something to monitor these various > >> services? I know there's at least one OSS tool to do it, though I > >> have no idea how hard it would be to tie that into the current > >> infrastructure. > > > > We have an open offer of Hyperic licenses, and they support > FreeBSD now. > > Not to discount the offer ... but, what exactly would that > provide us? We already monitor the *servers*, its what is > inside of the servers that needs better monitoring ... > knowing nothing about Hyperic, does that provide something for that? I assume you talk about the nagios monitoring? Or are there perhaps even now multiple sets of monitoring? (Dave has a nagios installation up at least). We could easily extend that to monitor much more detailed. It's just that someone has to define what we need to monitor. And in either case, I see no reason we should require commercial software to do it - that's still going to need the definition of what has to be monitored. Let's stick to opensource when we can... BTW, we already do content monitoring on the actual website mirrors. If a mirror does not answer, *or* does not update properly, it will automatically be removed from the DNS record, and thus get out of "public view" after 10-30 minutes. > In the case of the archives, for instance, the problem was a > perl process that for some unknown reason got stuck randomly > ... removed that in favor of an awk script, and it hasn't > done it since ... i also redirected cron's email to > scrappy@postgresql.org, so that any errors show up in my > mailbox instead of roots, so I get an hourly reminder that > things are running well ... Right. What we could do to easily enhance this is to have the update script update a timestamp file somewhere on the system when it's done, and then monitor that file using existing tools (the file should be accessible through http://archives.postgresql.org/ the same way it is for the general website). Then you can just define a "can get <nn> minutes out of sync before we scream".. > In the case of search ... John would be better at answering > that, but when he and I talked this past week, he mentioned > that he was moving it all over to two new servers, which I > changed the DNS for on Wednesday ... What I think would be good in cases like this is just information - AFAIK nobody on the web team knew hte servers were being moved. (I may be wrong here - I know I didn't know and I also spoke to Dave about it, but those are the only ones I polled. Anyway, -www should know) That would also make it possible to do the standard fiddling with DNS TTLs to make the problem much smaller. //Magnus