Thread: pg_standby observation
I think it would be useful if pg_standby (in version 8.3 contrib) could be observed in some way. Right now I use my own standby script, because every time it runs, it touches a file in a known location. That allows me to monitor that file, and if it is too stale, I know something must have gone wrong (I have an archive_timeout set), and I can send an SNMP trap. Would it be useful to add something similar to pg_standby? Is there a better way to detect a problem with a standby system, or a more appropriate place? The postgres logs do report this also, but it requires more care to properly intercept the "restored log file ... from archive" messages. Regards, Jeff Davis
On Sep 13, 2007, at 1:38 PM, Jeff Davis wrote: > I think it would be useful if pg_standby (in version 8.3 contrib) > could > be observed in some way. > > Right now I use my own standby script, because every time it runs, it > touches a file in a known location. That allows me to monitor that > file, > and if it is too stale, I know something must have gone wrong (I > have an > archive_timeout set), and I can send an SNMP trap. > > Would it be useful to add something similar to pg_standby? Is there a > better way to detect a problem with a standby system, or a more > appropriate place? > > The postgres logs do report this also, but it requires more care to > properly intercept the "restored log file ... from archive" messages. > > Regards, > Jeff Davis If you include the -d option pg_standby will emit logging info on stderr so you can tack on something like 2>> logpath/standby.log. What it is lacking, however, is timestamps in the output when it successfully recovers a WAL file. Was there something more ou were looking for? Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
On Thu, 2007-09-13 at 14:05 -0500, Erik Jones wrote: > If you include the -d option pg_standby will emit logging info on > stderr so you can tack on something like 2>> logpath/standby.log. > What it is lacking, however, is timestamps in the output when it > successfully recovers a WAL file. Was there something more ou were > looking for? I don't think the timestamps will be a problem, I can always pipe it through something else. I think this will work, but it would be nice to have something that's a little more well-defined and standardized to determine whether some kind of error happens during replay. Ultimately, what I'm trying to do is make it so that pgsnmpd can monitor this, and trap if a problem occurs. In order for pgsnmpd to do this in a way that works for a large number of people, it can't make too many assumptions about logging options, etc. Regards, Jeff Davis
On Sep 13, 2007, at 3:02 PM, Jeff Davis wrote: > On Thu, 2007-09-13 at 14:05 -0500, Erik Jones wrote: >> If you include the -d option pg_standby will emit logging info on >> stderr so you can tack on something like 2>> logpath/standby.log. >> What it is lacking, however, is timestamps in the output when it >> successfully recovers a WAL file. Was there something more ou were >> looking for? > > I don't think the timestamps will be a problem, I can always pipe it > through something else. > > I think this will work, but it would be nice to have something > that's a > little more well-defined and standardized to determine whether some > kind > of error happens during replay. Right. The problem there is that there really isn't anything standardized about pg_standby, yet. Or, if it is, it hasn't been documented, yet. Perhaps you could ask Simon about the possible outputs on error conditions so that you'll have a definite list to work with? > Ultimately, what I'm trying to do is make it so that pgsnmpd can > monitor > this, and trap if a problem occurs. In order for pgsnmpd to do this > in a > way that works for a large number of people, it can't make too many > assumptions about logging options, etc. Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
On Thu, 2007-09-13 at 15:13 -0500, Erik Jones wrote: > On Sep 13, 2007, at 3:02 PM, Jeff Davis wrote: > > > On Thu, 2007-09-13 at 14:05 -0500, Erik Jones wrote: > >> If you include the -d option pg_standby will emit logging info on > >> stderr so you can tack on something like 2>> logpath/standby.log. > >> What it is lacking, however, is timestamps in the output when it > >> successfully recovers a WAL file. Was there something more ou were > >> looking for? > > > > I don't think the timestamps will be a problem, I can always pipe it > > through something else. > > > > I think this will work, but it would be nice to have something > > that's a > > little more well-defined and standardized to determine whether some > > kind > > of error happens during replay. > > Right. The problem there is that there really isn't anything > standardized about pg_standby, yet. Or, if it is, it hasn't been > documented, yet. Perhaps you could ask Simon about the possible > outputs on error conditions so that you'll have a definite list to > work with? There's a few different kinds of errors pg_standby can generate, though much of its behaviour depends upon the command line switches. I wasn't planning on documenting all possible failure states. We don't do that anywhere else in the docs. Happy to consider any requests for change. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Thu, 2007-09-13 at 11:38 -0700, Jeff Davis wrote: > I think it would be useful if pg_standby (in version 8.3 contrib) could > be observed in some way. > > Right now I use my own standby script, because every time it runs, it > touches a file in a known location. That allows me to monitor that file, > and if it is too stale, I know something must have gone wrong (I have an > archive_timeout set), and I can send an SNMP trap. > > Would it be useful to add something similar to pg_standby? Is there a > better way to detect a problem with a standby system, or a more > appropriate place? > > The postgres logs do report this also, but it requires more care to > properly intercept the "restored log file ... from archive" messages. Well, the definition of it working correctly is that a "restored log file..." message occurs. Even with archive_timeout set there could be various delays before that happens. We have two servers and a network involved, so the time might spike occasionally. Touching a file doesn't really prove its working either. Not sure what to suggest otherwise. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Sun, 2007-09-16 at 09:25 +0100, Simon Riggs wrote: > Well, the definition of it working correctly is that a "restored log > file..." message occurs. Even with archive_timeout set there could be > various delays before that happens. We have two servers and a network > involved, so the time might spike occasionally. > The problem is, a "restored log file message" might appear in a different language or with a different prefix, depending on the settings. That makes it hard to come up with a general solution, so everyone has to use their own scripts that work with their logging configuration. In my particular case, I want to know if those logs aren't being replayed, regardless of whether it's a network problem or a postgres problem. It would be nice if there was a more standardized way to see when postgres replays a log successfully. > Touching a file doesn't really prove its working either. > Right. It's the best I have now, however, and should detect "most" error conditions. Regards, Jeff Davis
On Sun, 2007-09-16 at 11:18 -0700, Jeff Davis wrote: > On Sun, 2007-09-16 at 09:25 +0100, Simon Riggs wrote: > > Well, the definition of it working correctly is that a "restored log > > file..." message occurs. Even with archive_timeout set there could be > > various delays before that happens. We have two servers and a network > > involved, so the time might spike occasionally. > > > > The problem is, a "restored log file message" might appear in a > different language or with a different prefix, depending on the > settings. That makes it hard to come up with a general solution, so > everyone has to use their own scripts that work with their logging > configuration. > > In my particular case, I want to know if those logs aren't being > replayed, regardless of whether it's a network problem or a postgres > problem. Currently pg_standby just sits there waiting. If you can specify the events you wish to monitor and what action to take when that event happens, I can make it do this. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk