Thread: pg_standby observation

pg_standby observation

From
Jeff Davis
Date:
I think it would be useful if pg_standby (in version 8.3 contrib) could
be observed in some way.

Right now I use my own standby script, because every time it runs, it
touches a file in a known location. That allows me to monitor that file,
and if it is too stale, I know something must have gone wrong (I have an
archive_timeout set), and I can send an SNMP trap.

Would it be useful to add something similar to pg_standby? Is there a
better way to detect a problem with a standby system, or a more
appropriate place?

The postgres logs do report this also, but it requires more care to
properly intercept the "restored log file ... from archive" messages.

Regards,
    Jeff Davis


Re: pg_standby observation

From
Erik Jones
Date:
On Sep 13, 2007, at 1:38 PM, Jeff Davis wrote:

> I think it would be useful if pg_standby (in version 8.3 contrib)
> could
> be observed in some way.
>
> Right now I use my own standby script, because every time it runs, it
> touches a file in a known location. That allows me to monitor that
> file,
> and if it is too stale, I know something must have gone wrong (I
> have an
> archive_timeout set), and I can send an SNMP trap.
>
> Would it be useful to add something similar to pg_standby? Is there a
> better way to detect a problem with a standby system, or a more
> appropriate place?
>
> The postgres logs do report this also, but it requires more care to
> properly intercept the "restored log file ... from archive" messages.
>
> Regards,
>     Jeff Davis

If you include the -d option pg_standby will emit logging info on
stderr so you can tack on something like 2>> logpath/standby.log.
What it is lacking, however, is timestamps in the output when it
successfully recovers a WAL file.  Was there something more ou were
looking for?

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com



Re: pg_standby observation

From
Jeff Davis
Date:
On Thu, 2007-09-13 at 14:05 -0500, Erik Jones wrote:
> If you include the -d option pg_standby will emit logging info on
> stderr so you can tack on something like 2>> logpath/standby.log.
> What it is lacking, however, is timestamps in the output when it
> successfully recovers a WAL file.  Was there something more ou were
> looking for?

I don't think the timestamps will be a problem, I can always pipe it
through something else.

I think this will work, but it would be nice to have something that's a
little more well-defined and standardized to determine whether some kind
of error happens during replay.

Ultimately, what I'm trying to do is make it so that pgsnmpd can monitor
this, and trap if a problem occurs. In order for pgsnmpd to do this in a
way that works for a large number of people, it can't make too many
assumptions about logging options, etc.

Regards,
    Jeff Davis


Re: pg_standby observation

From
Erik Jones
Date:
On Sep 13, 2007, at 3:02 PM, Jeff Davis wrote:

> On Thu, 2007-09-13 at 14:05 -0500, Erik Jones wrote:
>> If you include the -d option pg_standby will emit logging info on
>> stderr so you can tack on something like 2>> logpath/standby.log.
>> What it is lacking, however, is timestamps in the output when it
>> successfully recovers a WAL file.  Was there something more ou were
>> looking for?
>
> I don't think the timestamps will be a problem, I can always pipe it
> through something else.
>
> I think this will work, but it would be nice to have something
> that's a
> little more well-defined and standardized to determine whether some
> kind
> of error happens during replay.

Right.  The problem there is that there really isn't anything
standardized about pg_standby, yet.  Or, if it is, it hasn't been
documented, yet.  Perhaps you could ask Simon about the possible
outputs on error conditions so that you'll have a definite list to
work with?

> Ultimately, what I'm trying to do is make it so that pgsnmpd can
> monitor
> this, and trap if a problem occurs. In order for pgsnmpd to do this
> in a
> way that works for a large number of people, it can't make too many
> assumptions about logging options, etc.


Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com



Re: pg_standby observation

From
Simon Riggs
Date:
On Thu, 2007-09-13 at 15:13 -0500, Erik Jones wrote:
> On Sep 13, 2007, at 3:02 PM, Jeff Davis wrote:
>
> > On Thu, 2007-09-13 at 14:05 -0500, Erik Jones wrote:
> >> If you include the -d option pg_standby will emit logging info on
> >> stderr so you can tack on something like 2>> logpath/standby.log.
> >> What it is lacking, however, is timestamps in the output when it
> >> successfully recovers a WAL file.  Was there something more ou were
> >> looking for?
> >
> > I don't think the timestamps will be a problem, I can always pipe it
> > through something else.
> >
> > I think this will work, but it would be nice to have something
> > that's a
> > little more well-defined and standardized to determine whether some
> > kind
> > of error happens during replay.
>
> Right.  The problem there is that there really isn't anything
> standardized about pg_standby, yet.  Or, if it is, it hasn't been
> documented, yet.  Perhaps you could ask Simon about the possible
> outputs on error conditions so that you'll have a definite list to
> work with?

There's a few different kinds of errors pg_standby can generate, though
much of its behaviour depends upon the command line switches.

I wasn't planning on documenting all possible failure states. We don't
do that anywhere else in the docs.

Happy to consider any requests for change.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: pg_standby observation

From
Simon Riggs
Date:
On Thu, 2007-09-13 at 11:38 -0700, Jeff Davis wrote:
> I think it would be useful if pg_standby (in version 8.3 contrib) could
> be observed in some way.
>
> Right now I use my own standby script, because every time it runs, it
> touches a file in a known location. That allows me to monitor that file,
> and if it is too stale, I know something must have gone wrong (I have an
> archive_timeout set), and I can send an SNMP trap.
>
> Would it be useful to add something similar to pg_standby? Is there a
> better way to detect a problem with a standby system, or a more
> appropriate place?
>
> The postgres logs do report this also, but it requires more care to
> properly intercept the "restored log file ... from archive" messages.

Well, the definition of it working correctly is that a "restored log
file..." message occurs. Even with archive_timeout set there could be
various delays before that happens. We have two servers and a network
involved, so the time might spike occasionally.

Touching a file doesn't really prove its working either.

Not sure what to suggest otherwise.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: pg_standby observation

From
Jeff Davis
Date:
On Sun, 2007-09-16 at 09:25 +0100, Simon Riggs wrote:
> Well, the definition of it working correctly is that a "restored log
> file..." message occurs. Even with archive_timeout set there could be
> various delays before that happens. We have two servers and a network
> involved, so the time might spike occasionally.
>

The problem is, a "restored log file message" might appear in a
different language or with a different prefix, depending on the
settings. That makes it hard to come up with a general solution, so
everyone has to use their own scripts that work with their logging
configuration.

In my particular case, I want to know if those logs aren't being
replayed, regardless of whether it's a network problem or a postgres
problem.

It would be nice if there was a more standardized way to see when
postgres replays a log successfully.

> Touching a file doesn't really prove its working either.
>

Right. It's the best I have now, however, and should detect "most" error
conditions.

Regards,
    Jeff Davis


Re: pg_standby observation

From
Simon Riggs
Date:
On Sun, 2007-09-16 at 11:18 -0700, Jeff Davis wrote:
> On Sun, 2007-09-16 at 09:25 +0100, Simon Riggs wrote:
> > Well, the definition of it working correctly is that a "restored log
> > file..." message occurs. Even with archive_timeout set there could be
> > various delays before that happens. We have two servers and a network
> > involved, so the time might spike occasionally.
> >
>
> The problem is, a "restored log file message" might appear in a
> different language or with a different prefix, depending on the
> settings. That makes it hard to come up with a general solution, so
> everyone has to use their own scripts that work with their logging
> configuration.
>
> In my particular case, I want to know if those logs aren't being
> replayed, regardless of whether it's a network problem or a postgres
> problem.

Currently pg_standby just sits there waiting. If you can specify the
events you wish to monitor and what action to take when that event
happens, I can make it do this.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com

  PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk