Thread: monitoring warm standby lag in 8.4?

monitoring warm standby lag in 8.4?

From

Josh Kupershmidt

Date:

09 December 2010, 15:30:38

Hi all,

I'm wondering if there's an accepted way to monitor a warm standby
machine's lag in 8.4. The wiki[1] has a link[2] to a script which
parses the output of pg_controldata, looking for a line like this:

  Time of latest checkpoint:            Thu 09 Dec 2010 01:35:46 PM EST

But I'm not sure whether this timestamp is to be trusted as an
indicator of how far behind the standby is in its recovery -- this
timestamp just tells us when the standby last performed a checkpoint,
regardless of how far behind in the WAL stream it is, right?

I haven't come across any other monitoring suggestions for warm
standby on 8.4. I've seen suggestions for hot standby slaves to use:
  SELECT pg_last_xlog_receive_location();
but this won't work on an 8.4 warm standby of course. I've searched
around and haven't found[3] any other tips on how to monitor my
standby.

The manual mentions[4] using pg_xlogfile_name_offset() in the context
of implementing record-based log shipping. Would this be useful for
monitoring standby lag? Any other ideas?

Thanks,
Josh


--
[1] http://wiki.postgresql.org/wiki/Warm_Standby
[2] http://www.kennygorman.com/wordpress/?p=249
[3] I was hoping this page would have some relevant info:
http://www.scottrmead.com/blogs/scott/warm-standby-monitoring , but
it's down now :(
[4] http://www.postgresql.org/docs/8.4/static/warm-standby.html#WARM-STANDBY-RECORD

Re: monitoring warm standby lag in 8.4?

From

"Greg Sabino Mullane"

Date:

10 December 2010, 12:27:58

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


> I'm wondering if there's an accepted way to monitor a warm standby
> machine's lag in 8.4. The wiki[1] has a link[2] to a script which
> parses the output of pg_controldata, looking for a line like this:
>
> Time of latest checkpoint:            Thu 09 Dec 2010 01:35:46 PM EST
>
> But I'm not sure whether this timestamp is to be trusted as an
> indicator of how far behind the standby is in its recovery -- this
> timestamp just tells us when the standby last performed a checkpoint,
> regardless of how far behind in the WAL stream it is, right?

Correct. But since we cannot connect to a database in recovery mode,
there are very few options to determine how far 'behind' it is. The
pg_controldata is what the check_postgres program uses. This offers a
rough check which is usually sufficient unless you have a very
inactive database or need very fine grained checking.

A better system would perhaps connect to both ends and examine which
specific WALs were being shipped and which one was last played, but
there are no tools I know of that do that. I suspect the reason for
this is that the pg_controldata check is "good enough". Certainly,
that's what we are using for many clients via check_postgres, and
it's been very good at detecting when the replica has problems. Good
enough that I've never worried about writing a different method,
anyway. :)

- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201012101126
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAk0CVN4ACgkQvJuQZxSWSshqIgCZASW1Tsf+8/Mk2qYPIzYUoYZz
+CwAmwQbwzv8ED1QRskG8DavSr89NG/d
=qwtc
-----END PGP SIGNATURE-----

Re: monitoring warm standby lag in 8.4?

From

Scott Mead

Date:

10 December 2010, 13:21:39

Yeah, my website is busted.  I'll fix it for you.


On Thu, Dec 9, 2010 at 2:30 PM, Josh Kupershmidt <schmiddy@gmail.com> wrote:
> Hi all,
>
> I'm wondering if there's an accepted way to monitor a warm standby
> machine's lag in 8.4. The wiki[1] has a link[2] to a script which
> parses the output of pg_controldata, looking for a line like this:
>
>  Time of latest checkpoint:            Thu 09 Dec 2010 01:35:46 PM EST
>
> But I'm not sure whether this timestamp is to be trusted as an
> indicator of how far behind the standby is in its recovery -- this
> timestamp just tells us when the standby last performed a checkpoint,
> regardless of how far behind in the WAL stream it is, right?
>
> I haven't come across any other monitoring suggestions for warm
> standby on 8.4. I've seen suggestions for hot standby slaves to use:
>  SELECT pg_last_xlog_receive_location();
> but this won't work on an 8.4 warm standby of course. I've searched
> around and haven't found[3] any other tips on how to monitor my
> standby.
>
> The manual mentions[4] using pg_xlogfile_name_offset() in the context
> of implementing record-based log shipping. Would this be useful for
> monitoring standby lag? Any other ideas?
>
> Thanks,
> Josh
>
>
> --
> [1] http://wiki.postgresql.org/wiki/Warm_Standby
> [2] http://www.kennygorman.com/wordpress/?p=249
> [3] I was hoping this page would have some relevant info:
> http://www.scottrmead.com/blogs/scott/warm-standby-monitoring , but
> it's down now :(
> [4] http://www.postgresql.org/docs/8.4/static/warm-standby.html#WARM-STANDBY-RECORD
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Re: monitoring warm standby lag in 8.4?

From

Josh Kupershmidt

Date:

10 December 2010, 15:13:36

On Fri, Dec 10, 2010 at 11:27 AM, Greg Sabino Mullane <greg@turnstep.com> wrote:
> Correct. But since we cannot connect to a database in recovery mode,
> there are very few options to determine how far 'behind' it is. The
> pg_controldata is what the check_postgres program uses. This offers a
> rough check which is usually sufficient unless you have a very
> inactive database or need very fine grained checking.
>
> A better system would perhaps connect to both ends and examine which
> specific WALs were being shipped and which one was last played, but
> there are no tools I know of that do that. I suspect the reason for
> this is that the pg_controldata check is "good enough". Certainly,
> that's what we are using for many clients via check_postgres, and
> it's been very good at detecting when the replica has problems. Good
> enough that I've never worried about writing a different method,
> anyway. :)

Thanks for the reply.

One simple piece I added in to my monitoring script which wasn't here:
  http://www.kennygorman.com/wordpress/?p=249
(or in check_postgres.pl, from a quick look at check_checkpoint() in
check_postgres.pl) is a verification that the standby slave is
actually 'in archive recovery' mode, from looking at the 'Database
cluster state:' output of pg_controldata.

I was mulling over some ways to add in a reasonable check that the
standby was keeping up with the WAL stream. Comparing WAL file names
on master vs. standby would probably work, but I was also thinking
that a simple directory-size check on the standby's WAL archive
directory would show whether we were receiving WAL files faster than
we could process them.

Josh