Thread: monitoring warm standby lag in 8.4?
Hi all, I'm wondering if there's an accepted way to monitor a warm standby machine's lag in 8.4. The wiki[1] has a link[2] to a script which parses the output of pg_controldata, looking for a line like this: Time of latest checkpoint: Thu 09 Dec 2010 01:35:46 PM EST But I'm not sure whether this timestamp is to be trusted as an indicator of how far behind the standby is in its recovery -- this timestamp just tells us when the standby last performed a checkpoint, regardless of how far behind in the WAL stream it is, right? I haven't come across any other monitoring suggestions for warm standby on 8.4. I've seen suggestions for hot standby slaves to use: SELECT pg_last_xlog_receive_location(); but this won't work on an 8.4 warm standby of course. I've searched around and haven't found[3] any other tips on how to monitor my standby. The manual mentions[4] using pg_xlogfile_name_offset() in the context of implementing record-based log shipping. Would this be useful for monitoring standby lag? Any other ideas? Thanks, Josh -- [1] http://wiki.postgresql.org/wiki/Warm_Standby [2] http://www.kennygorman.com/wordpress/?p=249 [3] I was hoping this page would have some relevant info: http://www.scottrmead.com/blogs/scott/warm-standby-monitoring , but it's down now :( [4] http://www.postgresql.org/docs/8.4/static/warm-standby.html#WARM-STANDBY-RECORD
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 > I'm wondering if there's an accepted way to monitor a warm standby > machine's lag in 8.4. The wiki[1] has a link[2] to a script which > parses the output of pg_controldata, looking for a line like this: > > Time of latest checkpoint: Thu 09 Dec 2010 01:35:46 PM EST > > But I'm not sure whether this timestamp is to be trusted as an > indicator of how far behind the standby is in its recovery -- this > timestamp just tells us when the standby last performed a checkpoint, > regardless of how far behind in the WAL stream it is, right? Correct. But since we cannot connect to a database in recovery mode, there are very few options to determine how far 'behind' it is. The pg_controldata is what the check_postgres program uses. This offers a rough check which is usually sufficient unless you have a very inactive database or need very fine grained checking. A better system would perhaps connect to both ends and examine which specific WALs were being shipped and which one was last played, but there are no tools I know of that do that. I suspect the reason for this is that the pg_controldata check is "good enough". Certainly, that's what we are using for many clients via check_postgres, and it's been very good at detecting when the replica has problems. Good enough that I've never worried about writing a different method, anyway. :) - -- Greg Sabino Mullane greg@turnstep.com End Point Corporation http://www.endpoint.com/ PGP Key: 0x14964AC8 201012101126 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAk0CVN4ACgkQvJuQZxSWSshqIgCZASW1Tsf+8/Mk2qYPIzYUoYZz +CwAmwQbwzv8ED1QRskG8DavSr89NG/d =qwtc -----END PGP SIGNATURE-----
Yeah, my website is busted. I'll fix it for you. On Thu, Dec 9, 2010 at 2:30 PM, Josh Kupershmidt <schmiddy@gmail.com> wrote: > Hi all, > > I'm wondering if there's an accepted way to monitor a warm standby > machine's lag in 8.4. The wiki[1] has a link[2] to a script which > parses the output of pg_controldata, looking for a line like this: > > Time of latest checkpoint: Thu 09 Dec 2010 01:35:46 PM EST > > But I'm not sure whether this timestamp is to be trusted as an > indicator of how far behind the standby is in its recovery -- this > timestamp just tells us when the standby last performed a checkpoint, > regardless of how far behind in the WAL stream it is, right? > > I haven't come across any other monitoring suggestions for warm > standby on 8.4. I've seen suggestions for hot standby slaves to use: > SELECT pg_last_xlog_receive_location(); > but this won't work on an 8.4 warm standby of course. I've searched > around and haven't found[3] any other tips on how to monitor my > standby. > > The manual mentions[4] using pg_xlogfile_name_offset() in the context > of implementing record-based log shipping. Would this be useful for > monitoring standby lag? Any other ideas? > > Thanks, > Josh > > > -- > [1] http://wiki.postgresql.org/wiki/Warm_Standby > [2] http://www.kennygorman.com/wordpress/?p=249 > [3] I was hoping this page would have some relevant info: > http://www.scottrmead.com/blogs/scott/warm-standby-monitoring , but > it's down now :( > [4] http://www.postgresql.org/docs/8.4/static/warm-standby.html#WARM-STANDBY-RECORD > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
On Fri, Dec 10, 2010 at 11:27 AM, Greg Sabino Mullane <greg@turnstep.com> wrote: > Correct. But since we cannot connect to a database in recovery mode, > there are very few options to determine how far 'behind' it is. The > pg_controldata is what the check_postgres program uses. This offers a > rough check which is usually sufficient unless you have a very > inactive database or need very fine grained checking. > > A better system would perhaps connect to both ends and examine which > specific WALs were being shipped and which one was last played, but > there are no tools I know of that do that. I suspect the reason for > this is that the pg_controldata check is "good enough". Certainly, > that's what we are using for many clients via check_postgres, and > it's been very good at detecting when the replica has problems. Good > enough that I've never worried about writing a different method, > anyway. :) Thanks for the reply. One simple piece I added in to my monitoring script which wasn't here: http://www.kennygorman.com/wordpress/?p=249 (or in check_postgres.pl, from a quick look at check_checkpoint() in check_postgres.pl) is a verification that the standby slave is actually 'in archive recovery' mode, from looking at the 'Database cluster state:' output of pg_controldata. I was mulling over some ways to add in a reasonable check that the standby was keeping up with the WAL stream. Comparing WAL file names on master vs. standby would probably work, but I was also thinking that a simple directory-size check on the standby's WAL archive directory would show whether we were receiving WAL files faster than we could process them. Josh