Thread: Calculating Replication Lag - units
Howdy, When calculating Replication lag, I know that we have to compare the pg_current_xlog_location to pg_last_xlog_receive_location, etc. but what I'm trying to figure out is what are the units that I'm left with after the calculation. (i.e., does the xlog_location imply some time value?) Here's the output of the (slightly modified script) Master: 5003964876715 Receive: 5003964876715 Replay: 5003964765203 receive.value 0 apply.value 111512 111512 isn't inherently useful to me on its own. Any tips? Thanks Dave
On 06/25/2012 01:17 PM, David Kerr wrote: > Howdy, > > When calculating Replication lag, I know that we have to compare the pg_current_xlog_location > to pg_last_xlog_receive_location, etc. but what I'm trying to figure out is what are > the units that I'm left with after the calculation. > > (i.e., does the xlog_location imply some time value?) > > Here's the output of the (slightly modified script) > Master: 5003964876715 > Receive: 5003964876715 > Replay: 5003964765203 > > receive.value 0 > apply.value 111512 > > 111512 isn't inherently useful to me on its own. > > Any tips? > How about now()-pg_last_xact_replay_timestamp() (however this can be a large number if there have not been any recent transactions on the master). I suppose you could do something like: case when pg_last_xlog_receive_location() = pg_last_xlog_replay_location() then '0 seconds'::interval else now()-pg_last_xact_replay_timestamp() end as log_delay; But I'm wrapping my head around some replication issues myself so others may have better ideas or corrections. Cheers, Steve
On Mon, Jun 25, 2012 at 02:17:22PM -0700, Steve Crawford wrote: - On 06/25/2012 01:17 PM, David Kerr wrote: - >Howdy, - > - >When calculating Replication lag, I know that we have to compare the - >pg_current_xlog_location - >to pg_last_xlog_receive_location, etc. but what I'm trying to figure out - >is what are - >the units that I'm left with after the calculation. - > - >(i.e., does the xlog_location imply some time value?) - > - >Here's the output of the (slightly modified script) - >Master: 5003964876715 - >Receive: 5003964876715 - >Replay: 5003964765203 - > - >receive.value 0 - >apply.value 111512 - > - >111512 isn't inherently useful to me on its own. - > - >Any tips? - > - How about now()-pg_last_xact_replay_timestamp() (however this can be a - large number if there have not been any recent transactions on the - master). I suppose you could do something like: - - case when pg_last_xlog_receive_location() = - pg_last_xlog_replay_location() then '0 seconds'::interval - else now()-pg_last_xact_replay_timestamp() end as log_delay; i don't know for sure that 111512 is a time value.. that's kind of what i'm wondering. If i knew that it was like miliseconds or something that would be helpful. - But I'm wrapping my head around some replication issues myself so others - may have better ideas or corrections. I've been fairly successful with replication so I'm happy to help there. Just trying to shore up my monitoring now!
On Tue, Jun 26, 2012 at 1:47 AM, David Kerr <dmk@mr-paradox.net> wrote:
Howdy,
When calculating Replication lag, I know that we have to compare the pg_current_xlog_location
to pg_last_xlog_receive_location, etc. but what I'm trying to figure out is what are
the units that I'm left with after the calculation.
(i.e., does the xlog_location imply some time value?)
Here's the output of the (slightly modified script)
Master: 5003964876715
Receive: 5003964876715
Replay: 5003964765203
receive.value 0
apply.value 111512
111512 isn't inherently useful to me on its own.
Any tips?
A common method I did in Oracle, I followed the same, I may be wrong in calculating exactly. Someone would have better solution on lag calculation.
My checking goes like this, Since its streaming replication, every DML should be replicated as fast it could to slave.
1. Create table on master as Lagcheck(lagtime timestamp) and insert one row with now() (current_time of server).
2. every minute update the same row with latest time by putting entry in cronjob
3. Step 2 will be replicated to Slave (i.e., SR box).
4. Now on slave calculate the lag by now() - lagcheck.lagtime(column which has value of Master time).
Here you get the time how much slave is behind from master.
Note: Special attention required on Timezones.
---
Regards,
Raghavendra
EnterpriseDB Corporation
On 6/25/2012 9:55 PM, Raghavendra wrote: > > On Tue, Jun 26, 2012 at 1:47 AM, David Kerr <dmk@mr-paradox.net > <mailto:dmk@mr-paradox.net>> wrote: > > Howdy, > > When calculating Replication lag, I know that we have to compare the > pg_current_xlog_location > to pg_last_xlog_receive_location, etc. but what I'm trying to figure > out is what are > the units that I'm left with after the calculation. > > (i.e., does the xlog_location imply some time value?) > > Here's the output of the (slightly modified script) > Master: 5003964876715 > Receive: 5003964876715 > Replay: 5003964765203 > > receive.value 0 > apply.value 111512 > > 111512 isn't inherently useful to me on its own. > > Any tips? > > A common method I did in Oracle, I followed the same, I may be wrong in > calculating exactly. Someone would have better solution on lag calculation. > > My checking goes like this, Since its streaming replication, every DML > should be replicated as fast it could to slave. > > 1. Create table on master as Lagcheck(lagtime timestamp) and insert one > row with now() (current_time of server). > 2. every minute update the same row with latest time by putting entry in > cronjob > 3. Step 2 will be replicated to Slave (i.e., SR box). > 4. Now on slave calculate the lag by now() - lagcheck.lagtime(column > which has value of Master time). > > Here you get the time how much slave is behind from master. > Note: Special attention required on Timezones. Well, I think the way I'm doing it is the correct way, it's even that way in Simon's book. I'm just not sure what the # is.. is it miliseconds since last update on master. or just some arbitrary number.
On Tue, Jun 26, 2012 at 6:21 AM, David Kerr <dmk@mr-paradox.net> wrote: > On Mon, Jun 25, 2012 at 02:17:22PM -0700, Steve Crawford wrote: > - On 06/25/2012 01:17 PM, David Kerr wrote: > - >Howdy, > - > > - >When calculating Replication lag, I know that we have to compare the > - >pg_current_xlog_location > - >to pg_last_xlog_receive_location, etc. but what I'm trying to figure out > - >is what are > - >the units that I'm left with after the calculation. > - > > - >(i.e., does the xlog_location imply some time value?) > - > > - >Here's the output of the (slightly modified script) > - >Master: 5003964876715 > - >Receive: 5003964876715 > - >Replay: 5003964765203 > - > > - >receive.value 0 > - >apply.value 111512 > - > > - >111512 isn't inherently useful to me on its own. > - > > - >Any tips? > - > > - How about now()-pg_last_xact_replay_timestamp() (however this can be a > - large number if there have not been any recent transactions on the > - master). I suppose you could do something like: > - > - case when pg_last_xlog_receive_location() = > - pg_last_xlog_replay_location() then '0 seconds'::interval > - else now()-pg_last_xact_replay_timestamp() end as log_delay; > > i don't know for sure that 111512 is a time value.. that's kind of > what i'm wondering. If i knew that it was like miliseconds or something > that would be helpful. On the hot standby: SELECT now()-pg_last_xact_replay_timestamp() AS lag; This gives you the lag time as a PostgreSQL interval. (It also might give you a value if you run it on a database that is not a hot standby if it started in recovery mode). It seems difficult or impossible to calculate this on the master. -- Stuart Bishop <stuart@stuartbishop.net> http://www.stuartbishop.net/
On 06/26/2012 05:11 AM, Stuart Bishop wrote: > On Tue, Jun 26, 2012 at 6:21 AM, David Kerr<dmk@mr-paradox.net> wrote: >> On Mon, Jun 25, 2012 at 02:17:22PM -0700, Steve Crawford wrote: >> - On 06/25/2012 01:17 PM, David Kerr wrote: >> ->Howdy, >> -> >> ->When calculating Replication lag, I know that we have to compare the >> ->pg_current_xlog_location >> ->to pg_last_xlog_receive_location, etc. but what I'm trying to figure out >> ->is what are >> ->the units that I'm left with after the calculation. >> -> >> ->(i.e., does the xlog_location imply some time value?) >> -> >> ->Here's the output of the (slightly modified script) >> ->Master: 5003964876715 >> ->Receive: 5003964876715 >> ->Replay: 5003964765203 >> -> >> ->receive.value 0 >> ->apply.value 111512 >> -> >> ->111512 isn't inherently useful to me on its own. >> -> >> ->Any tips? >> -> >> - How about now()-pg_last_xact_replay_timestamp() (however this can be a >> - large number if there have not been any recent transactions on the >> - master). I suppose you could do something like: >> - >> - case when pg_last_xlog_receive_location() = >> - pg_last_xlog_replay_location() then '0 seconds'::interval >> - else now()-pg_last_xact_replay_timestamp() end as log_delay; >> >> i don't know for sure that 111512 is a time value.. that's kind of >> what i'm wondering. If i knew that it was like miliseconds or something >> that would be helpful. > > On the hot standby: > > SELECT now()-pg_last_xact_replay_timestamp() AS lag; > > This gives you the lag time as a PostgreSQL interval. > > (It also might give you a value if you run it on a database that is > not a hot standby if it started in recovery mode). > > It seems difficult or impossible to calculate this on the master. > > Ah, awesome. I don't need to calculate it on the master so that's perfect. Thanks!
On 06/26/2012 05:11 AM, Stuart Bishop wrote: > On Tue, Jun 26, 2012 at 6:21 AM, David Kerr<dmk@mr-paradox.net> wrote: >> On Mon, Jun 25, 2012 at 02:17:22PM -0700, Steve Crawford wrote: >> - On 06/25/2012 01:17 PM, David Kerr wrote: >> ->Howdy, >> -> >> ->When calculating Replication lag, I know that we have to compare the >> ->pg_current_xlog_location >> ->to pg_last_xlog_receive_location, etc. but what I'm trying to figure out >> ->is what are >> ->the units that I'm left with after the calculation. >> -> >> ->(i.e., does the xlog_location imply some time value?) >> -> >> ->Here's the output of the (slightly modified script) >> ->Master: 5003964876715 >> ->Receive: 5003964876715 >> ->Replay: 5003964765203 >> -> >> ->receive.value 0 >> ->apply.value 111512 >> -> >> ->111512 isn't inherently useful to me on its own. >> -> >> ->Any tips? >> -> >> - How about now()-pg_last_xact_replay_timestamp() (however this can be a >> - large number if there have not been any recent transactions on the >> - master). I suppose you could do something like: >> - >> - case when pg_last_xlog_receive_location() = >> - pg_last_xlog_replay_location() then '0 seconds'::interval >> - else now()-pg_last_xact_replay_timestamp() end as log_delay; >> >> i don't know for sure that 111512 is a time value.. that's kind of >> what i'm wondering. If i knew that it was like miliseconds or something >> that would be helpful. > > On the hot standby: > > SELECT now()-pg_last_xact_replay_timestamp() AS lag; > > This gives you the lag time as a PostgreSQL interval. > > (It also might give you a value if you run it on a database that is > not a hot standby if it started in recovery mode). > > It seems difficult or impossible to calculate this on the master. > > d'oh. looking closer at the script I was using, it says in there that's it's kb xlog, so not time but size. That explains a lot. I like the the replay_timestamp method better though fits in better to my management's question of "how much data would we lose?". Thanks all.
On 06/26/2012 08:16 AM, David Kerr wrote: > On 06/26/2012 05:11 AM, Stuart Bishop wrote: >> On Tue, Jun 26, 2012 at 6:21 AM, David Kerr<dmk@mr-paradox.net> wrote: >>> On Mon, Jun 25, 2012 at 02:17:22PM -0700, Steve Crawford wrote: >>> - On 06/25/2012 01:17 PM, David Kerr wrote: >>> ->Howdy, >>> -> >>> ->When calculating Replication lag, I know that we have to compare the >>> ->pg_current_xlog_location >>> ->to pg_last_xlog_receive_location, etc. but what I'm trying to >>> figure out >>> ->is what are >>> ->the units that I'm left with after the calculation. >>> -> >>> ->(i.e., does the xlog_location imply some time value?) >>> -> >>> ->Here's the output of the (slightly modified script) >>> ->Master: 5003964876715 >>> ->Receive: 5003964876715 >>> ->Replay: 5003964765203 >>> -> >>> ->receive.value 0 >>> ->apply.value 111512 >>> -> >>> ->111512 isn't inherently useful to me on its own. >>> -> >>> ->Any tips? >>> -> >>> - How about now()-pg_last_xact_replay_timestamp() (however this can >>> be a >>> - large number if there have not been any recent transactions on the >>> - master). I suppose you could do something like: >>> - >>> - case when pg_last_xlog_receive_location() = >>> - pg_last_xlog_replay_location() then '0 seconds'::interval >>> - else now()-pg_last_xact_replay_timestamp() end as log_delay; >>> >>> i don't know for sure that 111512 is a time value.. that's kind of >>> what i'm wondering. If i knew that it was like miliseconds or something >>> that would be helpful. >> >> On the hot standby: >> >> SELECT now()-pg_last_xact_replay_timestamp() AS lag; >> >> This gives you the lag time as a PostgreSQL interval. >> >> (It also might give you a value if you run it on a database that is >> not a hot standby if it started in recovery mode). >> >> It seems difficult or impossible to calculate this on the master. >> >> > > Ah, awesome. I don't need to calculate it on the master so that's > perfect. > > Thanks! > > But beware. If your master server has any idle time you may be misled by the simple calculation. I was running a pgbench test on a replicated pair of machines. It finished this morning sometime so the lag delay shows 02:31:11.651118, a value that might set off alarm bells. That's why I used the case statement to force the interval to 0 if the replay is up-to-date. I think it is still worthwhile to check pg_stat_replication on the master to make sure that it is still *sending* logs and perhaps cross-checking the current log position on the master with the replayed log location on the standby to see if they are reasonably close. An additional verification check I've toyed with is to have a cron script on the master update a one-row one-column table that holds a timestamp and checking that timestamp on the standby(s) to double-check that it is not too-far out-of-date. (This would also force regular data delivery to the standby so that pg_last_xact_replay_timestamp() should not lag far behind on an otherwise idle server.) Cheers, Steve
On Tue, Jun 26, 2012 at 09:13:44AM -0700, Steve Crawford wrote: - On 06/26/2012 08:16 AM, David Kerr wrote: - >On 06/26/2012 05:11 AM, Stuart Bishop wrote: - >>On Tue, Jun 26, 2012 at 6:21 AM, David Kerr<dmk@mr-paradox.net> wrote: - >>>On Mon, Jun 25, 2012 at 02:17:22PM -0700, Steve Crawford wrote: - >>>- On 06/25/2012 01:17 PM, David Kerr wrote: - >>>->Howdy, - >>>-> - >>>->When calculating Replication lag, I know that we have to compare the - >>>->pg_current_xlog_location - >>>->to pg_last_xlog_receive_location, etc. but what I'm trying to - >>>figure out - >>>->is what are - >>>->the units that I'm left with after the calculation. - >>>-> - >>>->(i.e., does the xlog_location imply some time value?) - >>>-> - >>>->Here's the output of the (slightly modified script) - >>>->Master: 5003964876715 - >>>->Receive: 5003964876715 - >>>->Replay: 5003964765203 - >>>-> - >>>->receive.value 0 - >>>->apply.value 111512 - >>>-> - >>>->111512 isn't inherently useful to me on its own. - >>>-> - >>>->Any tips? - >>>-> - >>>- How about now()-pg_last_xact_replay_timestamp() (however this can - >>>be a - >>>- large number if there have not been any recent transactions on the - >>>- master). I suppose you could do something like: - >>>- - >>>- case when pg_last_xlog_receive_location() = - >>>- pg_last_xlog_replay_location() then '0 seconds'::interval - >>>- else now()-pg_last_xact_replay_timestamp() end as log_delay; - >>> - >>>i don't know for sure that 111512 is a time value.. that's kind of - >>>what i'm wondering. If i knew that it was like miliseconds or something - >>>that would be helpful. - >> - >>On the hot standby: - >> - >> SELECT now()-pg_last_xact_replay_timestamp() AS lag; - >> - >>This gives you the lag time as a PostgreSQL interval. - >> - >>(It also might give you a value if you run it on a database that is - >>not a hot standby if it started in recovery mode). - >> - >>It seems difficult or impossible to calculate this on the master. - >> - >> - > - >Ah, awesome. I don't need to calculate it on the master so that's - >perfect. - > - >Thanks! - > - > - But beware. If your master server has any idle time you may be misled by - the simple calculation. I was running a pgbench test on a replicated - pair of machines. It finished this morning sometime so the lag delay - shows 02:31:11.651118, a value that might set off alarm bells. That's - why I used the case statement to force the interval to 0 if the replay - is up-to-date. - - I think it is still worthwhile to check pg_stat_replication on the - master to make sure that it is still *sending* logs and perhaps - cross-checking the current log position on the master with the replayed - log location on the standby to see if they are reasonably close. - - An additional verification check I've toyed with is to have a cron - script on the master update a one-row one-column table that holds a - timestamp and checking that timestamp on the standby(s) to double-check - that it is not too-far out-of-date. (This would also force regular data - delivery to the standby so that pg_last_xact_replay_timestamp() should - not lag far behind on an otherwise idle server.) - - Cheers, - Steve - I see, thanks. I was completely missing that you were using different functions than me and that was throwing me off. =)