Thread: Calculating Replication Lag - units

Calculating Replication Lag - units

From
David Kerr
Date:
Howdy,

When calculating Replication lag, I know that we have to compare the pg_current_xlog_location
to pg_last_xlog_receive_location, etc. but what I'm trying to figure out is what are
the units that I'm left with after the calculation.

(i.e., does the xlog_location imply some time value?)

Here's the output of the (slightly modified script)
Master: 5003964876715
Receive: 5003964876715
Replay: 5003964765203

receive.value 0
apply.value 111512

111512 isn't inherently useful to me on its own.

Any tips?

Thanks

Dave

Re: Calculating Replication Lag - units

From
Steve Crawford
Date:
On 06/25/2012 01:17 PM, David Kerr wrote:
> Howdy,
>
> When calculating Replication lag, I know that we have to compare the pg_current_xlog_location
> to pg_last_xlog_receive_location, etc. but what I'm trying to figure out is what are
> the units that I'm left with after the calculation.
>
> (i.e., does the xlog_location imply some time value?)
>
> Here's the output of the (slightly modified script)
> Master: 5003964876715
> Receive: 5003964876715
> Replay: 5003964765203
>
> receive.value 0
> apply.value 111512
>
> 111512 isn't inherently useful to me on its own.
>
> Any tips?
>
How about now()-pg_last_xact_replay_timestamp() (however this can be a
large number if there have not been any recent transactions on the
master). I suppose you could do something like:

case when pg_last_xlog_receive_location() =
pg_last_xlog_replay_location() then '0 seconds'::interval
else now()-pg_last_xact_replay_timestamp() end as log_delay;

But I'm wrapping my head around some replication issues myself so others
may have better ideas or corrections.

Cheers,
Steve


Re: Calculating Replication Lag - units

From
David Kerr
Date:
On Mon, Jun 25, 2012 at 02:17:22PM -0700, Steve Crawford wrote:
- On 06/25/2012 01:17 PM, David Kerr wrote:
- >Howdy,
- >
- >When calculating Replication lag, I know that we have to compare the
- >pg_current_xlog_location
- >to pg_last_xlog_receive_location, etc. but what I'm trying to figure out
- >is what are
- >the units that I'm left with after the calculation.
- >
- >(i.e., does the xlog_location imply some time value?)
- >
- >Here's the output of the (slightly modified script)
- >Master: 5003964876715
- >Receive: 5003964876715
- >Replay: 5003964765203
- >
- >receive.value 0
- >apply.value 111512
- >
- >111512 isn't inherently useful to me on its own.
- >
- >Any tips?
- >
- How about now()-pg_last_xact_replay_timestamp() (however this can be a
- large number if there have not been any recent transactions on the
- master). I suppose you could do something like:
-
- case when pg_last_xlog_receive_location() =
- pg_last_xlog_replay_location() then '0 seconds'::interval
- else now()-pg_last_xact_replay_timestamp() end as log_delay;

i don't know for sure that 111512 is a time value.. that's kind of
what i'm wondering. If i knew that it was like miliseconds or something
that would be helpful.

- But I'm wrapping my head around some replication issues myself so others
- may have better ideas or corrections.

I've been fairly successful with replication so I'm happy to help there.
Just trying to shore up my monitoring now!

Re: Calculating Replication Lag - units

From
Raghavendra
Date:

On Tue, Jun 26, 2012 at 1:47 AM, David Kerr <dmk@mr-paradox.net> wrote:
Howdy,

When calculating Replication lag, I know that we have to compare the pg_current_xlog_location
to pg_last_xlog_receive_location, etc. but what I'm trying to figure out is what are
the units that I'm left with after the calculation.

(i.e., does the xlog_location imply some time value?)

Here's the output of the (slightly modified script)
Master: 5003964876715
Receive: 5003964876715
Replay: 5003964765203

receive.value 0
apply.value 111512

111512 isn't inherently useful to me on its own.

Any tips?

A common method I did in Oracle, I followed the same, I may be wrong in calculating exactly. Someone would have better solution on lag calculation.

My checking goes like this, Since its streaming replication, every DML should be replicated as fast it could to slave.

1. Create table on master as Lagcheck(lagtime timestamp) and insert one row with now() (current_time of server).
2. every minute update the same row with latest time by putting entry in cronjob
3. Step 2 will be replicated to Slave (i.e., SR box).
4. Now on slave calculate the lag by now() - lagcheck.lagtime(column which has value of Master time).

Here you get the time how much slave is behind from master.
Note: Special attention required on Timezones.

---
Regards,
Raghavendra
EnterpriseDB Corporation

Re: Calculating Replication Lag - units

From
David Kerr
Date:
On 6/25/2012 9:55 PM, Raghavendra wrote:
>
> On Tue, Jun 26, 2012 at 1:47 AM, David Kerr <dmk@mr-paradox.net
> <mailto:dmk@mr-paradox.net>> wrote:
>
>     Howdy,
>
>     When calculating Replication lag, I know that we have to compare the
>     pg_current_xlog_location
>     to pg_last_xlog_receive_location, etc. but what I'm trying to figure
>     out is what are
>     the units that I'm left with after the calculation.
>
>     (i.e., does the xlog_location imply some time value?)
>
>     Here's the output of the (slightly modified script)
>     Master: 5003964876715
>     Receive: 5003964876715
>     Replay: 5003964765203
>
>     receive.value 0
>     apply.value 111512
>
>     111512 isn't inherently useful to me on its own.
>
>     Any tips?
>
> A common method I did in Oracle, I followed the same, I may be wrong in
> calculating exactly. Someone would have better solution on lag calculation.
>
> My checking goes like this, Since its streaming replication, every DML
> should be replicated as fast it could to slave.
>
> 1. Create table on master as Lagcheck(lagtime timestamp) and insert one
> row with now() (current_time of server).
> 2. every minute update the same row with latest time by putting entry in
> cronjob
> 3. Step 2 will be replicated to Slave (i.e., SR box).
> 4. Now on slave calculate the lag by now() - lagcheck.lagtime(column
> which has value of Master time).
>
> Here you get the time how much slave is behind from master.
> Note: Special attention required on Timezones.

Well, I think the way I'm doing it is the correct way, it's even that
way in Simon's book. I'm just not sure what the # is.. is it miliseconds
since last update on master. or just some arbitrary number.

Re: Calculating Replication Lag - units

From
Stuart Bishop
Date:
On Tue, Jun 26, 2012 at 6:21 AM, David Kerr <dmk@mr-paradox.net> wrote:
> On Mon, Jun 25, 2012 at 02:17:22PM -0700, Steve Crawford wrote:
> - On 06/25/2012 01:17 PM, David Kerr wrote:
> - >Howdy,
> - >
> - >When calculating Replication lag, I know that we have to compare the
> - >pg_current_xlog_location
> - >to pg_last_xlog_receive_location, etc. but what I'm trying to figure out
> - >is what are
> - >the units that I'm left with after the calculation.
> - >
> - >(i.e., does the xlog_location imply some time value?)
> - >
> - >Here's the output of the (slightly modified script)
> - >Master: 5003964876715
> - >Receive: 5003964876715
> - >Replay: 5003964765203
> - >
> - >receive.value 0
> - >apply.value 111512
> - >
> - >111512 isn't inherently useful to me on its own.
> - >
> - >Any tips?
> - >
> - How about now()-pg_last_xact_replay_timestamp() (however this can be a
> - large number if there have not been any recent transactions on the
> - master). I suppose you could do something like:
> -
> - case when pg_last_xlog_receive_location() =
> - pg_last_xlog_replay_location() then '0 seconds'::interval
> - else now()-pg_last_xact_replay_timestamp() end as log_delay;
>
> i don't know for sure that 111512 is a time value.. that's kind of
> what i'm wondering. If i knew that it was like miliseconds or something
> that would be helpful.

On the hot standby:

    SELECT now()-pg_last_xact_replay_timestamp() AS lag;

This gives you the lag time as a PostgreSQL interval.

(It also might give you a value if you run it on a database that is
not a hot standby if it started in recovery mode).

It seems difficult or impossible to calculate this on the master.


--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

Re: Calculating Replication Lag - units

From
David Kerr
Date:
On 06/26/2012 05:11 AM, Stuart Bishop wrote:
> On Tue, Jun 26, 2012 at 6:21 AM, David Kerr<dmk@mr-paradox.net>  wrote:
>> On Mon, Jun 25, 2012 at 02:17:22PM -0700, Steve Crawford wrote:
>> - On 06/25/2012 01:17 PM, David Kerr wrote:
>> ->Howdy,
>> ->
>> ->When calculating Replication lag, I know that we have to compare the
>> ->pg_current_xlog_location
>> ->to pg_last_xlog_receive_location, etc. but what I'm trying to figure out
>> ->is what are
>> ->the units that I'm left with after the calculation.
>> ->
>> ->(i.e., does the xlog_location imply some time value?)
>> ->
>> ->Here's the output of the (slightly modified script)
>> ->Master: 5003964876715
>> ->Receive: 5003964876715
>> ->Replay: 5003964765203
>> ->
>> ->receive.value 0
>> ->apply.value 111512
>> ->
>> ->111512 isn't inherently useful to me on its own.
>> ->
>> ->Any tips?
>> ->
>> - How about now()-pg_last_xact_replay_timestamp() (however this can be a
>> - large number if there have not been any recent transactions on the
>> - master). I suppose you could do something like:
>> -
>> - case when pg_last_xlog_receive_location() =
>> - pg_last_xlog_replay_location() then '0 seconds'::interval
>> - else now()-pg_last_xact_replay_timestamp() end as log_delay;
>>
>> i don't know for sure that 111512 is a time value.. that's kind of
>> what i'm wondering. If i knew that it was like miliseconds or something
>> that would be helpful.
>
> On the hot standby:
>
>      SELECT now()-pg_last_xact_replay_timestamp() AS lag;
>
> This gives you the lag time as a PostgreSQL interval.
>
> (It also might give you a value if you run it on a database that is
> not a hot standby if it started in recovery mode).
>
> It seems difficult or impossible to calculate this on the master.
>
>

Ah, awesome. I don't need to calculate it on the master so that's perfect.

Thanks!

Re: Calculating Replication Lag - units

From
David Kerr
Date:
On 06/26/2012 05:11 AM, Stuart Bishop wrote:
> On Tue, Jun 26, 2012 at 6:21 AM, David Kerr<dmk@mr-paradox.net>  wrote:
>> On Mon, Jun 25, 2012 at 02:17:22PM -0700, Steve Crawford wrote:
>> - On 06/25/2012 01:17 PM, David Kerr wrote:
>> ->Howdy,
>> ->
>> ->When calculating Replication lag, I know that we have to compare the
>> ->pg_current_xlog_location
>> ->to pg_last_xlog_receive_location, etc. but what I'm trying to figure out
>> ->is what are
>> ->the units that I'm left with after the calculation.
>> ->
>> ->(i.e., does the xlog_location imply some time value?)
>> ->
>> ->Here's the output of the (slightly modified script)
>> ->Master: 5003964876715
>> ->Receive: 5003964876715
>> ->Replay: 5003964765203
>> ->
>> ->receive.value 0
>> ->apply.value 111512
>> ->
>> ->111512 isn't inherently useful to me on its own.
>> ->
>> ->Any tips?
>> ->
>> - How about now()-pg_last_xact_replay_timestamp() (however this can be a
>> - large number if there have not been any recent transactions on the
>> - master). I suppose you could do something like:
>> -
>> - case when pg_last_xlog_receive_location() =
>> - pg_last_xlog_replay_location() then '0 seconds'::interval
>> - else now()-pg_last_xact_replay_timestamp() end as log_delay;
>>
>> i don't know for sure that 111512 is a time value.. that's kind of
>> what i'm wondering. If i knew that it was like miliseconds or something
>> that would be helpful.
>
> On the hot standby:
>
>      SELECT now()-pg_last_xact_replay_timestamp() AS lag;
>
> This gives you the lag time as a PostgreSQL interval.
>
> (It also might give you a value if you run it on a database that is
> not a hot standby if it started in recovery mode).
>
> It seems difficult or impossible to calculate this on the master.
>
>

d'oh.

looking closer at the script I was using, it says in there that's it's
kb xlog, so not time but size. That explains a lot.

I like the the replay_timestamp method better though fits in better to
my management's question of "how much data would we lose?".

Thanks all.


Re: Calculating Replication Lag - units

From
Steve Crawford
Date:
On 06/26/2012 08:16 AM, David Kerr wrote:
> On 06/26/2012 05:11 AM, Stuart Bishop wrote:
>> On Tue, Jun 26, 2012 at 6:21 AM, David Kerr<dmk@mr-paradox.net>  wrote:
>>> On Mon, Jun 25, 2012 at 02:17:22PM -0700, Steve Crawford wrote:
>>> - On 06/25/2012 01:17 PM, David Kerr wrote:
>>> ->Howdy,
>>> ->
>>> ->When calculating Replication lag, I know that we have to compare the
>>> ->pg_current_xlog_location
>>> ->to pg_last_xlog_receive_location, etc. but what I'm trying to
>>> figure out
>>> ->is what are
>>> ->the units that I'm left with after the calculation.
>>> ->
>>> ->(i.e., does the xlog_location imply some time value?)
>>> ->
>>> ->Here's the output of the (slightly modified script)
>>> ->Master: 5003964876715
>>> ->Receive: 5003964876715
>>> ->Replay: 5003964765203
>>> ->
>>> ->receive.value 0
>>> ->apply.value 111512
>>> ->
>>> ->111512 isn't inherently useful to me on its own.
>>> ->
>>> ->Any tips?
>>> ->
>>> - How about now()-pg_last_xact_replay_timestamp() (however this can
>>> be a
>>> - large number if there have not been any recent transactions on the
>>> - master). I suppose you could do something like:
>>> -
>>> - case when pg_last_xlog_receive_location() =
>>> - pg_last_xlog_replay_location() then '0 seconds'::interval
>>> - else now()-pg_last_xact_replay_timestamp() end as log_delay;
>>>
>>> i don't know for sure that 111512 is a time value.. that's kind of
>>> what i'm wondering. If i knew that it was like miliseconds or something
>>> that would be helpful.
>>
>> On the hot standby:
>>
>>      SELECT now()-pg_last_xact_replay_timestamp() AS lag;
>>
>> This gives you the lag time as a PostgreSQL interval.
>>
>> (It also might give you a value if you run it on a database that is
>> not a hot standby if it started in recovery mode).
>>
>> It seems difficult or impossible to calculate this on the master.
>>
>>
>
> Ah, awesome. I don't need to calculate it on the master so that's
> perfect.
>
> Thanks!
>
>
But beware. If your master server has any idle time you may be misled by
the simple calculation. I was running a pgbench test on a replicated
pair of machines. It finished this morning sometime so the lag delay
shows 02:31:11.651118, a value that might set off alarm bells. That's
why I used the case statement to force the interval to 0 if the replay
is up-to-date.

I think it is still worthwhile to check pg_stat_replication on the
master to make sure that it is still *sending* logs and perhaps
cross-checking the current log position on the master with the replayed
log location on the standby to see if they are reasonably close.

An additional verification check I've toyed with is to have a cron
script on the master update a one-row one-column table that holds a
timestamp and checking that timestamp on the standby(s) to double-check
that it is not too-far out-of-date. (This would also force regular data
delivery to the standby so that pg_last_xact_replay_timestamp() should
not lag far behind on an otherwise idle server.)

Cheers,
Steve


Re: Calculating Replication Lag - units

From
David Kerr
Date:
On Tue, Jun 26, 2012 at 09:13:44AM -0700, Steve Crawford wrote:
- On 06/26/2012 08:16 AM, David Kerr wrote:
- >On 06/26/2012 05:11 AM, Stuart Bishop wrote:
- >>On Tue, Jun 26, 2012 at 6:21 AM, David Kerr<dmk@mr-paradox.net>  wrote:
- >>>On Mon, Jun 25, 2012 at 02:17:22PM -0700, Steve Crawford wrote:
- >>>- On 06/25/2012 01:17 PM, David Kerr wrote:
- >>>->Howdy,
- >>>->
- >>>->When calculating Replication lag, I know that we have to compare the
- >>>->pg_current_xlog_location
- >>>->to pg_last_xlog_receive_location, etc. but what I'm trying to
- >>>figure out
- >>>->is what are
- >>>->the units that I'm left with after the calculation.
- >>>->
- >>>->(i.e., does the xlog_location imply some time value?)
- >>>->
- >>>->Here's the output of the (slightly modified script)
- >>>->Master: 5003964876715
- >>>->Receive: 5003964876715
- >>>->Replay: 5003964765203
- >>>->
- >>>->receive.value 0
- >>>->apply.value 111512
- >>>->
- >>>->111512 isn't inherently useful to me on its own.
- >>>->
- >>>->Any tips?
- >>>->
- >>>- How about now()-pg_last_xact_replay_timestamp() (however this can
- >>>be a
- >>>- large number if there have not been any recent transactions on the
- >>>- master). I suppose you could do something like:
- >>>-
- >>>- case when pg_last_xlog_receive_location() =
- >>>- pg_last_xlog_replay_location() then '0 seconds'::interval
- >>>- else now()-pg_last_xact_replay_timestamp() end as log_delay;
- >>>
- >>>i don't know for sure that 111512 is a time value.. that's kind of
- >>>what i'm wondering. If i knew that it was like miliseconds or something
- >>>that would be helpful.
- >>
- >>On the hot standby:
- >>
- >>     SELECT now()-pg_last_xact_replay_timestamp() AS lag;
- >>
- >>This gives you the lag time as a PostgreSQL interval.
- >>
- >>(It also might give you a value if you run it on a database that is
- >>not a hot standby if it started in recovery mode).
- >>
- >>It seems difficult or impossible to calculate this on the master.
- >>
- >>
- >
- >Ah, awesome. I don't need to calculate it on the master so that's
- >perfect.
- >
- >Thanks!
- >
- >
- But beware. If your master server has any idle time you may be misled by
- the simple calculation. I was running a pgbench test on a replicated
- pair of machines. It finished this morning sometime so the lag delay
- shows 02:31:11.651118, a value that might set off alarm bells. That's
- why I used the case statement to force the interval to 0 if the replay
- is up-to-date.
-
- I think it is still worthwhile to check pg_stat_replication on the
- master to make sure that it is still *sending* logs and perhaps
- cross-checking the current log position on the master with the replayed
- log location on the standby to see if they are reasonably close.
-
- An additional verification check I've toyed with is to have a cron
- script on the master update a one-row one-column table that holds a
- timestamp and checking that timestamp on the standby(s) to double-check
- that it is not too-far out-of-date. (This would also force regular data
- delivery to the standby so that pg_last_xact_replay_timestamp() should
- not lag far behind on an otherwise idle server.)
-
- Cheers,
- Steve
-

I see, thanks. I was completely missing that you were using different functions
than me and that was throwing me off. =)