Thread: Negative replication lag?

Negative replication lag?

From
Quentin Hartman
Date:
I'm using this script to check my replication lag on my streaming replication pairs with Nagios:

https://gist.github.com/jacobian/743942

It generally works fine, but will occasionally return a negative lag value (-37kb for example) which of course causes it to throw an alarm, but is total nonsense. I've been working on the assumption that it is some sort of bug in the script, but in taking a quick look at it nothing jumps out at me.

Is there something in Postgres itself that could cause this to happen once in awhile? Is it something to be concerned about? Is there a better way to monitor this state?

Thanks!

QH

Re: Negative replication lag?

From
Andres Freund
Date:
On 2013-04-22 16:36:38 -0600, Quentin Hartman wrote:
> I'm using this script to check my replication lag on my streaming
> replication pairs with Nagios:
>
> https://gist.github.com/jacobian/743942
>
> It generally works fine, but will occasionally return a negative lag value
> (-37kb for example) which of course causes it to throw an alarm, but is
> total nonsense. I've been working on the assumption that it is some sort of
> bug in the script, but in taking a quick look at it nothing jumps out at me.
>
> Is there something in Postgres itself that could cause this to happen once
> in awhile? Is it something to be concerned about? Is there a better way to
> monitor this state?

Well, between the time pg_current_xlog_location() is run on the primary
and pg_last_xlog_replay_location() on the standby some time passes, so
its not all that unlikely that wal has been generated, streamed *and*
applied in that time. Given the short timeframe it only happens every
now and then.

Did you check the pg_stat_replication view on the primary?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Negative replication lag?

From
Quentin Hartman
Date:
Ah, that makes sense. I think I'll add some logic to the script that has it get new data points if it comes up with a negative value.

Thanks for the insight.

QH


On Mon, Apr 22, 2013 at 5:11 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-04-22 16:36:38 -0600, Quentin Hartman wrote:
> I'm using this script to check my replication lag on my streaming
> replication pairs with Nagios:
>
> https://gist.github.com/jacobian/743942
>
> It generally works fine, but will occasionally return a negative lag value
> (-37kb for example) which of course causes it to throw an alarm, but is
> total nonsense. I've been working on the assumption that it is some sort of
> bug in the script, but in taking a quick look at it nothing jumps out at me.
>
> Is there something in Postgres itself that could cause this to happen once
> in awhile? Is it something to be concerned about? Is there a better way to
> monitor this state?

Well, between the time pg_current_xlog_location() is run on the primary
and pg_last_xlog_replay_location() on the standby some time passes, so
its not all that unlikely that wal has been generated, streamed *and*
applied in that time. Given the short timeframe it only happens every
now and then.

Did you check the pg_stat_replication view on the primary?

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services