Re: standby apply lag on inactive servers - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: standby apply lag on inactive servers
Date
Msg-id 20200127203419.GA15216@alvherre.pgsql
Whole thread Raw
In response to standby apply lag on inactive servers  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: standby apply lag on inactive servers  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On 2020-Jan-10, Alvaro Herrera wrote:

> A customer of ours complained that if you have an inactive primary,
> monitoring the apply lag on a standby reports monotonically increasing
> lag.  The reason for this is that the apply lag is only updated on
> COMMIT records, which of course don't occur in inactive servers.
> But CHECKPOINT records do occur, so the WAL insert pointer continues to
> move forward, which is what causes the spurious lag.
> 
> (I think newer releases are protected from this problem because they
> don't emit checkpoints during periods of inactivity.  I didn't verify
> this.)
> 
> This patch fixes the problem by using the checkpoint timestamp to update
> the lag tracker in the standby.  This requires a little change in where
> this update is invoked, because previously it was done only for the XACT
> rmgr; this makes the patch a little bigger than it should.

Here's a version of the patch that applies to current master.  It does
fix the problem that CHECKPOINT wal records are not considered when
determining time-of-latest-record.

However, it does *not* fix the monitoring problem I mentioned (which
relied on comparing pg_last_xact_replay_timestamp() to 'now()') ...
because commit 6ef2eba3f57f (pg10) made an idle server not emit
checkpoint records anymore.  That is, my parenthical remark was
completely wrong: the new versions not only are "protected", but also
this fix doesn't fix them.  Luckily, the way to fix monitoring for
servers of versions 10 and later is to use the new replay_lag (etc)
columns in pg_stat_replication, commit 6912acc04f0b (also pg10).

I am inclined to apply this to all branches unless there are strong
objections, because the current code seems pretty arbitrary anyway.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: making the backend's json parser work in frontend code
Next
From: David Zhang
Date:
Subject: Re: Making psql error out on output failures