Home > mailing lists

Re: Better way to monitor for failed replication? - Mailing list pgsql-admin

From	Ron Johnson
Subject	Re: Better way to monitor for failed replication?
Date	January 9 20:53:51
Msg-id	CANzqJaBwJJj_-hoN1ZA21HLQi4KDdBYvO6FErCK7o61_MPmReg@mail.gmail.com Whole thread
In response to	Re: Better way to monitor for failed replication? (Keith <keith@keithf4.com>)
List	pgsql-admin

Tree view

On Fri, Jan 9, 2026 at 12:42 PM Keith <keith@keithf4.com> wrote:

On Fri, Jan 9, 2026 at 10:50 AM Ron Johnson <ronljohnsonjr@gmail.com> wrote:
Currently, in a bash script, I run this SELECT statement against the Primary server which is supposed to replicate to multiple servers. If active == f, I send an alter email.

postgres=# SELECT rs.slot_name, rs.active, sr.client_hostname
from pg_replication_slots rs
left outer join pg_stat_replication sr on rs.active_pid = sr.pid;
slot_name | active | client_hostname
--------------+--------+-----------------
pgstandby1 | t | BBOPITCPGS302B
replicate_dr | f |
(2 rows)

Is there a better way to check for replication that's supposed to be happening, but isn't (like PG on the replica was stopped for some reason)?

Your example only takes into account if you are using replication slots, correct? If you're always using those, this is definitely a good metric to have since the slot going down means WAL buildup, so I'd definitely keep it.

Yes, just replication slots.

As for general replication monitoring, these have been the two queries I use

On the Primary:

SELECT client_addr AS replica
, client_hostname AS replica_hostname
, client_port AS replica_port
, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes
FROM pg_catalog.pg_stat_replication;

This checks for byte-lag for all active streaming replicas, physical or logical. A count of zero or NULL from this metric means all replicas are down. Can monitor a specific count if you have a known number of replicas.

On any Replica:

SELECT
CASE
WHEN (pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()) OR (pg_is_in_recovery() = false) THEN 0
ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
END
AS replay_time
, CASE
WHEN pg_is_in_recovery() = false THEN 0
ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
END
AS received_time;

This monitors the lag in seconds from the replica. Technically it monitors the last time a WAL file was received (received_time) and the last time WAL was actually replayed (replay_time). The reason for both is that the received time can be a false positive when there is no write activity on the primary. If there's always supposed to be write activity, this can be a another good metric to indicate that something is very wrong. The replay_time metric avoids the false positive by only being considered when receive is different than replay.

I'll integrate this into the lag report.

Death to <Redacted>, and butter sauce.

Don't boil me, I'm still alive.

<Redacted> lobster!

pgsql-admin by date:

From: Keith
Date: 09 January, 20:41:57
Subject: Re: Better way to monitor for failed replication?

From: Gabriel Guillem Barceló Soteras
Date: 13 January, 11:19:02
Subject: VACUUM/ANALYZE Strategy for Low-Activity PostgreSQL 15 Instance

Re: Better way to monitor for failed replication? - Mailing list pgsql-admin

Previous

Next