Home > mailing lists

Re: Better way to monitor for failed replication? - Mailing list pgsql-admin

From	Keith
Subject	Re: Better way to monitor for failed replication?
Date	January 9 20:41:57
Msg-id	CAHw75vvaeoTDO6796G7O_zamiaFWoi81+2YDrjSh4mvFsnATkQ@mail.gmail.com Whole thread Raw
In response to	Better way to monitor for failed replication? (Ron Johnson <ronljohnsonjr@gmail.com>)
Responses	Re: Better way to monitor for failed replication?
List	pgsql-admin

Tree view

On Fri, Jan 9, 2026 at 10:50 AM Ron Johnson <ronljohnsonjr@gmail.com> wrote:

Currently, in a bash script, I run this SELECT statement against the Primary server which is supposed to replicate to multiple servers. If active == f, I send an alter email.

postgres=# SELECT rs.slot_name, rs.active, sr.client_hostname
from pg_replication_slots rs
left outer join pg_stat_replication sr on rs.active_pid = sr.pid;
slot_name | active | client_hostname
--------------+--------+-----------------
pgstandby1 | t | BBOPITCPGS302B
replicate_dr | f |
(2 rows)

Is there a better way to check for replication that's supposed to be happening, but isn't (like PG on the replica was stopped for some reason)?

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

Your example only takes into account if you are using replication slots, correct? If you're always using those, this is definitely a good metric to have since the slot going down means WAL buildup, so I'd definitely keep it. As for general replication monitoring, these have been the two queries I use

On the Primary:

SELECT client_addr AS replica
, client_hostname AS replica_hostname
, client_port AS replica_port
, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes
FROM pg_catalog.pg_stat_replication;

This checks for byte-lag for all active streaming replicas, physical or logical. A count of zero or NULL from this metric means all replicas are down. Can monitor a specific count if you have a known number of replicas.

On any Replica:

SELECT
CASE
WHEN (pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()) OR (pg_is_in_recovery() = false) THEN 0
ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
END
AS replay_time
, CASE
WHEN pg_is_in_recovery() = false THEN 0
ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
END
AS received_time;

This monitors the lag in seconds from the replica. Technically it monitors the last time a WAL file was received (received_time) and the last time WAL was actually replayed (replay_time). The reason for both is that the received time can be a false positive when there is no write activity on the primary. If there's always supposed to be write activity, this can be a another good metric to indicate that something is very wrong. The replay_time metric avoids the false positive by only being considered when receive is different than replay. This metric also works when you're doing WAL-replay replication instead of streaming.

pgsql-admin by date:

From: Ron Johnson
Date: 09 January, 18:50:28
Subject: Better way to monitor for failed replication?

From: Ron Johnson
Date: 09 January, 20:53:51
Subject: Re: Better way to monitor for failed replication?

Re: Better way to monitor for failed replication? - Mailing list pgsql-admin

Previous

Next