Re: Better way to monitor for failed replication? - Mailing list pgsql-admin

From Ron Johnson
Subject Re: Better way to monitor for failed replication?
Date
Msg-id CANzqJaBwJJj_-hoN1ZA21HLQi4KDdBYvO6FErCK7o61_MPmReg@mail.gmail.com
Whole thread Raw
In response to Re: Better way to monitor for failed replication?  (Keith <keith@keithf4.com>)
List pgsql-admin
On Fri, Jan 9, 2026 at 12:42 PM Keith <keith@keithf4.com> wrote:


On Fri, Jan 9, 2026 at 10:50 AM Ron Johnson <ronljohnsonjr@gmail.com> wrote:
Currently, in a bash script, I run this SELECT statement against the Primary server which is supposed to replicate to multiple servers.  If active == f, I send an alter email.

postgres=# SELECT rs.slot_name, rs.active, sr.client_hostname
from pg_replication_slots rs
    left outer join pg_stat_replication sr on rs.active_pid = sr.pid;
  slot_name   | active | client_hostname
--------------+--------+-----------------
 pgstandby1   | t      | BBOPITCPGS302B
 replicate_dr | f      |
(2 rows)

Is there a better way to check for replication that's supposed to be happening, but isn't (like PG on the replica was stopped for some reason)?
 

Your example only takes into account if you are using replication slots, correct? If you're always using those, this is definitely a good metric to have since the slot going down means WAL buildup, so I'd definitely keep it.

Yes, just replication slots.
 
As for general replication monitoring, these have been the two queries I use

On the Primary:

SELECT client_addr AS replica
        , client_hostname AS replica_hostname
        , client_port AS replica_port
        , pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes
        FROM pg_catalog.pg_stat_replication; 

This checks for byte-lag for all active streaming replicas, physical or logical. A count of zero or NULL from this metric means all replicas are down. Can monitor a specific count if you have a known number of replicas.

On any Replica:

SELECT
       CASE
       WHEN (pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()) OR (pg_is_in_recovery() = false) THEN 0
       ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
       END
    AS replay_time
    ,  CASE
       WHEN pg_is_in_recovery() = false THEN 0
       ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER
       END
    AS received_time;

This monitors the lag in seconds from the replica. Technically it monitors the last time a WAL file was received (received_time) and the last time WAL was actually replayed (replay_time). The reason for both is that the received time can be a false positive when there is no write activity on the primary. If there's always supposed to be write activity, this can be a another good metric to indicate that something is very wrong. The replay_time metric avoids the false positive by only being considered when receive is different than replay.

I'll integrate this into the lag report. 


--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

pgsql-admin by date:

Previous
From: Keith
Date:
Subject: Re: Better way to monitor for failed replication?
Next
From: Gabriel Guillem Barceló Soteras
Date:
Subject: VACUUM/ANALYZE Strategy for Low-Activity PostgreSQL 15 Instance