Currently, in a bash script, I run this SELECT statement against the Primary server which is supposed to replicate to multiple servers. If active == f, I send an alter email.
from pg_replication_slots rs left outer join pg_stat_replication sr on rs.active_pid = sr.pid; slot_name | active | client_hostname --------------+--------+----------------- pgstandby1 | t | BBOPITCPGS302B replicate_dr | f | (2 rows)
Is there a better way to check for replication that's supposed to be happening, but isn't (like PG on the replica was stopped for some reason)?
Your example only takes into account if you are using replication slots, correct? If you're always using those, this is definitely a good metric to have since the slot going down means WAL buildup, so I'd definitely keep it.
Yes, just replication slots.
As for general replication monitoring, these have been the two queries I use
On the Primary:
SELECT client_addr AS replica , client_hostname AS replica_hostname , client_port AS replica_port , pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes FROM pg_catalog.pg_stat_replication;
This checks for byte-lag for all active streaming replicas, physical or logical. A count of zero or NULL from this metric means all replicas are down. Can monitor a specific count if you have a known number of replicas.
On any Replica:
SELECT CASE WHEN (pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()) OR (pg_is_in_recovery() = false) THEN 0 ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER END AS replay_time , CASE WHEN pg_is_in_recovery() = false THEN 0 ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER END AS received_time;
This monitors the lag in seconds from the replica. Technically it monitors the last time a WAL file was received (received_time) and the last time WAL was actually replayed (replay_time). The reason for both is that the received time can be a false positive when there is no write activity on the primary. If there's always supposed to be write activity, this can be a another good metric to indicate that something is very wrong. The replay_time metric avoids the false positive by only being considered when receive is different than replay.