Thread: Logical replication lag in seconds
Hello! I currently use Slony for replication and want to switch to logical replication. With Slony I was aware of the replication lag of each slave in seconds. This info was available regardless if a slave was online or offline. For my application I need to know the "delay" of all replicas, wether they are currently active or not. I.e. a replica may be on the other side of the world, serving queries there, but has problems to connect to the master. Even in this case I need to know the replication status of the replica. With logical replication, it seems the only available data, eve in case a replica is offline, is in the pg_replication_slots table, for example: -[ RECORD 1 ]-------+--------------- slot_name | test_6 plugin | pgoutput slot_type | logical datoid | 16402 database | mydns temporary | f active | f active_pid | xmin | catalog_xmin | 1116191193 restart_lsn | 2EA/E61E7FA8 confirmed_flush_lsn | 2EA/E626F0B0 So I guess, the replication status of this replica is the "confirmed_flush_lsn", ie: 2EA/E626F0B0 But how to I get from 2EA/E626F0B0 to a timestamp when this transaction was added to the WAL files? I would appreciate any hints, or other methods to get the delay in some time format. thanks Klaus
I am very interested in this discussion. We settled a table with a single timestamp field that a script updates every minute with NOW() so that we can check the timestamp of that table on the replica, assuming the clocks are synced, then we will be able to compute the lag.
Hi Michael! Am 21.02.2020 um 21:24 schrieb Michael Lewis: > I am very interested in this discussion. We settled a table with a > single timestamp field that a script updates every minute with NOW() so > that we can check the timestamp of that table on the replica, assuming > the clocks are synced, then we will be able to compute the lag. I have a similar workaround at the moment. But it is more a hack than a nice solution, ie. I also have to store the last value locally to have the status also available if a replica is temporarily not reachable. Hence it would be great if the information could be retrieved from WAL/replication internals. regards Klaus
For the records - with a simple script I hacked a solution which is purely based on the server. 1. Create a table to track the timestamp of an lsn: CREATE TABLE lsn2date( lsn pg_lsn PRIMARY KEY, seen timestamp NOT NULL DEFAULT NOW() ); CREATE ROLE replication_lag_user WITH LOGIN PASSWORD 'xxx'; GRANT ALL ON TABLE lsn2date TO replication_lag_user; 2. Create a script which populates the table: # cat /etc/systemd/system/calculate_logical_replication_lag.service [Unit] Description=Start and auto restart service [Install] WantedBy=multi-user.target [Service] ExecStart=/usr/bin/php /path/to/calculate_logical_replication_lag.php Restart=always RestartSec=10 # cat calculate_logical_replication_lag.php <?php $dbuser="replication_lag_user"; $dbpass="xxx"; if (!$dbconn = pg_pconnect('host=127.0.0.1 dbname=mydb user='.$dbuser.' password='.$dbpass)) { print "Sorry, database connection failed"; exit; } $accuracy = 10; // in seconds while (1) { $dbq = pg_query("INSERT INTO lsn2date (lsn) VALUES (pg_current_wal_lsn())"); if ($dbq === FALSE) { mylog(LOG_ERROR, "SQL query error: ".pg_last_error()."\n"); exit(1); } $dbq = pg_query("DELETE FROM lsn2date WHERE lsn < (". "SELECT lsn FROM lsn2date WHERE lsn < (". "SELECT confirmed_flush_lsn FROM pg_replication_slots ORDER BY confirmed_flush_lsn ASC LIMIT 1". ") ORDER BY lsn DESC LIMIT 1". ")" ); if ($dbq === FALSE) { mylog(LOG_ERROR, "SQL query error: ".pg_last_error()."\n"); exit(1); } sleep($accuracy); } 3. Get the lag, using a function which compares the lsn of the replication_slots with the lsn/timestamp in the lsn2date table: CREATE OR REPLACE FUNCTION get_replication_lag() RETURNS TABLE (subscriber name, lag bigint) AS $BODY$ DECLARE subscriber name; BEGIN FOR subscriber IN SELECT slot_name FROM pg_replication_slots LOOP RETURN QUERY SELECT slot_name, EXTRACT(EPOCH FROM NOW()-seen)::bigint lag from lsn2date,pg_replication_slots WHERE slot_name=subscriber AND lsn < confirmed_flush_lsn ORDER BY lsn DESC LIMIT 1; END LOOP; RETURN; END $BODY$ LANGUAGE plpgsql; # SELECT * FROM get_replication_lag() ; subscriber | lag ------------+----- reg_sjc1 | 0 reg_ffm1 | 0 reg_tst2 | 0 reg_mia1 | 0 reg_jbg1 | 0 reg_ams1 | 0 reg_syy1 | 0 reg_wie1 | 0 reg_hkg1 | 0 reg_gnf1 | 0 reg_tor1 | 0 reg_sea1 | 0 reg_chi1 | 0 reg_dfw1 | 0 reg_sgp1 | 0 reg_lhr1 | 0 regards Klaus