Re: Monitoring logical replication - Mailing list pgsql-general
From | Klaus Darilion |
---|---|
Subject | Re: Monitoring logical replication |
Date | |
Msg-id | fd211d7c53452d0c8ae2168039e6d55e@pernau.at Whole thread Raw |
In response to | Monitoring logical replication (Shaheed Haque <shaheedhaque@gmail.com>) |
Responses |
Re: Monitoring logical replication
|
List | pgsql-general |
Hi Saheed! I monitor our replication this way: 1. Every 10 seconds i fetch the current LSN and write it into a table, next with the current timestamp. Further I fetch confirmend LSNs from the replication slots and delete old entries in lsn2data table. calculate_logical_replication_lag.php: <?php $path = realpath(dirname(__FILE__) . "/../inc"); set_include_path($path . PATH_SEPARATOR . get_include_path()); require_once('config.php'); $config_int['syslogprefix'] = basename(__FILE__); require_once('logging.php'); $dbuser="replication_lag_user"; $dbpass="XXXXXXXXXXXXXXXXXXXX"; if (!$dbconn = pg_pconnect('host='.$config_int['dbhost'].' dbname='.$config_int['dbname'].' user='.$dbuser.' password='.$dbpass)) { print "Sorry, database connection failed"; exit; } $accuracy = 10; // in seconds // // Preparations: // // CREATE TABLE lsn2data( // lsn pg_lsn PRIMARY KEY, // seen timestamp NOT NULL DEFAULT NOW() // ); // CREATE ROLE replication_lag_user WITH LOGIN PASSWORD 'XXXXXXXXXXXXXXXXXXX'; // GRANT ALL ON TABLE lsn2data TO replication_lag_user; // // CREATE OR REPLACE FUNCTION get_replication_lag() RETURNS TABLE (subscriber name, lag bigint) AS // $BODY$ // DECLARE // subscriber name; // BEGIN // FOR subscriber IN // SELECT slot_name FROM pg_replication_slots // LOOP // RETURN QUERY SELECT slot_name, EXTRACT(EPOCH FROM NOW()-seen)::bigint lag from lsn2data,pg_replication_slots WHERE slot_name=subscriber AND lsn < confirmed_flush_lsn ORDER BY lsn DESC LIMIT 1; // END LOOP; // RETURN; // END // $BODY$ // LANGUAGE plpgsql; // while (1) { $dbq = pg_query("INSERT INTO lsn2data (lsn) VALUES (pg_current_wal_lsn())"); if ($dbq === FALSE) { mylog(LOG_ERROR, "SQL query error: ".pg_last_error()."\n"); exit(1); } $dbq = pg_query("DELETE FROM lsn2data WHERE lsn < (". "SELECT lsn FROM lsn2data WHERE lsn < (". "SELECT confirmed_flush_lsn FROM pg_replication_slots ORDER BY confirmed_flush_lsn ASC LIMIT 1". ") ORDER BY lsn DESC LIMIT 1". ")" ); if ($dbq === FALSE) { mylog(LOG_ERROR, "SQL query error: ".pg_last_error()."\n"); exit(1); } sleep($accuracy); } 2. I graph the replications lags (converted from LSN to seconds) in my check_mk monitoring: #!/bin/bash # # Managed by Puppet: modules/base/files/monitoring/check_logical_replication_lag.sh # # Check the logical replication lag and export performance data for each subscriber # # exit on error #set -e #Make sure this script only runs one at a time ( ME=$0 MEBASE=`basename $0` mylog () { echo "$MEBASE: $1" logger -t "$MEBASE" "$1" } flock -x -w 1 200 if [ $? != "0" ]; then #echo "ERROR: $0 is already running ... exit" logger -t "$MEBASE" "ERROR: $0 is already running ... exit" exit 1 fi # Do stuff # Variablen fuer Monitoring CMK_SPOOLDIR=/var/lib/check_mk_agent/spool CMK_NAME=$MEBASE CMK_SPOOLFILE=600_`basename ${CMK_NAME}`.txt CMK_HEADER="<<<local>>>" TMP_FILE="/tmp/logical_replication_lag.csv" # Schwellwerte warn=300 crit=600 final_output="$CMK_HEADER\nP $CMK_NAME " # move to a directory where user postgresl may reside (sudo) cd /tmp # Lag auslesen. Waehrend dem initialen aufsynchen eines Subscribers gibt es temporaere Subscriptions, mit dem Namen reg_xxx1-pid-sync-pid. # Damit diese nicht getrackt werden gibt es die huebsche LIKE Clause. rm -f "$TMP_FILE" sudo -u postgres psql regdns -c "COPY (SELECT subscriber,lag FROM get_replication_lag() WHERE subscriber LIKE '%\_____' ORDER BY 2 DESC) TO '$TMP_FILE' With CSV" 2>&1> /dev/null LC=$(sudo -u postgres psql -t regdns -c "SELECT count(*) FROM get_replication_lag();" | tr -d ' ') if [ $LC == "0" ]; then echo -e "$CMK_HEADER\n0 $CMK_NAME - No Slaves with Replication found - maybe we are a slave?" > $CMK_SPOOLDIR/$CMK_SPOOLFILE exit 0; fi grep $(hostname | cut -d '-' -f2) "$TMP_FILE" > /dev/null if [ $? != "0" ]; then echo -e "$CMK_HEADER\n2 $CMK_NAME - Postgres Output does not seem valid. Please check script $ME and output in $TMP_FILE" > $CMK_SPOOLDIR/$CMK_SPOOLFILE exit 1; fi # CSV in Array einlesen IFS=$'\n' read -d '' -r -a input_file < "$TMP_FILE" # Auswerten maxlag=0 for i in "${input_file[@]}"; do node=`echo $i | awk -F "," '{print $1}' | tr -- _ -` lag=`echo $i | awk -F "," '{print $2}'` final_output="$final_output$node=$lag;$warn;$crit|" # https://unix.stackexchange.com/questions/186663/is-there-a-unix-command-that-gives-the-minimum-maximum-of-two-numbers maxlag=$(( maxlag > lag ? maxlag : lag )) done final_output="${final_output}max-lag=$maxlag;$warn;$crit" # Letztes Pipe Zeichen rausschneiden #final_output=`echo $final_output | rev | cut -c 2- | rev` # Spool File schreiben echo -e $final_output > $CMK_SPOOLDIR/$CMK_SPOOLFILE logger -t "$MEBASE" "$final_output" ) 200>/tmp/`basename $0`.exclusivelock 3. During initial sync I check the status on the subscriber. Once it has synced all tables of the publication, it will send me an email. # # Managed by Puppet: modules/pdns/templates/check_pglogical_subscription.sh.erb # # # This script checks and eventually creates the subscription, and wait until the initial sync is finished # PUB=regdns2020_pub SLEEP=5 PREFIX=check_pglogical_subscription.sh NUMTAB=175 SECONDS=0 date while true; do echo "SELECT * from pg_subscription;" | sudo -u postgres psql -t regdns | grep -q $PUB if [ $? -eq 0 ]; then echo "OK: Host is subscribed to '$PUB'. Checking for table count ..." break fi echo "ERROR: Host is not subscribed to '$PUB'. Subscribing to master ..." logger -t $PREFIX "ERROR: Host is not subscribed to '$PUB'. Subscribing to master ..." echo "CREATE SUBSCRIPTION `hostname -s|tr -- - _` CONNECTION 'host=XXXXX dbname=XXXX user=XXXXX password=XXXXXX PUBLICATION regdns2020_pub;" | sudo -u postgres psql regdns && touch /etc/regdns.schema_subscription.created echo "Re-Checking in $SLEEP seconds ..." logger -t $PREFIX "Re-Checking in $SLEEP seconds ..." sleep $SLEEP done while true; do COUNT=$(echo "SELECT count(*) from pg_subscription_rel;" | sudo -u postgres psql -t regdns | head -1 | xargs) if [ $COUNT -eq $NUMTAB ]; then echo "OK: Subscription '$PUB' contains $NUMTAB tables - that is OK. Checking for initial-sync status ..." logger -t $PREFIX "OK: Subscription '$PUB' contains $NUMTAB tables - that is OK. Checking for initial-sync status ..." break fi echo "ERROR: Subscription '$PUB' contains $COUNT tables, but should contain $NUMTAB table. Re-Checking in $SLEEP seconds ..." logger -t $PREFIX "ERROR: Subscription '$PUB' contains $COUNT tables, but should contain $NUMTAB table. Re-Checking in $SLEEP seconds ..." sleep $SLEEP done while true; do COUNTFIN=$(echo "SELECT count(*) from pg_subscription_rel WHERE srsubstate='r';" | sudo -u postgres psql -t regdns | head -1 | xargs) if [ $COUNTFIN -eq $NUMTAB ]; then echo "OK: Initial sync of $COUNTFIN/$NUMTAB tables finished in $SECONDS seconds." logger -t $PREFIX "OK: Initial sync of $COUNTFIN/$NUMTAB tables finished in $SECONDS seconds." echo "OK: Initial sync of $COUNTFIN/$NUMTAB tables finished in $SECONDS seconds." | mailx -s "$HOST $SECONDS seconds to subscribe" -- root break fi echo "PROGRESS: Initial sync of $COUNTFIN/$NUMTAB tables finished. Re-Checking in $SLEEP seconds ..." logger -t $PREFIX "PROGRESS: Initial sync of $COUNTFIN/$NUMTAB tables finished. Re-Checking in $SLEEP seconds ..." sleep $SLEEP done regards Klaus Am 2023-10-07 17:31, schrieb Shaheed Haque: > Hi, > > I've been playing with logical replication (currently on PG14), > specifically in an AWS RDS Postgres context, but NOT using AWS' own > replication tooling. I'm generally familiar with the challenges of > distributed systems (such causality, time synchronisation etc), but > not especially familiar with PG. > > In looking at how to tell how a given subscriber has caught up with > its publisher, there is plenty of advice around the Web, for example > https://dba.stackexchange.com/questions/314324/monitor-logical-replication-using-lsn. > Like this example, much advice ends up talking about using separate > queries on the publisher and the subscriber to compare LSNs. First, (I > think) I understand the core difficulty that comparing LSNs is > inherently racy, but given that, I'm a bit unclear as to why a single > query on the publisher is not enough...IIUC: > > * Changes sent from the publisher to the subscriber are identified by > LSN. > * The publisher knows it's own current latest LSN > (pg_current_wal_lsn()), but this seems not to be exposed at the > subscriber. > * The subscriber knows what it has applied locally and even tells the > publisher (pg_stat_subscription.latest_end_lsn), but it does not seem > to be exposed at the publisher. > > Have I missed something? Is there a way to track the LSN delta (given > that this is known to be racy) just by querying one end? > > Second, how do folk "know" when replication is "done". For example, if > the two LSNs continued to match for 1 * replication lag? Or N * > replication lag? What would be a plausible N? > > Third, as we know when logical replication is started, the initial > table state is captured in a snapshot, and sent across using COPY > TABLE under the covers. Now, let's say that the publisher is idle > (i.e. no SQL writes to the user's schema...obviously pg_catalog might > change as replication is configured and enabled) and that the > replication starts with the publisher as LSN_start. How could one know > when the copying is done: > > * I initially assumed that the publisher's LSN would not change from > LSN_start, but as the copying proceeds, I see that it DOES change > (presumably because there are updates happening to pg_catalog, such as > the temporary slots coming and going). > * Is there some kind of singleton state on either publisher or > subscriber that could be checked to know? (At the moment, I am > counting the records in all copied tables). > > I realise that the knowledge that the publisher is "idle" is a special > case, but right now, my test for being "done" is: > > * Number of records in copied tables matches AND the publisher's > pg_stat_subscription matches the subscriber's > pg_stat_subscription.latest_end_lsn. > > Plus or minus the bit about replication lag, is there a better way? > > Thanks, Shaheed
pgsql-general by date: