Nagios plugin to check slony replication - Mailing list pgsql-general
From | John Sidney-Woollett |
---|---|
Subject | Nagios plugin to check slony replication |
Date | |
Msg-id | 4221EE0D.5010502@wardbrook.com Whole thread Raw |
Responses |
Re: [Slony1-general] Nagios plugin to check slony replication
|
List | pgsql-general |
I've finally got around to writing the two nagios plugins which I am using to check our slony cluster (on our linux servers). I'm posting them in case anyone else wants them or to use them as a basis for something else. These are based on Christopher Browne's scripts that ship with slony. The two scripts perform different tasks. check_slon checks to see that the slon daemon is in the proces list and optionally checks for any error or warning messages in the slon log file it is called using two or three parameters; the clustername, the dbname and (optionally) the location of the log file. This script is to be executed on each node in the cluster (both master and slaves) check_sloncluster checks that active receiver nodes are comfirming sync within 10 seconds of the master. I'm not entirely sure that this is the best strategy, and if you know otherwise, I'd love to hear. Requires two parameters; the clustername and the dbname. This script is executed on the master database only. These scripts are designed to run on the host on which they are checking. With a little modification, they could check remote servers on the network. They are quite simplistic and may not be suitable for your environment. You are free to modify the code to suit your own needs. John Sidney-Woollett check_slon ========== #!/bin/sh # nagios plugin that checks whether the slon daemon is running # if the 3rd parameter (LOGFILE) is specified then the log file is # checked to see if the last entry is a WARN or FATAL message # # three possible exit statuses: # 0 = OK # 1 = Warning (warning in slon log file) # 2 = Fatal Error (slon not running, or error in log file) # # script requires two or three parameters: # CLUSTERNAME - name of slon cluster to be checked # DBNAME - name of database being replicated # LOGFILE - (optional) location of the slon log file # # Author: John Sidney-Woollett # Created: 26-Feb-2005 # Copyright 2005 # check parameters are valid if [[ $# -lt 2 && $# -gt 3 ]] then echo "Invalid parameters need CLUSTERNAME DBNAME [LOGFILE]" exit 2 fi # assign parameters CLUSTERNAME=$1 DBNAME=$2 LOGFILE=$3 # check to see whether the slon daemon is running SLONPROCESS=`ps -auxww | egrep "[s]lon $CLUSTERNAME" | egrep "dbname=$DBNAME" | awk '{print $2}'` if [ ! -n "$SLONPROCESS" ] then echo "no slon process active" exit 2 fi # if the logfile is specified, check it exists # and check for the word ERROR or WARN in the last line if [ -n "$LOGFILE" ] then # check for log file if [ -f "$LOGFILE" ] then LOGLINE=`tail -1 $LOGFILE` LOGSTATUS=`tail -1 $LOGFILE | awk '{print $1}'` if [ $LOGSTATUS = "FATAL" ] then echo "$LOGLINE" exit 2 elif [ $LOGSTATUS = "WARN" ] then echo "$LOGLINE" exit 1 fi else echo "$LOGFILE not found" exit 2 fi fi # otherwise all looks to be OK echo "OK - slon process $SLONPROCESS" exit 0 check_sloncluster ================= #!/bin/sh # nagios plugin that checks whether the slave nodes in a slony cluster # are being updated from the master # # possible exit statuses: # 0 = OK # 2 = Error, one or more slave nodes are not sync'ing with the master # # script requires two parameters: # CLUSTERNAME - name of slon cluster to be checked # DBNAME - name of master database # # Author: John Sidney-Woollett # Created: 26-Feb-2005 # Copyright 2005 # check parameters are valid if [[ $# -ne 2 ]] then echo "Invalid parameters need CLUSTERNAME DBNAME" exit 2 fi # assign parameters CLUSTERNAME=$1 DBNAME=$2 # setup the query to check the replication status SQL="select case when ttlcount = okcount then 'OK - '||okcount||' nodes in sync' else 'ERROR - '||ttlcount-okcount||' of '||ttlcount||' nodes not in sync' end as syncstatus from ( -- determine total active receivers select (select count(distinct sub_receiver) from _$CLUSTERNAME.sl_subscribe where sub_active = true) as ttlcount, ( -- determine active nodes syncing within 10 seconds select count(*) from ( select st_received, st_last_received_ts - st_last_event_ts as cfmdelay from _$CLUSTERNAME.sl_status where st_received in ( select distinct sub_receiver from _$CLUSTERNAME.sl_subscribe where sub_active = true ) ) as t1 where cfmdelay < interval '10 secs') as okcount ) as t2" # query the master database CHECK=`/usr/local/pgsql/bin/psql -c "$SQL" --tuples-only -U postgres $DBNAME` if [ ! -n "$CHECK" ] then echo "ERROR querying $DBNAME" exit 2 fi # echo the result of the query echo $CHECK # and check the return status STATUS=`echo $CHECK | awk '{print $1}'` if [ $STATUS = "OK" ] then exit 0 else exit 2 fi
pgsql-general by date: