I've finally got around to writing the two nagios plugins which I am
using to check our slony cluster (on our linux servers). I'm posting
them in case anyone else wants them or to use them as a basis for
something else. These are based on Christopher Browne's scripts that
ship with slony.
The two scripts perform different tasks.
check_slon checks to see that the slon daemon is in the proces list and
optionally checks for any error or warning messages in the slon log file
it is called using two or three parameters; the clustername, the dbname
and (optionally) the location of the log file. This script is to be
executed on each node in the cluster (both master and slaves)
check_sloncluster checks that active receiver nodes are comfirming sync
within 10 seconds of the master. I'm not entirely sure that this is the
best strategy, and if you know otherwise, I'd love to hear. Requires two
parameters; the clustername and the dbname. This script is executed on
the master database only.
These scripts are designed to run on the host on which they are
checking. With a little modification, they could check remote servers on
the network. They are quite simplistic and may not be suitable for your
environment. You are free to modify the code to suit your own needs.
John Sidney-Woollett
check_slon
==========
#!/bin/sh
# nagios plugin that checks whether the slon daemon is running
# if the 3rd parameter (LOGFILE) is specified then the log file is
# checked to see if the last entry is a WARN or FATAL message
#
# three possible exit statuses:
# 0 = OK
# 1 = Warning (warning in slon log file)
# 2 = Fatal Error (slon not running, or error in log file)
#
# script requires two or three parameters:
# CLUSTERNAME - name of slon cluster to be checked
# DBNAME - name of database being replicated
# LOGFILE - (optional) location of the slon log file
#
# Author: John Sidney-Woollett
# Created: 26-Feb-2005
# Copyright 2005
# check parameters are valid
if [[ $# -lt 2 && $# -gt 3 ]]
then
echo "Invalid parameters need CLUSTERNAME DBNAME [LOGFILE]"
exit 2
fi
# assign parameters
CLUSTERNAME=$1
DBNAME=$2
LOGFILE=$3
# check to see whether the slon daemon is running
SLONPROCESS=`ps -auxww | egrep "[s]lon $CLUSTERNAME" | egrep
"dbname=$DBNAME" | awk '{print $2}'`
if [ ! -n "$SLONPROCESS" ]
then
echo "no slon process active"
exit 2
fi
# if the logfile is specified, check it exists
# and check for the word ERROR or WARN in the last line
if [ -n "$LOGFILE" ]
then
# check for log file
if [ -f "$LOGFILE" ]
then
LOGLINE=`tail -1 $LOGFILE`
LOGSTATUS=`tail -1 $LOGFILE | awk '{print $1}'`
if [ $LOGSTATUS = "FATAL" ]
then
echo "$LOGLINE"
exit 2
elif [ $LOGSTATUS = "WARN" ]
then
echo "$LOGLINE"
exit 1
fi
else
echo "$LOGFILE not found"
exit 2
fi
fi
# otherwise all looks to be OK
echo "OK - slon process $SLONPROCESS"
exit 0
check_sloncluster
=================
#!/bin/sh
# nagios plugin that checks whether the slave nodes in a slony cluster
# are being updated from the master
#
# possible exit statuses:
# 0 = OK
# 2 = Error, one or more slave nodes are not sync'ing with the master
#
# script requires two parameters:
# CLUSTERNAME - name of slon cluster to be checked
# DBNAME - name of master database
#
# Author: John Sidney-Woollett
# Created: 26-Feb-2005
# Copyright 2005
# check parameters are valid
if [[ $# -ne 2 ]]
then
echo "Invalid parameters need CLUSTERNAME DBNAME"
exit 2
fi
# assign parameters
CLUSTERNAME=$1
DBNAME=$2
# setup the query to check the replication status
SQL="select case
when ttlcount = okcount then 'OK - '||okcount||' nodes in sync'
else 'ERROR - '||ttlcount-okcount||' of '||ttlcount||' nodes not in sync'
end as syncstatus
from (
-- determine total active receivers
select (select count(distinct sub_receiver)
from _$CLUSTERNAME.sl_subscribe
where sub_active = true) as ttlcount,
(
-- determine active nodes syncing within 10 seconds
select count(*) from (
select st_received, st_last_received_ts - st_last_event_ts as cfmdelay
from _$CLUSTERNAME.sl_status
where st_received in (
select distinct sub_receiver
from _$CLUSTERNAME.sl_subscribe
where sub_active = true
)
) as t1
where cfmdelay < interval '10 secs') as okcount
) as t2"
# query the master database
CHECK=`/usr/local/pgsql/bin/psql -c "$SQL" --tuples-only -U postgres
$DBNAME`
if [ ! -n "$CHECK" ]
then
echo "ERROR querying $DBNAME"
exit 2
fi
# echo the result of the query
echo $CHECK
# and check the return status
STATUS=`echo $CHECK | awk '{print $1}'`
if [ $STATUS = "OK" ]
then
exit 0
else
exit 2
fi