I have a monitoring setup that has been working well for a while now. It uses
Mon, which is very simple and has been around for ages. For others who want to
use this (I can do a wiki article eventually), here's a brief howto. In my
case, every box is both a server and client so I check for several things, but
this can easily be changed.
First, check that the glusterfsd process reports "running". Second,
that I can connect to the server port. Third, that every gluster mount listed in
/etc/fstab is currently mounted. If any of these things fail, then Mon calls an
"alert" which tries to fix things. The alert will restart glusterfsd,
reload the fuse module, run an ls on each mountpoint in /etc/fstab and try to
unmount / remount any that respond with "Transport endpoint not
connected" or are not mounted to begin with.
To set this up, here is what you need:
Install Mon on every server that you want to check. In your mon.cf file we need
to define a hostgroup that references localhost. The relevant part of my mon.cf
file is as follows:
###################################
### global options
cfbasedir = /cluster/mon ## note: yours will differ, this is a custom thing
pidfile = /var/run/mon.pid
statedir = /var/lib/mon/state.d
logdir = /var/lib/mon/log.d
dtlogfile = /var/lib/mon/log.d/downtime.log
alertdir = /usr/lib/mon/alert.d
mondir = /usr/lib/mon/mon.d
maxprocs = 20
histlength = 100
randstart = 60s
hostgroup localhost localhost
watch localhost
service gluster
interval 30s
randskew 5s
monitor gluster.monitor
period hr {12am-11pm}
alert fix_gluster.alert
alert mail.alert -S "GlusterFS monitor is reporting failures"
person at yourdomain.com
numalerts 3
############################
This defines a group of hosts that includes only localhost, a service called
gluster that will run the monitor named gluster.monitor, and which will call the
alert named fix_gluster.alert if the monitor finds a problem.
Now in the monitors directory (you may have to look for it, some distros keep it
in a different place; look for all the "*.monitor" files), create this
monitor script named gluster.monitor and make it executable:
############################################
## Check the server
## reports status "running" ?
MONITORS="/wherever/the/mon/monitors/are"
status=`/etc/init.d/glusterfsd status`
if [[ ! `echo $status | grep running` ]]; then
exit 1
fi
## can connect to server port?
gluster_server_port=`cat /etc/glusterfs/glusterfsd.vol | grep 'option
transport.socket.listen-port' | awk '{print $3}'`
$MONITORS/tcp.monitor -p $gluster_server_port localhost
if [[ $? != 0 ]]; then
exit 1
fi
## Check the client mounts
for i in `cat /etc/fstab | grep glusterfs | awk '{print $2}'`
do
if [[ ! `mount | grep ^glusterfs | grep -c $i` -gt 0 ]]; then
#this mount isn't there
exit 1
fi
## a more detailed check could go here, like an ls or attempted write
## we are only checking for the mount at this point
done
exit 0
####################################
Now in the alerts directory (look for all the "*.alert" files, usually
adjacent to the monitors dir), create this alert script named fix_gluster.alert
and make it executable:
####################################
# fix_gluster.alert
logger -t fix_gluster.alert "Attempting glusterfs repair:"
/etc/init.d/glusterfsd stop > /dev/null 2>&1
/etc/init.d/glusterfsd start > /dev/null 2>&1
modprobe fuse
mount -a
## get all gluster mount points listed in /etc/fstab
for i in `cat /etc/fstab | grep glusterfs | awk '{print $2}'`
do
logger -t fix_gluster.alert "Checking mountpoint $i..."
ls -l $i > /dev/null 2>&1
if [[ ! `mount | grep ^glusterfs | grep -c $i` -gt 0 ]]; then
logger -t fix_gluster.alert "$i not mounted, attempt remount"
mount -a
fi
if [[ `ls -l $i | grep 'Transport endpoint is not connected'` ]];
then
logger -t fix_gluster.alert "Mountpoint $i reports 'Transport
endpoint is not connected'"
umount $i > /dev/null 2>&1
service glusterfsd stop > /dev/null 2>&1
service glusterfsd start > /dev/null 2>&1
mount -a
fi
ls -l $i > /dev/null 2>&1
if [[ $? == 0 ]]; then
logger -t fix_gluster.alert "Mountpoint $i repaired!"
fi
done
exit 0
#####################################
Now start Mon, and every 30 seconds it will check all your glusterfs server
processes and mount points and fix them if they error out / disappear. Also it
will email you every time this happens, and log it to syslog. Since most distros
have mon in their repositories, this should be easy to install and setup. And
once you have created these config files, it is a simple matter to copy them to
any server you install gluster on. I personally use Mon for checking all kinds
of things. It is so easy to modify the scripts that you have have unlimited
flexibility in what you can do. The basic idea in Mon is that you have a monitor
that does some kind of check on some group of nodes, and it can either exit 0 or
1. If it exits 1, any specified alerts will run next. If it exits 0, it waits
until the next interval and runs again. Writing monitors and alerts is very easy
as you can see form the scripts above. Bottom line though, this has worked very
well for me in terms of restarting / remounting glusterfs whenever the need
arises.
Feel free to email / post any questions and I will try to help anyone who wants
to implement this.
Chris