I have a monitoring setup that has been working well for a while now. It uses Mon, which is very simple and has been around for ages. For others who want to use this (I can do a wiki article eventually), here's a brief howto. In my case, every box is both a server and client so I check for several things, but this can easily be changed. First, check that the glusterfsd process reports "running". Second, that I can connect to the server port. Third, that every gluster mount listed in /etc/fstab is currently mounted. If any of these things fail, then Mon calls an "alert" which tries to fix things. The alert will restart glusterfsd, reload the fuse module, run an ls on each mountpoint in /etc/fstab and try to unmount / remount any that respond with "Transport endpoint not connected" or are not mounted to begin with. To set this up, here is what you need: Install Mon on every server that you want to check. In your mon.cf file we need to define a hostgroup that references localhost. The relevant part of my mon.cf file is as follows: ################################### ### global options cfbasedir = /cluster/mon ## note: yours will differ, this is a custom thing pidfile = /var/run/mon.pid statedir = /var/lib/mon/state.d logdir = /var/lib/mon/log.d dtlogfile = /var/lib/mon/log.d/downtime.log alertdir = /usr/lib/mon/alert.d mondir = /usr/lib/mon/mon.d maxprocs = 20 histlength = 100 randstart = 60s hostgroup localhost localhost watch localhost service gluster interval 30s randskew 5s monitor gluster.monitor period hr {12am-11pm} alert fix_gluster.alert alert mail.alert -S "GlusterFS monitor is reporting failures" person at yourdomain.com numalerts 3 ############################ This defines a group of hosts that includes only localhost, a service called gluster that will run the monitor named gluster.monitor, and which will call the alert named fix_gluster.alert if the monitor finds a problem. Now in the monitors directory (you may have to look for it, some distros keep it in a different place; look for all the "*.monitor" files), create this monitor script named gluster.monitor and make it executable: ############################################ ## Check the server ## reports status "running" ? MONITORS="/wherever/the/mon/monitors/are" status=`/etc/init.d/glusterfsd status` if [[ ! `echo $status | grep running` ]]; then exit 1 fi ## can connect to server port? gluster_server_port=`cat /etc/glusterfs/glusterfsd.vol | grep 'option transport.socket.listen-port' | awk '{print $3}'` $MONITORS/tcp.monitor -p $gluster_server_port localhost if [[ $? != 0 ]]; then exit 1 fi ## Check the client mounts for i in `cat /etc/fstab | grep glusterfs | awk '{print $2}'` do if [[ ! `mount | grep ^glusterfs | grep -c $i` -gt 0 ]]; then #this mount isn't there exit 1 fi ## a more detailed check could go here, like an ls or attempted write ## we are only checking for the mount at this point done exit 0 #################################### Now in the alerts directory (look for all the "*.alert" files, usually adjacent to the monitors dir), create this alert script named fix_gluster.alert and make it executable: #################################### # fix_gluster.alert logger -t fix_gluster.alert "Attempting glusterfs repair:" /etc/init.d/glusterfsd stop > /dev/null 2>&1 /etc/init.d/glusterfsd start > /dev/null 2>&1 modprobe fuse mount -a ## get all gluster mount points listed in /etc/fstab for i in `cat /etc/fstab | grep glusterfs | awk '{print $2}'` do logger -t fix_gluster.alert "Checking mountpoint $i..." ls -l $i > /dev/null 2>&1 if [[ ! `mount | grep ^glusterfs | grep -c $i` -gt 0 ]]; then logger -t fix_gluster.alert "$i not mounted, attempt remount" mount -a fi if [[ `ls -l $i | grep 'Transport endpoint is not connected'` ]]; then logger -t fix_gluster.alert "Mountpoint $i reports 'Transport endpoint is not connected'" umount $i > /dev/null 2>&1 service glusterfsd stop > /dev/null 2>&1 service glusterfsd start > /dev/null 2>&1 mount -a fi ls -l $i > /dev/null 2>&1 if [[ $? == 0 ]]; then logger -t fix_gluster.alert "Mountpoint $i repaired!" fi done exit 0 ##################################### Now start Mon, and every 30 seconds it will check all your glusterfs server processes and mount points and fix them if they error out / disappear. Also it will email you every time this happens, and log it to syslog. Since most distros have mon in their repositories, this should be easy to install and setup. And once you have created these config files, it is a simple matter to copy them to any server you install gluster on. I personally use Mon for checking all kinds of things. It is so easy to modify the scripts that you have have unlimited flexibility in what you can do. The basic idea in Mon is that you have a monitor that does some kind of check on some group of nodes, and it can either exit 0 or 1. If it exits 1, any specified alerts will run next. If it exits 0, it waits until the next interval and runs again. Writing monitors and alerts is very easy as you can see form the scripts above. Bottom line though, this has worked very well for me in terms of restarting / remounting glusterfs whenever the need arises. Feel free to email / post any questions and I will try to help anyone who wants to implement this. Chris