Mark Nipper
2012-Jun-28  21:06 UTC
[Gluster-users] necessary improvements in documentation and monitoring
We are currently dipping our toes in Gluster using the
3.3 GA RPM packages.  So far most everything has been working
acceptably.  However, I do have some suggested improvements you
might consider to help the project out long term that we've run
across so far ourselves.
	The first problem we ran into has to do with monitoring.
We're coming from using a DRBD HA cluster using iSCSI to Gluster
for storing our KVM images in qcow2 files instead of iSCSI
exported LVM logical volumes on top of DRBD.  Needless to say,
Gluster is a joy in comparison to working with the complexity of
the DRBD/HA/iSCSI stack.  But, Gluster seems to lack terribly in
the monitoring department.
	It's quite trivial to get the status of your DRBD volumes
looking at the output of /proc/drbd, even as an unprivileged
user.  To get something even resembling this in Gluster requires,
at the very least, root permissions.
	I'm attaching a script I wrote which I think (3.3 is our
first real attempt using Gluster, so I'm probably making some
gross assumptions about the health of the volumes by using this
script) is giving us at least partial visibility into the status
of our Gluster volumes via Nagios.  Hopefully the list allows the
shell script through.  If not, I'll post a URL to it elsewhere.
Please suggest any changes you think might make that script more
robust or useful.  Like I said, it's just the first pass at
trying to get something workable into Nagios.
	Secondly, coming from a DRBD background, there are a few
things that seem like obvious omissions in the Gluster
Administration Guide.
	The first relates to the upgrade process.  What is (or
does it even exist) the upgrade process for moving between minor
point releases of Gluster?  Can you upgrade one brick in say a
replicated volume to the next point release, reboot, wait for
synchronization or healing to occur, and then rinse and repeat
across the other bricks?  This isn't really spelled out anywhere
and that's quite distressing given that one of the major features
of Gluster is the always on capability.  The DRBD documentation,
as an example, points out that upgrades between minor point
releases are fully supported (no protocol breaking
incompatibilities for example) and clearly illustrates the
process of upgrading each node at a time.  I understand that this
would only be highly useful on volumes which are in fact
replicated somehow in Gluster.  Any other configuration and you'd
have to expect some kind of downtime obviously.
	The second omission has to do with what might be simply a
design limitation currently.  In attempting to upgrade using a
rolling release approach as just discussed in the last paragraph,
there doesn't seem to be a clean way to shut down any given brick
in a volume.  Currently we're simply typing reboot on one of the
bricks and then we get hit with the network.ping-timeout
seemingly where the other server brick and all our KVM guests
hang for around 50 seconds.  This was rather unexpected behavior
to say the least (especially with the other server brick hanging
completely as well; we're accustom to the KVM guests themselves
hanging and waiting indefinitely for their iSCSI backed LVM
volume to come back during a DRBD transition for example).
	I see the detach option, but there isn't a reattach.  Do
you just re-add the brick into the volume?  Does it incur the
same amount of synchronization overhead as simply rebooting one
of the bricks or does it resynchronize the entire volume from
scratch?  Does it avoid the network.ping-timeout problem
completely?
	Again, this seems like an area where things could be more
clearly spelled out as it seems like something an administrator
would commonly be affected by in the routine maintenance of
servers.
	Those are the biggest issues we've seen for now.  We have
one KVM guest which keeps ending up with a read-only root file
system.  But since none of the other guests are doing it so far,
and I don't see any chatter in any of the Gluster logs about that
qcow2 file, I'm assuming for the time being that the disk image
was somehow already corrupt coming from the DRBD backed LVM
volume it was on before I used qemu-img to convert it to qcow2 on
top of this Gluster volume.  I'm going to recreate the VM from
scratch to verify that it's a lower level disk image problem and
not a Gluster problem at this point as the other VM's have been
behaving okay.
	Thanks for reading!
-- 
Mark Nipper
nipsy at bitgnome.net (XMPP)
+1 979 575 3193
-
"All existence is conditioned."
 -- Shakyamuni Buddha
-------------- next part --------------
#!/bin/bash
# This Nagios script was written against version 3.3 of Gluster.  Older
# versions will most likely not work at all with this monitoring script.
#
# Gluster currently requires elevated permissions to do anything.  In order to
# accommodate this, you need to allow your Nagios user some additional
# permissions via sudo.  The line you want to add will look something like the
# following in /etc/sudoers (or something equivalent):
#
# Defaults:nagios !requiretty
# nagios ALL=(root) NOPASSWD:/usr/sbin/gluster peer status,/usr/sbin/gluster
volume list,/usr/sbin/gluster volume heal [[\:graph\:]]* info
#
# That should give us all the access we need to check the status of any
# currently defined peers and volumes.
# define some variables
ME=$(basename -- $0)
SUDO="/usr/bin/sudo"
PIDOF="/sbin/pidof"
GLUSTER="/usr/sbin/gluster"
PEERSTATUS="peer status"
VOLLIST="volume list"
VOLHEAL1="volume heal"
VOLHEAL2="info"
peererrorvolerror
# check for commands
for cmd in $SUDO $PIDOF $GLUSTER; do
	if [ ! -x "$cmd" ]; then
		echo "$ME UNKNOWN - $cmd not found"
		exit 3
	fi
done
# check for glusterd (management daemon)
if ! $PIDOF glusterd &>/dev/null; then
	echo "$ME CRITICAL - glusterd management daemon not running"
	exit 2
fi
# check for glusterfsd (brick daemon)
if ! $PIDOF glusterfsd &>/dev/null; then
	echo "$ME CRITICAL - glusterfsd brick daemon not running"
	exit 2
fi
# get peer status
peerstatus="peers: "
for peer in $(sudo $GLUSTER $PEERSTATUS | grep '^Hostname: ' | awk
'{print $2}'); do
	state	state=$(sudo $GLUSTER $PEERSTATUS | grep -A 2 "^Hostname:
$peer$" | grep '^State: ' | sed -nre 's/.*
\(([[:graph:]]+)\)$/\1/p')
	if [ "$state" != "Connected" ]; then
		peererror=1
	fi
	peerstatus+="$peer/$state "
done
# get volume status
volstatus="volumes: "
for vol in $(sudo $GLUSTER $VOLLIST); do
	thisvolerror=0
	entries	for entries in $(sudo $GLUSTER $VOLHEAL1 $vol $VOLHEAL2 | grep
'^Number of entries: ' | awk '{print $4}'); do
		if [ "$entries" -gt 0 ]; then
			volerror=1
			let $((thisvolerror+=entries))
		fi
	done
	volstatus+="$vol/$thisvolerror unsynchronized entries "
done
# drop extra space
peerstatus=${peerstatus:0:${#peerstatus}-1}
volstatus=${volstatus:0:${#volstatus}-1}
# set status according to whether any errors occurred
if [ "$peererror" ] || [ "$volerror" ]; then
	status="CRITICAL"
else
	status="OK"
fi
# actual Nagios output
echo "$ME $status $peerstatus $volstatus"
# exit with appropriate value
if [ "$peererror" ] || [ "$volerror" ]; then
	exit 2
else
	exit 0
fi