thr3ads.net - Gluster users - [Gluster-users] A "cool" failure mode which gives no real useful information at debug level ... and how to fix [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Joe Landman

2012-Jan-18 18:14 UTC

[Gluster-users] A "cool" failure mode which gives no real useful information at debug level ... and how to fix

Ok, so there you are with a gluster file system, that just had a RAID 
issue on the backing store.  You fixed the backing store, rebooted, and 
are now trying to bring gluster daemon up.

And it doesn't come up.

Ok, no problem.  Run it by hand, turn on debugging, turn off 
backgrounding.  Capture all the output.

strace /opt/glusterfs/3.2.5/sbin/glusterd --no-daemon \
	--log-level=DEBUG  > out 2>&1 &

Then looking at the out file, near the end we see ...

writev(4, [{"/opt/glusterfs/3.2.5/sbin/gluste"..., 34},
{"(", 1},
{"glusterfs_volumes_init", 22}, {"+0x", 3},
{"18b", 3}, {")", 1},
{"[0x", 3}, {"4045ab", 6}, {"]\n", 2}], 9) = 75
writev(4, [{"/opt/glusterfs/3.2.5/sbin/gluste"..., 34},
{"(", 1},
{"main", 4}, {"+0x", 3}, {"448", 3},
{")", 1}, {"[0x", 3}, {"405658",
6}, {"]\n", 2}], 9) = 57
writev(4, [{"/lib64/libc.so.6", 16}, {"(", 1},
{"__libc_start_main",
17}, {"+0x", 3}, {"f4", 2}, {")", 1},
{"[0x", 3}, {"3f6961d994", 10},
{"]\n", 2}], 9) = 55
writev(4, [{"/opt/glusterfs/3.2.5/sbin/gluste"..., 34},
{"[0x", 3},
{"403739", 6}, {"]\n", 2}], 4) = 45
write(4, "---------\n", 10)             = 10
rt_sigaction(SIGSEGV, {SIG_DFL, [SEGV], SA_RESTORER|SA_RESTART, 
0x3f696302d0}, {0x7fb75cf66de0, [SEGV], SA_RESTORER|SA_RESTART, 
0x3f696302d0}, 8) = 0
tgkill(6353, 6353, SIGSEGV)             = 0
rt_sigreturn(0x18d1)                    = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV (core dumped) +++


... er ... uh .... ok ...

Not so helpful.  All we know is that we have a SEGV.  This usually 
happens when one program starts stamping on memory (or similar things) 
its really not supposed to touch.

I sanity checked this binary.  Same md5 signature as other working 
binaries on its peers.

Ok ... short of running debug and getting a stack trace at SEGV, I opted 
for the slightly simpler version of things (e.g. didn't force me to 
recompile this).  I assumed that for some reason, even though there was 
no crash, that the rootfs somehow was corrupted.

Ok, this was a WAG.  Bear with me.  Turns out to have been right.

Did a

	tune2fs -c 20 /dev/md0
	tune2fs -C 21 /dev/md0

and rebooted.  Forces a reboot on the file system attached to /dev/md0 
("/" in this case).  Basically I was trying to take all variables off 
the table with this.

Sure enough, on restart, glusterd fired up correctly with no problem. 
 From this I surmise that somehow, glusterd was trying to work with an 
area of the fs that was broken.  Not gluster being broken, but the root 
file system that gluster uses for pipes/files/etc. The failure cascaded, 
and glusterd not lighting up was merely a symptom of something else.

Just as an FYI for people working on debugging this.  For future 
reference, I think we may be tweaking the configure process to make sure 
we build a glusterfs with all debugging options on, just in case I need 
to run it in a debugger.  Will see if this materially impacts performance.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

Daniel Taylor

2012-Jan-18 18:29 UTC

head link

[Gluster-users] A "cool" failure mode which gives no real useful information at debug level ... and how to fix

Thanks. We saw something very similar with root filesystem damage on one
of our nodes locking access to the clusters it was a member of.

Better logging wouldn't have helped there, since it was clobbering the
glusterd logfile, but it does make me wonder if it isn't possible to get
smarter error messages for host filesystem access issues?

-- 
Daniel Taylor             VP Operations       Vocal Laboratories, Inc
dtaylor at vocalabs.com                                 952-941-6580x203

Gluster users - Jan 2012 - A "cool" failure mode which gives no real useful information at debug level ... and how to fix

[Gluster-users] A "cool" failure mode which gives no real useful information at debug level ... and how to fix

[Gluster-users] A "cool" failure mode which gives no real useful information at debug level ... and how to fix