Hi,
i have 2 machines running a simple replicate volume to provide highly
available storage for kvm virtual machines.
As soon as auto healing starts, glusterfs will start blocking the vm's
storage access (apparently writes are what causes this) leaving the
whole virtual machine hanging.
I can replicate this bug on both ext3 and ext4 filesystems, on real
machines as well as on vm's.
Any help would be appreciated, we have to run the vm's without glusterfs
at the moment because of this problem :-(
More on my config:
* Ubuntu 10.04 Server 64bit
* Kernel 2.6.32-21-server
* Fuse 2.8.1
* Glusterfs v3.0.2
How to replicate:
* 2 Nodes running glusterfs replicate
* Start KVM virtual machine with diskfile on glusterfs
* Stop glusterfsd on one node
* Make changes to the diskfile
* Bring glusterfsd back online (auto healing starts) (replicate: no
missing files - /image.raw. proceeding to metadata check)
* As soon as the vm starts writing data, it will be blocked until
autohealing finishes (Making it completely unresponsive)
Message from Kernel (Printed several times while healing):
INFO: task kvm:7774 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
kvm D 00000000ffffffff 0 7774 1 0x00000000
ffff8801adcd9e48 0000000000000082 0000000000015bc0 0000000000015bc0
ffff880308d9df80 ffff8801adcd9fd8 0000000000015bc0 ffff880308d9dbc0
0000000000015bc0 ffff8801adcd9fd8 0000000000015bc0 ffff880308d9df80
Call Trace:
[<ffffffff8153f867>] __mutex_lock_slowpath+0xe7/0x170
[<ffffffff8153f75b>] mutex_lock+0x2b/0x50
[<ffffffff8123a1d1>] fuse_file_llseek+0x41/0xe0
[<ffffffff8114238a>] vfs_llseek+0x3a/0x40
[<ffffffff81142fd6>] sys_lseek+0x66/0x80
[<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
Gluster Configuration:
### glusterfsd.vol ###
volume posix
type storage/posix
option directory /data/export
end-volume
volume locks
type features/locks
subvolumes posix
end-volume
volume brick
type performance/io-threads
option thread-count 16
subvolumes locks
end-volume
volume server
type protocol/server
option transport-type tcp
option transport.socket.nodelay on
option transport.socket.bind-address 192.168.158.141
option auth.addr.brick.allow 192.168.158.*
subvolumes brick
end-volume
### glusterfs.vol ###
volume gluster1
type protocol/client
option transport-type tcp
option remote-host 192.168.158.141
option remote-subvolume brick
end-volume
volume gluster2
type protocol/client
option transport-type tcp
option remote-host 192.168.158.142
option remote-subvolume brick
end-volume
volume replicate
type cluster/replicate
subvolumes gluster1 gluster2
end-volume
### fstab ###
/etc/glusterfs/glusterfs.vol /mnt/glusterfs glusterfs
log-level=DEBUG,direct-io-mode=disable 0 0
I read that you wanted users to kill -11 the glusterfs process for more
debug info - here it is:
pending frames:
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
patchset: v3.0.2
signal received: 11
time of crash: 2010-09-28 11:14:31
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.0.2
/lib/libc.so.6(+0x33af0)[0x7f0c6bf0eaf0]
/lib/libc.so.6(epoll_wait+0x33)[0x7f0c6bfc1c93]
/usr/lib/libglusterfs.so.0(+0x2e261)[0x7f0c6c6ac261]
glusterfs(main+0x852)[0x4044f2]
/lib/libc.so.6(__libc_start_main+0xfd)[0x7f0c6bef9c4d]
glusterfs[0x402ab9]
---------