Simon Detheridge
2009-May-13 14:20 UTC
[Gluster-users] Glusterfs-2 locks/hangs on EC2/vtun setup
Hi all, I'm trying to get a glusterfs cluster working inside Amazon's EC2. I'm using the official ubuntu 8.10 images. I've compiled glusterfs-2, but I'm using the in-kernel fuse module as the instances run 2.6.27-3, and the fuse module from glusterfs won't compile against something that recent. For my test setup I'm trying to get AFR working over two servers, with another server as a client. All 3 servers have the glusterfs volume mounted. After a few hours, the mounted volume on all three servers locks. ls'ing a directory on the volume, or typing "df -h" hangs, and won't even 'kill -9'. I have to umount --force the glusterfs volume to get ls or df to terminate. The images communicate with each other over vtun-based tunnels, which I've set up to provide a predictable IP addressing system between the nodes. (IPs assigned by amazon are random.) The logs don't show anything useful. The last thing it tells me about is the handshake that took place a few hours ago. I disabled the performance translators on the clients, but forgot to do so on the server, so I'm currently running the test again with iothreads disabled on the server, and also "mount -o log-level=DEBUG" on the client. The volumes are not under heavy load at all, when they fail. All that's happening to them is a script is running every 30 seconds on the client that isn't running as a storage node, and does the following things: * Writes a random value to a randomly-named file on the locally mounted volume * Connects via SSH to one of the storage nodes, and reads the file from the locally mounted volume * Complains if the contents of the file are different * Removes the file * Repeats for the other node In order to remount the volume after failure, I have to umount --force, and then manually kill the glusterfs process. Otherwise the connection just hangs again as soon as I remount. On each storage node, my glusterfs-client.vol looks like this: #------------------ volume web_remote_1 type protocol/client option transport-type tcp option remote-host 192.168.172.10 option remote-subvolume web_brick end-volume volume web_remote_2 type protocol/client option transport-type tcp option remote-host 192.168.172.11 option remote-subvolume web_brick end-volume volume web_replicate type cluster/replicate subvolumes web_remote_1 web_remote_2 end-volume #------------------ On the servers, my glusterfs-server.vol looks like this: #------------------ volume web type storage/posix option directory /var/glusterfs/web end-volume volume web_locks type features/locks subvolumes web end-volume volume web_brick type performance/io-threads option autoscaling on subvolumes web_locks end-volume volume web_server type protocol/server option transport-type tcp/server option client-volume-filename /etc/glusterfs/glusterfs-client-web.vol subvolumes web_brick option auth.addr.web_brick.allow * end-volume #------------------ Does anyone have any ideas why this happens? Thanks, Simon -- Simon Detheridge - CTO, Widgit Software 26 Queen Street, Cubbington, CV32 7NA - Tel: +44 (0)1926 333680