Jonathan Demers
2011-Mar-25 19:35 UTC
[Gluster-users] GlusterFS replication hangs (deadlock?) when 2 nodes attempts to create/delete the same file at the same time
Hi guys, We have setup GlusterFS replication (mirror) with 2 nodes. Each node has the server process and the client process running. We have stripped down the configuration to the minimum. Client configuration (same for both nodes): volume remote1 type protocol/client option transport-type tcp option remote-host glusterfs1 option remote-subvolume brick end-volume volume remote2 type protocol/client option transport-type tcp option remote-host glusterfs2 option remote-subvolume brick end-volume volume replicate type cluster/replicate subvolumes remote1 remote2 end-volume Server configuration (same for both nodes): volume storage type storage/posix option directory /storage end-volume volume brick type features/locks subvolumes storage end-volume volume server type protocol/server option transport-type tcp option auth.addr.brick.allow XXX.* subvolumes brick end-volume We start everything up: GlusterFS client mounted on /mnt/gluster. We can see the replication works fine: we can create a file in /mnt/gluster of one node and we see it appearing in /mnt/gluster of the other node. We also see the file appearing in the /storage of both nodes. However, if we go on *both *nodes and run the following script in /mnt/gluster: while true; do touch foo; rm foo; done GlusterFS just hangs. Every call on /mnt/gluster will just hang as well... on both node. Even "ls -l /mnt/gluster" hangs. However, the storage filesystem is just fine, we can do "ls -l /storage" and we see the file "foo". "ps" shows that the script on each node is stuck in "rm foo". We cannot stop the script, even with "ctrl-C" and "kill -9". After 30 minutes (the default frame-timeout), GlusterFS unlocks, but the cluster is just broken after that: file sharing does not even works (creating file on one node and we can't see it on the other node). We can manually restart the GlusterFS servers and clients and everything is fine after that. We can reproduce that problem very easily with the simple script. GlusterFS looked very promising and we planned to use it in our new HA architecture, but the fact that a simple sequence of standard commands could lock up the whole system is a big show stopper for us. Did you experience that problem before? Is there a way to fix it (with configuration or other)? Many thanks Jonathan
Jonathan Demers
2011-Mar-28 17:35 UTC
[Gluster-users] GlusterFS replication hangs (deadlock?) when 2 nodes attempts to create/delete the same file at the same time
Hi guys, We have setup GlusterFS replication (mirror) with 2 nodes (latest version 3.1.3). Each node has the server process and the client process running. We have stripped down the configuration to the minimum. Client configuration (same for both nodes): volume remote1 type protocol/client option transport-type tcp option remote-host glusterfs1 option remote-subvolume brick end-volume volume remote2 type protocol/client option transport-type tcp option remote-host glusterfs2 option remote-subvolume brick end-volume volume replicate type cluster/replicate subvolumes remote1 remote2 end-volume Server configuration (same for both nodes): volume storage type storage/posix option directory /storage end-volume volume brick type features/locks subvolumes storage end-volume volume server type protocol/server option transport-type tcp option auth.addr.brick.allow XXX.* subvolumes brick end-volume We start everything up: GlusterFS client mounted on /mnt/gluster. We can see the replication works fine: we can create a file in /mnt/gluster of one node and we see it appearing in /mnt/gluster of the other node. We also see the file appearing in the /storage of both nodes. However, if we go on *both *nodes and run the following script in /mnt/gluster: while true; do touch foo; rm foo; done GlusterFS just hangs. Every call on /mnt/gluster will just hang as well... on both node. Even "ls -l /mnt/gluster" hangs. However, the storage filesystem is just fine, we can do "ls -l /storage" and we see the file "foo". "ps" shows that the script on each node is stuck in "rm foo". We cannot stop the script, even with "ctrl-C" and "kill -9". After 30 minutes (the default frame-timeout), GlusterFS unlocks, but the cluster is just broken after that: file sharing does not even works (creating file on one node and we can't see it on the other node). We can manually restart the GlusterFS servers and clients and everything is fine after that. We can reproduce that problem very easily with the simple script. GlusterFS looked very promising and we planned to use it in our new HA architecture, but the fact that a simple sequence of standard commands could lock up the whole system is a big show stopper for us. Did you experience that problem before? Is there a way to fix it (with configuration or other)? Many thanks Jonathan