I do not mean to be argumentative, but I have to admit a little frustration with Gluster. I know an enormous emount of effort has gone into this product, and I just can't believe that with all the effort behind it and so many people using it, it could be so fragile. So here goes. Perhaps someone here can point to the error of my ways. I really want this to work because it would be ideal for our environment, but ... Please note that all of the nodes below are OpenVZ nodes with nfs/nfsd/fuse modules loaded on the hosts. After spending months trying to get 3.2.5 and 3.2.6 working in a production environment, I gave up on Gluster and went with a Linux-HA/NFS cluster which just works. The problems I had with gluster were strange lock-ups, split brains, and too many instances where the whole cluster was off-line until I reloaded the data. So wiith the release of 3.3, I decided to give it another try. I created one relicated volume on my two NFS servers. I then mounted the volume on a client as follows: 10.10.10.7:/pub2 /pub2 nfs rw,noacl,noatime,nodiratime,soft,proto=tcp,vers=3,defaults 0 0 I threw some data at it (find / -mount -print | cpio -pvdum /pub2/test) Within 10 seconds it locked up solid. No error messages on any of the servers, the client was unresponsive and load on the client was 15+. I restarted glusterd on both of my NFS servers, and the client remained locked. Finally I killed the cpio process on the client. When I started another cpio, it runs further than before, but now the logs on my NFS/Gluster server say: [2012-06-16 13:37:35.242754] I [afr-self-heal-common.c:1318:afr_sh_missing_entries_lookup_done] 0-pub2-replicate-0: No sources for dir of <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure, in missing entry self-heal, continuing with the rest of the self-heals [2012-06-16 13:37:35.243315] I [afr-self-heal-common.c:994:afr_sh_missing_entries_done] 0-pub2-replicate-0: split brain found, aborting selfheal of <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure [2012-06-16 13:37:35.243350] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-pub2-replicate-0: background data gfid self-heal failed on <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure This still seems to be an INCREDIBLY fragile system. Why would it lock solid while copying a large file? Why no errors in the logs? I am the only one seeing this kind of behavior? sean -- Sean Fulton GCN Publishing, Inc. Internet Design, Development and Consulting For Today's Media Companies http://www.gcnpublishing.com (203) 665-6211, x203
A little more information on this, which makes it more puzzling: 1) The split-brain message is strange because there are only two server nodes and 1 client node which has mounted the volume via NFS on a floating IP. This was done to guarantee that only one node gets written to at any point in time, so there is zero chance that two nodes were updated simultaneously. 2) I re-ran the cpio command to put a file tree on the gluster volume, then made a tar of the file tree. The tree was not being written to by anyone, and yet every 5 to 10 files tar would report "file changed as I read it." At first I thought there was some sort of healing operation going on, but since I was only writing to one node at a time, then making the backup, I don't see how this was possible. I've checked the network, resources, etc., and there are no issues there, no packet loss, all machines share the same time via NTP, etc. The OS is SL 6.1. So this is all very strange behavior. sean On 06/16/2012 01:48 PM, Sean Fulton wrote:> I do not mean to be argumentative, but I have to admit a little > frustration with Gluster. I know an enormous emount of effort has gone > into this product, and I just can't believe that with all the effort > behind it and so many people using it, it could be so fragile. > > So here goes. Perhaps someone here can point to the error of my ways. > I really want this to work because it would be ideal for our > environment, but ... > > Please note that all of the nodes below are OpenVZ nodes with > nfs/nfsd/fuse modules loaded on the hosts. > > After spending months trying to get 3.2.5 and 3.2.6 working in a > production environment, I gave up on Gluster and went with a > Linux-HA/NFS cluster which just works. The problems I had with gluster > were strange lock-ups, split brains, and too many instances where the > whole cluster was off-line until I reloaded the data. > > So wiith the release of 3.3, I decided to give it another try. I > created one relicated volume on my two NFS servers. > > I then mounted the volume on a client as follows: > 10.10.10.7:/pub2 /pub2 nfs > rw,noacl,noatime,nodiratime,soft,proto=tcp,vers=3,defaults 0 0 > > I threw some data at it (find / -mount -print | cpio -pvdum /pub2/test) > > Within 10 seconds it locked up solid. No error messages on any of the > servers, the client was unresponsive and load on the client was 15+. I > restarted glusterd on both of my NFS servers, and the client remained > locked. Finally I killed the cpio process on the client. When I > started another cpio, it runs further than before, but now the logs on > my NFS/Gluster server say: > > [2012-06-16 13:37:35.242754] I > [afr-self-heal-common.c:1318:afr_sh_missing_entries_lookup_done] > 0-pub2-replicate-0: No sources for dir of > <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure, in missing > entry self-heal, continuing with the rest of the self-heals > [2012-06-16 13:37:35.243315] I > [afr-self-heal-common.c:994:afr_sh_missing_entries_done] > 0-pub2-replicate-0: split brain found, aborting selfheal of > <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure > [2012-06-16 13:37:35.243350] E > [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] > 0-pub2-replicate-0: background data gfid self-heal failed on > <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure > > > This still seems to be an INCREDIBLY fragile system. Why would it lock > solid while copying a large file? Why no errors in the logs? > > I am the only one seeing this kind of behavior? > > sean > > > > >-- Sean Fulton GCN Publishing, Inc. Internet Design, Development and Consulting For Today's Media Companies http://www.gcnpublishing.com (203) 665-6211, x203
Was there anything in dmesg on the servers? If you are able to reproduce the hang, can you get the output of 'gluster volume status <name> callpool' and 'gluster volume status <name> nfs callpool' ? How big is the 'log/secure' file? Is it so large the the client was just busy writing it for a very long time? Are there any signs of disconnections or ping tmeouts in the logs? Avati On Sat, Jun 16, 2012 at 10:48 AM, Sean Fulton <sean at gcnpublishing.com>wrote:> I do not mean to be argumentative, but I have to admit a little > frustration with Gluster. I know an enormous emount of effort has gone into > this product, and I just can't believe that with all the effort behind it > and so many people using it, it could be so fragile. > > So here goes. Perhaps someone here can point to the error of my ways. I > really want this to work because it would be ideal for our environment, but > ... > > Please note that all of the nodes below are OpenVZ nodes with > nfs/nfsd/fuse modules loaded on the hosts. > > After spending months trying to get 3.2.5 and 3.2.6 working in a > production environment, I gave up on Gluster and went with a Linux-HA/NFS > cluster which just works. The problems I had with gluster were strange > lock-ups, split brains, and too many instances where the whole cluster was > off-line until I reloaded the data. > > So wiith the release of 3.3, I decided to give it another try. I created > one relicated volume on my two NFS servers. > > I then mounted the volume on a client as follows: > 10.10.10.7:/pub2 /pub2 nfs rw,noacl,noatime,nodiratime,**soft,proto=tcp,vers=3,defaults > 0 0 > > I threw some data at it (find / -mount -print | cpio -pvdum /pub2/test) > > Within 10 seconds it locked up solid. No error messages on any of the > servers, the client was unresponsive and load on the client was 15+. I > restarted glusterd on both of my NFS servers, and the client remained > locked. Finally I killed the cpio process on the client. When I started > another cpio, it runs further than before, but now the logs on my > NFS/Gluster server say: > > [2012-06-16 13:37:35.242754] I [afr-self-heal-common.c:1318:** > afr_sh_missing_entries_lookup_**done] 0-pub2-replicate-0: No sources for > dir of <gfid:4a787ad7-ab91-46ef-9b31-**715e49f5f818>/log/secure, in > missing entry self-heal, continuing with the rest of the self-heals > [2012-06-16 13:37:35.243315] I [afr-self-heal-common.c:994:**afr_sh_missing_entries_done] > 0-pub2-replicate-0: split brain found, aborting selfheal of > <gfid:4a787ad7-ab91-46ef-9b31-**715e49f5f818>/log/secure > [2012-06-16 13:37:35.243350] E [afr-self-heal-common.c:2156:**afr_self_heal_completion_cbk] > 0-pub2-replicate-0: background data gfid self-heal failed on > <gfid:4a787ad7-ab91-46ef-9b31-**715e49f5f818>/log/secure > > > This still seems to be an INCREDIBLY fragile system. Why would it lock > solid while copying a large file? Why no errors in the logs? > > I am the only one seeing this kind of behavior? > > sean > > > > > > -- > Sean Fulton > GCN Publishing, Inc. > Internet Design, Development and Consulting For Today's Media Companies > http://www.gcnpublishing.com > (203) 665-6211, x203 > > ______________________________**_________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/**mailman/listinfo/gluster-users<http://gluster.org/cgi-bin/mailman/listinfo/gluster-users> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120616/76ac571d/attachment.html>
On Sat, Jun 16, 2012 at 04:47:51PM -0400, Sean Fulton wrote:> 1) The split-brain message is strange because there are only two > server nodes and 1 client node which has mounted the volume via NFS > on a floating IP. This was done to guarantee that only one node gets > written to at any point in time, so there is zero chance that two > nodes were updated simultaneously.Are you using a distributed volume, or a replicated volume? Writes to a replicated volume go to both nodes.> [586898.273283] INFO: task flush-0:45:633954 blocked for more than 120 seconds. > [586898.273290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [586898.273295] flush-0:45 D ffff8806037592d0 0 633954 20 0x00000000 > [586898.273304] ffff88000d1ebbe0 0000000000000046 ffff88000d1ebd6c 0000000000000000 > [586898.273312] ffff88000d1ebce0 ffffffff81054444 ffff88000d1ebc80 ffff88000d1ebbf0 > [586898.273319] ffff8806050ac5f8 ffff880603759888 ffff88000d1ebfd8 ffff88000d1ebfd8 > [586898.273326] Call Trace: > [586898.273335] [<ffffffff81054444>] ? find_busiest_group+0x244/0xb20 > [586898.273343] [<ffffffff811ab050>] ? inode_wait+0x0/0x20 > [586898.273349] [<ffffffff811ab05e>] inode_wait+0xe/0x20Are you using XFS by any chance? I started with XFS, because that was what the gluster docs recommend, but eventually gave up on it. I can replicate those sort of kernel lockups on a 24-disk MD array within a short space of time - without gluster, just by throwing four bonnie++ processes at it. The same tests run with either ext4 or btrfs do not hang, at least not during two days of continuous testing. Of course, any kernel problem cannot be the fault of glusterfs, since glusterfs runs entirely in userland. Regards, Brian.
BTW, the thing which is unusual about your configuration is the HA setup. Are you completely sure that the HA IP has not been moving between the nodes? What if you point the NFS client at one server's fixed IP address instead of the HA address? I can imagine that the HA IP moving would cause the split-brain situations you describe.
On Sat, 2012-06-16 at 13:48 -0400, Sean Fulton wrote:> I do not mean to be argumentative, but I have to admit a little > frustration with Gluster. I know an enormous emount of effort has gone > into this product, and I just can't believe that with all the effort > behind it and so many people using it, it could be so fragile.Often it's not individual pieces that are fragile but combinations of pieces. For example, two possible interactions might be involved for you: (1) There are known problems with the interaction between FUSE and transparent hugepages (https://bugzilla.redhat.com/show_bug.cgi?id=764964). This could cause one or more of your server processes to lock up. (2) There are known problems with OpenVZ and our use of the "trusted" extended-attribute namespace (http://forum.openvz.org/index.php?t=msg&goto=35230&). This should result in a clean failure, but it's possible that it's leading to problems tracking which replicas need updates instead. If you're still having problems with the workarounds for those two issues, please let us know.