John Gardeniers
2015-Jul-23 23:45 UTC
[Gluster-users] Heal-failed - what does it really tell us?
We have a replica 2, where the second node was freshly added about a week ago and as fas as I can tell is fully replicated. This is storage for a RHEV cluster and the total space currently in use is about 3.5TB. When I run "gluster v heal gluster-rhev info heal-failed" it currently lists 866 files on the original and 1 file on the recently added node. What I find most interesting is that the single file listed on the second node is a lease file belonging to a VM template. Some obvious questions come to mind: What is that output supposed to mean? Dose it in fact even have a useful meaning at all? How can the files be in a heal-failed condition and not also be in a split-brain condition? My interpretation of "heal-failed" is that the listed files are not yet fully in sync across nodes (and are therefore by definition in a split-brain condition) but that doesn't match the output of the command. However, that can't be the same as the gluster interpretation because how can a template file which has received no reads or writes possibly be in a heal-failed condition a week after the initial volume heal?
prmarino1 at gmail.com
2015-Jul-24 00:51 UTC
[Gluster-users] Heal-failed - what does it really tell us?
You had a split brain at one point. RHEV adds a dimension to this? which is interesting. I have run into this before it probably happened during an update to the gluster servers or a sequential restart of the gluster process or servers.? So first thing there is a nasty cron daily job which is created by a package included in the Red Hat base that runs a yum update every day. This is one of the many reasons why my production kickstarts are always nobase installs. The big reason this happens with RHEV is if a node is rebooted or the gluster server processes are ?restarted and an other node in a 2 brick cluster has the same thing happen too quickly. Essentially what happens while a self heal operation is happening the second node which is the master source goes offline and instead of fensing the volume the client fails over to the incomplete copy. ?The result is actually a split brain? but the funny thing when you add RHEV into the mix is every thing keeps working so unless you are using a tool like splunk or a properly configured logwatch cron job on your syslog server you never know any thing is wrong till you restart gluster on one of the servers. So you did have a split brain you just didn't know it. The easiest way to prevent this is to have a 3 replica brick structure on your volumes and have tighter controls on when reboots, process restarts, and updates happen. ? We have a replica 2, where the second node was freshly added about a week ago and as fas as I can tell is fully replicated. This is storage for a RHEV cluster and the total space currently in use is about 3.5TB. When I run "gluster v heal gluster-rhev info heal-failed" it currently lists 866 files on the original and 1 file on the recently added node. What I find most interesting is that the single file listed on the second node is a lease file belonging to a VM template. Some obvious questions come to mind: What is that output supposed to mean? Dose it in fact even have a useful meaning at all? How can the files be in a heal-failed condition and not also be in a split-brain condition? My interpretation of "heal-failed" is that the listed files are not yet fully in sync across nodes (and are therefore by definition in a split-brain condition) but that doesn't match the output of the command. However, that can't be the same as the gluster interpretation because how can a template file which has received no reads or writes possibly be in a heal-failed condition a week after the initial volume heal? _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://www.gluster.org/mailman/listinfo/gluster-users