Hi, I have a volume created of 12 bricks and with 3x replication (no stripe). We had to take one server (2 bricks per server, but configured such that first brick from every server, then second brick from every server so there should not be 1 server multiple times in any replica groups) for maintenance. The server was down for 40 minutes and after it came up I saw that gluster volume heal home0 info showed some files. I started healing, but after 3 days it's still the same. Today I enabled quorum enforcement to make sure we don't get for future split brains and as we have 3 replicas, then 2 should make quorum. Anyway, the healing information is attached to this e-mail for commands: [root at se1 ~]# for i in "" heal-failed split-brain; do gluster volume heal home0 info $i > home-heal-$i.txt 2>&1; done -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: home-heal-.txt URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20121126/9c98d512/attachment.txt> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: home-heal-heal-failed.txt URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20121126/9c98d512/attachment-0001.txt> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: home-heal-split-brain.txt URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20121126/9c98d512/attachment-0002.txt> -------------- next part -------------- Ideas how to fix this? Mario Kadastik, PhD Researcher --- "Physics is like sex, sure it may have practical reasons, but that's not why we do it" -- Richard P. Feynman
On 11/26/2012 05:26 AM, Mario Kadastik wrote:> Hi, > > I have a volume created of 12 bricks and with 3x replication (no stripe). We had to take one server (2 bricks per server, but configured such that first brick from every server, then second brick from every server so there should not be 1 server multiple times in any replica groups) for maintenance. The server was down for 40 minutes and after it came up I saw that gluster volume heal home0 info showed some files. I started healing, but after 3 days it's still the same. Today I enabled quorum enforcement to make sure we don't get for future split brains and as we have 3 replicas, then 2 should make quorum. > > Anyway, the healing information is attached to this e-mail for commands: > [root at se1 ~]# for i in "" heal-failed split-brain; do gluster volume heal home0 info $i > home-heal-$i.txt 2>&1; doneFor some of the files where healing failed, check the extended attributes on each replica. For example: getfattr -d -e hex -m . .../res/out_files_485.tgz Also, check the logs in /var/log/glusterfs to see if they give any indication of why self-heal is failing. In my experience, the most common cause of such failures is GFID mismatches, which are really a form of split brain but not recognized or handled as such (which is why they don't get reported there). These can occur e.g. if a file is created separately on two bricks due to a network partition or two servers being down at different times.