Lindsay Mathieson
2016-Apr-24 01:12 UTC
[Gluster-users] File replicas out of sync/How to force heal on a file
I have two problems here. I have a replica 3 sharded volume, gluster 3.7.11 for VM Images. Yesterday I stopped the volume and ran a md5sum on all the shards to compare the 3 replicas. All 15 VM images were identical except for one (vm-307). It has 2048 shards of which 8 differed. volume heal info lists *no* files needing healed. Two things concern me: 1. How did this happen? trust in gluster either keeping replica's sync'd or knowing when they are not is crucial. 2. How do I force a heal of an individual file? I can find no documentation as to this process or even if it is possible. I do have one possible solution - delete the vm image and restore from backup. Not ideal. Notes: - I did have a hard disk failure on a brick while testing. ZFS recovered it with no errors. - My testing was reasonably severe - server reboots and killing of the gluster processes. All things that will happen in a cluster life time. I was pleased with how well gluster handled them. -- Lindsay Mathieson
Lindsay Mathieson
2016-Apr-24 06:13 UTC
[Gluster-users] File replicas out of sync/How to force heal on a file
On 24/04/2016 11:12 AM, Lindsay Mathieson wrote:> esterday I stopped the volume and ran a md5sum on all the shards to > compare the 3 replicas. All 15 VM images were identical except for one > (vm-307). It has 2048 shards of which 8 differed. > > volume heal info lists *no* files needing healed. > > Two things concern me: > > 1. How did this happen? trust in gluster either keeping replica's > sync'd or knowing when they are not is crucial. > > 2. How do I force a heal of an individual file? I can find no > documentation as to this process or even if it is possible. > > I do have one possible solution - delete the vm image and restore from > backup. Not ideal. > > > Notes: > - I did have a hard disk failure on a brick while testing. ZFS > recovered it with no errors. > > - My testing was reasonably severe - server reboots and killing of the > gluster processes. All things that will happen in a cluster life time. > I was pleased with how well gluster handled them.Duplicating from a separate msg how I resolved the immediate issue: I used diff3 to compare the checksums of the shards and it revealed that seven of the shards were the same on two bricks (vna & vng) and one of the shards was the same on two other bricks (vna & vnb). Fortunately none were different on all 3 bricks :) Using the checksum as a quorum I deleted all the singleton shards (7 on vnb, 1 on vng), touched the file owner and issule a "heal full". All 8 shards were restored with matching checksums for the other two bricks. A rechack of the entire set of shards for the vm showed all 3 copies as identical and the VM itself is functioning normally. Its one way to manually heal up shard mismatches which gluster hasn't detected, if somewhat tedious. Its a method which lends itself to automation though. -- Lindsay Mathieson -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160424/19c200f2/attachment.html>