Pavel Cernohorsky
2016-Nov-23 11:26 UTC
[Gluster-users] Files won't heal, although no obvious problem visible
Hello, thanks for your reply, answers are in the text. On 11/23/2016 11:55 AM, Ravishankar N wrote:> On 11/23/2016 03:56 PM, Pavel Cernohorsky wrote: >> The "hot-client-21" is, based on the vol-file, the following of the >> bricks: >> option remote-subvolume /opt/data/hdd5/gluster >> option remote-host 10.10.27.11 >> >> I have self healing daemon disabled, but when I try to trigger >> healing manually (gluster volume heal <volname>), I get: "Launching >> heal operation to perform index self heal on volume <volname> has >> been unsuccessful on bricks that are down. Please check if all brick >> processes are running.", although all the bricks are online (gluster >> volume status <volname>). > > Can you enable the self-heal daemon and try again ? `gluster volume > heal <volname>` requires the shd to be enabled. The error message that > you get is inappropriate and is being fixed.When I enabled the self heal daemon, I was able to start healing, and the files were actually healed. What does self-heal daemon do in addition to the automated healing when you read the file? The original reason to disable self heal daemon was to be able to control the amount of resources used by the healing, because the "cluster.background-self-heal-count: 1" did not help very much and the amount of both network and disk io consumed was just extreme. And I am also pretty sure we have seen similar problem (not sure about the attributes) before we disabled the shd.> >> >> When I try to just md5sum the file, to trigger automated healing on >> file manipulation, I get the result, but the file is not healed >> anyway. This usually works when I do not get 3 entries for the same >> file in the heal info. > > Is the file size for 99705_544c0cd369a84ebcaf095b4a9f6d682a.mp4 > non-zero on the 2 data bricks (i.e. on 10.10.27.11 and 10.10.27.10) > and do they match? > Do the md5sums match with what you got on the mount when you calculate > it directly on these bricks?The file has non-zero size on both the data bricks, and the md5 sum was the same on both of them before they were healed, after the healing (enabling the shd and healing start) the md5 did not change on either of the data bricks. Mount point reports the same md5 as all the other attempts directly on the bricks. So what is actually happening there? Why was the file blamed (not unblamed after healing?)? Thanks for your answers, Pavel
Ravishankar N
2016-Nov-23 12:22 UTC
[Gluster-users] Files won't heal, although no obvious problem visible
On 11/23/2016 04:56 PM, Pavel Cernohorsky wrote:> Hello, thanks for your reply, answers are in the text. > > On 11/23/2016 11:55 AM, Ravishankar N wrote: >> On 11/23/2016 03:56 PM, Pavel Cernohorsky wrote: >>> The "hot-client-21" is, based on the vol-file, the following of the >>> bricks: >>> option remote-subvolume /opt/data/hdd5/gluster >>> option remote-host 10.10.27.11 >>> >>> I have self healing daemon disabled, but when I try to trigger >>> healing manually (gluster volume heal <volname>), I get: "Launching >>> heal operation to perform index self heal on volume <volname> has >>> been unsuccessful on bricks that are down. Please check if all brick >>> processes are running.", although all the bricks are online (gluster >>> volume status <volname>). >> >> Can you enable the self-heal daemon and try again ? `gluster volume >> heal <volname>` requires the shd to be enabled. The error message >> that you get is inappropriate and is being fixed. > > When I enabled the self heal daemon, I was able to start healing, and > the files were actually healed. What does self-heal daemon do in > addition to the automated healing when you read the file?The lookup/read code-path doesn't seem to be considering a file with only the afr.dirty xattr being non-zero as a candidate for heal (while the self heal-daemon code-path does) . I'm not sure at this point if it should because just afr.dirty being set on all bricks without any trusted.afr.xxx-client-xxx being set doesn't seem to be something that should be hit under normal circumstances. I'll need to think about this more.> > The original reason to disable self heal daemon was to be able to > control the amount of resources used by the healing, because the > "cluster.background-self-heal-count: 1" did not help very much and the > amount of both network and disk io consumed was just extreme. > > And I am also pretty sure we have seen similar problem (not sure about > the attributes) before we disabled the shd. > >> >>> >>> When I try to just md5sum the file, to trigger automated healing on >>> file manipulation, I get the result, but the file is not healed >>> anyway. This usually works when I do not get 3 entries for the same >>> file in the heal info. >> >> Is the file size for 99705_544c0cd369a84ebcaf095b4a9f6d682a.mp4 >> non-zero on the 2 data bricks (i.e. on 10.10.27.11 and 10.10.27.10) >> and do they match? >> Do the md5sums match with what you got on the mount when you >> calculate it directly on these bricks? > > The file has non-zero size on both the data bricks, and the md5 sum > was the same on both of them before they were healed, after the > healing (enabling the shd and healing start) the md5 did not change on > either of the data bricks. Mount point reports the same md5 as all the > other attempts directly on the bricks. So what is actually happening > there? Why was the file blamed (not unblamed after healing?)?That means there was no real heal pending. But because the dirty xattr was set, the shd picked up a brick as a source and did the heal anyway. We would need to find how we ended in the 'only afr.dirty xattr was set' state for the file. -Ravi> > Thanks for your answers, > Pavel >