ABHISHEK PALIWAL
2016-Mar-03 05:44 UTC
[Gluster-users] [Gluster-devel] Query on healing process
Hi Ravi, As I discussed earlier this issue, I investigated this issue and find that healing is not triggered because the "gluster volume heal c_glusterfs info split-brain" command not showing any entries as a outcome of this command even though the file in split brain case. So, what I have done I manually deleted the gfid entry of that file from .glusterfs directory and follow the instruction mentioned in the following link to do heal https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md and this works fine for me. But my question is why the split-brain command not showing any file in output. Here I am attaching all the log which I get from the node for you and also the output of commands from both of the boards In this tar file two directories are present 000300 - log for the board which is running continuously 002500- log for the board which is rebooted I am waiting for your reply please help me out on this issue. Thanks in advanced. Regards, Abhishek On Fri, Feb 26, 2016 at 1:21 PM, ABHISHEK PALIWAL <abhishpaliwal at gmail.com> wrote:> On Fri, Feb 26, 2016 at 10:28 AM, Ravishankar N <ravishankar at redhat.com> > wrote: > >> On 02/26/2016 10:10 AM, ABHISHEK PALIWAL wrote: >> >> Yes correct >> >> >> Okay, so when you say the files are not in sync until some time, are you >> getting stale data when accessing from the mount? >> I'm not able to figure out why heal info shows zero when the files are >> not in sync, despite all IO happening from the mounts. Could you provide >> the output of getfattr -d -m . -e hex /brick/file-name from both bricks >> when you hit this issue? >> >> I'll provide the logs once I get. here delay means we are powering on the >> second board after the 10 minutes. >> >> >> On Feb 26, 2016 9:57 AM, "Ravishankar N" <ravishankar at redhat.com> wrote: >> >>> Hello, >>> >>> On 02/26/2016 08:29 AM, ABHISHEK PALIWAL wrote: >>> >>> Hi Ravi, >>> >>> Thanks for the response. >>> >>> We are using Glugsterfs-3.7.8 >>> >>> Here is the use case: >>> >>> We have a logging file which saves logs of the events for every board of >>> a node and these files are in sync using glusterfs. System in replica 2 >>> mode it means When one brick in a replicated volume goes offline, the >>> glusterd daemons on the other nodes keep track of all the files that are >>> not replicated to the offline brick. When the offline brick becomes >>> available again, the cluster initiates a healing process, replicating the >>> updated files to that brick. But in our casse, we see that log file of >>> one board is not in the sync and its format is corrupted means files are >>> not in sync. >>> >>> >>> Just to understand you correctly, you have mounted the 2 node replica-2 >>> volume on both these nodes and writing to a logging file from the mounts >>> right? >>> >>> >>> Even the outcome of #gluster volume heal c_glusterfs info shows that >>> there is no pending heals. >>> >>> Also , The logging file which is updated is of fixed size and the new >>> entries will be wrapped ,overwriting the old entries. >>> >>> This way we have seen that after few restarts , the contents of the same >>> file on two bricks are different , but the volume heal info shows zero >>> entries >>> >>> Solution: >>> >>> But when we tried to put delay > 5 min before the healing everything >>> is working fine. >>> >>> Regards, >>> Abhishek >>> >>> On Fri, Feb 26, 2016 at 6:35 AM, Ravishankar N < >>> <ravishankar at redhat.com>ravishankar at redhat.com> wrote: >>> >>>> On 02/25/2016 06:01 PM, ABHISHEK PALIWAL wrote: >>>> >>>> Hi, >>>> >>>> Here, I have one query regarding the time taken by the healing process. >>>> In current two node setup when we rebooted one node then the >>>> self-healing process starts less than 5min interval on the board which >>>> resulting the corruption of the some files data. >>>> >>>> >>>> Heal should start immediately after the brick process comes up. What >>>> version of gluster are you using? What do you mean by corruption of data? >>>> Also, how did you observe that the heal started after 5 minutes? >>>> -Ravi >>>> >>>> >>>> And to resolve it I have search on google and found the following link: >>>> https://support.rackspace.com/how-to/glusterfs-troubleshooting/ >>>> >>>> Mentioning that the healing process can takes upto 10min of time to >>>> start this process. >>>> >>>> Here is the statement from the link: >>>> >>>> "Healing replicated volumes >>>> >>>> When any brick in a replicated volume goes offline, the glusterd >>>> daemons on the remaining nodes keep track of all the files that are not >>>> replicated to the offline brick. When the offline brick becomes available >>>> again, the cluster initiates a healing process, replicating the updated >>>> files to that brick. *The start of this process can take up to 10 >>>> minutes, based on observation.*" >>>> >>>> After giving the time of more than 5 min file corruption problem has >>>> been resolved. >>>> >>>> So, Here my question is there any way through which we can reduce the >>>> time taken by the healing process to start? >>>> >>>> >>>> Regards, >>>> Abhishek Paliwal >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Gluster-devel mailing listGluster-devel at gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-devel >>>> >>>> >>>> >>>> >>> >>> >>> -- >>> >>> >>> >>> >>> Regards >>> Abhishek Paliwal >>> >>> >>> >>> >> >> > > > -- > > > > > Regards > Abhishek Paliwal >-- Regards Abhishek Paliwal -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160303/055d9e7c/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: HU37300_rep.tar.gz Type: application/x-gzip Size: 250104 bytes Desc: not available URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160303/055d9e7c/attachment-0001.gz>
Ravishankar N
2016-Mar-03 10:40 UTC
[Gluster-users] [Gluster-devel] Query on healing process
Hi, On 03/03/2016 11:14 AM, ABHISHEK PALIWAL wrote:> Hi Ravi, > > As I discussed earlier this issue, I investigated this issue and find > that healing is not triggered because the "gluster volume heal > c_glusterfs info split-brain" command not showing any entries as a > outcome of this command even though the file in split brain case.Couple of observations from the 'commands_output' file. getfattr -d -m . -e hex opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml The afr xattrs do not indicate that the file is in split brain: # file: opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml trusted.afr.c_glusterfs-client-1=0x000000000000000000000000 trusted.afr.dirty=0x000000000000000000000000 trusted.bit-rot.version=0x000000000000000b56d6dd1d000ec7a9 trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae getfattr -d -m . -e hex opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml trusted.afr.c_glusterfs-client-0=0x000000080000000000000000 trusted.afr.c_glusterfs-client-2=0x000000020000000000000000 trusted.afr.c_glusterfs-client-4=0x000000020000000000000000 trusted.afr.c_glusterfs-client-6=0x000000020000000000000000 trusted.afr.dirty=0x000000000000000000000000 trusted.bit-rot.version=0x000000000000000b56d6dcb7000c87e7 trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae 1. There doesn't seem to be a split-brain going by the trusted.afr* xattrs. 2. You seem to have re-used the bricks from another volume/setup. For replica 2, only trusted.afr.c_glusterfs-client-0 and trusted.afr.c_glusterfs-client-1 must be present but I see 4 xattrs - client-0,2,4 and 6 3. On the rebooted node, do you have ssl enabled by any chance? There is a bug for "Not able to fetch volfile' when ssl is enabled: https://bugzilla.redhat.com/show_bug.cgi?id=1258931 Btw, you for data and metadata split-brains you can use the gluster CLI https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md instead of modifying the file from the back end. -Ravi> > So, what I have done I manually deleted the gfid entry of that file > from .glusterfs directory and follow the instruction mentioned in the > following link to do heal > > https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md > > and this works fine for me. > > But my question is why the split-brain command not showing any file in > output. > > Here I am attaching all the log which I get from the node for you and > also the output of commands from both of the boards > > In this tar file two directories are present > > 000300 - log for the board which is running continuously > 002500- log for the board which is rebooted > > I am waiting for your reply please help me out on this issue. > > Thanks in advanced. > > Regards, > Abhishek > > On Fri, Feb 26, 2016 at 1:21 PM, ABHISHEK PALIWAL > <abhishpaliwal at gmail.com <mailto:abhishpaliwal at gmail.com>> wrote: > > On Fri, Feb 26, 2016 at 10:28 AM, Ravishankar N > <ravishankar at redhat.com <mailto:ravishankar at redhat.com>> wrote: > > On 02/26/2016 10:10 AM, ABHISHEK PALIWAL wrote: >> >> Yes correct >> > > Okay, so when you say the files are not in sync until some > time, are you getting stale data when accessing from the mount? > I'm not able to figure out why heal info shows zero when the > files are not in sync, despite all IO happening from the > mounts. Could you provide the output of getfattr -d -m . -e > hex /brick/file-name from both bricks when you hit this issue? > > I'll provide the logs once I get. here delay means we are > powering on the second board after the 10 minutes. > > >> On Feb 26, 2016 9:57 AM, "Ravishankar N" >> <ravishankar at redhat.com <mailto:ravishankar at redhat.com>> wrote: >> >> Hello, >> >> On 02/26/2016 08:29 AM, ABHISHEK PALIWAL wrote: >>> Hi Ravi, >>> >>> Thanks for the response. >>> >>> We are using Glugsterfs-3.7.8 >>> >>> Here is the use case: >>> >>> We have a logging file which saves logs of the events >>> for every board of a node and these files are in sync >>> using glusterfs. System in replica 2 mode it means When >>> one brick in a replicated volume goes offline, the >>> glusterd daemons on the other nodes keep track of all >>> the files that are not replicated to the offline brick. >>> When the offline brick becomes available again, the >>> cluster initiates a healing process, replicating the >>> updated files to that brick. But in our casse, we see >>> that log file of one board is not in the sync and its >>> format is corrupted means files are not in sync. >> >> Just to understand you correctly, you have mounted the 2 >> node replica-2 volume on both these nodes and writing to >> a logging file from the mounts right? >> >>> >>> Even the outcome of #gluster volume heal c_glusterfs >>> info shows that there is no pending heals. >>> >>> Also , The logging file which is updated is of fixed >>> size and the new entries will be wrapped ,overwriting >>> the old entries. >>> >>> This way we have seen that after few restarts , the >>> contents of the same file on two bricks are different , >>> but the volume heal info shows zero entries >>> >>> Solution: >>> >>> But when we tried to put delay > 5 min before the >>> healing everything is working fine. >>> >>> Regards, >>> Abhishek >>> >>> On Fri, Feb 26, 2016 at 6:35 AM, Ravishankar N >>> <ravishankar at redhat.com <mailto:ravishankar at redhat.com>> >>> wrote: >>> >>> On 02/25/2016 06:01 PM, ABHISHEK PALIWAL wrote: >>>> Hi, >>>> >>>> Here, I have one query regarding the time taken by >>>> the healing process. >>>> In current two node setup when we rebooted one node >>>> then the self-healing process starts less than 5min >>>> interval on the board which resulting the >>>> corruption of the some files data. >>> >>> Heal should start immediately after the brick >>> process comes up. What version of gluster are you >>> using? What do you mean by corruption of data? Also, >>> how did you observe that the heal started after 5 >>> minutes? >>> -Ravi >>>> >>>> And to resolve it I have search on google and found >>>> the following link: >>>> https://support.rackspace.com/how-to/glusterfs-troubleshooting/ >>>> >>>> Mentioning that the healing process can takes upto >>>> 10min of time to start this process. >>>> >>>> Here is the statement from the link: >>>> >>>> "Healing replicated volumes >>>> >>>> When any brick in a replicated volume goes offline, >>>> the glusterd daemons on the remaining nodes keep >>>> track of all the files that are not replicated to >>>> the offline brick. When the offline brick becomes >>>> available again, the cluster initiates a healing >>>> process, replicating the updated files to that >>>> brick. *The start of this process can take up to 10 >>>> minutes, based on observation.*" >>>> >>>> After giving the time of more than 5 min file >>>> corruption problem has been resolved. >>>> >>>> So, Here my question is there any way through which >>>> we can reduce the time taken by the healing process >>>> to start? >>>> >>>> >>>> Regards, >>>> Abhishek Paliwal >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel at gluster.org >>>> <mailto:Gluster-devel at gluster.org> >>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>> >>> >>> >>> >>> >>> -- >>> >>> >>> >>> >>> Regards >>> Abhishek Paliwal >> >> > > > > > > -- > > > > > Regards > Abhishek Paliwal > > > > > -- > > > > > Regards > Abhishek Paliwal-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160303/34441e6b/attachment.html>