Anuradha Talur
2014-Nov-21 07:24 UTC
[Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)
----- Original Message -----> From: "Joe Julian" <joe at julianfamily.org> > To: "Anuradha Talur" <atalur at redhat.com>, "Vince Loschiavo" <vloschiavo at gmail.com> > Cc: "gluster-users at gluster.org" <Gluster-users at gluster.org> > Sent: Friday, November 21, 2014 12:06:27 PM > Subject: Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related) > > > > On November 20, 2014 10:01:45 PM PST, Anuradha Talur <atalur at redhat.com> > wrote: > > > > > >----- Original Message ----- > >> From: "Vince Loschiavo" <vloschiavo at gmail.com> > >> To: "gluster-users at gluster.org" <Gluster-users at gluster.org> > >> Sent: Wednesday, November 19, 2014 9:50:50 PM > >> Subject: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios > >related) > >> > >> > >> Hello Gluster Community, > >> > >> I have been using the Nagios monitoring scripts, mentioned in the > >below > >> thread, on 3.5.2 with great success. The most useful of these is the > >self > >> heal. > >> > >> However, I've just upgraded to 3.6.1 on the lab and the self heal > >daemon has > >> become quite aggressive. I continually get alerts/warnings on 3.6.1 > >that > >> virt disk images need self heal, then they clear. This is not the > >case on > >> 3.5.2. This > >> > >> Configuration: > >> 2 node, 2 brick replicated volume with 2x1GB LAG network between the > >peers > >> using this volume as a QEMU/KVM virt image store through the fuse > >mount on > >> Centos 6.5. > >> > >> Example: > >> on 3.5.2: > >> gluster volume heal volumename info: shows the bricks and number of > >entries > >> to be healed: 0 > >> > >> On v3.5.2 - During normal gluster operations, I can run this command > >over and > >> over again, 2-4 times per second, and it will always show 0 entries > >to be > >> healed. I've used this as an indicator that the bricks are > >synchronized. > >> > >> Last night, I upgraded to 3.6.1 in lab and I'm seeing different > >behavior. > >> Running gluster volume heal volumename info , during normal > >operations, will > >> show a file out-of-sync, seemingly between every block written to > >disk then > >> synced to the peer. I can run the command over and over again, 2-4 > >times per > >> second, and it will almost always show something out of sync. The > >individual > >> files change, meaning: > >> > >> Example: > >> 1st Run: shows file1 out of sync > >> 2nd run: shows file 2 and file 3 out of sync but file 1 is now in > >sync (not > >> in the list) > >> 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in > >sync > >> (not in the list). > >> ... > >> nth run: shows 0 files out of sync > >> nth+1 run: shows file 3 and 12 out of sync. > >> > >> From looking at the virtual machines running off this gluster volume, > >it's > >> obvious that gluster is working well. However, this obviously plays > >havoc > >> with Nagios and alerts. Nagios will run the heal info and get > >different and > >> non-useful results each time, and will send alerts. > >> > >> Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to > >tune the > >> settings or change the monitoring method to get better results into > >Nagios. > >> > >In 3.6.1 the way heal info command works is different from that in > >3.5.2. In 3.6.1, it is self-heal daemon that gathers the entries that > >might need healing. Currently, in 3.6.1, there isn't a method to > >distinguish between a file that is being healed and a file with > >on-going I/O while listing. Hence you see files with normal operation > >too listed in the output of heal info command. > > How did that regression pass?!Test cases to check this condition was not written in regression tests.>-- Thanks, Anuradha.
Vince Loschiavo
2014-Nov-22 16:42 UTC
[Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)
Thank you for that information. Are there plans to restore the previous functionality in a later release of 3.6.x? Or is this what we should expect going forward? On Thu, Nov 20, 2014 at 11:24 PM, Anuradha Talur <atalur at redhat.com> wrote:> > > ----- Original Message ----- > > From: "Joe Julian" <joe at julianfamily.org> > > To: "Anuradha Talur" <atalur at redhat.com>, "Vince Loschiavo" < > vloschiavo at gmail.com> > > Cc: "gluster-users at gluster.org" <Gluster-users at gluster.org> > > Sent: Friday, November 21, 2014 12:06:27 PM > > Subject: Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios > related) > > > > > > > > On November 20, 2014 10:01:45 PM PST, Anuradha Talur <atalur at redhat.com> > > wrote: > > > > > > > > >----- Original Message ----- > > >> From: "Vince Loschiavo" <vloschiavo at gmail.com> > > >> To: "gluster-users at gluster.org" <Gluster-users at gluster.org> > > >> Sent: Wednesday, November 19, 2014 9:50:50 PM > > >> Subject: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios > > >related) > > >> > > >> > > >> Hello Gluster Community, > > >> > > >> I have been using the Nagios monitoring scripts, mentioned in the > > >below > > >> thread, on 3.5.2 with great success. The most useful of these is the > > >self > > >> heal. > > >> > > >> However, I've just upgraded to 3.6.1 on the lab and the self heal > > >daemon has > > >> become quite aggressive. I continually get alerts/warnings on 3.6.1 > > >that > > >> virt disk images need self heal, then they clear. This is not the > > >case on > > >> 3.5.2. This > > >> > > >> Configuration: > > >> 2 node, 2 brick replicated volume with 2x1GB LAG network between the > > >peers > > >> using this volume as a QEMU/KVM virt image store through the fuse > > >mount on > > >> Centos 6.5. > > >> > > >> Example: > > >> on 3.5.2: > > >> gluster volume heal volumename info: shows the bricks and number of > > >entries > > >> to be healed: 0 > > >> > > >> On v3.5.2 - During normal gluster operations, I can run this command > > >over and > > >> over again, 2-4 times per second, and it will always show 0 entries > > >to be > > >> healed. I've used this as an indicator that the bricks are > > >synchronized. > > >> > > >> Last night, I upgraded to 3.6.1 in lab and I'm seeing different > > >behavior. > > >> Running gluster volume heal volumename info , during normal > > >operations, will > > >> show a file out-of-sync, seemingly between every block written to > > >disk then > > >> synced to the peer. I can run the command over and over again, 2-4 > > >times per > > >> second, and it will almost always show something out of sync. The > > >individual > > >> files change, meaning: > > >> > > >> Example: > > >> 1st Run: shows file1 out of sync > > >> 2nd run: shows file 2 and file 3 out of sync but file 1 is now in > > >sync (not > > >> in the list) > > >> 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in > > >sync > > >> (not in the list). > > >> ... > > >> nth run: shows 0 files out of sync > > >> nth+1 run: shows file 3 and 12 out of sync. > > >> > > >> From looking at the virtual machines running off this gluster volume, > > >it's > > >> obvious that gluster is working well. However, this obviously plays > > >havoc > > >> with Nagios and alerts. Nagios will run the heal info and get > > >different and > > >> non-useful results each time, and will send alerts. > > >> > > >> Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to > > >tune the > > >> settings or change the monitoring method to get better results into > > >Nagios. > > >> > > >In 3.6.1 the way heal info command works is different from that in > > >3.5.2. In 3.6.1, it is self-heal daemon that gathers the entries that > > >might need healing. Currently, in 3.6.1, there isn't a method to > > >distinguish between a file that is being healed and a file with > > >on-going I/O while listing. Hence you see files with normal operation > > >too listed in the output of heal info command. > > > > How did that regression pass?! > Test cases to check this condition was not written in regression tests. > > > > -- > Thanks, > Anuradha. >-- -Vince Loschiavo -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20141122/f1b16c23/attachment.html>