Strahil
2019-Mar-06 16:51 UTC
[Gluster-users] Gluster : Improvements on "heal info" command
Hi , This sounds nice. I would like to ask if the order is starting from the local node's bricks first ? (I am talking about --brick=one) Best Regards, Strahil NikolovOn Mar 5, 2019 10:51, Ashish Pandey <aspandey at redhat.com> wrote:> > Hi All, > > We have observed and heard from gluster users about the long time "heal info" command takes. > Even when we all want to know if a gluster volume is healthy or not, it takes time to list down all the files from all the bricks after which we can be > sure if the volume is healthy or not. > Here, we have come up with some options for "heal info" command which provide report quickly and reliably. > > gluster v heal vol info --subvol=[number of the subvol] --brick=[one,all] > -------- > > Problem: "gluster v heal <volname> info" command picks each subvolume and checks the .glusterfs/indices/xattrop folder of? every brick of that subvolume to find out if there is any entry > which needs to be healed. It picks the entry and takes a lock? on that entry to check xattrs to find out if that entry actually needs heal or not. > This LOCK->CHECK-XATTR->UNLOCK cycle takes lot of time for each file. > > Let's consider two most often seen cases for which we use "heal info" and try to understand the improvements. > > Case -1 : Consider 4+2 EC volume and all the bricks on 6 different nodes. > A brick of the volume is down and client has written 10000 files on one of the mount point of this volume. Entries for these 10K files will be created on ".glusterfs/indices/xattrop" > on all the rest of 5 bricks. Now, brick is UP and when we use "heal info" command for this volume, it goes to all the bricks and picks these 10K file entries and > goes through LOCK->CHECK-XATTR->UNLOCK cycle for all the files. This happens for all the bricks, that means, we check 50K files and perform the?LOCK->CHECK-XATTR->UNLOCK cycle 50K times, > while only 10K entries were sufficient to check. It is a very time consuming operation. If IO"s are happening one some of the new files, we check these files also which will add the time. > Here, all we wanted to know if our volume has been healed and healthy. > > Solution : Whenever a brick goes down and comes up and when we use "heal info" command, our *main intention* is to find out if the volume is *healthy* or *unhealthy*. A volume is unhealthy even if one > file is not healthy. So, we should scan bricks one by one and as soon as we find that one brick is having some entries which require to be healed, we can come out and list the files and say the volume is not > healthy. No need to scan rest of the bricks. That's where "--brick=[one,all]" option has been introduced. > > "gluster v heal vol info --brick=[one,all]" > "one" - It will scan the brick sequentially and as soon as it will find any unhealthy entries, it will list it out and stop scanning other bricks. > "all" - It will act just like current behavior and provide all the files from all the bricks. If we do not provide this option, default (current) behavior will be applicable. > > Case -2 : Consider 24 X (4+2) EC volume. Let's say one brick from *only one* of the sub volume has been replaced and a heal has been triggered. > To know if the volume is in healthy state, we go to each brick of *each and every sub volume* and check if there are any entries in ".glusterfs/indices/xattrop" folder which need heal or not. > If we know which sub volume participated in brick replacement, we just need to check health of that sub volume and not query/check other sub volumes. > > If several clients are writing number of files on this volume, an entry for each of these files will be created in? .glusterfs/indices/xattrop and "heal info' > command will go through LOCK->CHECK-XATTR->UNLOCK cycle to find out if these entries need heal or not which takes lot of time. > In addition to this a client will also see performance drop as it will have to release and take lock again. > > Solution: Provide an option to mention number of sub volume for which we want to check heal info. > > "gluster v heal vol info --subvol=<no of subvolume>? " > Here, --subvol will be given number of the subvolume we want to check. > Example: > "gluster v heal vol info --subvol=1 " > > > ==================================> Performance Data - > A quick performance test done on standalone system. > > Type: Distributed-Disperse > Volume ID: ea40eb13-d42c-431c-9c89-0153e834e67e > Status: Started > Snapshot Count: 0 > Number of Bricks: 2 x (4 + 2) = 12 > Transport-type: tcp > Bricks: > Brick1: apandey:/home/apandey/bricks/gluster/vol-1 > Brick2: apandey:/home/apandey/bricks/gluster/vol-2 > Brick3: apandey:/home/apandey/bricks/gluster/vol-3 > Brick4: apandey:/home/apandey/bricks/gluster/vol-4 > Brick5: apandey:/home/apandey/bricks/gluster/vol-5 > Brick6: apandey:/home/apandey/bricks/gluster/vol-6 > Brick7: apandey:/home/apandey/bricks/gluster/new-1 > Brick8: apandey:/home/apandey/bricks/gluster/new-2 > Brick9: apandey:/home/apandey/bricks/gluster/new-3 > Brick10: apandey:/home/apandey/bricks/gluster/new-4 > Brick11: apandey:/home/apandey/bricks/gluster/new-5 > Brick12: apandey:/home/apandey/bricks/gluster/new-6 > > Just disabled the shd to get the data - > > Killed one brick each from two subvolumes and wrote 2000 files on mount point. > [root at apandey vol]# for i in {1..2000};do echo abc >> file-$i; done > > Start the volume using force option and get the heal info. Following is the data - > > [root at apandey glusterfs]# time gluster v heal vol info --brick=one >> /dev/null ????????????<<<<<<<< This will scan brick one by one and come out as soon as we find volume is unhealthy. > > real?? ?0m8.316s > user?? ?0m2.241s > sys?? ?0m1.278s > [root at apandey glusterfs]# > > [root at apandey glusterfs]# time gluster v heal vol info >> /dev/null????????????????????????? ????????<<<<<<<< This is current behavior. > > real?? ?0m26.097s > user?? ?0m10.868s > sys?? ?0m6.198s > [root at apandey glusterfs]# > ==================================> > I would like your comments/suggestions on this improvements. > Specially, would like to hear on the new syntax of the command - > > gluster v heal vol info --subvol=[number of the subvol] --brick=[one,all] > > Note that if we do not provide new options, command will behave just like it does right now. > Also, this improvement is valid for AFR and EC. > > --- > Ashish > > > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190306/f7c635d6/attachment.html>
Ashish Pandey
2019-Mar-06 17:03 UTC
[Gluster-users] Gluster : Improvements on "heal info" command
No, it is not necessary that the first brick would be local one. I really don't think starting from local node will make a difference. The major time spent is not in getting list of entries from .gluster/indices/xattrop folder. The LOCK->XATTR_CHECK->UNLOCK is the cycle which takes most of the time which is not going to change even if it is from local brick. --- Ashish ----- Original Message ----- From: "Strahil" <hunter86_bg at yahoo.com> To: "Ashish" <aspandey at redhat.com>, "Gluster" <gluster-devel at gluster.org>, "Gluster" <gluster-users at gluster.org> Sent: Wednesday, March 6, 2019 10:21:26 PM Subject: Re: [Gluster-users] Gluster : Improvements on "heal info" command Hi , This sounds nice. I would like to ask if the order is starting from the local node's bricks first ? (I am talking about --brick=one) Best Regards, Strahil Nikolov On Mar 5, 2019 10:51, Ashish Pandey <aspandey at redhat.com> wrote: Hi All, We have observed and heard from gluster users about the long time "heal info" command takes. Even when we all want to know if a gluster volume is healthy or not, it takes time to list down all the files from all the bricks after which we can be sure if the volume is healthy or not. Here, we have come up with some options for "heal info" command which provide report quickly and reliably. gluster v heal vol info --subvol=[number of the subvol] --brick=[one,all] -------- Problem: "gluster v heal <volname> info" command picks each subvolume and checks the .glusterfs/indices/xattrop folder of every brick of that subvolume to find out if there is any entry which needs to be healed. It picks the entry and takes a lock on that entry to check xattrs to find out if that entry actually needs heal or not. This LOCK->CHECK-XATTR->UNLOCK cycle takes lot of time for each file. Let's consider two most often seen cases for which we use "heal info" and try to understand the improvements. Case -1 : Consider 4+2 EC volume and all the bricks on 6 different nodes. A brick of the volume is down and client has written 10000 files on one of the mount point of this volume. Entries for these 10K files will be created on ".glusterfs/indices/xattrop" on all the rest of 5 bricks. Now, brick is UP and when we use "heal info" command for this volume, it goes to all the bricks and picks these 10K file entries and goes through LOCK->CHECK-XATTR->UNLOCK cycle for all the files. This happens for all the bricks, that means, we check 50K files and perform the LOCK->CHECK-XATTR->UNLOCK cycle 50K times, while only 10K entries were sufficient to check. It is a very time consuming operation. If IO"s are happening one some of the new files, we check these files also which will add the time. Here, all we wanted to know if our volume has been healed and healthy. Solution : Whenever a brick goes down and comes up and when we use "heal info" command, our *main intention* is to find out if the volume is *healthy* or *unhealthy*. A volume is unhealthy even if one file is not healthy. So, we should scan bricks one by one and as soon as we find that one brick is having some entries which require to be healed, we can come out and list the files and say the volume is not healthy. No need to scan rest of the bricks. That's where "--brick=[one,all]" option has been introduced. "gluster v heal vol info --brick=[one,all]" "one" - It will scan the brick sequentially and as soon as it will find any unhealthy entries, it will list it out and stop scanning other bricks. "all" - It will act just like current behavior and provide all the files from all the bricks. If we do not provide this option, default (current) behavior will be applicable. Case -2 : Consider 24 X (4+2) EC volume. Let's say one brick from *only one* of the sub volume has been replaced and a heal has been triggered. To know if the volume is in healthy state, we go to each brick of *each and every sub volume* and check if there are any entries in ".glusterfs/indices/xattrop" folder which need heal or not. If we know which sub volume participated in brick replacement, we just need to check health of that sub volume and not query/check other sub volumes. If several clients are writing number of files on this volume, an entry for each of these files will be created in .glusterfs/indices/xattrop and "heal info' command will go through LOCK->CHECK-XATTR->UNLOCK cycle to find out if these entries need heal or not which takes lot of time. In addition to this a client will also see performance drop as it will have to release and take lock again. Solution: Provide an option to mention number of sub volume for which we want to check heal info. "gluster v heal vol info --subvol=<no of subvolume> " Here, --subvol will be given number of the subvolume we want to check. Example: "gluster v heal vol info --subvol=1 " =================================== Performance Data - A quick performance test done on standalone system. Type: Distributed-Disperse Volume ID: ea40eb13-d42c-431c-9c89-0153e834e67e Status: Started Snapshot Count: 0 Number of Bricks: 2 x (4 + 2) = 12 Transport-type: tcp Bricks: Brick1: apandey:/home/apandey/bricks/gluster/vol-1 Brick2: apandey:/home/apandey/bricks/gluster/vol-2 Brick3: apandey:/home/apandey/bricks/gluster/vol-3 Brick4: apandey:/home/apandey/bricks/gluster/vol-4 Brick5: apandey:/home/apandey/bricks/gluster/vol-5 Brick6: apandey:/home/apandey/bricks/gluster/vol-6 Brick7: apandey:/home/apandey/bricks/gluster/new-1 Brick8: apandey:/home/apandey/bricks/gluster/new-2 Brick9: apandey:/home/apandey/bricks/gluster/new-3 Brick10: apandey:/home/apandey/bricks/gluster/new-4 Brick11: apandey:/home/apandey/bricks/gluster/new-5 Brick12: apandey:/home/apandey/bricks/gluster/new-6 Just disabled the shd to get the data - Killed one brick each from two subvolumes and wrote 2000 files on mount point. [root at apandey vol]# for i in {1..2000};do echo abc >> file-$i; done Start the volume using force option and get the heal info. Following is the data - [root at apandey glusterfs]# time gluster v heal vol info --brick=one >> /dev/null <<<<<<<< This will scan brick one by one and come out as soon as we find volume is unhealthy. real 0m8.316s user 0m2.241s sys 0m1.278s [root at apandey glusterfs]# [root at apandey glusterfs]# time gluster v heal vol info >> /dev/null <<<<<<<< This is current behavior. real 0m26.097s user 0m10.868s sys 0m6.198s [root at apandey glusterfs]# =================================== I would like your comments/suggestions on this improvements. Specially, would like to hear on the new syntax of the command - gluster v heal vol info --subvol=[number of the subvol] --brick=[one,all] Note that if we do not provide new options, command will behave just like it does right now. Also, this improvement is valid for AFR and EC. --- Ashish _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190306/f5ea91fd/attachment.html>