Nithya Balachandran
2016-Nov-10 08:17 UTC
[Gluster-users] [Gluster-devel] Feedback on DHT option "cluster.readdir-optimize"
On 8 November 2016 at 20:21, Kyle Johnson <kjohnson at gnulnx.net> wrote:> Hey there, > > We have a number of processes which daily walk our entire directory tree > and perform operations on the found files. > > Pre-gluster, this processes was able to complete within 24 hours of > starting. After outgrowing that single server and moving to a gluster > setup (two bricks, two servers, distribute, 10gig uplink), the processes > became unusable. > > After turning this option on, we were back to normal run times, with the > process completing within 24 hours. > > Our data is heavy nested in a large number of subfolders under /media/ftp. >Thanks for getting back to us - this is very good information. Can you provide a few more details? How deep is your directory tree and roughly how many directories do you have at each level? Are all your files in the lowest level dirs or do they exist on several levels? Would you be willing to provide the gluster volume info output for this volume?> > A subset of our data: > > 15T of files in 48163 directories under /media/ftp/dig_dis. > > Without readdir-optimize: > > [root at colossus dig_dis]# time ls|wc -l > 48163 > > real 13m1.582s > user 0m0.294s > sys 0m0.205s > > > With readdir-optimize: > > [root at colossus dig_dis]# time ls | wc -l > 48163 > > real 0m23.785s > user 0m0.296s > sys 0m0.108s > > > Long story short - this option is super important to me as it resolved an > issue that would have otherwise made me move my data off of gluster. > > > Thank you for all of your work, > > Kyle > > > > > > On 11/07/2016 10:07 PM, Raghavendra Gowdappa wrote: > >> Hi all, >> >> We have an option in called "cluster.readdir-optimize" which alters the >> behavior of readdirp in DHT. This value affects how storage/posix treats >> dentries corresponding to directories (not for files). >> >> When this value is on, >> * DHT asks only one subvol/brick to return dentries corresponding to >> directories. >> * Other subvols/bricks filter dentries corresponding to directories and >> send only dentries corresponding to files. >> >> When this value is off (this is the default value), >> * All subvols return all dentries stored on them. IOW, bricks don't >> filter any dentries. >> * Since a directory has one dentry representing it on each subvol, dht >> (loaded on client) picks up dentry only from hashed subvol. >> >> Note that irrespective of value of this option, _all_ subvols return >> dentries corresponding to files which are stored on them. >> >> This option was introduced to boost performance of readdir as (when set >> on), filtering of dentries happens on bricks and hence there is reduced: >> 1. network traffic (with filtering all the redundant dentry information) >> 2. number of readdir calls between client and server for the same number >> of dentries returned to application (If filtering happens on client, lesser >> number of dentries in result and hence more number of readdir calls. IOW, >> result buffer is not filled to maximum capacity). >> >> We want to hear from you Whether you've used this option and if yes, >> 1. Did it really boost readdir performance? >> 2. Do you've any performance data to find out what was the percentage of >> improvement (or deterioration)? >> 3. Data set you had (Number of files, directories and organisation of >> directories). >> >> If we find out that this option is really helping you, we can spend our >> energies on fixing issues that will arise when this option is set to on. >> One common issue with turning this option on is that when this option is >> set, some directories might not show up in directory listing [1]. The >> reason for this is that: >> 1. If a directory can be created on a hashed subvol, mkdir (result to >> application) will be successful, irrespective of result of mkdir on rest of >> the subvols. >> 2. So, any subvol we pick to give us dentries corresponding to directory >> need not contain all the directories and we might miss out those >> directories in listing. >> >> Your feedback is important for us and will help us to prioritize and >> improve things. >> >> [1] https://www.gluster.org/pipermail/gluster-users/2016-October >> /028703.html >> >> regards, >> Raghavendra >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://www.gluster.org/mailman/listinfo/gluster-users >> > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161110/e9057171/attachment.html>
Vijay Bellur
2016-Nov-10 15:27 UTC
[Gluster-users] [Gluster-devel] Feedback on DHT option "cluster.readdir-optimize"
On Thu, Nov 10, 2016 at 3:17 AM, Nithya Balachandran <nbalacha at redhat.com> wrote:> > > On 8 November 2016 at 20:21, Kyle Johnson <kjohnson at gnulnx.net> wrote: >> >> Hey there, >> >> We have a number of processes which daily walk our entire directory tree >> and perform operations on the found files. >> >> Pre-gluster, this processes was able to complete within 24 hours of >> starting. After outgrowing that single server and moving to a gluster setup >> (two bricks, two servers, distribute, 10gig uplink), the processes became >> unusable. >> >> After turning this option on, we were back to normal run times, with the >> process completing within 24 hours. >> >> Our data is heavy nested in a large number of subfolders under /media/ftp. > > > Thanks for getting back to us - this is very good information. Can you > provide a few more details? > > How deep is your directory tree and roughly how many directories do you have > at each level? > Are all your files in the lowest level dirs or do they exist on several > levels? > Would you be willing to provide the gluster volume info output for this > volume? >>I have had performance improvement with this option when the first level below the root consisted several thousands of directories without any files. IIRC, I was testing this in a 16 x 2 setup. Regards, Vijay
Kyle Johnson
2016-Nov-10 17:48 UTC
[Gluster-users] [Gluster-devel] Feedback on DHT option "cluster.readdir-optimize"
Sure, I'd be happy to supply some more details. See below. On 11/10/2016 01:17 AM, Nithya Balachandran wrote:> > > On 8 November 2016 at 20:21, Kyle Johnson <kjohnson at gnulnx.net > <mailto:kjohnson at gnulnx.net>> wrote: > > Hey there, > > We have a number of processes which daily walk our entire directory > tree and perform operations on the found files. > > Pre-gluster, this processes was able to complete within 24 hours of > starting. After outgrowing that single server and moving to a > gluster setup (two bricks, two servers, distribute, 10gig uplink), > the processes became unusable. > > After turning this option on, we were back to normal run times, with > the process completing within 24 hours. > > Our data is heavy nested in a large number of subfolders under > /media/ftp. > > > Thanks for getting back to us - this is very good information. Can you > provide a few more details? > > How deep is your directory tree and roughly how many directories do you > have at each level?Depends on the directory, Most of them, such as /media/ftp/dig_dis, there is only one level of nesting. e.g. /media/ftp/dig_dis/4058765004173/ is one of the 48,000 directories. With other directories, such as /media/ftp/believe_digital, there are two levels of nesting. /media/ftp/believe_digital/20160225/3614597218815. In this case, there are 463 top level (date) directories, and then a huge number of subdirs under them. In both cases, once you get to the bottom of the directory tree, there are generally no more than 20 files in the given directory, and they're somewhat large (flac files). While dig_dis is 15T of files, believe_digital is 26T of files. Our processes operate on the individual subdirs under /media/ftp/, such as /media/ftp/dig_dis. They don't start at /media/ftp.> Are all your files in the lowest level dirs or do they exist on several > levels?They're all in the lowest, though the number of nested directories varies between 1 and 2, as seen above.> Would you be willing to provide the gluster volume info output for this > volume? >Sure. [root at colossus dig_dis]# gluster volume info ftp Volume Name: ftp Type: Distribute Volume ID: f3f2b222-575c-4c8d-92f1-e640fd7edfbb Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 192.168.110.1:/tank/bricks/ftp Brick2: 192.168.110.2:/ftp/bricks/ftp Options Reconfigured: cluster.readdir-optimize: on performance.client-io-threads: on cluster.weighted-rebalance: off performance.readdir-ahead: off nfs.disable: on With some additional info... 110.1 is a freebsd 10.3 box with zfs-backed bricks. 110.2 is a centos box on an older kernel (2.6.32) with zfs-backed bricks. The .110 network is directly connected between the two hosts on 10GE NICs with CAT6. Each host has 2 NICs and the NICs are LAGG'd (bonded) together in mode 4. weighted-reblance is turned off because of https://bugzilla.redhat.com/show_bug.cgi?id=1356076 readdir-ahead was turned off due to a tip in https://bugzilla.redhat.com/show_bug.cgi?id=1369364 I don't specifically remember tweaking client-io-threads Hope this helps, Kyle> > A subset of our data: > > 15T of files in 48163 directories under /media/ftp/dig_dis. > > Without readdir-optimize: > > [root at colossus dig_dis]# time ls|wc -l > 48163 > > real 13m1.582s > user 0m0.294s > sys 0m0.205s > > > With readdir-optimize: > > [root at colossus dig_dis]# time ls | wc -l > 48163 > > real 0m23.785s > user 0m0.296s > sys 0m0.108s > > > Long story short - this option is super important to me as it > resolved an issue that would have otherwise made me move my data off > of gluster. > > > Thank you for all of your work, > > Kyle > > > > > > On 11/07/2016 10:07 PM, Raghavendra Gowdappa wrote: > > Hi all, > > We have an option in called "cluster.readdir-optimize" which > alters the behavior of readdirp in DHT. This value affects how > storage/posix treats dentries corresponding to directories (not > for files). > > When this value is on, > * DHT asks only one subvol/brick to return dentries > corresponding to directories. > * Other subvols/bricks filter dentries corresponding to > directories and send only dentries corresponding to files. > > When this value is off (this is the default value), > * All subvols return all dentries stored on them. IOW, bricks > don't filter any dentries. > * Since a directory has one dentry representing it on each > subvol, dht (loaded on client) picks up dentry only from hashed > subvol. > > Note that irrespective of value of this option, _all_ subvols > return dentries corresponding to files which are stored on them. > > This option was introduced to boost performance of readdir as > (when set on), filtering of dentries happens on bricks and hence > there is reduced: > 1. network traffic (with filtering all the redundant dentry > information) > 2. number of readdir calls between client and server for the > same number of dentries returned to application (If filtering > happens on client, lesser number of dentries in result and hence > more number of readdir calls. IOW, result buffer is not filled > to maximum capacity). > > We want to hear from you Whether you've used this option and if yes, > 1. Did it really boost readdir performance? > 2. Do you've any performance data to find out what was the > percentage of improvement (or deterioration)? > 3. Data set you had (Number of files, directories and > organisation of directories). > > If we find out that this option is really helping you, we can > spend our energies on fixing issues that will arise when this > option is set to on. One common issue with turning this option > on is that when this option is set, some directories might not > show up in directory listing [1]. The reason for this is that: > 1. If a directory can be created on a hashed subvol, mkdir > (result to application) will be successful, irrespective of > result of mkdir on rest of the subvols. > 2. So, any subvol we pick to give us dentries corresponding to > directory need not contain all the directories and we might miss > out those directories in listing. > > Your feedback is important for us and will help us to prioritize > and improve things. > > [1] > https://www.gluster.org/pipermail/gluster-users/2016-October/028703.html > <https://www.gluster.org/pipermail/gluster-users/2016-October/028703.html> > > regards, > Raghavendra > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > http://www.gluster.org/mailman/listinfo/gluster-users > <http://www.gluster.org/mailman/listinfo/gluster-users> > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org> > http://www.gluster.org/mailman/listinfo/gluster-devel > <http://www.gluster.org/mailman/listinfo/gluster-devel> > >