Nathan.Dauchy at noaa.gov
2011-Jan-07 00:42 UTC
[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8
Greetings, I am looking for more information regarding the "size on MDS" feature as it exists for lustre-1.8.x. Testing on our system (which started out as 1.6.6 and is now 1.8.x) indicates that there are many files which do not have the size information stored on the MDT. So, my basic question: under what conditions will the "size hint" attribute be updated? Is there any way to force the MDT to query the OSTs and update it''s information? More details and background information... We are currently running: Client: CentOS-5.5, kernel-2.6.18-194.11.4.el5, lustre-1.8.4 Server: CentOS-5.3, kernel-2.6.18-164.15.1.el5, lustre-1.8.3.ddn4.1 with no immediate plans to upgrade to lustre-2.x. I found the high level design here: http://wiki.lustre.org/images/9/94/Size_on_mds-hld.pdf but that seems to be targeted at 2.0.x for full implementation. I also discovered this old thread on the topic: http://lists.lustre.org/pipermail/lustre-discuss/2009-July/011149.html in which Andreas wrote "In Lustre 1.6.7 the approximate file size started to be stored on the MDT inodes..." and "This size is not actively updated for pre-existing files...". (It seems as though this is still true for 1.8.x.) The end goal of this is to facilitate efficient checks of disk usage on a per-directory basis (essentially we want "volume based quotas"). I''m hoping to run something once a day on the MDS like the following: lvcreate -s -p r -n mdt_snap /dev/mdt mount -t ldiskfs -o ro /dev/mdt_snap /mnt/snap cd /mnt/snap/ROOT du --apparent-size ./* > volume_usage.log cd / umount /mnt/snap lvremove /dev/mdt_snap Since the data is going to be up to one day old anyway, I don''t really mind that the file size is "approximate", but it does have to be reasonably close. With the MDT LVM snapshot method I can check the whole 300TB file system in about 3 hours, whereas checking from a client takes weeks. Here is why I am relatively certain that the size-on-MDS attributes are not updated (lightly edited): [root at mds0 ~]# ls -l /mnt/snap/ROOT/test/rollover/user_acct_file -rw-r--r-- 1 9999 9000 0 Mar 23 2010 /mnt/snap/ROOT/test/rollover/user_acct_file [root at mds0 ~]# du /mnt/snap/ROOT/test/rollover/user_acct_file 0 /mnt/snap/ROOT/test/rollover/user_acct_file [root at mds0 ~]# du --apparent-size /mnt/snap/ROOT/test/rollover/user_acct_file 0 /mnt/snap/ROOT/test/rollover/user_acct_file [root at c448 ~]# ls -l /mnt/lfs0/test/rollover/user_acct_file -rw-r--r-- 1 user group 184435207 Mar 23 2010 /mnt/lfs0/test/rollover/user_acct_file [root at c448 ~]# du /mnt/lfs0/test/rollover/user_acct_file 180120 /mnt/lfs0/test/rollover/user_acct_file [root at c448 ~]# du --apparent-size /mnt/lfs0/test/rollover/user_acct_file 180113 /mnt/lfs0/test/rollover/user_acct_file Thanks very much for any answers or suggestions you can provide! -Nathan
Robin Humble
2011-Jan-13 00:45 UTC
[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8
Hi Nathan, On Thu, Jan 06, 2011 at 05:42:24PM -0700, Nathan.Dauchy at noaa.gov wrote:>I am looking for more information regarding the "size on MDS" feature as >it exists for lustre-1.8.x. Testing on our system (which started out as >1.6.6 and is now 1.8.x) indicates that there are many files which do not >have the size information stored on the MDT. So, my basic question: >under what conditions will the "size hint" attribute be updated? Is >there any way to force the MDT to query the OSTs and update it''s >information?atime (and the MDT size hint) wasn''t being updated for most of the 1.8 series due to this bug: https://bugzilla.lustre.org/show_bug.cgi?id=23766 the atime fix is now in 1.8.5, but I''m not sure if anyone has verified whether or not the MDT size hint is now behaving as originally intended. actually, it was never clear to me what (if anything?) ever accessed OBD_MD_FLSIZE... does someone have a hacked ''lfs find'' or similar tool? your approach of mounting and searching a MDT snapshot should be possible, but it would seem neater just to have a tool on a client send the right rpc''s to the MDS and get the information that way. like you, we are finding that the timescales for our filesystem trawling scripts are getting out of hand, mostly (we think) due to retrieving size information from very busy OSTs. a tool that only hit the MDT and found (filename, uid, gid, approx size) should help a lot. so +1 on this topic. BTW, once you have 1.8.5 on the MDS, then a hack to populate the MDT size hints might be to read 4k from every file in the system. that should update atime and the size hint. please let us know if this works.>The end goal of this is to facilitate efficient checks of disk usage on >a per-directory basis (essentially we want "volume based quotas"). I''ma possible approach for your situation would be to chgrp every file under a directory to be the same gid, and then enable (un-enforcing) group quotas on your filesystem. then you wouldn''t have to search any directories. you would still have to find and chgrp some files nightly, but ''lfs find'' should make that relatively quick. unfortunately we also need a breakdown of the uid information in each directory, so this approach isn''t sufficient for us. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility>hoping to run something once a day on the MDS like the following: > lvcreate -s -p r -n mdt_snap /dev/mdt > mount -t ldiskfs -o ro /dev/mdt_snap /mnt/snap > cd /mnt/snap/ROOT > du --apparent-size ./* > volume_usage.log > cd / > umount /mnt/snap > lvremove /dev/mdt_snap >Since the data is going to be up to one day old anyway, I don''t really >mind that the file size is "approximate", but it does have to be >reasonably close. > >With the MDT LVM snapshot method I can check the whole 300TB file system >in about 3 hours, whereas checking from a client takes weeks. > >Here is why I am relatively certain that the size-on-MDS attributes are >not updated (lightly edited): > >[root at mds0 ~]# ls -l /mnt/snap/ROOT/test/rollover/user_acct_file >-rw-r--r-- 1 9999 9000 0 Mar 23 2010 >/mnt/snap/ROOT/test/rollover/user_acct_file >[root at mds0 ~]# du /mnt/snap/ROOT/test/rollover/user_acct_file >0 /mnt/snap/ROOT/test/rollover/user_acct_file >[root at mds0 ~]# du --apparent-size >/mnt/snap/ROOT/test/rollover/user_acct_file >0 /mnt/snap/ROOT/test/rollover/user_acct_file > >[root at c448 ~]# ls -l /mnt/lfs0/test/rollover/user_acct_file >-rw-r--r-- 1 user group 184435207 Mar 23 2010 >/mnt/lfs0/test/rollover/user_acct_file >[root at c448 ~]# du /mnt/lfs0/test/rollover/user_acct_file >180120 /mnt/lfs0/test/rollover/user_acct_file >[root at c448 ~]# du --apparent-size /mnt/lfs0/test/rollover/user_acct_file >180113 /mnt/lfs0/test/rollover/user_acct_file > > >Thanks very much for any answers or suggestions you can provide! > >-Nathan > >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss
Nathan Dauchy
2011-Jan-13 18:38 UTC
[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8
On 01/12/2011 05:45 PM, Robin Humble wrote:> On Thu, Jan 06, 2011 at 05:42:24PM -0700, Nathan.Dauchy at noaa.gov wrote: >> I am looking for more information regarding the "size on MDS" feature as >> it exists for lustre-1.8.x. Testing on our system (which started out as >> 1.6.6 and is now 1.8.x) indicates that there are many files which do not >> have the size information stored on the MDT. So, my basic question: >> under what conditions will the "size hint" attribute be updated? Is >> there any way to force the MDT to query the OSTs and update it''s >> information? > > atime (and the MDT size hint) wasn''t being updated for most of the 1.8 > series due to this bug: > https://bugzilla.lustre.org/show_bug.cgi?id=23766 > the atime fix is now in 1.8.5, but I''m not sure if anyone has verified > whether or not the MDT size hint is now behaving as originally intended.Thanks for the pointer, Robin!> your approach of mounting and searching a MDT snapshot should be > possible, but it would seem neater just to have a tool on a client send > the right rpc''s to the MDS and get the information that way.It would be great to have both options available. I was assuming the MDT snapshot would be easier, and potentially faster than waiting for network transactions too.> like you, we are finding that the timescales for our filesystem > trawling scripts are getting out of hand, mostly (we think) due to > retrieving size information from very busy OSTs. a tool that only hit > the MDT and found (filename, uid, gid, approx size) should help a lot. > so +1 on this topic.On a somewhat related note... we have recently discovered that the object caching added in 1.8 consumes all the memory on the OSS nodes, leaving insufficient block device cache for the inodes. This was making ''ls -l'' and ''du'' run 10-20x longer than when we were running lustre-1.6.7. If you are running 1.8 and want to try turning off the object caching, these are the settings you should look at: lctl conf_param fsname-OST00XX.ost.read_cache_enable=0 lctl conf_param fsname-OST00XX.ost.writethrough_cache_enable=0> BTW, once you have 1.8.5 on the MDS, then a hack to populate the MDT > size hints might be to read 4k from every file in the system. that > should update atime and the size hint. please let us know if this works.Unfortunately, we aren''t in a position to upgrade to 1.8.5 any time soon. If anyone can test this and see if it is possible to update the size hint AFTER initial object creation it would be very much appreciated! Regards, Nathan
Andreas Dilger
2011-Jan-13 21:37 UTC
[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8
On 2011-01-13, at 11:38, Nathan Dauchy wrote:> On a somewhat related note... we have recently discovered that the > object caching added in 1.8 consumes all the memory on the OSS nodes, > leaving insufficient block device cache for the inodes. This was making > ''ls -l'' and ''du'' run 10-20x longer than when we were running > lustre-1.6.7. If you are running 1.8 and want to try turning off the > object caching, these are the settings you should look at: > > lctl conf_param fsname-OST00XX.ost.read_cache_enable=0 > lctl conf_param fsname-OST00XX.ost.writethrough_cache_enable=0It would probably be better to set: lctl conf_param fsname-OST00XX.ost.readcache_max_filesize=32M or similar, to limit the read cache to files 32MB in size or less (or whatever you consider "small" files at your site. That allows the read cache for config files and such, while not thrashing the cache while accessing large files. We should probably change this to be the default, but at the time the read cache was introduced, we didn''t know what should be considered a small vs. large file, and the amount of RAM and number of OSTs on an OSS, and the uses varies so much that it is difficult to pick a single correct value for this. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.
Kit Westneat
2011-Jan-13 22:28 UTC
[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8
> It would probably be better to set: > > lctl conf_param fsname-OST00XX.ost.readcache_max_filesize=32M > > or similar, to limit the read cache to files 32MB in size or less (or whatever you consider "small" files at your site. That allows the read cache for config files and such, while not thrashing the cache while accessing large files. > > We should probably change this to be the default, but at the time the read cache was introduced, we didn''t know what should be considered a small vs. large file, and the amount of RAM and number of OSTs on an OSS, and the uses varies so much that it is difficult to pick a single correct value for this.I was looking through the Linux vm settings and saw vfs_cache_pressure - has anyone tested performance with this parameter? Do you know if this would this have any effect on file caching vs. ext4 metadata caching? For us, Linux/Lustre would ideally push out data before the metadata, as the performance penalty for doing 4k reads on the s2a far outweighs any benefits of data caching. Thanks, Kit
Robin Humble
2011-Jan-28 07:34 UTC
[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8
On Thu, Jan 13, 2011 at 05:28:23PM -0500, Kit Westneat wrote:>> It would probably be better to set: >> >> lctl conf_param fsname-OST00XX.ost.readcache_max_filesize=32M >> >> or similar, to limit the read cache to files 32MB in size or less (or whatever you consider "small" files at your site. That allows the read cache for config files and such, while not thrashing the cache while accessing large files. >> >> We should probably change this to be the default, but at the time the read cache was introduced, we didn''t know what should be considered a small vs. large file, and the amount of RAM and number of OSTs on an OSS, and the uses varies so much that it is difficult to pick a single correct value for this.limiting the total amount of OSS cache used in order to leave room for inodes/dentries might be more useful. the data cache will always fill up and push out inodes otherwise. Nathan''s approach of turning off the caches entirely is extreme, but if it gives us back some metadata performance then it might be worth it. or is there a Lustre or VM setting to limit overall OSS cache size? I presume that Lustre''s OSS caches are subject to normal Linux VM pagecache tweakables, but I don''t think such a knob exists in Linux at the moment...>I was looking through the Linux vm settings and saw vfs_cache_pressure - >has anyone tested performance with this parameter? Do you know if this >would this have any effect on file caching vs. ext4 metadata caching? > >For us, Linux/Lustre would ideally push out data before the metadata, as >the performance penalty for doing 4k reads on the s2a far outweighs any >benefits of data caching.good idea. if all inodes are always cached on OSS''s then the fs should be far more responsive to stat loads... 4k/inode shouldn''t use up too much of the OSS''s ram (probably more like 1 or 2k/inode really). anyway, following your idea, we tried vfs_cache_pressure=50 on our OSS''s a week or so ago, but hit this within a couple of hours https://bugzilla.lustre.org/show_bug.cgi?id=24401 could have been a coincidence I guess. did anyone else give it a try? BTW, we recently had the opposite problem on a client that scans the filesystem - too many inodes were cached leading to low memory problems on the client. we''ve had vfs_cache_pressure=150 set on that machine for the last month or so and it seems to help. although a more effective setting in this case was limiting ldlm locks. eg. from the Lustre manual lctl set_param ldlm.namespaces.*osc*.lru_size=10000 cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility
Jason Rappleye
2011-Jan-28 17:45 UTC
[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8
On Jan 27, 2011, at 11:34 PM, Robin Humble wrote:> On Thu, Jan 13, 2011 at 05:28:23PM -0500, Kit Westneat wrote: >>> It would probably be better to set: >>> >>> lctl conf_param fsname-OST00XX.ost.readcache_max_filesize=32M >>> >>> or similar, to limit the read cache to files 32MB in size or less (or whatever you consider "small" files at your site. That allows the read cache for config files and such, while not thrashing the cache while accessing large files. >>> >>> We should probably change this to be the default, but at the time the read cache was introduced, we didn''t know what should be considered a small vs. large file, and the amount of RAM and number of OSTs on an OSS, and the uses varies so much that it is difficult to pick a single correct value for this. > > limiting the total amount of OSS cache used in order to leave room for > inodes/dentries might be more useful. the data cache will always fill > up and push out inodes otherwise.The inode and dentry objects in the slab cache aren''t so much of an issue as having the disk blocks that each are generated from available in the buffer cache. Constructing the in-memory inode and dentry objects is cheap as long as the corresponding disk blocks are available. Doing the disk reads, depending on your hardware and some other factors, is not.> Nathan''s approach of turning off the caches entirely is extreme, but if > it gives us back some metadata performance then it might be worth it.We went the extreme and disabled the OSS read cache (+ writethrough cache). In addition, on the OSSes we pre-read all of the inode blocks that contain at least one used inode, along with all of the directory blocks. The results have been promising so far. Firing off a du on an entire filesystem, 3000-6000 stats/second is typical. I''ve noted a few causes of slowdowns so far; there may be more. First, no attempt has been made to pre-read metadata from the MDT. The need to read in inode and directory blocks may slow things down quite a bit. I can''t find the numbers in my notes at the moment, but I recall seeing 200-500 stats/second when the MDS needed to do I/O. When memory runs low on a client, kswapd kicks in to try and free up pages. On the client I''m currently testing on, almost all of the memory used is in the slab. It looks like kswapd has a difficult time clearing things up, and the client can go several seconds before the current stat call is completed. Dropping caches will (temporarily) get the performance back to expected rates. I haven''t dug into this one too much yet. Sometimes the performance drop is worse, and we see just tens of stats/second (or fewer!) This is due to the fact that filter_{fid2dentry,precreate,destory} all need to take a lock on the parent directory of the object on the OST. Unlink or precreate operations whose critical section protected by this lock take a long time to complete will slow down stat requests. I''m working on tracking down the cause of this; it may be journal related. BZ 22107 is probably relevant as well.> or is there a Lustre or VM setting to limit overall OSS cache size?No, but I think that would be really useful in this situation.> I presume that Lustre''s OSS caches are subject to normal Linux VM > pagecache tweakables, but I don''t think such a knob exists in Linux at > the moment...Correct on both counts. A patch was proposed to do this, but I don''t see any evidence of it making it into the kernel: http://lwn.net/Articles/218890/ I have a small set of perl, bash, and SystemTap scripts to read the inode and directory blocks from disk and monitor the performance of the relevant Lustre calls on the servers. I''ll clean them up and send them to the list next week. A more elegant solution would be to get e2scan to do the job, but I haven''t taken a hack at that yet. Our largest filesystem, in terms of inodes, has about 1.8M inodes per OST, and 15 OSTs per OSS. Of the 470400 inode blocks on disk (58800 block groups * 8 inode blocks/group), ~36% have at least one inode used. We pre-read those and ignore the empty inode blocks. Looking at the OSTs on one OSS, we have an average of 3891 directory blocks per OST. In the absence of controls on the size of the page cache, or enough RAM to cache all of the inode and directory blocks in memory, another potential solution is to place the metadata on an SSD. One can generate a dm linear target table that carves up an ext3/ext4 filesystem such that the inode blocks go on one device, and the data blocks go on another. Ideally the inode blocks would be placed on an SSD. I''ve tried this with both ext3, and with ext4 using flex_bg to reduce the size of the dm table. IIRC the overhead is acceptable in both cases - 1us, on average. Placing the inodes on separate storage is not sufficient, though. Slow directory block reads contribute to poor stat performance as well. Adding a feature to ext4 to reserve a number of fixed block groups for directory blocks, and always allocating them there, would help. Those blocks groups could then be placed on an SSD as well. Even with the inode and directory blocks on fast storage, stat performance will still suffer when other operations that require a lock on the object''s parent directory are going slow. I''ve left out a few details and actual performance numbers from our production systems. I''ll do a more detailed writeup after I take care of some other things at work, and finish recovering from 13.5 timezones worth of jet lag :-) Jason -- Jason Rappleye System Administrator NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035
Andreas Dilger
2011-Jan-28 18:04 UTC
[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8
On 2011-01-28, at 10:45, Jason Rappleye wrote:> Sometimes the performance drop is worse, and we see just tens of stats/second (or fewer!) This is due to the fact that filter_{fid2dentry,precreate,destory} all need to take a lock on the parent directory of the object on the OST. Unlink or precreate operations whose critical section protected by this lock take a long time to complete will slow down stat requests. I''m working on tracking down the cause of this; it may be journal related. BZ 22107 is probably relevant as well.There is work underway to allow the locking of the ldiskfs directories to be multi-threaded. This should significantly improve performance in such cases.> Our largest filesystem, in terms of inodes, has about 1.8M inodes per OST, and 15 OSTs per OSS. Of the 470400 inode blocks on disk (58800 block groups * 8 inode blocks/group), ~36% have at least one inode used. We pre-read those and ignore the empty inode blocks. Looking at the OSTs on one OSS, we have an average of 3891 directory blocks per OST. > > In the absence of controls on the size of the page cache, or enough RAM to cache all of the inode and directory blocks in memory, another potential solution is to place the metadata on an SSD. One can generate a dm linear target table that carves up an ext3/ext4 filesystem such that the inode blocks go on one device, and the data blocks go on another. Ideally the inode blocks would be placed on an SSD. > > I''ve tried this with both ext3, and with ext4 using flex_bg to reduce the size of the dm table. IIRC the overhead is acceptable in both cases - 1us, on average.I''d be quite interested to see the results of such testing.> Placing the inodes on separate storage is not sufficient, though. Slow directory block reads contribute to poor stat performance as well. Adding a feature to ext4 to reserve a number of fixed block groups for directory blocks, and always allocating them there, would help. Those blocks groups could then be placed on an SSD as well.I believe there is a heuristic that allocates directory blocks in the first group of a flex_bg, so if that entire group is on SSD it would potentially avoid this problem. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.
Jason Rappleye
2011-Feb-01 18:38 UTC
[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8
On Jan 28, 2011, at 10:04 AM, Andreas Dilger wrote:>> In the absence of controls on the size of the page cache, or enough RAM to cache all of the inode and directory blocks in memory, another potential solution is to place the metadata on an SSD. One can generate a dm linear target table that carves up an ext3/ext4 filesystem such that the inode blocks go on one device, and the data blocks go on another. Ideally the inode blocks would be placed on an SSD. >> >> I''ve tried this with both ext3, and with ext4 using flex_bg to reduce the size of the dm table. IIRC the overhead is acceptable in both cases - 1us, on average. > > I''d be quite interested to see the results of such testing.I''m waiting for more hardware to show up so I can restart my testing. Hope to have some results to share in another 3-4 weeks.>> Placing the inodes on separate storage is not sufficient, though. Slow directory block reads contribute to poor stat performance as well. Adding a feature to ext4 to reserve a number of fixed block groups for directory blocks, and always allocating them there, would help. Those blocks groups could then be placed on an SSD as well. > > I believe there is a heuristic that allocates directory blocks in the first group of a flex_bg, so if that entire group is on SSD it would potentially avoid this problem.There is, though I haven''t tested it yet. However, you''d need to have a relatively small number of flex_bgs for this to be cost-effective. I heard through the grapevine that you suggest not using "too few" flex_bgs on an ext4 filesystem. Can you elaborate on what might be a reasonable number, and why? Thanks, Jason -- Jason Rappleye System Administrator NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035
Andreas Dilger
2011-Feb-03 18:53 UTC
[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8
On 2011-02-01, at 11:38, Jason Rappleye <jason.rappleye at nasa.gov> wrote:> I heard through the grapevine that you suggest not using "too few" flex_bgs on an ext4 filesystem. Can you elaborate on what might be a reasonable number, and why?My gut feeling is that a flex_bg factor of 256 may give the best tradeoff of performance and configurability for a hybrid storage device. That will allow bitmaps and itables to be multiples of 1MB sizes, and with careful tuning they can also be aligned on 1MB boundaries. It means that 1/256th of the filesystem would be allocated on SSD storage (4GB per TB) which I think is totally reasonable (64GB for a 16TB filesystem), while at the same time avoiding too complex an LV layout (512 SSD regions for a 16TB filesystem). In the end we need to do the testing to know what the best tradeoff is. Cheers, Andreas
Robin Humble
2011-Feb-09 15:11 UTC
[Lustre-discuss] improving metadata performance [was Re: question about size on MDS (MDT) for lustre-1.8]
<rejoining this topic after a couple of weeks of experimentation> Re: trying to improve metadata performance -> we''ve been running with vfs_cache_pressure=0 on OSS''s in production for over a week now and it''s improved our metadata performance by a large factor. - filesystem scans that didn''t finish in ~30hrs now complete in a little over 3 hours. so >~10x speedup. - a recursive ls -altrR of my home dir (on a random uncached client) now runs at 2000 to 4000 files/s wheras before it could be <100 files/s. so 20 to 40x speedup. of course vfs_cache_pressure=0 can be a DANGEROUS setting because inodes/dentries will never be reclaimed, so OSS''s could OOM. however slabtop shows inodes are 0.89K and dentries 0.21K ie. small, so I expect many sites can (like us) easily cache everything. for a given number of inodes per OST it''s easily calculable whether there''s enough OSS ram to safely set vfs_cache_pressure=0 and cache them all in slab. continued monitoring of the fs inode growth (== OSS slab size) over time is very important as fs''s will inevitably acrue more files... sadly a slightly less extreme vfs_cache_pressure=1 wasn''t as successful at keeping stat rates high. sustained OSS cache memory pressure through the day dropped enough inodes that nightly scans weren''t fast any more. our current residual issue with vfs_cache_pressure=0 is unexpected. the number of OSS dentries appears to slowly grow over time :-/ it appears that some/many dentries for deleted files are not reclaimed without some memory pressure. any idea why that might be? anyway, I''ve now added a few lines of code to create a different (non-zero) vfs_cache_pressure knob for dentries. we''ll see how that goes... an alternate (simpler) workaround would be to occasionally drop OSS inode/dentry caches, or to set vfs_cache_pressure=100 once in a while, and to just live with a day of slow stat''s while the inode caches repopulate. hopefully vfs_cache_pressure=0 also has a net small positive impact on regular i/o due to reduced iops to OSTs, but I haven''t trid to measure that. slab didn''t steal much ram from our read and write_through caches (we have 48g ram on OSS''s and slab went up about 1.6g to 3.3g with the additional cached inodes/dentries) so OSS file caching should be almost unaffected. On Fri, Jan 28, 2011 at 09:45:10AM -0800, Jason Rappleye wrote:>On Jan 27, 2011, at 11:34 PM, Robin Humble wrote: >> limiting the total amount of OSS cache used in order to leave room for >> inodes/dentries might be more useful. the data cache will always fill >> up and push out inodes otherwise.I disagree with myself now. I think mm/vmscan.c would probably still call shrink_slab, so shrinkers would get called and some cached inodes would get dropped.>The inode and dentry objects in the slab cache aren''t so much of an issue as having the disk blocks that each are generated from available in the buffer cache. Constructing the in-memory inode and dentry objects is cheap as long as the corresponding disk blocks are available. Doing the disk reads, depending on your hardware and some other factors, is not.on a test cluster (with read and write_through caches still active and synthetic i/o load) I didn''t see a big change in stat rate from dropping OSS page/buffer cache - at most a slowdown for a client ''ls -lR'' of ~2x, and usually no slowdown at all. I suspect this is because there is almost zero persistent buffer cache due to the OSS buffer and page caches being punished by file i/o. in the same testing, dropping OSS inode/dentry caches was a much larger effect (up to 60x slowdown with synthetic i/o) - which is why the vfs_cache_pressure setting works. the synthetic i/o wasn''t crazily intensive, but did have a working set >>OSS mem which is likely true of our production machine. however for your setup with OSS caches off, and from doing tests on our MDS, I agree that buffer caches can be a big effect. dropping our MDS buffer cache slows down a client ''lfs find'' by ~4x, but dropping inode/dentry caches doesn''t slow it down at all, so buffers are definitely important there. happily we''re not under any memory pressure on our MDS''s at the moment.>We went the extreme and disabled the OSS read cache (+ writethrough cache). In addition, on the OSSes we pre-read all of the inode blocks that contain at least one used inode, along with all of the directory blocks. > >The results have been promising so far. Firing off a du on an entire filesystem, 3000-6000 stats/second is typical. I''ve noted a few causes of slowdowns so far; there may be more.we see about 2k files/s on the nightly sweeps now. that''s with one lfs find running and piping to parallel stat''s. I think we can do better with more parallelism in the finds, but 2k is so much better than what it used to be we''re fairly happy for now. 2k isn''t as good as your stat rates, but we still have OSS caches on, so the rest of our i/o should be benefiting from that.>When memory runs low on a client, kswapd kicks in to try and free up pages. On the client I''m currently testing on, almost all of the memory used is in the slab. It looks like kswapd has a difficult time clearing things up, and the client can go several seconds before the current stat call is completed. Dropping caches will (temporarily) get the performance back to expected rates. I haven''t dug into this one too much yet.the last para of my prev email might help you. we found client slab is hard to reclaim without limiting ldlm locks. I haven''t noticed a performance change from limiting ldlm lock counts. cheers, robin
Nathan Dauchy
2011-Apr-13 20:33 UTC
[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8
Revisiting this thread after 2 months... On 01/12/2011 05:45 PM, Robin Humble wrote:> On Thu, Jan 06, 2011 at 05:42:24PM -0700, Nathan.Dauchy at noaa.gov wrote: >> I am looking for more information regarding the "size on MDS" feature as >> it exists for lustre-1.8.x. Testing on our system (which started out as >> 1.6.6 and is now 1.8.x) indicates that there are many files which do not >> have the size information stored on the MDT. So, my basic question: >> under what conditions will the "size hint" attribute be updated? Is >> there any way to force the MDT to query the OSTs and update it''s >> information? > > atime (and the MDT size hint) wasn''t being updated for most of the 1.8 > series due to this bug: > https://bugzilla.lustre.org/show_bug.cgi?id=23766 > the atime fix is now in 1.8.5, but I''m not sure if anyone has verified > whether or not the MDT size hint is now behaving as originally intended.I wanted to report back that I applied the patch in that bug and it appears as though the MDT size hint is working. This is tested with lustre-1.8.4.ddn2.2 and linux-2.6.18-194.32.1 on the MDS.> BTW, once you have 1.8.5 on the MDS, then a hack to populate the MDT > size hints might be to read 4k from every file in the system. that > should update atime and the size hint. please let us know if this works. >Yes, performing "ls -l" on the file from the client was not sufficient. The size hint was updated on the MDS after reading 1k from the file on the client. Running ''du --apparent-size'' now reports the exact same value on the MDT snapshot as on the client. -Nathan
Andreas Dilger
2011-May-03 20:03 UTC
[Lustre-discuss] improving metadata performance [was Re: question about size on MDS (MDT) for lustre-1.8]
Just to follow up in this issue. We landed a patch for 2.1 that will reduce the default OST cache to objects 8MB or smaller. This can still be tuned via /proc, but is likely to provide better all-around performance by avoiding cache flushes for streaming read and write operations. Robin, it would be great to know if tuning this would also solve your cache pressure woes without having to resort to disabling the VM cache pressure (which isn''t something we can do by default for all users). Cheers, Andreas On 2011-02-09, at 8:11 AM, Robin Humble <robin.humble+lustre at anu.edu.au> wrote:> <rejoining this topic after a couple of weeks of experimentation> > > Re: trying to improve metadata performance -> > > we''ve been running with vfs_cache_pressure=0 on OSS''s in production for > over a week now and it''s improved our metadata performance by a large factor. > > - filesystem scans that didn''t finish in ~30hrs now complete in a little > over 3 hours. so >~10x speedup. > > - a recursive ls -altrR of my home dir (on a random uncached client) now > runs at 2000 to 4000 files/s wheras before it could be <100 files/s. > so 20 to 40x speedup. > > of course vfs_cache_pressure=0 can be a DANGEROUS setting because > inodes/dentries will never be reclaimed, so OSS''s could OOM. > > however slabtop shows inodes are 0.89K and dentries 0.21K ie. small, so > I expect many sites can (like us) easily cache everything. for a given > number of inodes per OST it''s easily calculable whether there''s enough > OSS ram to safely set vfs_cache_pressure=0 and cache them all in slab. > > continued monitoring of the fs inode growth (== OSS slab size) over > time is very important as fs''s will inevitably acrue more files... > > sadly a slightly less extreme vfs_cache_pressure=1 wasn''t as successful > at keeping stat rates high. sustained OSS cache memory pressure through > the day dropped enough inodes that nightly scans weren''t fast any more. > > our current residual issue with vfs_cache_pressure=0 is unexpected. > the number of OSS dentries appears to slowly grow over time :-/ > it appears that some/many dentries for deleted files are not reclaimed > without some memory pressure. > any idea why that might be? > > anyway, I''ve now added a few lines of code to create a different > (non-zero) vfs_cache_pressure knob for dentries. we''ll see how that > goes... > an alternate (simpler) workaround would be to occasionally drop OSS > inode/dentry caches, or to set vfs_cache_pressure=100 once in a while, > and to just live with a day of slow stat''s while the inode caches > repopulate. > > hopefully vfs_cache_pressure=0 also has a net small positive impact on > regular i/o due to reduced iops to OSTs, but I haven''t trid to measure > that. > slab didn''t steal much ram from our read and write_through caches (we > have 48g ram on OSS''s and slab went up about 1.6g to 3.3g with the > additional cached inodes/dentries) so OSS file caching should be > almost unaffected. > > On Fri, Jan 28, 2011 at 09:45:10AM -0800, Jason Rappleye wrote: >> On Jan 27, 2011, at 11:34 PM, Robin Humble wrote: >>> limiting the total amount of OSS cache used in order to leave room for >>> inodes/dentries might be more useful. the data cache will always fill >>> up and push out inodes otherwise. > > I disagree with myself now. I think mm/vmscan.c would probably still > call shrink_slab, so shrinkers would get called and some cached inodes > would get dropped. > >> The inode and dentry objects in the slab cache aren''t so much of an issue as having the disk blocks that each are generated from available in the buffer cache. Constructing the in-memory inode and dentry objects is cheap as long as the corresponding disk blocks are available. Doing the disk reads, depending on your hardware and some other factors, is not. > > on a test cluster (with read and write_through caches still active and > synthetic i/o load) I didn''t see a big change in stat rate from > dropping OSS page/buffer cache - at most a slowdown for a client > ''ls -lR'' of ~2x, and usually no slowdown at all. I suspect this is > because there is almost zero persistent buffer cache due to the OSS > buffer and page caches being punished by file i/o. > in the same testing, dropping OSS inode/dentry caches was a much larger > effect (up to 60x slowdown with synthetic i/o) - which is why the > vfs_cache_pressure setting works. > the synthetic i/o wasn''t crazily intensive, but did have a working > set >>OSS mem which is likely true of our production machine. > > however for your setup with OSS caches off, and from doing tests on our > MDS, I agree that buffer caches can be a big effect. > > dropping our MDS buffer cache slows down a client ''lfs find'' by ~4x, > but dropping inode/dentry caches doesn''t slow it down at all, so > buffers are definitely important there. > happily we''re not under any memory pressure on our MDS''s at the > moment. > >> We went the extreme and disabled the OSS read cache (+ writethrough cache). In addition, on the OSSes we pre-read all of the inode blocks that contain at least one used inode, along with all of the directory blocks. >> >> The results have been promising so far. Firing off a du on an entire filesystem, 3000-6000 stats/second is typical. I''ve noted a few causes of slowdowns so far; there may be more. > > we see about 2k files/s on the nightly sweeps now. that''s with one > lfs find running and piping to parallel stat''s. I think we can do > better with more parallelism in the finds, but 2k is so much better > than what it used to be we''re fairly happy for now. > > 2k isn''t as good as your stat rates, but we still have OSS caches on, > so the rest of our i/o should be benefiting from that. > >> When memory runs low on a client, kswapd kicks in to try and free up pages. On the client I''m currently testing on, almost all of the memory used is in the slab. It looks like kswapd has a difficult time clearing things up, and the client can go several seconds before the current stat call is completed. Dropping caches will (temporarily) get the performance back to expected rates. I haven''t dug into this one too much yet. > > the last para of my prev email might help you. > we found client slab is hard to reclaim without limiting ldlm locks. > I haven''t noticed a performance change from limiting ldlm lock counts. > > cheers, > robin > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss