thr3ads.net - Lustre discuss - [Lustre-discuss] question about size on MDS (MDT) for lustre-1.8 [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Nathan.Dauchy at noaa.gov

2011-Jan-07 00:42 UTC

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

Greetings,

I am looking for more information regarding the "size on MDS" feature
as
it exists for lustre-1.8.x.  Testing on our system (which started out as
1.6.6 and is now 1.8.x) indicates that there are many files which do not
have the size information stored on the MDT.  So, my basic question:
under what conditions will the "size hint" attribute be updated?  Is
there any way to force the MDT to query the OSTs and update it''s
information?


More details and background information...

We are currently running:
  Client: CentOS-5.5, kernel-2.6.18-194.11.4.el5, lustre-1.8.4
  Server: CentOS-5.3, kernel-2.6.18-164.15.1.el5, lustre-1.8.3.ddn4.1
with no immediate plans to upgrade to lustre-2.x.

I found the high level design here:
  http://wiki.lustre.org/images/9/94/Size_on_mds-hld.pdf
but that seems to be targeted at 2.0.x for full implementation.

I also discovered this old thread on the topic:
  http://lists.lustre.org/pipermail/lustre-discuss/2009-July/011149.html
in which Andreas wrote "In Lustre 1.6.7 the approximate file size
started to be stored on the MDT inodes..." and "This size is not
actively updated for pre-existing files...".  (It seems as though this
is still true for 1.8.x.)

The end goal of this is to facilitate efficient checks of disk usage on
a per-directory basis (essentially we want "volume based quotas"). 
I''m
hoping to run something once a day on the MDS like the following:
    lvcreate  -s -p r -n mdt_snap /dev/mdt
    mount -t ldiskfs -o ro /dev/mdt_snap /mnt/snap
    cd /mnt/snap/ROOT
    du --apparent-size ./* > volume_usage.log
    cd /
    umount /mnt/snap
    lvremove /dev/mdt_snap
Since the data is going to be up to one day old anyway, I don''t really
mind that the file size is "approximate", but it does have to be
reasonably close.

With the MDT LVM snapshot method I can check the whole 300TB file system
in about 3 hours, whereas checking from a client takes weeks.

Here is why I am relatively certain that the size-on-MDS attributes are
not updated (lightly edited):

[root at mds0 ~]# ls -l /mnt/snap/ROOT/test/rollover/user_acct_file
-rw-r--r-- 1 9999 9000 0 Mar 23  2010
/mnt/snap/ROOT/test/rollover/user_acct_file
[root at mds0 ~]# du /mnt/snap/ROOT/test/rollover/user_acct_file
0       /mnt/snap/ROOT/test/rollover/user_acct_file
[root at mds0 ~]# du --apparent-size
/mnt/snap/ROOT/test/rollover/user_acct_file
0       /mnt/snap/ROOT/test/rollover/user_acct_file

[root at c448 ~]# ls -l /mnt/lfs0/test/rollover/user_acct_file
-rw-r--r-- 1 user group 184435207 Mar 23  2010
/mnt/lfs0/test/rollover/user_acct_file
[root at c448 ~]# du /mnt/lfs0/test/rollover/user_acct_file
180120  /mnt/lfs0/test/rollover/user_acct_file
[root at c448 ~]# du --apparent-size /mnt/lfs0/test/rollover/user_acct_file
180113  /mnt/lfs0/test/rollover/user_acct_file


Thanks very much for any answers or suggestions you can provide!

-Nathan

Robin Humble

2011-Jan-13 00:45 UTC

head link

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

Hi Nathan,

On Thu, Jan 06, 2011 at 05:42:24PM -0700, Nathan.Dauchy at noaa.gov
wrote:>I am looking for more information regarding the "size on MDS"
feature as
>it exists for lustre-1.8.x.  Testing on our system (which started out as
>1.6.6 and is now 1.8.x) indicates that there are many files which do not
>have the size information stored on the MDT.  So, my basic question:
>under what conditions will the "size hint" attribute be updated? 
Is
>there any way to force the MDT to query the OSTs and update it''s
>information?
atime (and the MDT size hint) wasn''t being updated for most of the 1.8
series due to this bug:
  https://bugzilla.lustre.org/show_bug.cgi?id=23766
the atime fix is now in 1.8.5, but I''m not sure if anyone has verified
whether or not the MDT size hint is now behaving as originally intended.

actually, it was never clear to me what (if anything?) ever accessed
OBD_MD_FLSIZE...
does someone have a hacked ''lfs find'' or similar tool?
your approach of mounting and searching a MDT snapshot should be
possible, but it would seem neater just to have a tool on a client send
the right rpc''s to the MDS and get the information that way.

like you, we are finding that the timescales for our filesystem
trawling scripts are getting out of hand, mostly (we think) due to
retrieving size information from very busy OSTs. a tool that only hit
the MDT and found (filename, uid, gid, approx size) should help a lot.
so +1 on this topic.

BTW, once you have 1.8.5 on the MDS, then a hack to populate the MDT
size hints might be to read 4k from every file in the system. that
should update atime and the size hint. please let us know if this works.
>The end goal of this is to facilitate efficient checks of disk usage on
>a per-directory basis (essentially we want "volume based quotas").
I''m
a possible approach for your situation would be to chgrp every file
under a directory to be the same gid, and then enable (un-enforcing)
group quotas on your filesystem. then you wouldn''t have to search any
directories. you would still have to find and chgrp some files nightly,
but ''lfs find'' should make that relatively quick.

unfortunately we also need a breakdown of the uid information in each
directory, so this approach isn''t sufficient for us.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
>hoping to run something once a day on the MDS like the following:
>    lvcreate  -s -p r -n mdt_snap /dev/mdt
>    mount -t ldiskfs -o ro /dev/mdt_snap /mnt/snap
>    cd /mnt/snap/ROOT
>    du --apparent-size ./* > volume_usage.log
>    cd /
>    umount /mnt/snap
>    lvremove /dev/mdt_snap
>Since the data is going to be up to one day old anyway, I don''t
really
>mind that the file size is "approximate", but it does have to be
>reasonably close.
>
>With the MDT LVM snapshot method I can check the whole 300TB file system
>in about 3 hours, whereas checking from a client takes weeks.
>
>Here is why I am relatively certain that the size-on-MDS attributes are
>not updated (lightly edited):
>
>[root at mds0 ~]# ls -l /mnt/snap/ROOT/test/rollover/user_acct_file
>-rw-r--r-- 1 9999 9000 0 Mar 23  2010
>/mnt/snap/ROOT/test/rollover/user_acct_file
>[root at mds0 ~]# du /mnt/snap/ROOT/test/rollover/user_acct_file
>0       /mnt/snap/ROOT/test/rollover/user_acct_file
>[root at mds0 ~]# du --apparent-size
>/mnt/snap/ROOT/test/rollover/user_acct_file
>0       /mnt/snap/ROOT/test/rollover/user_acct_file
>
>[root at c448 ~]# ls -l /mnt/lfs0/test/rollover/user_acct_file
>-rw-r--r-- 1 user group 184435207 Mar 23  2010
>/mnt/lfs0/test/rollover/user_acct_file
>[root at c448 ~]# du /mnt/lfs0/test/rollover/user_acct_file
>180120  /mnt/lfs0/test/rollover/user_acct_file
>[root at c448 ~]# du --apparent-size /mnt/lfs0/test/rollover/user_acct_file
>180113  /mnt/lfs0/test/rollover/user_acct_file
>
>
>Thanks very much for any answers or suggestions you can provide!
>
>-Nathan
>
>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss at lists.lustre.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss

Nathan Dauchy

2011-Jan-13 18:38 UTC

head link

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

On 01/12/2011 05:45 PM, Robin Humble wrote:> On Thu, Jan 06, 2011 at 05:42:24PM -0700, Nathan.Dauchy at noaa.gov wrote:
>> I am looking for more information regarding the "size on MDS"
feature as
>> it exists for lustre-1.8.x.  Testing on our system (which started out
as
>> 1.6.6 and is now 1.8.x) indicates that there are many files which do
not
>> have the size information stored on the MDT.  So, my basic question:
>> under what conditions will the "size hint" attribute be
updated?  Is
>> there any way to force the MDT to query the OSTs and update
it''s
>> information?
> 
> atime (and the MDT size hint) wasn''t being updated for most of the
1.8
> series due to this bug:
>   https://bugzilla.lustre.org/show_bug.cgi?id=23766
> the atime fix is now in 1.8.5, but I''m not sure if anyone has
verified
> whether or not the MDT size hint is now behaving as originally intended.
Thanks for the pointer, Robin!
> your approach of mounting and searching a MDT snapshot should be
> possible, but it would seem neater just to have a tool on a client send
> the right rpc''s to the MDS and get the information that way.
It would be great to have both options available.  I was assuming the
MDT snapshot would be easier, and potentially faster than waiting for
network transactions too.
> like you, we are finding that the timescales for our filesystem
> trawling scripts are getting out of hand, mostly (we think) due to
> retrieving size information from very busy OSTs. a tool that only hit
> the MDT and found (filename, uid, gid, approx size) should help a lot.
> so +1 on this topic.
On a somewhat related note... we have recently discovered that the
object caching added in 1.8 consumes all the memory on the OSS nodes,
leaving insufficient block device cache for the inodes.  This was making
''ls -l'' and ''du'' run 10-20x longer than when
we were running
lustre-1.6.7.  If you are running 1.8 and want to try turning off the
object caching, these are the settings you should look at:

lctl conf_param fsname-OST00XX.ost.read_cache_enable=0
lctl conf_param fsname-OST00XX.ost.writethrough_cache_enable=0
> BTW, once you have 1.8.5 on the MDS, then a hack to populate the MDT
> size hints might be to read 4k from every file in the system. that
> should update atime and the size hint. please let us know if this works.
Unfortunately, we aren''t in a position to upgrade to 1.8.5 any time
soon.  If anyone can test this and see if it is possible to update the
size hint AFTER initial object creation it would be very much appreciated!

Regards,
Nathan

Andreas Dilger

2011-Jan-13 21:37 UTC

head link

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

On 2011-01-13, at 11:38, Nathan Dauchy wrote:> On a somewhat related note... we have recently discovered that the
> object caching added in 1.8 consumes all the memory on the OSS nodes,
> leaving insufficient block device cache for the inodes.  This was making
> ''ls -l'' and ''du'' run 10-20x longer than
when we were running
> lustre-1.6.7.  If you are running 1.8 and want to try turning off the
> object caching, these are the settings you should look at:
> 
> lctl conf_param fsname-OST00XX.ost.read_cache_enable=0
> lctl conf_param fsname-OST00XX.ost.writethrough_cache_enable=0
It would probably be better to set:

lctl conf_param fsname-OST00XX.ost.readcache_max_filesize=32M

or similar, to limit the read cache to files 32MB in size or less (or whatever
you consider "small" files at your site.  That allows the read cache
for config files and such, while not thrashing the cache while accessing large
files.

We should probably change this to be the default, but at the time the read cache
was introduced, we didn''t know what should be considered a small vs.
large file, and the amount of RAM and number of OSTs on an OSS, and the uses
varies so much that it is difficult to pick a single correct value for this.

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.

Kit Westneat

2011-Jan-13 22:28 UTC

head link

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

> It would probably be better to set:
>
> lctl conf_param fsname-OST00XX.ost.readcache_max_filesize=32M
>
> or similar, to limit the read cache to files 32MB in size or less (or
whatever you consider "small" files at your site.  That allows the
read cache for config files and such, while not thrashing the cache while
accessing large files.
>
> We should probably change this to be the default, but at the time the read
cache was introduced, we didn''t know what should be considered a small
vs. large file, and the amount of RAM and number of OSTs on an OSS, and the uses
varies so much that it is difficult to pick a single correct value for this.I was looking through the Linux vm settings and saw vfs_cache_pressure - 
has anyone tested performance with this parameter? Do you know if this 
would this have any effect on file caching vs. ext4 metadata caching?

For us, Linux/Lustre would ideally push out data before the metadata, as 
the performance penalty for doing 4k reads on the s2a far outweighs any 
benefits of data caching.

Thanks,
Kit

Robin Humble

2011-Jan-28 07:34 UTC

head link

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

On Thu, Jan 13, 2011 at 05:28:23PM -0500, Kit Westneat
wrote:>> It would probably be better to set:
>>
>> lctl conf_param fsname-OST00XX.ost.readcache_max_filesize=32M
>>
>> or similar, to limit the read cache to files 32MB in size or less (or
whatever you consider "small" files at your site.  That allows the
read cache for config files and such, while not thrashing the cache while
accessing large files.
>>
>> We should probably change this to be the default, but at the time the
read cache was introduced, we didn''t know what should be considered a
small vs. large file, and the amount of RAM and number of OSTs on an OSS, and
the uses varies so much that it is difficult to pick a single correct value for
this.
limiting the total amount of OSS cache used in order to leave room for
inodes/dentries might be more useful. the data cache will always fill
up and push out inodes otherwise.
Nathan''s approach of turning off the caches entirely is extreme, but if
it gives us back some metadata performance then it might be worth it.

or is there a Lustre or VM setting to limit overall OSS cache size?

I presume that Lustre''s OSS caches are subject to normal Linux VM
pagecache tweakables, but I don''t think such a knob exists in Linux at
the moment...
>I was looking through the Linux vm settings and saw vfs_cache_pressure - 
>has anyone tested performance with this parameter? Do you know if this 
>would this have any effect on file caching vs. ext4 metadata caching?
>
>For us, Linux/Lustre would ideally push out data before the metadata, as 
>the performance penalty for doing 4k reads on the s2a far outweighs any 
>benefits of data caching.
good idea. if all inodes are always cached on OSS''s then the fs should
be far more responsive to stat loads... 4k/inode shouldn''t use up too
much of the OSS''s ram (probably more like 1 or 2k/inode really).

anyway, following your idea, we tried vfs_cache_pressure=50 on our
OSS''s a week or so ago, but hit this within a couple of hours
  https://bugzilla.lustre.org/show_bug.cgi?id=24401
could have been a coincidence I guess.

did anyone else give it a try?


BTW, we recently had the opposite problem on a client that scans the
filesystem - too many inodes were cached leading to low memory problems
on the client. we''ve had vfs_cache_pressure=150 set on that machine for
the last month or so and it seems to help. although a more effective
setting in this case was limiting ldlm locks. eg. from the Lustre manual
  lctl set_param ldlm.namespaces.*osc*.lru_size=10000

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

Jason Rappleye

2011-Jan-28 17:45 UTC

head link

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

On Jan 27, 2011, at 11:34 PM, Robin Humble wrote:
> On Thu, Jan 13, 2011 at 05:28:23PM -0500, Kit Westneat wrote:
>>> It would probably be better to set:
>>> 
>>> lctl conf_param fsname-OST00XX.ost.readcache_max_filesize=32M
>>> 
>>> or similar, to limit the read cache to files 32MB in size or less
(or whatever you consider "small" files at your site.  That allows the
read cache for config files and such, while not thrashing the cache while
accessing large files.
>>> 
>>> We should probably change this to be the default, but at the time
the read cache was introduced, we didn''t know what should be considered
a small vs. large file, and the amount of RAM and number of OSTs on an OSS, and
the uses varies so much that it is difficult to pick a single correct value for
this.
> 
> limiting the total amount of OSS cache used in order to leave room for
> inodes/dentries might be more useful. the data cache will always fill
> up and push out inodes otherwise.
The inode and dentry objects in the slab cache aren''t so much of an
issue as having the disk blocks that each are generated from available in the
buffer cache. Constructing the in-memory inode and dentry objects is cheap as
long as the corresponding disk blocks are available. Doing the disk reads,
depending on your hardware and some other factors, is not.
> Nathan''s approach of turning off the caches entirely is extreme,
but if
> it gives us back some metadata performance then it might be worth it.
We went the extreme and disabled the OSS read cache (+ writethrough cache). In
addition, on the OSSes we pre-read all of the inode blocks that contain at least
one used inode, along with all of the directory blocks.

The results have been promising so far. Firing off a du on an entire filesystem,
3000-6000 stats/second is typical. I''ve noted a few causes of slowdowns
so far; there may be more.

First, no attempt has been made to pre-read metadata from the MDT. The need to
read in inode and directory blocks may slow things down quite a bit. I
can''t find the numbers in my notes at the moment, but I recall seeing
200-500 stats/second when the MDS needed to do I/O.

When memory runs low on a client, kswapd kicks in to try and free up pages. On
the client I''m currently testing on, almost all of the memory used is
in the slab. It looks like kswapd has a difficult time clearing things up, and
the client can go several seconds before the current stat call is completed.
Dropping caches will (temporarily) get the performance back to expected rates. I
haven''t dug into this one too much yet.

Sometimes the performance drop is worse, and we see just tens of stats/second
(or fewer!) This is due to the fact that filter_{fid2dentry,precreate,destory}
all need to take a lock on the parent directory of the object on the OST. Unlink
or precreate operations whose critical section protected by this lock take a
long time to complete will slow down stat requests. I''m working on
tracking down the cause of this; it may be journal related. BZ 22107 is probably
relevant as well.
> or is there a Lustre or VM setting to limit overall OSS cache size?
No, but I think that would be really useful in this situation.
> I presume that Lustre''s OSS caches are subject to normal Linux VM
> pagecache tweakables, but I don''t think such a knob exists in
Linux at
> the moment...
Correct on both counts. A patch was proposed to do this, but I don''t
see any evidence of it making it into the kernel:

http://lwn.net/Articles/218890/

I have a small set of perl, bash, and SystemTap scripts to read the inode and
directory blocks from disk and monitor the performance of the relevant Lustre
calls on the servers. I''ll clean them up and send them to the list next
week. A more elegant solution would be to get e2scan to do the job, but I
haven''t taken a hack at that yet.

Our largest filesystem, in terms of inodes, has about 1.8M inodes per OST, and
15 OSTs per OSS. Of the 470400 inode blocks on disk (58800 block groups * 8
inode blocks/group), ~36% have at least one inode used. We pre-read those and
ignore the empty inode blocks. Looking at the OSTs on one OSS, we have an
average of 3891 directory blocks per OST.

In the absence of controls on the size of the page cache, or enough RAM to cache
all of the inode and directory blocks in memory, another potential solution is
to place the metadata on an SSD. One can generate a dm linear target table that
carves up an ext3/ext4 filesystem such that the inode blocks go on one device,
and the data blocks go on another. Ideally the inode blocks would be placed on
an SSD.

I''ve tried this with both ext3, and with ext4 using flex_bg to reduce
the size of the dm table. IIRC the overhead is acceptable in both cases - 1us,
on average.

Placing the inodes on separate storage is not sufficient, though. Slow directory
block reads contribute to poor stat performance as well. Adding a feature to
ext4 to reserve a number of fixed block groups for directory blocks, and always
allocating them there, would help. Those blocks groups could then be placed on
an SSD as well.

Even with the inode and directory blocks on fast storage, stat performance will
still suffer when other operations that require a lock on the object''s
parent directory are going slow.

I''ve left out a few details and actual performance numbers from our
production systems. I''ll do a more detailed writeup after I take care
of some other things at work, and finish recovering from 13.5 timezones worth of
jet lag :-)

Jason

--
Jason Rappleye
System Administrator
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035

Andreas Dilger

2011-Jan-28 18:04 UTC

head link

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

On 2011-01-28, at 10:45, Jason Rappleye wrote:> Sometimes the performance drop is worse, and we see just tens of
stats/second (or fewer!) This is due to the fact that
filter_{fid2dentry,precreate,destory} all need to take a lock on the parent
directory of the object on the OST. Unlink or precreate operations whose
critical section protected by this lock take a long time to complete will slow
down stat requests. I''m working on tracking down the cause of this; it
may be journal related. BZ 22107 is probably relevant as well.
There is work underway to allow the locking of the ldiskfs directories to be
multi-threaded.  This should significantly improve performance in such cases.
> Our largest filesystem, in terms of inodes, has about 1.8M inodes per OST,
and 15 OSTs per OSS. Of the 470400 inode blocks on disk (58800 block groups * 8
inode blocks/group), ~36% have at least one inode used. We pre-read those and
ignore the empty inode blocks. Looking at the OSTs on one OSS, we have an
average of 3891 directory blocks per OST.
> 
> In the absence of controls on the size of the page cache, or enough RAM to
cache all of the inode and directory blocks in memory, another potential
solution is to place the metadata on an SSD. One can generate a dm linear target
table that carves up an ext3/ext4 filesystem such that the inode blocks go on
one device, and the data blocks go on another. Ideally the inode blocks would be
placed on an SSD.
> 
> I''ve tried this with both ext3, and with ext4 using flex_bg to
reduce the size of the dm table. IIRC the overhead is acceptable in both cases -
1us, on average.
I''d be quite interested to see the results of such testing.
> Placing the inodes on separate storage is not sufficient, though. Slow
directory block reads contribute to poor stat performance as well. Adding a
feature to ext4 to reserve a number of fixed block groups for directory blocks,
and always allocating them there, would help. Those blocks groups could then be
placed on an SSD as well.
I believe there is a heuristic that allocates directory blocks in the first
group of a flex_bg, so if that entire group is on SSD it would potentially avoid
this problem.

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.

Jason Rappleye

2011-Feb-01 18:38 UTC

head link

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

On Jan 28, 2011, at 10:04 AM, Andreas Dilger wrote:
>> In the absence of controls on the size of the page cache, or enough RAM
to cache all of the inode and directory blocks in memory, another potential
solution is to place the metadata on an SSD. One can generate a dm linear target
table that carves up an ext3/ext4 filesystem such that the inode blocks go on
one device, and the data blocks go on another. Ideally the inode blocks would be
placed on an SSD.
>> 
>> I''ve tried this with both ext3, and with ext4 using flex_bg to
reduce the size of the dm table. IIRC the overhead is acceptable in both cases -
1us, on average.
> 
> I''d be quite interested to see the results of such testing.
I''m waiting for more hardware to show up so I can restart my testing.
Hope to have some results to share in another 3-4 weeks.
>> Placing the inodes on separate storage is not sufficient, though. Slow
directory block reads contribute to poor stat performance as well. Adding a
feature to ext4 to reserve a number of fixed block groups for directory blocks,
and always allocating them there, would help. Those blocks groups could then be
placed on an SSD as well.
> 
> I believe there is a heuristic that allocates directory blocks in the first
group of a flex_bg, so if that entire group is on SSD it would potentially avoid
this problem.
There is, though I haven''t tested it yet. However, you''d need
to have a relatively small number of flex_bgs for this to be cost-effective. I
heard through the grapevine that you suggest not using "too few"
flex_bgs on an ext4 filesystem. Can you elaborate on what might be a reasonable
number, and why?

Thanks,

Jason

--
Jason Rappleye
System Administrator
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035

Andreas Dilger

2011-Feb-03 18:53 UTC

head link

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

On 2011-02-01, at 11:38, Jason Rappleye <jason.rappleye at nasa.gov>
wrote:>  I heard through the grapevine that you suggest not using "too
few" flex_bgs on an ext4 filesystem. Can you elaborate on what might be a
reasonable number, and why?
My gut feeling is that a flex_bg factor of 256 may give the best tradeoff of
performance and configurability for a hybrid storage device. That will allow
bitmaps and itables to be multiples of 1MB sizes, and with careful tuning they
can also be aligned on 1MB boundaries.

It means that 1/256th of the filesystem would be allocated on SSD storage (4GB
per TB) which I think is totally reasonable (64GB for a 16TB filesystem), while
at the same time avoiding too complex an LV layout (512 SSD regions for a 16TB
filesystem).

In the end we need to do the testing to know what the best tradeoff is. 

Cheers, Andreas

Robin Humble

2011-Feb-09 15:11 UTC

head link

[Lustre-discuss] improving metadata performance [was Re: question about size on MDS (MDT) for lustre-1.8]

<rejoining this topic after a couple of weeks of experimentation>

Re: trying to improve metadata performance ->

we''ve been running with vfs_cache_pressure=0 on OSS''s in
production for
over a week now and it''s improved our metadata performance by a large
factor.

 - filesystem scans that didn''t finish in ~30hrs now complete in a
little
   over 3 hours. so >~10x speedup.

 - a recursive ls -altrR of my home dir (on a random uncached client) now
   runs at 2000 to 4000 files/s wheras before it could be <100 files/s.
   so 20 to 40x speedup.

of course vfs_cache_pressure=0 can be a DANGEROUS setting because
inodes/dentries will never be reclaimed, so OSS''s could OOM.

however slabtop shows inodes are 0.89K and dentries 0.21K ie. small, so
I expect many sites can (like us) easily cache everything. for a given
number of inodes per OST it''s easily calculable whether
there''s enough
OSS ram to safely set vfs_cache_pressure=0 and cache them all in slab.

continued monitoring of the fs inode growth (== OSS slab size) over
time is very important as fs''s will inevitably acrue more files...

sadly a slightly less extreme vfs_cache_pressure=1 wasn''t as successful
at keeping stat rates high. sustained OSS cache memory pressure through
the day dropped enough inodes that nightly scans weren''t fast any more.

our current residual issue with vfs_cache_pressure=0 is unexpected.
the number of OSS dentries appears to slowly grow over time :-/
it appears that some/many dentries for deleted files are not reclaimed
without some memory pressure.
any idea why that might be?

anyway, I''ve now added a few lines of code to create a different
(non-zero) vfs_cache_pressure knob for dentries. we''ll see how that
goes...
an alternate (simpler) workaround would be to occasionally drop OSS
inode/dentry caches, or to set vfs_cache_pressure=100 once in a while,
and to just live with a day of slow stat''s while the inode caches
repopulate.

hopefully vfs_cache_pressure=0 also has a net small positive impact on
regular i/o due to reduced iops to OSTs, but I haven''t trid to measure
that.
slab didn''t steal much ram from our read and write_through caches (we
have 48g ram on OSS''s and slab went up about 1.6g to 3.3g with the
additional cached inodes/dentries) so OSS file caching should be
almost unaffected.

On Fri, Jan 28, 2011 at 09:45:10AM -0800, Jason Rappleye
wrote:>On Jan 27, 2011, at 11:34 PM, Robin Humble wrote:
>> limiting the total amount of OSS cache used in order to leave room for
>> inodes/dentries might be more useful. the data cache will always fill
>> up and push out inodes otherwise.
I disagree with myself now. I think mm/vmscan.c would probably still
call shrink_slab, so shrinkers would get called and some cached inodes
would get dropped.
>The inode and dentry objects in the slab cache aren''t so much of an
issue as having the disk blocks that each are generated from available in the
buffer cache. Constructing the in-memory inode and dentry objects is cheap as
long as the corresponding disk blocks are available. Doing the disk reads,
depending on your hardware and some other factors, is not.
on a test cluster (with read and write_through caches still active and
synthetic i/o load) I didn''t see a big change in stat rate from
dropping OSS page/buffer cache - at most a slowdown for a client
''ls -lR'' of ~2x, and usually no slowdown at all. I suspect
this is
because there is almost zero persistent buffer cache due to the OSS
buffer and page caches being punished by file i/o.
in the same testing, dropping OSS inode/dentry caches was a much larger
effect (up to 60x slowdown with synthetic i/o) - which is why the
vfs_cache_pressure setting works.
the synthetic i/o wasn''t crazily intensive, but did have a working
set >>OSS mem which is likely true of our production machine.

however for your setup with OSS caches off, and from doing tests on our
MDS, I agree that buffer caches can be a big effect.

dropping our MDS buffer cache slows down a client ''lfs find''
by ~4x,
but dropping inode/dentry caches doesn''t slow it down at all, so
buffers are definitely important there.
happily we''re not under any memory pressure on our MDS''s at
the
moment.
>We went the extreme and disabled the OSS read cache (+ writethrough cache).
In addition, on the OSSes we pre-read all of the inode blocks that contain at
least one used inode, along with all of the directory blocks.
>
>The results have been promising so far. Firing off a du on an entire
filesystem, 3000-6000 stats/second is typical. I''ve noted a few causes
of slowdowns so far; there may be more.
we see about 2k files/s on the nightly sweeps now. that''s with one
lfs find running and piping to parallel stat''s. I think we can do
better with more parallelism in the finds, but 2k is so much better
than what it used to be we''re fairly happy for now.

2k isn''t as good as your stat rates, but we still have OSS caches on,
so the rest of our i/o should be benefiting from that.
>When memory runs low on a client, kswapd kicks in to try and free up pages.
On the client I''m currently testing on, almost all of the memory used
is in the slab. It looks like kswapd has a difficult time clearing things up,
and the client can go several seconds before the current stat call is completed.
Dropping caches will (temporarily) get the performance back to expected rates. I
haven''t dug into this one too much yet.
the last para of my prev email might help you.
we found client slab is hard to reclaim without limiting ldlm locks.
I haven''t noticed a performance change from limiting ldlm lock counts.

cheers,
robin

Nathan Dauchy

2011-Apr-13 20:33 UTC

head link

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

Revisiting this thread after 2 months...

On 01/12/2011 05:45 PM, Robin Humble wrote:> On Thu, Jan 06, 2011 at 05:42:24PM -0700, Nathan.Dauchy at noaa.gov wrote:
>> I am looking for more information regarding the "size on MDS"
feature as
>> it exists for lustre-1.8.x.  Testing on our system (which started out
as
>> 1.6.6 and is now 1.8.x) indicates that there are many files which do
not
>> have the size information stored on the MDT.  So, my basic question:
>> under what conditions will the "size hint" attribute be
updated?  Is
>> there any way to force the MDT to query the OSTs and update
it''s
>> information?
> 
> atime (and the MDT size hint) wasn''t being updated for most of the
1.8
> series due to this bug:
>   https://bugzilla.lustre.org/show_bug.cgi?id=23766
> the atime fix is now in 1.8.5, but I''m not sure if anyone has
verified
> whether or not the MDT size hint is now behaving as originally intended.
I wanted to report back that I applied the patch in that bug and it
appears as though the MDT size hint is working.  This is tested with
lustre-1.8.4.ddn2.2 and linux-2.6.18-194.32.1 on the MDS.
> BTW, once you have 1.8.5 on the MDS, then a hack to populate the MDT
> size hints might be to read 4k from every file in the system. that
> should update atime and the size hint. please let us know if this works.
> 
Yes, performing "ls -l" on the file from the client was not
sufficient.
 The size hint was updated on the MDS after reading 1k from the file on
the client.  Running ''du --apparent-size'' now reports the
exact same
value on the MDT snapshot as on the client.

-Nathan

Andreas Dilger

2011-May-03 20:03 UTC

head link

[Lustre-discuss] improving metadata performance [was Re: question about size on MDS (MDT) for lustre-1.8]

Just to follow up in this issue. We landed a patch for 2.1 that will reduce the
default OST cache to objects 8MB or smaller.  This can still be tuned via /proc,
but is likely to provide better all-around performance by avoiding cache flushes
for streaming read and write operations.

Robin, it would be great to know if tuning this would also solve your cache
pressure woes without having to resort to disabling the VM cache pressure (which
isn''t something we can do by default for all users).

Cheers, Andreas

On 2011-02-09, at 8:11 AM, Robin Humble <robin.humble+lustre at
anu.edu.au> wrote:
> <rejoining this topic after a couple of weeks of experimentation>
> 
> Re: trying to improve metadata performance ->
> 
> we''ve been running with vfs_cache_pressure=0 on OSS''s in
production for
> over a week now and it''s improved our metadata performance by a
large factor.
> 
> - filesystem scans that didn''t finish in ~30hrs now complete in a
little
>   over 3 hours. so >~10x speedup.
> 
> - a recursive ls -altrR of my home dir (on a random uncached client) now
>   runs at 2000 to 4000 files/s wheras before it could be <100 files/s.
>   so 20 to 40x speedup.
> 
> of course vfs_cache_pressure=0 can be a DANGEROUS setting because
> inodes/dentries will never be reclaimed, so OSS''s could OOM.
> 
> however slabtop shows inodes are 0.89K and dentries 0.21K ie. small, so
> I expect many sites can (like us) easily cache everything. for a given
> number of inodes per OST it''s easily calculable whether
there''s enough
> OSS ram to safely set vfs_cache_pressure=0 and cache them all in slab.
> 
> continued monitoring of the fs inode growth (== OSS slab size) over
> time is very important as fs''s will inevitably acrue more files...
> 
> sadly a slightly less extreme vfs_cache_pressure=1 wasn''t as
successful
> at keeping stat rates high. sustained OSS cache memory pressure through
> the day dropped enough inodes that nightly scans weren''t fast any
more.
> 
> our current residual issue with vfs_cache_pressure=0 is unexpected.
> the number of OSS dentries appears to slowly grow over time :-/
> it appears that some/many dentries for deleted files are not reclaimed
> without some memory pressure.
> any idea why that might be?
> 
> anyway, I''ve now added a few lines of code to create a different
> (non-zero) vfs_cache_pressure knob for dentries. we''ll see how
that
> goes...
> an alternate (simpler) workaround would be to occasionally drop OSS
> inode/dentry caches, or to set vfs_cache_pressure=100 once in a while,
> and to just live with a day of slow stat''s while the inode caches
> repopulate.
> 
> hopefully vfs_cache_pressure=0 also has a net small positive impact on
> regular i/o due to reduced iops to OSTs, but I haven''t trid to
measure
> that.
> slab didn''t steal much ram from our read and write_through caches
(we
> have 48g ram on OSS''s and slab went up about 1.6g to 3.3g with the
> additional cached inodes/dentries) so OSS file caching should be
> almost unaffected.
> 
> On Fri, Jan 28, 2011 at 09:45:10AM -0800, Jason Rappleye wrote:
>> On Jan 27, 2011, at 11:34 PM, Robin Humble wrote:
>>> limiting the total amount of OSS cache used in order to leave room
for
>>> inodes/dentries might be more useful. the data cache will always
fill
>>> up and push out inodes otherwise.
> 
> I disagree with myself now. I think mm/vmscan.c would probably still
> call shrink_slab, so shrinkers would get called and some cached inodes
> would get dropped.
> 
>> The inode and dentry objects in the slab cache aren''t so much
of an issue as having the disk blocks that each are generated from available in
the buffer cache. Constructing the in-memory inode and dentry objects is cheap
as long as the corresponding disk blocks are available. Doing the disk reads,
depending on your hardware and some other factors, is not.
> 
> on a test cluster (with read and write_through caches still active and
> synthetic i/o load) I didn''t see a big change in stat rate from
> dropping OSS page/buffer cache - at most a slowdown for a client
> ''ls -lR'' of ~2x, and usually no slowdown at all. I
suspect this is
> because there is almost zero persistent buffer cache due to the OSS
> buffer and page caches being punished by file i/o.
> in the same testing, dropping OSS inode/dentry caches was a much larger
> effect (up to 60x slowdown with synthetic i/o) - which is why the
> vfs_cache_pressure setting works.
> the synthetic i/o wasn''t crazily intensive, but did have a working
> set >>OSS mem which is likely true of our production machine.
> 
> however for your setup with OSS caches off, and from doing tests on our
> MDS, I agree that buffer caches can be a big effect.
> 
> dropping our MDS buffer cache slows down a client ''lfs
find'' by ~4x,
> but dropping inode/dentry caches doesn''t slow it down at all, so
> buffers are definitely important there.
> happily we''re not under any memory pressure on our MDS''s
at the
> moment.
> 
>> We went the extreme and disabled the OSS read cache (+ writethrough
cache). In addition, on the OSSes we pre-read all of the inode blocks that
contain at least one used inode, along with all of the directory blocks.
>> 
>> The results have been promising so far. Firing off a du on an entire
filesystem, 3000-6000 stats/second is typical. I''ve noted a few causes
of slowdowns so far; there may be more.
> 
> we see about 2k files/s on the nightly sweeps now. that''s with one
> lfs find running and piping to parallel stat''s. I think we can do
> better with more parallelism in the finds, but 2k is so much better
> than what it used to be we''re fairly happy for now.
> 
> 2k isn''t as good as your stat rates, but we still have OSS caches
on,
> so the rest of our i/o should be benefiting from that.
> 
>> When memory runs low on a client, kswapd kicks in to try and free up
pages. On the client I''m currently testing on, almost all of the memory
used is in the slab. It looks like kswapd has a difficult time clearing things
up, and the client can go several seconds before the current stat call is
completed. Dropping caches will (temporarily) get the performance back to
expected rates. I haven''t dug into this one too much yet.
> 
> the last para of my prev email might help you.
> we found client slab is hard to reclaim without limiting ldlm locks.
> I haven''t noticed a performance change from limiting ldlm lock
counts.
> 
> cheers,
> robin
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Jan 2011 - question about size on MDS (MDT) for lustre-1.8

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

[Lustre-discuss] improving metadata performance [was Re: question about size on MDS (MDT) for lustre-1.8]

[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

[Lustre-discuss] improving metadata performance [was Re: question about size on MDS (MDT) for lustre-1.8]