thr3ads.net - Lustre discuss - [Lustre-discuss] Slow metadata, small file performance [May 2006]

If this information is useful, please help other people find it:
Share via:

Andreas Dilger

2006-May-19 07:36 UTC

[Lustre-discuss] Slow metadata, small file performance

On Dec 22, 2005  17:16 -0500, Brent A Nelson wrote:> "ls -lR >/dev/null" (even with just a single client and no
other activity)
> and a "cp -a /usr /lustre1/test1" (~3.5 minutes for a <350MB
/usr) both
> perform more slowly than to an older Linux box running NFS
> I tried increasing the lru_size on everything, but that didn''t
seem to
> have any effect at all in this scenario (maybe it only matters when there 
> are many more clients).
The single-client "ls -lR" case is one of the worst usage cases for
Lustre.
We have plans to fix this, but haven''t done so yet.  If you are doing
repeat "ls -lR" on the same directory (and working set fits into LRU)
then the performance is greatly improved and you are still guaranteed
coherency, unlike NFS.  That is why we recommend increased LRU sizes for
nodes that are being used interactively.  HPC compute nodes rarely have a
large working set.

Similarly, if you have a large number of clients doing such operations
the aggregate performance will be higher than that of NFS.  In some
usage scenarios adding more clients improves lustre performance instead
of hurting it.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Brent A Nelson

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Slow metadata, small file performance

Once again, thanks for the reply!

On Thu, 22 Dec 2005, Andreas Dilger wrote:
> On Dec 22, 2005  17:16 -0500, Brent A Nelson wrote:
>> "ls -lR >/dev/null" (even with just a single client and no
other activity)
>> and a "cp -a /usr /lustre1/test1" (~3.5 minutes for a
<350MB /usr) both
>> perform more slowly than to an older Linux box running NFS
>> I tried increasing the lru_size on everything, but that didn''t
seem to
>> have any effect at all in this scenario (maybe it only matters when
there
>> are many more clients).
>
> The single-client "ls -lR" case is one of the worst usage cases
for Lustre.
> We have plans to fix this, but haven''t done so yet.  If you are
doing
> repeat "ls -lR" on the same directory (and working set fits into
LRU)
> then the performance is greatly improved and you are still guaranteed
> coherency, unlike NFS.  That is why we recommend increased LRU sizes for
> nodes that are being used interactively.  HPC compute nodes rarely have a
> large working set.
>
Does this apply for the single-client copy case, as well? I imagine for a 
/usr directory, with lots of small files, there would be a lot of metadata 
activity.

I haven''t noticed any performance improvement at all with the lru_size 
change.  Perhaps the directory metadata is small enough to fit in the 
default lru_size.  All I have in my test case is 2 ~350GB copies of /usr 
and a single 4GB file (this totals to about 37000 files and 3000 
directories).

Also, I don''t notice any speed change between successive, identical ls 
timings on a client.  If I start up a new client and do the ls test twice 
in a row I get almost identical timings (even when I bump up the 
lru_size).

Is the caching on the metadata server, on the client, or both?  If on the 
server, perhaps the metadata has stayed in cache since the initial copies 
and I''ve always been experiencing the cached peformance.  But I see no 
evidence of a client-side caching effect on the ls tests.

Also, during the ls tests, I noticed there was some activity on the OSTs; 
why is that? I would have thought this was purely a metadata operation.

Assuming nothing''s wrong (and it may be a while before someone improves
the behavior of the code in this situation), what would I do to improve 
performance in this case? What hardware improvement would be of most 
benefit with the present code, faster processors, faster disks, lower 
latency disks, lower latency connection between nodes, more OSS nodes,...?

Thanks,

Brent

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Slow metadata, small file performance

On Dec 23, 2005  01:33 -0500, Brent A Nelson wrote:> I haven''t noticed any performance improvement at all with the
lru_size
> change.  Perhaps the directory metadata is small enough to fit in the 
> default lru_size.  All I have in my test case is 2 ~350GB copies of /usr 
> and a single 4GB file (this totals to about 37000 files and 3000 
> directories).
If your 40k files is larger than the LRU size then it is expected that
the performance would not be improved.  The default LRU size is 100,
and in newer versions 100 * num_cpus.

Also, if you are increasing the LRU size, both the MDC and OSCs need
to have larger LRUs.  The OSC LRU size can be $defaultstripe_count/$num_osts
of the MDC LRU, since you will be able to cache more object locks for
the larger number of OSTs.
> Is the caching on the metadata server, on the client, or both?  If on the 
> server, perhaps the metadata has stayed in cache since the initial copies 
> and I''ve always been experiencing the cached peformance.  But I
see no
> evidence of a client-side caching effect on the ls tests.
There is cache on both sides.  The DLM locks (kept in aforementioned LRU)
are client-side locks.  The server also has the Linux filesystem caches
(dcache, icache, pagecache) to cache data on the server side.
> Also, during the ls tests, I noticed there was some activity on the OSTs; 
> why is that? I would have thought this was purely a metadata operation.
The file size is stored on the OST, so this is normal.
> Assuming nothing''s wrong (and it may be a while before someone
improves
> the behavior of the code in this situation), what would I do to improve 
> performance in this case? What hardware improvement would be of most 
> benefit with the present code, faster processors, faster disks, lower 
> latency disks, lower latency connection between nodes, more OSS nodes,...?
The biggest improvement of "ls" performance would likely come from
larger MDS RAM size (and 64-bit CPU to be able to use it effectively) so
that it can cache this information on disk.  Recent tests showed a 10x
performance improvement when the inode information was in cache on the
MDS vs. reading it from the disk.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Oleg Drokin

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Slow metadata, small file performance

Hello!

On Fri, Dec 23, 2005 at 01:33:02AM -0500, Brent A Nelson
wrote:> >The single-client "ls -lR" case is one of the worst usage
cases for Lustre.
> >We have plans to fix this, but haven''t done so yet.  If you
are doing
> >repeat "ls -lR" on the same directory (and working set fits
into LRU)
> >then the performance is greatly improved and you are still guaranteed
> >coherency, unlike NFS.  That is why we recommend increased LRU sizes
for
> >nodes that are being used interactively.  HPC compute nodes rarely have
a
> >large working set.
> Does this apply for the single-client copy case, as well? I imagine for a 
Yes.
> /usr directory, with lots of small files, there would be a lot of metadata 
> activity.
Right.
> I haven''t noticed any performance improvement at all with the
lru_size
> change.  Perhaps the directory metadata is small enough to fit in the 
> default lru_size.  All I have in my test case is 2 ~350GB copies of /usr 
Well might be so,
For metadata it is unimportant how much space the data takes.
> and a single 4GB file (this totals to about 37000 files and 3000 
> directories).
So around 40000 entries.
This means you need to set your lock lru to at least 40000 so that entire
ls -lR fits into it.
> Also, I don''t notice any speed change between successive,
identical ls
> timings on a client.  If I start up a new client and do the ls test twice 
> in a row I get almost identical timings (even when I bump up the 
> lru_size).
How big was your increased lru size?
> Is the caching on the metadata server, on the client, or both?  If on the 
Caching is on client only.
> Also, during the ls tests, I noticed there was some activity on the OSTs; 
That''s right.
> why is that? I would have thought this was purely a metadata operation.
Part of the metadata is stored on OSTs. This is current file size and mtime
for now.
So you need to increase lru size for both MDS and all OSCs.
You can do this with this sequence of commands (on every client you want
this change to be present):
NEWLRUSIZE=41000 #(set lru size to 41k)
echo $NEWLRUSIZE >/proc/fs/lustre/ldlm/namespaces/MDC*/lru_size # set MDC LRU
for d in /proc/fs/lustre/ldlm/namespaces/OSC*/lru_size;
do echo $NEWLRUSIZE > $d ; done # Set all OSCs LRU
> Assuming nothing''s wrong (and it may be a while before someone
improves
> the behavior of the code in this situation), what would I do to improve 
> performance in this case? What hardware improvement would be of most 
lru size where everything fits should help for "Second run ls".
> benefit with the present code, faster processors, faster disks, lower 
> latency disks, lower latency connection between nodes, more OSS nodes,...?
Lower latency connection between nodes is what you want for fast metadata
operations, essentially.
The more OSS nodes your files are striped over - the slower stat(2) speed is,
because every such node should be queried for mtime and size of stripe it might
hold. (note that if you have many OSS nodes, but all files are striped over
only one OST, then stat(2) performance would be the same as if you have
single OST (with no other load, of course)).

Bye,
    Oleg

Brent A Nelson

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Slow metadata, small file performance

Well, thanks to several people on the list, I got Lustre 1.4.5 running on 
my test setup (with Ubuntu Breezy, no less!), and it seems stable (no 
problems so far that I didn''t cause myself ;-)).

However, I''ve noticed that some things are performing rather subpar in 
some limited testing with a single client.  Large sequential reads and 
writes seem quick (although perhaps not nearly as quick as would be 
theoretically possible with this setup, it might make up for that when 
multiple clients are running), but "ls -lR >/dev/null" (even with
just a
single client and no other activity) and a "cp -a /usr /lustre1/test1"
(~3.5 minutes for a <350MB /usr) both perform more slowly than to an older 
Linux box running NFS on fast ethernet (my Lustre servers have 
channel-bonded gigabit, and dual PIII 1GHz processors rather than the NFS 
server''s 450MHz processors).

I tried increasing the lru_size on everything, but that didn''t seem to 
have any effect at all in this scenario (maybe it only matters when there 
are many more clients).  I also added mballoc and extents to the mount 
options for the OSTs (small effect, if any).  Setting the debug level to 
zero helped significantly, but it''s still much slower than NFS.  The cp
takes maybe 50% longer than NFS and the ls takes about 300% longer.  The 
numbers are fairly similar whether I have 2 OSS servers serving 3 drbd 
mirrors each, or the same servers just serving out a logical volume from 
the system drive on each (although the more complex scenario is actually a 
little faster for these tests, but still much slower than NFS).  I 
originally had the MDS on one of the OSS servers and tried moving it to a 
third server, but the speed stayed the same.

Any ideas?

Many thanks! I know I''ve been rather scant on details; just let me know
and I''ll provide whatever info you need.

Brent Nelson
Director of Computing
Dept. of Physics
University of Florida

Lustre discuss - May 2006 - Slow metadata, small file performance

[Lustre-discuss] Slow metadata, small file performance

[Lustre-discuss] Slow metadata, small file performance

[Lustre-discuss] Slow metadata, small file performance

[Lustre-discuss] Slow metadata, small file performance

[Lustre-discuss] Slow metadata, small file performance