Andreas Dilger
2006-May-19 07:36 UTC
[Lustre-discuss] Slow metadata, small file performance
On Dec 22, 2005 17:16 -0500, Brent A Nelson wrote:> "ls -lR >/dev/null" (even with just a single client and no other activity) > and a "cp -a /usr /lustre1/test1" (~3.5 minutes for a <350MB /usr) both > perform more slowly than to an older Linux box running NFS > I tried increasing the lru_size on everything, but that didn''t seem to > have any effect at all in this scenario (maybe it only matters when there > are many more clients).The single-client "ls -lR" case is one of the worst usage cases for Lustre. We have plans to fix this, but haven''t done so yet. If you are doing repeat "ls -lR" on the same directory (and working set fits into LRU) then the performance is greatly improved and you are still guaranteed coherency, unlike NFS. That is why we recommend increased LRU sizes for nodes that are being used interactively. HPC compute nodes rarely have a large working set. Similarly, if you have a large number of clients doing such operations the aggregate performance will be higher than that of NFS. In some usage scenarios adding more clients improves lustre performance instead of hurting it. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Brent A Nelson
2006-May-19 07:36 UTC
[Lustre-discuss] Slow metadata, small file performance
Once again, thanks for the reply! On Thu, 22 Dec 2005, Andreas Dilger wrote:> On Dec 22, 2005 17:16 -0500, Brent A Nelson wrote: >> "ls -lR >/dev/null" (even with just a single client and no other activity) >> and a "cp -a /usr /lustre1/test1" (~3.5 minutes for a <350MB /usr) both >> perform more slowly than to an older Linux box running NFS >> I tried increasing the lru_size on everything, but that didn''t seem to >> have any effect at all in this scenario (maybe it only matters when there >> are many more clients). > > The single-client "ls -lR" case is one of the worst usage cases for Lustre. > We have plans to fix this, but haven''t done so yet. If you are doing > repeat "ls -lR" on the same directory (and working set fits into LRU) > then the performance is greatly improved and you are still guaranteed > coherency, unlike NFS. That is why we recommend increased LRU sizes for > nodes that are being used interactively. HPC compute nodes rarely have a > large working set. >Does this apply for the single-client copy case, as well? I imagine for a /usr directory, with lots of small files, there would be a lot of metadata activity. I haven''t noticed any performance improvement at all with the lru_size change. Perhaps the directory metadata is small enough to fit in the default lru_size. All I have in my test case is 2 ~350GB copies of /usr and a single 4GB file (this totals to about 37000 files and 3000 directories). Also, I don''t notice any speed change between successive, identical ls timings on a client. If I start up a new client and do the ls test twice in a row I get almost identical timings (even when I bump up the lru_size). Is the caching on the metadata server, on the client, or both? If on the server, perhaps the metadata has stayed in cache since the initial copies and I''ve always been experiencing the cached peformance. But I see no evidence of a client-side caching effect on the ls tests. Also, during the ls tests, I noticed there was some activity on the OSTs; why is that? I would have thought this was purely a metadata operation. Assuming nothing''s wrong (and it may be a while before someone improves the behavior of the code in this situation), what would I do to improve performance in this case? What hardware improvement would be of most benefit with the present code, faster processors, faster disks, lower latency disks, lower latency connection between nodes, more OSS nodes,...? Thanks, Brent
Andreas Dilger
2006-May-19 07:36 UTC
[Lustre-discuss] Slow metadata, small file performance
On Dec 23, 2005 01:33 -0500, Brent A Nelson wrote:> I haven''t noticed any performance improvement at all with the lru_size > change. Perhaps the directory metadata is small enough to fit in the > default lru_size. All I have in my test case is 2 ~350GB copies of /usr > and a single 4GB file (this totals to about 37000 files and 3000 > directories).If your 40k files is larger than the LRU size then it is expected that the performance would not be improved. The default LRU size is 100, and in newer versions 100 * num_cpus. Also, if you are increasing the LRU size, both the MDC and OSCs need to have larger LRUs. The OSC LRU size can be $defaultstripe_count/$num_osts of the MDC LRU, since you will be able to cache more object locks for the larger number of OSTs.> Is the caching on the metadata server, on the client, or both? If on the > server, perhaps the metadata has stayed in cache since the initial copies > and I''ve always been experiencing the cached peformance. But I see no > evidence of a client-side caching effect on the ls tests.There is cache on both sides. The DLM locks (kept in aforementioned LRU) are client-side locks. The server also has the Linux filesystem caches (dcache, icache, pagecache) to cache data on the server side.> Also, during the ls tests, I noticed there was some activity on the OSTs; > why is that? I would have thought this was purely a metadata operation.The file size is stored on the OST, so this is normal.> Assuming nothing''s wrong (and it may be a while before someone improves > the behavior of the code in this situation), what would I do to improve > performance in this case? What hardware improvement would be of most > benefit with the present code, faster processors, faster disks, lower > latency disks, lower latency connection between nodes, more OSS nodes,...?The biggest improvement of "ls" performance would likely come from larger MDS RAM size (and 64-bit CPU to be able to use it effectively) so that it can cache this information on disk. Recent tests showed a 10x performance improvement when the inode information was in cache on the MDS vs. reading it from the disk. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Hello! On Fri, Dec 23, 2005 at 01:33:02AM -0500, Brent A Nelson wrote:> >The single-client "ls -lR" case is one of the worst usage cases for Lustre. > >We have plans to fix this, but haven''t done so yet. If you are doing > >repeat "ls -lR" on the same directory (and working set fits into LRU) > >then the performance is greatly improved and you are still guaranteed > >coherency, unlike NFS. That is why we recommend increased LRU sizes for > >nodes that are being used interactively. HPC compute nodes rarely have a > >large working set. > Does this apply for the single-client copy case, as well? I imagine for aYes.> /usr directory, with lots of small files, there would be a lot of metadata > activity.Right.> I haven''t noticed any performance improvement at all with the lru_size > change. Perhaps the directory metadata is small enough to fit in the > default lru_size. All I have in my test case is 2 ~350GB copies of /usrWell might be so, For metadata it is unimportant how much space the data takes.> and a single 4GB file (this totals to about 37000 files and 3000 > directories).So around 40000 entries. This means you need to set your lock lru to at least 40000 so that entire ls -lR fits into it.> Also, I don''t notice any speed change between successive, identical ls > timings on a client. If I start up a new client and do the ls test twice > in a row I get almost identical timings (even when I bump up the > lru_size).How big was your increased lru size?> Is the caching on the metadata server, on the client, or both? If on theCaching is on client only.> Also, during the ls tests, I noticed there was some activity on the OSTs;That''s right.> why is that? I would have thought this was purely a metadata operation.Part of the metadata is stored on OSTs. This is current file size and mtime for now. So you need to increase lru size for both MDS and all OSCs. You can do this with this sequence of commands (on every client you want this change to be present): NEWLRUSIZE=41000 #(set lru size to 41k) echo $NEWLRUSIZE >/proc/fs/lustre/ldlm/namespaces/MDC*/lru_size # set MDC LRU for d in /proc/fs/lustre/ldlm/namespaces/OSC*/lru_size; do echo $NEWLRUSIZE > $d ; done # Set all OSCs LRU> Assuming nothing''s wrong (and it may be a while before someone improves > the behavior of the code in this situation), what would I do to improve > performance in this case? What hardware improvement would be of mostlru size where everything fits should help for "Second run ls".> benefit with the present code, faster processors, faster disks, lower > latency disks, lower latency connection between nodes, more OSS nodes,...?Lower latency connection between nodes is what you want for fast metadata operations, essentially. The more OSS nodes your files are striped over - the slower stat(2) speed is, because every such node should be queried for mtime and size of stripe it might hold. (note that if you have many OSS nodes, but all files are striped over only one OST, then stat(2) performance would be the same as if you have single OST (with no other load, of course)). Bye, Oleg
Brent A Nelson
2006-May-19 07:36 UTC
[Lustre-discuss] Slow metadata, small file performance
Well, thanks to several people on the list, I got Lustre 1.4.5 running on my test setup (with Ubuntu Breezy, no less!), and it seems stable (no problems so far that I didn''t cause myself ;-)). However, I''ve noticed that some things are performing rather subpar in some limited testing with a single client. Large sequential reads and writes seem quick (although perhaps not nearly as quick as would be theoretically possible with this setup, it might make up for that when multiple clients are running), but "ls -lR >/dev/null" (even with just a single client and no other activity) and a "cp -a /usr /lustre1/test1" (~3.5 minutes for a <350MB /usr) both perform more slowly than to an older Linux box running NFS on fast ethernet (my Lustre servers have channel-bonded gigabit, and dual PIII 1GHz processors rather than the NFS server''s 450MHz processors). I tried increasing the lru_size on everything, but that didn''t seem to have any effect at all in this scenario (maybe it only matters when there are many more clients). I also added mballoc and extents to the mount options for the OSTs (small effect, if any). Setting the debug level to zero helped significantly, but it''s still much slower than NFS. The cp takes maybe 50% longer than NFS and the ls takes about 300% longer. The numbers are fairly similar whether I have 2 OSS servers serving 3 drbd mirrors each, or the same servers just serving out a logical volume from the system drive on each (although the more complex scenario is actually a little faster for these tests, but still much slower than NFS). I originally had the MDS on one of the OSS servers and tried moving it to a third server, but the speed stayed the same. Any ideas? Many thanks! I know I''ve been rather scant on details; just let me know and I''ll provide whatever info you need. Brent Nelson Director of Computing Dept. of Physics University of Florida