Hi, I was wondering if it is possible to have the client completely cache a recursive listing of a lustre filesystem such that on a second run it doesn''t have to talk to the MDT again? Taking the simplest case where I only have one client that is browsing a million file tree (say), I would expect that once the ldlm has cached the locks (lru_size) then a second recursive scan (find, ls -R) shouldn''t need to talk to the MDT/OST again. But this is not the case probably because a recursive scan needs to do a open() and getdents() on each directory it finds. If I just stat all the files without doing a recursive scan then it gets everything from the client cache as expected without the MDS chatter - e.g. find /mnt/lustre -type f > /tmp/files.txt cat /tmp/files.txt | xargs ls -l Is there any way to improve the browsing speed and cache directory contents - especially for the case where I only have a single client accessing an entire tree? As an aside I also noticed that a "ls -l" does a getxattr - does that get cached by the client too? I can imagine it might cause quite a bit of MDS chatter. Cheers, Daire
On 2010-07-29, at 04:47, Daire Byrne wrote:> I was wondering if it is possible to have the client completely cache > a recursive listing of a lustre filesystem such that on a second run > it doesn''t have to talk to the MDT again? Taking the simplest case > where I only have one client that is browsing a million file tree > (say), I would expect that once the ldlm has cached the locks > (lru_size) then a second recursive scan (find, ls -R) shouldn''t need > to talk to the MDT/OST again. But this is not the case probably > because a recursive scan needs to do a open() and getdents() on each > directory it finds.The getdents() calls can be returned from the client-side cache, it is only the open() that needs to go to the MDS. Lustre actually does support client-side open cache, but it is currently only used by NFS servers (which, sadly, opens and closes the file for every single write operation on a file). I know Oleg has at times discussed enabling the open cache on the client for regular filesystem access, but I don''t know the tweak needed for this offhand. I know in the past we didn''t do this because there was extra DLM locking overhead for cancelling the open lock, but with the DLM lock cancel batching that may not be as big a performance hit. It wouldn''t be a bad idea to start with a /proc tuneable or "-o openlock" mount option that selectively allows open cache per client mount, so that performance testing can be done. After that we can decide whether this is only good for specific workloads and bad for others, or if it is an improvement for most workloads and should be enabled by default.> If I just stat all the files without doing a recursive scan then it gets everything from the client cache as expected without the MDS chatter - e.g. > > find /mnt/lustre -type f > /tmp/files.txt > cat /tmp/files.txt | xargs ls -l > > Is there any way to improve the browsing speed and cache directory > contents - especially for the case where I only have a single client > accessing an entire tree? As an aside I also noticed that a "ls -l" > does a getxattr - does that get cached by the client too? I can > imagine it might cause quite a bit of MDS chatter.So far, Lustre doesn''t cache any xattr on the client beyond the file layout ("lustre.lov" xattr), but it is something I''ve been thinking about. The security.capability attribute is special-cased in the 1.8.4 client to not return any data, and beyond that there aren''t any attributes that I''m aware of that are widely used, so I don''t think there is a pressing demand for this, but if a case can be made for this we''ll definitely look at it more seriously. -- Cheers, Andreas
On Thu, Jul 29, 2010 at 8:54 PM, Andreas Dilger <andreas.dilger at oracle.com> wrote:> On 2010-07-29, at 04:47, Daire Byrne wrote: >> I was wondering if it is possible to have the client completely cache >> a recursive listing of a lustre filesystem such that on a second run >> it doesn''t have to talk to the MDT again? Taking the simplest case >> where I only have one client that is browsing a million file tree >> (say), I would expect that once the ldlm has cached the locks >> (lru_size) then a second recursive scan (find, ls -R) shouldn''t need >> to talk to the MDT/OST again. But this is not the case probably >> because a recursive scan needs to do a open() and getdents() on each >> directory it finds. > > The getdents() calls can be returned from the client-side cache, it is only the open() that needs to go to the MDS. ?Lustre actually does support client-side open cache, but it is currently only used by NFS servers (which, sadly, opens and closes the file for every single write operation on a file).Ah yes... that makes sense. I recall the opencache gave a big boost in performance for NFS exporting but I wasn''t sure if it had become the default. I haven''t been keeping up to date with Lustre developments.> I know Oleg has at times discussed enabling the open cache on the client for regular filesystem access, but I don''t know the tweak needed for this offhand. ?I know in the past we didn''t do this because there was extra DLM locking overhead for cancelling the open lock, but with the DLM lock cancel batching that may not be as big a performance hit. > > It wouldn''t be a bad idea to start with a /proc tuneable or "-o openlock" mount option that selectively allows open cache per client mount, so that performance testing can be done. ?After that we can decide whether this is only good for specific workloads and bad for others, or if it is an improvement for most workloads and should be enabled by default.So just as a quick and dirty check of the opencache I did a "find" on a remote client directly using the lustre client and also through an NFS gateway client (in this case running on the MDS). find /mnt/lustre (not cached) = 51 secs find /mnt/lustre (cached = 22 secs find /mnt/nfs (not cached) = 127 secs find /mnt/nfs (cached) = 15 secs. So even with the metadata going over NFS the opencache in the client seems to make quite a difference (I''m not sure how much the NFS client caches though). As expected I see no mdt activity for the NFS export once cached. I think it would be really nice to be able to enable the opencache on any lustre client. A couple of potential workloads that I can think of that would benefit are WAN clients and clients that need to do mainly metadata (e.g. scanning the filesystem, rsync --link-dest hardlink snapshot backups). For the WAN case I''d be quite interested to see what the overhead of the lock cancellation would be like for a busy filesystem. I suppose we can already test that by doing an NFS export. I don''t suppose you know if CEA''s "ganesha" userspace NFS server has access to the opencache? It can cache data to disk too which is also good for WAN applications.>> If I just stat all the files without doing a recursive scan then it gets everything from the client cache as expected without the MDS chatter - e.g. >> >> ?find /mnt/lustre -type f > /tmp/files.txt >> ?cat /tmp/files.txt | xargs ls -l >> >> Is there any way to improve the browsing speed and cache directory >> contents - especially for the case where I only have a single client >> accessing an entire tree? As an aside I also noticed that a "ls -l" >> does a getxattr - does that get cached by the client too? I can >> imagine it might cause quite a bit of MDS chatter. > > So far, Lustre doesn''t cache any xattr on the client beyond the file layout ("lustre.lov" xattr), but it is something I''ve been thinking about. ?The security.capability attribute is special-cased in the 1.8.4 client to not return any data, and beyond that there aren''t any attributes that I''m aware of that are widely used, so I don''t think there is a pressing demand for this, but if a case can be made for this we''ll definitely look at it more seriously.Yea I doubt it makes much difference I just noticed that "ls -l" did it and was wondering what Lustre made of it. Thanks for the insightful reply as usual Andreas, Daire
Hello! On Jul 30, 2010, at 7:20 AM, Daire Byrne wrote:> Ah yes... that makes sense. I recall the opencache gave a big boost in > performance for NFS exporting but I wasn''t sure if it had become the > default. I haven''t been keeping up to date with Lustre developments.It was default for NFS for quite some time.> So even with the metadata going over NFS the opencache in the client > seems to make quite a difference (I''m not sure how much the NFS client > caches though). As expected I see no mdt activity for the NFS export > once cached. I think it would be really nice to be able to enable the > opencache on any lustre client. A couple of potential workloads that IA simple workaround for you to enable opencache on a specific client would be to add cr_flags |= MDS_OPEN_LOCK; in mdc/mdc_lib.c:mds_pack_open_flags() or if you want it to be cluster wide, in the mds/mds_open.c:mds_open() make all conditions checking for MDS_OPEN_LOCK to be true. I guess we really need to have an option for this, but I am not sure if we want it on the client, server, or both.> can think of that would benefit are WAN clients and clients that need > to do mainly metadata (e.g. scanning the filesystem, rsync --link-dest > hardlink snapshot backups). For the WAN case I''d be quite interestedOpen is very narrow metadata case, so if you do metadata but no opens you would get zero benefit from open cache. Also getting this extra lock puts some extra cpu load on MDS, but if we go this far, we can then somewhat simplify rep-ack and hold it for much shorter time in a lot of cases which would greatly help WAN workloads that happen to create files in same dir from many nodes, for example. (see bug 20373, first patch) Just be aware that testing with more than 16000 clients at ORNL clearly shows degradations at LAN latencies. Bye, Oleg
Oleg, On Tue, Aug 3, 2010 at 5:21 AM, Oleg Drokin <oleg.drokin at oracle.com> wrote:>> So even with the metadata going over NFS the opencache in the client >> seems to make quite a difference (I''m not sure how much the NFS client >> caches though). As expected I see no mdt activity for the NFS export >> once cached. I think it would be really nice to be able to enable the >> opencache on any lustre client. A couple of potential workloads that I > > A simple workaround for you to enable opencache on a specific client would > be to add cr_flags |= MDS_OPEN_LOCK; in mdc/mdc_lib.c:mds_pack_open_flags()Yea that works - cheers. FYI some comparisons with a simple find on a remote client (~33,000 files): find /mnt/lustre (not cached) = 41 secs find /mnt/lustre (cached) = 19 secs find /mnt/lustre (opencache) = 3 secs The "ls -lR" case is still having to query the MDS a lot (for getxattr) which becomes quite noticeable in the WAN case. Apparently the 1.8.4 client already addresses this (#15587?). I might try that patch too...> I guess we really need to have an option for this, but I am not sure > if we want it on the client, server, or both.Doing it client side with the minor modification you suggest is probably enough for our purposes for the time being. Thanks.>> can think of that would benefit are WAN clients and clients that need >> to do mainly metadata (e.g. scanning the filesystem, rsync --link-dest >> hardlink snapshot backups). For the WAN case I''d be quite interested > > Open is very narrow metadata case, so if you do metadata but no opens you would > get zero benefit from open cache.I suppose the recursive scan case is a fairly low frequency operation but is also one that Lustre has always suffered noticeably worse performance when compared to something simpler like NFS. Slightly off topic (and I''ve kinda asked this before) but is there a good reason why link() speeds in Lustre are so slow compare to something like NFS? A quick comparison of doing a "cp -al" from a remote Lustre client and an NFS client (to a fast NFS server): cp -fa /mnt/lustre/blah /mnt/lustre/blah2 = ~362 files/sec cp -fa /mnt/nfs/blah /mnt/nfs/blah2 = ~1863 files/sec Is it just the extra depth of the lustre stack/code path? Is there anything we could do to speed this up if we know that no other client will touch these dirs while we hardlink them?> Also getting this extra lock puts some extra cpu load on MDS, but if we go this far, > we can then somewhat simplify rep-ack and hold it for much shorter time in > a lot of cases which would greatly help WAN workloads that happen to create > files in same dir from many nodes, for example. (see bug 20373, first patch) > Just be aware that testing with more than 16000 clients at ORNL clearly shows > degradations at LAN latencies.Understood. I think we are a long way off hitting those kinds of limits. The WAN case is interesting because it is the interactive speed of browsing the filesystem that is usually the most noticeable (and annoying) artefact of being many miles away from the server. Once you start accessing the files you want then you are reasonably happy to be limited by your connection''s overall bandwidth. Thanks for the feedback, Daire
Hello! On Aug 3, 2010, at 12:49 PM, Daire Byrne wrote:>>> So even with the metadata going over NFS the opencache in the client >>> seems to make quite a difference (I''m not sure how much the NFS client >>> caches though). As expected I see no mdt activity for the NFS export >>> once cached. I think it would be really nice to be able to enable the >>> opencache on any lustre client. A couple of potential workloads that I >> A simple workaround for you to enable opencache on a specific client would >> be to add cr_flags |= MDS_OPEN_LOCK; in mdc/mdc_lib.c:mds_pack_open_flags() > Yea that works - cheers. FYI some comparisons with a simple find on a > remote client (~33,000 files): > > find /mnt/lustre (not cached) = 41 secs > find /mnt/lustre (cached) = 19 secs > find /mnt/lustre (opencache) = 3 secsHm, initially I was going to say that find is not open-intensive so it should not benefit from opencache at all. But then I realized if you have a lot of dirs, then indeed there would be a positive impact on subsequent reruns. I assume that the opencache result is a second run and first run produces same 41 seconds? BTW, another unintended side-effect you might experience if you have mixed opencache enabled/disabled network is if you run something (or open for write) on an opencache-enabled client, you might have problems writing (or executing) that file from non-opencache enabled nodes as long as the file handle would remain cached on the client. This is because if open lock was not requested, we don''t try to invalidate current ones (expensive) and MDS would think the file is genuinely open for write/execution and disallow conflicting accesses with EBUSY.> performance when compared to something simpler like NFS. Slightly off > topic (and I''ve kinda asked this before) but is there a good reason > why link() speeds in Lustre are so slow compare to something like NFS? > A quick comparison of doing a "cp -al" from a remote Lustre client and > an NFS client (to a fast NFS server): > > cp -fa /mnt/lustre/blah /mnt/lustre/blah2 = ~362 files/sec > cp -fa /mnt/nfs/blah /mnt/nfs/blah2 = ~1863 files/sec > > Is it just the extra depth of the lustre stack/code path? Is there > anything we could do to speed this up if we know that no other client > will touch these dirs while we hardlink them?Hm, this is a first complaint about this that I hear. I just looked into strace of cp -fal (which I guess you mant instead of just -fa that would just copy everything). so we traverse the tree down creating a dir structure in parallel first (or just doing it in readdir order) open("/mnt/lustre/a/b/c/d/e/f", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3 +1 RPC fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 +1 RPC (if no opencache) fcntl(3, F_SETFD, FD_CLOEXEC) = 0 getdents(3, /* 4 entries */, 4096) = 96 getdents(3, /* 0 entries */, 4096) = 0 +1 RPC close(3) = 0 +1 RPC (if no opencache) lstat("/mnt/lustre/a/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 (should be cached, so no RPC) mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0 +1 RPC lstat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 +1 RPC stat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 (should be cached, so no RPC) Then we get to files: link("/mnt/lustre/a/b/c/d/e/f/g/k/8", "/mnt/lustre/blah2/b/c/d/e/f/g/k/8") = 0 +1 RPC futimesat(AT_FDCWD, "/mnt/lustre/blah2/b/c/d/e/f/g/k", {{1280856246, 0}, {128085 6291, 0}}) = 0 +1 RPC then we start traversing the just created tree up and chowning it: chown("/mnt/lustre/blah2/b/c/d/e/f/g/k", 0, 0) = 0 +1 RPC getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_access", 0x7fff519f0950, 132) = -1 ENODATA (No data available) +1 RPC stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 (not sure why another stat here, we already did it on the way up. Should be cached) setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_access", "\x02\x00 \x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff\xff \x00\x0 5\x00\xff\xff\xff\xff", 28, 0) = 0 +1 RPC getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_default", 0x7fff519f09 50, 132) = -1 ENODATA (No data available) +1 RPC stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...}) 0 Hm, stat again? did not we do it a few syscalls back? stat("/mnt/lustre/blah2/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ... }) = 0 stat of the target. +1 RPC (the cache got invalidated by link above). setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_default", "\x02\x0 0\x00\x00", 4, 0) = 0 +1 RPC So I guess there is a certain number of stat RPCs that would not be present on NFS due to different ways the caching works, plus all the getxattrs. Not sure if this is enough to explain 4x rate difference. Also you can try disabling debug (if you did not already) to see how big of an impact that would make. It used to be that debug was affecting metadata loads a lot, though with recent debug levels adjustments I think it was somewhat improved. Bye, Oleg
Since Bug 22492 hit a lot of people, it sounds like opencache isn''t generally useful unless enabled on every node. Is there an easy way to force files out of the cache (ie, echo 3 > /proc/sys/vm/drop_caches)? Kevin On Aug 3, 2010, at 11:50 AM, Oleg Drokin <oleg.drokin at oracle.com> wrote:> Hello! > > On Aug 3, 2010, at 12:49 PM, Daire Byrne wrote: >>>> So even with the metadata going over NFS the opencache in the >>>> client >>>> seems to make quite a difference (I''m not sure how much the NFS >>>> client >>>> caches though). As expected I see no mdt activity for the NFS >>>> export >>>> once cached. I think it would be really nice to be able to enable >>>> the >>>> opencache on any lustre client. A couple of potential workloads >>>> that I >>> A simple workaround for you to enable opencache on a specific >>> client would >>> be to add cr_flags |= MDS_OPEN_LOCK; in mdc/ >>> mdc_lib.c:mds_pack_open_flags() >> Yea that works - cheers. FYI some comparisons with a simple find on a >> remote client (~33,000 files): >> >> find /mnt/lustre (not cached) = 41 secs >> find /mnt/lustre (cached) = 19 secs >> find /mnt/lustre (opencache) = 3 secs > > Hm, initially I was going to say that find is not open-intensive so > it should > not benefit from opencache at all. > But then I realized if you have a lot of dirs, then indeed there > would be a > positive impact on subsequent reruns. > I assume that the opencache result is a second run and first run > produces > same 41 seconds? > > BTW, another unintended side-effect you might experience if you have > mixed > opencache enabled/disabled network is if you run something (or open > for write) > on an opencache-enabled client, you might have problems writing (or > executing) > that file from non-opencache enabled nodes as long as the file handle > would remain cached on the client. This is because if open lock was > not requested, > we don''t try to invalidate current ones (expensive) and MDS would > think > the file is genuinely open for write/execution and disallow > conflicting accesses > with EBUSY. > >> performance when compared to something simpler like NFS. Slightly off >> topic (and I''ve kinda asked this before) but is there a good reason >> why link() speeds in Lustre are so slow compare to something like >> NFS? >> A quick comparison of doing a "cp -al" from a remote Lustre client >> and >> an NFS client (to a fast NFS server): >> >> cp -fa /mnt/lustre/blah /mnt/lustre/blah2 = ~362 files/sec >> cp -fa /mnt/nfs/blah /mnt/nfs/blah2 = ~1863 files/sec >> >> Is it just the extra depth of the lustre stack/code path? Is there >> anything we could do to speed this up if we know that no other client >> will touch these dirs while we hardlink them? > > Hm, this is a first complaint about this that I hear. > I just looked into strace of cp -fal (which I guess you mant instead > of just -fa that > would just copy everything). > > so we traverse the tree down creating a dir structure in parallel > first (or just doing it in readdir order) > > open("/mnt/lustre/a/b/c/d/e/f", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3 > +1 RPC > > fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > +1 RPC (if no opencache) > > fcntl(3, F_SETFD, FD_CLOEXEC) = 0 > getdents(3, /* 4 entries */, 4096) = 96 > getdents(3, /* 0 entries */, 4096) = 0 > +1 RPC > > close(3) = 0 > +1 RPC (if no opencache) > > lstat("/mnt/lustre/a/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, > st_size=4096, ...}) = 0 > (should be cached, so no RPC) > > mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0 > +1 RPC > > lstat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, > st_size=4096, ...}) = 0 > +1 RPC > > stat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, > st_size=4096, ...}) = 0 > (should be cached, so no RPC) > > Then we get to files: > link("/mnt/lustre/a/b/c/d/e/f/g/k/8", "/mnt/lustre/blah2/b/c/d/e/f/g/ > k/8") = 0 > +1 RPC > > futimesat(AT_FDCWD, "/mnt/lustre/blah2/b/c/d/e/f/g/k", {{1280856246, > 0}, {128085 > 6291, 0}}) = 0 > +1 RPC > > then we start traversing the just created tree up and chowning it: > chown("/mnt/lustre/blah2/b/c/d/e/f/g/k", 0, 0) = 0 > +1 RPC > > getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_access", > 0x7fff519f0950, 132) = -1 ENODATA (No data available) > +1 RPC > > stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, > st_size=4096, ...}) = 0 > (not sure why another stat here, we already did it on the way up. > Should be cached) > > setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", > "system.posix_acl_access", "\x02\x00 > \x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff > \xff \x00\x0 > 5\x00\xff\xff\xff\xff", 28, 0) = 0 > +1 RPC > > getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_default", > 0x7fff519f09 > 50, 132) = -1 ENODATA (No data available) > +1 RPC > > stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, > st_size=4096, ...}) > 0 > Hm, stat again? did not we do it a few syscalls back? > > stat("/mnt/lustre/blah2/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, > st_size=4096, ... > }) = 0 > stat of the target. +1 RPC (the cache got invalidated by link above). > > setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", > "system.posix_acl_default", "\x02\x0 > 0\x00\x00", 4, 0) = 0 > +1 RPC > > > So I guess there is a certain number of stat RPCs that would not be > present on NFS > due to different ways the caching works, plus all the getxattrs. Not > sure if this > is enough to explain 4x rate difference. > > Also you can try disabling debug (if you did not already) to see how > big of an impact > that would make. It used to be that debug was affecting metadata > loads a lot, though > with recent debug levels adjustments I think it was somewhat improved. > > Bye, > Oleg > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100803/071c2046/attachment.html
Well, you can drop all locks on a given FS that would in effect drop all metadata caches, but will leave data caches intact. echo clear >/proc/fs/lustre/ldlm/namespaces/your_MDC_namespace/lru_size On Aug 3, 2010, at 2:45 PM, Kevin Van Maren wrote:> Since Bug 22492 hit a lot of people, it sounds like opencache isn''t generally useful unless enabled on every node. Is there an easy way to force files out of the cache (ie, echo 3 > /proc/sys/vm/drop_caches)? > > Kevin > > > On Aug 3, 2010, at 11:50 AM, Oleg Drokin <oleg.drokin at oracle.com> wrote: > >> Hello! >> >> On Aug 3, 2010, at 12:49 PM, Daire Byrne wrote: >>>>> So even with the metadata going over NFS the opencache in the client >>>>> seems to make quite a difference (I''m not sure how much the NFS client >>>>> caches though). As expected I see no mdt activity for the NFS export >>>>> once cached. I think it would be really nice to be able to enable the >>>>> opencache on any lustre client. A couple of potential workloads that I >>>> A simple workaround for you to enable opencache on a specific client would >>>> be to add cr_flags |= MDS_OPEN_LOCK; in mdc/mdc_lib.c:mds_pack_open_flags() >>> Yea that works - cheers. FYI some comparisons with a simple find on a >>> remote client (~33,000 files): >>> >>> find /mnt/lustre (not cached) = 41 secs >>> find /mnt/lustre (cached) = 19 secs >>> find /mnt/lustre (opencache) = 3 secs >> >> Hm, initially I was going to say that find is not open-intensive so it should >> not benefit from opencache at all. >> But then I realized if you have a lot of dirs, then indeed there would be a >> positive impact on subsequent reruns. >> I assume that the opencache result is a second run and first run produces >> same 41 seconds? >> >> BTW, another unintended side-effect you might experience if you have mixed >> opencache enabled/disabled network is if you run something (or open for write) >> on an opencache-enabled client, you might have problems writing (or executing) >> that file from non-opencache enabled nodes as long as the file handle >> would remain cached on the client. This is because if open lock was not requested, >> we don''t try to invalidate current ones (expensive) and MDS would think >> the file is genuinely open for write/execution and disallow conflicting accesses >> with EBUSY. >> >>> performance when compared to something simpler like NFS. Slightly off >>> topic (and I''ve kinda asked this before) but is there a good reason >>> why link() speeds in Lustre are so slow compare to something like NFS? >>> A quick comparison of doing a "cp -al" from a remote Lustre client and >>> an NFS client (to a fast NFS server): >>> >>> cp -fa /mnt/lustre/blah /mnt/lustre/blah2 = ~362 files/sec >>> cp -fa /mnt/nfs/blah /mnt/nfs/blah2 = ~1863 files/sec >>> >>> Is it just the extra depth of the lustre stack/code path? Is there >>> anything we could do to speed this up if we know that no other client >>> will touch these dirs while we hardlink them? >> >> Hm, this is a first complaint about this that I hear. >> I just looked into strace of cp -fal (which I guess you mant instead of just -fa that >> would just copy everything). >> >> so we traverse the tree down creating a dir structure in parallel first (or just doing it in readdir order) >> >> open("/mnt/lustre/a/b/c/d/e/f", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3 >> +1 RPC >> >> fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >> +1 RPC (if no opencache) >> >> fcntl(3, F_SETFD, FD_CLOEXEC) = 0 >> getdents(3, /* 4 entries */, 4096) = 96 >> getdents(3, /* 0 entries */, 4096) = 0 >> +1 RPC >> >> close(3) = 0 >> +1 RPC (if no opencache) >> >> lstat("/mnt/lustre/a/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >> (should be cached, so no RPC) >> >> mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0 >> +1 RPC >> >> lstat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >> +1 RPC >> >> stat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >> (should be cached, so no RPC) >> >> Then we get to files: >> link("/mnt/lustre/a/b/c/d/e/f/g/k/8", "/mnt/lustre/blah2/b/c/d/e/f/g/k/8") = 0 >> +1 RPC >> >> futimesat(AT_FDCWD, "/mnt/lustre/blah2/b/c/d/e/f/g/k", {{1280856246, 0}, {128085 >> 6291, 0}}) = 0 >> +1 RPC >> >> then we start traversing the just created tree up and chowning it: >> chown("/mnt/lustre/blah2/b/c/d/e/f/g/k", 0, 0) = 0 >> +1 RPC >> >> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_access", 0x7fff519f0950, 132) = -1 ENODATA (No data available) >> +1 RPC >> >> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >> (not sure why another stat here, we already did it on the way up. Should be cached) >> >> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_access", "\x02\x00 >> \x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff\xff \x00\x0 >> 5\x00\xff\xff\xff\xff", 28, 0) = 0 >> +1 RPC >> >> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_default", 0x7fff519f09 >> 50, 132) = -1 ENODATA (No data available) >> +1 RPC >> >> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...}) >> 0 >> Hm, stat again? did not we do it a few syscalls back? >> >> stat("/mnt/lustre/blah2/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ... >> }) = 0 >> stat of the target. +1 RPC (the cache got invalidated by link above). >> >> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_default", "\x02\x0 >> 0\x00\x00", 4, 0) = 0 >> +1 RPC >> >> >> So I guess there is a certain number of stat RPCs that would not be present on NFS >> due to different ways the caching works, plus all the getxattrs. Not sure if this >> is enough to explain 4x rate difference. >> >> Also you can try disabling debug (if you did not already) to see how big of an impact >> that would make. It used to be that debug was affecting metadata loads a lot, though >> with recent debug levels adjustments I think it was somewhat improved. >> >> Bye, >> Oleg >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Tue, Aug 3, 2010 at 12:49 PM, Daire Byrne <daire.byrne at gmail.com> wrote:> Oleg, > > On Tue, Aug 3, 2010 at 5:21 AM, Oleg Drokin <oleg.drokin at oracle.com> > wrote: > >> So even with the metadata going over NFS the opencache in the client > >> seems to make quite a difference (I''m not sure how much the NFS client > >> caches though). As expected I see no mdt activity for the NFS export > >> once cached. I think it would be really nice to be able to enable the > >> opencache on any lustre client. A couple of potential workloads that I > > > > A simple workaround for you to enable opencache on a specific client > would > > be to add cr_flags |= MDS_OPEN_LOCK; in > mdc/mdc_lib.c:mds_pack_open_flags() > > Yea that works - cheers. FYI some comparisons with a simple find on a > remote client (~33,000 files): > >That''s not to bad even in the uncached case, what kind of round-trip-time delay do you have between your client and servers?> find /mnt/lustre (not cached) = 41 secs > find /mnt/lustre (cached) = 19 secs > find /mnt/lustre (opencache) = 3 secs > > The "ls -lR" case is still having to query the MDS a lot (for > getxattr) which becomes quite noticeable in the WAN case. Apparently > the 1.8.4 client already addresses this (#15587?). I might try that > patch too... >The patch for bug 15587 addresses problems with SLES 11 (maybe others?) patchless clients with CONFIG_FILE_SECURITY_CAPABILITIES enabled. It severely affects performance over a WAN (see bug 21439) because their is no xattr caching each write is requiring an RPC to the MDS and you won''t see any parallelism.> > I guess we really need to have an option for this, but I am not sure > > if we want it on the client, server, or both. > >It would be nice to have as options for both to allow WAN users to see the benefit without patching code.> Doing it client side with the minor modification you suggest is > probably enough for our purposes for the time being. Thanks. > > >> can think of that would benefit are WAN clients and clients that need > >> to do mainly metadata (e.g. scanning the filesystem, rsync --link-dest > >> hardlink snapshot backups). For the WAN case I''d be quite interested > > > > Open is very narrow metadata case, so if you do metadata but no opens you > would > > get zero benefit from open cache. > > I suppose the recursive scan case is a fairly low frequency operation > but is also one that Lustre has always suffered noticeably worse > performance when compared to something simpler like NFS. Slightly off > topic (and I''ve kinda asked this before) but is there a good reason > why link() speeds in Lustre are so slow compare to something like NFS? > A quick comparison of doing a "cp -al" from a remote Lustre client and > an NFS client (to a fast NFS server): >Another consideration for WAN performance when creating files is the stripe count. When you start writing to a file the first RPC to each OSC requests the lock rather then requesting the lock from all OSCs when the first lock is requested. This would require some code to change but it would be another nice optimization for WAN performance. We do some work over a 200 ms RTT latency and to create a file striped across 8 OSTs with 1 MB stripes takes 1.6 seconds just to write the first 8 MBs sine the locks are synchronous operations. For a single stripe it would only take ~200 ms.> > cp -fa /mnt/lustre/blah /mnt/lustre/blah2 = ~362 files/sec > cp -fa /mnt/nfs/blah /mnt/nfs/blah2 = ~1863 files/sec > > Is it just the extra depth of the lustre stack/code path? Is there > anything we could do to speed this up if we know that no other client > will touch these dirs while we hardlink them? > > > Also getting this extra lock puts some extra cpu load on MDS, but if we > go this far, > > we can then somewhat simplify rep-ack and hold it for much shorter time > in > > a lot of cases which would greatly help WAN workloads that happen to > create > > files in same dir from many nodes, for example. (see bug 20373, first > patch) > > Just be aware that testing with more than 16000 clients at ORNL clearly > shows > > degradations at LAN latencies. > > Understood. I think we are a long way off hitting those kinds of > limits. The WAN case is interesting because it is the interactive > speed of browsing the filesystem that is usually the most noticeable > (and annoying) artefact of being many miles away from the server. Once > you start accessing the files you want then you are reasonably happy > to be limited by your connection''s overall bandwidth. >I agree, there is a lot of things I''d like to see added to improve that interactive WAN performance. There is a general metadata WAN performance, bug 18526. I think readdir+ would help and larger RPCs to the MDS (bug 17833) are necessary to overcome the transactional nature Lustre stat/getxattr have today. As for statahead helping "ls -l" performance I have some numbers in my LUG2010 presentation ( http://wiki.lustre.org/images/6/60/LUG2010_Filizetti_SMSi.pdf) about the improvements that size on metadata (SOM) adds compared to Lustre 1.8. Jeremy> > Thanks for the feedback, > > Daire > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100803/63b4ee61/attachment.html
Hello! On Aug 3, 2010, at 10:59 PM, Jeremy Filizetti wrote:> Another consideration for WAN performance when creating files is the stripe count. When you start writing to a file the first RPC to each OSC requests the lock rather then requesting the lock from all OSCs when the first lock is requested. This would require some code to change but it would be another nice optimization for WAN performance. We do some work over a 200 ms RTT latency and to create a file striped across 8 OSTs with 1 MB stripes takes 1.6 seconds just to write the first 8 MBs sine the locks are synchronous operations. For a single stripe it would only take ~200 ms.Requesting locks on all stripes would be a disaster for shared-file type of workload where a single file is accessed by a lot of clients. If you know that your app is the only one doing accesses and you would agree to do some extra actions (sure this is limited), the glimpse handler could be modified in such a way that it would grant you a write lock if your file handle was opened for write, I think we have support for this on a client already and server support is mostly trivial. So with that change you would open/create a file for write and then do fstat that would fetch you your locks in parallel (currently this would fetch read locks only, so would not help in case of writes). See bug 11715 for discussions and hopefully for patches that would eventually come. This could be done in a nicer way of course, but would be a much bigger piece of work too which would mean it''s much longer to wait too. Bye, Oleg
On Tue, Aug 3, 2010 at 11:14 PM, Oleg Drokin <oleg.drokin at oracle.com> wrote:> Hello! > > On Aug 3, 2010, at 10:59 PM, Jeremy Filizetti wrote: > > > Another consideration for WAN performance when creating files is the > stripe count. When you start writing to a file the first RPC to each OSC > requests the lock rather then requesting the lock from all OSCs when the > first lock is requested. This would require some code to change but it > would be another nice optimization for WAN performance. We do some work > over a 200 ms RTT latency and to create a file striped across 8 OSTs with 1 > MB stripes takes 1.6 seconds just to write the first 8 MBs sine the locks > are synchronous operations. For a single stripe it would only take ~200 ms. > >Our workload is primarily WORM (write once read many) so we never have shared write access. I would be nice to have a client option to enable this capability that would remain off by default. Extracting a tar can become painful when it has some small files but we need the extra stripes for concurrency that provide good performance for the larger files (which take more time anyways). Jeremy Requesting locks on all stripes would be a disaster for shared-file type of> workload where a single file is accessed by a lot of clients. > If you know that your app is the only one doing accesses and you would > agree to do some extra actions (sure this is limited), the glimpse handler > could be modified in such a way that it would grant you a write lock if > your file handle was opened for write, I think we have support for this > on a client already and server support is mostly trivial. > So with that change you would open/create a file for write and then do > fstat that would fetch you your locks in parallel (currently this would > fetch read locks only, so would not help in case of writes). > See bug 11715 for discussions and hopefully for patches that would > eventually come. > > This could be done in a nicer way of course, but would be a much bigger > piece of work too which would mean it''s much longer to wait too. > > Bye, > Oleg-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100804/2e7d3d57/attachment.html
On 2010-08-03, at 12:45, Kevin Van Maren wrote:> Since Bug 22492 hit a lot of people, it sounds like opencache isn''t generally useful unless enabled on every node. Is there an easy way to force files out of the cache (ie, echo 3 > /proc/sys/vm/drop_caches)?For Lustre you can do "lctl set_param ldlm.namespaces.*.lru_size=clear" will drop all the DLM locks on the clients, which will flush all pages from the cache.>> I just looked into strace of cp -fal (which I guess you meant instead of just -fa that would just copy everything). >> >> so we traverse the tree down creating a dir structure in parallel first (or just doing it in readdir order) >> >> open("/mnt/lustre/a/b/c/d/e/f", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3 >> +1 RPC >> >> fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >> +1 RPC (if no opencache) >> >> fcntl(3, F_SETFD, FD_CLOEXEC) = 0 >> getdents(3, /* 4 entries */, 4096) = 96 >> getdents(3, /* 0 entries */, 4096) = 0 >> +1 RPCHaving large readdir RPCs would help for directories with more than about 170 entries.>> close(3) = 0 >> +1 RPC (if no opencache) >> >> lstat("/mnt/lustre/a/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >> (should be cached, so no RPC) >> >> mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0 >> +1 RPC >> >> lstat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >> +1 RPCIf we do the mkdir(), the client does not cache the entry?>> stat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >> (should be cached, so no RPC) >> >> Then we get to files: >> link("/mnt/lustre/a/b/c/d/e/f/g/k/8", "/mnt/lustre/blah2/b/c/d/e/f/g/k/8") = 0 >> +1 RPC >> >> futimesat(AT_FDCWD, "/mnt/lustre/blah2/b/c/d/e/f/g/k", {{1280856246, 0}, {1280856291, 0}}) = 0 >> +1 RPC >> >> then we start traversing the just created tree up and chowning it: >> chown("/mnt/lustre/blah2/b/c/d/e/f/g/k", 0, 0) = 0 >> +1 RPC >> >> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_access", 0x7fff519f0950, 132) = -1 ENODATA (No data available) >> +1 RPCThis is gone in 1.8.4>> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >> (not sure why another stat here, we already did it on the way up. Should be cached) >> >> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_access", "\x02\x00\x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff\xff \x00\x05\x00\xff\xff\xff\xff", 28, 0) = 0 >> +1 RPCStrange that it is setting an ACL when it didn''t read one?>> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_default", 0x7fff519f0950, 132) = -1 ENODATA (No data available) >> +1 RPC >> >> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >> Hm, stat again? did not we do it a few syscalls back?Gotta love those GNU file utilities. They are very stat happy.>> stat("/mnt/lustre/blah2/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ... }) = 0 >> stat of the target. +1 RPC (the cache got invalidated by link above). >> >> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_default", "\x02\x00\x00\x00", 4, 0) = 0 >> +1 RPCHere it is also setting an ACL even though it didn''t get one from the source.>> So I guess there is a certain number of stat RPCs that would not be present on NFS due to different ways the caching works, plus all the getxattrs. Not sure if this is enough to explain 4x rate difference. >> >> Also you can try disabling debug (if you did not already) to see how big of an impact that would make. It used to be that debug was affecting metadata loads a lot, though with recent debug levels adjustments I think it was somewhat improved.Useful would be to run "strace -tttT" to get timestamps for each operation to see for which operations it is slower on Lustre than NFS. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
22492 is exist because someone from Sun/Oracle is disable dentry caching, instead of fixing xattr code for 1.8<>2.0 interoperable. He is kill my patch in revalidate dentry (instead of fixing xattr code). in that case client always send one more RPC to server. see bug 17545 for some details. On Aug 4, 2010, at 10:41, Andreas Dilger wrote:> On 2010-08-03, at 12:45, Kevin Van Maren wrote: >> Since Bug 22492 hit a lot of people, it sounds like opencache isn''t generally useful unless enabled on every node. Is there an easy way to force files out of the cache (ie, echo 3 > /proc/sys/vm/drop_caches)? > > For Lustre you can do "lctl set_param ldlm.namespaces.*.lru_size=clear" will drop all the DLM locks on the clients, which will flush all pages from the cache. > >>> I just looked into strace of cp -fal (which I guess you meant instead of just -fa that would just copy everything). >>> >>> so we traverse the tree down creating a dir structure in parallel first (or just doing it in readdir order) >>> >>> open("/mnt/lustre/a/b/c/d/e/f", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3 >>> +1 RPC >>> >>> fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >>> +1 RPC (if no opencache) >>> >>> fcntl(3, F_SETFD, FD_CLOEXEC) = 0 >>> getdents(3, /* 4 entries */, 4096) = 96 >>> getdents(3, /* 0 entries */, 4096) = 0 >>> +1 RPC > > Having large readdir RPCs would help for directories with more than about 170 entries. > >>> close(3) = 0 >>> +1 RPC (if no opencache) >>> >>> lstat("/mnt/lustre/a/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >>> (should be cached, so no RPC) >>> >>> mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0 >>> +1 RPC >>> >>> lstat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >>> +1 RPC > > If we do the mkdir(), the client does not cache the entry? > >>> stat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >>> (should be cached, so no RPC) >>> >>> Then we get to files: >>> link("/mnt/lustre/a/b/c/d/e/f/g/k/8", "/mnt/lustre/blah2/b/c/d/e/f/g/k/8") = 0 >>> +1 RPC >>> >>> futimesat(AT_FDCWD, "/mnt/lustre/blah2/b/c/d/e/f/g/k", {{1280856246, 0}, {1280856291, 0}}) = 0 >>> +1 RPC >>> >>> then we start traversing the just created tree up and chowning it: >>> chown("/mnt/lustre/blah2/b/c/d/e/f/g/k", 0, 0) = 0 >>> +1 RPC >>> >>> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_access", 0x7fff519f0950, 132) = -1 ENODATA (No data available) >>> +1 RPC > > This is gone in 1.8.4 > >>> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >>> (not sure why another stat here, we already did it on the way up. Should be cached) >>> >>> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_access", "\x02\x00\x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff\xff \x00\x05\x00\xff\xff\xff\xff", 28, 0) = 0 >>> +1 RPC > > Strange that it is setting an ACL when it didn''t read one? > >>> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k", "system.posix_acl_default", 0x7fff519f0950, 132) = -1 ENODATA (No data available) >>> +1 RPC >>> >>> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >>> Hm, stat again? did not we do it a few syscalls back? > > Gotta love those GNU file utilities. They are very stat happy. > >>> stat("/mnt/lustre/blah2/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755, st_size=4096, ... }) = 0 >>> stat of the target. +1 RPC (the cache got invalidated by link above). >>> >>> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_default", "\x02\x00\x00\x00", 4, 0) = 0 >>> +1 RPC > > Here it is also setting an ACL even though it didn''t get one from the source. > >>> So I guess there is a certain number of stat RPCs that would not be present on NFS due to different ways the caching works, plus all the getxattrs. Not sure if this is enough to explain 4x rate difference. >>> >>> Also you can try disabling debug (if you did not already) to see how big of an impact that would make. It used to be that debug was affecting metadata loads a lot, though with recent debug levels adjustments I think it was somewhat improved. > > Useful would be to run "strace -tttT" to get timestamps for each operation to see for which operations it is slower on Lustre than NFS. > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hello! On Aug 4, 2010, at 3:41 AM, Andreas Dilger wrote:>>> mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0 >>> +1 RPC >>> >>> lstat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >>> +1 RPC > If we do the mkdir(), the client does not cache the entry?No. mkdir cannot return a lock so we cannot cache anything. Same for opens when open cache is disabled. Bye, Oleg
Oleg, On Tue, Aug 3, 2010 at 6:50 PM, Oleg Drokin <oleg.drokin at oracle.com> wrote:>> Yea that works - cheers. FYI some comparisons with a simple find on a >> remote client (~33,000 files): >> >> ?find /mnt/lustre (not cached) = 41 secs >> ?find /mnt/lustre (cached) = 19 secs >> ?find /mnt/lustre (opencache) = 3 secs > > Hm, initially I was going to say that find is not open-intensive so it should > not benefit from opencache at all. > But then I realized if you have a lot of dirs, then indeed there would be a > positive impact on subsequent reruns. > I assume that the opencache result is a second run and first run produces > same 41 seconds?Actually I assumed it would be but I guess there must be some repeat opens because the 1st run with opencache is actually better. I have also set lnet.debug=0 to show the difference that makes: find /mnt/lustre (1st run) = 35 secs find /mnt/lustre (2nd run) = 17 secs find /mnt/lustre (1st run opencache) = 23 secs find /mnt/lustre (2nd run opencache) = 0.65 secs Having lnet.debug=0 does make a difference - probably more noticeable over millions of dirs/files. BTW I think having lots of dirs on a 100TB+ filesystem is going to be a common workload for most non-lab lustre users so having a few clients able to cache opens for doing read-only scans of the filesystem would be a good performance win. BTW all my testing so far is on the LAN but I''m also thinking ahead to how all this metadata RPC traffic will work down our >200ms link to our office in Asia.> BTW, another unintended side-effect you might experience if you have mixed > opencache enabled/disabled network is if you run something (or open for write) > on an opencache-enabled client, you might have problems writing (or executing) > that file from non-opencache enabled nodes as long as the file handle > would remain cached on the client. This is because if open lock was not requested, > we don''t try to invalidate current ones (expensive) and MDS would think > the file is genuinely open for write/execution and disallow conflicting accesses > with EBUSY.Right - worth remembering.>> performance when compared to something simpler like NFS. Slightly off >> topic (and I''ve kinda asked this before) but is there a good reason >> why link() speeds in Lustre are so slow compare to something like NFS? >> A quick comparison of doing a "cp -al" from a remote Lustre client and >> an NFS client (to a fast NFS server): > > Hm, this is a first complaint about this that I hear. > I just looked into strace of cp -fal (which I guess you mant instead of just -fa that > would just copy everything). > > so we traverse the tree down creating a dir structure in parallel first (or just doing it in readdir order) > > open("/mnt/lustre/a/b/c/d/e/f", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3 > +1 RPC > > SNIP! > > setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k", "system.posix_acl_default", "\x02\x0 > 0\x00\x00", 4, 0) = 0 > +1 RPC > > So I guess there is a certain number of stat RPCs that would not be present on NFS > due to different ways the caching works, plus all the getxattrs. Not sure if this > is enough to explain 4x rate difference.Thanks for breaking down the strace for me - it is interesting to see where the RPCs are coming from. Andreas suggested getting timing info from the strace of "cp -al" which I''ve done for Lustre and a NFS server (bluearc) while hardlinking some kernel source directories. I added up the time (seconds) that all the syscalls in a run took with opencache and the lustre 1.8.4 client: syscall lustre nfs -------------------------- stat 7s 0.01s lstat 36s 7s link 29s 16s getxattr 5s 0.29s setxattr 30s 0.25s open 1s 2s mkdir 6s 3s lchown 11s 2s futimesat 11s 2s It doesn''t quite explain the 4:1 speed difference but the (l)stat heavy "cp -la" is consistently that much faster on NFS. Is the NFS server so much faster for get/setxattr because it returns "EOPNOTSUPP" for setxattr? Can we do something similar for the Lustre client if we don''t care about extended attributes? The link() times are still almost twice as slow on Lustre though - that may be related to a slowish (test) MDT disk. Like Andreas said I don''t understand why there is an setxattr RPC when we didn''t get any data from getxattr but that is probably more down to "cp" than lustre? I also did a quick test using "rsync" to do the hardlinking (which noticeably doesn''t use get/setxattr) and now the difference in speed is more like 2:1. The way rsync launches a couple of instances (sender/receiver) probably helps parallelise things better for Lustre. In this case the overall link() time is similar but the overall lstat() time is 3x slower for Lustre (67secs/22secs). The lstat() should all be coming from the client cache but it seems to be consistently 3x slower in each call than for NFS for some reason. Cheers, Daire
Hello! On Aug 4, 2010, at 2:04 PM, Daire Byrne wrote:>> Hm, initially I was going to say that find is not open-intensive so it should >> not benefit from opencache at all. >> But then I realized if you have a lot of dirs, then indeed there would be a >> positive impact on subsequent reruns. >> I assume that the opencache result is a second run and first run produces >> same 41 seconds? > Actually I assumed it would be but I guess there must be some repeat > opens because the 1st run with opencache is actually better. I haveopen followed by stat would also benefit from opencache by removing one RPC for stat.> > syscall lustre nfs > -------------------------- > stat 7s 0.01s > lstat 36s 7s > link 29s 16s > getxattr 5s 0.29s > setxattr 30s 0.25s > open 1s 2s > mkdir 6s 3s > lchown 11s 2s > futimesat 11s 2sHm. That''s interesting. And this is over a high latency link, is it? Was this also with debug disabled? I don''t think lstat is any much different than stat if the target is not symlink. I wonder if most of the difference with lstat comes from the fact that for us lstat is rpc (mostly used after opens or readdirs plus fetches attrs from OSTs too) where as for NFS not only they cache data, their readdirplus is better than statahead because it fetches all file info including size and times, where as statahead confusingly does not caches stat information, only what is available on MDS. I had a stab at patch to fetch OST data in parallel too, but that turned out to be not all that trivial and never worked completely correctly. Might be I need to take another look at it after Johann revamps request sets logic a bit to make adding requests to sets easier.> It doesn''t quite explain the 4:1 speed difference but the (l)stat > heavy "cp -la" is consistently that much faster on NFS. Is the NFS > server so much faster for get/setxattr because it returns "EOPNOTSUPP" > for setxattr? Can we do something similar for the Lustre client if weIf it does return EOPNOTSUPP on the client side then there is no RPC and the reply is instant. For lustre it is an RPC roundtrip which is not exactly cheap.> don''t care about extended attributes? The link() times are still > almost twice as slow on Lustre though - that may be related to a > slowish (test) MDT disk. Like Andreas said I don''t understand whyThere is some more work for link in case of lustre like rep-ack (extra confirmation from client to server that it got the link reply), same with mkdir. I am not sure why such a big difference with chown and time update, though actually I now realise we need to talk to OSTs to update ownership and times there as well which adds up even though it should be sent in parallel.> there is an setxattr RPC when we didn''t get any data from getxattr but > that is probably more down to "cp" than lustre?Yes, I think this is more about cp, you can see nfs also has setxattr attempts. Bye, Oleg
That very trivial patch. 2.0 use CR mode to avoid extra lock cancel with xattr upate, 1.8 use EX mode to any xattr updates. but real world must use both modes :-) because client got ACL (part of xattr) with any getattr / open RPC, so MDS MUST cancel that lock with any conflicted updates (because ACL used on VFS permission checks). currently that "fixed" via extra RPC from revalidate_dentry call. in other words, need changes in mds to use mode CR with UPDATE bits for xattr changes, mode EX with lookup + update bits for ACL updates. other bugs in that area - client always show LOV EA in xattr list and send extra RPC to MDS to get LOV EA from the disk, but that can be easily generated on client (via obd_pack_md function) because client always has LOV EA locally. That will save 2 RPC in some cases (first to get LOV EA size, second to get LOV EA). On Aug 4, 2010, at 23:59, Andreas Dilger wrote:> On 2010-08-04, at 01:57, Alexey Lyashkov wrote: >> 22492 is exist because someone from Sun/Oracle is disable dentry caching, instead of fixing xattr code for 1.8<>2.0 interoperable. >> He is kill my patch in revalidate dentry (instead of fixing xattr code). >> in that case client always send one more RPC to server. >> see bug 17545 for some details. > > Patches to "fix xattr interoperability" are welcome. > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. >