thr3ads.net - Lustre discuss - [Lustre-discuss] Client directory entry caching [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Daire Byrne

2010-Jul-29 10:47 UTC

[Lustre-discuss] Client directory entry caching

Hi,

I was wondering if it is possible to have the client completely cache
a recursive listing of a lustre filesystem such that on a second run
it doesn''t have to talk to the MDT again? Taking the simplest case
where I only have one client that is browsing a million file tree
(say), I would expect that once the ldlm has cached the locks
(lru_size) then a second recursive scan (find, ls -R) shouldn''t need
to talk to the MDT/OST again. But this is not the case probably
because a recursive scan needs to do a open() and getdents() on each
directory it finds. If I just stat all the files without doing a
recursive scan then it gets everything from the client cache as
expected without the MDS chatter - e.g.

  find /mnt/lustre -type f > /tmp/files.txt
  cat /tmp/files.txt | xargs ls -l

Is there any way to improve the browsing speed and cache directory
contents - especially for the case where I only have a single client
accessing an entire tree? As an aside I also noticed that a "ls -l"
does a getxattr - does that get cached by the client too? I can
imagine it might cause quite a bit of MDS chatter.

Cheers,

Daire

Andreas Dilger

2010-Jul-29 19:54 UTC

head link

[Lustre-discuss] Client directory entry caching

On 2010-07-29, at 04:47, Daire Byrne wrote:> I was wondering if it is possible to have the client completely cache
> a recursive listing of a lustre filesystem such that on a second run
> it doesn''t have to talk to the MDT again? Taking the simplest case
> where I only have one client that is browsing a million file tree
> (say), I would expect that once the ldlm has cached the locks
> (lru_size) then a second recursive scan (find, ls -R) shouldn''t
need
> to talk to the MDT/OST again. But this is not the case probably
> because a recursive scan needs to do a open() and getdents() on each
> directory it finds.
The getdents() calls can be returned from the client-side cache, it is only the
open() that needs to go to the MDS.  Lustre actually does support client-side
open cache, but it is currently only used by NFS servers (which, sadly, opens
and closes the file for every single write operation on a file).

I know Oleg has at times discussed enabling the open cache on the client for
regular filesystem access, but I don''t know the tweak needed for this
offhand.  I know in the past we didn''t do this because there was extra
DLM locking overhead for cancelling the open lock, but with the DLM lock cancel
batching that may not be as big a performance hit.

It wouldn''t be a bad idea to start with a /proc tuneable or "-o
openlock" mount option that selectively allows open cache per client mount,
so that performance testing can be done.  After that we can decide whether this
is only good for specific workloads and bad for others, or if it is an
improvement for most workloads and should be enabled by default.
> If I just stat all the files without doing a recursive scan then it gets
everything from the client cache as expected without the MDS chatter - e.g.
> 
>  find /mnt/lustre -type f > /tmp/files.txt
>  cat /tmp/files.txt | xargs ls -l
> 
> Is there any way to improve the browsing speed and cache directory
> contents - especially for the case where I only have a single client
> accessing an entire tree? As an aside I also noticed that a "ls
-l"
> does a getxattr - does that get cached by the client too? I can
> imagine it might cause quite a bit of MDS chatter.
So far, Lustre doesn''t cache any xattr on the client beyond the file
layout ("lustre.lov" xattr), but it is something I''ve been
thinking about.  The security.capability attribute is special-cased in the 1.8.4
client to not return any data, and beyond that there aren''t any
attributes that I''m aware of that are widely used, so I don''t
think there is a pressing demand for this, but if a case can be made for this
we''ll definitely look at it more seriously.

--
Cheers, Andreas

Daire Byrne

2010-Jul-30 11:20 UTC

head link

[Lustre-discuss] Client directory entry caching

On Thu, Jul 29, 2010 at 8:54 PM, Andreas Dilger
<andreas.dilger at oracle.com> wrote:> On 2010-07-29, at 04:47, Daire Byrne wrote:
>> I was wondering if it is possible to have the client completely cache
>> a recursive listing of a lustre filesystem such that on a second run
>> it doesn''t have to talk to the MDT again? Taking the simplest
case
>> where I only have one client that is browsing a million file tree
>> (say), I would expect that once the ldlm has cached the locks
>> (lru_size) then a second recursive scan (find, ls -R)
shouldn''t need
>> to talk to the MDT/OST again. But this is not the case probably
>> because a recursive scan needs to do a open() and getdents() on each
>> directory it finds.
>
> The getdents() calls can be returned from the client-side cache, it is only
the open() that needs to go to the MDS. ?Lustre actually does support
client-side open cache, but it is currently only used by NFS servers (which,
sadly, opens and closes the file for every single write operation on a file).
Ah yes... that makes sense. I recall the opencache gave a big boost in
performance for NFS exporting but I wasn''t sure if it had become the
default. I haven''t been keeping up to date with Lustre developments.
> I know Oleg has at times discussed enabling the open cache on the client
for regular filesystem access, but I don''t know the tweak needed for
this offhand. ?I know in the past we didn''t do this because there was
extra DLM locking overhead for cancelling the open lock, but with the DLM lock
cancel batching that may not be as big a performance hit.
>
> It wouldn''t be a bad idea to start with a /proc tuneable or
"-o openlock" mount option that selectively allows open cache per
client mount, so that performance testing can be done. ?After that we can decide
whether this is only good for specific workloads and bad for others, or if it is
an improvement for most workloads and should be enabled by default.
So just as a quick and dirty check of the opencache I did a "find" on
a remote client directly using the lustre client and also through an
NFS gateway client (in this case running on the MDS).

  find /mnt/lustre (not cached) = 51 secs
  find /mnt/lustre (cached = 22 secs
  find /mnt/nfs (not cached) = 127 secs
  find /mnt/nfs (cached) = 15 secs.

So even with the metadata going over NFS the opencache in the client
seems to make quite a difference (I''m not sure how much the NFS client
caches though). As expected I see no mdt activity for the NFS export
once cached. I think it would be really nice to be able to enable the
opencache on any lustre client. A couple of potential workloads that I
can think of that would benefit are WAN clients and clients that need
to do mainly metadata (e.g. scanning the filesystem, rsync --link-dest
hardlink snapshot backups). For the WAN case I''d be quite interested
to see what the overhead of the lock cancellation would be like for a
busy filesystem. I suppose we can already test that by doing an NFS
export. I don''t suppose you know if CEA''s "ganesha"
userspace NFS
server has access to the opencache? It can cache data to disk too
which is also good for WAN applications.
>> If I just stat all the files without doing a recursive scan then it
gets everything from the client cache as expected without the MDS chatter - e.g.
>>
>> ?find /mnt/lustre -type f > /tmp/files.txt
>> ?cat /tmp/files.txt | xargs ls -l
>>
>> Is there any way to improve the browsing speed and cache directory
>> contents - especially for the case where I only have a single client
>> accessing an entire tree? As an aside I also noticed that a "ls
-l"
>> does a getxattr - does that get cached by the client too? I can
>> imagine it might cause quite a bit of MDS chatter.
>
> So far, Lustre doesn''t cache any xattr on the client beyond the
file layout ("lustre.lov" xattr), but it is something I''ve
been thinking about. ?The security.capability attribute is special-cased in the
1.8.4 client to not return any data, and beyond that there aren''t any
attributes that I''m aware of that are widely used, so I don''t
think there is a pressing demand for this, but if a case can be made for this
we''ll definitely look at it more seriously.
Yea I doubt it makes much difference I just noticed that "ls -l" did
it and was wondering what Lustre made of it.

Thanks for the insightful reply as usual Andreas,

Daire

Oleg Drokin

2010-Aug-03 04:21 UTC

head link

[Lustre-discuss] Client directory entry caching

Hello!

On Jul 30, 2010, at 7:20 AM, Daire Byrne wrote:> Ah yes... that makes sense. I recall the opencache gave a big boost in
> performance for NFS exporting but I wasn''t sure if it had become
the
> default. I haven''t been keeping up to date with Lustre
developments.
It was default for NFS for quite some time.
> So even with the metadata going over NFS the opencache in the client
> seems to make quite a difference (I''m not sure how much the NFS
client
> caches though). As expected I see no mdt activity for the NFS export
> once cached. I think it would be really nice to be able to enable the
> opencache on any lustre client. A couple of potential workloads that I
A simple workaround for you to enable opencache on a specific client would
be to add cr_flags |= MDS_OPEN_LOCK; in mdc/mdc_lib.c:mds_pack_open_flags()

or if you want it to be cluster wide, in the mds/mds_open.c:mds_open()
make all conditions checking for MDS_OPEN_LOCK to be true.

I guess we really need to have an option for this, but I am not sure
if we want it on the client, server, or both.
> can think of that would benefit are WAN clients and clients that need
> to do mainly metadata (e.g. scanning the filesystem, rsync --link-dest
> hardlink snapshot backups). For the WAN case I''d be quite
interested
Open is very narrow metadata case, so if you do metadata but no opens you would
get zero benefit from open cache.
Also getting this extra lock puts some extra cpu load on MDS, but if we go this
far,
we can then somewhat simplify rep-ack and hold it for much shorter time in
a lot of cases which would greatly help WAN workloads that happen to create
files in same dir from many nodes, for example. (see bug 20373, first patch)
Just be aware that testing with more than 16000 clients at ORNL clearly shows
degradations at LAN latencies.

Bye,
    Oleg

Daire Byrne

2010-Aug-03 16:49 UTC

head link

[Lustre-discuss] Client directory entry caching

Oleg,

On Tue, Aug 3, 2010 at 5:21 AM, Oleg Drokin <oleg.drokin at oracle.com>
wrote:>> So even with the metadata going over NFS the opencache in the client
>> seems to make quite a difference (I''m not sure how much the
NFS client
>> caches though). As expected I see no mdt activity for the NFS export
>> once cached. I think it would be really nice to be able to enable the
>> opencache on any lustre client. A couple of potential workloads that I
>
> A simple workaround for you to enable opencache on a specific client would
> be to add cr_flags |= MDS_OPEN_LOCK; in mdc/mdc_lib.c:mds_pack_open_flags()
Yea that works - cheers. FYI some comparisons with a simple find on a
remote client (~33,000 files):

  find /mnt/lustre (not cached) = 41 secs
  find /mnt/lustre (cached) = 19 secs
  find /mnt/lustre (opencache) = 3 secs

The "ls -lR" case is still having to query the MDS a lot (for
getxattr) which becomes quite noticeable in the WAN case. Apparently
the 1.8.4 client already addresses this (#15587?). I might try that
patch too...
> I guess we really need to have an option for this, but I am not sure
> if we want it on the client, server, or both.
Doing it client side with the minor modification you suggest is
probably enough for our purposes for the time being. Thanks.
>> can think of that would benefit are WAN clients and clients that need
>> to do mainly metadata (e.g. scanning the filesystem, rsync --link-dest
>> hardlink snapshot backups). For the WAN case I''d be quite
interested
>
> Open is very narrow metadata case, so if you do metadata but no opens you
would
> get zero benefit from open cache.
I suppose the recursive scan case is a fairly low frequency operation
but is also one that Lustre has always suffered noticeably worse
performance when compared to something simpler like NFS. Slightly off
topic (and I''ve kinda asked this before) but is there a good reason
why link() speeds in Lustre are so slow compare to something like NFS?
A quick comparison of doing a "cp -al" from a remote Lustre client and
an NFS client (to a fast NFS server):

  cp -fa /mnt/lustre/blah /mnt/lustre/blah2 = ~362 files/sec
  cp -fa /mnt/nfs/blah /mnt/nfs/blah2 = ~1863 files/sec

Is it just the extra depth of the lustre stack/code path? Is there
anything we could do to speed this up if we know that no other client
will touch these dirs while we hardlink them?
> Also getting this extra lock puts some extra cpu load on MDS, but if we go
this far,
> we can then somewhat simplify rep-ack and hold it for much shorter time in
> a lot of cases which would greatly help WAN workloads that happen to create
> files in same dir from many nodes, for example. (see bug 20373, first
patch)
> Just be aware that testing with more than 16000 clients at ORNL clearly
shows
> degradations at LAN latencies.
Understood. I think we are a long way off hitting those kinds of
limits. The WAN case is interesting because it is the interactive
speed of browsing the filesystem that is usually the most noticeable
(and annoying) artefact of being many miles away from the server. Once
you start accessing the files you want then you are reasonably happy
to be limited by your connection''s overall bandwidth.

Thanks for the feedback,

Daire

Oleg Drokin

2010-Aug-03 17:50 UTC

head link

[Lustre-discuss] Client directory entry caching

Hello!

On Aug 3, 2010, at 12:49 PM, Daire Byrne wrote:>>> So even with the metadata going over NFS the opencache in the
client
>>> seems to make quite a difference (I''m not sure how much
the NFS client
>>> caches though). As expected I see no mdt activity for the NFS
export
>>> once cached. I think it would be really nice to be able to enable
the
>>> opencache on any lustre client. A couple of potential workloads
that I
>> A simple workaround for you to enable opencache on a specific client
would
>> be to add cr_flags |= MDS_OPEN_LOCK; in
mdc/mdc_lib.c:mds_pack_open_flags()
> Yea that works - cheers. FYI some comparisons with a simple find on a
> remote client (~33,000 files):
> 
>  find /mnt/lustre (not cached) = 41 secs
>  find /mnt/lustre (cached) = 19 secs
>  find /mnt/lustre (opencache) = 3 secs
Hm, initially I was going to say that find is not open-intensive so it should
not benefit from opencache at all.
But then I realized if you have a lot of dirs, then indeed there would be a
positive impact on subsequent reruns.
I assume that the opencache result is a second run and first run produces
same 41 seconds?

BTW, another unintended side-effect you might experience if you have mixed
opencache enabled/disabled network is if you run something (or open for write)
on an opencache-enabled client, you might have problems writing (or executing)
that file from non-opencache enabled nodes as long as the file handle
would remain cached on the client. This is because if open lock was not
requested,
we don''t try to invalidate current ones (expensive) and MDS would think
the file is genuinely open for write/execution and disallow conflicting accesses
with EBUSY.
> performance when compared to something simpler like NFS. Slightly off
> topic (and I''ve kinda asked this before) but is there a good
reason
> why link() speeds in Lustre are so slow compare to something like NFS?
> A quick comparison of doing a "cp -al" from a remote Lustre
client and
> an NFS client (to a fast NFS server):
> 
>  cp -fa /mnt/lustre/blah /mnt/lustre/blah2 = ~362 files/sec
>  cp -fa /mnt/nfs/blah /mnt/nfs/blah2 = ~1863 files/sec
> 
> Is it just the extra depth of the lustre stack/code path? Is there
> anything we could do to speed this up if we know that no other client
> will touch these dirs while we hardlink them?
Hm, this is a first complaint about this that I hear.
I just looked into strace of cp -fal (which I guess you mant instead of just -fa
that
would just copy everything).

so we traverse the tree down creating a dir structure in parallel first (or just
doing it in readdir order)

open("/mnt/lustre/a/b/c/d/e/f", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
+1 RPC

fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
+1 RPC (if no opencache)

fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
getdents(3, /* 4 entries */, 4096)      = 96
getdents(3, /* 0 entries */, 4096)      = 0
+1 RPC

close(3)                                = 0
+1 RPC (if no opencache)

lstat("/mnt/lustre/a/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,
st_size=4096, ...}) = 0
(should be cached, so no RPC)

mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0
+1 RPC

lstat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,
st_size=4096, ...}) = 0
+1 RPC

stat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,
st_size=4096, ...}) = 0
(should be cached, so no RPC)

Then we get to files:
link("/mnt/lustre/a/b/c/d/e/f/g/k/8",
"/mnt/lustre/blah2/b/c/d/e/f/g/k/8") = 0
+1 RPC

futimesat(AT_FDCWD, "/mnt/lustre/blah2/b/c/d/e/f/g/k", {{1280856246,
0}, {128085
6291, 0}}) = 0
+1 RPC

then we start traversing the just created tree up and chowning it:
chown("/mnt/lustre/blah2/b/c/d/e/f/g/k", 0, 0) = 0
+1 RPC 

getxattr("/mnt/lustre/a/b/c/d/e/f/g/k",
"system.posix_acl_access", 0x7fff519f0950, 132) = -1 ENODATA (No data
available)
+1 RPC

stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755,
st_size=4096, ...}) = 0
(not sure why another stat here, we already did it on the way up. Should be
cached)

setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k",
"system.posix_acl_access", "\x02\x00
\x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff\xff \x00\x0
5\x00\xff\xff\xff\xff", 28, 0) = 0
+1 RPC

getxattr("/mnt/lustre/a/b/c/d/e/f/g/k",
"system.posix_acl_default", 0x7fff519f09
50, 132) = -1 ENODATA (No data available)
+1 RPC

stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755,
st_size=4096, ...})  0
Hm, stat again? did not we do it a few syscalls back?

stat("/mnt/lustre/blah2/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755,
st_size=4096, ...
}) = 0
stat of the target. +1 RPC (the cache got invalidated by link above).

setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k",
"system.posix_acl_default", "\x02\x0
0\x00\x00", 4, 0) = 0
+1 RPC


So I guess there is a certain number of stat RPCs that would not be present on
NFS
due to different ways the caching works, plus all the getxattrs. Not sure if
this
is enough to explain 4x rate difference.

Also you can try disabling debug (if you did not already) to see how big of an
impact
that would make. It used to be that debug was affecting metadata loads a lot,
though
with recent debug levels adjustments I think it was somewhat improved.

Bye,
    Oleg

Kevin Van Maren

2010-Aug-03 18:45 UTC

head link

[Lustre-discuss] Client directory entry caching

Since Bug 22492 hit a lot of people, it sounds like opencache isn''t  
generally useful unless enabled on every node. Is there an easy way to  
force files out of the cache (ie, echo 3 > /proc/sys/vm/drop_caches)?

Kevin


On Aug 3, 2010, at 11:50 AM, Oleg Drokin <oleg.drokin at oracle.com>
wrote:
> Hello!
>
> On Aug 3, 2010, at 12:49 PM, Daire Byrne wrote:
>>>> So even with the metadata going over NFS the opencache in the  
>>>> client
>>>> seems to make quite a difference (I''m not sure how
much the NFS
>>>> client
>>>> caches though). As expected I see no mdt activity for the NFS  
>>>> export
>>>> once cached. I think it would be really nice to be able to
enable
>>>> the
>>>> opencache on any lustre client. A couple of potential workloads
>>>> that I
>>> A simple workaround for you to enable opencache on a specific  
>>> client would
>>> be to add cr_flags |= MDS_OPEN_LOCK; in mdc/ 
>>> mdc_lib.c:mds_pack_open_flags()
>> Yea that works - cheers. FYI some comparisons with a simple find on a
>> remote client (~33,000 files):
>>
>> find /mnt/lustre (not cached) = 41 secs
>> find /mnt/lustre (cached) = 19 secs
>> find /mnt/lustre (opencache) = 3 secs
>
> Hm, initially I was going to say that find is not open-intensive so  
> it should
> not benefit from opencache at all.
> But then I realized if you have a lot of dirs, then indeed there  
> would be a
> positive impact on subsequent reruns.
> I assume that the opencache result is a second run and first run  
> produces
> same 41 seconds?
>
> BTW, another unintended side-effect you might experience if you have  
> mixed
> opencache enabled/disabled network is if you run something (or open  
> for write)
> on an opencache-enabled client, you might have problems writing (or  
> executing)
> that file from non-opencache enabled nodes as long as the file handle
> would remain cached on the client. This is because if open lock was  
> not requested,
> we don''t try to invalidate current ones (expensive) and MDS would
> think
> the file is genuinely open for write/execution and disallow  
> conflicting accesses
> with EBUSY.
>
>> performance when compared to something simpler like NFS. Slightly off
>> topic (and I''ve kinda asked this before) but is there a good
reason
>> why link() speeds in Lustre are so slow compare to something like  
>> NFS?
>> A quick comparison of doing a "cp -al" from a remote Lustre
client
>> and
>> an NFS client (to a fast NFS server):
>>
>> cp -fa /mnt/lustre/blah /mnt/lustre/blah2 = ~362 files/sec
>> cp -fa /mnt/nfs/blah /mnt/nfs/blah2 = ~1863 files/sec
>>
>> Is it just the extra depth of the lustre stack/code path? Is there
>> anything we could do to speed this up if we know that no other client
>> will touch these dirs while we hardlink them?
>
> Hm, this is a first complaint about this that I hear.
> I just looked into strace of cp -fal (which I guess you mant instead  
> of just -fa that
> would just copy everything).
>
> so we traverse the tree down creating a dir structure in parallel  
> first (or just doing it in readdir order)
>
> open("/mnt/lustre/a/b/c/d/e/f", O_RDONLY|O_NONBLOCK|O_DIRECTORY)
= 3
> +1 RPC
>
> fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> +1 RPC (if no opencache)
>
> fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
> getdents(3, /* 4 entries */, 4096)      = 96
> getdents(3, /* 0 entries */, 4096)      = 0
> +1 RPC
>
> close(3)                                = 0
> +1 RPC (if no opencache)
>
> lstat("/mnt/lustre/a/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,  
> st_size=4096, ...}) = 0
> (should be cached, so no RPC)
>
> mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0
> +1 RPC
>
> lstat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,  
> st_size=4096, ...}) = 0
> +1 RPC
>
> stat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,  
> st_size=4096, ...}) = 0
> (should be cached, so no RPC)
>
> Then we get to files:
> link("/mnt/lustre/a/b/c/d/e/f/g/k/8",
"/mnt/lustre/blah2/b/c/d/e/f/g/
> k/8") = 0
> +1 RPC
>
> futimesat(AT_FDCWD, "/mnt/lustre/blah2/b/c/d/e/f/g/k",
{{1280856246,
> 0}, {128085
> 6291, 0}}) = 0
> +1 RPC
>
> then we start traversing the just created tree up and chowning it:
> chown("/mnt/lustre/blah2/b/c/d/e/f/g/k", 0, 0) = 0
> +1 RPC
>
> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k",
"system.posix_acl_access",
> 0x7fff519f0950, 132) = -1 ENODATA (No data available)
> +1 RPC
>
> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755,  
> st_size=4096, ...}) = 0
> (not sure why another stat here, we already did it on the way up.  
> Should be cached)
>
> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k",  
> "system.posix_acl_access", "\x02\x00
> \x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff 
> \xff \x00\x0
> 5\x00\xff\xff\xff\xff", 28, 0) = 0
> +1 RPC
>
> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k",
"system.posix_acl_default",
> 0x7fff519f09
> 50, 132) = -1 ENODATA (No data available)
> +1 RPC
>
> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755,  
> st_size=4096, ...}) > 0
> Hm, stat again? did not we do it a few syscalls back?
>
> stat("/mnt/lustre/blah2/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755,  
> st_size=4096, ...
> }) = 0
> stat of the target. +1 RPC (the cache got invalidated by link above).
>
> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k",  
> "system.posix_acl_default", "\x02\x0
> 0\x00\x00", 4, 0) = 0
> +1 RPC
>
>
> So I guess there is a certain number of stat RPCs that would not be  
> present on NFS
> due to different ways the caching works, plus all the getxattrs. Not  
> sure if this
> is enough to explain 4x rate difference.
>
> Also you can try disabling debug (if you did not already) to see how  
> big of an impact
> that would make. It used to be that debug was affecting metadata  
> loads a lot, though
> with recent debug levels adjustments I think it was somewhat improved.
>
> Bye,
>    Oleg
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100803/071c2046/attachment.html

Oleg Drokin

2010-Aug-03 18:50 UTC

head link

[Lustre-discuss] Client directory entry caching

Well, you can drop all locks on a given FS that would in effect drop all
metadata caches, but will leave
data caches intact.

echo clear >/proc/fs/lustre/ldlm/namespaces/your_MDC_namespace/lru_size

On Aug 3, 2010, at 2:45 PM, Kevin Van Maren wrote:
> Since Bug 22492 hit a lot of people, it sounds like opencache
isn''t generally useful unless enabled on every node. Is there an easy
way to force files out of the cache (ie, echo 3 > /proc/sys/vm/drop_caches)?
> 
> Kevin
> 
> 
> On Aug 3, 2010, at 11:50 AM, Oleg Drokin <oleg.drokin at oracle.com>
wrote:
> 
>> Hello!
>> 
>> On Aug 3, 2010, at 12:49 PM, Daire Byrne wrote:
>>>>> So even with the metadata going over NFS the opencache in
the client
>>>>> seems to make quite a difference (I''m not sure how
much the NFS client
>>>>> caches though). As expected I see no mdt activity for the
NFS export
>>>>> once cached. I think it would be really nice to be able to
enable the
>>>>> opencache on any lustre client. A couple of potential
workloads that I
>>>> A simple workaround for you to enable opencache on a specific
client would
>>>> be to add cr_flags |= MDS_OPEN_LOCK; in
mdc/mdc_lib.c:mds_pack_open_flags()
>>> Yea that works - cheers. FYI some comparisons with a simple find on
a
>>> remote client (~33,000 files):
>>> 
>>> find /mnt/lustre (not cached) = 41 secs
>>> find /mnt/lustre (cached) = 19 secs
>>> find /mnt/lustre (opencache) = 3 secs
>> 
>> Hm, initially I was going to say that find is not open-intensive so it
should
>> not benefit from opencache at all.
>> But then I realized if you have a lot of dirs, then indeed there would
be a
>> positive impact on subsequent reruns.
>> I assume that the opencache result is a second run and first run
produces
>> same 41 seconds?
>> 
>> BTW, another unintended side-effect you might experience if you have
mixed
>> opencache enabled/disabled network is if you run something (or open for
write)
>> on an opencache-enabled client, you might have problems writing (or
executing)
>> that file from non-opencache enabled nodes as long as the file handle
>> would remain cached on the client. This is because if open lock was not
requested,
>> we don''t try to invalidate current ones (expensive) and MDS
would think
>> the file is genuinely open for write/execution and disallow conflicting
accesses
>> with EBUSY.
>> 
>>> performance when compared to something simpler like NFS. Slightly
off
>>> topic (and I''ve kinda asked this before) but is there a
good reason
>>> why link() speeds in Lustre are so slow compare to something like
NFS?
>>> A quick comparison of doing a "cp -al" from a remote
Lustre client and
>>> an NFS client (to a fast NFS server):
>>> 
>>> cp -fa /mnt/lustre/blah /mnt/lustre/blah2 = ~362 files/sec
>>> cp -fa /mnt/nfs/blah /mnt/nfs/blah2 = ~1863 files/sec
>>> 
>>> Is it just the extra depth of the lustre stack/code path? Is there
>>> anything we could do to speed this up if we know that no other
client
>>> will touch these dirs while we hardlink them?
>> 
>> Hm, this is a first complaint about this that I hear.
>> I just looked into strace of cp -fal (which I guess you mant instead of
just -fa that
>> would just copy everything).
>> 
>> so we traverse the tree down creating a dir structure in parallel first
(or just doing it in readdir order)
>> 
>> open("/mnt/lustre/a/b/c/d/e/f",
O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
>> +1 RPC
>> 
>> fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>> +1 RPC (if no opencache)
>> 
>> fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
>> getdents(3, /* 4 entries */, 4096)      = 96
>> getdents(3, /* 0 entries */, 4096)      = 0
>> +1 RPC
>> 
>> close(3)                                = 0
>> +1 RPC (if no opencache)
>> 
>> lstat("/mnt/lustre/a/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,
st_size=4096, ...}) = 0
>> (should be cached, so no RPC)
>> 
>> mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0
>> +1 RPC
>> 
>> lstat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,
st_size=4096, ...}) = 0
>> +1 RPC
>> 
>> stat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,
st_size=4096, ...}) = 0
>> (should be cached, so no RPC)
>> 
>> Then we get to files:
>> link("/mnt/lustre/a/b/c/d/e/f/g/k/8",
"/mnt/lustre/blah2/b/c/d/e/f/g/k/8") = 0
>> +1 RPC
>> 
>> futimesat(AT_FDCWD, "/mnt/lustre/blah2/b/c/d/e/f/g/k",
{{1280856246, 0}, {128085
>> 6291, 0}}) = 0
>> +1 RPC
>> 
>> then we start traversing the just created tree up and chowning it:
>> chown("/mnt/lustre/blah2/b/c/d/e/f/g/k", 0, 0) = 0
>> +1 RPC 
>> 
>> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k",
"system.posix_acl_access", 0x7fff519f0950, 132) = -1 ENODATA (No data
available)
>> +1 RPC
>> 
>> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755,
st_size=4096, ...}) = 0
>> (not sure why another stat here, we already did it on the way up.
Should be cached)
>> 
>> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k",
"system.posix_acl_access", "\x02\x00
>>
\x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff\xff \x00\x0
>> 5\x00\xff\xff\xff\xff", 28, 0) = 0
>> +1 RPC
>> 
>> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k",
"system.posix_acl_default", 0x7fff519f09
>> 50, 132) = -1 ENODATA (No data available)
>> +1 RPC
>> 
>> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755,
st_size=4096, ...}) >> 0
>> Hm, stat again? did not we do it a few syscalls back?
>> 
>> stat("/mnt/lustre/blah2/b/c/d/e/f/g/k",
{st_mode=S_IFDIR|0755, st_size=4096, ...
>> }) = 0
>> stat of the target. +1 RPC (the cache got invalidated by link above).
>> 
>> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k",
"system.posix_acl_default", "\x02\x0
>> 0\x00\x00", 4, 0) = 0
>> +1 RPC
>> 
>> 
>> So I guess there is a certain number of stat RPCs that would not be
present on NFS
>> due to different ways the caching works, plus all the getxattrs. Not
sure if this
>> is enough to explain 4x rate difference.
>> 
>> Also you can try disabling debug (if you did not already) to see how
big of an impact
>> that would make. It used to be that debug was affecting metadata loads
a lot, though
>> with recent debug levels adjustments I think it was somewhat improved.
>> 
>> Bye,
>>    Oleg
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Jeremy Filizetti

2010-Aug-04 02:59 UTC

head link

[Lustre-discuss] Client directory entry caching

On Tue, Aug 3, 2010 at 12:49 PM, Daire Byrne <daire.byrne at gmail.com>
wrote:
> Oleg,
>
> On Tue, Aug 3, 2010 at 5:21 AM, Oleg Drokin <oleg.drokin at
oracle.com>
> wrote:
> >> So even with the metadata going over NFS the opencache in the
client
> >> seems to make quite a difference (I''m not sure how much
the NFS client
> >> caches though). As expected I see no mdt activity for the NFS
export
> >> once cached. I think it would be really nice to be able to enable
the
> >> opencache on any lustre client. A couple of potential workloads
that I
> >
> > A simple workaround for you to enable opencache on a specific client
> would
> > be to add cr_flags |= MDS_OPEN_LOCK; in
> mdc/mdc_lib.c:mds_pack_open_flags()
>
> Yea that works - cheers. FYI some comparisons with a simple find on a
> remote client (~33,000 files):
>
>That''s not to bad even in the uncached case, what kind of
round-trip-time
delay do you have between your client and servers?

>  find /mnt/lustre (not cached) = 41 secs
>  find /mnt/lustre (cached) = 19 secs
>  find /mnt/lustre (opencache) = 3 secs
>
> The "ls -lR" case is still having to query the MDS a lot (for
> getxattr) which becomes quite noticeable in the WAN case. Apparently
> the 1.8.4 client already addresses this (#15587?). I might try that
> patch too...
>
The patch for bug 15587 addresses problems with SLES 11 (maybe others?)
patchless clients with CONFIG_FILE_SECURITY_CAPABILITIES enabled.  It
severely affects performance over a WAN (see bug 21439) because their is no
xattr caching each write is requiring an RPC to the MDS and you won''t
see
any parallelism.

> > I guess we really need to have an option for this, but I am not sure
> > if we want it on the client, server, or both.
>
>It would be nice to have as options for both to allow WAN users to see the
benefit without patching code.

> Doing it client side with the minor modification you suggest is
> probably enough for our purposes for the time being. Thanks.
>
> >> can think of that would benefit are WAN clients and clients that
need
> >> to do mainly metadata (e.g. scanning the filesystem, rsync
--link-dest
> >> hardlink snapshot backups). For the WAN case I''d be quite
interested
> >
> > Open is very narrow metadata case, so if you do metadata but no opens
you
> would
> > get zero benefit from open cache.
>
> I suppose the recursive scan case is a fairly low frequency operation
> but is also one that Lustre has always suffered noticeably worse
> performance when compared to something simpler like NFS. Slightly off
> topic (and I''ve kinda asked this before) but is there a good
reason
> why link() speeds in Lustre are so slow compare to something like NFS?
> A quick comparison of doing a "cp -al" from a remote Lustre
client and
> an NFS client (to a fast NFS server):
>
Another consideration for WAN performance when creating files is the stripe
count.  When you start writing to a file the first RPC to each OSC requests
the lock rather then requesting the lock from all OSCs when the first lock
is requested.  This would require some code to change but it would be
another nice optimization for WAN performance.  We do some work over a 200
ms RTT latency and to create a file striped across 8 OSTs with 1 MB stripes
takes 1.6 seconds just to write the first 8 MBs sine the locks are
synchronous operations.  For a single stripe it would only take ~200 ms.

>
>  cp -fa /mnt/lustre/blah /mnt/lustre/blah2 = ~362 files/sec
>  cp -fa /mnt/nfs/blah /mnt/nfs/blah2 = ~1863 files/sec
>
> Is it just the extra depth of the lustre stack/code path? Is there
> anything we could do to speed this up if we know that no other client
> will touch these dirs while we hardlink them?
>
> > Also getting this extra lock puts some extra cpu load on MDS, but if
we
> go this far,
> > we can then somewhat simplify rep-ack and hold it for much shorter
time
> in
> > a lot of cases which would greatly help WAN workloads that happen to
> create
> > files in same dir from many nodes, for example. (see bug 20373, first
> patch)
> > Just be aware that testing with more than 16000 clients at ORNL
clearly
> shows
> > degradations at LAN latencies.
>
> Understood. I think we are a long way off hitting those kinds of
> limits. The WAN case is interesting because it is the interactive
> speed of browsing the filesystem that is usually the most noticeable
> (and annoying) artefact of being many miles away from the server. Once
> you start accessing the files you want then you are reasonably happy
> to be limited by your connection''s overall bandwidth.
>
I agree, there is a lot of things I''d like to see added to improve that
interactive WAN performance.  There is a general metadata WAN performance,
bug 18526.  I think readdir+ would help and larger RPCs to the MDS (bug
17833) are necessary to overcome the transactional nature Lustre
stat/getxattr have today.  As for statahead helping "ls -l"
performance I
have some numbers in my LUG2010 presentation (
http://wiki.lustre.org/images/6/60/LUG2010_Filizetti_SMSi.pdf) about the
improvements that size on metadata (SOM) adds compared to Lustre 1.8.

Jeremy

>
> Thanks for the feedback,
>
> Daire
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100803/63b4ee61/attachment.html

Oleg Drokin

2010-Aug-04 03:14 UTC

head link

[Lustre-discuss] Client directory entry caching

Hello!

On Aug 3, 2010, at 10:59 PM, Jeremy Filizetti wrote:
> Another consideration for WAN performance when creating files is the stripe
count.  When you start writing to a file the first RPC to each OSC requests the
lock rather then requesting the lock from all OSCs when the first lock is
requested.  This would require some code to change but it would be another nice
optimization for WAN performance.  We do some work over a 200 ms RTT latency and
to create a file striped across 8 OSTs with 1 MB stripes takes 1.6 seconds just
to write the first 8 MBs sine the locks are synchronous operations.  For a
single stripe it would only take ~200 ms.
Requesting locks on all stripes would be a disaster for shared-file type of
workload where a single file is accessed by a lot of clients.
If you know that your app is the only one doing accesses and you would agree to
do some extra actions (sure this is limited), the glimpse handler
could be modified in such a way that it would grant you a write lock if your
file handle was opened for write, I think we have support for this
on a client already and server support is mostly trivial.
So with that change you would open/create a file for write and then do fstat
that would fetch you your locks in parallel (currently this would
fetch read locks only, so would not help in case of writes).
See bug 11715 for discussions and hopefully for patches that would eventually
come.

This could be done in a nicer way of course, but would be a much bigger piece of
work too which would mean it''s much longer to wait too.

Bye,
    Oleg

Jeremy Filizetti

2010-Aug-04 04:22 UTC

head link

[Lustre-discuss] Client directory entry caching

On Tue, Aug 3, 2010 at 11:14 PM, Oleg Drokin <oleg.drokin at oracle.com>
wrote:
> Hello!
>
> On Aug 3, 2010, at 10:59 PM, Jeremy Filizetti wrote:
>
> > Another consideration for WAN performance when creating files is the
> stripe count.  When you start writing to a file the first RPC to each OSC
> requests the lock rather then requesting the lock from all OSCs when the
> first lock is requested.  This would require some code to change but it
> would be another nice optimization for WAN performance.  We do some work
> over a 200 ms RTT latency and to create a file striped across 8 OSTs with 1
> MB stripes takes 1.6 seconds just to write the first 8 MBs sine the locks
> are synchronous operations.  For a single stripe it would only take ~200
ms.
>
>
Our workload is primarily WORM (write once read many) so we never have
shared write access.  I would be nice to have a client option to enable this
capability that would remain off by default.  Extracting a tar can become
painful when it has some small files but we need the extra stripes for
concurrency that provide good performance for the larger files (which take
more time anyways).

Jeremy

Requesting locks on all stripes would be a disaster for shared-file type
of> workload where a single file is accessed by a lot of clients.
> If you know that your app is the only one doing accesses and you would
> agree to do some extra actions (sure this is limited), the glimpse handler
> could be modified in such a way that it would grant you a write lock if
> your file handle was opened for write, I think we have support for this
> on a client already and server support is mostly trivial.
> So with that change you would open/create a file for write and then do
> fstat that would fetch you your locks in parallel (currently this would
> fetch read locks only, so would not help in case of writes).
> See bug 11715 for discussions and hopefully for patches that would
> eventually come.
>
> This could be done in a nicer way of course, but would be a much bigger
> piece of work too which would mean it''s much longer to wait too.
>
> Bye,
>     Oleg-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100804/2e7d3d57/attachment.html

Andreas Dilger

2010-Aug-04 07:41 UTC

head link

[Lustre-discuss] Client directory entry caching

On 2010-08-03, at 12:45, Kevin Van Maren wrote:> Since Bug 22492 hit a lot of people, it sounds like opencache
isn''t generally useful unless enabled on every node. Is there an easy
way to force files out of the cache (ie, echo 3 > /proc/sys/vm/drop_caches)?
For Lustre you can do "lctl set_param
ldlm.namespaces.*.lru_size=clear" will drop all the DLM locks on the
clients, which will flush all pages from the cache.
>> I just looked into strace of cp -fal (which I guess you meant instead
of just -fa that would just copy everything).
>> 
>> so we traverse the tree down creating a dir structure in parallel first
(or just doing it in readdir order)
>> 
>> open("/mnt/lustre/a/b/c/d/e/f",
O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
>> +1 RPC
>> 
>> fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>> +1 RPC (if no opencache)
>> 
>> fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
>> getdents(3, /* 4 entries */, 4096)      = 96
>> getdents(3, /* 0 entries */, 4096)      = 0
>> +1 RPC
Having large readdir RPCs would help for directories with more than about 170
entries.
>> close(3)                                = 0
>> +1 RPC (if no opencache)
>> 
>> lstat("/mnt/lustre/a/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,
st_size=4096, ...}) = 0
>> (should be cached, so no RPC)
>> 
>> mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0
>> +1 RPC
>> 
>> lstat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,
st_size=4096, ...}) = 0
>> +1 RPC
If we do the mkdir(), the client does not cache the entry?
>> stat("/mnt/lustre/blah2/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,
st_size=4096, ...}) = 0
>> (should be cached, so no RPC)
>> 
>> Then we get to files:
>> link("/mnt/lustre/a/b/c/d/e/f/g/k/8",
"/mnt/lustre/blah2/b/c/d/e/f/g/k/8") = 0
>> +1 RPC
>> 
>> futimesat(AT_FDCWD, "/mnt/lustre/blah2/b/c/d/e/f/g/k",
{{1280856246, 0}, {1280856291, 0}}) = 0
>> +1 RPC
>> 
>> then we start traversing the just created tree up and chowning it:
>> chown("/mnt/lustre/blah2/b/c/d/e/f/g/k", 0, 0) = 0
>> +1 RPC 
>> 
>> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k",
"system.posix_acl_access", 0x7fff519f0950, 132) = -1 ENODATA (No data
available)
>> +1 RPC
This is gone in 1.8.4
>> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755,
st_size=4096, ...}) = 0
>> (not sure why another stat here, we already did it on the way up.
Should be cached)
>> 
>> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k",
"system.posix_acl_access",
"\x02\x00\x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff\xff
\x00\x05\x00\xff\xff\xff\xff", 28, 0) = 0
>> +1 RPC
Strange that it is setting an ACL when it didn''t read one?
>> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k",
"system.posix_acl_default", 0x7fff519f0950, 132) = -1 ENODATA (No data
available)
>> +1 RPC
>> 
>> stat("/mnt/lustre/a/b/c/d/e/f/g/k", {st_mode=S_IFDIR|0755,
st_size=4096, ...}) = 0
>> Hm, stat again? did not we do it a few syscalls back?
Gotta love those GNU file utilities.  They are very stat happy.
>> stat("/mnt/lustre/blah2/b/c/d/e/f/g/k",
{st_mode=S_IFDIR|0755, st_size=4096, ... }) = 0
>> stat of the target. +1 RPC (the cache got invalidated by link above).
>> 
>> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k",
"system.posix_acl_default", "\x02\x00\x00\x00", 4, 0) = 0
>> +1 RPC
Here it is also setting an ACL even though it didn''t get one from the
source.
>> So I guess there is a certain number of stat RPCs that would not be
present on NFS due to different ways the caching works, plus all the getxattrs.
Not sure if this is enough to explain 4x rate difference.
>> 
>> Also you can try disabling debug (if you did not already) to see how
big of an impact that would make. It used to be that debug was affecting
metadata loads a lot, though with recent debug levels adjustments I think it was
somewhat improved.
Useful would be to run "strace -tttT" to get timestamps for each
operation to see for which operations it is slower on Lustre than NFS.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Alexey Lyashkov

2010-Aug-04 07:57 UTC

head link

[Lustre-discuss] Client directory entry caching

22492 is exist because someone from Sun/Oracle is disable dentry caching,
instead of fixing xattr code for 1.8<>2.0 interoperable.
He is kill my patch in revalidate dentry (instead of fixing xattr code).
in that case client always send one more RPC to server.
see bug 17545 for some details.



On Aug 4, 2010, at 10:41, Andreas Dilger wrote:
> On 2010-08-03, at 12:45, Kevin Van Maren wrote:
>> Since Bug 22492 hit a lot of people, it sounds like opencache
isn''t generally useful unless enabled on every node. Is there an easy
way to force files out of the cache (ie, echo 3 > /proc/sys/vm/drop_caches)?
> 
> For Lustre you can do "lctl set_param
ldlm.namespaces.*.lru_size=clear" will drop all the DLM locks on the
clients, which will flush all pages from the cache.
> 
>>> I just looked into strace of cp -fal (which I guess you meant
instead of just -fa that would just copy everything).
>>> 
>>> so we traverse the tree down creating a dir structure in parallel
first (or just doing it in readdir order)
>>> 
>>> open("/mnt/lustre/a/b/c/d/e/f",
O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
>>> +1 RPC
>>> 
>>> fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>>> +1 RPC (if no opencache)
>>> 
>>> fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
>>> getdents(3, /* 4 entries */, 4096)      = 96
>>> getdents(3, /* 0 entries */, 4096)      = 0
>>> +1 RPC
> 
> Having large readdir RPCs would help for directories with more than about
170 entries.
> 
>>> close(3)                                = 0
>>> +1 RPC (if no opencache)
>>> 
>>> lstat("/mnt/lustre/a/b/c/d/e/f/g", {st_mode=S_IFDIR|0755,
st_size=4096, ...}) = 0
>>> (should be cached, so no RPC)
>>> 
>>> mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0
>>> +1 RPC
>>> 
>>> lstat("/mnt/lustre/blah2/b/c/d/e/f/g",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>>> +1 RPC
> 
> If we do the mkdir(), the client does not cache the entry?
> 
>>> stat("/mnt/lustre/blah2/b/c/d/e/f/g",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>>> (should be cached, so no RPC)
>>> 
>>> Then we get to files:
>>> link("/mnt/lustre/a/b/c/d/e/f/g/k/8",
"/mnt/lustre/blah2/b/c/d/e/f/g/k/8") = 0
>>> +1 RPC
>>> 
>>> futimesat(AT_FDCWD, "/mnt/lustre/blah2/b/c/d/e/f/g/k",
{{1280856246, 0}, {1280856291, 0}}) = 0
>>> +1 RPC
>>> 
>>> then we start traversing the just created tree up and chowning it:
>>> chown("/mnt/lustre/blah2/b/c/d/e/f/g/k", 0, 0) = 0
>>> +1 RPC 
>>> 
>>> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k",
"system.posix_acl_access", 0x7fff519f0950, 132) = -1 ENODATA (No data
available)
>>> +1 RPC
> 
> This is gone in 1.8.4
> 
>>> stat("/mnt/lustre/a/b/c/d/e/f/g/k",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>>> (not sure why another stat here, we already did it on the way up.
Should be cached)
>>> 
>>> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k",
"system.posix_acl_access",
"\x02\x00\x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff\xff
\x00\x05\x00\xff\xff\xff\xff", 28, 0) = 0
>>> +1 RPC
> 
> Strange that it is setting an ACL when it didn''t read one?
> 
>>> getxattr("/mnt/lustre/a/b/c/d/e/f/g/k",
"system.posix_acl_default", 0x7fff519f0950, 132) = -1 ENODATA (No data
available)
>>> +1 RPC
>>> 
>>> stat("/mnt/lustre/a/b/c/d/e/f/g/k",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>>> Hm, stat again? did not we do it a few syscalls back?
> 
> Gotta love those GNU file utilities.  They are very stat happy.
> 
>>> stat("/mnt/lustre/blah2/b/c/d/e/f/g/k",
{st_mode=S_IFDIR|0755, st_size=4096, ... }) = 0
>>> stat of the target. +1 RPC (the cache got invalidated by link
above).
>>> 
>>> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k",
"system.posix_acl_default", "\x02\x00\x00\x00", 4, 0) = 0
>>> +1 RPC
> 
> Here it is also setting an ACL even though it didn''t get one from
the source.
> 
>>> So I guess there is a certain number of stat RPCs that would not be
present on NFS due to different ways the caching works, plus all the getxattrs.
Not sure if this is enough to explain 4x rate difference.
>>> 
>>> Also you can try disabling debug (if you did not already) to see
how big of an impact that would make. It used to be that debug was affecting
metadata loads a lot, though with recent debug levels adjustments I think it was
somewhat improved.
> 
> Useful would be to run "strace -tttT" to get timestamps for each
operation to see for which operations it is slower on Lustre than NFS.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Oleg Drokin

2010-Aug-04 14:52 UTC

head link

[Lustre-discuss] Client directory entry caching

Hello!

On Aug 4, 2010, at 3:41 AM, Andreas Dilger wrote:>>> mkdir("/mnt/lustre/blah2/b/c/d/e/f/g", 040755) = 0
>>> +1 RPC
>>> 
>>> lstat("/mnt/lustre/blah2/b/c/d/e/f/g",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>>> +1 RPC
> If we do the mkdir(), the client does not cache the entry?
No. mkdir cannot return a lock so we cannot cache anything. Same for opens when
open cache is disabled.

Bye,
    Oleg

Daire Byrne

2010-Aug-04 18:04 UTC

head link

[Lustre-discuss] Client directory entry caching

Oleg,

On Tue, Aug 3, 2010 at 6:50 PM, Oleg Drokin <oleg.drokin at oracle.com>
wrote:>> Yea that works - cheers. FYI some comparisons with a simple find on a
>> remote client (~33,000 files):
>>
>> ?find /mnt/lustre (not cached) = 41 secs
>> ?find /mnt/lustre (cached) = 19 secs
>> ?find /mnt/lustre (opencache) = 3 secs
>
> Hm, initially I was going to say that find is not open-intensive so it
should
> not benefit from opencache at all.
> But then I realized if you have a lot of dirs, then indeed there would be a
> positive impact on subsequent reruns.
> I assume that the opencache result is a second run and first run produces
> same 41 seconds?
Actually I assumed it would be but I guess there must be some repeat
opens because the 1st run with opencache is actually better. I have
also set lnet.debug=0 to show the difference that makes:

  find /mnt/lustre (1st run) = 35 secs
  find /mnt/lustre (2nd run) = 17 secs
  find /mnt/lustre (1st run opencache) = 23 secs
  find /mnt/lustre (2nd run opencache) = 0.65 secs

Having lnet.debug=0 does make a difference - probably more noticeable
over millions of dirs/files. BTW I think having lots of dirs on a
100TB+ filesystem is going to be a common workload for most non-lab
lustre users so having a few clients able to cache opens for doing
read-only scans of the filesystem would be a good performance win.

BTW all my testing so far is on the LAN but I''m also thinking ahead to
how all this metadata RPC traffic will work down our >200ms link to
our office in Asia.
> BTW, another unintended side-effect you might experience if you have mixed
> opencache enabled/disabled network is if you run something (or open for
write)
> on an opencache-enabled client, you might have problems writing (or
executing)
> that file from non-opencache enabled nodes as long as the file handle
> would remain cached on the client. This is because if open lock was not
requested,
> we don''t try to invalidate current ones (expensive) and MDS would
think
> the file is genuinely open for write/execution and disallow conflicting
accesses
> with EBUSY.
Right - worth remembering.
>> performance when compared to something simpler like NFS. Slightly off
>> topic (and I''ve kinda asked this before) but is there a good
reason
>> why link() speeds in Lustre are so slow compare to something like NFS?
>> A quick comparison of doing a "cp -al" from a remote Lustre
client and
>> an NFS client (to a fast NFS server):
>
> Hm, this is a first complaint about this that I hear.
> I just looked into strace of cp -fal (which I guess you mant instead of
just -fa that
> would just copy everything).
>
> so we traverse the tree down creating a dir structure in parallel first (or
just doing it in readdir order)
>
> open("/mnt/lustre/a/b/c/d/e/f", O_RDONLY|O_NONBLOCK|O_DIRECTORY)
= 3
> +1 RPC
>
> SNIP!
>
> setxattr("/mnt/lustre/blah2/b/c/d/e/f/g/k",
"system.posix_acl_default", "\x02\x0
> 0\x00\x00", 4, 0) = 0
> +1 RPC
>
> So I guess there is a certain number of stat RPCs that would not be present
on NFS
> due to different ways the caching works, plus all the getxattrs. Not sure
if this
> is enough to explain 4x rate difference.
Thanks for breaking down the strace for me - it is interesting to see
where the RPCs are coming from. Andreas suggested getting timing info
from the strace of "cp -al" which I''ve done for Lustre and a
NFS
server (bluearc) while hardlinking some kernel source directories. I
added up the time (seconds) that all the syscalls in a run took with
opencache and the lustre 1.8.4 client:

syscall   lustre nfs
--------------------------
stat        7s    0.01s
lstat      36s    7s
link       29s    16s
getxattr   5s    0.29s
setxattr  30s    0.25s
open       1s    2s
mkdir      6s    3s
lchown    11s    2s
futimesat 11s    2s

It doesn''t quite explain the 4:1 speed difference but the (l)stat
heavy "cp -la" is consistently that much faster on NFS. Is the NFS
server so much faster for get/setxattr because it returns "EOPNOTSUPP"
for setxattr? Can we do something similar for the Lustre client if we
don''t care about extended attributes? The link() times are still
almost twice as slow on Lustre though - that may be related to a
slowish (test) MDT disk.  Like Andreas said I don''t understand why
there is an setxattr RPC when we didn''t get any data from getxattr but
that is probably more down to "cp" than lustre?

I also did a quick test using "rsync" to do the hardlinking (which
noticeably doesn''t use get/setxattr) and now the difference in speed
is more like 2:1. The way rsync launches a couple of instances
(sender/receiver) probably helps parallelise things better for Lustre.
In this case the overall link() time is similar but the overall
lstat() time is 3x slower for Lustre (67secs/22secs). The lstat()
should all be coming from the client cache but it seems to be
consistently 3x slower in each call than for NFS for some reason.

Cheers,

Daire

Oleg Drokin

2010-Aug-04 18:23 UTC

head link

[Lustre-discuss] Client directory entry caching

Hello!

On Aug 4, 2010, at 2:04 PM, Daire Byrne wrote:>> Hm, initially I was going to say that find is not open-intensive so it
should
>> not benefit from opencache at all.
>> But then I realized if you have a lot of dirs, then indeed there would
be a
>> positive impact on subsequent reruns.
>> I assume that the opencache result is a second run and first run
produces
>> same 41 seconds?
> Actually I assumed it would be but I guess there must be some repeat
> opens because the 1st run with opencache is actually better. I have
open followed by stat would also benefit from opencache by removing one RPC for
stat.
> 
> syscall   lustre nfs
> --------------------------
> stat        7s    0.01s
> lstat      36s    7s
> link       29s    16s
> getxattr   5s    0.29s
> setxattr  30s    0.25s
> open       1s    2s
> mkdir      6s    3s
> lchown    11s    2s
> futimesat 11s    2s
Hm. That''s interesting. And this is over a high latency link, is it?
Was this also with debug disabled?
I don''t think lstat is any much different than stat if the target is
not
symlink.
I wonder if most of the difference with lstat comes from the fact that for us
lstat is rpc (mostly used after opens or readdirs plus fetches attrs from OSTs
too)
where as for NFS not only they cache data, their readdirplus is better than
statahead
because it fetches all file info including size and times, where as statahead
confusingly does not caches stat information, only what is available on MDS.

I had a stab at patch to fetch OST data in parallel too, but that turned out to
be
not all that trivial and never worked completely correctly. Might be I need
to take another look at it after Johann revamps request sets logic a bit to make
adding requests to sets easier.
> It doesn''t quite explain the 4:1 speed difference but the (l)stat
> heavy "cp -la" is consistently that much faster on NFS. Is the
NFS
> server so much faster for get/setxattr because it returns
"EOPNOTSUPP"
> for setxattr? Can we do something similar for the Lustre client if we
If it does return EOPNOTSUPP on the client side then there is no RPC and
the reply is instant. For lustre it is an RPC roundtrip which is not exactly
cheap.
> don''t care about extended attributes? The link() times are still
> almost twice as slow on Lustre though - that may be related to a
> slowish (test) MDT disk.  Like Andreas said I don''t understand why
There is some more work for link in case of lustre like rep-ack (extra
confirmation
from client to server that it got the link reply), same with mkdir.

I am not sure why such a big difference with chown and time update, though
actually I now realise we need to talk to OSTs to update ownership and times
there as well
which adds up even though it should be sent in parallel.
> there is an setxattr RPC when we didn''t get any data from getxattr
but
> that is probably more down to "cp" than lustre?
Yes, I think this is more about cp, you can see nfs also has setxattr attempts.

Bye,
    Oleg

Alexey Lyashkov

2010-Aug-05 07:46 UTC

head link

[Lustre-discuss] Client directory entry caching

That very trivial patch.
2.0 use CR mode to avoid extra lock cancel with xattr upate,
1.8 use EX mode to any xattr updates.
but real world must use both modes :-) because client got ACL (part of xattr)
with any getattr / open RPC,
so MDS MUST cancel that lock with any conflicted updates (because ACL used on
VFS permission checks).
currently that "fixed" via extra RPC from revalidate_dentry call.

in other words, need changes in mds to use
mode CR with UPDATE bits for xattr changes,
mode EX with lookup + update bits for ACL updates.

other bugs in that area - client always show LOV EA in xattr list and send extra
RPC to MDS to get LOV EA from the disk, but
that can be easily generated on client (via obd_pack_md function) because client
always has LOV EA locally.
That will save 2 RPC in some cases (first to get LOV EA size, second to get LOV
EA).

On Aug 4, 2010, at 23:59, Andreas Dilger wrote:
> On 2010-08-04, at 01:57, Alexey Lyashkov wrote:
>> 22492 is exist because someone from Sun/Oracle is disable dentry
caching, instead of fixing xattr code for 1.8<>2.0 interoperable.
>> He is kill my patch in revalidate dentry (instead of fixing xattr
code).
>> in that case client always send one more RPC to server.
>> see bug 17545 for some details.
> 
> Patches to "fix xattr interoperability" are welcome.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>

Lustre discuss - Jul 2010 - Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching

[Lustre-discuss] Client directory entry caching