Hi, due to some experiments with dcache related code we''ve been doing with shadow and others, it became clear that statahead code is quite complicated. probably for no reason. the most hard part to follow is interaction with dcache. the feature does number of complex things and make other parts (like ll_lookup_it()) harder to follow too. after amount of discussions with people we''d like to share our vision on the feature and propose slightly different solution. we think statahead should do nothing with dcache. it''s about inodes and attributes only. thus, it would be good to decouple it from dcache. the only thing statahead should do is: 1) detect statahead is needed (policy, out of the message''s scope) 2) scan part of directory (probably using readdir(), skip RPCs) 3) finds/creates inodes for found fids 4) lock these inodes (notice we propose to use inodes as a serialization point so that lockless getattr can be used) 5) issue getattr RPCs (probably lockless) 6) unlock inodes upon getattr''s completion then stat(2) is called, it first has to lookup fid by name. for this we can use pagecache just filled with MDS_READDIR. if directory isn''t being modified at the time, then entries will be there and we can create dentries in the dcache. they will be valid till UPDATE lock is cancelled - no even LOOKUP lock is needed. another possible thing for optimization is lockless getattr. given most of supported kernel don''t pass intent to ->getattr(), it''s possible that stat(2) needs two RPCs: one in ll_lookup_it() and another in ll_getattr_it() as lock is released between them. stat(2) gives no warranty about attributes, it gives a shot of them. attributes can change right before userspace application get them. so, why don''t we introduce some simple mechanism making attributes valid for short time at least for process executed lookup. this could help statahead as well, we think. comments? suggestions? thanks, Alex
On Thu, 2008-07-24 at 14:19 +0400, Alex Zhuravlev wrote:> we think statahead should do nothing with dcache. it''s about inodes > and attributes > only. thus, it would be good to decouple it from dcache. the only > thing statahead > should do is: > 1) detect statahead is needed (policy, out of the message''s scope) > 2) scan part of directory (probably using readdir(), skip RPCs) > 3) finds/creates inodes for found fids > 4) lock these inodes (notice we propose to use inodes as a > serialization point > so that lockless getattr can be used) > 5) issue getattr RPCs (probably lockless) > 6) unlock inodes upon getattr''s completionother point view - statahead never create dentry or inode, this function (or thread) only send all need lock enqueue to MDS and put granted lock to cache. user thread will call ->lookup and will be sleep until lock is granted on mdc_enqueue so none new wait queue need. At this case statahead will be very simple - call ->readdir() in background, in fill callback skip dot/dotdot files, skip hidden files if it not need, possible something other. in result new rpc added to ptlrpcd or own set. and very simple interpret callback which only process granted lock. -- Alex Lyashkov <Alexey.lyashkov at sun.com> Lustre Group, Sun Microsystems
I strongly agree with this. A good way to verify if we have a favorable implementation is to see if it can be ported to other OS''s. Peter On 7/24/08 4:19 AM, "Alex Zhuravlev" <Alex.Zhuravlev at Sun.COM> wrote:> Hi, > > due to some experiments with dcache related code we''ve been doing with shadow > and others, it became clear that statahead code is quite complicated. probably > for no reason. the most hard part to follow is interaction with dcache. the > feature does number of complex things and make other parts (like > ll_lookup_it()) > harder to follow too. > > after amount of discussions with people we''d like to share our vision on the > feature and propose slightly different solution. > > we think statahead should do nothing with dcache. it''s about inodes and > attributes > only. thus, it would be good to decouple it from dcache. the only thing > statahead > should do is: > 1) detect statahead is needed (policy, out of the message''s scope) > 2) scan part of directory (probably using readdir(), skip RPCs) > 3) finds/creates inodes for found fids > 4) lock these inodes (notice we propose to use inodes as a serialization point > so that lockless getattr can be used) > 5) issue getattr RPCs (probably lockless) > 6) unlock inodes upon getattr''s completion > > then stat(2) is called, it first has to lookup fid by name. for this we can > use > pagecache just filled with MDS_READDIR. if directory isn''t being modified at > the > time, then entries will be there and we can create dentries in the dcache. > they > will be valid till UPDATE lock is cancelled - no even LOOKUP lock is needed. > > > another possible thing for optimization is lockless getattr. given most of > supported > kernel don''t pass intent to ->getattr(), it''s possible that stat(2) needs two > RPCs: > one in ll_lookup_it() and another in ll_getattr_it() as lock is released > between them. > stat(2) gives no warranty about attributes, it gives a shot of them. > attributes can > change right before userspace application get them. so, why don''t we introduce > some > simple mechanism making attributes valid for short time at least for process > executed > lookup. this could help statahead as well, we think. > > comments? suggestions? > > thanks, Alex > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
it''s actually very interesting point - how much of lustre internals we should expose outside? I mean flexibility vs. complexity issue in context of portable API. thanks, Alex Peter Braam wrote:> I strongly agree with this. A good way to verify if we have a favorable > implementation is to see if it can be ported to other OS''s. > > Peter > > > On 7/24/08 4:19 AM, "Alex Zhuravlev" <Alex.Zhuravlev at Sun.COM> wrote: > >> Hi, >> >> due to some experiments with dcache related code we''ve been doing with shadow >> and others, it became clear that statahead code is quite complicated. probably >> for no reason. the most hard part to follow is interaction with dcache. the >> feature does number of complex things and make other parts (like >> ll_lookup_it()) >> harder to follow too. >> >> after amount of discussions with people we''d like to share our vision on the >> feature and propose slightly different solution. >> >> we think statahead should do nothing with dcache. it''s about inodes and >> attributes >> only. thus, it would be good to decouple it from dcache. the only thing >> statahead >> should do is: >> 1) detect statahead is needed (policy, out of the message''s scope) >> 2) scan part of directory (probably using readdir(), skip RPCs) >> 3) finds/creates inodes for found fids >> 4) lock these inodes (notice we propose to use inodes as a serialization point >> so that lockless getattr can be used) >> 5) issue getattr RPCs (probably lockless) >> 6) unlock inodes upon getattr''s completion >> >> then stat(2) is called, it first has to lookup fid by name. for this we can >> use >> pagecache just filled with MDS_READDIR. if directory isn''t being modified at >> the >> time, then entries will be there and we can create dentries in the dcache. >> they >> will be valid till UPDATE lock is cancelled - no even LOOKUP lock is needed. >> >> >> another possible thing for optimization is lockless getattr. given most of >> supported >> kernel don''t pass intent to ->getattr(), it''s possible that stat(2) needs two >> RPCs: >> one in ll_lookup_it() and another in ll_getattr_it() as lock is released >> between them. >> stat(2) gives no warranty about attributes, it gives a shot of them. >> attributes can >> change right before userspace application get them. so, why don''t we introduce >> some >> simple mechanism making attributes valid for short time at least for process >> executed >> lookup. this could help statahead as well, we think. >> >> comments? suggestions? >> >> thanks, Alex >> >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel > >
Alex Zhuravlev ??:> Hi, > > due to some experiments with dcache related code we''ve been doing with > shadow > and others, it became clear that statahead code is quite complicated. > probably > for no reason. the most hard part to follow is interaction with > dcache. the > feature does number of complex things and make other parts (like > ll_lookup_it()) > harder to follow too. > > after amount of discussions with people we''d like to share our vision > on the > feature and propose slightly different solution. > > we think statahead should do nothing with dcache. it''s about inodes > and attributes > only. thus, it would be good to decouple it from dcache. the only > thing statahead > should do is: > 1) detect statahead is needed (policy, out of the message''s scope) > 2) scan part of directory (probably using readdir(), skip RPCs) > 3) finds/creates inodes for found fids > 4) lock these inodes (notice we propose to use inodes as a > serialization point > so that lockless getattr can be used) > 5) issue getattr RPCs (probably lockless) > 6) unlock inodes upon getattr''s completion > > then stat(2) is called, it first has to lookup fid by name. for this > we can use > pagecache just filled with MDS_READDIR. if directory isn''t being > modified at the > time, then entries will be there and we can create dentries in the > dcache. they > will be valid till UPDATE lock is cancelled - no even LOOKUP lock is > needed. > > > another possible thing for optimization is lockless getattr. given > most of supported > kernel don''t pass intent to ->getattr(), it''s possible that stat(2) > needs two RPCs: > one in ll_lookup_it() and another in ll_getattr_it() as lock is > released between them. > stat(2) gives no warranty about attributes, it gives a shot of them. > attributes can > change right before userspace application get them. so, why don''t we > introduce some > simple mechanism making attributes valid for short time at least for > process executed > lookup. this could help statahead as well, we think.Lockless getattr RPC for "ls -l" is a good idea. But whether all the getattr RPC will be lockless or not? If not, how to distinguish them from "stat(2)" case? by intent or something else? Statahead is based on forecast, it maybe wrong. We should guarantee the proper lock(s) is held even if the statahead is wrong. Another optimization (maybe) to be considered is that whether it is necessary to start one statahead thread for each "ls -l" operation or not? As said by Nikita, maybe we can use a single thread for all the statahead. Thanks! -- Fan Yong> > comments? suggestions? > > thanks, Alex >
Yong Fan wrote:> Lockless getattr RPC for "ls -l" is a good idea. But whether all the > getattr RPC will be lockless or not? > If not, how to distinguish them from "stat(2)" case? by intent or > something else? > Statahead is based on forecast, it maybe wrong. We should guarantee the > proper lock(s) is held even > if the statahead is wrong.no, I don''t think we need lock always: you just did getattr and fill inode with fresh attributes, then you release lock, somebody changes attributes on the server and only then userspace application gets attributes (already non-fresh). so, what lock gives you in this case?> Another optimization (maybe) to be considered is that whether it is > necessary to start one statahead thread > for each "ls -l" operation or not? As said by Nikita, maybe we can use a > single thread for all the statahead.I''m not sure we need any statahead thread at all. what''s wrong with issuing number of async RPCs from ll_getattr_it()? this way user''s application drives statahead directly: each time stat(2) is called you tune statahead window and send few more RPCs again -- like data read-ahead does. thanks, Alex
Yong Fan wrote:> Lockless getattr RPC for "ls -l" is a good idea. But whether all the > getattr RPC will be lockless or not?forgot to answer this ... I believe it''s to be decided by the server who knows how given inode is contended. we do the same for i_size in glimpse path. thanks, Alex
On Thu, Jul 24, 2008 at 9:41 PM, Alex Zhuravlev <Alex.Zhuravlev at sun.com> wrote:> no, I don''t think we need lock always: you just did getattr and fill inode > with fresh attributes, then you release lock, somebody changes attributes > on the server and only then userspace application gets attributes (already > non-fresh). so, what lock gives you in this case?If I have an application that does: 1. Update stat data on one node. 2. Barrier 3. Read the same stat data on a different node I should be guaranteed to get the correct data, right? So stat should return data that was valid either when the system call started, when it ended, or somewhere in between. (not data that was invalid even before the systemcall started) -- Ragnar Kj?rstad
Ragnar Kj?rstad wrote:> On Thu, Jul 24, 2008 at 9:41 PM, Alex Zhuravlev <Alex.Zhuravlev at sun.com> wrote: >> no, I don''t think we need lock always: you just did getattr and fill inode >> with fresh attributes, then you release lock, somebody changes attributes >> on the server and only then userspace application gets attributes (already >> non-fresh). so, what lock gives you in this case? > > If I have an application that does: > 1. Update stat data on one node. > 2. Barrier > 3. Read the same stat data on a different node > > I should be guaranteed to get the correct data, right? > So stat should return data that was valid either when the system call > started, when it ended, or somewhere in between. (not data that was > invalid even before the systemcall started)indeed. we can''t use this trick in all the cases. I think it''s a subject of discussion what we consider a barrier, for regular stat(2) it can be start of system call. for ls -l case it can be open(2). isn''t this a reasonable cost for significant performance improvement of ls -l ? my understanding is that for NFS with readdir+ support, the barrier will be open(2). thanks for this good point, Alex
On Fri, 2008-07-25 at 08:41 +0400, Alex Zhuravlev wrote:> > > Another optimization (maybe) to be considered is that whether it is > > necessary to start one statahead thread > > for each "ls -l" operation or not? As said by Nikita, maybe we can use a > > single thread for all the statahead. > > I''m not sure we need any statahead thread at all. what''s wrong with issuing > number of async RPCs from ll_getattr_it()? this way user''s application drives > statahead directly: each time stat(2) is called you tune statahead window and > send few more RPCs again -- like data read-ahead does. >not need any statahead thread. as alex say: if we have UPDATE lock to parent - we can create valid dentry without lookup and add new locked inode to him. --- this looks easy to implement - call ll_readdir with i_mutex held and pass custom fill callback. in callback we allocate new dentry + empty locked inode (or attach new dentry to inode if we have only update lock /ll_find_alias/ ?? ), also submit async getattr rpc into ptlrpcd if need. rpc completion callback set md_lock to inode and unlock inode. also ll_getattr_it should call wait_on_inode for be sure statahead is finished. -- Alex Lyashkov <Alexey.lyashkov at sun.com> Lustre Group, Sun Microsystems
Alex Lyashkov wrote:> this looks easy to implement - call ll_readdir with i_mutex held and > pass custom fill callback. > in callback we allocate new dentry + empty locked inode (or attach new > dentry to inode if we have only update lock /ll_find_alias/ ?? ), also > submit async getattr rpc into ptlrpcd if need. > rpc completion callback set md_lock to inode and unlock inode. > also ll_getattr_it should call wait_on_inode for be sure statahead is > finished.I think it''d be better to get common infrastructure in ll_lookup_it() which can resolve name to fid from readdir''s cache. having that in lookup path means any syscall can benefit from it, not statahead only. for example, we can do open by fid on mds and mds won''t need to resolve name by himself. probably, this infrastructure could be used for WBC. thanks, Alex
On Jul 25, 2008 12:37 +0300, Alexey Lyashkov wrote:> On Fri, 2008-07-25 at 08:41 +0400, Alex Zhuravlev wrote: > > > Another optimization (maybe) to be considered is that whether it is > > > necessary to start one statahead thread > > > for each "ls -l" operation or not? As said by Nikita, maybe we can use a > > > single thread for all the statahead. > > > > I''m not sure we need any statahead thread at all. what''s wrong with issuing > > number of async RPCs from ll_getattr_it()? this way user''s application > > drives statahead directly: each time stat(2) is called you tune statahead > > window and send few more RPCs again -- like data read-ahead does.I think it is very useful to consider these "lockless MDT getattr" RPCs as "glimpse" requests. That idea has served us very well on OST attributes, and I think the same would true with MDT attributes. We should always send these getattr requests with a special flag (is a "getattr intent" enough, or do we need a different flag). We should ideally make this DLM RPC work the same way for both OST glimpses and MDT glimpses - client can optionally be granted a lock if inode has not been changed recently, but only the attributes returned if the lock is busy.> not need any statahead thread. > as alex say: > if we have UPDATE lock to parent - we can create valid dentry without > lookup and add new locked inode to him. > --- > this looks easy to implement - call ll_readdir with i_mutex held and > pass custom fill callback.One problem with this idea is that we cannot do the callbacks in the context of the ptlrpcd thread - see bug 15927.> in callback we allocate new dentry + empty locked inode (or attach new > dentry to inode if we have only update lock /ll_find_alias/ ?? ), also > submit async getattr rpc into ptlrpcd if need. > rpc completion callback set md_lock to inode and unlock inode. > also ll_getattr_it should call wait_on_inode for be sure statahead is > finished.IMHO it would also be useful to start the OST glimpse in the callback as soon as the MDT "glimpse" is returned. Without a change like this we are limited to <= 2x speedup for statahead until size-on-MDS is done. That is because we are hiding the MDT RPC latency, but still have to wait for the parallel OST RPC latency, so: speedup = (old time / new time) = (MDT RPC + max(OST RPC)) / max(OST RPC) Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:> IMHO it would also be useful to start the OST glimpse in the callback as > soon as the MDT "glimpse" is returned. Without a change like this we > are limited to <= 2x speedup for statahead until size-on-MDS is done. > That is because we are hiding the MDT RPC latency, but still have to wait > for the parallel OST RPC latency, so: > > speedup = (old time / new time) = (MDT RPC + max(OST RPC)) / max(OST RPC)definitely thanks, Alex
On Jul 24, 2008 22:04 -0700, Ragnar Kj?rstad wrote:> On Thu, Jul 24, 2008 at 9:41 PM, Alex Zhuravlev <Alex.Zhuravlev at sun.com> wrote: > > no, I don''t think we need lock always: you just did getattr and fill inode > > with fresh attributes, then you release lock, somebody changes attributes > > on the server and only then userspace application gets attributes (already > > non-fresh). so, what lock gives you in this case? > > If I have an application that does: > 1. Update stat data on one node. > 2. Barrier > 3. Read the same stat data on a different node > > I should be guaranteed to get the correct data, right? > So stat should return data that was valid either when the system call > started, when it ended, or somewhere in between. (not data that was > invalid even before the systemcall started)That depends on what stat data was changed. If it was e.g. atime then there is no guarantee that the other node will get an updated copy. If there is a file write that updates mtime then there is no guarantee that other clients will get an updated mtime if they already have a file lock. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.