thr3ads.net - Lustre devel - [Lustre-devel] statahead feature [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Alex Zhuravlev

2008-Jul-24 10:19 UTC

[Lustre-devel] statahead feature

Hi,

due to some experiments with dcache related code we''ve been doing with
shadow
and others, it became clear that statahead code is quite complicated. probably
for no reason. the most hard part to follow is interaction with dcache. the
feature does number of complex things and make other parts (like ll_lookup_it())
harder to follow too.

after amount of discussions with people we''d like to share our vision
on the
feature and propose slightly different solution.

we think statahead should do nothing with dcache. it''s about inodes and
attributes
only. thus, it would be good to decouple it from dcache. the only thing
statahead
should do is:
1) detect statahead is needed (policy, out of the message''s scope)
2) scan part of directory (probably using readdir(), skip RPCs)
3) finds/creates inodes for found fids
4) lock these inodes (notice we propose to use inodes as a serialization point
    so that lockless getattr can be used)
5) issue getattr RPCs (probably lockless)
6) unlock inodes upon getattr''s completion

then stat(2) is called, it first has to lookup fid by name. for this we can use
pagecache just filled with MDS_READDIR. if directory isn''t being
modified at the
time, then entries will be there and we can create dentries in the dcache. they
will be valid till UPDATE lock is cancelled - no even LOOKUP lock is needed.


another possible thing for optimization is lockless getattr. given most of
supported
kernel don''t pass intent to ->getattr(), it''s possible that
stat(2) needs two RPCs:
one in ll_lookup_it() and another in ll_getattr_it() as lock is released between
them.
stat(2) gives no warranty about attributes, it gives a shot of them. attributes
can
change right before userspace application get them. so, why don''t we
introduce some
simple mechanism making attributes valid for short time at least for process
executed
lookup. this could help statahead as well, we think.

comments? suggestions?

thanks, Alex

Alex Lyashkov

2008-Jul-24 11:35 UTC

head link

[Lustre-devel] statahead feature

On Thu, 2008-07-24 at 14:19 +0400, Alex Zhuravlev wrote:> we think statahead should do nothing with dcache. it''s about
inodes
> and attributes
> only. thus, it would be good to decouple it from dcache. the only
> thing statahead
> should do is:
> 1) detect statahead is needed (policy, out of the message''s scope)
> 2) scan part of directory (probably using readdir(), skip RPCs)
> 3) finds/creates inodes for found fids
> 4) lock these inodes (notice we propose to use inodes as a
> serialization point
>     so that lockless getattr can be used)
> 5) issue getattr RPCs (probably lockless)
> 6) unlock inodes upon getattr''s completion
other point view - statahead never create dentry or inode, this function
(or thread) only send all need lock enqueue to MDS and put granted lock
to cache. user thread will call ->lookup and will be sleep until lock is
granted on mdc_enqueue so none new wait queue need.

At this case statahead will be very simple - call ->readdir() in
background, in fill callback skip dot/dotdot files, skip hidden files if
it not need, possible something other. in result new rpc added to
ptlrpcd or own set.
and very simple interpret callback which only process granted lock.

-- 
Alex Lyashkov <Alexey.lyashkov at sun.com>
Lustre Group, Sun Microsystems

Peter Braam

2008-Jul-24 16:27 UTC

head link

[Lustre-devel] statahead feature

I strongly agree with this.  A good way to verify if we have a favorable
implementation is to see if it can be ported to other OS''s.

Peter


On 7/24/08 4:19 AM, "Alex Zhuravlev" <Alex.Zhuravlev at Sun.COM>
wrote:
> Hi,
> 
> due to some experiments with dcache related code we''ve been doing
with shadow
> and others, it became clear that statahead code is quite complicated.
probably
> for no reason. the most hard part to follow is interaction with dcache. the
> feature does number of complex things and make other parts (like
> ll_lookup_it())
> harder to follow too.
> 
> after amount of discussions with people we''d like to share our
vision on the
> feature and propose slightly different solution.
> 
> we think statahead should do nothing with dcache. it''s about
inodes and
> attributes
> only. thus, it would be good to decouple it from dcache. the only thing
> statahead
> should do is:
> 1) detect statahead is needed (policy, out of the message''s scope)
> 2) scan part of directory (probably using readdir(), skip RPCs)
> 3) finds/creates inodes for found fids
> 4) lock these inodes (notice we propose to use inodes as a serialization
point
>     so that lockless getattr can be used)
> 5) issue getattr RPCs (probably lockless)
> 6) unlock inodes upon getattr''s completion
> 
> then stat(2) is called, it first has to lookup fid by name. for this we can
> use
> pagecache just filled with MDS_READDIR. if directory isn''t being
modified at
> the
> time, then entries will be there and we can create dentries in the dcache.
> they
> will be valid till UPDATE lock is cancelled - no even LOOKUP lock is
needed.
> 
> 
> another possible thing for optimization is lockless getattr. given most of
> supported
> kernel don''t pass intent to ->getattr(), it''s possible
that stat(2) needs two
> RPCs:
> one in ll_lookup_it() and another in ll_getattr_it() as lock is released
> between them.
> stat(2) gives no warranty about attributes, it gives a shot of them.
> attributes can
> change right before userspace application get them. so, why don''t
we introduce
> some
> simple mechanism making attributes valid for short time at least for
process
> executed
> lookup. this could help statahead as well, we think.
> 
> comments? suggestions?
> 
> thanks, Alex
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Alex Zhuravlev

2008-Jul-24 18:56 UTC

head link

[Lustre-devel] statahead feature

it''s actually very interesting point - how much of lustre internals we
should expose outside? I mean flexibility vs. complexity issue in context
of portable API.

thanks, Alex

Peter Braam wrote:> I strongly agree with this.  A good way to verify if we have a favorable
> implementation is to see if it can be ported to other OS''s.
> 
> Peter
> 
> 
> On 7/24/08 4:19 AM, "Alex Zhuravlev" <Alex.Zhuravlev at
Sun.COM> wrote:
> 
>> Hi,
>>
>> due to some experiments with dcache related code we''ve been
doing with shadow
>> and others, it became clear that statahead code is quite complicated.
probably
>> for no reason. the most hard part to follow is interaction with dcache.
the
>> feature does number of complex things and make other parts (like
>> ll_lookup_it())
>> harder to follow too.
>>
>> after amount of discussions with people we''d like to share our
vision on the
>> feature and propose slightly different solution.
>>
>> we think statahead should do nothing with dcache. it''s about
inodes and
>> attributes
>> only. thus, it would be good to decouple it from dcache. the only thing
>> statahead
>> should do is:
>> 1) detect statahead is needed (policy, out of the message''s
scope)
>> 2) scan part of directory (probably using readdir(), skip RPCs)
>> 3) finds/creates inodes for found fids
>> 4) lock these inodes (notice we propose to use inodes as a
serialization point
>>     so that lockless getattr can be used)
>> 5) issue getattr RPCs (probably lockless)
>> 6) unlock inodes upon getattr''s completion
>>
>> then stat(2) is called, it first has to lookup fid by name. for this we
can
>> use
>> pagecache just filled with MDS_READDIR. if directory isn''t
being modified at
>> the
>> time, then entries will be there and we can create dentries in the
dcache.
>> they
>> will be valid till UPDATE lock is cancelled - no even LOOKUP lock is
needed.
>>
>>
>> another possible thing for optimization is lockless getattr. given most
of
>> supported
>> kernel don''t pass intent to ->getattr(), it''s
possible that stat(2) needs two
>> RPCs:
>> one in ll_lookup_it() and another in ll_getattr_it() as lock is
released
>> between them.
>> stat(2) gives no warranty about attributes, it gives a shot of them.
>> attributes can
>> change right before userspace application get them. so, why
don''t we introduce
>> some
>> simple mechanism making attributes valid for short time at least for
process
>> executed
>> lookup. this could help statahead as well, we think.
>>
>> comments? suggestions?
>>
>> thanks, Alex
>>
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> 
>

Yong Fan

2008-Jul-25 04:34 UTC

head link

[Lustre-devel] statahead feature

Alex Zhuravlev ??:> Hi,
>
> due to some experiments with dcache related code we''ve been doing
with
> shadow
> and others, it became clear that statahead code is quite complicated. 
> probably
> for no reason. the most hard part to follow is interaction with 
> dcache. the
> feature does number of complex things and make other parts (like 
> ll_lookup_it())
> harder to follow too.
>
> after amount of discussions with people we''d like to share our
vision
> on the
> feature and propose slightly different solution.
>
> we think statahead should do nothing with dcache. it''s about
inodes
> and attributes
> only. thus, it would be good to decouple it from dcache. the only 
> thing statahead
> should do is:
> 1) detect statahead is needed (policy, out of the message''s scope)
> 2) scan part of directory (probably using readdir(), skip RPCs)
> 3) finds/creates inodes for found fids
> 4) lock these inodes (notice we propose to use inodes as a 
> serialization point
> so that lockless getattr can be used)
> 5) issue getattr RPCs (probably lockless)
> 6) unlock inodes upon getattr''s completion
>
> then stat(2) is called, it first has to lookup fid by name. for this 
> we can use
> pagecache just filled with MDS_READDIR. if directory isn''t being 
> modified at the
> time, then entries will be there and we can create dentries in the 
> dcache. they
> will be valid till UPDATE lock is cancelled - no even LOOKUP lock is 
> needed.
>
>
> another possible thing for optimization is lockless getattr. given 
> most of supported
> kernel don''t pass intent to ->getattr(), it''s possible
that stat(2)
> needs two RPCs:
> one in ll_lookup_it() and another in ll_getattr_it() as lock is 
> released between them.
> stat(2) gives no warranty about attributes, it gives a shot of them. 
> attributes can
> change right before userspace application get them. so, why don''t
we
> introduce some
> simple mechanism making attributes valid for short time at least for 
> process executed
> lookup. this could help statahead as well, we think.Lockless getattr RPC for "ls -l" is a good idea. But whether all the 
getattr RPC will be lockless or not?
If not, how to distinguish them from "stat(2)" case? by intent or 
something else?
Statahead is based on forecast, it maybe wrong. We should guarantee the 
proper lock(s) is held even
if the statahead is wrong.

Another optimization (maybe) to be considered is that whether it is 
necessary to start one statahead thread
for each "ls -l" operation or not? As said by Nikita, maybe we can use
a
single thread for all the statahead.


Thanks!
--
Fan Yong>
> comments? suggestions?
>
> thanks, Alex
>

Alex Zhuravlev

2008-Jul-25 04:41 UTC

head link

[Lustre-devel] statahead feature

Yong Fan wrote:> Lockless getattr RPC for "ls -l" is a good idea. But whether all
the
> getattr RPC will be lockless or not?
> If not, how to distinguish them from "stat(2)" case? by intent or
> something else?
> Statahead is based on forecast, it maybe wrong. We should guarantee the 
> proper lock(s) is held even
> if the statahead is wrong.
no, I don''t think we need lock always: you just did getattr and fill
inode
with fresh attributes, then you release lock, somebody changes attributes
on the server and only then userspace application gets attributes (already
non-fresh). so, what lock gives you in this case?
> Another optimization (maybe) to be considered is that whether it is 
> necessary to start one statahead thread
> for each "ls -l" operation or not? As said by Nikita, maybe we
can use a
> single thread for all the statahead.
I''m not sure we need any statahead thread at all. what''s wrong
with issuing
number of async RPCs from ll_getattr_it()? this way user''s application
drives
statahead directly: each time stat(2) is called you tune statahead window and
send few more RPCs again -- like data read-ahead does.

thanks, Alex

Alex Zhuravlev

2008-Jul-25 04:44 UTC

head link

[Lustre-devel] statahead feature

Yong Fan wrote:> Lockless getattr RPC for "ls -l" is a good idea. But whether all
the
> getattr RPC will be lockless or not?
forgot to answer this ...

I believe it''s to be decided by the server who knows how given inode is
contended.
we do the same for i_size in glimpse path.

thanks, Alex

Ragnar Kjørstad

2008-Jul-25 05:04 UTC

head link

[Lustre-devel] statahead feature

On Thu, Jul 24, 2008 at 9:41 PM, Alex Zhuravlev <Alex.Zhuravlev at
sun.com> wrote:> no, I don''t think we need lock always: you just did getattr and
fill inode
> with fresh attributes, then you release lock, somebody changes attributes
> on the server and only then userspace application gets attributes (already
> non-fresh). so, what lock gives you in this case?
If I have an application that does:
1. Update stat data on one node.
2. Barrier
3. Read the same stat data on a different node

I should be guaranteed to get the correct data, right?
So stat should return data that was valid either when the system call
started, when it ended, or somewhere in between. (not data that was
invalid even before the systemcall started)

-- 
Ragnar Kj?rstad

Alex Zhuravlev

2008-Jul-25 05:17 UTC

head link

[Lustre-devel] statahead feature

Ragnar Kj?rstad wrote:> On Thu, Jul 24, 2008 at 9:41 PM, Alex Zhuravlev <Alex.Zhuravlev at
sun.com> wrote:
>> no, I don''t think we need lock always: you just did getattr
and fill inode
>> with fresh attributes, then you release lock, somebody changes
attributes
>> on the server and only then userspace application gets attributes
(already
>> non-fresh). so, what lock gives you in this case?
> 
> If I have an application that does:
> 1. Update stat data on one node.
> 2. Barrier
> 3. Read the same stat data on a different node
> 
> I should be guaranteed to get the correct data, right?
> So stat should return data that was valid either when the system call
> started, when it ended, or somewhere in between. (not data that was
> invalid even before the systemcall started)
indeed. we can''t use this trick in all the cases. I think it''s
a subject
of discussion what we consider a barrier, for regular stat(2) it can be
start of system call. for ls -l case it can be open(2). isn''t this a
reasonable cost for significant performance improvement of ls -l ?

my understanding is that for NFS with readdir+ support, the barrier will
be open(2).

thanks for this good point, Alex

Alex Lyashkov

2008-Jul-25 09:37 UTC

head link

[Lustre-devel] statahead feature

On Fri, 2008-07-25 at 08:41 +0400, Alex Zhuravlev wrote:
> 
> > Another optimization (maybe) to be considered is that whether it is 
> > necessary to start one statahead thread
> > for each "ls -l" operation or not? As said by Nikita, maybe
we can use a
> > single thread for all the statahead.
> 
> I''m not sure we need any statahead thread at all. what''s
wrong with issuing
> number of async RPCs from ll_getattr_it()? this way user''s
application drives
> statahead directly: each time stat(2) is called you tune statahead window
and
> send few more RPCs again -- like data read-ahead does.
> not need any statahead thread.
as alex say:
if we have UPDATE lock to parent - we can create valid dentry without
lookup and add new locked inode to him.
---
this looks easy to implement - call ll_readdir with i_mutex held and
pass custom fill callback.
in callback we allocate new dentry + empty locked inode (or attach new
dentry to inode if we have only update lock /ll_find_alias/ ?? ), also
submit async getattr rpc into ptlrpcd if need.
rpc completion callback set md_lock to inode and unlock inode.
also ll_getattr_it should call wait_on_inode for be sure statahead is
finished.



-- 
Alex Lyashkov <Alexey.lyashkov at sun.com>
Lustre Group, Sun Microsystems

Alex Zhuravlev

2008-Jul-25 12:29 UTC

head link

[Lustre-devel] statahead feature

Alex Lyashkov wrote:> this looks easy to implement - call ll_readdir with i_mutex held and
> pass custom fill callback.
> in callback we allocate new dentry + empty locked inode (or attach new
> dentry to inode if we have only update lock /ll_find_alias/ ?? ), also
> submit async getattr rpc into ptlrpcd if need.
> rpc completion callback set md_lock to inode and unlock inode.
> also ll_getattr_it should call wait_on_inode for be sure statahead is
> finished.
I think it''d be better to get common infrastructure in ll_lookup_it()
which can resolve name to fid from readdir''s cache. having that in
lookup
path means any syscall can benefit from it, not statahead only.

for example, we can do open by fid on mds and mds won''t need to resolve
name by himself.

probably, this infrastructure could be used for WBC.

thanks, Alex

Andreas Dilger

2008-Jul-25 19:41 UTC

head link

[Lustre-devel] statahead feature

On Jul 25, 2008  12:37 +0300, Alexey Lyashkov wrote:> On Fri, 2008-07-25 at 08:41 +0400, Alex Zhuravlev wrote:
> > > Another optimization (maybe) to be considered is that whether it
is
> > > necessary to start one statahead thread
> > > for each "ls -l" operation or not? As said by Nikita,
maybe we can use a
> > > single thread for all the statahead.
> > 
> > I''m not sure we need any statahead thread at all.
what''s wrong with issuing
> > number of async RPCs from ll_getattr_it()? this way user''s
application
> > drives statahead directly: each time stat(2) is called you tune
statahead
> > window and send few more RPCs again -- like data read-ahead does.
I think it is very useful to consider these "lockless MDT getattr"
RPCs
as "glimpse" requests.  That idea has served us very well on OST
attributes,
and I think the same would true with MDT attributes.  We should always
send these getattr requests with a special flag (is a "getattr intent"
enough, or do we need a different flag).

We should ideally make this DLM RPC work the same way for both OST glimpses
and MDT glimpses - client can optionally be granted a lock if inode has not
been changed recently, but only the attributes returned if the lock is busy.
> not need any statahead thread.
> as alex say:
> if we have UPDATE lock to parent - we can create valid dentry without
> lookup and add new locked inode to him.
> ---
> this looks easy to implement - call ll_readdir with i_mutex held and
> pass custom fill callback.
One problem with this idea is that we cannot do the callbacks in the
context of the ptlrpcd thread - see bug 15927.
> in callback we allocate new dentry + empty locked inode (or attach new
> dentry to inode if we have only update lock /ll_find_alias/ ?? ), also
> submit async getattr rpc into ptlrpcd if need.
> rpc completion callback set md_lock to inode and unlock inode.
> also ll_getattr_it should call wait_on_inode for be sure statahead is
> finished.
IMHO it would also be useful to start the OST glimpse in the callback as
soon as the MDT "glimpse" is returned.  Without a change like this we
are limited to <= 2x speedup for statahead until size-on-MDS is done.
That is because we are hiding the MDT RPC latency, but still have to wait
for the parallel OST RPC latency, so:

speedup = (old time / new time) = (MDT RPC + max(OST RPC)) / max(OST RPC)

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Alex Zhuravlev

2008-Jul-25 19:50 UTC

head link

[Lustre-devel] statahead feature

Andreas Dilger wrote:> IMHO it would also be useful to start the OST glimpse in the callback as
> soon as the MDT "glimpse" is returned.  Without a change like
this we
> are limited to <= 2x speedup for statahead until size-on-MDS is done.
> That is because we are hiding the MDT RPC latency, but still have to wait
> for the parallel OST RPC latency, so:
> 
> speedup = (old time / new time) = (MDT RPC + max(OST RPC)) / max(OST RPC)
definitely

thanks, Alex

Andreas Dilger

2008-Jul-28 22:02 UTC

head link

[Lustre-devel] statahead feature

On Jul 24, 2008  22:04 -0700, Ragnar Kj?rstad wrote:> On Thu, Jul 24, 2008 at 9:41 PM, Alex Zhuravlev <Alex.Zhuravlev at
sun.com> wrote:
> > no, I don''t think we need lock always: you just did getattr
and fill inode
> > with fresh attributes, then you release lock, somebody changes
attributes
> > on the server and only then userspace application gets attributes
(already
> > non-fresh). so, what lock gives you in this case?
> 
> If I have an application that does:
> 1. Update stat data on one node.
> 2. Barrier
> 3. Read the same stat data on a different node
> 
> I should be guaranteed to get the correct data, right?
> So stat should return data that was valid either when the system call
> started, when it ended, or somewhere in between. (not data that was
> invalid even before the systemcall started)
That depends on what stat data was changed.  If it was e.g. atime then
there is no guarantee that the other node will get an updated copy.
If there is a file write that updates mtime then there is no guarantee
that other clients will get an updated mtime if they already have a file
lock.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre devel - Jul 2008 - statahead feature

[Lustre-devel] statahead feature

[Lustre-devel] statahead feature

[Lustre-devel] statahead feature

[Lustre-devel] statahead feature

[Lustre-devel] statahead feature

[Lustre-devel] statahead feature

[Lustre-devel] statahead feature

[Lustre-devel] statahead feature

[Lustre-devel] statahead feature

[Lustre-devel] statahead feature

[Lustre-devel] statahead feature

[Lustre-devel] statahead feature

[Lustre-devel] statahead feature

[Lustre-devel] statahead feature