thr3ads.net - Lustre devel - [Lustre-devel] Queries regarding LDLM

If this information is useful, please help other people find it:
Share via:

Vilobh Meshram

2010-Oct-18 23:33 UTC

[Lustre-devel] Queries regarding LDLM_ENQUEUE

Hi,

Out of the many RPC''s used in Lustre seems like LDLM_ENQUEUE is the
most
frequently used RPC to communicate between the client and the MDS.I have few
queries regarding the same :-

1) Is LDLM_ENQUEUE the only interface(RPC here) for CREATE/OPEN kind of
request ; through which the client can interact with the MDS ?

I tried couple of experiments and found out that LDLM_ENQUEUE comes into
picture while mounting the FS as well as when we do a lookup,create or open
a file.I was expecting the MDS_REINT RPC to get invoked in case of a
CREATE/OPEN request via mdc_create() but it seems like Lustre invokes
LDLM_ENQEUE even for CREATE/OPEN( by packing the intent related data).
Please correct me if I am wrong.

2) In which cases (which system calls) does the MDS_REINT RPC will get
invoked ?


Thanks,
Vilobh
*Graduate Research Associate
Department of Computer Science
The Ohio State University Columbus Ohio*
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20101018/9f652e32/attachment.html

Fan Yong

2010-Oct-19 15:46 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 10/19/10 7:33 AM, Vilobh Meshram wrote:> Hi,
>
> Out of the many RPC''s used in Lustre seems like LDLM_ENQUEUE is
the
> most frequently used RPC to communicate between the client and the 
> MDS.I have few queries regarding the same :-
>
> 1) Is LDLM_ENQUEUE the only interface(RPC here) for CREATE/OPEN kind 
> of request ; through which the client can interact with the MDS ?
>
> I tried couple of experiments and found out that LDLM_ENQUEUE comes 
> into picture while mounting the FS as well as when we do a 
> lookup,create or open a file.I was expecting the MDS_REINT RPC to get 
> invoked in case of a CREATE/OPEN request via mdc_create() but it seems 
> like Lustre invokes LDLM_ENQEUE even for CREATE/OPEN( by packing the 
> intent related data).
> Please correct me if I am wrong.For OPEN_CREATE case, it is through LDLM_ENQUEUE interface to 
communicate with MDS.>
> 2) In which cases (which system calls) does the MDS_REINT RPC will get 
> invoked ?You can try mkdir/symlink/mknode to trigger MDS_REINT.


Cheers,
--
Nasf>
>
> Thanks,
> Vilobh
> /Graduate Research Associate
> Department of Computer Science
> The Ohio State University Columbus Ohio/
>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20101019/92866d97/attachment.html

Vilobh Meshram

2010-Oct-19 20:28 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

Hi,
>From my exploration it seems like for create/open kind of requestLDLM_ENQUEUE is the RPC through which the client talks to MDS.Please confirm
on this.

Since I could figure out that LDLM_ENQUEUE is the only RPC to interface with
MDS I am planning to send the LDLM_ENQUEUE RPC *with some additonal
buffer*from the client to the MDS so that based on some specific
condition the MDS
can fill the information in the buffer sent from the client.I have made some
modifications to the code for the LDLM_ENQUEUE RPC but I am getting kernel
panics.Can someone please help me and suggest me what is a good way to
tackle this problem.I am using Lustre 1.8.1.1 and I cannot upgrade to Lustre
2.0.

Thanks,
Vilobh
*Graduate Research Associate
Department of Computer Science
The Ohio State University Columbus Ohio*


On Mon, Oct 18, 2010 at 7:33 PM, Vilobh Meshram <vilobh.meshram at
gmail.com>wrote:
> Hi,
>
> Out of the many RPC''s used in Lustre seems like LDLM_ENQUEUE is
the most
> frequently used RPC to communicate between the client and the MDS.I have
few
> queries regarding the same :-
>
> 1) Is LDLM_ENQUEUE the only interface(RPC here) for CREATE/OPEN kind of
> request ; through which the client can interact with the MDS ?
>
> I tried couple of experiments and found out that LDLM_ENQUEUE comes into
> picture while mounting the FS as well as when we do a lookup,create or open
> a file.I was expecting the MDS_REINT RPC to get invoked in case of a
> CREATE/OPEN request via mdc_create() but it seems like Lustre invokes
> LDLM_ENQEUE even for CREATE/OPEN( by packing the intent related data).
> Please correct me if I am wrong.
>
> 2) In which cases (which system calls) does the MDS_REINT RPC will get
> invoked ?
>
>
> Thanks,
> Vilobh
> *Graduate Research Associate
> Department of Computer Science
> The Ohio State University Columbus Ohio*
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20101019/5c80e0d6/attachment.html

Andreas Dilger

2010-Oct-19 22:53 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 2010-10-19, at 14:28, Vilobh Meshram wrote:> From my exploration it seems like for create/open kind of request
LDLM_ENQUEUE is the RPC through which the client talks to MDS.Please confirm on
this.
> 
> Since I could figure out that LDLM_ENQUEUE is the only RPC to interface
with MDS I am planning to send the LDLM_ENQUEUE RPC with some additonal buffer
from the client to the MDS so that based on some specific condition the MDS can
fill the information in the buffer sent from the client.
This isn''t correct.  LDLM_ENQUEUE is used for enqueueing locks.  It
just happens that when Lustre wants to create a new file it enqueues a lock on
the parent directory with the "intent" to create a new file.  The MDS
currently always replies "you cannot have the lock for the directory, I
created the requested file for you".  Similarly, when the client is getting
attributes on a file, it needs a lock on that file in order to cache the
attributes, and to save RPCs the attributes are returned with the lock.
> I have made some modifications to the code for the LDLM_ENQUEUE RPC but I
am getting kernel panics.Can someone please help me and suggest me what is a
good way to tackle this problem.I am using Lustre 1.8.1.1 and I cannot upgrade
to Lustre 2.0.
It would REALLY be a lot easier to have this discussion with you if you actually
told us what it is you are working on.  Not only could we focus on the
higher-level issue that you are trying to solve (instead of possibly wasting a
lot of time focussing in a small issue that may in fact be completely
irrelevant), but with many ideas related to Lustre it has probably already been
discussed at length by the Lustre developers sometime over the past 8 years that
we''ve been working on it.  I suspect that the readership of this list
could probably give you a lot of assistance with whatever you are working on, if
you will only tell us what it actually is you are trying to do.
> On Mon, Oct 18, 2010 at 7:33 PM, Vilobh Meshram <vilobh.meshram at
gmail.com> wrote:
>> Out of the many RPC''s used in Lustre seems like LDLM_ENQUEUE
is the most frequently used RPC to communicate between the client and the MDS.I
have few queries regarding the same :-
>> 
>> 1) Is LDLM_ENQUEUE the only interface(RPC here) for CREATE/OPEN kind of
request ; through which the client can interact with the MDS ?
>> 
>> I tried couple of experiments and found out that LDLM_ENQUEUE comes
into picture while mounting the FS as well as when we do a lookup,create or open
a file.I was expecting the MDS_REINT RPC to get invoked in case of a CREATE/OPEN
request via mdc_create() but it seems like Lustre invokes LDLM_ENQEUE even for
CREATE/OPEN( by packing the intent related data).
>> Please correct me if I am wrong.
>> 
>> 2) In which cases (which system calls) does the MDS_REINT RPC will get
invoked ?

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Vilobh Meshram

2010-Oct-20 02:04 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

Hi Andreas,

Thanks for your e-mail.

We are trying to do following things.Please let me know if things are not
clear :-

Say we have 2 client C1 and C2 and a MDS .Say C1 and C2 share a file.
1) When a client C1 performs a open/create kind of request to the MDS we
want to follow the normal path which Lustre performs.
2) Now say C2 tries to open the same file which was opened by C1.
3) At the MDS end we maintain some data structure to scan and see if the
file was already opened by some Client(in this case C1 has opened this
file).
4) If MDS finds that some client(C1 here) has already opened the file then
it send the new client(C2 here) with some information about the client which
has initially opened the file.
5) Once C2 gets the information its upto C2 to take further actions.
6) By this process we can save the time spent in the locking mechanism for
C2.Basically we aim to by-pass the locking scheme of Lustre for the files
already opened by some client by maintaining some kind of data structure.

Please let us know your thoughts on the above approach.Is this a feasible
design moving ahead can we see any complications ?

So considering the problem statement I need a way for C2 to extract the
information from the data structure maintained at MDS.In order to do that ,
C2 will send a request with intent = create|open which will be a
LDLM_ENQUEUE RPC.I need to modify this RPC such that :-
1) I can enclose some additional buffer whose size is known to me .
2) When we pack the reply at the MDS side we should be able to include this
buffer in the reply message .
3) At the client side we should be able to extract the information from the
reply message about the buffer.

As of now , I need help in above three steps.

Thanks,
Vilobh
*Graduate Research Associate
Department of Computer Science
The Ohio State University Columbus Ohio*

On Tue, Oct 19, 2010 at 6:53 PM, Andreas Dilger
<andreas.dilger at oracle.com>wrote:
> On 2010-10-19, at 14:28, Vilobh Meshram wrote:
> > From my exploration it seems like for create/open kind of request
> LDLM_ENQUEUE is the RPC through which the client talks to MDS.Please
confirm
> on this.
> >
> > Since I could figure out that LDLM_ENQUEUE is the only RPC to
interface
> with MDS I am planning to send the LDLM_ENQUEUE RPC with some additonal
> buffer from the client to the MDS so that based on some specific condition
> the MDS can fill the information in the buffer sent from the client.
>
> This isn''t correct.  LDLM_ENQUEUE is used for enqueueing locks. 
It just
> happens that when Lustre wants to create a new file it enqueues a lock on
> the parent directory with the "intent" to create a new file.  The
MDS
> currently always replies "you cannot have the lock for the directory,
I
> created the requested file for you".  Similarly, when the client is
getting
> attributes on a file, it needs a lock on that file in order to cache the
> attributes, and to save RPCs the attributes are returned with the lock.
>
> > I have made some modifications to the code for the LDLM_ENQUEUE RPC
but I
> am getting kernel panics.Can someone please help me and suggest me what is
a
> good way to tackle this problem.I am using Lustre 1.8.1.1 and I cannot
> upgrade to Lustre 2.0.
>
> It would REALLY be a lot easier to have this discussion with you if you
> actually told us what it is you are working on.  Not only could we focus on
> the higher-level issue that you are trying to solve (instead of possibly
> wasting a lot of time focussing in a small issue that may in fact be
> completely irrelevant), but with many ideas related to Lustre it has
> probably already been discussed at length by the Lustre developers sometime
> over the past 8 years that we''ve been working on it.  I suspect
that the
> readership of this list could probably give you a lot of assistance with
> whatever you are working on, if you will only tell us what it actually is
> you are trying to do.
>
> > On Mon, Oct 18, 2010 at 7:33 PM, Vilobh Meshram <
> vilobh.meshram at gmail.com> wrote:
> >> Out of the many RPC''s used in Lustre seems like
LDLM_ENQUEUE is the most
> frequently used RPC to communicate between the client and the MDS.I have
few
> queries regarding the same :-
> >>
> >> 1) Is LDLM_ENQUEUE the only interface(RPC here) for CREATE/OPEN
kind of
> request ; through which the client can interact with the MDS ?
> >>
> >> I tried couple of experiments and found out that LDLM_ENQUEUE
comes into
> picture while mounting the FS as well as when we do a lookup,create or open
> a file.I was expecting the MDS_REINT RPC to get invoked in case of a
> CREATE/OPEN request via mdc_create() but it seems like Lustre invokes
> LDLM_ENQEUE even for CREATE/OPEN( by packing the intent related data).
> >> Please correct me if I am wrong.
> >>
> >> 2) In which cases (which system calls) does the MDS_REINT RPC will
get
> invoked ?
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20101019/2423bbd7/attachment-0001.html

Andreas Dilger

2010-Oct-20 07:55 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 2010-10-19, at 20:04, Vilobh Meshram wrote:> We are trying to do following things.Please let me know if things are not
clear :-
> 
> Say we have 2 client C1 and C2 and a MDS .Say C1 and C2 share a file.
> 1) When a client C1 performs a open/create kind of request to the MDS we
want to follow the normal path which Lustre performs.
> 2) Now say C2 tries to open the same file which was opened by C1.
> 3) At the MDS end we maintain some data structure to scan and see if the
file was already opened by some Client(in this case C1 has opened this file).
> 4) If MDS finds that some client(C1 here) has already opened the file then
it send the new client(C2 here) with some information about the client which has
initially opened the file.
While I understand the basic concept, I don''t really see how your
proposal will actually improve performance.  If C2 already has to contact the
MDS and get a reply from it, then wouldn''t it be about the same to
simply perform the open as is done today?  The number of MDS RPCs is the same,
and in fact this would avoid further message overhead between C1 and C2.
> 5) Once C2 gets the information its upto C2 to take further actions.
> 6) By this process we can save the time spent in the locking mechanism for
C2.Basically we aim to by-pass the locking scheme of Lustre for the files
already opened by some client by maintaining some kind of data structure.
> 
> Please let us know your thoughts on the above approach.Is this a feasible
design moving ahead can we see any complications ?
There is a separate proposal that has been underway in the Linux community for
some time, to allow a user process to get a file handle (i.e. binary blob
returned from a new name_to_handle() syscall) from the kernel for a given
pathname, and then later use that file handle in another process to open a file
descriptor without re-traversing the path.

I''ve been thinking this would be very useful for Lustre (and MPI in
general), and have tried to steer the Linux development in a direction that
would allow this to happen.  Is this in line with what you are investigating?

While this wouldn''t eliminate the actual MDS open RPC (i.e. the
LDLM_ENQUEUE you have been discussing), it could avoid the path traversal from
each client, possibly saving {path_elements * num_clients} additional RPCs,
> So considering the problem statement I need a way for C2 to extract the
information from the data structure maintained at MDS.In order to do that , C2
will send a request with intent = create|open which will be a LDLM_ENQUEUE RPC.I
need to modify this RPC such that :-
> 1) I can enclose some additional buffer whose size is known to me .
> 2) When we pack the reply at the MDS side we should be able to include this
buffer in the reply message .
> 3) At the client side we should be able to extract the information from the
reply message about the buffer.
> 
> As of now , I need help in above three steps.
> 
> Thanks,
> Vilobh
> Graduate Research Associate
> Department of Computer Science
> The Ohio State University Columbus Ohio
> 
> 
> On Tue, Oct 19, 2010 at 6:53 PM, Andreas Dilger <andreas.dilger at
oracle.com> wrote:
> On 2010-10-19, at 14:28, Vilobh Meshram wrote:
> > From my exploration it seems like for create/open kind of request
LDLM_ENQUEUE is the RPC through which the client talks to MDS.Please confirm on
this.
> >
> > Since I could figure out that LDLM_ENQUEUE is the only RPC to
interface with MDS I am planning to send the LDLM_ENQUEUE RPC with some
additonal buffer from the client to the MDS so that based on some specific
condition the MDS can fill the information in the buffer sent from the client.
> 
> This isn''t correct.  LDLM_ENQUEUE is used for enqueueing locks. 
It just happens that when Lustre wants to create a new file it enqueues a lock
on the parent directory with the "intent" to create a new file.  The
MDS currently always replies "you cannot have the lock for the directory, I
created the requested file for you".  Similarly, when the client is getting
attributes on a file, it needs a lock on that file in order to cache the
attributes, and to save RPCs the attributes are returned with the lock.
> 
> > I have made some modifications to the code for the LDLM_ENQUEUE RPC
but I am getting kernel panics.Can someone please help me and suggest me what is
a good way to tackle this problem.I am using Lustre 1.8.1.1 and I cannot upgrade
to Lustre 2.0.
> 
> It would REALLY be a lot easier to have this discussion with you if you
actually told us what it is you are working on.  Not only could we focus on the
higher-level issue that you are trying to solve (instead of possibly wasting a
lot of time focussing in a small issue that may in fact be completely
irrelevant), but with many ideas related to Lustre it has probably already been
discussed at length by the Lustre developers sometime over the past 8 years that
we''ve been working on it.  I suspect that the readership of this list
could probably give you a lot of assistance with whatever you are working on, if
you will only tell us what it actually is you are trying to do.
> 
> > On Mon, Oct 18, 2010 at 7:33 PM, Vilobh Meshram <vilobh.meshram at
gmail.com> wrote:
> >> Out of the many RPC''s used in Lustre seems like
LDLM_ENQUEUE is the most frequently used RPC to communicate between the client
and the MDS.I have few queries regarding the same :-
> >>
> >> 1) Is LDLM_ENQUEUE the only interface(RPC here) for CREATE/OPEN
kind of request ; through which the client can interact with the MDS ?
> >>
> >> I tried couple of experiments and found out that LDLM_ENQUEUE
comes into picture while mounting the FS as well as when we do a lookup,create
or open a file.I was expecting the MDS_REINT RPC to get invoked in case of a
CREATE/OPEN request via mdc_create() but it seems like Lustre invokes
LDLM_ENQEUE even for CREATE/OPEN( by packing the intent related data).
> >> Please correct me if I am wrong.
> >>
> >> 2) In which cases (which system calls) does the MDS_REINT RPC will
get invoked ?
> 
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
> 
> 

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

bzzz.tomas at gmail.com

2010-Oct-20 08:11 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 10/20/10 11:55 AM, Andreas Dilger wrote:> There is a separate proposal that has been underway in the Linux community
for some time, to allow a user process to get a file handle (i.e. binary blob
returned from a new name_to_handle() syscall) from the kernel for a given
pathname, and then later use that file handle in another process to open a file
descriptor without re-traversing the path.
>
> I''ve been thinking this would be very useful for Lustre (and MPI
in general), and have tried to steer the Linux development in a direction that
would allow this to happen.  Is this in line with what you are investigating?
with FIDs is quite possible and even safe if application can learn it
(using xattr_get or ioctl). then it should be trivial to export FID
namespace on MDS via special .lustre-fids directory?

another idea was to do whole path traversal on MDS within a single RPC.
bug that''d require amount of changes to llite and/or VFS and keep MDS
a bottleneck.

thanks, z

Andreas Dilger

2010-Oct-20 08:24 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 2010-10-20, at 02:11, bzzz.tomas at gmail.com wrote:> On 10/20/10 11:55 AM, Andreas Dilger wrote:
>> There is a separate proposal that has been underway in the Linux
community for some time, to allow a user process to get a file handle (i.e.
binary blob returned from a new name_to_handle() syscall) from the kernel for a
given pathname, and then later use that file handle in another process to open a
file descriptor without re-traversing the path.
>> 
>> I''ve been thinking this would be very useful for Lustre (and
MPI in general), and have tried to steer the Linux development in a direction
that would allow this to happen.  Is this in line with what you are
investigating?
> 
> with FIDs is quite possible and even safe if application can learn it
> (using xattr_get or ioctl). then it should be trivial to export FID
> namespace on MDS via special .lustre-fids directory?
I''m reluctant to expose the whole FID namespace to applications, since
this completely bypasses all directory permissions and allows opening files only
based on their inode permissions.  If we require a name_to_handle() syscall to
succeed first, before allowing open_by_handle() to work, then at least we know
that one of the involved processes was able to do a full path traversal.
> another idea was to do whole path traversal on MDS within a single RPC.
> bug that''d require amount of changes to llite and/or VFS and keep
MDS
> a bottleneck.
This was discussed a long time ago, and has the potential drawback that if one
of the path components is over-mounted on the client (e.g. local RAM-based tmpfs
on a Lustre root filesystem) then the MDS-side path traversal would be
incorrect.  It could return an entry underneath the mountpoint, instead of
inside it.


Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

bzzz.tomas at gmail.com

2010-Oct-20 08:30 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 10/20/10 12:24 PM, Andreas Dilger wrote:> I''m reluctant to expose the whole FID namespace to applications,
since this completely bypasses all directory permissions and allows opening
files only based on their inode permissions.  If we require a name_to_handle()
syscall to succeed first, before allowing open_by_handle() to work, then at
least we know that one of the involved processes was able to do a full path
traversal.
yes, this is a good point. can be solved if you use FID +
capability/signature ?
>> another idea was to do whole path traversal on MDS within a single RPC.
>> bug that''d require amount of changes to llite and/or VFS and
keep MDS
>> a bottleneck.
>
> This was discussed a long time ago, and has the potential drawback that if
one of the path components is over-mounted on the client (e.g. local RAM-based
tmpfs on a Lustre root filesystem) then the MDS-side path traversal would be
incorrect.  It could return an entry underneath the mountpoint, instead of
inside it.
yes, and that could be solved if server returns a series of FIDs,
then client could check whether any of those is over-mounted?

thanks, z

Nikita Danilov

2010-Oct-20 08:38 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 20 October 2010 12:30, <bzzz.tomas at gmail.com> wrote:
> On 10/20/10 12:24 PM, Andreas Dilger wrote:
> > I''m reluctant to expose the whole FID namespace to
applications, since
> this completely bypasses all directory permissions and allows opening files
> only based on their inode permissions.  If we require a name_to_handle()
> syscall to succeed first, before allowing open_by_handle() to work, then at
> least we know that one of the involved processes was able to do a full path
> traversal.
>
> yes, this is a good point. can be solved if you use FID +
> capability/signature ?
>
> >> another idea was to do whole path traversal on MDS within a single
RPC.
> >> bug that''d require amount of changes to llite and/or VFS
and keep MDS
> >> a bottleneck.
> >
> > This was discussed a long time ago, and has the potential drawback
that
> if one of the path components is over-mounted on the client (e.g. local
> RAM-based tmpfs on a Lustre root filesystem) then the MDS-side path
> traversal would be incorrect.  It could return an entry underneath the
> mountpoint, instead of inside it.
>
> yes, and that could be solved if server returns a series of FIDs,
> then client could check whether any of those is over-mounted?
>
This is what sufficiently smart nfsv4 clients are supposed to do, by the
way, I believe: issue a compound RPC with a sequence of LOOKUP requests and
traverse returned sequence of file-id-s locally, checking for mount points.

Nikita.

>
> thanks, z
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20101020/ae9ad024/attachment.html

Eric Barton

2010-Oct-20 13:30 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

I do like the idea of a collective open, but I''m wondering if it can be
implemented simply enough to be worth the effort.  True, it avoids the O(n)
load on the server of all the clients (re)populating their namespace
caches, but it''s only useful for parallel jobs - a scale-out NAS style
workload can''t benefit.  Ultimately the O(n) will have to be replaced
with
something that scales O(log n) (e.g. with a fat tree of caching proxy
servers).
> On 10/20/10 12:24 PM, Andreas Dilger wrote:
> > I''m reluctant to expose the whole FID namespace to
applications,
??? It can just be opaque bytes to the app.
> > since this completely bypasses all directory permissions and allows
> > opening files only based on their inode permissions.  If we require a
> > name_to_handle() syscall to succeed first, before allowing
> > open_by_handle() to work, then at least we know that one of the
> > involved processes was able to do a full path traversal.
I think this defeats the scalability objective - we trying to avoid having
to pull the namespace into every client aren''t we?
> yes, this is a good point. can be solved if you use FID +
> capability/signature ?
Yes, I think capabilities are the only way collective open can be made
secure "properly".  And given the way we believe capabilities have to
be
implemented for scalability (i.e. to keep the capability cache down to a
reasonable size on the server) any open by one node in a given client
cluster may well have to confer the right to use the FID by any of its
peers.
> >> another idea was to do whole path traversal on MDS within a single
> >> RPC.  bug that''d require amount of changes to llite
and/or VFS and
> >> keep MDS a bottleneck.
That''s an optimization rather than a scalability feature.  How much
does
it complicate the code?  I''d hate to see something new tricky and
fragile
complicate further development.

          Cheers,
                   Eric

bzzz.tomas at gmail.com

2010-Oct-20 13:40 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 10/20/10 5:30 PM, Eric Barton wrote:> I do like the idea of a collective open, but I''m wondering if it
can be
> implemented simply enough to be worth the effort.  True, it avoids the O(n)
> load on the server of all the clients (re)populating their namespace
> caches, but it''s only useful for parallel jobs - a scale-out NAS
style
> workload can''t benefit.  Ultimately the O(n) will have to be
replaced with
> something that scales O(log n) (e.g. with a fat tree of caching proxy
> servers).
in long-term I''d prefer proxy approach because this way we could
improve
number of cases, including existing POSIX apps doing open, stat, etc.
>>>> another idea was to do whole path traversal on MDS within a
single
>>>> RPC.  bug that''d require amount of changes to llite
and/or VFS and
>>>> keep MDS a bottleneck.
>
> That''s an optimization rather than a scalability feature.  How
much does
> it complicate the code?  I''d hate to see something new tricky and
fragile
> complicate further development.
yes, this is an optimization. good thing here is that single client can
benefit a lot from this (replacing few RPCs with a single one). bad
thing is that it can be quite quite complicated on the client side (the
server side''s part looks OK).

thanks, z

Nicolas Williams

2010-Oct-20 14:45 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On Wed, Oct 20, 2010 at 12:38:44PM +0400, Nikita Danilov
wrote:> On 20 October 2010 12:30, <bzzz.tomas at gmail.com> wrote:
> > yes, and that could be solved if server returns a series of FIDs,
> > then client could check whether any of those is over-mounted?
> 
> This is what sufficiently smart nfsv4 clients are supposed to do, by the
> way, I believe: issue a compound RPC with a sequence of LOOKUP requests and
> traverse returned sequence of file-id-s locally, checking for mount points.
Yes.  The detection and replication of server-side mountpoints on the
client-side is called "mirror mounts" in Solaris, and it''s
quite handy.

For clients the main issue is going to be whether the VFS allows plugins
to resolve more than one path component at a time.

Paul Nowoczynski

2010-Oct-20 14:51 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

Eric Barton wrote:> I do like the idea of a collective open, but I''m wondering if it
can be
> implemented simply enough to be worth the effort.  True, it avoids the O(n)
> load on the server of all the clients (re)populating their namespace
> caches, but it''s only useful for parallel jobs - a scale-out NAS
style
> workload can''t benefit.  Ultimately the O(n) will have to be
replaced with
> something that scales O(log n) (e.g. with a fat tree of caching proxy
> servers).Eric makes a good point in that only parallel jobs really need this 
feature. Unfortunately, at scale the system (both clients and servers) 
*really do* need something like this, especially if we continue pushing 
users to perform N-1 file I/O instead of ''file per process''. I
too am in
agreement that some sort of capability mechanism is the best approach. I 
wonder if this is something that could be done outside of POSIX and 
supported through a parallel I/O library? Perhaps a single application 
threads could make a special open call (/proc magic perhaps?) and obtain 
the glob of opaque bytes which are then broadcast to the rest of the 
client via mpi. Traversing the namespace would be avoided on all but one 
client. In such a scenario I don''t feel that enforcing unix permissions
at every level of the path is needed or sensible, the operation should 
be treated as a simple logical open. The question to the lustre experts 
- can enough state be packed into an opaque object such that the 
recv''ing client can construct the necessary cache state?
>
>> On 10/20/10 12:24 PM, Andreas Dilger wrote:
>>> I''m reluctant to expose the whole FID namespace to
applications,
>
> ??? It can just be opaque bytes to the app.
>
>>> since this completely bypasses all directory permissions and allows
>>> opening files only based on their inode permissions.  If we require
a
>>> name_to_handle() syscall to succeed first, before allowing
>>> open_by_handle() to work, then at least we know that one of the
>>> involved processes was able to do a full path traversal.
>
> I think this defeats the scalability objective - we trying to avoid having
> to pull the namespace into every client aren''t we?
>
>> yes, this is a good point. can be solved if you use FID +
>> capability/signature ?
>
> Yes, I think capabilities are the only way collective open can be made
> secure "properly".  And given the way we believe capabilities
have to be
> implemented for scalability (i.e. to keep the capability cache down to a
> reasonable size on the server) any open by one node in a given client
> cluster may well have to confer the right to use the FID by any of its
> peers.
>
>>>> another idea was to do whole path traversal on MDS within a
single
>>>> RPC.  bug that''d require amount of changes to llite
and/or VFS and
>>>> keep MDS a bottleneck.
>
> That''s an optimization rather than a scalability feature.  How
much does
> it complicate the code?  I''d hate to see something new tricky and
fragile
> complicate further development.
>
>           Cheers,
>                    Eric
>
>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Nicolas Williams

2010-Oct-20 14:55 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On Wed, Oct 20, 2010 at 10:51:06AM -0400, Paul Nowoczynski
wrote:> Eric makes a good point in that only parallel jobs really need this 
> feature. Unfortunately, at scale the system (both clients and servers) 
> *really do* need something like this, especially if we continue pushing 
> users to perform N-1 file I/O instead of ''file per
process''. I too am in
> agreement that some sort of capability mechanism is the best approach. I 
> wonder if this is something that could be done outside of POSIX and 
> supported through a parallel I/O library? Perhaps a single application 
> threads could make a special open call (/proc magic perhaps?) and obtain 
> the glob of opaque bytes which are then broadcast to the rest of the 
> client via mpi. Traversing the namespace would be avoided on all but one 
> client. In such a scenario I don''t feel that enforcing unix
permissions
> at every level of the path is needed or sensible, the operation should 
> be treated as a simple logical open. The question to the lustre experts 
> - can enough state be packed into an opaque object such that the 
> recv''ing client can construct the necessary cache state?
POSIX already has what you''re asking for, and it''s called
openg() ;)

Paul Nowoczynski

2010-Oct-20 15:16 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

Yes! I think I was at this HEC meeting a few years ago?? :)
Here are the pointers to the manpages if anyone else is interested. 
http://www.opengroup.org/platform/hecewg/

So my question wasn''t so much about the interface which is why I posed
a
scenario based on MPI.  But rather, how feasible is it to import the 
necessary state the from the client issuing openg() to the rest?
paul


Nicolas Williams wrote:> On Wed, Oct 20, 2010 at 10:51:06AM -0400, Paul Nowoczynski wrote:
>   
>> Eric makes a good point in that only parallel jobs really need this 
>> feature. Unfortunately, at scale the system (both clients and servers) 
>> *really do* need something like this, especially if we continue pushing
>> users to perform N-1 file I/O instead of ''file per
process''. I too am in
>> agreement that some sort of capability mechanism is the best approach.
I
>> wonder if this is something that could be done outside of POSIX and 
>> supported through a parallel I/O library? Perhaps a single application 
>> threads could make a special open call (/proc magic perhaps?) and
obtain
>> the glob of opaque bytes which are then broadcast to the rest of the 
>> client via mpi. Traversing the namespace would be avoided on all but
one
>> client. In such a scenario I don''t feel that enforcing unix
permissions
>> at every level of the path is needed or sensible, the operation should 
>> be treated as a simple logical open. The question to the lustre experts
>> - can enough state be packed into an opaque object such that the 
>> recv''ing client can construct the necessary cache state?
>>     
>
> POSIX already has what you''re asking for, and it''s called
openg() ;)
>

bzzz.tomas at gmail.com

2010-Oct-20 15:22 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 10/20/10 6:51 PM, Paul Nowoczynski wrote:> Eric makes a good point in that only parallel jobs really need this
> feature. Unfortunately, at scale the system (both clients and servers)
> *really do* need something like this, especially if we continue pushing
> users to perform N-1 file I/O instead of ''file per
process''. I too am in
> agreement that some sort of capability mechanism is the best approach. I
> wonder if this is something that could be done outside of POSIX and
> supported through a parallel I/O library? Perhaps a single application
> threads could make a special open call (/proc magic perhaps?) and obtain
> the glob of opaque bytes which are then broadcast to the rest of the
> client via mpi. Traversing the namespace would be avoided on all but one
> client. In such a scenario I don''t feel that enforcing unix
permissions
> at every level of the path is needed or sensible, the operation should
> be treated as a simple logical open. The question to the lustre experts
> - can enough state be packed into an opaque object such that the
> recv''ing client can construct the necessary cache state?
could you explain why is it so important to skip intermediate lookups?
those are to be done once, then the clients will do them locally.
is it because your nodes are getting new paths all the time or the nodes
are rebooted very often and lose cache?

thanks, z

Andreas Dilger

2010-Oct-20 16:07 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

Note that I was in contact with the HECEWG also, and openg() was proposed to be
renamed to be more understandable.

The current Linux name_to_handle() proposal could be used to export an blob
identifier (file handle that holds a FID, fs UUID, and a cookie/capability in
the Lustre case) to userspace, then MPI-IO or some other mechanism can be used
to distribute this to other client processes and open_by_handle() to convert
this back into a file handle.

One question is whether mpi_open() could be used for a collective operation
(allowing this to be handled inside the Lustre ADIO layer) or if it would need
specific application support?

Cheers, Andreas

On 2010-10-20, at 9:16, Paul Nowoczynski <pauln at psc.edu> wrote:
> Yes! I think I was at this HEC meeting a few years ago?? :)
> Here are the pointers to the manpages if anyone else is interested. 
> http://www.opengroup.org/platform/hecewg/
> 
> So my question wasn''t so much about the interface which is why I
posed a
> scenario based on MPI.  But rather, how feasible is it to import the 
> necessary state the from the client issuing openg() to the rest?
> paul
> 
> 
> Nicolas Williams wrote:
>> On Wed, Oct 20, 2010 at 10:51:06AM -0400, Paul Nowoczynski wrote:
>> 
>>> Eric makes a good point in that only parallel jobs really need this
>>> feature. Unfortunately, at scale the system (both clients and
servers)
>>> *really do* need something like this, especially if we continue
pushing
>>> users to perform N-1 file I/O instead of ''file per
process''. I too am in
>>> agreement that some sort of capability mechanism is the best
approach. I
>>> wonder if this is something that could be done outside of POSIX and
>>> supported through a parallel I/O library? Perhaps a single
application
>>> threads could make a special open call (/proc magic perhaps?) and
obtain
>>> the glob of opaque bytes which are then broadcast to the rest of
the
>>> client via mpi. Traversing the namespace would be avoided on all
but one
>>> client. In such a scenario I don''t feel that enforcing
unix permissions
>>> at every level of the path is needed or sensible, the operation
should
>>> be treated as a simple logical open. The question to the lustre
experts
>>> - can enough state be packed into an opaque object such that the 
>>> recv''ing client can construct the necessary cache state?
>>> 
>> 
>> POSIX already has what you''re asking for, and it''s
called openg() ;)
>> 
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Andreas Dilger

2010-Oct-20 16:35 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 2010-10-20, at 7:30, Eric Barton <eeb at whamcloud.com>
wrote:>> On 10/20/10 12:24 PM, Andreas Dilger wrote:
>>> I''m reluctant to expose the whole FID namespace to
applications,
> 
> ??? It can just be opaque bytes to the app.
This was in reply to Alex Z''s comments that we can just do open-by-FID
from userspace.
>>> since this completely bypasses all directory permissions and allows
>>> opening files only based on their inode permissions.  If we require
a
>>> name_to_handle() syscall to succeed first, before allowing
>>> open_by_handle() to work, then at least we know that one of the
>>> involved processes was able to do a full path traversal.
> 
> I think this defeats the scalability objective - we trying to avoid having
> to pull the namespace into every client aren''t we?
The name_to_handle() only needs to be called on a single node, and
open_by_handle() is called on the other nodes. I agree that this
doesn''t avoid the full O(n) RPCs for the open itself  but at least it
does avoid the full path traversal from every client and on the MDS (replacing
it with an MPI broadcast of the handle).
> 
Cheers, Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20101020/b4201823/attachment.html

Paul Nowoczynski

2010-Oct-20 16:43 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

bzzz.tomas at gmail.com wrote:> On 10/20/10 6:51 PM, Paul Nowoczynski wrote:
>   
>> Eric makes a good point in that only parallel jobs really need this
>> feature. Unfortunately, at scale the system (both clients and servers)
>> *really do* need something like this, especially if we continue pushing
>> users to perform N-1 file I/O instead of ''file per
process''. I too am in
>> agreement that some sort of capability mechanism is the best approach.
I
>> wonder if this is something that could be done outside of POSIX and
>> supported through a parallel I/O library? Perhaps a single application
>> threads could make a special open call (/proc magic perhaps?) and
obtain
>> the glob of opaque bytes which are then broadcast to the rest of the
>> client via mpi. Traversing the namespace would be avoided on all but
one
>> client. In such a scenario I don''t feel that enforcing unix
permissions
>> at every level of the path is needed or sensible, the operation should
>> be treated as a simple logical open. The question to the lustre experts
>> - can enough state be packed into an opaque object such that the
>> recv''ing client can construct the necessary cache state?
>>     
>
> could you explain why is it so important to skip intermediate lookups?
> those are to be done once, then the clients will do them locally.
> is it because your nodes are getting new paths all the time or the nodes
> are rebooted very often and lose cache?
>   It''s for scalability reasons.  When N clients traverse the namespace 
with the purpose of opening the same file the result is a storm of RPC 
requests which bear down on the metadata server.  This type of activity 
becomes prohibitive especially when you start considering client counts 
 > 10^4.  An operation such as this is ripe for optimization because 
every client in the network is trying to build the same state.  If you 
have a method for a single client to ''learn'' the final state,
i.e. the
pathname -> fid translation,  and broadcast it to its cohorts, it''s
a
huge win because it eliminates an O(N) operation.
paul> thanks, z
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>

Paul Nowoczynski

2010-Oct-20 16:46 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

> The name_to_handle() only needs to be called on a single node, and 
> open_by_handle() is called on the other nodes. I agree that this 
> doesn''t avoid the full O(n) RPCs for the open itself  but at least
it
> does avoid the full path traversal from every client and on the 
> MDS (replacing it with an MPI broadcast of the handle).Andreas,
excuse my ignorance, but why does open_by_handle() need to issue an 
RPC?  If it''s to obtain the layout, couldn''t the layout be
encoded into
the ''handle''?
p>
> Cheers, Andreas
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>

bzzz.tomas at gmail.com

2010-Oct-20 16:49 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 10/20/10 8:43 PM, Paul Nowoczynski wrote:> It''s for scalability reasons. When N clients traverse the
namespace with
> the purpose of opening the same file the result is a storm of RPC
> requests which bear down on the metadata server. This type of activity
> becomes prohibitive especially when you start considering client counts
>  > 10^4. An operation such as this is ripe for optimization because
> every client in the network is trying to build the same state. If you
> have a method for a single client to ''learn'' the final
state, i.e. the
> pathname -> fid translation, and broadcast it to its cohorts,
it''s a
> huge win because it eliminates an O(N) operation.
> paul
clear enough, but what is the bottleneck here: MDS to handle lots of
RPCs or network to pass RPCs ?

thanks, z

Andreas Dilger

2010-Oct-20 17:00 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 2010-10-20, at 10:46, Paul Nowoczynski <pauln at psc.edu>
wrote:> 
>> The name_to_handle() only needs to be called on a single node, and
open_by_handle() is called on the other nodes. I agree that this
doesn''t avoid the full O(n) RPCs for the open itself  but at least it
does avoid the full path traversal from every client and on the MDS (replacing
it with an MPI broadcast of the handle).
> 
> excuse my ignorance, but why does open_by_handle() need to issue an RPC? 
If it''s to obtain the layout, couldn''t the layout be encoded
into the ''handle''?
In theory, yes. Practically, there is a size limit on the handle, and in large
filesystems the layout is larger than this limit.

Also, it depends on whether we want the MDS to have consistent behavior with the
resulting open file descriptor or not.

I suppose in many cases it would be possible to fake out an open file on the
client without telling the MDS, but then there will be strange problems in some
cases (e.g. stat() of the file, errors on close, etc.) that would result since
the MDS won''t know anything about the other openers. Maybe that is
acceptable, I don''t know.

Cheers, Andreas

Nicolas Williams

2010-Oct-20 17:01 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On Wed, Oct 20, 2010 at 12:46:56PM -0400, Paul Nowoczynski
wrote:> 
> > The name_to_handle() only needs to be called on a single node, and 
> > open_by_handle() is called on the other nodes. I agree that this 
> > doesn''t avoid the full O(n) RPCs for the open itself  but at
least it
> > does avoid the full path traversal from every client and on the 
> > MDS (replacing it with an MPI broadcast of the handle).
> Andreas,
> excuse my ignorance, but why does open_by_handle() need to issue an 
> RPC?  If it''s to obtain the layout, couldn''t the layout
be encoded into
> the ''handle''?
If you don''t mind having a huge handle, then yes, we could skip
additional RPCs.

A handle would have to consist of a {MGS address, FID, layout, access
type, capability}, or so.

Paul Nowoczynski

2010-Oct-20 17:11 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

bzzz.tomas at gmail.com wrote:> On 10/20/10 8:43 PM, Paul Nowoczynski wrote:
>   
>> It''s for scalability reasons. When N clients traverse the
namespace with
>> the purpose of opening the same file the result is a storm of RPC
>> requests which bear down on the metadata server. This type of activity
>> becomes prohibitive especially when you start considering client counts
>>  > 10^4. An operation such as this is ripe for optimization because
>> every client in the network is trying to build the same state. If you
>> have a method for a single client to ''learn'' the
final state, i.e. the
>> pathname -> fid translation, and broadcast it to its cohorts,
it''s a
>> huge win because it eliminates an O(N) operation.
>> paul
>>     
>
> clear enough, but what is the bottleneck here: MDS to handle lots of
> RPCs or network to pass RPCs ?I could be wrong but my guess is that the network congestion caused by 
this communication pattern is a more serious problem.  The mds should be 
able to easily service lookup rpc''s since only the first few
necessitate
a read I/O from the disk.> thanks, z
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>

Nicolas Williams

2010-Oct-20 17:13 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On Wed, Oct 20, 2010 at 11:00:53AM -0600, Andreas Dilger
wrote:> On 2010-10-20, at 10:46, Paul Nowoczynski <pauln at psc.edu> wrote:
> >> The name_to_handle() only needs to be called on a single node, and
> >> open_by_handle() is called on the other nodes. I agree that this
> >> doesn''t avoid the full O(n) RPCs for the open itself  but
at least
> >> it does avoid the full path traversal from every client and on the
> >> MDS (replacing it with an MPI broadcast of the handle).
> > 
> > excuse my ignorance, but why does open_by_handle() need to issue an
> > RPC?  If it''s to obtain the layout, couldn''t the
layout be encoded
> > into the ''handle''?
> 
> In theory, yes. Practically, there is a size limit on the handle, and
> in large filesystems the layout is larger than this limit. 
> 
> Also, it depends on whether we want the MDS to have consistent
> behavior with the resulting open file descriptor or not.
> 
> I suppose in many cases it would be possible to fake out an open file
> on the client without telling the MDS, but then there will be strange
> problems in some cases (e.g. stat() of the file, errors on close,
> etc.) that would result since the MDS won''t know anything about
the
> other openers. Maybe that is acceptable, I don''t know. 
Well, if we''re going to add openg() (or whatever its name), we might as
well add variants of stat() that don''t require getting the size when
the
app doesn''t need it, and forget about SOM, or forget about SOM when we
know that a file might be open by unknown clients (recover issues here).

Another possibility is that the handle encodes the current size, and
that to write past that size requires an RPC to establish open state,
but this ignores truncation.

Another possibility is to say that a handle is only good as long as the
original file descriptor remains open (recovery issues here), and that
client can tell the MDS that it will be sharing its handle with other
clients.  Or that client could tell the MDS what all the clients are
that will share that handle (recovery issues here too).

Some sort of additional RPC seems hard to avoid here, but maybe it could
be async for clients opening by handle.

Nico
--

bzzz.tomas at gmail.com

2010-Oct-20 17:18 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 10/20/10 9:11 PM, Paul Nowoczynski wrote:> I could be wrong but my guess is that the network congestion caused by
> this communication pattern is a more serious problem. The mds should be
> able to easily service lookup rpc''s since only the first few
necessitate
> a read I/O from the disk.
but then the network should be able to deal with storm of
<max RPC in-flight> * <# clients> to read/write data?

or it''s a specific switch being the bottleneck to specific node?

because if it isn''t network, but MDS being a real bottleneck,
then proxy might be a solution like Eric said above. not sure
is this important in your case, but this would allow to use
existing apps.

of course, distribution tree for a handle may scale better.

thanks, z

Paul Nowoczynski

2010-Oct-20 17:25 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

have a look at this, it explains the type of problem networks have in 
dealing with these communication patterns. 
http://www.pdl.cmu.edu/Incast/

and yes, a proxy is a workable solution, and probably the most well 
rounded.  The disadvantages  is that it would presumably require more 
engineering to deploy.
p



bzzz.tomas at gmail.com wrote:> On 10/20/10 9:11 PM, Paul Nowoczynski wrote:
>   
>> I could be wrong but my guess is that the network congestion caused by
>> this communication pattern is a more serious problem. The mds should be
>> able to easily service lookup rpc''s since only the first few
necessitate
>> a read I/O from the disk.
>>     
>
> but then the network should be able to deal with storm of
> <max RPC in-flight> * <# clients> to read/write data?
>
> or it''s a specific switch being the bottleneck to specific node?
>
> because if it isn''t network, but MDS being a real bottleneck,
> then proxy might be a solution like Eric said above. not sure
> is this important in your case, but this would allow to use
> existing apps.
>
> of course, distribution tree for a handle may scale better.
>
> thanks, z
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>

Andreas Dilger

2010-Oct-20 17:27 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 2010-10-20, at 11:18, bzzz.tomas at gmail.com wrote:> On 10/20/10 9:11 PM, Paul Nowoczynski wrote:
>> I could be wrong but my guess is that the network congestion caused by
>> this communication pattern is a more serious problem. The mds should be
>> able to easily service lookup rpc''s since only the first few
necessitate
>> a read I/O from the disk.
> 
> but then the network should be able to deal with storm of
> <max RPC in-flight> * <# clients> to read/write data?
> 
> or it''s a specific switch being the bottleneck to specific node?
I think there is definitely non-trivial overhead of the MDS threads descending
into the filesystem to do path lookup and permission checking than would be
avoided.
> because if it isn''t network, but MDS being a real bottleneck,
> then proxy might be a solution like Eric said above. not sure
> is this important in your case, but this would allow to use
> existing apps.
> 
> of course, distribution tree for a handle may scale better.
I don''t think the actual distribution of the handle is a significant
factor (this can be done via efficient broadcast in MPI layer).  If we want to
keep the MDS state consistent with N openers of the file then that may take more
effort.  However, I also just thought of a partial solution to the MDS state
issue - if the original client doing name_to_handle() also gets the MDS open
lock, then it can somewhat act as "proxy" for the remaining clients
that are opening via handle.

The MDS will know that the client with the MDS open lock may be doing other
opens, and if the handle also contains the layout as Paul proposed, then it
seems possible to get at least a reasonable representation of the file on each
client w/o having an additional MDS RPC from each one.  Those clients may still
have issues if contacting the MDS for that file, but maybe not.

Actually implementing this is left as an exercise for the reader...



Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Nicolas Williams

2010-Oct-20 17:29 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On Wed, Oct 20, 2010 at 09:18:59PM +0400, bzzz.tomas at gmail.com
wrote:> On 10/20/10 9:11 PM, Paul Nowoczynski wrote:
> > I could be wrong but my guess is that the network congestion caused by
> > this communication pattern is a more serious problem. The mds should
be
> > able to easily service lookup rpc''s since only the first few
necessitate
> > a read I/O from the disk.
> 
> but then the network should be able to deal with storm of
> <max RPC in-flight> * <# clients> to read/write data?
> 
> or it''s a specific switch being the bottleneck to specific node?
> 
> because if it isn''t network, but MDS being a real bottleneck,
> then proxy might be a solution like Eric said above. not sure
> is this important in your case, but this would allow to use
> existing apps.
MDSes are typically CPU bound, so that''s likely the issue.  The problem
though is that the MDS does need to track open file state for SOM and
for dealing with unlinks.  The semantics of open-by-handle might be such
that unlinks of files opened by handle can cause the file to disappear
and syscalls on FDs opened by handle could then return EBADF or EIO or
some new error code.  But open-by-handle semantics don''t allow for
that,
then the MDS needs to track open file state, and it''s hard to see how
to
avoid RPCs to the MDS to establish that state (the original client could
tell the MDS about all the clients that will open-by-handle, but this
seems unlikely to perform so much better than N smaller RPCs as to
justify it, and the open-by-handle API suddenly gets much more complex).

Nico
--

Andreas Dilger

2010-Oct-20 17:30 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 2010-10-20, at 11:13, Nicolas Williams wrote:> Well, if we''re going to add openg() (or whatever its name), we
might as
> well add variants of stat() that don''t require getting the size
when the
> app doesn''t need it
That is "stat_lite" (or various different names), and was also under
discussion for adding to the Linux kernel, until it turned from being a sensible
API to a Linux-designed-by-committee API from hell (IMHO, of course) and has
stopped dead in its tracks.
> Another possibility is to say that a handle is only good as long as the
> original file descriptor remains open (recovery issues here), and that
> client can tell the MDS that it will be sharing its handle with other
> clients.
That is partly what the MDS open lock does.  It was intended for NFS servers to
allow them to open and close a file locally for its clients w/o MDS RPCs.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

bzzz.tomas at gmail.com

2010-Oct-20 17:40 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 10/20/10 9:29 PM, Nicolas Williams wrote:> MDSes are typically CPU bound, so that''s likely the issue.  The
problem
> though is that the MDS does need to track open file state for SOM and
> for dealing with unlinks.  The semantics of open-by-handle might be such
> that unlinks of files opened by handle can cause the file to disappear
> and syscalls on FDs opened by handle could then return EBADF or EIO or
> some new error code.  But open-by-handle semantics don''t allow for
that,
> then the MDS needs to track open file state, and it''s hard to see
how to
> avoid RPCs to the MDS to establish that state (the original client could
> tell the MDS about all the clients that will open-by-handle, but this
> seems unlikely to perform so much better than N smaller RPCs as to
> justify it, and the open-by-handle API suddenly gets much more complex).
I guess for this purpose they may just disable SOM and do few steps away
from POSIX. probably inter-client data consistency isn''t that important
any more ;) then get rid of MDS and namespace completely using some sort
of FID.

thanks, z

Andreas Dilger

2010-Oct-20 18:01 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 2010-10-20, at 11:40, bzzz.tomas at gmail.com wrote:> On 10/20/10 9:29 PM, Nicolas Williams wrote:
>> MDSes are typically CPU bound, so that''s likely the issue. 
The problem
>> though is that the MDS does need to track open file state for SOM and
>> for dealing with unlinks.  The semantics of open-by-handle might be
such
>> that unlinks of files opened by handle can cause the file to disappear
>> and syscalls on FDs opened by handle could then return EBADF or EIO or
>> some new error code.  But open-by-handle semantics don''t allow
for that,
>> then the MDS needs to track open file state, and it''s hard to
see how to
>> avoid RPCs to the MDS to establish that state (the original client
could
>> tell the MDS about all the clients that will open-by-handle, but this
>> seems unlikely to perform so much better than N smaller RPCs as to
>> justify it, and the open-by-handle API suddenly gets much more
complex).
> 
> I guess for this purpose they may just disable SOM and do few steps away
> from POSIX. probably inter-client data consistency isn''t that
important
> any more ;) then get rid of MDS and namespace completely using some sort
> of FID.
I don''t think that most customers want to drop POSIX and namespaces
completely, because of the huge numbers of tools/apps that depend on this, but
rather to have an API that can improve the performance of select applications
that have a need for it.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

bzzz.tomas at gmail.com

2010-Oct-20 18:09 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

On 10/20/10 10:01 PM, Andreas Dilger wrote:>> I guess for this purpose they may just disable SOM and do few steps
away
>> from POSIX. probably inter-client data consistency isn''t that
important
>> any more ;) then get rid of MDS and namespace completely using some
sort
>> of FID.
>
> I don''t think that most customers want to drop POSIX and
namespaces completely, because of the huge numbers of tools/apps that depend on
this, but rather to have an API that can improve the performance of select
applications that have a need for it.
oh, sorry for this sort of joke.. what I mean is that
probably we could provide with another user-visible
API which allows to bypass regular namespace, for example.

thanks, z

Vilobh Meshram

2010-Oct-22 02:33 UTC

head link

[Lustre-devel] Queries regarding LDLM_ENQUEUE

Thanks Andreas for the e-mail.

I am trying to modify the LDLM_ENQUEUE rpc to get the the reply in the form
of some buffer (say string "Hello World") filled in from MDS.I have
explained the use case in my last e-mail.Please refer my e-mail sent on
10/19.

I have attached the diff files. I am getting a kernel panic at the MDS end
when I try to make the attached changes.Can someone please suggest me where
I might be missing ?


Thanks,
Vilobh
*Graduate Research Associate
Department of Computer Science
The Ohio State University Columbus Ohio*


On Wed, Oct 20, 2010 at 3:55 AM, Andreas Dilger
<andreas.dilger at oracle.com>wrote:
> On 2010-10-19, at 20:04, Vilobh Meshram wrote:
> > We are trying to do following things.Please let me know if things are
not
> clear :-
> >
> > Say we have 2 client C1 and C2 and a MDS .Say C1 and C2 share a file.
> > 1) When a client C1 performs a open/create kind of request to the MDS
we
> want to follow the normal path which Lustre performs.
> > 2) Now say C2 tries to open the same file which was opened by C1.
> > 3) At the MDS end we maintain some data structure to scan and see if
the
> file was already opened by some Client(in this case C1 has opened this
> file).
> > 4) If MDS finds that some client(C1 here) has already opened the file
> then it send the new client(C2 here) with some information about the client
> which has initially opened the file.
>
> While I understand the basic concept, I don''t really see how your
proposal
> will actually improve performance.  If C2 already has to contact the MDS
and
> get a reply from it, then wouldn''t it be about the same to simply
perform
> the open as is done today?  The number of MDS RPCs is the same, and in fact
> this would avoid further message overhead between C1 and C2.
>
> > 5) Once C2 gets the information its upto C2 to take further actions.
> > 6) By this process we can save the time spent in the locking mechanism
> for C2.Basically we aim to by-pass the locking scheme of Lustre for the
> files already opened by some client by maintaining some kind of data
> structure.
> >
> > Please let us know your thoughts on the above approach.Is this a
feasible
> design moving ahead can we see any complications ?
>
> There is a separate proposal that has been underway in the Linux community
> for some time, to allow a user process to get a file handle (i.e. binary
> blob returned from a new name_to_handle() syscall) from the kernel for a
> given pathname, and then later use that file handle in another process to
> open a file descriptor without re-traversing the path.
>
> I''ve been thinking this would be very useful for Lustre (and MPI
in
> general), and have tried to steer the Linux development in a direction that
> would allow this to happen.  Is this in line with what you are
> investigating?
>
> While this wouldn''t eliminate the actual MDS open RPC (i.e. the
> LDLM_ENQUEUE you have been discussing), it could avoid the path traversal
> from each client, possibly saving {path_elements * num_clients} additional
> RPCs,
>
> > So considering the problem statement I need a way for C2 to extract
the
> information from the data structure maintained at MDS.In order to do that ,
> C2 will send a request with intent = create|open which will be a
> LDLM_ENQUEUE RPC.I need to modify this RPC such that :-
> > 1) I can enclose some additional buffer whose size is known to me .
> > 2) When we pack the reply at the MDS side we should be able to include
> this buffer in the reply message .
> > 3) At the client side we should be able to extract the information
from
> the reply message about the buffer.
> >
> > As of now , I need help in above three steps.
> >
> > Thanks,
> > Vilobh
> > Graduate Research Associate
> > Department of Computer Science
> > The Ohio State University Columbus Ohio
> >
> >
> > On Tue, Oct 19, 2010 at 6:53 PM, Andreas Dilger <
> andreas.dilger at oracle.com> wrote:
> > On 2010-10-19, at 14:28, Vilobh Meshram wrote:
> > > From my exploration it seems like for create/open kind of request
> LDLM_ENQUEUE is the RPC through which the client talks to MDS.Please
confirm
> on this.
> > >
> > > Since I could figure out that LDLM_ENQUEUE is the only RPC to
interface
> with MDS I am planning to send the LDLM_ENQUEUE RPC with some additonal
> buffer from the client to the MDS so that based on some specific condition
> the MDS can fill the information in the buffer sent from the client.
> >
> > This isn''t correct.  LDLM_ENQUEUE is used for enqueueing
locks.  It just
> happens that when Lustre wants to create a new file it enqueues a lock on
> the parent directory with the "intent" to create a new file.  The
MDS
> currently always replies "you cannot have the lock for the directory,
I
> created the requested file for you".  Similarly, when the client is
getting
> attributes on a file, it needs a lock on that file in order to cache the
> attributes, and to save RPCs the attributes are returned with the lock.
> >
> > > I have made some modifications to the code for the LDLM_ENQUEUE
RPC but
> I am getting kernel panics.Can someone please help me and suggest me what
is
> a good way to tackle this problem.I am using Lustre 1.8.1.1 and I cannot
> upgrade to Lustre 2.0.
> >
> > It would REALLY be a lot easier to have this discussion with you if
you
> actually told us what it is you are working on.  Not only could we focus on
> the higher-level issue that you are trying to solve (instead of possibly
> wasting a lot of time focussing in a small issue that may in fact be
> completely irrelevant), but with many ideas related to Lustre it has
> probably already been discussed at length by the Lustre developers sometime
> over the past 8 years that we''ve been working on it.  I suspect
that the
> readership of this list could probably give you a lot of assistance with
> whatever you are working on, if you will only tell us what it actually is
> you are trying to do.
> >
> > > On Mon, Oct 18, 2010 at 7:33 PM, Vilobh Meshram <
> vilobh.meshram at gmail.com> wrote:
> > >> Out of the many RPC''s used in Lustre seems like
LDLM_ENQUEUE is the
> most frequently used RPC to communicate between the client and the MDS.I
> have few queries regarding the same :-
> > >>
> > >> 1) Is LDLM_ENQUEUE the only interface(RPC here) for
CREATE/OPEN kind
> of request ; through which the client can interact with the MDS ?
> > >>
> > >> I tried couple of experiments and found out that LDLM_ENQUEUE
comes
> into picture while mounting the FS as well as when we do a lookup,create or
> open a file.I was expecting the MDS_REINT RPC to get invoked in case of a
> CREATE/OPEN request via mdc_create() but it seems like Lustre invokes
> LDLM_ENQEUE even for CREATE/OPEN( by packing the intent related data).
> > >> Please correct me if I am wrong.
> > >>
> > >> 2) In which cases (which system calls) does the MDS_REINT RPC
will get
> invoked ?
> >
> >
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Lustre Technical Lead
> > Oracle Corporation Canada Inc.
> >
> >
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20101021/495a5f4a/attachment-0001.html
-------------- next part --------------
*** ./lustre/ldlm/ldlm_lockd.c	2010-10-21 17:49:05.000000000 -0400
--- ../fresh/lustre/ldlm/ldlm_lockd.c	2010-10-15 15:37:02.000000000 -0400
*************** int ldlm_handle_enqueue(struct ptlrpc_re
*** 997,1017 ****
          struct obd_device *obddev = req->rq_export->exp_obd;
          struct ldlm_reply *dlm_rep;
          struct ldlm_request *dlm_req;
!         __u32 size[4] = { [MSG_PTLRPC_BODY_OFF] = sizeof(struct ptlrpc_body),
                          [DLM_LOCKREPLY_OFF]   = sizeof(*dlm_rep) };
          int rc = 0;
          __u32 flags;
          ldlm_error_t err = ELDLM_OK;
          struct ldlm_lock *lock = NULL;
          void *cookie = NULL;
-         int i;
-         char *str = "Hello World Sun";
-         char *str_target;
          ENTRY;
  
          LDLM_DEBUG_NOLOCK("server-side enqueue handler START");
!         printk("\n Inside function %s server-side enqueue handler
START",__func__);
!         for(i=0;i<3;i++) printk("\n Inside function %s
size[%d]:%d",__func__,i,size[i]);
          dlm_req = lustre_swab_reqbuf(req, DLM_LOCKREQ_OFF, sizeof(*dlm_req),
                                       lustre_swab_ldlm_request);
          if (dlm_req == NULL) {
--- 997,1013 ----
          struct obd_device *obddev = req->rq_export->exp_obd;
          struct ldlm_reply *dlm_rep;
          struct ldlm_request *dlm_req;
!         __u32 size[3] = { [MSG_PTLRPC_BODY_OFF] = sizeof(struct ptlrpc_body),
                          [DLM_LOCKREPLY_OFF]   = sizeof(*dlm_rep) };
          int rc = 0;
          __u32 flags;
          ldlm_error_t err = ELDLM_OK;
          struct ldlm_lock *lock = NULL;
          void *cookie = NULL;
          ENTRY;
  
          LDLM_DEBUG_NOLOCK("server-side enqueue handler START");
! 
          dlm_req = lustre_swab_reqbuf(req, DLM_LOCKREQ_OFF, sizeof(*dlm_req),
                                       lustre_swab_ldlm_request);
          if (dlm_req == NULL) {
*************** existing_lock:
*** 1126,1148 ****
                  int buffers = 2;
  
                  lock_res_and_lock(lock);
-                 printk("\n Exsisting lock
lock->l_resource->lr_lvb_len:%u",lock->l_resource->lr_lvb_len);
                  if (lock->l_resource->lr_lvb_len) {
-                         printk("\n Inside function %s , inside condition
lock->l_resource->lr_lvb_len so buffers=3",__func__);
                          size[DLM_REPLY_REC_OFF] =
lock->l_resource->lr_lvb_len;
                          buffers = 3;
                  }
-                 //size[DLM_REPLY_REC_OFF] = 16;
-                 //buffer = buffer + 1;
-                 if(lock->l_resource->lr_lvb_len == 0)
-                 {
-                      buffers++;
-                      size[DLM_REPLY_REC_OFF] = 0;
-                 }
-                 buffers++;
-                 size[DLM_REPLY_REC_OFF+1] = 16;
                  unlock_res_and_lock(lock);
!                 printk("\n Inside function %s , outside condition
lock->l_resource->lr_lvb_len so buffers=2",__func__);
                  if (OBD_FAIL_CHECK_ONCE(OBD_FAIL_LDLM_ENQUEUE_EXTENT_ERR))
                          GOTO(out, rc = -ENOMEM);
  
--- 1122,1133 ----
                  int buffers = 2;
  
                  lock_res_and_lock(lock);
                  if (lock->l_resource->lr_lvb_len) {
                          size[DLM_REPLY_REC_OFF] =
lock->l_resource->lr_lvb_len;
                          buffers = 3;
                  }
                  unlock_res_and_lock(lock);
! 
                  if (OBD_FAIL_CHECK_ONCE(OBD_FAIL_LDLM_ENQUEUE_EXTENT_ERR))
                          GOTO(out, rc = -ENOMEM);
  
*************** existing_lock:
*** 1156,1164 ****
          if (dlm_req->lock_desc.l_resource.lr_type == LDLM_EXTENT)
                  lock->l_req_extent = lock->l_policy_data.l_extent;
  
-         printk("%s: \twill do lock-enq...\n", __func__);
          err = ldlm_lock_enqueue(obddev->obd_namespace, &lock, cookie,
(int *)&flags);
-         printk("%s: \tafter lock-enq...\n", __func__);
          if (err)
                  GOTO(out, err);
  
--- 1141,1147 ----
*************** existing_lock:
*** 1178,1185 ****
          dlm_rep->lock_flags |= dlm_req->lock_flags &
LDLM_INHERIT_FLAGS;
          lock->l_flags |= dlm_req->lock_flags & LDLM_INHERIT_FLAGS;
  
-         str_target = lustre_msg_buf(req->rq_repmsg,
DLM_REPLY_REC_OFF+1,16);
-         memcpy(str_target,str,16);
          /* Don''t move a pending lock onto the export if it has
already
           * been evicted.  Cancel it now instead. (bug 5683) */
          if (req->rq_export->exp_failed ||
--- 1161,1166 ----
*************** existing_lock:
*** 1232,1238 ****
  
          EXIT;
   out:
-         printk("\n [VM] Inside function %s got a hit at
out",__func__);
          req->rq_status = rc ?: err;  /* return either error - bug 11190 */
          if (!req->rq_packed_final) {
                  err = lustre_pack_reply(req, 1, NULL, NULL);
--- 1213,1218 ----
*************** existing_lock:
*** 1248,1257 ****
  
                  if (rc == 0) {
                          lock_res_and_lock(lock);
-                         printk("\n Inside function %s , inside if
condition rc=0 the place where we do a memcpy for offset =
DLM_REPLY_REC_OFF",__func__);
                          size[DLM_REPLY_REC_OFF] =
lock->l_resource->lr_lvb_len;
-                         printk("\n Inside function %s ,
size[DLM_REPLY_REC_OFF] : %u , lock->l_resource->lr_lvb_len
:%u",__func__,size[DLM_REPLY_REC_OFF],lock->l_resource->lr_lvb_len);
-                         size[DLM_REPLY_REC_OFF+1]= 16;
                          if (size[DLM_REPLY_REC_OFF] > 0) {
                                  void *lvb = lustre_msg_buf(req->rq_repmsg,
                                                         DLM_REPLY_REC_OFF,
--- 1228,1234 ----
*************** existing_lock:
*** 1264,1270 ****
                          }
                          unlock_res_and_lock(lock);
                  } else {
-                         printk("\n Inside function %s , inside else
condition rc=0 the place where we do a memcpy for offset =
DLM_REPLY_REC_OFF",__func__);
                          lock_res_and_lock(lock);
                          ldlm_resource_unlink_lock(lock);
                          ldlm_lock_destroy_nolock(lock);
--- 1241,1246 ----
-------------- next part --------------
*** ./lustre/ldlm/ldlm_request.c	2010-10-21 22:26:28.000000000 -0400
--- ../fresh/lustre/ldlm/ldlm_request.c	2010-10-15 15:37:02.000000000 -0400
*************** int ldlm_cli_enqueue_fini(struct obd_exp
*** 389,395 ****
          int cleanup_phase = 1;
          ENTRY;
  
-         printk("\n Inside function %s",__func__);
          lock = ldlm_handle2lock(lockh);
          /* ldlm_cli_enqueue is holding a reference on this lock. */
          if (!lock) {
--- 389,394 ----
*************** int ldlm_cli_enqueue_fini(struct obd_exp
*** 401,407 ****
                  LASSERT(!is_replay);
                  LDLM_DEBUG(lock, "client-side enqueue END (%s)",
                             rc == ELDLM_LOCK_ABORTED ? "ABORTED" :
"FAILED");
-                 printk("\n Inside %s if client lock aborted or
failed",__func__);
                  if (rc == ELDLM_LOCK_ABORTED) {
                          /* Before we return, swab the reply */
                          reply = lustre_swab_repbuf(req, DLM_LOCKREPLY_OFF,
--- 400,405 ----
*************** int ldlm_cli_enqueue_fini(struct obd_exp
*** 433,440 ****
                  GOTO(cleanup, rc = -EPROTO);
          }
  
-         printk("\n Inside function %s we have received a
reply",__func__);
- 
          /* lock enqueued on the server */
          cleanup_phase = 0;
  
--- 431,436 ----
*************** int ldlm_cli_enqueue_fini(struct obd_exp
*** 463,469 ****
           * again. */
          if ((*flags) & LDLM_FL_LOCK_CHANGED) {
                  int newmode = reply->lock_desc.l_req_mode;
-                 printk("\n Inside function %s in condition (*flags) &
LDLM_FL_LOCK_CHANGED)",__func__);
                  LASSERT(!is_replay);
                  if (newmode && newmode != lock->l_req_mode) {
                          LDLM_DEBUG(lock, "server returned different mode
%s",
--- 459,464 ----
*************** int ldlm_cli_enqueue_fini(struct obd_exp
*** 504,510 ****
               * because it cannot handle asynchronous ASTs robustly (see
               * bug 7311). */
              (LIBLUSTRE_CLIENT && type == LDLM_EXTENT)) {
-                 printk("\n Inside function %s in condition ((*flags)
& LDLM_FL_AST_SENT ||(LIBLUSTRE_CLIENT && type ==
LDLM_EXTENT))",__func__);
                  lock_res_and_lock(lock);
                  lock->l_flags |= LDLM_FL_CBPENDING | LDLM_FL_BL_AST;
                  unlock_res_and_lock(lock);
--- 499,504 ----
*************** int ldlm_cli_enqueue_fini(struct obd_exp
*** 515,521 ****
           * clobber the LVB with an older one. */
          if (lvb_len && (lock->l_req_mode !=
lock->l_granted_mode)) {
                  void *tmplvb;
-                 printk("\n Inside function %s in condition lvb_len
&& (lock->l_req_mode != lock->l_granted_mode) ,
lvb_len:%d",__func__,lvb_len);
                  tmplvb = lustre_swab_repbuf(req, DLM_REPLY_REC_OFF, lvb_len,
                                              lvb_swabber);
                  if (tmplvb == NULL)
--- 509,514 ----
*************** int ldlm_cli_enqueue_fini(struct obd_exp
*** 524,530 ****
          }
  
          if (!is_replay) {
-                 printk("\n Inside function %s in condition
!is_replay",__func__);
                  rc = ldlm_lock_enqueue(ns, &lock, NULL, flags);
                  if (lock->l_completion_ast != NULL) {
                          int err = lock->l_completion_ast(lock, *flags,
NULL);
--- 517,522 ----
*************** int ldlm_cli_enqueue_fini(struct obd_exp
*** 536,542 ****
          }
  
          if (lvb_len && lvb != NULL) {
-                 printk("\n Inside function %s in condition lvb_len
&& lvb != NULL",__func__);
                  /* Copy the LVB here, and not earlier, because the completion
                   * AST (if any) can override what we got in the reply */
                  memcpy(lvb, lock->l_lvb_data, lvb_len);
--- 528,533 ----
*************** static inline int ldlm_req_handles_avail
*** 560,578 ****
                                           __u32 *size, int bufcount, int off)
  {
          int avail = min_t(int, LDLM_MAXREQSIZE, CFS_PAGE_SIZE - 512);
!         printk("\n Inside function %s",__func__);
!         printk("\n avail--before = %d",avail); 
          avail -= lustre_msg_size(class_exp2cliimp(exp)->imp_msg_magic,
                                   bufcount, size);
!         printk("\n avail--after  = %d",avail);
!         if (likely(avail >= 0)){
                  avail /= (int)sizeof(struct lustre_handle);
-                 printk("\n avail--likely = %d",avail);
-         }
          else
                  avail = 0;
          avail += LDLM_LOCKREQ_HANDLES - off;
!         printk("\n avail--lats = %d",avail);
          return avail;
  }
  
--- 551,565 ----
                                           __u32 *size, int bufcount, int off)
  {
          int avail = min_t(int, LDLM_MAXREQSIZE, CFS_PAGE_SIZE - 512);
! 
          avail -= lustre_msg_size(class_exp2cliimp(exp)->imp_msg_magic,
                                   bufcount, size);
!         if (likely(avail >= 0))
                  avail /= (int)sizeof(struct lustre_handle);
          else
                  avail = 0;
          avail += LDLM_LOCKREQ_HANDLES - off;
! 
          return avail;
  }
  
*************** struct ptlrpc_request *ldlm_prep_elc_req
*** 597,622 ****
          CFS_LIST_HEAD(head);
          ENTRY;
  
-         printk("\n Inside function %s, opc=%d",__func__, opc);
          if (cancels == NULL)
                  cancels = &head;
          if (exp_connect_cancelset(exp)) {
                  /* Estimate the amount of free space in the request. */
-                 printk("\n Inside exp_connect_cancelset(exp) in func
%s",__func__);
                  LASSERT(bufoff < bufcount);
  
                  avail = ldlm_req_handles_avail(exp, size, bufcount,
canceloff);
-                 printk("\n In function %s avail =
%d",__func__,avail);
                  flags = ns_connect_lru_resize(ns) ?
                          LDLM_CANCEL_LRUR : LDLM_CANCEL_AGED;
-                 printk("\n In function %s ns_connect_lru_resize(ns)
:%d",__func__,ns_connect_lru_resize(ns));
                  to_free = !ns_connect_lru_resize(ns) &&
                            opc == LDLM_ENQUEUE ? 1 : 0;
  
                  /* Cancel lru locks here _only_ if the server supports
                   * EARLY_CANCEL. Otherwise we have to send extra CANCEL
                   * rpc, what will make us slower. */
-                 printk("\n In function %s count =
%d",__func__,count);
                  if (avail > count)
                          count += ldlm_cancel_lru_local(ns, cancels, to_free,
                                                         avail - count, 0,
flags);
--- 584,604 ----
*************** struct ptlrpc_request *ldlm_prep_elc_req
*** 624,632 ****
                          pack = count;
                  else
                          pack = avail;
-                 printk("\n In function %s pack = %d",__func__,pack);
                  size[bufoff] = ldlm_request_bufsize(pack, opc);
-                 printk("\n In function %s , bufoff : %d , size[bufoff]=
%u",__func__,bufoff,size[bufoff]);
          }
  
          req = ptlrpc_prep_req(class_exp2cliimp(exp), version,
--- 606,612 ----
*************** struct ptlrpc_request *ldlm_prep_enqueue
*** 657,663 ****
                                               struct list_head *cancels,
                                               int count)
  {
-         printk("\n Inside function %s \n",__func__);
          return ldlm_prep_elc_req(exp, LUSTRE_DLM_VERSION, LDLM_ENQUEUE,
                                   bufcount, size, DLM_LOCKREQ_OFF,
                                   LDLM_ENQUEUE_CANCEL_OFF, cancels, count);
--- 637,642 ----
*************** int ldlm_cli_enqueue(struct obd_export *
*** 679,697 ****
          struct ldlm_lock *lock;
          struct ldlm_request *body;
          struct ldlm_reply *reply;
!         __u32 size[4] = { [MSG_PTLRPC_BODY_OFF] = sizeof(struct ptlrpc_body),
                          [DLM_LOCKREQ_OFF]     = sizeof(*body),
                          [DLM_REPLY_REC_OFF]   = lvb_len ? lvb_len :
!                                                 sizeof(struct ost_lvb),
!                         [DLM_REPLY_REC_OFF+1] = 16};
          int is_replay = *flags & LDLM_FL_REPLAY;
          int req_passed_in = 1, rc, err;
          struct ptlrpc_request *req;
-         int i;
          ENTRY;
  
-         printk("\n Inside function %s \n",__func__);
-         for(i=0;i<4;i++) printk("\n size[%d] : %d",i,size[i]);
          LASSERT(exp != NULL);
  
          /* If we''re replaying this lock, just check some invariants.
--- 658,672 ----
          struct ldlm_lock *lock;
          struct ldlm_request *body;
          struct ldlm_reply *reply;
!         __u32 size[3] = { [MSG_PTLRPC_BODY_OFF] = sizeof(struct ptlrpc_body),
                          [DLM_LOCKREQ_OFF]     = sizeof(*body),
                          [DLM_REPLY_REC_OFF]   = lvb_len ? lvb_len :
!                                                 sizeof(struct ost_lvb) };
          int is_replay = *flags & LDLM_FL_REPLAY;
          int req_passed_in = 1, rc, err;
          struct ptlrpc_request *req;
          ENTRY;
  
          LASSERT(exp != NULL);
  
          /* If we''re replaying this lock, just check some invariants.
*************** int ldlm_cli_enqueue(struct obd_export *
*** 700,706 ****
                  lock = ldlm_handle2lock(lockh);
                  LASSERT(lock != NULL);
                  LDLM_DEBUG(lock, "client-side enqueue START");
-                 printk("\n Client-side enqueue START in
%s",__func__);
                  LASSERT(exp == lock->l_conn_export);
          } else {
                  lock = ldlm_lock_create(ns, res_id, einfo->ei_type,
--- 675,680 ----
*************** int ldlm_cli_enqueue(struct obd_export *
*** 736,742 ****
          /* lock not sent to server yet */
  
          if (reqp == NULL || *reqp == NULL) {
!                 req = ldlm_prep_enqueue_req(exp,3, size, NULL, 0);
                  if (req == NULL) {
                          failed_lock_cleanup(ns, lock, lockh,
einfo->ei_mode);
                          LDLM_LOCK_PUT(lock);
--- 710,716 ----
          /* lock not sent to server yet */
  
          if (reqp == NULL || *reqp == NULL) {
!                 req = ldlm_prep_enqueue_req(exp, 2, size, NULL, 0);
                  if (req == NULL) {
                          failed_lock_cleanup(ns, lock, lockh,
einfo->ei_mode);
                          LDLM_LOCK_PUT(lock);
*************** int ldlm_cli_enqueue(struct obd_export *
*** 746,752 ****
                  if (reqp)
                          *reqp = req;
          } else {
-                 printk("\n [VM]got a hit at case where reqp is not NULL
in %s",__func__);
                  req = *reqp;
                  LASSERTF(lustre_msg_buflen(req->rq_reqmsg, DLM_LOCKREQ_OFF)
>                           sizeof(*body), "buflen[%d] = %d, not
%d\n",
--- 720,725 ----
*************** int ldlm_cli_enqueue(struct obd_export *
*** 768,774 ****
          /* Continue as normal. */
          if (!req_passed_in) {
                  size[DLM_LOCKREPLY_OFF] = sizeof(*reply);
!                 ptlrpc_req_set_repsize(req, 4, size);
          }
  
          /*
--- 741,747 ----
          /* Continue as normal. */
          if (!req_passed_in) {
                  size[DLM_LOCKREPLY_OFF] = sizeof(*reply);
!                 ptlrpc_req_set_repsize(req, 3, size);
          }
  
          /*
*************** int ldlm_cli_enqueue(struct obd_export *
*** 784,793 ****
                  RETURN(0);
          }
  
-         printk("\n in --func-- %s SENDING REQUEST",__func__);
          LDLM_DEBUG(lock, "sending request");
          rc = ptlrpc_queue_wait(req);
-         printk("\n in --func-- %s REQUEST SENT after
ptlrpc_queue_wait",__func__);
          err = ldlm_cli_enqueue_fini(exp, req, einfo->ei_type, policy ? 1 :
0,
                                      einfo->ei_mode, flags, lvb, lvb_len,
                                      lvb_swabber, lockh, rc);
--- 757,764 ----
-------------- next part --------------
*** ./lustre/mdc/mdc_locks.c	2010-10-20 20:58:51.000000000 -0400
--- ../fresh/lustre/mdc/mdc_locks.c	2010-10-15 15:37:15.000000000 -0400
*************** static struct ptlrpc_request *mdc_intent
*** 252,264 ****
          int repbufcount = 5;
          int mode;
          int rc;
-         int i;
          ENTRY;
  
-         printk("\n Inside function %s",__func__);
-         for(i=0;i<6;i++) printk("\n size[%d] : %d",i,size[i]);
-         for(i=0;i<5;i++) printk("\n repsize[%d] :
%d",i,repsize[i]);
- 
          it->it_create_mode = (it->it_create_mode & ~S_IFMT) |
S_IFREG;
          if (mdc_exp_is_2_0_server(exp)) {
                  size[DLM_INTENT_REC_OFF] = sizeof(struct mdt_rec_create);
--- 252,259 ----
*************** static struct ptlrpc_request *mdc_intent
*** 381,405 ****
          struct ptlrpc_request *req;
          struct ldlm_intent *lit;
          struct obd_device *obddev = class_exp2obd(exp);
!         __u32 size[] = { [MSG_PTLRPC_BODY_OFF] = sizeof(struct ptlrpc_body),
                          [DLM_LOCKREQ_OFF]     = sizeof(struct ldlm_request),
                          [DLM_INTENT_IT_OFF]   = sizeof(*lit),
                          [DLM_INTENT_REC_OFF]  = sizeof(struct mdt_body),
                          [DLM_INTENT_REC_OFF+1]= data->namelen + 1,
!                         [DLM_INTENT_REC_OFF+2]= 0,
!                         [DLM_INTENT_REC_OFF+3]= 16 };
!         __u32 repsize[] = { [MSG_PTLRPC_BODY_OFF] = sizeof(struct
ptlrpc_body),
                             [DLM_LOCKREPLY_OFF]   = sizeof(struct ldlm_reply),
                             [DLM_REPLY_REC_OFF]   = sizeof(struct mdt_body),
                             [DLM_REPLY_REC_OFF+1] = obddev->u.cli.
                                                          cl_max_mds_easize,
                             [DLM_REPLY_REC_OFF+2] = LUSTRE_POSIX_ACL_MAX_SIZE,
!                            [DLM_REPLY_REC_OFF+3] = 0,
!                            [DLM_REPLY_REC_OFF+4] = 16 };
          obd_valid valid = OBD_MD_FLGETATTR | OBD_MD_FLEASIZE | OBD_MD_FLACL |
                            OBD_MD_FLMODEASIZE | OBD_MD_FLDIREA;
!         int bufcount = 6;
!         int i=0;
          ENTRY;
  
          if (mdc_exp_is_2_0_server(exp)) {
--- 376,397 ----
          struct ptlrpc_request *req;
          struct ldlm_intent *lit;
          struct obd_device *obddev = class_exp2obd(exp);
!         __u32 size[6] = { [MSG_PTLRPC_BODY_OFF] = sizeof(struct ptlrpc_body),
                          [DLM_LOCKREQ_OFF]     = sizeof(struct ldlm_request),
                          [DLM_INTENT_IT_OFF]   = sizeof(*lit),
                          [DLM_INTENT_REC_OFF]  = sizeof(struct mdt_body),
                          [DLM_INTENT_REC_OFF+1]= data->namelen + 1,
!                         [DLM_INTENT_REC_OFF+2]= 0 };
!         __u32 repsize[6] = { [MSG_PTLRPC_BODY_OFF] = sizeof(struct
ptlrpc_body),
                             [DLM_LOCKREPLY_OFF]   = sizeof(struct ldlm_reply),
                             [DLM_REPLY_REC_OFF]   = sizeof(struct mdt_body),
                             [DLM_REPLY_REC_OFF+1] = obddev->u.cli.
                                                          cl_max_mds_easize,
                             [DLM_REPLY_REC_OFF+2] = LUSTRE_POSIX_ACL_MAX_SIZE,
!                            [DLM_REPLY_REC_OFF+3] = 0 };
          obd_valid valid = OBD_MD_FLGETATTR | OBD_MD_FLEASIZE | OBD_MD_FLACL |
                            OBD_MD_FLMODEASIZE | OBD_MD_FLDIREA;
!         int bufcount = 5;
          ENTRY;
  
          if (mdc_exp_is_2_0_server(exp)) {
*************** static struct ptlrpc_request *mdc_intent
*** 407,418 ****
                  size[DLM_INTENT_REC_OFF+2] = data->namelen + 1;
                  bufcount = 6;
          }
-        
-         printk("%s: prep-enq-req: bufcnt=%d\n", __func__, bufcount);
-         for(i=0; i<bufcount; i++) {      
-                 printk("\tsize[%d]=%u\n", i,size[i] );
-                 printk("\trepsize[%d]=%u\n", i,repsize[i] );
-         }
          req = ldlm_prep_enqueue_req(exp, bufcount, size, NULL, 0);
          if (req) {
                  /* pack the intent */
--- 399,404 ----
*************** static int mdc_finish_enqueue(struct obd
*** 455,461 ****
          struct ldlm_reply *lockrep;
          ENTRY;
  
-         printk("\n Inside function %s",__func__);
          LASSERT(rc >= 0);
          /* Similarly, if we''re going to replay this request, we
don''t want to
           * actually get a lock, just perform the intent. */
--- 441,446 ----
*************** static int mdc_finish_enqueue(struct obd
*** 517,523 ****
          /* We know what to expect, so we do any byte flipping required here */
          if (it->it_op & (IT_OPEN | IT_UNLINK | IT_LOOKUP | IT_GETATTR))
{
                  struct mds_body *body;
!                 printk("\n Inside function %s inside condition IT_OPEN ,
IT_LOOKUP , IT_GETATTR",__func__);
                  body = lustre_swab_repbuf(req, DLM_REPLY_REC_OFF,
sizeof(*body),
                                           lustre_swab_mds_body);
                  if (body == NULL) {
--- 502,508 ----
          /* We know what to expect, so we do any byte flipping required here */
          if (it->it_op & (IT_OPEN | IT_UNLINK | IT_LOOKUP | IT_GETATTR))
{
                  struct mds_body *body;
! 
                  body = lustre_swab_repbuf(req, DLM_REPLY_REC_OFF,
sizeof(*body),
                                           lustre_swab_mds_body);
                  if (body == NULL) {
*************** int mdc_enqueue(struct obd_export *exp, 
*** 587,593 ****
          int rc;
          ENTRY;
  
-         printk("\n Inside function %s \n",__func__);
          fid_build_reg_res_name((void *)&data->fid1, &res_id);
          LASSERTF(einfo->ei_type == LDLM_IBITS,"lock type %d\n",
einfo->ei_type);
          if (it->it_op & (IT_UNLINK | IT_GETATTR | IT_READDIR))
--- 572,577 ----
*************** int mdc_intent_getattr_async(struct obd_
*** 924,933 ****
          int                      flags = LDLM_FL_HAS_INTENT;
          ENTRY;
  
-         printk("%s: name: %.*s in inode "LPU64", intent: %s
flags %#o\n",__func__,
-                op_data->namelen, op_data->name, op_data->fid1.id,
-                ldlm_it2str(it->it_op), it->it_flags);
- 
          CDEBUG(D_DLMTRACE,"name: %.*s in inode "LPU64", intent:
%s flags %#o\n",
                 op_data->namelen, op_data->name, op_data->fid1.id,
                 ldlm_it2str(it->it_op), it->it_flags);
--- 908,913 ----

Lustre devel - Oct 2010 - Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE

[Lustre-devel] Queries regarding LDLM_ENQUEUE