thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre, NFS and mds_getattr

If this information is useful, please help other people find it:
Share via:

Frederik Ferner

2010-May-06 15:57 UTC

[Lustre-discuss] Lustre, NFS and mds_getattr_lock operation

On our Lustre system we are seeing the following error fairly regularly, 
so far we have not had complaints from users and have not noticed any 
negative effects, but it would still be nice to understand the errors 
better. The systems reporting these errors are NFS exporters for 
subtrees of the Lustre file system.

One the Lustre client/NFS server:

May  6 14:23:09 i16-storage1 kernel: LustreError: 11-0: an error 
occurred while communicating with 172.23.68.8 at tcp. The mds_getattr_lock 
operation failed with -13
May  6 14:23:09 i16-storage1 kernel: LustreError: Skipped 10 previous 
similar messages
May  6 14:23:09 i16-storage1 kernel: LustreError: 
3515:0:(llite_nfs.c:223:ll_get_parent()) failure -13 inode 108443563 get 
parent
May  6 14:23:09 i16-storage1 kernel: LustreError: 
3515:0:(llite_nfs.c:223:ll_get_parent()) Skipped 10 previous similar 
messages

On the MDS:
May  6 14:23:08 cs04r-sc-mds01-01 kernel: LustreError: 
3595:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error 
(-13)  req at ffff81042936a000 x4806957/t0 
o34->33a488dc-5987-fee2-b810-00ff4304bf53 at NET_0x20000ac176821_UUID:0/0 
lens 312/128 e 0 to 0 dl 1273152288 ref 1 fl Interpret:/0/0 rc -13/0
May  6 14:23:08 cs04r-sc-mds01-01 kernel: LustreError: 
3595:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped 14 previous 
similar messages

We''ve checked the inodes mentioned in the various messages and
can''t
spot anything that would make them different from other directories 
where this does not seem to happen. Unfortunately we have so far not 
been able to reproduce it.

Does anyone know if we should worry about those messages or if we can 
safely ignore them? Or should we assume that some of our users might 
have a problem accessing data that they have just not reported? Even 
though I find that unlikely.

I''ve seen a thread mentioning similar messages[1] but could not find
any
conclusion.

Our MDS, OSSes and the clients involved are all running Lustre 
1.6.7.2.ddn3.5 on RHEL5. If necessary I can probably find exactly which 
patches the ddn3.5 version has applied on top of 1.6.7.2.

Kind regards,
Frederik

[1] 
http://lists.lustre.org/pipermail/lustre-discuss/2008-January/006309.html



-- 
Frederik Ferner
Computer Systems Administrator		phone: +44 1235 77 8624
Diamond Light Source Ltd.		mob:   +44 7917 08 5110
(Apologies in advance for the lines below. Some bits are a legal
requirement and I have no control over them.)

Andreas Dilger

2010-May-07 04:32 UTC

head link

[Lustre-discuss] Lustre, NFS and mds_getattr_lock operation

On 2010-05-06, at 11:57, Frederik Ferner wrote:> On our Lustre system we are seeing the following error fairly regularly, 
> so far we have not had complaints from users and have not noticed any 
> negative effects, but it would still be nice to understand the errors 
> better. The systems reporting these errors are NFS exporters for 
> subtrees of the Lustre file system.
> 
> One the Lustre client/NFS server:
> 
> May  6 14:23:09 i16-storage1 kernel: LustreError: 11-0: an error 
> occurred while communicating with 172.23.68.8 at tcp. The mds_getattr_lock 
> operation failed with -13
-13 is -EACCESS (per /usr/include/asm-generic/errno-base.h) or equivalent

That just means that someone tried to access a file they don''t have
permission to access.  As to why this is being printed on the console is a bit
of a mystery, since I haven''t seen anything similar.  I wonder if NFS
is going down some obscure code path that is returning the error to the RPC
handler instead of stashing this "normal" error code inside the reply.

In any case it is harmless and expected (sigh).  I''d hope it would have
been removed in newer versions, but I don''t know at all.
> Does anyone know if we should worry about those messages or if we can 
> safely ignore them? Or should we assume that some of our users might 
> have a problem accessing data that they have just not reported? Even 
> though I find that unlikely.
Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Frederik Ferner

2010-May-07 09:12 UTC

head link

[Lustre-discuss] Lustre, NFS and mds_getattr_lock operation

Andreas,

thanks for your reply.

Andreas Dilger wrote:> On 2010-05-06, at 11:57, Frederik Ferner wrote:
>> On our Lustre system we are seeing the following error fairly
>> regularly, so far we have not had complaints from users and have
>> not noticed any negative effects, but it would still be nice to
>> understand the errors better. The systems reporting these errors
>> are NFS exporters for subtrees of the Lustre file system.
>> 
>> One the Lustre client/NFS server:
>> 
>> May  6 14:23:09 i16-storage1 kernel: LustreError: 11-0: an error 
>> occurred while communicating with 172.23.68.8 at tcp. The
>> mds_getattr_lock operation failed with -13
> 
> -13 is -EACCESS (per /usr/include/asm-generic/errno-base.h) or
> equivalent
> 
> That just means that someone tried to access a file they don''t
have
> permission to access.  As to why this is being printed on the console
> is a bit of a mystery, since I haven''t seen anything similar.  I
> wonder if NFS is going down some obscure code path that is returning
> the error to the RPC handler instead of stashing this "normal"
error
> code inside the reply.
It does not happen every time someone tries to access a directory/file 
they don''t have access, i.e. a simple attempt to change into a
directory
where you don''t have enough permissions does not trigger the log entry.
I still suspect some of our users/applications is doing something 
strange but I''m happy to ignore these errors unless some user complains
and we can reproduce it.
> In any case it is harmless and expected (sigh).  I''d hope it would
> have been removed in newer versions, but I don''t know at all.
> 
>> Does anyone know if we should worry about those messages or if we
>> can safely ignore them? Or should we assume that some of our users
>> might have a problem accessing data that they have just not
>> reported? Even though I find that unlikely.
Thanks,

Frederik

-- 
Frederik Ferner
Computer Systems Administrator		phone: +44 1235 77 8624
Diamond Light Source Ltd.		mob:   +44 7917 08 5110
(Apologies in advance for the lines below. Some bits are a legal
requirement and I have no control over them.)

Andreas Dilger

2010-May-07 18:45 UTC

head link

[Lustre-discuss] Lustre, NFS and mds_getattr_lock operation

On 2010-05-07, at 05:12, Frederik Ferner wrote:> Andreas Dilger wrote:
>> On 2010-05-06, at 11:57, Frederik Ferner wrote:
>>> On our Lustre system we are seeing the following error fairly
>>> regularly, so far we have not had complaints from users and have
>>> not noticed any negative effects, but it would still be nice to
>>> understand the errors better. The systems reporting these errors
>>> are NFS exporters for subtrees of the Lustre file system.
>>> One the Lustre client/NFS server:
>>> May  6 14:23:09 i16-storage1 kernel: LustreError: 11-0: an error
occurred while communicating with 172.23.68.8 at tcp. The
>>> mds_getattr_lock operation failed with -13
>> 
>> -13 is -EACCESS (per /usr/include/asm-generic/errno-base.h) or
>> equivalent
>> That just means that someone tried to access a file they don''t
have
>> permission to access.  As to why this is being printed on the console
>> is a bit of a mystery, since I haven''t seen anything similar. 
I
>> wonder if NFS is going down some obscure code path that is returning
>> the error to the RPC handler instead of stashing this
"normal" error
>> code inside the reply.
> 
> It does not happen every time someone tries to access a directory/file they
don''t have access, i.e. a simple attempt to change into a directory
where you don''t have enough permissions does not trigger the log entry.
I still suspect some of our users/applications is doing something strange but
I''m happy to ignore these errors unless some user complains and we can
reproduce it.
It would still be good to figure out what is causing it.  If you could accept
the performance impact, you could enable more Lustre debugging on the MDS, and
then e.g. have a syslog trigger that dumps the kernel debug log when this
message is printed:

lctl set_param debug=+rpctrace  # will have minor impact
lctl set_param debug=+entry     # might have significant impact

That said, I''d hate to go chasing a bug in 1.6.x that is fixed in 1.8
already.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Frederik Ferner

2010-May-12 10:53 UTC

head link

[Lustre-discuss] Lustre, NFS and mds_getattr_lock operation

Andreas Dilger wrote:> On 2010-05-07, at 05:12, Frederik Ferner wrote:
>> Andreas Dilger wrote:
>>> On 2010-05-06, at 11:57, Frederik Ferner wrote:
>>>> On our Lustre system we are seeing the following error fairly 
>>>> regularly, so far we have not had complaints from users and
>>>> have not noticed any negative effects, but it would still be
>>>> nice to understand the errors better. The systems reporting
>>>> these errors are NFS exporters for subtrees of the Lustre file
>>>> system. One the Lustre client/NFS server: May  6 14:23:09
>>>> i16-storage1 kernel: LustreError: 11-0: an error occurred while
>>>> communicating with 172.23.68.8 at tcp. The mds_getattr_lock
>>>> operation failed with -13
>>> -13 is -EACCESS (per /usr/include/asm-generic/errno-base.h) or 
>>> equivalent That just means that someone tried to access a file
>>> they don''t have permission to access.  As to why this is
being
>>> printed on the console is a bit of a mystery, since I
haven''t
>>> seen anything similar.  I wonder if NFS is going down some
>>> obscure code path that is returning the error to the RPC handler
>>> instead of stashing this "normal" error code inside the
reply.
>> It does not happen every time someone tries to access a
>> directory/file they don''t have access, i.e. a simple attempt
to
>> change into a directory where you don''t have enough
permissions
>> does not trigger the log entry. I still suspect some of our
>> users/applications is doing something strange but I''m happy to
>> ignore these errors unless some user complains and we can reproduce
>> it.
> 
> It would still be good to figure out what is causing it.  If you
> could accept the performance impact, you could enable more Lustre
> debugging on the MDS, and then e.g. have a syslog trigger that dumps
> the kernel debug log when this message is printed:
> 
> lctl set_param debug=+rpctrace  # will have minor impact lctl
> set_param debug=+entry     # might have significant impact
We may be able to do that in our next maintenance window beginning of
June. We''ll report back.

So far we have not managed to reproduce it on our test file system so we 
can''t test there.
> That said, I''d hate to go chasing a bug in 1.6.x that is fixed in
1.8
> already.
Understood, unfortunately we are not really in a position to upgrade to
1.8 any time soon. And as it has not caused any real problem as far as I 
can tell, we are not going to force the upgrade just because of these 
log entries.

Thanks,
Frederik

-- 
Frederik Ferner
Computer Systems Administrator		phone: +44 1235 77 8624
Diamond Light Source Ltd.		mob:   +44 7917 08 5110
(Apologies in advance for the lines below. Some bits are a legal
requirement and I have no control over them.)

Frederik Ferner

2010-Sep-08 16:08 UTC

head link

[Lustre-discuss] Lustre, NFS and mds_getattr_lock operation

Hi Andreas, List,

reviving an old thread now that I managed to look into it during the 
current maintenance window. Note though that we are evaluating an 
upgrade to 1.8.4 in the near future, so I would not call this an high 
priority investigation, I''m partly doing it to see how far I can get, 
partly to make sure it''s not hiding a real problem.

Unfortunately I don''t really understand the debug logs myself.
I''ve
attached one of the debug logs that I''ve created on our MDS after 
running ''lctl set_param debug=+rpctrace'' and ''lctl
set_param
debug=+trace'', the log has been created as soon as the error message 
appeared in /var/log/messages. Note that I''ve not managed to set the 
suggested debug flag "+entry", that returned this:

lnet.debug=+entry
error: set_param: writing to file /proc/sys/lnet/debug: Invalid argument

Any help understanding the debug log etc would be much appreciated.

Kind regards,
Frederik

Frederik Ferner wrote:> Andreas Dilger wrote:
>> On 2010-05-07, at 05:12, Frederik Ferner wrote:
>>> Andreas Dilger wrote:
>>>> On 2010-05-06, at 11:57, Frederik Ferner wrote:
>>>>> On our Lustre system we are seeing the following error
fairly
>>>>> regularly, so far we have not had complaints from users and
>>>>> have not noticed any negative effects, but it would still
be
>>>>> nice to understand the errors better. The systems reporting
>>>>> these errors are NFS exporters for subtrees of the Lustre
file
>>>>> system. One the Lustre client/NFS server: May  6 14:23:09
>>>>> i16-storage1 kernel: LustreError: 11-0: an error occurred
while
>>>>> communicating with 172.23.68.8 at tcp. The mds_getattr_lock
>>>>> operation failed with -13
>>>> -13 is -EACCESS (per /usr/include/asm-generic/errno-base.h) or 
>>>> equivalent That just means that someone tried to access a file
>>>> they don''t have permission to access.  As to why this
is being
>>>> printed on the console is a bit of a mystery, since I
haven''t
>>>> seen anything similar.  I wonder if NFS is going down some
>>>> obscure code path that is returning the error to the RPC
handler
>>>> instead of stashing this "normal" error code inside
the reply.
>>> It does not happen every time someone tries to access a
>>> directory/file they don''t have access, i.e. a simple
attempt to
>>> change into a directory where you don''t have enough
permissions
>>> does not trigger the log entry. I still suspect some of our
>>> users/applications is doing something strange but I''m
happy to
>>> ignore these errors unless some user complains and we can reproduce
>>> it.
>> It would still be good to figure out what is causing it.  If you
>> could accept the performance impact, you could enable more Lustre
>> debugging on the MDS, and then e.g. have a syslog trigger that dumps
>> the kernel debug log when this message is printed:
>>
>> lctl set_param debug=+rpctrace  # will have minor impact lctl
>> set_param debug=+entry     # might have significant impact
> 
> We may be able to do that in our next maintenance window beginning of
> June. We''ll report back.
> 
> So far we have not managed to reproduce it on our test file system so we 
> can''t test there.
> 
>> That said, I''d hate to go chasing a bug in 1.6.x that is fixed
in 1.8
>> already.
> 
> Understood, unfortunately we are not really in a position to upgrade to
> 1.8 any time soon. And as it has not caused any real problem as far as I 
> can tell, we are not going to force the upgrade just because of these 
> log entries.
> 
> Thanks,
> Frederik
> 

-- 
Frederik Ferner
Computer Systems Administrator		phone: +44 1235 77 8624
Diamond Light Source Ltd.		mob:   +44 7917 08 5110
(Apologies in advance for the lines below. Some bits are a legal
requirement and I have no control over them.)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: processing_e_161808534488000.gz
Type: application/x-gzip
Size: 3120614 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100908/80cbef17/attachment-0001.gz

Lustre discuss - May 2010 - Lustre, NFS and mds_getattr_lock operation

[Lustre-discuss] Lustre, NFS and mds_getattr_lock operation

[Lustre-discuss] Lustre, NFS and mds_getattr_lock operation

[Lustre-discuss] Lustre, NFS and mds_getattr_lock operation

[Lustre-discuss] Lustre, NFS and mds_getattr_lock operation

[Lustre-discuss] Lustre, NFS and mds_getattr_lock operation

[Lustre-discuss] Lustre, NFS and mds_getattr_lock operation