thr3ads.net - Lustre discuss - [Lustre-discuss] no handle for file close [May 2009]

If this information is useful, please help other people find it:
Share via:

Nirmal Seenu

2009-May-07 15:45 UTC

[Lustre-discuss] no handle for file close

I am getting quite a few errors similar to the following error on the 
MDS server which is running the latest 1.6.7.1 patched kernel. The 
clients are running 1.6.7 patchless client on 2.6.18-128.1.6.el5 kernel 
and this cluster has 130 nodes/Lustre clients and uses GigE network.

May  7 04:13:48 lustre3 kernel: LustreError: 
7213:0:(mds_open.c:1567:mds_close()) @@@ no handle for file close ino 
772769: cookie 0xcfe66441310829d4  req at ffff8101ca8a3800 x2681218/t0 
o35->fedc91f9-4de7-c789-6bdd-1de1f5e3dd33 at NET_0x20000c0a8f109_UUID:0/0 
lens 296/1680 e 0 to 0 dl 1241687634 ref 1 fl Interpret:/0/0 rc 0/0

May  7 04:13:48 lustre3 kernel: LustreError: 
7213:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error 
(-116)  req at ffff8101ca8a3800 x2681218/t0 
o35->fedc91f9-4de7-c789-6bdd-1de1f5e3dd33 at NET_0x20000c0a8f109_UUID:0/0 
lens 296/1680 e 0 to 0 dl 1241687634 ref 1 fl Interpret:/0/0 rc -116/0

I don''t see the same errors on another cluster/Lustre installation with
2000 Lustre clients which uses Infiniband network.

I looked at the following bugs 19328, 18946, 18192 and 19085 but I am 
not sure if any of those bugs apply to this error. I would appreciate it 
someone could help me understand these errors and possibly suggest the 
solution.

TIA
Nirmal

Robin Humble

2009-May-10 08:15 UTC

head link

[Lustre-discuss] no handle for file close

On Thu, May 07, 2009 at 10:45:31AM -0500, Nirmal Seenu
wrote:>I am getting quite a few errors similar to the following error on the 
>MDS server which is running the latest 1.6.7.1 patched kernel. The 
>clients are running 1.6.7 patchless client on 2.6.18-128.1.6.el5 kernel 
>and this cluster has 130 nodes/Lustre clients and uses GigE network.
>
>May  7 04:13:48 lustre3 kernel: LustreError:
7213:0:(mds_open.c:1567:mds_close()) @@@ no handle for file close ino 772769:
cookie 0xcfe66441310829d4  req at ffff8101ca8a3800 x2681218/t0
o35->fedc91f9-4de7-c789-6bdd-1de1f5e3dd33 at NET_0x20000c0a8f109_UUID:0/0
lens 296/1680 e 0 to 0 dl 1241687634 ref 1 fl Interpret:/0/0 rc 0/0
>
>May  7 04:13:48 lustre3 kernel: LustreError:
7213:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-116) 
req at ffff8101ca8a3800 x2681218/t0 o35->fedc91f9-4de7-c789-6bdd-1de1f5e3dd33
at NET_0x20000c0a8f109_UUID:0/0 lens 296/1680 e 0 to 0 dl 1241687634 ref 1 fl
Interpret:/0/0 rc -116/0
>
>I don''t see the same errors on another cluster/Lustre installation
with
>2000 Lustre clients which uses Infiniband network.
we see this sometimes when a job that is using a shared library that
lives on Lustre is killed - presumably the un-memorymapping of the .so
from a bunch of nodes at once confuses Lustre a bit.

what is your inode 772769?
eg.
   find -inum 772769 /some/lustre/fs/
if the file is a .so then that would be similar to what we are seeing.

so we have this listed in the "probably harmless" section of the
errors
that we get from Lustre, so if it''s not harmless than we''d
very much
like to know about it :)

this cluster is IB, rhel5, x86_64, 1.6.6 on servers and patchless
1.6.4.3 on clients w/ 2.6.23.17 kernels.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
>I looked at the following bugs 19328, 18946, 18192 and 19085 but I am 
>not sure if any of those bugs apply to this error. I would appreciate it 
>someone could help me understand these errors and possibly suggest the 
>solution.
>
>TIA
>Nirmal
>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss at lists.lustre.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss

Nirmal Seenu

2009-May-11 20:38 UTC

head link

[Lustre-discuss] no handle for file close

I got a couple more of these errors over the weekend. One of the file in 
which the error occurred was a log(ascii) file while the other was a 
dynamically linked MPI binary which was getting accessed from multiple 
nodes.

The PBS job that was running was a hybrid MPI/OpenMP program using 20 
nodes and 6 cores per node. The PBS job got killed when its walltime 
exceeded. The user confirmed that there was no corruption in any of the 
output files.

The following is the error message that I found in the log files:

May 10 14:59:37 lustre3 kernel: LustreError: 
7566:0:(mds_open.c:1567:mds_close()) @@@ no handle for file close ino 
2570697: cookie 0xcfe66441300e06ad  req at ffff81041057a800 x2975034/t0 
o35->30090fc1-eb50-ca1
5-b57a-41ea32f1d9db@:0/0 lens 296/1680 e 0 to 0 dl 1241985583 ref 1 fl 
Interpret:/0/0 rc 0/0
May 10 14:59:37 lustre3 kernel: LustreError: 
7566:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error 
(-116)  req at ffff81041057a800 x2975034/t0 
o35->30090fc1-eb50-ca15-b57a-41ea32f1d9db@:0/0 len
s 296/1680 e 0 to 0 dl 1241985583 ref 1 fl Interpret:/0/0 rc -116/0
May 10 14:59:37 lustre3 kernel: LustreError: 
7739:0:(mds_open.c:1567:mds_close()) @@@ no handle for file close ino 
2558291: cookie 0xcfe66441300e07da  req at ffff81040fa46c00 x2975035/t0 
o35->30090fc1-eb50-ca1
5-b57a-41ea32f1d9db@:0/0 lens 296/1680 e 0 to 0 dl 1241985583 ref 1 fl 
Interpret:/0/0 rc 0/0
May 10 14:59:37 lustre3 kernel: LustreError: 
7739:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error 
(-116)  req at ffff81040fa46c00 x2975035/t0 
o35->30090fc1-eb50-ca15-b57a-41ea32f1d9db@:0/0 len
s 296/1680 e 0 to 0 dl 1241985583 ref 1 fl Interpret:/0/0 rc -116/0


Nirmal

Robin Humble wrote:> On Thu, May 07, 2009 at 10:45:31AM -0500, Nirmal Seenu wrote:
>> I am getting quite a few errors similar to the following error on the 
>> MDS server which is running the latest 1.6.7.1 patched kernel. The 
>> clients are running 1.6.7 patchless client on 2.6.18-128.1.6.el5 kernel
>> and this cluster has 130 nodes/Lustre clients and uses GigE network.
>>
>> May  7 04:13:48 lustre3 kernel: LustreError:
7213:0:(mds_open.c:1567:mds_close()) @@@ no handle for file close ino 772769:
cookie 0xcfe66441310829d4  req at ffff8101ca8a3800 x2681218/t0
o35->fedc91f9-4de7-c789-6bdd-1de1f5e3dd33 at NET_0x20000c0a8f109_UUID:0/0
lens 296/1680 e 0 to 0 dl 1241687634 ref 1 fl Interpret:/0/0 rc 0/0
>>
>> May  7 04:13:48 lustre3 kernel: LustreError:
7213:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-116) 
req at ffff8101ca8a3800 x2681218/t0 o35->fedc91f9-4de7-c789-6bdd-1de1f5e3dd33
at NET_0x20000c0a8f109_UUID:0/0 lens 296/1680 e 0 to 0 dl 1241687634 ref 1 fl
Interpret:/0/0 rc -116/0
>>
>> I don''t see the same errors on another cluster/Lustre
installation with
>> 2000 Lustre clients which uses Infiniband network.
> 
> we see this sometimes when a job that is using a shared library that
> lives on Lustre is killed - presumably the un-memorymapping of the .so
> from a bunch of nodes at once confuses Lustre a bit.
> 
> what is your inode 772769?
> eg.
>    find -inum 772769 /some/lustre/fs/
> if the file is a .so then that would be similar to what we are seeing.
> 
> so we have this listed in the "probably harmless" section of the
errors
> that we get from Lustre, so if it''s not harmless than
we''d very much
> like to know about it :)
> 
> this cluster is IB, rhel5, x86_64, 1.6.6 on servers and patchless
> 1.6.4.3 on clients w/ 2.6.23.17 kernels.
> 
> cheers,
> robin
> --
> Dr Robin Humble, HPC Systems Analyst, NCI National Facility
> 
>> I looked at the following bugs 19328, 18946, 18192 and 19085 but I am 
>> not sure if any of those bugs apply to this error. I would appreciate
it
>> someone could help me understand these errors and possibly suggest the 
>> solution.
>>
>> TIA
>> Nirmal
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - May 2009 - no handle for file close

[Lustre-discuss] no handle for file close

[Lustre-discuss] no handle for file close

[Lustre-discuss] no handle for file close