I am getting quite a few errors similar to the following error on the MDS server which is running the latest 1.6.7.1 patched kernel. The clients are running 1.6.7 patchless client on 2.6.18-128.1.6.el5 kernel and this cluster has 130 nodes/Lustre clients and uses GigE network. May 7 04:13:48 lustre3 kernel: LustreError: 7213:0:(mds_open.c:1567:mds_close()) @@@ no handle for file close ino 772769: cookie 0xcfe66441310829d4 req at ffff8101ca8a3800 x2681218/t0 o35->fedc91f9-4de7-c789-6bdd-1de1f5e3dd33 at NET_0x20000c0a8f109_UUID:0/0 lens 296/1680 e 0 to 0 dl 1241687634 ref 1 fl Interpret:/0/0 rc 0/0 May 7 04:13:48 lustre3 kernel: LustreError: 7213:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-116) req at ffff8101ca8a3800 x2681218/t0 o35->fedc91f9-4de7-c789-6bdd-1de1f5e3dd33 at NET_0x20000c0a8f109_UUID:0/0 lens 296/1680 e 0 to 0 dl 1241687634 ref 1 fl Interpret:/0/0 rc -116/0 I don''t see the same errors on another cluster/Lustre installation with 2000 Lustre clients which uses Infiniband network. I looked at the following bugs 19328, 18946, 18192 and 19085 but I am not sure if any of those bugs apply to this error. I would appreciate it someone could help me understand these errors and possibly suggest the solution. TIA Nirmal
On Thu, May 07, 2009 at 10:45:31AM -0500, Nirmal Seenu wrote:>I am getting quite a few errors similar to the following error on the >MDS server which is running the latest 1.6.7.1 patched kernel. The >clients are running 1.6.7 patchless client on 2.6.18-128.1.6.el5 kernel >and this cluster has 130 nodes/Lustre clients and uses GigE network. > >May 7 04:13:48 lustre3 kernel: LustreError: 7213:0:(mds_open.c:1567:mds_close()) @@@ no handle for file close ino 772769: cookie 0xcfe66441310829d4 req at ffff8101ca8a3800 x2681218/t0 o35->fedc91f9-4de7-c789-6bdd-1de1f5e3dd33 at NET_0x20000c0a8f109_UUID:0/0 lens 296/1680 e 0 to 0 dl 1241687634 ref 1 fl Interpret:/0/0 rc 0/0 > >May 7 04:13:48 lustre3 kernel: LustreError: 7213:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-116) req at ffff8101ca8a3800 x2681218/t0 o35->fedc91f9-4de7-c789-6bdd-1de1f5e3dd33 at NET_0x20000c0a8f109_UUID:0/0 lens 296/1680 e 0 to 0 dl 1241687634 ref 1 fl Interpret:/0/0 rc -116/0 > >I don''t see the same errors on another cluster/Lustre installation with >2000 Lustre clients which uses Infiniband network.we see this sometimes when a job that is using a shared library that lives on Lustre is killed - presumably the un-memorymapping of the .so from a bunch of nodes at once confuses Lustre a bit. what is your inode 772769? eg. find -inum 772769 /some/lustre/fs/ if the file is a .so then that would be similar to what we are seeing. so we have this listed in the "probably harmless" section of the errors that we get from Lustre, so if it''s not harmless than we''d very much like to know about it :) this cluster is IB, rhel5, x86_64, 1.6.6 on servers and patchless 1.6.4.3 on clients w/ 2.6.23.17 kernels. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility>I looked at the following bugs 19328, 18946, 18192 and 19085 but I am >not sure if any of those bugs apply to this error. I would appreciate it >someone could help me understand these errors and possibly suggest the >solution. > >TIA >Nirmal >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss
I got a couple more of these errors over the weekend. One of the file in which the error occurred was a log(ascii) file while the other was a dynamically linked MPI binary which was getting accessed from multiple nodes. The PBS job that was running was a hybrid MPI/OpenMP program using 20 nodes and 6 cores per node. The PBS job got killed when its walltime exceeded. The user confirmed that there was no corruption in any of the output files. The following is the error message that I found in the log files: May 10 14:59:37 lustre3 kernel: LustreError: 7566:0:(mds_open.c:1567:mds_close()) @@@ no handle for file close ino 2570697: cookie 0xcfe66441300e06ad req at ffff81041057a800 x2975034/t0 o35->30090fc1-eb50-ca1 5-b57a-41ea32f1d9db@:0/0 lens 296/1680 e 0 to 0 dl 1241985583 ref 1 fl Interpret:/0/0 rc 0/0 May 10 14:59:37 lustre3 kernel: LustreError: 7566:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-116) req at ffff81041057a800 x2975034/t0 o35->30090fc1-eb50-ca15-b57a-41ea32f1d9db@:0/0 len s 296/1680 e 0 to 0 dl 1241985583 ref 1 fl Interpret:/0/0 rc -116/0 May 10 14:59:37 lustre3 kernel: LustreError: 7739:0:(mds_open.c:1567:mds_close()) @@@ no handle for file close ino 2558291: cookie 0xcfe66441300e07da req at ffff81040fa46c00 x2975035/t0 o35->30090fc1-eb50-ca1 5-b57a-41ea32f1d9db@:0/0 lens 296/1680 e 0 to 0 dl 1241985583 ref 1 fl Interpret:/0/0 rc 0/0 May 10 14:59:37 lustre3 kernel: LustreError: 7739:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-116) req at ffff81040fa46c00 x2975035/t0 o35->30090fc1-eb50-ca15-b57a-41ea32f1d9db@:0/0 len s 296/1680 e 0 to 0 dl 1241985583 ref 1 fl Interpret:/0/0 rc -116/0 Nirmal Robin Humble wrote:> On Thu, May 07, 2009 at 10:45:31AM -0500, Nirmal Seenu wrote: >> I am getting quite a few errors similar to the following error on the >> MDS server which is running the latest 1.6.7.1 patched kernel. The >> clients are running 1.6.7 patchless client on 2.6.18-128.1.6.el5 kernel >> and this cluster has 130 nodes/Lustre clients and uses GigE network. >> >> May 7 04:13:48 lustre3 kernel: LustreError: 7213:0:(mds_open.c:1567:mds_close()) @@@ no handle for file close ino 772769: cookie 0xcfe66441310829d4 req at ffff8101ca8a3800 x2681218/t0 o35->fedc91f9-4de7-c789-6bdd-1de1f5e3dd33 at NET_0x20000c0a8f109_UUID:0/0 lens 296/1680 e 0 to 0 dl 1241687634 ref 1 fl Interpret:/0/0 rc 0/0 >> >> May 7 04:13:48 lustre3 kernel: LustreError: 7213:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-116) req at ffff8101ca8a3800 x2681218/t0 o35->fedc91f9-4de7-c789-6bdd-1de1f5e3dd33 at NET_0x20000c0a8f109_UUID:0/0 lens 296/1680 e 0 to 0 dl 1241687634 ref 1 fl Interpret:/0/0 rc -116/0 >> >> I don''t see the same errors on another cluster/Lustre installation with >> 2000 Lustre clients which uses Infiniband network. > > we see this sometimes when a job that is using a shared library that > lives on Lustre is killed - presumably the un-memorymapping of the .so > from a bunch of nodes at once confuses Lustre a bit. > > what is your inode 772769? > eg. > find -inum 772769 /some/lustre/fs/ > if the file is a .so then that would be similar to what we are seeing. > > so we have this listed in the "probably harmless" section of the errors > that we get from Lustre, so if it''s not harmless than we''d very much > like to know about it :) > > this cluster is IB, rhel5, x86_64, 1.6.6 on servers and patchless > 1.6.4.3 on clients w/ 2.6.23.17 kernels. > > cheers, > robin > -- > Dr Robin Humble, HPC Systems Analyst, NCI National Facility > >> I looked at the following bugs 19328, 18946, 18192 and 19085 but I am >> not sure if any of those bugs apply to this error. I would appreciate it >> someone could help me understand these errors and possibly suggest the >> solution. >> >> TIA >> Nirmal >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss