Craig Prescott
2008-Feb-15 18:15 UTC
[Lustre-discuss] mds_close operation failed with -107
Hi; We had a couple of our Lustre clients get horked last night (one o2ib, and one tcp), df on the clients reported ''Input/output error'' for the Lustre fs. Here''s what was in the syslog of our MDS/MGS regarding the two nodes: Feb 14 22:50:43 hpcmds kernel: LustreError: 26165:0:(handler.c:1498:mds_handle()) operation 35 on unconnected MDS from 12345-10.13.18.156 at tcp Feb 14 22:50:43 hpcmds kernel: LustreError: 26165:0:(ldlm_lib.c:1437:target_send_reply_msg()) @@@ processing error (-107) req at ffff81011fa81400 x7753405/t0 o35-><?>@<?>:-1 lens 296/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 14 22:50:43 hpcmds kernel: LustreError: 26165:0:(ldlm_lib.c:1437:target_send_reply_msg()) Skipped 9 previous similar messages Feb 14 23:01:54 hpcmds kernel: LustreError: 31054:0:(handler.c:1498:mds_handle()) operation 35 on unconnected MDS from 12345-10.13.26.125 at o2ib Feb 14 23:01:54 hpcmds kernel: LustreError: 31054:0:(ldlm_lib.c:1437:target_send_reply_msg()) @@@ processing error (-107) req at ffff810123416c00 x73394313/t0 o35-><?>@<?>:-1 lens 296/0 ref 0 fl Interpret:/0/0 rc -107/0 The corresponding syslog from one of the clients is appended below (they were very similar). Does anyone recognize this? It says ENOTCONN, but there is no evidence of anything being wrong with our ethernet or IB networks. There aren''t any locking complaints. We are running Lustre 1.6.3 (plus a couple of patches) with a combo MGS/MDS. Kernel is a fully patched 2.6.18-8.1.14el5 everywhere. Thanks, Craig Feb 14 23:01:54 r5b-s14 kernel: LustreError: 11-0: an error occurred while communicating with 10.13.24.40 at o2ib. The mds_close operation failed with -107 Feb 14 23:01:54 r5b-s14 kernel: Lustre: ufhpc-MDT0000-mdc-ffff81022dfac400: Connection to service ufhpc-MDT0000 via nid 10.13.24.40 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Feb 14 23:01:54 r5b-s14 kernel: Lustre: Skipped 5 previous similar messages Feb 14 23:01:54 r5b-s14 kernel: LustreError: 17647:0:(file.c:97:ll_close_inode_openhandle()) inode 41118870 mdc close failed: rc = -4 Feb 14 23:01:54 r5b-s14 kernel: LustreError: 167-0: This client was evicted by ufhpc-MDT0000; in progress operations using this service will fail. Feb 14 23:01:54 r5b-s14 kernel: LustreError: 15785:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81020f386400 x73394323/t0 o35->ufhpc-MDT0000_UUID at 10.13.24.40@o2ib:12 lens 296/1736 ref 1 fl Rpc:/0/0 rc 0/0 Feb 14 23:01:54 r5b-s14 kernel: LustreError: 15785:0:(file.c:97:ll_close_inode_openhandle()) inode 24445371 mdc close failed: rc = -108 Feb 14 23:01:54 r5b-s14 kernel: LustreError: 17719:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8102166c0000 x73394324/t0 o35->ufhpc-MDT0000_UUID at 10.13.24.40@o2ib:12 lens 296/1736 ref 1 fl Rpc:/0/0 rc 0/0 Feb 14 23:01:54 r5b-s14 kernel: LustreError: 15785:0:(file.c:97:ll_close_inode_openhandle()) Skipped 12 previous similar messages Feb 14 23:01:54 r5b-s14 kernel: LustreError: 17640:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81020f386400 x73394331/t0 o35->ufhpc-MDT0000_UUID at 10.13.24.40@o2ib:12 lens 296/1736 ref 1 fl Rpc:/0/0 rc 0/0 Feb 14 23:01:54 r5b-s14 kernel: LustreError: 17640:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 6 previous similar messages Feb 14 23:01:54 r5b-s14 kernel: LustreError: 17640:0:(file.c:97:ll_close_inode_openhandle()) inode 41118867 mdc close failed: rc = -108 Feb 14 23:01:54 r5b-s14 kernel: LustreError: 17640:0:(file.c:97:ll_close_inode_openhandle()) Skipped 3 previous similar messages Feb 14 23:01:54 r5b-s14 kernel: LustreError: 17254:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81015e0e0000 x73394332/t0 o35->ufhpc-MDT0000_UUID at 10.13.24.40@o2ib:12 lens 296/1736 ref 1 fl Rpc:/0/0 rc 0/0 Feb 14 23:01:54 r5b-s14 kernel: LustreError: 17254:0:(file.c:97:ll_close_inode_openhandle()) inode 42611428 mdc close failed: rc = -108 Feb 14 23:01:54 r5b-s14 kernel: LustreError: 17254:0:(file.c:97:ll_close_inode_openhandle()) Skipped 2 previous similar messages Feb 14 23:01:54 r5b-s14 kernel: LustreError: 17323:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8101c63af800 x73394341/t0 o35->ufhpc-MDT0000_UUID at 10.13.24.40@o2ib:12 lens 296/1736 ref 1 fl Rpc:/0/0 rc 0/0 Feb 14 23:01:54 r5b-s14 kernel: LustreError: 17323:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 8 previous similar messages Feb 14 23:01:54 r5b-s14 kernel: LustreError: 17323:0:(file.c:97:ll_close_inode_openhandle()) inode 24249026 mdc close failed: rc = -108 Feb 14 23:01:54 r5b-s14 kernel: LustreError: 17323:0:(file.c:97:ll_close_inode_openhandle()) Skipped 11 previous similar messages Feb 14 23:02:35 r5b-s14 kernel: LustreError: 5746:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81014dc04200 x73394350/t0 o35->ufhpc-MDT0000_UUID at 10.13.24.40@o2ib:12 lens 296/1736 ref 1 fl Rpc:/0/0 rc 0/0 Feb 14 23:02:35 r5b-s14 kernel: LustreError: 5746:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 8 previous similar messages Feb 14 23:02:35 r5b-s14 kernel: LustreError: 5746:0:(file.c:97:ll_close_inode_openhandle()) inode 42195621 mdc close failed: rc = -108 Feb 14 23:02:35 r5b-s14 kernel: LustreError: 5746:0:(file.c:97:ll_close_inode_openhandle()) Skipped 2 previous similar messages
What does an ifconfig of the mds and clients look like? Can you run "ibcheckerrors -v" on an infiniband node and send the output? Also send the output of "ibqueryerrors.pl". Thanks! On Feb 15, 2008, at 1:15 PM, Craig Prescott wrote:> > Hi; > > We had a couple of our Lustre clients get horked last night (one > o2ib, and one tcp), df on the clients reported ''Input/output error'' > for the Lustre fs. > > Here''s what was in the syslog of our MDS/MGS regarding the two > nodes: > > Feb 14 22:50:43 hpcmds kernel: LustreError: > 26165:0:(handler.c:1498:mds_handle()) operation 35 on unconnected MDS > from 12345-10.13.18.156 at tcp > Feb 14 22:50:43 hpcmds kernel: LustreError: > 26165:0:(ldlm_lib.c:1437:target_send_reply_msg()) @@@ processing error > (-107) req at ffff81011fa81400 x7753405/t0 o35-><?>@<?>:-1 lens 296/0 > ref > 0 fl Interpret:/0/0 rc -107/0 > Feb 14 22:50:43 hpcmds kernel: LustreError: > 26165:0:(ldlm_lib.c:1437:target_send_reply_msg()) Skipped 9 previous > similar messages > Feb 14 23:01:54 hpcmds kernel: LustreError: > 31054:0:(handler.c:1498:mds_handle()) operation 35 on unconnected MDS > from 12345-10.13.26.125 at o2ib > Feb 14 23:01:54 hpcmds kernel: LustreError: > 31054:0:(ldlm_lib.c:1437:target_send_reply_msg()) @@@ processing error > (-107) req at ffff810123416c00 x73394313/t0 o35-><?>@<?>:-1 lens 296/0 > ref > 0 fl Interpret:/0/0 rc -107/0 > > The corresponding syslog from one of the clients is appended below > (they were very similar). > > Does anyone recognize this? It says ENOTCONN, but there is no > evidence > of anything being wrong with our ethernet or IB networks. There > aren''t > any locking complaints. > > We are running Lustre 1.6.3 (plus a couple of patches) with a combo > MGS/MDS. Kernel is a fully patched 2.6.18-8.1.14el5 everywhere. > > Thanks, > Craig > > > Feb 14 23:01:54 r5b-s14 kernel: LustreError: 11-0: an error occurred > while communicating with 10.13.24.40 at o2ib. The mds_close operation > failed with -107 > Feb 14 23:01:54 r5b-s14 kernel: Lustre: > ufhpc-MDT0000-mdc-ffff81022dfac400: Connection to service ufhpc- > MDT0000 > via nid 10.13.24.40 at o2ib was lost; in progress operations using this > service will wait for recovery to complete. > Feb 14 23:01:54 r5b-s14 kernel: Lustre: Skipped 5 previous similar > messages > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 17647:0:(file.c:97:ll_close_inode_openhandle()) inode 41118870 mdc > close > failed: rc = -4 > Feb 14 23:01:54 r5b-s14 kernel: LustreError: 167-0: This client was > evicted by ufhpc-MDT0000; in progress operations using this service > will > fail. > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 15785:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID > req at ffff81020f386400 x73394323/t0 > o35->ufhpc-MDT0000_UUID at 10.13.24.40@o2ib:12 lens 296/1736 ref 1 fl > Rpc:/0/0 rc 0/0 > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 15785:0:(file.c:97:ll_close_inode_openhandle()) inode 24445371 mdc > close > failed: rc = -108 > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 17719:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID > req at ffff8102166c0000 x73394324/t0 > o35->ufhpc-MDT0000_UUID at 10.13.24.40@o2ib:12 lens 296/1736 ref 1 fl > Rpc:/0/0 rc 0/0 > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 15785:0:(file.c:97:ll_close_inode_openhandle()) Skipped 12 previous > similar messages > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 17640:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID > req at ffff81020f386400 x73394331/t0 > o35->ufhpc-MDT0000_UUID at 10.13.24.40@o2ib:12 lens 296/1736 ref 1 fl > Rpc:/0/0 rc 0/0 > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 17640:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 6 previous > similar messages > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 17640:0:(file.c:97:ll_close_inode_openhandle()) inode 41118867 mdc > close > failed: rc = -108 > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 17640:0:(file.c:97:ll_close_inode_openhandle()) Skipped 3 previous > similar messages > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 17254:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID > req at ffff81015e0e0000 x73394332/t0 > o35->ufhpc-MDT0000_UUID at 10.13.24.40@o2ib:12 lens 296/1736 ref 1 fl > Rpc:/0/0 rc 0/0 > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 17254:0:(file.c:97:ll_close_inode_openhandle()) inode 42611428 mdc > close > failed: rc = -108 > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 17254:0:(file.c:97:ll_close_inode_openhandle()) Skipped 2 previous > similar messages > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 17323:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID > req at ffff8101c63af800 x73394341/t0 > o35->ufhpc-MDT0000_UUID at 10.13.24.40@o2ib:12 lens 296/1736 ref 1 fl > Rpc:/0/0 rc 0/0 > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 17323:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 8 previous > similar messages > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 17323:0:(file.c:97:ll_close_inode_openhandle()) inode 24249026 mdc > close > failed: rc = -108 > Feb 14 23:01:54 r5b-s14 kernel: LustreError: > 17323:0:(file.c:97:ll_close_inode_openhandle()) Skipped 11 previous > similar messages > Feb 14 23:02:35 r5b-s14 kernel: LustreError: > 5746:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID > req at ffff81014dc04200 x73394350/t0 > o35->ufhpc-MDT0000_UUID at 10.13.24.40@o2ib:12 lens 296/1736 ref 1 fl > Rpc:/0/0 rc 0/0 > Feb 14 23:02:35 r5b-s14 kernel: LustreError: > 5746:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 8 previous > similar messages > Feb 14 23:02:35 r5b-s14 kernel: LustreError: > 5746:0:(file.c:97:ll_close_inode_openhandle()) inode 42195621 mdc > close > failed: rc = -108 > Feb 14 23:02:35 r5b-s14 kernel: LustreError: > 5746:0:(file.c:97:ll_close_inode_openhandle()) Skipped 2 previous > similar messages > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussAaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 aaron at iges.org