So I have been trying to find out if someone else has reported or found something similar. I would be happy to create a bug report but I searched bugzilla for a bit and haven''t found out much. So the weirdest thing is that the MDS/OSS servers are fine but the clients whole network stack gets screwed up. I mean it stops pinging which is just very odd that Lustre is causing problems to this extent. Anyone heard or know of anything like this attached are the syslogs from when the clients network stack hung and the MDS/MGS. Note: the client cfd-mds-01 is not running any MDS/MGT services just a patch-less client for now. Client (lustre-client-1.8.1-2.6.18_128.1.14.el5_lustre.1.8.1) Oct 22 12:35:11 cfd-mds-01 kernel: LustreError: 4682:0:(socklnd.c: 1661:ksocknal_destroy_conn()) Completing partial receive from 12345-192.168.14.23 at tcp, ip 192.168.14.23:1022, with error Oct 22 12:35:11 cfd-mds-01 kernel: LustreError: 4682:0:(events.c: 189:client_bulk_callback()) event type 1, status -5, desc ffff8100c7672000 Oct 22 12:37:59 cfd-mds-01 kernel: Lustre: 4678:0:(linux-tcpip.c: 688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1020 -> 192.168.14.20/988 Oct 22 12:37:59 cfd-mds-01 kernel: Lustre: 4678:0:(acceptor.c: 102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at host 192.168.14.20 on port 988 took too long: that node may be hung or experiencing high load. Oct 22 12:41:09 cfd-mds-01 kernel: Lustre: 4681:0:(linux-tcpip.c: 688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 -> 192.168.14.20/988 Oct 22 12:41:09 cfd-mds-01 kernel: Lustre: 4681:0:(acceptor.c: 102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at host 192.168.14.20 on port 988 took too long: that node may be hung or experiencing high load. Oct 22 12:44:20 cfd-mds-01 kernel: Lustre: 4679:0:(linux-tcpip.c: 688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 -> 192.168.14.20/988 Oct 22 12:44:20 cfd-mds-01 kernel: Lustre: 4679:0:(acceptor.c: 102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at host 192.168.14.20 on port 988 took too long: that node may be hung or experiencing high load. Oct 22 12:47:33 cfd-mds-01 kernel: Lustre: 4680:0:(linux-tcpip.c: 688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 -> 192.168.14.20/988 Oct 22 12:47:33 cfd-mds-01 kernel: Lustre: 4680:0:(acceptor.c: 102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at host 192.168.14.20 on port 988 took too long: that node may be hung or experiencing high load. Oct 22 12:50:51 cfd-mds-01 kernel: Lustre: 4678:0:(linux-tcpip.c: 688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 -> 192.168.14.20/988 Oct 22 12:50:51 cfd-mds-01 kernel: Lustre: 4678:0:(acceptor.c: 102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at host 192.168.14.20 on port 988 took too long: that node may be hung or experiencing high load. Oct 22 12:54:16 cfd-mds-01 kernel: Lustre: 4681:0:(linux-tcpip.c: 688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 -> 192.168.14.20/988 Oct 22 12:54:16 cfd-mds-01 kernel: Lustre: 4681:0:(acceptor.c: 102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at host 192.168.14.20 on port 988 took too long: that node may be hung or experiencing high load. Oct 22 12:57:57 cfd-mds-01 kernel: Lustre: 4679:0:(linux-tcpip.c: 688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 -> 192.168.14.20/988 Oct 22 12:57:57 cfd-mds-01 kernel: Lustre: 4679:0:(acceptor.c: 102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at host 192.168.14.20 on port 988 took too long: that node may be hung or experiencing high load. Oct 22 13:02:07 cfd-mds-01 kernel: Lustre: 4680:0:(linux-tcpip.c: 688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 -> 192.168.14.20/988 Oct 22 13:02:07 cfd-mds-01 kernel: Lustre: 4680:0:(acceptor.c: 102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at host 192.168.14.20 on port 988 took too long: that node may be hung or experiencing high load. Oct 22 13:06:16 cfd-mds-01 kernel: Lustre: 4678:0:(linux-tcpip.c: 688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 -> 192.168.14.20/988 Oct 22 13:06:16 cfd-mds-01 kernel: Lustre: 4678:0:(acceptor.c: 102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at host 192.168.14.20 on port 988 took too long: that node may be hung or experiencing high load. Oct 22 13:10:25 cfd-mds-01 kernel: Lustre: 4681:0:(linux-tcpip.c: 688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 -> 192.168.14.20/988 Oct 22 13:10:25 cfd-mds-01 kernel: Lustre: 4681:0:(acceptor.c: 102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at host 192.168.14.20 on port 988 took too long: that node may be hung or experiencing high load. Oct 22 13:14:34 cfd-mds-01 kernel: Lustre: 4679:0:(linux-tcpip.c: 688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 -> 192.168.14.20/988 Oct 22 13:14:34 cfd-mds-01 kernel: Lustre: 4679:0:(acceptor.c: 102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at host 192.168.14.20 on port 988 took too long: that node may be hung or experiencing high load. Oct 22 13:18:43 cfd-mds-01 kernel: Lustre: 4680:0:(linux-tcpip.c: 688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 -> 192.168.14.20/988 Oct 22 13:18:43 cfd-mds-01 kernel: Lustre: 4680:0:(acceptor.c: 102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at host 192.168.14.20 on port 988 took too long: that node may be hung or experiencing high load. Oct 22 13:22:52 cfd-mds-01 kernel: Lustre: 4678:0:(linux-tcpip.c: 688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 -> 192.168.14.20/988 Oct 22 13:22:52 cfd-mds-01 kernel: Lustre: 4678:0:(acceptor.c: 102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at host 192.168.14.20 on port 988 took too long: that node may be hung or experiencing high load. Oct 22 13:24:13 cfd-mds-01 kernel: Lustre: 4684:0:(client.c: 1383:ptlrpc_expire_one_request()) @@@ Request x1317187295870120 sent from cfd-OST0003-osc-ffff8103216b8800 to NID 192.168.14.23 at tcp 2997s ago has timed out (limit 1344s). Oct 22 13:24:13 cfd-mds-01 kernel: Lustre: cfd-OST0003-osc- ffff8103216b8800: Connection to service cfd-OST0003 via nid 192.168.14.23 at tcp was lost; in progress operations using this service will wait for recovery to complete. Oct 22 13:24:13 cfd-mds-01 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.14.23 at tcp. The ost_connect operation failed with -16 Oct 22 13:24:38 cfd-mds-01 kernel: Lustre: 4686:0:(import.c: 508:import_select_connection()) cfd-OST0003-osc-ffff8103216b8800: tried all connections, increasing latency to 6s Oct 22 13:24:38 cfd-mds-01 kernel: Lustre: cfd-OST0003-osc- ffff8103216b8800: Connection restored to service cfd-OST0003 using nid 192.168.14.23 at tcp. Oct 22 13:26:50 cfd-mds-01 kernel: Lustre: 4684:0:(client.c: 1383:ptlrpc_expire_one_request()) @@@ Request x1317187296186310 sent from MGC192.168.14.20 at tcp to NID 192.168.14.20 at tcp 7s ago has timed out (limit 7s). Oct 22 13:26:50 cfd-mds-01 kernel: LustreError: 166-1: MGC192.168.14.20 at tcp: Connection to service MGS via nid 192.168.14.20 at tcp was lost; in progress operations using this service will fail. Oct 22 13:26:56 cfd-mds-01 kernel: Lustre: 4685:0:(client.c: 1383:ptlrpc_expire_one_request()) @@@ Request x1317187296186311 sent from MGC192.168.14.20 at tcp to NID 192.168.14.20 at tcp 6s ago has timed out (limit 6s). Oct 22 13:26:57 cfd-mds-01 kernel: Lustre: cfd-MDT0000-mdc- ffff8103216b8800: Connection to service cfd-MDT0000 via nid 192.168.14.20 at tcp was lost; in progress operations using this service will wait for recovery to complete. Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: 4681:0:(linux-tcpip.c: 688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 -> 192.168.14.20/988 Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: 4681:0:(acceptor.c: 102:lnet_connect_console_error()) Connection to 192.168.14.20 at tcp at host 192.168.14.20 on port 988 took too long: that node may be hung or experiencing high load. Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: 4684:0:(client.c: 1383:ptlrpc_expire_one_request()) @@@ Request x1317187296186316 sent from cfd-OST0003-osc-ffff8103216b8800 to NID 192.168.14.23 at tcp 12s ago has timed out (limit 12s). Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: 4684:0:(client.c: 1383:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: cfd-OST0003-osc- ffff8103216b8800: Connection to service cfd-OST0003 via nid 192.168.14.23 at tcp was lost; in progress operations using this service will wait for recovery to complete. Oct 22 13:27:02 cfd-mds-01 kernel: Lustre: Skipped 3 previous similar messages Oct 22 13:27:08 cfd-mds-01 kernel: Lustre: 4684:0:(client.c: 1383:ptlrpc_expire_one_request()) @@@ Request x1317187296186309 sent from cfd-OST0001-osc-ffff8103216b8800 to NID 192.168.14.23 at tcp 44s ago has timed out (limit 44s). Oct 22 13:27:08 cfd-mds-01 kernel: Lustre: 4684:0:(client.c: 1383:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Oct 22 13:27:12 cfd-mds-01 kernel: Lustre: 4682:0:(socklnd_cb.c: 2173:ksocknal_find_timed_out_conn()) A connection with 12345-192.168.14.22 at tcp (192.168.14.22:988) timed out; the network or node may be down. MDS (lustre-1.8.1-2.6.18_128.1.14.el5_lustre.1.8.1) Oct 22 12:34:50 cfd-mds-00 kernel: Lustre: 24837:0:(socklnd_cb.c: 2173:ksocknal_find_timed_out_conn()) A connection with 12345-192.168.14.21 at tcp (192.168.14.21:1021) timed out; the network or node may be down. Oct 22 13:27:14 cfd-mds-00 kernel: Lustre: 24837:0:(socklnd_cb.c: 915:ksocknal_launch_packet()) No usable routes to 12345-192.168.14.21 at tcp Oct 22 13:27:26 cfd-mds-00 kernel: Lustre: 24837:0:(socklnd_cb.c: 915:ksocknal_launch_packet()) No usable routes to 12345-192.168.14.21 at tcp Oct 22 13:27:26 cfd-mds-00 kernel: Lustre: 24837:0:(socklnd_cb.c: 2181:ksocknal_find_timed_out_conn()) An unexpected network error 113 occurred with 12345-192.168.14.21 at tcp (192.168.14.21:1022 Oct 22 13:30:11 cfd-mds-00 kernel: Lustre: cfd-MDT0000: haven''t heard from client 2d7ea85b-2184-0e60-e96f-fe2cd01a4b3e (at 192.168.14.21 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Oct 22 13:30:30 cfd-mds-00 kernel: Lustre: MGS: haven''t heard from client 58fe30f2-259f-a304-9f56-696ec03db7c0 (at 192.168.14.21 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Derek Yarnell UNIX Systems Administrator University of Maryland Institute for Advanced Computer Studies