Chad Cantwell
2009-Oct-24 00:46 UTC
[Lustre-discuss] dmesg unable to connect flood from OST server, and soft lockup in syslog
Hi, I have a lustre 1.8.1.1 system with 1 MDS and 3 OST servers. 2 of the OST servers currently have 4 ~6.5TB OSTs, and the 3rd has 8 ~6.5TB OSTs. This setup was originally Lustre 1.6.x, but the current OSTs were formatted originally as 1.8.1.1 (we''ve done some disk upgrades and copying data around to phase out old OSTs, with a writecont to clear them out of the ost list). This may be irrelevant, just some background. Anyway, at times now, a client system will crash, and this seems to sometimes cause an OST server to lose control. Here''s an example from the dmesg on the server: Lustre: 3680:0:(socklnd_cb.c:2173:ksocknal_find_timed_out_conn()) A connection with 12345-10.1.255.235 at tcp (10.1.255.235:1023) timed out; the network or node may be down. Lustre: 3676:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> 10.1.255.235/988 Lustre: 3676:0:(acceptor.c:95:lnet_connect_console_error()) Connection to 10.1.255.235 at tcp at host 10.1.255.235 was unreachable: the network or that node may be down, or Lustre may be misconfigured. Lustre: 3676:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 296 10.1.1.54 at tcp->10.1.255.235 at tcp Lustre: 3812:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1317161589925083 sent from lustre-OST0011 to NID 10.1.255.235 at tcp 7s ago has timed out (limit 7s). req at ffff810074c5a000 x1317161589925083/t0 o106->@NET_0x200000a01ffeb_UUID:15/16 lens 296/424 e 0 to 1 dl 1256340405 ref 1 fl Rpc:/0/0 rc 0/0 Lustre: 3678:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> 10.1.255.235/988 Lustre: 3678:0:(acceptor.c:95:lnet_connect_console_error()) Connection to 10.1.255.235 at tcp at host 10.1.255.235 was unreachable: the network or that node may be down, or Lustre may be misconfigured. Lustre: 3678:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 296 10.1.1.54 at tcp->10.1.255.235 at tcp Lustre: 3812:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1317161589925083 sent from lustre-OST0011 to NID 10.1.255.235 at tcp 7s ago has timed out (limit 7s). req at ffff810074c5a000 x1317161589925083/t0 o106->@NET_0x200000a01ffeb_UUID:15/16 lens 296/424 e 0 to 1 dl 1256340412 ref 1 fl Rpc:/2/0 rc 0/0 Lustre: 3677:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> 10.1.255.235/988 Lustre: 3677:0:(acceptor.c:95:lnet_connect_console_error()) Connection to 10.1.255.235 at tcp at host 10.1.255.235 was unreachable: the network or that node may be down, or Lustre may be misconfigured. Lustre: 3677:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 296 10.1.1.54 at tcp->10.1.255.235 at tcp Lustre: 3812:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1317161589925083 sent from lustre-OST0011 to NID 10.1.255.235 at tcp 7s ago has timed out (limit 7s). req at ffff810074c5a000 x1317161589925083/t0 o106->@NET_0x200000a01ffeb_UUID:15/16 lens 296/424 e 0 to 1 dl 1256340419 ref 1 fl Rpc:/2/0 rc 0/0 Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp LustreError: 3812:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff810074c5a000 x1317161589925083/t0 o106->@NET_0x200000a01ffeb_UUID:15/16 lens 296/424 e 0 to 1 dl 1256340426 ref 2 fl Rpc:/2/0 rc 0/0 Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp LustreError: 3812:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff810074c5a000 x1317161589925083/t0 o106->@NET_0x200000a01ffeb_UUID:15/16 lens 296/424 e 0 to 1 dl 1256340426 ref 2 fl Rpc:/2/0 rc 0/0 LustreError: 3812:0:(events.c:66:request_out_callback()) Skipped 26 previous similar messages 20091023_162803 5-10.1.255.235 at tcp Lustre: 3811:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3811:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3811:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3811:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3811:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3811:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3811:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3811:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3811:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3811:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3811:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3811:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3812:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp Lustre: 3811:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.1.255.235 at tcp it continues on like that for thousands of lines, flooding the dmesg buffer. At the same time on this OST server in the /var/log/messages, we get: Oct 23 16:26:59 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff810074c5a000 x1317161589925083/t0 o10 6->@NET_0x200000a01ffeb_UUID:15/16 lens 296/424 e 0 to 1 dl 1256340426 ref 2 fl Rpc:/2/0 rc 0/0 Oct 23 16:26:59 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff810074c5a000 x1317161589925083/t0 o10 6->@NET_0x200000a01ffeb_UUID:15/16 lens 296/424 e 0 to 1 dl 1256340426 ref 2 fl Rpc:/2/0 rc 0/0 Oct 23 16:26:59 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) Skipped 26 previous similar messages Oct 23 16:27:06 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff810074c5a000 x1317161589925083/t0 o10 6->@NET_0x200000a01ffeb_UUID:15/16 lens 296/424 e 0 to 1 dl 1256340433 ref 2 fl Rpc:/2/0 rc 0/0 Oct 23 16:27:07 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff810074c5a000 x1317161589925083/t0 o10 6->@NET_0x200000a01ffeb_UUID:15/16 lens 296/424 e 0 to 1 dl 1256340434 ref 2 fl Rpc:/2/0 rc 0/0 Oct 23 16:27:07 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) Skipped 36616 previous similar messages Oct 23 16:27:09 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff810074c5a000 x1317161589925083/t0 o10 6->@NET_0x200000a01ffeb_UUID:15/16 lens 296/424 e 0 to 1 dl 1256340436 ref 2 fl Rpc:/2/0 rc 0/0 Oct 23 16:27:09 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) Skipped 76698 previous similar messages Oct 23 16:27:17 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff810074c5a000 x1317161589925083/t0 o10 6->@NET_0x200000a01ffeb_UUID:15/16 lens 296/424 e 0 to 1 dl 1256340444 ref 2 fl Rpc:/2/0 rc 0/0 Oct 23 16:27:17 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) Skipped 38612 previous similar messages Oct 23 16:27:25 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff810074c5a000 x1317161589925083/t0 o10 6->@NET_0x200000a01ffeb_UUID:15/16 lens 296/424 e 0 to 1 dl 1256340452 ref 2 fl Rpc:/2/0 rc 0/0 Oct 23 16:27:25 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) Skipped 310433 previous similar messages Oct 23 16:27:26 nas-50-4 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_16:3812] Oct 23 16:27:41 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff810074c5a000 x1317161589925083/t0 o10 6->@NET_0x200000a01ffeb_UUID:15/16 lens 296/424 e 0 to 1 dl 1256340468 ref 2 fl Rpc:/2/0 rc 0/0 Oct 23 16:27:41 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) Skipped 380440 previous similar messages Oct 23 16:27:45 nas-50-4 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_16:3812] Oct 23 16:27:51 nas-50-4 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [ll_ost_15:3811] Oct 23 16:27:55 nas-50-4 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_16:3812] Oct 23 16:28:13 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff810074c5a000 x1317161589925083/t0 o10 6->@NET_0x200000a01ffeb_UUID:15/16 lens 296/424 e 0 to 1 dl 1256340500 ref 2 fl Rpc:/2/0 rc 0/0 Oct 23 16:28:13 nas-50-4 kernel: LustreError: 3812:0:(events.c:66:request_out_callback()) Skipped 1094662 previous similar messages Oct 23 16:28:19 nas-50-4 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_16:3812] Oct 23 16:28:29 nas-50-4 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_16:3812] Oct 23 16:28:39 nas-50-4 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ll_ost_23:3819] Oct 23 16:28:39 nas-50-4 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_16:3812] Oct 23 16:28:40 nas-50-4 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [ll_ost_15:3811] Oct 23 16:28:49 nas-50-4 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ll_ost_23:3819] Oct 23 16:28:49 nas-50-4 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_16:3812] Oct 23 16:28:50 nas-50-4 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [ll_ost_15:3811] Oct 23 16:28:59 nas-50-4 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ll_ost_23:3819] Oct 23 16:28:59 nas-50-4 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_16:3812] Oct 23 16:29:00 nas-50-4 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [ll_ost_15:3811] Oct 23 16:29:17 nas-50-4 kernel: LustreError: 3811:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff810099281400 x1317161589925107/t0 o10 6->@NET_0x200000a01ffeb_UUID:15/16 lens 296/424 e 0 to 1 dl 1256340564 ref 2 fl Rpc:/2/0 rc 0/0 Oct 23 16:29:17 nas-50-4 kernel: LustreError: 3811:0:(events.c:66:request_out_callback()) Skipped 3641941 previous similar messages Oct 23 16:29:22 nas-50-4 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ll_ost_23:3819] Oct 23 16:29:22 nas-50-4 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_16:3812] Oct 23 16:29:30 nas-50-4 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [ll_ost_15:3811] We''re not really sure if this is a problem with Lustre 1.8.1.1, or a problem with load, or some other issue. At the same time we upgraded Lustre versions, we had to temporarily change the transport to tcp from o2ib which it used to be under 1.6, because not all of our client nodes have infiniband cards right now. Also, at the same time, load may have increased, so I''m not 100% sure we wouldn''t have had this problem previously with similiar load characteristics as now. When the above happens, an ''lfs check servers'' on a client showed one OST with ''Resource temporarily unavailable'' which caused access to hang for many applications. The only fix for this is to reboot the OST server. If the problem isn''t caught and intervened upon within time for the lustre clients to recover, this can be rather disruptive to the users. Does anyone have any ideas on what the problem is? It looks to me like the lustre server is doing something wrong, if it writes to the dmesg the same error so many times. Shouldn''t it just stop trying to contact 10.1.255.235 after a few tries and give up without OST threads hanging? If we can''t find a way to make this work reliably in Lustre 1.8.1.1, is it safe to try to go back to Lustre 1.6 and rerun the tunefs.lustre commands with --writeconf to start with clean logs, or might the ldiskfs backend be incompatible with the 1.6 series e2fsprogs? Thanks, Chad