Dear list, We find bug on Lustre 1.8.1.1. Sometimes one client''s dead may cause soft lockup on OSS. The certain OSS may reach a high CPU System% usage, and then became unreachable through "lctl ping" from now and then. We find the system log as follow: Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 last message repeated 36 times Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 last message repeated 36 times Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 last message repeated 36 times Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 last message repeated 36 times Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 last message repeated 36 times Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 last message repeated 36 times Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:05:45 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:05:45 boss34 kernel: CPU 11: Mar 23 01:05:45 boss34 kernel: Modules linked in: autofs4(U) hidp(U) obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) parport_pc(U) lp(U) parport(U) sr_mod(U) cdrom(U) sg(U) shpchp(U) ixgbe(U) pcspkr(U) serio_raw(U) hpilo(U) bnx2(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) qla2xxx(U) scsi_transport_fc(U) ata_piix(U) libata(U) cciss(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Mar 23 01:05:45 boss34 kernel: Pid: 5781, comm: ll_ost_405 Tainted: G 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 Mar 23 01:05:45 boss34 kernel: RIP: 0010:[<ffffffff8006dc7e>] [<ffffffff8006dc7e>] do_gettimeofday+0x2c/0x8f Mar 23 01:05:45 boss34 kernel: RSP: 0018:ffff8103065f5200 EFLAGS: 00000246 Mar 23 01:05:45 boss34 kernel: RAX: 0000000000000001 RBX: ffff8103065f5230 RCX: 000000003021bf6a Mar 23 01:05:45 boss34 kernel: RDX: 0008b6f19d2513c6 RSI: 000000000008b6f1 RDI: ffff8103065f5230 Mar 23 01:05:45 boss34 kernel: RBP: ffff8105f74f1600 R08: 0000000000000000 R09: 0000000000000567 Mar 23 01:05:45 boss34 kernel: R10: ffffffff88708c7b R11: ffffffff88708c4e R12: 000000003021b1eb Mar 23 01:05:45 boss34 kernel: R13: 0000000000000018 R14: ffff8103065f5170 R15: 0000000000000206Mar 23 01:05:45 boss34 kernel: FS: 00002ac00ee25220(0000) GS:ffff81032a9d7540(0000) knlGS:0000000000000000 Mar 23 01:05:45 boss34 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Mar 23 01:05:45 boss34 kernel: CR2: 00000000f7f7b000 CR3: 0000000000201000 CR4: 00000000000006e0 Mar 23 01:05:45 boss34 kernel: Mar 23 01:05:45 boss34 kernel: Call Trace: Mar 23 01:05:45 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:05:46 boss34 last message repeated 37 times Mar 23 01:05:46 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:05:46 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:05:46 boss34 last message repeated 36 times Mar 23 01:05:46 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcpMar 23 01:05:46 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp The certain ll_ost may be scheduled to differnt CPU, and cause CPU stuck: [root at boss34 ~]# grep stuck /var/log/messages Mar 23 01:05:45 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:14:10 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 01:18:19 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:22:28 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:30:36 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:30:46 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:30:56 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:34:45 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:34:55 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:35:15 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:43:03 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:47:12 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 01:47:43 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 01:51:41 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:55:30 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 01:59:39 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 02:03:58 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 02:07:57 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 02:08:07 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 02:08:27 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 02:12:06 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:16:15 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:16:25 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:16:35 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:16:45 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:20:24 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:20:34 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:24:46 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:29:02 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 02:32:50 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:37:20 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:37:30 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:41:09 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:45:18 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:49:33 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 02:53:46 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:57:45 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 03:01:56 boss34 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_405:5781] Mar 23 03:06:38 boss34 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_405:5781] Mar 23 03:10:15 boss34 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_405:5781] Mar 23 03:10:35 boss34 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_405:5781] Mar 23 03:14:21 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:14:51 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:18:40 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:19:00 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:22:39 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:22:49 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:26:48 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:26:58 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:27:08 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:27:18 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:31:27 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:35:06 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:35:16 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:39:14 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:39:34 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:47:33 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:51:42 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:52:12 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:55:51 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 03:56:11 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:00:00 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:00:30 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:04:09 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:04:39 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:08:18 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:08:28 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:08:48 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:12:26 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:12:46 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:16:36 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:20:45 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:20:55 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:25:24 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:29:03 boss34 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_405:5781] Mar 23 04:33:32 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:37:21 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:37:41 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:37:51 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:41:30 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:41:40 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:41:50 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:46:09 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:49:48 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:53:57 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 04:54:17 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:02:15 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:02:35 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:06:24 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:06:54 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:10:43 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:11:03 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:14:42 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:18:50 boss34 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_405:5781] Mar 23 05:19:10 boss34 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_405:5781] Mar 23 05:23:20 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:27:09 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:27:29 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:31:18 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:35:37 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:39:36 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:39:46 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:43:46 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:47:54 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:48:24 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:52:04 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:52:24 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:00:31 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:00:41 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:00:51 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:04:30 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:04:40 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:05:00 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:08:39 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:09:09 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:16:57 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:17:07 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:17:27 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:17:37 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:21:17 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:25:25 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:25:35 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:29:25 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:29:35 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:29:45 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:33:33 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:33:53 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:34:03 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:37:53 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:38:13 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:41:52 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:42:22 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:46:00 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:46:20 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:46:30 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:50:09 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:50:29 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:58:57 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:02:37 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:02:57 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:06:46 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:07:06 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:10:55 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:11:05 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:15:24 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:15:34 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:19:12 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:23:21 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:23:31 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:31:40 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:31:50 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:32:00 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:35:48 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:36:08 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:40:27 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:44:06 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:48:15 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:56:34 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:56:44 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 07:56:54 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:00:43 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:01:13 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:09:01 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:13:09 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:17:28 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:21:28 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:21:48 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:21:58 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:25:36 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:25:56 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:26:06 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:29:55 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:34:24 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:38:13 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:42:42 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:46:21 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:46:31 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:50:31 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:50:41 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:50:51 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 08:59:19 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 09:03:27 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 09:07:26 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 09:07:36 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 09:11:15 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 09:15:35 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 09:23:42 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 09:24:02 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 09:27:51 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 09:32:10 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] As attached is the ganglia monitor of the OSS. When the CPU SYSTEM% reached the peak, our monitoring client could not "lctl ping" the certain OSS. The situation ended until we restarted the certain lustre client node. Is it possible avoiding this problem? Best Regards Lu Wang -------------------------------------------------------------- Computing Center IHEP Office: Computing Center,123 19B Yuquan Road Tel: (+86) 10 88236012-607 P.O. Box 918-7 Fax: (+86) 10 8823 6839 Beijing 100049,China Email: Lu.Wang at ihep.ac.cn -------------------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: boss34.bmp Type: application/octet-stream Size: 281130 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100326/c8bc0534/attachment-0001.obj
Dear list, We find bug on Lustre 1.8.1.1. Sometimes one client''s dead may cause soft lockup on OSS. The certain OSS may reach a high CPU System% usage, and then became unreachable through "lctl ping" from now and then. We find the system log as follow: Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 last message repeated 36 times Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 last message repeated 36 times Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 last message repeated 36 times Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 last message repeated 36 times Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 last message repeated 36 times Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:02:11 boss34 last message repeated 36 times Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:05:45 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:05:45 boss34 kernel: CPU 11: Mar 23 01:05:45 boss34 kernel: Modules linked in: autofs4(U) hidp(U) obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) parport_pc(U) lp(U) parport(U) sr_mod(U) cdrom(U) sg(U) shpchp(U) ixgbe(U) pcspkr(U) serio_raw(U) hpilo(U) bnx2(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) qla2xxx(U) scsi_transport_fc(U) ata_piix(U) libata(U) cciss(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Mar 23 01:05:45 boss34 kernel: Pid: 5781, comm: ll_ost_405 Tainted: G 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 Mar 23 01:05:45 boss34 kernel: RIP: 0010:[<ffffffff8006dc7e>] [<ffffffff8006dc7e>] do_gettimeofday+0x2c/0x8f Mar 23 01:05:45 boss34 kernel: RSP: 0018:ffff8103065f5200 EFLAGS: 00000246 Mar 23 01:05:45 boss34 kernel: RAX: 0000000000000001 RBX: ffff8103065f5230 RCX: 000000003021bf6a Mar 23 01:05:45 boss34 kernel: RDX: 0008b6f19d2513c6 RSI: 000000000008b6f1 RDI: ffff8103065f5230 Mar 23 01:05:45 boss34 kernel: RBP: ffff8105f74f1600 R08: 0000000000000000 R09: 0000000000000567 Mar 23 01:05:45 boss34 kernel: R10: ffffffff88708c7b R11: ffffffff88708c4e R12: 000000003021b1eb Mar 23 01:05:45 boss34 kernel: R13: 0000000000000018 R14: ffff8103065f5170 R15: 0000000000000206Mar 23 01:05:45 boss34 kernel: FS: 00002ac00ee25220(0000) GS:ffff81032a9d7540(0000) knlGS:0000000000000000 Mar 23 01:05:45 boss34 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Mar 23 01:05:45 boss34 kernel: CR2: 00000000f7f7b000 CR3: 0000000000201000 CR4: 00000000000006e0 Mar 23 01:05:45 boss34 kernel: Mar 23 01:05:45 boss34 kernel: Call Trace: Mar 23 01:05:45 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:05:46 boss34 last message repeated 37 times Mar 23 01:05:46 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:05:46 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp Mar 23 01:05:46 boss34 last message repeated 36 times Mar 23 01:05:46 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcpMar 23 01:05:46 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp The certain ll_ost may be scheduled to differnt CPU, and cause CPU stuck: [root at boss34 ~]# grep stuck /var/log/messages Mar 23 01:05:45 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:14:10 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 01:18:19 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:22:28 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:30:36 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:30:46 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:30:56 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:34:45 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:34:55 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:35:15 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:43:03 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:47:12 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 01:47:43 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 01:51:41 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 01:55:30 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 01:59:39 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 02:03:58 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 02:07:57 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 02:08:07 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 02:08:27 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 02:12:06 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:16:15 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:16:25 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:16:35 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:16:45 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:20:24 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:20:34 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:24:46 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:29:02 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 02:32:50 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:37:20 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:37:30 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:41:09 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:45:18 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:49:33 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 02:53:46 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781] Mar 23 02:57:45 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781] Mar 23 03:01:56 boss34 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_405:5781] Mar 23 03:06:38 boss34 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_405:5781] Mar 23 03:10:15 boss34 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_405:5781] Mar 23 03:10:35 boss34 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_405:5781] Mar 23 03:14:21 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781 .... Mar 23 05:27:09 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:27:29 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:31:18 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 05:35:37 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] .... Mar 23 06:04:30 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:04:40 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:05:00 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:08:39 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:09:09 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] .... Mar 23 06:42:22 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] Mar 23 06:46:00 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781] The situation ended until we restarted the certain lustre client node. Is it possible avoiding this problem? Best Regards Lu Wang -------------------------------------------------------------- Computing Center IHEP Office: Computing Center,123 19B Yuquan Road Tel: (+86) 10 88236012-607 P.O. Box 918-7 Fax: (+86) 10 8823 6839 Beijing 100049,China Email: Lu.Wang at ihep.ac.cn --------------------------------------------------------------
Michael Sternberg
2010-Mar-26 19:27 UTC
[Lustre-discuss] A Failed client soft lockup one OSS
+1 on this one, in my case using lustre-1.8.2 on RHEL-5.4 over o2ib, with patchless clients. My OSS complains about hung service threads: Service thread pid 16590 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: . . . Service thread pid 16590 completed after 2403.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). In this case, n337 (172.16.1.87) went dead, and another client (login2) suffered. Solution was to reboot n337, the initial hung client. NB: I take the overload warning at its word. The OST is one giant RAID5, which I have scheduled to split into several RAID 1+0 sets next week. regards, Michael On Mar 26, 2010, at 2:29 , Lu Wang wrote:> We find bug on Lustre 1.8.1.1. Sometimes one client''s dead may cause soft lockup on OSS. The certain OSS may reach a high CPU System% usage, and then became unreachable through "lctl ping" from now and then.Mar 13 18:45:13 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST0000 to NID 172.16.1.87 at tcp 0s ago has failed due to network error (7s prior to deadline). Mar 13 18:45:13 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Mar 13 18:45:13 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) Skipped 59 previous similar messages Mar 13 18:45:13 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527520 ref 2 fl Rpc:/0/0 rc 0/0 Mar 13 18:45:13 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 12345-172.16.1.87 at tcp: -113 Mar 13 18:45:13 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Skipped 57 previous similar messages Mar 13 18:45:13 oss01 kernel: req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527520 ref 1 fl Rpc:/0/0 rc 0/0 Mar 13 18:45:22 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST0000 to NID 172.16.1.87 at tcp 0s ago has failed due to network error (7s prior to deadline). Mar 13 18:45:22 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 454325 previous similar messages Mar 13 18:45:22 oss01 kernel: req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527529 ref 1 fl Rpc:/2/0 rc 0/0 Mar 13 18:45:29 login2 -- MARK -- Mar 13 18:45:41 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST0000 to NID 172.16.1.87 at tcp 0s ago has failed due to network error (7s prior to deadline). Mar 13 18:45:41 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 903192 previous similar messages Mar 13 18:45:41 oss01 kernel: req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527548 ref 1 fl Rpc:/2/0 rc 0/0 Mar 13 18:46:18 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST0000 to NID 172.16.1.87 at tcp 0s ago has failed due to network error (7s prior to deadline). Mar 13 18:46:18 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 1804452 previous similar messages Mar 13 18:46:18 oss01 kernel: req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527585 ref 1 fl Rpc:/2/0 rc 0/0 Mar 13 18:46:28 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) Skipped 3611781 previous similar messages Mar 13 18:46:28 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527595 ref 2 fl Rpc:/2/0 rc 0/0 Mar 13 18:46:28 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 12345-172.16.1.87 at tcp: -113 Mar 13 18:46:28 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Skipped 3611782 previous similar messages Mar 13 18:47:27 oss01 kernel: LustreError: 138-a: carbonfs-OST0000: A client on nid 172.16.1.87 at tcp was evicted due to a lock blocking callback to 172.16.1.87 at tcp timed out: rc -107 Mar 13 18:47:33 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST0000 to NID 172.16.1.87 at tcp 0s ago has failed due to network error (7s prior to deadline). Mar 13 18:47:33 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 3623858 previous similar messages Mar 13 18:47:33 oss01 kernel: req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527660 ref 1 fl Rpc:/2/0 rc 0/0 Mar 13 18:48:33 oss01 kernel: Mar 13 18:48:33 oss01 kernel: 0002000000000000 ffffc20010c6d870 ffffffff8884d460 ffffffff00000000 Mar 13 18:48:33 oss01 kernel: 0004b95a9d0332d0 ffffffff889c41e5 ffffffff889c3beb ffff8101172516e0 Mar 13 18:48:33 oss01 kernel: Call Trace: Mar 13 18:48:33 oss01 kernel: [<ffffffff80047152>] try_to_wake_up+0x472/0x484 Mar 13 18:48:33 oss01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Mar 13 18:48:33 oss01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Mar 13 18:48:33 oss01 kernel: [<ffffffff80062ff8>] thread_return+0x62/0xfe Mar 13 18:48:33 oss01 kernel: [<ffffffff8008ac95>] __wake_up_common+0x3e/0x68 Mar 13 18:48:33 oss01 kernel: [<ffffffff8008c270>] __activate_task+0x56/0x6d Mar 13 18:48:33 oss01 kernel: [<ffffffff8008c86b>] default_wake_function+0x0/0xe Mar 13 18:48:33 oss01 kernel: [<ffffffff800b7076>] audit_syscall_exit+0x336/0x362 Mar 13 18:48:33 oss01 kernel: [<ffffffff801522a0>] list_del+0xb/0x71 Mar 13 18:48:33 oss01 kernel: [<ffffffff8887314b>] :lnet:LNetMDAttach+0x37b/0x4c0 Mar 13 18:48:33 oss01 kernel: [<ffffffff88879b90>] :lnet:LNetPut+0x700/0x800 Mar 13 18:48:33 oss01 kernel: [<ffffffff88879c06>] :lnet:LNetPut+0x776/0x800 Mar 13 18:48:33 oss01 kernel: [<ffffffff88944380>] :ptlrpc:ldlm_lock_create+0x540/0x9f0 Mar 13 18:48:33 oss01 kernel: [<ffffffff88947eb6>] :ptlrpc:ldlm_lock_enqueue+0x186/0xb20 Mar 13 18:48:33 oss01 kernel: [<ffffffff8895b5f0>] :ptlrpc:ldlm_process_extent_lock+0x0/0xad0 Mar 13 18:48:33 oss01 kernel: [<ffffffff88966a06>] :ptlrpc:ldlm_server_glimpse_ast+0x266/0x3b0 Mar 13 18:48:33 oss01 kernel: [<ffffffff889693f9>] :ptlrpc:ldlm_handle_enqueue+0xbf9/0x11f0 Mar 13 18:48:33 oss01 kernel: [<ffffffff889736f3>] :ptlrpc:interval_iterate_reverse+0x73/0x240 Mar 13 18:48:33 oss01 kernel: [<ffffffff889763eb>] :ptlrpc:ptlrpc_expire_one_request+0x12b/0x630 Mar 13 18:48:33 oss01 kernel: [<ffffffff88976975>] :ptlrpc:ptlrpc_at_set_req_timeout+0x85/0xd0 Mar 13 18:48:33 oss01 kernel: [<ffffffff88977199>] :ptlrpc:ptlrpc_prep_req_pool+0x619/0x6b0 Mar 13 18:48:33 oss01 kernel: [<ffffffff88977d26>] :ptlrpc:ptlrpc_check_reply+0x1c6/0x610 Mar 13 18:48:33 oss01 kernel: [<ffffffff8897efc8>] :ptlrpc:ptlrpc_queue_wait+0x988/0x16f0 Mar 13 18:48:33 oss01 kernel: [<ffffffff88982613>] :ptlrpc:ptl_send_buf+0x3f3/0x5b0 Mar 13 18:48:33 oss01 kernel: [<ffffffff88985265>] :ptlrpc:ptl_send_rpc+0xb45/0xde0 Mar 13 18:48:33 oss01 kernel: [<ffffffff889872e8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 Mar 13 18:48:33 oss01 kernel: [<ffffffff8898d565>] :ptlrpc:lustre_msg_set_opc+0x45/0x120 Mar 13 18:48:33 oss01 kernel: [<ffffffff8898f101>] :ptlrpc:request_out_callback+0xe1/0x1e0 Mar 13 18:48:33 oss01 kernel: [<ffffffff8899185d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Mar 13 18:48:33 oss01 kernel: [<ffffffff889944fe>] :ptlrpc:ptlrpc_server_handle_request+0xa8e/0x1130 Mar 13 18:48:33 oss01 kernel: [<ffffffff88996d00>] :ptlrpc:ptlrpc_main+0x0/0x1420 Mar 13 18:48:33 oss01 kernel: [<ffffffff88997f58>] :ptlrpc:ptlrpc_main+0x1258/0x1420 Mar 13 18:48:33 oss01 kernel: ffffffff889bbca0 ffff810117356708 ffff810178f94940 ffffffff888711a4 Mar 13 18:48:33 oss01 kernel: [<ffffffff88c3e5f0>] :ost:ost_blocking_ast+0x0/0xaa0 Mar 13 18:48:33 oss01 kernel: [<ffffffff88c462f7>] :ost:ost_handle+0x4e17/0x53e0 Mar 13 18:48:33 oss01 kernel: [<ffffffff88c90f4a>] :obdfilter:filter_intent_policy+0x65a/0x760 Mar 13 18:48:33 oss01 kernel: ll_ost_100 R running task 0 16590 1 16591 16589 (L-TLB) Mar 13 18:48:33 oss01 kernel: Lustre: 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process 16590 Mar 13 18:48:33 oss01 kernel: Lustre: Service thread pid 16590 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Mar 13 18:48:58 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) Skipped 7264682 previous similar messages Mar 13 18:48:58 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527745 ref 2 fl Rpc:/2/0 rc 0/0 Mar 13 18:48:58 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 12345-172.16.1.87 at tcp: -113 Mar 13 18:48:58 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Skipped 7264681 previous similar messages Mar 13 18:50:03 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST0000 to NID 172.16.1.87 at tcp 0s ago has failed due to network error (7s prior to deadline). . . . Mar 13 19:17:39 oss01 kernel: LustreError: 16549:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 4985591 previous similar messages Mar 13 19:22:40 login2 kernel: LustreError: 4413:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Mar 13 19:22:40 login2 kernel: LustreError: 4413:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Mar 13 19:25:08 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST0000 to NID 172.16.1.87 at tcp 7s ago has timed out (7s prior to deadline). Mar 13 19:25:08 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 20801286 previous similar messages Mar 13 19:25:08 oss01 kernel: req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268529908 ref 1 fl Rpc:/2/0 rc 0/0 Mar 13 19:25:15 login2 kernel: Lustre: carbonfs-OST0000-osc-ffff81042c2ffc00: Connection restored to service carbonfs-OST0000 using nid 172.17.130.1 at o2ib. Mar 13 19:25:15 login2 kernel: Lustre: Skipped 1 previous similar message Mar 13 19:25:15 oss01 kernel: Lustre: 16590:0:(service.c:1380:ptlrpc_server_handle_request()) @@@ Request x1329711551107702 took longer than estimated (812+1590s); client may timeout. req at ffff81012fc25800 x1329711551107702/t0 o101->cc99a57b-05eb-215d-d552-1b3b78978588 at NET_0x50000ac116e02_UUID:0/0 lens 296/352 e 5 to 0 dl 1268528325 ref 1 fl Complete:/0/0 rc 301/301 Mar 13 19:25:15 oss01 kernel: LustreError: 137-5: UUID ''sandbox-OST0000_UUID'' is not available for connect (no target) Mar 13 19:25:15 oss01 kernel: Lustre: Service thread pid 16590 completed after 2403.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Mar 13 19:50:51 oss01 -- MARK -- [root at n337 ~]# who -r run-level 3 Mar 13 19:22 last=S
Michael Sternberg
2010-Mar-26 19:31 UTC
[Lustre-discuss] A Failed client soft lockup one OSS
PS: The syslog snippet I posted is slightly out of order; I merged from login2 and oss01 and did a simple sort(1). Michael
Hello! On Mar 26, 2010, at 3:27 PM, Michael Sternberg wrote:> +1 on this one, in my case using lustre-1.8.2 on RHEL-5.4 over o2ib, with patchless clients.In your case you have an instance of bug 21937. There is a workaround patch in that bug.> Mar 13 18:48:33 oss01 kernel: [<ffffffff8008c86b>] default_wake_function+0x0/0xe > Mar 13 18:48:33 oss01 kernel: [<ffffffff800b7076>] audit_syscall_exit+0x336/0x362 > Mar 13 18:48:33 oss01 kernel: [<ffffffff801522a0>] list_del+0xb/0x71 > Mar 13 18:48:33 oss01 kernel: [<ffffffff8887314b>] :lnet:LNetMDAttach+0x37b/0x4c0 > Mar 13 18:48:33 oss01 kernel: [<ffffffff88879b90>] :lnet:LNetPut+0x700/0x800 > Mar 13 18:48:33 oss01 kernel: [<ffffffff88879c06>] :lnet:LNetPut+0x776/0x800 > Mar 13 18:48:33 oss01 kernel: [<ffffffff88944380>] :ptlrpc:ldlm_lock_create+0x540/0x9f0 > Mar 13 18:48:33 oss01 kernel: [<ffffffff88947eb6>] :ptlrpc:ldlm_lock_enqueue+0x186/0xb20 > Mar 13 18:48:33 oss01 kernel: [<ffffffff8895b5f0>] :ptlrpc:ldlm_process_extent_lock+0x0/0xad0 > Mar 13 18:48:33 oss01 kernel: [<ffffffff88966a06>] :ptlrpc:ldlm_server_glimpse_ast+0x266/0x3b0 > Mar 13 18:48:33 oss01 kernel: [<ffffffff889693f9>] :ptlrpc:ldlm_handle_enqueue+0xbf9/0x11f0 > Mar 13 18:48:33 oss01 kernel: [<ffffffff889736f3>] :ptlrpc:interval_iterate_reverse+0x73/0x240 > Mar 13 18:48:33 oss01 kernel: [<ffffffff889763eb>] :ptlrpc:ptlrpc_expire_one_request+0x12b/0x630 > Mar 13 18:48:33 oss01 kernel: [<ffffffff88976975>] :ptlrpc:ptlrpc_at_set_req_timeout+0x85/0xd0 > Mar 13 18:48:33 oss01 kernel: [<ffffffff88977199>] :ptlrpc:ptlrpc_prep_req_pool+0x619/0x6b0 > Mar 13 18:48:33 oss01 kernel: [<ffffffff88977d26>] :ptlrpc:ptlrpc_check_reply+0x1c6/0x610 > Mar 13 18:48:33 oss01 kernel: [<ffffffff8897efc8>] :ptlrpc:ptlrpc_queue_wait+0x988/0x16f0 > Mar 13 18:48:33 oss01 kernel: [<ffffffff88982613>] :ptlrpc:ptl_send_buf+0x3f3/0x5b0 > Mar 13 18:48:33 oss01 kernel: [<ffffffff88985265>] :ptlrpc:ptl_send_rpc+0xb45/0xde0 > Mar 13 18:48:33 oss01 kernel: [<ffffffff889872e8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 > Mar 13 18:48:33 oss01 kernel: [<ffffffff8898d565>] :ptlrpc:lustre_msg_set_opc+0x45/0x120Bye, Oleg