Recently I found that on an OSS, ll_ost thread got soft-lockup, which was started after RAID rebuilding. After rebooting the OSS on which this OST belongs to, we ran e2fsck with "-fn" and "-fp" But this symptom still occurs. in some cases, we could see ll_ost_creat got hung instead of ll_ost after reboot. during this soft lockup occurred, we couldn''t usually delete or create files from/to this OST has owned (lfs find -O ...uuid). and when I tried to run ll_recover_lost_found_objs after mounting the OST with ldiskfs, we could see the recovery thread got soft lockup. kerne and Lustre ver: kernel 2.6.18-194.17.1.el5_lustre.1.8.5smp HW: J4400 and Sun x4270 for OSS There is no other hardware problem found. ######### disk error occurred Mar 13 03:11:26 oss20 kernel: mptbase: ioc2: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) Mar 13 03:11:46 oss20 kernel: sd 2:0:27:0: SCSI error: return code = 0x08000002 Mar 13 03:11:46 oss20 kernel: sdbx: Current: sense key: Medium Error Mar 13 03:11:46 oss20 kernel: Add. Sense: Unrecovered read error Mar 13 03:11:46 oss20 kernel: Mar 13 03:11:46 oss20 kernel: Info fld=0x5656e6b4 Mar 13 03:11:46 oss20 kernel: end_request: I/O error, dev sdbx, sector 1448535732 Mar 13 03:11:47 oss20 kernel: raid5:md17: read error corrected (8 sectors at 1448535728 on sdbx) Mar 13 03:11:47 oss20 kernel: raid5:md17: read error corrected (8 sectors at 1448535736 on sdbx) ? ########### call trace occurred Mar 13 09:12:52 oss20 kernel: Lustre: scratch-OST0034: haven''t heard from client scratch-mdtlov_UUID (at 10.3.52.161 at o2ib) in 227 seconds. I think it''s dead, and I am evicting it. Mar 13 09:12:52 oss20 kernel: Lustre: Skipped 2 previous similar messages Mar 13 09:13:47 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff810450875000 x1390684636813427/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 224/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:13:47 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff8101e17a5400 x1390684636813440/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:13:57 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff81036449d000 x1390684636813453/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:14:34 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff8104dec0f000 x1390684636813541/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 288/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:14:38 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff81010fd4b800 x1390684636813561/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:18:55 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff810367104c00 x1390684636814090/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 576/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:20:02 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff810453592000 x1390684636814172/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 320/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:20:23 oss20 kernel: LustreError: 0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 147s: evicting client at 10.3.27.16 at o2ib ns: filter-scratch-OST0034_UUID lock: ffff810239b84e00/0x8134b24415b8c002 lrc: 3/0,0 mode: PW/PW res: 8770349/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x80000020 remote: 0x2eb65052eeaac8dc expref: 17 pid: 17958 timeout 9632848008 Mar 13 09:20:24 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff810341db8c00 x1390684636814189/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 224/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:20:25 oss20 kernel: LustreError: 17878:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107) req at ffff81022a56ac00 x1390697545930782/t0 o101-><?>@<?>:0/0 lens 296/0 e 0 to 0 dl 1331598031 ref 1 fl Interpret:/0/0 rc -107/0 Mar 13 09:22:07 oss20 kernel: LustreError: 0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.3.30.72 at o2ib ns: filter-scratch-OST0034_UUID lock: ffff81014a9c8000/0x8134b244156839db lrc: 3/0,0 mode: PW/PW res: 8770383/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x10020 remote: 0xb5fbb5ddee10ed12 expref: 24 pid: 17841 timeout 9632952155 Mar 13 09:22:08 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff81016e6d1000 x1390684636814642/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 352/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:22:08 oss20 kernel: LustreError: 17966:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107) req at ffff810217882c00 x1390697597420037/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1331598144 ref 1 fl Interpret:H/0/0 rc -107/0 Mar 13 09:22:11 oss20 kernel: LustreError: 0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.3.15.9 at o2ib ns: filter-scratch-OST0034_UUID lock: ffff8101b899ce00/0x8134b2441567880a lrc: 3/0,0 mode: PW/PW res: 8770073/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0xffb69ae75cb1e32e expref: 9 pid: 17861 timeout 9632956878 Mar 13 09:22:12 oss20 kernel: LustreError: 17951:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107) req at ffff81037d4f7000 x1390697363779506/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1331598148 ref 1 fl Interpret:H/0/0 rc -107/0 Mar 13 09:25:09 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff8104b2367000 x1390684636814999/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 352/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:27:44 oss20 kernel: LustreError: 0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.3.11.76 at o2ib ns: filter-scratch-OST0034_UUID lock: ffff810460625c00/0x8134b24415b86381 lrc: 3/0,0 mode: PW/PW res: 8798589/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0x237e7fdbefcbfed expref: 8 pid: 17872 timeout 9633289138 Mar 13 09:27:49 oss20 kernel: Lustre: Service thread pid 18061 was inactive for 442.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Mar 13 09:27:49 oss20 kernel: Pid: 18061, comm: ll_ost_io_92 Mar 13 09:27:49 oss20 kernel: Mar 13 09:27:49 oss20 kernel: Call Trace: Mar 13 09:27:49 oss20 kernel: [<ffffffff8006462b>] __down_write_nested+0x7a/0x92 Mar 13 09:27:49 oss20 kernel: [<ffffffff8b2b99c0>] filter_destroy+0x9b0/0x1fb0 [obdfilter] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afb4980>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afb8060>] ldlm_completion_ast+0x0/0x880 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afdda74>] lustre_msg_add_version+0x34/0x110 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afe077a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8af9befc>] __ldlm_handle2lock+0x8c/0x360 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afe0889>] lustre_pack_reply+0x29/0xb0 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8b26b050>] ost_destroy+0x660/0x790 [ost] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afdcbc8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afdca15>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8b274a36>] ost_handle+0x1556/0x55b0 [ost] Mar 13 09:27:49 oss20 kernel: [<ffffffff80150e00>] __next_cpu+0x19/0x28 Mar 13 09:27:49 oss20 kernel: [<ffffffff800767e2>] smp_send_reschedule+0x4e/0x53 Mar 13 09:27:49 oss20 kernel: [<ffffffff8afec15a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afec8a8>] ptlrpc_wait_event+0x2d8/0x310 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8008b403>] __wake_up_common+0x3e/0x68 Mar 13 09:27:49 oss20 kernel: [<ffffffff8afed817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8005e01d>] child_rip+0xa/0x11 ############## call trace occurred for the 2nd time Mar 13 09:29:18 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff810502953400 x1390684636815395/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:29:18 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) Skipped 1 previous similar message Mar 13 09:29:34 oss20 kernel: Lustre: Service thread pid 17995 was inactive for 442.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Mar 13 09:29:34 oss20 kernel: Pid: 17995, comm: ll_ost_io_26 Mar 13 09:29:34 oss20 kernel: Mar 13 09:29:34 oss20 kernel: Call Trace: Mar 13 09:29:34 oss20 kernel: [<ffffffff8006e20b>] do_gettimeofday+0x40/0x90 Mar 13 09:29:34 oss20 kernel: [<ffffffff80028ace>] sync_page+0x0/0x42 Mar 13 09:29:34 oss20 kernel: [<ffffffff80063802>] io_schedule+0x3f/0x67 Mar 13 09:29:34 oss20 kernel: [<ffffffff80028b0c>] sync_page+0x3e/0x42 Mar 13 09:29:34 oss20 kernel: [<ffffffff80063946>] __wait_on_bit_lock+0x36/0x66 Mar 13 09:29:34 oss20 kernel: [<ffffffff8003fc3c>] __lock_page+0x5e/0x64 Mar 13 09:29:34 oss20 kernel: [<ffffffff800a0ab3>] wake_bit_function+0x0/0x23 Mar 13 09:29:34 oss20 kernel: [<ffffffff80013b32>] find_lock_page+0x69/0xa2 Mar 13 09:29:34 oss20 kernel: [<ffffffff80025bba>] find_or_create_page+0x22/0x72 Mar 13 09:29:34 oss20 kernel: [<ffffffff8b2c5315>] filter_get_page+0x35/0x70 [obdfilter] Mar 13 09:29:34 oss20 kernel: [<ffffffff8b2c5951>] filter_preprw_read+0x601/0xd30 [obdfilter] Mar 13 09:29:34 oss20 kernel: [<ffffffff8aecd3a0>] LNetPut+0x710/0x820 [lnet] Mar 13 09:29:34 oss20 kernel: [<ffffffff8afd7ed3>] ptl_send_buf+0x3f3/0x5b0 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8afe2765>] lustre_msg_set_limit+0x35/0xf0 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8b2c7dcc>] filter_preprw+0x1d4c/0x1db0 [obdfilter] Mar 13 09:29:34 oss20 kernel: [<ffffffff8afd8658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8afe077a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8afd8658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8b26dab3>] ost_brw_read+0xb33/0x1a70 [ost] Mar 13 09:29:34 oss20 kernel: [<ffffffff8afdca15>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8008cfd9>] default_wake_function+0x0/0xe Mar 13 09:29:34 oss20 kernel: [<ffffffff8afdcbc8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8b276353>] ost_handle+0x2e73/0x55b0 [ost] Mar 13 09:29:34 oss20 kernel: [<ffffffff80150e00>] __next_cpu+0x19/0x28 Mar 13 09:29:34 oss20 kernel: [<ffffffff800767e2>] smp_send_reschedule+0x4e/0x53 Mar 13 09:29:34 oss20 kernel: [<ffffffff8afec15a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8afec8a8>] ptlrpc_wait_event+0x2d8/0x310 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8008b403>] __wake_up_common+0x3e/0x68 Mar 13 09:29:34 oss20 kernel: [<ffffffff8afed817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc] Mar 13 09:29:35 oss20 kernel: [<ffffffff8005e01d>] child_rip+0xa/0x11 Mar 13 09:29:35 oss20 kernel: [<ffffffff8afec8e0>] ptlrpc_main+0x0/0x10f0 [ptlrpc] Mar 13 09:29:35 oss20 kernel: [<ffffffff8005e013>] child_rip+0x0/0x11 Mar 13 09:29:35 oss20 kernel: Mar 13 09:30:05 oss20 kernel: Lustre: 18022:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-239), not sending early reply Mar 13 09:30:05 oss20 kernel: req at ffff8104e15ef000 x1390697425321956/t0 o4->104e3eaf-521c-6fd2-0ded-c959aeb89264 at NET_0x500000a03122f_UUID:0/0 lens 448/416 e 5 to 0 dl 1331598610 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:30:05 oss20 kernel: Lustre: 18022:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-239), not sending early reply Mar 13 09:30:05 oss20 kernel: req at ffff81041144d400 x1390697425321961/t0 o4->104e3eaf-521c-6fd2-0ded-c959aeb89264 at NET_0x500000a03122f_UUID:0/0 lens 448/416 e 5 to 0 dl 1331598610 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:30:07 oss20 kernel: Lustre: 17982:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-49), not sending early reply Mar 13 09:30:07 oss20 kernel: req at ffff81025e7f6c50 x1387502968075317/t0 o4->e6cc42f9-1723-0037-1bef-8c11b5bb0cd5@:0/0 lens 448/416 e 2 to 0 dl 1331598612 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:30:07 oss20 kernel: Lustre: 17982:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 7 previous similar messages Mar 13 09:30:09 oss20 kernel: Lustre: 18027:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-50), not sending early reply Mar 13 09:30:09 oss20 kernel: req at ffff8105c4a83800 x1390670171393616/t0 o4->0c31049a-ac01-884e-a74a-f549ca801551@:0/0 lens 448/416 e 2 to 0 dl 1331598614 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:30:09 oss20 kernel: Lustre: 18027:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 8 previous similar messages Mar 13 09:30:26 oss20 kernel: Lustre: 18079:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-196), not sending early reply Mar 13 09:30:26 oss20 kernel: req at ffff81047934d800 x1390668976327099/t0 o4->240e8b56-f2f0-beab-0460-c95e7af9ebf2 at NET_0x500000a031347_UUID:0/0 lens 448/416 e 3 to 0 dl 1331598631 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:30:26 oss20 kernel: Lustre: 18079:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 7 previous similar messages Mar 13 09:30:56 oss20 kernel: Lustre: 18021:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-157), not sending early reply Mar 13 09:30:56 oss20 kernel: req at ffff810301bdd850 x1390697295264668/t0 o4->819aa1e7-48e9-c8c1-c639-4d7767aac076 at NET_0x500000a030b07_UUID:0/0 lens 448/416 e 3 to 0 dl 1331598661 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:30:56 oss20 kernel: Lustre: 18021:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages Mar 13 09:31:32 oss20 kernel: Lustre: 18039:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-108), not sending early reply Mar 13 09:31:32 oss20 kernel: req at ffff810471a57000 x1390668308715230/t0 o4->80fad2a1-6a67-0aad-69a7-2ea3c9cf6909 at NET_0x500000a030e16_UUID:0/0 lens 448/416 e 1 to 0 dl 1331598697 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:31:32 oss20 kernel: Lustre: 18039:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message Mar 13 09:32:04 oss20 kernel: Lustre: 18046:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-49), not sending early reply Mar 13 09:32:04 oss20 kernel: req at ffff81010eda4800 x1390697612903849/t0 o4->cdc6f8ec-952c-4d6a-7fd2-dca6fe51a617 at NET_0x500000a032018_UUID:0/0 lens 448/416 e 2 to 0 dl 1331598729 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:32:04 oss20 kernel: Lustre: 18046:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 13 previous similar messages Mar 13 09:32:05 oss20 kernel: Lustre: 17923:0:(ldlm_lib.c:574:target_handle_reconnect()) scratch-OST0034: 104e3eaf-521c-6fd2-0ded-c959aeb89264 reconnecting Mar 13 09:32:05 oss20 kernel: Lustre: 17923:0:(ldlm_lib.c:874:target_handle_connect()) scratch-OST0034: refuse reconnection from 104e3eaf-521c-6fd2-0ded-c959aeb89264 at 10.3.18.47@o2ib to 0xffff810293ab2000; still busy with 9 active RPCs ################ disk disabled manually by admin using "mdadm /dev/md17 ?fail /dev/sdbx" Mar 13 10:03:22 oss20 kernel: raid5: Disk failure on sdbx, disabling device. Operation continuing on 9 devices Mar 13 10:03:23 oss20 kernel: RAID5 conf printout: Mar 13 10:03:23 oss20 kernel: --- rd:10 wd:9 fd:1 Mar 13 10:03:23 oss20 kernel: disk 0, o:1, dev:sdbv Mar 13 10:03:23 oss20 kernel: disk 1, o:1, dev:sdbw Mar 13 10:03:23 oss20 kernel: disk 2, o:0, dev:sdbx Mar 13 10:03:23 oss20 kernel: disk 3, o:1, dev:sdby Mar 13 10:03:23 oss20 kernel: disk 4, o:1, dev:sdbz Mar 13 10:03:23 oss20 kernel: disk 5, o:1, dev:sdca Mar 13 10:03:23 oss20 kernel: disk 6, o:1, dev:sdcb Mar 13 10:03:23 oss20 kernel: disk 7, o:1, dev:sdcc Mar 13 10:03:23 oss20 kernel: disk 8, o:1, dev:sdcd Mar 13 10:03:23 oss20 kernel: disk 9, o:1, dev:sdce Mar 13 10:03:23 oss20 kernel: RAID5 conf printout: Mar 13 10:03:23 oss20 kernel: --- rd:10 wd:9 fd:1 Mar 13 10:03:23 oss20 kernel: disk 0, o:1, dev:sdbv Mar 13 10:03:23 oss20 kernel: disk 1, o:1, dev:sdbw Mar 13 10:03:23 oss20 kernel: disk 3, o:1, dev:sdby Mar 13 10:03:23 oss20 kernel: disk 4, o:1, dev:sdbz Mar 13 10:03:23 oss20 kernel: disk 5, o:1, dev:sdca Mar 13 10:03:23 oss20 kernel: disk 6, o:1, dev:sdcb Mar 13 10:03:23 oss20 kernel: disk 7, o:1, dev:sdcc Mar 13 10:03:23 oss20 kernel: disk 8, o:1, dev:sdcd Mar 13 10:03:23 oss20 kernel: disk 9, o:1, dev:sdce Mar 13 10:03:23 oss20 kernel: RAID5 conf printout: Mar 13 10:03:23 oss20 kernel: --- rd:10 wd:9 fd:1 Mar 13 10:03:23 oss20 kernel: disk 0, o:1, dev:sdbv Mar 13 10:03:23 oss20 kernel: disk 1, o:1, dev:sdbw Mar 13 10:03:23 oss20 kernel: disk 2, o:1, dev:sdcp Mar 13 10:03:23 oss20 kernel: disk 3, o:1, dev:sdby Mar 13 10:03:23 oss20 kernel: disk 4, o:1, dev:sdbz Mar 13 10:03:23 oss20 kernel: disk 5, o:1, dev:sdca Mar 13 10:03:23 oss20 kernel: disk 6, o:1, dev:sdcb Mar 13 10:03:23 oss20 kernel: disk 7, o:1, dev:sdcc Mar 13 10:03:23 oss20 kernel: disk 8, o:1, dev:sdcd Mar 13 10:03:23 oss20 kernel: disk 9, o:1, dev:sdce Mar 13 10:03:23 oss20 kernel: md: syncing RAID array md17 ############## md rebuild done Mar 13 14:07:22 oss20 kernel: md: rebuild md17 throttled due to IO Mar 13 14:09:24 oss20 kernel: md: unbind<sdbx> Mar 13 14:09:24 oss20 kernel: md: export_rdev(sdbx) Mar 13 14:17:24 oss20 kernel: md: rebuild md17 throttled due to IO Mar 13 14:25:48 oss20 kernel: md: md17: sync done. ################ ll_ost kernel thread soft lockup occurred Mar 13 19:10:04 oss20 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_29:17868] Mar 13 19:10:04 oss20 kernel: CPU 2: Mar 13 19:10:05 oss20 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) raid456(U) xor(U) raid1(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) autofs4(U) sunrpc(U) rdma_ucm(U) qlgc_vnic(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) iw_nes(U) iw_cxgb3(U) cxgb3(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) mptctl(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sg(U) igb(U) mlx4_core(U) i2c_i801(U) 8021q(U) i2c_core(U) dca(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) ahci(U) libata(U) mptfc(U) scsi_transport_fc(U) mptspi(U) scsi_transport_spi(U) shpchp(U Mar 13 19:10:05 oss20 kernel: mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Mar 13 19:10:05 oss20 kernel: Pid: 17868, comm: ll_ost_29 Tainted: G 2.6.18-194.17.1.el5_lustre.1.8.5smp #1 Mar 13 19:10:05 oss20 kernel: RIP: 0010:[<ffffffff8b218cf4>] [<ffffffff8b218cf4>] :ldiskfs:ldiskfs_find_entry+0x1d4/0x5b0 Mar 13 19:10:05 oss20 kernel: RSP: 0018:ffff8102c7f637e0 EFLAGS: 00000202 Mar 13 19:10:05 oss20 kernel: RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000036f3f4cb Mar 13 19:10:05 oss20 kernel: RDX: ffff8104fa4d1c00 RSI: ffff8102c7f637c0 RDI: ffff81010c43e7a8 Mar 13 19:10:05 oss20 kernel: RBP: ffff810254a1d7f0 R08: ffff8100708a5ff8 R09: ffff8100708a5000 Mar 13 19:10:05 oss20 kernel: R10: ffff81010c618038 R11: 000000004f5f1cf2 R12: 0000000000000000 Mar 13 19:10:05 oss20 kernel: R13: 0000000000000002 R14: ffff8101d7e17cd0 R15: ffffffff80063ac8 Mar 13 19:10:05 oss20 kernel: FS: 00002b99048b1220(0000) GS:ffff81010c499240(0000) knlGS:0000000000000000 Mar 13 19:10:05 oss20 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Mar 13 19:10:05 oss20 kernel: CR2: 000000316a03b6a0 CR3: 00000003738ff000 CR4: 00000000000006e0 Mar 13 19:10:05 oss20 kernel: Mar 13 19:10:05 oss20 kernel: Call Trace: Mar 13 19:10:05 oss20 kernel: [<ffffffff8001a81c>] vsnprintf+0x5e7/0x62f Mar 13 19:10:05 oss20 kernel: [<ffffffff8b21ab23>] :ldiskfs:ldiskfs_lookup+0x53/0x281 Mar 13 19:10:05 oss20 kernel: [<ffffffff80036d8e>] __lookup_hash+0x10b/0x12f Mar 13 19:10:05 oss20 kernel: [<ffffffff800e861d>] lookup_one_len+0x53/0x61 Mar 13 19:10:05 oss20 kernel: [<ffffffff8b2b136d>] :obdfilter:filter_fid2dentry+0x42d/0x740 Mar 13 19:10:05 oss20 kernel: [<ffffffff80286b42>] __down_trylock+0x44/0x4e Mar 13 19:10:05 oss20 kernel: [<ffffffff8b2cc38b>] :obdfilter:filter_lvbo_init+0x3bb/0x69b Mar 13 19:10:05 oss20 kernel: [<ffffffff8aecd3a0>] :lnet:LNetPut+0x710/0x820 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afa22c0>] :ptlrpc:ldlm_resource_get+0x900/0xa60 Mar 13 19:10:05 oss20 kernel: [<ffffffff8b270600>] :ost:ost_blocking_ast+0x0/0x9b0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afc12a0>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8af98efa>] :ptlrpc:ldlm_lock_create+0xba/0x9f0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afddfb1>] :ptlrpc:lustre_swab_buf+0x81/0x170 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afbbd30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afbbd30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afc12a0>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8b270600>] :ost:ost_blocking_ast+0x0/0x9b0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afbe5b0>] :ptlrpc:ldlm_handle_enqueue+0x670/0x1210 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afdcbc8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 Mar 13 19:10:05 oss20 kernel: [<ffffffff8b2784c3>] :ost:ost_handle+0x4fe3/0x55b0 Mar 13 19:10:05 oss20 kernel: [<ffffffff80150e00>] __next_cpu+0x19/0x28 Mar 13 19:10:05 oss20 kernel: [<ffffffff800767e2>] smp_send_reschedule+0x4e/0x53 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afec15a>] :ptlrpc:ptlrpc_server_handle_request+0x97a/0xdf0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afec8a8>] :ptlrpc:ptlrpc_wait_event+0x2d8/0x310 Mar 13 19:10:05 oss20 kernel: [<ffffffff8008b403>] __wake_up_common+0x3e/0x68 Mar 13 19:10:06 oss20 kernel: [<ffffffff8afed817>] :ptlrpc:ptlrpc_main+0xf37/0x10f0 Mar 13 19:10:06 oss20 kernel: [<ffffffff8005e01d>] child_rip+0xa/0x11 Mar 13 19:10:06 oss20 kernel: [<ffffffff8afec8e0>] :ptlrpc:ptlrpc_main+0x0/0x10f0 Mar 13 19:10:06 oss20 kernel: [<ffffffff8005e013>] child_rip+0x0/0x11 Mar 13 19:10:06 oss20 kernel: Taeyoung Hong Senior Researcher Supercomputing Center, KISTI Korea -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120319/ee7d01ed/attachment-0001.html
You have a bad disk, at 3:11, and then at 9:27 you got a soft lockup (doing a read?), you failed the bad disk at 10:03, rebuild finished at 14:25, and then you got another soft lockup at 19:10 (different stack trace). You are running 1.8.5, which does not have the fix for the known MD raid5/6 rebuild corruption bug. That fix was released in the Oracle Lustre 1.8.7 kernel patches. Unless you already applied that patch, you might want to run a check of your raid arrays and consider an upgrade (at least patch your kernel with that fix). md-avoid-corrupted-ldiskfs-after-rebuild.patch in the 2.6-rhel5.series (note that this bug is NOT specific to rhel5). This fix does NOT appear to have been picked up by whamcloud. Kevin On Mar 18, 2012, at 9:22 PM, Tae Young Hong wrote: Recently I found that on an OSS, ll_ost thread got soft-lockup, which was started after RAID rebuilding. After rebooting the OSS on which this OST belongs to, we ran e2fsck with "-fn" and "-fp" But this symptom still occurs. in some cases, we could see ll_ost_creat got hung instead of ll_ost after reboot. during this soft lockup occurred, we couldn''t usually delete or create files from/to this OST has owned (lfs find -O ...uuid). and when I tried to run ll_recover_lost_found_objs after mounting the OST with ldiskfs, we could see the recovery thread got soft lockup. kerne and Lustre ver: kernel 2.6.18-194.17.1.el5_lustre.1.8.5smp HW: J4400 and Sun x4270 for OSS There is no other hardware problem found. ######### disk error occurred Mar 13 03:11:26 oss20 kernel: mptbase: ioc2: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) Mar 13 03:11:46 oss20 kernel: sd 2:0:27:0: SCSI error: return code = 0x08000002 Mar 13 03:11:46 oss20 kernel: sdbx: Current: sense key: Medium Error Mar 13 03:11:46 oss20 kernel: Add. Sense: Unrecovered read error Mar 13 03:11:46 oss20 kernel: Mar 13 03:11:46 oss20 kernel: Info fld=0x5656e6b4 Mar 13 03:11:46 oss20 kernel: end_request: I/O error, dev sdbx, sector 1448535732 Mar 13 03:11:47 oss20 kernel: raid5:md17: read error corrected (8 sectors at 1448535728 on sdbx) Mar 13 03:11:47 oss20 kernel: raid5:md17: read error corrected (8 sectors at 1448535736 on sdbx) ? ########### call trace occurred Mar 13 09:12:52 oss20 kernel: Lustre: scratch-OST0034: haven''t heard from client scratch-mdtlov_UUID (at 10.3.52.161 at o2ib) in 227 seconds. I think it''s dead, and I am evicting it. Mar 13 09:12:52 oss20 kernel: Lustre: Skipped 2 previous similar messages Mar 13 09:13:47 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff810450875000 x1390684636813427/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 224/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:13:47 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff8101e17a5400 x1390684636813440/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:13:57 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff81036449d000 x1390684636813453/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:14:34 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff8104dec0f000 x1390684636813541/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 288/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:14:38 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff81010fd4b800 x1390684636813561/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:18:55 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff810367104c00 x1390684636814090/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 576/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:20:02 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff810453592000 x1390684636814172/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 320/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:20:23 oss20 kernel: LustreError: 0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 147s: evicting client at 10.3.27.16 at o2ib ns: filter-scratch-OST0034_UUID lock: ffff810239b84e00/0x8134b24415b8c002 lrc: 3/0,0 mode: PW/PW res: 8770349/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x80000020 remote: 0x2eb65052eeaac8dc expref: 17 pid: 17958 timeout 9632848008 Mar 13 09:20:24 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff810341db8c00 x1390684636814189/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 224/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:20:25 oss20 kernel: LustreError: 17878:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107) req at ffff81022a56ac00 x1390697545930782/t0 o101-><?>@<?>:0/0 lens 296/0 e 0 to 0 dl 1331598031 ref 1 fl Interpret:/0/0 rc -107/0 Mar 13 09:22:07 oss20 kernel: LustreError: 0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.3.30.72 at o2ib ns: filter-scratch-OST0034_UUID lock: ffff81014a9c8000/0x8134b244156839db lrc: 3/0,0 mode: PW/PW res: 8770383/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x10020 remote: 0xb5fbb5ddee10ed12 expref: 24 pid: 17841 timeout 9632952155 Mar 13 09:22:08 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff81016e6d1000 x1390684636814642/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 352/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:22:08 oss20 kernel: LustreError: 17966:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107) req at ffff810217882c00 x1390697597420037/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1331598144 ref 1 fl Interpret:H/0/0 rc -107/0 Mar 13 09:22:11 oss20 kernel: LustreError: 0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.3.15.9 at o2ib ns: filter-scratch-OST0034_UUID lock: ffff8101b899ce00/0x8134b2441567880a lrc: 3/0,0 mode: PW/PW res: 8770073/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0xffb69ae75cb1e32e expref: 9 pid: 17861 timeout 9632956878 Mar 13 09:22:12 oss20 kernel: LustreError: 17951:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107) req at ffff81037d4f7000 x1390697363779506/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1331598148 ref 1 fl Interpret:H/0/0 rc -107/0 Mar 13 09:25:09 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff8104b2367000 x1390684636814999/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 352/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:27:44 oss20 kernel: LustreError: 0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.3.11.76 at o2ib ns: filter-scratch-OST0034_UUID lock: ffff810460625c00/0x8134b24415b86381 lrc: 3/0,0 mode: PW/PW res: 8798589/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0x237e7fdbefcbfed expref: 8 pid: 17872 timeout 9633289138 Mar 13 09:27:49 oss20 kernel: Lustre: Service thread pid 18061 was inactive for 442.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Mar 13 09:27:49 oss20 kernel: Pid: 18061, comm: ll_ost_io_92 Mar 13 09:27:49 oss20 kernel: Mar 13 09:27:49 oss20 kernel: Call Trace: Mar 13 09:27:49 oss20 kernel: [<ffffffff8006462b>] __down_write_nested+0x7a/0x92 Mar 13 09:27:49 oss20 kernel: [<ffffffff8b2b99c0>] filter_destroy+0x9b0/0x1fb0 [obdfilter] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afb4980>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afb8060>] ldlm_completion_ast+0x0/0x880 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afdda74>] lustre_msg_add_version+0x34/0x110 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afe077a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8af9befc>] __ldlm_handle2lock+0x8c/0x360 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afe0889>] lustre_pack_reply+0x29/0xb0 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8b26b050>] ost_destroy+0x660/0x790 [ost] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afdcbc8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afdca15>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8b274a36>] ost_handle+0x1556/0x55b0 [ost] Mar 13 09:27:49 oss20 kernel: [<ffffffff80150e00>] __next_cpu+0x19/0x28 Mar 13 09:27:49 oss20 kernel: [<ffffffff800767e2>] smp_send_reschedule+0x4e/0x53 Mar 13 09:27:49 oss20 kernel: [<ffffffff8afec15a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8afec8a8>] ptlrpc_wait_event+0x2d8/0x310 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8008b403>] __wake_up_common+0x3e/0x68 Mar 13 09:27:49 oss20 kernel: [<ffffffff8afed817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc] Mar 13 09:27:49 oss20 kernel: [<ffffffff8005e01d>] child_rip+0xa/0x11 ############## call trace occurred for the 2nd time Mar 13 09:29:18 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff810502953400 x1390684636815395/t0 o401->@NET_0x500000a0334a1_UUID:17/18 lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Mar 13 09:29:18 oss20 kernel: LustreError: 18126:0:(client.c:841:ptlrpc_import_delay_req()) Skipped 1 previous similar message Mar 13 09:29:34 oss20 kernel: Lustre: Service thread pid 17995 was inactive for 442.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Mar 13 09:29:34 oss20 kernel: Pid: 17995, comm: ll_ost_io_26 Mar 13 09:29:34 oss20 kernel: Mar 13 09:29:34 oss20 kernel: Call Trace: Mar 13 09:29:34 oss20 kernel: [<ffffffff8006e20b>] do_gettimeofday+0x40/0x90 Mar 13 09:29:34 oss20 kernel: [<ffffffff80028ace>] sync_page+0x0/0x42 Mar 13 09:29:34 oss20 kernel: [<ffffffff80063802>] io_schedule+0x3f/0x67 Mar 13 09:29:34 oss20 kernel: [<ffffffff80028b0c>] sync_page+0x3e/0x42 Mar 13 09:29:34 oss20 kernel: [<ffffffff80063946>] __wait_on_bit_lock+0x36/0x66 Mar 13 09:29:34 oss20 kernel: [<ffffffff8003fc3c>] __lock_page+0x5e/0x64 Mar 13 09:29:34 oss20 kernel: [<ffffffff800a0ab3>] wake_bit_function+0x0/0x23 Mar 13 09:29:34 oss20 kernel: [<ffffffff80013b32>] find_lock_page+0x69/0xa2 Mar 13 09:29:34 oss20 kernel: [<ffffffff80025bba>] find_or_create_page+0x22/0x72 Mar 13 09:29:34 oss20 kernel: [<ffffffff8b2c5315>] filter_get_page+0x35/0x70 [obdfilter] Mar 13 09:29:34 oss20 kernel: [<ffffffff8b2c5951>] filter_preprw_read+0x601/0xd30 [obdfilter] Mar 13 09:29:34 oss20 kernel: [<ffffffff8aecd3a0>] LNetPut+0x710/0x820 [lnet] Mar 13 09:29:34 oss20 kernel: [<ffffffff8afd7ed3>] ptl_send_buf+0x3f3/0x5b0 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8afe2765>] lustre_msg_set_limit+0x35/0xf0 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8b2c7dcc>] filter_preprw+0x1d4c/0x1db0 [obdfilter] Mar 13 09:29:34 oss20 kernel: [<ffffffff8afd8658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8afe077a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8afd8658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8b26dab3>] ost_brw_read+0xb33/0x1a70 [ost] Mar 13 09:29:34 oss20 kernel: [<ffffffff8afdca15>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8008cfd9>] default_wake_function+0x0/0xe Mar 13 09:29:34 oss20 kernel: [<ffffffff8afdcbc8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8b276353>] ost_handle+0x2e73/0x55b0 [ost] Mar 13 09:29:34 oss20 kernel: [<ffffffff80150e00>] __next_cpu+0x19/0x28 Mar 13 09:29:34 oss20 kernel: [<ffffffff800767e2>] smp_send_reschedule+0x4e/0x53 Mar 13 09:29:34 oss20 kernel: [<ffffffff8afec15a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8afec8a8>] ptlrpc_wait_event+0x2d8/0x310 [ptlrpc] Mar 13 09:29:34 oss20 kernel: [<ffffffff8008b403>] __wake_up_common+0x3e/0x68 Mar 13 09:29:34 oss20 kernel: [<ffffffff8afed817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc] Mar 13 09:29:35 oss20 kernel: [<ffffffff8005e01d>] child_rip+0xa/0x11 Mar 13 09:29:35 oss20 kernel: [<ffffffff8afec8e0>] ptlrpc_main+0x0/0x10f0 [ptlrpc] Mar 13 09:29:35 oss20 kernel: [<ffffffff8005e013>] child_rip+0x0/0x11 Mar 13 09:29:35 oss20 kernel: Mar 13 09:30:05 oss20 kernel: Lustre: 18022:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-239), not sending early reply Mar 13 09:30:05 oss20 kernel: req at ffff8104e15ef000 x1390697425321956/t0 o4->104e3eaf-521c-6fd2-0ded-c959aeb89264 at NET_0x500000a03122f_UUID:0/0 lens 448/416 e 5 to 0 dl 1331598610 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:30:05 oss20 kernel: Lustre: 18022:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-239), not sending early reply Mar 13 09:30:05 oss20 kernel: req at ffff81041144d400 x1390697425321961/t0 o4->104e3eaf-521c-6fd2-0ded-c959aeb89264 at NET_0x500000a03122f_UUID:0/0 lens 448/416 e 5 to 0 dl 1331598610 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:30:07 oss20 kernel: Lustre: 17982:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-49), not sending early reply Mar 13 09:30:07 oss20 kernel: req at ffff81025e7f6c50 x1387502968075317/t0 o4->e6cc42f9-1723-0037-1bef-8c11b5bb0cd5@:0/0 lens 448/416 e 2 to 0 dl 1331598612 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:30:07 oss20 kernel: Lustre: 17982:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 7 previous similar messages Mar 13 09:30:09 oss20 kernel: Lustre: 18027:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-50), not sending early reply Mar 13 09:30:09 oss20 kernel: req at ffff8105c4a83800 x1390670171393616/t0 o4->0c31049a-ac01-884e-a74a-f549ca801551@:0/0 lens 448/416 e 2 to 0 dl 1331598614 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:30:09 oss20 kernel: Lustre: 18027:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 8 previous similar messages Mar 13 09:30:26 oss20 kernel: Lustre: 18079:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-196), not sending early reply Mar 13 09:30:26 oss20 kernel: req at ffff81047934d800 x1390668976327099/t0 o4->240e8b56-f2f0-beab-0460-c95e7af9ebf2 at NET_0x500000a031347_UUID:0/0 lens 448/416 e 3 to 0 dl 1331598631 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:30:26 oss20 kernel: Lustre: 18079:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 7 previous similar messages Mar 13 09:30:56 oss20 kernel: Lustre: 18021:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-157), not sending early reply Mar 13 09:30:56 oss20 kernel: req at ffff810301bdd850 x1390697295264668/t0 o4->819aa1e7-48e9-c8c1-c639-4d7767aac076 at NET_0x500000a030b07_UUID:0/0 lens 448/416 e 3 to 0 dl 1331598661 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:30:56 oss20 kernel: Lustre: 18021:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages Mar 13 09:31:32 oss20 kernel: Lustre: 18039:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-108), not sending early reply Mar 13 09:31:32 oss20 kernel: req at ffff810471a57000 x1390668308715230/t0 o4->80fad2a1-6a67-0aad-69a7-2ea3c9cf6909 at NET_0x500000a030e16_UUID:0/0 lens 448/416 e 1 to 0 dl 1331598697 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:31:32 oss20 kernel: Lustre: 18039:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message Mar 13 09:32:04 oss20 kernel: Lustre: 18046:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add any time (5/-49), not sending early reply Mar 13 09:32:04 oss20 kernel: req at ffff81010eda4800 x1390697612903849/t0 o4->cdc6f8ec-952c-4d6a-7fd2-dca6fe51a617 at NET_0x500000a032018_UUID:0/0 lens 448/416 e 2 to 0 dl 1331598729 ref 2 fl Interpret:/0/0 rc 0/0 Mar 13 09:32:04 oss20 kernel: Lustre: 18046:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 13 previous similar messages Mar 13 09:32:05 oss20 kernel: Lustre: 17923:0:(ldlm_lib.c:574:target_handle_reconnect()) scratch-OST0034: 104e3eaf-521c-6fd2-0ded-c959aeb89264 reconnecting Mar 13 09:32:05 oss20 kernel: Lustre: 17923:0:(ldlm_lib.c:874:target_handle_connect()) scratch-OST0034: refuse reconnection from 104e3eaf-521c-6fd2-0ded-c959aeb89264 at 10.3.18.47<mailto:104e3eaf-521c-6fd2-0ded-c959aeb89264 at 10.3.18.47>@o2ib to 0xffff810293ab2000; still busy with 9 active RPCs ################ disk disabled manually by admin using "mdadm /dev/md17 ?fail /dev/sdbx" Mar 13 10:03:22 oss20 kernel: raid5: Disk failure on sdbx, disabling device. Operation continuing on 9 devices Mar 13 10:03:23 oss20 kernel: RAID5 conf printout: Mar 13 10:03:23 oss20 kernel: --- rd:10 wd:9 fd:1 Mar 13 10:03:23 oss20 kernel: disk 0, o:1, dev:sdbv Mar 13 10:03:23 oss20 kernel: disk 1, o:1, dev:sdbw Mar 13 10:03:23 oss20 kernel: disk 2, o:0, dev:sdbx Mar 13 10:03:23 oss20 kernel: disk 3, o:1, dev:sdby Mar 13 10:03:23 oss20 kernel: disk 4, o:1, dev:sdbz Mar 13 10:03:23 oss20 kernel: disk 5, o:1, dev:sdca Mar 13 10:03:23 oss20 kernel: disk 6, o:1, dev:sdcb Mar 13 10:03:23 oss20 kernel: disk 7, o:1, dev:sdcc Mar 13 10:03:23 oss20 kernel: disk 8, o:1, dev:sdcd Mar 13 10:03:23 oss20 kernel: disk 9, o:1, dev:sdce Mar 13 10:03:23 oss20 kernel: RAID5 conf printout: Mar 13 10:03:23 oss20 kernel: --- rd:10 wd:9 fd:1 Mar 13 10:03:23 oss20 kernel: disk 0, o:1, dev:sdbv Mar 13 10:03:23 oss20 kernel: disk 1, o:1, dev:sdbw Mar 13 10:03:23 oss20 kernel: disk 3, o:1, dev:sdby Mar 13 10:03:23 oss20 kernel: disk 4, o:1, dev:sdbz Mar 13 10:03:23 oss20 kernel: disk 5, o:1, dev:sdca Mar 13 10:03:23 oss20 kernel: disk 6, o:1, dev:sdcb Mar 13 10:03:23 oss20 kernel: disk 7, o:1, dev:sdcc Mar 13 10:03:23 oss20 kernel: disk 8, o:1, dev:sdcd Mar 13 10:03:23 oss20 kernel: disk 9, o:1, dev:sdce Mar 13 10:03:23 oss20 kernel: RAID5 conf printout: Mar 13 10:03:23 oss20 kernel: --- rd:10 wd:9 fd:1 Mar 13 10:03:23 oss20 kernel: disk 0, o:1, dev:sdbv Mar 13 10:03:23 oss20 kernel: disk 1, o:1, dev:sdbw Mar 13 10:03:23 oss20 kernel: disk 2, o:1, dev:sdcp Mar 13 10:03:23 oss20 kernel: disk 3, o:1, dev:sdby Mar 13 10:03:23 oss20 kernel: disk 4, o:1, dev:sdbz Mar 13 10:03:23 oss20 kernel: disk 5, o:1, dev:sdca Mar 13 10:03:23 oss20 kernel: disk 6, o:1, dev:sdcb Mar 13 10:03:23 oss20 kernel: disk 7, o:1, dev:sdcc Mar 13 10:03:23 oss20 kernel: disk 8, o:1, dev:sdcd Mar 13 10:03:23 oss20 kernel: disk 9, o:1, dev:sdce Mar 13 10:03:23 oss20 kernel: md: syncing RAID array md17 ############## md rebuild done Mar 13 14:07:22 oss20 kernel: md: rebuild md17 throttled due to IO Mar 13 14:09:24 oss20 kernel: md: unbind<sdbx> Mar 13 14:09:24 oss20 kernel: md: export_rdev(sdbx) Mar 13 14:17:24 oss20 kernel: md: rebuild md17 throttled due to IO Mar 13 14:25:48 oss20 kernel: md: md17: sync done. ################ ll_ost kernel thread soft lockup occurred Mar 13 19:10:04 oss20 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_ost_29:17868] Mar 13 19:10:04 oss20 kernel: CPU 2: Mar 13 19:10:05 oss20 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) raid456(U) xor(U) raid1(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) autofs4(U) sunrpc(U) rdma_ucm(U) qlgc_vnic(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) iw_nes(U) iw_cxgb3(U) cxgb3(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) mptctl(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sg(U) igb(U) mlx4_core(U) i2c_i801(U) 8021q(U) i2c_core(U) dca(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) ahci(U) libata(U) mptfc(U) scsi_transport_fc(U) mptspi(U) scsi_transport_spi(U) shpchp(U Mar 13 19:10:05 oss20 kernel: mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Mar 13 19:10:05 oss20 kernel: Pid: 17868, comm: ll_ost_29 Tainted: G 2.6.18-194.17.1.el5_lustre.1.8.5smp #1 Mar 13 19:10:05 oss20 kernel: RIP: 0010:[<ffffffff8b218cf4>] [<ffffffff8b218cf4>] :ldiskfs:ldiskfs_find_entry+0x1d4/0x5b0 Mar 13 19:10:05 oss20 kernel: RSP: 0018:ffff8102c7f637e0 EFLAGS: 00000202 Mar 13 19:10:05 oss20 kernel: RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000036f3f4cb Mar 13 19:10:05 oss20 kernel: RDX: ffff8104fa4d1c00 RSI: ffff8102c7f637c0 RDI: ffff81010c43e7a8 Mar 13 19:10:05 oss20 kernel: RBP: ffff810254a1d7f0 R08: ffff8100708a5ff8 R09: ffff8100708a5000 Mar 13 19:10:05 oss20 kernel: R10: ffff81010c618038 R11: 000000004f5f1cf2 R12: 0000000000000000 Mar 13 19:10:05 oss20 kernel: R13: 0000000000000002 R14: ffff8101d7e17cd0 R15: ffffffff80063ac8 Mar 13 19:10:05 oss20 kernel: FS: 00002b99048b1220(0000) GS:ffff81010c499240(0000) knlGS:0000000000000000 Mar 13 19:10:05 oss20 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Mar 13 19:10:05 oss20 kernel: CR2: 000000316a03b6a0 CR3: 00000003738ff000 CR4: 00000000000006e0 Mar 13 19:10:05 oss20 kernel: Mar 13 19:10:05 oss20 kernel: Call Trace: Mar 13 19:10:05 oss20 kernel: [<ffffffff8001a81c>] vsnprintf+0x5e7/0x62f Mar 13 19:10:05 oss20 kernel: [<ffffffff8b21ab23>] :ldiskfs:ldiskfs_lookup+0x53/0x281 Mar 13 19:10:05 oss20 kernel: [<ffffffff80036d8e>] __lookup_hash+0x10b/0x12f Mar 13 19:10:05 oss20 kernel: [<ffffffff800e861d>] lookup_one_len+0x53/0x61 Mar 13 19:10:05 oss20 kernel: [<ffffffff8b2b136d>] :obdfilter:filter_fid2dentry+0x42d/0x740 Mar 13 19:10:05 oss20 kernel: [<ffffffff80286b42>] __down_trylock+0x44/0x4e Mar 13 19:10:05 oss20 kernel: [<ffffffff8b2cc38b>] :obdfilter:filter_lvbo_init+0x3bb/0x69b Mar 13 19:10:05 oss20 kernel: [<ffffffff8aecd3a0>] :lnet:LNetPut+0x710/0x820 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afa22c0>] :ptlrpc:ldlm_resource_get+0x900/0xa60 Mar 13 19:10:05 oss20 kernel: [<ffffffff8b270600>] :ost:ost_blocking_ast+0x0/0x9b0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afc12a0>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8af98efa>] :ptlrpc:ldlm_lock_create+0xba/0x9f0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afddfb1>] :ptlrpc:lustre_swab_buf+0x81/0x170 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afbbd30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afbbd30>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afc12a0>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8b270600>] :ost:ost_blocking_ast+0x0/0x9b0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afbe5b0>] :ptlrpc:ldlm_handle_enqueue+0x670/0x1210 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afdcbc8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 Mar 13 19:10:05 oss20 kernel: [<ffffffff8b2784c3>] :ost:ost_handle+0x4fe3/0x55b0 Mar 13 19:10:05 oss20 kernel: [<ffffffff80150e00>] __next_cpu+0x19/0x28 Mar 13 19:10:05 oss20 kernel: [<ffffffff800767e2>] smp_send_reschedule+0x4e/0x53 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afec15a>] :ptlrpc:ptlrpc_server_handle_request+0x97a/0xdf0 Mar 13 19:10:05 oss20 kernel: [<ffffffff8afec8a8>] :ptlrpc:ptlrpc_wait_event+0x2d8/0x310 Mar 13 19:10:05 oss20 kernel: [<ffffffff8008b403>] __wake_up_common+0x3e/0x68 Mar 13 19:10:06 oss20 kernel: [<ffffffff8afed817>] :ptlrpc:ptlrpc_main+0xf37/0x10f0 Mar 13 19:10:06 oss20 kernel: [<ffffffff8005e01d>] child_rip+0xa/0x11 Mar 13 19:10:06 oss20 kernel: [<ffffffff8afec8e0>] :ptlrpc:ptlrpc_main+0x0/0x10f0 Mar 13 19:10:06 oss20 kernel: [<ffffffff8005e013>] child_rip+0x0/0x11 Mar 13 19:10:06 oss20 kernel: Taeyoung Hong Senior Researcher Supercomputing Center, KISTI Korea <ATT00001..txt> Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120319/c99aa942/attachment-0001.html
On Mon, Mar 19, 2012 at 07:28:22AM -0600, Kevin Van Maren wrote:>You are running 1.8.5, which does not have the fix for the known MD raid5/6 rebuild corruption bug. That fix was released in the Oracle Lustre 1.8.7 kernel patches. Unless you already applied that patch, you might want to run a check of your raid arrays and consider an upgrade (at least patch your kernel with that fix). > >md-avoid-corrupted-ldiskfs-after-rebuild.patch in the 2.6-rhel5.series (note that this bug is NOT specific to rhel5). This fix does NOT appear to have been picked up by whamcloud.as you say, the md rebuild bug is in all kernels < 2.6.32 http://marc.info/?l=linux-raid&m=130192650924540&w=2 the Whamcloud fix is LU-824 which landed in git a tad after 1.8.7-wc1. I also asked RedHat nicely, and they added the same patch to RHEL5.8 kernels, which IMHO is the correct place for a fundamental md fix. so once Lustre supports RHEL5.8 servers, then the patch in Lustre isn''t needed any more. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility
Thank you for your information, Today I tested our OSS after reading bugzilla 24264, say, after patching the kernel (http://review.whamcloud.com/#change,1672), I rebuilt the md in question with new one disk added (because we just had 9 disks for RAID6 8+2), and then reran e2fsck -fn, and I finally tried to mount it but I still saw ll_ost soft lockup. the call trace messages is the same as before. so I think ours is not the case that you said. Anyway yesterday I tried the simplest method as below, to see if ldiskfs is working properly alone. mount -t ldiskfs -o ro,extents,mballoc /dev/md17 /mnt/kkk find /mnt/kkk -type f | while read f; do echo $f >&2 ; cat $f > /dev/null ; done and I got the following syslog messages while running this "find/cat" command, however the command finished without any other kernel error or soft lockup. Mar 19 17:57:51 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341259: rec_len is smaller than minimal - offset=806912, inode=0, rec_len=0, name_len=0 Mar 19 18:31:09 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than minimal - offset=978944, inode=0, rec_len=0, name_len=0 Mar 19 18:31:11 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than minimal - offset=282624, inode=0, rec_len=0, name_len=0 Mar 19 18:31:11 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than minimal - offset=290816, inode=0, rec_len=0, name_len=0 Mar 19 19:01:15 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341379: rec_len is smaller than minimal - offset=528384, inode=0, rec_len=0, name_len=0 Mar 19 19:18:14 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341258: rec_len is smaller than minimal - offset=1196032, inode=0, rec_len=0, name_len=0 Mar 19 19:18:14 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341258: rec_len is smaller than minimal - offset=1187840, inode=0, rec_len=0, name_len=0 ... cat ldiskfs_syslog_error.20120320 | awk ''{print $15}'' | sort | uniq -c 4 #341257: 12 #341258: 4 #341259: 4 #341379: 12 #341380: 4 #341381: 12 #341382: 12 #341383: 4 #341384: 4 #341507: 4 #341510: 11 directory data blocks seems corrupted, I don''t know what I can do further, regards, Taeyoung Hong 2012. 3. 19., ?? 11:27, Robin Humble ??:> On Mon, Mar 19, 2012 at 07:28:22AM -0600, Kevin Van Maren wrote: >> You are running 1.8.5, which does not have the fix for the known MD raid5/6 rebuild corruption bug. That fix was released in the Oracle Lustre 1.8.7 kernel patches. Unless you already applied that patch, you might want to run a check of your raid arrays and consider an upgrade (at least patch your kernel with that fix). >> >> md-avoid-corrupted-ldiskfs-after-rebuild.patch in the 2.6-rhel5.series (note that this bug is NOT specific to rhel5). This fix does NOT appear to have been picked up by whamcloud. > > as you say, the md rebuild bug is in all kernels < 2.6.32 > http://marc.info/?l=linux-raid&m=130192650924540&w=2 > > the Whamcloud fix is LU-824 which landed in git a tad after 1.8.7-wc1. > > I also asked RedHat nicely, and they added the same patch to RHEL5.8 > kernels, which IMHO is the correct place for a fundamental md fix. > > so once Lustre supports RHEL5.8 servers, then the patch in Lustre > isn''t needed any more. > > cheers, > robin > -- > Dr Robin Humble, HPC Systems Analyst, NCI National Facility
It worked. Thank you for the valuable tip. I never expected this simple command(e2fsck -fD) could be a solution. regards, Taeyoung Hong 2012. 3. 21., ?? 7:43, Bernd Schubert ??:> I''m removing lustre-discuss as I''m an FhGFS developer now and I don''t > think my boss would like it to see me posting to Lustre lists... Anyway, > I''m still reading and sometimes helping here > > On 03/20/2012 03:42 PM, Tae Young Hong wrote: >> >> Thank you for your information, >> Today I tested our OSS after reading bugzilla 24264, say, after patching the kernel (http://review.whamcloud.com/#change,1672), I rebuilt the md in question with new one disk added (because we just had 9 disks for RAID6 8+2), and then reran e2fsck -fn, and I finally tried to mount it but I still saw ll_ost soft lockup. the call trace messages is the same as before. so I think ours is not the case that you said. >> >> Anyway yesterday I tried the simplest method as below, to see if ldiskfs is working properly alone. >> >> mount -t ldiskfs -o ro,extents,mballoc /dev/md17 /mnt/kkk >> find /mnt/kkk -type f | while read f; do echo $f>&2 ; cat $f> /dev/null ; done >> >> and I got the following syslog messages while running this "find/cat" command, however the command finished without any other kernel error or soft lockup. >> >> Mar 19 17:57:51 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341259: rec_len is smaller than minimal - offset=806912, inode=0, rec_len=0, name_len=0 >> Mar 19 18:31:09 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than minimal - offset=978944, inode=0, rec_len=0, name_len=0 >> Mar 19 18:31:11 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than minimal - offset=282624, inode=0, rec_len=0, name_len=0 >> Mar 19 18:31:11 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than minimal - offset=290816, inode=0, rec_len=0, name_len=0 >> Mar 19 19:01:15 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341379: rec_len is smaller than minimal - offset=528384, inode=0, rec_len=0, name_len=0 >> Mar 19 19:18:14 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341258: rec_len is smaller than minimal - offset=1196032, inode=0, rec_len=0, name_len=0 >> Mar 19 19:18:14 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341258: rec_len is smaller than minimal - offset=1187840, inode=0, rec_len=0, name_len=0 >> ... > > Ideal would be to upload an e2image somewhere so that we can fix e2fsck. > If you would be willing to do that, the procedure would be: > > e2image -r /dev/md17 /path/to/md17.e2image > tar cvfS image.tar /path/to/md17.e2image > > You probably don''t want to do the > "e2image -r /dev/hda1 - | bzip2 > hda1.e2i.bz2" command from the man > page, as compressing all the zeros usually takes several days bzip2 time. > > However, most recent e2fsprogs versions now also may create qcow2 > images, which should be the fastest method to create the meta-data > image. However, I''m not sure if the current e2fsprogs version for Lustre > already can do that. > > Anyway, to fix the htree problem, running "e2fsck -fvD /dev/md17" > should do the trick. But I really would like to see the issue solved in > e2fsck to fix it automatically without "-D" some day. > > Cheers, > Bernd