Recently I found that on an OSS, ll_ost  thread got soft-lockup, which was
started after RAID rebuilding. After rebooting the OSS on which this OST belongs
to, we ran e2fsck with "-fn" and "-fp"
But this symptom still occurs. in some cases, we could see  ll_ost_creat got
hung instead of ll_ost after reboot.
during this soft lockup occurred, we couldn''t usually delete or create
files from/to  this OST has owned (lfs find -O  ...uuid). and when I tried to
run ll_recover_lost_found_objs after mounting the OST with ldiskfs, we could see
the recovery thread got soft lockup.
kerne and Lustre ver: kernel 2.6.18-194.17.1.el5_lustre.1.8.5smp  
HW: J4400 and Sun x4270 for OSS
There is no other hardware problem found.
 
#########  disk error occurred
Mar 13 03:11:26 oss20 kernel: mptbase: ioc2: LogInfo(0x31080000):
Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
Mar 13 03:11:46 oss20 kernel: sd 2:0:27:0: SCSI error: return code = 0x08000002
Mar 13 03:11:46 oss20 kernel: sdbx: Current: sense key: Medium Error
Mar 13 03:11:46 oss20 kernel:     Add. Sense: Unrecovered read error
Mar 13 03:11:46 oss20 kernel:
Mar 13 03:11:46 oss20 kernel: Info fld=0x5656e6b4
Mar 13 03:11:46 oss20 kernel: end_request: I/O error, dev sdbx, sector
1448535732
Mar 13 03:11:47 oss20 kernel: raid5:md17: read error corrected (8 sectors at
1448535728 on sdbx)
Mar 13 03:11:47 oss20 kernel: raid5:md17: read error corrected (8 sectors at
1448535736 on sdbx)
?
########### call trace occurred
Mar 13 09:12:52 oss20 kernel: Lustre: scratch-OST0034: haven''t heard
from client scratch-mdtlov_UUID (at 10.3.52.161 at o2ib) in 227 seconds. I think
it''s dead, and I am evicting it.
Mar 13 09:12:52 oss20 kernel: Lustre: Skipped 2 previous similar messages
Mar 13 09:13:47 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff810450875000 x1390684636813427/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 224/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:13:47 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff8101e17a5400 x1390684636813440/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:13:57 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff81036449d000 x1390684636813453/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:14:34 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff8104dec0f000 x1390684636813541/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 288/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:14:38 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff81010fd4b800 x1390684636813561/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:18:55 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff810367104c00 x1390684636814090/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 576/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:20:02 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff810453592000 x1390684636814172/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 320/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:20:23 oss20 kernel: LustreError:
0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired
after 147s: evicting client at 10.3.27.16 at o2ib  ns:
filter-scratch-OST0034_UUID lock: ffff810239b84e00/0x8134b24415b8c002 lrc: 3/0,0
mode: PW/PW res: 8770349/0 rrc: 2 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 0x80000020 remote: 0x2eb65052eeaac8dc expref:
17 pid: 17958 timeout 9632848008
Mar 13 09:20:24 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff810341db8c00 x1390684636814189/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 224/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:20:25 oss20 kernel: LustreError:
17878:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107) 
req at ffff81022a56ac00 x1390697545930782/t0 o101-><?>@<?>:0/0
lens 296/0 e 0 to 0 dl 1331598031 ref 1 fl Interpret:/0/0 rc -107/0
Mar 13 09:22:07 oss20 kernel: LustreError:
0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired
after 100s: evicting client at 10.3.30.72 at o2ib  ns:
filter-scratch-OST0034_UUID lock: ffff81014a9c8000/0x8134b244156839db lrc: 3/0,0
mode: PW/PW res: 8770383/0 rrc: 2 type: EXT [0->18446744073709551615] (req
0->4095) flags: 0x10020 remote: 0xb5fbb5ddee10ed12 expref: 24 pid: 17841
timeout 9632952155
Mar 13 09:22:08 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff81016e6d1000 x1390684636814642/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 352/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:22:08 oss20 kernel: LustreError:
17966:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107) 
req at ffff810217882c00 x1390697597420037/t0 o400-><?>@<?>:0/0
lens 192/0 e 0 to 0 dl 1331598144 ref 1 fl Interpret:H/0/0 rc -107/0
Mar 13 09:22:11 oss20 kernel: LustreError:
0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired
after 100s: evicting client at 10.3.15.9 at o2ib  ns:
filter-scratch-OST0034_UUID lock: ffff8101b899ce00/0x8134b2441567880a lrc: 3/0,0
mode: PW/PW res: 8770073/0 rrc: 2 type: EXT [0->18446744073709551615] (req
0->4095) flags: 0x20 remote: 0xffb69ae75cb1e32e expref: 9 pid: 17861 timeout
9632956878
Mar 13 09:22:12 oss20 kernel: LustreError:
17951:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107) 
req at ffff81037d4f7000 x1390697363779506/t0 o400-><?>@<?>:0/0
lens 192/0 e 0 to 0 dl 1331598148 ref 1 fl Interpret:H/0/0 rc -107/0
Mar 13 09:25:09 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff8104b2367000 x1390684636814999/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 352/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:27:44 oss20 kernel: LustreError:
0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired
after 151s: evicting client at 10.3.11.76 at o2ib  ns:
filter-scratch-OST0034_UUID lock: ffff810460625c00/0x8134b24415b86381 lrc: 3/0,0
mode: PW/PW res: 8798589/0 rrc: 2 type: EXT [0->18446744073709551615] (req
0->4095) flags: 0x20 remote: 0x237e7fdbefcbfed expref: 8 pid: 17872 timeout
9633289138
Mar 13 09:27:49 oss20 kernel: Lustre: Service thread pid 18061 was inactive for
442.00s. The thread might be hung, or it might only be slow and will resume
later. Dumping the stack trace for debugging purposes:
Mar 13 09:27:49 oss20 kernel: Pid: 18061, comm: ll_ost_io_92
Mar 13 09:27:49 oss20 kernel:
Mar 13 09:27:49 oss20 kernel: Call Trace:
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8006462b>]
__down_write_nested+0x7a/0x92
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8b2b99c0>]
filter_destroy+0x9b0/0x1fb0 [obdfilter]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afb4980>]
ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afb8060>]
ldlm_completion_ast+0x0/0x880 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afdda74>]
lustre_msg_add_version+0x34/0x110 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afe077a>]
lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8af9befc>]
__ldlm_handle2lock+0x8c/0x360 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afe0889>]
lustre_pack_reply+0x29/0xb0 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8b26b050>]
ost_destroy+0x660/0x790 [ost]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afdcbc8>]
lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afdca15>]
lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8b274a36>]
ost_handle+0x1556/0x55b0 [ost]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff80150e00>] __next_cpu+0x19/0x28
Mar 13 09:27:49 oss20 kernel:  [<ffffffff800767e2>]
smp_send_reschedule+0x4e/0x53
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afec15a>]
ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afec8a8>]
ptlrpc_wait_event+0x2d8/0x310 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8008b403>]
__wake_up_common+0x3e/0x68
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afed817>]
ptlrpc_main+0xf37/0x10f0 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8005e01d>] child_rip+0xa/0x11 
############## call trace occurred for the 2nd time
Mar 13 09:29:18 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff810502953400 x1390684636815395/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:29:18 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) Skipped 1 previous similar
message
Mar 13 09:29:34 oss20 kernel: Lustre: Service thread pid 17995 was inactive for
442.00s. The thread might be hung, or it might only be slow and will resume
later. Dumping the stack trace for debugging purposes:
Mar 13 09:29:34 oss20 kernel: Pid: 17995, comm: ll_ost_io_26
Mar 13 09:29:34 oss20 kernel:
Mar 13 09:29:34 oss20 kernel: Call Trace:
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8006e20b>]
do_gettimeofday+0x40/0x90
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80028ace>] sync_page+0x0/0x42
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80063802>] io_schedule+0x3f/0x67
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80028b0c>] sync_page+0x3e/0x42
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80063946>]
__wait_on_bit_lock+0x36/0x66
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8003fc3c>] __lock_page+0x5e/0x64
Mar 13 09:29:34 oss20 kernel:  [<ffffffff800a0ab3>]
wake_bit_function+0x0/0x23
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80013b32>]
find_lock_page+0x69/0xa2
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80025bba>]
find_or_create_page+0x22/0x72
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8b2c5315>]
filter_get_page+0x35/0x70 [obdfilter]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8b2c5951>]
filter_preprw_read+0x601/0xd30 [obdfilter]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8aecd3a0>] LNetPut+0x710/0x820
[lnet]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afd7ed3>]
ptl_send_buf+0x3f3/0x5b0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afe2765>]
lustre_msg_set_limit+0x35/0xf0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8b2c7dcc>]
filter_preprw+0x1d4c/0x1db0 [obdfilter]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afd8658>]
ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afe077a>]
lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afd8658>]
ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8b26dab3>]
ost_brw_read+0xb33/0x1a70 [ost]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afdca15>]
lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8008cfd9>]
default_wake_function+0x0/0xe
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afdcbc8>]
lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8b276353>]
ost_handle+0x2e73/0x55b0 [ost]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80150e00>] __next_cpu+0x19/0x28
Mar 13 09:29:34 oss20 kernel:  [<ffffffff800767e2>]
smp_send_reschedule+0x4e/0x53
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afec15a>]
ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afec8a8>]
ptlrpc_wait_event+0x2d8/0x310 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8008b403>]
__wake_up_common+0x3e/0x68
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afed817>]
ptlrpc_main+0xf37/0x10f0 [ptlrpc]
Mar 13 09:29:35 oss20 kernel:  [<ffffffff8005e01d>] child_rip+0xa/0x11
Mar 13 09:29:35 oss20 kernel:  [<ffffffff8afec8e0>] ptlrpc_main+0x0/0x10f0
[ptlrpc]
Mar 13 09:29:35 oss20 kernel:  [<ffffffff8005e013>] child_rip+0x0/0x11
Mar 13 09:29:35 oss20 kernel:
Mar 13 09:30:05 oss20 kernel: Lustre:
18022:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-239), not sending early reply
Mar 13 09:30:05 oss20 kernel:   req at ffff8104e15ef000 x1390697425321956/t0
o4->104e3eaf-521c-6fd2-0ded-c959aeb89264 at NET_0x500000a03122f_UUID:0/0 lens
448/416 e 5 to 0 dl 1331598610 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:05 oss20 kernel: Lustre:
18022:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-239), not sending early reply
Mar 13 09:30:05 oss20 kernel:   req at ffff81041144d400 x1390697425321961/t0
o4->104e3eaf-521c-6fd2-0ded-c959aeb89264 at NET_0x500000a03122f_UUID:0/0 lens
448/416 e 5 to 0 dl 1331598610 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:07 oss20 kernel: Lustre:
17982:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-49), not sending early reply
Mar 13 09:30:07 oss20 kernel:   req at ffff81025e7f6c50 x1387502968075317/t0
o4->e6cc42f9-1723-0037-1bef-8c11b5bb0cd5@:0/0 lens 448/416 e 2 to 0 dl
1331598612 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:07 oss20 kernel: Lustre:
17982:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 7 previous similar
messages
Mar 13 09:30:09 oss20 kernel: Lustre:
18027:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-50), not sending early reply
Mar 13 09:30:09 oss20 kernel:   req at ffff8105c4a83800 x1390670171393616/t0
o4->0c31049a-ac01-884e-a74a-f549ca801551@:0/0 lens 448/416 e 2 to 0 dl
1331598614 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:09 oss20 kernel: Lustre:
18027:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 8 previous similar
messages
Mar 13 09:30:26 oss20 kernel: Lustre:
18079:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-196), not sending early reply
Mar 13 09:30:26 oss20 kernel:   req at ffff81047934d800 x1390668976327099/t0
o4->240e8b56-f2f0-beab-0460-c95e7af9ebf2 at NET_0x500000a031347_UUID:0/0 lens
448/416 e 3 to 0 dl 1331598631 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:26 oss20 kernel: Lustre:
18079:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 7 previous similar
messages
Mar 13 09:30:56 oss20 kernel: Lustre:
18021:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-157), not sending early reply
Mar 13 09:30:56 oss20 kernel:   req at ffff810301bdd850 x1390697295264668/t0
o4->819aa1e7-48e9-c8c1-c639-4d7767aac076 at NET_0x500000a030b07_UUID:0/0 lens
448/416 e 3 to 0 dl 1331598661 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:56 oss20 kernel: Lustre:
18021:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 2 previous similar
messages
Mar 13 09:31:32 oss20 kernel: Lustre:
18039:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-108), not sending early reply
Mar 13 09:31:32 oss20 kernel:   req at ffff810471a57000 x1390668308715230/t0
o4->80fad2a1-6a67-0aad-69a7-2ea3c9cf6909 at NET_0x500000a030e16_UUID:0/0 lens
448/416 e 1 to 0 dl 1331598697 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:31:32 oss20 kernel: Lustre:
18039:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 1 previous similar
message
Mar 13 09:32:04 oss20 kernel: Lustre:
18046:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-49), not sending early reply
Mar 13 09:32:04 oss20 kernel:   req at ffff81010eda4800 x1390697612903849/t0
o4->cdc6f8ec-952c-4d6a-7fd2-dca6fe51a617 at NET_0x500000a032018_UUID:0/0 lens
448/416 e 2 to 0 dl 1331598729 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:32:04 oss20 kernel: Lustre:
18046:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 13 previous similar
messages
Mar 13 09:32:05 oss20 kernel: Lustre:
17923:0:(ldlm_lib.c:574:target_handle_reconnect()) scratch-OST0034:
104e3eaf-521c-6fd2-0ded-c959aeb89264 reconnecting
Mar 13 09:32:05 oss20 kernel: Lustre:
17923:0:(ldlm_lib.c:874:target_handle_connect()) scratch-OST0034: refuse
reconnection from 104e3eaf-521c-6fd2-0ded-c959aeb89264 at 10.3.18.47@o2ib to
0xffff810293ab2000; still busy with 9 active RPCs
################ disk disabled manually by admin using "mdadm /dev/md17
?fail /dev/sdbx"
Mar 13 10:03:22 oss20 kernel: raid5: Disk failure on sdbx, disabling device.
Operation continuing on 9 devices
Mar 13 10:03:23 oss20 kernel: RAID5 conf printout:
Mar 13 10:03:23 oss20 kernel:  --- rd:10 wd:9 fd:1
Mar 13 10:03:23 oss20 kernel:  disk 0, o:1, dev:sdbv
Mar 13 10:03:23 oss20 kernel:  disk 1, o:1, dev:sdbw
Mar 13 10:03:23 oss20 kernel:  disk 2, o:0, dev:sdbx
Mar 13 10:03:23 oss20 kernel:  disk 3, o:1, dev:sdby
Mar 13 10:03:23 oss20 kernel:  disk 4, o:1, dev:sdbz
Mar 13 10:03:23 oss20 kernel:  disk 5, o:1, dev:sdca
Mar 13 10:03:23 oss20 kernel:  disk 6, o:1, dev:sdcb
Mar 13 10:03:23 oss20 kernel:  disk 7, o:1, dev:sdcc
Mar 13 10:03:23 oss20 kernel:  disk 8, o:1, dev:sdcd
Mar 13 10:03:23 oss20 kernel:  disk 9, o:1, dev:sdce
Mar 13 10:03:23 oss20 kernel: RAID5 conf printout:
Mar 13 10:03:23 oss20 kernel:  --- rd:10 wd:9 fd:1
Mar 13 10:03:23 oss20 kernel:  disk 0, o:1, dev:sdbv
Mar 13 10:03:23 oss20 kernel:  disk 1, o:1, dev:sdbw
Mar 13 10:03:23 oss20 kernel:  disk 3, o:1, dev:sdby
Mar 13 10:03:23 oss20 kernel:  disk 4, o:1, dev:sdbz
Mar 13 10:03:23 oss20 kernel:  disk 5, o:1, dev:sdca
Mar 13 10:03:23 oss20 kernel:  disk 6, o:1, dev:sdcb
Mar 13 10:03:23 oss20 kernel:  disk 7, o:1, dev:sdcc
Mar 13 10:03:23 oss20 kernel:  disk 8, o:1, dev:sdcd
Mar 13 10:03:23 oss20 kernel:  disk 9, o:1, dev:sdce
Mar 13 10:03:23 oss20 kernel: RAID5 conf printout:
Mar 13 10:03:23 oss20 kernel:  --- rd:10 wd:9 fd:1
Mar 13 10:03:23 oss20 kernel:  disk 0, o:1, dev:sdbv
Mar 13 10:03:23 oss20 kernel:  disk 1, o:1, dev:sdbw
Mar 13 10:03:23 oss20 kernel:  disk 2, o:1, dev:sdcp
Mar 13 10:03:23 oss20 kernel:  disk 3, o:1, dev:sdby
Mar 13 10:03:23 oss20 kernel:  disk 4, o:1, dev:sdbz
Mar 13 10:03:23 oss20 kernel:  disk 5, o:1, dev:sdca
Mar 13 10:03:23 oss20 kernel:  disk 6, o:1, dev:sdcb
Mar 13 10:03:23 oss20 kernel:  disk 7, o:1, dev:sdcc
Mar 13 10:03:23 oss20 kernel:  disk 8, o:1, dev:sdcd
Mar 13 10:03:23 oss20 kernel:  disk 9, o:1, dev:sdce
Mar 13 10:03:23 oss20 kernel: md: syncing RAID array md17
############## md rebuild done 
Mar 13 14:07:22 oss20 kernel: md: rebuild md17 throttled due to IO
Mar 13 14:09:24 oss20 kernel: md: unbind<sdbx>
Mar 13 14:09:24 oss20 kernel: md: export_rdev(sdbx)
Mar 13 14:17:24 oss20 kernel: md: rebuild md17 throttled due to IO
Mar 13 14:25:48 oss20 kernel: md: md17: sync done.
################ ll_ost kernel thread  soft lockup occurred
Mar 13 19:10:04 oss20 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
[ll_ost_29:17868]
Mar 13 19:10:04 oss20 kernel: CPU 2:
Mar 13 19:10:05 oss20 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U)
ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U)
ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) raid456(U) xor(U)
raid1(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) autofs4(U) sunrpc(U) rdma_ucm(U)
qlgc_vnic(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U)
ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U)
ib_uverbs(U) ib_umad(U) iw_nes(U) iw_cxgb3(U) cxgb3(U) mlx4_ib(U) ib_mthca(U)
ib_mad(U) ib_core(U) mptctl(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U)
backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U)
button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U)
parport(U) joydev(U) sg(U) igb(U) mlx4_core(U) i2c_i801(U) 8021q(U) i2c_core(U)
dca(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U)
dm_mod(U) dm_mem_cache(U) ahci(U) libata(U) mptfc(U) scsi_transport_fc(U)
mptspi(U) scsi_transport_spi(U) shpchp(U
Mar 13 19:10:05 oss20 kernel:  mptsas(U) mptscsih(U) mptbase(U)
scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U)
ohci_hcd(U) ehci_hcd(U)
Mar 13 19:10:05 oss20 kernel: Pid: 17868, comm: ll_ost_29 Tainted: G     
2.6.18-194.17.1.el5_lustre.1.8.5smp #1
Mar 13 19:10:05 oss20 kernel: RIP: 0010:[<ffffffff8b218cf4>] 
[<ffffffff8b218cf4>] :ldiskfs:ldiskfs_find_entry+0x1d4/0x5b0
Mar 13 19:10:05 oss20 kernel: RSP: 0018:ffff8102c7f637e0  EFLAGS: 00000202
Mar 13 19:10:05 oss20 kernel: RAX: 0000000000000000 RBX: 0000000000000007 RCX:
0000000036f3f4cb
Mar 13 19:10:05 oss20 kernel: RDX: ffff8104fa4d1c00 RSI: ffff8102c7f637c0 RDI:
ffff81010c43e7a8
Mar 13 19:10:05 oss20 kernel: RBP: ffff810254a1d7f0 R08: ffff8100708a5ff8 R09:
ffff8100708a5000
Mar 13 19:10:05 oss20 kernel: R10: ffff81010c618038 R11: 000000004f5f1cf2 R12:
0000000000000000
Mar 13 19:10:05 oss20 kernel: R13: 0000000000000002 R14: ffff8101d7e17cd0 R15:
ffffffff80063ac8
Mar 13 19:10:05 oss20 kernel: FS:  00002b99048b1220(0000)
GS:ffff81010c499240(0000) knlGS:0000000000000000
Mar 13 19:10:05 oss20 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 13 19:10:05 oss20 kernel: CR2: 000000316a03b6a0 CR3: 00000003738ff000 CR4:
00000000000006e0
Mar 13 19:10:05 oss20 kernel:
Mar 13 19:10:05 oss20 kernel: Call Trace:
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8001a81c>] vsnprintf+0x5e7/0x62f
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b21ab23>]
:ldiskfs:ldiskfs_lookup+0x53/0x281
Mar 13 19:10:05 oss20 kernel:  [<ffffffff80036d8e>]
__lookup_hash+0x10b/0x12f
Mar 13 19:10:05 oss20 kernel:  [<ffffffff800e861d>]
lookup_one_len+0x53/0x61
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b2b136d>]
:obdfilter:filter_fid2dentry+0x42d/0x740
Mar 13 19:10:05 oss20 kernel:  [<ffffffff80286b42>]
__down_trylock+0x44/0x4e
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b2cc38b>]
:obdfilter:filter_lvbo_init+0x3bb/0x69b
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8aecd3a0>]
:lnet:LNetPut+0x710/0x820
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afa22c0>]
:ptlrpc:ldlm_resource_get+0x900/0xa60
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b270600>]
:ost:ost_blocking_ast+0x0/0x9b0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afc12a0>]
:ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8af98efa>]
:ptlrpc:ldlm_lock_create+0xba/0x9f0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afddfb1>]
:ptlrpc:lustre_swab_buf+0x81/0x170
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afbbd30>]
:ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afbbd30>]
:ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afc12a0>]
:ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b270600>]
:ost:ost_blocking_ast+0x0/0x9b0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afbe5b0>]
:ptlrpc:ldlm_handle_enqueue+0x670/0x1210
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afdcbc8>]
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b2784c3>]
:ost:ost_handle+0x4fe3/0x55b0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff80150e00>] __next_cpu+0x19/0x28
Mar 13 19:10:05 oss20 kernel:  [<ffffffff800767e2>]
smp_send_reschedule+0x4e/0x53
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afec15a>]
:ptlrpc:ptlrpc_server_handle_request+0x97a/0xdf0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afec8a8>]
:ptlrpc:ptlrpc_wait_event+0x2d8/0x310
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8008b403>]
__wake_up_common+0x3e/0x68
Mar 13 19:10:06 oss20 kernel:  [<ffffffff8afed817>]
:ptlrpc:ptlrpc_main+0xf37/0x10f0
Mar 13 19:10:06 oss20 kernel:  [<ffffffff8005e01d>] child_rip+0xa/0x11
Mar 13 19:10:06 oss20 kernel:  [<ffffffff8afec8e0>]
:ptlrpc:ptlrpc_main+0x0/0x10f0
Mar 13 19:10:06 oss20 kernel:  [<ffffffff8005e013>] child_rip+0x0/0x11
Mar 13 19:10:06 oss20 kernel:
Taeyoung Hong
Senior Researcher
Supercomputing Center, KISTI
Korea
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120319/ee7d01ed/attachment-0001.html
You have a bad disk, at 3:11, and then at 9:27 you got a soft lockup (doing a
read?), you failed the bad disk at 10:03, rebuild finished at 14:25, and then
you got another soft lockup at 19:10 (different stack trace).
You are running 1.8.5, which does not have the fix for the known MD raid5/6
rebuild corruption bug.  That fix was released in the Oracle Lustre 1.8.7 kernel
patches.  Unless you already applied that patch, you might want to run a check
of your raid arrays and consider an upgrade (at least patch your kernel with
that fix).
md-avoid-corrupted-ldiskfs-after-rebuild.patch in the 2.6-rhel5.series (note
that this bug is NOT specific to rhel5).  This fix does NOT appear to have been
picked up by whamcloud.
Kevin
On Mar 18, 2012, at 9:22 PM, Tae Young Hong wrote:
Recently I found that on an OSS, ll_ost  thread got soft-lockup, which was
started after RAID rebuilding. After rebooting the OSS on which this OST belongs
to, we ran e2fsck with "-fn" and "-fp"
But this symptom still occurs. in some cases, we could see  ll_ost_creat got
hung instead of ll_ost after reboot.
during this soft lockup occurred, we couldn''t usually delete or create
files from/to  this OST has owned (lfs find -O  ...uuid). and when I tried to
run ll_recover_lost_found_objs after mounting the OST with ldiskfs, we could see
the recovery thread got soft lockup.
kerne and Lustre ver: kernel 2.6.18-194.17.1.el5_lustre.1.8.5smp
HW: J4400 and Sun x4270 for OSS
There is no other hardware problem found.
#########  disk error occurred
Mar 13 03:11:26 oss20 kernel: mptbase: ioc2: LogInfo(0x31080000):
Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
Mar 13 03:11:46 oss20 kernel: sd 2:0:27:0: SCSI error: return code = 0x08000002
Mar 13 03:11:46 oss20 kernel: sdbx: Current: sense key: Medium Error
Mar 13 03:11:46 oss20 kernel:     Add. Sense: Unrecovered read error
Mar 13 03:11:46 oss20 kernel:
Mar 13 03:11:46 oss20 kernel: Info fld=0x5656e6b4
Mar 13 03:11:46 oss20 kernel: end_request: I/O error, dev sdbx, sector
1448535732
Mar 13 03:11:47 oss20 kernel: raid5:md17: read error corrected (8 sectors at
1448535728 on sdbx)
Mar 13 03:11:47 oss20 kernel: raid5:md17: read error corrected (8 sectors at
1448535736 on sdbx)
?
########### call trace occurred
Mar 13 09:12:52 oss20 kernel: Lustre: scratch-OST0034: haven''t heard
from client scratch-mdtlov_UUID (at 10.3.52.161 at o2ib) in 227 seconds. I think
it''s dead, and I am evicting it.
Mar 13 09:12:52 oss20 kernel: Lustre: Skipped 2 previous similar messages
Mar 13 09:13:47 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff810450875000 x1390684636813427/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 224/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:13:47 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff8101e17a5400 x1390684636813440/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:13:57 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff81036449d000 x1390684636813453/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:14:34 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff8104dec0f000 x1390684636813541/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 288/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:14:38 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff81010fd4b800 x1390684636813561/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:18:55 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff810367104c00 x1390684636814090/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 576/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:20:02 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff810453592000 x1390684636814172/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 320/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:20:23 oss20 kernel: LustreError:
0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired
after 147s: evicting client at 10.3.27.16 at o2ib  ns:
filter-scratch-OST0034_UUID lock: ffff810239b84e00/0x8134b24415b8c002 lrc: 3/0,0
mode: PW/PW res: 8770349/0 rrc: 2 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 0x80000020 remote: 0x2eb65052eeaac8dc expref:
17 pid: 17958 timeout 9632848008
Mar 13 09:20:24 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff810341db8c00 x1390684636814189/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 224/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:20:25 oss20 kernel: LustreError:
17878:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107) 
req at ffff81022a56ac00 x1390697545930782/t0 o101-><?>@<?>:0/0
lens 296/0 e 0 to 0 dl 1331598031 ref 1 fl Interpret:/0/0 rc -107/0
Mar 13 09:22:07 oss20 kernel: LustreError:
0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired
after 100s: evicting client at 10.3.30.72 at o2ib  ns:
filter-scratch-OST0034_UUID lock: ffff81014a9c8000/0x8134b244156839db lrc: 3/0,0
mode: PW/PW res: 8770383/0 rrc: 2 type: EXT [0->18446744073709551615] (req
0->4095) flags: 0x10020 remote: 0xb5fbb5ddee10ed12 expref: 24 pid: 17841
timeout 9632952155
Mar 13 09:22:08 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff81016e6d1000 x1390684636814642/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 352/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:22:08 oss20 kernel: LustreError:
17966:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107) 
req at ffff810217882c00 x1390697597420037/t0 o400-><?>@<?>:0/0
lens 192/0 e 0 to 0 dl 1331598144 ref 1 fl Interpret:H/0/0 rc -107/0
Mar 13 09:22:11 oss20 kernel: LustreError:
0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired
after 100s: evicting client at 10.3.15.9 at o2ib  ns:
filter-scratch-OST0034_UUID lock: ffff8101b899ce00/0x8134b2441567880a lrc: 3/0,0
mode: PW/PW res: 8770073/0 rrc: 2 type: EXT [0->18446744073709551615] (req
0->4095) flags: 0x20 remote: 0xffb69ae75cb1e32e expref: 9 pid: 17861 timeout
9632956878
Mar 13 09:22:12 oss20 kernel: LustreError:
17951:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107) 
req at ffff81037d4f7000 x1390697363779506/t0 o400-><?>@<?>:0/0
lens 192/0 e 0 to 0 dl 1331598148 ref 1 fl Interpret:H/0/0 rc -107/0
Mar 13 09:25:09 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff8104b2367000 x1390684636814999/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 352/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:27:44 oss20 kernel: LustreError:
0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired
after 151s: evicting client at 10.3.11.76 at o2ib  ns:
filter-scratch-OST0034_UUID lock: ffff810460625c00/0x8134b24415b86381 lrc: 3/0,0
mode: PW/PW res: 8798589/0 rrc: 2 type: EXT [0->18446744073709551615] (req
0->4095) flags: 0x20 remote: 0x237e7fdbefcbfed expref: 8 pid: 17872 timeout
9633289138
Mar 13 09:27:49 oss20 kernel: Lustre: Service thread pid 18061 was inactive for
442.00s. The thread might be hung, or it might only be slow and will resume
later. Dumping the stack trace for debugging purposes:
Mar 13 09:27:49 oss20 kernel: Pid: 18061, comm: ll_ost_io_92
Mar 13 09:27:49 oss20 kernel:
Mar 13 09:27:49 oss20 kernel: Call Trace:
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8006462b>]
__down_write_nested+0x7a/0x92
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8b2b99c0>]
filter_destroy+0x9b0/0x1fb0 [obdfilter]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afb4980>]
ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afb8060>]
ldlm_completion_ast+0x0/0x880 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afdda74>]
lustre_msg_add_version+0x34/0x110 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afe077a>]
lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8af9befc>]
__ldlm_handle2lock+0x8c/0x360 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afe0889>]
lustre_pack_reply+0x29/0xb0 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8b26b050>]
ost_destroy+0x660/0x790 [ost]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afdcbc8>]
lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afdca15>]
lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8b274a36>]
ost_handle+0x1556/0x55b0 [ost]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff80150e00>] __next_cpu+0x19/0x28
Mar 13 09:27:49 oss20 kernel:  [<ffffffff800767e2>]
smp_send_reschedule+0x4e/0x53
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afec15a>]
ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afec8a8>]
ptlrpc_wait_event+0x2d8/0x310 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8008b403>]
__wake_up_common+0x3e/0x68
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8afed817>]
ptlrpc_main+0xf37/0x10f0 [ptlrpc]
Mar 13 09:27:49 oss20 kernel:  [<ffffffff8005e01d>] child_rip+0xa/0x11
############## call trace occurred for the 2nd time
Mar 13 09:29:18 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req at
ffff810502953400 x1390684636815395/t0 o401->@NET_0x500000a0334a1_UUID:17/18
lens 256/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Mar 13 09:29:18 oss20 kernel: LustreError:
18126:0:(client.c:841:ptlrpc_import_delay_req()) Skipped 1 previous similar
message
Mar 13 09:29:34 oss20 kernel: Lustre: Service thread pid 17995 was inactive for
442.00s. The thread might be hung, or it might only be slow and will resume
later. Dumping the stack trace for debugging purposes:
Mar 13 09:29:34 oss20 kernel: Pid: 17995, comm: ll_ost_io_26
Mar 13 09:29:34 oss20 kernel:
Mar 13 09:29:34 oss20 kernel: Call Trace:
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8006e20b>]
do_gettimeofday+0x40/0x90
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80028ace>] sync_page+0x0/0x42
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80063802>] io_schedule+0x3f/0x67
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80028b0c>] sync_page+0x3e/0x42
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80063946>]
__wait_on_bit_lock+0x36/0x66
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8003fc3c>] __lock_page+0x5e/0x64
Mar 13 09:29:34 oss20 kernel:  [<ffffffff800a0ab3>]
wake_bit_function+0x0/0x23
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80013b32>]
find_lock_page+0x69/0xa2
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80025bba>]
find_or_create_page+0x22/0x72
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8b2c5315>]
filter_get_page+0x35/0x70 [obdfilter]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8b2c5951>]
filter_preprw_read+0x601/0xd30 [obdfilter]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8aecd3a0>] LNetPut+0x710/0x820
[lnet]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afd7ed3>]
ptl_send_buf+0x3f3/0x5b0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afe2765>]
lustre_msg_set_limit+0x35/0xf0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8b2c7dcc>]
filter_preprw+0x1d4c/0x1db0 [obdfilter]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afd8658>]
ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afe077a>]
lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afd8658>]
ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8b26dab3>]
ost_brw_read+0xb33/0x1a70 [ost]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afdca15>]
lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8008cfd9>]
default_wake_function+0x0/0xe
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afdcbc8>]
lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8b276353>]
ost_handle+0x2e73/0x55b0 [ost]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff80150e00>] __next_cpu+0x19/0x28
Mar 13 09:29:34 oss20 kernel:  [<ffffffff800767e2>]
smp_send_reschedule+0x4e/0x53
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afec15a>]
ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afec8a8>]
ptlrpc_wait_event+0x2d8/0x310 [ptlrpc]
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8008b403>]
__wake_up_common+0x3e/0x68
Mar 13 09:29:34 oss20 kernel:  [<ffffffff8afed817>]
ptlrpc_main+0xf37/0x10f0 [ptlrpc]
Mar 13 09:29:35 oss20 kernel:  [<ffffffff8005e01d>] child_rip+0xa/0x11
Mar 13 09:29:35 oss20 kernel:  [<ffffffff8afec8e0>] ptlrpc_main+0x0/0x10f0
[ptlrpc]
Mar 13 09:29:35 oss20 kernel:  [<ffffffff8005e013>] child_rip+0x0/0x11
Mar 13 09:29:35 oss20 kernel:
Mar 13 09:30:05 oss20 kernel: Lustre:
18022:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-239), not sending early reply
Mar 13 09:30:05 oss20 kernel:   req at ffff8104e15ef000 x1390697425321956/t0
o4->104e3eaf-521c-6fd2-0ded-c959aeb89264 at NET_0x500000a03122f_UUID:0/0 lens
448/416 e 5 to 0 dl 1331598610 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:05 oss20 kernel: Lustre:
18022:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-239), not sending early reply
Mar 13 09:30:05 oss20 kernel:   req at ffff81041144d400 x1390697425321961/t0
o4->104e3eaf-521c-6fd2-0ded-c959aeb89264 at NET_0x500000a03122f_UUID:0/0 lens
448/416 e 5 to 0 dl 1331598610 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:07 oss20 kernel: Lustre:
17982:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-49), not sending early reply
Mar 13 09:30:07 oss20 kernel:   req at ffff81025e7f6c50 x1387502968075317/t0
o4->e6cc42f9-1723-0037-1bef-8c11b5bb0cd5@:0/0 lens 448/416 e 2 to 0 dl
1331598612 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:07 oss20 kernel: Lustre:
17982:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 7 previous similar
messages
Mar 13 09:30:09 oss20 kernel: Lustre:
18027:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-50), not sending early reply
Mar 13 09:30:09 oss20 kernel:   req at ffff8105c4a83800 x1390670171393616/t0
o4->0c31049a-ac01-884e-a74a-f549ca801551@:0/0 lens 448/416 e 2 to 0 dl
1331598614 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:09 oss20 kernel: Lustre:
18027:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 8 previous similar
messages
Mar 13 09:30:26 oss20 kernel: Lustre:
18079:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-196), not sending early reply
Mar 13 09:30:26 oss20 kernel:   req at ffff81047934d800 x1390668976327099/t0
o4->240e8b56-f2f0-beab-0460-c95e7af9ebf2 at NET_0x500000a031347_UUID:0/0 lens
448/416 e 3 to 0 dl 1331598631 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:26 oss20 kernel: Lustre:
18079:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 7 previous similar
messages
Mar 13 09:30:56 oss20 kernel: Lustre:
18021:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-157), not sending early reply
Mar 13 09:30:56 oss20 kernel:   req at ffff810301bdd850 x1390697295264668/t0
o4->819aa1e7-48e9-c8c1-c639-4d7767aac076 at NET_0x500000a030b07_UUID:0/0 lens
448/416 e 3 to 0 dl 1331598661 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:30:56 oss20 kernel: Lustre:
18021:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 2 previous similar
messages
Mar 13 09:31:32 oss20 kernel: Lustre:
18039:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-108), not sending early reply
Mar 13 09:31:32 oss20 kernel:   req at ffff810471a57000 x1390668308715230/t0
o4->80fad2a1-6a67-0aad-69a7-2ea3c9cf6909 at NET_0x500000a030e16_UUID:0/0 lens
448/416 e 1 to 0 dl 1331598697 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:31:32 oss20 kernel: Lustre:
18039:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 1 previous similar
message
Mar 13 09:32:04 oss20 kernel: Lustre:
18046:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-49), not sending early reply
Mar 13 09:32:04 oss20 kernel:   req at ffff81010eda4800 x1390697612903849/t0
o4->cdc6f8ec-952c-4d6a-7fd2-dca6fe51a617 at NET_0x500000a032018_UUID:0/0 lens
448/416 e 2 to 0 dl 1331598729 ref 2 fl Interpret:/0/0 rc 0/0
Mar 13 09:32:04 oss20 kernel: Lustre:
18046:0:(service.c:808:ptlrpc_at_send_early_reply()) Skipped 13 previous similar
messages
Mar 13 09:32:05 oss20 kernel: Lustre:
17923:0:(ldlm_lib.c:574:target_handle_reconnect()) scratch-OST0034:
104e3eaf-521c-6fd2-0ded-c959aeb89264 reconnecting
Mar 13 09:32:05 oss20 kernel: Lustre:
17923:0:(ldlm_lib.c:874:target_handle_connect()) scratch-OST0034: refuse
reconnection from 104e3eaf-521c-6fd2-0ded-c959aeb89264 at
10.3.18.47<mailto:104e3eaf-521c-6fd2-0ded-c959aeb89264 at 10.3.18.47>@o2ib
to 0xffff810293ab2000; still busy with 9 active RPCs
################ disk disabled manually by admin using "mdadm /dev/md17
?fail /dev/sdbx"
Mar 13 10:03:22 oss20 kernel: raid5: Disk failure on sdbx, disabling device.
Operation continuing on 9 devices
Mar 13 10:03:23 oss20 kernel: RAID5 conf printout:
Mar 13 10:03:23 oss20 kernel:  --- rd:10 wd:9 fd:1
Mar 13 10:03:23 oss20 kernel:  disk 0, o:1, dev:sdbv
Mar 13 10:03:23 oss20 kernel:  disk 1, o:1, dev:sdbw
Mar 13 10:03:23 oss20 kernel:  disk 2, o:0, dev:sdbx
Mar 13 10:03:23 oss20 kernel:  disk 3, o:1, dev:sdby
Mar 13 10:03:23 oss20 kernel:  disk 4, o:1, dev:sdbz
Mar 13 10:03:23 oss20 kernel:  disk 5, o:1, dev:sdca
Mar 13 10:03:23 oss20 kernel:  disk 6, o:1, dev:sdcb
Mar 13 10:03:23 oss20 kernel:  disk 7, o:1, dev:sdcc
Mar 13 10:03:23 oss20 kernel:  disk 8, o:1, dev:sdcd
Mar 13 10:03:23 oss20 kernel:  disk 9, o:1, dev:sdce
Mar 13 10:03:23 oss20 kernel: RAID5 conf printout:
Mar 13 10:03:23 oss20 kernel:  --- rd:10 wd:9 fd:1
Mar 13 10:03:23 oss20 kernel:  disk 0, o:1, dev:sdbv
Mar 13 10:03:23 oss20 kernel:  disk 1, o:1, dev:sdbw
Mar 13 10:03:23 oss20 kernel:  disk 3, o:1, dev:sdby
Mar 13 10:03:23 oss20 kernel:  disk 4, o:1, dev:sdbz
Mar 13 10:03:23 oss20 kernel:  disk 5, o:1, dev:sdca
Mar 13 10:03:23 oss20 kernel:  disk 6, o:1, dev:sdcb
Mar 13 10:03:23 oss20 kernel:  disk 7, o:1, dev:sdcc
Mar 13 10:03:23 oss20 kernel:  disk 8, o:1, dev:sdcd
Mar 13 10:03:23 oss20 kernel:  disk 9, o:1, dev:sdce
Mar 13 10:03:23 oss20 kernel: RAID5 conf printout:
Mar 13 10:03:23 oss20 kernel:  --- rd:10 wd:9 fd:1
Mar 13 10:03:23 oss20 kernel:  disk 0, o:1, dev:sdbv
Mar 13 10:03:23 oss20 kernel:  disk 1, o:1, dev:sdbw
Mar 13 10:03:23 oss20 kernel:  disk 2, o:1, dev:sdcp
Mar 13 10:03:23 oss20 kernel:  disk 3, o:1, dev:sdby
Mar 13 10:03:23 oss20 kernel:  disk 4, o:1, dev:sdbz
Mar 13 10:03:23 oss20 kernel:  disk 5, o:1, dev:sdca
Mar 13 10:03:23 oss20 kernel:  disk 6, o:1, dev:sdcb
Mar 13 10:03:23 oss20 kernel:  disk 7, o:1, dev:sdcc
Mar 13 10:03:23 oss20 kernel:  disk 8, o:1, dev:sdcd
Mar 13 10:03:23 oss20 kernel:  disk 9, o:1, dev:sdce
Mar 13 10:03:23 oss20 kernel: md: syncing RAID array md17
############## md rebuild done
Mar 13 14:07:22 oss20 kernel: md: rebuild md17 throttled due to IO
Mar 13 14:09:24 oss20 kernel: md: unbind<sdbx>
Mar 13 14:09:24 oss20 kernel: md: export_rdev(sdbx)
Mar 13 14:17:24 oss20 kernel: md: rebuild md17 throttled due to IO
Mar 13 14:25:48 oss20 kernel: md: md17: sync done.
################ ll_ost kernel thread  soft lockup occurred
Mar 13 19:10:04 oss20 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
[ll_ost_29:17868]
Mar 13 19:10:04 oss20 kernel: CPU 2:
Mar 13 19:10:05 oss20 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U)
ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U)
ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) raid456(U) xor(U)
raid1(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) autofs4(U) sunrpc(U) rdma_ucm(U)
qlgc_vnic(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U)
ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U)
ib_uverbs(U) ib_umad(U) iw_nes(U) iw_cxgb3(U) cxgb3(U) mlx4_ib(U) ib_mthca(U)
ib_mad(U) ib_core(U) mptctl(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U)
backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U)
button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U)
parport(U) joydev(U) sg(U) igb(U) mlx4_core(U) i2c_i801(U) 8021q(U) i2c_core(U)
dca(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U)
dm_mod(U) dm_mem_cache(U) ahci(U) libata(U) mptfc(U) scsi_transport_fc(U)
mptspi(U) scsi_transport_spi(U) shpchp(U
Mar 13 19:10:05 oss20 kernel:  mptsas(U) mptscsih(U) mptbase(U)
scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U)
ohci_hcd(U) ehci_hcd(U)
Mar 13 19:10:05 oss20 kernel: Pid: 17868, comm: ll_ost_29 Tainted: G     
2.6.18-194.17.1.el5_lustre.1.8.5smp #1
Mar 13 19:10:05 oss20 kernel: RIP: 0010:[<ffffffff8b218cf4>] 
[<ffffffff8b218cf4>] :ldiskfs:ldiskfs_find_entry+0x1d4/0x5b0
Mar 13 19:10:05 oss20 kernel: RSP: 0018:ffff8102c7f637e0  EFLAGS: 00000202
Mar 13 19:10:05 oss20 kernel: RAX: 0000000000000000 RBX: 0000000000000007 RCX:
0000000036f3f4cb
Mar 13 19:10:05 oss20 kernel: RDX: ffff8104fa4d1c00 RSI: ffff8102c7f637c0 RDI:
ffff81010c43e7a8
Mar 13 19:10:05 oss20 kernel: RBP: ffff810254a1d7f0 R08: ffff8100708a5ff8 R09:
ffff8100708a5000
Mar 13 19:10:05 oss20 kernel: R10: ffff81010c618038 R11: 000000004f5f1cf2 R12:
0000000000000000
Mar 13 19:10:05 oss20 kernel: R13: 0000000000000002 R14: ffff8101d7e17cd0 R15:
ffffffff80063ac8
Mar 13 19:10:05 oss20 kernel: FS:  00002b99048b1220(0000)
GS:ffff81010c499240(0000) knlGS:0000000000000000
Mar 13 19:10:05 oss20 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 13 19:10:05 oss20 kernel: CR2: 000000316a03b6a0 CR3: 00000003738ff000 CR4:
00000000000006e0
Mar 13 19:10:05 oss20 kernel:
Mar 13 19:10:05 oss20 kernel: Call Trace:
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8001a81c>] vsnprintf+0x5e7/0x62f
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b21ab23>]
:ldiskfs:ldiskfs_lookup+0x53/0x281
Mar 13 19:10:05 oss20 kernel:  [<ffffffff80036d8e>]
__lookup_hash+0x10b/0x12f
Mar 13 19:10:05 oss20 kernel:  [<ffffffff800e861d>]
lookup_one_len+0x53/0x61
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b2b136d>]
:obdfilter:filter_fid2dentry+0x42d/0x740
Mar 13 19:10:05 oss20 kernel:  [<ffffffff80286b42>]
__down_trylock+0x44/0x4e
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b2cc38b>]
:obdfilter:filter_lvbo_init+0x3bb/0x69b
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8aecd3a0>]
:lnet:LNetPut+0x710/0x820
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afa22c0>]
:ptlrpc:ldlm_resource_get+0x900/0xa60
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b270600>]
:ost:ost_blocking_ast+0x0/0x9b0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afc12a0>]
:ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8af98efa>]
:ptlrpc:ldlm_lock_create+0xba/0x9f0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afddfb1>]
:ptlrpc:lustre_swab_buf+0x81/0x170
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afbbd30>]
:ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afbbd30>]
:ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afc12a0>]
:ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b270600>]
:ost:ost_blocking_ast+0x0/0x9b0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afbe5b0>]
:ptlrpc:ldlm_handle_enqueue+0x670/0x1210
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afdcbc8>]
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8b2784c3>]
:ost:ost_handle+0x4fe3/0x55b0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff80150e00>] __next_cpu+0x19/0x28
Mar 13 19:10:05 oss20 kernel:  [<ffffffff800767e2>]
smp_send_reschedule+0x4e/0x53
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afec15a>]
:ptlrpc:ptlrpc_server_handle_request+0x97a/0xdf0
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8afec8a8>]
:ptlrpc:ptlrpc_wait_event+0x2d8/0x310
Mar 13 19:10:05 oss20 kernel:  [<ffffffff8008b403>]
__wake_up_common+0x3e/0x68
Mar 13 19:10:06 oss20 kernel:  [<ffffffff8afed817>]
:ptlrpc:ptlrpc_main+0xf37/0x10f0
Mar 13 19:10:06 oss20 kernel:  [<ffffffff8005e01d>] child_rip+0xa/0x11
Mar 13 19:10:06 oss20 kernel:  [<ffffffff8afec8e0>]
:ptlrpc:ptlrpc_main+0x0/0x10f0
Mar 13 19:10:06 oss20 kernel:  [<ffffffff8005e013>] child_rip+0x0/0x11
Mar 13 19:10:06 oss20 kernel:
Taeyoung Hong
Senior Researcher
Supercomputing Center, KISTI
Korea
<ATT00001..txt>
Confidentiality Notice: This e-mail message, its contents and any attachments to
it are confidential to the intended recipient, and may contain information that
is privileged and/or exempt from disclosure under applicable law. If you are not
the intended recipient, please immediately notify the sender and destroy the
original e-mail message and any attachments (and any copies that may have been
made) from your system or otherwise. Any unauthorized use, copying, disclosure
or distribution of this information is strictly prohibited.  Email addresses
that end with a ?-c? identify the sender as a Fusion-io contractor.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120319/c99aa942/attachment-0001.html
On Mon, Mar 19, 2012 at 07:28:22AM -0600, Kevin Van Maren wrote:>You are running 1.8.5, which does not have the fix for the known MD raid5/6 rebuild corruption bug. That fix was released in the Oracle Lustre 1.8.7 kernel patches. Unless you already applied that patch, you might want to run a check of your raid arrays and consider an upgrade (at least patch your kernel with that fix). > >md-avoid-corrupted-ldiskfs-after-rebuild.patch in the 2.6-rhel5.series (note that this bug is NOT specific to rhel5). This fix does NOT appear to have been picked up by whamcloud.as you say, the md rebuild bug is in all kernels < 2.6.32 http://marc.info/?l=linux-raid&m=130192650924540&w=2 the Whamcloud fix is LU-824 which landed in git a tad after 1.8.7-wc1. I also asked RedHat nicely, and they added the same patch to RHEL5.8 kernels, which IMHO is the correct place for a fundamental md fix. so once Lustre supports RHEL5.8 servers, then the patch in Lustre isn''t needed any more. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility
Thank you for your  information,
Today I tested our OSS after reading bugzilla 24264, say, after patching the
kernel (http://review.whamcloud.com/#change,1672), I rebuilt the md in question
with new one disk added (because we just had 9 disks for RAID6 8+2), and then
reran e2fsck -fn,  and I finally tried to mount it but I still saw ll_ost soft
lockup. the call trace messages is the same as before. so I think ours is not
the case that you said.
Anyway yesterday I tried the simplest method as below, to see if ldiskfs is
working properly alone.
mount -t ldiskfs -o  ro,extents,mballoc /dev/md17 /mnt/kkk
find /mnt/kkk  -type f | while read f; do echo $f >&2 ; cat $f >
/dev/null ; done
and I got the following syslog messages while running this "find/cat"
command, however the command finished without any other kernel error or soft
lockup.
Mar 19 17:57:51 oss19 kernel: LDISKFS-fs error (device md17):
htree_dirblock_to_tree: bad entry in directory #341259: rec_len is smaller than
minimal - offset=806912, inode=0, rec_len=0, name_len=0
Mar 19 18:31:09 oss19 kernel: LDISKFS-fs error (device md17):
htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than
minimal - offset=978944, inode=0, rec_len=0, name_len=0
Mar 19 18:31:11 oss19 kernel: LDISKFS-fs error (device md17):
htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than
minimal - offset=282624, inode=0, rec_len=0, name_len=0
Mar 19 18:31:11 oss19 kernel: LDISKFS-fs error (device md17):
htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than
minimal - offset=290816, inode=0, rec_len=0, name_len=0
Mar 19 19:01:15 oss19 kernel: LDISKFS-fs error (device md17):
htree_dirblock_to_tree: bad entry in directory #341379: rec_len is smaller than
minimal - offset=528384, inode=0, rec_len=0, name_len=0
Mar 19 19:18:14 oss19 kernel: LDISKFS-fs error (device md17):
htree_dirblock_to_tree: bad entry in directory #341258: rec_len is smaller than
minimal - offset=1196032, inode=0, rec_len=0, name_len=0
Mar 19 19:18:14 oss19 kernel: LDISKFS-fs error (device md17):
htree_dirblock_to_tree: bad entry in directory #341258: rec_len is smaller than
minimal - offset=1187840, inode=0, rec_len=0, name_len=0
...
 cat ldiskfs_syslog_error.20120320 | awk ''{print $15}'' | sort
| uniq -c
      4 #341257:
     12 #341258:
      4 #341259:
      4 #341379:
     12 #341380:
      4 #341381:
     12 #341382:
     12 #341383:
      4 #341384:
      4 #341507:
      4 #341510:
11 directory data blocks seems corrupted, I don''t know what I can do
further,
regards,
Taeyoung Hong
2012. 3. 19., ?? 11:27, Robin Humble ??:
> On Mon, Mar 19, 2012 at 07:28:22AM -0600, Kevin Van Maren wrote:
>> You are running 1.8.5, which does not have the fix for the known MD
raid5/6 rebuild corruption bug.  That fix was released in the Oracle Lustre
1.8.7 kernel patches.  Unless you already applied that patch, you might want to
run a check of your raid arrays and consider an upgrade (at least patch your
kernel with that fix).
>> 
>> md-avoid-corrupted-ldiskfs-after-rebuild.patch in the 2.6-rhel5.series
(note that this bug is NOT specific to rhel5).  This fix does NOT appear to have
been picked up by whamcloud.
> 
> as you say, the md rebuild bug is in all kernels < 2.6.32
>  http://marc.info/?l=linux-raid&m=130192650924540&w=2
> 
> the Whamcloud fix is LU-824 which landed in git a tad after 1.8.7-wc1.
> 
> I also asked RedHat nicely, and they added the same patch to RHEL5.8
> kernels, which IMHO is the correct place for a fundamental md fix.
> 
> so once Lustre supports RHEL5.8 servers, then the patch in Lustre
> isn''t needed any more.
> 
> cheers,
> robin
> --
> Dr Robin Humble, HPC Systems Analyst, NCI National Facility
It worked. Thank you for the valuable tip. I never expected this simple command(e2fsck -fD) could be a solution. regards, Taeyoung Hong 2012. 3. 21., ?? 7:43, Bernd Schubert ??:> I''m removing lustre-discuss as I''m an FhGFS developer now and I don''t > think my boss would like it to see me posting to Lustre lists... Anyway, > I''m still reading and sometimes helping here > > On 03/20/2012 03:42 PM, Tae Young Hong wrote: >> >> Thank you for your information, >> Today I tested our OSS after reading bugzilla 24264, say, after patching the kernel (http://review.whamcloud.com/#change,1672), I rebuilt the md in question with new one disk added (because we just had 9 disks for RAID6 8+2), and then reran e2fsck -fn, and I finally tried to mount it but I still saw ll_ost soft lockup. the call trace messages is the same as before. so I think ours is not the case that you said. >> >> Anyway yesterday I tried the simplest method as below, to see if ldiskfs is working properly alone. >> >> mount -t ldiskfs -o ro,extents,mballoc /dev/md17 /mnt/kkk >> find /mnt/kkk -type f | while read f; do echo $f>&2 ; cat $f> /dev/null ; done >> >> and I got the following syslog messages while running this "find/cat" command, however the command finished without any other kernel error or soft lockup. >> >> Mar 19 17:57:51 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341259: rec_len is smaller than minimal - offset=806912, inode=0, rec_len=0, name_len=0 >> Mar 19 18:31:09 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than minimal - offset=978944, inode=0, rec_len=0, name_len=0 >> Mar 19 18:31:11 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than minimal - offset=282624, inode=0, rec_len=0, name_len=0 >> Mar 19 18:31:11 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than minimal - offset=290816, inode=0, rec_len=0, name_len=0 >> Mar 19 19:01:15 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341379: rec_len is smaller than minimal - offset=528384, inode=0, rec_len=0, name_len=0 >> Mar 19 19:18:14 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341258: rec_len is smaller than minimal - offset=1196032, inode=0, rec_len=0, name_len=0 >> Mar 19 19:18:14 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341258: rec_len is smaller than minimal - offset=1187840, inode=0, rec_len=0, name_len=0 >> ... > > Ideal would be to upload an e2image somewhere so that we can fix e2fsck. > If you would be willing to do that, the procedure would be: > > e2image -r /dev/md17 /path/to/md17.e2image > tar cvfS image.tar /path/to/md17.e2image > > You probably don''t want to do the > "e2image -r /dev/hda1 - | bzip2 > hda1.e2i.bz2" command from the man > page, as compressing all the zeros usually takes several days bzip2 time. > > However, most recent e2fsprogs versions now also may create qcow2 > images, which should be the fastest method to create the meta-data > image. However, I''m not sure if the current e2fsprogs version for Lustre > already can do that. > > Anyway, to fix the htree problem, running "e2fsck -fvD /dev/md17" > should do the trick. But I really would like to see the issue solved in > e2fsck to fix it automatically without "-D" some day. > > Cheers, > Bernd