Niklas Edmundsson
2007-Sep-25 13:21 UTC
[Lustre-discuss] Lustre 1.6.2 + 2.6.18 - lockup doing recover
We managed to trigger another intriguing lustre failure by simply rebooting our two OSS''s (which has two OST''s each), and then mounting the OST''s on both OSS''s in parallell (ie mount ost1 ; mount ost2 on both OSS''s at the same time). The filesystem was mounted by 191 clients, of which twenty-ish hade been recently active. Interconnect used is standard tcp over GigE. Both OSS''s were really locking up intermittently after this, it all ended with having to do a hard reboot. Taking it slow and leisurely mounting one OST, then wait, another one, then wait until we had all four mounted resulted in a working lustre backend. The OSS''s are dual CPU amd64 boxes. I tried finding a corresponding issue in bugzilla, but I failed miserably as ususal. Thoughts and suggestions are welcome... We got tons of soft lockups and call traces, here are the first one from each OSS (with some added interesting log lines from the first OSS): -------------------8<------------------------ [ 1376.614228] Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 6802: it was inactive for 100s [ 1385.082596] LustreError: 6877:0:(ldlm_lockd.c:214:waiting_locks_callback()) ### lock callback timer expired: evicting client f993878f-7f88-addc-4b1f-394853cdb358 at NET_0x2000082ef4e43_UUID nid 130.239.78.67 at tcp ns: filter-hpfs-OST0000_UUID lock: ffff810052c08780/0x3a2961c77b2b19c1 lrc: 2/0,0 mode: PW/PW res: 1416400/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 10020 remote: 0x171c36570b6bb69d expref: 82 pid: 6401 [ 1388.238268] LustreError: 6877:0:(events.c:55:request_out_callback()) @@@ type 4, status -5 req at ffff8100e5c84800 x38/t0 o104->@NET_0x2000082ef4e80_UUID:15 lens 232/128 ref 2 fl Rpc:/0/0 rc 0/-22 [ 1388.256244] LustreError: 6877:0:(events.c:55:request_out_callback()) Skipped 1158333 previous similar messages [ 1394.804079] LustreError: 6731:0:(client.c:961:ptlrpc_expire_one_request()) @@@ network error (sent at 1190712998, 0s ago) req at ffff8100e24ee400 x177/t0 o104->@NET_0x2000082ef4e8e_UUID:15 lens 232/128 ref 1 fl Rpc:/0/0 rc 0/-22 [ 1394.825111] LustreError: 6731:0:(client.c:961:ptlrpc_expire_one_request()) Skipped 1959868 previous similar messages [ 1395.921898] BUG: soft lockup detected on CPU#0! [ 1395.926481] [ 1395.926482] Call Trace: [ 1395.930474] <IRQ> [<ffffffff80263e7c>] softlockup_tick+0xfc/0x120 [ 1395.936758] [<ffffffff8023f1c7>] update_process_times+0x57/0x90 [ 1395.942820] [<ffffffff8021a3e3>] smp_local_timer_interrupt+0x23/0x50 [ 1395.949332] [<ffffffff8021acf1>] smp_apic_timer_interrupt+0x41/0x50 [ 1395.955751] [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c [ 1395.961811] <EOI> [<ffffffff803153b9>] number+0x109/0x270 [ 1395.967399] [<ffffffff80315ad1>] vsnprintf+0x5b1/0x630 [ 1395.972707] [<ffffffff883ad5d4>] :libcfs:libcfs_debug_vmsg2+0x494/0x8e0 [ 1395.979488] [<ffffffff8022b8e3>] __wake_up_common+0x43/0x80 [ 1395.985252] [<ffffffff884d5520>] :ptlrpc:request_out_callback+0x0/0x1f0 [ 1395.992056] [<ffffffff884d5520>] :ptlrpc:request_out_callback+0x0/0x1f0 [ 1395.998859] [<ffffffff884d1962>] :ptlrpc:_debug_req+0x3e2/0x420 [ 1396.004963] [<ffffffff8853b7a7>] :ksocklnd:ksocknal_free_tx+0x77/0x100 [ 1396.011651] [<ffffffff8853ccbb>] :ksocklnd:ksocknal_send+0x2eb/0x330 [ 1396.018188] [<ffffffff884d5601>] :ptlrpc:request_out_callback+0xe1/0x1f0 [ 1396.025056] [<ffffffff883e0b5b>] :lnet:lnet_enq_event_locked+0x9b/0xf0 [ 1396.031747] [<ffffffff883e0f69>] :lnet:lnet_finalize+0x119/0x260 [ 1396.037914] [<ffffffff883e7ba4>] :lnet:lnet_send+0x9c4/0x9e0 [ 1396.043728] [<ffffffff883e3437>] :lnet:lnet_prep_send+0x67/0xb0 [ 1396.049807] [<ffffffff883e8963>] :lnet:LNetPut+0x6a3/0x740 [ 1396.055445] [<ffffffff883e2c69>] :lnet:LNetMDBind+0x2b9/0x3a0 [ 1396.061370] [<ffffffff884cb10b>] :ptlrpc:ptl_send_buf+0x4ab/0x650 [ 1396.067643] [<ffffffff884cc9ce>] :ptlrpc:ptl_send_rpc+0xc7e/0xf70 [ 1396.073914] [<ffffffff884c555a>] :ptlrpc:ptlrpc_check_set+0x66a/0xb90 [ 1396.080527] [<ffffffff884bf4a9>] :ptlrpc:ptlrpc_set_next_timeout+0x19/0x170 [ 1396.087674] [<ffffffff884c5c30>] :ptlrpc:ptlrpc_set_wait+0x1b0/0x510 [ 1396.094191] [<ffffffff804186b9>] _spin_lock_bh+0x9/0x20 [ 1396.099560] [<ffffffff8022f450>] default_wake_function+0x0/0x10 [ 1396.105652] [<ffffffff884980fb>] :ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200 [ 1396.113405] [<ffffffff88499133>] :ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290 [ 1396.120369] [<ffffffff884ac5db>] :ptlrpc:ldlm_extent_compat_queue+0x5ab/0x720 [ 1396.127684] [<ffffffff884ad156>] :ptlrpc:ldlm_process_extent_lock+0x536/0x670 [ 1396.134986] [<ffffffff80315c81>] sprintf+0x51/0x60 [ 1396.139923] [<ffffffff80287ebe>] transfer_objects+0x4e/0x80 [ 1396.145663] [<ffffffff8849c22f>] :ptlrpc:ldlm_lock_enqueue+0x4ff/0x5b0 [ 1396.152382] [<ffffffff88499928>] :ptlrpc:ldlm_lock_remove_from_lru+0x78/0xf0 [ 1396.159615] [<ffffffff8849a366>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x46/0xa0 [ 1396.167454] [<ffffffff884b0d70>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0 [ 1396.173975] [<ffffffff884ad5d3>] :ptlrpc:ldlm_cli_enqueue_local+0x323/0x510 [ 1396.181110] [<ffffffff886ceeec>] :obdfilter:filter_destroy+0x5dc/0x1ed0 [ 1396.187913] [<ffffffff884af280>] :ptlrpc:ldlm_completion_ast+0x0/0x6e0 [ 1396.194624] [<ffffffff884cb10b>] :ptlrpc:ptl_send_buf+0x4ab/0x650 [ 1396.200888] [<ffffffff884d09b1>] :ptlrpc:lustre_msg_add_version+0x61/0x140 [ 1396.207939] [<ffffffff884d3e48>] :ptlrpc:lustre_pack_reply+0xa28/0xb60 [ 1396.214636] [<ffffffff8869d4a1>] :ost:ost_handle+0x1db1/0x579c [ 1396.220607] [<ffffffff8022cacc>] task_rq_lock+0x4c/0x90 [ 1396.225979] [<ffffffff8022b4f6>] __activate_task+0x36/0x50 [ 1396.231617] [<ffffffff8022e1b5>] wake_up_new_task+0x295/0x2b0 [ 1396.237559] [<ffffffff8843bd05>] :obdclass:class_handle2object+0xd5/0x160 [ 1396.244535] [<ffffffff884cd1f0>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x70 [ 1396.251506] [<ffffffff884d1a65>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 [ 1396.257870] [<ffffffff884d7bcd>] :ptlrpc:ptlrpc_server_handle_request+0xc0d/0x13b0 [ 1396.265626] [<ffffffff884d951e>] :ptlrpc:ptlrpc_start_thread+0x95e/0xa00 [ 1396.272492] [<ffffffff80416ae0>] thread_return+0x0/0x100 [ 1396.277940] [<ffffffff8020df8e>] do_gettimeofday+0x5e/0xb0 [ 1396.283579] [<ffffffff883adab6>] :libcfs:lcw_update_time+0x16/0x100 [ 1396.289997] [<ffffffff8023f2c9>] lock_timer_base+0x29/0x60 [ 1396.295627] [<ffffffff8023f7b0>] __mod_timer+0xc0/0xf0 [ 1396.300925] [<ffffffff884d9e2b>] :ptlrpc:ptlrpc_main+0x86b/0x9f0 [ 1396.307083] [<ffffffff8022f450>] default_wake_function+0x0/0x10 [ 1396.313161] [<ffffffff8020ac4c>] child_rip+0xa/0x12 [ 1396.318197] [<ffffffff884d95c0>] :ptlrpc:ptlrpc_main+0x0/0x9f0 [ 1396.324193] [<ffffffff8020ac42>] child_rip+0x0/0x12 -------------------8<------------------------ -------------------8<------------------------ [ 1362.254118] BUG: soft lockup detected on CPU#1! [ 1362.258808] [ 1362.258809] Call Trace: [ 1362.262940] <IRQ> [<ffffffff80263e7c>] softlockup_tick+0xfc/0x120 [ 1362.269408] [<ffffffff8023f1c7>] update_process_times+0x57/0x90 [ 1362.275742] [<ffffffff8021a3e3>] smp_local_timer_interrupt+0x23/0x50 [ 1362.282462] [<ffffffff8021acf1>] smp_apic_timer_interrupt+0x41/0x50 [ 1362.289117] [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c [ 1362.295395] <EOI> [<ffffffff8031588b>] vsnprintf+0x36b/0x630 [ 1362.301519] [<ffffffff883ab5d4>] :libcfs:libcfs_debug_vmsg2+0x494/0x8e0 [ 1362.308574] [<ffffffff885397a7>] :ksocklnd:ksocknal_free_tx+0x77/0x100 [ 1362.315501] [<ffffffff884cf962>] :ptlrpc:_debug_req+0x3e2/0x420 [ 1362.321764] [<ffffffff883e0c69>] :lnet:LNetMDBind+0x2b9/0x3a0 [ 1362.327918] [<ffffffff884ce89f>] :ptlrpc:lustre_msg_get_opc+0x4f/0x100 [ 1362.334858] [<ffffffff884eeb3e>] :ptlrpc:ptlrpc_lprocfs_rpc_sent+0x1e/0x120 [ 1362.342266] [<ffffffff884ca9e1>] :ptlrpc:ptl_send_rpc+0xc91/0xf70 [ 1362.348687] [<ffffffff884c25ed>] :ptlrpc:ptlrpc_expire_one_request+0xad/0x390 [ 1362.356283] [<ffffffff884bd37e>] :ptlrpc:ptlrpc_import_delay_req+0x22e/0x240 [ 1362.363771] [<ffffffff884c311a>] :ptlrpc:ptlrpc_check_set+0x22a/0xb90 [ 1362.370626] [<ffffffff884c3c30>] :ptlrpc:ptlrpc_set_wait+0x1b0/0x510 [ 1362.377440] [<ffffffff804186b9>] _spin_lock_bh+0x9/0x20 [ 1362.382975] [<ffffffff8022f450>] default_wake_function+0x0/0x10 [ 1362.389284] [<ffffffff884960fb>] :ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200 [ 1362.397318] [<ffffffff88497133>] :ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290 [ 1362.404585] [<ffffffff884aa5db>] :ptlrpc:ldlm_extent_compat_queue+0x5ab/0x720 [ 1362.412120] [<ffffffff884ab156>] :ptlrpc:ldlm_process_extent_lock+0x536/0x670 [ 1362.419684] [<ffffffff80315c81>] sprintf+0x51/0x60 [ 1362.424810] [<ffffffff8849a22f>] :ptlrpc:ldlm_lock_enqueue+0x4ff/0x5b0 [ 1362.431730] [<ffffffff88497928>] :ptlrpc:ldlm_lock_remove_from_lru+0x78/0xf0 [ 1362.439217] [<ffffffff88498366>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x46/0xa0 [ 1362.447388] [<ffffffff884aed70>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0 [ 1362.454153] [<ffffffff884ab5d3>] :ptlrpc:ldlm_cli_enqueue_local+0x323/0x510 [ 1362.461525] [<ffffffff886cceec>] :obdfilter:filter_destroy+0x5dc/0x1ed0 [ 1362.468519] [<ffffffff884ad280>] :ptlrpc:ldlm_completion_ast+0x0/0x6e0 [ 1362.475411] [<ffffffff884c910b>] :ptlrpc:ptl_send_buf+0x4ab/0x650 [ 1362.481894] [<ffffffff884ce9b1>] :ptlrpc:lustre_msg_add_version+0x61/0x140 [ 1362.489175] [<ffffffff884d1e48>] :ptlrpc:lustre_pack_reply+0xa28/0xb60 [ 1362.496054] [<ffffffff8869b4a1>] :ost:ost_handle+0x1db1/0x579c [ 1362.502251] [<ffffffff8022cacc>] task_rq_lock+0x4c/0x90 [ 1362.507825] [<ffffffff8022b4f6>] __activate_task+0x36/0x50 [ 1362.513661] [<ffffffff8022e1b5>] wake_up_new_task+0x295/0x2b0 [ 1362.519778] [<ffffffff88439d05>] :obdclass:class_handle2object+0xd5/0x160 [ 1362.526982] [<ffffffff884cb1f0>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x70 [ 1362.534215] [<ffffffff884cfa65>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 [ 1362.540699] [<ffffffff884d5bcd>] :ptlrpc:ptlrpc_server_handle_request+0xc0d/0x13b0 [ 1362.548745] [<ffffffff8020df8e>] do_gettimeofday+0x5e/0xb0 [ 1362.554566] [<ffffffff883abab6>] :libcfs:lcw_update_time+0x16/0x100 [ 1362.561220] [<ffffffff8023f2c9>] lock_timer_base+0x29/0x60 [ 1362.566962] [<ffffffff8023f7b0>] __mod_timer+0xc0/0xf0 [ 1362.572462] [<ffffffff884d7e2b>] :ptlrpc:ptlrpc_main+0x86b/0x9f0 [ 1362.578796] [<ffffffff8022f450>] default_wake_function+0x0/0x10 [ 1362.585055] [<ffffffff8020ac4c>] child_rip+0xa/0x12 [ 1362.590266] [<ffffffff884d75c0>] :ptlrpc:ptlrpc_main+0x0/0x9f0 [ 1362.596491] [<ffffffff8020ac42>] child_rip+0x0/0x12 -------------------8<------------------------ /Nikke -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | nikke at hpc2n.umu.se --------------------------------------------------------------------------- FUNDAMENTALISM is never having to open your mind. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Andreas Dilger
2007-Sep-25 21:03 UTC
[Lustre-discuss] Lustre 1.6.2 + 2.6.18 - lockup doing recover
On Sep 25, 2007 15:21 +0200, Niklas Edmundsson wrote:> Both OSS''s were really locking up intermittently after this, it all > ended with having to do a hard reboot. Taking it slow and leisurely > mounting one OST, then wait, another one, then wait until we had all > four mounted resulted in a working lustre backend. > > We got tons of soft lockups and call traces, here are the first one > from each OSS (with some added interesting log lines from the first OSS): > -------------------8<------------------------ > [ 1376.614228] Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 6802: it was inactive for 100s > > [ 1388.238268] LustreError: 6877:0:(events.c:55:request_out_callback()) @@@ type 4, status -5 req at ffff8100e5c84800 x38/t0 o104->@NET_0x2000082ef4e80_UUID:15 lens 232/128 ref 2 fl Rpc:/0/0 rc 0/-22 > [ 1388.256244] LustreError: 6877:0:(events.c:55:request_out_callback()) Skipped 1158333 previous similar messages > [ 1394.804079] LustreError: 6731:0:(client.c:961:ptlrpc_expire_one_request()) @@@ network error (sent at 1190712998, 0s ago) req at ffff8100e24ee400 x177/t0 o104->@NET_0x2000082ef4e8e_UUID:15 lens 232/128 ref 1 fl Rpc:/0/0 rc 0/-22 > [ 1394.825111] LustreError: 6731:0:(client.c:961:ptlrpc_expire_one_request()) Skipped 1959868 previous similar messages > [ 1395.921898] BUG: soft lockup detected on CPU#0!I can imagine if you are getting 2M requests timing out then the CPU will be busy.> [ 1396.080527] [<ffffffff884bf4a9>] :ptlrpc:ptlrpc_set_next_timeout+0x19/0x170 > [ 1396.105652] [<ffffffff884980fb>] :ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200 > [ 1396.113405] [<ffffffff88499133>] :ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290 > [ 1396.120369] [<ffffffff884ac5db>] :ptlrpc:ldlm_extent_compat_queue+0x5ab/0x720> [ 1362.348687] [<ffffffff884c25ed>] :ptlrpc:ptlrpc_expire_one_request+0xad/0x390 > [ 1362.389284] [<ffffffff884960fb>] :ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200 > [ 1362.397318] [<ffffffff88497133>] :ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290Looks like something related to the parallel blocking callback code. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Oleg Drokin
2007-Sep-25 22:32 UTC
[Lustre-discuss] Lustre 1.6.2 + 2.6.18 - lockup doing recover
Hello! On Tue, Sep 25, 2007 at 03:21:55PM +0200, Niklas Edmundsson wrote:> We managed to trigger another intriguing lustre failure by simply > rebooting our two OSS''s (which has two OST''s each), and then mounting > the OST''s on both OSS''s in parallell (ie mount ost1 ; mount ost2 on > both OSS''s at the same time). The filesystem was mounted by 191 > clients, of which twenty-ish hade been recently active. Interconnect > used is standard tcp over GigE.First of all you need to have a patch from bug 13614 if yo don''t have it yet. https://bugzilla.lustre.org/attachment.cgi?id=12679 Bye, Oleg
Niklas Edmundsson
2007-Sep-26 07:26 UTC
[Lustre-discuss] Lustre 1.6.2 + 2.6.18 - lockup doing recover
On Tue, 25 Sep 2007, Oleg Drokin wrote:> Hello! > > On Tue, Sep 25, 2007 at 03:21:55PM +0200, Niklas Edmundsson wrote: >> We managed to trigger another intriguing lustre failure by simply >> rebooting our two OSS''s (which has two OST''s each), and then mounting >> the OST''s on both OSS''s in parallell (ie mount ost1 ; mount ost2 on >> both OSS''s at the same time). The filesystem was mounted by 191 >> clients, of which twenty-ish hade been recently active. Interconnect >> used is standard tcp over GigE. > > First of all you need to have a patch from bug 13614 if yo don''t have it yet. > https://bugzilla.lustre.org/attachment.cgi?id=12679Nope, I didn''t have that one. I''ll apply it and see if it improves the situation. /Nikke -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | nikke at hpc2n.umu.se --------------------------------------------------------------------------- Iraqi Bingo: B-52... F-16... M-1... F-18... F-117... =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=