thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre 1.6.2 + 2.6.18

If this information is useful, please help other people find it:
Share via:

Niklas Edmundsson

2007-Sep-25 13:21 UTC

[Lustre-discuss] Lustre 1.6.2 + 2.6.18 - lockup doing recover

We managed to trigger another intriguing lustre failure by simply 
rebooting our two OSS''s (which has two OST''s each), and then
mounting
the OST''s on both OSS''s in parallell (ie mount ost1 ; mount
ost2 on
both OSS''s at the same time). The filesystem was mounted by 191 
clients, of which twenty-ish hade been recently active. Interconnect 
used is standard tcp over GigE.

Both OSS''s were really locking up intermittently after this, it all 
ended with having to do a hard reboot. Taking it slow and leisurely 
mounting one OST, then wait, another one, then wait until we had all 
four mounted resulted in a working lustre backend.

The OSS''s are dual CPU amd64 boxes.

I tried finding a corresponding issue in bugzilla, but I failed 
miserably as ususal. Thoughts and suggestions are welcome...

We got tons of soft lockups and call traces, here are the first one 
from each OSS (with some added interesting log lines from the first OSS):
-------------------8<------------------------
[ 1376.614228] Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid
6802: it was inactive for 100s

[ 1385.082596] LustreError: 6877:0:(ldlm_lockd.c:214:waiting_locks_callback())
### lock callback timer expired: evicting client
f993878f-7f88-addc-4b1f-394853cdb358 at NET_0x2000082ef4e43_UUID nid
130.239.78.67 at tcp  ns: filter-hpfs-OST0000_UUID lock:
ffff810052c08780/0x3a2961c77b2b19c1 lrc: 2/0,0 mode: PW/PW res: 1416400/0 rrc: 2
type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags:
10020 remote: 0x171c36570b6bb69d expref: 82 pid: 6401
[ 1388.238268] LustreError: 6877:0:(events.c:55:request_out_callback()) @@@ type
4, status -5  req at ffff8100e5c84800 x38/t0
o104->@NET_0x2000082ef4e80_UUID:15 lens 232/128 ref 2 fl Rpc:/0/0 rc 0/-22
[ 1388.256244] LustreError: 6877:0:(events.c:55:request_out_callback()) Skipped
1158333 previous similar messages
[ 1394.804079] LustreError: 6731:0:(client.c:961:ptlrpc_expire_one_request())
@@@ network error (sent at 1190712998, 0s ago)  req at ffff8100e24ee400 x177/t0
o104->@NET_0x2000082ef4e8e_UUID:15 lens 232/128 ref 1 fl Rpc:/0/0 rc 0/-22
[ 1394.825111] LustreError: 6731:0:(client.c:961:ptlrpc_expire_one_request())
Skipped 1959868 previous similar messages
[ 1395.921898] BUG: soft lockup detected on CPU#0!
[ 1395.926481] 
[ 1395.926482] Call Trace:
[ 1395.930474]  <IRQ> [<ffffffff80263e7c>]
softlockup_tick+0xfc/0x120
[ 1395.936758]  [<ffffffff8023f1c7>] update_process_times+0x57/0x90
[ 1395.942820]  [<ffffffff8021a3e3>] smp_local_timer_interrupt+0x23/0x50
[ 1395.949332]  [<ffffffff8021acf1>] smp_apic_timer_interrupt+0x41/0x50
[ 1395.955751]  [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
[ 1395.961811]  <EOI> [<ffffffff803153b9>] number+0x109/0x270
[ 1395.967399]  [<ffffffff80315ad1>] vsnprintf+0x5b1/0x630
[ 1395.972707]  [<ffffffff883ad5d4>]
:libcfs:libcfs_debug_vmsg2+0x494/0x8e0
[ 1395.979488]  [<ffffffff8022b8e3>] __wake_up_common+0x43/0x80
[ 1395.985252]  [<ffffffff884d5520>]
:ptlrpc:request_out_callback+0x0/0x1f0
[ 1395.992056]  [<ffffffff884d5520>]
:ptlrpc:request_out_callback+0x0/0x1f0
[ 1395.998859]  [<ffffffff884d1962>] :ptlrpc:_debug_req+0x3e2/0x420
[ 1396.004963]  [<ffffffff8853b7a7>] :ksocklnd:ksocknal_free_tx+0x77/0x100
[ 1396.011651]  [<ffffffff8853ccbb>] :ksocklnd:ksocknal_send+0x2eb/0x330
[ 1396.018188]  [<ffffffff884d5601>]
:ptlrpc:request_out_callback+0xe1/0x1f0
[ 1396.025056]  [<ffffffff883e0b5b>] :lnet:lnet_enq_event_locked+0x9b/0xf0
[ 1396.031747]  [<ffffffff883e0f69>] :lnet:lnet_finalize+0x119/0x260
[ 1396.037914]  [<ffffffff883e7ba4>] :lnet:lnet_send+0x9c4/0x9e0
[ 1396.043728]  [<ffffffff883e3437>] :lnet:lnet_prep_send+0x67/0xb0
[ 1396.049807]  [<ffffffff883e8963>] :lnet:LNetPut+0x6a3/0x740
[ 1396.055445]  [<ffffffff883e2c69>] :lnet:LNetMDBind+0x2b9/0x3a0
[ 1396.061370]  [<ffffffff884cb10b>] :ptlrpc:ptl_send_buf+0x4ab/0x650
[ 1396.067643]  [<ffffffff884cc9ce>] :ptlrpc:ptl_send_rpc+0xc7e/0xf70
[ 1396.073914]  [<ffffffff884c555a>] :ptlrpc:ptlrpc_check_set+0x66a/0xb90
[ 1396.080527]  [<ffffffff884bf4a9>]
:ptlrpc:ptlrpc_set_next_timeout+0x19/0x170
[ 1396.087674]  [<ffffffff884c5c30>] :ptlrpc:ptlrpc_set_wait+0x1b0/0x510
[ 1396.094191]  [<ffffffff804186b9>] _spin_lock_bh+0x9/0x20
[ 1396.099560]  [<ffffffff8022f450>] default_wake_function+0x0/0x10
[ 1396.105652]  [<ffffffff884980fb>]
:ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200
[ 1396.113405]  [<ffffffff88499133>]
:ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290
[ 1396.120369]  [<ffffffff884ac5db>]
:ptlrpc:ldlm_extent_compat_queue+0x5ab/0x720
[ 1396.127684]  [<ffffffff884ad156>]
:ptlrpc:ldlm_process_extent_lock+0x536/0x670
[ 1396.134986]  [<ffffffff80315c81>] sprintf+0x51/0x60
[ 1396.139923]  [<ffffffff80287ebe>] transfer_objects+0x4e/0x80
[ 1396.145663]  [<ffffffff8849c22f>] :ptlrpc:ldlm_lock_enqueue+0x4ff/0x5b0
[ 1396.152382]  [<ffffffff88499928>]
:ptlrpc:ldlm_lock_remove_from_lru+0x78/0xf0
[ 1396.159615]  [<ffffffff8849a366>]
:ptlrpc:ldlm_lock_addref_internal_nolock+0x46/0xa0
[ 1396.167454]  [<ffffffff884b0d70>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0
[ 1396.173975]  [<ffffffff884ad5d3>]
:ptlrpc:ldlm_cli_enqueue_local+0x323/0x510
[ 1396.181110]  [<ffffffff886ceeec>]
:obdfilter:filter_destroy+0x5dc/0x1ed0
[ 1396.187913]  [<ffffffff884af280>] :ptlrpc:ldlm_completion_ast+0x0/0x6e0
[ 1396.194624]  [<ffffffff884cb10b>] :ptlrpc:ptl_send_buf+0x4ab/0x650
[ 1396.200888]  [<ffffffff884d09b1>]
:ptlrpc:lustre_msg_add_version+0x61/0x140
[ 1396.207939]  [<ffffffff884d3e48>] :ptlrpc:lustre_pack_reply+0xa28/0xb60
[ 1396.214636]  [<ffffffff8869d4a1>] :ost:ost_handle+0x1db1/0x579c
[ 1396.220607]  [<ffffffff8022cacc>] task_rq_lock+0x4c/0x90
[ 1396.225979]  [<ffffffff8022b4f6>] __activate_task+0x36/0x50
[ 1396.231617]  [<ffffffff8022e1b5>] wake_up_new_task+0x295/0x2b0
[ 1396.237559]  [<ffffffff8843bd05>]
:obdclass:class_handle2object+0xd5/0x160
[ 1396.244535]  [<ffffffff884cd1f0>]
:ptlrpc:lustre_swab_ptlrpc_body+0x0/0x70
[ 1396.251506]  [<ffffffff884d1a65>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
[ 1396.257870]  [<ffffffff884d7bcd>]
:ptlrpc:ptlrpc_server_handle_request+0xc0d/0x13b0
[ 1396.265626]  [<ffffffff884d951e>]
:ptlrpc:ptlrpc_start_thread+0x95e/0xa00
[ 1396.272492]  [<ffffffff80416ae0>] thread_return+0x0/0x100
[ 1396.277940]  [<ffffffff8020df8e>] do_gettimeofday+0x5e/0xb0
[ 1396.283579]  [<ffffffff883adab6>] :libcfs:lcw_update_time+0x16/0x100
[ 1396.289997]  [<ffffffff8023f2c9>] lock_timer_base+0x29/0x60
[ 1396.295627]  [<ffffffff8023f7b0>] __mod_timer+0xc0/0xf0
[ 1396.300925]  [<ffffffff884d9e2b>] :ptlrpc:ptlrpc_main+0x86b/0x9f0
[ 1396.307083]  [<ffffffff8022f450>] default_wake_function+0x0/0x10
[ 1396.313161]  [<ffffffff8020ac4c>] child_rip+0xa/0x12
[ 1396.318197]  [<ffffffff884d95c0>] :ptlrpc:ptlrpc_main+0x0/0x9f0
[ 1396.324193]  [<ffffffff8020ac42>] child_rip+0x0/0x12
-------------------8<------------------------

-------------------8<------------------------
[ 1362.254118] BUG: soft lockup detected on CPU#1!
[ 1362.258808] 
[ 1362.258809] Call Trace:
[ 1362.262940]  <IRQ> [<ffffffff80263e7c>]
softlockup_tick+0xfc/0x120
[ 1362.269408]  [<ffffffff8023f1c7>] update_process_times+0x57/0x90
[ 1362.275742]  [<ffffffff8021a3e3>] smp_local_timer_interrupt+0x23/0x50
[ 1362.282462]  [<ffffffff8021acf1>] smp_apic_timer_interrupt+0x41/0x50
[ 1362.289117]  [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
[ 1362.295395]  <EOI> [<ffffffff8031588b>] vsnprintf+0x36b/0x630
[ 1362.301519]  [<ffffffff883ab5d4>]
:libcfs:libcfs_debug_vmsg2+0x494/0x8e0
[ 1362.308574]  [<ffffffff885397a7>] :ksocklnd:ksocknal_free_tx+0x77/0x100
[ 1362.315501]  [<ffffffff884cf962>] :ptlrpc:_debug_req+0x3e2/0x420
[ 1362.321764]  [<ffffffff883e0c69>] :lnet:LNetMDBind+0x2b9/0x3a0
[ 1362.327918]  [<ffffffff884ce89f>] :ptlrpc:lustre_msg_get_opc+0x4f/0x100
[ 1362.334858]  [<ffffffff884eeb3e>]
:ptlrpc:ptlrpc_lprocfs_rpc_sent+0x1e/0x120
[ 1362.342266]  [<ffffffff884ca9e1>] :ptlrpc:ptl_send_rpc+0xc91/0xf70
[ 1362.348687]  [<ffffffff884c25ed>]
:ptlrpc:ptlrpc_expire_one_request+0xad/0x390
[ 1362.356283]  [<ffffffff884bd37e>]
:ptlrpc:ptlrpc_import_delay_req+0x22e/0x240
[ 1362.363771]  [<ffffffff884c311a>] :ptlrpc:ptlrpc_check_set+0x22a/0xb90
[ 1362.370626]  [<ffffffff884c3c30>] :ptlrpc:ptlrpc_set_wait+0x1b0/0x510
[ 1362.377440]  [<ffffffff804186b9>] _spin_lock_bh+0x9/0x20
[ 1362.382975]  [<ffffffff8022f450>] default_wake_function+0x0/0x10
[ 1362.389284]  [<ffffffff884960fb>]
:ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200
[ 1362.397318]  [<ffffffff88497133>]
:ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290
[ 1362.404585]  [<ffffffff884aa5db>]
:ptlrpc:ldlm_extent_compat_queue+0x5ab/0x720
[ 1362.412120]  [<ffffffff884ab156>]
:ptlrpc:ldlm_process_extent_lock+0x536/0x670
[ 1362.419684]  [<ffffffff80315c81>] sprintf+0x51/0x60
[ 1362.424810]  [<ffffffff8849a22f>] :ptlrpc:ldlm_lock_enqueue+0x4ff/0x5b0
[ 1362.431730]  [<ffffffff88497928>]
:ptlrpc:ldlm_lock_remove_from_lru+0x78/0xf0
[ 1362.439217]  [<ffffffff88498366>]
:ptlrpc:ldlm_lock_addref_internal_nolock+0x46/0xa0
[ 1362.447388]  [<ffffffff884aed70>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0
[ 1362.454153]  [<ffffffff884ab5d3>]
:ptlrpc:ldlm_cli_enqueue_local+0x323/0x510
[ 1362.461525]  [<ffffffff886cceec>]
:obdfilter:filter_destroy+0x5dc/0x1ed0
[ 1362.468519]  [<ffffffff884ad280>] :ptlrpc:ldlm_completion_ast+0x0/0x6e0
[ 1362.475411]  [<ffffffff884c910b>] :ptlrpc:ptl_send_buf+0x4ab/0x650
[ 1362.481894]  [<ffffffff884ce9b1>]
:ptlrpc:lustre_msg_add_version+0x61/0x140
[ 1362.489175]  [<ffffffff884d1e48>] :ptlrpc:lustre_pack_reply+0xa28/0xb60
[ 1362.496054]  [<ffffffff8869b4a1>] :ost:ost_handle+0x1db1/0x579c
[ 1362.502251]  [<ffffffff8022cacc>] task_rq_lock+0x4c/0x90
[ 1362.507825]  [<ffffffff8022b4f6>] __activate_task+0x36/0x50
[ 1362.513661]  [<ffffffff8022e1b5>] wake_up_new_task+0x295/0x2b0
[ 1362.519778]  [<ffffffff88439d05>]
:obdclass:class_handle2object+0xd5/0x160
[ 1362.526982]  [<ffffffff884cb1f0>]
:ptlrpc:lustre_swab_ptlrpc_body+0x0/0x70
[ 1362.534215]  [<ffffffff884cfa65>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
[ 1362.540699]  [<ffffffff884d5bcd>]
:ptlrpc:ptlrpc_server_handle_request+0xc0d/0x13b0
[ 1362.548745]  [<ffffffff8020df8e>] do_gettimeofday+0x5e/0xb0
[ 1362.554566]  [<ffffffff883abab6>] :libcfs:lcw_update_time+0x16/0x100
[ 1362.561220]  [<ffffffff8023f2c9>] lock_timer_base+0x29/0x60
[ 1362.566962]  [<ffffffff8023f7b0>] __mod_timer+0xc0/0xf0
[ 1362.572462]  [<ffffffff884d7e2b>] :ptlrpc:ptlrpc_main+0x86b/0x9f0
[ 1362.578796]  [<ffffffff8022f450>] default_wake_function+0x0/0x10
[ 1362.585055]  [<ffffffff8020ac4c>] child_rip+0xa/0x12
[ 1362.590266]  [<ffffffff884d75c0>] :ptlrpc:ptlrpc_main+0x0/0x9f0
[ 1362.596491]  [<ffffffff8020ac42>] child_rip+0x0/0x12
-------------------8<------------------------


/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se     |    nikke at hpc2n.umu.se
---------------------------------------------------------------------------
  FUNDAMENTALISM is never having to open your mind.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Andreas Dilger

2007-Sep-25 21:03 UTC

head link

[Lustre-discuss] Lustre 1.6.2 + 2.6.18 - lockup doing recover

On Sep 25, 2007  15:21 +0200, Niklas Edmundsson wrote:> Both OSS''s were really locking up intermittently after this, it
all
> ended with having to do a hard reboot. Taking it slow and leisurely 
> mounting one OST, then wait, another one, then wait until we had all 
> four mounted resulted in a working lustre backend.
> 
> We got tons of soft lockups and call traces, here are the first one 
> from each OSS (with some added interesting log lines from the first OSS):
> -------------------8<------------------------
> [ 1376.614228] Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for
pid 6802: it was inactive for 100s
> 
> [ 1388.238268] LustreError: 6877:0:(events.c:55:request_out_callback()) @@@
type 4, status -5  req at ffff8100e5c84800 x38/t0
o104->@NET_0x2000082ef4e80_UUID:15 lens 232/128 ref 2 fl Rpc:/0/0 rc 0/-22
> [ 1388.256244] LustreError: 6877:0:(events.c:55:request_out_callback())
Skipped 1158333 previous similar messages
> [ 1394.804079] LustreError:
6731:0:(client.c:961:ptlrpc_expire_one_request()) @@@ network error (sent at
1190712998, 0s ago)  req at ffff8100e24ee400 x177/t0
o104->@NET_0x2000082ef4e8e_UUID:15 lens 232/128 ref 1 fl Rpc:/0/0 rc 0/-22
> [ 1394.825111] LustreError:
6731:0:(client.c:961:ptlrpc_expire_one_request()) Skipped 1959868 previous
similar messages
> [ 1395.921898] BUG: soft lockup detected on CPU#0!
I can imagine if you are getting 2M requests timing out then the CPU will
be busy.
> [ 1396.080527]  [<ffffffff884bf4a9>]
:ptlrpc:ptlrpc_set_next_timeout+0x19/0x170
> [ 1396.105652]  [<ffffffff884980fb>]
:ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200
> [ 1396.113405]  [<ffffffff88499133>]
:ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290
> [ 1396.120369]  [<ffffffff884ac5db>]
:ptlrpc:ldlm_extent_compat_queue+0x5ab/0x720
> [ 1362.348687]  [<ffffffff884c25ed>]
:ptlrpc:ptlrpc_expire_one_request+0xad/0x390
> [ 1362.389284]  [<ffffffff884960fb>]
:ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200
> [ 1362.397318]  [<ffffffff88497133>]
:ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290
Looks like something related to the parallel blocking callback code.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Oleg Drokin

2007-Sep-25 22:32 UTC

head link

[Lustre-discuss] Lustre 1.6.2 + 2.6.18 - lockup doing recover

Hello!

On Tue, Sep 25, 2007 at 03:21:55PM +0200, Niklas Edmundsson
wrote:> We managed to trigger another intriguing lustre failure by simply 
> rebooting our two OSS''s (which has two OST''s each), and
then mounting
> the OST''s on both OSS''s in parallell (ie mount ost1 ;
mount ost2 on
> both OSS''s at the same time). The filesystem was mounted by 191 
> clients, of which twenty-ish hade been recently active. Interconnect 
> used is standard tcp over GigE.
First of all you need to have a patch from bug 13614 if yo don''t have
it yet.
https://bugzilla.lustre.org/attachment.cgi?id=12679

Bye,
    Oleg

Niklas Edmundsson

2007-Sep-26 07:26 UTC

head link

[Lustre-discuss] Lustre 1.6.2 + 2.6.18 - lockup doing recover

On Tue, 25 Sep 2007, Oleg Drokin wrote:
> Hello!
>
> On Tue, Sep 25, 2007 at 03:21:55PM +0200, Niklas Edmundsson wrote:
>> We managed to trigger another intriguing lustre failure by simply
>> rebooting our two OSS''s (which has two OST''s each),
and then mounting
>> the OST''s on both OSS''s in parallell (ie mount ost1 ;
mount ost2 on
>> both OSS''s at the same time). The filesystem was mounted by
191
>> clients, of which twenty-ish hade been recently active. Interconnect
>> used is standard tcp over GigE.
>
> First of all you need to have a patch from bug 13614 if yo don''t
have it yet.
> https://bugzilla.lustre.org/attachment.cgi?id=12679
Nope, I didn''t have that one.

I''ll apply it and see if it improves the situation.

/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se     |    nikke at hpc2n.umu.se
---------------------------------------------------------------------------
  Iraqi Bingo: B-52... F-16... M-1... F-18... F-117...
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Lustre discuss - Sep 2007 - Lustre 1.6.2 + 2.6.18 - lockup doing recover

[Lustre-discuss] Lustre 1.6.2 + 2.6.18 - lockup doing recover

[Lustre-discuss] Lustre 1.6.2 + 2.6.18 - lockup doing recover

[Lustre-discuss] Lustre 1.6.2 + 2.6.18 - lockup doing recover

[Lustre-discuss] Lustre 1.6.2 + 2.6.18 - lockup doing recover