thr3ads.net - Lustre discuss - [Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4 [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Temple Jason

2010-Sep-14 09:21 UTC

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

Hello,

I have recently upgraded my lustre filesystem from 1.8.3 to 1.8.4.  The first
day we brought our system online with the new version, we started seeing clients
getting stuck in this soft lockup loop.  The load shoots up over 120, and
eventually the node becomes unusable and requires a hard reset.  I''ve
seen loops like this on the server side in previous lustre versions, but to have
it happen on the client is completely new.  Here is a bit of what I see in the
logs:

Sep 13 21:11:39 wn122 kernel: LustreError:
27016:0:(statahead.c:289:ll_sai_entry_fini()) ASSERTION(sa_is_stopped(sai))
failed
Sep 13 21:11:39 wn122 kernel: LustreError:
27016:0:(statahead.c:289:ll_sai_entry_fini()) LBUG
Sep 13 21:11:39 wn122 kernel: Pid: 27016, comm: athena.py
Sep 13 21:11:39 wn122 kernel:
Sep 13 21:11:39 wn122 kernel: Call Trace:
Sep 13 21:11:39 wn122 kernel:  [<ffffffff885cf6a1>]
libcfs_debug_dumpstack+0x51/0x60 [libcfs]
Sep 13 21:11:39 wn122 kernel:  [<ffffffff885cfbda>]
lbug_with_loc+0x7a/0xd0 [libcfs]
Sep 13 21:11:39 wn122 kernel:  [<ffffffff885d7f00>]
tracefile_init+0x0/0x110 [libcfs]
Sep 13 21:11:39 wn122 kernel:  [<ffffffff889247e9>]
ll_statahead_exit+0x409/0x500 [lustre]
Sep 13 21:11:39 wn122 kernel:  [<ffffffff8008a4b4>]
default_wake_function+0x0/0xe
Sep 13 21:11:39 wn122 kernel:  [<ffffffff888d2f2e>]
ll_intent_drop_lock+0x8e/0xb0 [lustre]
Sep 13 21:11:39 wn122 kernel:  [<ffffffff8891c2bb>]
ll_lookup_it+0x30b/0x7c0 [lustre]
Sep 13 21:11:39 wn122 kernel:  [<ffffffff888e250d>]
__ll_inode_revalidate_it+0x5bd/0x650 [lustre]
Sep 13 21:11:39 wn122 kernel:  [<ffffffff886f34f4>]
ldlm_lock_add_to_lru+0x74/0xe0 [ptlrpc]
Sep 13 21:11:39 wn122 kernel:  [<ffffffff88919431>]
ll_convert_intent+0xb1/0x170 [lustre]
Sep 13 21:11:39 wn122 kernel:  [<ffffffff8891d4c7>]
ll_lookup_nd+0x207/0x400 [lustre]
Sep 13 21:11:39 wn122 kernel:  [<ffffffff8002229b>] d_alloc+0x174/0x1a9
Sep 13 21:11:39 wn122 kernel:  [<ffffffff8000cc40>] do_lookup+0xe5/0x1e6
Sep 13 21:11:39 wn122 kernel:  [<ffffffff80009fce>]
__link_path_walk+0xa01/0xf42
Sep 13 21:11:39 wn122 kernel:  [<ffffffff8000e803>]
link_path_walk+0x5c/0xe5
Sep 13 21:11:39 wn122 kernel:  [<ffffffff80034cd4>] vfs_readdir+0x94/0xa9
Sep 13 21:11:39 wn122 kernel:  [<ffffffff800f4fa9>]
compat_sys_getdents+0xaf/0xbd
Sep 13 21:11:39 wn122 kernel:  [<ffffffff8000c9df>]
do_path_lookup+0x270/0x2e8
Sep 13 21:11:39 wn122 kernel:  [<ffffffff800123f7>] getname+0x15b/0x1c1
Sep 13 21:11:39 wn122 kernel:  [<ffffffff80023318>]
__user_walk_fd+0x37/0x4c
Sep 13 21:11:39 wn122 kernel:  [<ffffffff800320b8>]
sys_faccessat+0xe4/0x18d
Sep 13 21:11:39 wn122 kernel:  [<ffffffff80034cd4>] vfs_readdir+0x94/0xa9
Sep 13 21:11:39 wn122 kernel:  [<ffffffff800f4fa9>]
compat_sys_getdents+0xaf/0xbd
Sep 13 21:11:39 wn122 kernel:  [<ffffffff8006149b>]
sysenter_do_call+0x1b/0x67
Sep 13 21:11:39 wn122 kernel:  [<ffffffff8002c205>]
dummy_inode_permission+0x0/0x3
Sep 13 21:11:39 wn122 kernel:
Sep 13 21:11:39 wn122 kernel: LustreError: dumping log to
/tmp/lustre-log.1284405099.27016
Sep 13 21:11:44 wn122 dhclient: DHCPREQUEST on eth0 to 148.187.67.113 port 67
Sep 13 21:11:49 wn122 kernel: BUG: soft lockup - CPU#3 stuck for 10s!
[ptlrpcd:31817]
Sep 13 21:11:49 wn122 kernel: CPU 3:
Sep 13 21:11:49 wn122 kernel: Modules linked in: mgc(U) lustre(U) lov(U) mdc(U)
lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfs
fscache nfs_acl loc
kd sunrpc bonding(U) ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp xt_state
iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink iptable_mangle
ip_tables x_tables rdma_ucm(U) ib
_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6
xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) mlx4_vnic(U) ib_sa(U) mlx4_ib(U)
ib_mthca(U) ib_mad(U
) ib_core(U) dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button
battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sg i2c_i801
i2c_core e1000e shpchp mlx4_
core(U) pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot
dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd
uhci_hcd ohci_hcd ehci_hcd
Sep 13 21:11:49 wn122 kernel: Pid: 31817, comm: ptlrpcd Tainted: G     
2.6.18-128.7.1.el5 #1
Sep 13 21:11:49 wn122 kernel: RIP: 0010:[<ffffffff80064cb7>] 
[<ffffffff80064cb7>] .text.lock.spinlock+0x5/0x30
Sep 13 21:11:49 wn122 kernel: RSP: 0018:ffff8101ec177cb8  EFLAGS: 00000282
Sep 13 21:11:49 wn122 kernel: RAX: 000000000000004f RBX: 0000000000000000 RCX:
0000000000000000
Sep 13 21:11:49 wn122 kernel: RDX: ffff81035956b480 RSI: ffff810253c2d400 RDI:
ffff810552ccb500
Sep 13 21:11:49 wn122 kernel: RBP: ffff810192294000 R08: 5a5a5a5a5a5a5a5a R09:
5a5a5a5a5a5a5a5a
Sep 13 21:11:49 wn122 kernel: R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12:
0000000000000038
Sep 13 21:11:49 wn122 kernel: R13: ffff81045b0150c0 R14: ffff81067fc57000 R15:
ffffffff886f5168
Sep 13 21:11:49 wn122 kernel: FS:  00002b5af649d240(0000)
GS:ffff81010c4c8e40(0000) knlGS:0000000000000000
Sep 13 21:11:49 wn122 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Sep 13 21:11:49 wn122 kernel: CR2: 0000000008183094 CR3: 0000000000201000 CR4:
00000000000006e0
Sep 13 21:11:49 wn122 kernel:
Sep 13 21:11:49 wn122 kernel: Call Trace:
Sep 13 21:11:49 wn122 kernel:  [<ffffffff889249dc>]
:lustre:ll_statahead_interpret+0xfc/0x5b0
Sep 13 21:11:49 wn122 kernel:  [<ffffffff88861779>]
:mdc:mdc_intent_getattr_async_interpret+0x459/0x490
Sep 13 21:11:49 wn122 kernel:  [<ffffffff88729246>]
:ptlrpc:ptlrpc_check_set+0x1186/0x1440
Sep 13 21:11:49 wn122 kernel:  [<ffffffff8004ac1b>]
try_to_del_timer_sync+0x51/0x5a
Sep 13 21:11:49 wn122 kernel:  [<ffffffff8875d05d>]
:ptlrpc:ptlrpcd_check+0xdd/0x1f0
Sep 13 21:11:49 wn122 kernel:  [<ffffffff80095004>]
process_timeout+0x0/0x5
Sep 13 21:11:49 wn122 kernel:  [<ffffffff8875d9a1>]
:ptlrpc:ptlrpcd+0x1b1/0x259
Sep 13 21:11:49 wn122 kernel:  [<ffffffff8008a4b4>]
default_wake_function+0x0/0xe
Sep 13 21:11:49 wn122 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Sep 13 21:11:49 wn122 kernel:  [<ffffffff8875d7f0>]
:ptlrpc:ptlrpcd+0x0/0x259
Sep 13 21:11:49 wn122 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Sep 13 21:11:49 wn122 kernel:
Sep 13 21:11:49 wn122 kernel: BUG: soft lockup - CPU#0 stuck for 10s!
[ll_sa_27016:28593]
Sep 13 21:11:49 wn122 kernel: CPU 0:
Sep 13 21:11:49 wn122 kernel: Modules linked in: mgc(U) lustre(U) lov(U) mdc(U)
lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfs
fscache nfs_acl lockd sunrpc bonding(U) ip_conntrack_netbios_ns ipt_REJECT
xt_tcpudp xt_state iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink
iptable_mangle ip_tables x_tables rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U)
ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6 xfrm_nalgo crypto_api
ib_uverbs(U) ib_umad(U) mlx4_vnic(U) ib_sa(U) mlx4_ib(U) ib_mthca(U) ib_mad(U)
ib_core(U) dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery
asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sg i2c_i801 i2c_core
e1000e shpchp mlx4_core(U) pcspkr dm_raid45 dm_message dm_region_hash
dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod
scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Sep 13 21:11:49 wn122 kernel: Pid: 28593, comm: ll_sa_27016 Tainted: G     
2.6.18-128.7.1.el5 #1
Sep 13 21:11:49 wn122 kernel: RIP: 0010:[<ffffffff80064cb4>] 
[<ffffffff80064cb4>] .text.lock.spinlock+0x2/0x30
Sep 13 21:11:49 wn122 kernel: RSP: 0000:ffff810251edfcf8  EFLAGS: 00000282
Sep 13 21:11:49 wn122 kernel: RAX: 0000000000000001 RBX: ffff8101cf193e80 RCX:
0000000000000000
Sep 13 21:11:49 wn122 kernel: RDX: 0000000000000012 RSI: 000000000abe79b0 RDI:
ffff810552ccb500
Sep 13 21:11:49 wn122 kernel: RBP: 0000000000000286 R08: 0000000351edfde0 R09:
0000000000000000
Sep 13 21:11:49 wn122 kernel: R10: ffff810311d9a400 R11: 0000000000000248 R12:
ffff81023f907a00
Sep 13 21:11:49 wn122 kernel: R13: ffffffff88726dba R14: ffff810162543bc0 R15:
0000000000000000
Sep 13 21:11:49 wn122 kernel: FS:  0000000000000000(0000)
GS:ffffffff803ac000(0000) knlGS:0000000000000000
Sep 13 21:11:49 wn122 kernel: CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
Sep 13 21:11:49 wn122 kernel: CR2: 000000000c0f900c CR3: 000000066b863000 CR4:
00000000000006e0
Sep 13 21:11:49 wn122 kernel:
Sep 13 21:11:49 wn122 kernel: Call Trace:
Sep 13 21:11:49 wn122 kernel:  [<ffffffff88923787>]
:lustre:ll_sai_entry_to_stated+0x87/0x330
Sep 13 21:11:49 wn122 kernel:  [<ffffffff8000d0de>] dput+0x2c/0x114
Sep 13 21:11:49 wn122 kernel:  [<ffffffff88923f8b>]
:lustre:do_statahead_interpret+0x55b/0x5c0
Sep 13 21:11:49 wn122 kernel:  [<ffffffff88925641>]
:lustre:ll_statahead_thread+0x7b1/0x1750
Sep 13 21:11:49 wn122 kernel:  [<ffffffff8008a4b4>]
default_wake_function+0x0/0xe
Sep 13 21:11:49 wn122 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Sep 13 21:11:49 wn122 kernel:  [<ffffffff88924e90>]
:lustre:ll_statahead_thread+0x0/0x1750
Sep 13 21:11:49 wn122 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

After we had 3 clients go into this state the first day, and 2 the next, we
decided to downgrade the clients back to 1.8.3, while leaving the servers at
1.8.4.  This morning, we had 2 more clients get stuck again, so my assumption is
that there is a problem with the 1.8.4 servers.

These are RedHat 5.4 servers and clients, with the
2.6.18-194.3.1.el5_lustre.1.8.4 kernel.

Any help on this issue would be great.

Thanks,

Jason Temple
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100914/36d6e23b/attachment-0001.html

Fan Yong

2010-Sep-14 09:55 UTC

head link

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

On 9/14/10 5:21 PM, Temple Jason wrote:>
> Hello,
>
> I have recently upgraded my lustre filesystem from 1.8.3 to 1.8.4.  
> The first day we brought our system online with the new version, we 
> started seeing clients getting stuck in this soft lockup loop.  The 
> load shoots up over 120, and eventually the node becomes unusable and 
> requires a hard reset.  I''ve seen loops like this on the server
side
> in previous lustre versions, but to have it happen on the client is 
> completely new.  Here is a bit of what I see in the logs:
>
> Sep 13 21:11:39 wn122 kernel: LustreError: 
> 27016:0:(statahead.c:289:ll_sai_entry_fini()) 
> ASSERTION(sa_is_stopped(sai)) failed
>The soft lockup is caused by this ASSERTION, because when such ASSERTION 
triggered, the thread holds a spin_lock, then caused other threads busy 
waiting for such spin_lock.
> Sep 13 21:11:39 wn122 kernel: LustreError: 
> 27016:0:(statahead.c:289:ll_sai_entry_fini()) LBUG
>
> Sep 13 21:11:39 wn122 kernel: Pid: 27016, comm: athena.py
>
> Sep 13 21:11:39 wn122 kernel:
>
> Sep 13 21:11:39 wn122 kernel: Call Trace:
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff885cf6a1>] 
> libcfs_debug_dumpstack+0x51/0x60 [libcfs]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff885cfbda>] 
> lbug_with_loc+0x7a/0xd0 [libcfs]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff885d7f00>] 
> tracefile_init+0x0/0x110 [libcfs]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff889247e9>] 
> ll_statahead_exit+0x409/0x500 [lustre]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8008a4b4>] 
> default_wake_function+0x0/0xe
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff888d2f2e>] 
> ll_intent_drop_lock+0x8e/0xb0 [lustre]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8891c2bb>] 
> ll_lookup_it+0x30b/0x7c0 [lustre]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff888e250d>] 
> __ll_inode_revalidate_it+0x5bd/0x650 [lustre]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff886f34f4>] 
> ldlm_lock_add_to_lru+0x74/0xe0 [ptlrpc]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff88919431>] 
> ll_convert_intent+0xb1/0x170 [lustre]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8891d4c7>] 
> ll_lookup_nd+0x207/0x400 [lustre]
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8002229b>]
d_alloc+0x174/0x1a9
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8000cc40>]
do_lookup+0xe5/0x1e6
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff80009fce>] 
> __link_path_walk+0xa01/0xf42
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8000e803>] 
> link_path_walk+0x5c/0xe5
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff80034cd4>]
vfs_readdir+0x94/0xa9
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff800f4fa9>] 
> compat_sys_getdents+0xaf/0xbd
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8000c9df>] 
> do_path_lookup+0x270/0x2e8
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff800123f7>]
getname+0x15b/0x1c1
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff80023318>] 
> __user_walk_fd+0x37/0x4c
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff800320b8>] 
> sys_faccessat+0xe4/0x18d
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff80034cd4>]
vfs_readdir+0x94/0xa9
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff800f4fa9>] 
> compat_sys_getdents+0xaf/0xbd
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8006149b>] 
> sysenter_do_call+0x1b/0x67
>
> Sep 13 21:11:39 wn122 kernel:  [<ffffffff8002c205>] 
> dummy_inode_permission+0x0/0x3
>
> Sep 13 21:11:39 wn122 kernel:
>
> Sep 13 21:11:39 wn122 kernel: LustreError: dumping log to 
> /tmp/lustre-log.1284405099.27016
>
> Sep 13 21:11:44 wn122 dhclient: DHCPREQUEST on eth0 to 148.187.67.113 
> port 67
>
> Sep 13 21:11:49 wn122 kernel: BUG: soft lockup - CPU#3 stuck for 10s! 
> [ptlrpcd:31817]
>
> Sep 13 21:11:49 wn122 kernel: CPU 3:
>
> Sep 13 21:11:49 wn122 kernel: Modules linked in: mgc(U) lustre(U) 
> lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) 
> lnet(U) lvfs(U) libcfs(U) nfs fscache nfs_acl loc
>
> kd sunrpc bonding(U) ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp 
> xt_state iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink 
> iptable_mangle ip_tables x_tables rdma_ucm(U) ib
>
> _sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) 
> ib_cm(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) 
> mlx4_vnic(U) ib_sa(U) mlx4_ib(U) ib_mthca(U) ib_mad(U
>
> ) ib_core(U) dm_multipath scsi_dh video hwmon backlight sbs i2c_ec 
> button battery asus_acpi acpi_memhotplug ac parport_pc lp parport 
> joydev sg i2c_i801 i2c_core e1000e shpchp mlx4_
>
> core(U) pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache 
> dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod 
> scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
>
> Sep 13 21:11:49 wn122 kernel: Pid: 31817, comm: ptlrpcd Tainted: 
> G      2.6.18-128.7.1.el5 #1
>
> Sep 13 21:11:49 wn122 kernel: RIP: 0010:[<ffffffff80064cb7>]  
> [<ffffffff80064cb7>] .text.lock.spinlock+0x5/0x30
>
> Sep 13 21:11:49 wn122 kernel: RSP: 0018:ffff8101ec177cb8  EFLAGS: 00000282
>
> Sep 13 21:11:49 wn122 kernel: RAX: 000000000000004f RBX: 
> 0000000000000000 RCX: 0000000000000000
>
> Sep 13 21:11:49 wn122 kernel: RDX: ffff81035956b480 RSI: 
> ffff810253c2d400 RDI: ffff810552ccb500
>
> Sep 13 21:11:49 wn122 kernel: RBP: ffff810192294000 R08: 
> 5a5a5a5a5a5a5a5a R09: 5a5a5a5a5a5a5a5a
>
> Sep 13 21:11:49 wn122 kernel: R10: 5a5a5a5a5a5a5a5a R11: 
> 5a5a5a5a5a5a5a5a R12: 0000000000000038
>
> Sep 13 21:11:49 wn122 kernel: R13: ffff81045b0150c0 R14: 
> ffff81067fc57000 R15: ffffffff886f5168
>
> Sep 13 21:11:49 wn122 kernel: FS:  00002b5af649d240(0000) 
> GS:ffff81010c4c8e40(0000) knlGS:0000000000000000
>
> Sep 13 21:11:49 wn122 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
> 000000008005003b
>
> Sep 13 21:11:49 wn122 kernel: CR2: 0000000008183094 CR3: 
> 0000000000201000 CR4: 00000000000006e0
>
> Sep 13 21:11:49 wn122 kernel:
>
> Sep 13 21:11:49 wn122 kernel: Call Trace:
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff889249dc>] 
> :lustre:ll_statahead_interpret+0xfc/0x5b0
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff88861779>] 
> :mdc:mdc_intent_getattr_async_interpret+0x459/0x490
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff88729246>] 
> :ptlrpc:ptlrpc_check_set+0x1186/0x1440
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8004ac1b>] 
> try_to_del_timer_sync+0x51/0x5a
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8875d05d>] 
> :ptlrpc:ptlrpcd_check+0xdd/0x1f0
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff80095004>] 
> process_timeout+0x0/0x5
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8875d9a1>] 
> :ptlrpc:ptlrpcd+0x1b1/0x259
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8008a4b4>] 
> default_wake_function+0x0/0xe
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8005dfb1>]
child_rip+0xa/0x11
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8875d7f0>] 
> :ptlrpc:ptlrpcd+0x0/0x259
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8005dfa7>]
child_rip+0x0/0x11
>
> Sep 13 21:11:49 wn122 kernel:
>
> Sep 13 21:11:49 wn122 kernel: BUG: soft lockup - CPU#0 stuck for 10s! 
> [ll_sa_27016:28593]
>
> Sep 13 21:11:49 wn122 kernel: CPU 0:
>
> Sep 13 21:11:49 wn122 kernel: Modules linked in: mgc(U) lustre(U) 
> lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) 
> lnet(U) lvfs(U) libcfs(U) nfs fscache nfs_acl lockd sunrpc bonding(U) 
> ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp xt_state iptable_filter 
> iptable_nat ip_nat ip_conntrack nfnetlink iptable_mangle ip_tables 
> x_tables rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) 
> ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6 xfrm_nalgo crypto_api 
> ib_uverbs(U) ib_umad(U) mlx4_vnic(U) ib_sa(U) mlx4_ib(U) ib_mthca(U) 
> ib_mad(U) ib_core(U) dm_multipath scsi_dh video hwmon backlight sbs 
> i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp 
> parport joydev sg i2c_i801 i2c_core e1000e shpchp mlx4_core(U) pcspkr 
> dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero 
> dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd 
> uhci_hcd ohci_hcd ehci_hcd
>
> Sep 13 21:11:49 wn122 kernel: Pid: 28593, comm: ll_sa_27016 Tainted: 
> G      2.6.18-128.7.1.el5 #1
>
> Sep 13 21:11:49 wn122 kernel: RIP: 0010:[<ffffffff80064cb4>]  
> [<ffffffff80064cb4>] .text.lock.spinlock+0x2/0x30
>
> Sep 13 21:11:49 wn122 kernel: RSP: 0000:ffff810251edfcf8  EFLAGS: 00000282
>
> Sep 13 21:11:49 wn122 kernel: RAX: 0000000000000001 RBX: 
> ffff8101cf193e80 RCX: 0000000000000000
>
> Sep 13 21:11:49 wn122 kernel: RDX: 0000000000000012 RSI: 
> 000000000abe79b0 RDI: ffff810552ccb500
>
> Sep 13 21:11:49 wn122 kernel: RBP: 0000000000000286 R08: 
> 0000000351edfde0 R09: 0000000000000000
>
> Sep 13 21:11:49 wn122 kernel: R10: ffff810311d9a400 R11: 
> 0000000000000248 R12: ffff81023f907a00
>
> Sep 13 21:11:49 wn122 kernel: R13: ffffffff88726dba R14: 
> ffff810162543bc0 R15: 0000000000000000
>
> Sep 13 21:11:49 wn122 kernel: FS:  0000000000000000(0000) 
> GS:ffffffff803ac000(0000) knlGS:0000000000000000
>
> Sep 13 21:11:49 wn122 kernel: CS:  0010 DS: 002b ES: 002b CR0: 
> 000000008005003b
>
> Sep 13 21:11:49 wn122 kernel: CR2: 000000000c0f900c CR3: 
> 000000066b863000 CR4: 00000000000006e0
>
> Sep 13 21:11:49 wn122 kernel:
>
> Sep 13 21:11:49 wn122 kernel: Call Trace:
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff88923787>] 
> :lustre:ll_sai_entry_to_stated+0x87/0x330
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8000d0de>] dput+0x2c/0x114
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff88923f8b>] 
> :lustre:do_statahead_interpret+0x55b/0x5c0
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff88925641>] 
> :lustre:ll_statahead_thread+0x7b1/0x1750
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8008a4b4>] 
> default_wake_function+0x0/0xe
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8005dfb1>]
child_rip+0xa/0x11
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff88924e90>] 
> :lustre:ll_statahead_thread+0x0/0x1750
>
> Sep 13 21:11:49 wn122 kernel:  [<ffffffff8005dfa7>]
child_rip+0x0/0x11
>
> After we had 3 clients go into this state the first day, and 2 the 
> next, we decided to downgrade the clients back to 1.8.3, while leaving 
> the servers at 1.8.4.  This morning, we had 2 more clients get stuck 
> again, so my assumption is that there is a problem with the 1.8.4 servers.
>This ASSERTION looks related with statahead, such feature only affects 
client-side behavior, but nothing for server. On the other hand, we 
never saw that on lustre-1.8.3 or former version, and as I known, there 
were almost no changes for statahead between lustre-1.8.3 and 
lustre-1.8.4. So I am not sure whether your assumption is right not.

Anyway, to make your system run, you can disable statahead on 
client-side firstly, to check what will happen after that. The 
client-side proc interface "statahead_max" is for that.

Cheers,
Nasf>
> These are RedHat 5.4 servers and clients, with the 
> 2.6.18-194.3.1.el5_lustre.1.8.4 kernel.
>
> Any help on this issue would be great.
>
> Thanks,
>
> Jason Temple
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100914/a4eadf0b/attachment-0001.html

Gabriele Paciucci

2010-Sep-14 12:55 UTC

head link

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

I have the same problem, I put the statahead_max to 0 !!!

On 09/14/2010 11:55 AM, Fan Yong wrote:> On 9/14/10 5:21 PM, Temple Jason wrote:
>>
>> Hello,
>>
>> I have recently upgraded my lustre filesystem from 1.8.3 to 1.8.4. 
>> The first day we brought our system online with the new version, we 
>> started seeing clients getting stuck in this soft lockup loop. The 
>> load shoots up over 120, and eventually the node becomes unusable and 
>> requires a hard reset. I?ve seen loops like this on the server side 
>> in previous lustre versions, but to have it happen on the client is 
>> completely new. Here is a bit of what I see in the logs:
>>
>> Sep 13 21:11:39 wn122 kernel: LustreError: 
>> 27016:0:(statahead.c:289:ll_sai_entry_fini()) 
>> ASSERTION(sa_is_stopped(sai)) failed
>>
> The soft lockup is caused by this ASSERTION, because when such 
> ASSERTION triggered, the thread holds a spin_lock, then caused other 
> threads busy waiting for such spin_lock.
>
>> Sep 13 21:11:39 wn122 kernel: LustreError: 
>> 27016:0:(statahead.c:289:ll_sai_entry_fini()) LBUG
>>
>> Sep 13 21:11:39 wn122 kernel: Pid: 27016, comm: athena.py
>>
>> Sep 13 21:11:39 wn122 kernel:
>>
>> Sep 13 21:11:39 wn122 kernel: Call Trace:
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff885cf6a1>] 
>> libcfs_debug_dumpstack+0x51/0x60 [libcfs]
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff885cfbda>] 
>> lbug_with_loc+0x7a/0xd0 [libcfs]
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff885d7f00>] 
>> tracefile_init+0x0/0x110 [libcfs]
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff889247e9>] 
>> ll_statahead_exit+0x409/0x500 [lustre]
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8008a4b4>] 
>> default_wake_function+0x0/0xe
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff888d2f2e>] 
>> ll_intent_drop_lock+0x8e/0xb0 [lustre]
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8891c2bb>] 
>> ll_lookup_it+0x30b/0x7c0 [lustre]
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff888e250d>] 
>> __ll_inode_revalidate_it+0x5bd/0x650 [lustre]
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff886f34f4>] 
>> ldlm_lock_add_to_lru+0x74/0xe0 [ptlrpc]
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff88919431>] 
>> ll_convert_intent+0xb1/0x170 [lustre]
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8891d4c7>] 
>> ll_lookup_nd+0x207/0x400 [lustre]
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8002229b>]
d_alloc+0x174/0x1a9
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8000cc40>]
do_lookup+0xe5/0x1e6
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff80009fce>] 
>> __link_path_walk+0xa01/0xf42
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8000e803>] 
>> link_path_walk+0x5c/0xe5
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff80034cd4>]
vfs_readdir+0x94/0xa9
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff800f4fa9>] 
>> compat_sys_getdents+0xaf/0xbd
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8000c9df>] 
>> do_path_lookup+0x270/0x2e8
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff800123f7>]
getname+0x15b/0x1c1
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff80023318>] 
>> __user_walk_fd+0x37/0x4c
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff800320b8>] 
>> sys_faccessat+0xe4/0x18d
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff80034cd4>]
vfs_readdir+0x94/0xa9
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff800f4fa9>] 
>> compat_sys_getdents+0xaf/0xbd
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8006149b>] 
>> sysenter_do_call+0x1b/0x67
>>
>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8002c205>] 
>> dummy_inode_permission+0x0/0x3
>>
>> Sep 13 21:11:39 wn122 kernel:
>>
>> Sep 13 21:11:39 wn122 kernel: LustreError: dumping log to 
>> /tmp/lustre-log.1284405099.27016
>>
>> Sep 13 21:11:44 wn122 dhclient: DHCPREQUEST on eth0 to 148.187.67.113 
>> port 67
>>
>> Sep 13 21:11:49 wn122 kernel: BUG: soft lockup - CPU#3 stuck for 10s! 
>> [ptlrpcd:31817]
>>
>> Sep 13 21:11:49 wn122 kernel: CPU 3:
>>
>> Sep 13 21:11:49 wn122 kernel: Modules linked in: mgc(U) lustre(U) 
>> lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) 
>> lnet(U) lvfs(U) libcfs(U) nfs fscache nfs_acl loc
>>
>> kd sunrpc bonding(U) ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp 
>> xt_state iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink 
>> iptable_mangle ip_tables x_tables rdma_ucm(U) ib
>>
>> _sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) 
>> ib_cm(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) 
>> mlx4_vnic(U) ib_sa(U) mlx4_ib(U) ib_mthca(U) ib_mad(U
>>
>> ) ib_core(U) dm_multipath scsi_dh video hwmon backlight sbs i2c_ec 
>> button battery asus_acpi acpi_memhotplug ac parport_pc lp parport 
>> joydev sg i2c_i801 i2c_core e1000e shpchp mlx4_
>>
>> core(U) pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache 
>> dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod 
>> scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
>>
>> Sep 13 21:11:49 wn122 kernel: Pid: 31817, comm: ptlrpcd Tainted: G 
>> 2.6.18-128.7.1.el5 #1
>>
>> Sep 13 21:11:49 wn122 kernel: RIP: 0010:[<ffffffff80064cb7>] 
>> [<ffffffff80064cb7>] .text.lock.spinlock+0x5/0x30
>>
>> Sep 13 21:11:49 wn122 kernel: RSP: 0018:ffff8101ec177cb8 EFLAGS:
00000282
>>
>> Sep 13 21:11:49 wn122 kernel: RAX: 000000000000004f RBX: 
>> 0000000000000000 RCX: 0000000000000000
>>
>> Sep 13 21:11:49 wn122 kernel: RDX: ffff81035956b480 RSI: 
>> ffff810253c2d400 RDI: ffff810552ccb500
>>
>> Sep 13 21:11:49 wn122 kernel: RBP: ffff810192294000 R08: 
>> 5a5a5a5a5a5a5a5a R09: 5a5a5a5a5a5a5a5a
>>
>> Sep 13 21:11:49 wn122 kernel: R10: 5a5a5a5a5a5a5a5a R11: 
>> 5a5a5a5a5a5a5a5a R12: 0000000000000038
>>
>> Sep 13 21:11:49 wn122 kernel: R13: ffff81045b0150c0 R14: 
>> ffff81067fc57000 R15: ffffffff886f5168
>>
>> Sep 13 21:11:49 wn122 kernel: FS: 00002b5af649d240(0000) 
>> GS:ffff81010c4c8e40(0000) knlGS:0000000000000000
>>
>> Sep 13 21:11:49 wn122 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 
>> 000000008005003b
>>
>> Sep 13 21:11:49 wn122 kernel: CR2: 0000000008183094 CR3: 
>> 0000000000201000 CR4: 00000000000006e0
>>
>> Sep 13 21:11:49 wn122 kernel:
>>
>> Sep 13 21:11:49 wn122 kernel: Call Trace:
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff889249dc>] 
>> :lustre:ll_statahead_interpret+0xfc/0x5b0
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88861779>] 
>> :mdc:mdc_intent_getattr_async_interpret+0x459/0x490
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88729246>] 
>> :ptlrpc:ptlrpc_check_set+0x1186/0x1440
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8004ac1b>] 
>> try_to_del_timer_sync+0x51/0x5a
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8875d05d>] 
>> :ptlrpc:ptlrpcd_check+0xdd/0x1f0
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff80095004>] 
>> process_timeout+0x0/0x5
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8875d9a1>] 
>> :ptlrpc:ptlrpcd+0x1b1/0x259
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8008a4b4>] 
>> default_wake_function+0x0/0xe
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8005dfb1>]
child_rip+0xa/0x11
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8875d7f0>] 
>> :ptlrpc:ptlrpcd+0x0/0x259
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8005dfa7>]
child_rip+0x0/0x11
>>
>> Sep 13 21:11:49 wn122 kernel:
>>
>> Sep 13 21:11:49 wn122 kernel: BUG: soft lockup - CPU#0 stuck for 10s! 
>> [ll_sa_27016:28593]
>>
>> Sep 13 21:11:49 wn122 kernel: CPU 0:
>>
>> Sep 13 21:11:49 wn122 kernel: Modules linked in: mgc(U) lustre(U) 
>> lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) 
>> lnet(U) lvfs(U) libcfs(U) nfs fscache nfs_acl lockd sunrpc bonding(U) 
>> ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp xt_state iptable_filter 
>> iptable_nat ip_nat ip_conntrack nfnetlink iptable_mangle ip_tables 
>> x_tables rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) 
>> ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6 xfrm_nalgo crypto_api 
>> ib_uverbs(U) ib_umad(U) mlx4_vnic(U) ib_sa(U) mlx4_ib(U) ib_mthca(U) 
>> ib_mad(U) ib_core(U) dm_multipath scsi_dh video hwmon backlight sbs 
>> i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp 
>> parport joydev sg i2c_i801 i2c_core e1000e shpchp mlx4_core(U) pcspkr 
>> dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero 
>> dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd 
>> uhci_hcd ohci_hcd ehci_hcd
>>
>> Sep 13 21:11:49 wn122 kernel: Pid: 28593, comm: ll_sa_27016 Tainted: 
>> G 2.6.18-128.7.1.el5 #1
>>
>> Sep 13 21:11:49 wn122 kernel: RIP: 0010:[<ffffffff80064cb4>] 
>> [<ffffffff80064cb4>] .text.lock.spinlock+0x2/0x30
>>
>> Sep 13 21:11:49 wn122 kernel: RSP: 0000:ffff810251edfcf8 EFLAGS:
00000282
>>
>> Sep 13 21:11:49 wn122 kernel: RAX: 0000000000000001 RBX: 
>> ffff8101cf193e80 RCX: 0000000000000000
>>
>> Sep 13 21:11:49 wn122 kernel: RDX: 0000000000000012 RSI: 
>> 000000000abe79b0 RDI: ffff810552ccb500
>>
>> Sep 13 21:11:49 wn122 kernel: RBP: 0000000000000286 R08: 
>> 0000000351edfde0 R09: 0000000000000000
>>
>> Sep 13 21:11:49 wn122 kernel: R10: ffff810311d9a400 R11: 
>> 0000000000000248 R12: ffff81023f907a00
>>
>> Sep 13 21:11:49 wn122 kernel: R13: ffffffff88726dba R14: 
>> ffff810162543bc0 R15: 0000000000000000
>>
>> Sep 13 21:11:49 wn122 kernel: FS: 0000000000000000(0000) 
>> GS:ffffffff803ac000(0000) knlGS:0000000000000000
>>
>> Sep 13 21:11:49 wn122 kernel: CS: 0010 DS: 002b ES: 002b CR0: 
>> 000000008005003b
>>
>> Sep 13 21:11:49 wn122 kernel: CR2: 000000000c0f900c CR3: 
>> 000000066b863000 CR4: 00000000000006e0
>>
>> Sep 13 21:11:49 wn122 kernel:
>>
>> Sep 13 21:11:49 wn122 kernel: Call Trace:
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88923787>] 
>> :lustre:ll_sai_entry_to_stated+0x87/0x330
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8000d0de>]
dput+0x2c/0x114
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88923f8b>] 
>> :lustre:do_statahead_interpret+0x55b/0x5c0
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88925641>] 
>> :lustre:ll_statahead_thread+0x7b1/0x1750
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8008a4b4>] 
>> default_wake_function+0x0/0xe
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8005dfb1>]
child_rip+0xa/0x11
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88924e90>] 
>> :lustre:ll_statahead_thread+0x0/0x1750
>>
>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8005dfa7>]
child_rip+0x0/0x11
>>
>> After we had 3 clients go into this state the first day, and 2 the 
>> next, we decided to downgrade the clients back to 1.8.3, while 
>> leaving the servers at 1.8.4. This morning, we had 2 more clients get 
>> stuck again, so my assumption is that there is a problem with the 
>> 1.8.4 servers.
>>
> This ASSERTION looks related with statahead, such feature only affects 
> client-side behavior, but nothing for server. On the other hand, we 
> never saw that on lustre-1.8.3 or former version, and as I known, 
> there were almost no changes for statahead between lustre-1.8.3 and 
> lustre-1.8.4. So I am not sure whether your assumption is right not.
>
> Anyway, to make your system run, you can disable statahead on 
> client-side firstly, to check what will happen after that. The 
> client-side proc interface "statahead_max" is for that.
>
> Cheers,
> Nasf
>>
>> These are RedHat 5.4 servers and clients, with the 
>> 2.6.18-194.3.1.el5_lustre.1.8.4 kernel.
>>
>> Any help on this issue would be great.
>>
>> Thanks,
>>
>> Jason Temple
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>      
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>    

-- 
_Gabriele Paciucci_ http://www.linkedin.com/in/paciucci

Pursuant to legislative Decree n. 196/03 you are hereby informed that this email
contains confidential information intended only for use of addressee. If you are
not the addressee and have received this email by mistake, please send this
email to the sender. You may not copy or disseminate this message to anyone.
Thank You.

Fan Yong

2010-Sep-17 15:47 UTC

head link

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

On 9/14/10 8:55 PM, Gabriele Paciucci wrote:> I have the same problem, I put the statahead_max to 0 !!!In fact, I have made a patch for statahead related issues (including 
this one) against lustre-1.8, which is in inspection.
http://review.whamcloud.com/#change,2
If possible, you can try such patch.

Cheers,
--
Nasf> On 09/14/2010 11:55 AM, Fan Yong wrote:
>> On 9/14/10 5:21 PM, Temple Jason wrote:
>>> Hello,
>>>
>>> I have recently upgraded my lustre filesystem from 1.8.3 to 1.8.4.
>>> The first day we brought our system online with the new version, we
>>> started seeing clients getting stuck in this soft lockup loop. The
>>> load shoots up over 120, and eventually the node becomes unusable
and
>>> requires a hard reset. I?ve seen loops like this on the server side
>>> in previous lustre versions, but to have it happen on the client is
>>> completely new. Here is a bit of what I see in the logs:
>>>
>>> Sep 13 21:11:39 wn122 kernel: LustreError:
>>> 27016:0:(statahead.c:289:ll_sai_entry_fini())
>>> ASSERTION(sa_is_stopped(sai)) failed
>>>
>> The soft lockup is caused by this ASSERTION, because when such
>> ASSERTION triggered, the thread holds a spin_lock, then caused other
>> threads busy waiting for such spin_lock.
>>
>>> Sep 13 21:11:39 wn122 kernel: LustreError:
>>> 27016:0:(statahead.c:289:ll_sai_entry_fini()) LBUG
>>>
>>> Sep 13 21:11:39 wn122 kernel: Pid: 27016, comm: athena.py
>>>
>>> Sep 13 21:11:39 wn122 kernel:
>>>
>>> Sep 13 21:11:39 wn122 kernel: Call Trace:
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff885cf6a1>]
>>> libcfs_debug_dumpstack+0x51/0x60 [libcfs]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff885cfbda>]
>>> lbug_with_loc+0x7a/0xd0 [libcfs]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff885d7f00>]
>>> tracefile_init+0x0/0x110 [libcfs]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff889247e9>]
>>> ll_statahead_exit+0x409/0x500 [lustre]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8008a4b4>]
>>> default_wake_function+0x0/0xe
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff888d2f2e>]
>>> ll_intent_drop_lock+0x8e/0xb0 [lustre]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8891c2bb>]
>>> ll_lookup_it+0x30b/0x7c0 [lustre]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff888e250d>]
>>> __ll_inode_revalidate_it+0x5bd/0x650 [lustre]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff886f34f4>]
>>> ldlm_lock_add_to_lru+0x74/0xe0 [ptlrpc]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff88919431>]
>>> ll_convert_intent+0xb1/0x170 [lustre]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8891d4c7>]
>>> ll_lookup_nd+0x207/0x400 [lustre]
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8002229b>]
d_alloc+0x174/0x1a9
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8000cc40>]
do_lookup+0xe5/0x1e6
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff80009fce>]
>>> __link_path_walk+0xa01/0xf42
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8000e803>]
>>> link_path_walk+0x5c/0xe5
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff80034cd4>]
vfs_readdir+0x94/0xa9
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff800f4fa9>]
>>> compat_sys_getdents+0xaf/0xbd
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8000c9df>]
>>> do_path_lookup+0x270/0x2e8
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff800123f7>]
getname+0x15b/0x1c1
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff80023318>]
>>> __user_walk_fd+0x37/0x4c
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff800320b8>]
>>> sys_faccessat+0xe4/0x18d
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff80034cd4>]
vfs_readdir+0x94/0xa9
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff800f4fa9>]
>>> compat_sys_getdents+0xaf/0xbd
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8006149b>]
>>> sysenter_do_call+0x1b/0x67
>>>
>>> Sep 13 21:11:39 wn122 kernel: [<ffffffff8002c205>]
>>> dummy_inode_permission+0x0/0x3
>>>
>>> Sep 13 21:11:39 wn122 kernel:
>>>
>>> Sep 13 21:11:39 wn122 kernel: LustreError: dumping log to
>>> /tmp/lustre-log.1284405099.27016
>>>
>>> Sep 13 21:11:44 wn122 dhclient: DHCPREQUEST on eth0 to
148.187.67.113
>>> port 67
>>>
>>> Sep 13 21:11:49 wn122 kernel: BUG: soft lockup - CPU#3 stuck for
10s!
>>> [ptlrpcd:31817]
>>>
>>> Sep 13 21:11:49 wn122 kernel: CPU 3:
>>>
>>> Sep 13 21:11:49 wn122 kernel: Modules linked in: mgc(U) lustre(U)
>>> lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U)
>>> lnet(U) lvfs(U) libcfs(U) nfs fscache nfs_acl loc
>>>
>>> kd sunrpc bonding(U) ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp
>>> xt_state iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink
>>> iptable_mangle ip_tables x_tables rdma_ucm(U) ib
>>>
>>> _sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U)
>>> ib_cm(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U)
>>> mlx4_vnic(U) ib_sa(U) mlx4_ib(U) ib_mthca(U) ib_mad(U
>>>
>>> ) ib_core(U) dm_multipath scsi_dh video hwmon backlight sbs i2c_ec
>>> button battery asus_acpi acpi_memhotplug ac parport_pc lp parport
>>> joydev sg i2c_i801 i2c_core e1000e shpchp mlx4_
>>>
>>> core(U) pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache
>>> dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod
>>> scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
>>>
>>> Sep 13 21:11:49 wn122 kernel: Pid: 31817, comm: ptlrpcd Tainted: G
>>> 2.6.18-128.7.1.el5 #1
>>>
>>> Sep 13 21:11:49 wn122 kernel: RIP: 0010:[<ffffffff80064cb7>]
>>> [<ffffffff80064cb7>] .text.lock.spinlock+0x5/0x30
>>>
>>> Sep 13 21:11:49 wn122 kernel: RSP: 0018:ffff8101ec177cb8 EFLAGS:
00000282
>>>
>>> Sep 13 21:11:49 wn122 kernel: RAX: 000000000000004f RBX:
>>> 0000000000000000 RCX: 0000000000000000
>>>
>>> Sep 13 21:11:49 wn122 kernel: RDX: ffff81035956b480 RSI:
>>> ffff810253c2d400 RDI: ffff810552ccb500
>>>
>>> Sep 13 21:11:49 wn122 kernel: RBP: ffff810192294000 R08:
>>> 5a5a5a5a5a5a5a5a R09: 5a5a5a5a5a5a5a5a
>>>
>>> Sep 13 21:11:49 wn122 kernel: R10: 5a5a5a5a5a5a5a5a R11:
>>> 5a5a5a5a5a5a5a5a R12: 0000000000000038
>>>
>>> Sep 13 21:11:49 wn122 kernel: R13: ffff81045b0150c0 R14:
>>> ffff81067fc57000 R15: ffffffff886f5168
>>>
>>> Sep 13 21:11:49 wn122 kernel: FS: 00002b5af649d240(0000)
>>> GS:ffff81010c4c8e40(0000) knlGS:0000000000000000
>>>
>>> Sep 13 21:11:49 wn122 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
>>> 000000008005003b
>>>
>>> Sep 13 21:11:49 wn122 kernel: CR2: 0000000008183094 CR3:
>>> 0000000000201000 CR4: 00000000000006e0
>>>
>>> Sep 13 21:11:49 wn122 kernel:
>>>
>>> Sep 13 21:11:49 wn122 kernel: Call Trace:
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff889249dc>]
>>> :lustre:ll_statahead_interpret+0xfc/0x5b0
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88861779>]
>>> :mdc:mdc_intent_getattr_async_interpret+0x459/0x490
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88729246>]
>>> :ptlrpc:ptlrpc_check_set+0x1186/0x1440
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8004ac1b>]
>>> try_to_del_timer_sync+0x51/0x5a
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8875d05d>]
>>> :ptlrpc:ptlrpcd_check+0xdd/0x1f0
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff80095004>]
>>> process_timeout+0x0/0x5
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8875d9a1>]
>>> :ptlrpc:ptlrpcd+0x1b1/0x259
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8008a4b4>]
>>> default_wake_function+0x0/0xe
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8005dfb1>]
child_rip+0xa/0x11
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8875d7f0>]
>>> :ptlrpc:ptlrpcd+0x0/0x259
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8005dfa7>]
child_rip+0x0/0x11
>>>
>>> Sep 13 21:11:49 wn122 kernel:
>>>
>>> Sep 13 21:11:49 wn122 kernel: BUG: soft lockup - CPU#0 stuck for
10s!
>>> [ll_sa_27016:28593]
>>>
>>> Sep 13 21:11:49 wn122 kernel: CPU 0:
>>>
>>> Sep 13 21:11:49 wn122 kernel: Modules linked in: mgc(U) lustre(U)
>>> lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U)
>>> lnet(U) lvfs(U) libcfs(U) nfs fscache nfs_acl lockd sunrpc
bonding(U)
>>> ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp xt_state
iptable_filter
>>> iptable_nat ip_nat ip_conntrack nfnetlink iptable_mangle ip_tables
>>> x_tables rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U)
>>> ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6 xfrm_nalgo crypto_api
>>> ib_uverbs(U) ib_umad(U) mlx4_vnic(U) ib_sa(U) mlx4_ib(U)
ib_mthca(U)
>>> ib_mad(U) ib_core(U) dm_multipath scsi_dh video hwmon backlight sbs
>>> i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp
>>> parport joydev sg i2c_i801 i2c_core e1000e shpchp mlx4_core(U)
pcspkr
>>> dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot
dm_zero
>>> dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd
>>> uhci_hcd ohci_hcd ehci_hcd
>>>
>>> Sep 13 21:11:49 wn122 kernel: Pid: 28593, comm: ll_sa_27016
Tainted:
>>> G 2.6.18-128.7.1.el5 #1
>>>
>>> Sep 13 21:11:49 wn122 kernel: RIP: 0010:[<ffffffff80064cb4>]
>>> [<ffffffff80064cb4>] .text.lock.spinlock+0x2/0x30
>>>
>>> Sep 13 21:11:49 wn122 kernel: RSP: 0000:ffff810251edfcf8 EFLAGS:
00000282
>>>
>>> Sep 13 21:11:49 wn122 kernel: RAX: 0000000000000001 RBX:
>>> ffff8101cf193e80 RCX: 0000000000000000
>>>
>>> Sep 13 21:11:49 wn122 kernel: RDX: 0000000000000012 RSI:
>>> 000000000abe79b0 RDI: ffff810552ccb500
>>>
>>> Sep 13 21:11:49 wn122 kernel: RBP: 0000000000000286 R08:
>>> 0000000351edfde0 R09: 0000000000000000
>>>
>>> Sep 13 21:11:49 wn122 kernel: R10: ffff810311d9a400 R11:
>>> 0000000000000248 R12: ffff81023f907a00
>>>
>>> Sep 13 21:11:49 wn122 kernel: R13: ffffffff88726dba R14:
>>> ffff810162543bc0 R15: 0000000000000000
>>>
>>> Sep 13 21:11:49 wn122 kernel: FS: 0000000000000000(0000)
>>> GS:ffffffff803ac000(0000) knlGS:0000000000000000
>>>
>>> Sep 13 21:11:49 wn122 kernel: CS: 0010 DS: 002b ES: 002b CR0:
>>> 000000008005003b
>>>
>>> Sep 13 21:11:49 wn122 kernel: CR2: 000000000c0f900c CR3:
>>> 000000066b863000 CR4: 00000000000006e0
>>>
>>> Sep 13 21:11:49 wn122 kernel:
>>>
>>> Sep 13 21:11:49 wn122 kernel: Call Trace:
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88923787>]
>>> :lustre:ll_sai_entry_to_stated+0x87/0x330
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8000d0de>]
dput+0x2c/0x114
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88923f8b>]
>>> :lustre:do_statahead_interpret+0x55b/0x5c0
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88925641>]
>>> :lustre:ll_statahead_thread+0x7b1/0x1750
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8008a4b4>]
>>> default_wake_function+0x0/0xe
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8005dfb1>]
child_rip+0xa/0x11
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff88924e90>]
>>> :lustre:ll_statahead_thread+0x0/0x1750
>>>
>>> Sep 13 21:11:49 wn122 kernel: [<ffffffff8005dfa7>]
child_rip+0x0/0x11
>>>
>>> After we had 3 clients go into this state the first day, and 2 the
>>> next, we decided to downgrade the clients back to 1.8.3, while
>>> leaving the servers at 1.8.4. This morning, we had 2 more clients
get
>>> stuck again, so my assumption is that there is a problem with the
>>> 1.8.4 servers.
>>>
>> This ASSERTION looks related with statahead, such feature only affects
>> client-side behavior, but nothing for server. On the other hand, we
>> never saw that on lustre-1.8.3 or former version, and as I known,
>> there were almost no changes for statahead between lustre-1.8.3 and
>> lustre-1.8.4. So I am not sure whether your assumption is right not.
>>
>> Anyway, to make your system run, you can disable statahead on
>> client-side firstly, to check what will happen after that. The
>> client-side proc interface "statahead_max" is for that.
>>
>> Cheers,
>> Nasf
>>> These are RedHat 5.4 servers and clients, with the
>>> 2.6.18-194.3.1.el5_lustre.1.8.4 kernel.
>>>
>>> Any help on this issue would be great.
>>>
>>> Thanks,
>>>
>>> Jason Temple
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>

Peter Jones

2010-Sep-17 17:18 UTC

head link

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

The URL does not work for me, but if it is a statahead issue then surely 
turning statahead off would be a simple workaround to avoid having to 
apply a patch.

Fan Yong wrote:>   On 9/14/10 8:55 PM, Gabriele Paciucci wrote:
>   
>> I have the same problem, I put the statahead_max to 0 !!!
>>     
> In fact, I have made a patch for statahead related issues (including 
> this one) against lustre-1.8, which is in inspection.
> http://review.whamcloud.com/#change,2
> If possible, you can try such patch.
>
> Cheers,
> --
> Nasf
>   
>
>

Fan Yong

2010-Sep-17 17:24 UTC

head link

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

On 9/18/10 1:18 AM, Peter Jones wrote:> The URL does not work for me, but if it is a statahead issue then 
> surely turning statahead off would be a simple workaround to avoid 
> having to apply a patch.Strange, it is a public link. I can access it without login.
Anyway, after inspection, I will attach it to Lustre Bugzilla for 
further review/verify.

Cheers,
Nasf>
> Fan Yong wrote:
>>   On 9/14/10 8:55 PM, Gabriele Paciucci wrote:
>>> I have the same problem, I put the statahead_max to 0 !!!
>> In fact, I have made a patch for statahead related issues (including 
>> this one) against lustre-1.8, which is in inspection.
>> http://review.whamcloud.com/#change,2
>> If possible, you can try such patch.
>>
>> Cheers,
>> -- 
>> Nasf
>>

Jeff Becker

2010-Sep-17 17:28 UTC

head link

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

On 09/17/10 10:24, Fan Yong wrote:>   On 9/18/10 1:18 AM, Peter Jones wrote:
>> The URL does not work for me, but if it is a statahead issue then 
>> surely turning statahead off would be a simple workaround to avoid 
>> having to apply a patch.
> Strange, it is a public link. I can access it without login.
> Anyway, after inspection, I will attach it to Lustre Bugzilla for 
> further review/verify.
FWIW - the link worked for me. Thanks.

-jeff> Cheers,
> Nasf
>> Fan Yong wrote:
>>>   On 9/14/10 8:55 PM, Gabriele Paciucci wrote:
>>>> I have the same problem, I put the statahead_max to 0 !!!
>>> In fact, I have made a patch for statahead related issues
(including
>>> this one) against lustre-1.8, which is in inspection.
>>> http://review.whamcloud.com/#change,2
>>> If possible, you can try such patch.
>>>
>>> Cheers,
>>> -- 
>>> Nasf
>>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Robert Read

2010-Sep-18 04:42 UTC

head link

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

Hi Peter,

Perhaps the link got mangled by your mail client? (It does have some 
seemingly unusual characters for an URL.)  My interpretation of
Gabriele''s
reply is that the problem occurred even with statahead disabled, so in
that case this patch might be worth trying. 

robert




On Sep 17, 2010, at 10:18 , Peter Jones wrote:
> The URL does not work for me, but if it is a statahead issue then surely 
> turning statahead off would be a simple workaround to avoid having to 
> apply a patch.
> 
> Fan Yong wrote:
>>  On 9/14/10 8:55 PM, Gabriele Paciucci wrote:
>> 
>>> I have the same problem, I put the statahead_max to 0 !!!
>>> 
>> In fact, I have made a patch for statahead related issues (including 
>> this one) against lustre-1.8, which is in inspection.
>> http://review.whamcloud.com/#change,2
>> If possible, you can try such patch.
>> 
>> Cheers,
>> --
>> Nasf
>> 
>> 
>> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

paciucci at gmail.com

2010-Sep-18 06:12 UTC

head link

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

No i have disabled the statahead cache to avoid the problem
-----Messaggio originale-----
Da: "Robert Read" <rread at whamcloud.com>
Data: Sat Sep 18 04:42:18 GMT 2010
A: "Peter Jones" <peter.x.jones at oracle.com>
CC: "lustre-discuss at lists.lustre.org" <lustre-discuss at
lists.lustre.org>
Oggetto: Re: [Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with
lustre 1.8.4

Hi Peter,

Perhaps the link got mangled by your mail client? (It does have some 
seemingly unusual characters for an URL.)  My interpretation of
Gabriele''s
reply is that the problem occurred even with statahead disabled, so in
that case this patch might be worth trying. 

robert




On Sep 17, 2010, at 10:18 , Peter Jones wrote:
> The URL does not work for me, but if it is a statahead issue then surely 
> turning statahead off would be a simple workaround to avoid having to 
> apply a patch.
> 
> Fan Yong wrote:
>>  On 9/14/10 8:55 PM, Gabriele Paciucci wrote:
>> 
>>> I have the same problem, I put the statahead_max to 0 !!!
>>> 
>> In fact, I have made a patch for statahead related issues (including 
>> this one) against lustre-1.8, which is in inspection.
>> http://review.whamcloud.com/#change,2
>> If possible, you can try such patch.
>> 
>> Cheers,
>> --
>> Nasf
>> 
>> 
>> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Temple Jason

2010-Sep-20 09:36 UTC

head link

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

It appears that turning off statahead does indeed avoid the soft lockup bug. 
But this seems to me to be a workaround, and not a solution.

Is statahead not useful for performance gains?  I am not comfortable making my
user''s jobs waste more cpu time because I have to implement a
workaround instead of a fix.

Is there one in the works?  Nasf - does your patch solve the bug, or does it
just disable statahead by default?

Thanks,

Jason

-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces
at lists.lustre.org] On Behalf Of paciucci at gmail.com
Sent: sabato, 18. settembre 2010 08:13
To: rread at whamcloud.com; peter.x.jones at oracle.com
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with
lustre 1.8.4

No i have disabled the statahead cache to avoid the problem -----Messaggio
originale-----
Da: "Robert Read" <rread at whamcloud.com>
Data: Sat Sep 18 04:42:18 GMT 2010
A: "Peter Jones" <peter.x.jones at oracle.com>
CC: "lustre-discuss at lists.lustre.org" <lustre-discuss at
lists.lustre.org>
Oggetto: Re: [Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with
lustre 1.8.4

Hi Peter,

Perhaps the link got mangled by your mail client? (It does have some seemingly
unusual characters for an URL.)  My interpretation of Gabriele''s reply
is that the problem occurred even with statahead disabled, so in that case this
patch might be worth trying.

robert

On Sep 17, 2010, at 10:18 , Peter Jones wrote:
> The URL does not work for me, but if it is a statahead issue then 
> surely turning statahead off would be a simple workaround to avoid 
> having to apply a patch.
> 
> Fan Yong wrote:
>>  On 9/14/10 8:55 PM, Gabriele Paciucci wrote:
>> 
>>> I have the same problem, I put the statahead_max to 0 !!!
>>> 
>> In fact, I have made a patch for statahead related issues (including 
>> this one) against lustre-1.8, which is in inspection.
>> http://review.whamcloud.com/#change,2
>> If possible, you can try such patch.
>> 
>> Cheers,
>> --
>> Nasf
>> 
>> 
>> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Fan Yong

2010-Sep-20 09:49 UTC

head link

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

On 9/20/10 5:36 PM, Temple Jason wrote:> It appears that turning off statahead does indeed avoid the soft lockup
bug.  But this seems to me to be a workaround, and not a solution.
>
> Is statahead not useful for performance gains?  I am not comfortable making
my user''s jobs waste more cpu time because I have to implement a
workaround instead of a fix.
>
> Is there one in the works?  Nasf - does your patch solve the bug, or does
it just disable statahead by default?Statahead is used for optimizing the performance of traversing large 
directory in Lustre. My patch is not to disable statahead, but try to 
fix the soft lockup issues, and also contains some other fixes to 
improve the system stability and performance a bit. I think it is worth 
for you to try such patch.


Cheers,
Nasf> Thanks,
>
> Jason
>
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org
[mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of paciucci at
gmail.com
> Sent: sabato, 18. settembre 2010 08:13
> To: rread at whamcloud.com; peter.x.jones at oracle.com
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s!
with lustre 1.8.4
>
> No i have disabled the statahead cache to avoid the problem -----Messaggio
originale-----
> Da: "Robert Read"<rread at whamcloud.com>
> Data: Sat Sep 18 04:42:18 GMT 2010
> A: "Peter Jones"<peter.x.jones at oracle.com>
> CC: "lustre-discuss at lists.lustre.org"<lustre-discuss at
lists.lustre.org>
> Oggetto: Re: [Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s!
with lustre 1.8.4
>
> Hi Peter,
>
> Perhaps the link got mangled by your mail client? (It does have some
seemingly unusual characters for an URL.)  My interpretation of
Gabriele''s reply is that the problem occurred even with statahead
disabled, so in that case this patch might be worth trying.
>
> robert
>
>
>
>
> On Sep 17, 2010, at 10:18 , Peter Jones wrote:
>
>> The URL does not work for me, but if it is a statahead issue then
>> surely turning statahead off would be a simple workaround to avoid
>> having to apply a patch.
>>
>> Fan Yong wrote:
>>>   On 9/14/10 8:55 PM, Gabriele Paciucci wrote:
>>>
>>>> I have the same problem, I put the statahead_max to 0 !!!
>>>>
>>> In fact, I have made a patch for statahead related issues
(including
>>> this one) against lustre-1.8, which is in inspection.
>>> http://review.whamcloud.com/#change,2
>>> If possible, you can try such patch.
>>>
>>> Cheers,
>>> --
>>> Nasf
>>>
>>>
>>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Gabriele Paciucci

2010-Sep-20 12:23 UTC

head link

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

No, sorry for my english, but when I disabled the statahead, the problem 
don''t occour anymore.

On 09/18/2010 06:42 AM, Robert Read wrote:> Hi Peter,
>
> Perhaps the link got mangled by your mail client? (It does have some
> seemingly unusual characters for an URL.)  My interpretation of
Gabriele''s
> reply is that the problem occurred even with statahead disabled, so in
> that case this patch might be worth trying.
>
> robert
>
>
>
>
> On Sep 17, 2010, at 10:18 , Peter Jones wrote:
>
>    
>> The URL does not work for me, but if it is a statahead issue then
surely
>> turning statahead off would be a simple workaround to avoid having to
>> apply a patch.
>>
>> Fan Yong wrote:
>>      
>>>   On 9/14/10 8:55 PM, Gabriele Paciucci wrote:
>>>
>>>        
>>>> I have the same problem, I put the statahead_max to 0 !!!
>>>>
>>>>          
>>> In fact, I have made a patch for statahead related issues
(including
>>> this one) against lustre-1.8, which is in inspection.
>>> http://review.whamcloud.com/#change,2
>>> If possible, you can try such patch.
>>>
>>> Cheers,
>>> --
>>> Nasf
>>>
>>>
>>>
>>>        
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>      
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>    

-- 
_Gabriele Paciucci_ http://www.linkedin.com/in/paciucci

Pursuant to legislative Decree n. 196/03 you are hereby informed that this email
contains confidential information intended only for use of addressee. If you are
not the addressee and have received this email by mistake, please send this
email to the sender. You may not copy or disseminate this message to anyone.
Thank You.

Peter Jones

2010-Sep-20 13:53 UTC

head link

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

Yes, turning off statahead is a workaround. Of course it is a trade-off 
and ultimately for you to decide. However,  the statahead feature 
(introduced in 1.6.2 I think) has been very problematic and many 
previous attempts at fixing issues have caused subtle regressions. Most 
sites seem to be able to run statahead ok with the latest 1.8.x releases 
so I think that we are just down to edge cases. We will of course still 
continue to remove the edge cases in future releases, and perhaps 
FanYong''s patch might land and be tested out as part of a release, but,
as things stand, this is still just a proposed fix. If you do decide to 
try the patch, please post your experiences to the bz ticket - thanks!

Temple Jason wrote:> It appears that turning off statahead does indeed avoid the soft lockup
bug.  But this seems to me to be a workaround, and not a solution.
>
> Is statahead not useful for performance gains?  I am not comfortable making
my user''s jobs waste more cpu time because I have to implement a
workaround instead of a fix.
>
> Is there one in the works?  Nasf - does your patch solve the bug, or does
it just disable statahead by default?
>
> Thanks,
>
> Jason
>
>
>

Lustre discuss - Sep 2010 - kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4