thr3ads.net - Lustre discuss - [Lustre-discuss] Possible out of memory condition [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Craig Tierney

2008-Oct-22 20:37 UTC

[Lustre-discuss] Possible out of memory condition

I just had two nodes hang with the following soft lockup messages.
I am running Centos 5.2 (2.6.18-93.1.13.el5) with the patchless client
(1.6.5.1).  My nodes do not have swap configured on them (no local
disks).   We do have a tool that looks for out of memory condition
and neither of the nodes in question reported a problem (not that it
is perfect).

Does the problem look like an issue with Lustre?

Thanks,
Craig


Oct 22 08:06:45 h53 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
[kswapd0:418]
Oct 22 08:06:45 h53 kernel: CPU 2:
Oct 22 08:06:45 h53 kernel: Modules linked in: zxfs(U) rs_iscsi(PFU) dm_mod(U)
sd_mod(U) rs_cfg(PU) autofs4(U) mgc(U) lustre(U)
lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U)
libcfs(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U)
ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U)
crypto_api(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U)
mlx4_core(U) nfsd(U) exportfs(U) auth_rpcgss(U) video(U) sbs(U) backlight(U)
i2c_ec(U) button(U) battery(U) asus_acpi(U)
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U)
cdrom(U) sg(U) ib_mthca(U) pcspkr(U) usb_storage(U)
ata_piix(U) ib_mad(U) libata(U) ib_core(U) i2c_i801(U) i2c_core(U) shpchp(U)
scsi_mod(U) serio_raw(U) ehci_hcd(U) ohci_hcd(U)
uhci_hcd(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) e1000(U) e1000e(U)
Oct 22 08:06:45 h53 kernel: Pid: 418, comm: kswapd0 Tainted: PF    
2.6.18-92.1.13.el5 #1
Oct 22 08:06:45 h53 kernel: RIP: 0010:[<ffffffff80064be4>] 
[<ffffffff80064be4>] .text.lock.spinlock+0x2/0x30
Oct 22 08:06:45 h53 kernel: RSP: 0000:ffff81042ea099b8  EFLAGS: 00000286
Oct 22 08:06:45 h53 kernel: RAX: ffff810423adeee8 RBX: ffff81035d272200 RCX:
ffff8101c866eb28
Oct 22 08:06:45 h53 kernel: RDX: 0000000000000000 RSI: ffff810282e8eb10 RDI:
ffff81035d272388
Oct 22 08:06:45 h53 kernel: RBP: 0000000000000086 R08: ffff81000001b600 R09:
ffff81010af7be18
Oct 22 08:06:45 h53 kernel: R10: ffff810107b41a60 R11: ffffffff888117d0 R12:
0000000000000086
Oct 22 08:06:45 h53 kernel: R13: ffff81010d4dce88 R14: ffff81010eb8f0c0 R15:
0000000000000086
Oct 22 08:06:45 h53 kernel: FS:  0000000000000000(0000)
GS:ffff81010eba1ec0(0000) knlGS:0000000000000000
Oct 22 08:06:45 h53 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Oct 22 08:06:45 h53 kernel: CR2: 00000000371b1000 CR3: 000000014772f000 CR4:
00000000000006e0
Oct 22 08:06:45 h53 kernel:
Oct 22 08:06:45 h53 kernel: Call Trace:
Oct 22 08:06:45 h53 kernel:  [<ffffffff8871125a>]
:osc:cache_remove_extent+0x4a/0x90
Oct 22 08:06:45 h53 kernel:  [<ffffffff88707c5a>]
:osc:osc_teardown_async_page+0x25a/0x3c0
Oct 22 08:06:45 h53 kernel:  [<ffffffff800d4589>] slab_destroy+0x80/0xa8
Oct 22 08:06:45 h53 kernel:  [<ffffffff800d46cd>] free_block+0x11c/0x13b
Oct 22 08:06:45 h53 kernel:  [<ffffffff8877c8e0>]
:lov:lov_teardown_async_page+0x320/0x4e0
Oct 22 08:06:45 h53 kernel:  [<ffffffff801453a1>]
radix_tree_delete+0x150/0x187
Oct 22 08:06:45 h53 kernel:  [<ffffffff887f7ba1>]
:lustre:__ll_put_llap+0x3e1/0x610
Oct 22 08:06:45 h53 kernel:  [<ffffffff80023601>] __pagevec_free+0x21/0x2e
Oct 22 08:06:45 h53 kernel:  [<ffffffff887f830d>]
:lustre:ll_removepage+0x1dd/0x240
Oct 22 08:06:45 h53 kernel:  [<ffffffff800c087c>]
__remove_from_page_cache+0x1b/0x3a
Oct 22 08:06:45 h53 kernel:  [<ffffffff888117e0>]
:lustre:ll_releasepage+0x10/0x20
Oct 22 08:06:45 h53 kernel:  [<ffffffff800c4fb1>]
shrink_inactive_list+0x4e1/0x7f9
Oct 22 08:06:45 h53 kernel:  [<ffffffff800479c3>]
__pagevec_release+0x19/0x22
Oct 22 08:06:45 h53 kernel:  [<ffffffff800c49af>]
shrink_active_list+0x416/0x426
Oct 22 08:06:45 h53 kernel:  [<ffffffff80012c22>] shrink_zone+0xf6/0x11c
Oct 22 08:06:45 h53 kernel:  [<ffffffff8005792d>] kswapd+0x337/0x45e
Oct 22 08:06:45 h53 kernel:  [<ffffffff8009e30e>]
autoremove_wake_function+0x0/0x2e
Oct 22 08:06:45 h53 kernel:  [<ffffffff8009e0e8>]
keventd_create_kthread+0x0/0xc4
Oct 22 08:06:45 h53 kernel:  [<ffffffff800575f6>] kswapd+0x0/0x45e
Oct 22 08:06:45 h53 kernel:  [<ffffffff8009e0e8>]
keventd_create_kthread+0x0/0xc4
Oct 22 08:06:45 h53 kernel:  [<ffffffff80032625>] kthread+0xfe/0x132
Oct 22 08:06:45 h53 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Oct 22 08:06:45 h53 kernel:  [<ffffffff8009e0e8>]
keventd_create_kthread+0x0/0xc4
Oct 22 08:06:45 h53 kernel:  [<ffffffff80032527>] kthread+0x0/0x132
Oct 22 08:06:45 h53 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Oct 22 08:06:45 h53 kernel:
Oct 22 08:06:46 h53 kernel: BUG: soft lockup - CPU#4 stuck for 10s!
[convert_emiss.e:27258]
Oct 22 08:06:46 h53 kernel: CPU 4:
Oct 22 08:06:46 h53 kernel: Modules linked in: zxfs(U) rs_iscsi(PFU) dm_mod(U)
sd_mod(U) rs_cfg(PU) autofs4(U) mgc(U) lustre(U)
lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U)
libcfs(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U)
ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U)
crypto_api(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U)
mlx4_core(U) nfsd(U) exportfs(U) auth_rpcgss(U) video(U) sbs(U) backlight(U)
i2c_ec(U) button(U) battery(U) asus_acpi(U)
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U)
cdrom(U) sg(U) ib_mthca(U) pcspkr(U) usb_storage(U)
ata_piix(U) ib_mad(U) libata(U) ib_core(U) i2c_i801(U) i2c_core(U) shpchp(U)
scsi_mod(U) serio_raw(U) ehci_hcd(U) ohci_hcd(U)
uhci_hcd(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) e1000(U) e1000e(U)
Oct 22 08:06:46 h53 kernel: Pid: 27258, comm: convert_emiss.e Tainted: PF    
2.6.18-92.1.13.el5 #1
Oct 22 08:06:46 h53 kernel: RIP: 0010:[<ffffffff80064be7>] 
[<ffffffff80064be7>] .text.lock.spinlock+0x5/0x30
Oct 22 08:06:46 h53 kernel: RSP: 0018:ffff8102ba4c3770  EFLAGS: 00000286
Oct 22 08:06:46 h53 kernel: RAX: ffff810423adeec0 RBX: ffff8102e617aad0 RCX:
0000000000000000
Oct 22 08:06:46 h53 kernel: RDX: 0000000000000000 RSI: 0000000004df5578 RDI:
ffff810423adeec0
Oct 22 08:06:46 h53 kernel: RBP: 0000000000000000 R08: 0000000000000002 R09:
0000000000000000
Oct 22 08:06:46 h53 kernel: R10: ffff810164619cb0 R11: 00000000000000d0 R12:
0000000000000000
Oct 22 08:06:46 h53 kernel: R13: 0000000000000000 R14: ffff810423c36468 R15:
ffff81042ebc4000
Oct 22 08:06:46 h53 kernel: FS:  00002b8ce6c66630(0000)
GS:ffff81042fc23d40(0000) knlGS:0000000000000000
Oct 22 08:06:46 h53 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 22 08:06:46 h53 kernel: CR2: 0000000001e46604 CR3: 0000000307228000 CR4:
00000000000006e0
Oct 22 08:06:46 h53 kernel:
Oct 22 08:06:46 h53 kernel: Call Trace:
Oct 22 08:06:46 h53 kernel:  [<ffffffff887066d5>]
:osc:osc_queue_async_io+0x7c5/0x1020
Oct 22 08:06:46 h53 kernel:  [<ffffffff8024394c>]
udp_push_pending_frames+0x236/0x25b
Oct 22 08:06:46 h53 kernel:  [<ffffffff80064b11>] _spin_lock_bh+0x9/0x14
Oct 22 08:06:46 h53 kernel:  [<ffffffff80030b4b>] release_sock+0x13/0xaa
Oct 22 08:06:46 h53 kernel:  [<ffffffff8005215c>] udp_sendmsg+0x47f/0x566
Oct 22 08:06:46 h53 kernel:  [<ffffffff887fcb44>]
:lustre:ll_prepare_write+0x484/0x1360
Oct 22 08:06:46 h53 kernel:  [<ffffffff888117d0>]
:lustre:ll_releasepage+0x0/0x20
Oct 22 08:06:46 h53 kernel:  [<ffffffff88628574>]
:ptlrpc:ldlm_resource_get+0x1b4/0xa70
Oct 22 08:06:46 h53 kernel:  [<ffffffff8877daf3>]
:lov:lov_queue_async_io+0x353/0x450
Oct 22 08:06:46 h53 kernel:  [<ffffffff886263c6>]
:ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
Oct 22 08:06:46 h53 kernel:  [<ffffffff887f3eb0>]
:lustre:queue_or_sync_write+0x310/0x1270
Oct 22 08:06:46 h53 kernel:  [<ffffffff8002e341>] __wake_up+0x38/0x4f
Oct 22 08:06:46 h53 kernel:  [<ffffffff887faf3c>]
:lustre:ll_commit_write+0x2dc/0x6e0
Oct 22 08:06:46 h53 kernel:  [<ffffffff88797d0f>]
:lov:lov_finish_set+0x67f/0x6e0
Oct 22 08:06:46 h53 kernel:  [<ffffffff8879b455>]
:lov:lov_set_add_req+0x15/0x20
Oct 22 08:06:46 h53 kernel:  [<ffffffff8000fe3c>]
generic_file_buffered_write+0x4dc/0x6d3
Oct 22 08:06:46 h53 kernel:  [<ffffffff887e4fb9>]
:lustre:ll_inode_size_unlock+0x89/0xe0
Oct 22 08:06:46 h53 kernel:  [<ffffffff8000df96>]
current_fs_time+0x3b/0x40
Oct 22 08:06:46 h53 kernel:  [<ffffffff88798533>]
:lov:lov_fini_cancel_set+0x1e3/0x260
Oct 22 08:06:46 h53 kernel:  [<ffffffff80016088>]
__generic_file_aio_write_nolock+0x36c/0x3b8
Oct 22 08:06:46 h53 kernel:  [<ffffffff8863cf60>]
:ptlrpc:ldlm_completion_ast+0x0/0x6a0
Oct 22 08:06:46 h53 kernel:  [<ffffffff887d9ce0>]
:lustre:ll_glimpse_callback+0x0/0x440
Oct 22 08:06:46 h53 kernel:  [<ffffffff800c0727>]
__generic_file_write_nolock+0x8f/0xa8
Oct 22 08:06:46 h53 kernel:  [<ffffffff887cf509>]
:lustre:ll_extent_unlock+0x419/0x520
Oct 22 08:06:46 h53 kernel:  [<ffffffff8009e30e>]
autoremove_wake_function+0x0/0x2e
Oct 22 08:06:46 h53 kernel:  [<ffffffff887d054a>]
:lustre:ll_file_get_tree_lock_iov+0x44a/0x5e0
Oct 22 08:06:46 h53 kernel:  [<ffffffff8011fb2f>] avc_has_perm+0x43/0x55
Oct 22 08:06:46 h53 kernel:  [<ffffffff80063ae4>] mutex_lock+0xd/0x1d
Oct 22 08:06:46 h53 kernel:  [<ffffffff800c0788>]
generic_file_writev+0x48/0xa2
Oct 22 08:06:46 h53 kernel:  [<ffffffff887d129f>]
:lustre:ll_file_writev+0xbbf/0xe20
Oct 22 08:06:46 h53 kernel:  [<ffffffff887d151a>]
:lustre:ll_file_write+0x1a/0x20
Oct 22 08:06:46 h53 kernel:  [<ffffffff8001649a>] vfs_write+0xce/0x174
Oct 22 08:06:46 h53 kernel:  [<ffffffff80016d67>] sys_write+0x45/0x6e
Oct 22 08:06:46 h53 kernel:  [<ffffffff8005d116>] system_call+0x7e/0x83
Oct 22 08:06:46 h53 kernel:



-- 
Craig Tierney (craig.tierney at noaa.gov)

Andreas Dilger

2008-Oct-25 09:18 UTC

head link

[Lustre-discuss] Possible out of memory condition

On Oct 22, 2008  14:37 -0600, Craig Tierney wrote:> I just had two nodes hang with the following soft lockup messages.
> I am running Centos 5.2 (2.6.18-93.1.13.el5) with the patchless client
> (1.6.5.1).  My nodes do not have swap configured on them (no local
> disks).   We do have a tool that looks for out of memory condition
> and neither of the nodes in question reported a problem (not that it
> is perfect).
Note that soft lockups are only a warning.  It shouldn''t mean that the
node is completely dead, only that some thread was hogging the CPU.
> Does the problem look like an issue with Lustre?
Lots of Lustre functions on the stack...
> Oct 22 08:06:45 h53 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
[kswapd0:418]
> Oct 22 08:06:45 h53 kernel: Call Trace:
> Oct 22 08:06:45 h53 kernel:  [<ffffffff8871125a>]
:osc:cache_remove_extent+0x4a/0x90
> Oct 22 08:06:45 h53 kernel:  [<ffffffff88707c5a>]
:osc:osc_teardown_async_page+0x25a/0x3c0
Do you have particularly large files in use (e.g. in the realm of 1TB or
more)?  It seems possible that if there are a lot of pages to be cleaned
up that this might cause a report like this.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Craig Tierney

2008-Oct-27 16:16 UTC

head link

[Lustre-discuss] Possible out of memory condition

Andreas Dilger wrote:> On Oct 22, 2008  14:37 -0600, Craig Tierney wrote:
>> I just had two nodes hang with the following soft lockup messages.
>> I am running Centos 5.2 (2.6.18-93.1.13.el5) with the patchless client
>> (1.6.5.1).  My nodes do not have swap configured on them (no local
>> disks).   We do have a tool that looks for out of memory condition
>> and neither of the nodes in question reported a problem (not that it
>> is perfect).
> 
> Note that soft lockups are only a warning.  It shouldn''t mean that
the
> node is completely dead, only that some thread was hogging the CPU.
> 
The two soft lockup messages (one in kswapd0 and the other in the user
process convert_emiss) repeated their messages for 6 hours before I rebooted
the node.  I don''t recall if I could login to the node or not.

>> Does the problem look like an issue with Lustre?
> 
> Lots of Lustre functions on the stack...
> 
>> Oct 22 08:06:45 h53 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
[kswapd0:418]
>> Oct 22 08:06:45 h53 kernel: Call Trace:
>> Oct 22 08:06:45 h53 kernel:  [<ffffffff8871125a>]
:osc:cache_remove_extent+0x4a/0x90
>> Oct 22 08:06:45 h53 kernel:  [<ffffffff88707c5a>]
:osc:osc_teardown_async_page+0x25a/0x3c0
> 
> Do you have particularly large files in use (e.g. in the realm of 1TB or
> more)?  It seems possible that if there are a lot of pages to be cleaned
> up that this might cause a report like this.
> 
My first guess would be no, we don''t create files that large.  But it
is
entirely possible a user did something wrong with this code which caused
some large files (append vs. create).  I will check it out.

Thanks,
Craig



> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 

-- 
Craig Tierney (craig.tierney at noaa.gov)

Andreas Dilger

2008-Oct-27 19:38 UTC

head link

[Lustre-discuss] Possible out of memory condition

On Oct 27, 2008  10:16 -0600, Craig Tierney wrote:> Andreas Dilger wrote:
>> Note that soft lockups are only a warning.  It shouldn''t mean
that the
>> node is completely dead, only that some thread was hogging the CPU.
>
> The two soft lockup messages (one in kswapd0 and the other in the user
> process convert_emiss) repeated their messages for 6 hours before I
rebooted
> the node.  I don''t recall if I could login to the node or not.
Ah, then the spewing of the "warning" messages is likely what caused
the
node to be unusable :-(.  Console messages are printed with all interrupts
disabled and can be a problem in such cases.  Unfortunately, this printing
is outside of the Lustre code so we can''t fix it without patching the
kernel.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Craig Tierney

2008-Oct-27 19:51 UTC

head link

[Lustre-discuss] Possible out of memory condition

Andreas Dilger wrote:> On Oct 27, 2008  10:16 -0600, Craig Tierney wrote:
>> Andreas Dilger wrote:
>>> Note that soft lockups are only a warning.  It shouldn''t
mean that the
>>> node is completely dead, only that some thread was hogging the CPU.
>> The two soft lockup messages (one in kswapd0 and the other in the user
>> process convert_emiss) repeated their messages for 6 hours before I
rebooted
>> the node.  I don''t recall if I could login to the node or not.
> 
> Ah, then the spewing of the "warning" messages is likely what
caused the
> node to be unusable :-(.  Console messages are printed with all interrupts
> disabled and can be a problem in such cases.  Unfortunately, this printing
> is outside of the Lustre code so we can''t fix it without patching
the kernel.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 


Thanks for your responses.  This last one is very informative and I
can look for solutions to patch the kernel in the event the problem
reoccurs.

Thanks,
Craig


-- 
Craig Tierney (craig.tierney at noaa.gov)

Lustre discuss - Oct 2008 - Possible out of memory condition

[Lustre-discuss] Possible out of memory condition

[Lustre-discuss] Possible out of memory condition

[Lustre-discuss] Possible out of memory condition

[Lustre-discuss] Possible out of memory condition

[Lustre-discuss] Possible out of memory condition