I just had two nodes hang with the following soft lockup messages. I am running Centos 5.2 (2.6.18-93.1.13.el5) with the patchless client (1.6.5.1). My nodes do not have swap configured on them (no local disks). We do have a tool that looks for out of memory condition and neither of the nodes in question reported a problem (not that it is perfect). Does the problem look like an issue with Lustre? Thanks, Craig Oct 22 08:06:45 h53 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [kswapd0:418] Oct 22 08:06:45 h53 kernel: CPU 2: Oct 22 08:06:45 h53 kernel: Modules linked in: zxfs(U) rs_iscsi(PFU) dm_mod(U) sd_mod(U) rs_cfg(PU) autofs4(U) mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) nfsd(U) exportfs(U) auth_rpcgss(U) video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U) cdrom(U) sg(U) ib_mthca(U) pcspkr(U) usb_storage(U) ata_piix(U) ib_mad(U) libata(U) ib_core(U) i2c_i801(U) i2c_core(U) shpchp(U) scsi_mod(U) serio_raw(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) e1000(U) e1000e(U) Oct 22 08:06:45 h53 kernel: Pid: 418, comm: kswapd0 Tainted: PF 2.6.18-92.1.13.el5 #1 Oct 22 08:06:45 h53 kernel: RIP: 0010:[<ffffffff80064be4>] [<ffffffff80064be4>] .text.lock.spinlock+0x2/0x30 Oct 22 08:06:45 h53 kernel: RSP: 0000:ffff81042ea099b8 EFLAGS: 00000286 Oct 22 08:06:45 h53 kernel: RAX: ffff810423adeee8 RBX: ffff81035d272200 RCX: ffff8101c866eb28 Oct 22 08:06:45 h53 kernel: RDX: 0000000000000000 RSI: ffff810282e8eb10 RDI: ffff81035d272388 Oct 22 08:06:45 h53 kernel: RBP: 0000000000000086 R08: ffff81000001b600 R09: ffff81010af7be18 Oct 22 08:06:45 h53 kernel: R10: ffff810107b41a60 R11: ffffffff888117d0 R12: 0000000000000086 Oct 22 08:06:45 h53 kernel: R13: ffff81010d4dce88 R14: ffff81010eb8f0c0 R15: 0000000000000086 Oct 22 08:06:45 h53 kernel: FS: 0000000000000000(0000) GS:ffff81010eba1ec0(0000) knlGS:0000000000000000 Oct 22 08:06:45 h53 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Oct 22 08:06:45 h53 kernel: CR2: 00000000371b1000 CR3: 000000014772f000 CR4: 00000000000006e0 Oct 22 08:06:45 h53 kernel: Oct 22 08:06:45 h53 kernel: Call Trace: Oct 22 08:06:45 h53 kernel: [<ffffffff8871125a>] :osc:cache_remove_extent+0x4a/0x90 Oct 22 08:06:45 h53 kernel: [<ffffffff88707c5a>] :osc:osc_teardown_async_page+0x25a/0x3c0 Oct 22 08:06:45 h53 kernel: [<ffffffff800d4589>] slab_destroy+0x80/0xa8 Oct 22 08:06:45 h53 kernel: [<ffffffff800d46cd>] free_block+0x11c/0x13b Oct 22 08:06:45 h53 kernel: [<ffffffff8877c8e0>] :lov:lov_teardown_async_page+0x320/0x4e0 Oct 22 08:06:45 h53 kernel: [<ffffffff801453a1>] radix_tree_delete+0x150/0x187 Oct 22 08:06:45 h53 kernel: [<ffffffff887f7ba1>] :lustre:__ll_put_llap+0x3e1/0x610 Oct 22 08:06:45 h53 kernel: [<ffffffff80023601>] __pagevec_free+0x21/0x2e Oct 22 08:06:45 h53 kernel: [<ffffffff887f830d>] :lustre:ll_removepage+0x1dd/0x240 Oct 22 08:06:45 h53 kernel: [<ffffffff800c087c>] __remove_from_page_cache+0x1b/0x3a Oct 22 08:06:45 h53 kernel: [<ffffffff888117e0>] :lustre:ll_releasepage+0x10/0x20 Oct 22 08:06:45 h53 kernel: [<ffffffff800c4fb1>] shrink_inactive_list+0x4e1/0x7f9 Oct 22 08:06:45 h53 kernel: [<ffffffff800479c3>] __pagevec_release+0x19/0x22 Oct 22 08:06:45 h53 kernel: [<ffffffff800c49af>] shrink_active_list+0x416/0x426 Oct 22 08:06:45 h53 kernel: [<ffffffff80012c22>] shrink_zone+0xf6/0x11c Oct 22 08:06:45 h53 kernel: [<ffffffff8005792d>] kswapd+0x337/0x45e Oct 22 08:06:45 h53 kernel: [<ffffffff8009e30e>] autoremove_wake_function+0x0/0x2e Oct 22 08:06:45 h53 kernel: [<ffffffff8009e0e8>] keventd_create_kthread+0x0/0xc4 Oct 22 08:06:45 h53 kernel: [<ffffffff800575f6>] kswapd+0x0/0x45e Oct 22 08:06:45 h53 kernel: [<ffffffff8009e0e8>] keventd_create_kthread+0x0/0xc4 Oct 22 08:06:45 h53 kernel: [<ffffffff80032625>] kthread+0xfe/0x132 Oct 22 08:06:45 h53 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Oct 22 08:06:45 h53 kernel: [<ffffffff8009e0e8>] keventd_create_kthread+0x0/0xc4 Oct 22 08:06:45 h53 kernel: [<ffffffff80032527>] kthread+0x0/0x132 Oct 22 08:06:45 h53 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Oct 22 08:06:45 h53 kernel: Oct 22 08:06:46 h53 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [convert_emiss.e:27258] Oct 22 08:06:46 h53 kernel: CPU 4: Oct 22 08:06:46 h53 kernel: Modules linked in: zxfs(U) rs_iscsi(PFU) dm_mod(U) sd_mod(U) rs_cfg(PU) autofs4(U) mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) nfsd(U) exportfs(U) auth_rpcgss(U) video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) sr_mod(U) cdrom(U) sg(U) ib_mthca(U) pcspkr(U) usb_storage(U) ata_piix(U) ib_mad(U) libata(U) ib_core(U) i2c_i801(U) i2c_core(U) shpchp(U) scsi_mod(U) serio_raw(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) e1000(U) e1000e(U) Oct 22 08:06:46 h53 kernel: Pid: 27258, comm: convert_emiss.e Tainted: PF 2.6.18-92.1.13.el5 #1 Oct 22 08:06:46 h53 kernel: RIP: 0010:[<ffffffff80064be7>] [<ffffffff80064be7>] .text.lock.spinlock+0x5/0x30 Oct 22 08:06:46 h53 kernel: RSP: 0018:ffff8102ba4c3770 EFLAGS: 00000286 Oct 22 08:06:46 h53 kernel: RAX: ffff810423adeec0 RBX: ffff8102e617aad0 RCX: 0000000000000000 Oct 22 08:06:46 h53 kernel: RDX: 0000000000000000 RSI: 0000000004df5578 RDI: ffff810423adeec0 Oct 22 08:06:46 h53 kernel: RBP: 0000000000000000 R08: 0000000000000002 R09: 0000000000000000 Oct 22 08:06:46 h53 kernel: R10: ffff810164619cb0 R11: 00000000000000d0 R12: 0000000000000000 Oct 22 08:06:46 h53 kernel: R13: 0000000000000000 R14: ffff810423c36468 R15: ffff81042ebc4000 Oct 22 08:06:46 h53 kernel: FS: 00002b8ce6c66630(0000) GS:ffff81042fc23d40(0000) knlGS:0000000000000000 Oct 22 08:06:46 h53 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 22 08:06:46 h53 kernel: CR2: 0000000001e46604 CR3: 0000000307228000 CR4: 00000000000006e0 Oct 22 08:06:46 h53 kernel: Oct 22 08:06:46 h53 kernel: Call Trace: Oct 22 08:06:46 h53 kernel: [<ffffffff887066d5>] :osc:osc_queue_async_io+0x7c5/0x1020 Oct 22 08:06:46 h53 kernel: [<ffffffff8024394c>] udp_push_pending_frames+0x236/0x25b Oct 22 08:06:46 h53 kernel: [<ffffffff80064b11>] _spin_lock_bh+0x9/0x14 Oct 22 08:06:46 h53 kernel: [<ffffffff80030b4b>] release_sock+0x13/0xaa Oct 22 08:06:46 h53 kernel: [<ffffffff8005215c>] udp_sendmsg+0x47f/0x566 Oct 22 08:06:46 h53 kernel: [<ffffffff887fcb44>] :lustre:ll_prepare_write+0x484/0x1360 Oct 22 08:06:46 h53 kernel: [<ffffffff888117d0>] :lustre:ll_releasepage+0x0/0x20 Oct 22 08:06:46 h53 kernel: [<ffffffff88628574>] :ptlrpc:ldlm_resource_get+0x1b4/0xa70 Oct 22 08:06:46 h53 kernel: [<ffffffff8877daf3>] :lov:lov_queue_async_io+0x353/0x450 Oct 22 08:06:46 h53 kernel: [<ffffffff886263c6>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0 Oct 22 08:06:46 h53 kernel: [<ffffffff887f3eb0>] :lustre:queue_or_sync_write+0x310/0x1270 Oct 22 08:06:46 h53 kernel: [<ffffffff8002e341>] __wake_up+0x38/0x4f Oct 22 08:06:46 h53 kernel: [<ffffffff887faf3c>] :lustre:ll_commit_write+0x2dc/0x6e0 Oct 22 08:06:46 h53 kernel: [<ffffffff88797d0f>] :lov:lov_finish_set+0x67f/0x6e0 Oct 22 08:06:46 h53 kernel: [<ffffffff8879b455>] :lov:lov_set_add_req+0x15/0x20 Oct 22 08:06:46 h53 kernel: [<ffffffff8000fe3c>] generic_file_buffered_write+0x4dc/0x6d3 Oct 22 08:06:46 h53 kernel: [<ffffffff887e4fb9>] :lustre:ll_inode_size_unlock+0x89/0xe0 Oct 22 08:06:46 h53 kernel: [<ffffffff8000df96>] current_fs_time+0x3b/0x40 Oct 22 08:06:46 h53 kernel: [<ffffffff88798533>] :lov:lov_fini_cancel_set+0x1e3/0x260 Oct 22 08:06:46 h53 kernel: [<ffffffff80016088>] __generic_file_aio_write_nolock+0x36c/0x3b8 Oct 22 08:06:46 h53 kernel: [<ffffffff8863cf60>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0 Oct 22 08:06:46 h53 kernel: [<ffffffff887d9ce0>] :lustre:ll_glimpse_callback+0x0/0x440 Oct 22 08:06:46 h53 kernel: [<ffffffff800c0727>] __generic_file_write_nolock+0x8f/0xa8 Oct 22 08:06:46 h53 kernel: [<ffffffff887cf509>] :lustre:ll_extent_unlock+0x419/0x520 Oct 22 08:06:46 h53 kernel: [<ffffffff8009e30e>] autoremove_wake_function+0x0/0x2e Oct 22 08:06:46 h53 kernel: [<ffffffff887d054a>] :lustre:ll_file_get_tree_lock_iov+0x44a/0x5e0 Oct 22 08:06:46 h53 kernel: [<ffffffff8011fb2f>] avc_has_perm+0x43/0x55 Oct 22 08:06:46 h53 kernel: [<ffffffff80063ae4>] mutex_lock+0xd/0x1d Oct 22 08:06:46 h53 kernel: [<ffffffff800c0788>] generic_file_writev+0x48/0xa2 Oct 22 08:06:46 h53 kernel: [<ffffffff887d129f>] :lustre:ll_file_writev+0xbbf/0xe20 Oct 22 08:06:46 h53 kernel: [<ffffffff887d151a>] :lustre:ll_file_write+0x1a/0x20 Oct 22 08:06:46 h53 kernel: [<ffffffff8001649a>] vfs_write+0xce/0x174 Oct 22 08:06:46 h53 kernel: [<ffffffff80016d67>] sys_write+0x45/0x6e Oct 22 08:06:46 h53 kernel: [<ffffffff8005d116>] system_call+0x7e/0x83 Oct 22 08:06:46 h53 kernel: -- Craig Tierney (craig.tierney at noaa.gov)
On Oct 22, 2008 14:37 -0600, Craig Tierney wrote:> I just had two nodes hang with the following soft lockup messages. > I am running Centos 5.2 (2.6.18-93.1.13.el5) with the patchless client > (1.6.5.1). My nodes do not have swap configured on them (no local > disks). We do have a tool that looks for out of memory condition > and neither of the nodes in question reported a problem (not that it > is perfect).Note that soft lockups are only a warning. It shouldn''t mean that the node is completely dead, only that some thread was hogging the CPU.> Does the problem look like an issue with Lustre?Lots of Lustre functions on the stack...> Oct 22 08:06:45 h53 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [kswapd0:418] > Oct 22 08:06:45 h53 kernel: Call Trace: > Oct 22 08:06:45 h53 kernel: [<ffffffff8871125a>] :osc:cache_remove_extent+0x4a/0x90 > Oct 22 08:06:45 h53 kernel: [<ffffffff88707c5a>] :osc:osc_teardown_async_page+0x25a/0x3c0Do you have particularly large files in use (e.g. in the realm of 1TB or more)? It seems possible that if there are a lot of pages to be cleaned up that this might cause a report like this. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:> On Oct 22, 2008 14:37 -0600, Craig Tierney wrote: >> I just had two nodes hang with the following soft lockup messages. >> I am running Centos 5.2 (2.6.18-93.1.13.el5) with the patchless client >> (1.6.5.1). My nodes do not have swap configured on them (no local >> disks). We do have a tool that looks for out of memory condition >> and neither of the nodes in question reported a problem (not that it >> is perfect). > > Note that soft lockups are only a warning. It shouldn''t mean that the > node is completely dead, only that some thread was hogging the CPU. >The two soft lockup messages (one in kswapd0 and the other in the user process convert_emiss) repeated their messages for 6 hours before I rebooted the node. I don''t recall if I could login to the node or not.>> Does the problem look like an issue with Lustre? > > Lots of Lustre functions on the stack... > >> Oct 22 08:06:45 h53 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [kswapd0:418] >> Oct 22 08:06:45 h53 kernel: Call Trace: >> Oct 22 08:06:45 h53 kernel: [<ffffffff8871125a>] :osc:cache_remove_extent+0x4a/0x90 >> Oct 22 08:06:45 h53 kernel: [<ffffffff88707c5a>] :osc:osc_teardown_async_page+0x25a/0x3c0 > > Do you have particularly large files in use (e.g. in the realm of 1TB or > more)? It seems possible that if there are a lot of pages to be cleaned > up that this might cause a report like this. >My first guess would be no, we don''t create files that large. But it is entirely possible a user did something wrong with this code which caused some large files (append vs. create). I will check it out. Thanks, Craig> Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Craig Tierney (craig.tierney at noaa.gov)
On Oct 27, 2008 10:16 -0600, Craig Tierney wrote:> Andreas Dilger wrote: >> Note that soft lockups are only a warning. It shouldn''t mean that the >> node is completely dead, only that some thread was hogging the CPU. > > The two soft lockup messages (one in kswapd0 and the other in the user > process convert_emiss) repeated their messages for 6 hours before I rebooted > the node. I don''t recall if I could login to the node or not.Ah, then the spewing of the "warning" messages is likely what caused the node to be unusable :-(. Console messages are printed with all interrupts disabled and can be a problem in such cases. Unfortunately, this printing is outside of the Lustre code so we can''t fix it without patching the kernel. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:> On Oct 27, 2008 10:16 -0600, Craig Tierney wrote: >> Andreas Dilger wrote: >>> Note that soft lockups are only a warning. It shouldn''t mean that the >>> node is completely dead, only that some thread was hogging the CPU. >> The two soft lockup messages (one in kswapd0 and the other in the user >> process convert_emiss) repeated their messages for 6 hours before I rebooted >> the node. I don''t recall if I could login to the node or not. > > Ah, then the spewing of the "warning" messages is likely what caused the > node to be unusable :-(. Console messages are printed with all interrupts > disabled and can be a problem in such cases. Unfortunately, this printing > is outside of the Lustre code so we can''t fix it without patching the kernel. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. >Thanks for your responses. This last one is very informative and I can look for solutions to patch the kernel in the event the problem reoccurs. Thanks, Craig -- Craig Tierney (craig.tierney at noaa.gov)