Greetings, The below console output is from a 1.8.4 OST (RHEL5.5, 2.6.18-194.3.1.el5_lustre.1.8.4, x86_64). Not saying it is a Lustre bug for sure. Just wondering if anyone has seen this or something very similar. Updating to 1.8.6 WC variant isn''t an option at this time. If anyone has some insight into this I''d appreciate the feedback. Thanks, --Jeff BUG: soft lockup - CPU#6 stuck for 10s! [kswapd0:409] CPU 6: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) jbd2(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) autofs4(U) hidp(U) l2cap(U) bluetooth(U) lockd(U) sunrpc(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U) ib_iser(U) libiscsi2(U) scsi_transport_iscsi2(U) scsi_transport_iscsi(U) ib_srp(U) rds(U) ib_sdp(U) ib_ipoib(U) ipoib_helper(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) rdma_ucm(U) rdma_cm(U) ib_ucm(U) ib_uverbs(U) ib_umad(U) ib_cm(U) iw_cm(U) ib_addr(U) ib_sa(U) mptsas(U) mptctl(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_en(U) joydev(U) shpchp(U) sg(U) mlx4_core(U) e1000e(U) serio_raw(U) pcspkr(U) i2c_i801(U) i2c_core(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) mptspi(U) scsi_transport_spi(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) ata_piix(U) libata(U) sd_mod(U) scsi_mod(U) raid1(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 409, comm: kswapd0 Tainted: G 2.6.18-194.3.1.el5_lustre.1.8.4 #1 RIP: 0010:[<ffffffff801011bf>] [<ffffffff801011bf>] dqput+0x105/0x19f RSP: 0018:ffff8101be805cd0 EFLAGS: 00000202 RAX: ffff81012e03f000 RBX: 0000000000000000 RCX: ffff81012e03f000 RDX: ffffffffffffffe2 RSI: 0000000000000002 RDI: ffff81012f4f01c0 RBP: ffff81007fb4c918 R08: ffff810000018b00 R09: ffff81007fb4c918 R10: ffff8101be805c60 R11: ffffffff8b6448f0 R12: ffff8101be805c60 R13: ffffffff8b6448f0 R14: 00000000ffffffe2 R15: ffffffff8b6448f0 FS: 0000000000000000(0000) GS:ffff8101bfc2adc0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000402000 CR3: 0000000000201000 CR4: 00000000000006e0 Call Trace: [<ffffffff8010182b>] dquot_drop+0x30/0x5e [<ffffffff8b647e83>] :ldiskfs:ldiskfs_dquot_drop+0x43/0x70 [<ffffffff80022d99>] clear_inode+0xb4/0x123 [<ffffffff80034e52>] dispose_list+0x41/0xe0 [<ffffffff8002d6a7>] shrink_icache_memory+0x1b7/0x1e6 [<ffffffff8003f466>] shrink_slab+0xdc/0x153 [<ffffffff80057e59>] kswapd+0x343/0x46c [<ffffffff800a0ab2>] autoremove_wake_function+0x0/0x2e [<ffffffff80057b16>] kswapd+0x0/0x46c [<ffffffff800a089a>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032890>] kthread+0xfe/0x132 [<ffffffff8009d728>] request_module+0x0/0x14d [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a089a>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032792>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 -- ------------------------------ Jeff Johnson Manager Aeon Computing jeff.johnson "at" aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x101 f: 858-412-3845 4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117
On 08/10/2011 01:40 AM, Jeff Johnson wrote:> Greetings, > > The below console output is from a 1.8.4 OST (RHEL5.5, > 2.6.18-194.3.1.el5_lustre.1.8.4, x86_64). Not saying it is a Lustre bug > for sure. Just wondering if anyone has seen this or something very > similar. Updating to 1.8.6 WC variant isn''t an option at this time.It was stuck in a kernel swap thread for more than 10 seconds. Possibly a race condition on the disk.> > If anyone has some insight into this I''d appreciate the feedback. > > Thanks, > > --Jeff > > BUG: soft lockup - CPU#6 stuck for 10s! [kswapd0:409]More to the point, it shouldn''t be swapping. What is sysctl -a | grep swappiness ? and cat /proc/meminfo | grep -i swap Likely you have some process with a memory leak, and you need to flush cache/swap every now and then to make sure it doesn''t fill up.> CPU 6:> RIP: 0010:[<ffffffff801011bf>] [<ffffffff801011bf>] dqput+0x105/0x19fThis is a quota put. It has some nice spin locks in there, and there could be some allocations in some of the function calls. I haven''t checked. http://lxr.free-electrons.com/source/fs/quota/dquot.c?a=microblaze#L718> RSP: 0018:ffff8101be805cd0 EFLAGS: 00000202 > RAX: ffff81012e03f000 RBX: 0000000000000000 RCX: ffff81012e03f000 > RDX: ffffffffffffffe2 RSI: 0000000000000002 RDI: ffff81012f4f01c0 > RBP: ffff81007fb4c918 R08: ffff810000018b00 R09: ffff81007fb4c918 > R10: ffff8101be805c60 R11: ffffffff8b6448f0 R12: ffff8101be805c60 > R13: ffffffff8b6448f0 R14: 00000000ffffffe2 R15: ffffffff8b6448f0 > FS: 0000000000000000(0000) GS:ffff8101bfc2adc0(0000) knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 0000000000402000 CR3: 0000000000201000 CR4: 00000000000006e0 > > Call Trace: > [<ffffffff8010182b>] dquot_drop+0x30/0x5e > [<ffffffff8b647e83>] :ldiskfs:ldiskfs_dquot_drop+0x43/0x70 > [<ffffffff80022d99>] clear_inode+0xb4/0x123 > [<ffffffff80034e52>] dispose_list+0x41/0xe0 > [<ffffffff8002d6a7>] shrink_icache_memory+0x1b7/0x1e6 > [<ffffffff8003f466>] shrink_slab+0xdc/0x153 > [<ffffffff80057e59>] kswapd+0x343/0x46c > [<ffffffff800a0ab2>] autoremove_wake_function+0x0/0x2e > [<ffffffff80057b16>] kswapd+0x0/0x46c > [<ffffffff800a089a>] keventd_create_kthread+0x0/0xc4 > [<ffffffff80032890>] kthread+0xfe/0x132 > [<ffffffff8009d728>] request_module+0x0/0x14d > [<ffffffff8005dfb1>] child_rip+0xa/0x11 > [<ffffffff800a089a>] keventd_create_kthread+0x0/0xc4 > [<ffffffff80032792>] kthread+0x0/0x132 > [<ffffffff8005dfa7>] child_rip+0x0/0x11There are a couple of bugs in RHEL that this could be similar to. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615