Niklas Edmundsson
2007-Oct-09 12:15 UTC
[Lustre-discuss] soft lockup on Lustre 1.6.2 + Ubuntu 2.6.15 patchless
Hi all! We managed to get a soft lockup on a lustre client when doing some stress testing. Clients are using Lustre 1.6.2 patchless client on the Ubuntu Dapper 2.6.15 kernel. From what I have understood the Ubuntu 2.6.15 kernel has the needed patch to be able to work with the patchless client. As usual, I didn''t find any similar in the bugzilla or by googling. Any hints on what''s going wrong would be helpful. This is the description I got on what was done: Let 44 tasks (22 nodes with 2 tasks each) do: rsync -a master rankNN when finished, on another node do rm -rf rank* that node immediately did the following: [696899.812091] BUG: soft lockup detected on CPU#0! [696899.919769] CPU 0: [696899.968179] Modules linked in: osc mgc lustre lov lquota mdc ksocklnd ptlrpc obdclass lnet lvfs libcfs nfs lockd sunrpc iptable_filter ip_tables openafs ipv6 autofs4 ext2 ext3 jbd md_mod ipmi_devintf ipmi_si ipmi_msghandler tg3 mx_driver mx_mcp hw_random psmouse i2c_amd756 shpchp pci_hotplug i2c_core pcspkr serio_raw evdev xfs exportfs dm_mod ide_generic ohci_hcd usbcore ide_cd cdrom ide_disk generic amd74xx thermal processor fan fbcon tileblit font bitblit softcursor capability commoncap [696900.973641] Pid: 16696, comm: rm Tainted: P 2.6.15-29-amd64-server #1 [696901.133969] RIP: 0010:[<ffffffff801a8630>] <ffffffff801a8630>{__d_lookup+288} [696901.298891] RSP: 0018:ffff8100a5f83c58 EFLAGS: 00000286 [696901.427423] RAX: ffff8100a7795250 RBX: 0000000000000000 RCX: 0000000000000014 [696901.595905] RDX: 000000000001d578 RSI: 00c399114511d578 RDI: ffff8100d1f0b238 [696901.763747] RBP: ffff81005c7f2e00 R08: 000000080585b155 R09: ffff81005b1ed000 [696901.931231] R10: 0000000000000050 R11: ffffffff801e3ac0 R12: 0000000188517366 [696902.098758] R13: 0000000000000010 R14: 0000000000000246 R15: 0000000000000283 [696902.266669] FS: 00002aaaaadfb6d0(0000) GS:ffffffff80453800(0000) knlGS:00000000556a9a90 [696902.456911] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [696902.593287] CR2: 0000000000511988 CR3: 00000000bc4c6000 CR4: 00000000000006e0 [696902.763134] [696902.763135] Call Trace:<ffffffff88514035>{:ptlrpc:lock_res_and_lock+53} <ffffffff8019d3dc>{do_lookup+60} [696903.017468] <ffffffff8019df66>{__link_path_walk+2518} <ffffffff88660787>{:lustre:ll_readdir+3191} [696903.241819] <ffffffff8019e4f0>{link_path_walk+128} <ffffffff801a95a0>{update_atime+64} [696903.441728] <ffffffff8019eaf8>{path_lookup+440} <ffffffff8019ec9e>{__user_walk+62} [696903.632782] <ffffffff80197e06>{vfs_lstat+38} <ffffffff801a95a0>{update_atime+64} [696903.823144] <ffffffff801984af>{sys_newlstat+31} <ffffffff801a254e>{vfs_readdir+174} [696904.015504] <ffffffff801a2884>{sys_getdents64+180} <ffffffff8010fd82>{system_call+126} [696904.217213] [696904.873276] LustreError: 16698:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81015a2d7600 x1868515/t0 o101->MGS at 130.239.78.233@tcp:26 lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0 [696905.274439] LustreError: 16698:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 1 previous similar message [696906.232276] LustreError: 4194:0:(client.c:961:ptlrpc_expire_one_request()) @@@ timeout (sent at 1191925375, 100s ago) req at ffff81018965b800 x1868505/t0 o250->MGS at 130.239.78.233@tcp:26 lens 304/328 ref 2 fl Rpc:/0/0 rc 0/-22 [696906.694129] LustreError: 4194:0:(client.c:961:ptlrpc_expire_one_request()) Skipped 7 previous similar messages /Nikke -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | nikke at hpc2n.umu.se --------------------------------------------------------------------------- KISS (:<) Keep It Simple Stupid =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=