Hi again! We''re able to consistently kill the lustre client with bonnie in combination with striping. This is Lustre 1.6.4.1, Debian 2.6.18 amd64 kernel with lustre patches on both server and clients (ie. not patchless client, even though we''re pretty sure that it''s the same bug that bites us using ubuntu 2.6.15 kernel and patchless client). All machines are dual opterons connected with GigE. We have 5 servers, 1 MDS with 1 MGS and 1 MDT target and 4 OSS''s with 2 OST targets (~1.2TB) each. We''re able to consistently cause a lustre client lock-up doing the following: cd /into-lustre-filsystem mkdir striped lfs setstripe striped 0 -1 -1 cd striped mkdir host1 host2 host3 host4 host5 for i in host1 host2 host3 host4 host5; do rsh $i "cd $PWD; bonnie++ -d $i -n 60:0:0:30 > res.$i 2>&1" & done After 10-15 minutes it locks up with the following stacktrace: =======Jan 25 11:16:23 BUG: soft lockup detected on CPU#1! Jan 25 11:16:23 Jan 25 11:16:23 Call Trace: Jan 25 11:16:23 <IRQ> [<ffffffff80263eec>] softlockup_tick+0xfc/0x120 Jan 25 11:16:23 [<ffffffff8023f207>] update_process_times+0x57/0x90 Jan 25 11:16:23 [<ffffffff8021a423>] smp_local_timer_interrupt+0x23/0x50 Jan 25 11:16:23 [<ffffffff8021ad31>] smp_apic_timer_interrupt+0x41/0x50 Jan 25 11:16:23 [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c Jan 25 11:16:23 <EOI> [<ffffffff804187e3>] __lock_text_start+0x3/0x10 Jan 25 11:16:23 [<ffffffff8851d97c>] :ptlrpc:ptlrpc_check_set+0x6bc/0xb70 Jan 25 11:16:23 [<ffffffff88518f0a>] :ptlrpc:__ptlrpc_free_req+0x67a/0x6e0 Jan 25 11:16:23 [<ffffffff8854804c>] :ptlrpc:ptlrpcd_check+0x17c/0x2a0 Jan 25 11:16:23 [<ffffffff8023ef50>] process_timeout+0x0/0x10 Jan 25 11:16:23 [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60 Jan 25 11:16:23 [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272 Jan 25 11:16:23 [<ffffffff8028da91>] filp_close+0x71/0x90 Jan 25 11:16:23 [<ffffffff8022f490>] default_wake_function+0x0/0x10 Jan 25 11:16:23 [<ffffffff8022f490>] default_wake_function+0x0/0x10 Jan 25 11:16:23 [<ffffffff8020ac4c>] child_rip+0xa/0x12 Jan 25 11:16:23 [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272 Jan 25 11:16:23 [<ffffffff8020ac42>] child_rip+0x0/0x12 ======= mkdir striped-4ways lfs setstripe striped-4ways 0 -1 4 repeat above test After 10-15 minutes it locks up, this time with a bunch of LustreErrors before the stack trace: =======Jan 25 13:30:40 LustreError: 5748:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at 1201264136, 103s ago) req at ffff8100e3317e00 x1219785/t0 o6->hpfs-OST0004_UUID at 130.239.78.239@tcp:28 lens 336/336 ref 1 fl Rpc:/0/0 rc 0/-22 Jan 25 13:30:40 Lustre: hpfs-OST0004-osc-ffff8100ecad4000: Connection to service hpfs-OST0004 via nid 130.239.78.239 at tcp was lost; in progress operations using this service will wait for recovery to complete. Jan 25 13:30:54 BUG: soft lockup detected on CPU#1! Jan 25 13:30:54 Jan 25 13:30:54 Call Trace: Jan 25 13:30:54 <IRQ> [<ffffffff80263eec>] softlockup_tick+0xfc/0x120 Jan 25 13:30:54 [<ffffffff8023f207>] update_process_times+0x57/0x90 Jan 25 13:30:54 [<ffffffff8021a423>] smp_local_timer_interrupt+0x23/0x50 Jan 25 13:30:54 [<ffffffff8021ad31>] smp_apic_timer_interrupt+0x41/0x50 Jan 25 13:30:54 [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c Jan 25 13:30:54 <EOI> [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0 Jan 25 13:30:54 [<ffffffff80418a69>] .text.lock.spinlock+0x0/0x97 Jan 25 13:30:54 [<ffffffff884385be>] :lnet:LNetMEAttach+0x24e/0x330 Jan 25 13:30:54 [<ffffffff88524771>] :ptlrpc:ptl_send_rpc+0x711/0xf20 Jan 25 13:30:54 [<ffffffff8851c727>] :ptlrpc:ptlrpc_unregister_reply+0x107/0x2f0 Jan 25 13:30:54 [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0 Jan 25 13:30:54 [<ffffffff88529ae7>] :ptlrpc:lustre_msg_add_flags+0x47/0x120 Jan 25 13:30:54 [<ffffffff8851d923>] :ptlrpc:ptlrpc_check_set+0x663/0xb70 Jan 25 13:30:54 [<ffffffff885447ea>] :ptlrpc:ptlrpc_fail_import+0x9a/0x220 Jan 25 13:30:54 [<ffffffff8852869f>] :ptlrpc:lustre_msg_get_conn_cnt+0x4f/0x100 Jan 25 13:30:54 [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0 Jan 25 13:30:54 [<ffffffff8854804c>] :ptlrpc:ptlrpcd_check+0x17c/0x2a0 Jan 25 13:30:54 [<ffffffff8023ef50>] process_timeout+0x0/0x10 Jan 25 13:30:54 [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60 Jan 25 13:30:54 [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272 Jan 25 13:30:54 [<ffffffff8022f490>] default_wake_function+0x0/0x10 Jan 25 13:30:54 [<ffffffff8022f490>] default_wake_function+0x0/0x10 Jan 25 13:30:54 [<ffffffff8020ac4c>] child_rip+0xa/0x12 Jan 25 13:30:54 [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272 Jan 25 13:30:54 [<ffffffff8020ac42>] child_rip+0x0/0x12 ======= Note that the 2 stacktraces are somewhat different. If run in non-striped directory it doesn''t lockup. /Nikke -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | nikke at hpc2n.umu.se --------------------------------------------------------------------------- "Jake, honey, when did we become Republicans?" - Celeste Kane =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Hi Niklas, On Friday 25 January 2008 07:10:47 am Niklas Edmundsson wrote:> We''re able to consistently kill the lustre client with bonnie in > combination with striping.Out of curiosity, I tried to reproduce your experiment, and didn''t encounter any problem. All the bonnie processes ran fine. There are a lot of significative differences between our test environments, but I thought it may be useful to know the results of your test case on a different system.> This is Lustre 1.6.4.1, Debian 2.6.18 > amd64 kernel with lustre patches on both server and clientsI used Lustre 1.6.4.1, RHEL4 and 2.6.9-55.0.9.EL_lustre.1.6.4.1smp amd64 x86_64 kernel.> All machines are dual opterons connected with GigE.They are Intel quad-cores (E5345) connected with IB.> We have 5 servers, 1 MDS with 1 MGS and 1 MDT target and 4 OSS''s with > 2 OST targets (~1.2TB) each.We have 9 servers, 1 MDS with MGS and MDT, and 8 OSSs with 2 OSTs each.> Jan 25 11:16:23 BUG: soft lockup detected on CPU#1!> After 10-15 minutes it locks up, this time with a bunch of > LustreErrors before the stack trace:They look like a network interruption problem, but it''s hard to tell if that''s the cause or the consequence. Can''t that be that your Ethernet switches dropped some packets? Cheers, -- Kilian
Hi, thats interessting for me, can you just try what happens if you delete a large directory (lots of files with couple of GB total space) from this client? I have a test cluster with 1.6.4.1 kernel 2.6.18.8 vanilla running. The clients are patchless, server and clients are rock stable since weaks, but I have only one dual opteron machine (others are mostly athlon and couple of pentium) connected with GigE which ist a rock solid machine if I don''t mount lustre. If I mount lustre on this machine it crashs all the the time. The last crash happens directly after I tried to delete a large directory from this client. Up to now I thougt I must have done something wrong with the installation of this client, because it behaves completly different than the others, but maybe I am wrong? Harald On Friday 25 January 2008 04:10 pm, Niklas Edmundsson wrote:> Hi again! > > We''re able to consistently kill the lustre client with bonnie in > combination with striping. This is Lustre 1.6.4.1, Debian 2.6.18 amd64 > kernel with lustre patches on both server and clients (ie. not > patchless client, even though we''re pretty sure that it''s the same bug > that bites us using ubuntu 2.6.15 kernel and patchless client). > > All machines are dual opterons connected with GigE. > > We have 5 servers, 1 MDS with 1 MGS and 1 MDT target and 4 OSS''s with > 2 OST targets (~1.2TB) each. > > We''re able to consistently cause a lustre client lock-up doing the > following: > > cd /into-lustre-filsystem > mkdir striped > lfs setstripe striped 0 -1 -1 > cd striped > mkdir host1 host2 host3 host4 host5 > for i in host1 host2 host3 host4 host5; do > rsh $i "cd $PWD; bonnie++ -d $i -n 60:0:0:30 > res.$i 2>&1" & > done > After 10-15 minutes it locks up with the following stacktrace: > =======> Jan 25 11:16:23 BUG: soft lockup detected on CPU#1! > Jan 25 11:16:23 > Jan 25 11:16:23 Call Trace: > Jan 25 11:16:23 <IRQ> [<ffffffff80263eec>] softlockup_tick+0xfc/0x120 > Jan 25 11:16:23 [<ffffffff8023f207>] update_process_times+0x57/0x90 > Jan 25 11:16:23 [<ffffffff8021a423>] smp_local_timer_interrupt+0x23/0x50 > Jan 25 11:16:23 [<ffffffff8021ad31>] smp_apic_timer_interrupt+0x41/0x50 > Jan 25 11:16:23 [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c > Jan 25 11:16:23 <EOI> [<ffffffff804187e3>] __lock_text_start+0x3/0x10 > Jan 25 11:16:23 [<ffffffff8851d97c>] :ptlrpc:ptlrpc_check_set+0x6bc/0xb70 > Jan 25 11:16:23 [<ffffffff88518f0a>] :ptlrpc:__ptlrpc_free_req+0x67a/0x6e0 > Jan 25 11:16:23 [<ffffffff8854804c>] :ptlrpc:ptlrpcd_check+0x17c/0x2a0 > Jan 25 11:16:23 [<ffffffff8023ef50>] process_timeout+0x0/0x10 > Jan 25 11:16:23 [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60 > Jan 25 11:16:23 [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272 > Jan 25 11:16:23 [<ffffffff8028da91>] filp_close+0x71/0x90 > Jan 25 11:16:23 [<ffffffff8022f490>] default_wake_function+0x0/0x10 > Jan 25 11:16:23 [<ffffffff8022f490>] default_wake_function+0x0/0x10 > Jan 25 11:16:23 [<ffffffff8020ac4c>] child_rip+0xa/0x12 > Jan 25 11:16:23 [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272 > Jan 25 11:16:23 [<ffffffff8020ac42>] child_rip+0x0/0x12 > =======> > mkdir striped-4ways > lfs setstripe striped-4ways 0 -1 4 > repeat above test > After 10-15 minutes it locks up, this time with a bunch of > LustreErrors before the stack trace: > =======> Jan 25 13:30:40 LustreError: > 5748:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at > 1201264136, 103s ago) req at ffff8100e3317e00 x1219785/t0 > o6->hpfs-OST0004_UUID at 130.239.78.239@tcp:28 lens 336/336 ref 1 fl Rpc:/0/0 > rc 0/-22 Jan 25 13:30:40 Lustre: hpfs-OST0004-osc-ffff8100ecad4000: > Connection to service hpfs-OST0004 via nid 130.239.78.239 at tcp was lost; in > progress operations using this service will wait for recovery to complete. > Jan 25 13:30:54 BUG: soft lockup detected on CPU#1! > Jan 25 13:30:54 > Jan 25 13:30:54 Call Trace: > Jan 25 13:30:54 <IRQ> [<ffffffff80263eec>] softlockup_tick+0xfc/0x120 > Jan 25 13:30:54 [<ffffffff8023f207>] update_process_times+0x57/0x90 > Jan 25 13:30:54 [<ffffffff8021a423>] smp_local_timer_interrupt+0x23/0x50 > Jan 25 13:30:54 [<ffffffff8021ad31>] smp_apic_timer_interrupt+0x41/0x50 > Jan 25 13:30:54 [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c > Jan 25 13:30:54 <EOI> [<ffffffff8852dc70>] > :ptlrpc:reply_in_callback+0x0/0x2b0 Jan 25 13:30:54 [<ffffffff80418a69>] > .text.lock.spinlock+0x0/0x97 Jan 25 13:30:54 [<ffffffff884385be>] > :lnet:LNetMEAttach+0x24e/0x330 Jan 25 13:30:54 [<ffffffff88524771>] > :ptlrpc:ptl_send_rpc+0x711/0xf20 Jan 25 13:30:54 [<ffffffff8851c727>] > :ptlrpc:ptlrpc_unregister_reply+0x107/0x2f0 Jan 25 13:30:54 > [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0 Jan 25 13:30:54 > [<ffffffff88529ae7>] :ptlrpc:lustre_msg_add_flags+0x47/0x120 Jan 25 > 13:30:54 [<ffffffff8851d923>] :ptlrpc:ptlrpc_check_set+0x663/0xb70 Jan 25 > 13:30:54 [<ffffffff885447ea>] :ptlrpc:ptlrpc_fail_import+0x9a/0x220 Jan 25 > 13:30:54 [<ffffffff8852869f>] :ptlrpc:lustre_msg_get_conn_cnt+0x4f/0x100 > Jan 25 13:30:54 [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0 > Jan 25 13:30:54 [<ffffffff8854804c>] :ptlrpc:ptlrpcd_check+0x17c/0x2a0 Jan > 25 13:30:54 [<ffffffff8023ef50>] process_timeout+0x0/0x10 > Jan 25 13:30:54 [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60 > Jan 25 13:30:54 [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272 > Jan 25 13:30:54 [<ffffffff8022f490>] default_wake_function+0x0/0x10 > Jan 25 13:30:54 [<ffffffff8022f490>] default_wake_function+0x0/0x10 > Jan 25 13:30:54 [<ffffffff8020ac4c>] child_rip+0xa/0x12 > Jan 25 13:30:54 [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272 > Jan 25 13:30:54 [<ffffffff8020ac42>] child_rip+0x0/0x12 > =======> > > Note that the 2 stacktraces are somewhat different. > > > If run in non-striped directory it doesn''t lockup. > > > > /Nikke-- Harald van Pee Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
On Fri, 25 Jan 2008, Kilian CAVALOTTI wrote:> Hi Niklas, > > On Friday 25 January 2008 07:10:47 am Niklas Edmundsson wrote: >> We''re able to consistently kill the lustre client with bonnie in >> combination with striping. > > Out of curiosity, I tried to reproduce your experiment, and didn''t > encounter any problem. All the bonnie processes ran fine.Interesting...> There are a lot of significative differences between our test > environments, but I thought it may be useful to know the results of > your test case on a different system. > >> This is Lustre 1.6.4.1, Debian 2.6.18 >> amd64 kernel with lustre patches on both server and clients > > I used Lustre 1.6.4.1, RHEL4 and 2.6.9-55.0.9.EL_lustre.1.6.4.1smp amd64 > x86_64 kernel. > >> All machines are dual opterons connected with GigE. > > They are Intel quad-cores (E5345) connected with IB.Not identical environments, but it still suggests that there''s something funky with the 2.6.18 support...>> We have 5 servers, 1 MDS with 1 MGS and 1 MDT target and 4 OSS''s with >> 2 OST targets (~1.2TB) each. > > We have 9 servers, 1 MDS with MGS and MDT, and 8 OSSs with 2 OSTs each. > >> Jan 25 11:16:23 BUG: soft lockup detected on CPU#1! > >> After 10-15 minutes it locks up, this time with a bunch of >> LustreErrors before the stack trace: > > They look like a network interruption problem, but it''s hard to tell if > that''s the cause or the consequence. Can''t that be that your Ethernet > switches dropped some packets?Given that it''s TCP packet drops shouldn''t affect stuff in that way IMHO. My guess is that something is writing outside its buffer and killing some random part of the kernel, we''re usually seeing these kinds of problems then... Usually pure hell to debug :/ /Nikke -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | nikke at hpc2n.umu.se --------------------------------------------------------------------------- "Run out of small children to butcher?" -- G''Kar =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
On Fri, 25 Jan 2008, Harald van Pee wrote:> Hi, > > thats interessting for me, > can you just try what happens if you delete a large directory > (lots of files with couple of GB total space) from this client?Works, as long as we only have one client doing rm on the directory. If we do rm concurrently from multiple clients the MDS bugs out.> I have a test cluster with 1.6.4.1 kernel 2.6.18.8 vanilla > running. The clients are patchless, > server and clients are rock stable since > weaks, but I have only one dual opteron machine (others are mostly athlon and > couple of pentium) > connected with GigE > which ist a rock solid machine if I don''t mount lustre. > If I mount lustre on this machine it crashs all the the time. > The last crash happens directly after I tried to delete a large directory > from this client. > Up to now I thougt I must have done something wrong with the installation of > this client, because it behaves completly different than the others, but > maybe I am wrong?It might be the same bug, or not... IMHO it''s an indication of a buffer overrun that happens more often on a 64bit box due to increased storage needed for pointers and so on... But with only one machine crashing it''s hard to rule out other issues.> > Harald > > > On Friday 25 January 2008 04:10 pm, Niklas Edmundsson wrote: >> Hi again! >> >> We''re able to consistently kill the lustre client with bonnie in >> combination with striping. This is Lustre 1.6.4.1, Debian 2.6.18 amd64 >> kernel with lustre patches on both server and clients (ie. not >> patchless client, even though we''re pretty sure that it''s the same bug >> that bites us using ubuntu 2.6.15 kernel and patchless client). >> >> All machines are dual opterons connected with GigE. >> >> We have 5 servers, 1 MDS with 1 MGS and 1 MDT target and 4 OSS''s with >> 2 OST targets (~1.2TB) each. >> >> We''re able to consistently cause a lustre client lock-up doing the >> following: >> >> cd /into-lustre-filsystem >> mkdir striped >> lfs setstripe striped 0 -1 -1 >> cd striped >> mkdir host1 host2 host3 host4 host5 >> for i in host1 host2 host3 host4 host5; do >> rsh $i "cd $PWD; bonnie++ -d $i -n 60:0:0:30 > res.$i 2>&1" & >> done >> After 10-15 minutes it locks up with the following stacktrace: >> =======>> Jan 25 11:16:23 BUG: soft lockup detected on CPU#1! >> Jan 25 11:16:23 >> Jan 25 11:16:23 Call Trace: >> Jan 25 11:16:23 <IRQ> [<ffffffff80263eec>] softlockup_tick+0xfc/0x120 >> Jan 25 11:16:23 [<ffffffff8023f207>] update_process_times+0x57/0x90 >> Jan 25 11:16:23 [<ffffffff8021a423>] smp_local_timer_interrupt+0x23/0x50 >> Jan 25 11:16:23 [<ffffffff8021ad31>] smp_apic_timer_interrupt+0x41/0x50 >> Jan 25 11:16:23 [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c >> Jan 25 11:16:23 <EOI> [<ffffffff804187e3>] __lock_text_start+0x3/0x10 >> Jan 25 11:16:23 [<ffffffff8851d97c>] :ptlrpc:ptlrpc_check_set+0x6bc/0xb70 >> Jan 25 11:16:23 [<ffffffff88518f0a>] :ptlrpc:__ptlrpc_free_req+0x67a/0x6e0 >> Jan 25 11:16:23 [<ffffffff8854804c>] :ptlrpc:ptlrpcd_check+0x17c/0x2a0 >> Jan 25 11:16:23 [<ffffffff8023ef50>] process_timeout+0x0/0x10 >> Jan 25 11:16:23 [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60 >> Jan 25 11:16:23 [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272 >> Jan 25 11:16:23 [<ffffffff8028da91>] filp_close+0x71/0x90 >> Jan 25 11:16:23 [<ffffffff8022f490>] default_wake_function+0x0/0x10 >> Jan 25 11:16:23 [<ffffffff8022f490>] default_wake_function+0x0/0x10 >> Jan 25 11:16:23 [<ffffffff8020ac4c>] child_rip+0xa/0x12 >> Jan 25 11:16:23 [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272 >> Jan 25 11:16:23 [<ffffffff8020ac42>] child_rip+0x0/0x12 >> =======>> >> mkdir striped-4ways >> lfs setstripe striped-4ways 0 -1 4 >> repeat above test >> After 10-15 minutes it locks up, this time with a bunch of >> LustreErrors before the stack trace: >> =======>> Jan 25 13:30:40 LustreError: >> 5748:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at >> 1201264136, 103s ago) req at ffff8100e3317e00 x1219785/t0 >> o6->hpfs-OST0004_UUID at 130.239.78.239@tcp:28 lens 336/336 ref 1 fl Rpc:/0/0 >> rc 0/-22 Jan 25 13:30:40 Lustre: hpfs-OST0004-osc-ffff8100ecad4000: >> Connection to service hpfs-OST0004 via nid 130.239.78.239 at tcp was lost; in >> progress operations using this service will wait for recovery to complete. >> Jan 25 13:30:54 BUG: soft lockup detected on CPU#1! >> Jan 25 13:30:54 >> Jan 25 13:30:54 Call Trace: >> Jan 25 13:30:54 <IRQ> [<ffffffff80263eec>] softlockup_tick+0xfc/0x120 >> Jan 25 13:30:54 [<ffffffff8023f207>] update_process_times+0x57/0x90 >> Jan 25 13:30:54 [<ffffffff8021a423>] smp_local_timer_interrupt+0x23/0x50 >> Jan 25 13:30:54 [<ffffffff8021ad31>] smp_apic_timer_interrupt+0x41/0x50 >> Jan 25 13:30:54 [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c >> Jan 25 13:30:54 <EOI> [<ffffffff8852dc70>] >> :ptlrpc:reply_in_callback+0x0/0x2b0 Jan 25 13:30:54 [<ffffffff80418a69>] >> .text.lock.spinlock+0x0/0x97 Jan 25 13:30:54 [<ffffffff884385be>] >> :lnet:LNetMEAttach+0x24e/0x330 Jan 25 13:30:54 [<ffffffff88524771>] >> :ptlrpc:ptl_send_rpc+0x711/0xf20 Jan 25 13:30:54 [<ffffffff8851c727>] >> :ptlrpc:ptlrpc_unregister_reply+0x107/0x2f0 Jan 25 13:30:54 >> [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0 Jan 25 13:30:54 >> [<ffffffff88529ae7>] :ptlrpc:lustre_msg_add_flags+0x47/0x120 Jan 25 >> 13:30:54 [<ffffffff8851d923>] :ptlrpc:ptlrpc_check_set+0x663/0xb70 Jan 25 >> 13:30:54 [<ffffffff885447ea>] :ptlrpc:ptlrpc_fail_import+0x9a/0x220 Jan 25 >> 13:30:54 [<ffffffff8852869f>] :ptlrpc:lustre_msg_get_conn_cnt+0x4f/0x100 >> Jan 25 13:30:54 [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0 >> Jan 25 13:30:54 [<ffffffff8854804c>] :ptlrpc:ptlrpcd_check+0x17c/0x2a0 Jan >> 25 13:30:54 [<ffffffff8023ef50>] process_timeout+0x0/0x10 >> Jan 25 13:30:54 [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60 >> Jan 25 13:30:54 [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272 >> Jan 25 13:30:54 [<ffffffff8022f490>] default_wake_function+0x0/0x10 >> Jan 25 13:30:54 [<ffffffff8022f490>] default_wake_function+0x0/0x10 >> Jan 25 13:30:54 [<ffffffff8020ac4c>] child_rip+0xa/0x12 >> Jan 25 13:30:54 [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272 >> Jan 25 13:30:54 [<ffffffff8020ac42>] child_rip+0x0/0x12 >> =======>> >> >> Note that the 2 stacktraces are somewhat different. >> >> >> If run in non-striped directory it doesn''t lockup. >> >> >> >> /Nikke > >/Nikke -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | nikke at hpc2n.umu.se --------------------------------------------------------------------------- Riker: If it becomes necessary to fight, can someone find @N@ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=