thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre 1.6.4.1

If this information is useful, please help other people find it:
Share via:

Niklas Edmundsson

2008-Jan-25 15:10 UTC

[Lustre-discuss] Lustre 1.6.4.1 - client lockup

Hi again!

We''re able to consistently kill the lustre client with bonnie in 
combination with striping. This is Lustre 1.6.4.1, Debian 2.6.18 amd64
kernel with lustre patches on both server and clients (ie. not 
patchless client, even though we''re pretty sure that it''s the
same bug
that bites us using ubuntu 2.6.15 kernel and patchless client).

All machines are dual opterons connected with GigE.

We have 5 servers, 1 MDS with 1 MGS and 1 MDT target and 4 OSS''s with
2 OST targets (~1.2TB) each.

We''re able to consistently cause a lustre client lock-up doing the
following:

cd /into-lustre-filsystem
mkdir striped
lfs setstripe striped 0 -1 -1
cd striped
mkdir host1 host2 host3 host4 host5
for i in host1 host2 host3 host4 host5; do
   rsh $i "cd $PWD; bonnie++ -d $i -n 60:0:0:30 > res.$i
2>&1" &
done
After 10-15 minutes it locks up with the following stacktrace:
=======Jan 25 11:16:23 BUG: soft lockup detected on CPU#1!
Jan 25 11:16:23 
Jan 25 11:16:23 Call Trace:
Jan 25 11:16:23  <IRQ> [<ffffffff80263eec>]
softlockup_tick+0xfc/0x120
Jan 25 11:16:23  [<ffffffff8023f207>] update_process_times+0x57/0x90
Jan 25 11:16:23  [<ffffffff8021a423>] smp_local_timer_interrupt+0x23/0x50
Jan 25 11:16:23  [<ffffffff8021ad31>] smp_apic_timer_interrupt+0x41/0x50
Jan 25 11:16:23  [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
Jan 25 11:16:23  <EOI> [<ffffffff804187e3>]
__lock_text_start+0x3/0x10
Jan 25 11:16:23  [<ffffffff8851d97c>] :ptlrpc:ptlrpc_check_set+0x6bc/0xb70
Jan 25 11:16:23  [<ffffffff88518f0a>]
:ptlrpc:__ptlrpc_free_req+0x67a/0x6e0
Jan 25 11:16:23  [<ffffffff8854804c>] :ptlrpc:ptlrpcd_check+0x17c/0x2a0
Jan 25 11:16:23  [<ffffffff8023ef50>] process_timeout+0x0/0x10
Jan 25 11:16:23  [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60
Jan 25 11:16:23  [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272
Jan 25 11:16:23  [<ffffffff8028da91>] filp_close+0x71/0x90
Jan 25 11:16:23  [<ffffffff8022f490>] default_wake_function+0x0/0x10
Jan 25 11:16:23  [<ffffffff8022f490>] default_wake_function+0x0/0x10
Jan 25 11:16:23  [<ffffffff8020ac4c>] child_rip+0xa/0x12
Jan 25 11:16:23  [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272
Jan 25 11:16:23  [<ffffffff8020ac42>] child_rip+0x0/0x12
=======
mkdir striped-4ways
lfs setstripe striped-4ways 0 -1 4
repeat above test
After 10-15 minutes it locks up, this time with a bunch of
LustreErrors before the stack trace:
=======Jan 25 13:30:40 LustreError:
5748:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
1201264136, 103s ago)  req at ffff8100e3317e00 x1219785/t0
o6->hpfs-OST0004_UUID at 130.239.78.239@tcp:28 lens 336/336 ref 1 fl Rpc:/0/0
rc 0/-22
Jan 25 13:30:40 Lustre: hpfs-OST0004-osc-ffff8100ecad4000: Connection to service
hpfs-OST0004 via nid 130.239.78.239 at tcp was lost; in progress operations
using this service will wait for recovery to complete.
Jan 25 13:30:54 BUG: soft lockup detected on CPU#1!
Jan 25 13:30:54 
Jan 25 13:30:54 Call Trace:
Jan 25 13:30:54  <IRQ> [<ffffffff80263eec>]
softlockup_tick+0xfc/0x120
Jan 25 13:30:54  [<ffffffff8023f207>] update_process_times+0x57/0x90
Jan 25 13:30:54  [<ffffffff8021a423>] smp_local_timer_interrupt+0x23/0x50
Jan 25 13:30:54  [<ffffffff8021ad31>] smp_apic_timer_interrupt+0x41/0x50
Jan 25 13:30:54  [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
Jan 25 13:30:54  <EOI> [<ffffffff8852dc70>]
:ptlrpc:reply_in_callback+0x0/0x2b0
Jan 25 13:30:54  [<ffffffff80418a69>] .text.lock.spinlock+0x0/0x97
Jan 25 13:30:54  [<ffffffff884385be>] :lnet:LNetMEAttach+0x24e/0x330
Jan 25 13:30:54  [<ffffffff88524771>] :ptlrpc:ptl_send_rpc+0x711/0xf20
Jan 25 13:30:54  [<ffffffff8851c727>]
:ptlrpc:ptlrpc_unregister_reply+0x107/0x2f0
Jan 25 13:30:54  [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0
Jan 25 13:30:54  [<ffffffff88529ae7>]
:ptlrpc:lustre_msg_add_flags+0x47/0x120
Jan 25 13:30:54  [<ffffffff8851d923>] :ptlrpc:ptlrpc_check_set+0x663/0xb70
Jan 25 13:30:54  [<ffffffff885447ea>]
:ptlrpc:ptlrpc_fail_import+0x9a/0x220
Jan 25 13:30:54  [<ffffffff8852869f>]
:ptlrpc:lustre_msg_get_conn_cnt+0x4f/0x100
Jan 25 13:30:54  [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0
Jan 25 13:30:54  [<ffffffff8854804c>] :ptlrpc:ptlrpcd_check+0x17c/0x2a0
Jan 25 13:30:54  [<ffffffff8023ef50>] process_timeout+0x0/0x10
Jan 25 13:30:54  [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60
Jan 25 13:30:54  [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272
Jan 25 13:30:54  [<ffffffff8022f490>] default_wake_function+0x0/0x10
Jan 25 13:30:54  [<ffffffff8022f490>] default_wake_function+0x0/0x10
Jan 25 13:30:54  [<ffffffff8020ac4c>] child_rip+0xa/0x12
Jan 25 13:30:54  [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272
Jan 25 13:30:54  [<ffffffff8020ac42>] child_rip+0x0/0x12
=======

Note that the 2 stacktraces are somewhat different.


If run in non-striped directory it doesn''t lockup.



/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se     |    nikke at hpc2n.umu.se
---------------------------------------------------------------------------
  "Jake, honey, when did we become Republicans?" - Celeste Kane
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Kilian CAVALOTTI

2008-Jan-25 20:06 UTC

head link

[Lustre-discuss] Lustre 1.6.4.1 - client lockup

Hi Niklas,

On Friday 25 January 2008 07:10:47 am Niklas Edmundsson
wrote:> We''re able to consistently kill the lustre client with bonnie in
> combination with striping. 
Out of curiosity, I tried to reproduce your experiment, and didn''t 
encounter any problem. All the bonnie processes ran fine.

There are a lot of significative differences between our test 
environments, but I thought it may be useful to know the results of 
your test case on a different system.
> This is Lustre 1.6.4.1, Debian 2.6.18 
> amd64 kernel with lustre patches on both server and clients 
I used Lustre 1.6.4.1, RHEL4 and 2.6.9-55.0.9.EL_lustre.1.6.4.1smp amd64 
x86_64 kernel.
> All machines are dual opterons connected with GigE.
They are Intel quad-cores (E5345) connected with IB.
> We have 5 servers, 1 MDS with 1 MGS and 1 MDT target and 4 OSS''s
with
> 2 OST targets (~1.2TB) each.
We have 9 servers, 1 MDS with MGS and MDT, and 8 OSSs with 2 OSTs each.
> Jan 25 11:16:23 BUG: soft lockup detected on CPU#1!
> After 10-15 minutes it locks up, this time with a bunch of
> LustreErrors before the stack trace:
They look like a network interruption problem, but it''s hard to tell if
that''s the cause or the consequence. Can''t that be that your
Ethernet
switches dropped some packets?

Cheers,
-- 
Kilian

Harald van Pee

2008-Jan-25 21:34 UTC

head link

[Lustre-discuss] Lustre 1.6.4.1 - client lockup

Hi, 

thats interessting for me, 
can you just try what happens if you delete a large directory
(lots of files with couple of GB total space) from this client?

I have a test cluster with 1.6.4.1 kernel 2.6.18.8 vanilla  
running. The clients are patchless, 
server and clients are rock stable since 
weaks, but I have only one dual opteron machine (others are mostly athlon and 
couple of pentium)
connected with GigE 
which ist a rock solid machine if I don''t mount lustre.
If I mount lustre on this machine it crashs all the the time. 
The last crash happens directly after  I tried to delete a large directory 
from this client.
Up to now I thougt I must have done something wrong with the installation of 
this client, because it behaves completly different than the others, but 
maybe I am wrong?

Harald


On Friday 25 January 2008 04:10 pm, Niklas Edmundsson
wrote:> Hi again!
>
> We''re able to consistently kill the lustre client with bonnie in
> combination with striping. This is Lustre 1.6.4.1, Debian 2.6.18 amd64
> kernel with lustre patches on both server and clients (ie. not
> patchless client, even though we''re pretty sure that it''s
the same bug
> that bites us using ubuntu 2.6.15 kernel and patchless client).
>
> All machines are dual opterons connected with GigE.
>
> We have 5 servers, 1 MDS with 1 MGS and 1 MDT target and 4 OSS''s
with
> 2 OST targets (~1.2TB) each.
>
> We''re able to consistently cause a lustre client lock-up doing the
> following:
>
> cd /into-lustre-filsystem
> mkdir striped
> lfs setstripe striped 0 -1 -1
> cd striped
> mkdir host1 host2 host3 host4 host5
> for i in host1 host2 host3 host4 host5; do
>    rsh $i "cd $PWD; bonnie++ -d $i -n 60:0:0:30 > res.$i
2>&1" &
> done
> After 10-15 minutes it locks up with the following stacktrace:
> =======> Jan 25 11:16:23 BUG: soft lockup detected on CPU#1!
> Jan 25 11:16:23
> Jan 25 11:16:23 Call Trace:
> Jan 25 11:16:23  <IRQ> [<ffffffff80263eec>]
softlockup_tick+0xfc/0x120
> Jan 25 11:16:23  [<ffffffff8023f207>] update_process_times+0x57/0x90
> Jan 25 11:16:23  [<ffffffff8021a423>]
smp_local_timer_interrupt+0x23/0x50
> Jan 25 11:16:23  [<ffffffff8021ad31>]
smp_apic_timer_interrupt+0x41/0x50
> Jan 25 11:16:23  [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
> Jan 25 11:16:23  <EOI> [<ffffffff804187e3>]
__lock_text_start+0x3/0x10
> Jan 25 11:16:23  [<ffffffff8851d97c>]
:ptlrpc:ptlrpc_check_set+0x6bc/0xb70
> Jan 25 11:16:23  [<ffffffff88518f0a>]
:ptlrpc:__ptlrpc_free_req+0x67a/0x6e0
> Jan 25 11:16:23  [<ffffffff8854804c>]
:ptlrpc:ptlrpcd_check+0x17c/0x2a0
> Jan 25 11:16:23  [<ffffffff8023ef50>] process_timeout+0x0/0x10
> Jan 25 11:16:23  [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60
> Jan 25 11:16:23  [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272
> Jan 25 11:16:23  [<ffffffff8028da91>] filp_close+0x71/0x90
> Jan 25 11:16:23  [<ffffffff8022f490>] default_wake_function+0x0/0x10
> Jan 25 11:16:23  [<ffffffff8022f490>] default_wake_function+0x0/0x10
> Jan 25 11:16:23  [<ffffffff8020ac4c>] child_rip+0xa/0x12
> Jan 25 11:16:23  [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272
> Jan 25 11:16:23  [<ffffffff8020ac42>] child_rip+0x0/0x12
> =======>
> mkdir striped-4ways
> lfs setstripe striped-4ways 0 -1 4
> repeat above test
> After 10-15 minutes it locks up, this time with a bunch of
> LustreErrors before the stack trace:
> =======> Jan 25 13:30:40 LustreError:
> 5748:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
> 1201264136, 103s ago)  req at ffff8100e3317e00 x1219785/t0
> o6->hpfs-OST0004_UUID at 130.239.78.239@tcp:28 lens 336/336 ref 1 fl
Rpc:/0/0
> rc 0/-22 Jan 25 13:30:40 Lustre: hpfs-OST0004-osc-ffff8100ecad4000:
> Connection to service hpfs-OST0004 via nid 130.239.78.239 at tcp was lost;
in
> progress operations using this service will wait for recovery to complete.
> Jan 25 13:30:54 BUG: soft lockup detected on CPU#1!
> Jan 25 13:30:54
> Jan 25 13:30:54 Call Trace:
> Jan 25 13:30:54  <IRQ> [<ffffffff80263eec>]
softlockup_tick+0xfc/0x120
> Jan 25 13:30:54  [<ffffffff8023f207>] update_process_times+0x57/0x90
> Jan 25 13:30:54  [<ffffffff8021a423>]
smp_local_timer_interrupt+0x23/0x50
> Jan 25 13:30:54  [<ffffffff8021ad31>]
smp_apic_timer_interrupt+0x41/0x50
> Jan 25 13:30:54  [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
> Jan 25 13:30:54  <EOI> [<ffffffff8852dc70>]
> :ptlrpc:reply_in_callback+0x0/0x2b0 Jan 25 13:30:54 
[<ffffffff80418a69>]
> .text.lock.spinlock+0x0/0x97 Jan 25 13:30:54  [<ffffffff884385be>]
> :lnet:LNetMEAttach+0x24e/0x330 Jan 25 13:30:54  [<ffffffff88524771>]
> :ptlrpc:ptl_send_rpc+0x711/0xf20 Jan 25 13:30:54 
[<ffffffff8851c727>]
> :ptlrpc:ptlrpc_unregister_reply+0x107/0x2f0 Jan 25 13:30:54 
> [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0 Jan 25
13:30:54
> [<ffffffff88529ae7>] :ptlrpc:lustre_msg_add_flags+0x47/0x120 Jan 25
> 13:30:54  [<ffffffff8851d923>] :ptlrpc:ptlrpc_check_set+0x663/0xb70
Jan 25
> 13:30:54  [<ffffffff885447ea>] :ptlrpc:ptlrpc_fail_import+0x9a/0x220
Jan 25
> 13:30:54  [<ffffffff8852869f>]
:ptlrpc:lustre_msg_get_conn_cnt+0x4f/0x100
> Jan 25 13:30:54  [<ffffffff8852dc70>]
:ptlrpc:reply_in_callback+0x0/0x2b0
> Jan 25 13:30:54  [<ffffffff8854804c>]
:ptlrpc:ptlrpcd_check+0x17c/0x2a0 Jan
> 25 13:30:54  [<ffffffff8023ef50>] process_timeout+0x0/0x10
> Jan 25 13:30:54  [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60
> Jan 25 13:30:54  [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272
> Jan 25 13:30:54  [<ffffffff8022f490>] default_wake_function+0x0/0x10
> Jan 25 13:30:54  [<ffffffff8022f490>] default_wake_function+0x0/0x10
> Jan 25 13:30:54  [<ffffffff8020ac4c>] child_rip+0xa/0x12
> Jan 25 13:30:54  [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272
> Jan 25 13:30:54  [<ffffffff8020ac42>] child_rip+0x0/0x12
> =======>
>
> Note that the 2 stacktraces are somewhat different.
>
>
> If run in non-striped directory it doesn''t lockup.
>
>
>
> /Nikke
-- 
Harald van Pee

Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn

Niklas Edmundsson

2008-Jan-28 06:28 UTC

head link

[Lustre-discuss] Lustre 1.6.4.1 - client lockup

On Fri, 25 Jan 2008, Kilian CAVALOTTI wrote:
> Hi Niklas,
>
> On Friday 25 January 2008 07:10:47 am Niklas Edmundsson wrote:
>> We''re able to consistently kill the lustre client with bonnie
in
>> combination with striping.
>
> Out of curiosity, I tried to reproduce your experiment, and didn''t
> encounter any problem. All the bonnie processes ran fine.
Interesting...
> There are a lot of significative differences between our test
> environments, but I thought it may be useful to know the results of
> your test case on a different system.
>
>> This is Lustre 1.6.4.1, Debian 2.6.18
>> amd64 kernel with lustre patches on both server and clients
>
> I used Lustre 1.6.4.1, RHEL4 and 2.6.9-55.0.9.EL_lustre.1.6.4.1smp amd64
> x86_64 kernel.
> 
>> All machines are dual opterons connected with GigE.
>
> They are Intel quad-cores (E5345) connected with IB.
Not identical environments, but it still suggests that there''s 
something funky with the 2.6.18 support...
>> We have 5 servers, 1 MDS with 1 MGS and 1 MDT target and 4
OSS''s with
>> 2 OST targets (~1.2TB) each.
>
> We have 9 servers, 1 MDS with MGS and MDT, and 8 OSSs with 2 OSTs each.
>
>> Jan 25 11:16:23 BUG: soft lockup detected on CPU#1!
>
>> After 10-15 minutes it locks up, this time with a bunch of
>> LustreErrors before the stack trace:
>
> They look like a network interruption problem, but it''s hard to
tell if
> that''s the cause or the consequence. Can''t that be that
your Ethernet
> switches dropped some packets?
Given that it''s TCP packet drops shouldn''t affect stuff in
that way
IMHO.

My guess is that something is writing outside its buffer and killing 
some random part of the kernel, we''re usually seeing these kinds of 
problems then... Usually pure hell to debug :/

/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se     |    nikke at hpc2n.umu.se
---------------------------------------------------------------------------
  "Run out of small children to butcher?" -- G''Kar
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Niklas Edmundsson

2008-Jan-28 06:34 UTC

head link

[Lustre-discuss] Lustre 1.6.4.1 - client lockup

On Fri, 25 Jan 2008, Harald van Pee wrote:
> Hi,
>
> thats interessting for me,
> can you just try what happens if you delete a large directory
> (lots of files with couple of GB total space) from this client?
Works, as long as we only have one client doing rm on the directory. 
If we do rm concurrently from multiple clients the MDS bugs out.
> I have a test cluster with 1.6.4.1 kernel 2.6.18.8 vanilla
> running. The clients are patchless,
> server and clients are rock stable since
> weaks, but I have only one dual opteron machine (others are mostly athlon
and
> couple of pentium)
> connected with GigE
> which ist a rock solid machine if I don''t mount lustre.
> If I mount lustre on this machine it crashs all the the time.
> The last crash happens directly after  I tried to delete a large directory
> from this client.
> Up to now I thougt I must have done something wrong with the installation
of
> this client, because it behaves completly different than the others, but
> maybe I am wrong?
It might be the same bug, or not... IMHO it''s an indication of a 
buffer overrun that happens more often on a 64bit box due to increased 
storage needed for pointers and so on... But with only one machine 
crashing it''s hard to rule out other issues.
>
> Harald
>
>
> On Friday 25 January 2008 04:10 pm, Niklas Edmundsson wrote:
>> Hi again!
>>
>> We''re able to consistently kill the lustre client with bonnie
in
>> combination with striping. This is Lustre 1.6.4.1, Debian 2.6.18 amd64
>> kernel with lustre patches on both server and clients (ie. not
>> patchless client, even though we''re pretty sure that
it''s the same bug
>> that bites us using ubuntu 2.6.15 kernel and patchless client).
>>
>> All machines are dual opterons connected with GigE.
>>
>> We have 5 servers, 1 MDS with 1 MGS and 1 MDT target and 4
OSS''s with
>> 2 OST targets (~1.2TB) each.
>>
>> We''re able to consistently cause a lustre client lock-up doing
the
>> following:
>>
>> cd /into-lustre-filsystem
>> mkdir striped
>> lfs setstripe striped 0 -1 -1
>> cd striped
>> mkdir host1 host2 host3 host4 host5
>> for i in host1 host2 host3 host4 host5; do
>>    rsh $i "cd $PWD; bonnie++ -d $i -n 60:0:0:30 > res.$i
2>&1" &
>> done
>> After 10-15 minutes it locks up with the following stacktrace:
>> =======>> Jan 25 11:16:23 BUG: soft lockup detected on CPU#1!
>> Jan 25 11:16:23
>> Jan 25 11:16:23 Call Trace:
>> Jan 25 11:16:23  <IRQ> [<ffffffff80263eec>]
softlockup_tick+0xfc/0x120
>> Jan 25 11:16:23  [<ffffffff8023f207>]
update_process_times+0x57/0x90
>> Jan 25 11:16:23  [<ffffffff8021a423>]
smp_local_timer_interrupt+0x23/0x50
>> Jan 25 11:16:23  [<ffffffff8021ad31>]
smp_apic_timer_interrupt+0x41/0x50
>> Jan 25 11:16:23  [<ffffffff8020a936>]
apic_timer_interrupt+0x66/0x6c
>> Jan 25 11:16:23  <EOI> [<ffffffff804187e3>]
__lock_text_start+0x3/0x10
>> Jan 25 11:16:23  [<ffffffff8851d97c>]
:ptlrpc:ptlrpc_check_set+0x6bc/0xb70
>> Jan 25 11:16:23  [<ffffffff88518f0a>]
:ptlrpc:__ptlrpc_free_req+0x67a/0x6e0
>> Jan 25 11:16:23  [<ffffffff8854804c>]
:ptlrpc:ptlrpcd_check+0x17c/0x2a0
>> Jan 25 11:16:23  [<ffffffff8023ef50>] process_timeout+0x0/0x10
>> Jan 25 11:16:23  [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60
>> Jan 25 11:16:23  [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272
>> Jan 25 11:16:23  [<ffffffff8028da91>] filp_close+0x71/0x90
>> Jan 25 11:16:23  [<ffffffff8022f490>]
default_wake_function+0x0/0x10
>> Jan 25 11:16:23  [<ffffffff8022f490>]
default_wake_function+0x0/0x10
>> Jan 25 11:16:23  [<ffffffff8020ac4c>] child_rip+0xa/0x12
>> Jan 25 11:16:23  [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272
>> Jan 25 11:16:23  [<ffffffff8020ac42>] child_rip+0x0/0x12
>> =======>>
>> mkdir striped-4ways
>> lfs setstripe striped-4ways 0 -1 4
>> repeat above test
>> After 10-15 minutes it locks up, this time with a bunch of
>> LustreErrors before the stack trace:
>> =======>> Jan 25 13:30:40 LustreError:
>> 5748:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
>> 1201264136, 103s ago)  req at ffff8100e3317e00 x1219785/t0
>> o6->hpfs-OST0004_UUID at 130.239.78.239@tcp:28 lens 336/336 ref 1 fl
Rpc:/0/0
>> rc 0/-22 Jan 25 13:30:40 Lustre: hpfs-OST0004-osc-ffff8100ecad4000:
>> Connection to service hpfs-OST0004 via nid 130.239.78.239 at tcp was
lost; in
>> progress operations using this service will wait for recovery to
complete.
>> Jan 25 13:30:54 BUG: soft lockup detected on CPU#1!
>> Jan 25 13:30:54
>> Jan 25 13:30:54 Call Trace:
>> Jan 25 13:30:54  <IRQ> [<ffffffff80263eec>]
softlockup_tick+0xfc/0x120
>> Jan 25 13:30:54  [<ffffffff8023f207>]
update_process_times+0x57/0x90
>> Jan 25 13:30:54  [<ffffffff8021a423>]
smp_local_timer_interrupt+0x23/0x50
>> Jan 25 13:30:54  [<ffffffff8021ad31>]
smp_apic_timer_interrupt+0x41/0x50
>> Jan 25 13:30:54  [<ffffffff8020a936>]
apic_timer_interrupt+0x66/0x6c
>> Jan 25 13:30:54  <EOI> [<ffffffff8852dc70>]
>> :ptlrpc:reply_in_callback+0x0/0x2b0 Jan 25 13:30:54 
[<ffffffff80418a69>]
>> .text.lock.spinlock+0x0/0x97 Jan 25 13:30:54 
[<ffffffff884385be>]
>> :lnet:LNetMEAttach+0x24e/0x330 Jan 25 13:30:54 
[<ffffffff88524771>]
>> :ptlrpc:ptl_send_rpc+0x711/0xf20 Jan 25 13:30:54 
[<ffffffff8851c727>]
>> :ptlrpc:ptlrpc_unregister_reply+0x107/0x2f0 Jan 25 13:30:54
>> [<ffffffff8852dc70>] :ptlrpc:reply_in_callback+0x0/0x2b0 Jan 25
13:30:54
>> [<ffffffff88529ae7>] :ptlrpc:lustre_msg_add_flags+0x47/0x120 Jan
25
>> 13:30:54  [<ffffffff8851d923>]
:ptlrpc:ptlrpc_check_set+0x663/0xb70 Jan 25
>> 13:30:54  [<ffffffff885447ea>]
:ptlrpc:ptlrpc_fail_import+0x9a/0x220 Jan 25
>> 13:30:54  [<ffffffff8852869f>]
:ptlrpc:lustre_msg_get_conn_cnt+0x4f/0x100
>> Jan 25 13:30:54  [<ffffffff8852dc70>]
:ptlrpc:reply_in_callback+0x0/0x2b0
>> Jan 25 13:30:54  [<ffffffff8854804c>]
:ptlrpc:ptlrpcd_check+0x17c/0x2a0 Jan
>> 25 13:30:54  [<ffffffff8023ef50>] process_timeout+0x0/0x10
>> Jan 25 13:30:54  [<ffffffff8024b91c>] add_wait_queue+0x1c/0x60
>> Jan 25 13:30:54  [<ffffffff885487ad>] :ptlrpc:ptlrpcd+0xed/0x272
>> Jan 25 13:30:54  [<ffffffff8022f490>]
default_wake_function+0x0/0x10
>> Jan 25 13:30:54  [<ffffffff8022f490>]
default_wake_function+0x0/0x10
>> Jan 25 13:30:54  [<ffffffff8020ac4c>] child_rip+0xa/0x12
>> Jan 25 13:30:54  [<ffffffff885486c0>] :ptlrpc:ptlrpcd+0x0/0x272
>> Jan 25 13:30:54  [<ffffffff8020ac42>] child_rip+0x0/0x12
>> =======>>
>>
>> Note that the 2 stacktraces are somewhat different.
>>
>>
>> If run in non-striped directory it doesn''t lockup.
>>
>>
>>
>> /Nikke
>
>

/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se     |    nikke at hpc2n.umu.se
---------------------------------------------------------------------------
  Riker: If it becomes necessary to fight, can someone find @N@
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Lustre discuss - Jan 2008 - Lustre 1.6.4.1 - client lockup

[Lustre-discuss] Lustre 1.6.4.1 - client lockup

[Lustre-discuss] Lustre 1.6.4.1 - client lockup

[Lustre-discuss] Lustre 1.6.4.1 - client lockup

[Lustre-discuss] Lustre 1.6.4.1 - client lockup

[Lustre-discuss] Lustre 1.6.4.1 - client lockup