thr3ads.net - Lustre discuss - [Lustre-discuss] lustre + nfs + alphas [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Aaron S. Knister

2007-Dec-11 23:51 UTC

[Lustre-discuss] lustre + nfs + alphas

This is the strangest problem I have seen. I have a lustre filesystem mounted on
a linux server and its being exported to various alpha systems. The alphas mount
it just fine however under heavy load the NFS server stops responding, as does
the lustre mount on the export server. The weird thing is that if i mount the
nfs export on another nfs server and run the same benchmark (bonnie) everything
is fine. The lustre mount on the export server can take a real pounding (ive
seen it push 300MB/sec) so I don''t know why nfs is crashing it.

On the nfs export server i see these messages--


Lustre: 4224:0:(o2iblnd_cb.c:412:kiblnd_handle_rx()) PUT_NACK from 192.168.64.70
at o2ib
LustreError: 4400:0:(client.c:969:ptlrpc_expire_one_request()) @@@ timeout (sent
at 1197415542, 100s ago)  req at ffff810827bfbc00 x38827/t0
o36->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 14256/672 ref 1 fl
Rpc:/0/0 rc 0/-22
Lustre: data-MDT0000-mdc-ffff81082d702000: Connection to service data-MDT0000
via nid 192.168.64.70 at o2ib was lost; in progress operations using this
service
will wait for recovery to complete.

A trace of the hung nfs deamons revels the following--

Dec 11 18:46:33 cpu3 kernel: nfsd          S ffff8108246ff008     0  4729      1
4730  4728 (L-TLB)
Dec 11 18:46:33 cpu3 kernel:  ffff81082be0daa0 0000000000000046 ffff810824710740
000064b0886cfdc4
Dec 11 18:46:33 cpu3 kernel:  0000000000000009 ffff81082fc6f7e0 ffffffff802dcae0
000000814fbeae1f
Dec 11 18:46:33 cpu3 kernel:  0000000003d51554 ffff81082fc6f9c8 0000000000000000
ffff8108246ff000
Dec 11 18:46:33 cpu3 kernel: Call Trace:
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff80061839>]
schedule_timeout+0x8a/0xad
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff80092b26>] process_timeout+0x0/0x5
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff88700a3d>]
:ptlrpc:ptlrpc_queue_wait+0xa9d/0x1250
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff886d67a1>]
:ptlrpc:ldlm_resource_putref+0x331/0x3b0
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8870a2c5>]
:ptlrpc:lustre_msg_set_flags+0x45/0x120
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff800884f8>]
default_wake_function+0x0/0xe
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff887a37d0>]
:mdc:mdc_reint+0xc0/0x240
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff887a5c77>]
:mdc:mdc_unlink_pack+0x117/0x140
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff887a4ab7>]
:mdc:mdc_unlink+0x307/0x3d0
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff801405f7>] __next_cpu+0x19/0x28
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff80087090>]
find_busiest_group+0x20d/0x621
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff80009499>] __d_lookup+0xb0/0xff
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8886ced6>]
:lustre:ll_unlink+0x1d6/0x370
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8883b791>]
:lustre:ll_inode_permission+0xa1/0xc0
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff80047fc8>] vfs_unlink+0xc2/0x108
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8857c57a>]
:nfsd:nfsd_unlink+0x1de/0x24b
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff88583e9a>]
:nfsd:nfsd3_proc_remove+0xa8/0xb5
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff885791c4>]
:nfsd:nfsd_dispatch+0xd7/0x198
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff88488514>]
:sunrpc:svc_process+0x44d/0x70b
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff800625bf>] __down_read+0x12/0x92
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8857954d>] :nfsd:nfsd+0x0/0x2db
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff885796fb>] :nfsd:nfsd+0x1ae/0x2db
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8005bfb1>] child_rip+0xa/0x11
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8857954d>] :nfsd:nfsd+0x0/0x2db
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8857954d>] :nfsd:nfsd+0x0/0x2db
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8005bfa7>] child_rip+0x0/0x11

Oleg Drokin

2007-Dec-12 16:23 UTC

head link

[Lustre-discuss] lustre + nfs + alphas

Hello!

On Dec 11, 2007, at 6:51 PM, Aaron S. Knister wrote:
> This is the strangest problem I have seen. I have a lustre  
> filesystem mounted on a linux server and its being exported to  
> various alpha systems. The alphas mount it just fine however under  
> heavy load the NFS server stops responding, as does the lustre mount  
> on the export server. The weird thing is that if i mount the nfs  
> export on another nfs server and run the same benchmark (bonnie)  
> everything is fine. The lustre mount on the export server can take a  
> real pounding (ive seen it push 300MB/sec) so I don''t know why nfs
> is crashing it.
> On the nfs export server i see these messages--
> Lustre: 4224:0:(o2iblnd_cb.c:412:kiblnd_handle_rx()) PUT_NACK from  
> 192.168.64.70 at o2ib
> LustreError: 4400:0:(client.c:969:ptlrpc_expire_one_request()) @@@  
> timeout (sent at 1197415542, 100s ago)  req at ffff810827bfbc00 x38827/ 
> t0 o36->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 14256/672 ref 1
> fl Rpc:/0/0 rc 0/-22
> Lustre: data-MDT0000-mdc-ffff81082d702000: Connection to service  
> data-MDT0000 via nid 192.168.64.70 at o2ib was lost; in progress  
> operations using this service
> will wait for recovery to complete.
Any messages on mds at this time?

Bye,
     Oleg

Aaron Knister

2007-Dec-12 16:27 UTC

head link

[Lustre-discuss] lustre + nfs + alphas

Yes, it turns out its bug 14379. I applied the provided patches and  
everything works fine now. Thanks for the follow up!

-Aaron

On Dec 12, 2007, at 11:23 AM, Oleg Drokin wrote:
> Hello!
>
> On Dec 11, 2007, at 6:51 PM, Aaron S. Knister wrote:
>
>> This is the strangest problem I have seen. I have a lustre  
>> filesystem mounted on a linux server and its being exported to  
>> various alpha systems. The alphas mount it just fine however under  
>> heavy load the NFS server stops responding, as does the lustre  
>> mount on the export server. The weird thing is that if i mount the  
>> nfs export on another nfs server and run the same benchmark  
>> (bonnie) everything is fine. The lustre mount on the export server  
>> can take a real pounding (ive seen it push 300MB/sec) so I
don''t
>> know why nfs is crashing it.
>> On the nfs export server i see these messages--
>> Lustre: 4224:0:(o2iblnd_cb.c:412:kiblnd_handle_rx()) PUT_NACK from  
>> 192.168.64.70 at o2ib
>> LustreError: 4400:0:(client.c:969:ptlrpc_expire_one_request()) @@@  
>> timeout (sent at 1197415542, 100s ago)  req at ffff810827bfbc00 x38827/
>> t0 o36->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 14256/672
ref
>> 1 fl Rpc:/0/0 rc 0/-22
>> Lustre: data-MDT0000-mdc-ffff81082d702000: Connection to service  
>> data-MDT0000 via nid 192.168.64.70 at o2ib was lost; in progress  
>> operations using this service
>> will wait for recovery to complete.
>
> Any messages on mds at this time?
>
> Bye,
>    Oleg
Aaron Knister
Associate Systems Administrator/Web Designer
Center for Research on Environment and Water

(301) 595-7001
aaron at iges.org

Reasonably Related Threads

Search for more apparently analagous threads

Lustre discuss - Dec 2007 - lustre + nfs + alphas

[Lustre-discuss] lustre + nfs + alphas

[Lustre-discuss] lustre + nfs + alphas

[Lustre-discuss] lustre + nfs + alphas

Reasonably Related Threads