This is the strangest problem I have seen. I have a lustre filesystem mounted on a linux server and its being exported to various alpha systems. The alphas mount it just fine however under heavy load the NFS server stops responding, as does the lustre mount on the export server. The weird thing is that if i mount the nfs export on another nfs server and run the same benchmark (bonnie) everything is fine. The lustre mount on the export server can take a real pounding (ive seen it push 300MB/sec) so I don''t know why nfs is crashing it. On the nfs export server i see these messages-- Lustre: 4224:0:(o2iblnd_cb.c:412:kiblnd_handle_rx()) PUT_NACK from 192.168.64.70 at o2ib LustreError: 4400:0:(client.c:969:ptlrpc_expire_one_request()) @@@ timeout (sent at 1197415542, 100s ago) req at ffff810827bfbc00 x38827/t0 o36->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 14256/672 ref 1 fl Rpc:/0/0 rc 0/-22 Lustre: data-MDT0000-mdc-ffff81082d702000: Connection to service data-MDT0000 via nid 192.168.64.70 at o2ib was lost; in progress operations using this service will wait for recovery to complete. A trace of the hung nfs deamons revels the following-- Dec 11 18:46:33 cpu3 kernel: nfsd S ffff8108246ff008 0 4729 1 4730 4728 (L-TLB) Dec 11 18:46:33 cpu3 kernel: ffff81082be0daa0 0000000000000046 ffff810824710740 000064b0886cfdc4 Dec 11 18:46:33 cpu3 kernel: 0000000000000009 ffff81082fc6f7e0 ffffffff802dcae0 000000814fbeae1f Dec 11 18:46:33 cpu3 kernel: 0000000003d51554 ffff81082fc6f9c8 0000000000000000 ffff8108246ff000 Dec 11 18:46:33 cpu3 kernel: Call Trace: Dec 11 18:46:33 cpu3 kernel: [<ffffffff80061839>] schedule_timeout+0x8a/0xad Dec 11 18:46:33 cpu3 kernel: [<ffffffff80092b26>] process_timeout+0x0/0x5 Dec 11 18:46:33 cpu3 kernel: [<ffffffff88700a3d>] :ptlrpc:ptlrpc_queue_wait+0xa9d/0x1250 Dec 11 18:46:33 cpu3 kernel: [<ffffffff886d67a1>] :ptlrpc:ldlm_resource_putref+0x331/0x3b0 Dec 11 18:46:33 cpu3 kernel: [<ffffffff8870a2c5>] :ptlrpc:lustre_msg_set_flags+0x45/0x120 Dec 11 18:46:33 cpu3 kernel: [<ffffffff800884f8>] default_wake_function+0x0/0xe Dec 11 18:46:33 cpu3 kernel: [<ffffffff887a37d0>] :mdc:mdc_reint+0xc0/0x240 Dec 11 18:46:33 cpu3 kernel: [<ffffffff887a5c77>] :mdc:mdc_unlink_pack+0x117/0x140 Dec 11 18:46:33 cpu3 kernel: [<ffffffff887a4ab7>] :mdc:mdc_unlink+0x307/0x3d0 Dec 11 18:46:33 cpu3 kernel: [<ffffffff801405f7>] __next_cpu+0x19/0x28 Dec 11 18:46:33 cpu3 kernel: [<ffffffff80087090>] find_busiest_group+0x20d/0x621 Dec 11 18:46:33 cpu3 kernel: [<ffffffff80009499>] __d_lookup+0xb0/0xff Dec 11 18:46:33 cpu3 kernel: [<ffffffff8886ced6>] :lustre:ll_unlink+0x1d6/0x370 Dec 11 18:46:33 cpu3 kernel: [<ffffffff8883b791>] :lustre:ll_inode_permission+0xa1/0xc0 Dec 11 18:46:33 cpu3 kernel: [<ffffffff80047fc8>] vfs_unlink+0xc2/0x108 Dec 11 18:46:33 cpu3 kernel: [<ffffffff8857c57a>] :nfsd:nfsd_unlink+0x1de/0x24b Dec 11 18:46:33 cpu3 kernel: [<ffffffff88583e9a>] :nfsd:nfsd3_proc_remove+0xa8/0xb5 Dec 11 18:46:33 cpu3 kernel: [<ffffffff885791c4>] :nfsd:nfsd_dispatch+0xd7/0x198 Dec 11 18:46:33 cpu3 kernel: [<ffffffff88488514>] :sunrpc:svc_process+0x44d/0x70b Dec 11 18:46:33 cpu3 kernel: [<ffffffff800625bf>] __down_read+0x12/0x92 Dec 11 18:46:33 cpu3 kernel: [<ffffffff8857954d>] :nfsd:nfsd+0x0/0x2db Dec 11 18:46:33 cpu3 kernel: [<ffffffff885796fb>] :nfsd:nfsd+0x1ae/0x2db Dec 11 18:46:33 cpu3 kernel: [<ffffffff8005bfb1>] child_rip+0xa/0x11 Dec 11 18:46:33 cpu3 kernel: [<ffffffff8857954d>] :nfsd:nfsd+0x0/0x2db Dec 11 18:46:33 cpu3 kernel: [<ffffffff8857954d>] :nfsd:nfsd+0x0/0x2db Dec 11 18:46:33 cpu3 kernel: [<ffffffff8005bfa7>] child_rip+0x0/0x11
Hello! On Dec 11, 2007, at 6:51 PM, Aaron S. Knister wrote:> This is the strangest problem I have seen. I have a lustre > filesystem mounted on a linux server and its being exported to > various alpha systems. The alphas mount it just fine however under > heavy load the NFS server stops responding, as does the lustre mount > on the export server. The weird thing is that if i mount the nfs > export on another nfs server and run the same benchmark (bonnie) > everything is fine. The lustre mount on the export server can take a > real pounding (ive seen it push 300MB/sec) so I don''t know why nfs > is crashing it. > On the nfs export server i see these messages-- > Lustre: 4224:0:(o2iblnd_cb.c:412:kiblnd_handle_rx()) PUT_NACK from > 192.168.64.70 at o2ib > LustreError: 4400:0:(client.c:969:ptlrpc_expire_one_request()) @@@ > timeout (sent at 1197415542, 100s ago) req at ffff810827bfbc00 x38827/ > t0 o36->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 14256/672 ref 1 > fl Rpc:/0/0 rc 0/-22 > Lustre: data-MDT0000-mdc-ffff81082d702000: Connection to service > data-MDT0000 via nid 192.168.64.70 at o2ib was lost; in progress > operations using this service > will wait for recovery to complete.Any messages on mds at this time? Bye, Oleg
Yes, it turns out its bug 14379. I applied the provided patches and everything works fine now. Thanks for the follow up! -Aaron On Dec 12, 2007, at 11:23 AM, Oleg Drokin wrote:> Hello! > > On Dec 11, 2007, at 6:51 PM, Aaron S. Knister wrote: > >> This is the strangest problem I have seen. I have a lustre >> filesystem mounted on a linux server and its being exported to >> various alpha systems. The alphas mount it just fine however under >> heavy load the NFS server stops responding, as does the lustre >> mount on the export server. The weird thing is that if i mount the >> nfs export on another nfs server and run the same benchmark >> (bonnie) everything is fine. The lustre mount on the export server >> can take a real pounding (ive seen it push 300MB/sec) so I don''t >> know why nfs is crashing it. >> On the nfs export server i see these messages-- >> Lustre: 4224:0:(o2iblnd_cb.c:412:kiblnd_handle_rx()) PUT_NACK from >> 192.168.64.70 at o2ib >> LustreError: 4400:0:(client.c:969:ptlrpc_expire_one_request()) @@@ >> timeout (sent at 1197415542, 100s ago) req at ffff810827bfbc00 x38827/ >> t0 o36->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 14256/672 ref >> 1 fl Rpc:/0/0 rc 0/-22 >> Lustre: data-MDT0000-mdc-ffff81082d702000: Connection to service >> data-MDT0000 via nid 192.168.64.70 at o2ib was lost; in progress >> operations using this service >> will wait for recovery to complete. > > Any messages on mds at this time? > > Bye, > OlegAaron Knister Associate Systems Administrator/Web Designer Center for Research on Environment and Water (301) 595-7001 aaron at iges.org