We seem to have a bit of a problem with our Lustre test setup. During high load - about 250MB/s writes from one client to 3 OSS, the client starts loosing connection to OSTs (sometimes to MDS) after about an hour and the writes hang for about 10 minutes: One of the servers: Oct 27 22:41:41 f9sn005 kernel: Lustre: 4067:0:(ldlm_lib.c:541:target_handle_reconnect()) lustre-OST0003: 98ef67d6-49f3-2447-a52a-9cf3f49130cd reconnecting Oct 27 22:41:41 f9sn005 kernel: Lustre: 4067:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2 previous similar messages Oct 27 22:41:41 f9sn005 kernel: Lustre: 4184:0:(ldlm_lib.c:835:target_handle_connect()) lustre-OST0000: refuse reconnection from 98ef67d6-49f3-2447-a52a-9cf3f49130cd at 194.2 49.156.61 at tcp to 0xffff81023154b200; still busy with 2 active RPCs Oct 27 22:41:41 f9sn005 kernel: LustreError: 4184:0:(ldlm_lib.c:1863:target_send_reply_msg()) @@@ processing error (-16) req at ffff81020ab50c50 x1317422623456031/t0 o8->98e f67d6-49f3-2447-a52a-9cf3f49130cd at NET_0x20000c2f99c3d_UUID:0/0 lens 368/264 e 0 to 0 dl 1256680301 ref 1 fl Interpret:/0/0 rc -16/0 Oct 27 22:41:41 f9sn005 kernel: LustreError: 4184:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 1 previous similar message Oct 27 22:43:20 f9sn005 kernel: Lustre: 4083:0:(ldlm_lib.c:541:target_handle_reconnect()) lustre-OST0000: 98ef67d6-49f3-2447-a52a-9cf3f49130cd reconnecting Client: Oct 27 22:41:40 pikolit Lustre: 399:0:(client.c:1430:ptlrpc_expire_one_request()) @@@ Request x1317422623456004 sent from lustre-OST0003-osc-ffff88073512ec00 to NID 194.24 9.156.130 at tcp 7s ago has timed out (7s prior to deadline). Oct 27 22:41:40 pikolit req at ffff8805ba4a0c00 x1317422623456004/t0 o103->lustre-OST0003_UUID at 194.249.156.130@tcp:17/18 lens 312/384 e 0 to 1 dl 1256679700 ref 2 fl Rpc:N/0/ 0 rc 0/0 Oct 27 22:41:40 pikolit Lustre: 399:0:(client.c:1430:ptlrpc_expire_one_request()) Skipped 12 previous similar messages Oct 27 22:41:40 pikolit Lustre: lustre-OST0003-osc-ffff88073512ec00: Connection to service lustre-OST0003 via nid 194.249.156.130 at tcp was lost; in progress operations usin g this service will wait for recovery to complete. Oct 27 22:41:40 pikolit Lustre: Skipped 8 previous similar messages Oct 27 22:41:40 pikolit LustreError: 399:0:(ldlm_request.c:1030:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Oct 27 22:41:40 pikolit LustreError: 399:0:(ldlm_request.c:1030:ldlm_cli_cancel_req()) Skipped 7 previous similar messages Oct 27 22:41:40 pikolit LustreError: 399:0:(ldlm_request.c:1586:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Oct 27 22:41:40 pikolit LustreError: 399:0:(ldlm_request.c:1586:ldlm_cli_cancel_list()) Skipped 7 previous similar messages Oct 27 22:41:41 pikolit Lustre: 28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from 12345-194.249.156.130 at tcp portal 4 match 1317422623455903 offset 192 length 41 6: 2 Oct 27 22:41:41 pikolit Lustre: 28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from 12345-194.249.156.130 at tcp portal 4 match 1317422623453859 offset 192 length 41 6: 2 Oct 27 22:41:41 pikolit Lustre: 28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from 12345-194.249.156.130 at tcp portal 4 match 1317422623454918 offset 192 length 41 6: 2 Oct 27 22:41:41 pikolit Lustre: 28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from 12345-194.249.156.130 at tcp portal 4 match 1317422623454640 offset 192 length 41 6: 2 Oct 27 22:41:41 pikolit Lustre: 28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from 12345-194.249.156.130 at tcp portal 4 match 1317422623454554 offset 192 length 41 6: 2 Oct 27 22:41:41 pikolit Lustre: 28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from 12345-194.249.156.130 at tcp portal 4 match 1317422623454424 offset 192 length 416: 2 Oct 27 22:41:41 pikolit Lustre: 28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from 12345-194.249.156.130 at tcp portal 4 match 1317422623454296 offset 192 length 416: 2 Oct 27 22:41:41 pikolit Lustre: 28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from 12345-194.249.156.130 at tcp portal 4 match 1317422623454159 offset 192 length 416: 2 Oct 27 22:41:41 pikolit Lustre: 28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from 12345-194.249.156.130 at tcp portal 4 match 1317422623453984 offset 192 length 416: 2 Oct 27 22:41:41 pikolit Lustre: 28152:0:(lib-move.c:1795:lnet_parse_put()) Dropping PUT from 12345-194.249.156.130 at tcp portal 18 match 1317422623456004 offset 192 length 192: 2 Oct 27 22:41:41 pikolit Lustre: 28152:0:(lib-move.c:1834:lnet_parse_get()) Dropping GET from 12345-194.249.156.130 at tcp portal 8 match 1317422623456018 offset 0 length 917504 Oct 27 22:41:41 pikolit Lustre: lustre-OST0003-osc-ffff88073512ec00: Connection restored to service lustre-OST0003 using nid 194.249.156.130 at tcp. There are no disk or filesystem related errors, nor are there problems with the network. The Lustre consists of 3 SLC5 (RHEL5 compatible distribution) servers, all of them having OSTs and one of them MDT as well. They were recently upgraded to 1.8.1.1, the problem was however present before (with 1.8.0). The clients (the writing one included) have 2.6.30 kernel with modules for unpatched kernel last compiled from b1_8 branch a couple days ago. Is this a known problem with high load? Cheers, Dejan