Hi all, I have been troubleshooting a strange problem that is occurring with our Lustre setup. Under high loads our developers are complaining that various processes they run will error out with I/O error. Our setup is small 1 MDS and 2 OSS(10OSTs 5/OSS), and 13 Clients (152 Cores) the storage is all local 60TB (30TB/OSS) usable in a RAID6 Software raid setup. All of the machines are connected via 10Gig Ethernet. The clients run Rocks 5.3 (CentOS 5.4) and the Servers run CentOS 5.4 with kernel 2.6.18-164.11.1.el5_lustre.1.8.2. The Clients run an un-patched vanilla kernel from CentOS and Lustre 1.8.3 So far I''ve not been able to pin point where i should begin to look. I have been trawling through log files that quite frankly don''t make much sense to me. Here is the messages output from the OSS ############################## Jul 6 14:57:11 helium kernel: Lustre: AC3-OST0005: haven''t heard from client ce1a3eb7-8514-d16e-4050-0507e82f1116 (at 172.16.16.125 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jul 6 15:08:26 helium kernel: Lustre: 6539:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0006: 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting Jul 6 15:08:26 helium kernel: Lustre: 6539:0:(ldlm_lib.c:837:target_handle_connect()) AC3-OST0006: refuse reconnection from 593eb160-edce-8ceb-3f93-6f743cacd1a1 at 10.0.0.54@tcp to 0xffff81026241b800; still busy with 5 active RPCs Jul 6 15:08:26 helium kernel: LustreError: 6539:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-16) req at ffff810282fe1000 x1340041377953748/t0 o8->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 lens 368/264 e 0 to 0 dl 1278393006 ref 1 fl Interpret:/0/0 rc -16/0 Jul 6 15:08:26 helium kernel: LustreError: 6660:0:(ost_handler.c:1061:ost_brw_write()) @@@ Reconnect on bulk GET req at ffff810357883c00 x1340041377934618/t0 o4->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 lens 448/416 e 0 to 0 dl 1278392947 ref 1 fl Interpret:/0/0 rc 0/0 Jul 6 15:08:26 helium kernel: LustreError: 6660:0:(ost_handler.c:1061:ost_brw_write()) Skipped 1 previous similar message Jul 6 15:08:26 helium kernel: LustreError: 6704:0:(ost_handler.c:1061:ost_brw_write()) @@@ Reconnect on bulk GET req at ffff81082af44050 x1340041377934964/t0 o4->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 lens 448/416 e 0 to 0 dl 1278392947 ref 1 fl Interpret:/0/0 rc 0/0 Jul 6 15:08:27 helium kernel: Lustre: 7062:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting Jul 6 15:08:27 helium kernel: Lustre: 7062:0:(ldlm_lib.c:837:target_handle_connect()) AC3-OST0008: refuse reconnection from 593eb160-edce-8ceb-3f93-6f743cacd1a1 at 10.0.0.54@tcp to 0xffff81025a535e00; still busy with 3 active RPCs Jul 6 15:08:27 helium kernel: LustreError: 7062:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-16) req at ffff810698233850 x1340041377955630/t0 o8->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 lens 368/264 e 0 to 0 dl 1278393007 ref 1 fl Interpret:/0/0 rc -16/0 Jul 6 15:08:27 helium kernel: Lustre: 6692:0:(ost_handler.c:1219:ost_brw_write()) AC3-OST0006: ignoring bulk IO comm error with 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id 12345-10.0.0.54 at tcp - client will retry Jul 6 15:08:27 helium kernel: Lustre: 6692:0:(ost_handler.c:1219:ost_brw_write()) Skipped 6 previous similar messages Jul 6 15:08:27 helium kernel: LustreError: 6720:0:(ost_handler.c:829:ost_brw_read()) @@@ Reconnect on bulk PUT req at ffff8103a72aec00 x1340041377933615/t0 o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 lens 448/400 e 0 to 0 dl 1278392946 ref 1 fl Interpret:/0/0 rc 0/0 Jul 6 15:08:27 helium kernel: LustreError: 6720:0:(ost_handler.c:829:ost_brw_read()) Skipped 1 previous similar message Jul 6 15:08:29 helium kernel: Lustre: 6720:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0008: ignoring bulk IO comm error with 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id 12345-10.0.0.54 at tcp - client will retry Jul 6 15:08:29 helium kernel: Lustre: 6720:0:(ost_handler.c:886:ost_brw_read()) Skipped 1 previous similar message Jul 6 15:08:37 helium kernel: Lustre: 7058:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0006: 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting Jul 6 15:08:39 helium kernel: Lustre: 6522:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting Jul 6 15:10:09 helium kernel: Lustre: 7064:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: e6ce7565-0b86-5925-82fc-015770c5143c reconnecting Jul 6 15:10:09 helium kernel: Lustre: 7064:0:(ldlm_lib.c:837:target_handle_connect()) AC3-OST0008: refuse reconnection from e6ce7565-0b86-5925-82fc-015770c5143c at 10.0.0.53@tcp to 0xffff810267bbc400; still busy with 1 active RPCs Jul 6 15:10:09 helium kernel: LustreError: 7064:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-16) req at ffff81069588b800 x1340036504045855/t0 o8->e6ce7565-0b86-5925-82fc-015770c5143c at NET_0x200000a000035_UUID:0/0 lens 368/264 e 0 to 0 dl 1278393109 ref 1 fl Interpret:/0/0 rc -16/0 Jul 6 15:10:09 helium kernel: LustreError: 6649:0:(ost_handler.c:829:ost_brw_read()) @@@ Reconnect on bulk PUT req at ffff81053a96f400 x1340036504013451/t0 o3->e6ce7565-0b86-5925-82fc-015770c5143c at NET_0x200000a000035_UUID:0/0 lens 448/400 e 0 to 0 dl 1278393053 ref 1 fl Interpret:/0/0 rc 0/0 Jul 6 15:10:09 helium kernel: LustreError: 6649:0:(ost_handler.c:829:ost_brw_read()) Skipped 2 previous similar messages Jul 6 15:10:13 helium kernel: Lustre: 6649:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0008: ignoring bulk IO comm error with e6ce7565-0b86-5925-82fc-015770c5143c at NET_0x200000a000035_UUID id 12345-10.0.0.53 at tcp - client will retry Jul 6 15:10:13 helium kernel: Lustre: 6649:0:(ost_handler.c:886:ost_brw_read()) Skipped 2 previous similar messages Jul 6 15:10:17 helium kernel: Lustre: 7016:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: e6ce7565-0b86-5925-82fc-015770c5143c reconnecting Jul 6 15:10:17 helium kernel: LustreError: 6708:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.0.0.54 at tcp ns: filter-AC3-OST0006_UUID lock: ffff8103a644e200/0xded0540147d4c8c7 lrc: 3/0,0 mode: PR/PR res: 8432287/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0x6c318a80ee850f9f expref: 878 pid: 7043 timeout 4896983243 Jul 6 15:10:17 helium kernel: LustreError: 6708:0:(ldlm_lockd.c:305:waiting_locks_callback()) Skipped 1 previous similar message Jul 6 15:10:17 helium kernel: LustreError: 6715:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) req at ffff81027fcc5400 x1340041378188643/t0 o4-><?>@<?>:0/0 lens 448/0 e 0 to 0 dl 1278393069 ref 1 fl Interpret:/0/0 rc -107/0 Jul 6 15:10:19 helium kernel: LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.0.0.54 at tcp ns: filter-AC3-OST0008_UUID lock: ffff810489298400/0xded0540147d4c8b9 lrc: 3/0,0 mode: PR/PR res: 8127131/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0x6c318a80ee850fbb expref: 34 pid: 6592 timeout 4896985868 Jul 6 15:10:19 helium kernel: LustreError: 6730:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT req at ffff810355aeec00 x1340041378188706/t0 o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 lens 448/400 e 0 to 0 dl 1278393069 ref 1 fl Interpret:/0/0 rc 0/0 Jul 6 15:10:19 helium kernel: LustreError: 6665:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT req at ffff81069e442400 x1340041378188736/t0 o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 lens 448/400 e 0 to 0 dl 1278393069 ref 1 fl Interpret:/0/0 rc 0/0 Jul 6 15:10:19 helium kernel: LustreError: 6665:0:(ost_handler.c:825:ost_brw_read()) Skipped 5 previous similar messages Jul 6 15:10:20 helium kernel: Lustre: 6730:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0008: ignoring bulk IO comm error with 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id 12345-10.0.0.54 at tcp - client will retry Jul 6 15:10:20 helium kernel: LustreError: 6714:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) req at ffff8106965e0800 x1340041378196268/t0 o3-><?>@<?>:0/0 lens 448/0 e 0 to 0 dl 1278393072 ref 1 fl Interpret:/0/0 rc -107/0 Jul 6 15:11:10 helium kernel: Lustre: 7119:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0009: d9a4606b-2d46-7e5b-c67d-c05610d8af95 reconnecting Jul 6 15:18:00 helium kernel: Lustre: 6979:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0009: e6ce7565-0b86-5925-82fc-015770c5143c reconnecting Jul 6 15:30:40 helium kernel: Lustre: 7119:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850745218 sent from AC3-OST0006 to NID 10.0.0.54 at tcp 7s ago has timed out (7s prior to deadline). Jul 6 15:30:40 helium kernel: req at ffff8104fd018000 x1339860850745218/t0 o104->@NET_0x200000a000036_UUID:15/16 lens 296/384 e 0 to 1 dl 1278394240 ref 2 fl Rpc:N/0/0 rc 0/0 Jul 6 15:30:40 helium kernel: LustreError: 138-a: AC3-OST0006: A client on nid 10.0.0.54 at tcp was evicted due to a lock blocking callback to 10.0.0.54 at tcp timed out: rc -107 Jul 6 15:30:40 helium kernel: LustreError: 7119:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed export ffff8102688f7400 ns: filter-AC3-OST0006_UUID lock: ffff81012bdcb200/0xded0540147d6bb48 lrc: 3/0,0 mode: --/PW res: 8432457/0 rrc: 2 type: EXT [0->376831] (req 0->376831) flags: 0x0 remote: 0x6c318a80f18346f1 expref: 357 pid: 7119 timeout 0 Jul 6 15:30:40 helium kernel: LustreError: 7119:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) req at ffff8103870e8400 x1340041383514587/t0 o101->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 lens 296/352 e 0 to 0 dl 1278394284 ref 1 fl Interpret:/0/0 rc -107/0 Jul 6 15:30:41 helium kernel: LustreError: 6679:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT req at ffff81082a335c50 x1340041383502349/t0 o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 lens 448/400 e 0 to 0 dl 1278394307 ref 1 fl Interpret:/0/0 rc 0/0 Jul 6 15:30:41 helium kernel: Lustre: 6679:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0006: ignoring bulk IO comm error with 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id 12345-10.0.0.54 at tcp - client will retry Jul 6 15:30:41 helium kernel: Lustre: 6679:0:(ost_handler.c:886:ost_brw_read()) Skipped 6 previous similar messages Jul 6 15:30:41 helium kernel: LustreError: 8166:0:(ldlm_lockd.c:1821:ldlm_cancel_handler()) operation 103 from 12345-10.0.0.54 at tcp with bad export cookie 16055425036052657553 Jul 6 15:32:08 helium kernel: Lustre: 7134:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850745435 sent from AC3-OST0005 to NID 10.0.0.56 at tcp 7s ago has timed out (7s prior to deadline). Jul 6 15:32:08 helium kernel: req at ffff810695874000 x1339860850745435/t0 o104->@:15/16 lens 296/384 e 0 to 1 dl 1278394328 ref 2 fl Rpc:N/0/0 rc 0/0 Jul 6 15:32:08 helium kernel: LustreError: 138-a: AC3-OST0005: A client on nid 10.0.0.56 at tcp was evicted due to a lock blocking callback to 10.0.0.56 at tcp timed out: rc -107 Jul 6 15:32:08 helium kernel: LustreError: 7134:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed export ffff8102552f5400 ns: filter-AC3-OST0005_UUID lock: ffff810485fade00/0xded0540147d6ead5 lrc: 3/0,0 mode: --/PW res: 8108669/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 remote: 0xd1f8ae5995b5c83b expref: 620 pid: 7134 timeout 0 Jul 6 15:32:08 helium kernel: LustreError: 7134:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) req at ffff81009ddd6800 x1340491441560517/t0 o101->8a1862dd-8de6-9414-f71c-0c85925e1e20@:0/0 lens 296/352 e 0 to 0 dl 1278394372 ref 1 fl Interpret:/0/0 rc -107/0 Jul 6 15:32:08 helium kernel: LustreError: 6610:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT req at ffff810357887c00 x1340491441559786/t0 o3->8a1862dd-8de6-9414-f71c-0c85925e1e20@:0/0 lens 448/400 e 0 to 0 dl 1278394396 ref 1 fl Interpret:/0/0 rc 0/0 Jul 6 15:32:08 helium kernel: LustreError: 6610:0:(ost_handler.c:825:ost_brw_read()) Skipped 3 previous similar messages Jul 6 15:32:08 helium kernel: Lustre: 6610:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0005: ignoring bulk IO comm error with 8a1862dd-8de6-9414-f71c-0c85925e1e20@ id 12345-10.0.0.56 at tcp - client will retry Jul 6 15:32:08 helium kernel: Lustre: 6610:0:(ost_handler.c:886:ost_brw_read()) Skipped 3 previous similar messages Jul 6 15:32:08 helium kernel: LustreError: 6720:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT req at ffff81040de33000 x1340491441560035/t0 o3->8a1862dd-8de6-9414-f71c-0c85925e1e20@:0/0 lens 448/400 e 0 to 0 dl 1278394396 ref 1 fl Interpret:/0/0 rc 0/0 Jul 6 15:32:16 helium kernel: LustreError: 12001:0:(ldlm_lockd.c:1821:ldlm_cancel_handler()) operation 103 from 12345-10.0.0.56 at tcp with bad export cookie 16055425036052176730 Jul 6 15:32:19 helium kernel: Lustre: 6720:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0005: ignoring bulk IO comm error with 8a1862dd-8de6-9414-f71c-0c85925e1e20@ id 12345-10.0.0.56 at tcp - client will retry Jul 6 15:32:21 helium kernel: Lustre: 6712:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0005: ignoring bulk IO comm error with 8a1862dd-8de6-9414-f71c-0c85925e1e20@ id 12345-10.0.0.56 at tcp - client will retry Jul 6 15:32:21 helium kernel: Lustre: 6712:0:(ost_handler.c:886:ost_brw_read()) Skipped 2 previous similar messages Jul 6 15:36:56 helium kernel: Lustre: 6970:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850746067 sent from AC3-OST0006 to NID 10.0.0.57 at tcp 11s ago has timed out (11s prior to deadline). Jul 6 15:36:56 helium kernel: req at ffff810086c83000 x1339860850746067/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394616 ref 2 fl Rpc:/0/0 rc 0/0 Jul 6 15:37:01 helium diskmond: 168:Polling all 48 slots for drive fault Jul 6 15:37:10 helium diskmond: sata4/5 device(/dev/sdal) is running bad Jul 6 15:37:10 helium diskmond: please back up and replace the disk soon. Jul 6 15:38:05 helium kernel: Lustre: 6537:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850746176 sent from AC3-OST0005 to NID 10.0.0.57 at tcp 8s ago has timed out (8s prior to deadline). Jul 6 15:38:05 helium kernel: req at ffff810341fdc800 x1339860850746176/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394685 ref 2 fl Rpc:/0/0 rc 0/0 Jul 6 15:38:18 helium kernel: Lustre: 6524:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850746195 sent from AC3-OST0008 to NID 10.0.0.57 at tcp 7s ago has timed out (7s prior to deadline). Jul 6 15:38:18 helium kernel: req at ffff810282961400 x1339860850746195/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394698 ref 2 fl Rpc:/0/0 rc 0/0 Jul 6 15:39:07 helium kernel: Lustre: 6793:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850746269 sent from AC3-OST0006 to NID 10.0.0.57 at tcp 11s ago has timed out (11s prior to deadline). Jul 6 15:39:07 helium kernel: req at ffff81027ce5b400 x1339860850746269/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394747 ref 2 fl Rpc:/0/0 rc 0/0 Jul 6 15:40:19 helium kernel: Lustre: 6851:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850746437 sent from AC3-OST0005 to NID 10.0.0.57 at tcp 8s ago has timed out (8s prior to deadline). Jul 6 15:40:19 helium kernel: req at ffff8104135bb800 x1339860850746437/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394819 ref 2 fl Rpc:/0/0 rc 0/0 Jul 6 15:42:10 helium kernel: Lustre: 6935:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850746615 sent from AC3-OST0005 to NID 10.0.0.53 at tcp 7s ago has timed out (7s prior to deadline). Jul 6 15:42:10 helium kernel: req at ffff81025b227c00 x1339860850746615/t0 o106->@NET_0x200000a000035_UUID:15/16 lens 296/424 e 0 to 1 dl 1278394930 ref 2 fl Rpc:/0/0 rc 0/0 Jul 6 15:46:01 helium kernel: Lustre: 6957:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850746990 sent from AC3-OST0007 to NID 172.16.16.124 at tcp 7s ago has timed out (7s prior to deadline). Jul 6 15:46:01 helium kernel: req at ffff8103fabd9800 x1339860850746990/t0 o106->@NET_0x20000ac10107c_UUID:15/16 lens 296/424 e 0 to 1 dl 1278395161 ref 2 fl Rpc:/0/0 rc 0/0 ######################## Here is the output from the client at the same time Jul 6 15:10:17 compute-0-3 kernel: LustreError: 11-0: an error occurred while communicating with 172.16.16.2 at tcp. The ost_write operation failed with -107 Jul 6 15:10:17 compute-0-3 kernel: LustreError: Skipped 1 previous similar message Jul 6 15:10:17 compute-0-3 kernel: LustreError: 167-0: This client was evicted by AC3-OST0006; in progress operations using this service will fail. Jul 6 15:10:17 compute-0-3 kernel: LustreError: Skipped 4 previous similar messages Jul 6 15:10:17 compute-0-3 kernel: LustreError: 3095:0:(namei.c:1176:ll_objects_destroy()) obd destroy objid 0x18542c4 at 0x0 error -5 Jul 6 15:10:17 compute-0-3 kernel: LustreError: 6419:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO Jul 6 15:10:18 compute-0-3 kernel: LustreError: 6779:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace AC3-OST0006-osc-ffff81043e754c00 resource r efcount nonzero (1) after lock cleanup; forcing cleanup. Jul 6 15:10:18 compute-0-3 kernel: LustreError: 6779:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: ffff810117997500 (8432287/0/0/0) (rc: 1) Jul 6 15:10:18 compute-0-3 kernel: LustreError: 6687:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5 Jul 6 15:10:18 compute-0-3 kernel: LustreError: 6782:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO Jul 6 15:10:20 compute-0-3 kernel: LustreError: 11-0: an error occurred while communicating with 172.16.16.2 at tcp. The ost_read operation failed with -107 Jul 6 15:10:20 compute-0-3 kernel: LustreError: 167-0: This client was evicted by AC3-OST0008; in progress operations using this service will fail. Jul 6 15:10:20 compute-0-3 kernel: LustreError: 20660:0:(rw.c:122:ll_brw()) error from obd_brw: rc = -4 Jul 6 15:10:20 compute-0-3 kernel: LustreError: 6784:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace AC3-OST0008-osc-ffff81043e754c00 resource r efcount nonzero (1) after lock cleanup; forcing cleanup. Jul 6 15:10:20 compute-0-3 kernel: LustreError: 6784:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: ffff81041604b9c0 (5521743/0/0/0) (rc: 1) Jul 6 15:10:20 compute-0-3 kernel: LustreError: 6686:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5 Jul 6 15:10:20 compute-0-3 kernel: LustreError: 3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81042061a800 x1340041378196315/t0 o4->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 512/624 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0Jul 6 15:10:20 compute-0-3 kernel: LustreError: 3571:0:(client.c:858:ptlrpc_import_delay_req()) Skipped 78 previous similar messages Jul 6 15:10:20 compute-0-3 kernel: LustreError: 6785:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO Kind Regards, Peter Kitchener Systems Administrator Capital Markets CRC Limited (CMCRC) Telephone: +61 2 8088 4223 Fax: +61 2 8088 4201 www.cmcrc.com Capital Markets CRC Ltd - Confidential Communication The information contained in this e-mail is confidential. It is intended for the addressee only. If you receive this e-mail by mistake please promptly inform us by reply e-mail and then delete the e-mail and destroy any printed copy. You must not disclose or use in any way the information in the e-mail. There is no warranty that this e-mail is error or virus free. It may be a private communication, and if so, does not represent the views of the CMCRC and its associates. If it is a private communication, care should be taken in opening it to ensure that undue offence is not given. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100706/783cbd27/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 13489 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100706/783cbd27/attachment-0001.png
Hi Peter, which 10GbE Card do you have? I''ve solved similar problem with a Netxen Card (HP Blade Mezzanine Card) using the nx_nic proprietary driver instead of the "open source" driver. In every case the problem is that your users fill the network between client and ost !!! On 07/06/2010 08:19 AM, Peter Kitchener wrote:> Hi all, > > I have been troubleshooting a strange problem that is occurring with > our Lustre setup. Under high loads our developers are complaining that > various processes they run will error out with I/O error. > > Our setup is small 1 MDS and 2 OSS(10OSTs 5/OSS), and 13 Clients (152 > Cores) the storage is all local 60TB (30TB/OSS) usable in a RAID6 > Software raid setup. All of the machines are connected via 10Gig > Ethernet. The clients run Rocks 5.3 (CentOS 5.4) and the Servers run > CentOS 5.4 with kernel 2.6.18-164.11.1.el5_lustre.1.8.2. The Clients > run an un-patched vanilla kernel from CentOS and Lustre 1.8.3 > > So far I''ve not been able to pin point where i should begin to look. I > have been trawling through log files that quite frankly don''t make > much sense to me. > > Here is the messages output from the OSS > > ############################## > > Jul 6 14:57:11 helium kernel: Lustre: AC3-OST0005: haven''t heard from > client ce1a3eb7-8514-d16e-4050-0507e82f1116 (at 172.16.16.125 at tcp) in > 227 seconds. I think it''s dead, and I am evicting it. > Jul 6 15:08:26 helium kernel: Lustre: > 6539:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0006: > 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting > Jul 6 15:08:26 helium kernel: Lustre: > 6539:0:(ldlm_lib.c:837:target_handle_connect()) AC3-OST0006: refuse > reconnection from 593eb160-edce-8ceb-3f93-6f743cacd1a1 at 10.0.0.54@tcp > to 0xffff81026241b800; still busy with 5 active RPCs > Jul 6 15:08:26 helium kernel: LustreError: > 6539:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > (-16) req at ffff810282fe1000 x1340041377953748/t0 > o8->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > lens 368/264 e 0 to 0 dl 1278393006 ref 1 fl Interpret:/0/0 rc -16/0 > Jul 6 15:08:26 helium kernel: LustreError: > 6660:0:(ost_handler.c:1061:ost_brw_write()) @@@ Reconnect on bulk GET > req at ffff810357883c00 x1340041377934618/t0 > o4->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > lens 448/416 e 0 to 0 dl 1278392947 ref 1 fl Interpret:/0/0 rc 0/0 > Jul 6 15:08:26 helium kernel: LustreError: > 6660:0:(ost_handler.c:1061:ost_brw_write()) Skipped 1 previous similar > message > Jul 6 15:08:26 helium kernel: LustreError: > 6704:0:(ost_handler.c:1061:ost_brw_write()) @@@ Reconnect on bulk GET > req at ffff81082af44050 x1340041377934964/t0 > o4->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > lens 448/416 e 0 to 0 dl 1278392947 ref 1 fl Interpret:/0/0 rc 0/0 > Jul 6 15:08:27 helium kernel: Lustre: > 7062:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: > 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting > Jul 6 15:08:27 helium kernel: Lustre: > 7062:0:(ldlm_lib.c:837:target_handle_connect()) AC3-OST0008: refuse > reconnection from 593eb160-edce-8ceb-3f93-6f743cacd1a1 at 10.0.0.54@tcp > to 0xffff81025a535e00; still busy with 3 active RPCs > Jul 6 15:08:27 helium kernel: LustreError: > 7062:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > (-16) req at ffff810698233850 x1340041377955630/t0 > o8->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > lens 368/264 e 0 to 0 dl 1278393007 ref 1 fl Interpret:/0/0 rc -16/0 > Jul 6 15:08:27 helium kernel: Lustre: > 6692:0:(ost_handler.c:1219:ost_brw_write()) AC3-OST0006: ignoring bulk > IO comm error with > 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id > 12345-10.0.0.54 at tcp - client will retry > Jul 6 15:08:27 helium kernel: Lustre: > 6692:0:(ost_handler.c:1219:ost_brw_write()) Skipped 6 previous similar > messages > Jul 6 15:08:27 helium kernel: LustreError: > 6720:0:(ost_handler.c:829:ost_brw_read()) @@@ Reconnect on bulk PUT > req at ffff8103a72aec00 x1340041377933615/t0 > o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > lens 448/400 e 0 to 0 dl 1278392946 ref 1 fl Interpret:/0/0 rc 0/0 > Jul 6 15:08:27 helium kernel: LustreError: > 6720:0:(ost_handler.c:829:ost_brw_read()) Skipped 1 previous similar > message > Jul 6 15:08:29 helium kernel: Lustre: > 6720:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0008: ignoring bulk > IO comm error with > 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id > 12345-10.0.0.54 at tcp - client will retry > Jul 6 15:08:29 helium kernel: Lustre: > 6720:0:(ost_handler.c:886:ost_brw_read()) Skipped 1 previous similar > message > Jul 6 15:08:37 helium kernel: Lustre: > 7058:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0006: > 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting > Jul 6 15:08:39 helium kernel: Lustre: > 6522:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: > 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting > Jul 6 15:10:09 helium kernel: Lustre: > 7064:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: > e6ce7565-0b86-5925-82fc-015770c5143c reconnecting > Jul 6 15:10:09 helium kernel: Lustre: > 7064:0:(ldlm_lib.c:837:target_handle_connect()) AC3-OST0008: refuse > reconnection from e6ce7565-0b86-5925-82fc-015770c5143c at 10.0.0.53@tcp > to 0xffff810267bbc400; still busy with 1 active RPCs > Jul 6 15:10:09 helium kernel: LustreError: > 7064:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > (-16) req at ffff81069588b800 x1340036504045855/t0 > o8->e6ce7565-0b86-5925-82fc-015770c5143c at NET_0x200000a000035_UUID:0/0 > lens 368/264 e 0 to 0 dl 1278393109 ref 1 fl Interpret:/0/0 rc -16/0 > Jul 6 15:10:09 helium kernel: LustreError: > 6649:0:(ost_handler.c:829:ost_brw_read()) @@@ Reconnect on bulk PUT > req at ffff81053a96f400 x1340036504013451/t0 > o3->e6ce7565-0b86-5925-82fc-015770c5143c at NET_0x200000a000035_UUID:0/0 > lens 448/400 e 0 to 0 dl 1278393053 ref 1 fl Interpret:/0/0 rc 0/0 > Jul 6 15:10:09 helium kernel: LustreError: > 6649:0:(ost_handler.c:829:ost_brw_read()) Skipped 2 previous similar > messages > Jul 6 15:10:13 helium kernel: Lustre: > 6649:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0008: ignoring bulk > IO comm error with > e6ce7565-0b86-5925-82fc-015770c5143c at NET_0x200000a000035_UUID id > 12345-10.0.0.53 at tcp - client will retry > Jul 6 15:10:13 helium kernel: Lustre: > 6649:0:(ost_handler.c:886:ost_brw_read()) Skipped 2 previous similar > messages > Jul 6 15:10:17 helium kernel: Lustre: > 7016:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: > e6ce7565-0b86-5925-82fc-015770c5143c reconnecting > Jul 6 15:10:17 helium kernel: LustreError: > 6708:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback > timer expired after 100s: evicting client at 10.0.0.54 at tcp ns: > filter-AC3-OST0006_UUID lock: ffff8103a644e200/0xded0540147d4c8c7 lrc: > 3/0,0 mode: PR/PR res: 8432287/0 rrc: 2 type: EXT > [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 > remote: 0x6c318a80ee850f9f expref: 878 pid: 7043 timeout 4896983243 > Jul 6 15:10:17 helium kernel: LustreError: > 6708:0:(ldlm_lockd.c:305:waiting_locks_callback()) Skipped 1 previous > similar message > Jul 6 15:10:17 helium kernel: LustreError: > 6715:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > (-107) req at ffff81027fcc5400 x1340041378188643/t0 o4-><?>@<?>:0/0 lens > 448/0 e 0 to 0 dl 1278393069 ref 1 fl Interpret:/0/0 rc -107/0 > Jul 6 15:10:19 helium kernel: LustreError: > 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback > timer expired after 100s: evicting client at 10.0.0.54 at tcp ns: > filter-AC3-OST0008_UUID lock: ffff810489298400/0xded0540147d4c8b9 lrc: > 3/0,0 mode: PR/PR res: 8127131/0 rrc: 2 type: EXT > [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 > remote: 0x6c318a80ee850fbb expref: 34 pid: 6592 timeout 4896985868 > Jul 6 15:10:19 helium kernel: LustreError: > 6730:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT > req at ffff810355aeec00 x1340041378188706/t0 > o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > lens 448/400 e 0 to 0 dl 1278393069 ref 1 fl Interpret:/0/0 rc 0/0 > Jul 6 15:10:19 helium kernel: LustreError: > 6665:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT > req at ffff81069e442400 x1340041378188736/t0 > o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > lens 448/400 e 0 to 0 dl 1278393069 ref 1 fl Interpret:/0/0 rc 0/0 > Jul 6 15:10:19 helium kernel: LustreError: > 6665:0:(ost_handler.c:825:ost_brw_read()) Skipped 5 previous similar > messages > Jul 6 15:10:20 helium kernel: Lustre: > 6730:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0008: ignoring bulk > IO comm error with > 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id > 12345-10.0.0.54 at tcp - client will retry > Jul 6 15:10:20 helium kernel: LustreError: > 6714:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > (-107) req at ffff8106965e0800 x1340041378196268/t0 o3-><?>@<?>:0/0 lens > 448/0 e 0 to 0 dl 1278393072 ref 1 fl Interpret:/0/0 rc -107/0 > Jul 6 15:11:10 helium kernel: Lustre: > 7119:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0009: > d9a4606b-2d46-7e5b-c67d-c05610d8af95 reconnecting > Jul 6 15:18:00 helium kernel: Lustre: > 6979:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0009: > e6ce7565-0b86-5925-82fc-015770c5143c reconnecting > Jul 6 15:30:40 helium kernel: Lustre: > 7119:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > x1339860850745218 sent from AC3-OST0006 to NID 10.0.0.54 at tcp 7s ago > has timed out (7s prior to deadline). > Jul 6 15:30:40 helium kernel: req at ffff8104fd018000 > x1339860850745218/t0 o104->@NET_0x200000a000036_UUID:15/16 lens > 296/384 e 0 to 1 dl 1278394240 ref 2 fl Rpc:N/0/0 rc 0/0 > Jul 6 15:30:40 helium kernel: LustreError: 138-a: AC3-OST0006: A > client on nid 10.0.0.54 at tcp was evicted due to a lock blocking > callback to 10.0.0.54 at tcp timed out: rc -107 > Jul 6 15:30:40 helium kernel: LustreError: > 7119:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed > export ffff8102688f7400 ns: filter-AC3-OST0006_UUID lock: > ffff81012bdcb200/0xded0540147d6bb48 lrc: 3/0,0 mode: --/PW res: > 8432457/0 rrc: 2 type: EXT [0->376831] (req 0->376831) flags: 0x0 > remote: 0x6c318a80f18346f1 expref: 357 pid: 7119 timeout 0 > Jul 6 15:30:40 helium kernel: LustreError: > 7119:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > (-107) req at ffff8103870e8400 x1340041383514587/t0 > o101->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 lens > 296/352 e 0 to 0 dl 1278394284 ref 1 fl Interpret:/0/0 rc -107/0 > Jul 6 15:30:41 helium kernel: LustreError: > 6679:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT > req at ffff81082a335c50 x1340041383502349/t0 > o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > lens 448/400 e 0 to 0 dl 1278394307 ref 1 fl Interpret:/0/0 rc 0/0 > Jul 6 15:30:41 helium kernel: Lustre: > 6679:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0006: ignoring bulk > IO comm error with > 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id > 12345-10.0.0.54 at tcp - client will retry > Jul 6 15:30:41 helium kernel: Lustre: > 6679:0:(ost_handler.c:886:ost_brw_read()) Skipped 6 previous similar > messages > Jul 6 15:30:41 helium kernel: LustreError: > 8166:0:(ldlm_lockd.c:1821:ldlm_cancel_handler()) operation 103 from > 12345-10.0.0.54 at tcp with bad export cookie 16055425036052657553 > Jul 6 15:32:08 helium kernel: Lustre: > 7134:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > x1339860850745435 sent from AC3-OST0005 to NID 10.0.0.56 at tcp 7s ago > has timed out (7s prior to deadline). > Jul 6 15:32:08 helium kernel: req at ffff810695874000 > x1339860850745435/t0 o104->@:15/16 lens 296/384 e 0 to 1 dl 1278394328 > ref 2 fl Rpc:N/0/0 rc 0/0 > Jul 6 15:32:08 helium kernel: LustreError: 138-a: AC3-OST0005: A > client on nid 10.0.0.56 at tcp was evicted due to a lock blocking > callback to 10.0.0.56 at tcp timed out: rc -107 > Jul 6 15:32:08 helium kernel: LustreError: > 7134:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed > export ffff8102552f5400 ns: filter-AC3-OST0005_UUID lock: > ffff810485fade00/0xded0540147d6ead5 lrc: 3/0,0 mode: --/PW res: > 8108669/0 rrc: 2 type: EXT [0->18446744073709551615] (req > 0->18446744073709551615) flags: 0x0 remote: 0xd1f8ae5995b5c83b expref: > 620 pid: 7134 timeout 0 > Jul 6 15:32:08 helium kernel: LustreError: > 7134:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > (-107) req at ffff81009ddd6800 x1340491441560517/t0 > o101->8a1862dd-8de6-9414-f71c-0c85925e1e20@:0/0 lens 296/352 e 0 to 0 > dl 1278394372 ref 1 fl Interpret:/0/0 rc -107/0 > Jul 6 15:32:08 helium kernel: LustreError: > 6610:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT > req at ffff810357887c00 x1340491441559786/t0 > o3->8a1862dd-8de6-9414-f71c-0c85925e1e20@:0/0 lens 448/400 e 0 to 0 dl > 1278394396 ref 1 fl Interpret:/0/0 rc 0/0 > Jul 6 15:32:08 helium kernel: LustreError: > 6610:0:(ost_handler.c:825:ost_brw_read()) Skipped 3 previous similar > messages > Jul 6 15:32:08 helium kernel: Lustre: > 6610:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0005: ignoring bulk > IO comm error with 8a1862dd-8de6-9414-f71c-0c85925e1e20@ id > 12345-10.0.0.56 at tcp - client will retry > Jul 6 15:32:08 helium kernel: Lustre: > 6610:0:(ost_handler.c:886:ost_brw_read()) Skipped 3 previous similar > messages > Jul 6 15:32:08 helium kernel: LustreError: > 6720:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT > req at ffff81040de33000 x1340491441560035/t0 > o3->8a1862dd-8de6-9414-f71c-0c85925e1e20@:0/0 lens 448/400 e 0 to 0 dl > 1278394396 ref 1 fl Interpret:/0/0 rc 0/0 > Jul 6 15:32:16 helium kernel: LustreError: > 12001:0:(ldlm_lockd.c:1821:ldlm_cancel_handler()) operation 103 from > 12345-10.0.0.56 at tcp with bad export cookie 16055425036052176730 > Jul 6 15:32:19 helium kernel: Lustre: > 6720:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0005: ignoring bulk > IO comm error with 8a1862dd-8de6-9414-f71c-0c85925e1e20@ id > 12345-10.0.0.56 at tcp - client will retry > Jul 6 15:32:21 helium kernel: Lustre: > 6712:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0005: ignoring bulk > IO comm error with 8a1862dd-8de6-9414-f71c-0c85925e1e20@ id > 12345-10.0.0.56 at tcp - client will retry > Jul 6 15:32:21 helium kernel: Lustre: > 6712:0:(ost_handler.c:886:ost_brw_read()) Skipped 2 previous similar > messages > Jul 6 15:36:56 helium kernel: Lustre: > 6970:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > x1339860850746067 sent from AC3-OST0006 to NID 10.0.0.57 at tcp 11s ago > has timed out (11s prior to deadline). > Jul 6 15:36:56 helium kernel: req at ffff810086c83000 > x1339860850746067/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394616 > ref 2 fl Rpc:/0/0 rc 0/0 > Jul 6 15:37:01 helium diskmond: 168:Polling all 48 slots for drive fault > Jul 6 15:37:10 helium diskmond: sata4/5 device(/dev/sdal) is running bad > Jul 6 15:37:10 helium diskmond: please back up and replace the disk > soon. > Jul 6 15:38:05 helium kernel: Lustre: > 6537:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > x1339860850746176 sent from AC3-OST0005 to NID 10.0.0.57 at tcp 8s ago > has timed out (8s prior to deadline). > Jul 6 15:38:05 helium kernel: req at ffff810341fdc800 > x1339860850746176/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394685 > ref 2 fl Rpc:/0/0 rc 0/0 > Jul 6 15:38:18 helium kernel: Lustre: > 6524:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > x1339860850746195 sent from AC3-OST0008 to NID 10.0.0.57 at tcp 7s ago > has timed out (7s prior to deadline). > Jul 6 15:38:18 helium kernel: req at ffff810282961400 > x1339860850746195/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394698 > ref 2 fl Rpc:/0/0 rc 0/0 > Jul 6 15:39:07 helium kernel: Lustre: > 6793:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > x1339860850746269 sent from AC3-OST0006 to NID 10.0.0.57 at tcp 11s ago > has timed out (11s prior to deadline). > Jul 6 15:39:07 helium kernel: req at ffff81027ce5b400 > x1339860850746269/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394747 > ref 2 fl Rpc:/0/0 rc 0/0 > Jul 6 15:40:19 helium kernel: Lustre: > 6851:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > x1339860850746437 sent from AC3-OST0005 to NID 10.0.0.57 at tcp 8s ago > has timed out (8s prior to deadline). > Jul 6 15:40:19 helium kernel: req at ffff8104135bb800 > x1339860850746437/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394819 > ref 2 fl Rpc:/0/0 rc 0/0 > Jul 6 15:42:10 helium kernel: Lustre: > 6935:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > x1339860850746615 sent from AC3-OST0005 to NID 10.0.0.53 at tcp 7s ago > has timed out (7s prior to deadline). > Jul 6 15:42:10 helium kernel: req at ffff81025b227c00 > x1339860850746615/t0 o106->@NET_0x200000a000035_UUID:15/16 lens > 296/424 e 0 to 1 dl 1278394930 ref 2 fl Rpc:/0/0 rc 0/0 > Jul 6 15:46:01 helium kernel: Lustre: > 6957:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > x1339860850746990 sent from AC3-OST0007 to NID 172.16.16.124 at tcp 7s > ago has timed out (7s prior to deadline). > Jul 6 15:46:01 helium kernel: req at ffff8103fabd9800 > x1339860850746990/t0 o106->@NET_0x20000ac10107c_UUID:15/16 lens > 296/424 e 0 to 1 dl 1278395161 ref 2 fl Rpc:/0/0 rc 0/0 > > > ######################## > > Here is the output from the client at the same time > > Jul 6 15:10:17 compute-0-3 kernel: LustreError: 11-0: an error > occurred while communicating with 172.16.16.2 at tcp. The ost_write > operation failed with -107 > Jul 6 15:10:17 compute-0-3 kernel: LustreError: Skipped 1 previous > similar message > Jul 6 15:10:17 compute-0-3 kernel: LustreError: 167-0: This client > was evicted by AC3-OST0006; in progress operations using this service > will fail. > Jul 6 15:10:17 compute-0-3 kernel: LustreError: Skipped 4 previous > similar messages > Jul 6 15:10:17 compute-0-3 kernel: LustreError: > 3095:0:(namei.c:1176:ll_objects_destroy()) obd destroy objid > 0x18542c4 at 0x0 error -5 > Jul 6 15:10:17 compute-0-3 kernel: LustreError: > 6419:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, > returning -EIO > Jul 6 15:10:18 compute-0-3 kernel: LustreError: > 6779:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace > AC3-OST0006-osc-ffff81043e754c00 resource r > efcount nonzero (1) after lock cleanup; forcing cleanup. > Jul 6 15:10:18 compute-0-3 kernel: LustreError: > 6779:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: > ffff810117997500 (8432287/0/0/0) (rc: 1) > Jul 6 15:10:18 compute-0-3 kernel: LustreError: > 6687:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5 > Jul 6 15:10:18 compute-0-3 kernel: LustreError: > 6782:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, > returning -EIO > Jul 6 15:10:20 compute-0-3 kernel: LustreError: 11-0: an error > occurred while communicating with 172.16.16.2 at tcp. The ost_read > operation failed with -107 > Jul 6 15:10:20 compute-0-3 kernel: LustreError: 167-0: This client > was evicted by AC3-OST0008; in progress operations using this service > will fail. > Jul 6 15:10:20 compute-0-3 kernel: LustreError: > 20660:0:(rw.c:122:ll_brw()) error from obd_brw: rc = -4 > Jul 6 15:10:20 compute-0-3 kernel: LustreError: > 6784:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace > AC3-OST0008-osc-ffff81043e754c00 resource r > efcount nonzero (1) after lock cleanup; forcing cleanup. > Jul 6 15:10:20 compute-0-3 kernel: LustreError: > 6784:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: > ffff81041604b9c0 (5521743/0/0/0) (rc: 1) > Jul 6 15:10:20 compute-0-3 kernel: LustreError: > 6686:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5 > Jul 6 15:10:20 compute-0-3 kernel: LustreError: > 3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID > req at ffff81042061a800 x1340041378196315/t0 > o4->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 512/624 e 0 to 1 dl 0 > ref 2 fl Rpc:/0/0 rc 0/0Jul 6 15:10:20 compute-0-3 kernel: > LustreError: 3571:0:(client.c:858:ptlrpc_import_delay_req()) Skipped > 78 previous similar messages > Jul 6 15:10:20 compute-0-3 kernel: LustreError: > 6785:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, > returning -EIO > > Kind Regards, > > Peter Kitchener > Systems Administrator > Capital Markets CRC Limited (CMCRC) > Telephone: +61 2 8088 4223 > Fax: +61 2 8088 4201 > www.cmcrc.com <http://www.cmcrc.com/> > > > > > > Capital Markets CRC Ltd - Confidential Communication > The information contained in this e-mail is confidential. It is > intended for the addressee only. If you receive this e-mail by > mistake please promptly inform us by reply e-mail and then delete the > e-mail and destroy any printed copy. You must not disclose or use in > any way the information in the e-mail. There is no warranty that this > e-mail is error or virus free. It may be a private communication, and > if so, does not represent the views of the CMCRC and its associates. > If it is a private communication, care should be taken in opening it > to ensure that undue offence is not given. > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- _Gabriele Paciucci_ http://www.linkedin.com/in/paciucci Pursuant to legislative Decree n. 196/03 you are hereby informed that this email contains confidential information intended only for use of addressee. If you are not the addressee and have received this email by mistake, please send this email to the sender. You may not copy or disseminate this message to anyone. Thank You.
The source of the I/O error is the eviction of the client by the OSS server that was not able to reclaim a lock from that client in specified timeout window (100s) OSS: =====Jul 6 15:10:17 helium kernel: LustreError: 6708:0:(ldlm_lockd.c:305: waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.0.0.54 at tcp ns: filter-AC3-OST0006_UUID lock: ffff8103a644e200/0xded0540147d4c8c7 lrc: 3/0,0 mode: PR/PR res: 8432287/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0x6c318a80ee850f9f expref: 878 pid: 7043 timeout 4896983243 Client: ====Jul 6 15:10:17 compute-0-3 kernel: LustreError: 11-0: an error occurred while communicating with 172.16.16.2 at tcp. The ost_write operation failed with -107 Jul 6 15:10:17 compute-0-3 kernel: LustreError: Skipped 1 previous similar message Jul 6 15:10:17 compute-0-3 kernel: LustreError: 167-0: This client was evicted by AC3-OST0006; in progress operations using this service will fail. The questions is why the lock callback was expiered? It is possible that your network was unhealthy which caused the eviciton. There is another possibility, I have been seeing simillar evicition on our large cluster (~600 nodes) with running RHEL5.4 Lustre-1.8.3 where I know that network is healthy. I have searched lustre bugzilla for similar problems and found several simmilar cases reported for example BUGs 22964, 23190, 22175 and others. It looks like there might be a Lustre bug which casuses a lock deadlock which then causes OSS to evict the client. Best regards, Wojciech On 6 July 2010 10:23, Gabriele Paciucci <paciucci at gmail.com> wrote:> Hi Peter, > which 10GbE Card do you have? I''ve solved similar problem with a Netxen > Card (HP Blade Mezzanine Card) using the nx_nic proprietary driver > instead of the "open source" driver. In every case the problem is that > your users fill the network between client and ost !!! > > On 07/06/2010 08:19 AM, Peter Kitchener wrote: > > Hi all, > > > > I have been troubleshooting a strange problem that is occurring with > > our Lustre setup. Under high loads our developers are complaining that > > various processes they run will error out with I/O error. > > > > Our setup is small 1 MDS and 2 OSS(10OSTs 5/OSS), and 13 Clients (152 > > Cores) the storage is all local 60TB (30TB/OSS) usable in a RAID6 > > Software raid setup. All of the machines are connected via 10Gig > > Ethernet. The clients run Rocks 5.3 (CentOS 5.4) and the Servers run > > CentOS 5.4 with kernel 2.6.18-164.11.1.el5_lustre.1.8.2. The Clients > > run an un-patched vanilla kernel from CentOS and Lustre 1.8.3 > > > > So far I''ve not been able to pin point where i should begin to look. I > > have been trawling through log files that quite frankly don''t make > > much sense to me. > > > > Here is the messages output from the OSS > > > > ############################## > > > > Jul 6 14:57:11 helium kernel: Lustre: AC3-OST0005: haven''t heard from > > client ce1a3eb7-8514-d16e-4050-0507e82f1116 (at 172.16.16.125 at tcp) in > > 227 seconds. I think it''s dead, and I am evicting it. > > Jul 6 15:08:26 helium kernel: Lustre: > > 6539:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0006: > > 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting > > Jul 6 15:08:26 helium kernel: Lustre: > > 6539:0:(ldlm_lib.c:837:target_handle_connect()) AC3-OST0006: refuse > > reconnection from 593eb160-edce-8ceb-3f93-6f743cacd1a1 at 10.0.0.54@tcp > > to 0xffff81026241b800; still busy with 5 active RPCs > > Jul 6 15:08:26 helium kernel: LustreError: > > 6539:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > > (-16) req at ffff810282fe1000 x1340041377953748/t0 > > o8->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > > lens 368/264 e 0 to 0 dl 1278393006 ref 1 fl Interpret:/0/0 rc -16/0 > > Jul 6 15:08:26 helium kernel: LustreError: > > 6660:0:(ost_handler.c:1061:ost_brw_write()) @@@ Reconnect on bulk GET > > req at ffff810357883c00 x1340041377934618/t0 > > o4->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > > lens 448/416 e 0 to 0 dl 1278392947 ref 1 fl Interpret:/0/0 rc 0/0 > > Jul 6 15:08:26 helium kernel: LustreError: > > 6660:0:(ost_handler.c:1061:ost_brw_write()) Skipped 1 previous similar > > message > > Jul 6 15:08:26 helium kernel: LustreError: > > 6704:0:(ost_handler.c:1061:ost_brw_write()) @@@ Reconnect on bulk GET > > req at ffff81082af44050 x1340041377934964/t0 > > o4->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > > lens 448/416 e 0 to 0 dl 1278392947 ref 1 fl Interpret:/0/0 rc 0/0 > > Jul 6 15:08:27 helium kernel: Lustre: > > 7062:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: > > 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting > > Jul 6 15:08:27 helium kernel: Lustre: > > 7062:0:(ldlm_lib.c:837:target_handle_connect()) AC3-OST0008: refuse > > reconnection from 593eb160-edce-8ceb-3f93-6f743cacd1a1 at 10.0.0.54@tcp > > to 0xffff81025a535e00; still busy with 3 active RPCs > > Jul 6 15:08:27 helium kernel: LustreError: > > 7062:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > > (-16) req at ffff810698233850 x1340041377955630/t0 > > o8->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > > lens 368/264 e 0 to 0 dl 1278393007 ref 1 fl Interpret:/0/0 rc -16/0 > > Jul 6 15:08:27 helium kernel: Lustre: > > 6692:0:(ost_handler.c:1219:ost_brw_write()) AC3-OST0006: ignoring bulk > > IO comm error with > > 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id > > 12345-10.0.0.54 at tcp - client will retry > > Jul 6 15:08:27 helium kernel: Lustre: > > 6692:0:(ost_handler.c:1219:ost_brw_write()) Skipped 6 previous similar > > messages > > Jul 6 15:08:27 helium kernel: LustreError: > > 6720:0:(ost_handler.c:829:ost_brw_read()) @@@ Reconnect on bulk PUT > > req at ffff8103a72aec00 x1340041377933615/t0 > > o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > > lens 448/400 e 0 to 0 dl 1278392946 ref 1 fl Interpret:/0/0 rc 0/0 > > Jul 6 15:08:27 helium kernel: LustreError: > > 6720:0:(ost_handler.c:829:ost_brw_read()) Skipped 1 previous similar > > message > > Jul 6 15:08:29 helium kernel: Lustre: > > 6720:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0008: ignoring bulk > > IO comm error with > > 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id > > 12345-10.0.0.54 at tcp - client will retry > > Jul 6 15:08:29 helium kernel: Lustre: > > 6720:0:(ost_handler.c:886:ost_brw_read()) Skipped 1 previous similar > > message > > Jul 6 15:08:37 helium kernel: Lustre: > > 7058:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0006: > > 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting > > Jul 6 15:08:39 helium kernel: Lustre: > > 6522:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: > > 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting > > Jul 6 15:10:09 helium kernel: Lustre: > > 7064:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: > > e6ce7565-0b86-5925-82fc-015770c5143c reconnecting > > Jul 6 15:10:09 helium kernel: Lustre: > > 7064:0:(ldlm_lib.c:837:target_handle_connect()) AC3-OST0008: refuse > > reconnection from e6ce7565-0b86-5925-82fc-015770c5143c at 10.0.0.53@tcp > > to 0xffff810267bbc400; still busy with 1 active RPCs > > Jul 6 15:10:09 helium kernel: LustreError: > > 7064:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > > (-16) req at ffff81069588b800 x1340036504045855/t0 > > o8->e6ce7565-0b86-5925-82fc-015770c5143c at NET_0x200000a000035_UUID:0/0 > > lens 368/264 e 0 to 0 dl 1278393109 ref 1 fl Interpret:/0/0 rc -16/0 > > Jul 6 15:10:09 helium kernel: LustreError: > > 6649:0:(ost_handler.c:829:ost_brw_read()) @@@ Reconnect on bulk PUT > > req at ffff81053a96f400 x1340036504013451/t0 > > o3->e6ce7565-0b86-5925-82fc-015770c5143c at NET_0x200000a000035_UUID:0/0 > > lens 448/400 e 0 to 0 dl 1278393053 ref 1 fl Interpret:/0/0 rc 0/0 > > Jul 6 15:10:09 helium kernel: LustreError: > > 6649:0:(ost_handler.c:829:ost_brw_read()) Skipped 2 previous similar > > messages > > Jul 6 15:10:13 helium kernel: Lustre: > > 6649:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0008: ignoring bulk > > IO comm error with > > e6ce7565-0b86-5925-82fc-015770c5143c at NET_0x200000a000035_UUID id > > 12345-10.0.0.53 at tcp - client will retry > > Jul 6 15:10:13 helium kernel: Lustre: > > 6649:0:(ost_handler.c:886:ost_brw_read()) Skipped 2 previous similar > > messages > > Jul 6 15:10:17 helium kernel: Lustre: > > 7016:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: > > e6ce7565-0b86-5925-82fc-015770c5143c reconnecting > > Jul 6 15:10:17 helium kernel: LustreError: > > 6708:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback > > timer expired after 100s: evicting client at 10.0.0.54 at tcp ns: > > filter-AC3-OST0006_UUID lock: ffff8103a644e200/0xded0540147d4c8c7 lrc: > > 3/0,0 mode: PR/PR res: 8432287/0 rrc: 2 type: EXT > > [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 > > remote: 0x6c318a80ee850f9f expref: 878 pid: 7043 timeout 4896983243 > > Jul 6 15:10:17 helium kernel: LustreError: > > 6708:0:(ldlm_lockd.c:305:waiting_locks_callback()) Skipped 1 previous > > similar message > > Jul 6 15:10:17 helium kernel: LustreError: > > 6715:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > > (-107) req at ffff81027fcc5400 x1340041378188643/t0 o4-><?>@<?>:0/0 lens > > 448/0 e 0 to 0 dl 1278393069 ref 1 fl Interpret:/0/0 rc -107/0 > > Jul 6 15:10:19 helium kernel: LustreError: > > 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback > > timer expired after 100s: evicting client at 10.0.0.54 at tcp ns: > > filter-AC3-OST0008_UUID lock: ffff810489298400/0xded0540147d4c8b9 lrc: > > 3/0,0 mode: PR/PR res: 8127131/0 rrc: 2 type: EXT > > [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 > > remote: 0x6c318a80ee850fbb expref: 34 pid: 6592 timeout 4896985868 > > Jul 6 15:10:19 helium kernel: LustreError: > > 6730:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT > > req at ffff810355aeec00 x1340041378188706/t0 > > o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > > lens 448/400 e 0 to 0 dl 1278393069 ref 1 fl Interpret:/0/0 rc 0/0 > > Jul 6 15:10:19 helium kernel: LustreError: > > 6665:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT > > req at ffff81069e442400 x1340041378188736/t0 > > o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > > lens 448/400 e 0 to 0 dl 1278393069 ref 1 fl Interpret:/0/0 rc 0/0 > > Jul 6 15:10:19 helium kernel: LustreError: > > 6665:0:(ost_handler.c:825:ost_brw_read()) Skipped 5 previous similar > > messages > > Jul 6 15:10:20 helium kernel: Lustre: > > 6730:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0008: ignoring bulk > > IO comm error with > > 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id > > 12345-10.0.0.54 at tcp - client will retry > > Jul 6 15:10:20 helium kernel: LustreError: > > 6714:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > > (-107) req at ffff8106965e0800 x1340041378196268/t0 o3-><?>@<?>:0/0 lens > > 448/0 e 0 to 0 dl 1278393072 ref 1 fl Interpret:/0/0 rc -107/0 > > Jul 6 15:11:10 helium kernel: Lustre: > > 7119:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0009: > > d9a4606b-2d46-7e5b-c67d-c05610d8af95 reconnecting > > Jul 6 15:18:00 helium kernel: Lustre: > > 6979:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0009: > > e6ce7565-0b86-5925-82fc-015770c5143c reconnecting > > Jul 6 15:30:40 helium kernel: Lustre: > > 7119:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > > x1339860850745218 sent from AC3-OST0006 to NID 10.0.0.54 at tcp 7s ago > > has timed out (7s prior to deadline). > > Jul 6 15:30:40 helium kernel: req at ffff8104fd018000 > > x1339860850745218/t0 o104->@NET_0x200000a000036_UUID:15/16 lens > > 296/384 e 0 to 1 dl 1278394240 ref 2 fl Rpc:N/0/0 rc 0/0 > > Jul 6 15:30:40 helium kernel: LustreError: 138-a: AC3-OST0006: A > > client on nid 10.0.0.54 at tcp was evicted due to a lock blocking > > callback to 10.0.0.54 at tcp timed out: rc -107 > > Jul 6 15:30:40 helium kernel: LustreError: > > 7119:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed > > export ffff8102688f7400 ns: filter-AC3-OST0006_UUID lock: > > ffff81012bdcb200/0xded0540147d6bb48 lrc: 3/0,0 mode: --/PW res: > > 8432457/0 rrc: 2 type: EXT [0->376831] (req 0->376831) flags: 0x0 > > remote: 0x6c318a80f18346f1 expref: 357 pid: 7119 timeout 0 > > Jul 6 15:30:40 helium kernel: LustreError: > > 7119:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > > (-107) req at ffff8103870e8400 x1340041383514587/t0 > > o101->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > lens > > 296/352 e 0 to 0 dl 1278394284 ref 1 fl Interpret:/0/0 rc -107/0 > > Jul 6 15:30:41 helium kernel: LustreError: > > 6679:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT > > req at ffff81082a335c50 x1340041383502349/t0 > > o3->593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID:0/0 > > lens 448/400 e 0 to 0 dl 1278394307 ref 1 fl Interpret:/0/0 rc 0/0 > > Jul 6 15:30:41 helium kernel: Lustre: > > 6679:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0006: ignoring bulk > > IO comm error with > > 593eb160-edce-8ceb-3f93-6f743cacd1a1 at NET_0x200000a000036_UUID id > > 12345-10.0.0.54 at tcp - client will retry > > Jul 6 15:30:41 helium kernel: Lustre: > > 6679:0:(ost_handler.c:886:ost_brw_read()) Skipped 6 previous similar > > messages > > Jul 6 15:30:41 helium kernel: LustreError: > > 8166:0:(ldlm_lockd.c:1821:ldlm_cancel_handler()) operation 103 from > > 12345-10.0.0.54 at tcp with bad export cookie 16055425036052657553 > > Jul 6 15:32:08 helium kernel: Lustre: > > 7134:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > > x1339860850745435 sent from AC3-OST0005 to NID 10.0.0.56 at tcp 7s ago > > has timed out (7s prior to deadline). > > Jul 6 15:32:08 helium kernel: req at ffff810695874000 > > x1339860850745435/t0 o104->@:15/16 lens 296/384 e 0 to 1 dl 1278394328 > > ref 2 fl Rpc:N/0/0 rc 0/0 > > Jul 6 15:32:08 helium kernel: LustreError: 138-a: AC3-OST0005: A > > client on nid 10.0.0.56 at tcp was evicted due to a lock blocking > > callback to 10.0.0.56 at tcp timed out: rc -107 > > Jul 6 15:32:08 helium kernel: LustreError: > > 7134:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed > > export ffff8102552f5400 ns: filter-AC3-OST0005_UUID lock: > > ffff810485fade00/0xded0540147d6ead5 lrc: 3/0,0 mode: --/PW res: > > 8108669/0 rrc: 2 type: EXT [0->18446744073709551615] (req > > 0->18446744073709551615) flags: 0x0 remote: 0xd1f8ae5995b5c83b expref: > > 620 pid: 7134 timeout 0 > > Jul 6 15:32:08 helium kernel: LustreError: > > 7134:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > > (-107) req at ffff81009ddd6800 x1340491441560517/t0 > > o101->8a1862dd-8de6-9414-f71c-0c85925e1e20@:0/0 lens 296/352 e 0 to 0 > > dl 1278394372 ref 1 fl Interpret:/0/0 rc -107/0 > > Jul 6 15:32:08 helium kernel: LustreError: > > 6610:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT > > req at ffff810357887c00 x1340491441559786/t0 > > o3->8a1862dd-8de6-9414-f71c-0c85925e1e20@:0/0 lens 448/400 e 0 to 0 dl > > 1278394396 ref 1 fl Interpret:/0/0 rc 0/0 > > Jul 6 15:32:08 helium kernel: LustreError: > > 6610:0:(ost_handler.c:825:ost_brw_read()) Skipped 3 previous similar > > messages > > Jul 6 15:32:08 helium kernel: Lustre: > > 6610:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0005: ignoring bulk > > IO comm error with 8a1862dd-8de6-9414-f71c-0c85925e1e20@ id > > 12345-10.0.0.56 at tcp - client will retry > > Jul 6 15:32:08 helium kernel: Lustre: > > 6610:0:(ost_handler.c:886:ost_brw_read()) Skipped 3 previous similar > > messages > > Jul 6 15:32:08 helium kernel: LustreError: > > 6720:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT > > req at ffff81040de33000 x1340491441560035/t0 > > o3->8a1862dd-8de6-9414-f71c-0c85925e1e20@:0/0 lens 448/400 e 0 to 0 dl > > 1278394396 ref 1 fl Interpret:/0/0 rc 0/0 > > Jul 6 15:32:16 helium kernel: LustreError: > > 12001:0:(ldlm_lockd.c:1821:ldlm_cancel_handler()) operation 103 from > > 12345-10.0.0.56 at tcp with bad export cookie 16055425036052176730 > > Jul 6 15:32:19 helium kernel: Lustre: > > 6720:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0005: ignoring bulk > > IO comm error with 8a1862dd-8de6-9414-f71c-0c85925e1e20@ id > > 12345-10.0.0.56 at tcp - client will retry > > Jul 6 15:32:21 helium kernel: Lustre: > > 6712:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0005: ignoring bulk > > IO comm error with 8a1862dd-8de6-9414-f71c-0c85925e1e20@ id > > 12345-10.0.0.56 at tcp - client will retry > > Jul 6 15:32:21 helium kernel: Lustre: > > 6712:0:(ost_handler.c:886:ost_brw_read()) Skipped 2 previous similar > > messages > > Jul 6 15:36:56 helium kernel: Lustre: > > 6970:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > > x1339860850746067 sent from AC3-OST0006 to NID 10.0.0.57 at tcp 11s ago > > has timed out (11s prior to deadline). > > Jul 6 15:36:56 helium kernel: req at ffff810086c83000 > > x1339860850746067/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394616 > > ref 2 fl Rpc:/0/0 rc 0/0 > > Jul 6 15:37:01 helium diskmond: 168:Polling all 48 slots for drive fault > > Jul 6 15:37:10 helium diskmond: sata4/5 device(/dev/sdal) is running bad > > Jul 6 15:37:10 helium diskmond: please back up and replace the disk > > soon. > > Jul 6 15:38:05 helium kernel: Lustre: > > 6537:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > > x1339860850746176 sent from AC3-OST0005 to NID 10.0.0.57 at tcp 8s ago > > has timed out (8s prior to deadline). > > Jul 6 15:38:05 helium kernel: req at ffff810341fdc800 > > x1339860850746176/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394685 > > ref 2 fl Rpc:/0/0 rc 0/0 > > Jul 6 15:38:18 helium kernel: Lustre: > > 6524:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > > x1339860850746195 sent from AC3-OST0008 to NID 10.0.0.57 at tcp 7s ago > > has timed out (7s prior to deadline). > > Jul 6 15:38:18 helium kernel: req at ffff810282961400 > > x1339860850746195/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394698 > > ref 2 fl Rpc:/0/0 rc 0/0 > > Jul 6 15:39:07 helium kernel: Lustre: > > 6793:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > > x1339860850746269 sent from AC3-OST0006 to NID 10.0.0.57 at tcp 11s ago > > has timed out (11s prior to deadline). > > Jul 6 15:39:07 helium kernel: req at ffff81027ce5b400 > > x1339860850746269/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394747 > > ref 2 fl Rpc:/0/0 rc 0/0 > > Jul 6 15:40:19 helium kernel: Lustre: > > 6851:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > > x1339860850746437 sent from AC3-OST0005 to NID 10.0.0.57 at tcp 8s ago > > has timed out (8s prior to deadline). > > Jul 6 15:40:19 helium kernel: req at ffff8104135bb800 > > x1339860850746437/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394819 > > ref 2 fl Rpc:/0/0 rc 0/0 > > Jul 6 15:42:10 helium kernel: Lustre: > > 6935:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > > x1339860850746615 sent from AC3-OST0005 to NID 10.0.0.53 at tcp 7s ago > > has timed out (7s prior to deadline). > > Jul 6 15:42:10 helium kernel: req at ffff81025b227c00 > > x1339860850746615/t0 o106->@NET_0x200000a000035_UUID:15/16 lens > > 296/424 e 0 to 1 dl 1278394930 ref 2 fl Rpc:/0/0 rc 0/0 > > Jul 6 15:46:01 helium kernel: Lustre: > > 6957:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request > > x1339860850746990 sent from AC3-OST0007 to NID 172.16.16.124 at tcp 7s > > ago has timed out (7s prior to deadline). > > Jul 6 15:46:01 helium kernel: req at ffff8103fabd9800 > > x1339860850746990/t0 o106->@NET_0x20000ac10107c_UUID:15/16 lens > > 296/424 e 0 to 1 dl 1278395161 ref 2 fl Rpc:/0/0 rc 0/0 > > > > > > ######################## > > > > Here is the output from the client at the same time > > > > Jul 6 15:10:17 compute-0-3 kernel: LustreError: 11-0: an error > > occurred while communicating with 172.16.16.2 at tcp. The ost_write > > operation failed with -107 > > Jul 6 15:10:17 compute-0-3 kernel: LustreError: Skipped 1 previous > > similar message > > Jul 6 15:10:17 compute-0-3 kernel: LustreError: 167-0: This client > > was evicted by AC3-OST0006; in progress operations using this service > > will fail. > > Jul 6 15:10:17 compute-0-3 kernel: LustreError: Skipped 4 previous > > similar messages > > Jul 6 15:10:17 compute-0-3 kernel: LustreError: > > 3095:0:(namei.c:1176:ll_objects_destroy()) obd destroy objid > > 0x18542c4 at 0x0 error -5 > > Jul 6 15:10:17 compute-0-3 kernel: LustreError: > > 6419:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, > > returning -EIO > > Jul 6 15:10:18 compute-0-3 kernel: LustreError: > > 6779:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace > > AC3-OST0006-osc-ffff81043e754c00 resource r > > efcount nonzero (1) after lock cleanup; forcing cleanup. > > Jul 6 15:10:18 compute-0-3 kernel: LustreError: > > 6779:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: > > ffff810117997500 (8432287/0/0/0) (rc: 1) > > Jul 6 15:10:18 compute-0-3 kernel: LustreError: > > 6687:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5 > > Jul 6 15:10:18 compute-0-3 kernel: LustreError: > > 6782:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, > > returning -EIO > > Jul 6 15:10:20 compute-0-3 kernel: LustreError: 11-0: an error > > occurred while communicating with 172.16.16.2 at tcp. The ost_read > > operation failed with -107 > > Jul 6 15:10:20 compute-0-3 kernel: LustreError: 167-0: This client > > was evicted by AC3-OST0008; in progress operations using this service > > will fail. > > Jul 6 15:10:20 compute-0-3 kernel: LustreError: > > 20660:0:(rw.c:122:ll_brw()) error from obd_brw: rc = -4 > > Jul 6 15:10:20 compute-0-3 kernel: LustreError: > > 6784:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace > > AC3-OST0008-osc-ffff81043e754c00 resource r > > efcount nonzero (1) after lock cleanup; forcing cleanup. > > Jul 6 15:10:20 compute-0-3 kernel: LustreError: > > 6784:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: > > ffff81041604b9c0 (5521743/0/0/0) (rc: 1) > > Jul 6 15:10:20 compute-0-3 kernel: LustreError: > > 6686:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5 > > Jul 6 15:10:20 compute-0-3 kernel: LustreError: > > 3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID > > req at ffff81042061a800 x1340041378196315/t0 > > o4->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 512/624 e 0 to 1 dl 0 > > ref 2 fl Rpc:/0/0 rc 0/0Jul 6 15:10:20 compute-0-3 kernel: > > LustreError: 3571:0:(client.c:858:ptlrpc_import_delay_req()) Skipped > > 78 previous similar messages > > Jul 6 15:10:20 compute-0-3 kernel: LustreError: > > 6785:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, > > returning -EIO > > > > Kind Regards, > > > > Peter Kitchener > > Systems Administrator > > Capital Markets CRC Limited (CMCRC) > > Telephone: +61 2 8088 4223 > > Fax: +61 2 8088 4201 > > www.cmcrc.com <http://www.cmcrc.com/> > > > > > > > > > > > > Capital Markets CRC Ltd - Confidential Communication > > The information contained in this e-mail is confidential. It is > > intended for the addressee only. If you receive this e-mail by > > mistake please promptly inform us by reply e-mail and then delete the > > e-mail and destroy any printed copy. You must not disclose or use in > > any way the information in the e-mail. There is no warranty that this > > e-mail is error or virus free. It may be a private communication, and > > if so, does not represent the views of the CMCRC and its associates. > > If it is a private communication, care should be taken in opening it > > to ensure that undue offence is not given. > > > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > -- > _Gabriele Paciucci_ http://www.linkedin.com/in/paciucci > > Pursuant to legislative Decree n. 196/03 you are hereby informed that this > email contains confidential information intended only for use of addressee. > If you are not the addressee and have received this email by mistake, please > send this email to the sender. You may not copy or disseminate this message > to anyone. Thank You. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- -- Wojciech Turek Assistant System Manager High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100706/8c231e63/attachment-0001.html
Hi All, The NICs we''re using are Broadcom Corporation NetXtreme II BCM57711 10-Gigabit PCIe They''re a Dual Port SFP+ card. Connected to a Dell PowerConnect 8024F. Currently we''re using the open source driver bnx2x as provided in the kernel, there doesn''t appear to be any dropped packets on the clients or servers. Could it be that the Lock callback timer threshold is too low? Im uncertain if there is a way of altering this threshold. Could there be some sort of kernel buffer or limit set too low on the clients/servers? Regards, Peter Kitchener Systems Administrator Capital Markets CRC Limited (CMCRC) Telephone: +61 2 8088 4223 Fax: +61 2 8088 4201 www.cmcrc.com Capital Markets CRC Ltd - Confidential Communication The information contained in this e-mail is confidential. It is intended for the addressee only. If you receive this e-mail by mistake please promptly inform us by reply e-mail and then delete the e-mail and destroy any printed copy. You must not disclose or use in any way the information in the e-mail. There is no warranty that this e-mail is error or virus free. It may be a private communication, and if so, does not represent the views of the CMCRC and its associates. If it is a private communication, care should be taken in opening it to ensure that undue offence is not given. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100707/17993131/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 13489 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100707/17993131/attachment.png
On 07/05/2010 11:19 PM, Peter Kitchener wrote:> Hi all, > > I have been troubleshooting a strange problem that is occurring with our Lustre setup. Under high loads our developers are complaining that various processes they run will error out with I/O error. > > Our setup is small 1 MDS and 2 OSS(10OSTs 5/OSS), and 13 Clients (152 Cores) the storage is all local 60TB (30TB/OSS) usable in a RAID6 Software raid setup. All of the machines are connected via 10Gig Ethernet. The clients run Rocks 5.3 (CentOS 5.4) and the Servers run CentOS 5.4 with kernel 2.6.18-164.11.1.el5_lustre.1.8.2. The Clients run an un-patched vanilla kernel from CentOS and Lustre 1.8.3 > > So far I''ve not been able to pin point where i should begin to look. I have been trawling through log files that quite frankly don''t make much sense to me. > > Here is the messages output from the OSS > > ############################## > > Jul 6 14:57:11 helium kernel: Lustre: AC3-OST0005: haven''t heard from client ce1a3eb7-8514-d16e-4050-0507e82f1116 (at 172.16.16.125 at tcp) in 227 seconds. I think it''s dead, and I am evicting it.There is a bug in lustre 1.8.2 and 1.8.3 that makes the ptlrpcd get stuck for long periods of time (around 10 minutes was the longest that I saw) on lustre clients under certain work loads. If the ptlrpcd is dead, the client may stop sending all RPCs to the servers, and the servers evict the client because they haven''t heard from it in a while. See bug 22897 for a description of the bug. But the fix is a simple one-liner in bug 22786, attachment 29866. The fix will first appear in lustre 1.8.4. I would highly recommend to anyone using 1.8.2 or 1.8.3 that they add that patch. I don''t know if that is the cause of your particular evictions, because there can be many causes of evictions. But the "haven''t hear from client ... in 227 seconds" was one of the symptoms, and the client failing with -107 (ENOTCONN) with multiple OSTs (and/or MDS, MGS...) at the same time was another symptom. Chris
Hi Chris,> See bug 22897 for a description of the bug. But the fix is a simple > one-liner in bug 22786, attachment 29866. The fix will first appear in > lustre 1.8.4. I would highly recommend to anyone using 1.8.2 or 1.8.3 > that they add that patch.How would I safely do that when i''ve installed lustre from the RPMs? Do i obtain the Source tarball or patched kernel source rpm and then run the patch? Not sure. Im being pressured by my dev team to fix this ASAP. Regards, Peter Kitchener Systems Administrator Capital Markets CRC Limited (CMCRC) Telephone: +61 2 8088 4223 Fax: +61 2 8088 4201 www.cmcrc.com Capital Markets CRC Ltd - Confidential Communication The information contained in this e-mail is confidential. It is intended for the addressee only. If you receive this e-mail by mistake please promptly inform us by reply e-mail and then delete the e-mail and destroy any printed copy. You must not disclose or use in any way the information in the e-mail. There is no warranty that this e-mail is error or virus free. It may be a private communication, and if so, does not represent the views of the CMCRC and its associates. If it is a private communication, care should be taken in opening it to ensure that undue offence is not given. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100707/9160a926/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 13489 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100707/9160a926/attachment.png
Hi, the ptlrcp bug is a problem, but i don''t find in the Peter''s logs any refer to an eviction caused by the ptlrpc but instead by a timeout during the comunication between a ost and the client. But Peter could make a downgrade to 1.8.1.1 that not suffer by the problem. My action plan could be: 1. First of all Peter use the same lustre between client and server 1.8.3. 2. Second check the /proc/sys/lustre/ldlm_timeout : 6 sec for MDS, 20 sec for OSS!! 3. Third: do you have enough memory on the servers for all the clients locks? Please refer to: http://wiki.lustre.org/manual/LustreManual18_HTML/LustreProc.html#50417791_pgfId-1290875 Normally the server could suffer for more than 500k locks. bye On 07/07/2010 02:12 AM, Christopher J. Morrone wrote:> On 07/05/2010 11:19 PM, Peter Kitchener wrote: > >> Hi all, >> >> I have been troubleshooting a strange problem that is occurring with our Lustre setup. Under high loads our developers are complaining that various processes they run will error out with I/O error. >> >> Our setup is small 1 MDS and 2 OSS(10OSTs 5/OSS), and 13 Clients (152 Cores) the storage is all local 60TB (30TB/OSS) usable in a RAID6 Software raid setup. All of the machines are connected via 10Gig Ethernet. The clients run Rocks 5.3 (CentOS 5.4) and the Servers run CentOS 5.4 with kernel 2.6.18-164.11.1.el5_lustre.1.8.2. The Clients run an un-patched vanilla kernel from CentOS and Lustre 1.8.3 >> >> So far I''ve not been able to pin point where i should begin to look. I have been trawling through log files that quite frankly don''t make much sense to me. >> >> Here is the messages output from the OSS >> >> ############################## >> >> Jul 6 14:57:11 helium kernel: Lustre: AC3-OST0005: haven''t heard from client ce1a3eb7-8514-d16e-4050-0507e82f1116 (at 172.16.16.125 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. >> > There is a bug in lustre 1.8.2 and 1.8.3 that makes the ptlrpcd get > stuck for long periods of time (around 10 minutes was the longest that I > saw) on lustre clients under certain work loads. If the ptlrpcd is > dead, the client may stop sending all RPCs to the servers, and the > servers evict the client because they haven''t heard from it in a while. > > See bug 22897 for a description of the bug. But the fix is a simple > one-liner in bug 22786, attachment 29866. The fix will first appear in > lustre 1.8.4. I would highly recommend to anyone using 1.8.2 or 1.8.3 > that they add that patch. > > I don''t know if that is the cause of your particular evictions, because > there can be many causes of evictions. But the "haven''t hear from > client ... in 227 seconds" was one of the symptoms, and the client > failing with -107 (ENOTCONN) with multiple OSTs (and/or MDS, MGS...) at > the same time was another symptom. > > Chris > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-- _Gabriele Paciucci_ http://www.linkedin.com/in/paciucci Pursuant to legislative Decree n. 196/03 you are hereby informed that this email contains confidential information intended only for use of addressee. If you are not the addressee and have received this email by mistake, please send this email to the sender. You may not copy or disseminate this message to anyone. Thank You.
Here is a dump of a log of some kind from the client at the time of the eviction. Does this shed any further light on the problem? 00000100:00000400:7:1278392906.021560:0:6677:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1340041377934503 sent from AC3-OST0006-osc-ffff81043e754c00 to NID 172.16.16.2 at tcp 11s ago has timed o ut (11s prior to deadline). req at ffff81043f31dc00 x1340041377934503/t0 o101->AC3-OST0006_UUID at 172.16.16.2@tcp:28/4 lens 296/544 e 0 to 1 dl 1278392906 ref 1 fl Rpc:/0/0 rc 0/0 00000100:02000400:7:1278392906.021598:0:6677:0:(import.c:159:ptlrpc_set_import_discon()) AC3-OST0006-osc-ffff81043e754c00: Connection to service AC3-OST0006 via nid 172.16.16.2 at tcp was lost; in progress operations using this service will wait for recovery to complete. 00000100:00080000:7:1278392906.021611:0:6677:0:(import.c:167:ptlrpc_set_import_discon()) ffff81042018c000 AC3-OST0006_UUID: changing import state from FULL to DISCONN 00000100:00080000:7:1278392906.021615:0:6677:0:(import.c:403:ptlrpc_fail_import()) AC3-OST0006_UUID: waking up pinger00000100:00080000:4:1278392906.021634:0:3573:0:(recover.c:67:ptlrpc_initiate_recovery()) AC3-OST0006_UUID: starting recovery 00000100:00080000:4:1278392906.021639:0:3573:0:(import.c:608:ptlrpc_connect_import()) ffff81042018c000 AC3-OST0006_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:4:1278392906.021645:0:3573:0:(import.c:470:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: connect to NID 172.16.16.2 at tcp last attempt 4379111012 00000100:00080000:4:1278392906.021645:0:3573:0:(import.c:470:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: connect to NID 172.16.16.2 at tcp last attempt 4379111012 00000100:00080000:4:1278392906.021651:0:3573:0:(import.c:544:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: import ffff81042018c000 using connection 172.16.16.2 at tcp/172.16.16.2 at tcp 00000100:00000400:7:1278392906.030554:0:6687:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1340041377934535 sent from AC3-OST0006-osc-ffff81043e754c00 to NID 172.16.16.2 at tcp 11s ago has timed o ut (11s prior to deadline). req at ffff81031686b000 x1340041377934535/t0 o101->AC3-OST0006_UUID at 172.16.16.2@tcp:28/4 lens 296/544 e 0 to 1 dl 1278392906 ref 1 fl Rpc:/0/0 rc 0/0 00000100:00080000:7:1278392906.030567:0:6687:0:(import.c:182:ptlrpc_set_import_discon()) osc: import ffff81042018c000 already not connected (conn 177, was 176): CONNECTING 00000100:00000400:5:1278392907.030414:0:6686:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1340041377934536 sent from AC3-OST0008-osc-ffff81043e754c00 to NID 172.16.16.2 at tcp 12s ago has timed o ut (12s prior to deadline). req at ffff8102b439bc00 x1340041377934536/t0 o101->AC3-OST0008_UUID at 172.16.16.2@tcp:28/4 lens 296/544 e 0 to 1 dl 1278392907 ref 1 fl Rpc:/0/0 rc 0/0 00000100:02000400:5:1278392907.030434:0:6686:0:(import.c:159:ptlrpc_set_import_discon()) AC3-OST0008-osc-ffff81043e754c00: Connection to service AC3-OST0008 via nid 172.16.16.2 at tcp was lost; in progress ope rations using this service will wait for recovery to complete. 00000100:00080000:5:1278392907.030449:0:6686:0:(import.c:167:ptlrpc_set_import_discon()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from FULL to DISCONN 00000100:00080000:5:1278392907.030452:0:6686:0:(import.c:403:ptlrpc_fail_import()) AC3-OST0008_UUID: waking up pinger 00000100:00080000:4:1278392907.030465:0:3573:0:(recover.c:67:ptlrpc_initiate_recovery()) AC3-OST0008_UUID: starting recovery00000100:00080000:4:1278392907.030470:0:3573:0:(import.c:608:ptlrpc_connect_import()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:4:1278392907.030475:0:3573:0:(import.c:470:import_select_connection()) AC3-OST0008-osc-ffff81043e754c00: connect to NID 172.16.16.2 at tcp last attempt 437911101200000100:00080000:4:1278392907.030480:0:3573:0:(import.c:544:import_select_connection()) AC3-OST0008-osc-ffff81043e754c00: import ffff810420d1a000 using connection 172.16.16.2 at tcp/172.16.16.2 at tcp 00000400:00000100:3:1278392907.213541:0:3560:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2 at tcp portal 4 match 1340041377934536 offset 192 length 352: 200000400:00000100:3:1278392907.213550:0:3560:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2 at tcp portal 4 match 1340041377934503 offset 192 length 352: 2 00000400:00000100:3:1278392907.213554:0:3560:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2 at tcp portal 4 match 1340041377934535 offset 192 length 352: 200000100:02020000:0:1278392907.213815:0:3572:0:(client.c:975:ptlrpc_check_status()) 11-0: an error occurred while communicating with 172.16.16.2 at tcp. The ost_connect operation failed with -16 00000100:00080000:0:1278392907.213828:0:3572:0:(import.c:1091:ptlrpc_connect_interpret()) ffff81042018c000 AC3-OST0006_UUID: changing import state from CONNECTING to DISCONN00000100:00080000:0:1278392907.213831:0:3572:0:(import.c:1137:ptlrpc_connect_interpret()) recovery of AC3-OST0006_UUID on 172.16.16.2 at tcp failed (-16) 00000100:02020000:0:1278392907.213839:0:3572:0:(client.c:975:ptlrpc_check_status()) 11-0: an error occurred while communicating with 172.16.16.2 at tcp. The ost_connect operation failed with -1600000100:00080000:0:1278392907.213841:0:3572:0:(import.c:1091:ptlrpc_connect_interpret()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from CONNECTING to DISCONN 00000100:00080000:0:1278392907.213843:0:3572:0:(import.c:1137:ptlrpc_connect_interpret()) recovery of AC3-OST0008_UUID on 172.16.16.2 at tcp failed (-16) 00000100:00080000:6:1278392907.226136:0:3571:0:(client.c:1145:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN) req at ffff8101b4588c00 x1340041377955935/t0 o3->AC3-OST0008_UUI D at 172.16.16.2@tcp:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0 00000100:00080000:6:1278392907.226192:0:3571:0:(client.c:1145:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN) req at ffff81042061b800 x1340041377955949/t0 o4->AC3-OST0008_UUI D at 172.16.16.2@tcp:6/4 lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0 00000100:00080000:6:1278392907.226202:0:3571:0:(client.c:1145:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN) req at ffff81042061a400 x1340041377955951/t0 o4->AC3-OST0008_UUI D at 172.16.16.2@tcp:6/4 lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0 00000100:00080000:6:1278392907.226207:0:3571:0:(client.c:1145:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN) req at ffff8104201e5000 x1340041377955955/t0 o4->AC3-OST0006_UUID at 172.16.16.2@tcp:6/4 lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0 00000100:00080000:6:1278392907.226211:0:3571:0:(client.c:1145:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN) req at ffff81042030f800 x1340041377955956/t0 o3->AC3-OST0008_UUI D at 172.16.16.2@tcp:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/000000100:00080000:6:1278392907.226219:0:3571:0:(client.c:1145:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN) req at ffff81042061a000 x1340041377955959/t0 o4->AC3-OST0008_UUI D at 172.16.16.2@tcp:6/4 lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0 00000100:00080000:4:1278392917.022748:0:3573:0:(recover.c:67:ptlrpc_initiate_recovery()) AC3-OST0006_UUID: starting recovery00000100:00080000:4:1278392917.022753:0:3573:0:(import.c:608:ptlrpc_connect_import()) ffff81042018c000 AC3-OST0006_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:4:1278392917.022759:0:3573:0:(import.c:470:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: connect to NID 172.16.16.2 at tcp last attempt 4724612378 00000100:00080000:4:1278392917.022766:0:3573:0:(import.c:544:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: import ffff81042018c000 using connection 172.16.16.2 at tcp/172.16.16.2 at tcp00000100:00080000:0:1278392917.023419:0:3572:0:(import.c:868:ptlrpc_connect_interpret()) reconnected to AC3-OST0006_UUID at 172.16.16.2@tcp after partition 00000100:00080000:0:1278392917.023425:0:3572:0:(import.c:891:ptlrpc_connect_interpret()) ffff81042018c000 AC3-OST0006_UUID: changing import state from CONNECTING to RECOVER 1 dl 1278392951 ref 2 fl Rpc:/0/0 rc 0/0 00000100:00080000:0:1278392917.023440:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377933846 new x1340041377973369 1 dl 1278392951 ref 2 fl Rpc:/0/0 rc 0/0 00000100:00080000:0:1278392917.023452:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377934227 new x1340041377973370 1 dl 1278392952 ref 2 fl Rpc:/0/0 rc 0/0 00000100:00080000:0:1278392917.023459:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377934618 new x1340041377973371 1 dl 1278392952 ref 2 fl Rpc:/0/0 rc 0/0 00000100:00080000:0:1278392917.023466:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377934964 new x1340041377973372 1 dl 1278392955 ref 2 fl Rpc:/0/0 rc 0/0 00000100:00080000:0:1278392917.023472:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377940508 new x1340041377973373 00000100:00080000:0:1278392917.023475:0:3572:0:(import.c:1319:ptlrpc_import_recovery_state_machine()) ffff81042018c000 AC3-OST0006_UUID: changing import state from RECOVER to FULL 00000100:00080000:6:1278392917.023480:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377973369 new x1340041377973374 00000002:00080000:0:1278392917.023492:0:3572:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320 00000100:02000000:0:1278392917.023497:0:3572:0:(import.c:1327:ptlrpc_import_recovery_state_machine()) AC3-OST0006-osc-ffff81043e754c00: Connection restored to service AC3-OST0006 using nid 172.16.16.2 at tcp. 00000100:00080000:6:1278392917.023513:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377973370 new x1340041377973375 ns 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0 lens 296/544 e 0 to 1 dl 1278392906 ref 1 fl Rpc:X/0/0 rc 0/0 lens 296/544 e 0 to 1 dl 1278392906 ref 1 fl Rpc:X/0/0 rc 0/0 00000100:00080000:6:1278392917.023543:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377973371 new x1340041377973376 00000002:00080000:0:1278392917.023544:0:3572:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320 00000100:00080000:6:1278392917.023572:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377973372 new x1340041377973377 00000100:00080000:6:1278392917.023596:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377973373 new x1340041377973378 00000100:00080000:4:1278392919.644883:0:3573:0:(recover.c:67:ptlrpc_initiate_recovery()) AC3-OST0008_UUID: starting recovery 00000100:00080000:4:1278392919.644889:0:3573:0:(import.c:608:ptlrpc_connect_import()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:4:1278392919.644894:0:3573:0:(import.c:470:import_select_connection()) AC3-OST0008-osc-ffff81043e754c00: connect to NID 172.16.16.2 at tcp last attempt 4724613386 00000100:00080000:4:1278392919.644900:0:3573:0:(import.c:544:import_select_connection()) AC3-OST0008-osc-ffff81043e754c00: import ffff810420d1a000 using connection 172.16.16.2 at tcp/172.16.16.2 at tcp 00000100:00080000:0:1278392919.645477:0:3572:0:(import.c:868:ptlrpc_connect_interpret()) reconnected to AC3-OST0008_UUID at 172.16.16.2@tcp after partition 00000100:00080000:0:1278392919.645482:0:3572:0:(import.c:891:ptlrpc_connect_interpret()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from CONNECTING to RECOVER 00000100:00080000:0:1278392919.645486:0:3572:0:(import.c:1314:ptlrpc_import_recovery_state_machine()) reconnected to AC3-OST0008_UUID at 172.16.16.2@tcp 1 dl 1278392952 ref 2 fl Rpc:/0/0 rc 0/0 00000100:00080000:0:1278392919.645495:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377933615 new x1340041377977349 1 dl 1278392952 ref 2 fl Rpc:/0/0 rc 0/0 00000100:00080000:0:1278392919.645505:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377933659 new x1340041377977350 1 dl 1278392952 ref 2 fl Rpc:/0/0 rc 0/0 00000100:00080000:0:1278392919.645511:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377933703 new x1340041377977351 00000100:00080000:0:1278392919.645514:0:3572:0:(import.c:1319:ptlrpc_import_recovery_state_machine()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from RECOVER to FULL 00000100:00080000:6:1278392919.645518:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377977350 new x1340041377977353 00000002:00080000:0:1278392919.645531:0:3572:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320 00000100:02000000:0:1278392919.645536:0:3572:0:(import.c:1327:ptlrpc_import_recovery_state_machine()) AC3-OST0008-osc-ffff81043e754c00: Connection restored to service AC3-OST0008 using nid 172.16.16.2 at tcp. lens 296/544 e 0 to 1 dl 1278392907 ref 1 fl Rpc:X/0/0 rc 0/0 ns 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0 ns 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0 ns 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0 ns 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0 ns 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0 ns 448/592 e 0 to 1 dl 1278392952 ref 2 fl Rpc:S/0/0 rc -11/0 00000100:00080000:6:1278392919.645591:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377977351 new x1340041377977354 00000002:00080000:0:1278392919.645593:0:3572:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320 00000100:00080000:6:1278392919.645729:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377977349 new x1340041377977355 00000100:02020000:2:1278393017.800317:0:3571:0:(client.c:975:ptlrpc_check_status()) 11-0: an error occurred while communicating with 172.16.16.2 at tcp. The ost_write operation failed with -107 00000100:00080000:2:1278393017.800332:0:3571:0:(recover.c:196:ptlrpc_request_handle_notconn()) import AC3-OST0006-osc-ffff81043e754c00 of AC3-OST0006_UUID at 172.16.16.2@tcp abruptly disconnected: reconnecting rations using this service will wait for recovery to complete. 00000100:00080000:2:1278393017.800343:0:3571:0:(import.c:167:ptlrpc_set_import_discon()) ffff81042018c000 AC3-OST0006_UUID: changing import state from FULL to DISCONN 00000100:00080000:2:1278393017.800350:0:3571:0:(import.c:608:ptlrpc_connect_import()) ffff81042018c000 AC3-OST0006_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:2:1278393017.800357:0:3571:0:(import.c:470:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: connect to NID 172.16.16.2 at tcp last attempt 4724623378 00000100:00080000:2:1278393017.800366:0:3571:0:(import.c:544:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: import ffff81042018c000 using connection 172.16.16.2 at tcp/172.16.16.2 at tcpt 0 o8->AC3-OST0006_UUID at 172.16.16.2@tcp:28/4 lens 368/584 e 0 to 1 dl 1278393031 ref 1 fl Interpret:RN/0/0 rc 0/0 00000100:00080000:0:1278393017.801096:0:3572:0:(import.c:911:ptlrpc_connect_interpret()) ffff81042018c000 AC3-OST0006_UUID: changing import state from CONNECTING to EVICTED 00000100:02020000:0:1278393017.801099:0:3572:0:(import.c:1249:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by AC3-OST0006; in progress operations using this service will fail. 00000100:00080000:0:1278393017.801106:0:3572:0:(import.c:1253:ptlrpc_import_recovery_state_machine()) evicted from AC3-OST0006_UUID at 172.16.16.2@tcp; invalidating 00000002:00080000:0:1278393017.801150:0:3572:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320 00000100:00080000:0:1278393017.801188:0:6779:0:(import.c:1215:ptlrpc_invalidate_import_thread()) thread invalidate import AC3-OST0006-osc-ffff81043e754c00 to AC3-OST0006_UUID at 172.16.16.2@tcp 00000100:00080000:0:1278393017.801190:0:6779:0:(import.c:194:ptlrpc_deactivate_and_unlock_import()) setting import AC3-OST0006_UUID INVALID 00020000:01000000:0:1278393017.801201:0:6779:0:(lov_obd.c:567:lov_set_osc_active()) Marking OSC AC3-OST0006_UUID inactive 00000002:00080000:0:1278393017.801206:0:6779:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320 00020000:00080000:7:1278393017.810055:0:6430:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive 00020000:00080000:5:1278393017.827252:0:3095:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive 00020000:00080000:5:1278393017.827298:0:3095:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive 00020000:00080000:5:1278393017.827325:0:3095:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive 00020000:00080000:7:1278393017.833456:0:3095:0:(lov_request.c:1099:lov_prep_destroy_set()) lov idx 6 inactive 00000080:00020000:7:1278393017.833461:0:3095:0:(namei.c:1176:ll_objects_destroy()) obd destroy objid 0x18542c4 at 0x0 error -5 00020000:00080000:4:1278393017.912169:0:6430:0:(lov_obd.c:2372:lov_join_lru()) lov idx 6 inactive 00020000:00080000:4:1278393017.913654:0:6419:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive 00000080:00020000:4:1278393017.913660:0:6419:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO 00020000:00080000:4:1278393017.922360:0:6419:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive 00010000:00020000:1:1278393018.325707:0:6779:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace AC3-OST0006-osc-ffff81043e754c00 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 00010000:00020000:1:1278393018.325727:0:6779:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: ffff810117997500 (8432287/0/0/0) (rc: 1) 00020000:00080000:7:1278393018.325761:0:6687:0:(lov_obd.c:1727:lov_brw_check()) lov idx 6 inactive 00000080:00020000:7:1278393018.325781:0:6687:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5 00020000:00080000:7:1278393018.327514:0:6687:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive 00020000:00080000:2:1278393018.332887:0:6782:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive 00000080:00020000:2:1278393018.332895:0:6782:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO 00000100:00080000:1:1278393018.333028:0:6779:0:(import.c:1224:ptlrpc_invalidate_import_thread()) ffff81042018c000 AC3-OST0006_UUID: changing import state from EVICTED to RECOVER 00000100:00080000:1:1278393018.333032:0:6779:0:(import.c:1314:ptlrpc_import_recovery_state_machine()) reconnected to AC3-OST0006_UUID at 172.16.16.2@tcp 00000100:00080000:1:1278393018.333035:0:6779:0:(import.c:1319:ptlrpc_import_recovery_state_machine()) ffff81042018c000 AC3-OST0006_UUID: changing import state from RECOVER to FULL 00020000:01000000:1:1278393018.333042:0:6779:0:(lov_obd.c:567:lov_set_osc_active()) Marking OSC AC3-OST0006_UUID active 00000002:00080000:1:1278393018.333051:0:6779:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320 00000100:02000000:1:1278393018.333056:0:6779:0:(import.c:1327:ptlrpc_import_recovery_state_machine()) AC3-OST0006-osc-ffff81043e754c00: Connection restored to service AC3-OST0006 using nid 172.16.16.2 at tcp. 00000100:02020000:4:1278393020.758895:0:20660:0:(client.c:975:ptlrpc_check_status()) 11-0: an error occurred while communicating with 172.16.16.2 at tcp. The ost_read operation failed with -107 g erations using this service will wait for recovery to complete. 00000100:00080000:4:1278393020.758919:0:20660:0:(import.c:167:ptlrpc_set_import_discon()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from FULL to DISCONN 00000100:00080000:4:1278393020.758925:0:20660:0:(import.c:608:ptlrpc_connect_import()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:4:1278393020.758928:0:20660:0:(import.c:470:import_select_connection()) AC3-OST0008-osc-ffff81043e754c00: connect to NID 172.16.16.2 at tcp last attempt 4724626000 00000100:00080000:4:1278393020.758935:0:20660:0:(import.c:544:import_select_connection()) AC3-OST0008-osc-ffff81043e754c00: import ffff810420d1a000 using connection 172.16.16.2 at tcp/172.16.16.2 at tcp 16.16.2 at tcp:6/4 lens 448/592 e 0 to 1 dl 1278393085 ref 1 fl Rpc:RS/0/0 rc 0/-107 UUID at 172.16.16.2@tcp:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0 t0 o8->AC3-OST0008_UUID at 172.16.16.2@tcp:28/4 lens 368/584 e 0 to 1 dl 1278393038 ref 1 fl Interpret:RN/0/0 rc 0/0 00000100:00080000:0:1278393020.770920:0:3572:0:(import.c:911:ptlrpc_connect_interpret()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from CONNECTING to EVICTED 00000100:02020000:0:1278393020.770923:0:3572:0:(import.c:1249:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by AC3-OST0008; in progress operations using this service will fail. 00000100:00080000:0:1278393020.770930:0:3572:0:(import.c:1253:ptlrpc_import_recovery_state_machine()) evicted from AC3-OST0008_UUID at 172.16.16.2@tcp; invalidating 00000002:00080000:0:1278393020.771032:0:3572:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320 00000100:00080000:0:1278393020.771066:0:6784:0:(import.c:1215:ptlrpc_invalidate_import_thread()) thread invalidate import AC3-OST0008-osc-ffff81043e754c00 to AC3-OST0008_UUID at 172.16.16.2@tcp 00000100:00080000:0:1278393020.771068:0:6784:0:(import.c:194:ptlrpc_deactivate_and_unlock_import()) setting import AC3-OST0008_UUID INVALID 2 at tcp:6/4 lens 448/592 e 0 to 1 dl 1278393085 ref 1 fl Rpc:RES/0/0 rc -4/-107 00020000:01000000:0:1278393020.771084:0:6784:0:(lov_obd.c:567:lov_set_osc_active()) Marking OSC AC3-OST0008_UUID inactive 00000002:00080000:0:1278393020.771087:0:6784:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320 00000080:00020000:4:1278393020.771103:0:20660:0:(rw.c:122:ll_brw()) error from obd_brw: rc = -4 00010000:00020000:0:1278393020.777506:0:6784:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace AC3-OST0008-osc-ffff81043e754c00 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 00010000:00020000:0:1278393020.777518:0:6784:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: ffff81041604b9c0 (5521743/0/0/0) (rc: 1) 00010000:00020000:0:1278393020.777531:0:6784:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace AC3-OST0008-osc-ffff81043e754c00 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 00010000:00020000:0:1278393020.777533:0:6784:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: ffff810266786e00 (8127131/0/0/0) (rc: 1) 00020000:00080000:5:1278393020.777583:0:6686:0:(lov_obd.c:1727:lov_brw_check()) lov idx 8 inactive 00000080:00020000:5:1278393020.777624:0:6686:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5 00000100:00020000:2:1278393020.777678:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81042061a800 x1340041378196315/t0 o4->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 512/624 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/01068:0:6784:0:(import.c:194:ptlrpc_deactivate_and_unlock_import()) setting import AC3-OST0008_UUID INVALID 00000100:00020000:2:1278393020.778942:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81042061a000 x1340041378196323/t0 o4->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 448/608 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/01084:0:6784:0:(lov_obd.c:567:lov_set_osc_active()) Marking OSC AC3-OST0008_UUID inactive 00020000:00080000:5:1278393020.779394:0:6686:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 8 inactive 00000100:00020000:2:1278393020.780168:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8101f525e000 x1340041378196326/t0 o3->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 448/592 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/07506:0:6784:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace AC3-OST0008-osc-ffff81043e754c00 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 00000100:00020000:2:1278393020.781246:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8101f189fc00 x1340041378196328/t0 o3->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 448/592 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/07531:0:6784:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace AC3-OST0008-osc-ffff81043e754c00 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 00000100:00020000:2:1278393020.782563:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81042061b400 x1340041378196329/t0 o4->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 496/616 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/07583:0:6686:0:(lov_obd.c:1727:lov_brw_check()) lov idx 8 inactive 00000100:00020000:2:1278393020.783344:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8104212fc400 x1340041378196334/t0 o3->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 448/592 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 00000100:00020000:2:1278393020.783559:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81042061ac00 x1340041378196337/t0 o4->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 448/608 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/09394:0:6686:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 8 inactive 00000100:00020000:2:1278393020.783727:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81042061c800 x1340041378196341/t0 o4->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 496/616 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 00020000:00080000:7:1278393020.783778:0:6785:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 8 inactive 00000080:00020000:7:1278393020.783782:0:6785:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO 00000100:00020000:2:1278393020.784052:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81042061bc00 x1340041378196347/t0 o4->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 448/608 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 00000100:00020000:2:1278393020.784197:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81042061cc00 x1340041378196354/t0 o4->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 480/616 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/03782:0:6785:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO 00000100:00020000:2:1278393020.784325:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81042061b000 x1340041378196360/t0 o4->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 496/616 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 00000100:00020000:2:1278393020.784436:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81042061b800 x1340041378196369/t0 o4->AC3-OST0008_UUID at 172.16.16.2@tcp:6/4 lens 464/608 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 00000100:00080000:0:1278393020.792064:0:6784:0:(import.c:1224:ptlrpc_invalidate_import_thread()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from EVICTED to RECOVER 00000100:00080000:0:1278393020.792068:0:6784:0:(import.c:1314:ptlrpc_import_recovery_state_machine()) reconnected to AC3-OST0008_UUID at 172.16.16.2@tcp 00000100:00080000:0:1278393020.792071:0:6784:0:(import.c:1319:ptlrpc_import_recovery_state_machine()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from RECOVER to FULL 00020000:01000000:0:1278393020.792078:0:6784:0:(lov_obd.c:567:lov_set_osc_active()) Marking OSC AC3-OST0008_UUID active 00000002:00080000:0:1278393020.792087:0:6784:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320 00000100:02000000:0:1278393020.792092:0:6784:0:(import.c:1327:ptlrpc_import_recovery_state_machine()) AC3-OST0008-osc-ffff81043e754c00: Connection restored to service AC3-OST0008 using nid 172.16.16.2 at tcp. 00000400:00000100:4:1278393020.806533:0:3561:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2 at tcp portal 8 match 1340041378188715 offset 0 length 1048576: 2 00000400:00000100:4:1278393020.814069:0:3561:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2 at tcp portal 8 match 1340041378188716 offset 0 length 1048576: 2 00000400:00000100:4:1278393020.828140:0:3561:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2 at tcp portal 8 match 1340041378188729 offset 0 length 1048576: 2 00000400:00000100:4:1278393020.834691:0:3561:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2 at tcp portal 8 match 1340041378188730 offset 0 length 1048576: 2 00000400:00000100:4:1278393020.841916:0:3561:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2 at tcp portal 8 match 1340041378188736 offset 0 length 1048576: 2 00000100:02020000:2:1278394241.028730:0:7780:0:(client.c:975:ptlrpc_check_status()) 11-0: an error occurred while communicating with 172.16.16.2 at tcp. The ldlm_enqueue operation failed with -107 00000100:00080000:2:1278394241.028748:0:7780:0:(recover.c:196:ptlrpc_request_handle_notconn()) import AC3-OST0006-osc-ffff81043e754c00 of AC3-OST0006_UUID at 172.16.16.2@tcp abruptly disconnected: reconnecting rations using this service will wait for recovery to complete. 00000100:00080000:2:1278394241.028763:0:7780:0:(import.c:167:ptlrpc_set_import_discon()) ffff81042018c000 AC3-OST0006_UUID: changing import state from FULL to DISCONN 00000100:00080000:2:1278394241.028769:0:7780:0:(import.c:608:ptlrpc_connect_import()) ffff81042018c000 AC3-OST0006_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:2:1278394241.028772:0:7780:0:(import.c:470:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: connect to NID 172.16.16.2 at tcp last attempt 4724724153 00000100:00080000:2:1278394241.028783:0:7780:0:(import.c:544:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: import ffff81042018c000 using connection 172.16.16.2 at tcp/172.16.16.2 at tcp 6.16.2 at tcp:28/4 lens 296/544 e 0 to 1 dl 1278394306 ref 1 fl Rpc:RS/0/0 rc 0/-107 g 00000100:00080000:2:1278394241.029211:0:22741:0:(import.c:182:ptlrpc_set_import_discon()) osc: import ffff81042018c000 already not connected (conn 180, was 179): CONNECTING 00010000:00020000:2:1278394241.029215:0:22741:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway 00010000:00020000:2:1278394241.029225:0:22741:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107 00000100:00080000:2:1278394241.029446:0:3572:0:(import.c:908:ptlrpc_connect_interpret()) @@@ evicting (not initial connect and flags reconnect/recovering not set: 4) req at ffff8102e5404400 x1340041383549552/t0 o8->AC3-OST0006_UUID at 172.16.16.2@tcp:28/4 lens 368/584 e 0 to 1 dl 1278394268 ref 1 fl Interpret:RN/0/0 rc 0/0 00000100:00080000:2:1278394241.029451:0:3572:0:(import.c:911:ptlrpc_connect_interpret()) ffff81042018c000 AC3-OST0006_UUID: changing import state from CONNECTING to EVICTED 00000100:02020000:2:1278394241.029454:0:3572:0:(import.c:1249:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by AC3-OST0006; in progress operations using this service will fail. 00000100:00080000:2:1278394241.029460:0:3572:0:(import.c:1253:ptlrpc_import_recovery_state_machine()) evicted from AC3-OST0006_UUID at 172.16.16.2@tcp; invalidating 00000002:00080000:2:1278394241.029546:0:3572:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320 00000100:00080000:2:1278394241.029564:0:7781:0:(import.c:1215:ptlrpc_invalidate_import_thread()) thread invalidate import AC3-OST0006-osc-ffff81043e754c00 to AC3-OST0006_UUID at 172.16.16.2@tcp 00000100:00080000:2:1278394241.029567:0:7781:0:(import.c:194:ptlrpc_deactivate_and_unlock_import()) setting import AC3-OST0006_UUID INVALID 00000100:00080000:2:1278394241.029594:0:7780:0:(client.c:2098:ptlrpc_queue_wait()) @@@ "mv" awake: (EVICTED == FULL or 1/0 == 1) req at ffff81012b338800 x1340041383514587/t0 o101->AC3-OST0006_UUID at 172.16.16.2@tcp:28/4 lens 296/544 e 0 to 1 dl 1278394306 ref 1 fl Rpc:RES/0/0 rc -4/-107 00000100:00020000:2:1278394241.029668:0:7780:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81012b338800 x1340041383549557/t0 o101->AC3-OST0006_UUID at 172.16.16.2@tcp:28/4 lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 00020000:00020000:2:1278394241.029685:0:7780:0:(lov_request.c:209:lov_update_enqueue_set()) enqueue objid 0x18ea266 subobj 0x80ab49 on OST idx 6: rc -108 00020000:01000000:2:1278394241.030074:0:7781:0:(lov_obd.c:567:lov_set_osc_active()) Marking OSC AC3-OST0006_UUID inactive 00000002:00080000:2:1278394241.030079:0:7781:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320 00000100:00020000:7:1278394241.052192:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81032a347800 x1340041383549562/t0 o3->AC3-OST0006_UUID at 172.16.16.2@tcp:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 00000100:00020000:7:1278394241.052453:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff810436b9ac00 x1340041383549566/t0 o3->AC3-OST0006_UUID at 172.16.16.2@tcp:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 00000100:00020000:7:1278394241.052749:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8101767fe000 x1340041383549569/t0 o3->AC3-OST0006_UUID at 172.16.16.2@tcp:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 00000100:00020000:7:1278394241.053042:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81012a81ac00 x1340041383549573/t0 o3->AC3-OST0006_UUID at 172.16.16.2@tcp:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 00020000:00080000:7:1278394241.221422:0:7301:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive 00000080:00020000:7:1278394241.221429:0:7301:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO 00020000:00080000:6:1278394241.232801:0:7640:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive 00000100:00080000:2:1278394241.435957:0:7781:0:(import.c:1224:ptlrpc_invalidate_import_thread()) ffff81042018c000 AC3-OST0006_UUID: changing import state from EVICTED to RECOVER 00000100:00080000:2:1278394241.435965:0:7781:0:(import.c:1314:ptlrpc_import_recovery_state_machine()) reconnected to AC3-OST0006_UUID at 172.16.16.2@tcp 00000100:00080000:2:1278394241.435968:0:7781:0:(import.c:1319:ptlrpc_import_recovery_state_machine()) ffff81042018c000 AC3-OST0006_UUID: changing import state from RECOVER to FULL 00020000:01000000:2:1278394241.435975:0:7781:0:(lov_obd.c:567:lov_set_osc_active()) Marking OSC AC3-OST0006_UUID active 00000002:00080000:2:1278394241.435983:0:7781:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320 00000100:02000000:2:1278394241.435989:0:7781:0:(import.c:1327:ptlrpc_import_recovery_state_machine()) AC3-OST0006-osc-ffff81043e754c00: Connection restored to service AC3-OST0006 using nid 172.16.16.2 at tcp. ut (7s prior to deadline). req at ffff810346c34c00 x1340041411931387/t0 o101->AC3-OST0002_UUID at 172.16.16.4@tcp:28/4 lens 296/544 e 0 to 1 dl 1278419851 ref 1 fl Rpc:/0/0 rc 0/0 erations using this service will wait for recovery to complete. Peter Kitchener Systems Administrator Capital Markets CRC Limited (CMCRC) Telephone: +61 2 8088 4223 Fax: +61 2 8088 4201 www.cmcrc.com Capital Markets CRC Ltd - Confidential Communication The information contained in this e-mail is confidential. It is intended for the addressee only. If you receive this e-mail by mistake please promptly inform us by reply e-mail and then delete the e-mail and destroy any printed copy. You must not disclose or use in any way the information in the e-mail. There is no warranty that this e-mail is error or virus free. It may be a private communication, and if so, does not represent the views of the CMCRC and its associates. If it is a private communication, care should be taken in opening it to ensure that undue offence is not given. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100708/049ef3b6/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 13489 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100708/049ef3b6/attachment-0001.png
Hi Everyone, Is there any sort of ETA on 1.8.4? Also,> See bug 22897 for a description of the bug. But the fix is a simple > one-liner in bug 22786, attachment 29866. The fix will first appear in > lustre 1.8.4. I would highly recommend to anyone using 1.8.2 or 1.8.3 > that they add that patch.I have attempted to build the OST/MDT rpms with the match mentioned by Chris previously against the current running lustre kernel src rpm (2.6.18.164-1.11.el5_lustre.1.8.2). When i install the RPMS lustre and lustre-modules that were built and attempt to remount the OST/MDT partitions i receive the following messages output for each partition. I''ve had to roll back to the rpms from the lustre website as they work. Jul 14 10:23:06 beryllium kernel: LDISKFS-fs: mounted filesystem md3 with ordered data mode Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: disagrees about version of symbol ldiskfs_free_blocks Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: Unknown symbol ldiskfs_free_blocks Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: disagrees about version of symbol ldiskfs_journal_start_sb Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: Unknown symbol ldiskfs_journal_start_sb Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: disagrees about version of symbol __ldiskfs_journal_stop Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: Unknown symbol __ldiskfs_journal_stop Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: disagrees about version of symbol ldiskfs_ext_search_right Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: Unknown symbol ldiskfs_ext_search_right Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: disagrees about version of symbol ldiskfs_mark_inode_dirty Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: Unknown symbol ldiskfs_mark_inode_dirty Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: disagrees about version of symbol ldiskfs_journal_abort_handle Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: Unknown symbol ldiskfs_journal_abort_handle Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: disagrees about version of symbol ldiskfs_ext_calc_credits_for_insert Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: Unknown symbol ldiskfs_ext_calc_credits_for_insert Jul 14 10:23:07 beryllium modprobe: FATAL: Error inserting fsfilt_ldiskfs (/lib/modules/2.6.18-164.11.1.el5_lustre.1.8.2/kernel/fs/lustre/fsfilt_ldi skfs.ko): Unknown symbol in module, or unknown parameter (see dmesg) Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: disagrees about version of symbol ldiskfs_xattr_set_handle Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: Unknown symbol ldiskfs_xattr_set_handle Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: disagrees about version of symbol ldiskfs_ext_insert_extent Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: Unknown symbol ldiskfs_ext_insert_extent Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: disagrees about version of symbol ldiskfs_mb_new_blocks Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: Unknown symbol ldiskfs_mb_new_blocks Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: disagrees about version of symbol ldiskfs_ext_search_left Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: Unknown symbol ldiskfs_ext_search_left Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: disagrees about version of symbol ldiskfs_bread Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: Unknown symbol ldiskfs_bread Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: disagrees about version of symbol ldiskfs_ext_walk_space Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: Unknown symbol ldiskfs_ext_walk_space Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: disagrees about version of symbol ldiskfs_mb_discard_inode_preallocations Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: Unknown symbol ldiskfs_mb_discard_inode_preallocations Jul 14 10:23:07 beryllium kernel: LustreError: 12640:0:(fsfilt.c:124:fsfilt_get_ops()) Can''t find fsfilt_ldiskfs interface Jul 14 10:23:07 beryllium kernel: LustreError: 12640:0:(mgc_request.c:404:mgc_fs_setup()) No fstype ldiskfs rc=-256 Jul 14 10:23:07 beryllium kernel: LustreError: 12640:0:(mgc_request.c:876:mgc_set_info_async()) set_fs got -256 Jul 14 10:23:07 beryllium kernel: LustreError: 12640:0:(obd_mount.c:852:server_mgc_set_fs()) can''t set_fs -256 Jul 14 10:23:07 beryllium kernel: LustreError: 12640:0:(obd_mount.c:1653:server_fill_super()) Unable to start targets: -256 Jul 14 10:23:07 beryllium kernel: LustreError: 12640:0:(obd_mount.c:1436:server_put_super()) no obd AC3-OST0000 Jul 14 10:23:07 beryllium kernel: LustreError: 12640:0:(obd_mount.c:147:server_deregister_mount()) AC3-OST0000 not registered Jul 14 10:23:11 beryllium kernel: Lustre: 12720:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1341198938669057 sent from MGC172.16.16.5 @tcp to NID 172.16.16.5 at tcp 5s ago has timed out (5s prior to deadline). Jul 14 10:23:11 beryllium kernel: req at ffff810155257800 x1341198938669057/t0 o250->MGS at MGC172.16.16.5@tcp_0:26/25 lens 368/584 e 0 to 1 dl 12790669 91 ref 1 fl Rpc:N/0/0 rc 0/0 Jul 14 10:23:11 beryllium kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success) Jul 14 10:23:11 beryllium kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost Jul 14 10:23:11 beryllium kernel: LDISKFS-fs: mballoc: 0 generated and it took 0 Jul 14 10:23:11 beryllium kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 discarded Jul 14 10:23:12 beryllium kernel: Lustre: server umount AC3-OST0000 complete Jul 14 10:23:12 beryllium kernel: LustreError: 12640:0:(obd_mount.c:2042:lustre_fill_super()) Unable to mount (-256) Jul 14 10:23:12 beryllium kernel: LDISKFS-fs: barriers enabled Jul 14 10:23:19 beryllium kernel: kjournald2 starting: pid 13086, dev md4:8, commit interval 5 seconds Jul 14 10:23:19 beryllium kernel: LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended Jul 14 10:23:19 beryllium kernel: LDISKFS FS on md4, internal journal on md4:8 Jul 14 10:23:19 beryllium kernel: LDISKFS-fs: delayed allocation enabled Jul 14 10:23:19 beryllium kernel: LDISKFS-fs: file extents enabled Jul 14 10:23:19 beryllium kernel: LDISKFS-fs: mballoc enabled Jul 14 10:23:19 beryllium kernel: LDISKFS-fs: recovery complete. Peter Kitchener Systems Administrator Capital Markets CRC Limited (CMCRC) Telephone: +61 2 8088 4223 Fax: +61 2 8088 4201 www.cmcrc.com Capital Markets CRC Ltd - Confidential Communication The information contained in this e-mail is confidential. It is intended for the addressee only. If you receive this e-mail by mistake please promptly inform us by reply e-mail and then delete the e-mail and destroy any printed copy. You must not disclose or use in any way the information in the e-mail. There is no warranty that this e-mail is error or virus free. It may be a private communication, and if so, does not represent the views of the CMCRC and its associates. If it is a private communication, care should be taken in opening it to ensure that undue offence is not given. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100714/a549117b/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 13489 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100714/a549117b/attachment-0001.png
Yes, it is scheduled for release by the end of the month. Peter Kitchener wrote:> > Is there any sort of ETA on 1.8.4? > >
On 2010-07-13, at 18:56, Peter Kitchener wrote:> Is there any sort of ETA on 1.8.4?The scheduled release data is July 31.>> See bug 22897 for a description of the bug. But the fix is a simple >> one-liner in bug 22786, attachment 29866. The fix will first appear in >> lustre 1.8.4. I would highly recommend to anyone using 1.8.2 or 1.8.3 >> that they add that patch. > > I have attempted to build the OST/MDT rpms with the match mentioned by Chris previously against the current running lustre kernel src rpm (2.6.18.164-1.11.el5_lustre.1.8.2). When i install the RPMS lustre and lustre-modules that were built and attempt to remount the OST/MDT partitions i receive the following messages output for each partition. I''ve had to roll back to the rpms from the lustre website as they work. > > Jul 14 10:23:06 beryllium kernel: LDISKFS-fs: mounted filesystem md3 with ordered data mode > Jul 14 10:23:07 beryllium kernel: fsfilt_ldiskfs: disagrees about version of symbol ldiskfs_free_blocksLooks like you either didn''t rebuild the ldiskfs module, or you forgot to reload the Lustre modules before starting again. You can use "lustre_rmmod" to remove all of the Lustre modules. If you rebooted, or did this already, and this error is still present then it looks like you somehow didn''t build the modules correctly. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
On 07/07/2010 01:04 AM, Gabriele Paciucci wrote:> Hi, > the ptlrcp bug is a problem, but i don''t find in the Peter''s logs any > refer to an eviction caused by the ptlrpc but instead by a timeout > during the comunication between a ost and the client. But Peter could > make a downgrade to 1.8.1.1 that not suffer by the problem.The bug that I describe does not have any messages about the ptlrpcd performing evictions. The server''s "I think it''s dead, and I am evicting it", and other messages about the server timing out on the client are the only messages that you will see with the bug that I described. But like I said, there are many possible reasons for timeouts, so it could easily be something else. Chris
On Tuesday, July 20, 2010, Christopher J. Morrone wrote:> On 07/07/2010 01:04 AM, Gabriele Paciucci wrote: > > Hi, > > the ptlrcp bug is a problem, but i don''t find in the Peter''s logs any > > refer to an eviction caused by the ptlrpc but instead by a timeout > > during the comunication between a ost and the client. But Peter could > > make a downgrade to 1.8.1.1 that not suffer by the problem. > > The bug that I describe does not have any messages about the ptlrpcd > performing evictions. The server''s "I think it''s dead, and I am > evicting it", and other messages about the server timing out on the > client are the only messages that you will see with the bug that I > described. > > But like I said, there are many possible reasons for timeouts, so it > could easily be something else.For the record, while stress testing Lustre I can easily reproduce evictions with any Lustre version. However, it is terribly difficult to debug it without additional tools. I have opened a bugzilla for that, but I don''t think I will have time for those tools any time soon. https://bugzilla.lustre.org/show_bug.cgi?id=23190 -- Bernd Schubert DataDirect Networks
Hi All, So I''ve applied the patch for the bug 22897 to the 1.8.2 source etc and rebuilt the lustre rpms. So far our development team has reported an improvement in the frequency of the I/O Errors but still not enough for them to be 100% happy. In the mean time i''ve been double checking all the settings and understanding the timeouts etc. As we use software RAID on the OSTs I''ve made sure that the setup of these are as the operations manual states. My next port of call is going to be to update the bnx2x driver that our BCM57711 10Gig Ethernet cards use on all the clients and MDT/OSTs. I doubt that its going to make a difference but the version of the driver is the CentOS 5.4 driver. I have also had a snoop round the switch to make sure there isn''t any funny setting floating round that could also be the cause of the problem. Any thoughts? Regards, Peter Kitchener Systems Administrator Capital Markets CRC Limited (CMCRC) Telephone: +61 2 8088 4223 Fax: +61 2 8088 4201 www.cmcrc.com Capital Markets CRC Ltd - Confidential Communication The information contained in this e-mail is confidential. It is intended for the addressee only. If you receive this e-mail by mistake please promptly inform us by reply e-mail and then delete the e-mail and destroy any printed copy. You must not disclose or use in any way the information in the e-mail. There is no warranty that this e-mail is error or virus free. It may be a private communication, and if so, does not represent the views of the CMCRC and its associates. If it is a private communication, care should be taken in opening it to ensure that undue offence is not given. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100721/02289edf/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 13489 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100721/02289edf/attachment.png