Abe Ingersoll
2009-Aug-27 01:52 UTC
[Lustre-discuss] lustre errors when system stressed; bad hardware?
Is this likely bad IB hardware/switch/cables? Bad RAM? It''s lustre-1.8.1 on CentOS-5.x with four OSS''s exporting two Datadirect/FC OST''s each, a separate MDT/MGS and clients all over qlogic ib_ipath IB. The clients are simulatenously running iozone against the lustre fs. ("iozone -a -g 32G -f /mnt/sparta/iozone/iozone.`hostname`.file -M -R -b report-`hostname`.xls -i 0 -i 1 -i 2 -i 3 -i 4 -i 5 -i 6 -n 4g -y 4096") One client and an OSS spit out these errors, iozone appears to continue on just fine -- client: Lustre: 2637:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked()) Closing conn to 10.168.22.106 at o2ib: error 0(waiting) LustreError: 3898:0:(events.c:66:request_out_callback()) @@@ type 4, status -103 req at ffff81021933d400 x1312117388289524/t0 o4->sparta-OST0003_UUID at 10.168.22.106@o2ib:6/4 lens 448/608 e 0 to 1 dl 1251335107 ref 3 fl Rpc:/0/0 rc 0/0 Lustre: 3926:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1312117388289524 sent from sparta-OST0003-osc-ffff8102198f1000 to NID 10.168.22.106 at o2ib 0s ago has failed due to network error (limit 7s). req at ffff81021933d400 x1312117388289524/t0 o4->sparta-OST0003_UUID at 10.168.22.106@o2ib:6/4 lens 448/608 e 0 to 1 dl 1251335107 ref 2 fl Rpc:/0/0 rc 0/0 Lustre: sparta-OST0003-osc-ffff8102198f1000: Connection to service sparta-OST0003 via nid 10.168.22.106 at o2ib was lost; in progress operations using this service will wait for recovery to complete. LustreError: 11-0: an error occurred while communicating with 10.168.22.106 at o2ib. The ost_connect operation failed with -16 LustreError: Skipped 2 previous similar messages Lustre: 3926:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1312117388289487 sent from sparta-OST0002-osc-ffff8102198f1000 to NID 10.168.22.106 at o2ib 7s ago has timed out (limit 7s). req at ffff8102192b6400 x1312117388289487/t0 o4->sparta-OST0002_UUID at 10.168.22.106@o2ib:6/4 lens 448/608 e 0 to 1 dl 1251335106 ref 2 fl Rpc:/0/0 rc 0/0 Lustre: sparta-OST0002-osc-ffff8102198f1000: Connection to service sparta-OST0002 via nid 10.168.22.106 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Lustre: sparta-OST0002-osc-ffff8102198f1000: Connection restored to service sparta-OST0002 using nid 10.168.22.106 at o2ib. Lustre: 3928:0:(import.c:508:import_select_connection()) sparta-OST0003-osc-ffff8102198f1000: tried all connections, increasing latency to 6s Lustre: 3928:0:(import.c:508:import_select_connection()) Skipped 2 previous similar messages Lustre: sparta-OST0003-osc-ffff8102198f1000: Connection restored to service sparta-OST0003 using nid 10.168.22.106 at o2ib. oss: Lustre: 4790:0:(o2iblnd_cb.c:955:kiblnd_tx_complete()) Tx -> 10.168.22.104 at o2ib cookie 0xc8dd6 sending 1 waiting 1: failed 12 Lustre: 4790:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked()) Closing conn to 10.168.22.104 at o2ib: error -5(waiting) LustreError: 4790:0:(events.c:367:server_bulk_callback()) event type 4, status -5, desc ffff8100ae208000 LustreError: 4790:0:(events.c:367:server_bulk_callback()) event type 2, status -5, desc ffff8100ae208000 LustreError: 5086:0:(ost_handler.c:1014:ost_brw_write()) @@@ network error on bulk GET 0(1048576) req at ffff8101fc9fc800 x1312117388289524/t0 o4->2920ef40-0b97-644f-178a-5e74613e467b at NET_0x500000aa81668_UUID:0/0 lens 448/416 e 0 to 0 dl 1251335106 ref 1 fl Interpret:/0/0 rc 0/0 Lustre: 5086:0:(ost_handler.c:1150:ost_brw_write()) sparta-OST0003: ignoring bulk IO comm error with 2920ef40-0b97-644f-178a-5e74613e467b at NET_0x500000aa81668_UUID id 12345-10.168.22.104 at o2ib - client will retry Lustre: 4953:0:(ldlm_lib.c:541:target_handle_reconnect()) sparta-OST0003: 2920ef40-0b97-644f-178a-5e74613e467b reconnecting Lustre: 4953:0:(ldlm_lib.c:835:target_handle_connect()) sparta-OST0003: refuse reconnection from 2920ef40-0b97-644f-178a-5e74613e467b at 10.168.22.104@o2ib to 0xffff810421231000; still busy with 2 active RPCs LustreError: 4953:0:(ldlm_lib.c:1850:target_send_reply_msg()) @@@ processing error (-16) req at ffff8103baae8c00 x1312117388289527/t0 o8->2920ef40-0b97-644f-178a-5e74613e467b at NET_0x500000aa81668_UUID:0/0 lens 368/264 e 0 to 0 dl 1251335200 ref 1 fl Interpret:/0/0 rc -16/0 LustreError: 4953:0:(ldlm_lib.c:1850:target_send_reply_msg()) Skipped 1 previous similar message Lustre: 5075:0:(ldlm_lib.c:541:target_handle_reconnect()) sparta-OST0002: 2920ef40-0b97-644f-178a-5e74613e467b reconnecting Lustre: 5010:0:(ldlm_lib.c:541:target_handle_reconnect()) sparta-OST0003: 2920ef40-0b97-644f-178a-5e74613e467b reconnecting -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090826/9f412561/attachment.html
Isaac Huang
2009-Aug-27 21:37 UTC
[Lustre-discuss] lustre errors when system stressed; bad hardware?
On Wed, Aug 26, 2009 at 06:52:24PM -0700, Abe Ingersoll wrote:> ...... > kiblnd_tx_complete()) Tx -> 10.168.22.104 at o2ib cookie 0xc8dd6 sending 1 > waiting 1: failed 1212 == IB_WC_RETRY_EXC_ERR, which usually indicates faulty links in the network or some other application (like a MPI application) hogging network resources unfavorably against Lustre. We once observed such errors at times there was no IO at all - a bad MPI implementation was resending aggressively upon RNR such that even the tiny bit of keepalive traffic from Lustre would end up with IB_WC_RETRY_EXC_ERR. Diagnostics from OFED and the fabric should point you to faulty hardware, and setting up IB QoS should prevent Lustre from being hurt badly by someone else. Meanwhile, there''s a potential workaround mentioned here: https://bugzilla.lustre.org/show_bug.cgi?id=14223#c36 But it''s certainly not a good solution in the long run. Thanks, Isaac