Hi; I have started getting numerous dump logs, timeouts and client evictions. Our environment: 200 compute nodes The MDT/MGS is on a x86_64 server with 8G memory and 2 dual core AMD procs The OSS is on a x86_64 server with 8G memory and 2 dual core AMD procs One OST raid 6 ~ 8TB (recently shrunk from 9TB) Lustre 1.6.4.2 The timeouts are causing our login node to hang long enough to be noticeable to the users before recovery. I have attached a couple of the logs and the syslog from this morning. All of the logs and messages are pretty similar. If anyone has any suggestions, I would appreciate it. Thanks, Denise -------------- next part -------------- A non-text attachment was scrubbed... Name: lustre-log.1229661071.27314 Type: application/octet-stream Size: 9840 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081219/55dcf04a/attachment-0001.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: lustre-log.1229685498.27300.gz Type: application/x-gzip Size: 797686 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081219/55dcf04a/attachment-0001.bin -------------- next part -------------- Dec 19 04:17:28 oss1 kernel: Lustre: 27065:0:(router.c:167:lnet_notify()) Ignoring prediction from 172.16.100.41 at tcp of 172.16.2.47 at tcp down 11396654252 seconds in the future Dec 19 04:17:28 oss1 kernel: LustreError: 27065:0:(socklnd.c:1599:ksocknal_destroy_conn()) Completing partial receive from 12345-172.16.2.47 at tcp, ip 172.16.2.47:1021, with error Dec 19 04:17:28 oss1 kernel: LustreError: 27065:0:(socklnd.c:1599:ksocknal_destroy_conn()) Skipped 5 previous similar messages Dec 19 04:17:28 oss1 kernel: LustreError: 27065:0:(events.c:307:server_bulk_callback()) event type 2, status -5, desc 00000101b3b2a000 Dec 19 04:17:28 oss1 kernel: LustreError: 27311:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 0(1048576) req at 000001013b81ec00 x1689415/t0 o4->bece0307-a4e2-5aa1-3b02-673feb3c5e7e at NET_0x20000ac10022f_UUID:-1 lens 384/352 ref 0 fl Interpret:/0/0 rc 0/0 Dec 19 04:17:28 oss1 kernel: LustreError: 27311:0:(ost_handler.c:1073:ost_brw_write()) Skipped 5 previous similar messages Dec 19 04:17:28 oss1 kernel: Lustre: 27311:0:(ost_handler.c:1205:ost_brw_write()) lustre-OST0000: ignoring bulk IO comm error with bece0307-a4e2-5aa1-3b02-673feb3c5e7e at NET_0x20000ac10022f_UUID id 12345-172.16.2.47 at tcp - client will retry Dec 19 04:17:28 oss1 kernel: Lustre: 27311:0:(ost_handler.c:1205:ost_brw_write()) Skipped 15 previous similar messages Dec 19 04:17:28 oss1 kernel: Lustre: 27138:0:(ldlm_lib.c:519:target_handle_reconnect()) lustre-OST0000: bece0307-a4e2-5aa1-3b02-673feb3c5e7e reconnecting Dec 19 04:17:28 oss1 kernel: Lustre: 27138:0:(ldlm_lib.c:519:target_handle_reconnect()) Skipped 20 previous similar messages Dec 19 04:17:28 oss1 kernel: Lustre: 27138:0:(ldlm_lib.c:747:target_handle_connect()) lustre-OST0000: refuse reconnection from bece0307-a4e2-5aa1-3b02-673feb3c5e7e at 172.16.2.47@tcp to 0x00000101e69aa000; still busy with 8 active RPCs Dec 19 04:17:28 oss1 kernel: Lustre: 27138:0:(ldlm_lib.c:747:target_handle_connect()) Skipped 1 previous similar message Dec 19 04:17:28 oss1 kernel: LustreError: 27138:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-16) req at 000001023c7bfe00 x1689445/t0 o8->bece0307-a4e2-5aa1-3b02-673feb3c5e7e at NET_0x20000ac10022f_UUID:-1 lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0 Dec 19 04:17:40 oss1 kernel: Lustre: 27065:0:(router.c:167:lnet_notify()) Ignoring prediction from 172.16.100.41 at tcp of 172.16.2.47 at tcp down 11397159540 seconds in the future Dec 19 04:18:18 oss1 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 27277: it was inactive for 100s Dec 19 04:18:18 oss1 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb()) Skipped 1 previous similar message Dec 19 04:18:18 oss1 kernel: Lustre: 0:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for process 27277 Dec 19 04:18:18 oss1 kernel: ll_ost_io_67 S 0000010218f55af8 0 27300 1 27301 27299 (L-TLB) Dec 19 04:18:18 oss1 kernel: 0000010218f55948 0000000000000046 0000010140002030 ffffffff00000073 Dec 19 04:18:18 oss1 kernel: 0000000000000001 00000000a02ba9f2 0000010140067a80 0000000300000004 Dec 19 04:18:18 oss1 kernel: 000001021c220800 00000000000000a2 Dec 19 04:18:18 oss1 kernel: Call Trace:<4>Lustre: 0:0:(linux-debug.c:168:libcfs_debug_dumpstack()) Skipped 2 previous similar messages Dec 19 04:18:18 oss1 kernel: ll_ost_io_44 S 0000010198895af8 0 27277 1 27278 27276 (L-TLB) Dec 19 04:18:18 oss1 kernel: 0000010198895948 0000000000000046 0000010037ef1030 ffffffff00000073 Dec 19 04:18:18 oss1 kernel: 0000000000000001 00000000a02ba9f2 000001014005fa80 0000000100000004 Dec 19 04:18:18 oss1 kernel: 00000102380b8030 000000000000010b Dec 19 04:18:18 oss1 kernel: Call Trace:<ffffffff8013f448>{__mod_timer+293} <ffffffff8013f448>{__mod_timer+293} <ffffffff80321053>{schedule_timeout+367} Dec 19 04:18:18 oss1 kernel: <ffffffff80321053>{schedule_timeout+367} Dec 19 04:18:18 oss1 kernel: <ffffffff8013fe78>{process_timeout+0} <ffffffffa0576998>{:ost:ost_brw_write+7448} Dec 19 04:18:18 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read+8528} <ffffffff8013fe78>{process_timeout+0} <ffffffffa0576998>{:ost:ost_brw_write+7448} Dec 19 04:18:18 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read+8528}<ffffffff80133566>{default_wake_function+0} Dec 19 04:18:18 oss1 kernel: <ffffffffa039e125>{:ptlrpc:lustre_msg_check_version+69} Dec 19 04:18:18 oss1 kernel: <ffffffffa0571590>{:ost:ost_bulk_timeout+0} <ffffffffa057c4ab>{:ost:ost_handle+12187} Dec 19 04:18:18 oss1 kernel: <ffffffff80131923>{recalc_task_prio+337} <ffffffffa02beb58>{:lnet:lnet_match_blocked_msg+920} Dec 19 04:18:18 oss1 kernel: <ffffffffa03a699e>{:ptlrpc:ptlrpc_server_handle_request+2830} Dec 19 04:18:18 oss1 kernel: <ffffffffa028c02e>{:libcfs:lcw_update_time+30} <ffffffff8013f448>{__mod_timer+293} Dec 19 04:18:18 oss1 kernel: <ffffffffa03a8fa8>{:ptlrpc:ptlrpc_main+2456} <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:18 oss1 kernel: <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} Dec 19 04:18:18 oss1 kernel: <ffffffff80110de3>{child_rip+8} <ffffffffa03a8610>{:ptlrpc:ptlrpc_main+0} Dec 19 04:18:18 oss1 kernel: <ffffffff80110ddb>{child_rip+0} Dec 19 04:18:18 oss1 kernel: LustreError: dumping log to /tmp/lustre-log.1229685498.27300 Dec 19 04:18:19 oss1 kernel: <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:19 oss1 kernel: <ffffffffa039e125>{:ptlrpc:lustre_msg_check_version+69} Dec 19 04:18:19 oss1 kernel: <ffffffffa0571590>{:ost:ost_bulk_timeout+0} <ffffffffa057c4ab>{:ost:ost_handle+12187} Dec 19 04:18:19 oss1 kernel: <ffffffff80131923>{recalc_task_prio+337} <ffffffffa02beb58>{:lnet:lnet_match_blocked_msg+920} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a699e>{:ptlrpc:ptlrpc_server_handle_request+2830} Dec 19 04:18:19 oss1 kernel: <ffffffffa028c02e>{:libcfs:lcw_update_time+30} <ffffffff8013f448>{__mod_timer+293} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a8fa8>{:ptlrpc:ptlrpc_main+2456} <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} Dec 19 04:18:19 oss1 kernel: <ffffffff80110de3>{child_rip+8} <ffffffffa03a8610>{:ptlrpc:ptlrpc_main+0} Dec 19 04:18:19 oss1 kernel: <ffffffff80110ddb>{child_rip+0} Dec 19 04:18:19 oss1 kernel: ll_ost_io_93 S 000001016dafbaf8 0 27326 1 27327 27325 (L-TLB) Dec 19 04:18:19 oss1 kernel: 000001016dafb948 0000000000000046 00000102380b8030 ffffffff00000073 Dec 19 04:18:19 oss1 kernel: 0000000000000001 00000000a02ba9f2 000001014005fa80 0000000100000004 Dec 19 04:18:19 oss1 kernel: 00000101d1dd4800 00000000000000c8 Dec 19 04:18:19 oss1 kernel: Call Trace:<ffffffff8013f448>{__mod_timer+293} <ffffffff80321053>{schedule_timeout+367} Dec 19 04:18:19 oss1 kernel: <ffffffff8013fe78>{process_timeout+0} <ffffffffa0576998>{:ost:ost_brw_write+7448} Dec 19 04:18:19 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read+8528} <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:19 oss1 kernel: <ffffffffa039e125>{:ptlrpc:lustre_msg_check_version+69} Dec 19 04:18:19 oss1 kernel: <ffffffffa0571590>{:ost:ost_bulk_timeout+0} <ffffffffa057c4ab>{:ost:ost_handle+12187} Dec 19 04:18:19 oss1 kernel: <ffffffff80131923>{recalc_task_prio+337} <ffffffffa02beb58>{:lnet:lnet_match_blocked_msg+920} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a699e>{:ptlrpc:ptlrpc_server_handle_request+2830} Dec 19 04:18:19 oss1 kernel: <ffffffffa028c02e>{:libcfs:lcw_update_time+30} <ffffffff8013f448>{__mod_timer+293} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a8fa8>{:ptlrpc:ptlrpc_main+2456} <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} Dec 19 04:18:19 oss1 kernel: <ffffffff80110de3>{child_rip+8} <ffffffffa03a8610>{:ptlrpc:ptlrpc_main+0} Dec 19 04:18:19 oss1 kernel: <ffffffff80110ddb>{child_rip+0} Dec 19 04:18:19 oss1 kernel: ll_ost_io_89 S 00000101bbe01af8 0 27322 1 27323 27321 (L-TLB) Dec 19 04:18:19 oss1 kernel: 00000101bbe01948 0000000000000046 0000010037ef1030 ffffffff00000073 Dec 19 04:18:19 oss1 kernel: ll_ost_io_31 S 000001016257daf8 0 27264 1 27265 27263 (L-TLB) Dec 19 04:18:19 oss1 kernel: 000001016257d948 0000000000000046 0000010140002030 ffffffff00000073 Dec 19 04:18:19 oss1 kernel: 0000000000000001 00000000a02ba9f2 0000010140067a80 0000000300000004 Dec 19 04:18:19 oss1 kernel: 0000010239280800 0000000000000092 Dec 19 04:18:19 oss1 kernel: Call Trace:0000000000000001 00000000a02ba9f2 000001014005fa80 0000000100000004 Dec 19 04:18:19 oss1 kernel: 0000010227575800 00000000000000d5 Dec 19 04:18:19 oss1 kernel: Call Trace:<ffffffff8013f448>{__mod_timer+293} <ffffffff8013f448>{__mod_timer+293} <ffffffff80321053>{schedule_timeout+367} Dec 19 04:18:19 oss1 kernel: <ffffffff80321053>{schedule_timeout+367} Dec 19 04:18:19 oss1 kernel: <ffffffff8013fe78>{process_timeout+0} <ffffffffa0576998>{:ost:ost_brw_write+7448} Dec 19 04:18:19 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read+8528} <ffffffff8013fe78>{process_timeout+0} <ffffffffa0576998>{:ost:ost_brw_write+7448} Dec 19 04:18:19 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read+8528} <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:19 oss1 kernel: <ffffffffa039e125>{:ptlrpc:lustre_msg_check_version+69} Dec 19 04:18:19 oss1 kernel: <ffffffffa0571590>{:ost:ost_bulk_timeout+0} <ffffffffa057c4ab>{:ost:ost_handle+12187}<ffffffff80133566>{default_wake_function+0} Dec 19 04:18:19 oss1 kernel: <ffffffffa039e125>{:ptlrpc:lustre_msg_check_version+69} Dec 19 04:18:19 oss1 kernel: <ffffffffa0571590>{:ost:ost_bulk_timeout+0} <ffffffffa057c4ab>{:ost:ost_handle+12187} Dec 19 04:18:19 oss1 kernel: Dec 19 04:18:19 oss1 kernel: <ffffffff80131923>{recalc_task_prio+337} <ffffffffa02beb58>{:lnet:lnet_match_blocked_msg+920} Dec 19 04:18:19 oss1 kernel: <ffffffff80131923>{recalc_task_prio+337}<ffffffffa03a699e>{:ptlrpc:ptlrpc_server_handle_request+2830} Dec 19 04:18:19 oss1 kernel: <ffffffffa028c02e>{:libcfs:lcw_update_time+30} <ffffffff801335b7>{__wake_up_common+67}<ffffffff8013f448>{__mod_timer+293} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a8fa8>{:ptlrpc:ptlrpc_main+2456} Dec 19 04:18:19 oss1 kernel: <ffffffffa02beb58>{:lnet:lnet_match_blocked_msg+920} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a699e>{:ptlrpc:ptlrpc_server_handle_request+2830} Dec 19 04:18:19 oss1 kernel: <ffffffffa028c02e>{:libcfs:lcw_update_time+30} <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} Dec 19 04:18:19 oss1 kernel: <ffffffff80110de3>{child_rip+8} <ffffffffa03a8610>{:ptlrpc:ptlrpc_main+0} <ffffffff8013f448>{__mod_timer+293} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a8fa8>{:ptlrpc:ptlrpc_main+2456} Dec 19 04:18:19 oss1 kernel: <ffffffff80110ddb>{child_rip+0} Dec 19 04:18:19 oss1 kernel: ll_ost_io_01 S 0000010237121af8 0 27234 1 27235 27233 (L-TLB) Dec 19 04:18:19 oss1 kernel: 0000010237121948 0000000000000046 0000010227575800 ffffffff00000073 Dec 19 04:18:19 oss1 kernel: 0000000000000001 00000000a02ba9f2 000001014005fa80 0000000100000004 Dec 19 04:18:19 oss1 kernel: 0000010237fb5800 00000000000000c0 Dec 19 04:18:19 oss1 kernel: Call Trace:<ffffffff80133566>{default_wake_function+0} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} Dec 19 04:18:19 oss1 kernel: <ffffffff80110de3>{child_rip+8} <ffffffffa03a8610>{:ptlrpc:ptlrpc_main+0} Dec 19 04:18:19 oss1 kernel: <ffffffff80110ddb>{child_rip+0} Dec 19 04:18:19 oss1 kernel: <ffffffff8013f448>{__mod_timer+293} <ffffffff80321053>{schedule_timeout+367} Dec 19 04:18:19 oss1 kernel: <ffffffff8013fe78>{process_timeout+0} <ffffffffa0576998>{:ost:ost_brw_write+7448} Dec 19 04:18:19 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read+8528} <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:19 oss1 kernel: <ffffffffa039e125>{:ptlrpc:lustre_msg_check_version+69} Dec 19 04:18:19 oss1 kernel: <ffffffffa0571590>{:ost:ost_bulk_timeout+0} <ffffffffa057c4ab>{:ost:ost_handle+12187} Dec 19 04:18:19 oss1 kernel: <ffffffff80131923>{recalc_task_prio+337} <ffffffffa02beb58>{:lnet:lnet_match_blocked_msg+920} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a699e>{:ptlrpc:ptlrpc_server_handle_request+2830} Dec 19 04:18:19 oss1 kernel: <ffffffffa028c02e>{:libcfs:lcw_update_time+30} <ffffffff8013f448>{__mod_timer+293} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a8fa8>{:ptlrpc:ptlrpc_main+2456} <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} Dec 19 04:18:19 oss1 kernel: <ffffffff80110de3>{child_rip+8} <ffffffffa03a8610>{:ptlrpc:ptlrpc_main+0} Dec 19 04:18:19 oss1 kernel: <ffffffff80110ddb>{child_rip+0} Dec 19 04:18:19 oss1 kernel: ll_ost_io_02 S 000001014d971af8 0 27235 1 27236 27234 (L-TLB) Dec 19 04:18:19 oss1 kernel: 000001014d971948 0000000000000046 0000010140002030 ffffffff00000073 Dec 19 04:18:19 oss1 kernel: 0000000000000001 00000000a02ba9f2 0000010140067a80 0000000300000004 Dec 19 04:18:19 oss1 kernel: 0000010237fb5030 000000000000009c Dec 19 04:18:19 oss1 kernel: Call Trace:<ffffffff8013f448>{__mod_timer+293} <ffffffff80321053>{schedule_timeout+367} Dec 19 04:18:19 oss1 kernel: <ffffffff8013fe78>{process_timeout+0} <ffffffffa0576998>{:ost:ost_brw_write+7448} Dec 19 04:18:19 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read+8528} <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:20 oss1 kernel: <ffffffffa039e125>{:ptlrpc:lustre_msg_check_version+69} Dec 19 04:18:20 oss1 kernel: <ffffffffa0571590>{:ost:ost_bulk_timeout+0} <ffffffffa057c4ab>{:ost:ost_handle+12187} Dec 19 04:18:20 oss1 kernel: <ffffffff80131923>{recalc_task_prio+337} <ffffffffa02beb58>{:lnet:lnet_match_blocked_msg+920} Dec 19 04:18:20 oss1 kernel: <ffffffffa03a699e>{:ptlrpc:ptlrpc_server_handle_request+2830} Dec 19 04:18:20 oss1 kernel: <ffffffffa028c02e>{:libcfs:lcw_update_time+30} <ffffffff8013f448>{__mod_timer+293} Dec 19 04:18:20 oss1 kernel: <ffffffffa03a8fa8>{:ptlrpc:ptlrpc_main+2456} <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:20 oss1 kernel: <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} Dec 19 04:18:20 oss1 kernel: <ffffffff80110de3>{child_rip+8} <ffffffffa03a8610>{:ptlrpc:ptlrpc_main+0} Dec 19 04:18:20 oss1 kernel: <ffffffff80110ddb>{child_rip+0} Dec 19 04:18:20 oss1 kernel: LustreError: dumping log to /tmp/lustre-log.1229685498.27235 Dec 19 04:18:20 oss1 kernel: LustreError: dumping log to /tmp/lustre-log.1229685498.27234 Dec 19 04:18:20 oss1 kernel: LustreError: dumping log to /tmp/lustre-log.1229685498.27264 Dec 19 04:18:20 oss1 kernel: LustreError: dumping log to /tmp/lustre-log.1229685498.27322 Dec 19 04:18:20 oss1 kernel: LustreError: dumping log to /tmp/lustre-log.1229685498.27326 Dec 19 04:18:20 oss1 kernel: LustreError: dumping log to /tmp/lustre-log.1229685498.27277 Dec 19 04:20:48 oss1 kernel: LustreError: 27300:0:(ost_handler.c:1062:ost_brw_write()) @@@ timeout on bulk GET req at 000001013b83c800 x1689418/t0 o4->bece0307-a4e2-5aa1-3b02-673feb3c5e7e at NET_0x20000ac10022f_UUID:-1 lens 384/352 ref 0 fl Interpret:/0/0 rc 0/0 Dec 19 04:20:48 oss1 kernel: LustreError: 27300:0:(ost_handler.c:1062:ost_brw_write()) Skipped 1 previous similar message Dec 19 04:20:48 oss1 kernel: Lustre: 27300:0:(ost_handler.c:1205:ost_brw_write()) lustre-OST0000: ignoring bulk IO comm error with bece0307-a4e2-5aa1-3b02-673feb3c5e7e at NET_0x20000ac10022f_UUID id 12345-172.16.2.47 at tcp - client will retry Dec 19 04:20:48 oss1 kernel: Lustre: 27300:0:(watchdog.c:312:lcw_update_time()) Expired watchdog for pid 27300 disabled after 250.0272s Dec 19 04:20:48 oss1 kernel: Lustre: 27300:0:(watchdog.c:312:lcw_update_time()) Skipped 1 previous similar message Dec 19 04:25:04 oss1 kernel: Lustre: 27182:0:(ldlm_lib.c:519:target_handle_reconnect()) lustre-OST0000: 8789d360-6d52-0f7f-3558-f376de52aaf3 reconnecting Dec 19 04:27:53 oss1 kernel: Lustre: 27173:0:(ldlm_lib.c:519:target_handle_reconnect()) lustre-OST0000: bece0307-a4e2-5aa1-3b02-673feb3c5e7e reconnecting Dec 19 07:45:56 oss1 kernel: Lustre: 27065:0:(router.c:167:lnet_notify()) Ignoring prediction from 172.16.100.41 at tcp of 172.16.2.13 at tcp down 11521628444 seconds in the future Dec 19 07:47:59 oss1 kernel: LustreError: 138-a: lustre-OST0000: A client on nid 172.16.2.13 at tcp was evicted due to a lock glimpse callback to 172.16.2.13 at tcp timed out: rc -107
On Fri, Dec 19, 2008 at 08:42:16AM -0700, Denise Hummel wrote:> Hi; > > I have started getting numerous dump logs, timeouts and client > evictions. Our environment: > ...... > Dec 19 04:17:28 oss1 kernel: Lustre: 27065:0:(router.c:167:lnet_notify()) Ignoring prediction from 172.16.100.41 at tcp of 172.16.2.47 at tcp down 11396654252 seconds in the futurePlease look in bug 14300 for a fix of this. Isaac
HI Isaac; " > Dec 19 04:17:28 oss1 kernel: Lustre: 27065:0:(router.c:167:lnet_notify()) Ignoring prediction from 172.16.100.41 at tcp of 172.16.2.47 at tcp down 11396654252 seconds in the future Please look in bug 14300 for a fix of this. Isaac " I did not see where this bug would fix all the dumps I am getting - about 50 a day now: Dec 19 04:18:18 oss1 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 27277: it was inactive for 100s Dec 19 04:18:18 oss1 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb()) Skipped 1 previous similar message Dec 19 04:18:18 oss1 kernel: Lustre: 0:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for process 27277 Dec 19 04:18:18 oss1 kernel: ll_ost_io_67 S 0000010218f55af8 0 27300 1 27301 27299 (L-TLB) Dec 19 04:18:18 oss1 kernel: 0000010218f55948 0000000000000046 0000010140002030 ffffffff00000073 Dec 19 04:18:18 oss1 kernel: 0000000000000001 00000000a02ba9f2 0000010140067a80 0000000300000004 Dec 19 04:18:18 oss1 kernel: 000001021c220800 00000000000000a2 Dec 19 04:18:18 oss1 kernel: Call Trace:<4>Lustre: 0:0:(linux-debug.c:168:libcfs_debug_dumpstack()) Skipped 2 previous similar messages Dec 19 04:18:18 oss1 kernel: ll_ost_io_44 S 0000010198895af8 0 27277 1 27278 27276 (L-TLB) Dec 19 04:18:18 oss1 kernel: 0000010198895948 0000000000000046 0000010037ef1030 ffffffff00000073 Dec 19 04:18:18 oss1 kernel: 0000000000000001 00000000a02ba9f2 000001014005fa80 0000000100000004 Dec 19 04:18:18 oss1 kernel: 00000102380b8030 000000000000010b Dec 19 04:18:18 oss1 kernel: Call Trace:<ffffffff8013f448>{__mod_timer +293} <ffffffff8013f448>{__mod_timer+293} <ffffffff80321053>{schedule_timeout+367} Dec 19 04:18:18 oss1 kernel: <ffffffff80321053>{schedule_timeout +367} Dec 19 04:18:18 oss1 kernel: <ffffffff8013fe78>{process_timeout +0} <ffffffffa0576998>{:ost:ost_brw_write+7448} Dec 19 04:18:18 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read +8528} <ffffffff8013fe78>{process_timeout+0} <ffffffffa0576998>{:ost:ost_brw_write+7448} Dec 19 04:18:18 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read +8528}<ffffffff80133566>{default_wake_function+0} Dec 19 04:18:18 oss1 kernel: <ffffffffa039e125>{:ptlrpc:lustre_msg_check_version+69} Dec 19 04:18:18 oss1 kernel: <ffffffffa0571590>{:ost:ost_bulk_timeout+0} <ffffffffa057c4ab>{:ost:ost_handle+12187} Dec 19 04:18:18 oss1 kernel: <ffffffff80131923>{recalc_task_prio +337} <ffffffffa02beb58>{:lnet:lnet_match_blocked_msg+920} Dec 19 04:18:18 oss1 kernel: <ffffffffa03a699e>{:ptlrpc:ptlrpc_server_handle_request+2830} Dec 19 04:18:18 oss1 kernel: <ffffffffa028c02e>{:libcfs:lcw_update_time+30} <ffffffff8013f448>{__mod_timer+293} Dec 19 04:18:18 oss1 kernel: <ffffffffa03a8fa8>{:ptlrpc:ptlrpc_main+2456} <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:18 oss1 kernel: <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} Dec 19 04:18:18 oss1 kernel: <ffffffff80110de3>{child_rip+8} <ffffffffa03a8610>{:ptlrpc:ptlrpc_main+0} Dec 19 04:18:18 oss1 kernel: <ffffffff80110ddb>{child_rip+0} Dec 19 04:18:18 oss1 kernel: LustreError: dumping log to /tmp/lustre-log.1229685498.27300 Dec 19 04:18:19 oss1 kernel: <ffffffff80133566>{default_wake_function +0} Dec 19 04:18:19 oss1 kernel: <ffffffffa039e125>{:ptlrpc:lustre_msg_check_version+69} Dec 19 04:18:19 oss1 kernel: <ffffffffa0571590>{:ost:ost_bulk_timeout+0} <ffffffffa057c4ab>{:ost:ost_handle+12187} Dec 19 04:18:19 oss1 kernel: <ffffffff80131923>{recalc_task_prio +337} <ffffffffa02beb58>{:lnet:lnet_match_blocked_msg+920} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a699e>{:ptlrpc:ptlrpc_server_handle_request+2830} Dec 19 04:18:19 oss1 kernel: <ffffffffa028c02e>{:libcfs:lcw_update_time+30} <ffffffff8013f448>{__mod_timer+293} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a8fa8>{:ptlrpc:ptlrpc_main+2456} <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} Dec 19 04:18:19 oss1 kernel: <ffffffff80110de3>{child_rip+8} <ffffffffa03a8610>{:ptlrpc:ptlrpc_main+0} Dec 19 04:18:19 oss1 kernel: <ffffffff80110ddb>{child_rip+0} Dec 19 04:18:19 oss1 kernel: ll_ost_io_93 S 000001016dafbaf8 0 27326 1 27327 27325 (L-TLB) Dec 19 04:18:19 oss1 kernel: 000001016dafb948 0000000000000046 00000102380b8030 ffffffff00000073 Dec 19 04:18:19 oss1 kernel: 0000000000000001 00000000a02ba9f2 000001014005fa80 0000000100000004 Dec 19 04:18:19 oss1 kernel: 00000101d1dd4800 00000000000000c8 Dec 19 04:18:19 oss1 kernel: Call Trace:<ffffffff8013f448>{__mod_timer +293} <ffffffff80321053>{schedule_timeout+367} Dec 19 04:18:19 oss1 kernel: <ffffffff8013fe78>{process_timeout +0} <ffffffffa0576998>{:ost:ost_brw_write+7448} Dec 19 04:18:19 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read +8528} <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:19 oss1 kernel: <ffffffffa039e125>{:ptlrpc:lustre_msg_check_version+69} Dec 19 04:18:19 oss1 kernel: <ffffffffa0571590>{:ost:ost_bulk_timeout+0} <ffffffffa057c4ab>{:ost:ost_handle+12187} Dec 19 04:18:19 oss1 kernel: <ffffffff80131923>{recalc_task_prio +337} <ffffffffa02beb58>{:lnet:lnet_match_blocked_msg+920} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a699e>{:ptlrpc:ptlrpc_server_handle_request+2830} Dec 19 04:18:19 oss1 kernel: <ffffffffa028c02e>{:libcfs:lcw_update_time+30} <ffffffff8013f448>{__mod_timer+293} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a8fa8>{:ptlrpc:ptlrpc_main+2456} <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:19 oss1 kernel: <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0} Dec 19 04:18:19 oss1 kernel: <ffffffff80110de3>{child_rip+8} <ffffffffa03a8610>{:ptlrpc:ptlrpc_main+0} Dec 19 04:18:19 oss1 kernel: <ffffffff80110ddb>{child_rip+0} Dec 19 04:18:19 oss1 kernel: ll_ost_io_89 S 00000101bbe01af8 0 27322 1 27323 27321 (L-TLB) Dec 19 04:18:19 oss1 kernel: 00000101bbe01948 0000000000000046 0000010037ef1030 ffffffff00000073 Dec 19 04:18:19 oss1 kernel: 0000000000000001 00000000a02ba9f2 0000010140067a80 0000000300000004 Dec 19 04:18:19 oss1 kernel: 0000010239280800 0000000000000092 Dec 19 04:18:19 oss1 kernel: Call Trace:0000000000000001 00000000a02ba9f2 000001014005fa80 0000000100000004 Dec 19 04:18:19 oss1 kernel: 0000010227575800 00000000000000d5 Dec 19 04:18:19 oss1 kernel: Call Trace:<ffffffff8013f448>{__mod_timer +293} <ffffffff8013f448>{__mod_timer+293} <ffffffff80321053>{schedule_timeout+367} Dec 19 04:18:19 oss1 kernel: <ffffffff80321053>{schedule_timeout +367} Dec 19 04:18:19 oss1 kernel: <ffffffff8013fe78>{process_timeout +0} <ffffffffa0576998>{:ost:ost_brw_write+7448} Dec 19 04:18:19 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read +8528} <ffffffff8013fe78>{process_timeout+0} <ffffffffa0576998>{:ost:ost_brw_write+7448} Dec 19 04:18:19 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read +8528} <ffffffff80133566>{default_wake_function+0} Dec 19 04:18:19 oss1 kernel: <ffffffffa039e125>{:ptlrpc:lustre_msg_check_version+69} Dec 19 04:18:19 oss1 kernel: <ffffffffa0571590>{:ost:ost_bulk_timeout+0} <ffffffffa057c4ab>{:ost:ost_handle +12187}<ffffffff80133566>{default_wake_function+0} ... and much more of the same - it is in the log I sent with the original email. With these dumps I end up with timeouts and evictions and noticeable hangs until recovery. Does the fix for bug 14300 fix this issue also? Thanks, Denise
Hi; A continuation of the timeouts and dumps. The login node has hung three times today - no one can login and anyone logged in cannot work. The hangs are lasting for 5 or more minutes. The messages on the OSS are: Dec 22 13:00:44 oss1 kernel: LustreError: 138-a: lustre-OST0000: A client on nid 172.16.100.1 at tcp was evicted due to a lock blocking callback to 172.16.100.1 at tcp timed out: rc -107 Dec 22 13:00:44 oss1 kernel: LustreError: 27250:0:(ost_handler.c:1065:ost_brw_write()) @@@ Eviction on bulk GET req at 00000100bff5c800 x91545/t0 o4->ab915dd6-4f50-ba8a-5cd4-05edb28fcc82 at NET_0x20000ac106401_UUID:-1 lens 384/352 ref 0 fl Interpret:/0/0 rc 0/0 Dec 22 13:00:44 oss1 kernel: LustreError: 27166:0:(ldlm_lockd.c:962:ldlm_handle_enqueue()) ### lock on destroyed export 0000010087dc1000 ns: filter-lustre-OST0000_UUID lock: 0000010191bd54c0/0x0 lrc: 1/0,0 mode: --/PW res: 14313567/0 rrc: 1 type: EXT [0->8191] (req 0->8191) flags: 20000080 remote: 0x4c7a2f2234322fa9 expref: 29 pid: 27166 Dec 22 13:00:44 oss1 kernel: LustreError: 27166:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at 00000101397be600 x91686/t0 o101->ab915dd6-4f50-ba8a-5cd4-05edb28fcc82 at NET_0x20000ac106401_UUID:-1 lens 232/288 ref 0 fl Interpret:/0/0 rc -107/0 Dec 22 13:00:44 oss1 kernel: Lustre: 27250:0:(ost_handler.c:1205:ost_brw_write()) lustre-OST0000: ignoring bulk IO comm error with ab915dd6-4f50-ba8a-5cd4-05edb28fcc82 at NET_0x20000ac106401_UUID id 12345-172.16.100.1 at tcp - client will retry Dec 22 13:00:44 oss1 kernel: LustreError: 27314:0:(ost_handler.c:1065:ost_brw_write()) @@@ Eviction on bulk GET req at 0000010065c0ca00 x91546/t0 o4->ab915dd6-4f50-ba8a-5cd4-05edb28fcc82 at NET_0x20000ac106401_UUID:-1 lens 384/352 ref 0 fl Interpret:/0/0 rc 0/0 Dec 22 13:00:45 oss1 kernel: LustreError: 27256:0:(ost_handler.c:1065:ost_brw_write()) @@@ Eviction on bulk GET req at 0000010008df4a00 x91586/t0 o4->ab915dd6-4f50-ba8a-5cd4-05edb28fcc82 at NET_0x20000ac106401_UUID:-1 lens 384/352 ref 0 fl Interpret:/0/0 rc 0/0 Dec 22 13:00:45 oss1 kernel: LustreError: 27256:0:(ost_handler.c:1065:ost_brw_write()) Skipped 1 previous similar message Dec 22 13:00:45 oss1 kernel: Lustre: 27285:0:(ost_handler.c:1205:ost_brw_write()) lustre-OST0000: ignoring bulk IO comm error with ab915dd6-4f50-ba8a-5cd4-05edb28fcc82 at NET_0x20000ac106401_UUID id 12345-172.16.100.1 at tcp - client will retry Dec 22 13:00:45 oss1 kernel: Lustre: 27285:0:(ost_handler.c:1205:ost_brw_write()) Skipped 1 previous similar message Dec 22 13:00:48 oss1 kernel: Lustre: 27245:0:(ost_handler.c:1205:ost_brw_write()) lustre-OST0000: ignoring bulk IO comm error with ab915dd6-4f50-ba8a-5cd4-05edb28fcc82 at NET_0x20000ac106401_UUID id 12345-172.16.100.1 at tcp - client will retry Dec 22 13:00:48 oss1 kernel: Lustre: 27245:0:(ost_handler.c:1205:ost_brw_write()) Skipped 3 previous similar messages Dec 22 13:00:48 oss1 kernel: LustreError: 29210:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at 00000100bff84600 x91689/t0 o13-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0 rc -107/0 Dec 22 13:00:48 oss1 kernel: LustreError: 27116:0:(ldlm_lockd.c:1556:ldlm_cancel_handler()) operation 103 from 12345-172.16.100.1 at tcp with bad export cookie 13571729019316464878 Dec 22 13:00:48 oss1 kernel: LustreError: 27114:0:(ldlm_lockd.c:1556:ldlm_cancel_handler()) operation 103 from 12345-172.16.100.1 at tcp with bad export cookie 13571729019316464878 The messages in the syslog on the login node are: Dec 22 13:00:45 head kernel: LustreError: 11-0: an error occurred while communicating with 172.16.100.41 at tcp. The ldlm_enqueue operation failed with -107 Dec 22 13:00:45 head kernel: Lustre: lustre-OST0000-osc-000001018197f800: Connection to service lustre-OST0000 via nid 172.16.100.41 at tcp was lost; in progress operations using this service will wait for recovery to complete. Dec 22 13:00:58 head kernel: LustreError: 11-0: an error occurred while communicating with 172.16.100.41 at tcp. The ost_statfs operation failed with -107 Dec 22 13:00:58 head kernel: LustreError: 3270:0:(ldlm_request.c:909:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway Dec 22 13:00:58 head kernel: LustreError: 3270:0:(file.c:803:ll_extent_lock_callback()) ldlm_cli_cancel failed: -107 Dec 22 13:00:58 head kernel: LustreError: 3271:0:(ldlm_request.c:909:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway Dec 22 13:00:58 head kernel: LustreError: 3271:0:(file.c:803:ll_extent_lock_callback()) ldlm_cli_cancel failed: -107 Dec 22 13:00:58 head kernel: LustreError: 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail. Dec 22 13:00:58 head kernel: LustreError: 3248:0:(osc_request.c:1301:osc_brw_redo_request()) @@@ redo for recoverable error req at 00000101819c4a00 x91544/t0 o4->lustre-OST0000_UUID at 172.17.103.18@o2ib:28 lens 384/352 ref 2 fl Complete:E/0/0 rc -5/-22 Dec 22 13:00:58 head kernel: LustreError: 3248:0:(osc_request.c:1301:osc_brw_redo_request()) Skipped 8 previous similar messages Dec 22 13:00:58 head kernel: LustreError: 19930:0:(llite_mmap.c:206:ll_tree_unlock()) couldn''t unlock -5 Dec 22 13:00:58 head kernel: LustreError: 19930:0:(file.c:1052:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO Dec 22 13:00:58 head kernel: LustreError: 19930:0:(file.c:1052:ll_glimpse_size()) Skipped 543 previous similar messages Dec 22 13:00:58 head kernel: LustreError: 19998:0:(file.c:1052:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO Dec 22 13:00:58 head kernel: LustreError: 19998:0:(file.c:1052:ll_glimpse_size()) Skipped 195 previous similar messages Dec 22 13:01:00 head kernel: LustreError: 3248:0:(osc_request.c:1301:osc_brw_redo_request()) @@@ redo for recoverable error req at 000001017aa67c00 x91546/t0 o4->lustre-OST0000_UUID at 172.17.103.18@o2ib:28 lens 384/352 ref 2 fl Complete:E/0/0 rc -5/-22 Dec 22 13:01:00 head kernel: LustreError: 3248:0:(osc_request.c:1301:osc_brw_redo_request()) Skipped 1 previous similar message Dec 22 13:01:00 head kernel: LustreError: 3248:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 00000100080d9400 x92325/t0 o4->lustre-OST0000_UUID at 172.17.103.18@o2ib:28 lens 384/352 ref 2 fl Rpc:/0/0 rc 0/0 Dec 22 13:01:00 head kernel: LustreError: 3248:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 00000100f9fda200 x93266/t0 o4->lustre-OST0000_UUID at 172.17.103.18@o2ib:28 lens 384/352 ref 2 fl Rpc:/0/0 rc 0/0 Dec 22 13:01:00 head kernel: LustreError: 3248:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 35 previous similar messages Dec 22 13:01:00 head kernel: LustreError: Skipped 5 previous similar messages Dec 22 13:01:00 head kernel: LustreError: 19947:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 00000102faba0200 x91689/t0 o13->lustre-OST0000_UUID at 172.17.103.18@o2ib:28 lens 128/272 ref 1 fl Rpc:RS/0/0 rc -107/-107 Dec 22 13:01:00 head kernel: LustreError: 19947:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 5 previous similar messages Dec 22 13:01:00 head kernel: LustreError: 19947:0:(llite_lib.c:1520:ll_statfs_internal()) obd_statfs fails: rc -5 Dec 22 13:01:00 head kernel: LustreError: 20029:0:(llite_lib.c:1520:ll_statfs_internal()) obd_statfs fails: rc -5 Dec 22 13:01:00 head kernel: Lustre: lustre-OST0000-osc-000001018197f800: Connection restored to service lustre-OST0000 using nid 172.16.100.41 at tcp. Dec 22 13:09:10 head kernel: LustreError: 11-0: an error occurred while communicating with 172.16.100.41 at tcp. The ldlm_enqueue operation failed with -107 Dec 22 13:09:10 head kernel: Lustre: lustre-OST0000-osc-000001018197f800: Connection to service lustre-OST0000 via nid 172.16.100.41 at tcp was lost; in progress operations using this service will wait for recovery to complete. Dec 22 13:09:24 head kernel: LustreError: 3268:0:(ldlm_request.c:909:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway Dec 22 13:09:24 head kernel: LustreError: 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail. Dec 22 13:09:24 head kernel: LustreError: 23487:0:(llite_mmap.c:206:ll_tree_unlock()) couldn''t unlock -5 Dec 22 13:09:24 head kernel: LustreError: 23487:0:(file.c:1052:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO Dec 22 13:09:24 head kernel: LustreError: 23487:0:(file.c:1052:ll_glimpse_size()) Skipped 131 previous similar messages Dec 22 13:09:24 head kernel: LustreError: 23487:0:(namei.c:1132:ll_objects_destroy()) obd destroy objid 0xa485b6 error -5 Dec 22 13:09:24 head kernel: Lustre: lustre-OST0000-osc-000001018197f800: Connection restored to service lustre-OST0000 using nid 172.16.100.41 at tcp. I also have 56 lustre dump logs in about 2 minutes. I have attached lustre.logs.tar.gz with all of the dumps. If anyone has any suggestions, I would appreciate it. Denise -------------- next part -------------- A non-text attachment was scrubbed... Name: lustre.logs.tar.gz Type: application/x-compressed-tar Size: 93452 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081222/b34ffa99/attachment-0001.bin
On Dec 22, 2008 13:22 -0700, Denise Hummel wrote:> Dec 22 13:00:44 oss1 kernel: LustreError: 138-a: lustre-OST0000: A > client on nid 172.16.100.1 at tcp was evicted due to a lock blocking > callback to 172.16.100.1 at tcp timed out: rc -107 > Dec 22 13:00:44 oss1 kernel: LustreError: > 27250:0:(ost_handler.c:1065:ost_brw_write()) @@@ Eviction on bulk GET > req at 00000100bff5c800 x91545/t0 > 27250:0:(ost_handler.c:1205:ost_brw_write()) lustre-OST0000: ignoring > bulk IO comm error withThese messages could relate to network problems on the oss1 node. That said, this is most interesting if only oss1 is showing these messages. In particular "eviction on bulk GET" indicates the network stopped working in the middle of a data transfer.> The messages in the syslog on the login node are: > lustre-OST0000-osc-000001018197f800: Connection to service > lustre-OST0000 via nid 172.16.100.41 at tcp was lost; in progress > operations using this service will wait for recovery to complete.This is just the client''s version of the same issue. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hi; Thanks. I have suspected the network, however have not been able to pinpoint the problem. I have looked at the ethernet and infiniband switches - found a few with IGMP turned on and some multicast issues. Those have been fixed. I am checking the network stats on the oss, mdt and nodes and find a few dropped packets, however the system stats do not indicate a heavy load during that time. If anyone has any suggestions on anything else I can look at please let me know. Thanks for all of your help, Denise On Mon, 2008-12-22 at 20:49 -0700, Andreas Dilger wrote:> On Dec 22, 2008 13:22 -0700, Denise Hummel wrote: > > Dec 22 13:00:44 oss1 kernel: LustreError: 138-a: lustre-OST0000: A > > client on nid 172.16.100.1 at tcp was evicted due to a lock blocking > > callback to 172.16.100.1 at tcp timed out: rc -107 > > Dec 22 13:00:44 oss1 kernel: LustreError: > > 27250:0:(ost_handler.c:1065:ost_brw_write()) @@@ Eviction on bulk GET > > req at 00000100bff5c800 x91545/t0 > > 27250:0:(ost_handler.c:1205:ost_brw_write()) lustre-OST0000: ignoring > > bulk IO comm error with > > These messages could relate to network problems on the oss1 node. That > said, this is most interesting if only oss1 is showing these messages. > In particular "eviction on bulk GET" indicates the network stopped working > in the middle of a data transfer. > > > > The messages in the syslog on the login node are: > > lustre-OST0000-osc-000001018197f800: Connection to service > > lustre-OST0000 via nid 172.16.100.41 at tcp was lost; in progress > > operations using this service will wait for recovery to complete. > > This is just the client''s version of the same issue. > > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. >
On Tue, Dec 23, 2008 at 06:45:09AM -0700, Denise Hummel wrote:> Hi; > > Thanks. I have suspected the network, however have not been able to > pinpoint the problem. I have looked at the ethernet and infiniband > switches - found a few with IGMP turned on and some multicast issues. > Those have been fixed. I am checking the network stats on the oss, mdt > and nodes and find a few dropped packets, however the system stats do > not indicate a heavy load during that time. If anyone has any > suggestions on anything else I can look at please let me know.Turning on console logging of low-level network errors may reveal something useful: echo +neterror > /proc/sys/lnet/printk Isaac
Hi; Thanks for all of the help and suggestions. I was able to narrow the problem down to two racks of intel nodes. Checking the HP Procurve switches in each rack showed that LACP was turned on all of the ports causing ports randomly blocked by LACP with off-line and on-line messages. Lustre was being accurate, the nodes were timing out (repeatedly) and I think it caused a cascade effect to the rest of the nodes I turned off LACP Dec. 24 and have not had a Lustre timeout since. Again, thanks for all of your help, Denise On Tue, 2008-12-23 at 20:04 -0500, Isaac Huang wrote:> On Tue, Dec 23, 2008 at 06:45:09AM -0700, Denise Hummel wrote: > > Hi; > > > > Thanks. I have suspected the network, however have not been able to > > pinpoint the problem. I have looked at the ethernet and infiniband > > switches - found a few with IGMP turned on and some multicast issues. > > Those have been fixed. I am checking the network stats on the oss, mdt > > and nodes and find a few dropped packets, however the system stats do > > not indicate a heavy load during that time. If anyone has any > > suggestions on anything else I can look at please let me know. > > Turning on console logging of low-level network errors may reveal > something useful: > > echo +neterror > /proc/sys/lnet/printk > > Isaac