Brock Palen
2008-Sep-05 02:58 UTC
[Lustre-discuss] Lustre clients failing, and cant reconnect
I am having clients lose their connection to the MDS. Messages on the clients look like this: Sep 4 19:51:30 nyx-login2 kernel: Lustre: nobackup-MDT0000- mdc-00000101fc44e800: Connection to service nobackup-MDT0000 via nid 10.164.3.246 at tcp was lost; in progress operations using this service will wait for recovery to complete. Sep 4 19:51:30 nyx-login2 kernel: LustreError: 11-0: an error occurred while communicating with 10.164.3.246 at tcp. The mds_connect operation failed with -16 It will keep doing this trying to connect and spiting out mds_connect failed -16. The clients never recover. On the mds all I see is: Lustre: 7653:0:(ldlm_lib.c:760:target_handle_connect()) nobackup- MDT0000: refuse reconnection from 618cf36e-a7a6- a7d9-077c-7cbaee1e80b3 at 141.212.31.43@tcp to 0x000001037c109000; still busy with 3 active RPCs This is common between many hosts that I get this RPC message. Clients and servers are all using TCP. Is this enough information? Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985
Brian J. Murrell
2008-Sep-05 03:19 UTC
[Lustre-discuss] Lustre clients failing, and cant reconnect
On Thu, 2008-09-04 at 22:58 -0400, Brock Palen wrote:> I am having clients lose their connection to the MDS. Messages on > the clients look like this: > > Sep 4 19:51:30 nyx-login2 kernel: Lustre: nobackup-MDT0000- > mdc-00000101fc44e800: Connection to service nobackup-MDT0000 via nid > 10.164.3.246 at tcp was lost; in progress operations using this service > will wait for recovery to complete. > Sep 4 19:51:30 nyx-login2 kernel: LustreError: 11-0: an error > occurred while communicating with 10.164.3.246 at tcp. The mds_connect > operation failed with -16 > > It will keep doing this trying to connect and spiting out mds_connect > failed -16. The clients never recover. > > On the mds all I see is: > > Lustre: 7653:0:(ldlm_lib.c:760:target_handle_connect()) nobackup- > MDT0000: refuse reconnection from 618cf36e-a7a6- > a7d9-077c-7cbaee1e80b3 at 141.212.31.43@tcp to 0x000001037c109000; still > busy with 3 active RPCs > > This is common between many hosts that I get this RPC message. > > Clients and servers are all using TCP. > > Is this enough information?Probably. If you are running 1.6.5, try disabling statahead on all of your clients... # echo 0 > /proc/fs/lustre/.../statahead_max Of course, this setting goes back to it''s default of 32 on a reboot. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080904/24184e5b/attachment.bin
Brock Palen
2008-Sep-05 03:34 UTC
[Lustre-discuss] Lustre clients failing, and cant reconnect
>> >> Is this enough information? > > Probably. If you are running 1.6.5, try disabling statahead on all of > your clients... > > # echo 0 > /proc/fs/lustre/.../statahead_maxI thought statahead was fixed in 1.6.5 ? Main reason we upgraded. Login nodes already are showing that behavior again. I will try it out> > Of course, this setting goes back to it''s default of 32 on a reboot. > > b. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Brock Palen
2008-Sep-05 04:15 UTC
[Lustre-discuss] Lustre clients failing, and cant reconnect
Looks like that didn''t fix it. One of the login nodes repeated the behavior. The strange thing is that the MDS does not show anything about the NID of the client. The client just says it lost connection with it, but the MDS never says it has not heard from the client and is kicking it out. Very strange. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Sep 4, 2008, at 11:34 PM, Brock Palen wrote:> >>> >>> Is this enough information? >> >> Probably. If you are running 1.6.5, try disabling statahead on >> all of >> your clients... >> >> # echo 0 > /proc/fs/lustre/.../statahead_max > > I thought statahead was fixed in 1.6.5 ? Main reason we upgraded. > Login nodes already are showing that behavior again. > I will try it out > >> >> Of course, this setting goes back to it''s default of 32 on a reboot. >> >> b. >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Brian J. Murrell
2008-Sep-05 14:01 UTC
[Lustre-discuss] Lustre clients failing, and cant reconnect
On Fri, 2008-09-05 at 00:15 -0400, Brock Palen wrote:> Looks like that didn''t fix it. One of the login nodes repeated the > behavior.So what are the messages the client logged when the problem occurred? And what, if anything was logged on the MDS at the same time? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080905/14186f36/attachment.bin
Brock Palen
2008-Sep-05 14:21 UTC
[Lustre-discuss] Lustre clients failing, and cant reconnect
I had to reboot the MDS to get the problem to go away. I will watch and see if it reappears. I screwed up and deleted the wrong /var/log/messages So I don''t have the messages. I am watching this issues. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Sep 5, 2008, at 10:01 AM, Brian J. Murrell wrote:> On Fri, 2008-09-05 at 00:15 -0400, Brock Palen wrote: >> Looks like that didn''t fix it. One of the login nodes repeated the >> behavior. > > So what are the messages the client logged when the problem occurred? > And what, if anything was logged on the MDS at the same time? > > b. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Jerome, Ron
2008-Sep-05 14:48 UTC
[Lustre-discuss] Lustre clients failing, and cant reconnect
For what it''s worth... I''ve seen similar problems with clients not being able to connect to OSSs SERVER OS: Linux oss1 2.6.18-53.1.14.el5_lustre.1.6.5.1smp #1 SMP Thu Jun 26 01:38:50 EDT 2008 i686 i686 i386 GNU/Linux CLIENT OS: Linux x15 2.6.18-53.1.14.el5_lustre.1.6.5smp #1 SMP Mon May 12 22:24:24 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux On the client side I see this... ==================================== CLIENT LOG ================================================Sep 1 15:17:22 x15 kernel: Lustre: Request x30990319 sent from data-OST0004-osc-ffff81022067ec00 to NID 192.168.0.201 at tcp 100s ago has timed out (limit 100s). Sep 1 15:17:22 x15 kernel: Lustre: Skipped 9 previous similar messages Sep 1 15:17:22 x15 kernel: Lustre: data-OST0004-osc-ffff81022067ec00: Connection to service data-OST0004 via nid 192.168.0.201 at tcp was lost; in progress operations using this service will wait for recovery to complete. Sep 1 15:17:22 x15 kernel: LustreError: 3834:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Sep 1 15:17:22 x15 kernel: LustreError: 3834:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Sep 1 15:17:22 x15 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.0.201 at tcp. The ost_connect operation failed with -16 Sep 1 15:17:22 x15 kernel: LustreError: Skipped 2 previous similar messages Sep 1 15:17:47 x15 kernel: Lustre: 3216:0:(import.c:395:import_select_connection()) data-OST0004-osc-ffff81022067ec00: tried all connections, increasing latency to 6s Sep 1 15:17:47 x15 kernel: Lustre: 3216:0:(import.c:395:import_select_connection()) Skipped 4 previous similar messages Sep 1 15:17:47 x15 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.0.201 at tcp. The ost_connect operation failed with -16 Sep 1 15:18:37 x15 last message repeated 2 times Sep 1 15:19:02 x15 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.0.201 at tcp. The ost_connect operation failed with -16 Sep 1 15:19:27 x15 kernel: Lustre: 3216:0:(import.c:395:import_select_connection()) data-OST0004-osc-ffff81022067ec00: tried all connections, increasing latency to 26s Sep 1 15:19:27 x15 kernel: Lustre: 3216:0:(import.c:395:import_select_connection()) Skipped 3 previous similar messages Sep 1 15:19:27 x15 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.0.201 at tcp. The ost_connect operation failed with -16 Sep 1 15:20:17 x15 last message repeated 2 times Sep 1 15:21:07 x15 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.0.201 at tcp. The ost_connect operation failed with -16 Sep 1 15:21:07 x15 kernel: LustreError: Skipped 1 previous similar message Sep 1 15:21:57 x15 kernel: Lustre: 3216:0:(import.c:395:import_select_connection()) data-OST0004-osc-ffff81022067ec00: tried all connections, increasing latency to 51s Sep 1 15:21:57 x15 kernel: Lustre: 3216:0:(import.c:395:import_select_connection()) Skipped 5 previous similar messages Sep 1 15:22:22 x15 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.0.201 at tcp. The ost_connect operation failed with -16 Sep 1 15:22:22 x15 kernel: LustreError: Skipped 2 previous similar messages Sep 1 15:24:52 x15 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.0.201 at tcp. The ost_connect operation failed with -16 Sep 1 15:24:52 x15 kernel: LustreError: Skipped 5 previous similar messages Sep 1 15:27:22 x15 kernel: Lustre: 3216:0:(import.c:395:import_select_connection()) data-OST0004-osc-ffff81022067ec00: tried all connections, increasing latency to 51s Sep 1 15:27:22 x15 kernel: Lustre: 3216:0:(import.c:395:import_select_connection()) Skipped 12 previous similar messages Sep 1 15:29:27 x15 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.0.201 at tcp. The ost_connect operation failed with -16 Sep 1 15:29:27 x15 kernel: LustreError: Skipped 10 previous similar messages Sep 1 15:37:47 x15 kernel: Lustre: 3216:0:(import.c:395:import_select_connection()) data-OST0004-osc-ffff81022067ec00: tried all connections, increasing latency to 51s Sep 1 15:37:47 x15 kernel: Lustre: 3216:0:(import.c:395:import_select_connection()) Skipped 24 previous similar messages Sep 1 15:38:12 x15 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.0.201 at tcp. The ost_connect operation failed with -16 Sep 1 15:38:12 x15 kernel: LustreError: Skipped 20 previous similar messages Sep 1 15:48:12 x15 kernel: Lustre: 3216:0:(import.c:395:import_select_connection()) data-OST0004-osc-ffff81022067ec00: tried all connections, increasing latency to 51s ==================================== END CLIENT LOG ================================================ Server log at corresponding time... ==================================== SERVER LOG ================================================Aug 31 04:02:04 oss1 syslogd 1.4.1: restart. Sep 1 15:15:42 oss1 kernel: LustreError: 2450:0:(ldlm_lock.c:430:__ldlm_handle2lock()) ASSERTION(lock->l_resource != NULL) failed Sep 1 15:15:42 oss1 kernel: LustreError: 2450:0:(tracefile.c:432:libcfs_assertion_failed()) LBUG Sep 1 15:15:42 oss1 kernel: Lustre: 2450:0:(linux-debug.c:167:libcfs_debug_dumpstack()) showing stack for process 2450 Sep 1 15:15:42 oss1 kernel: ldlm_cn_01 R running 5128 2450 1 2451 2448 (L-TLB) Sep 1 15:15:42 oss1 kernel: f8df6614 48bc3fd2 c042dabe 00000082 e8addfc0 e1f77fcc 00000001 e8ad5ef4 Sep 1 15:15:42 oss1 kernel: 48bc3f5e 00059c88 e77907c0 23c34600 e737d2f4 f8d73b5c cb9f7000 e47d8000 Sep 1 15:15:42 oss1 kernel: 48bc3f3c c041f851 00000000 00000000 48bc3f5e 00059c88 00000080 e737d298 Sep 1 15:15:42 oss1 kernel: Call Trace: Sep 1 15:15:42 oss1 kernel: [<f8df6614>] lustre_msg_get_conn_cnt+0xc4/0x210 [ptlrpc] Sep 1 15:15:42 oss1 kernel: [<c042dabe>] lock_timer_base+0x15/0x2f Sep 1 15:15:42 oss1 kernel: [<f8d73b5c>] lc_watchdog_touch_ms+0xac/0x270 [libcfs] Sep 1 15:15:42 oss1 kernel: [<c041f851>] __wake_up+0x2a/0x3d Sep 1 15:15:42 oss1 kernel: [<f8e0dffe>] ptlrpc_main+0xabe/0x16d0 [ptlrpc] Sep 1 15:15:42 oss1 kernel: [<c0420291>] default_wake_function+0x0/0xc Sep 1 15:15:42 oss1 kernel: [<f8e0d540>] ptlrpc_main+0x0/0x16d0 [ptlrpc] Sep 1 15:15:42 oss1 kernel: [<c0405c0f>] kernel_thread_helper+0x7/0x10 Sep 1 15:15:42 oss1 kernel: ======================Sep 1 15:15:42 oss1 kernel: LustreError: dumping log to /tmp/lustre-log.1220296542.2450 Sep 1 15:17:22 oss1 kernel: Lustre: 2500:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp to 0xcb9f7000; still busy with 2 active RPCs Sep 1 15:17:47 oss1 kernel: Lustre: 2484:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp to 0xcb9f7000; still busy with 2 active RPCs Sep 1 15:18:12 oss1 kernel: Lustre: 2481:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp to 0xcb9f7000; still busy with 2 active RPCs Sep 1 15:18:37 oss1 kernel: Lustre: 2651:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp to 0xcb9f7000; still busy with 2 active RPCs Sep 1 15:19:02 oss1 kernel: Lustre: 2481:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp to 0xcb9f7000; still busy with 2 active RPCs Sep 1 15:19:27 oss1 kernel: Lustre: 2672:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp to 0xcb9f7000; still busy with 2 active RPCs Sep 1 15:19:52 oss1 kernel: Lustre: 2677:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp to 0xcb9f7000; still busy with 2 active RPCs Sep 1 15:20:42 oss1 kernel: Lustre: 2489:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp to 0xcb9f7000; still busy with 2 active RPCs Sep 1 15:20:42 oss1 kernel: Lustre: 2489:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 1 previous similar message Sep 1 15:21:57 oss1 kernel: Lustre: 2500:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp to 0xcb9f7000; still busy with 2 active RPCs Sep 1 15:21:57 oss1 kernel: Lustre: 2500:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 2 previous similar messages Sep 1 15:24:27 oss1 kernel: Lustre: 2720:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp to 0xcb9f7000; still busy with 2 active RPCs Sep 1 15:24:27 oss1 kernel: Lustre: 2720:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 5 previous similar messages Sep 1 15:25:42 oss1 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 2450: it was inactive for 600s Sep 1 15:25:42 oss1 kernel: Lustre: 0:0:(linux-debug.c:167:libcfs_debug_dumpstack()) showing stack for process 2450 Sep 1 15:25:42 oss1 kernel: ldlm_cn_01 D 5853A7AE 5128 2450 1 2451 2448 (L-TLB) Sep 1 15:25:42 oss1 kernel: e8ad5ce0 00000046 000001b0 5853a7ae 000990b5 00000000 0000000a f5cd5550 Sep 1 15:25:42 oss1 kernel: f7c1c550 5853d51e 000990b5 00002d70 00000001 f5cd565c c20147e0 00000000 Sep 1 15:25:42 oss1 kernel: e8ad5eb4 f8d830a0 f8d7615c ffffffff 00000000 00000000 00000000 f8d75fa0 Sep 1 15:25:42 oss1 kernel: Call Trace: Sep 1 15:25:42 oss1 kernel: [<f8d67d8c>] lbug_with_loc+0x9c/0xc0 [libcfs] Sep 1 15:25:42 oss1 kernel: [<f8d7287f>] libcfs_assertion_failed+0x5f/0x60 [libcfs] Sep 1 15:25:42 oss1 kernel: [<f8d8e3cb>] __ldlm_handle2lock+0x4db/0x6b0 [ptlrpc] Sep 1 15:25:42 oss1 kernel: [<f8df815e>] lustre_msg_get_flags+0x4e/0x130 [ptlrpc] Sep 1 15:25:42 oss1 kernel: [<f8defe79>] ptlrpc_send_reply+0x299/0x840 [ptlrpc] Sep 1 15:25:42 oss1 kernel: [<f8dc3ae1>] ldlm_request_cancel+0x131/0x7e0 [ptlrpc] Sep 1 15:25:42 oss1 kernel: [<f8dfd53f>] lustre_pack_reply+0x2f/0x140 [ptlrpc] Sep 1 15:25:42 oss1 kernel: [<f8dc423d>] ldlm_handle_cancel+0xad/0x570 [ptlrpc] Sep 1 15:25:42 oss1 kernel: [<f8df7854>] lustre_msg_get_opc+0xc4/0x210 [ptlrpc] Sep 1 15:25:42 oss1 kernel: [<f8df6614>] lustre_msg_get_conn_cnt+0xc4/0x210 [ptlrpc] Sep 1 15:25:42 oss1 kernel: [<f8dc487e>] ldlm_cancel_handler+0x17e/0xb80 [ptlrpc] Sep 1 15:25:42 oss1 kernel: [<f8e06305>] ptlrpc_check_req+0x15/0x220 [ptlrpc] Sep 1 15:25:42 oss1 kernel: [<f8df7e9e>] lustre_msg_get_handle+0xce/0x210 [ptlrpc] Sep 1 15:25:42 oss1 kernel: [<c042ccfd>] do_gettimeofday+0x31/0xce Sep 1 15:25:42 oss1 kernel: [<f8e09fdd>] ptlrpc_server_handle_request+0xf3d/0x1d90 [ptlrpc] Sep 1 15:25:42 oss1 kernel: [<c042ccfd>] do_gettimeofday+0x31/0xce Sep 1 15:25:42 oss1 kernel: [<f8d730bc>] lcw_update_time+0x1c/0x170 [libcfs] Sep 1 15:25:42 oss1 kernel: [<f8d730bc>] lcw_update_time+0x1c/0x170 [libcfs] Sep 1 15:25:42 oss1 kernel: [<f8df6614>] lustre_msg_get_conn_cnt+0xc4/0x210 [ptlrpc] Sep 1 15:25:42 oss1 kernel: [<c042dabe>] lock_timer_base+0x15/0x2f Sep 1 15:25:42 oss1 kernel: [<f8d73b5c>] lc_watchdog_touch_ms+0xac/0x270 [libcfs] Sep 1 15:25:42 oss1 kernel: [<c041f851>] __wake_up+0x2a/0x3d Sep 1 15:25:42 oss1 kernel: [<f8e0dffe>] ptlrpc_main+0xabe/0x16d0 [ptlrpc] Sep 1 15:25:42 oss1 kernel: [<c0420291>] default_wake_function+0x0/0xc Sep 1 15:25:42 oss1 kernel: [<f8e0d540>] ptlrpc_main+0x0/0x16d0 [ptlrpc] Sep 1 15:25:42 oss1 kernel: [<c0405c0f>] kernel_thread_helper+0x7/0x10 Sep 1 15:25:42 oss1 kernel: ======================Sep 1 15:25:42 oss1 kernel: LustreError: dumping log to /tmp/lustre-log.1220297142.2450 Sep 1 15:29:02 oss1 kernel: Lustre: 2654:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp to 0xcb9f7000; still busy with 2 active RPCs Sep 1 15:29:02 oss1 kernel: Lustre: 2654:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 10 previous similar messages Sep 1 15:37:47 oss1 kernel: Lustre: 2703:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp to 0xcb9f7000; still busy with 2 active RPCs Sep 1 15:37:47 oss1 kernel: Lustre: 2703:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 20 previous similar messages Sep 1 15:48:12 oss1 kernel: Lustre: 2699:0:(ldlm_lib.c:760:target_handle_connect()) data-OST0004: refuse reconnection from 85d01dbb-c5f4-a525-3fc8-4c316490b85a at 192.168.0.115@tcp to 0xcb9f7000; still busy with 2 active RPCs Sep 1 15:48:12 oss1 kernel: Lustre: 2699:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 24 previous similar messages ==================================== END SERVER LOG ================================================ _________________________________________ Ron Jerome National Research Council Canada _________________________________________> -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss- > bounces at lists.lustre.org] On Behalf Of Brock Palen > Sent: September 5, 2008 10:22 AM > To: Brian J. Murrell > Cc: Lustre Discuss > Subject: Re: [Lustre-discuss] Lustre clients failing, and cant > reconnect > > I had to reboot the MDS to get the problem to go away. > I will watch and see if it reappears. I screwed up and deleted the > wrong /var/log/messages So I don''t have the messages. > > I am watching this issues. > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > On Sep 5, 2008, at 10:01 AM, Brian J. Murrell wrote: > > On Fri, 2008-09-05 at 00:15 -0400, Brock Palen wrote: > >> Looks like that didn''t fix it. One of the login nodes repeated the > >> behavior. > > > > So what are the messages the client logged when the problemoccurred?> > And what, if anything was logged on the MDS at the same time? > > > > b. > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Brian J. Murrell
2008-Sep-05 15:10 UTC
[Lustre-discuss] Lustre clients failing, and cant reconnect
On Fri, 2008-09-05 at 10:48 -0400, Jerome, Ron wrote:> ==================================== SERVER LOG > ================================================> Aug 31 04:02:04 oss1 syslogd 1.4.1: restart. > Sep 1 15:15:42 oss1 kernel: LustreError: > 2450:0:(ldlm_lock.c:430:__ldlm_handle2lock()) ASSERTION(lock->l_resource > != NULL) failed > Sep 1 15:15:42 oss1 kernel: LustreError: > 2450:0:(tracefile.c:432:libcfs_assertion_failed()) LBUGDid you notice this ASSERTION/LBUG? All bets are off when you hit one of these. A thread that LBUGs is put in to an uninteruptable sleep, which can certainly account for hung clients. This looks like bug 16496. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080905/d9a8e5d9/attachment.bin
By disabling statahead_max what consequences can a client face? # echo 0 > /proc/fs/lustre/.../statahead_max On Fri, Sep 5, 2008 at 11:10 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:> On Fri, 2008-09-05 at 10:48 -0400, Jerome, Ron wrote: >> ==================================== SERVER LOG >> ================================================>> Aug 31 04:02:04 oss1 syslogd 1.4.1: restart. >> Sep 1 15:15:42 oss1 kernel: LustreError: >> 2450:0:(ldlm_lock.c:430:__ldlm_handle2lock()) ASSERTION(lock->l_resource >> != NULL) failed >> Sep 1 15:15:42 oss1 kernel: LustreError: >> 2450:0:(tracefile.c:432:libcfs_assertion_failed()) LBUG > > Did you notice this ASSERTION/LBUG? All bets are off when you hit one > of these. A thread that LBUGs is put in to an uninteruptable sleep, > which can certainly account for hung clients. > > This looks like bug 16496. > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Brian J. Murrell
2008-Sep-13 16:16 UTC
[Lustre-discuss] Lustre clients failing, and cant reconnect
On Sat, 2008-09-13 at 11:44 -0400, Mag Gam wrote:> By disabling statahead_max what consequences can a client face? > > # echo 0 > /proc/fs/lustre/.../statahead_maxThe client does not benefit from the optimizations that can be done to speed up ls -l. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080913/97344395/attachment.bin