Joan Marc
2011-Aug-25 16:37 UTC
[Lustre-discuss] Lustre I/O server (OSS) goes down, another I/O server takes the service overloads itself and performance goes down.
Hello, I''m having something that I can not understand. I quite new on lustre. n4 and n5 are two I/O servers, each one taking care of diferent datasets. When one of them goes down, the other is supposed to handle the service as backup node. But n4 goes down without having heavy loads, and when n5 gets the service its cpu and memory goes to the top and performance goes to the bottom. Can someone tell me which lines to look at , and from here I can start checking this issue? Here the a 30 minutes syslog with lustre related messages from n4 : http://pastebin.com/q1iGwDxw Here almost the same 30 minutes from n5. http://pastebin.com/4Bg5repa Many thanks. Marc -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110825/fa236086/attachment.html
Joan Marc
2011-Aug-26 10:05 UTC
[Lustre-discuss] Lustre I/O server (OSS) goes down, another I/O server takes the service overloads itself and performance goes down.
Replaying to myself to resume a bit the bunch of logs sended yesterday. I''m trying to understand the following block of logs: It seems quite extrange to me that big numbers (for example 21018 times having a refused connection ) of similar messages on so small amount of time, less than a second. Is there any way to ask lustre WHY is refusing that connection? many thanks. Marc Aug 22 18:58:13 s_kernel at n4 kernel: Lustre: 23954:0:(ldlm_lib.c:542:target_handle_reconnect()) scratch-OST0034: 1ccbe619-062d-09df-9e3b-3ff3960bfa6a reconnecting Aug 22 18:58:13 s_kernel at n4 kernel: Lustre: 23954:0:(ldlm_lib.c:542:target_handle_reconnect()) Skipped 21021 previous similar messages Aug 22 18:58:13 s_kernel at n4 kernel: Lustre: 23954:0:(ldlm_lib.c:799:target_handle_connect()) scratch-OST0034: refuse reconnection from 1ccbe619-062d-09df-9e3b-3ff3960bfa6a at 10.2.1.15@tcp to 0xffff81070c578000; still busy with 1 active RPCs Aug 22 18:58:13 s_kernel at n4 kernel: Lustre: 23954:0:(ldlm_lib.c:799:target_handle_connect()) Skipped 21018 previous similar messages Aug 22 18:59:01 s_sys at n4 crond[21701]: (perf) CMD (/opt/perf/bin/ ganglia_metrics.pl -jobs -bond0 &>/dev/null) Aug 22 18:59:17 s_kernel at n4 kernel: Lustre: 24195:0:(ldlm_lib.c:799:target_handle_connect()) scratch-OST0034: refuse reconnection from 1ccbe619-062d-09df-9e3b-3ff3960bfa6a at 10.2.1.15@tcp to 0xffff81070c578000; still busy with 1 active RPCs Aug 22 18:59:17 s_kernel at n4 kernel: Lustre: 24195:0:(ldlm_lib.c:799:target_handle_connect()) Skipped previous similar messages Aug 22 18:59:53 s_kernel at n4 kernel: Lustre: 0:0:(watchdog.c:183:lcw_cb()) Watchdog triggered for pid 13380: it was inactive for 200.00s Aug 22 18:59:53 s_kernel at n4 kernel: Lustre: 0:0:(watchdog.c:183:lcw_cb()) Skipped 4 previous similar messages Aug 22 18:59:53 s_kernel at n4 kernel: Lustre: 0:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for process 13380 Aug 22 18:59:53 s_kernel at n4 kernel: Lustre: 0:0:(linux-debug.c:222:libcfs_debug_dumpstack()) Skipped 4 previous similar messages Aug 22 18:59:53 s_kernel at n4 kernel: ll_ost_io_49 S ffff81000900c980 0 13380 1 13381 13379 (L-TLB) Aug 22 18:59:53 s_kernel at n4 kernel: ffff81061fd55a10 0000000000000046 ffff8104ca43f400 0000000000000000 Aug 22 18:59:53 s_kernel at n4 kernel: 0000000000000000 000000000000000a ffff8106201db040 ffff8106400677d0 Aug 22 18:59:53 s_kernel at n4 kernel: 00063866b1dac473 0000000000000b17 ffff8106201db228 0000000100000000 Aug 22 18:59:53 s_kernel at n4 kernel: Call Trace: Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff888c5f27>] :ptlrpc:target_queue_recovery_request+0xb37/0x12b0 Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8008a88e>] default_wake_function+0x0/0xe Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff88b5d11b>] :ost:ost_handle+0x37b/0x5cd0 Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff80063118>] thread_return+0x62/0xfe Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff800c5d06>] free_pages_bulk+0x1ee/0x266 Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff800c9cb4>] zone_statistics+0x3e/0x6d Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8014a88b>] __next_cpu+0x19/0x28 Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff800893ba>] find_busiest_group+0x20d/0x621 Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff888f7325>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8008a228>] enqueue_task+0x41/0x56 Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff888ffa6a>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150 Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8890190d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff88903e9e>] :ptlrpc:ptlrpc_server_handle_request+0xa9e/0x1160 Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff80063118>] thread_return+0x62/0xfe Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff80088c83>] __wake_up_common+0x3e/0x68 Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff88907388>] :ptlrpc:ptlrpc_main+0x1208/0x13d0 Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8008a88e>] default_wake_function+0x0/0xe Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff800b5eed>] audit_syscall_exit+0x31b/0x336 Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8005e01d>] child_rip+0xa/0x11 Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff88906180>] :ptlrpc:ptlrpc_main+0x0/0x13d0 Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8005e013>] child_rip+0x0/0x11 Aug 22 18:59:53 s_kernel at n4 kernel: Aug 22 18:59:53 s_kernel at n4 kernel: LustreError: dumping log to /tmp/lustre-log.1314032393.13380 On 25 August 2011 18:37, Joan Marc <joanmarcriera at gmail.com> wrote:> Hello, > > I''m having something that I can not understand. I quite new on lustre. > > > n4 and n5 are two I/O servers, each one taking care of diferent datasets. > > When one of them goes down, the other is supposed to handle the service as > backup node. But n4 goes down without having heavy loads, and when n5 gets > the service its cpu and memory goes to the top and performance goes to the > bottom. > > Can someone tell me which lines to look at , and from here I can start > checking this issue? > > Here the a 30 minutes syslog with lustre related messages from n4 : > http://pastebin.com/q1iGwDxw > > Here almost the same 30 minutes from n5. > http://pastebin.com/4Bg5repa > > > Many thanks. > > Marc > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110826/7fc078ee/attachment.html