thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre I/O server (OSS) goes down, another I/O server takes the service overloads itself and performance goes down. [Aug 2011]

If this information is useful, please help other people find it:
Share via:

Joan Marc

2011-Aug-25 16:37 UTC

[Lustre-discuss] Lustre I/O server (OSS) goes down, another I/O server takes the service overloads itself and performance goes down.

Hello,

I''m having something that I can not understand. I quite new on lustre.


n4 and n5 are two I/O servers, each one taking care of diferent datasets.

When one of them goes down, the other is supposed to handle the service as
backup node. But n4 goes down without having heavy loads, and when n5 gets
the service its cpu and memory goes to the top and performance goes to the
bottom.

Can someone tell me which lines to look at , and from here I can start
checking this issue?

Here the a 30 minutes syslog with lustre related messages from n4 :
http://pastebin.com/q1iGwDxw

Here almost the same 30 minutes from n5.
http://pastebin.com/4Bg5repa


Many thanks.

Marc
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110825/fa236086/attachment.html

Joan Marc

2011-Aug-26 10:05 UTC

head link

[Lustre-discuss] Lustre I/O server (OSS) goes down, another I/O server takes the service overloads itself and performance goes down.

Replaying to myself to resume a bit the bunch of logs sended yesterday.

I''m trying to understand the following block of logs:

It seems quite extrange to me that big numbers (for example 21018 times
having a refused connection ) of similar messages on so small amount of
time, less than a second.

Is there any way to ask lustre WHY is refusing that connection?

many thanks.

Marc





Aug 22 18:58:13 s_kernel at n4 kernel: Lustre:
23954:0:(ldlm_lib.c:542:target_handle_reconnect()) scratch-OST0034:
1ccbe619-062d-09df-9e3b-3ff3960bfa6a reconnecting
Aug 22 18:58:13 s_kernel at n4 kernel: Lustre:
23954:0:(ldlm_lib.c:542:target_handle_reconnect()) Skipped 21021 previous
similar messages
Aug 22 18:58:13 s_kernel at n4 kernel: Lustre:
23954:0:(ldlm_lib.c:799:target_handle_connect()) scratch-OST0034: refuse
reconnection from 1ccbe619-062d-09df-9e3b-3ff3960bfa6a at 10.2.1.15@tcp to
0xffff81070c578000; still busy with 1 active RPCs
Aug 22 18:58:13 s_kernel at n4 kernel: Lustre:
23954:0:(ldlm_lib.c:799:target_handle_connect()) Skipped 21018 previous
similar messages
Aug 22 18:59:01 s_sys at n4 crond[21701]: (perf) CMD (/opt/perf/bin/
ganglia_metrics.pl -jobs -bond0   &>/dev/null)
Aug 22 18:59:17 s_kernel at n4 kernel: Lustre:
24195:0:(ldlm_lib.c:799:target_handle_connect()) scratch-OST0034: refuse
reconnection from 1ccbe619-062d-09df-9e3b-3ff3960bfa6a at 10.2.1.15@tcp to
0xffff81070c578000; still busy with 1 active RPCs
Aug 22 18:59:17 s_kernel at n4 kernel: Lustre:
24195:0:(ldlm_lib.c:799:target_handle_connect()) Skipped
previous similar messages



Aug 22 18:59:53 s_kernel at n4 kernel: Lustre: 0:0:(watchdog.c:183:lcw_cb())
Watchdog triggered for pid 13380: it was inactive for 200.00s
Aug 22 18:59:53 s_kernel at n4 kernel: Lustre: 0:0:(watchdog.c:183:lcw_cb())
Skipped 4 previous similar messages
Aug 22 18:59:53 s_kernel at n4 kernel: Lustre:
0:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for process
13380
Aug 22 18:59:53 s_kernel at n4 kernel: Lustre:
0:0:(linux-debug.c:222:libcfs_debug_dumpstack()) Skipped 4 previous similar
messages
Aug 22 18:59:53 s_kernel at n4 kernel: ll_ost_io_49  S ffff81000900c980     0
13380      1         13381 13379 (L-TLB)
Aug 22 18:59:53 s_kernel at n4 kernel: ffff81061fd55a10 0000000000000046
ffff8104ca43f400 0000000000000000
Aug 22 18:59:53 s_kernel at n4 kernel: 0000000000000000 000000000000000a
ffff8106201db040 ffff8106400677d0
Aug 22 18:59:53 s_kernel at n4 kernel: 00063866b1dac473 0000000000000b17
ffff8106201db228 0000000100000000
Aug 22 18:59:53 s_kernel at n4 kernel: Call Trace:
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff888c5f27>]
:ptlrpc:target_queue_recovery_request+0xb37/0x12b0
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8008a88e>]
default_wake_function+0x0/0xe
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff88b5d11b>]
:ost:ost_handle+0x37b/0x5cd0
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff80063118>]
thread_return+0x62/0xfe
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff800c5d06>]
free_pages_bulk+0x1ee/0x266
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff800c9cb4>]
zone_statistics+0x3e/0x6d
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8014a88b>]
__next_cpu+0x19/0x28
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff800893ba>]
find_busiest_group+0x20d/0x621
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff888f7325>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8008a228>]
enqueue_task+0x41/0x56
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff888ffa6a>]
:ptlrpc:ptlrpc_server_request_get+0x6a/0x150
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8890190d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff88903e9e>]
:ptlrpc:ptlrpc_server_handle_request+0xa9e/0x1160
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff80063118>]
thread_return+0x62/0xfe
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff80088c83>]
__wake_up_common+0x3e/0x68
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff88907388>]
:ptlrpc:ptlrpc_main+0x1208/0x13d0
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8008a88e>]
default_wake_function+0x0/0xe
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff800b5eed>]
audit_syscall_exit+0x31b/0x336
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8005e01d>]
child_rip+0xa/0x11
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff88906180>]
:ptlrpc:ptlrpc_main+0x0/0x13d0
Aug 22 18:59:53 s_kernel at n4 kernel: [<ffffffff8005e013>]
child_rip+0x0/0x11
Aug 22 18:59:53 s_kernel at n4 kernel:
Aug 22 18:59:53 s_kernel at n4 kernel: LustreError: dumping log to
/tmp/lustre-log.1314032393.13380

On 25 August 2011 18:37, Joan Marc <joanmarcriera at gmail.com> wrote:
> Hello,
>
> I''m having something that I can not understand. I quite new on
lustre.
>
>
> n4 and n5 are two I/O servers, each one taking care of diferent datasets.
>
> When one of them goes down, the other is supposed to handle the service as
> backup node. But n4 goes down without having heavy loads, and when n5 gets
> the service its cpu and memory goes to the top and performance goes to the
> bottom.
>
> Can someone tell me which lines to look at , and from here I can start
> checking this issue?
>
> Here the a 30 minutes syslog with lustre related messages from n4 :
> http://pastebin.com/q1iGwDxw
>
> Here almost the same 30 minutes from n5.
> http://pastebin.com/4Bg5repa
>
>
> Many thanks.
>
> Marc
>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110826/7fc078ee/attachment.html

Lustre discuss - Aug 2011 - Lustre I/O server (OSS) goes down, another I/O server takes the service overloads itself and performance goes down.

[Lustre-discuss] Lustre I/O server (OSS) goes down, another I/O server takes the service overloads itself and performance goes down.

[Lustre-discuss] Lustre I/O server (OSS) goes down, another I/O server takes the service overloads itself and performance goes down.