On all of out OSSs ll_ost_io_512 goes to "D" state after heavy IO.
Additionally, some of our clients loose their mounts as a result.
Could this be due to the megasas reset shown below? If so, which
party''s at fault here? Lustre, megasas, or both? I know I can increase
the megasas timeout using /sys/block/sdb/device/timeout. Is this
advisable?
Thank You,
jeff
[root@oss1 ~]# ps axuwww|grep 19120
root 19120 0.0 0.0 0 0 ? D 10:05 0:00 [ll_ost_io_512]
[root@oss1 ~]# dmesg|grep -C 20 ll_ost_io_512
Lustre: Skipped 5 previous similar messages
LustreError: 4991:0:(ldlm_lib.c:1363:target_send_reply_msg()) @@@
processing error (-107) req@0000010037cedc00 x61664/t0
o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0 rc -107/0
LustreError: 4991:0:(ldlm_lib.c:1363:target_send_reply_msg()) Skipped
18 previous similar messages
megasas: RESET -12714968 cmd=2a <c=2 t=1 l=0>
megasas: reset successful
Lustre: 6146:0:(filter_io_26.c:732:filter_commitrw_write())
lustre0-OST0000: slow direct_io 36s
Lustre: 6146:0:(filter_io_26.c:732:filter_commitrw_write()) Skipped 1
previous similar message
Lustre: 6146:0:(filter_io_26.c:745:filter_commitrw_write())
lustre0-OST0000: slow commitrw commit 36s
Lustre: 6146:0:(filter_io_26.c:745:filter_commitrw_write()) Skipped 1
previous similar message
Lustre: 12672:0:(filter_io_26.c:732:filter_commitrw_write())
lustre0-OST0000: slow direct_io 42s
Lustre: 12672:0:(filter_io_26.c:745:filter_commitrw_write())
lustre0-OST0000: slow commitrw commit 42s
LustreError: 19120:0:(filter.c:1575:filter_iobuf_get())
ASSERTION(thread_id < filter->fo_iobuf_count) failed
LustreError: 19120:0:(tracefile.c:433:libcfs_assertion_failed()) LBUG
Lustre: 19120:0:(linux-debug.c:166:libcfs_debug_dumpstack()) showing
stack for process 19120
ll_ost_io_512 R running task 0 19120 1 19119 (L-TLB)
000001022b373ee8 000001021a71c070 0000010233255c00 000001022b373db8
0000010037d245c0 0000000000000200 0000000000000000 0000000000000000
000000000000045e 00000000ffffffff
Call Trace:<ffffffffa02ac539>{:ptlrpc:ptlrpc_main+0}
<ffffffff80110e1b>{child_rip+0}
LustreError: dumping log to /tmp/lustre-log.1181916357.19120
Lustre: 19120:0:(linux-debug.c:98:libcfs_run_upcall()) Invoked LNET
upcall /usr/lib/lustre/lnet_upcall
LBUG,/cache/build/BUILD/lustre-1.6.0.1/lnet/libcfs/tracefile.c,libcfs_assertion_failed,433
Lustre: 19120:0:(linux-debug.c:98:libcfs_run_upcall()) Skipped 21
previous similar messages
Lustre: 5071:0:(ldlm_lib.c:497:target_handle_reconnect())
lustre0-OST0000: 40018968-32d5-68f5-4d5c-565dbeaf7d52 reconnecting
Lustre: 5071:0:(ldlm_lib.c:709:target_handle_connect())
lustre0-OST0000: refuse reconnection from
40018968-32d5-68f5-4d5c-565dbeaf7d52@10.2.2.107@tcp to
0x0000010196bfe000/2
LustreError: 5071:0:(ldlm_lib.c:1363:target_send_reply_msg()) @@@
processing error (-16) req@0000010006bc8400 x53868/t0
o8->40018968-32d5-68f5-4d5c-565dbeaf7d52@NET_0x200000a02026b_UUID:-1
lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0
LustreError: 5071:0:(ldlm_lib.c:1363:target_send_reply_msg()) Skipped
13 previous similar messages
Lustre: 4909:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid
19120: it was inactive for 100s
Lustre: 4909:0:(linux-debug.c:166:libcfs_debug_dumpstack()) showing
stack for process 19120
ll_ost_io_512 D 00000000000001b1 0 19120 1 19119 (L-TLB)
000001018e131908 0000000000000046 000001018e131898 ffffffff00000074
ffffffffa01bff45 00000000a01bff4c 000001000106daa0 0000000180373ca8
000001018e7ef030 0000000000024494
Call Trace:<ffffffff80134a16>{remove_wait_queue+16}
<ffffffffa01b7a75>{:libcfs:lbug_with_loc+145}
<ffffffffa01bd1d5>{:libcfs:collect_pages_on_cpu+0}
<ffffffffa043ab87>{:obdfilter:filter_iobuf_get+68}
<ffffffffa0446da9>{:obdfilter:filter_preprw+4299}
<ffffffffa0416d3b>{:ost:ost_brw_read+2434}
<ffffffffa02a4471>{:ptlrpc:lustre_msg_get_version+64}
<ffffffffa041c657>{:ost:ost_handle+7330}
<ffffffffa01bcf69>{:libcfs:libcfs_debug_vmsg2+1713}
<ffffffff801e9ce7>{vsnprintf+1406}
<ffffffff801e9dca>{snprintf+131}
<ffffffffa02aae57>{:ptlrpc:ptlrpc_server_handle_request+2528}
<ffffffff8013f100>{__mod_timer+293}
<ffffffffa02acd1b>{:ptlrpc:ptlrpc_main+2018}
<ffffffffa02ab97a>{:ptlrpc:ptlrpc_retry_rqbds+0}
<ffffffffa02ab97a>{:ptlrpc:ptlrpc_retry_rqbds+0}
<ffffffffa02ab97a>{:ptlrpc:ptlrpc_retry_rqbds+0}
<ffffffff80110e23>{child_rip+8}
<ffffffffa02ac539>{:ptlrpc:ptlrpc_main+0}
<ffffffff80110e1b>{child_rip+0}
LustreError: dumping log to /tmp/lustre-log.1181916457.19120
Lustre: 4992:0:(ldlm_lib.c:497:target_handle_reconnect())
lustre0-OST0000: 40018968-32d5-68f5-4d5c-565dbeaf7d52 reconnecting
Lustre: 4992:0:(ldlm_lib.c:709:target_handle_connect())
lustre0-OST0000: refuse reconnection from
40018968-32d5-68f5-4d5c-565dbeaf7d52@10.2.2.107@tcp to
0x0000010196bfe000/2
LustreError: 4992:0:(ldlm_lib.c:1363:target_send_reply_msg()) @@@
processing error (-16) req@0000010037d83400 x53871/t0
o8->40018968-32d5-68f5-4d5c-565dbeaf7d52@NET_0x200000a02026b_UUID:-1
lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0
On Clients:
LustreError: 2310:0:(client.c:950:ptlrpc_expire_one_request()) @@@
timeout (sent at 1181916375, 100s ago) req@0000010403309200 x55228/t0
o3->l
ustre0-OST0002_UUID@10.1.1.35@tcp:28 lens 384/336 ref 2 fl Rpc:/0/0 rc 0/-22
Lustre: lustre0-OST0002-osc-0000010404826800: Connection to service
lustre0-OST0002 via nid 10.1.1.35@tcp was lost; in progress operations
usi
ng this service will wait for recovery to complete.
LustreError: 2311:0:(client.c:574:ptlrpc_check_status()) @@@ type
=PTL_RPC_MSG_ERR, err == -16 req@0000010404bc0600 x55270/t0
o8->lustre0-OS
T0002_UUID@10.1.1.35@tcp:28 lens 304/328 ref 1 fl Rpc:R/0/0 rc 0/-16
LustreError: 2311:0:(client.c:574:ptlrpc_check_status()) @@@ type
=PTL_RPC_MSG_ERR, err == -16 req@0000010404ac9800 x55275/t0
o8->lustre0-OS
T0002_UUID@10.1.1.35@tcp:28 lens 304/328 ref 1 fl Rpc:R/0/0 rc 0/-16
LustreError: 2311:0:(client.c:574:ptlrpc_check_status()) @@@ type
=PTL_RPC_MSG_ERR, err == -16 req@0000010404af6600 x55280/t0
o8->lustre0-OS
T0002_UUID@10.1.1.35@tcp:28 lens 304/328 ref 1 fl Rpc:R/0/0 rc 0/-16
--
Jeff Blasius / jeff.blasius@yale.edu
Phone: (203)432-9940 51 Prospect Rm. 011
High Performance Computing (HPC)
UNIX Systems Administrator, WorkStation Support (WSS)
Yale University Information Technology Services (ITS)