LBUG is a fatal error. You need to reboot. Strongly consider upgrading to a
newer release (e.g. 1.8.8) to fix this problem.
Cheers, Andreas
On 2013-01-31, at 12:10, "???????? ????????? ????????"
<theodoros.stylianos.kondylis at
gmail.com<mailto:theodoros.stylianos.kondylis at gmail.com>> wrote:
Hello everyone,
On lustre v1.8.4, since last night we are facing problems on our scratch FS. So
I went to the respective mds where on /var/log/messages I found the following ::
Jan 24 23:12:13 mds kernel: LustreError:
5107:0:(ldlm_lock.c:430:__ldlm_handle2lock()) ASSERTION(handle) failed
Jan 24 23:12:13 mds kernel: LustreError:
5107:0:(ldlm_lock.c:430:__ldlm_handle2lock()) LBUG
Jan 24 23:12:13 mds kernel: Pid: 5107, comm: ll_mdt_106
Jan 24 23:12:13 mds kernel:
Jan 24 23:12:13 mds kernel: Call Trace:
Jan 24 23:12:13 mds kernel: [<ffffffffa05d88ba>]
libcfs_debug_dumpstack+0x4a/0x70 [libcfs]
Jan 24 23:12:13 mds kernel: [<ffffffffa05d8e2c>] lbug_with_loc+0x6c/0xd0
[libcfs]
Jan 24 23:12:13 mds kernel: [<ffffffffa05e2550>] tracefile_init+0x0/0x170
[libcfs]
Jan 24 23:12:13 mds kernel: [<ffffffffa06eb24f>]
__ldlm_handle2lock+0x35f/0x3b0 [ptlrpc]
Jan 24 23:12:13 mds kernel: [<ffffffffa06ec00f>]
ldlm_lock_decref+0x1f/0xb0 [ptlrpc]
Jan 24 23:12:13 mds kernel: [<ffffffffa09ba124>]
mds_verify_child+0x4e4/0x8b0 [mds]
Jan 24 23:12:13 mds kernel: [<ffffffffa09bcf39>]
mds_get_parent_child_locked+0x4e9/0x980 [mds]
Jan 24 23:12:13 mds kernel: [<ffffffffa09e0886>] mds_open+0x716/0x3680
[mds]
Jan 24 23:12:13 mds kernel: [<ffffffffa09b73c7>]
mds_reint_rec+0x157/0x2d0 [mds]
Jan 24 23:12:13 mds kernel: [<ffffffffa09ac207>] mds_reint+0x1d7/0x450
[mds]
Jan 24 23:12:13 mds kernel: [<ffffffffa09b1ae0>]
mds_intent_policy+0x6f0/0xd20 [mds]
Jan 24 23:12:13 mds kernel: [<ffffffffa06ec958>]
ldlm_lock_enqueue+0x468/0xbf0 [ptlrpc]
Jan 24 23:12:13 mds kernel: [<ffffffffa070fe85>]
ldlm_handle_enqueue+0x265/0x1300 [ptlrpc]
Jan 24 23:12:13 mds kernel: [<ffffffffa09b2bcc>] mds_handle+0xabc/0x5150
[mds]
Jan 24 23:12:13 mds kernel: [<ffffffffa073ea32>]
ptlrpc_server_handle_request+0x902/0x1330 [ptlrpc]
Jan 24 23:12:13 mds kernel: [<ffffffffa07411a4>] ptlrpc_main+0x6d4/0x1610
[ptlrpc]
Jan 24 23:12:13 mds kernel: [<ffffffff8020cf49>] child_rip+0xa/0x11
Jan 24 23:12:13 mds kernel:
Jan 24 23:12:13 mds kernel: LustreError: dumping log to
/tmp/lustre-log.1359065533.5107
By googling I found that this is a bug fixed in version 1.8.6
(http://jira.whamcloud.com/browse/LU-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel).
After that follows a long list of lustre Errors as the one following that keeps
growing as we speak.
Jan 24 23:25:38 mds kernel: LustreError:
23676:0:(ldlm_request.c:83:ldlm_expired_completion_wait()) ### lock timed out
(enqueued at 1359066138, 200s ago); not entering recovery in server code, just
going back to sleep ns: mds-scratch-MDT0000_UUID lock:
ffff88080a81f800/0x48997a721de79abd lrc: 3/0,1 mode: --/EX res:
381225206/1751756546 bits 0x3 rrc: 63 type: IBT flags: 0x4004030 remote: 0x0
expref: -99 pid: 23676 timeout: 0
Jan 24 23:25:38 mds kernel: LustreError: dumping log to
/tmp/lustre-log.1359066338.23676
Jan 24 23:26:06 mds kernel: Lustre:
23677:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
ba8ef0e3-630b-714c-4a37-b7558adba3d0 reconnecting
Jan 24 23:26:25 mds kernel: LustreError:
23659:0:(ldlm_request.c:83:ldlm_expired_completion_wait()) ### lock timed out
(enqueued at 1359066185, 200s ago); not entering recovery in server code, just
going back to sleep ns: mds-scratch-MDT0000_UUID lock:
ffff880c04ded200/0x48997a721e02cf62 lrc: 3/1,0 mode: --/CR res:
381225206/1751756546 bits 0x3 rrc: 64 type: IBT flags: 0x4004000 remote: 0x0
expref: -99 pid: 23659 timeout: 0
Jan 24 23:26:26 mds kernel: LustreError:
5071:0:(ldlm_request.c:83:ldlm_expired_completion_wait()) ### lock timed out
(enqueued at 1359066186, 200s ago); not entering recovery in server code, just
going back to sleep ns: mds-scratch-MDT0000_UUID lock:
ffff880b4b213a00/0x48997a721e03a0b3 lrc: 3/1,0 mode: --/CR res:
381225206/1751756546 bits 0x3 rrc: 64 type: IBT flags: 0x4004000 remote: 0x0
expref: -99 pid: 5071 timeout: 0
Jan 24 23:26:29 mds kernel: Lustre:
5033:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
58837d54-5d72-cecb-593b-e7d78c9c54d3 reconnecting
Jan 24 23:27:15 mds kernel: LustreError:
5101:0:(ldlm_request.c:83:ldlm_expired_completion_wait()) ### lock timed out
(enqueued at 1359066235, 200s ago); not entering recovery in server code, just
going back to sleep ns: mds-scratch-MDT0000_UUID lock:
ffff88090f7c7600/0x48997a721e1e8189 lrc: 3/1,0 mode: --/CR res:
381225206/1751756546 bits 0x3 rrc: 67 type: IBT flags: 0x4004000 remote: 0x0
expref: -99 pid: 5101 timeout: 0
Jan 24 23:27:18 mds kernel: Lustre:
23684:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
141ba2f1-5a32-282b-1b9f-fde05c0caa1a reconnecting
Jan 24 23:27:22 mds kernel: Lustre:
5099:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
e57b39b0-2e26-0e11-3995-e801a3fb4ce3 reconnecting
Jan 24 23:27:24 mds kernel: Lustre:
23672:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
72b809ed-de60-1c38-f17e-145a2294c6db reconnecting
Jan 24 23:27:24 mds kernel: Lustre:
23672:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 1 previous similar
message
Jan 24 23:27:26 mds kernel: LustreError:
23665:0:(ldlm_request.c:83:ldlm_expired_completion_wait()) ### lock timed out
(enqueued at 1359066246, 200s ago); not entering recovery in server code, just
going back to sleep ns: mds-scratch-MDT0000_UUID lock:
ffff8808c4952000/0x48997a721e2588eb lrc: 3/1,0 mode: --/CR res:
381225206/1751756546 bits 0x3 rrc: 67 type: IBT flags: 0x4004000 remote: 0x0
expref: -99 pid: 23665 timeout: 0
Jan 24 23:27:29 mds kernel: Lustre:
5108:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
c9029114-746d-964e-8712-906a310f598f reconnecting
Jan 24 23:27:29 mds kernel: Lustre:
5108:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 1 previous similar
message
Jan 24 23:27:38 mds kernel: Lustre:
5032:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
d3e7f205-2e08-6614-3bf0-c2344ea5bf09 reconnecting
Jan 24 23:27:38 mds kernel: Lustre:
5032:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 2 previous similar
messages
Jan 24 23:27:56 mds kernel: Lustre:
5066:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
4e67a94e-7e0d-4b29-6272-587a72de2498 reconnecting
Jan 24 23:27:56 mds kernel: Lustre:
5066:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 10 previous similar
messages
Jan 24 23:28:36 mds kernel: Lustre:
23686:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
f5637638-bf4a-320a-6967-7cffaea36273 reconnecting
Jan 24 23:28:36 mds kernel: Lustre:
23686:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 8 previous similar
messages
Jan 24 23:28:36 mds kernel: Lustre:
5039:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn''t add
any time (5/-383), not sending early reply
Jan 24 23:28:36 mds kernel: req at ffff88054a4ea400 x1425048289875237/t0
o101->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
lens 1024/3680 e 4 to 0 dl 1359066521 ref 2 fl Interpret:/0/0 rc 0/0
Jan 24 23:30:54 mds kernel: Lustre:
5023:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse
reconnection from ee75239f-4115-3c1d-a11a-7649df079499 at
X.X.28.1<mailto:ee75239f-4115-3c1d-a11a-7649df079499 at X.X.28.1>@o2ib to
0xffff880beb978200; still busy with 1 active RPCs
Jan 24 23:30:54 mds kernel: LustreError:
5023:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16) req
at ffff880425d99c00 x1425048289890866/t0
o38->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
lens 368/264 e 0 to 0 dl 1359066754 ref 1 fl Interpret:/0/0 rc -16/0
Jan 24 23:31:01 mds kernel: Lustre:
5124:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse
reconnection from ee75239f-4115-3c1d-a11a-7649df079499 at
X.X.28.1<mailto:ee75239f-4115-3c1d-a11a-7649df079499 at X.X.28.1>@o2ib to
0xffff880beb978200; still busy with 1 active RPCs
Jan 24 23:31:01 mds kernel: LustreError:
5124:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16) req
at ffff88081b5c4000 x1425048289890867/t0
o38->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
lens 368/264 e 0 to 0 dl 1359066761 ref 1 fl Interpret:/0/0 rc -16/0
Jan 24 23:31:08 mds kernel: Lustre:
5086:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse
reconnection from ee75239f-4115-3c1d-a11a-7649df079499 at
X.X.28.1<mailto:ee75239f-4115-3c1d-a11a-7649df079499 at X.X.28.1>@o2ib to
0xffff880beb978200; still busy with 1 active RPCs
Jan 24 23:31:08 mds kernel: LustreError:
5086:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16) req
at ffff8806213cfc00 x1425048289891170/t0
o38->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
lens 368/264 e 0 to 0 dl 1359066768 ref 1 fl Interpret:/0/0 rc -16/0
Jan 24 23:31:15 mds kernel: Lustre:
5081:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse
reconnection from ee75239f-4115-3c1d-a11a-7649df079499 at
X.X.28.1<mailto:ee75239f-4115-3c1d-a11a-7649df079499 at X.X.28.1>@o2ib to
0xffff880beb978200; still busy with 1 active RPCs
Jan 24 23:31:15 mds kernel: LustreError:
5081:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16) req
at ffff88011d249000 x1425048289891171/t0
o38->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
lens 368/264 e 0 to 0 dl 1359066775 ref 1 fl Interpret:/0/0 rc -16/0
Jan 24 23:31:22 mds kernel: Lustre:
5017:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse
reconnection from ee75239f-4115-3c1d-a11a-7649df079499 at
X.X.28.1<mailto:ee75239f-4115-3c1d-a11a-7649df079499 at X.X.28.1>@o2ib to
0xffff880beb978200; still busy with 1 active RPCs
Jan 24 23:31:22 mds kernel: LustreError:
5017:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16) req
at ffff8805d32c3000 x1425048289891172/t0
o38->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
lens 368/264 e 0 to 0 dl 1359066782 ref 1 fl Interpret:/0/0 rc -16/0
Jan 24 23:31:29 mds kernel: Lustre:
5040:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse
reconnection from ee75239f-4115-3c1d-a11a-7649df079499 at
X.X.28.1<mailto:ee75239f-4115-3c1d-a11a-7649df079499 at X.X.28.1>@o2ib to
0xffff880beb978200; still busy with 1 active RPCs
Jan 24 23:31:29 mds kernel: LustreError:
5040:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16) req
at ffff8803a768a800 x1425048289891173/t0
o38->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
lens 368/264 e 0 to 0 dl 1359066789 ref 1 fl Interpret:/0/0 rc -16/0
Jan 24 23:31:43 mds kernel: Lustre:
5039:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse
reconnection from ee75239f-4115-3c1d-a11a-7649df079499 at
X.X.28.1<mailto:ee75239f-4115-3c1d-a11a-7649df079499 at X.X.28.1>@o2ib to
0xffff880beb978200; still busy with 1 active RPCs
Jan 24 23:31:43 mds kernel: Lustre:
5039:0:(ldlm_lib.c:872:target_handle_connect()) Skipped 1 previous similar
message
Jan 24 23:31:43 mds kernel: LustreError:
5039:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16) req
at ffff88040fe6b000 x1425048289891477/t0
o38->ee75239f-4115-3c1d-a11a-7649df079499 at NET_0x500000a031c01_UUID:0/0
lens 368/264 e 0 to 0 dl 1359066803 ref 1 fl Interpret:/0/0 rc -16/0
Jan 24 23:31:43 mds kernel: LustreError:
5039:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 1 previous similar
message
Jan 24 23:31:50 mds kernel: Lustre:
23682:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000:
ee75239f-4115-3c1d-a11a-7649df079499 reconnecting
Jan 24 23:31:50 mds kernel: Lustre:
23682:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 22 previous similar
messages
I was wondering whether a failover of the MDT and then reboot the MDS would
solve the problem.
If you have any idea or proposal I would be more than happy to hear,
Stelios.
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org<mailto:Lustre-discuss at
lists.lustre.org>
http://lists.lustre.org/mailman/listinfo/lustre-discuss