Hi, Today our mds started to behave unstable. /proc/fs/lustre/health_check file reported that mds device is not healthy. All clients connected to ddn_home file system got stuck and MDS server started to refuse client connections and after some time it started to evict clients. Can some one help me get to the bottom of this problem? Below I attached log snippets from the MDS and one client Thanks Wojciech Lustre-1.6.6 RHEL4 2.6.9-67.0.22.EL_lustre.1.6.6smp 600 clients 24 OST/4 OSS Feb 6 11:42:34 mds01 kernel: LustreError: 19469:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920454, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home- MDT0000_UUID lock: 0000010238fdb700/0x2dea4e283aa99c1c lrc: 3/0,1 mode: --/EX res: 38667385/3946061039 bits 0x2 rrc: 16 type: IBT flags: 4004030 remote: 0x0 expref: -99 pid 19469 Feb 6 11:42:34 mds01 kernel: LustreError: dumping log to /tmp/lustre-log.1233920554.19471 Feb 6 11:42:34 mds01 kernel: LustreError: 19469:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) Skipped 1 previous similar message Feb 6 11:42:34 mds01 kernel: LustreError: 19509:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920454, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home- MDT0000_UUID lock: 000001014281e040/0x2dea4e283aa9ba14 lrc: 3/1,0 mode: --/CR res: 38662560/3951344313 bits 0x1 rrc: 6 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19509 Feb 6 11:42:38 mds01 kernel: LustreError: 19506:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920458, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home- MDT0000_UUID lock: 00000101882b6700/0x2dea4e283aabc0f4 lrc: 3/1,0 mode: --/CR res: 38662560/3951344313 bits 0x1 rrc: 6 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19506 Feb 6 11:42:38 mds01 kernel: LustreError: 19506:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) Skipped 2 previous similar messages Feb 6 11:42:39 mds01 kernel: LustreError: 19463:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920459, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home- MDT0000_UUID lock: 000001036a58fdc0/0x2dea4e283aac7ad6 lrc: 3/1,0 mode: --/CR res: 38662560/3951344313 bits 0x1 rrc: 6 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19463 Feb 6 11:42:50 mds01 kernel: LustreError: 19498:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920470, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home- MDT0000_UUID lock: 0000010242c97940/0x2dea4e283ab844cb lrc: 3/1,0 mode: --/CR res: 38667385/3946061039 bits 0x3 rrc: 16 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19498 Feb 6 11:42:50 mds01 kernel: LustreError: 19498:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) Skipped 6 previous similar messages Feb 6 11:44:17 mds01 kernel: Lustre: 0:0:(watchdog.c:148:lcw_cb()) Watchdog triggered for pid 19469: it was inactive for 200s Feb 6 11:44:17 mds01 kernel: Lustre: 0:0:(linux-debug.c:185:libcfs_debug_dumpstack()) showing stack for process 19471 Feb 6 11:44:17 mds01 kernel: ll_mdt_50 S 0000000000000000 0 19471 1 19472 19470 (L-TLB) ... lustre watchdog log dumps ... Feb 6 11:44:17 mds01 kernel: Lustre: 19507:0:(ldlm_lib.c:538:target_handle_reconnect()) ddn_home-MDT0000: 3d03a918-dd07-998f-226e-16a7634433bf reconnecting Feb 6 11:44:17 mds01 kernel: Lustre: 19507:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 1 previous similar message Feb 6 11:44:17 mds01 kernel: Lustre: 19507:0:(ldlm_lib.c:773:target_handle_connect()) ddn_home-MDT0000: refuse reconnection from 3d03a918-dd07-998f-226e-16a7634433bf@node-h27@tcp to 0x000001027449c000; still busy with 3 active RPCs Feb 6 11:44:17 mds01 kernel: Lustre: 19507:0:(ldlm_lib.c:773:target_handle_connect()) Skipped 1 previous similar message Feb 6 11:44:17 mds01 kernel: LustreError: 19507:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16) req@0000010402541c50 x2988280/t0 o38->3d03a918-dd07-998f-226e-16a7634433bf@NET_0x200000a8f081b_UUID:0/0 lens 304/200 e 0 to 0 dl 1233920754 ref 1 fl Interpret:/0/0 rc -16/0 Feb 6 11:44:17 mds01 kernel: LustreError: 19507:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 1 previous similar message Feb 6 11:44:17 mds01 kernel: ll_mdt_103 S 0000000000000000 0 19525 1 19526 19524 (L-TLB) ... lustre watchdog log dumps ... Feb 6 11:44:35 mds01 kernel: Lustre: 19473:0:(ldlm_lib.c:538:target_handle_reconnect()) ddn_home-MDT0000: d225b081-9db9-b8c4-9e6d-6558af87cdeb reconnecting Feb 6 11:44:35 mds01 kernel: LustreError: dumping log to /tmp/lustre-log.1233920674.19495 Feb 6 11:44:35 mds01 kernel: Lustre: 19473:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 2 previous similar messages Feb 6 11:44:35 mds01 kernel: Lustre: 19473:0:(ldlm_lib.c:773:target_handle_connect()) ddn_home-MDT0000: refuse reconnection from d225b081-9db9-b8c4-9e6d-6558af87cdeb@node-h02@tcp to 0x00000102763b8000; still busy with 5 active RPCs Feb 6 11:44:35 mds01 kernel: Lustre: 19473:0:(ldlm_lib.c:773:target_handle_connect()) Skipped 2 previous similar messages Feb 6 11:44:35 mds01 kernel: LustreError: 19473:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16) req@000001007a4ee800 x2710478/t0 o38->d225b081-9db9-b8c4-9e6d-6558af87cdeb@NET_0x200000a8f0802_UUID:0/0 lens 304/200 e 0 to 0 dl 1233920774 ref 1 fl Interpret:/0/0 rc -16/0 Feb 6 11:44:35 mds01 kernel: LustreError: 19473:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 2 previous similar messages Feb 6 11:45:00 mds01 kernel: LustreError: 19485:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920600, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home- MDT0000_UUID lock: 00000101a3c94280/0x2dea4e283abe817a lrc: 3/1,0 mode: --/CR res: 38656801/3946041493 bits 0x3 rrc: 26 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19485 Feb 6 11:45:14 mds01 kernel: LustreError: 19550:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920614, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home- MDT0000_UUID lock: 00000103d3b90280/0x2dea4e283abed080 lrc: 3/1,0 mode: --/CR res: 24610283/3605486555 bits 0x3 rrc: 23 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19550 Feb 6 11:45:14 mds01 kernel: LustreError: 19550:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) Skipped 10 previous similar messages Feb 6 11:45:31 mds01 kernel: LustreError: 19520:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920631, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home- MDT0000_UUID lock: 000001025e8334c0/0x2dea4e283abfa192 lrc: 3/1,0 mode: --/CR res: 24610283/3605486555 bits 0x3 rrc: 24 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19520 Feb 6 11:45:31 mds01 kernel: LustreError: 19520:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) Skipped 1 previous similar message Feb 6 11:46:40 mds01 kernel: Lustre: 0:0:(watchdog.c:148:lcw_cb()) Watchdog triggered for pid 19485: it was inactive for 200s Feb 6 11:46:40 mds01 kernel: Lustre: 0:0:(watchdog.c:148:lcw_cb()) Skipped 1 previous similar message Feb 6 11:46:40 mds01 kernel: Lustre: 0:0:(linux-debug.c:185:libcfs_debug_dumpstack()) showing stack for process 19485 Feb 6 11:46:40 mds01 kernel: Lustre: 0:0:(linux-debug.c:185:libcfs_debug_dumpstack()) Skipped 7 previous similar messages Feb 6 11:46:40 mds01 kernel: ll_mdt_63 S 0000000000000000 0 19485 1 19486 19484 (L-TLB) ... ... Feb 6 11:57:27 mds01 kernel: Lustre: ddn_home-MDT0000: haven''t heard from client d225b081-9db9-b8c4-9e6d-6558af87cdeb (at node-h02@tcp) in 227 seconds. I think it''s dead, and I am evicting it. Feb 6 11:57:27 mds01 kernel: Lustre: Skipped 4 previous similar messages Feb 6 11:57:35 mds01 kernel: LustreError: 19467:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233921355, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home- MDT0000_UUID lock: 0000010341014dc0/0x2dea4e283ad5221d lrc: 3/1,0 mode: --/CR res: 24610283/3605486555 bits 0x3 rrc: 43 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19467 Feb 6 11:57:35 mds01 kernel: LustreError: 19467:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) Skipped 7 previous similar messages Feb 6 11:58:43 mds01 kernel: Lustre: ddn_home-MDT0000: haven''t heard from client a50193fa-1aef-44fe-f917-7b3063f7cc30 (at node-g10@tcp) in 221 seconds. I think it''s dead, and I am evicting it. Feb 6 11:58:43 mds01 kernel: Lustre: Skipped 1 previous similar message Feb 6 11:59:15 mds01 kernel: Lustre: 0:0:(watchdog.c:148:lcw_cb()) Watchdog triggered for pid 19467: it was inactive for 200s Feb 6 11:59:15 mds01 kernel: Lustre: 0:0:(watchdog.c:148:lcw_cb()) Skipped 7 previous similar messages Feb 6 11:59:15 mds01 kernel: Lustre: 0:0:(linux-debug.c:185:libcfs_debug_dumpstack()) showing stack for process 19467 Feb 6 11:59:15 mds01 kernel: Lustre: 0:0:(linux-debug.c:185:libcfs_debug_dumpstack()) Skipped 7 previous similar messages Feb 6 11:59:15 mds01 kernel: Lustre: 19502:0:(ldlm_lib.c:538:target_handle_reconnect()) ddn_home-MDT0000: 3d03a918-dd07-998f-226e-16a7634433bf reconnecting Feb 6 11:59:15 mds01 kernel: Lustre: 19502:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 6 previous similar messages Feb 6 11:59:15 mds01 kernel: Lustre: 19502:0:(ldlm_lib.c:773:target_handle_connect()) ddn_home-MDT0000: refuse reconnection from 3d03a918-dd07-998f-226e-16a7634433bf@node-h27@tcp to 0x000001029c566000; still busy with 2 active RPCs Feb 6 11:59:15 mds01 kernel: Lustre: 19502:0:(ldlm_lib.c:773:target_handle_connect()) Skipped 6 previous similar messages Feb 6 11:59:15 mds01 kernel: LustreError: 19502:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16) req@0000010216915000 x2989437/t0 o38->3d03a918-dd07-998f-226e-16a7634433bf@NET_0x200000a8f081b_UUID:0/0 lens 304/200 e 0 to 0 dl 1233921655 ref 1 fl Interpret:/0/0 rc -16/0 Feb 6 11:59:15 mds01 kernel: LustreError: 19502:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 6 previous similar messages Feb 6 11:59:15 mds01 kernel: ll_mdt_46 S 0000000000000000 0 19467 1 19468 19466 (L-TLB) Client: node-h02 Feb 6 11:44:30 node-h02 kernel: Lustre: Request x2710369 sent from ddn_home-MDT0000-mdc-00000100069fc400 to NID mds01@tcp 200s ago has timed out (limit 200s). Feb 6 11:44:30 node-h02 kernel: Lustre: Request x2710376 sent from ddn_home-MDT0000-mdc-00000100069fc400 to NID mds01@tcp 200s ago has timed out (limit 200s). Feb 6 11:44:30 node-h02 kernel: Lustre: ddn_home-MDT0000-mdc-00000100069fc400: Connection to service ddn_home-MDT0000 via nid mds01@tcp was lost; in progress operations using this service will wait for recovery to complete. Feb 6 11:44:30 node-h02 kernel: Lustre: Skipped 2 previous similar messages Feb 6 11:44:34 node-h02 kernel: LustreError: 11-0: an error occurred while communicating with mds01@tcp. The mds_connect operation failed with -16 Feb 6 11:46:10 node-h02 kernel: Lustre: Changing connection for ddn_home-MDT0000-mdc-00000100069fc400 to mds02@tcp/mds02@tcp Feb 6 11:49:30 node-h02 kernel: Lustre: Request x2710506 sent from ddn_home-MDT0000-mdc-00000100069fc400 to NID mds02@tcp 200s ago has timed out (limit 200s). Feb 6 11:50:20 node-h02 kernel: Lustre: 15881:0:(import.c:410:import_select_connection()) ddn_home-MDT0000-mdc-00000100069fc400: tried all connections, increasing latency to 9s Feb 6 11:50:20 node-h02 kernel: Lustre: Changing connection for ddn_home-MDT0000-mdc-00000100069fc400 to mds01@tcp/mds01@tcp Feb 6 11:50:20 node-h02 kernel: LustreError: 167-0: This client was evicted by ddn_home-MDT0000; in progress operations using this service will fail. Feb 6 11:50:20 node-h02 kernel: LustreError: 9041:0:(mdc_locks.c:598:mdc_enqueue()) ldlm_cli_enqueue: -5 Feb 6 11:50:20 node-h02 kernel: LustreError: 9301:0:(mdc_locks.c:598:mdc_enqueue()) ldlm_cli_enqueue: -5 Feb 6 11:50:20 node-h02 kernel: LustreError: 9189:0:(file.c:2925:ll_inode_revalidate_fini()) failure -5 inode 38656801 Feb 6 11:50:20 node-h02 kernel: LustreError: 9301:0:(mdc_locks.c:598:mdc_enqueue()) Skipped 1 previous similar message Feb 6 11:50:20 node-h02 kernel: LustreError: 9080:0:(mdc_locks.c:598:mdc_enqueue()) ldlm_cli_enqueue: -5 Feb 6 11:50:20 node-h02 kernel: LustreError: 9080:0:(file.c:2925:ll_inode_revalidate_fini()) failure -5 inode 38656801 Feb 6 11:50:20 node-h02 kernel: LustreError: 9080:0:(file.c:2925:ll_inode_revalidate_fini()) Skipped 1 previous similar message Feb 6 11:50:20 node-h02 kernel: LustreError: 9367:0:(client.c:722:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@00000101adf1d200 x2710664/t0 o35->ddn_home-MDT0000_UUID@mds01@tcp:23/10 lens 296/688 e 0 to 200 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 Feb 6 11:50:20 node-h02 kernel: LustreError: 9367:0:(file.c:116:ll_close_inode_openhandle()) inode 26872571 mdc close failed: rc = -108 Feb 6 11:50:20 node-h02 kernel: Lustre: ddn_home-MDT0000-mdc-00000100069fc400: Connection restored to service ddn_home-MDT0000 using nid mds01@tcp. Feb 6 11:50:20 node-h02 kernel: LustreError: 9080:0:(mdc_request.c:741:mdc_close()) Unexpected: can''t find mdc_open_data, but the close succeeded. Please tell . Feb 6 11:50:20 node-h02 kernel: LustreError: 9187:0:(mdc_request.c:741:mdc_close()) Unexpected: can''t find mdc_open_data, but the close succeeded. Please tell . Feb 6 11:50:20 node-h02 kernel: LustreError: 9187:0:(mdc_request.c:741:mdc_close()) Skipped 77 previous similar messages Feb 6 11:53:40 node-h02 kernel: Lustre: Request x2710802 sent from ddn_home-MDT0000-mdc-00000100069fc400 to NID mds01@tcp 200s ago has timed out (limit 200s). Feb 6 11:53:40 node-h02 kernel: Lustre: ddn_home-MDT0000-mdc-00000100069fc400: Connection to service ddn_home-MDT0000 via nid mds01@tcp was lost; in progress operations using this service will wait for recovery to complete. Feb 6 11:53:40 node-h02 kernel: LustreError: 11-0: an error occurred while communicating with mds01@tcp. The mds_connect operation failed with -16 Feb 6 11:54:30 node-h02 kernel: Lustre: Changing connection for ddn_home-MDT0000-mdc-00000100069fc400 to mds02@tcp/mds02@tcp Feb 6 11:57:50 node-h02 kernel: Lustre: Request x2710887 sent from ddn_home-MDT0000-mdc-00000100069fc400 to NID mds02@tcp 200s ago has timed out (limit 200s). Feb 6 11:58:40 node-h02 kernel: Lustre: 15881:0:(import.c:410:import_select_connection()) ddn_home-MDT0000-mdc-00000100069fc400: tried all connections, increasing latency to 14s Feb 6 11:58:40 node-h02 kernel: Lustre: Changing connection for ddn_home-MDT0000-mdc-00000100069fc400 to mds01@tcp/mds01@tcp Feb 6 11:58:40 node-h02 kernel: LustreError: 167-0: This client was evicted by ddn_home-MDT0000; in progress operations using this service will fail. Feb 6 11:58:40 node-h02 kernel: LustreError: 10470:0:(mdc_locks.c:598:mdc_enqueue()) ldlm_cli_enqueue: -5 _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Andreas Dilger
2009-Feb-06 22:23 UTC
[Lustre-discuss] mds device unhealthy - clients got stuck
On Feb 06, 2009 17:48 +0000, Wojciech Turek wrote:> Today our mds started to behave unstable. /proc/fs/lustre/health_check > file reported that mds device is not healthy. All clients connected to > ddn_home file system got stuck and MDS server started to refuse client > connections and after some time it started to evict clients. Can some > one help me get to the bottom of this problem? Below I attached log > snippets from the MDS and one client> Lustre-1.6.6 > RHEL4 > 2.6.9-67.0.22.EL_lustre.1.6.6smp > 600 clients > 24 OST/4 OSSNB - please post plain-text emails to the mailing list.> Feb 6 11:42:34 mds01 kernel: LustreError: > 19469:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock > timed out (enqueued at 1233920454, 100s ago); not entering recovery in > server code, just going back to sleep ns: mds-ddn_home-<br>It looks like the MDS is stuck in a lock deadlock of some kind. It is worthwhile to get a stack track (sysrq-t), capture any debug logs dumped by the initial watchdog timer, and file a bug. If the debug logs do not contain "dlmtrace" then it may be very hard to debug this problem. The MDS will need to be rebooted in any case, and should resolve the problem for the time being. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.