Thomas Roth
2008-Oct-13 18:21 UTC
[Lustre-discuss] LBUG mds_reint.c, questions about recovery time
Hi all, I just ran into a LBUG on an MDS still running Lustre Version 1.6.3 with kernel 2.6.18, Debian Etch. kern.log c.f. below. You will probably tell me that is a known BUG already fixed/ to be fixed (I''m unsure how to search for such a thing in bugzilla). But my main question concerns the subsequent recovery. It seems to have worked fine, however it took 2 hours. I would like to know what influences the recovery time? During this period, I was watching /proc/fs/lustre/mds/lustre-MDT0000/recovery_status. It kind of continually showed a remainder time of 2100 sec, fluctuating between 2400 and 1900, until the last 10 min or so, when the time really went down. So this is just a rough guess of Lustre as to what the remaining recovery time might be? recovery_status also showed 346 connected clients, of which 146 had finished for a long time, the others obviously not. I wanted to be very clever and manually umounted Lustre on a number of our batch nodes which were not using Lustre at that time. This did neither influence the given number of connected clients nor did it have any perceptible effect on recovery. ----------------------------------------------------------------------------------------------------------------------------------- Oct 13 17:10:58 kernel: LustreError: 9132:0:(mds_reint.c:1512:mds_orphan_add_link()) ASSERTION(inode->i_nlink == 1) f ailed:dir nlink == 0 Oct 13 17:10:58 kernel: LustreError: 9132:0:(mds_reint.c:1512:mds_orphan_add_link()) LBUG Oct 13 17:10:58 kernel: Lustre: 9132:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for process 9132 Oct 13 17:10:58 kernel: ll_mdt_77 R running 0 9132 1 9133 9131 (L-TLB) Oct 13 17:10:58 kernel: e14eb98c 00000046 55c3b8b1 000016ab 0000006e 0000000a c084b550 e1abeaa0 Oct 13 17:10:58 kernel: 8d1b6f09 001aedee 0000c81c 00000001 c0116bb3 dffcc000 ea78f060 c02cbab0 Oct 13 17:10:58 kernel: dffcc000 00000082 c0117c15 0013fa7b 00000000 00000001 3638acd3 00003931 Oct 13 17:10:58 kernel: Call Trace: Oct 13 17:10:58 kernel: [<c0116bb3>] task_rq_lock+0x31/0x58 Oct 13 17:10:58 kernel: [<c0116bb3>] task_rq_lock+0x31/0x58 Oct 13 17:10:58 kernel: [<c0116bb3>] task_rq_lock+0x31/0x58 Oct 13 17:10:58 kernel: [<c011de22>] printk+0x14/0x18 Oct 13 17:10:58 kernel: [<c0136851>] __print_symbol+0x9f/0xa8 Oct 13 17:10:58 kernel: [<c0116bb3>] task_rq_lock+0x31/0x58 Oct 13 17:10:58 kernel: [<c0117c15>] try_to_wake_up+0x355/0x35f Oct 13 17:10:58 kernel: [<c01166f5>] __wake_up_common+0x2f/0x53 Oct 13 17:10:58 kernel: [<c0116b46>] __wake_up+0x2a/0x3d Oct 13 17:10:58 kernel: [<c011d854>] release_console_sem+0x1b4/0x1bc Oct 13 17:10:58 kernel: [<c011d854>] release_console_sem+0x1b4/0x1bc Oct 13 17:10:58 kernel: [<c011d854>] release_console_sem+0x1b4/0x1bc Oct 13 17:10:58 kernel: [<c012c6d8>] __kernel_text_address+0x18/0x23 Oct 13 17:10:58 kernel: [<c0103b62>] show_trace_log_lvl+0x47/0x6a Oct 13 17:10:58 kernel: [<c0103c13>] show_stack_log_lvl+0x8e/0x96 Oct 13 17:10:58 kernel: [<c0104107>] show_stack+0x20/0x25 Oct 13 17:10:58 kernel: [<fa1bef79>] lbug_with_loc+0x69/0xc0 [libcfs] Oct 13 17:10:58 kernel: [<fa689448>] mds_orphan_add_link+0xcb8/0xd20 [mds] Oct 13 17:10:58 kernel: [<fa69c87a>] mds_reint_unlink+0x292a/0x3fd0 [mds] Oct 13 17:10:58 kernel: [<fa3ac990>] lustre_swab_ldlm_request+0x0/0x20 [ptlrpc] Oct 13 17:10:58 kernel: [<fa688495>] mds_reint_rec+0xf5/0x3f0 [mds] Oct 13 17:10:58 kernel: [<fa39f788>] ptl_send_buf+0x1b8/0xb00 [ptlrpc] Oct 13 17:10:58 kernel: [<fa66bfeb>] mds_reint+0xcb/0x8a0 [mds] Oct 13 17:10:58 kernel: [<fa67f998>] mds_handle+0x3048/0xb9df [mds] Oct 13 17:10:58 kernel: [<fa4ac402>] LNetMEAttach+0x142/0x4a0 [lnet] Oct 13 17:10:58 kernel: [<fa2dcd91>] class_handle_free_cb+0x21/0x190 [obdclass] Oct 13 17:10:58 kernel: [<c0124d83>] do_gettimeofday+0x31/0xce Oct 13 17:10:58 kernel: [<fa2dc06b>] class_handle2object+0xbb/0x2a0 [obdclass] Oct 13 17:10:58 kernel: [<fa3aca00>] lustre_swab_ptlrpc_body+0x0/0xc0 [ptlrpc] Oct 13 17:10:58 kernel: [<fa3a9b5a>] lustre_swab_buf+0xfa/0x180 [ptlrpc] Oct 13 17:10:58 kernel: [<c0125aac>] lock_timer_base+0x15/0x2f Oct 13 17:10:59 kernel: [<c0125bbd>] __mod_timer+0x99/0xa3 Oct 13 17:10:59 kernel: [<fa3a6efe>] lustre_msg_get_conn_cnt+0xce/0x220 [ptlrpc] Oct 13 17:10:59 kernel: [<fa3b8e56>] ptlrpc_main+0x2016/0x2f40 [ptlrpc] Oct 13 17:10:59 kernel: [<c01b6dc0>] __next_cpu+0x12/0x21 Oct 13 17:10:59 kernel: [<c012053d>] do_exit+0x711/0x71b Oct 13 17:10:59 kernel: [<c0117c1f>] default_wake_function+0x0/0xc Oct 13 17:10:59 kernel: [<fa3b6e40>] ptlrpc_main+0x0/0x2f40 [ptlrpc] Oct 13 17:10:59 kernel: [<c0101005>] kernel_thread_helper+0x5/0xb Oct 13 17:12:38 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 9132: it was inactive for 10 0s Cheers, Thomas
Wojciech Turek
2008-Oct-13 23:00 UTC
[Lustre-discuss] LBUG mds_reint.c, questions about recovery time
Lustre recovery time is 2.5 x timeout You can find timeout by running this command on the MDS cat /proc/sys/lustre/timeout Thomas Roth wrote:> Hi all, > > I just ran into a LBUG on an MDS still running Lustre Version 1.6.3 with > kernel 2.6.18, Debian Etch. > kern.log c.f. below. You will probably tell me that is a known BUG > already fixed/ to be fixed (I''m unsure how to search for such a thing in > bugzilla). > But my main question concerns the subsequent recovery. It seems to have > worked fine, however it took 2 hours. I would like to know what > influences the recovery time? > During this period, I was watching > /proc/fs/lustre/mds/lustre-MDT0000/recovery_status. It kind of > continually showed a remainder time of 2100 sec, fluctuating between > 2400 and 1900, until the last 10 min or so, when the time really went > down. So this is just a rough guess of Lustre as to what the remaining > recovery time might be? > recovery_status also showed 346 connected clients, of which 146 had > finished for a long time, the others obviously not. I wanted to be very > clever and manually umounted Lustre on a number of our batch nodes which > were not using Lustre at that time. This did neither influence the given > number of connected clients nor did it have any perceptible effect on > recovery. > > ----------------------------------------------------------------------------------------------------------------------------------- > Oct 13 17:10:58 kernel: LustreError: > 9132:0:(mds_reint.c:1512:mds_orphan_add_link()) ASSERTION(inode->i_nlink > == 1) f > ailed:dir nlink == 0 > Oct 13 17:10:58 kernel: LustreError: > 9132:0:(mds_reint.c:1512:mds_orphan_add_link()) LBUG > Oct 13 17:10:58 kernel: Lustre: > 9132:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for > process 9132 > Oct 13 17:10:58 kernel: ll_mdt_77 R running 0 9132 1 > 9133 9131 (L-TLB) > Oct 13 17:10:58 kernel: e14eb98c 00000046 55c3b8b1 000016ab 0000006e > 0000000a c084b550 e1abeaa0 > Oct 13 17:10:58 kernel: 8d1b6f09 001aedee 0000c81c 00000001 c0116bb3 > dffcc000 ea78f060 c02cbab0 > Oct 13 17:10:58 kernel: dffcc000 00000082 c0117c15 0013fa7b 00000000 > 00000001 3638acd3 00003931 > Oct 13 17:10:58 kernel: Call Trace: > Oct 13 17:10:58 kernel: [<c0116bb3>] task_rq_lock+0x31/0x58 > Oct 13 17:10:58 kernel: [<c0116bb3>] task_rq_lock+0x31/0x58 > Oct 13 17:10:58 kernel: [<c0116bb3>] task_rq_lock+0x31/0x58 > Oct 13 17:10:58 kernel: [<c011de22>] printk+0x14/0x18 > Oct 13 17:10:58 kernel: [<c0136851>] __print_symbol+0x9f/0xa8 > Oct 13 17:10:58 kernel: [<c0116bb3>] task_rq_lock+0x31/0x58 > Oct 13 17:10:58 kernel: [<c0117c15>] try_to_wake_up+0x355/0x35f > Oct 13 17:10:58 kernel: [<c01166f5>] __wake_up_common+0x2f/0x53 > Oct 13 17:10:58 kernel: [<c0116b46>] __wake_up+0x2a/0x3d > Oct 13 17:10:58 kernel: [<c011d854>] release_console_sem+0x1b4/0x1bc > Oct 13 17:10:58 kernel: [<c011d854>] release_console_sem+0x1b4/0x1bc > Oct 13 17:10:58 kernel: [<c011d854>] release_console_sem+0x1b4/0x1bc > Oct 13 17:10:58 kernel: [<c012c6d8>] __kernel_text_address+0x18/0x23 > Oct 13 17:10:58 kernel: [<c0103b62>] show_trace_log_lvl+0x47/0x6a > Oct 13 17:10:58 kernel: [<c0103c13>] show_stack_log_lvl+0x8e/0x96 > Oct 13 17:10:58 kernel: [<c0104107>] show_stack+0x20/0x25 > Oct 13 17:10:58 kernel: [<fa1bef79>] lbug_with_loc+0x69/0xc0 [libcfs] > Oct 13 17:10:58 kernel: [<fa689448>] mds_orphan_add_link+0xcb8/0xd20 [mds] > Oct 13 17:10:58 kernel: [<fa69c87a>] mds_reint_unlink+0x292a/0x3fd0 [mds] > Oct 13 17:10:58 kernel: [<fa3ac990>] lustre_swab_ldlm_request+0x0/0x20 > [ptlrpc] > Oct 13 17:10:58 kernel: [<fa688495>] mds_reint_rec+0xf5/0x3f0 [mds] > Oct 13 17:10:58 kernel: [<fa39f788>] ptl_send_buf+0x1b8/0xb00 [ptlrpc] > Oct 13 17:10:58 kernel: [<fa66bfeb>] mds_reint+0xcb/0x8a0 [mds] > Oct 13 17:10:58 kernel: [<fa67f998>] mds_handle+0x3048/0xb9df [mds] > Oct 13 17:10:58 kernel: [<fa4ac402>] LNetMEAttach+0x142/0x4a0 [lnet] > Oct 13 17:10:58 kernel: [<fa2dcd91>] class_handle_free_cb+0x21/0x190 > [obdclass] > Oct 13 17:10:58 kernel: [<c0124d83>] do_gettimeofday+0x31/0xce > Oct 13 17:10:58 kernel: [<fa2dc06b>] class_handle2object+0xbb/0x2a0 > [obdclass] > Oct 13 17:10:58 kernel: [<fa3aca00>] lustre_swab_ptlrpc_body+0x0/0xc0 > [ptlrpc] > Oct 13 17:10:58 kernel: [<fa3a9b5a>] lustre_swab_buf+0xfa/0x180 [ptlrpc] > Oct 13 17:10:58 kernel: [<c0125aac>] lock_timer_base+0x15/0x2f > Oct 13 17:10:59 kernel: [<c0125bbd>] __mod_timer+0x99/0xa3 > Oct 13 17:10:59 kernel: [<fa3a6efe>] lustre_msg_get_conn_cnt+0xce/0x220 > [ptlrpc] > Oct 13 17:10:59 kernel: [<fa3b8e56>] ptlrpc_main+0x2016/0x2f40 [ptlrpc] > Oct 13 17:10:59 kernel: [<c01b6dc0>] __next_cpu+0x12/0x21 > Oct 13 17:10:59 kernel: [<c012053d>] do_exit+0x711/0x71b > Oct 13 17:10:59 kernel: [<c0117c1f>] default_wake_function+0x0/0xc > Oct 13 17:10:59 kernel: [<fa3b6e40>] ptlrpc_main+0x0/0x2f40 [ptlrpc] > Oct 13 17:10:59 kernel: [<c0101005>] kernel_thread_helper+0x5/0xb > > Oct 13 17:12:38 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog > triggered for pid 9132: it was inactive for 10 > 0s > > > Cheers, > Thomas > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >