We are currently trying to stand up a lustre file system in a system test environment before moving it into production. Twice in the last week the file system has locked up with the only recourse of recovery was to reboot all clients attached along with the mds/mdt. We are currently running Lustre 1.8.2. Here is the LBUG info we are receiving. If there is anything else I can provide to help find the cause please let me know. Sep 30 05:30:01 edclxs200 auditd[4529]: Audit daemon rotating log files Sep 30 15:45:22 edclxs200 kernel: LustreError: 7193:0:(mds_reint.c:1772:mds_orphan_add_link()) ASSERTION(inode->i_nlink == 2) failed: dir nlink == 1 Sep 30 15:45:22 edclxs200 kernel: LustreError: 7193:0:(mds_reint.c:1772:mds_orphan_add_link()) LBUG Sep 30 15:45:22 edclxs200 kernel: Lustre: 7193:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process 7193 Sep 30 15:45:22 edclxs200 kernel: ll_mdt_30 R running task 0 7193 1 7195 7192 (L-TLB) Sep 30 15:45:22 edclxs200 kernel: ffff810592dd7100 ffff81010ba88000 0000000000000282 0000000000000082 Sep 30 15:45:22 edclxs200 kernel: 0000008100001400 ffff810348753ef8 0000000000000001 0000000000000001 Sep 30 15:45:22 edclxs200 kernel: ffff810345bcc5b8 0000000000000000 ffff810348615e10 ffffffff8008ac95 Sep 30 15:45:22 edclxs200 kernel: Call Trace: Sep 30 15:45:22 edclxs200 kernel: [<ffffffff8008ac95>] __wake_up_common+0x3e/0x68 Sep 30 15:45:22 edclxs200 kernel: [<ffffffff88703f58>] :ptlrpc:ptlrpc_main+0x1258/0x1420 Sep 30 15:45:22 edclxs200 kernel: [<ffffffff8008c86b>] default_wake_function+0x0/0xe Sep 30 15:45:22 edclxs200 kernel: [<ffffffff800b7076>] audit_syscall_exit+0x336/0x362 Sep 30 15:45:22 edclxs200 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Sep 30 15:45:22 edclxs200 kernel: [<ffffffff88702d00>] :ptlrpc:ptlrpc_main+0x0/0x1420 Sep 30 15:45:22 edclxs200 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Sep 30 15:45:22 edclxs200 kernel: Sep 30 15:45:22 edclxs200 kernel: LustreError: dumping log to /tmp/lustre-log.1285879522.7193 Sep 30 15:48:42 edclxs200 kernel: Lustre: Service thread pid 7193 was inactive for 200.00s. The thread might be hung, or it mi ght only be slow and will resume later. Dumping the stack trace for debugging purposes: Sep 30 15:48:42 edclxs200 kernel: Lustre: 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process 7193 Sep 30 15:48:42 edclxs200 kernel: ll_mdt_30 D ffffffff8014e8f3 0 7193 1 7195 7192 (L-TLB) Sep 30 15:48:42 edclxs200 kernel: ffff8103486157f0 0000000000000046 0000000000000000 ffffffff8006b921 Sep 30 15:48:42 edclxs200 kernel: ffff8103486157b0 0000000000000009 ffff81034df607e0 ffff81034fcd3080 Sep 30 15:48:42 edclxs200 kernel: 000290eac846946f 0000000000001d1b ffff81034df609c8 000000098003bcc8 Rocky -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101001/065d4f17/attachment.html
>From further research it looks as though this is a known problem withopen-unlinked directories in 1.8.2 and a fix is attached to bug 22177. Would an upgrade to 1.8.4 be advised? Thanks again Rocky From: Ronald K Long <rklong at usgs.gov> To: lustre-discuss at lists.lustre.org Date: 10/01/2010 06:32 AM Subject: [Lustre-discuss] Lustre file system crashing Sent by: lustre-discuss-bounces at lists.lustre.org We are currently trying to stand up a lustre file system in a system test environment before moving it into production. Twice in the last week the file system has locked up with the only recourse of recovery was to reboot all clients attached along with the mds/mdt. We are currently running Lustre 1.8.2. Here is the LBUG info we are receiving. If there is anything else I can provide to help find the cause please let me know. Sep 30 05:30:01 edclxs200 auditd[4529]: Audit daemon rotating log files Sep 30 15:45:22 edclxs200 kernel: LustreError: 7193:0:(mds_reint.c:1772:mds_orphan_add_link()) ASSERTION(inode->i_nlink == 2) failed: dir nlink == 1 Sep 30 15:45:22 edclxs200 kernel: LustreError: 7193:0:(mds_reint.c:1772:mds_orphan_add_link()) LBUG Sep 30 15:45:22 edclxs200 kernel: Lustre: 7193:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process 7193 Sep 30 15:45:22 edclxs200 kernel: ll_mdt_30 R running task 0 7193 1 7195 7192 (L-TLB) Sep 30 15:45:22 edclxs200 kernel: ffff810592dd7100 ffff81010ba88000 0000000000000282 0000000000000082 Sep 30 15:45:22 edclxs200 kernel: 0000008100001400 ffff810348753ef8 0000000000000001 0000000000000001 Sep 30 15:45:22 edclxs200 kernel: ffff810345bcc5b8 0000000000000000 ffff810348615e10 ffffffff8008ac95 Sep 30 15:45:22 edclxs200 kernel: Call Trace: Sep 30 15:45:22 edclxs200 kernel: [<ffffffff8008ac95>] __wake_up_common+0x3e/0x68 Sep 30 15:45:22 edclxs200 kernel: [<ffffffff88703f58>] :ptlrpc:ptlrpc_main+0x1258/0x1420 Sep 30 15:45:22 edclxs200 kernel: [<ffffffff8008c86b>] default_wake_function+0x0/0xe Sep 30 15:45:22 edclxs200 kernel: [<ffffffff800b7076>] audit_syscall_exit+0x336/0x362 Sep 30 15:45:22 edclxs200 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Sep 30 15:45:22 edclxs200 kernel: [<ffffffff88702d00>] :ptlrpc:ptlrpc_main+0x0/0x1420 Sep 30 15:45:22 edclxs200 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Sep 30 15:45:22 edclxs200 kernel: Sep 30 15:45:22 edclxs200 kernel: LustreError: dumping log to /tmp/lustre-log.1285879522.7193 Sep 30 15:48:42 edclxs200 kernel: Lustre: Service thread pid 7193 was inactive for 200.00s. The thread might be hung, or it mi ght only be slow and will resume later. Dumping the stack trace for debugging purposes: Sep 30 15:48:42 edclxs200 kernel: Lustre: 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process 7193 Sep 30 15:48:42 edclxs200 kernel: ll_mdt_30 D ffffffff8014e8f3 0 7193 1 7195 7192 (L-TLB) Sep 30 15:48:42 edclxs200 kernel: ffff8103486157f0 0000000000000046 0000000000000000 ffffffff8006b921 Sep 30 15:48:42 edclxs200 kernel: ffff8103486157b0 0000000000000009 ffff81034df607e0 ffff81034fcd3080 Sep 30 15:48:42 edclxs200 kernel: 000290eac846946f 0000000000001d1b ffff81034df609c8 000000098003bcc8 Rocky _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101001/47450671/attachment.html
On Fri, Oct 01, 2010 at 07:03:36AM -0500, Ronald K Long wrote:> From further research it looks as though this is a known problem with > open-unlinked directories in 1.8.2 and a fix is attached to bug 22177.Yes, the fix is to increment nlink by 2 instead of 1. If you want to patch 1.8.2, you need to apply attachment 28798 (main fix) and 29030 (just fix a warning printed to the console).> Would an upgrade to 1.8.4 be advised?Yes, the fix (with many others) is included in 1.8.4. HTH Cheers, Johann