Frederik Ferner
2009-Nov-10 13:18 UTC
[Lustre-discuss] MDS LBUG: mds_inode_is_orphan(dchild->d_inode)) failed:dchild
Hi, occasionally we run into a LBUG on our MDS (ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed:dchild)[1]. A quick search revealed an old post from May[2] and at least one bug that may be related. (https://bugzilla.lustre.org/show_bug.cgi?id=16492 which is against 1.6.5 or https://bugzilla.lustre.org/show_bug.cgi?id=17764 against 1.4.X with reports fir other versions as well.) We are running 1.6.6 on RHEL 5 on the MDS currently and I''m not sure I understand the state of the bugs. Can someone help me work out if that bug should be fixed in later version of 1.6.X? The changelog[3] seems to suggest at least 16492 is but it also mentions the bug number in two instances which slightly confuses me. The second description however might fit the problem we see. Before I attempt to upgrade our MDS, I would like to know if the bug is fixed in 1.6.7.2. Kind regards, Frederik [1] Syslog has has the following entries for the LBUG: Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: LustreError: 20109:0:(mds_open.c:1156:mds_open()) ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed:dchild 71229c0:d0c22313 (ffff8103065c8150) inode ffff810391a98528/118630848/3502383891 Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: LustreError: 20109:0:(mds_open.c:1156:mds_open()) LBUG Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: Lustre: 20109:0:(linux-debug.c:185:libcfs_debug_dumpstack()) showing stack for process 20109 Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: ll_mdt_25 R running task 0 20109 1 20110 20108 (L-TLB) Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: 0000000000000000 ffffffff8006d940 ffff81020200f140 ffffffff8869877d Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: ffff810205fa3000 ffff810205fa30e8 ffff8101ffe39480 ffffffff88696476 Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: ffff810205fa3190 0000000000000000 ffff8101fb0e5e10 ffffffff800893bb Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: Call Trace: Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: [<ffffffff8006d940>] do_gettimeofday+0x50/0x92 Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: [<ffffffff88696476>] :libcfs:lcw_update_time+0x16/0x100 Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: [<ffffffff800893bb>] __wake_up_common+0x3e/0x68 Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: [<ffffffff887ea22c>] :ptlrpc:ptlrpc_main+0xe0c/0xf90 Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: [<ffffffff8008ad7e>] default_wake_function+0x0/0xe Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: [<ffffffff800b4610>] audit_syscall_exit+0x31b/0x336 Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: [<ffffffff887e9420>] :ptlrpc:ptlrpc_main+0x0/0xf90 Nov 10 09:51:19 cs04r-sc-mds01-01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Followed by a watchdog: Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: Lustre: 0:0:(watchdog.c:148:lcw_cb()) Watchdog triggered for pid 20109: it was inactive for 200s Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: Lustre: 0:0:(linux-debug.c:185:libcfs_debug_dumpstack()) showing stack for process 20109 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: ll_mdt_25 D ffff810428e4a5a8 0 20109 1 20110 20108 (L-TLB) Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: ffff8101fb0e5700 0000000000000046 0000000000000000 ffffffff80450560 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: ffff8101fb0e56c0 000000000000000a ffff8102030bd7e0 ffff8102471be040 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: 0009bf805a7afba6 0000000000002403 ffff8102030bd9c8 0000000300000484 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: Call Trace: Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff8008ad7e>] default_wake_function+0x0/0xe Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff8868dc6b>] :libcfs:lbug_with_loc+0xbb/0xc0 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff88ab0eb7>] :mds:mds_open+0x2017/0x332e Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff800893bb>] __wake_up_common+0x3e/0x68 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff8885d2d1>] :ksocklnd:ksocknal_queue_tx_locked+0x4f1/0x550 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff88a8d8c9>] :mds:mds_reint_rec+0x1d9/0x2b0 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff88ab4133>] :mds:mds_open_unpack+0x2f3/0x410 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff88a8091a>] :mds:mds_reint+0x35a/0x420 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff88a7efa2>] :mds:fixup_handle_for_resent_req+0x52/0x200 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff88a84974>] :mds:mds_intent_policy+0x484/0xc30 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff886db50c>] :lnet:LNetMDBind+0x2ac/0x400 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff887a4156>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff887a1916>] :ptlrpc:ldlm_lock_enqueue+0x186/0x990 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff8879e73d>] :ptlrpc:ldlm_lock_create+0x9ad/0x9e0 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff887c34d0>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5c0 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff887c0de5>] :ptlrpc:ldlm_handle_enqueue+0xca5/0x12a0 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff887c3a90>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff88a89155>] :mds:mds_handle+0x4035/0x4cf0 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff801437a4>] __next_cpu+0x19/0x28 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff800756b4>] smp_send_reschedule+0x4e/0x53 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff8873c031>] :obdclass:class_handle2object+0xd1/0x160 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff887db705>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff887e50da>] :ptlrpc:ptlrpc_check_req+0x1a/0x110 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff887e72c2>] :ptlrpc:ptlrpc_server_handle_request+0x992/0x1040 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff80062f4b>] thread_return+0x0/0xdf Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff8006d940>] do_gettimeofday+0x50/0x92 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff88696476>] :libcfs:lcw_update_time+0x16/0x100 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff800893bb>] __wake_up_common+0x3e/0x68 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff887ea22c>] :ptlrpc:ptlrpc_main+0xe0c/0xf90 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff8008ad7e>] default_wake_function+0x0/0xe Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff800b4610>] audit_syscall_exit+0x31b/0x336 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff887e9420>] :ptlrpc:ptlrpc_main+0x0/0xf90 Nov 10 09:54:39 cs04r-sc-mds01-01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 [2] http://lists.lustre.org/pipermail/lustre-discuss/2009-May/010469.html [3] http://lists.lustre.org/pipermail/lustre-discuss/2009-May/010469.html -- Frederik Ferner Computer Systems Administrator phone: +44 1235 77 8624 Diamond Light Source Ltd. mob: +44 7917 08 5110 (Apologies in advance for the lines below. Some bits are a legal requirement and I have no control over them.)