Frederik Ferner
2011-Jul-08 13:26 UTC
[Lustre-discuss] MDS LBUG: ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed
All, we are experiencing what looks like the same MDS LBUG with increasing frequency, see below for a sample stack trace. This seems to affect only one client at a time and even this client will recover after some time (usually minutes but sometimes longer) and continue to work even without requiring immediate MDS reboots. In the recent past, it seems to have affected one specific client more often than others. This client is mainly a NFS exporter for the Lustre file system. All attempts to trigger the LBUG with known actions have been unsuccessful so far. Attempts to trigger it on the test file system have equally not been successful but we are still working on this. As far as I can see, this could be this bug https://bugzilla.lustre.org/show_bug.cgi?id=17764 but there has been no recent activity. And I''m not entirely sure this is the same bug. As far as I can see the log dumps don''t contain any useful information, but I''m happy to provide as sample file if someone offers to look at it. I''m also looking for suggestions how to go about debugging this problem, ideally initially with as little performance impact as possible so we might apply it on the productions system until we can reproduce it on a test file system. Once we can reproduce it on the test file system, debugging with performance implications should be possible as well. The MDS and clients are currently running Lustre 1.8.3.ddn3.3 on Red Hat Enterprise 5.> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: 4037:0:(mds_open.c:1295:mds_open()) ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed: dchild 8ad94b2:0cae8d46 (ffff8101995b0300) inode ffff81041d4e8548/145593522/21276602 > 2 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: 4037:0:(mds_open.c:1295:mds_open()) LBUG > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: Lustre: 4037:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process 4037 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: ll_mdt_49 R running task 0 4037 1 4038 4036 (L-TLB) > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: ffff810226da0d00 ffff810247120000 0000000000000286 0000000000000082 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: 0000008100001400 ffff8101db219ef8 0000000000000001 0000000000000001 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: ffff8101ead74db8 0000000000000000 ffff810423223e10 ffffffff8008aee7 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: Call Trace: > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff8008aee7>] __wake_up_common+0x3e/0x68 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff887acee8>] :ptlrpc:ptlrpc_main+0x1258/0x1420 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff8008cabd>] default_wake_function+0x0/0xe > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff800b7310>] audit_syscall_exit+0x336/0x362 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff887abc90>] :ptlrpc:ptlrpc_main+0x0/0x1420 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: dumping log to /tmp/lustre-log.1309949645.4037Kind regards, Frederik -- Frederik Ferner Computer Systems Administrator phone: +44 1235 77 8624 Diamond Light Source Ltd. mob: +44 7917 08 5110 (Apologies in advance for the lines below. Some bits are a legal requirement and I have no control over them.) -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
Frederik Ferner
2011-Aug-23 13:44 UTC
[Lustre-discuss] MDS LBUG: ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed
All, I''d like to follow up on this as I can now repeatedly reproduce this on our test file system. I''ve managed to reproduce it on an version up to lustre 1.8.6-wc1 on the MDS that I''ve tried so far. I''ve also reported it as LU-534 (http://jira.whamcloud.com/browse/LU-534) and included current stack traces etc. I''ll repeat the basic instructions how to reproduce here: Export a Lustre file system via NFS(v3) from a Lustre client, mount it on one other system over NFS, run racer on the file system over NFS, after a few minutes (sometimes one or two hours) the MDS LBUGs with the ASSERTION in the subject. If anyone has any suggestions of debug flags to enable or other ideas how to track down the exact problem, I''d like to hear them. Kind regards, Frederik On 08/07/11 14:26, Frederik Ferner wrote:> All, > > we are experiencing what looks like the same MDS LBUG with increasing > frequency, see below for a sample stack trace. This seems to affect only > one client at a time and even this client will recover after some time > (usually minutes but sometimes longer) and continue to work even without > requiring immediate MDS reboots. > > In the recent past, it seems to have affected one specific client more > often than others. This client is mainly a NFS exporter for the Lustre > file system. All attempts to trigger the LBUG with known actions have > been unsuccessful so far. Attempts to trigger it on the test file system > have equally not been successful but we are still working on this. > > As far as I can see, this could be this bug > https://bugzilla.lustre.org/show_bug.cgi?id=17764 but there has been no > recent activity. And I''m not entirely sure this is the same bug. > > As far as I can see the log dumps don''t contain any useful information, > but I''m happy to provide as sample file if someone offers to look at > it. > > I''m also looking for suggestions how to go about debugging this problem, > ideally initially with as little performance impact as possible so we > might apply it on the productions system until we can reproduce it on a > test file system. Once we can reproduce it on the test file system, > debugging with performance implications should be possible as well. > > The MDS and clients are currently running Lustre 1.8.3.ddn3.3 on Red Hat > Enterprise 5. > >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: 4037:0:(mds_open.c:1295:mds_open()) ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed: dchild 8ad94b2:0cae8d46 (ffff8101995b0300) inode ffff81041d4e8548/145593522/21276602 >> 2 >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: 4037:0:(mds_open.c:1295:mds_open()) LBUG >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: Lustre: 4037:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process 4037 >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: ll_mdt_49 R running task 0 4037 1 4038 4036 (L-TLB) >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: ffff810226da0d00 ffff810247120000 0000000000000286 0000000000000082 >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: 0000008100001400 ffff8101db219ef8 0000000000000001 0000000000000001 >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: ffff8101ead74db8 0000000000000000 ffff810423223e10 ffffffff8008aee7 >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: Call Trace: >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff8008aee7>] __wake_up_common+0x3e/0x68 >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff887acee8>] :ptlrpc:ptlrpc_main+0x1258/0x1420 >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff8008cabd>] default_wake_function+0x0/0xe >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff800b7310>] audit_syscall_exit+0x336/0x362 >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff887abc90>] :ptlrpc:ptlrpc_main+0x0/0x1420 >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: >> Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: dumping log to /tmp/lustre-log.1309949645.4037 > > Kind regards, > Frederik-- Frederik Ferner Computer Systems Administrator phone: +44 1235 77 8624 Diamond Light Source Ltd. mob: +44 7917 08 5110 (Apologies in advance for the lines below. Some bits are a legal requirement and I have no control over them.) -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom