I seem to have generated some kind of problem with my test lustre fs. I''m using it for, among other things, temp storage for the gentoo portage tools on my test system, mostly because portage does all kinds of thrash of the filesystem while it''s running. I''ve got one particular package which, when I try to emerge it, gets to one particular step, then gets killed. Digging down into the code, the proximate cause of the crash appears to be that in namei.c, ll_lookup_it, parent->i_sb->s_fs_info is NULL. This causes a null pointer dereference in kernel space. However, looking further, just prior to the sequence which leads up to the null pointer, there''s a pair of messages in the log: Sep 12 09:47:55 localhost Lustre: client umount complete Sep 12 09:47:55 localhost VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... which look pretty alarming. So, my questions: Has anybody every seen this kind of thing? Am I right in assuming that the bad s_fs_info is likely due to something having attempted to umount the filesystem? And finally, what are likely causes for the fs to spontaneously umount like that? Thanks in advance...
Hi Could you tell us what kernel, patch series and Lustre version you are running on client and servers? This may well be a new problem that we haven''t seen yet. Thanks! - Peter -> -----Original Message----- > From: lustre-discuss-bounces@clusterfs.com > [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of jrd@jrd.org > Sent: Tuesday, September 12, 2006 8:21 AM > To: lustre-discuss@clusterfs.com > Subject: [Lustre-discuss] Odd problem with a lustrefs > > I seem to have generated some kind of problem with my test > lustre fs. I''m using it for, among other things, temp > storage for the gentoo portage tools on my test system, > mostly because portage does all kinds of thrash of the > filesystem while it''s running. > > I''ve got one particular package which, when I try to emerge > it, gets to one particular step, then gets killed. Digging > down into the code, the proximate cause of the crash appears > to be that in namei.c, ll_lookup_it, > parent->i_sb->s_fs_info is NULL. This causes a null pointer > dereference > parent->in > kernel space. > > However, looking further, just prior to the sequence which > leads up to the null pointer, there''s a pair of messages in the log: > > Sep 12 09:47:55 localhost Lustre: client umount complete Sep > 12 09:47:55 localhost VFS: Busy inodes after unmount. > Self-destruct in 5 seconds. Have a nice day... > > which look pretty alarming. > > So, my questions: Has anybody every seen this kind of thing? > Am I right in assuming that the bad s_fs_info is likely due > to something having attempted to umount the filesystem? And > finally, what are likely causes for the fs to spontaneously > umount like that? > > Thanks in advance... > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > >
From: "Peter J. Braam" <braam@clusterfs.com> Date: Tue, 12 Sep 2006 10:33:57 -0400 Hi Could you tell us what kernel, patch series and Lustre version you are running on client and servers? 2.6.15 vanilla. The patch series that I posted a few weeks ago (derived from rhel-fc5) Lustre 1.6b4, plus a couple of fixes that I believe are slated for next release (for instance the paper-over for special xattrs) The client in question and the servers are all running that identical code base, with the exception that I''ve added a bunch of extra printk''s to the client while debugging it. This may well be a new problem that we haven''t seen yet. My current best guess is that the pointer to the info block being null is actually a side effect of the funny message about something attempting to unmount the fs while things are in flight. If you think that makes sense, then the question becomes what would cause the kernel to unmount it at that time? I''m not enough of a vfs wizard to have a good idea about that one yet. Any hints appreciated :-}
From: jrd@jrd.org Date: Tue, 12 Sep 2006 10:49:25 -0400 My current best guess is that the pointer to the info block being null is actually a side effect of the funny message about something attempting to unmount the fs while things are in flight. Further info: It appears that what''s happening is that the user app is exiting, and the exit code is going through its litany of cleanups. It calls deactivate_super, which seems to be taking the path where it concludes that this is the last reference to this superblock, and therefore the fs should be unmounted. Not clear yet why it thinks there are no more references to this sb.
I had similar problem (i think), after installing beta5 yesterday on 2.6.12 (server) and 2.6.16.28 (client) it runs stable with no additional patches (+ vserver on the client). ----- Original Message ----- From: <jrd@jrd.org> To: "Peter J. Braam" <braam@clusterfs.com> Cc: <jrd@jrd.org>; <lustre-discuss@clusterfs.com> Sent: Tuesday, September 12, 2006 4:49 PM Subject: [Lustre-discuss] Odd problem with a lustrefs> From: "Peter J. Braam" <braam@clusterfs.com> > Date: Tue, 12 Sep 2006 10:33:57 -0400 > > Hi > > Could you tell us what kernel, patch series and Lustre version you are > running on client and servers? > > 2.6.15 vanilla. > The patch series that I posted a few weeks ago (derived from rhel-fc5) > Lustre 1.6b4, plus a couple of fixes that I believe are slated for next > release (for instance the paper-over for special xattrs) > > The client in question and the servers are all running that identical code > base, with the exception that I''ve added a bunch of extra printk''s to the > client while debugging it. > > This may well be a new problem that we haven''t seen yet. > > My current best guess is that the pointer to the info block being null is > actually a side effect of the funny message about something attempting to > unmount the fs while things are in flight. If you think that makes sense, > then the question becomes what would cause the kernel to unmount it at > that > time? > > I''m not enough of a vfs wizard to have a good idea about that one yet. > Any > hints appreciated :-} > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > >
managed to kill it with: cd /mnt/lustremount/somedir mkdir t ln -s t t cd t rm t (after this it dies) ----- Original Message ----- From: "Konrad Gutkowski" <kgutkowski@wi.ps.pl> To: <lustre-discuss@clusterfs.com> Sent: Tuesday, September 12, 2006 10:17 PM Subject: Re: [Lustre-discuss] Odd problem with a lustrefs>I had similar problem (i think), > after installing beta5 yesterday on 2.6.12 (server) and 2.6.16.28 (client) > it runs stable with no additional patches (+ vserver on the client). > ----- Original Message ----- > From: <jrd@jrd.org> > To: "Peter J. Braam" <braam@clusterfs.com> > Cc: <jrd@jrd.org>; <lustre-discuss@clusterfs.com> > Sent: Tuesday, September 12, 2006 4:49 PM > Subject: [Lustre-discuss] Odd problem with a lustrefs > > >> From: "Peter J. Braam" <braam@clusterfs.com> >> Date: Tue, 12 Sep 2006 10:33:57 -0400 >> >> Hi >> >> Could you tell us what kernel, patch series and Lustre version you are >> running on client and servers? >> >> 2.6.15 vanilla. >> The patch series that I posted a few weeks ago (derived from rhel-fc5) >> Lustre 1.6b4, plus a couple of fixes that I believe are slated for next >> release (for instance the paper-over for special xattrs) >> >> The client in question and the servers are all running that identical >> code >> base, with the exception that I''ve added a bunch of extra printk''s to the >> client while debugging it. >> >> This may well be a new problem that we haven''t seen yet. >> >> My current best guess is that the pointer to the info block being null is >> actually a side effect of the funny message about something attempting to >> unmount the fs while things are in flight. If you think that makes >> sense, >> then the question becomes what would cause the kernel to unmount it at >> that >> time? >> >> I''m not enough of a vfs wizard to have a good idea about that one yet. >> Any >> hints appreciated :-} >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss@clusterfs.com >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >> >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > >
From: "Konrad Gutkowski" <kgutkowski@wi.ps.pl> Date: Wed, 13 Sep 2006 02:03:27 +0200 managed to kill it with: cd /mnt/lustremount/somedir mkdir t ln -s t t cd t rm t (after this it dies) Interesting! That kills mine too, but it''s not clear from the logs that it''s the same root cause as the case I was chasing. For instance, it doesn''t appear to be doing the spurious unmount. However, it is generating a BUG (below) which suggests that there may be a reference-counting problem happening, so it could be related. Here''s the BUG: Sep 13 09:29:25 localhost ----------- [cut here ] --------- [please bite here ] --------- Sep 13 09:29:25 localhost Kernel BUG at include/linux/dcache.h:306 Sep 13 09:29:25 localhost invalid operand: 0000 [1] SMP Sep 13 09:29:25 localhost CPU 0 Sep 13 09:29:25 localhost Modules linked in: osc mgc lustre lov mdc ksocklnd ptlrpc obdclass lnet lvfs libcfs Sep 13 09:29:25 localhost Pid: 8301, comm: bash Not tainted 2.6.15-sc-lustre-1.6b4-devo #5 Sep 13 09:29:25 localhost RIP: 0010:[<ffffffff8018d647>] <ffffffff8018d647>{path_lookup+423} Sep 13 09:29:25 localhost RSP: 0018:ffff810036167db8 EFLAGS: 00010246 Sep 13 09:29:25 localhost RAX: 0000000000000000 RBX: ffff810036167e18 RCX: 0000000000000fff Sep 13 09:29:25 localhost RDX: ffff8100358e65b8 RSI: 0000000000000001 RDI: ffff81003ff36744 Sep 13 09:29:25 localhost RBP: 0000000000000001 R08: 0000000000000001 R09: ffff810036167e78 Sep 13 09:29:25 localhost R10: 0000000000000000 R11: 0000000000000246 R12: ffff81003a634000 Sep 13 09:29:25 localhost R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000002 Sep 13 09:29:25 localhost FS: 0000000000dcfae0(0000) GS:ffffffff8064b800(0000) knlGS:0000000000000000 Sep 13 09:29:25 localhost CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Sep 13 09:29:25 localhost CR2: 00000000005c73ec CR3: 0000000036171000 CR4: 00000000000006e0 Sep 13 09:29:25 localhost Process bash (pid: 8301, threadinfo ffff810036166000, task ffff810037dd23c0) Sep 13 09:29:25 localhost Stack: 0000000000495bb2 ffff81003a634000 ffff810036167e18 0000000000495bb2 Sep 13 09:29:25 localhost 000000003a634000 ffffffff8018db29 0000000000495bb2 ffff810036167e78 Sep 13 09:29:25 localhost 00000000005c7400 ffff810036167ef8 Sep 13 09:29:25 localhost Call Trace:<ffffffff8018db29>{__user_walk_it+105} <ffffffff80187550>{vfs_stat+96} Sep 13 09:29:25 localhost <ffffffff802628c1>{__up_read+33} <ffffffff80120879>{do_page_fault+1161} Sep 13 09:29:25 localhost <ffffffff80187a4f>{sys_newstat+31} <ffffffff8017dc34>{vfs_read+308} Sep 13 09:29:25 localhost <ffffffff8010e961>{error_exit+0} <ffffffff8010dc26>{system_call+126} I don''t know what more to make of that one yet. Have you debugged this any further? I''ve been looking around in the code and nothing is jumping out at me yet.
From: jrd@jrd.org Date: Wed, 13 Sep 2006 09:56:30 -0400 From: "Konrad Gutkowski" <kgutkowski@wi.ps.pl> Date: Wed, 13 Sep 2006 02:03:27 +0200 managed to kill it with: cd /mnt/lustremount/somedir mkdir t ln -s t t cd t rm t (after this it dies) Interesting! That kills mine too, but it''s not clear from the logs that it''s the same root cause as the case I was chasing. For instance, it doesn''t appear to be doing the spurious unmount. However, it is generating a BUG (below) which suggests that there may be a reference-counting problem happening, so it could be related. Further information on this, and a question: I constructed a test case using your recipe, and one that ought to have been close, but using a different way of creating the symlink (create it under a different name, then rename it). I instrumented the code a bit, and tried comparing the two scenarios to try to get a handle on what''s different in the case which crashes vs the case which doesn''t. What appears to be happening is that after rm''ing the symlink, in response to doing an ls in the directory in which it used to live, we are handling an ldlm request, which ends up calling down into ll_mdc_blocking_ast. That function does some other stuff, then ends up (at namei.c:240, more or less) traversing over the dentry which represents the containing dir (somedir/t, in your example). I''m not sure I''m reading the code correctly, but it appears to be removing unused entries? In any event, in both the working and failing cases, it encounters the dentry representing the symlink, but in the failing case, that dentry''s inode is NULL, whereas in the working case, it points to something. In the code at that point there''s a comment reading /* XXX Print some debug here? */ I infer from that that somebody thought that path through the code might indicate a problem; it certainly seems to in the case I''ve been working on. So the question for any CFS developer types reading this list: Should it be the case that at that point in namei.c, the dentries in question shouldn''t have NULL inodes? If there is one with a NULL inode, is it possible to make any inferences about where that might have happened, and what should have happened instead? I''m not much of a vfs wizard, and am learning this code as I go along, so I''m not at all sure I know what to expect to see in here. Thanks in advance...
Hello! On Fri, Sep 15, 2006 at 11:44:20AM -0400, jrd@jrd.org wrote:> What appears to be happening is that after rm''ing the symlink, in response to > doing an ls in the directory in which it used to live, we are handling an ldlm > request, which ends up calling down into ll_mdc_blocking_ast. That function > does some other stuff, then ends up (at namei.c:240, more or less) traversing > over the dentry which represents the containing dir (somedir/t, in your > example). I''m not sure I''m reading the code correctly, but it appears to be > removing unused entries? In any event, in both the working and failing cases,The code lustre-invalidates (for lack of better term) all dentries of that inode (ll_unhash_aliases()). Also another bit of (pretty new, 1.4.7 specific) code only works if inode represents a directory. It walks all the children of the directory cached and drops all negative dentries (dentries with NULL inode).> it encounters the dentry representing the symlink, but in the failing case, > that dentry''s inode is NULL, whereas in the working case, it points to > something. > In the code at that point there''s a comment reading > /* XXX Print some debug here? */Yes, this is negative dentry removal code.> So the question for any CFS developer types reading this list: Should it be > the case that at that point in namei.c, the dentries in question shouldn''t > have NULL inodes? If there is one with a NULL inode, is it possible to make > any inferences about where that might have happened, and what should have > happened instead?Dentry with NULL inode means this is negative dentry, i.e. dentry that got ENOENT from server. Of course it is pretty strange to have a negative dentry for existing symlink. The crash kernel BUG dump you posted earlies seems to have nothing to do with this code, though, unless you think it somehow managed to screw dentry refcount and we hit BUG_ON() later doing dget on it. I''d like to say that I tried your script with RHEL4 kernel and did not see any problem. I will try with more modern kernel later. Meanwhile, I wonder if you can see the problem with CFS-released patchset? If you take latest 1.6 beta, there should be fc5 (2.6.15?) and sles10 (2.6.16-based) series. Bye, Oleg
From: Oleg Drokin <green@clusterfs.com> Date: Sat, 16 Sep 2006 23:27:22 +0300 Hello! On Fri, Sep 15, 2006 at 11:44:20AM -0400, jrd@jrd.org wrote: > What appears to be happening is that after rm''ing the symlink, in response to > doing an ls in the directory in which it used to live, we are handling an ldlm > request, which ends up calling down into ll_mdc_blocking_ast. That function > does some other stuff, then ends up (at namei.c:240, more or less) traversing > over the dentry which represents the containing dir (somedir/t, in your > example). I''m not sure I''m reading the code correctly, but it appears to be > removing unused entries? In any event, in both the working and failing cases, The code lustre-invalidates (for lack of better term) all dentries of that inode (ll_unhash_aliases()). Also another bit of (pretty new, 1.4.7 specific) code only works if inode represents a directory. It walks all the children of the directory cached and drops all negative dentries (dentries with NULL inode). Ok. > it encounters the dentry representing the symlink, but in the failing case, > that dentry''s inode is NULL, whereas in the working case, it points to > something. > In the code at that point there''s a comment reading > /* XXX Print some debug here? */ Yes, this is negative dentry removal code. Ok. What''s the significance of that comment? Just that we''re removing negative dentries? Also, note that in the case that works, we get to the "identical" place in the code, and the dentry for the symlink is *not* negative. The failing case is mkdir foo ln -s foo foo cd foo rm foo ls and the working case is anything like mkdir foo ln -s foo foo/bar cd foo mv bar foo rm foo ls So far it''s not clear to me why, at the "rm foo", those two scenarios should be doing different things at that point in the code. But as I''ve said, I''m really not a vfs wizard, I''m learning this stuff as I go along :-{ > So the question for any CFS developer types reading this list: Should it be > the case that at that point in namei.c, the dentries in question shouldn''t > have NULL inodes? If there is one with a NULL inode, is it possible to make > any inferences about where that might have happened, and what should have > happened instead? Dentry with NULL inode means this is negative dentry, i.e. dentry that got ENOENT from server. Of course it is pretty strange to have a negative dentry for existing symlink. The crash kernel BUG dump you posted earlies seems to have nothing to do with this code, though, unless you think it somehow managed to screw dentry refcount and we hit BUG_ON() later doing dget on it. I don''t really know. It''s entirely possible that there are two bugs. I started spending time on this one because both this one and the other one seemed to have to do with reference-counting things, and it seemed plausible that they have the same root cause. Also, this recipe for reproducing is many many times simpler than the other one. I''d like to say that I tried your script with RHEL4 kernel and did not see any problem. I will try with more modern kernel later. Meanwhile, I wonder if you can see the problem with CFS-released patchset? If you take latest 1.6 beta, there should be fc5 (2.6.15?) and sles10 (2.6.16-based) series. This recipe was reported by Konrad Gutkowski <kgutkowski@wi.ps.pl>; it sounded like he was running stock kernels and patchsets. Konrad, can you confirm? If need be I can dig out and build up a RHEL or something kernel. I don''t have anything in house running either RHEL or a modern SUSE, but I expect it will exhibit the same behaviour if I can come up with one of those kernels and run my gentoo system with it.
i used vanilla 2.6.12.6 on the server and 2.6.16.28 on the client server was patched with vanilla series, client wasnt patched when i changed to 2.6.12.6 on the client the problem vanished ----- Original Message ----- From: "John R. Dunning" <jrd@jrd.org> To: "Oleg Drokin" <green@clusterfs.com> Cc: <jrd@jrd.org>; <lustre-discuss@clusterfs.com> Sent: Saturday, September 16, 2006 11:47 PM Subject: [Lustre-discuss] Odd problem with a lustrefs> From: Oleg Drokin <green@clusterfs.com> > Date: Sat, 16 Sep 2006 23:27:22 +0300 > > Hello! > > On Fri, Sep 15, 2006 at 11:44:20AM -0400, jrd@jrd.org wrote: > > > What appears to be happening is that after rm''ing the symlink, in > response to > > doing an ls in the directory in which it used to live, we are > handling an ldlm > > request, which ends up calling down into ll_mdc_blocking_ast. That > function > > does some other stuff, then ends up (at namei.c:240, more or less) > traversing > > over the dentry which represents the containing dir (somedir/t, in > your > > example). I''m not sure I''m reading the code correctly, but it > appears to be > > removing unused entries? In any event, in both the working and > failing cases, > > The code lustre-invalidates (for lack of better term) all dentries of > that inode > (ll_unhash_aliases()). > Also another bit of (pretty new, 1.4.7 specific) code only works if > inode > represents a directory. It walks all the children of the directory > cached > and drops all negative dentries (dentries with NULL inode). > > Ok. > > > it encounters the dentry representing the symlink, but in the failing > case, > > that dentry''s inode is NULL, whereas in the working case, it points > to > > something. > > In the code at that point there''s a comment reading > > /* XXX Print some debug here? */ > > Yes, this is negative dentry removal code. > > Ok. What''s the significance of that comment? Just that we''re > removing negative dentries? > > Also, note that in the case that works, we get to the "identical" > place in the code, and the dentry for the symlink is *not* negative. > The failing case is > > mkdir foo > ln -s foo foo > cd foo > rm foo > ls > > and the working case is anything like > > mkdir foo > ln -s foo foo/bar > cd foo > mv bar foo > rm foo > ls > > So far it''s not clear to me why, at the "rm foo", those two scenarios > should be doing different things at that point in the code. But as > I''ve said, I''m really not a vfs wizard, I''m learning this stuff as I > go along :-{ > > > So the question for any CFS developer types reading this list: > Should it be > > the case that at that point in namei.c, the dentries in question > shouldn''t > > have NULL inodes? If there is one with a NULL inode, is it possible > to make > > any inferences about where that might have happened, and what should > have > > happened instead? > > Dentry with NULL inode means this is negative dentry, i.e. dentry that > got > ENOENT from server. Of course it is pretty strange to have a negative > dentry > for existing symlink. > > The crash kernel BUG dump you posted earlies seems to have nothing to > do > with this code, though, unless you think it somehow managed to screw > dentry refcount and we hit BUG_ON() later doing dget on it. > > I don''t really know. It''s entirely possible that there are two bugs. > I started spending time on this one because both this one and the > other one seemed to have to do with reference-counting things, and it > seemed plausible that they have the same root cause. Also, this > recipe for reproducing is many many times simpler than the other one. > > I''d like to say that I tried your script with RHEL4 kernel and did not > see any problem. I will try with more modern kernel later. > Meanwhile, I wonder if you can see the problem with CFS-released > patchset? > If you take latest 1.6 beta, there should be fc5 (2.6.15?) and > sles10 (2.6.16-based) series. > > This recipe was reported by Konrad Gutkowski <kgutkowski@wi.ps.pl>; it > sounded like he was running stock kernels and patchsets. Konrad, can > you confirm? > > If need be I can dig out and build up a RHEL or something kernel. I > don''t have anything in house running either RHEL or a modern SUSE, but > I expect it will exhibit the same behaviour if I can come up with one > of those kernels and run my gentoo system with it. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > >
From: "Konrad Gutkowski" <kgutkowski@wi.ps.pl> Date: Sun, 17 Sep 2006 13:57:14 +0200 i used vanilla 2.6.12.6 on the server and 2.6.16.28 on the client server was patched with vanilla series, client wasnt patched when i changed to 2.6.12.6 on the client the problem vanished Interesting. The fact that you got the problem with an unpatched 2.6.16 client suggests that the problem may be something generic in modern kernels, or at least that something between 12 and 16 changed in such a way as to expose a discrepancy in how the lustre server and client interact. The stuff I''m running started life as the rhel-fc5 patchset, 2.6.15. Unfortunately, that doesn''t really narrow it down a lot, as there appears to have been a fair bit of thrash in the vfs code even between 12 and 15. Oleg, shall I assume this means it''s not interesting for me to try to construct an rhel kernel and try it again there? ----- Original Message ----- From: "John R. Dunning" <jrd@jrd.org> To: "Oleg Drokin" <green@clusterfs.com> Cc: <jrd@jrd.org>; <lustre-discuss@clusterfs.com> Sent: Saturday, September 16, 2006 11:47 PM Subject: [Lustre-discuss] Odd problem with a lustrefs > From: Oleg Drokin <green@clusterfs.com> > Date: Sat, 16 Sep 2006 23:27:22 +0300 > > Hello! > > On Fri, Sep 15, 2006 at 11:44:20AM -0400, jrd@jrd.org wrote: > > > What appears to be happening is that after rm''ing the symlink, in > response to > > doing an ls in the directory in which it used to live, we are > handling an ldlm > > request, which ends up calling down into ll_mdc_blocking_ast. That > function > > does some other stuff, then ends up (at namei.c:240, more or less) > traversing > > over the dentry which represents the containing dir (somedir/t, in > your > > example). I''m not sure I''m reading the code correctly, but it > appears to be > > removing unused entries? In any event, in both the working and > failing cases, > > The code lustre-invalidates (for lack of better term) all dentries of > that inode > (ll_unhash_aliases()). > Also another bit of (pretty new, 1.4.7 specific) code only works if > inode > represents a directory. It walks all the children of the directory > cached > and drops all negative dentries (dentries with NULL inode). > > Ok. > > > it encounters the dentry representing the symlink, but in the failing > case, > > that dentry''s inode is NULL, whereas in the working case, it points > to > > something. > > In the code at that point there''s a comment reading > > /* XXX Print some debug here? */ > > Yes, this is negative dentry removal code. > > Ok. What''s the significance of that comment? Just that we''re > removing negative dentries? > > Also, note that in the case that works, we get to the "identical" > place in the code, and the dentry for the symlink is *not* negative. > The failing case is > > mkdir foo > ln -s foo foo > cd foo > rm foo > ls > > and the working case is anything like > > mkdir foo > ln -s foo foo/bar > cd foo > mv bar foo > rm foo > ls > > So far it''s not clear to me why, at the "rm foo", those two scenarios > should be doing different things at that point in the code. But as > I''ve said, I''m really not a vfs wizard, I''m learning this stuff as I > go along :-{ > > > So the question for any CFS developer types reading this list: > Should it be > > the case that at that point in namei.c, the dentries in question > shouldn''t > > have NULL inodes? If there is one with a NULL inode, is it possible > to make > > any inferences about where that might have happened, and what should > have > > happened instead? > > Dentry with NULL inode means this is negative dentry, i.e. dentry that > got > ENOENT from server. Of course it is pretty strange to have a negative > dentry > for existing symlink. > > The crash kernel BUG dump you posted earlies seems to have nothing to > do > with this code, though, unless you think it somehow managed to screw > dentry refcount and we hit BUG_ON() later doing dget on it. > > I don''t really know. It''s entirely possible that there are two bugs. > I started spending time on this one because both this one and the > other one seemed to have to do with reference-counting things, and it > seemed plausible that they have the same root cause. Also, this > recipe for reproducing is many many times simpler than the other one. > > I''d like to say that I tried your script with RHEL4 kernel and did not > see any problem. I will try with more modern kernel later. > Meanwhile, I wonder if you can see the problem with CFS-released > patchset? > If you take latest 1.6 beta, there should be fc5 (2.6.15?) and > sles10 (2.6.16-based) series. > > This recipe was reported by Konrad Gutkowski <kgutkowski@wi.ps.pl>; it > sounded like he was running stock kernels and patchsets. Konrad, can > you confirm? > > If need be I can dig out and build up a RHEL or something kernel. I > don''t have anything in house running either RHEL or a modern SUSE, but > I expect it will exhibit the same behaviour if I can come up with one > of those kernels and run my gentoo system with it. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > > _______________________________________________ Lustre-discuss mailing list Lustre-discuss@clusterfs.com https://mail.clusterfs.com/mailman/listinfo/lustre-discuss