thr3ads.net - Lustre discuss - [Lustre-discuss] Kernel oops with dangling symlinks [Oct 2006]

If this information is useful, please help other people find it:
Share via:

Goswin von Brederlow

2006-Oct-12 11:35 UTC

[Lustre-discuss] Kernel oops with dangling symlinks

Hi,

I''m using Lustre 1.4.6 and a vanilla 2.6.15 kernel and dangling
symlinks on lustre cause kernel oopses.

I have a nfs export mounted on /data/nfs, lustre mounted on
/data/lustre (both by the automounter) and a dangling symlink
/data/lustre/nonexistent -> /data/nfs/nonexistent.

When I run ''cat /data/lustre/nonexistent'' (gives a
''No such file or
directory'') the mount count of /data/nfs seem to get reduced one more
time than it is raised. This results in the mount count reaching 0 and
/data/nfs being sort of umounted. fs/super.c:
generic_shutdown_super() gets called and prints ''VFS: Busy inodes
after umount. Self-destruct in 5 seconds. Have a nice day...''

Roughly 5 seconds later I get a kernel oops from some NFS function
on trying to access /data/nfs.


Can anyone reproduce this behaviour or even have a patch for the
problem?

MfG
        Goswin

Oleg Drokin

2006-Oct-12 12:03 UTC

head link

[Lustre-discuss] Kernel oops with dangling symlinks

Hello!

On Thu, Oct 12, 2006 at 07:34:52PM +0200, Goswin von Brederlow wrote:
> When I run ''cat /data/lustre/nonexistent'' (gives a
''No such file or
> directory'') the mount count of /data/nfs seem to get reduced one
more
> time than it is raised. This results in the mount count reaching 0 and
> /data/nfs being sort of umounted. fs/super.c:
> generic_shutdown_super() gets called and prints ''VFS: Busy inodes
> after umount. Self-destruct in 5 seconds. Have a nice day...''
> Roughly 5 seconds later I get a kernel oops from some NFS function
> on trying to access /data/nfs.
You must be running on x86_64 or other 64 bit arch on client.
We run into this problem two days ago as well.
The simple workaround for you would be to change return type for ll_follow_link
from int to long (in llite/symlink.c).
Of course this is not a proper fix (which is under work), but it will get rid
of the oops and restore normal operations.

Bye,
    Oleg

Goswin von Brederlow

2006-Oct-13 02:23 UTC

head link

[Lustre-discuss] Kernel oops with dangling symlinks

Oleg Drokin <green@clusterfs.com> writes:
> Hello!
>
> On Thu, Oct 12, 2006 at 07:34:52PM +0200, Goswin von Brederlow wrote:
>
>> When I run ''cat /data/lustre/nonexistent'' (gives a
''No such file or
>> directory'') the mount count of /data/nfs seem to get reduced
one more
>> time than it is raised. This results in the mount count reaching 0 and
>> /data/nfs being sort of umounted. fs/super.c:
>> generic_shutdown_super() gets called and prints ''VFS: Busy
inodes
>> after umount. Self-destruct in 5 seconds. Have a nice day...''
>> Roughly 5 seconds later I get a kernel oops from some NFS function
>> on trying to access /data/nfs.
>
> You must be running on x86_64 or other 64 bit arch on client.
> We run into this problem two days ago as well.
> The simple workaround for you would be to change return type for
ll_follow_link
> from int to long (in llite/symlink.c).
> Of course this is not a proper fix (which is under work), but it will get
rid
> of the oops and restore normal operations.
>
> Bye,
>     Oleg
Hi,

I actualy found that problem 5 minutes after my mail when I run the debug
output against the source to see the code path taken for a dangling
symlink in lustre and non lustre.

I noticed that lustre retruns -2 (debug output in from ll_follow_link)
but was quite surprised that __do_follow_link didn''t see that as an
error. The relevant code fragment is

        void *cookie; 
        cookie = dentry->d_inode->i_op->follow_link(dentry, nd); 
        error = PTR_ERR(cookie); 
        if (!IS_ERR(cookie)) { 

and the prototype is

        void * (*follow_link) (struct dentry *, struct nameidata *); 

but thanks to your mail it is pretty clear now. Lustre has

        static int ll_follow_link(struct dentry *dentry, struct nameidata *nd) 

resulting in 0xFFFFFFFE being returned instead of 0xFFFFFFFFFFFFFFFE.

The correct fix is to change the function prototype to void *, not to
long though. The function should also use ERR_PTR(rc) from
<linux/err.h> to convert the error code into what linux expects. Just
to be correct in case the error encoding ever changes.

MfG
        Goswin

Andreas Dilger

2006-Oct-13 02:37 UTC

head link

[Lustre-discuss] Kernel oops with dangling symlinks

On Oct 13, 2006  10:22 +0200, Goswin von Brederlow
wrote:> Oleg Drokin <green@clusterfs.com> writes:
> > The simple workaround for you would be to change return type for
> > ll_follow_link from int to long (in llite/symlink.c).
> > Of course this is not a proper fix (which is under work), but it will
get
> > rid of the oops and restore normal operations.
> 
> The correct fix is to change the function prototype to void *, not to
> long though. The function should also use ERR_PTR(rc) from
> <linux/err.h> to convert the error code into what linux expects. Just
> to be correct in case the error encoding ever changes.
I believe Oleg is aware of that - it''s just that Lustre needs to run on
many older kernels as well.  The proper fix is to have a configure check
for the return type of ->follow_link and add some #ifdefs to the Lustre
code to change the return type appropriately.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Oleg Drokin

2006-Oct-13 05:21 UTC

head link

[Lustre-discuss] Kernel oops with dangling symlinks

Hello!

On Fri, Oct 13, 2006 at 10:22:25AM +0200, Goswin von Brederlow wrote:
> The correct fix is to change the function prototype to void *, not to
> long though. The function should also use ERR_PTR(rc) from
> <linux/err.h> to convert the error code into what linux expects. Just
> to be correct in case the error encoding ever changes.
Well, lustre need to compile on a lot of kernels.
The correct fix is to have int for kernels that expect int and void *
for kernels that expect void *. That would be proper fix and that''s
what we plan to include ultimately into our released tree.

Bye,
    Oleg

Goswin von Brederlow

2006-Oct-13 07:00 UTC

head link

[Lustre-discuss] Kernel oops with dangling symlinks

Oleg Drokin <green@clusterfs.com> writes:
> Hello!
>
> On Fri, Oct 13, 2006 at 10:22:25AM +0200, Goswin von Brederlow wrote:
>
>> The correct fix is to change the function prototype to void *, not to
>> long though. The function should also use ERR_PTR(rc) from
>> <linux/err.h> to convert the error code into what linux expects.
Just
>> to be correct in case the error encoding ever changes.
>
> Well, lustre need to compile on a lot of kernels.
> The correct fix is to have int for kernels that expect int and void *
> for kernels that expect void *. That would be proper fix and
that''s
> what we plan to include ultimately into our released tree.
>
> Bye,
>     Oleg
Was that changed between 2.4.x and 2.6.x? I only looked at the docs of
the 2.6.15 thats why I came up with void *. No offense ment.

MfG
        Goswin

Oleg Drokin

2006-Oct-13 17:29 UTC

head link

[Lustre-discuss] Kernel oops with dangling symlinks

Hello!

On Fri, Oct 13, 2006 at 03:00:14PM +0200, Goswin von Brederlow
wrote:> > Well, lustre need to compile on a lot of kernels.
> > The correct fix is to have int for kernels that expect int and void *
> > for kernels that expect void *. That would be proper fix and
that''s
> > what we plan to include ultimately into our released tree.
> Was that changed between 2.4.x and 2.6.x? I only looked at the docs of
> the 2.6.15 thats why I came up with void *. No offense ment.
This was in fact changed at least twice.
With 2.4 it was returning int and was supposed to actually follow symlink.
With early 2.6 (up to 2.6.14, I think) it returns int and might just
set symlink info in nameidata to be followed separately,
now after 2.6.15 it returns void * with rest of semantic similar to earlier
2.6

Bye,
    Oleg

Qi Yong

2006-Oct-17 23:46 UTC

head link

[Lustre-discuss] Kernel oops with dangling symlinks

Andreas Dilger wrote:> On Oct 13, 2006  10:22 +0200, Goswin von Brederlow wrote:
>> Oleg Drokin <green@clusterfs.com> writes:
>>> The simple workaround for you would be to change return type for
>>> ll_follow_link from int to long (in llite/symlink.c).
>>> Of course this is not a proper fix (which is under work), but it
will get
>>> rid of the oops and restore normal operations.
>> The correct fix is to change the function prototype to void *, not to
>> long though. The function should also use ERR_PTR(rc) from
>> <linux/err.h> to convert the error code into what linux expects.
Just
>> to be correct in case the error encoding ever changes.
> 
> I believe Oleg is aware of that - it''s just that Lustre needs to
run on
> many older kernels as well.  The proper fix is to have a configure check
> for the return type of ->follow_link and add some #ifdefs to the Lustre
> code to change the return type appropriately.
What about the fix in b1_5 one month ago?  I didn''t see further
objections.

I suggest that we simply convert "int foo()" to "void *
foo()".  In this way,
we do not need configure check and #ifdefs. Isn''t configure test
fragile and
#ifdefs ugly?

For 2.6.13+, we are correct in both API and ABI. For older kernel, we are still
correct in ABI.  The only badness is that it brings a compile warning to old
kernels.
Also we don''t need to worry about removing configure tests and #ifdefs
when old
kernels are gone.
-- 
Qi Yong
System Software Engineer
Cluster File Systems, Inc.

Lustre discuss - Oct 2006 - Kernel oops with dangling symlinks

[Lustre-discuss] Kernel oops with dangling symlinks

[Lustre-discuss] Kernel oops with dangling symlinks

[Lustre-discuss] Kernel oops with dangling symlinks

[Lustre-discuss] Kernel oops with dangling symlinks

[Lustre-discuss] Kernel oops with dangling symlinks

[Lustre-discuss] Kernel oops with dangling symlinks

[Lustre-discuss] Kernel oops with dangling symlinks

[Lustre-discuss] Kernel oops with dangling symlinks