zam@clusterfs.com
2007-Jan-18  05:04 UTC
[Lustre-devel] [Bug 11562] racer-correctness test fails on b1_4_sles10 kernel 2.6.16.21-08-sles10 on x86_64
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11562
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
the problem is reproducible locally in a SL10.1 environment.
The folling hot fix (with debugging code) fixes the problem. 
However it is not tried on buffalo.
diff -u -p -r1.27.40.3.14.3.26.4 symlink.c
--- lustre/llite/symlink.c      16 Nov 2006 19:21:39 -0000      
1.27.40.3.14.3.26.4
+++ lustre/llite/symlink.c      18 Jan 2007 10:41:09 -0000
@@ -130,7 +130,7 @@ static LL_FOLLOW_LINK_RETURN_TYPE ll_fol
         struct inode *inode = dentry->d_inode;
         struct ll_inode_info *lli = ll_i2info(inode);
         struct lookup_intent *it = ll_nd2it(nd);
-        struct ptlrpc_request *request;
+        struct ptlrpc_request *request = NULL;
         int rc;
         char *symname;
         ENTRY;
@@ -145,6 +145,17 @@ static LL_FOLLOW_LINK_RETURN_TYPE ll_fol
         }
         CDEBUG(D_VFSTRACE, "VFS Op\n");
+
+        {
+                int dummy = 1;
+                printk("SP x%p lelel = %d\n", &dummy,
current->link_count);
+        }
+
+        if (current->link_count > 5) {
+                path_release(nd);
+                GOTO(out, rc = -ELOOP);
+        }
+
A simpler test found which causes stack overflow in a luster client
without the hot fix:
       $ ln -sf foo foo
       $ ls foo
The debugging code above gives information about how stack usage
grows with each ll_follow_link call:
SP 0xffff810001defb14 lelel = 1
SP 0xffff810001def8c4 lelel = 2
SP 0xffff810001def674 lelel = 3
SP 0xffff810001def424 lelel = 4
SP 0xffff810001def1d4 lelel = 5
SP 0xffff810001deef84 lelel = 6
It means these functions together eat 592 bytes on stack: 
link_path_walk
__link_path_walk
do_follow_link
__do_follow_link
__vfs_follow_link
(link_path_walk again)  
especially link_path_walk takes 200 bytes and __link_path_walk takes 280
(from checkstack.pl report)
for comparing, 
the same functions in the same kernel for i386 arch take:
__link_path_walk:                    280
link_path_walk:                      200
and stack usage report for newer kernel 2.6.20-rc5 on x86_64:
link_path_walk [vmlinux]:            152
__link_path_walk [vmlinux]:          104
2.6.9-rhel4 and x86_64:
0xffffffff80184f6f link_path_walk:                      192
0xffffffff80183f80 __link_path_walk:                    136
zam@clusterfs.com
2007-Jan-18  11:12 UTC
[Lustre-devel] [Bug 11562] racer-correctness test fails on b1_4_sles10 kernel 2.6.16.21-08-sles10 on x86_64
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11562 Created an attachment (id=9371) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9371&action=view) reduce stack usage in __vfs_follow_link __vfs_follow_link mistakenly pushed whole intent structure to the stack and increased stack usage by 72 bytes (for x86_64 arch). Thus recursive symlink lookup crashed before the MAX_NESTED_LINKS limit is reached. The fix gets rid of the temporary stack-allocated variable.