Hi all, We''re having an odd problem with a SNFS (cvfs) filesystem on one of our systems, call it X. If you don''t know, SNFS is a shared filesystem that uses separate metadata servers. Since we have so many systems that access the same filesystem, we have a system identical to the one with the problem, for redundancy purposes. The control system, call it Y, doesn''t have this problem. Here''s the gist, and I''m simplifying this down to the essentials. Say the filesystem is mounted as /foo/bar. Under that is a directory duh and in that is a file f, so /foo/bar/duh/f. On system Y, that file is visible and accessible, no problems. On system X, doing a "/bin/ls /foo/bar/duh" (or "cd /foo/bar/duh; /bin/ls") lists file f but any command that tries to access the file (e.g. via a stat, open, etc. system call) fails saying file not found. I''ve been using dtrace to try to determine what is happening in the kernel during one of these system calls and, specifically, I''ve been looking at the lookup*at kernel functions, but so far I''ve not been able to find much. At this point I''m somewhat stumped as to how to approach the problem further. While I''m not looking for a script from anyone, I would appreciate any advice on how to figure out why the kernel (snfs/cvfs driver?) is not able to access the file from system X. Remember that I can use system Y as a control system. Thanks, Justin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20080624/fa05a308/attachment.html>
[...]> On system X, doing a "/bin/ls /foo/bar/duh" (or "cd /foo/bar/duh; > /bin/ls") lists file f but any command that tries to access the file > (e.g. via a stat, open, etc. system call) fails saying file not found.[...]> While I''m not looking for a script from anyone, I would appreciate any > advice on how to figure out why the kernel (snfs/cvfs driver?) is not > able to access the file from system X. Remember that I can use system > Y as a control system.Well, without any filesystem knowledge, I would start looking at one specific syscall, say ''stat''. Then I would find out which syscall exactly it is $ dtrace -n ''syscall::stat*:entry{trace(copyinstr(arg0))}'' Then I would record every function being executed during the syscall processing, with the function return values. (let''s say it''s stat64, and you are doing ''ls -l /foo/bar/duh/f'' $ dtrace -x flowindent -n ''syscall::stat64:entry/copyinstr(arg0)=="/foo/bar/duh/f"/{self->go=1} fbt:::entry/self->go/{} fbt:::return/self->go/{trace(arg1)} syscall::stat64:return/self->go/{self->go=0; exit(0)}'' Then compare one run when the syscall succeeded and one where it failed. -- Vlad -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 193 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20080625/dac5c53d/attachment.bin>
Vlad, This is exactly what I''d been doing, though I was getting stumped at what to look for in the fbt trace of lstat64(). However, once I did an sdiff between that trace and one from a system that doesn''t have the problem, it led to looking at it being an SNFS issue. Specifically, the first difference in the scall traces was to what seemed to be a SNFS name cache search function. This has been helpful in dealing with Quantum (who owns SNFS) and now we''re thinking, based on other testing as well, that it may be a known bug in SNFS that on rare occasions causes directory corruption. So at the very least, you reassured me that I was taking the right track in analyzing the problem. Thanks, Justin -----Original Message----- From: dtrace-discuss-bounces at opensolaris.org [mailto:dtrace-discuss-bounces at opensolaris.org] On Behalf Of Vladimir Marek Sent: Wednesday, June 25, 2008 12:16 AM To: dtrace-discuss at opensolaris.org Subject: Re: [dtrace-discuss] Shared filesystem weirdness [...]> On system X, doing a "/bin/ls /foo/bar/duh" (or "cd /foo/bar/duh; > /bin/ls") lists file f but any command that tries to access the file > (e.g. via a stat, open, etc. system call) fails saying file not found.[...]> While I''m not looking for a script from anyone, I would appreciate any > advice on how to figure out why the kernel (snfs/cvfs driver?) is not > able to access the file from system X. Remember that I can use system > Y as a control system.Well, without any filesystem knowledge, I would start looking at one specific syscall, say ''stat''. Then I would find out which syscall exactly it is $ dtrace -n ''syscall::stat*:entry{trace(copyinstr(arg0))}'' Then I would record every function being executed during the syscall processing, with the function return values. (let''s say it''s stat64, and you are doing ''ls -l /foo/bar/duh/f'' $ dtrace -x flowindent -n ''syscall::stat64:entry/copyinstr(arg0)=="/foo/bar/duh/f"/{self->go=1} fbt:::entry/self->go/{} fbt:::return/self->go/{trace(arg1)} syscall::stat64:return/self->go/{self->go=0; exit(0)}'' Then compare one run when the syscall succeeded and one where it failed. -- Vlad