starlight.2012q2 at binnacle.cx
2012-May-09 17:54 UTC
[Samba] "file not found" under high-contention
Hello, For several years I've been experiencing an intermittent Samba error when running a very intense, highly parallel build/compile jobset. A file is reported as "not found" even though it most certainly exists and re-running the compile jobset always succeeds. Samba version is 3.6.4 running on CentOS 5.8 with 64-bit kernel 2.6.18-308.4.1.el5. Windows side is 64-bit Window Server 2008 (NT 6.1) with latest updates. Used to see same problem with W2K3 64-bit and CentOS 4 on similar hardware. Windows machine is attached via Infiniband directly to CentOS machine where the files are hosted. Other Linux systems access the compile directory with NFSv3 over gigabit ethernet. "kernel oplocks = no" is set due to troublesome behavior where open files have their modify time temporarily set to the present as seen from NFSv3. This causes 'make' to rebuild objects unnecessarily. Since the compile jobs never attempt to write the same file turning off kernel oplocks appears to have no downsides. The new version of Samba performs better, but this is now causing the failure to happen more often, to the point where it's becoming rather annoying. Have been hoping someone else would figure this out and fix it, but waiting four years hasn't done it. I hate figuring out these sorts of problem, but am becoming resigned to attempting it. Can anyone suggest an efficient approach for narrowing down and identifying where the problem is? I don't see writing a test case as possible with the information available at present. Thanks
On Wed, May 09, 2012 at 01:54:06PM -0400, starlight.2012q2 at binnacle.cx wrote:> > I hate figuring out these sorts of problem, > but am becoming resigned to attempting it. > Can anyone suggest an efficient approach > for narrowing down and identifying where > the problem is? I don't see writing a > test case as possible with the information > available at present.I'm afraid you're going to have to track this one down yourself, or at least get it to a point wher we have a smbd log, trace file or reproducible test case. Start by figuring out the minimal setup required to reproduce, and then try running with debug level 10 logs and look for the NT_STATUS_NOT_FOUND errors. Sorry, but there isn't much else to go on in your original message. Jeremy.
starlight.2012q2 at binnacle.cx
2012-May-09 23:00 UTC
[Samba] "file not found" under high-contention
Well sometimes it's easy too. . . Turns out that Samba is delivering inodes that are different when the underlying file is the same. Have some hard-links created in the build, and the script logic that suppresses 'ln -f' when they're already correctly linked was failing on the Windows side. This lead to brief instants when the files really were not present. Fixed by suppressing the re-hard-link when 'uname -s' starts with CYGWIN_NT-. I suppose there is some good reason why the Samba inode values don't reflect any semblance to the Linux reality, but it's not helpful. At 01:42 PM 5/9/2012 -0700, Jeremy Allison wrote:>Well I'm sorry, but you have to understand that >your original report is no better than "my car >won't start - what's wrong ?". > >Not trying to be difficult but we need more to >go on than you gave us. >
starlight.2012q2 at binnacle.cx
2012-May-11 14:08 UTC
[Samba] "file not found" under high-contention
On Wed, May 09, 2012 at 07:00:25PM -0400, starlight.2012q2 at binnacle.cx wrote: How are you getting the inode numbers from Samba ? Should mention that the information from CYGWIN 'ls -i' allows one to verify if two Windows-native NTFS files are the same file with different hard-linked pathnames.
starlight.2012q2 at binnacle.cx
2012-May-11 15:00 UTC
[Samba] "file not found" under high-contention
Here' it is from an old-version MKS Tookit POV: $ mksinfo Toolkit for Developers Release 9.0p3 Install date: 4/5/2006 10:34:43 $ ls -1i s?/inc/df_types.h 3 se/inc/df_types.h 2079 sp/inc/df_types.h 2079 sr/inc/df_types.h $ stat -l s?/inc/df_types.h se/inc/df_types.h: nlink: 1 sp/inc/df_types.h: nlink: 3 sr/inc/df_types.h: nlink: 3 So somewhat better, but still not 100% correct. Seems like some bad interaction is happening between CYGWIN and Samba. also $ stat -A s?/inc/df_types.h se/inc/df_types.h: atime: 1336747408 se/inc/df_types.h: ctime: 1291836195 se/inc/df_types.h: nlink: 1 se/inc/df_types.h: mode: 0x100777 se/inc/df_types.h: mtime: 1291836195 se/inc/df_types.h: smode: -rwxrwxrwx se/inc/df_types.h: size: 8546 sp/inc/df_types.h: atime: 1336747408 sp/inc/df_types.h: ctime: 1291836195 sp/inc/df_types.h: nlink: 3 sp/inc/df_types.h: mode: 0x100777 sp/inc/df_types.h: mtime: 1291836195 sp/inc/df_types.h: smode: -rwxrwxrwx sp/inc/df_types.h: size: 8546 sr/inc/df_types.h: atime: 1336747408 sr/inc/df_types.h: ctime: 1291836195 sr/inc/df_types.h: nlink: 3 sr/inc/df_types.h: mode: 0x100777 sr/inc/df_types.h: mtime: 1291836195 sr/inc/df_types.h: smode: -rwxrwxrwx sr/inc/df_types.h: size: 8546
starlight.2012q2 at binnacle.cx
2012-May-11 15:21 UTC
[Samba] "file not found" under high-contention
At 07:49 AM 5/11/2012 -0700, Jeremy Allison wrote:>On Fri, May 11, 2012 at 10:39:57AM -0400, >Need to see the over the wire trace to really >know I'm afraid.Ran 'procmon'. Here is a trace of the Samba scenario: Date & Time: 5/11/2012 11:11:36 Event Class: File System Operation: QueryStandardInformationFile Result: SUCCESS Path: \\siobhan\d\df\sr\inc\df_types.h TID: 5608 Duration: 0.0000141 AllocationSize: 1,048,576 EndOfFile: 8,546 NumberOfLinks: 1 DeletePending: False Directory: False Date & Time: 5/11/2012 11:11:36 Event Class: File System Operation: QueryFileInternalInformationFile Result: SUCCESS Path: \\siobhan\d\df\sr\inc\df_types.h TID: 5608 Duration: 0.0000097 IndexNumber: 0x81f Date & Time: 5/11/2012 11:11:36 Event Class: File System Operation: QueryBasicInformationFile Result: SUCCESS Path: \\siobhan\d\df\sr\inc\df_types.h TID: 5608 Duration: 0.0000124 CreationTime: 12/8/2010 15:23:15 LastAccessTime: 5/11/2012 11:11:36 LastWriteTime: 12/8/2010 15:23:15 ChangeTime: 12/8/2010 15:23:15 FileAttributes: N Date & Time: 5/11/2012 11:11:36 Event Class: File System Operation: QueryStandardInformationFile Result: SUCCESS Path: \\siobhan\d\df\sr\inc\df_types.h TID: 5608 Duration: 0.0000180 AllocationSize: 1,048,576 EndOfFile: 8,546 NumberOfLinks: 1 DeletePending: False Directory: False Date & Time: 5/11/2012 11:11:36 Event Class: File System Operation: QueryFileInternalInformationFile Result: SUCCESS Path: \\siobhan\d\df\sr\inc\df_types.h TID: 5608 Duration: 0.0000083 IndexNumber: 0x81f Date & Time: 5/11/2012 11:11:36 Event Class: File System Operation: QueryNameInformationFile Result: SUCCESS Path: \\siobhan\d\df\sr\inc\df_types.h TID: 5608 Duration: 0.0000078 Name: \siobhan\d\df\sr\inc\df_types.h =================== And here is the trace of the good, NTFS scenario: Date & Time: 5/11/2012 11:18:46 Event Class: File System Operation: QueryFileInternalInformationFile Result: SUCCESS Path: C:\flume\env\kf_i_errlog TID: 6804 Duration: 0.0000024 IndexNumber: 0x1400000001557f Date & Time: 5/11/2012 11:18:46 Event Class: File System Operation: QueryBasicInformationFile Result: SUCCESS Path: C:\flume\env\kf_i_errlog TID: 6804 Duration: 0.0000020 CreationTime: 9/27/2010 21:41:52 LastAccessTime: 9/27/2010 21:41:52 LastWriteTime: 8/15/2001 13:41:14 ChangeTime: 5/11/2012 11:18:35 FileAttributes: A Date & Time: 5/11/2012 11:18:46 Event Class: File System Operation: QueryStandardInformationFile Result: SUCCESS Path: C:\flume\env\kf_i_errlog TID: 6804 Duration: 0.0000022 AllocationSize: 0 EndOfFile: 0 NumberOfLinks: 2 DeletePending: False Directory: False Date & Time: 5/11/2012 11:18:46 Event Class: File System Operation: QueryFileInternalInformationFile Result: SUCCESS Path: C:\flume\env\kf_i_errlog TID: 6804 Duration: 0.0000017 IndexNumber: 0x1400000001557f Date & Time: 5/11/2012 11:18:46 Event Class: File System Operation: QueryNameInformationFile Result: SUCCESS Path: C:\flume\env\kf_i_errlog TID: 6804 Duration: 0.0000023 Name: \flume\env\kf_i_errlog
starlight.2012q2 at binnacle.cx
2012-May-11 15:31 UTC
[Samba] "file not found" under high-contention
So it looks like CYGWIN is deciding to invoke some oddball i-node fabrication when a file is located on a CIFS share and actually use the NTFS value when it's NTFS. Maybe the logic is designed to support the FAT32 scenario and they never accounted for CIFS. The 'IndexNumber: 0x81f' valued returned by Samba is the actual i-node value. I might post a report to CYGWIN on this and see what they say. Thanks for the help!