Hi, we have a user with simultaneously starting fortran runs that fail about 10% of the time because Lustre sometimes returns ENOENT instead of EACCES to an open() request on a read-only file. you might think this is an odd thing to want to do, but a standard fortran open(3,file=''file'',status=''old'') always does a rw open() before a ro open(), as well as a pile of other bizarre stuff. the rw open should always fail with EACCES, and the subsequent ro open succeeds. however if ENOENT is returned instead of EACCES then fortran exits. attached is a minimal C code that triggers the problem by emulating a subset of gfortran/ifort''s open() statement and checking for EACCES. standard Sun kernel 2.6.18-53.1.14.el5_lustre.1.6.5.1smp on IB and GigE appears to have this problem, as does a patched RHEL with 1.6.4.2 on GigE, and all RHEL and kernel.org patchless kernels we''ve tested including with Lustre cvs b1_6 from a couple of days ago. turning off statahead doesn''t change anything. I couldn''t find anything in bz. steps to reproduce: # on a Lustre fs gcc -o openFileMinimal openFileMinimal.c touch file chmod 400 file ./openFileMinimal file & ./openFileMinimal file & the correct result would be no output (all opens fail with EACCES). typical output is one of the 2 processes getting ENOENT. eg: % ./openFileMinimal file & ./openFileMinimal file & [1] 4948 [2] 4949 % "No such file or directory" on existing file. loop 2/1000 [2]+ Exit 1 ./openFileMinimal file [1]- Done ./openFileMinimal file let me know if you want me to create a bz entry. cheers, robin -------------- next part -------------- #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #include <string.h> int main(int argc, char **argv) { int i, f; struct stat b; if ( argc < 2 ) { printf( "run as an ordinary user to a file without rw access, like:\n" ); printf( " %s /lustre/some/root/owned/file\n", argv[0] ); printf( "... and then run 2 copies and watch for failures\n" ); exit(0); } for ( i=0; i<1000; i++ ) { stat( argv[1], &b ); f = open( argv[1], O_RDWR ); if ( f < 0 ) { if ( errno != EACCES ) { // EACCES is correct. other errors aren''t printf( "\"%s\" on existing file. loop %d/1000\n", strerror(errno), i ); exit(1); } } else { printf( "rw open succeeded - shouldn''t happen - this test needs to be " "run on a file with read-only access (eg. chmod 400, not owned " "by you, or on a read-only Lustre fs).\n" ); exit(2); } } exit(0); }
On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote:> Hi,Hi.> > we have a user with simultaneously starting fortran runs that fail > about 10% of the time because Lustre sometimes returns ENOENT instead > of EACCES to an open() request on a read-only file.I can reproduce this on 1.6.6 as well using your reproducer. Can you file a bug in our bugzilla about it? Please include your reproducer program. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081030/6fef90a4/attachment.bin
On Thursday 30 October 2008, Brian J. Murrell wrote:> On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote: > > Hi, > > Hi. > > > we have a user with simultaneously starting fortran runs that fail > > about 10% of the time because Lustre sometimes returns ENOENT instead > > of EACCES to an open() request on a read-only file. > > I can reproduce this on 1.6.6 as well using your reproducer.We have also seen this bug on our systems (reported by a user running a Fortran code). We have servers with both 1.4 (2.6.9-55.0.9.EL_lustre.1.4.11.1smp) and 1.6 (2.6.18-8.1.14.el5_lustre.1.6.4.2smp) lustre. The error is seen towards both server versions from a cluster with patchless 1.6.5.1 clients running centos-5.2.x86_64 (2.6.18-92.1.13.el5). However the error is not seen from another cluster running _patched_ 1.6.5.1 on centos-4.x86_64 (2.6.9-67.0.7.EL_lustre.1.6.5.1smp). /Peter> Can you file a bug in our bugzilla about it? Please include your > reproducer program. > > b.-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081030/fe530b25/attachment.bin
On Thu, Oct 30, 2008 at 08:28:05AM -0400, Brian J. Murrell wrote:>On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote: >> we have a user with simultaneously starting fortran runs that fail >> about 10% of the time because Lustre sometimes returns ENOENT instead >> of EACCES to an open() request on a read-only file. >I can reproduce this on 1.6.6 as well using your reproducer.thanks for looking into it so quickly.>Can you file a bug in our bugzilla about it? Please include your >reproducer program.https://bugzilla.lustre.org/show_bug.cgi?id=17545 cheers, robin
On Thu, Oct 30, 2008 at 02:05:57PM +0100, Peter Kjellstrom wrote:>On Thursday 30 October 2008, Brian J. Murrell wrote: >> On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote: >> > we have a user with simultaneously starting fortran runs that fail >> > about 10% of the time because Lustre sometimes returns ENOENT instead >> > of EACCES to an open() request on a read-only file. >> >> I can reproduce this on 1.6.6 as well using your reproducer. > >We have also seen this bug on our systems (reported by a user running a >Fortran code). We have servers with both 1.4 >(2.6.9-55.0.9.EL_lustre.1.4.11.1smp) and 1.6 >(2.6.18-8.1.14.el5_lustre.1.6.4.2smp) lustre. > >The error is seen towards both server versions from a cluster with patchless >1.6.5.1 clients running centos-5.2.x86_64 (2.6.18-92.1.13.el5). > >However the error is not seen from another cluster running _patched_ 1.6.5.1 >on centos-4.x86_64 (2.6.9-67.0.7.EL_lustre.1.6.5.1smp).I dug up an old 2.6.9-67.0.7.EL_lustre.1.6.5.1 + IB kernel (who''d have thought it''d boot with a RHEL5 userland!? :-) and you are right - my openFileMinimal test case runs without problem. ie. 2.6.9 seems a lot more robust than 2.6.18 and onwards. however, when running ~10 copies of the below fortran code with the above RHEL4 + 1.6.5.1 kernel, several of the copies of the code always die with: Fortran runtime error: Stale NFS file handle program blah implicit none integer i do i=1,1000 open(3,file=''file'',status=''old'') close(3) enddo stop end so although my cut-down C code reproducer doesn''t trigger anything, it seems Lustre still has issues with the real fortran code. the user''s jobs would probably run ok in this RHEL4 environment though as they don''t run 10 copies at once. it''s a slightly different variant of the bug as well (different error code), or maybe it''s just a totaly different bug. cheers, robin> >/Peter > >> Can you file a bug in our bugzilla about it? Please include your >> reproducer program. >> >> b.>_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss