thr3ads.net - Lustre discuss - [Lustre-discuss] open() ENOENT bug [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Robin Humble

2008-Oct-30 05:35 UTC

[Lustre-discuss] open() ENOENT bug

Hi,

we have a user with simultaneously starting fortran runs that fail
about 10% of the time because Lustre sometimes returns ENOENT instead
of EACCES to an open() request on a read-only file.

you might think this is an odd thing to want to do, but a standard
fortran open(3,file=''file'',status=''old'')
always does a rw open() before
a ro open(), as well as a pile of other bizarre stuff. the rw open
should always fail with EACCES, and the subsequent ro open succeeds.
however if ENOENT is returned instead of EACCES then fortran exits.

attached is a minimal C code that triggers the problem by emulating a
subset of gfortran/ifort''s open() statement and checking for EACCES.

standard Sun kernel 2.6.18-53.1.14.el5_lustre.1.6.5.1smp on IB and GigE
appears to have this problem, as does a patched RHEL with 1.6.4.2 on
GigE, and all RHEL and kernel.org patchless kernels we''ve tested
including with Lustre cvs b1_6 from a couple of days ago. turning off
statahead doesn''t change anything.

I couldn''t find anything in bz.

steps to reproduce:
  # on a Lustre fs
  gcc -o openFileMinimal openFileMinimal.c
  touch file
  chmod 400 file
  ./openFileMinimal file & ./openFileMinimal file &

the correct result would be no output (all opens fail with EACCES).

typical output is one of the 2 processes getting ENOENT. eg:
   % ./openFileMinimal file & ./openFileMinimal file &
  [1] 4948
  [2] 4949
   % "No such file or directory" on existing file. loop 2/1000
  [2]+  Exit 1                  ./openFileMinimal file
  [1]-  Done                    ./openFileMinimal file

let me know if you want me to create a bz entry.

cheers,
robin
-------------- next part --------------
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>

int main(int argc, char **argv) {
   int i, f;
   struct stat b;

   if ( argc < 2 ) {
      printf( "run as an ordinary user to a file without rw access,
like:\n" );
      printf( "  %s /lustre/some/root/owned/file\n", argv[0] );
      printf( "... and then run 2 copies and watch for failures\n" );
      exit(0);
   }

   for ( i=0; i<1000; i++ ) {
      stat( argv[1], &b );
      f = open( argv[1], O_RDWR );
      if ( f < 0 ) {
         if ( errno != EACCES ) { // EACCES is correct. other errors
aren''t
            printf( "\"%s\" on existing file. loop
%d/1000\n", strerror(errno), i );
            exit(1);
         }
      }
      else {
         printf( "rw open succeeded - shouldn''t happen - this test
needs to be "
                 "run on a file with read-only access (eg. chmod 400, not
owned "
                 "by you, or on a read-only Lustre fs).\n" );
         exit(2);
      }
   }
   exit(0);
}

Brian J. Murrell

2008-Oct-30 12:28 UTC

head link

[Lustre-discuss] open() ENOENT bug

On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote:> Hi,
Hi.
> 
> we have a user with simultaneously starting fortran runs that fail
> about 10% of the time because Lustre sometimes returns ENOENT instead
> of EACCES to an open() request on a read-only file.
I can reproduce this on 1.6.6 as well using your reproducer.

Can you file a bug in our bugzilla about it?  Please include your
reproducer program.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081030/6fef90a4/attachment.bin

Peter Kjellstrom

2008-Oct-30 13:05 UTC

head link

[Lustre-discuss] open() ENOENT bug

On Thursday 30 October 2008, Brian J. Murrell wrote:> On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote:
> > Hi,
>
> Hi.
>
> > we have a user with simultaneously starting fortran runs that fail
> > about 10% of the time because Lustre sometimes returns ENOENT instead
> > of EACCES to an open() request on a read-only file.
>
> I can reproduce this on 1.6.6 as well using your reproducer.
We have also seen this bug on our systems (reported by a user running a 
Fortran code). We have servers with both 1.4 
(2.6.9-55.0.9.EL_lustre.1.4.11.1smp) and 1.6 
(2.6.18-8.1.14.el5_lustre.1.6.4.2smp) lustre.

The error is seen towards both server versions from a cluster with patchless 
1.6.5.1 clients running centos-5.2.x86_64 (2.6.18-92.1.13.el5).

However the error is not seen from another cluster running _patched_ 1.6.5.1 
on centos-4.x86_64 (2.6.9-67.0.7.EL_lustre.1.6.5.1smp).

/Peter
> Can you file a bug in our bugzilla about it?  Please include your
> reproducer program.
>
> b.-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081030/fe530b25/attachment.bin

Robin Humble

2008-Oct-31 00:54 UTC

head link

[Lustre-discuss] open() ENOENT bug

On Thu, Oct 30, 2008 at 08:28:05AM -0400, Brian J. Murrell
wrote:>On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote:
>> we have a user with simultaneously starting fortran runs that fail
>> about 10% of the time because Lustre sometimes returns ENOENT instead
>> of EACCES to an open() request on a read-only file.
>I can reproduce this on 1.6.6 as well using your reproducer.
thanks for looking into it so quickly.
>Can you file a bug in our bugzilla about it?  Please include your
>reproducer program.
https://bugzilla.lustre.org/show_bug.cgi?id=17545

cheers,
robin

Robin Humble

2008-Nov-03 01:30 UTC

head link

[Lustre-discuss] open() ENOENT bug

On Thu, Oct 30, 2008 at 02:05:57PM +0100, Peter Kjellstrom
wrote:>On Thursday 30 October 2008, Brian J. Murrell wrote:
>> On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote:
>> > we have a user with simultaneously starting fortran runs that fail
>> > about 10% of the time because Lustre sometimes returns ENOENT
instead
>> > of EACCES to an open() request on a read-only file.
>>
>> I can reproduce this on 1.6.6 as well using your reproducer.
>
>We have also seen this bug on our systems (reported by a user running a 
>Fortran code). We have servers with both 1.4 
>(2.6.9-55.0.9.EL_lustre.1.4.11.1smp) and 1.6 
>(2.6.18-8.1.14.el5_lustre.1.6.4.2smp) lustre.
>
>The error is seen towards both server versions from a cluster with patchless
>1.6.5.1 clients running centos-5.2.x86_64 (2.6.18-92.1.13.el5).
>
>However the error is not seen from another cluster running _patched_ 1.6.5.1
>on centos-4.x86_64 (2.6.9-67.0.7.EL_lustre.1.6.5.1smp).
I dug up an old 2.6.9-67.0.7.EL_lustre.1.6.5.1 + IB kernel (who''d have
thought it''d boot with a RHEL5 userland!? :-) and you are right - my
openFileMinimal test case runs without problem. ie. 2.6.9 seems a lot
more robust than 2.6.18 and onwards.

however, when running ~10 copies of the below fortran code with the
above RHEL4 + 1.6.5.1 kernel, several of the copies of the code always
die with:
  Fortran runtime error: Stale NFS file handle

      program blah
      implicit none
      integer i
      do i=1,1000
      open(3,file=''file'',status=''old'')
      close(3)
      enddo
      stop
      end

so although my cut-down C code reproducer doesn''t trigger anything, it
seems Lustre still has issues with the real fortran code. the user''s
jobs would probably run ok in this RHEL4 environment though as they
don''t run 10 copies at once.
it''s a slightly different variant of the bug as well (different error
code), or maybe it''s just a totaly different bug.

cheers,
robin


>
>/Peter
>
>> Can you file a bug in our bugzilla about it?  Please include your
>> reproducer program.
>>
>> b.

>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss at lists.lustre.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Oct 2008 - open() ENOENT bug

[Lustre-discuss] open() ENOENT bug

[Lustre-discuss] open() ENOENT bug

[Lustre-discuss] open() ENOENT bug

[Lustre-discuss] open() ENOENT bug

[Lustre-discuss] open() ENOENT bug