I am having jobs on a cluster client crash. The job creates a small text file (using cp) and then immediately tries to use it with another application. The application fails saying the file doesn''t exist. In the client /var/log/messages, I''m seeing Sep 4 15:58:17 clus039 kernel: LustreError: 15249:0:(file.c:2930:ll_inode_revalidate_fini()) failure -2 inode 75792903 which, I''m led to believe is never meant to occur :) Any ideas? # uname -a Linux clus039 2.6.18-92.1.26.el5_lustre.1.6.7.2smp #1 SMP Wed May 27 19:06:26 MDT 2009 x86_64 x86_64 x86_64 GNU/Linux Sep 4 15:54:06 clus039 kernel: LustreError: 14982:0:(file.c:2930:ll_inode_revalidate_fini()) failure -2 inode 75793756 Sep 4 15:58:17 clus039 kernel: LustreError: 15249:0:(file.c:2930:ll_inode_revalidate_fini()) failure -2 inode 75792903 nothing on the mds or oss. -- Dr Stuart Midgley sdm900 at gmail.com
Evening The file was created on the same node it was access from. The error isn''t permanent. When the job crashed, I went and started investigating and the file was fine. No, the file is never unlinked. How do I go about getting a lustre log? -- Dr Stuart Midgley sdm900 at gmail.com On 04/09/2009, at 11:28 PM, Oleg Drokin wrote:> Hello! > > On Sep 4, 2009, at 5:35 AM, Stu Midgley wrote: > >> I am having jobs on a cluster client crash. The job creates a small >> text file (using cp) and then immediately tries to use it with >> another >> application. The application fails saying the file doesn''t exist. > > That''s quite strange for such a sequence of actions. > Is the file created on one node and accessed on another? > How permanent is the error ? (i.e. does it still happen when you > later access the file again?) > Is the file unlinked at any time, could there be a race with unlink > by any chance? > >> In the client /var/log/messages, I''m seeing >> Sep 4 15:58:17 clus039 kernel: LustreError: >> 15249:0:(file.c:2930:ll_inode_revalidate_fini()) failure -2 inode >> 75792903 > > There is bug 16377 about this same message, though it is not clear > what happened there. > Perhaps you can gather -1 lustre logs from mds and a client that > creates > and client that accesses this file and gets an error and attach > those to the bug 16377? > > Bye, > Oleg
I''m sorry Oleg, but I suspect I will never be able to run this test. * I don''t have a reproducer. At the time I had this problem, I started about 200 jobs simultaneously and about 50 failed with this problem. I reran those jobs and they worked just fine. * I will never get a chance to make the FS quiet. We have way to much production work on. If I do get time to fiddle about and reproduce this problem I''ll create a bug. -- Dr Stuart Midgley sdm900 at gmail.com On 04/09/2009, at 11:46 PM, Oleg Drokin wrote:> Hello! > > On Sep 4, 2009, at 11:31 AM, Stuart Midgley wrote: > >> The file was created on the same node it was access from. > > Hm, interesting. > >> The error isn''t permanent. When the job crashed, I went and >> started investigating and the file was fine. > > I think I remember a bug like this that shadow(@sun.com) worked on. > Turned out it is bug 17545 which has somewhat different symptoms, > though. > >> No, the file is never unlinked. >> How do I go about getting a lustre log? > > Make the system (mds-wise) as idle as possible (ideally only this > node with problems should do anything > on lustre). > on mds and a client do a cat /proc/sys/lnet/debug and remember the > value > echo -1 >/proc/sys/lnet/debug on both mds and the client. > lctl dk >/dev/null > run your reproducer and immediatelly after error happens do > lctl dk >/tmp/lustre.log on both mds and client nodes. > then restore /proc/sys/lnet/debug values on the nodes back > to what they were. > > Thanks. > > Bye, > Oleg
extra to my previous information (a colleague prompted me to add this). This file was being created in a new directory. The parent of this directory would have had a few hundred directories created in it "simultaneously"... That is, the first thing my job does on startup is create a temporary working directory and create this temporary working file within that directory. On Fri, Sep 4, 2009 at 11:31 PM, Stuart Midgley<sdm900 at gmail.com> wrote:> Evening > > The file was created on the same node it was access from. > > The error isn''t permanent. ?When the job crashed, I went and started > investigating and the file was fine. > > No, the file is never unlinked. > > How do I go about getting a lustre log? > > > -- > Dr Stuart Midgley > sdm900 at gmail.com > > > > On 04/09/2009, at 11:28 PM, Oleg Drokin wrote: > >> Hello! >> >> On Sep 4, 2009, at 5:35 AM, Stu Midgley wrote: >> >>> I am having jobs on a cluster client crash. ?The job creates a small >>> text file (using cp) and then immediately tries to use it with another >>> application. ?The application fails saying the file doesn''t exist. >> >> That''s quite strange for such a sequence of actions. >> Is the file created on one node and accessed on another? >> How permanent is the error ? (i.e. does it still happen when you later >> access the file again?) >> Is the file unlinked at any time, could there be a race with unlink by any >> chance? >> >>> In the client /var/log/messages, I''m seeing >>> Sep ?4 15:58:17 clus039 kernel: LustreError: >>> 15249:0:(file.c:2930:ll_inode_revalidate_fini()) failure -2 inode >>> 75792903 >> >> There is bug 16377 about this same message, though it is not clear what >> happened there. >> Perhaps you can gather -1 lustre logs from mds and a client that creates >> and client that accesses this file and gets an error and attach those to >> the bug 16377? >> >> Bye, >> ? Oleg > >-- Dr Stuart Midgley sdm900 at gmail.com