Hi Lustre users, we actually a little problems with jobs running on our cluster and using Lustre. Sometimes, we have these errors : forrtl: No such file or directory forrtl: severe (29): file not found, unit 213, file ?@/suivi.d000 It does not only happen with forttl but also sometimes with other files. It tries to access a file located at : ?@/suivi.d000. We also had errors when he was trying to access files like there were at the root of the FS, in this example /suivi.d000. It''s like it was loosing or corrupting the PWD environment variable. The funny thing is that when we execute this same job again, it works perfectly. We didn''t succeed in reproducing the errors but they still happens from time to time. I didn''t find any Lustre errors in my logs related the these problems. We''re using Lustre 1.8.5 on SLES 11SP1 nodes, and SLES 10 OSS and MDS. Do you have any clue? Thanks, Jay N.
Sebastien Piechurski
2011-Jun-16 13:30 UTC
[Lustre-discuss] Path lost when accessing files
Hi, This problem is documented in bug 23978 (http://bugzilla.lustre.org/show_bug.cgi?id=23978). To summarize: the fortran runtime is making a call to getcwd() to get the full path to a file which was given as a relative path. Lustre sometimes fail to answer to this syscall, which returns a non initialized buffer and an error code, BUT the fortran runtime does not test the getcwd() return code, and uses the buffer as-is. The uninitialized buffer is what you see as " @", followed by the relative path. A patch is currently inspected.> -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org > [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of > styr at free.fr > Sent: jeudi 16 juin 2011 12:17 > To: lustre-discuss > Subject: [Lustre-discuss] Path lost when accessing files > > Hi Lustre users, > > we actually a little problems with jobs running on our > cluster and using Lustre. Sometimes, we have these errors : > forrtl: No such file or directory > forrtl: severe (29): file not found, unit 213, file @/suivi.d000 > > It does not only happen with forttl but also sometimes with > other files. It tries to access a file located at : > @/suivi.d000. We also had errors when he was trying to access > files like there were at the root of the FS, in this example > /suivi.d000. > > It''s like it was loosing or corrupting the PWD environment variable. > > The funny thing is that when we execute this same job again, > it works perfectly. We didn''t succeed in reproducing the > errors but they still happens from time to time. > > I didn''t find any Lustre errors in my logs related the these problems. > > We''re using Lustre 1.8.5 on SLES 11SP1 nodes, and SLES 10 OSS and MDS. > > Do you have any clue? > > Thanks, > > Jay N. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
On Thursday, June 16, 2011 03:30:38 PM Sebastien Piechurski wrote:> Hi, > > This problem is documented in bug 23978 > (http://bugzilla.lustre.org/show_bug.cgi?id=23978). To summarize: the > fortran runtime is making a call to getcwd() to get the full path to a > file which was given as a relative path. Lustre sometimes fail to answer > to this syscall, which returns a non initialized buffer and an error code, > BUT the fortran runtime does not test the getcwd() return code, and uses > the buffer as-is. > > The uninitialized buffer is what you see as " @", followed by the relative > path. > > A patch is currently inspected.Perfectly summarized I''ll just add two things. 1) The patch didn''t help :-( 2) There are two work-arounds listed in the bz, patch the kernel to retry the getcwd or build and use an LD_PRELOAD wrapper to retry the getcwd. /Peter> > From: lustre-discuss-bounces at lists.lustre.org > > we actually a little problems with jobs running on our > > cluster and using Lustre. Sometimes, we have these errors : > > forrtl: No such file or directory > > > > forrtl: severe (29): file not found, unit 213, file @/suivi.d000... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110617/9532b244/attachment.bin
Thanks Sebastian, I didn''t check that out. I''ll start following that bug. ----- Mail Original ----- De: "Sebastien Piechurski" <spiechurski at sgi.com> ?: styr at free.fr, "lustre-discuss" <lustre-discuss at lists.lustre.org> Envoy?: Jeudi 16 Juin 2011 15h30:38 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne Objet: RE: [Lustre-discuss] Path lost when accessing files Hi, This problem is documented in bug 23978 (http://bugzilla.lustre.org/show_bug.cgi?id=23978). To summarize: the fortran runtime is making a call to getcwd() to get the full path to a file which was given as a relative path. Lustre sometimes fail to answer to this syscall, which returns a non initialized buffer and an error code, BUT the fortran runtime does not test the getcwd() return code, and uses the buffer as-is. The uninitialized buffer is what you see as " @", followed by the relative path. A patch is currently inspected.> -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org > [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of > styr at free.fr > Sent: jeudi 16 juin 2011 12:17 > To: lustre-discuss > Subject: [Lustre-discuss] Path lost when accessing files > > Hi Lustre users, > > we actually a little problems with jobs running on our > cluster and using Lustre. Sometimes, we have these errors : > forrtl: No such file or directory > forrtl: severe (29): file not found, unit 213, file @/suivi.d000 > > It does not only happen with forttl but also sometimes with > other files. It tries to access a file located at : > @/suivi.d000. We also had errors when he was trying to access > files like there were at the root of the FS, in this example > /suivi.d000. > > It''s like it was loosing or corrupting the PWD environment variable. > > The funny thing is that when we execute this same job again, > it works perfectly. We didn''t succeed in reproducing the > errors but they still happens from time to time. > > I didn''t find any Lustre errors in my logs related the these problems. > > We''re using Lustre 1.8.5 on SLES 11SP1 nodes, and SLES 10 OSS and MDS. > > Do you have any clue? > > Thanks, > > Jay N. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >