Ashley Nicholls
2010-Jun-03 17:50 UTC
[Lustre-discuss] Autoconf/Automake problem when running on Lustre 1.8.2 filesystem
Hello all, A little background info: We have a cluster of fourteen servers (each with 8 cores) running builds for employees at my workplace. We primarily used NFSv3 as a way of providing a single file-system for all workstations and servers to share files. Each build is run in a separate directory so cache coherence shouldn''t be an issue. When the cluster is full we noticed lots of jobs failing when attempting to copy or move results from one pass of the build to the next pass (once again, a ''job'' is run on a single machine from beginning to end so coherency shouldn''t be an issue''). In most cases the error was along the lines of ''File doesn''t exist'', but if you were to manually check its presence it appeared fine and with all the correct permissions. Having tried NFSv4 (and some other file-systems I won''t name) and finding the same results we decided to implement a coherent clustering file-system that would also provide more scalability. Setup and problem: After much trial and error I have managed to setup a small Lustre system consisting of one MDS and one OSS. All machines involved are CentOS 5 based and run the 2.6.18-164.11.1.el5_lustre.1.8.2 kernel. This setup appears to work correctly but now fails in a way that it didn''t before. One of our builds that uses autoconf started failing with the error: "Can''t locate auto/Autom4te/XFile/msg.al in @INC" After having a look at some of the problems encountered on OpenBSD with automake and a lockless NFS system a workaround was suggested. By adding the line ''use Autom4te::Channels qw(msg);'' and ''use Automake::Channels qw(msg);'' to the respective XFile.pm autoconf becomes more verbose and produces: "autom4te: cannot lock autom4te.cache/requests with mode 2: Function not implemented" It also no longer fails and continues to build the configure file. But now we come to the interesting part - The script runs ./configure directly after running autoconf and this always fails the first time with the message ./configure: /bin/sh: bad interpreter: Text file busy If you manually re-run the script again it works correctly! I have tried flock and localflocks on the lustre mount but this doesn''t appear to make any difference. After googling around I can''t seem to find anyone who has had a similar problem. Has anyone experienced this before or have any clues as to how I can go about debugging this? Thanks, --------- Ashley Nicholls Open all hailing frequencies and broadcast in all known languages. Including Welsh. - Arnold J. Rimmer In my many years I have come to a conclusion that one useless man is a shame, two is a law firm, and three or more is a congress. - John Adams -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100603/fc874b4c/attachment.html
Andreas Dilger
2010-Jun-03 19:27 UTC
[Lustre-discuss] Autoconf/Automake problem when running on Lustre 1.8.2 filesystem
On 2010-06-03, at 11:50, Ashley Nicholls wrote:> Setup and problem: > After much trial and error I have managed to setup a small Lustre system consisting of one MDS and one OSS. All machines involved are CentOS 5 based and run the 2.6.18-164.11.1.el5_lustre.1.8.2 kernel. This setup appears to work correctly but now fails in a way that it didn''t before. > > One of our builds that uses autoconf started failing with the error: > "Can''t locate auto/Autom4te/XFile/msg.al in @INC" > > After having a look at some of the problems encountered on OpenBSD with automake and a lockless NFS system a workaround was suggested. By adding the line ''use Autom4te::Channels qw(msg);'' and ''use Automake::Channels qw(msg);'' to the respective XFile.pm autoconf becomes more verbose and produces: > "autom4te: cannot lock autom4te.cache/requests with mode 2: Function not implemented"Your tools probably are using flock to lock the files. You need to mount the clients with "-o flock" to get globally-coherent flock (at some performance impact) or "-o localflock" to get local-node-only flock (at minimal performance impact). Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Andreas Dilger
2010-Jun-04 17:30 UTC
[Lustre-discuss] Autoconf/Automake problem when running on Lustre 1.8.2 filesystem
On 2010-06-04, at 04:07, Ashley Nicholls wrote:> I have tried flock and localflock, this resolves the locking error message but still fails on running configure with the "text file is busy" message.This means some executable is in use (on a client) but it is trying to be overwritten/truncated. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Kent Engström
2010-Jun-05 08:29 UTC
[Lustre-discuss] Autoconf/Automake problem when running on Lustre 1.8.2 filesystem
Andreas Dilger <andreas.dilger at oracle.com> writes:> On 2010-06-04, at 04:07, Ashley Nicholls wrote: >> I have tried flock and localflock, this resolves the locking error message but still fails on running configure with the "text file is busy" message. > > This means some executable is in use (on a client) but it is trying to be overwritten/truncated.On Lustre 1.8.2, it might also be because of the bug discussed in: https://bugzilla.lustre.org/show_bug.cgi?id=22492 Cheers, -- Kent Engstr?m, National Supercomputer Centre kent at nsc.liu.se, +46 13 28 4444