On 2009-11-26, at 12:40, Goranka Bilalbegovic wrote:> Recently the cluster I am using for computing has been updated to
> the VMware with the Lustre file system. Cluster uses: Oscar 6.0.3,
> Sun Grid Engine 6.2u3, Nagios, Ganglia, InfiniBand 10 Gb/s. Nodes
> access the file system using Ethernet via the Lustre InfiniBand/
> Ethernet router.
>
> I used to run one type of jobs as:
> ---
> #$ -N name
> #$ -o namesys.out
> #$ -e namesys.err
> #$ -pe mpi 2
> #$ -cwd
> #$ -v LD_LIBRARY_PATH
> mpirun -machinefile $TMPDIR/machines -np $NSLOTS /path/.../code.x <<
> EOF
> name.in
> name.out
> EOF
> ---
>
> This is for an open source package (written in Fortran plus some C
> utilities) and a such way of running was recommended by authors. It
> was working on the previous version of the cluster, but it does not
> run on a new lustre filesystem. It starts, but then stays in the
> queue forever.
Without more information it is impossible to know what the problem
is. There shouldn''t be any problem with running executables from
Lustre,
General debugging steps that should be followed (not strictly related
to this problem):
- presumably the Lustre filesystem is accessible from within your VM
and is working fine other than this job launch problem?
- try to run the job by hand to see if it really is a Lustre problem
or if it is related to the batch scheduler or something else
- check /var/log/messages to see if there are Lustre (or other) errors
- do "echo t > /proc/sysrq-trigger" to dump the stacks of all
processes
on the system, and see where your job is stuck
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.