thr3ads.net - Lustre discuss - [Lustre-discuss] I/O on cluster with lustre [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Goranka Bilalbegovic

2009-Nov-26 19:40 UTC

[Lustre-discuss] I/O on cluster with lustre

Hello,

Recently the cluster I am using for computing has been updated to the VMware
with the Lustre file system.  Cluster uses: Oscar 6.0.3,  Sun Grid Engine
6.2u3, Nagios, Ganglia, InfiniBand 10 Gb/s. Nodes access the file system
using Ethernet via the Lustre InfiniBand/Ethernet router.

I used to run one type of jobs as:
---
#$ -N name
#$ -o namesys.out
#$ -e namesys.err
#$ -pe mpi 2
#$ -cwd
#$ -v LD_LIBRARY_PATH
mpirun -machinefile $TMPDIR/machines -np $NSLOTS /path/.../code.x << EOF
name.in
name.out
EOF
---

This is for an open source package (written in Fortran plus some C
utilities) and a such way of running was recommended by authors. It was
working on the previous version of the cluster, but it does not run on a new
lustre filesystem. It starts, but then stays in the queue forever.

Is it possible to run this type of jobs on lustre ?

Thank you.
Best wishes,
Goranka
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091126/ccb6733d/attachment.html

Andreas Dilger

2009-Dec-01 00:40 UTC

head link

[Lustre-discuss] I/O on cluster with lustre

On 2009-11-26, at 12:40, Goranka Bilalbegovic wrote:> Recently the cluster I am using for computing has been updated to  
> the VMware with the Lustre file system.  Cluster uses: Oscar 6.0.3,   
> Sun Grid Engine 6.2u3, Nagios, Ganglia, InfiniBand 10 Gb/s. Nodes  
> access the file system using Ethernet via the Lustre InfiniBand/ 
> Ethernet router.
>
> I used to run one type of jobs as:
> ---
> #$ -N name
> #$ -o namesys.out
> #$ -e namesys.err
> #$ -pe mpi 2
> #$ -cwd
> #$ -v LD_LIBRARY_PATH
> mpirun -machinefile $TMPDIR/machines -np $NSLOTS /path/.../code.x <<
> EOF
> name.in
> name.out
> EOF
> ---
>
> This is for an open source package (written in Fortran plus some C  
> utilities) and a such way of running was recommended by authors. It  
> was working on the previous version of the cluster, but it does not  
> run on a new lustre filesystem. It starts, but then stays in the  
> queue forever.

Without more information it is impossible to know what the problem  
is.  There shouldn''t be any problem with running executables from  
Lustre,

General debugging steps that should be followed (not strictly related  
to this problem):
- presumably the Lustre filesystem is accessible from within your VM
   and is working fine other than this job launch problem?
- try to run the job by hand to see if it really is a Lustre problem
   or if it is related to the batch scheduler or something else
- check /var/log/messages to see if there are Lustre (or other) errors
- do "echo t > /proc/sysrq-trigger" to dump the stacks of all
processes
   on the system, and see where your job is stuck

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Nov 2009 - I/O on cluster with lustre

[Lustre-discuss] I/O on cluster with lustre

[Lustre-discuss] I/O on cluster with lustre