I''ve been trying out patchless kernels and the attached simple code appears to trigger a failure in Lustre 1.6.0.1. I couldn''t see anything in bugzilla about it. typically I see 4+ open() failures out of 32 on the first run after a Lustre filesystem is mounted. often (but not always) the number of failures decreases to a few or 0 on subsequent runs. eg. typical output (where no output is success) would be: % /opt/openmpi/1.2/bin/mpirun --hostfile hosts -np 32 ./open /mnt/testfs/rjh open of ''/mnt/testfs/rjh/blk016.dat'' failed on rank 16, hostname ''x15'' open: No such file or directory open of ''/mnt/testfs/rjh/blk018.dat'' failed on rank 18, hostname ''x15'' open: No such file or directory open of ''/mnt/testfs/rjh/blk019.dat'' failed on rank 19, hostname ''x15'' open: No such file or directory open of ''/mnt/testfs/rjh/blk022.dat'' failed on rank 22, hostname ''x16'' open: No such file or directory open of ''/mnt/testfs/rjh/blk020.dat'' failed on rank 20, hostname ''x16'' open: No such file or directory open of ''/mnt/testfs/rjh/blk023.dat'' failed on rank 23, hostname ''x16'' open: No such file or directory % /opt/openmpi/1.2/bin/mpirun --hostfile hosts -np 32 ./open /mnt/testfs/rjh open of ''/mnt/testfs/rjh/blk014.dat'' failed on rank 14, hostname ''x14'' open: No such file or directory open of ''/mnt/testfs/rjh/blk013.dat'' failed on rank 13, hostname ''x14'' open: No such file or directory open of ''/mnt/testfs/rjh/blk015.dat'' failed on rank 15, hostname ''x14'' open: No such file or directory % /opt/openmpi/1.2/bin/mpirun --hostfile hosts -np 32 ./open /mnt/testfs/rjh % /opt/openmpi/1.2/bin/mpirun --hostfile hosts -np 32 ./open /mnt/testfs/rjh which is 32 threads across 6 nodes attempting to open and close 1 file each (32 files total) in 1 directory over o2ib. if I umount and remount the filesystem then the higher rate of errors occurs again. cexec :11-16 umount /mnt/testfs cexec :11-16 /usr/sbin/lustre_rmmod ; cexec :11-16 /usr/sbin/lustre_rmmod cexec :11-16 mount -t lustre x17ib@o2ib:/testfs /mnt/testfs Note that the same failures happen over GigE too, but only on larger tests. eg. -np 64 or 128. so the extra speed of IB is triggering the bugs sooner. if Lustre kernel rpms (eg. 2.6.9-42.0.10.EL_lustre-1.6.0.1smp) are used instead of the patchless kernels, then I don''t see any failures. tested out to -np 512. patchless 2.6.19.7 and 2.6.21.5 give failures at about the same rate. modules for 2.6.19.7 were built using the standard Lustre 1.6.0.1 tarball, and 2.6.21.5 modules were built using http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/lustre/1.6/lustre-1.6.0.1-ql3.tar.bz2 as that''s a lot easier to work with than https://bugzilla.lustre.org/show_bug.cgi?id=11647 Lustre setup is: 1 OSS node with 2 OSTs, each md raid0 SAS, 2.6.9-42.0.10.EL_lustre-1.6.0.1smp 1 MDS node with MDT on a 3G ramdisk "" 6 client nodes no lustre striping, lnet debugging as the default all nodes are dual dual-core Xeon x86_64 CentOS4.5 nodes are booting diskless oneSIS another data point is that if I rm all the files in the dir then the test succeeds more often (up until the time the fs is umount''d and remounted). so something about the unlink/create combo might be the problem. eg. % cexec :11 rm /mnt/testfs/rjh/''*'' % /opt/openmpi/1.2/bin/mpirun --hostfile hosts -np 32 ./open /mnt/testfs/rjh % /opt/openmpi/1.2/bin/mpirun --hostfile hosts -np 32 ./open /mnt/testfs/rjh # the above both succeed. umount and remount the fs as per above, then: % /opt/openmpi/1.2/bin/mpirun --hostfile hosts -np 32 ./open /mnt/testfs/rjh open of ''/mnt/testfs/rjh/blk014.dat'' failed on rank 14, hostname ''x14'' open: No such file or directory open of ''/mnt/testfs/rjh/blk015.dat'' failed on rank 15, hostname ''x14'' open: No such file or directory open of ''/mnt/testfs/rjh/blk010.dat'' failed on rank 10, hostname ''x13'' open: No such file or directory ... please let me know if you''d like me to re-run anything with a different setup or try different kernels or something... cheers, robin -------------- next part -------------- #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <mpi.h> int main(int nargs, char** argv) { int myRank; char fname[128]; int fp; int mpiErr, closeErr; char name[64]; mpiErr = MPI_Init(&nargs, &argv); if ( mpiErr ) perror( "MPI_Init" ); mpiErr = MPI_Comm_rank(MPI_COMM_WORLD, &myRank); if ( mpiErr ) perror( "MPI_Comm_rank" ); gethostname(name, sizeof(name)); sprintf(fname,"%s/blk%03d.dat", argv[1], myRank); fp = open(fname, (O_RDWR | O_CREAT | O_TRUNC), 0640 ); if ( fp == -1 ) { fprintf(stderr,"open of ''%s'' failed on rank %d, hostname ''%s''\n", fname, myRank, name); perror("open"); } else { closeErr = close(fp); if ( closeErr ) { fprintf(stderr,"close of ''%s'' failed on rank %d\n", fname, myRank); perror("close"); } } mpiErr = MPI_Finalize(); if ( mpiErr ) perror( "MPI_Finalize" ); exit(0); }
Hello Robin, On Wednesday 27 June 2007 14:34:11 Robin Humble wrote:> I''ve been trying out patchless kernels and the attached simple code > appears to trigger a failure in Lustre 1.6.0.1. I couldn''t see anything > in bugzilla about it. > > typically I see 4+ open() failures out of 32 on the first run after a > Lustre filesystem is mounted. often (but not always) the number of > failures decreases to a few or 0 on subsequent runs. > > eg. typical output (where no output is success) would be: > % /opt/openmpi/1.2/bin/mpirun --hostfile hosts -np 32 ./open > /mnt/testfs/rjh open of ''/mnt/testfs/rjh/blk016.dat'' failed on rank 16,[...]> > which is 32 threads across 6 nodes attempting to open and close 1 file each > (32 files total) in 1 directory over o2ib. > > if I umount and remount the filesystem then the higher rate of errors > occurs again. > cexec :11-16 umount /mnt/testfs > cexec :11-16 /usr/sbin/lustre_rmmod ; cexec :11-16 /usr/sbin/lustre_rmmod > cexec :11-16 mount -t lustre x17ib@o2ib:/testfs /mnt/testfs > > Note that the same failures happen over GigE too, but only on larger > tests. eg. -np 64 or 128. so the extra speed of IB is triggering the > bugs sooner. > > if Lustre kernel rpms (eg. 2.6.9-42.0.10.EL_lustre-1.6.0.1smp) are used > instead of the patchless kernels, then I don''t see any failures. tested > out to -np 512. > > patchless 2.6.19.7 and 2.6.21.5 give failures at about the same rate. > modules for 2.6.19.7 were built using the standard Lustre 1.6.0.1 > tarball, and 2.6.21.5 modules were built using > > http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/lustre/1.6/lustre-1 >.6.0.1-ql3.tar.bz2 as that''s a lot easier to work with than > https://bugzilla.lustre.org/show_bug.cgi?id=11647for real MPI jobs you will probably also need flock support, but for this you will need -ql4 (bug #12802 and #11880). Could you please test again with a patched 2.6.20 or 2.6.21? So far we don''t need patchless clients, so I don''t test this extensivle. I also didn''t test 2.6.21 very much. I think I will skip 2.6.21 at all and after adding sanity tests for the 2.6.22 patches will test this version more thorougly. [...] Also, did you see anything in the logs (server and clients)?> another data point is that if I rm all the files in the dir then the > test succeeds more often (up until the time the fs is umount''d and > remounted). so something about the unlink/create combo might be the > problem. eg.When I run "sanity.sh 76" it will. Test 76 is also about unlink/create actions, only that it tests the inode cache. No idea so far where I need to look into... Cheers, Bernd PS: I would test this here, but presently we have too few customer systems in repair to do these tests with ;) -- Bernd Schubert Q-Leap Networks GmbH
Robin, can you test patch from bug 12123, how it fix you problem? looks this same symptoms. On Wed, 2007-06-27 at 08:34 -0400, Robin Humble wrote:> I''ve been trying out patchless kernels and the attached simple code > appears to trigger a failure in Lustre 1.6.0.1. I couldn''t see anything > in bugzilla about it. > > typically I see 4+ open() failures out of 32 on the first run after a > Lustre filesystem is mounted. often (but not always) the number of > failures decreases to a few or 0 on subsequent runs. > > eg. typical output (where no output is success) would be: > % /opt/openmpi/1.2/bin/mpirun --hostfile hosts -np 32 ./open /mnt/testfs/rjh > open of ''/mnt/testfs/rjh/blk016.dat'' failed on rank 16, hostname ''x15'' > open: No such file or directory-- Alexey Lyashkov <shadow@clusterfs.com> Beaver team, Cluster filesystem
Hi Bernd and Alexey, thanks for the prompt response and suggestions. On Wed, Jun 27, 2007 at 03:16:37PM +0200, Bernd Schubert wrote:>> if Lustre kernel rpms (eg. 2.6.9-42.0.10.EL_lustre-1.6.0.1smp) are used >> instead of the patchless kernels, then I don''t see any failures. tested >> out to -np 512.>> patchless 2.6.19.7 and 2.6.21.5 give failures at about the same rate. >> modules for 2.6.19.7 were built using the standard Lustre 1.6.0.1 >> tarball, and 2.6.21.5 modules were built using >> >> http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/lustre/1.6/lustre-1 >>.6.0.1-ql3.tar.bz2 as that''s a lot easier to work with than >> https://bugzilla.lustre.org/show_bug.cgi?id=11647 > >for real MPI jobs you will probably also need flock support, but for this you >will need -ql4 (bug #12802 and #11880).the test code is actually a distilled part of a real MPI job which doesn''t need flock() - they''re happy just to writing to separate files. thanks for the info though. Alexey Lyashkov wrote:>can you test patch from bug 12123, how it fix you problem? >looks this same symptoms.patchless: a modified ql4 with the patch from bugzilla 12123(*) gives no joy :-( both patchless kernels 2.6.20 and 2.6.21 were tried and the open() failures still occur with the same symptoms. Bernd Schubert wrote:>Could you please test again with a patched 2.6.20 or 2.6.21? So far we don''tpatched: clients with patched 2.6.20 kernel (ql4 + 12123 similar to above) have no problems. I tried 2.6.20 and 2.6.9-42.0.10.EL_lustre-1.6.0.1smp on the servers and both worked ok with the patched clients. patchless 2.6.20 clients with patched 2.6.20 servers have the familiar open() failures. as always, the best way to trigger this is by umount''ing and re-mounting the fs.>Also, did you see anything in the logs (server and clients)?there''s nothing unusual logged in dmesg or /var/log/messages anywhere that I can see. just startup and shutdown messages. cheers, robin (*) the patch didn''t apply cleanly but was easy to merge
> patchless: > a modified ql4 with the patch from bugzilla 12123(*) gives no joy :-( > both patchless kernels 2.6.20 and 2.6.21 were tried and the open() > failures still occur with the same symptoms. >Robin, can you replicate this bug with set lnet.debug to -1, dump lustre debug dump with lctl dk > some-file and fill bug report in bugzilla ? -- Alexey Lyashkov <shadow@clusterfs.com> Beaver team, Cluster filesystem
Hi Alexey, On Fri, Jun 29, 2007 at 12:36:23PM +0300, Alexey Lyashkov wrote:>can you replicate this bug with set lnet.debug to -1, dump lustre debug >dump with lctl dk > some-file and fill bug report in bugzilla ?no probs. https://bugzilla.lustre.org/show_bug.cgi?id=12880 cheers, robin
Hi Robin, Thanks for sumbit bug, but looks client log is to short, can you check it? is sysctl variable ''lnet.debug'' set to -1 ? On Mon, 2007-07-02 at 00:31 -0400, Robin Humble wrote:> Hi Alexey, > > On Fri, Jun 29, 2007 at 12:36:23PM +0300, Alexey Lyashkov wrote: > >can you replicate this bug with set lnet.debug to -1, dump lustre debug > >dump with lctl dk > some-file and fill bug report in bugzilla ? > > no probs. > https://bugzilla.lustre.org/show_bug.cgi?id=12880 > > cheers, > robin > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss-- Alexey Lyashkov <shadow@clusterfs.com> Beaver team, Cluster filesystem
Hi Alexey, On Mon, Jul 02, 2007 at 12:10:13PM +0300, Alexey Lyashkov wrote:>Thanks for sumbit bug, but looks client log is to short, can you check >it? is sysctl variable ''lnet.debug'' set to -1 ?I''m afraid that''s all there is :-/ they''re set to debug. bugzilla has logs where I set lnet debug after the fs was setup, but (see below) I also tried modprobe''ing lnet, setting it to debug, and then mkfs and mounting the rest of lustre. it made no difference. below is a script I used to build the fs for this test. let me know if you can see a problem with it. the ''check lnet debug'' step below reports (same for every node): lnet.debug = trace inode super ext2 malloc cache info ioctl neterror net warning buffs other dentry nettrace page dlmtrace error emerg ha rpctrace vfstrace reada mmap config console quota sec cheers, robin #!/bin/sh # resetLustreWithDebug # tidy cexec :10-14,16 umount /mnt/testfs cexec :1 umount /mnt/ost0 /mnt/ost1 cexec :17 umount /mnt/mdt cexec :1,10-14,16-17 /usr/sbin/lustre_rmmod cexec :1,10-14,16-17 /usr/sbin/lustre_rmmod # check cexec :1,10-14,16-17 ''lsmod | wc -l'' cexec :1,10-14,16-17 ''mount | grep lustre'' # lnet into debug mode cexec :1,10-14,16-17 modprobe lnet cexec :1,10-14,16-17 sysctl lnet.debug=-1 # check cexec :1,10-14,16-17 sysctl lnet.debug # mkfs cexec :17 mkfs.lustre --fsname=testfs --mdt --mgs --reformat /dev/loop0 & cexec :1 mkfs.lustre --fsname=testfs --ost --mgsnode=x17ib@o2ib --reformat /dev/md0 & cexec :1 mkfs.lustre --fsname=testfs --ost --mgsnode=x17ib@o2ib --reformat /dev/md1 & wait # mount cexec :17 mount -t lustre /dev/loop0 /mnt/mdt cexec :1 mount -t lustre /dev/md0 /mnt/ost0 cexec :1 mount -t lustre /dev/md1 /mnt/ost1 cexec :10-14,16 mount -t lustre x17ib@o2ib:/testfs /mnt/testfs # prep for test cexec :10 chmod 1777 /mnt/testfs cexec :10 mkdir /mnt/testfs/rjh cexec :10 chown rjh.rjh /mnt/testfs/rjh # then run the test as a the user... # then gather logs with # ssh x1 lctl dk > dk.x1.oss # ssh x17 lctl dk > dk.x17.mds # cexec :10-14,16 lctl dk > dk.nodes