I am having a few problems with Lustre and I can''t seem to find the answer to my problem on the web so I wondered if you could help? Firstly I would like to give you some information on the kit we have and how I have configured it so far, apologies if I get some of naming schemes incorrect. I installed two servers for (MGS/MDT) with RHEL 5.4, Lustre 1.8.1.1 and the OFED Infiniband driver. These servers are connected to a SUN J4200 which has 12 disks configured as one RAID 10 Lun. I used the following command to create the lustre filesystem: mkfs.lustre --fsname=lustre --reformat --mgs --mdt --failnode=lmds02 at o2ib0 /dev/md0 This filesystem mounts without a problem with the following command: mount -t lustre /dev/md0 /mnt/test/mdt I then installed a further two servers for (OSS) with RHEL 5.4, Lustre 1.8.1.1 and the OFED Infiniband driver. These servers are connected to a SUN J4400 which has 24 disks configured as 2x RAID 6 8+2 and 2x RAID 1 for the Journals. I used the following command to zero the Journal: mke2fs -O journal_dev -b 4096 /dev/md0 and then issued the following command to create the filesystem and use the external Journal: mkfs.lustre --fsname=lustre --reformat --ost --mgsnode=lmds01 at o2ib0 --mgsnode=lmds02 at o2ib0 --failnode=lmds02 at o2ib0 --mkfsoptions="-J device=/dev/md0" /dev/md1 The problem occurs when I try to mount the above filesystem: [root at lss01 ~]# mount -t lustre /dev/md1 /mnt/ost1 mount.lustre: mount /dev/md1 at /mnt/ost1 failed: Cannot send after transport endpoint shutdown I have had a look in /var/log/messages and I get quite a lot of errors regarding connection refused etc, please see attached file for more info. I wondered if the above was Infiniband related so I did the above configuration again using tcp0 only over the GigE network and this worked without a problem. Any help you can offer would be much appreciated this is my first Lustre installation so it''s quite possible I have configured something incorrectly. Kind Regards Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091112/0d8e385d/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: messages Type: application/octet-stream Size: 1410055 bytes Desc: messages Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091112/0d8e385d/attachment-0001.obj
On Thu, 2009-11-12 at 10:37 +0000, Chris Exton wrote:> I am having a few problems with Lustre and I can?t seem to find the > answer to my problem on the web so I wondered if you could help?You have networking problems: Nov 11 11:26:00 lss01 kernel: LustreError: 9238:0:(lib-move.c:1371:lnet_send()) No route to 12345-172.22.13.100 at tcp Nov 11 11:26:00 lss01 kernel: LustreError: 9238:0:(lib-move.c:2571:LNetGet()) error sending GET to 12345-172.22.13.100 at tcp: -113 Nov 11 11:26:09 lss01 kernel: Lustre: 6185:0:(o2iblnd_cb.c:2742:kiblnd_cm_callback()) 172.22.13.100 at o2ib: ROUTE ERROR -22 Nov 11 11:26:09 lss01 kernel: Lustre: 6185:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) Deleting messages for 172.22.13.100 at o2ib: connection failed b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091112/672da14c/attachment.bin
On Thu, Nov 12, 2009 at 12:47:33PM -0500, Brian J. Murrell wrote:> On Thu, 2009-11-12 at 10:37 +0000, Chris Exton wrote: > > I am having a few problems with Lustre and I can???t seem to find the > > answer to my problem on the web so I wondered if you could help? > > You have networking problems: > > Nov 11 11:26:00 lss01 kernel: LustreError: 9238:0:(lib-move.c:1371:lnet_send()) No route to 12345-172.22.13.100 at tcp > Nov 11 11:26:00 lss01 kernel: LustreError: 9238:0:(lib-move.c:2571:LNetGet()) error sending GET to 12345-172.22.13.100 at tcp: -113 > Nov 11 11:26:09 lss01 kernel: Lustre: 6185:0:(o2iblnd_cb.c:2742:kiblnd_cm_callback()) 172.22.13.100 at o2ib: ROUTE ERROR -22 > Nov 11 11:26:09 lss01 kernel: Lustre: 6185:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) Deleting messages for 172.22.13.100 at o2ib: connection failedMost likely, this is a configuration issue in the IB fabric. You might look at your subnet manager logs for clues. Isaac