Sebastian Reitenbach
2010-Mar-16 10:12 UTC
[Lustre-discuss] networking problem between OSS and MGS
Hi, I am trying to setup the network between my Lustre servers, actually I am not sure whether this will work at all what I am trying, so here it goes. I use SLES 11 x86_64 and Lustre 1.8.2 on all hosts. As a start, I have a separate MGS (192.168.0.150) in one network, and a separate MDS (192.168.1.216) in a second network, and two OSS nodes in a third network. The reason to have the stuff in different network is to have a firewall in between, that can restrict access to the hosts. The MGS and MDS hosts are each connected with 1 GB Ethernet interface, the OSS nodes are connected with two GB network interfaces to the switch in the same network. OSS node 1: 192.168.2.21 and 192.168.2.121 OSS node 2: 192.168.2.22 192.168.2.122. Filesystems were created on the MDS, MDT. The filesystems on the OSS nodes were first created the following way: On OSS1: mkfs.lustre --fsname=WB01 --ost --mgsnode=192.168.0.150 at tcp /dev/mapper/WBOSS1_part1 mkfs.lustre --fsname=WB01 --ost --mgsnode=192.168.0.150 at tcp /dev/mapper/WBOSS1_part2 On OSS2: mkfs.lustre --fsname=WB01 --ost --mgsnode=192.168.0.150 at tcp /dev/mapper/WBOSS2_part1 mkfs.lustre --fsname=WB01 --ost --mgsnode=192.168.0.150 at tcp /dev/mapper/WBOSS2_part2 and on the OSS hosts, I tried to put the following into the /etc/modprobe.conf.local: options lnet networks=tcp0(eth0,eth1) However, then only the eth0 was used, the interface where the default route is configured. Then I tried to do the following: On OSS1: mkfs.lustre --fsname=WB01 --ost --mgsnode=192.168.0.150 at tcp0 /dev/mapper/WBOSS1_part1 mkfs.lustre --fsname=WB01 --ost --mgsnode=192.168.0.150 at tcp1 /dev/mapper/WBOSS1_part2 On OSS2: mkfs.lustre --fsname=WB01 --ost --mgsnode=192.168.0.150 at tcp0 /dev/mapper/WBOSS2_part1 mkfs.lustre --fsname=WB01 --ost --mgsnode=192.168.0.150 at tcp1 /dev/mapper/WBOSS2_part2 and configured this in the /etc/modprobe.conf.local: options lnet networks=tcp0(eth1),tcp1(eth0) Then /dev/mapper/WBOSS1_part1 was mountable, but when I tried to mount /dev/mapper/WBOSS1_part2 then I get this error message: mount.lustre: mount /dev/mapper/WBOSS1_part2 at /lustre/WBOSS2-1 failed: Cannot send after transport endpoint shutdown and on the MGS in dmesg I see: LustreError: 120-3: Refusing connection from 192.168.2.21 for 192.168.0.150 at tcp1: No matching NI Where the 192.168.2.21 is the IP of eth0 (tcp1) when I switch the default routes, then the problem persists I also tried to configure /etc/modprobe.conf.local the other way around: options lnet networks=tcp0(eth0),tcp1(eth1) Then I also was able to mount /dev/mapper/WBOSS1_part1, but not /dev/mapper/WBOSS1_part2, with the same error, It just takes the other Ethernet interface. Also lctl ping only works for one interface: oss1:~ # lctl ping 192.168.0.150 at tcp0 12345-0 at lo 12345-192.168.0.150 at tcp oss1:~ # lctl ping 192.168.0.150 at tcp1 failed to ping 192.168.0.150 at tcp1: Input/output error Looking further into the documentation, I saw that some kind of load balancing can be setup with ip2nets parameter in /etc/modprobe.conf.local But as far as I can see, I cannot specify two interface to be used to communicate with the MGS? So is above setup possible at all? I hope I was clear enough with my explanation ;) When I now try to to setup bonding on the OSS hosts, which bonding mode would be the preferred one to choose? The manual only talks about the recommended xmit_hash_policy, but does not recommend a bonding mode. The switch is capable of doing 802.1ad trunking. And a more general question: What is recommended anyways, bonding the two available interfaces on the OSS servers, or getting Lustre and the routing configured to use two separate network interfaces? regards, Sebastian