Bjorn Nuyttens
2007-Feb-06 03:00 UTC
[Lustre-discuss] MDS-OST connection refused: No matching NI
This seems to be a common problem when working with multihomed systems. A short description of the situation: - Lustre 1.4.7.3 (2.6.9-42.0.2.EL_lustre.1.4.7.3smp) - 5 machines named helios160 trough helios164 each with two NICs (networks 10.10.3. and 172.16.0.) - 4 OSTs: helios160-163 1 MDS: helios164 5 clients: helios160-164 - options lnet networks="tcp0" in /etc/modprobe.conf The OSTs start using lconf but it "binds" to the wrong interface. It should bind to the 172.16.0. interface in stead of the 10.10.3. interface. I have been playing around with the lnet options but lnet does not bind on the other NIC. So the question is, how to configure lnet to use the other interface? [root@helios160 ~]# cat /proc/sys/lnet/nis nid refs peer max tx min 0@lo 2 0 0 0 0 10.10.3.160@tcp 1 8 256 256 256 Trying to add the correct interface seems to have no result: [root@helios160 ~]# lctl lctl > network tcp lctl > interface_list 10.10.3.160: (10.10.3.160/255.255.128.0) npeer 0 nroute 0 lctl > add_interface 172.16.0.160 lctl > interface_list 10.10.3.160: (10.10.3.160/255.255.128.0) npeer 0 nroute 0 helios160: (172.16.0.160/255.255.255.0) npeer 0 nroute 0 lctl > quit [root@helios160 ~]# cat /proc/sys/lnet/nis nid refs peer max tx min 0@lo 2 0 0 0 0 10.10.3.160@tcp 1 8 256 256 256 Because of that, I get the following error when starting the MDS on 172.16.0.164: LustreError: Refusing connection from 172.16.0 .164 for 172.16.0.160@tcp: No matching NI PS1: no errors a la lnet: Unknown parameter PS2: /etc/hosts identical on all systems involved 127.0.0.1 localhost.localdomain localhost 172.16.0.160 helios160 172.16.0.161 helios161 172.16.0.162 helios162 172.16.0.163 helios163 172.16.0.164 helios164 172.16.0.165 helios165 172.16.0.166 helios166
Brian J. Murrell
2007-Feb-06 06:51 UTC
[Lustre-discuss] MDS-OST connection refused: No matching NI
On Tue, 2007-02-06 at 09:57 +0000, Bjorn Nuyttens wrote:> This seems to be a common problem when working with multihomed systems. A short > description of the situation: > - Lustre 1.4.7.3 (2.6.9-42.0.2.EL_lustre.1.4.7.3smp) > - 5 machines named helios160 trough helios164 > each with two NICs (networks 10.10.3. and 172.16.0.) > - 4 OSTs: helios160-163 > 1 MDS: helios164 > 5 clients: helios160-164 > - options lnet networks="tcp0" in /etc/modprobe.conf > > > The OSTs start using lconf but it "binds" to the wrong interface. It should bind > to the 172.16.0. interface in stead of the 10.10.3. interface. I have been > playing around with the lnet options but lnet does not bind on the other NIC. So > the question is, how to configure lnet to use the other interface?Section 5.1 of the manual (at https://mail.clusterfs.com/wikis/lustre/LustreDocumentation) covers multi-homed networks pretty well. Specifically you need to look at the ip2nets or networks lnet options. b.
Bjorn Nuyttens
2007-Feb-07 01:21 UTC
[Lustre-discuss] Re: MDS-OST connection refused: No matching NI
Brian J. Murrell <brian@...> writes:> Section 5.1 of the manual (at > https://mail.clusterfs.com/wikis/lustre/LustreDocumentation) covers > multi-homed networks pretty well. Specifically you need to look at the > ip2nets or networks lnet options. > > b. >Thanks for pointing out, I was focused on section 3.2.2 "Module parameters". Solution (in this case): replace the line options lnet ''networks="tcp0"'' with options lnet networks="tcp(eth1)" FYI: inet 127.0.0.1/8 brd 127.255.255.255 scope host lo inet 10.10.3.160/17 brd 10.10.127.255 scope global eth0 inet 172.16.0.160/24 brd 172.16.0.255 scope global eth1