Andreas Dilger
2006-May-19  07:36 UTC
[Lustre-discuss] Problem with multiple network interfaces
On Sep 19, 2005 17:40 +0200, Roland Fehrenbacher wrote:> I have the following Lustre setup (Vers. 1.4.1): MDS and OSSs have two > TCP/IP network interfaces, configured with IP addresses in two > different networks A and B, some clients are in network A, some in > network B. Using the configuration file below, I assumed I will have > no problems mounting the LOV on either clients. However, mounting > works successfully only on network 172.17.0.0, but fails on network > 172.16.0.0 with the following error message: > > [ 173.446741] LustreError: Connected successfully to nid 0xac1003fd on host 172.16.3.253, but they claimed they were nid 0xac1103fd (172.17.3.253); please check your Lustre configuration. > [ 173.466207] LustreError: 1482:0:(socknal_cb.c:1886:ksocknal_recv_hello()) Connected to nid 0xac1103fd@172.16.3.253 but expecting 0xac1003fd > [ 173.480957] LustreError: Protocol error connecting to host 172.16.3.253 on port 988: Is it running a compatible version of Lustre? > [ 173.494856] LustreError: 1482:0:(socknal_cb.c:2107:ksocknal_autoconnect()) Deleting packet type 1 len 168 (0xac100601 172.16.6.1->0xac1003fd 172.16.6.1) > [ 173.511009] LustreError: 1482:0:(events.c:58:request_out_callback()) @@@ type 8, status 19 req@ffff810071ed4600 x3/t0 o38->mds-beo@MDS_PEER_UUID:12 lens 168/64 ref 2 fl Rpc:/0/0 rc 0/0How are you mounting the clients? If you are using zconf mounting you also need to specify the NID of the server, otherwise the name resolution on the second network interface will be incorrect. If you use the "server_nid=" mount option you can specify a different NID for the MDS (the variables are just to make the command line shorter for this example): MDSPRI=DNS name or ipaddr of primary interface (= MDS hostname) MDSSEC=DNS name or ipaddr of secondary interface mount -t lustre -o server_nid=$MDSPRI $MDSSEC:/mds-beo/client /l Note that the "-o server_nid=$MDSPRI" option can be specified for all clients. Also note that it is also possible to specify an arbitrary NID for TCP nodes with "--nid {number}" that does not match the hostname IP address. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Roland Fehrenbacher
2006-May-19  07:36 UTC
[Lustre-discuss] Problem with multiple network interfaces
>>>>> "Andreas" == Andreas Dilger <adilger@clusterfs.com> writes:Hi Andreas, sorry for the late reply, I only got to test this right now. Andreas> On Sep 19, 2005 17:40 +0200, Roland Fehrenbacher wrote: >> I have the following Lustre setup (Vers. 1.4.1): MDS and OSSs >> have two TCP/IP network interfaces, configured with IP >> addresses in two different networks A and B, some clients are >> in network A, some in network B. Using the configuration file >> below, I assumed I will have no problems mounting the LOV on >> either clients. However, mounting works successfully only on >> network 172.17.0.0, but fails on network 172.16.0.0 with the >> following error message: >> [ 173.446741] LustreError: Connected successfully to nid 0xac1003fd on host 172.16.3.253, but they claimed they were nid 0xac1103fd (172.17.3.253); please check your Lustre configuration. >> [ 173.466207] LustreError: 1482:0:(socknal_cb.c:1886:ksocknal_recv_hello()) Connected to nid 0xac1103fd@172.16.3.253 but expecting 0xac1003fd >> [ 173.480957] LustreError: Protocol error connecting to host 172.16.3.253 on port 988: Is it running a compatible version of Lustre? >> [ 173.494856] LustreError: 1482:0:(socknal_cb.c:2107:ksocknal_autoconnect()) Deleting packet type 1 len 168 (0xac100601 172.16.6.1->0xac1003fd 172.16.6.1) >> [ 173.511009] LustreError: 1482:0:(events.c:58:request_out_callback()) @@@ type 8, status 19 req@ffff810071ed4600 x3/t0 o38->mds-beo@MDS_PEER_UUID:12 lens 168/64 ref 2 fl Rpc:/0/0 rc 0/0 > Andreas> How are you mounting the clients? If you are using zconf Andreas> mounting you also need to specify the NID of the server, Andreas> otherwise the name resolution on the second network Andreas> interface will be incorrect. Yes I''m using zconf. Andreas> If you use the "server_nid=" mount option you can specify Andreas> a different NID for the MDS (the variables are just to Andreas> make the command line shorter for this example): I have now changed my configuration to use a nid = 1 for the MDS, as lmc -m config.xml --add net --node ha-beo-2 --nid 1 \ --hostaddr 172.17.3.253 --hostaddr 172.16.3.253 --nettype tcp Andreas> MDSPRI=DNS name or ipaddr of primary interface (= MDS Andreas> hostname) MDSSEC=DNS name or ipaddr of secondary Andreas> interface Andreas> mount -t lustre -o server_nid=$MDSPRI Andreas> $MDSSEC:/mds-beo/client /l I then use the following mount command on the servers that only have an interface in the network 172.16.0.0: $ mount -t lustre -o server_nid=1 172.16.3.253:/mds-beo/client /l However, it still doesn''t work. The mount command hangs forever. I obtain the following error messages on the client: [90813.569545] LustreError: 17315:0:(client.c:815:ptlrpc_expire_one_request()) @@@ timeout (sent at 1127921408, 50s ago) req@ffff81002c7e3e00 x8/t0 o38->mds-beo_UUID@NID_1_UUID:12 lens 168/64 ref 1 fl Rpc:/0/0 rc 0/0 [90913.503202] LustreError: 17129:0:(socknal_lib-linux.c:813:ksocknal_lib_connect_sock()) Error -115 connecting 0.0.0.0/1023 -> 172.17.3.253/988 [90913.517931] LustreError: Unexpected error -115 connecting to host 172.17.3.253 on port 988 [90913.527580] LustreError: 17129:0:(socknal_cb.c:2107:ksocknal_autoconnect()) Deleting packet type 1 len 168 (0xac100601 172.16.6.1->0x1 172.16.6.1) [90913.542758] LustreError: 17129:0:(events.c:58:request_out_callback()) @@@ type 8, status 19 req@ffff81002c010800 x9/t0 o38->mds-beo_UUID@NID_1_UUID:12 lens 168/64 ref 2 fl Rpc:/0/0 rc 0/0 [90913.562183] LustreError: 17315:0:(client.c:815:ptlrpc_expire_one_request()) @@@ timeout (sent at 1127921508, 50s ago) req@ffff81002c010800 x9/t0 o38->mds-beo_UUID@NID_1_UUID:12 lens 168/64 ref 1 fl Rpc:/0/0 rc 0/0 and on the MDS I have Sep 28 17:28:28 ha-beo-2 kernel: Lustre: 31279:0:(socknal.c:1153:ksocknal_create_conn()) New conn nid:0xac100601 172.16.3.253 -> 172.16.6.1/1023 incarnation:0x401d6a401cff2 sched[1]/0 Sep 28 17:28:28 ha-beo-2 kernel: Lustre: 31279:0:(socknal.c:1153:ksocknal_create_conn()) skipped 6 similar messages Sep 28 17:28:28 ha-beo-2 kernel: Lustre: 31211:0:(socknal_cb.c:1326:ksocknal_process_receive()) [ffff81001be38000] EOF from 0xac100601 ip 172.16.6.1:1021 Sep 28 17:28:28 ha-beo-2 kernel: Lustre: 31211:0:(socknal_cb.c:1326:ksocknal_process_receive()) skipped 6 similar messages Sep 28 17:28:28 ha-beo-2 kernel: Lustre: 31210:0:(socknal_cb.c:1326:ksocknal_process_receive()) [ffff8100542d0000] EOF from 0xac100601 ip 172.16.6.1:1022 Sep 28 17:28:28 ha-beo-2 kernel: Lustre: 31210:0:(socknal_cb.c:1326:ksocknal_process_receive()) skipped 6 similar messages Andreas> Note that the "-o server_nid=$MDSPRI" option can be Andreas> specified for all clients. Yes, I''m doing that. Andreas> Also note that it is also possible to specify an Andreas> arbitrary NID for TCP nodes with "--nid {number}" that Andreas> does not match the hostname IP address. I tried this as explained above. Thanks, Roland
Roland Fehrenbacher
2006-May-19  07:36 UTC
[Lustre-discuss] Problem with multiple network interfaces
Hi,
I have the following Lustre setup (Vers. 1.4.1): MDS and OSSs have two
TCP/IP network interfaces, configured with IP addresses in two
different networks A and B, some clients are in network A, some in
network B. Using the configuration file below, I assumed I will have
no problems mounting the LOV on either clients. However, mounting
works successfully only on network 172.17.0.0, but fails on network
172.16.0.0 with the following error message:
[  173.446741] LustreError: Connected successfully to nid 0xac1003fd on host
172.16.3.253, but they claimed they were nid 0xac1103fd (172.17.3.253); please
check your Lustre configuration.
[  173.466207] LustreError: 1482:0:(socknal_cb.c:1886:ksocknal_recv_hello())
Connected to nid 0xac1103fd@172.16.3.253 but expecting 0xac1003fd
[  173.480957] LustreError: Protocol error connecting to host 172.16.3.253 on
port 988: Is it running a compatible version of Lustre?
[  173.494856] LustreError: 1482:0:(socknal_cb.c:2107:ksocknal_autoconnect())
Deleting packet type 1 len 168 (0xac100601 172.16.6.1->0xac1003fd 172.16.6.1)
[  173.511009] LustreError: 1482:0:(events.c:58:request_out_callback()) @@@ type
8, status 19 req@ffff810071ed4600 x3/t0 o38->mds-beo@MDS_PEER_UUID:12 lens
168/64 ref 2 fl Rpc:/0/0 rc 0/0
Any ideas?
Thanks,
Roland
--------------------- Config file --------------------------------------------
# Node definitions
lmc -m config.xml --add net --node ha-beo-2 --nid ha-beo-i-2 --hostaddr
172.17.3.253 --hostaddr 172.16.3.253 --nettype tcp
lmc -m config.xml --add net --node sn-03-1 --nid sn-03-1-i --hostaddr
172.17.3.21 --hostaddr 172.16.3.21 --nettype tcp
lmc -m config.xml --add net --node sn-03-2 --nid sn-03-2-i --hostaddr
172.17.3.22 --hostaddr 172.16.3.22 --nettype tcp
lmc -m config.xml --add net --node client --nid ''*'' --nettype
tcp
# MDS
lmc -m config.xml --format --add mds --node ha-beo-2 --mds mds-beo \
    --fstype ldiskfs --dev /dev/drbd/2
# OSS
lmc -m config.xml --add lov --lov lov-beo --mds mds-beo --stripe_sz 1048576 \
  --stripe_cnt 0 --stripe_pattern 0
lmc -m config.xml --add ost --node sn-03-1 --lov lov-beo --ost sn-03-1 \
    --failover --fstype ldiskfs --dev /dev/vgraid50/ost
lmc -m config.xml --add ost --node sn-03-2 --lov lov-beo --ost sn-03-2 \
    --failover --fstype ldiskfs --dev /dev/vgraid50/ost
# Clients
lmc -m config.xml --add mtpt --node client --path /l \
    --mds mds-beo --lov lov-beo
-------------------------------------------------------------------------------