Robert Pinnow, rise | fx
2009-Nov-13 14:54 UTC
[Lustre-discuss] lustre setup with a several subnets
hey lustre group, i am currently working on my first lustre setup (centos 5.2, lustre 1.8.1) and as it is slightly more complicated on the network side than the one in the quickstart guide, i am running into a few problems here. *** 1. some of the oss can not mount their osts with a message to *** 2. the client hangs and produces errors when trying to mount ..:/lustre the setup looks like this: 1 MDS/CLIENT, having 10 (TEN) network interfaces, (six of the eths used for lustre, the other four bonded as a cifs gateway), as this is only a test setup, the MDS is also the client 3 OSSs, with 3 OSTs each, connected via 2 eths directly to the mds each every interface used for lustre is setup to use a different subnet. all the eths used for lustre are connected directly to the mds/client without a switch. OST1 on OSS1 via eth0 (192.168.16.101) OST4 on OSS1 via eth1 (192.168.17.101) OST7 on OSS1 via eth0 (192.168.16.101) OST2 on OSS2 via eth0 (192.168.18.101) OST5 on OSS2 via eth1 (192.168.19.101) OST8 on OSS2 via eth0 (192.168.18.101) OST3 on OSS3 via eth0 (192.168.10.101) OST6 on OSS3 via eth1 (192.168.11.101) OST9 on OSS3 via eth0 (192.168.10.101) on the OST/CLIENT side eth0,1,6,7,8,9 are configured accordingly. a linux ping works for all of them. interfaces in the modprobe.conf are configured with options lnet networks=tcp0(eth0),tcp1(eth1) # and of course all six on the MDS/CLIENT + and just in case, the lo and the bond0 the filesystems were created with the option --mgsnode=IPADRESS at tcpN according to the list above so the OSTs know which interface to use (do they know?) the output from the mds looks like it knows about its interfaces: ----------------------------- [root at mds ~]# lctl list_nids 192.168.10.100 at tcp 192.168.11.100 at tcp1 127.0.0.1 at tcp2 192.168.1.205 at tcp5 192.168.16.100 at tcp6 192.168.17.100 at tcp7 192.168.18.100 at tcp8 192.168.19.100 at tcp9 ----------------------------- the three OSTs named here, are the ones that i where able to mount on oss3 ----------------------------- [root at mds ~]# lctl device_list 0 UP mgs MGS MGS 5 1 UP mgc MGC192.168.10.100 at tcp 95c7ba1b-a923-76f8-63fc-1a44201cf75e 5 2 UP mdt MDS MDS_uuid 3 3 UP lov lustre-mdtlov lustre-mdtlov_UUID 4 4 UP mds lustre-MDT0000 lustre-MDT0000_UUID 3 5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5 6 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5 7 UP osc lustre-OST0002-osc lustre-mdtlov_UUID 5 ----------------------------- for all the other OSSs trying to start the OSTs will produce the following output: ----------------------------- [root at oss1 ~]# mount -t lustre /dev/vg01/ost1 /mnt/ost1 mount.lustre: mount /dev/vg01/ost1 at /mnt/ost1 failed: Input/output error Is the MGS running? ----------------------------- and these dmesg notes: ----------------------------- Lustre: Added LNI 192.168.16.101 at tcp [8/256/0/0] Lustre: Added LNI 192.168.17.101 at tcp1 [8/256/0/0] Lustre: Accept secure, port 988 usb 6-2: USB disconnect, address 2 Lustre: OBD class driver, http://www.lustre.org/ Lustre: Lustre Version: 1.8.1.1 Lustre: Build Version: 1.8.1.1-20091009075116-PRISTINE-2.6.18-128.7.1.el5_lustre.1.8.1.1 Lustre: Lustre Client File System; http://www.lustre.org/ kjournald starting. Commit interval 5 seconds LDISKFS FS on dm-2, internal journal LDISKFS-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds LDISKFS FS on dm-2, internal journal LDISKFS-fs: mounted filesystem with ordered data mode. LDISKFS-fs: file extents enabled LDISKFS-fs: mballoc enabled LustreError: 11140:0:(socklnd_cb.c:1706:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.16.100 LustreError: 11b-b: Connection to 192.168.16.100 at tcp at host 192.168.16.100 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.16.100 at tcp one of its NIDs? Lustre: 11140:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.16.101 at tcp->192.168.16.100 at tcp Lustre: 12047:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1319237993889793 sent from MGC192.168.16.100 at tcp to NID 192.168.16.100 at tcp 5s ago has timed out (limit 5s). req at ffff8100590c5400 x1319237993889793/t0 o250->MGS at MGC192.168.16.100@tcp_0:26/25 lens 368/584 e 0 to 1 dl 1258123398 ref 1 fl Rpc:N/0/0 rc 0/0 LustreError: 12024:0:(obd_mount.c:1085:server_start_targets()) Required registration failed for lustre-OSTffff: -5 LustreError: 15f-b: Communication error with the MGS. Is the MGS running? LustreError: 12024:0:(obd_mount.c:1629:server_fill_super()) Unable to start targets: -5 LustreError: 12024:0:(obd_mount.c:1412:server_put_super()) no obd lustre-OSTffff LustreError: 12024:0:(obd_mount.c:136:server_deregister_mount()) lustre-OSTffff not registered LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success) LDISKFS-fs: mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost LDISKFS-fs: mballoc: 0 generated and it took 0 LDISKFS-fs: mballoc: 0 preallocated, 0 discarded Lustre: server umount lustre-OSTffff complete LustreError: 12024:0:(obd_mount.c:1997:lustre_fill_super()) Unable to mount (-5) ----------------------------- trying to mount this setup (with the three available OSTs to the CLIENT (though 6 osts are still missing) with "mount -t lustre [anyofmyinterfaces]:/lustre /mnt/lustre" will produce a hang until i crtl+c. a dmesg of the MDS/CLIENT looks like this: ----------------------------- Lustre: OBD class driver, http://www.lustre.org/ Lustre: Lustre Version: 1.8.1.1 Lustre: Build Version: 1.8.1.1-20091009075116-PRISTINE-2.6.18-128.7.1.el5_lustre.1.8.1.1 Lustre: Added LNI 192.168.10.100 at tcp [8/256/0/0] Lustre: Added LNI 192.168.11.100 at tcp1 [8/256/0/0] Lustre: Added LNI 127.0.0.1 at tcp2 [8/256/0/0] Lustre: Added LNI 192.168.1.205 at tcp5 [8/256/0/0] Lustre: Added LNI 192.168.16.100 at tcp6 [8/256/0/0] Lustre: Added LNI 192.168.17.100 at tcp7 [8/256/0/0] Lustre: Added LNI 192.168.18.100 at tcp8 [8/256/0/0] Lustre: Added LNI 192.168.19.100 at tcp9 [8/256/0/0] Lustre: Accept secure, port 988 Lustre: Lustre Client File System; http://www.lustre.org/ kjournald starting. Commit interval 5 seconds LDISKFS FS on dm-0, internal journal LDISKFS-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds LDISKFS FS on dm-0, internal journal LDISKFS-fs: mounted filesystem with ordered data mode. Lustre: MGS MGS started Lustre: MGC192.168.10.100 at tcp: Reactivating import Lustre: Enabling user_xattr Lustre: MDT lustre-MDT0000 now serving lustre-MDT0000_UUID (lustre-MDT0000/a387fd73-011c-aaa2-4d56-c85cf2b2228c) with recovery enabled Lustre: 12081:0:(lproc_mds.c:271:lprocfs_wr_group_upcall()) lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups Lustre: lustre-MDT0000.mdt: set parameter group_upcall=/usr/sbin/l_getgroups Lustre: Server lustre-MDT0000 on device /dev/vg00/mds1 has started Lustre: 11898:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -111 connecting 0.0.0.0/1023 -> 192.168.10.101/988 Lustre: 11898:0:(acceptor.c:88:lnet_connect_console_error()) Connection to 192.168.10.101 at tcp at host 192.168.10.101 on port 988 was refused: check that Lustre is running on that node. Lustre: 11898:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp Lustre: 11898:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp Lustre: 11898:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1319226586431495 sent from lustre-OST0000-osc to NID 192.168.10.101 at tcp 5s ago has timed out (limit 5s). req at ffff8100d7264800 x1319226586431495/t0 o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl 1258112518 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: 11906:0:(import.c:508:import_select_connection()) lustre-OST0000-osc: tried all connections, increasing latency to 35s Lustre: 11906:0:(import.c:508:import_select_connection()) Skipped 2 previous similar messages Lustre: 11905:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.10.101 at tcp LustreError: 11905:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff81011d067000 x1319226586431526/t0 o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl 1258112803 ref 2 fl Rpc:N/0/0 rc 0/0 Lustre: 11905:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.10.101 at tcp Lustre: 11905:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.10.101 at tcp Lustre: 11906:0:(import.c:508:import_select_connection()) lustre-OST0000-osc: tried all connections, increasing latency to 40s Lustre: 11906:0:(import.c:508:import_select_connection()) Skipped 2 previous similar messages Lustre: 11901:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -111 connecting 0.0.0.0/1023 -> 192.168.10.101/988 Lustre: 11901:0:(acceptor.c:88:lnet_connect_console_error()) Connection to 192.168.10.101 at tcp at host 192.168.10.101 on port 988 was refused: check that Lustre is running on that node. Lustre: 11901:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp Lustre: 11901:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp Lustre: 11901:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp eth1: Link is Down e1000: eth0: e1000_watchdog_task: NIC Link is Down Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1319226586431530 sent from lustre-OST0000-osc to NID 192.168.10.101 at tcp 45s ago has timed out (limit 45s). req at ffff810104d4c800 x1319226586431530/t0 o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl 1258112833 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 5 previous similar messages eth1: Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX Lustre: 11906:0:(import.c:508:import_select_connection()) lustre-OST0000-osc: tried all connections, increasing latency to 45s Lustre: 11906:0:(import.c:508:import_select_connection()) Skipped 2 previous similar messages Lustre: 11905:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.10.101 at tcp LustreError: 11905:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff8100de21d000 x1319226586431535/t0 o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl 1258112888 ref 2 fl Rpc:N/0/0 rc 0/0 LustreError: 11905:0:(events.c:66:request_out_callback()) Skipped 2 previous similar messages Lustre: 11905:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.10.101 at tcp Lustre: 11905:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.10.101 at tcp eth1: Link is Down e1000: eth0: e1000_watchdog_task: NIC Link is Down eth1: Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX e1000: eth0: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX e1000: eth0: e1000_watchdog_task: NIC Link is Down e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX Lustre: 11898:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> 192.168.10.101/988 Lustre: 11898:0:(acceptor.c:95:lnet_connect_console_error()) Connection to 192.168.10.101 at tcp at host 192.168.10.101 was unreachable: the network or that node may be down, or Lustre may be misconfigured. Lustre: 11898:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp Lustre: 11898:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp Lustre: 11898:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1319226586431539 sent from lustre-OST0000-osc to NID 192.168.10.101 at tcp 55s ago has timed out (limit 55s). req at ffff8100de21d400 x1319226586431539/t0 o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl 1258112918 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 5 previous similar messages eth1: Link is Down e1000: eth0: e1000_watchdog_task: NIC Link is Down Lustre: 11906:0:(import.c:508:import_select_connection()) lustre-OST0000-osc: tried all connections, increasing latency to 50s Lustre: 11906:0:(import.c:508:import_select_connection()) Skipped 5 previous similar messages eth1: Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX Lustre: 11899:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> 192.168.10.101/988 Lustre: 11899:0:(acceptor.c:95:lnet_connect_console_error()) Connection to 192.168.10.101 at tcp at host 192.168.10.101 was unreachable: the network or that node may be down, or Lustre may be misconfigured. Lustre: 11899:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp Lustre: 11899:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp Lustre: 11899:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp eth1: Link is Down eth1: Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1319226586431545 sent from lustre-OST0000-osc to NID 192.168.10.101 at tcp 55s ago has timed out (limit 55s). req at ffff8100dc6ca800 x1319226586431545/t0 o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl 1258112993 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: 11906:0:(import.c:508:import_select_connection()) lustre-OST0000-osc: tried all connections, increasing latency to 50s Lustre: 11906:0:(import.c:508:import_select_connection()) Skipped 5 previous similar messages Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1319226586431557 sent from lustre-OST0000-osc to NID 192.168.10.101 at tcp 55s ago has timed out (limit 55s). req at ffff81011ebda000 x1319226586431557/t0 o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl 1258113143 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Lustre: 11906:0:(import.c:508:import_select_connection()) lustre-OST0000-osc: tried all connections, increasing latency to 50s Lustre: 11906:0:(import.c:508:import_select_connection()) Skipped 11 previous similar messages Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1319226586431581 sent from lustre-OST0000-osc to NID 192.168.10.101 at tcp 55s ago has timed out (limit 55s). req at ffff8100da7abc00 x1319226586431581/t0 o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl 1258113443 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 11 previous similar messages Lustre: MGC127.0.0.1 at tcp2: Reactivating import LustreError: 120-3: Refusing connection from 192.168.16.100 for 192.168.16.100 at tcp: No matching NI LustreError: 11901:0:(socklnd_cb.c:1706:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.16.100 LustreError: 11b-b: Connection to 192.168.16.100 at tcp at host 192.168.16.100 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.16.100 at tcp one of its NIDs? Lustre: 11901:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.10.100 at tcp->192.168.16.100 at tcp LustreError: 120-3: Refusing connection from 192.168.16.100 for 192.168.16.100 at tcp: No matching NI LustreError: 11898:0:(socklnd_cb.c:1706:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.16.100 LustreError: 11b-b: Connection to 192.168.16.100 at tcp at host 192.168.16.100 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.16.100 at tcp one of its NIDs? Lustre: 11898:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.10.100 at tcp->192.168.16.100 at tcp LustreError: 12157:0:(lov_obd.c:988:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1 LustreError: 12157:0:(ldlm_request.c:1030:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway LustreError: 12157:0:(ldlm_request.c:1533:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Lustre: client ffff8101172ef800 umount complete LustreError: 12157:0:(obd_mount.c:1997:lustre_fill_super()) Unable to mount (-4) ------------------------ any ideas? THANK YOU! robert