Hi all, I have a problem in setting lnet routing. The MDS and OSSes have IB and GigE networks, 30.9.100.* for IB and 20.9.100.* for GigE. Most of the clients have IB, too. But a few of them haven''t. So I choose one client as a lnet router. Below is the configurations: On the MDS and OSSes, IB: 30.9.100.* GigE: 20.9.100.* modprobe.conf: options lnet networks="o2ib0(ib0)" routes="tcp0 30.9.0.5 at o2ib0" On the router, IB 30.9.0.5 GigE: 20.9.0.5 modprobe.conf: options lnet networks="o2ib0(ib0),tcp0(eth1)" forwarding="enabled" On the GigE client, GigE: 20.9.0.2 modprobe.conf: options lnet networks="tcp0(eth1)" routes="o2ib0 20.9.0.5 at tcp0" After the lnet configured,client can lctl ping every MDS and OSSes . For example, client:~ # lctl ping 30.9.100.31 at o2ib 12345-0 at lo 12345-30.9.100.31 at o2ib where 30.9.100.31 is MDS. But mount -t lustre 30.9.100.31 at o2ib0:30.9.100.32 at o2ib0:/fnfs /mnt failed, the log says, Nov 24 10:36:37 cn-fn02 kernel: [502743.285050] Lustre: OBD class driver, http://wiki.whamcloud.com/ Nov 24 10:36:37 cn-fn02 kernel: [502743.285056] Lustre: Lustre Version: 2.1.0 Nov 24 10:36:37 cn-fn02 kernel: [502743.285060] Lustre: Build Version: RC2-g9d71fe8-PRISTINE-2.6.32.12-0.7-default Nov 24 10:36:37 cn-fn02 kernel: [502743.287057] Lustre: Lustre LU module (ffffffffa17f6d00). Nov 24 10:36:37 cn-fn02 kernel: [502743.358095] Lustre: Added LNI 20.9.0.2 at tcp [8/256/0/180] Nov 24 10:36:37 cn-fn02 kernel: [502743.358153] Lustre: Accept secure, port 988 Nov 24 10:36:37 cn-fn02 kernel: [502743.423409] Lustre: Lustre OSC module (ffffffffa1a9b800). Nov 24 10:36:37 cn-fn02 kernel: [502743.438668] Lustre: Lustre LOV module (ffffffffa1b09500). Nov 24 10:36:37 cn-fn02 kernel: [502743.460108] Lustre: Lustre client module (ffffffffa1ba9a40). Nov 24 10:36:37 cn-fn02 kernel: [502743.480266] Lustre: 4329:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC30.9.100.31 at o2ib->MGC30.9.100.31 at o2ib_0 neti d 20000: select flavor null Nov 24 10:36:37 cn-fn02 kernel: [502743.485938] Lustre: MGC30.9.100.31 at o2ib: Reactivating import Nov 24 10:36:37 cn-fn02 kernel: [502743.517528] Lustre: 4329:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import fnfs-MDT0000-mdc-ffff8801b79afc00->30.9.100.31@ o2ib netid 20000: select flavor null Nov 24 10:36:42 cn-fn02 kernel: [502748.508709] Lustre: 4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1386324633321488 sent from fnfs-MDT00 00-mdc-ffff8801b79afc00 to NID 20.9.100.31 at tcp has timed out for sent delay: [sent 1322102197] [real_sent 0] [current 1322102202] [deadline 5s] [delay 0s] r eq at ffff88019c603c00 x1386324633321488/t0(0) o-1->fnfs-MDT0000_UUID at 30.9.100.31@o2ib:12/10 lens 368/512 e 0 to 1 dl 1322102202 ref 2 fl Rpc:XN/ffffffff/ffffff ff rc 0/-1 Nov 24 10:37:07 cn-fn02 kernel: [502773.472069] Lustre: 4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1386324633321491 sent from fnfs-MDT00 00-mdc-ffff8801b79afc00 to NID 30.9.100.32 at o2ib has timed out for slow reply: [sent 1322102222] [real_sent 1322102222] [current 1322102227] [deadline 5s] [de lay 0s] req at ffff88019b092400 x1386324633321491/t0(0) o-1->fnfs-MDT0000_UUID at 30.9.100.32@o2ib:12/10 lens 368/512 e 0 to 1 dl 1322102227 ref 1 fl Rpc:XN/fffff fff/ffffffff rc 0/-1 Nov 24 10:37:27 cn-fn02 kernel: [502793.442762] Lustre: 4402:0:(import.c:526:import_select_connection()) fnfs-MDT0000-mdc-ffff8801b79afc00: tried all connect ions, increasing latency to 5s Nov 24 10:37:27 cn-fn02 kernel: [502793.442802] Lustre: 4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1386324633321493 sent from fnfs-MDT00 00-mdc-ffff8801b79afc00 to NID 20.9.100.31 at tcp has failed due to network error: [sent 1322102247] [real_sent 1322102247] [current 1322102247] [deadline 10s] [delay -10s] req at ffff8801b68ebc00 x1386324633321493/t0(0) o-1->fnfs-MDT0000_UUID at 30.9.100.31@o2ib:12/10 lens 368/512 e 0 to 1 dl 1322102257 ref 1 fl Rpc:XN/ ffffffff/ffffffff rc 0/-1 Nov 24 10:38:02 cn-fn02 kernel: [502828.392144] Lustre: 4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1386324633321495 sent from fnfs-MDT00 00-mdc-ffff8801b79afc00 to NID 30.9.100.32 at o2ib has timed out for slow reply: [sent 1322102272] [real_sent 1322102272] [current 1322102282] [deadline 10s] [d elay 0s] req at ffff88019c603c00 x1386324633321495/t0(0) o-1->fnfs-MDT0000_UUID at 30.9.100.32@o2ib:12/10 lens 368/512 e 0 to 1 dl 1322102282 ref 1 fl Rpc:XN/ffff ffff/ffffffff rc 0/-1 Nov 24 10:38:17 cn-fn02 kernel: [502843.369501] Lustre: 4402:0:(import.c:526:import_select_connection()) fnfs-MDT0000-mdc-ffff8801b79afc00: tried all connect ions, increasing latency to 10s Nov 24 10:38:17 cn-fn02 kernel: [502843.369561] Lustre: 4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1386324633321497 sent from fnfs-MDT00 00-mdc-ffff8801b79afc00 to NID 20.9.100.31 at tcp has failed due to network error: [sent 1322102297] [real_sent 1322102297] [current 1322102297] [deadline 15s] [delay -15s] req at ffff88019b082000 x1386324633321497/t0(0) o-1->fnfs-MDT0000_UUID at 30.9.100.31@o2ib:12/10 lens 368/512 e 0 to 1 dl 1322102312 ref 1 fl Rpc:XN/ ffffffff/ffffffff rc 0/-1 Nov 24 10:38:57 cn-fn02 kernel: [502883.322837] Lustre: 4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1386324633321499 sent from fnfs-MDT00 00-mdc-ffff8801b79afc00 to NID 30.9.100.32 at o2ib has timed out for slow reply: [sent 1322102322] [real_sent 1322102322] [current 1322102337] [deadline 15s] [d elay 0s] req at ffff8801b7860400 x1386324633321499/t0(0) o-1->fnfs-MDT0000_UUID at 30.9.100.32@o2ib:12/10 lens 368/512 e 0 to 1 dl 1322102337 ref 1 fl Rpc:XN/ffff ffff/ffffffff rc 0/-1 Nov 24 10:39:07 cn-fn02 kernel: [502893.296214] Lustre: 4402:0:(import.c:526:import_select_connection()) fnfs-MDT0000-mdc-ffff8801b79afc00: tried all connect ions, increasing latency to 15s Nov 24 10:39:07 cn-fn02 kernel: [502893.296281] Lustre: 4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1386324633321501 sent from fnfs-MDT0000-mdc-ffff8801b79afc00 to NID 20.9.100.31 at tcp has failed due to network error: [sent 1322102347] [real_sent 1322102347] [current 1322102347] [deadline 20s] [delay -20s] req at ffff8801b7400400 x1386324633321501/t0(0) o-1->fnfs-MDT0000_UUID at 30.9.100.31@o2ib:12/10 lens 368/512 e 0 to 1 dl 1322102367 ref 1 fl Rpc:XN/ ffffffff/ffffffff rc 0/-1 Nov 24 10:39:52 cn-fn02 kernel: [502938.234238] Lustre: 4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1386324633321503 sent from fnfs-MDT00 00-mdc-ffff8801b79afc00 to NID 30.9.100.32 at o2ib has timed out for slow reply: [sent 1322102372] [real_sent 1322102372] [current 1322102392] [deadline 20s] [d elay 0s] req at ffff88019a9d0000 x1386324633321503/t0(0) o-1->fnfs-MDT0000_UUID at 30.9.100.32@o2ib:12/10 lens 368/512 e 0 to 1 dl 1322102392 ref 1 fl Rpc:XN/ffff ffff/ffffffff rc 0/-1 Nov 24 10:39:57 cn-fn02 kernel: [502943.222935] Lustre: 4402:0:(import.c:526:import_select_connection()) fnfs-MDT0000-mdc-ffff8801b79afc00: tried all connect ions, increasing latency to 20s Nov 24 10:40:01 cn-fn02 /usr/sbin/cron[4509]: (root) CMD ([ -x /usr/lib64/sa/sa1 ] && exec /usr/lib64/sa/sa1 -S ALL 1 1) Nov 24 10:40:47 cn-fn02 kernel: [502993.149647] Lustre: 4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1386324633321507 sent from fnfs-MDT00 00-mdc-ffff8801b79afc00 to NID 30.9.100.32 at o2ib has timed out for slow reply: [sent 1322102422] [real_sent 1322102422] [current 1322102447] [deadline 25s] [d elay 0s] req at ffff8801b7583000 x1386324633321507/t0(0) o-1->fnfs-MDT0000_UUID at 30.9.100.32@o2ib:12/10 lens 368/512 e 0 to 1 dl 1322102447 ref 1 fl Rpc:XN/ffff ffff/ffffffff rc 0/-1 Nov 24 10:40:47 cn-fn02 kernel: [502993.149653] Lustre: 4401:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 1 previous similar message Nov 24 10:41:12 cn-fn02 kernel: [503018.117004] Lustre: 4402:0:(import.c:526:import_select_connection()) fnfs-MDT0000-mdc-ffff8801b79afc00: tried all connect ions, increasing latency to 25s Nov 24 10:42:07 cn-fn02 kernel: [503073.041134] Lustre: 4401:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1386324633321512 sent from fnfs-MDT00 00-mdc-ffff8801b79afc00 to NID 30.9.100.32 at o2ib has timed out for slow reply: [sent 1322102497] [real_sent 1322102497] [current 1322102527] [deadline 30s] [d elay 0s] req at ffff88019b08b000 x1386324633321512/t0(0) o-1->fnfs-MDT0000_UUID at 30.9.100.32@o2ib:12/10 lens 368/512 e 0 to 1 dl 1322102527 ref 1 fl Rpc:XN/ffff ffff/ffffffff rc 0/-1 Nov 24 10:42:07 cn-fn02 kernel: [503073.041140] Lustre: 4401:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 1 previous similar message Nov 24 10:42:27 cn-fn02 kernel: [503093.011078] Lustre: 4402:0:(import.c:526:import_select_connection()) fnfs-MDT0000-mdc-ffff8801b79afc00: tried all connect ions, increasing latency to 30s ............ I wonder why the client connects 20.9.100.31 at tcp and 20.9.100.32 at o2ib, not 20.9.100.31 at o2ib? 20.9.100.31 is my active MDS, 20.9.100.32 is just a standby one. Thanks a lot!