James Robnett
2014-Sep-09 11:04 UTC
Network name o2ib0 collision in two discrete filesystems
I'm having difficulty figuring out a solution to an LNET issue I'm having. We have two Lustre filesystems separated by about 60 miles, both of which have o2ib0(ib0) and tcp(eth0) networks defined. Both have IB and TCP clients which work just fine. I'll call them FS1 and FS2. FS1-mds@ib0 192.168.1.11 FS1-mds@eth0 10.1.1.11 FS2-mds@ib0 192.168.2.11 FS2-mds@eth0 10.1.2.11 We have a need for a client physically at site-1 to mount the filesystems from both sites. The intent is to mount the local FS1 via IB0 and the remote FS2 via TCP0 (accessible over gbit). The mount commands for the client are: mount −t lustre 192.168.1.11@o2ib0:/lustre /lustre/FS1 mount −t lustre 10.1.2.11@tcp0:/lustre /lustre/FS2 If I set this client's modprobe.conf line as options network=o2ib0(ib0), tcp0(eth0) then it mounts FS1 without issue but then fails on FS2 since it tries to communicate via o2ib0 despite the mount command specifying tcp0. Presumably since the client asserts it knows about both o2ib0 and tcp0 without realizing o2ib0 at site1 is functionally different from o2ib0 at site2. If I set the client's modprobe.conf line as options network=tcp0(eth0), o2ib0(ib0) then it mounts FS1 just fine but actually communicates via TCP0 (visible through /proc/sys/lnet/peers) since there's a network path that works and it's first in the list. It also mounts FS2 just fine as expected. So I can mount on or the other but not both or at least not both in the way that we need (i.e. IB for site1 and TCP for site2). I'd begun looking into setting up an LNET router at site2 but I'm suspicious that won't actually help or it will help but only if I set it up in such a way that it disturbs existing IB0 and TCP0 clients there. I tried briefly to set up an LNET router at site 1 that only knew about tcp0. I put a routes line on the client pointing tcp0 at <lnetIP>@tcp0. The LNET router can see and lctl ping the FS2 MDS but the client throws an error on startup and doesn't seem to believe there's a route. I'm beginning to sense that the only real option is to get rid of the IB name collision and do a tunefs at site2 and change the servers and clients to use o2ib1 rather than o2ib0, or other permutations of renaming networks, but maybe (hopefully) I'm missing something with lnet routing. On a side note it's mildly confusing that the ordering of lnet options networks= line takes precedence over the mount command. If that weren't the case then either modprobe.conf line ordering above would work rather than neither but maybe there's a case I'm missing that requires that lnet option ordering takes precedence over the mount syntax. Of course there's the very real possibility I'm missing an obvious simple solution. James Robnett NRAO/NM _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss