Klaus Steden
2007-Nov-15 02:23 UTC
[Lustre-discuss] Adding a new client on a different network
Hi there, I offered to help one of our network engineers test out some equipment by connecting up a new Linux node to our existing CFS and blasting data back and forth (I figure it was about as high-performance a solution I could provide :-) There are a couple of wrinkles that I can''t quite figure out, and can''t find any hits in Google that seem to fit the bill. Basically, I''ve had Lustre working beautifully on a closed network segment (172.16.129.0/24), and want to expand it to include another closed segment, 172.16.128.0/24. I''ve connected my MDS and OSS nodes into this new segment, connectivity is good, software/kernels/etc. are all at the same major and minor revision. However, when I try to connect up the new client, here''s what happens: -- client -- mount -t lustre 172.16.128.252 at tcp0:/lustre /mnt/lustre mount.lustre: mount 172.16.128.252 at tcp0:/lustre at /mnt/lustre failed: Cannot send after transport endpoint shutdown -- client -- And on the MDS side, here''s what I see in the output of ''dmesg'': -- mds -- LustreError: 120-3: Refusing connection from 172.16.128.100 for 172.16.128.252 at tcp: No matching NI -- mds -- I was initially using this in my modprobe.conf: -- modprobe.conf -- options lnet networks=tcp0(eth0,bond0) -- modprobe.conf -- where ''eth0'' is attached to 172.16.129.0/24, and ''bond0'' is attached to 172.16.128.0/24. What''s happening here, and where do I look for information on how to fix it? When I originally assembled the file system, I had only specified nodes on the 172.16.129.0/24 network in the various MGS/MGC parameters. Any help would be greatly, greatly appreciated! thanks, Klaus
Nathan Rutman
2007-Nov-15 03:48 UTC
[Lustre-discuss] Adding a new client on a different network
Your client is trying to talk to 172.16.128.252 at tcp, but the server doesn''t think that is one of its NIDs. Some things to try: cat /proc/sys/lnet/nis on client and servers lctl ping back and forth Klaus Steden wrote:> Hi there, > > I offered to help one of our network engineers test out some equipment by > connecting up a new Linux node to our existing CFS and blasting data back > and forth (I figure it was about as high-performance a solution I could > provide :-) > > There are a couple of wrinkles that I can''t quite figure out, and can''t find > any hits in Google that seem to fit the bill. > > Basically, I''ve had Lustre working beautifully on a closed network segment > (172.16.129.0/24), and want to expand it to include another closed segment, > 172.16.128.0/24. I''ve connected my MDS and OSS nodes into this new segment, > connectivity is good, software/kernels/etc. are all at the same major and > minor revision. > > However, when I try to connect up the new client, here''s what happens: > > -- client -- > mount -t lustre 172.16.128.252 at tcp0:/lustre /mnt/lustre > mount.lustre: mount 172.16.128.252 at tcp0:/lustre at /mnt/lustre failed: > Cannot send after transport endpoint shutdown > -- client -- > > And on the MDS side, here''s what I see in the output of ''dmesg'': > > -- mds -- > LustreError: 120-3: Refusing connection from 172.16.128.100 for > 172.16.128.252 at tcp: No matching NI > -- mds -- > > I was initially using this in my modprobe.conf: > > -- modprobe.conf -- > options lnet networks=tcp0(eth0,bond0) > -- modprobe.conf -- > > where ''eth0'' is attached to 172.16.129.0/24, and ''bond0'' is attached to > 172.16.128.0/24. > > What''s happening here, and where do I look for information on how to fix it? > > When I originally assembled the file system, I had only specified nodes on > the 172.16.129.0/24 network in the various MGS/MGC parameters. > > Any help would be greatly, greatly appreciated! > > thanks, > Klaus > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >
Isaac Huang
2007-Nov-15 09:11 UTC
[Lustre-discuss] Adding a new client on a different network
On Wed, Nov 14, 2007 at 06:23:36PM -0800, Klaus Steden wrote: [......]> And on the MDS side, here''s what I see in the output of ''dmesg'': > > -- mds -- > LustreError: 120-3: Refusing connection from 172.16.128.100 for > 172.16.128.252 at tcp: No matching NI > -- mds -- > > I was initially using this in my modprobe.conf: > > -- modprobe.conf -- > options lnet networks=tcp0(eth0,bond0) > -- modprobe.conf -- >This only gave the MDS one NID: IP-of-eth0 at tcp0, i.e. IP address of the 1st interface specified was used to generate the NID.> where ''eth0'' is attached to 172.16.129.0/24, and ''bond0'' is attached to > 172.16.128.0/24. >In your case, 172.16.129.x at tcp.> What''s happening here, and where do I look for information on how to fix it? >When the client tried to reach the MDS at 172.16.128.252 at tcp, the MDS refused the connection since 172.16.128.252 at tcp wasn''t one of its NIDs. If they''re two separate networks, just give the MDS two NIDs: options lnet networks=''tcp0(eth0),tcp1(bond0)'' And for clients on eth0''s network: options lnet networks=''tcp0(eth?)'' At last clients on bond0''s network: options lnet networks=''tcp1(eth?)'' HTH, Isaac
Jody McIntyre
2007-Nov-15 15:52 UTC
[Lustre-discuss] Adding a new client on a different network
Nathan Rutman wrote:> cat /proc/sys/lnet/nis on client and serversMinor typo: this should be /proc/sys/lnet/*nids * Cheers, Jody* *
Jody McIntyre
2007-Nov-15 18:10 UTC
[Lustre-discuss] Adding a new client on a different network
Sorry, Thunderbird is stupid. Replying again using mutt. On Thu, Nov 15, 2007 at 07:52:48AM -0800, Jody McIntyre wrote:> Nathan Rutman wrote: > > cat /proc/sys/lnet/nis on client and servers > Minor typo: this should be /proc/sys/lnet/*nidsMinor typo: this should be /proc/sys/lnet/nids Cheers, Jody> * > Cheers, > Jody* > * > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss--
Klaus Steden
2007-Nov-15 19:03 UTC
[Lustre-discuss] Adding a new client on a different network
Isaac, I followed your instructions, and now I can see the client on the second network on the MDS when I run: lctl> network tcp1 lctl> peer_list lctl> conn_list However ... when I attempt to mount, I get this error on the client: -- client -- mount -t lustre 172.16.128.252 at tcp1:/lustre /mnt/lustre mount.lustre: mount 172.16.128.252 at tcp1:/lustre at /mnt/lustre failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) -- client -- This is what turns up on the MDS (via dmesg): -- mds -- LustreError: 14098:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error -104 reading HELLO from 172.16.128.100 LustreError: 11b-b: Connection to 172.16.128.100 at tcp at host 172.16.128.100 on port 988 was reset: is it running a compatible version of Lustre and is 172.16.128.100 at tcp one of its NIDs? -- mds -- If I initially constructed my file system to use failover MDS, do I have to specify it in my mount command? Is there a way to query the creation-time flags and options set on a particular file system so I can see if I am indeed attempting to talk to the MGS as well? thanks, Klaus On 11/15/07 1:11 AM, "Isaac Huang" <He.Huang at Sun.COM>did etch on stone tablets:> On Wed, Nov 14, 2007 at 06:23:36PM -0800, Klaus Steden wrote: > [......] >> And on the MDS side, here''s what I see in the output of ''dmesg'': >> >> -- mds -- >> LustreError: 120-3: Refusing connection from 172.16.128.100 for >> 172.16.128.252 at tcp: No matching NI >> -- mds -- >> >> I was initially using this in my modprobe.conf: >> >> -- modprobe.conf -- >> options lnet networks=tcp0(eth0,bond0) >> -- modprobe.conf -- >> > > This only gave the MDS one NID: IP-of-eth0 at tcp0, i.e. IP address of > the 1st interface specified was used to generate the NID. > >> where ''eth0'' is attached to 172.16.129.0/24, and ''bond0'' is attached to >> 172.16.128.0/24. >> > > In your case, 172.16.129.x at tcp. > >> What''s happening here, and where do I look for information on how to fix it? >> > > When the client tried to reach the MDS at 172.16.128.252 at tcp, the MDS > refused the connection since 172.16.128.252 at tcp wasn''t one of its NIDs. > > If they''re two separate networks, just give the MDS two NIDs: > options lnet networks=''tcp0(eth0),tcp1(bond0)'' > > And for clients on eth0''s network: > options lnet networks=''tcp0(eth?)'' > > At last clients on bond0''s network: > options lnet networks=''tcp1(eth?)'' > > > HTH, > Isaac
Klaus Steden
2007-Nov-15 19:16 UTC
[Lustre-discuss] Adding a new client on a different network
Hi Jody, My system shows me only /proc/sys/lnet/nis ... but it looks like the list of NIDs, so I''m assuming it''s the file you are referring to? Klaus On 11/15/07 10:10 AM, "Jody McIntyre" <scjody at Sun.COM>did etch on stone tablets:> Sorry, Thunderbird is stupid. Replying again using mutt. > > On Thu, Nov 15, 2007 at 07:52:48AM -0800, Jody McIntyre wrote: >> Nathan Rutman wrote: >>> cat /proc/sys/lnet/nis on client and servers >> Minor typo: this should be /proc/sys/lnet/*nids > > Minor typo: this should be /proc/sys/lnet/nids > > Cheers, > Jody > >> * >> Cheers, >> Jody* >> * >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at clusterfs.com >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss