So we have a cluster with an MGT and 2 MDTs. Each has an NID on o2ib and tcp and are dual-connected to 2 MDSs. We created the MGT and MDTs with the following commands: mkfs.lustre --mgs --reformat --failnode=10.4.200.0 at o2ib,10.4.200.1 at o2ib --failnode=10.4.200.0 at tcp0,10.4.200.1 at tcp0 /dev/dm-0 mkfs.lustre --mdt --mgsnode=10.4.200.0 at o2ib --fsname=lrc --reformat --failnode=10.4.200.0 at o2ib,10.4.200.1 at o2ib,10.4.200.0 at tcp0,10.4.200.1 at tcp0 /dev/dm-1 mkfs.lustre --mdt --mgsnode=10.4.200.0 at o2ib --fsname=nano --reformat --failnode=10.4.200.1 at o2ib,10.4.200.0 at o2ib,10.4.200.1 at tcp0,10.4.200.0 at tcp0 /dev/dm-2 The host cluster starts and mounts the luns just fine. I mount TCP connected clients with both MGSs called out. The client fails over to the secondary MDS/MGT just fine but keeps failing on the MDT. It just keeps trying the old MDS NIDs: Lustre: Changing connection for lrc-MDT0000-mdc-ffff8101d57ad400 to 10.4.200.0 at o2ib/10.0.200.0 at tcp Ideas? ---------------- John White High Performance Computing Services (HPCS) (510) 486-7307 One Cyclotron Rd, MS: 50B-3209C Lawrence Berkeley National Lab Berkeley, CA 94720
Please disregard. I just realized the difference between a '':'' and '','' when running these commands. On Dec 11, 2009, at 11:42 AM, John White wrote:> So we have a cluster with an MGT and 2 MDTs. Each has an NID on o2ib and tcp and are dual-connected to 2 MDSs. We created the MGT and MDTs with the following commands: > mkfs.lustre --mgs --reformat --failnode=10.4.200.0 at o2ib,10.4.200.1 at o2ib --failnode=10.4.200.0 at tcp0,10.4.200.1 at tcp0 /dev/dm-0 > mkfs.lustre --mdt --mgsnode=10.4.200.0 at o2ib --fsname=lrc --reformat --failnode=10.4.200.0 at o2ib,10.4.200.1 at o2ib,10.4.200.0 at tcp0,10.4.200.1 at tcp0 /dev/dm-1 > mkfs.lustre --mdt --mgsnode=10.4.200.0 at o2ib --fsname=nano --reformat --failnode=10.4.200.1 at o2ib,10.4.200.0 at o2ib,10.4.200.1 at tcp0,10.4.200.0 at tcp0 /dev/dm-2 > > The host cluster starts and mounts the luns just fine. I mount TCP connected clients with both MGSs called out. The client fails over to the secondary MDS/MGT just fine but keeps failing on the MDT. It just keeps trying the old MDS NIDs: > Lustre: Changing connection for lrc-MDT0000-mdc-ffff8101d57ad400 to 10.4.200.0 at o2ib/10.0.200.0 at tcp > > Ideas? > ---------------- > John White > High Performance Computing Services (HPCS) > (510) 486-7307 > One Cyclotron Rd, MS: 50B-3209C > Lawrence Berkeley National Lab > Berkeley, CA 94720 > > > > > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss---------------- John White High Performance Computing Services (HPCS) (510) 486-7307 One Cyclotron Rd, MS: 50B-3209C Lawrence Berkeley National Lab Berkeley, CA 94720
On 2009-12-11, at 17:48, John White wrote:> Please disregard. I just realized the difference between a '':'' and > '','' when running these commands.If this isn''t explained clearly in the documentation, please speak up and it will be fixed.> On Dec 11, 2009, at 11:42 AM, John White wrote: > >> So we have a cluster with an MGT and 2 MDTs. Each has an NID on >> o2ib and tcp and are dual-connected to 2 MDSs. We created the MGT >> and MDTs with the following commands: >> mkfs.lustre --mgs --reformat --failnode=10.4.200.0 at o2ib, >> 10.4.200.1 at o2ib --failnode=10.4.200.0 at tcp0,10.4.200.1 at tcp0 /dev/dm-0 >> mkfs.lustre --mdt --mgsnode=10.4.200.0 at o2ib --fsname=lrc --reformat >> --failnode=10.4.200.0 at o2ib,10.4.200.1 at o2ib, >> 10.4.200.0 at tcp0,10.4.200.1 at tcp0 /dev/dm-1 >> mkfs.lustre --mdt --mgsnode=10.4.200.0 at o2ib --fsname=nano -- >> reformat --failnode=10.4.200.1 at o2ib,10.4.200.0 at o2ib, >> 10.4.200.1 at tcp0,10.4.200.0 at tcp0 /dev/dm-2 >> >> The host cluster starts and mounts the luns just fine. I mount TCP >> connected clients with both MGSs called out. The client fails over >> to the secondary MDS/MGT just fine but keeps failing on the MDT. >> It just keeps trying the old MDS NIDs: >> Lustre: Changing connection for lrc-MDT0000-mdc-ffff8101d57ad400 to >> 10.4.200.0 at o2ib/10.0.200.0 at tcp >> >> Ideas? >> ---------------- >> John White >> High Performance Computing Services (HPCS) >> (510) 486-7307 >> One Cyclotron Rd, MS: 50B-3209C >> Lawrence Berkeley National Lab >> Berkeley, CA 94720 >> >> >> >> >> >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > ---------------- > John White > High Performance Computing Services (HPCS) > (510) 486-7307 > One Cyclotron Rd, MS: 50B-3209C > Lawrence Berkeley National Lab > Berkeley, CA 94720 > > > > > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.