Erich Focht
2008-Apr-29 09:55 UTC
[Lustre-discuss] mounting lustre in failover configuration
Hi, I''m puzzled by the following behavior. An active-passive failover pair of metadata servers have separated MGS and MDT disks and two networks (o2ib and tcp0(eth0)): mds1: 10.3.0.230 at o2ib 192.168.50.130 at tcp0 mds2: 10.3.0.231 at o2ib 192.168.50.131 at tcp0 MGS and MDT are formatted with the options: failover.node=10.3.0.231 at o2ib,192.168.50.131 at tcp0 mgsnode=10.3.0.230 at o2ib,192.168.50.130 at tcp0,10.3.0.231 at o2ib,192.168.50.131 at tcp0 The _first_ mount of an OST fails if mds1 is the active metadata server. It succeeds when mds2 is active. With client mounts I have seen something similar. I could mount clients with mount -t lustre 10.3.0.231 at o2ib:10.3.0.230 at o2ib:/lustre /mnt/lustre but not with mount -t lustre 10.3.0.230 at o2ib:10.3.0.231 at o2ib:/lustre /mnt/lustre when mds1 was the active MDS. This suggests that the active MDS has to be the last one on the list. Strange enough, in my current lab setup I cannot reproduce the client mount behavior any more. Did anybody else see this kind of behavior? Are there any reasons for this? Thanks & best regards, Erich
Andreas Dilger
2008-May-03 02:30 UTC
[Lustre-discuss] mounting lustre in failover configuration
On Apr 29, 2008 11:55 +0200, Erich Focht wrote:> I''m puzzled by the following behavior. > An active-passive failover pair of metadata servers have separated MGS and > MDT disks and two networks (o2ib and tcp0(eth0)): > > mds1: 10.3.0.230 at o2ib 192.168.50.130 at tcp0 > mds2: 10.3.0.231 at o2ib 192.168.50.131 at tcp0 > > MGS and MDT are formatted with the options: > failover.node=10.3.0.231 at o2ib,192.168.50.131 at tcp0 > mgsnode=10.3.0.230 at o2ib,192.168.50.130 at tcp0,10.3.0.231 at o2ib,192.168.50.131 at tcp0 > > The _first_ mount of an OST fails if mds1 is the active metadata server. > It succeeds when mds2 is active. > > With client mounts I have seen something similar. I could mount clients > with > mount -t lustre 10.3.0.231 at o2ib:10.3.0.230 at o2ib:/lustre /mnt/lustre > but not with > mount -t lustre 10.3.0.230 at o2ib:10.3.0.231 at o2ib:/lustre /mnt/lustre > when mds1 was the active MDS. This suggests that the active MDS has to be > the last one on the list. > > Strange enough, in my current lab setup I cannot reproduce the client > mount behavior any more. > > Did anybody else see this kind of behavior? Are there any reasons for this?I believe there is a bug open on this already - the problem is that the parsing of the "failover" line in mount.lustre is broken, and it re-uses the same buffer to parse all of the MDS NIDs and the last one wins. I can''t find the bug number offhand, but I believe there was a patch for it already. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.