Hi, I dont get the way lustre figures out which addresses to use, problem is that i run ha-linux with drbd and all that jazz, i''ve got the addresses to fail over, the drbd-disk to failover/resync, i''ve got the mount to work (which includes loading and unloading lustre modules) - but the problem comes when I want everything to work together; My (current) modprobe: options lnet networks=tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50 This is the errors i get: LustreError: 10f-e: Error parsing ''networks="tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50"'' LustreError: 110-0: here...............................|---------| LustreError: 4527:0:(events.c:707:ptlrpc_init_portals()) network initialisation failed (along with a bunch of errors since this module does not load) The addresses are availible at the time i mount lustre, but it still fails. I''ve tried with tcp0(eth0:0) which fails with about the same error, i''ve tried tcp0(eth0,eth1) which gives me the wrong addresses (machine ones) but works. Anything i''ve missed here? Do I really need to use dedicated interfaces, wich is always active or is it some way I can set the nids to the aliased addresses? I do not want to fail the machine-addresses over to another server. Im all out of ideas. -- Timh Bergstr?m System Administrator Diino AB - www.diino.com :wq
On Tue, 2008-09-23 at 15:06 +0200, Timh Bergstr?m wrote:> Hi,Hi,> My (current) modprobe: > > options lnet networks=tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50This syntax is incorrect. For some examples of multi-homed configurations see the manual at http://manual.lustre.org/manual/LustreManual16_HTML/MoreComplicatedConfigurations.html#50642998_20213> This is the errors i get: > LustreError: 10f-e: Error parsing > ''networks="tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50"''When you specify "networks" because you specify the interfaces to use, you don''t need to specify the ip address. I think you are confusing the networks and ipnets options.> LustreError: 110-0: here...............................|---------| > LustreError: 4527:0:(events.c:707:ptlrpc_init_portals()) network > initialisation failed > (along with a bunch of errors since this module does not load)> I''ve tried with tcp0(eth0:0) which fails with about the same error, > i''ve tried tcp0(eth0,eth1) which gives me the wrong addresses (machine > ones) but works.What is the topology exactly? Are there two nics or one nic with two addresses? Are the two nics on the same physical network or separate physical networks? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080923/43d5f3c1/attachment-0001.bin
2008/9/23 Brian J. Murrell <Brian.Murrell at sun.com>:> On Tue, 2008-09-23 at 15:06 +0200, Timh Bergstr?m wrote: >> Hi, > > Hi,Hi again, and thanks for the quick reply!> >> My (current) modprobe: >> >> options lnet networks=tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50 > > This syntax is incorrect. For some examples of multi-homed > configurations see the manual at > http://manual.lustre.org/manual/LustreManual16_HTML/MoreComplicatedConfigurations.html#50642998_20213Yes that''s the link i''ve been consulting, perhaps im not looking hard enough.> >> This is the errors i get: >> LustreError: 10f-e: Error parsing >> ''networks="tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50"'' > > When you specify "networks" because you specify the interfaces to use, > you don''t need to specify the ip address. I think you are confusing the > networks and ipnets options.The problem here exactly is that the physical interfaces is there, but not with the ip-addresses i want the mdt to "listen" on - the "NIDs", they are added later through heartbeat as aliases (IPaddr2::10.4.21.50 IPaddr2::10.4.22.50), but before mounting the mdt-resource (drbd).> >> LustreError: 110-0: here...............................|---------| >> LustreError: 4527:0:(events.c:707:ptlrpc_init_portals()) network >> initialisation failed >> (along with a bunch of errors since this module does not load) > >> I''ve tried with tcp0(eth0:0) which fails with about the same error, >> i''ve tried tcp0(eth0,eth1) which gives me the wrong addresses (machine >> ones) but works. > > What is the topology exactly? Are there two nics or one nic with two > addresses? Are the two nics on the same physical network or separate > physical networks?eth0 and eth1 are physical interfaces, they have statically assigned ip''s (for management, supervision etc), heartbeat then adds addresses to theese two interfaces if the node is "primary". If it matters - eth0 and eth1 has separated physical paths to everything, this is because we want to survive a physical fail on the network before failing over to another physical server. As I read the manual, i format my OST''s with more than one --mgsnode option, which in turn will make the OST "know" about both path''s to the MDS/MGS server(s). As in, if first MGS does not work (physical network failure on side A) - try second (Physical side B). What we healthcheck on is the data/disks/server hardware which will tell heartbeat to fail over to server 2 which takes over network path A and network path B (on 10.4.[21,22].50), and the OST''s/clients should continue working without noticing.> > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-- Timh Bergstr?m System Administrator Diino AB - www.diino.com :wq
Note that you do not normally use IP takeover with Lustre/Heartbeat: you set the failover IP addresses with the mkfs.lustre command, and Lustre reconnects to the _other_ address when it is disconnected. In your case, you would have 2 fixed addresses for each node (w/o heartbeat - do NOT use the heartbeat virtual IP addresses), and specify both those failover NIDs (rather than just 1). Lustre1.6 is a bit different from a lot of HA/Heartbeat users: Lustre _knows_ about the multiple paths/addresses, and simply requires Heartbeat to ensure it is mounted on exactly one node in the failover pair: it does NOT rely on IP takeover for HA. Kevin Van Maren Timh Bergstr?m wrote:> 2008/9/23 Brian J. Murrell <Brian.Murrell at sun.com>: > >> On Tue, 2008-09-23 at 15:06 +0200, Timh Bergstr?m wrote: >> >>> Hi, >>> >> Hi, >> > Hi again, and thanks for the quick reply! > > >>> My (current) modprobe: >>> >>> options lnet networks=tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50 >>> >> This syntax is incorrect. For some examples of multi-homed >> configurations see the manual at >> http://manual.lustre.org/manual/LustreManual16_HTML/MoreComplicatedConfigurations.html#50642998_20213 >> > > Yes that''s the link i''ve been consulting, perhaps im not looking hard enough. > > >>> This is the errors i get: >>> LustreError: 10f-e: Error parsing >>> ''networks="tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50"'' >>> >> When you specify "networks" because you specify the interfaces to use, >> you don''t need to specify the ip address. I think you are confusing the >> networks and ipnets options. >> > > The problem here exactly is that the physical interfaces is there, but > not with the ip-addresses i want the mdt to "listen" on - the "NIDs", > they are added later through heartbeat as aliases (IPaddr2::10.4.21.50 > IPaddr2::10.4.22.50), but before mounting the mdt-resource (drbd). > > >>> LustreError: 110-0: here...............................|---------| >>> LustreError: 4527:0:(events.c:707:ptlrpc_init_portals()) network >>> initialisation failed >>> (along with a bunch of errors since this module does not load) >>> >>> I''ve tried with tcp0(eth0:0) which fails with about the same error, >>> i''ve tried tcp0(eth0,eth1) which gives me the wrong addresses (machine >>> ones) but works. >>> >> What is the topology exactly? Are there two nics or one nic with two >> addresses? Are the two nics on the same physical network or separate >> physical networks? >> > > eth0 and eth1 are physical interfaces, they have statically assigned > ip''s (for management, supervision etc), heartbeat then adds addresses > to theese two interfaces if the node is "primary". > > If it matters - eth0 and eth1 has separated physical paths to > everything, this is because we want to survive a physical fail on the > network before failing over to another physical server. > > As I read the manual, i format my OST''s with more than one --mgsnode > option, which in turn will make the OST "know" about both path''s to > the MDS/MGS server(s). As in, if first MGS does not work (physical > network failure on side A) - try second (Physical side B). > > What we healthcheck on is the data/disks/server hardware which will > tell heartbeat to fail over to server 2 which takes over network path > A and network path B (on 10.4.[21,22].50), and the OST''s/clients > should continue working without noticing. > > >> b. >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> > > > >
Thank you, that''s the path i''ve taken from the last message on this list, as I misunderstood some of the drbd/ha setups before. However, using 4 mgsnode "paths", is that recommended or should I use one mgspath per node and use the other as some sort of manual failover? Regards, Timh 2008/9/23 Kevin Van Maren <Kevin.Vanmaren at sun.com>:> Note that you do not normally use IP takeover with Lustre/Heartbeat: you set > the failover IP addresses with the mkfs.lustre command, and Lustre > reconnects to the _other_ address when it is disconnected. > > In your case, you would have 2 fixed addresses for each node (w/o heartbeat > - do NOT use the heartbeat virtual IP addresses), and specify both those > failover NIDs (rather than just 1). > > Lustre1.6 is a bit different from a lot of HA/Heartbeat users: Lustre > _knows_ about the multiple paths/addresses, and simply requires Heartbeat to > ensure it is mounted on exactly one node in the failover pair: it does NOT > rely on IP takeover for HA. > > Kevin Van Maren > > > Timh Bergstr?m wrote: >> >> 2008/9/23 Brian J. Murrell <Brian.Murrell at sun.com>: >> >>> >>> On Tue, 2008-09-23 at 15:06 +0200, Timh Bergstr?m wrote: >>> >>>> >>>> Hi, >>>> >>> >>> Hi, >>> >> >> Hi again, and thanks for the quick reply! >> >> >>>> >>>> My (current) modprobe: >>>> >>>> options lnet networks=tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50 >>>> >>> >>> This syntax is incorrect. For some examples of multi-homed >>> configurations see the manual at >>> >>> http://manual.lustre.org/manual/LustreManual16_HTML/MoreComplicatedConfigurations.html#50642998_20213 >>> >> >> Yes that''s the link i''ve been consulting, perhaps im not looking hard >> enough. >> >> >>>> >>>> This is the errors i get: >>>> LustreError: 10f-e: Error parsing >>>> ''networks="tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50"'' >>>> >>> >>> When you specify "networks" because you specify the interfaces to use, >>> you don''t need to specify the ip address. I think you are confusing the >>> networks and ipnets options. >>> >> >> The problem here exactly is that the physical interfaces is there, but >> not with the ip-addresses i want the mdt to "listen" on - the "NIDs", >> they are added later through heartbeat as aliases (IPaddr2::10.4.21.50 >> IPaddr2::10.4.22.50), but before mounting the mdt-resource (drbd). >> >> >>>> >>>> LustreError: 110-0: here...............................|---------| >>>> LustreError: 4527:0:(events.c:707:ptlrpc_init_portals()) network >>>> initialisation failed >>>> (along with a bunch of errors since this module does not load) >>>> I''ve tried with tcp0(eth0:0) which fails with about the same error, >>>> i''ve tried tcp0(eth0,eth1) which gives me the wrong addresses (machine >>>> ones) but works. >>>> >>> >>> What is the topology exactly? Are there two nics or one nic with two >>> addresses? Are the two nics on the same physical network or separate >>> physical networks? >>> >> >> eth0 and eth1 are physical interfaces, they have statically assigned >> ip''s (for management, supervision etc), heartbeat then adds addresses >> to theese two interfaces if the node is "primary". >> >> If it matters - eth0 and eth1 has separated physical paths to >> everything, this is because we want to survive a physical fail on the >> network before failing over to another physical server. >> >> As I read the manual, i format my OST''s with more than one --mgsnode >> option, which in turn will make the OST "know" about both path''s to >> the MDS/MGS server(s). As in, if first MGS does not work (physical >> network failure on side A) - try second (Physical side B). >> >> What we healthcheck on is the data/disks/server hardware which will >> tell heartbeat to fail over to server 2 which takes over network path >> A and network path B (on 10.4.[21,22].50), and the OST''s/clients >> should continue working without noticing. >> >> >>> >>> b. >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> >> >> >> >> > >-- Timh Bergstr?m System Administrator Diino AB - www.diino.com :wq
To follow up on this matter, i''ve currently set ha/drbd as suggested, formatted the ost''s with double mgsserver directives and also mounted with double addresses on the clients, as ip1 at tcp0:ip2 at tcp1:/fsname - though, if i fail mgs/mdt 1 it does not recover (in a resonable time), what kinds of tuning/settings will affect this? //Timh 2008/9/23 Timh Bergstr?m <timh.bergstrom at diino.net>:> Thank you, that''s the path i''ve taken from the last message on this > list, as I misunderstood some of the drbd/ha setups before. However, > using 4 mgsnode "paths", is that recommended or should I use one > mgspath per node and use the other as some sort of manual failover? > > Regards, > Timh > > 2008/9/23 Kevin Van Maren <Kevin.Vanmaren at sun.com>: >> Note that you do not normally use IP takeover with Lustre/Heartbeat: you set >> the failover IP addresses with the mkfs.lustre command, and Lustre >> reconnects to the _other_ address when it is disconnected. >> >> In your case, you would have 2 fixed addresses for each node (w/o heartbeat >> - do NOT use the heartbeat virtual IP addresses), and specify both those >> failover NIDs (rather than just 1). >> >> Lustre1.6 is a bit different from a lot of HA/Heartbeat users: Lustre >> _knows_ about the multiple paths/addresses, and simply requires Heartbeat to >> ensure it is mounted on exactly one node in the failover pair: it does NOT >> rely on IP takeover for HA. >> >> Kevin Van Maren >> >> >> Timh Bergstr?m wrote: >>> >>> 2008/9/23 Brian J. Murrell <Brian.Murrell at sun.com>: >>> >>>> >>>> On Tue, 2008-09-23 at 15:06 +0200, Timh Bergstr?m wrote: >>>> >>>>> >>>>> Hi, >>>>> >>>> >>>> Hi, >>>> >>> >>> Hi again, and thanks for the quick reply! >>> >>> >>>>> >>>>> My (current) modprobe: >>>>> >>>>> options lnet networks=tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50 >>>>> >>>> >>>> This syntax is incorrect. For some examples of multi-homed >>>> configurations see the manual at >>>> >>>> http://manual.lustre.org/manual/LustreManual16_HTML/MoreComplicatedConfigurations.html#50642998_20213 >>>> >>> >>> Yes that''s the link i''ve been consulting, perhaps im not looking hard >>> enough. >>> >>> >>>>> >>>>> This is the errors i get: >>>>> LustreError: 10f-e: Error parsing >>>>> ''networks="tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50"'' >>>>> >>>> >>>> When you specify "networks" because you specify the interfaces to use, >>>> you don''t need to specify the ip address. I think you are confusing the >>>> networks and ipnets options. >>>> >>> >>> The problem here exactly is that the physical interfaces is there, but >>> not with the ip-addresses i want the mdt to "listen" on - the "NIDs", >>> they are added later through heartbeat as aliases (IPaddr2::10.4.21.50 >>> IPaddr2::10.4.22.50), but before mounting the mdt-resource (drbd). >>> >>> >>>>> >>>>> LustreError: 110-0: here...............................|---------| >>>>> LustreError: 4527:0:(events.c:707:ptlrpc_init_portals()) network >>>>> initialisation failed >>>>> (along with a bunch of errors since this module does not load) >>>>> I''ve tried with tcp0(eth0:0) which fails with about the same error, >>>>> i''ve tried tcp0(eth0,eth1) which gives me the wrong addresses (machine >>>>> ones) but works. >>>>> >>>> >>>> What is the topology exactly? Are there two nics or one nic with two >>>> addresses? Are the two nics on the same physical network or separate >>>> physical networks? >>>> >>> >>> eth0 and eth1 are physical interfaces, they have statically assigned >>> ip''s (for management, supervision etc), heartbeat then adds addresses >>> to theese two interfaces if the node is "primary". >>> >>> If it matters - eth0 and eth1 has separated physical paths to >>> everything, this is because we want to survive a physical fail on the >>> network before failing over to another physical server. >>> >>> As I read the manual, i format my OST''s with more than one --mgsnode >>> option, which in turn will make the OST "know" about both path''s to >>> the MDS/MGS server(s). As in, if first MGS does not work (physical >>> network failure on side A) - try second (Physical side B). >>> >>> What we healthcheck on is the data/disks/server hardware which will >>> tell heartbeat to fail over to server 2 which takes over network path >>> A and network path B (on 10.4.[21,22].50), and the OST''s/clients >>> should continue working without noticing. >>> >>> >>>> >>>> b. >>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> >>>> >>> >>> >>> >>> >> >> > > > > -- > Timh Bergstr?m > System Administrator > Diino AB - www.diino.com > :wq >-- Timh Bergstr?m System Administrator Diino AB - www.diino.com :wq
Hi Timh, If you''re using Linux-HA, you can configure how quickly failover takes place. I have mine set to 90 seconds before the primary is marked dead and the secondary takes over. When this occurs, any Lustre transactions not yet in flight will block until the ones that were in progress at the time of the failure have either had a chance to complete or have timed out. I''m not sure how to modify Lustre-specific settings for recovery time, though. cheers, Klaus On 9/25/08 1:54 PM, "Timh Bergstr?m" <timh.bergstrom at diino.net>did etch on stone tablets:> To follow up on this matter, i''ve currently set ha/drbd as suggested, > formatted the ost''s with double mgsserver directives and also mounted > with double addresses on the clients, as ip1 at tcp0:ip2 at tcp1:/fsname - > though, if i fail mgs/mdt 1 it does not recover (in a resonable time), > what kinds of tuning/settings will affect this? > > //Timh > > 2008/9/23 Timh Bergstr?m <timh.bergstrom at diino.net>: >> Thank you, that''s the path i''ve taken from the last message on this >> list, as I misunderstood some of the drbd/ha setups before. However, >> using 4 mgsnode "paths", is that recommended or should I use one >> mgspath per node and use the other as some sort of manual failover? >> >> Regards, >> Timh >> >> 2008/9/23 Kevin Van Maren <Kevin.Vanmaren at sun.com>: >>> Note that you do not normally use IP takeover with Lustre/Heartbeat: you set >>> the failover IP addresses with the mkfs.lustre command, and Lustre >>> reconnects to the _other_ address when it is disconnected. >>> >>> In your case, you would have 2 fixed addresses for each node (w/o heartbeat >>> - do NOT use the heartbeat virtual IP addresses), and specify both those >>> failover NIDs (rather than just 1). >>> >>> Lustre1.6 is a bit different from a lot of HA/Heartbeat users: Lustre >>> _knows_ about the multiple paths/addresses, and simply requires Heartbeat to >>> ensure it is mounted on exactly one node in the failover pair: it does NOT >>> rely on IP takeover for HA. >>> >>> Kevin Van Maren >>> >>> >>> Timh Bergstr?m wrote: >>>> >>>> 2008/9/23 Brian J. Murrell <Brian.Murrell at sun.com>: >>>> >>>>> >>>>> On Tue, 2008-09-23 at 15:06 +0200, Timh Bergstr?m wrote: >>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>> >>>>> Hi, >>>>> >>>> >>>> Hi again, and thanks for the quick reply! >>>> >>>> >>>>>> >>>>>> My (current) modprobe: >>>>>> >>>>>> options lnet networks=tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50 >>>>>> >>>>> >>>>> This syntax is incorrect. For some examples of multi-homed >>>>> configurations see the manual at >>>>> >>>>> http://manual.lustre.org/manual/LustreManual16_HTML/MoreComplicatedConfigu >>>>> rations.html#50642998_20213 >>>>> >>>> >>>> Yes that''s the link i''ve been consulting, perhaps im not looking hard >>>> enough. >>>> >>>> >>>>>> >>>>>> This is the errors i get: >>>>>> LustreError: 10f-e: Error parsing >>>>>> ''networks="tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50"'' >>>>>> >>>>> >>>>> When you specify "networks" because you specify the interfaces to use, >>>>> you don''t need to specify the ip address. I think you are confusing the >>>>> networks and ipnets options. >>>>> >>>> >>>> The problem here exactly is that the physical interfaces is there, but >>>> not with the ip-addresses i want the mdt to "listen" on - the "NIDs", >>>> they are added later through heartbeat as aliases (IPaddr2::10.4.21.50 >>>> IPaddr2::10.4.22.50), but before mounting the mdt-resource (drbd). >>>> >>>> >>>>>> >>>>>> LustreError: 110-0: here...............................|---------| >>>>>> LustreError: 4527:0:(events.c:707:ptlrpc_init_portals()) network >>>>>> initialisation failed >>>>>> (along with a bunch of errors since this module does not load) >>>>>> I''ve tried with tcp0(eth0:0) which fails with about the same error, >>>>>> i''ve tried tcp0(eth0,eth1) which gives me the wrong addresses (machine >>>>>> ones) but works. >>>>>> >>>>> >>>>> What is the topology exactly? Are there two nics or one nic with two >>>>> addresses? Are the two nics on the same physical network or separate >>>>> physical networks? >>>>> >>>> >>>> eth0 and eth1 are physical interfaces, they have statically assigned >>>> ip''s (for management, supervision etc), heartbeat then adds addresses >>>> to theese two interfaces if the node is "primary". >>>> >>>> If it matters - eth0 and eth1 has separated physical paths to >>>> everything, this is because we want to survive a physical fail on the >>>> network before failing over to another physical server. >>>> >>>> As I read the manual, i format my OST''s with more than one --mgsnode >>>> option, which in turn will make the OST "know" about both path''s to >>>> the MDS/MGS server(s). As in, if first MGS does not work (physical >>>> network failure on side A) - try second (Physical side B). >>>> >>>> What we healthcheck on is the data/disks/server hardware which will >>>> tell heartbeat to fail over to server 2 which takes over network path >>>> A and network path B (on 10.4.[21,22].50), and the OST''s/clients >>>> should continue working without noticing. >>>> >>>> >>>>> >>>>> b. >>>>> >>>>> >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>> >>> >> >> >> >> -- >> Timh Bergstr?m >> System Administrator >> Diino AB - www.diino.com >> :wq >> > >
Hi Klaus, Thanks, the linux-ha setup is fairly complete I think, problem is the lustre timeout - the clients does not try the "other" fast enough. That''s kind of the lustre timeout(s) I want to change along with some recovery options. OT: Where can I read more about the "recovery" in Lustre, i''ve heard words like replay/recovery in some discussions here and im not sure I know what theese really mean 100% (from a Lustre point of view, the words are crystal clear ;-)). It seems like my manual is to old. Regards, Timh 2008/9/25 Klaus Steden <klaus.steden at thomson.net>:> > Hi Timh, > > If you''re using Linux-HA, you can configure how quickly failover takes > place. I have mine set to 90 seconds before the primary is marked dead and > the secondary takes over. > > When this occurs, any Lustre transactions not yet in flight will block until > the ones that were in progress at the time of the failure have either had a > chance to complete or have timed out. > > I''m not sure how to modify Lustre-specific settings for recovery time, > though. > > cheers, > Klaus > > > On 9/25/08 1:54 PM, "Timh Bergstr?m" <timh.bergstrom at diino.net>did etch on > stone tablets: > >> To follow up on this matter, i''ve currently set ha/drbd as suggested, >> formatted the ost''s with double mgsserver directives and also mounted >> with double addresses on the clients, as ip1 at tcp0:ip2 at tcp1:/fsname - >> though, if i fail mgs/mdt 1 it does not recover (in a resonable time), >> what kinds of tuning/settings will affect this? >> >> //Timh >> >> 2008/9/23 Timh Bergstr?m <timh.bergstrom at diino.net>: >>> Thank you, that''s the path i''ve taken from the last message on this >>> list, as I misunderstood some of the drbd/ha setups before. However, >>> using 4 mgsnode "paths", is that recommended or should I use one >>> mgspath per node and use the other as some sort of manual failover? >>> >>> Regards, >>> Timh >>> >>> 2008/9/23 Kevin Van Maren <Kevin.Vanmaren at sun.com>: >>>> Note that you do not normally use IP takeover with Lustre/Heartbeat: you set >>>> the failover IP addresses with the mkfs.lustre command, and Lustre >>>> reconnects to the _other_ address when it is disconnected. >>>> >>>> In your case, you would have 2 fixed addresses for each node (w/o heartbeat >>>> - do NOT use the heartbeat virtual IP addresses), and specify both those >>>> failover NIDs (rather than just 1). >>>> >>>> Lustre1.6 is a bit different from a lot of HA/Heartbeat users: Lustre >>>> _knows_ about the multiple paths/addresses, and simply requires Heartbeat to >>>> ensure it is mounted on exactly one node in the failover pair: it does NOT >>>> rely on IP takeover for HA. >>>> >>>> Kevin Van Maren >>>> >>>> >>>> Timh Bergstr?m wrote: >>>>> >>>>> 2008/9/23 Brian J. Murrell <Brian.Murrell at sun.com>: >>>>> >>>>>> >>>>>> On Tue, 2008-09-23 at 15:06 +0200, Timh Bergstr?m wrote: >>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>> >>>>> Hi again, and thanks for the quick reply! >>>>> >>>>> >>>>>>> >>>>>>> My (current) modprobe: >>>>>>> >>>>>>> options lnet networks=tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50 >>>>>>> >>>>>> >>>>>> This syntax is incorrect. For some examples of multi-homed >>>>>> configurations see the manual at >>>>>> >>>>>> http://manual.lustre.org/manual/LustreManual16_HTML/MoreComplicatedConfigu >>>>>> rations.html#50642998_20213 >>>>>> >>>>> >>>>> Yes that''s the link i''ve been consulting, perhaps im not looking hard >>>>> enough. >>>>> >>>>> >>>>>>> >>>>>>> This is the errors i get: >>>>>>> LustreError: 10f-e: Error parsing >>>>>>> ''networks="tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50"'' >>>>>>> >>>>>> >>>>>> When you specify "networks" because you specify the interfaces to use, >>>>>> you don''t need to specify the ip address. I think you are confusing the >>>>>> networks and ipnets options. >>>>>> >>>>> >>>>> The problem here exactly is that the physical interfaces is there, but >>>>> not with the ip-addresses i want the mdt to "listen" on - the "NIDs", >>>>> they are added later through heartbeat as aliases (IPaddr2::10.4.21.50 >>>>> IPaddr2::10.4.22.50), but before mounting the mdt-resource (drbd). >>>>> >>>>> >>>>>>> >>>>>>> LustreError: 110-0: here...............................|---------| >>>>>>> LustreError: 4527:0:(events.c:707:ptlrpc_init_portals()) network >>>>>>> initialisation failed >>>>>>> (along with a bunch of errors since this module does not load) >>>>>>> I''ve tried with tcp0(eth0:0) which fails with about the same error, >>>>>>> i''ve tried tcp0(eth0,eth1) which gives me the wrong addresses (machine >>>>>>> ones) but works. >>>>>>> >>>>>> >>>>>> What is the topology exactly? Are there two nics or one nic with two >>>>>> addresses? Are the two nics on the same physical network or separate >>>>>> physical networks? >>>>>> >>>>> >>>>> eth0 and eth1 are physical interfaces, they have statically assigned >>>>> ip''s (for management, supervision etc), heartbeat then adds addresses >>>>> to theese two interfaces if the node is "primary". >>>>> >>>>> If it matters - eth0 and eth1 has separated physical paths to >>>>> everything, this is because we want to survive a physical fail on the >>>>> network before failing over to another physical server. >>>>> >>>>> As I read the manual, i format my OST''s with more than one --mgsnode >>>>> option, which in turn will make the OST "know" about both path''s to >>>>> the MDS/MGS server(s). As in, if first MGS does not work (physical >>>>> network failure on side A) - try second (Physical side B). >>>>> >>>>> What we healthcheck on is the data/disks/server hardware which will >>>>> tell heartbeat to fail over to server 2 which takes over network path >>>>> A and network path B (on 10.4.[21,22].50), and the OST''s/clients >>>>> should continue working without noticing. >>>>> >>>>> >>>>>> >>>>>> b. >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Lustre-discuss mailing list >>>>>> Lustre-discuss at lists.lustre.org >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> -- >>> Timh Bergstr?m >>> System Administrator >>> Diino AB - www.diino.com >>> :wq >>> >> >> > >-- Timh Bergstr?m System Administrator Diino AB - www.diino.com :wq