My understanding of setting up fail-over is you need some control over the power so with a script it can turn off a machine by cutting its power? Is this correct? Is there a way to do fail-over without having access to the pdu(power strips)? Thanks David -- Personally, I liked the university. They gave us money and facilities, we didn''t have to produce anything! You''ve never been out of college! You don''t know what it''s like out there! I''ve worked in the private sector. They expect results. -Ray Ghostbusters
On Mon, 2010-08-09 at 12:45 -0500, David Noriega wrote:> My understanding of setting up fail-over is you need some control over > the power so with a script it can turn off a machine by cutting its > power? Is this correct? Is there a way to do fail-over without having > access to the pdu(power strips)?Lustre failover in and of itself does not require power control. We do however, recommend having power control to prevent double mounts. If we assume that node1 and node2 both serve ost1 and at a given moment node1 is active and has it mounted. If node2 thinks that node1 is dead and wants to take over ost1, and it''s procedure for doing so dictates that it MUST power off node1 before it can mount ost1, then you are guaranteed (to the limit of the reliability of the power control) that both node1 and node2 won''t mount ost1 at the same time, yes? This is even true if node1 was perfectly functional (and has the ost mounted still) but it was node2''s determination that node1 was down that was faulty. Without power control, there is a risk that node2 mounts ost1 while node1 still has it mounted -- MMP aside. MMP is a good belt to have with your power control suspenders. :-) Since a double-mount has such serious consequences, you cannot do too much to prevent it. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100809/7f075033/attachment.bin
On Aug 9, 2010, at 11:45 AM, David Noriega <tsk133 at my.utsa.edu> wrote:> My understanding of setting up fail-over is you need some control over > the power so with a script it can turn off a machine by cutting its > power? Is this correct?It is the recommended configuration because it is simple to understand and implement. But the only _hard_ requirement is that both nodes can access the storage.> Is there a way to do fail-over without having > access to the pdu(power strips)?If you have IPMI support, that can be used for power control, instead of a switched PDU. Depending on the storage, you may be able to do resource fencing of the disks instead of STONITH. Or you can run fast- and-loose, without any way to ensure the dead node is really "dead" and not accessing storage (at your risk). While Lustre has MMP, it is really more to protect against a mount typo than to guarantee resource fencing.> Thanks > David > > -- > Personally, I liked the university. They gave us money and facilities, > we didn''t have to produce anything! You''ve never been out of college! > You don''t know what it''s like out there! I''ve worked in the private > sector. They expect results. -Ray Ghostbusters > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Could you describe this resource fencing in more detail? As for regards to STONITH, the pdu already has the grubby hands of IT plugged into it and doubt they would be happy if I unplugged them. What about the network management port or ILOM? On Mon, Aug 9, 2010 at 1:08 PM, Kevin Van Maren <Kevin.Van.Maren at oracle.com> wrote:> On Aug 9, 2010, at 11:45 AM, David Noriega <tsk133 at my.utsa.edu> wrote: > >> My understanding of setting up fail-over is you need some control over >> the power so with a script it can turn off a machine by cutting its >> power? Is this correct? > > It is the recommended configuration because it is simple to understand and > implement. > > But the only _hard_ requirement is that both nodes can access the storage. > > >> Is there a way to do fail-over without having >> access to the pdu(power strips)? > > If you have IPMI support, that can be used for power control, instead of a > switched PDU. ?Depending on the storage, you may be able to do resource > fencing of the disks instead of STONITH. ?Or you can run fast-and-loose, > without any way to ensure the dead node is really "dead" and not accessing > storage (at your risk). ?While Lustre has MMP, it is really more to protect > against a mount typo than to guarantee resource fencing. > > >> Thanks >> David >> >> -- >> Personally, I liked the university. They gave us money and facilities, >> we didn''t have to produce anything! You''ve never been out of college! >> You don''t know what it''s like out there! I''ve worked in the private >> sector. They expect results. -Ray Ghostbusters >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Personally, I liked the university. They gave us money and facilities, we didn''t have to produce anything! You''ve never been out of college! You don''t know what it''s like out there! I''ve worked in the private sector. They expect results. -Ray Ghostbusters
David Noriega wrote:> Could you describe this resource fencing in more detail? As for > regards to STONITH, the pdu already has the grubby hands of IT plugged > into it and doubt they would be happy if I unplugged them. What about > the network management port or ILOM? >Resource fencing is needed to ensure that a node does not take over a resource (ie, OST) while the other node is still accessing it (as could happen if the node only partly crashes, where it is not responding to the HA package but still writing to the disk). STONITH is a pretty common way to ensure the other node is dead and can no longer access the resource. If you can''t use your switched PDU, then using the ILOM for IPMI-based power control works. The other common way to do resource fencing is to use scsi reserve commands (if supported by the hardware and the HA package) to ensure exclusive access. Kevin> On Mon, Aug 9, 2010 at 1:08 PM, Kevin Van Maren > <Kevin.Van.Maren at oracle.com> wrote: > >> On Aug 9, 2010, at 11:45 AM, David Noriega <tsk133 at my.utsa.edu> wrote: >> >> >>> My understanding of setting up fail-over is you need some control over >>> the power so with a script it can turn off a machine by cutting its >>> power? Is this correct? >>> >> It is the recommended configuration because it is simple to understand and >> implement. >> >> But the only _hard_ requirement is that both nodes can access the storage. >> >> >> >>> Is there a way to do fail-over without having >>> access to the pdu(power strips)? >>> >> If you have IPMI support, that can be used for power control, instead of a >> switched PDU. Depending on the storage, you may be able to do resource >> fencing of the disks instead of STONITH. Or you can run fast-and-loose, >> without any way to ensure the dead node is really "dead" and not accessing >> storage (at your risk). While Lustre has MMP, it is really more to protect >> against a mount typo than to guarantee resource fencing. >> >> >> >>> Thanks >>> David >>> >>> -- >>> Personally, I liked the university. They gave us money and facilities, >>> we didn''t have to produce anything! You''ve never been out of college! >>> You don''t know what it''s like out there! I''ve worked in the private >>> sector. They expect results. -Ray Ghostbusters >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> > > > >
I think I''ll go the ipmi route. So reading on STONITH, its just a script, so all I would need is a script to run ipmi that tells the server to power off, right? Also while reading through the lustre manual, seems some things are being deleted from the wiki, http://wiki.lustre.org/index.php?title=Clu_Manager no longer exists, and noticed this too when I found the lustre quick guide is no longer available. Thanks David On Tue, Aug 10, 2010 at 10:57 AM, Kevin Van Maren <kevin.van.maren at oracle.com> wrote:> David Noriega wrote: >> >> Could you describe this resource fencing in more detail? As for >> regards to STONITH, the pdu already has the grubby hands of IT plugged >> into it and doubt they would be happy if I unplugged them. ?What about >> the network management port or ILOM? >> > > Resource fencing is needed to ensure that a node does not take over a > resource (ie, OST) > while the other node is still accessing it (as could happen if the node only > partly crashes, > where it is not responding to the HA package but still writing to the disk). > > STONITH is a pretty common way to ensure the other node is dead and can no > longer > access the resource. ?If you can''t use your switched PDU, then using the > ILOM for IPMI-based > power control works. ?The other common way to do resource fencing is to use > scsi reserve > commands (if supported by the hardware and the HA package) to ensure > exclusive access. > > Kevin > >> On Mon, Aug 9, 2010 at 1:08 PM, Kevin Van Maren >> <Kevin.Van.Maren at oracle.com> wrote: >> >>> >>> On Aug 9, 2010, at 11:45 AM, David Noriega <tsk133 at my.utsa.edu> wrote: >>> >>> >>>> >>>> My understanding of setting up fail-over is you need some control over >>>> the power so with a script it can turn off a machine by cutting its >>>> power? Is this correct? >>>> >>> >>> It is the recommended configuration because it is simple to understand >>> and >>> implement. >>> >>> But the only _hard_ requirement is that both nodes can access the >>> storage. >>> >>> >>> >>>> >>>> Is there a way to do fail-over without having >>>> access to the pdu(power strips)? >>>> >>> >>> If you have IPMI support, that can be used for power control, instead of >>> a >>> switched PDU. ?Depending on the storage, you may be able to do resource >>> fencing of the disks instead of STONITH. ?Or you can run fast-and-loose, >>> without any way to ensure the dead node is really "dead" and not >>> accessing >>> storage (at your risk). ?While Lustre has MMP, it is really more to >>> protect >>> against a mount typo than to guarantee resource fencing. >>> >>> >>> >>>> >>>> Thanks >>>> David >>>> >>>> -- >>>> Personally, I liked the university. They gave us money and facilities, >>>> we didn''t have to produce anything! You''ve never been out of college! >>>> You don''t know what it''s like out there! I''ve worked in the private >>>> sector. They expect results. -Ray Ghostbusters >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >> >> >> >> > >-- Personally, I liked the university. They gave us money and facilities, we didn''t have to produce anything! You''ve never been out of college! You don''t know what it''s like out there! I''ve worked in the private sector. They expect results. -Ray Ghostbusters
Depends on the HA package you are using. Heartbeat comes with a script that supports IPMI. The important thing is that stonith NOT succeed if you don''t _know_ that the node is off. So it is absolutely not a 1-line script. Kevin David Noriega wrote:> I think I''ll go the ipmi route. So reading on STONITH, its just a > script, so all I would need is a script to run ipmi that tells the > server to power off, right? > > Also while reading through the lustre manual, seems some things are > being deleted from the wiki, > http://wiki.lustre.org/index.php?title=Clu_Manager no longer exists, > and noticed this too when I found the lustre quick guide is no longer > available. > > Thanks > David > > On Tue, Aug 10, 2010 at 10:57 AM, Kevin Van Maren > <kevin.van.maren at oracle.com> wrote: > >> David Noriega wrote: >> >>> Could you describe this resource fencing in more detail? As for >>> regards to STONITH, the pdu already has the grubby hands of IT plugged >>> into it and doubt they would be happy if I unplugged them. What about >>> the network management port or ILOM? >>> >>> >> Resource fencing is needed to ensure that a node does not take over a >> resource (ie, OST) >> while the other node is still accessing it (as could happen if the node only >> partly crashes, >> where it is not responding to the HA package but still writing to the disk). >> >> STONITH is a pretty common way to ensure the other node is dead and can no >> longer >> access the resource. If you can''t use your switched PDU, then using the >> ILOM for IPMI-based >> power control works. The other common way to do resource fencing is to use >> scsi reserve >> commands (if supported by the hardware and the HA package) to ensure >> exclusive access. >> >> Kevin >> >> >>> On Mon, Aug 9, 2010 at 1:08 PM, Kevin Van Maren >>> <Kevin.Van.Maren at oracle.com> wrote: >>> >>> >>>> On Aug 9, 2010, at 11:45 AM, David Noriega <tsk133 at my.utsa.edu> wrote: >>>> >>>> >>>> >>>>> My understanding of setting up fail-over is you need some control over >>>>> the power so with a script it can turn off a machine by cutting its >>>>> power? Is this correct? >>>>> >>>>> >>>> It is the recommended configuration because it is simple to understand >>>> and >>>> implement. >>>> >>>> But the only _hard_ requirement is that both nodes can access the >>>> storage. >>>> >>>> >>>> >>>> >>>>> Is there a way to do fail-over without having >>>>> access to the pdu(power strips)? >>>>> >>>>> >>>> If you have IPMI support, that can be used for power control, instead of >>>> a >>>> switched PDU. Depending on the storage, you may be able to do resource >>>> fencing of the disks instead of STONITH. Or you can run fast-and-loose, >>>> without any way to ensure the dead node is really "dead" and not >>>> accessing >>>> storage (at your risk). While Lustre has MMP, it is really more to >>>> protect >>>> against a mount typo than to guarantee resource fencing. >>>> >>>> >>>> >>>> >>>>> Thanks >>>>> David >>>>> >>>>> -- >>>>> Personally, I liked the university. They gave us money and facilities, >>>>> we didn''t have to produce anything! You''ve never been out of college! >>>>> You don''t know what it''s like out there! I''ve worked in the private >>>>> sector. They expect results. -Ray Ghostbusters >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>> >>>>> >>> >>> >>> >> > > > >
Another question. Is it possible to use centos/redhat''s clustering software? In the manual it mentions using that for metadata failover(since having more then one metadata server online isnt possible right now), so why not use that for all of lustre? But since the information is missing, can someone fill in the blanks on setting up metadata failover? David On Tue, Aug 10, 2010 at 11:11 AM, Kevin Van Maren <kevin.van.maren at oracle.com> wrote:> Depends on the HA package you are using. ?Heartbeat comes with a script that > supports IPMI. > > The important thing is that stonith NOT succeed if you don''t _know_ that the > node is off. > So it is absolutely not a 1-line script. > > Kevin > > > David Noriega wrote: >> >> I think I''ll go the ipmi route. So reading on STONITH, its just a >> script, so all I would need is a script to run ipmi that tells the >> server to power off, right? >> >> Also while reading through the lustre manual, seems some things are >> being deleted from the wiki, >> http://wiki.lustre.org/index.php?title=Clu_Manager no longer exists, >> and noticed this too when I found the lustre quick guide is no longer >> available. >> >> Thanks >> David >> >> On Tue, Aug 10, 2010 at 10:57 AM, Kevin Van Maren >> <kevin.van.maren at oracle.com> wrote: >> >>> >>> David Noriega wrote: >>> >>>> >>>> Could you describe this resource fencing in more detail? As for >>>> regards to STONITH, the pdu already has the grubby hands of IT plugged >>>> into it and doubt they would be happy if I unplugged them. ?What about >>>> the network management port or ILOM? >>>> >>>> >>> >>> Resource fencing is needed to ensure that a node does not take over a >>> resource (ie, OST) >>> while the other node is still accessing it (as could happen if the node >>> only >>> partly crashes, >>> where it is not responding to the HA package but still writing to the >>> disk). >>> >>> STONITH is a pretty common way to ensure the other node is dead and can >>> no >>> longer >>> access the resource. ?If you can''t use your switched PDU, then using the >>> ILOM for IPMI-based >>> power control works. ?The other common way to do resource fencing is to >>> use >>> scsi reserve >>> commands (if supported by the hardware and the HA package) to ensure >>> exclusive access. >>> >>> Kevin >>> >>> >>>> >>>> On Mon, Aug 9, 2010 at 1:08 PM, Kevin Van Maren >>>> <Kevin.Van.Maren at oracle.com> wrote: >>>> >>>> >>>>> >>>>> On Aug 9, 2010, at 11:45 AM, David Noriega <tsk133 at my.utsa.edu> wrote: >>>>> >>>>> >>>>> >>>>>> >>>>>> My understanding of setting up fail-over is you need some control over >>>>>> the power so with a script it can turn off a machine by cutting its >>>>>> power? Is this correct? >>>>>> >>>>>> >>>>> >>>>> It is the recommended configuration because it is simple to understand >>>>> and >>>>> implement. >>>>> >>>>> But the only _hard_ requirement is that both nodes can access the >>>>> storage. >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>> Is there a way to do fail-over without having >>>>>> access to the pdu(power strips)? >>>>>> >>>>>> >>>>> >>>>> If you have IPMI support, that can be used for power control, instead >>>>> of >>>>> a >>>>> switched PDU. ?Depending on the storage, you may be able to do resource >>>>> fencing of the disks instead of STONITH. ?Or you can run >>>>> fast-and-loose, >>>>> without any way to ensure the dead node is really "dead" and not >>>>> accessing >>>>> storage (at your risk). ?While Lustre has MMP, it is really more to >>>>> protect >>>>> against a mount typo than to guarantee resource fencing. >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>> Thanks >>>>>> David >>>>>> >>>>>> -- >>>>>> Personally, I liked the university. They gave us money and facilities, >>>>>> we didn''t have to produce anything! You''ve never been out of college! >>>>>> You don''t know what it''s like out there! I''ve worked in the private >>>>>> sector. They expect results. -Ray Ghostbusters >>>>>> _______________________________________________ >>>>>> Lustre-discuss mailing list >>>>>> Lustre-discuss at lists.lustre.org >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>>> >>>>>> >>>> >>>> >>>> >>> >>> >> >> >> >> > >-- Personally, I liked the university. They gave us money and facilities, we didn''t have to produce anything! You''ve never been out of college! You don''t know what it''s like out there! I''ve worked in the private sector. They expect results. -Ray Ghostbusters
On 8/10/2010 12:03 PM, David Noriega wrote:> I think I''ll go the ipmi route. So reading on STONITH, its just a > script, so all I would need is a script to run ipmi that tells the > server to power off, right? > > Also while reading through the lustre manual, seems some things are > being deleted from the wiki, > http://wiki.lustre.org/index.php?title=Clu_Manager no longer exists, > and noticed this too when I found the lustre quick guide is no longer > available.lustre qucik start guide http://www.filibeto.org/sun/lib/blueprints/820-7390.pdf> Thanks > David > > On Tue, Aug 10, 2010 at 10:57 AM, Kevin Van Maren > <kevin.van.maren at oracle.com> wrote: >> David Noriega wrote: >>> Could you describe this resource fencing in more detail? As for >>> regards to STONITH, the pdu already has the grubby hands of IT plugged >>> into it and doubt they would be happy if I unplugged them. What about >>> the network management port or ILOM? >>> >> Resource fencing is needed to ensure that a node does not take over a >> resource (ie, OST) >> while the other node is still accessing it (as could happen if the node only >> partly crashes, >> where it is not responding to the HA package but still writing to the disk). >> >> STONITH is a pretty common way to ensure the other node is dead and can no >> longer >> access the resource. If you can''t use your switched PDU, then using the >> ILOM for IPMI-based >> power control works. The other common way to do resource fencing is to use >> scsi reserve >> commands (if supported by the hardware and the HA package) to ensure >> exclusive access. >> >> Kevin >> >>> On Mon, Aug 9, 2010 at 1:08 PM, Kevin Van Maren >>> <Kevin.Van.Maren at oracle.com> wrote: >>> >>>> On Aug 9, 2010, at 11:45 AM, David Noriega<tsk133 at my.utsa.edu> wrote: >>>> >>>> >>>>> My understanding of setting up fail-over is you need some control over >>>>> the power so with a script it can turn off a machine by cutting its >>>>> power? Is this correct? >>>>> >>>> It is the recommended configuration because it is simple to understand >>>> and >>>> implement. >>>> >>>> But the only _hard_ requirement is that both nodes can access the >>>> storage. >>>> >>>> >>>> >>>>> Is there a way to do fail-over without having >>>>> access to the pdu(power strips)? >>>>> >>>> If you have IPMI support, that can be used for power control, instead of >>>> a >>>> switched PDU. Depending on the storage, you may be able to do resource >>>> fencing of the disks instead of STONITH. Or you can run fast-and-loose, >>>> without any way to ensure the dead node is really "dead" and not >>>> accessing >>>> storage (at your risk). While Lustre has MMP, it is really more to >>>> protect >>>> against a mount typo than to guarantee resource fencing. >>>> >>>> >>>> >>>>> Thanks >>>>> David >>>>> >>>>> -- >>>>> Personally, I liked the university. They gave us money and facilities, >>>>> we didn''t have to produce anything! You''ve never been out of college! >>>>> You don''t know what it''s like out there! I''ve worked in the private >>>>> sector. They expect results. -Ray Ghostbusters >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>> >>> >>> >>> >> > >-------------- next part -------------- A non-text attachment was scrubbed... Name: laotsao.vcf Type: text/x-vcard Size: 139 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100810/80b1c7bd/attachment.vcf
On 8/10/2010 12:20 PM, David Noriega wrote:> Another question. Is it possible to use centos/redhat''s clustering > software?main issues, IMHO, are that lustre today use the physical hostname/ip for all MDS, OSS, MGS etc cluster SW use the VIP, so there are some work need to be done to make VIP work for lustre my 2c> In the manual it mentions using that for metadata > failover(since having more then one metadata server online isnt > possible right now), so why not use that for all of lustre? But since > the information is missing, can someone fill in the blanks on setting > up metadata failover? > > David > > On Tue, Aug 10, 2010 at 11:11 AM, Kevin Van Maren > <kevin.van.maren at oracle.com> wrote: >> Depends on the HA package you are using. Heartbeat comes with a script that >> supports IPMI. >> >> The important thing is that stonith NOT succeed if you don''t _know_ that the >> node is off. >> So it is absolutely not a 1-line script. >> >> Kevin >> >> >> David Noriega wrote: >>> I think I''ll go the ipmi route. So reading on STONITH, its just a >>> script, so all I would need is a script to run ipmi that tells the >>> server to power off, right? >>> >>> Also while reading through the lustre manual, seems some things are >>> being deleted from the wiki, >>> http://wiki.lustre.org/index.php?title=Clu_Manager no longer exists, >>> and noticed this too when I found the lustre quick guide is no longer >>> available. >>> >>> Thanks >>> David >>> >>> On Tue, Aug 10, 2010 at 10:57 AM, Kevin Van Maren >>> <kevin.van.maren at oracle.com> wrote: >>> >>>> David Noriega wrote: >>>> >>>>> Could you describe this resource fencing in more detail? As for >>>>> regards to STONITH, the pdu already has the grubby hands of IT plugged >>>>> into it and doubt they would be happy if I unplugged them. What about >>>>> the network management port or ILOM? >>>>> >>>>> >>>> Resource fencing is needed to ensure that a node does not take over a >>>> resource (ie, OST) >>>> while the other node is still accessing it (as could happen if the node >>>> only >>>> partly crashes, >>>> where it is not responding to the HA package but still writing to the >>>> disk). >>>> >>>> STONITH is a pretty common way to ensure the other node is dead and can >>>> no >>>> longer >>>> access the resource. If you can''t use your switched PDU, then using the >>>> ILOM for IPMI-based >>>> power control works. The other common way to do resource fencing is to >>>> use >>>> scsi reserve >>>> commands (if supported by the hardware and the HA package) to ensure >>>> exclusive access. >>>> >>>> Kevin >>>> >>>> >>>>> On Mon, Aug 9, 2010 at 1:08 PM, Kevin Van Maren >>>>> <Kevin.Van.Maren at oracle.com> wrote: >>>>> >>>>> >>>>>> On Aug 9, 2010, at 11:45 AM, David Noriega<tsk133 at my.utsa.edu> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> My understanding of setting up fail-over is you need some control over >>>>>>> the power so with a script it can turn off a machine by cutting its >>>>>>> power? Is this correct? >>>>>>> >>>>>>> >>>>>> It is the recommended configuration because it is simple to understand >>>>>> and >>>>>> implement. >>>>>> >>>>>> But the only _hard_ requirement is that both nodes can access the >>>>>> storage. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Is there a way to do fail-over without having >>>>>>> access to the pdu(power strips)? >>>>>>> >>>>>>> >>>>>> If you have IPMI support, that can be used for power control, instead >>>>>> of >>>>>> a >>>>>> switched PDU. Depending on the storage, you may be able to do resource >>>>>> fencing of the disks instead of STONITH. Or you can run >>>>>> fast-and-loose, >>>>>> without any way to ensure the dead node is really "dead" and not >>>>>> accessing >>>>>> storage (at your risk). While Lustre has MMP, it is really more to >>>>>> protect >>>>>> against a mount typo than to guarantee resource fencing. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Thanks >>>>>>> David >>>>>>> >>>>>>> -- >>>>>>> Personally, I liked the university. They gave us money and facilities, >>>>>>> we didn''t have to produce anything! You''ve never been out of college! >>>>>>> You don''t know what it''s like out there! I''ve worked in the private >>>>>>> sector. They expect results. -Ray Ghostbusters >>>>>>> _______________________________________________ >>>>>>> Lustre-discuss mailing list >>>>>>> Lustre-discuss at lists.lustre.org >>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>>>> >>>>>>> >>>>> >>>>> >>>> >>> >>> >>> >> > >-------------- next part -------------- A non-text attachment was scrubbed... Name: laotsao.vcf Type: text/x-vcard Size: 139 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100810/141501d1/attachment.vcf
On Tuesday, August 10, 2010, Kevin Van Maren wrote:> Depends on the HA package you are using. Heartbeat comes with a script > that supports IPMI. >For our installations we even use a modified external/ipmi_ddn stonith script that does uses power-off/status/on to make sure the system is really reset. The heartbeat/pacemaker script uses the ipmi reset method by default, but ipmi commands are not required by specs to succeed. So ipmitool (used by external/ipmi) might successfully return, but does in way ensure the node was really reset. I have seen that rather often in real life already. The default script also supports the power-off/on method, but also does not check for the status. So our modified script first powers off, then checks if the node is really offline, then powers on again and only then successfully returns. Unfortunately, that is at the cost of an increased fail-over time, as power- off and then power-on needs some minimal downtime in between (ca. 30s) and heartbeats/pacemaker stonith does not support async events (power-off would be sufficient, but once stonith successfully returns, it is not called again till the next fencing). -- Bernd Schubert DataDirect Networks
So your script resets the server so there is no fail-over(ie the other server takes over resources from that server?) or there is failover but you then manually return resources back to the server that was reset? On Tue, Aug 10, 2010 at 1:39 PM, Bernd Schubert <bs_lists at aakef.fastmail.fm> wrote:> On Tuesday, August 10, 2010, Kevin Van Maren wrote: >> Depends on the HA package you are using. ?Heartbeat comes with a script >> that supports IPMI. >> > > For our installations we even use a modified external/ipmi_ddn stonith script > that does uses power-off/status/on to make sure the system is really reset. > The heartbeat/pacemaker script uses the ipmi reset method by default, but ipmi > commands are not required by specs to succeed. So ipmitool (used by > external/ipmi) might successfully return, but does in way ensure the node was > really reset. I have seen that rather often in real life already. > The default script also supports the power-off/on method, but also does not > check for the status. > > So our modified script first powers off, then checks if the node is really > offline, then powers on again and only then successfully returns. > Unfortunately, that is at the cost of an increased fail-over time, as power- > off and then power-on needs some minimal downtime in between (ca. 30s) and > heartbeats/pacemaker stonith does not support async events (power-off would be > sufficient, but once stonith successfully returns, it is not called again till > the next fencing). > > -- > Bernd Schubert > DataDirect Networks >-- Personally, I liked the university. They gave us money and facilities, we didn''t have to produce anything! You''ve never been out of college! You don''t know what it''s like out there! I''ve worked in the private sector. They expect results. -Ray Ghostbusters
On Tuesday, August 10, 2010, David Noriega wrote:> So your script resets the server so there is no fail-over(ie the other > server takes over resources from that server?) or there is failover > but you then manually return resources back to the server that was > reset?Our ddn ipmi stonith script (external/ipmi_ddn in heartbeat/pacemaker stonith terms) only makes absolutely sure the node was really reset. If something fails, an error code is reported to pacemaker and then pacemaker (*) will not initiate resource fail-over in order to prevent split-brain. As Lustre devices use MMP (multiple-mount protection) that is not strictly required, in principal. But if something goes wrong. e.g. MMP was accidentally not enabled, a double mount could come up and that would cause serious filesystem and data corruption... Cheers, Bernd PS: (*) hearbeat-v1 (and v2/v3 if not in xml/crm mode) also *should* accept stonith error codes, but in general, I have seen it more than once that hearbeat-v1 run into split-brain and started resources on both cluster nodes. That is something where pacemaker does a much better job. -- Bernd Schubert DataDirect Networks
I would recommend the heartbeat with pacemaker setup for the fail-over control. The configuration may seem complex at the beginning but after enough reading (and there is many good sources) it is quite easy to setup. I have recently set up a Lustre system with 3 OSSs and two MDSs (DRBD with LVM between them) working as a single HA cluster and it was easy enough. Pacemaker allows single point of administration of lustre system (starting and stopping the filesystem) and there is a neat GUI for those who want to show something to their managers :) Best regards, Wojciech On 10 August 2010 20:47, Bernd Schubert <bs_lists at aakef.fastmail.fm> wrote:> > On Tuesday, August 10, 2010, David Noriega wrote: > > So your script resets the server so there is no fail-over(ie the other > > server takes over resources from that server?) or there is failover > > but you then manually return resources back to the server that was > > reset? > > Our ddn ipmi stonith script (external/ipmi_ddn in heartbeat/pacemaker > stonith > terms) only makes absolutely sure the node was really reset. If something > fails, an error code is reported to pacemaker and then pacemaker (*) will > not > initiate resource fail-over in order to prevent split-brain. > As Lustre devices use MMP (multiple-mount protection) that is not strictly > required, in principal. But if something goes wrong. e.g. MMP was > accidentally > not enabled, a double mount could come up and that would cause serious > filesystem and data corruption... > > > Cheers, > Bernd > > PS: (*) hearbeat-v1 (and v2/v3 if not in xml/crm mode) also *should* accept > stonith error codes, but in general, I have seen it more than once that > hearbeat-v1 run into split-brain and started resources on both cluster > nodes. > That is something where pacemaker does a much better job. > > -- > Bernd Schubert > DataDirect Networks > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Wojciech Turek Senior System Architect High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100810/1f5bf39a/attachment.html
Ok I''ve gotten heartbeat setup with the two OSSs, but I do have a question that isn''t stated in the documentation. Shouldn''t the lustre mounts be removed from fstab once they are given to heartbeat since when it comes online, it will mount the resources, correct? David -- Personally, I liked the university. They gave us money and facilities, we didn''t have to produce anything! You''ve never been out of college! You don''t know what it''s like out there! I''ve worked in the private sector. They expect results. -Ray Ghostbusters
David Noriega wrote:> Ok I''ve gotten heartbeat setup with the two OSSs, but I do have a > question that isn''t stated in the documentation. Shouldn''t the lustre > mounts be removed from fstab once they are given to heartbeat since > when it comes online, it will mount the resources, correct? > > David >Yes: on the servers, they must be not there or "noauto". Once you start running heartbeat, you have given control of the resource away, and must not mount/umount it yourself (unless you stop heartbeat on both nodes in the HA pair to get control back). Kevin
Some info: MDS/MGS 192.168.5.104 Passive failover MDS/MGS 192.168.5.105 OSS1 192.168.5.100 OSS2 192.168.5.101 I''ve got some more questions about setting up failover. Besides having heartbeat setup, what about using tunefs.lustre to set options? On the MDS/MGS I set the following options tunefs.lustre --failnode=192.168.5.105 /dev/lustre-mdt-dg/lv1 Heartbeat works just fine, can mount on the primary node and then failover to the other and back. Now on the OSSs things get a bit more confusing. Reading these two blog posts: http://mergingbusinessandit.blogspot.com/2008/12/implementing-lustre-failover.html http://jermen.posterous.com/lustre-mds-failover>From these I tried these options:tunefs.lustre --erase-params --mgsnode=192.168.5.104 at tcp0 --mgsnode=192.168.5.105 at tcp0 --failover=192.168.5.101 at tcp0 -write-params /dev/lustre-ost1-dg1/lv1 I ran that for all for OSTs, changing the failover option on the last two OSTs to point OSS1 while the first two point to OST2. My understanding is that you mount the OSTs first, then the MDS, but the OSTs are failing to mount. Are all these options needed? Or is simply specifying the primary MDS is enough for it to find out about the second MDS? David On Mon, Aug 16, 2010 at 2:14 PM, Kevin Van Maren <kevin.van.maren at oracle.com> wrote:> David Noriega wrote: >> >> Ok I''ve gotten heartbeat setup with the two OSSs, but I do have a >> question that isn''t stated in the documentation. Shouldn''t the lustre >> mounts be removed from fstab once they are given to heartbeat since >> when it comes online, it will mount the resources, correct? >> >> David >> > > > Yes: on the servers, they must be not there or "noauto". ?Once you start > running heartbeat, > you have given control of the resource away, and must not mount/umount it > yourself > (unless you stop heartbeat on both nodes in the HA pair to get control > back). > > Kevin > >-- Personally, I liked the university. They gave us money and facilities, we didn''t have to produce anything! You''ve never been out of college! You don''t know what it''s like out there! I''ve worked in the private sector. They expect results. -Ray Ghostbusters
Oppps some how I changed the target name of all OSTs to lustre-OST0000 and trying to mount any other ost fails. I''ve gone and found the ''More Complicated Configuration'' section which details the usage of --mgsnode=nid1,nid2 and so using this I think I''ll just reformat. On Tue, Aug 17, 2010 at 11:26 AM, David Noriega <tsk133 at my.utsa.edu> wrote:> Some info: > MDS/MGS 192.168.5.104 > Passive failover MDS/MGS 192.168.5.105 > OSS1 192.168.5.100 > OSS2 192.168.5.101 > > I''ve got some more questions about setting up failover. Besides having > heartbeat setup, what about using tunefs.lustre to set options? > > On the MDS/MGS I set the following options > tunefs.lustre --failnode=192.168.5.105 /dev/lustre-mdt-dg/lv1 > Heartbeat works just fine, can mount on the primary node and then > failover to the other and back. > > Now on the OSSs things get a bit more confusing. Reading these two blog posts: > http://mergingbusinessandit.blogspot.com/2008/12/implementing-lustre-failover.html > http://jermen.posterous.com/lustre-mds-failover > > From these I tried these options: > tunefs.lustre --erase-params --mgsnode=192.168.5.104 at tcp0 > --mgsnode=192.168.5.105 at tcp0 --failover=192.168.5.101 at tcp0 > -write-params /dev/lustre-ost1-dg1/lv1 > > I ran that for all for OSTs, changing the failover option on the last > two OSTs to point OSS1 while the first two point to OST2. > > My understanding is that you mount the OSTs first, then the MDS, but > the OSTs are failing to mount. Are all these options needed? Or is > simply specifying the primary MDS is enough for it to find out about > the second MDS? > > David > > On Mon, Aug 16, 2010 at 2:14 PM, Kevin Van Maren > <kevin.van.maren at oracle.com> wrote: >> David Noriega wrote: >>> >>> Ok I''ve gotten heartbeat setup with the two OSSs, but I do have a >>> question that isn''t stated in the documentation. Shouldn''t the lustre >>> mounts be removed from fstab once they are given to heartbeat since >>> when it comes online, it will mount the resources, correct? >>> >>> David >>> >> >> >> Yes: on the servers, they must be not there or "noauto". ?Once you start >> running heartbeat, >> you have given control of the resource away, and must not mount/umount it >> yourself >> (unless you stop heartbeat on both nodes in the HA pair to get control >> back). >> >> Kevin >> >> > > > > -- > Personally, I liked the university. They gave us money and facilities, > we didn''t have to produce anything! You''ve never been out of college! > You don''t know what it''s like out there! I''ve worked in the private > sector. They expect results. -Ray Ghostbusters >-- Personally, I liked the university. They gave us money and facilities, we didn''t have to produce anything! You''ve never been out of college! You don''t know what it''s like out there! I''ve worked in the private sector. They expect results. -Ray Ghostbusters
Hi David, You need to umount your OSTs and MDTs and run tunefs.lustre --writeconf /dev/<lustre device> on all Lustre OSTs and MDTs This will force the lustre targets to fetch new configuration next time they are mounted. The order of mounting is: MGT -> MDT -> OSTs Best regards, Wojciech On 17 August 2010 18:19, David Noriega <tsk133 at my.utsa.edu> wrote:> Oppps some how I changed the target name of all OSTs to lustre-OST0000 > and trying to mount any other ost fails. I''ve gone and found the ''More > Complicated Configuration'' section which details the usage of > --mgsnode=nid1,nid2 and so using this I think I''ll just reformat. > > On Tue, Aug 17, 2010 at 11:26 AM, David Noriega <tsk133 at my.utsa.edu> > wrote: > > Some info: > > MDS/MGS 192.168.5.104 > > Passive failover MDS/MGS 192.168.5.105 > > OSS1 192.168.5.100 > > OSS2 192.168.5.101 > > > > I''ve got some more questions about setting up failover. Besides having > > heartbeat setup, what about using tunefs.lustre to set options? > > > > On the MDS/MGS I set the following options > > tunefs.lustre --failnode=192.168.5.105 /dev/lustre-mdt-dg/lv1 > > Heartbeat works just fine, can mount on the primary node and then > > failover to the other and back. > > > > Now on the OSSs things get a bit more confusing. Reading these two blog > posts: > > > http://mergingbusinessandit.blogspot.com/2008/12/implementing-lustre-failover.html > > http://jermen.posterous.com/lustre-mds-failover > > > > From these I tried these options: > > tunefs.lustre --erase-params --mgsnode=192.168.5.104 at tcp0 > > --mgsnode=192.168.5.105 at tcp0 --failover=192.168.5.101 at tcp0 > > -write-params /dev/lustre-ost1-dg1/lv1 > > > > I ran that for all for OSTs, changing the failover option on the last > > two OSTs to point OSS1 while the first two point to OST2. > > > > My understanding is that you mount the OSTs first, then the MDS, but > > the OSTs are failing to mount. Are all these options needed? Or is > > simply specifying the primary MDS is enough for it to find out about > > the second MDS? > > > > David > > > > On Mon, Aug 16, 2010 at 2:14 PM, Kevin Van Maren > > <kevin.van.maren at oracle.com> wrote: > >> David Noriega wrote: > >>> > >>> Ok I''ve gotten heartbeat setup with the two OSSs, but I do have a > >>> question that isn''t stated in the documentation. Shouldn''t the lustre > >>> mounts be removed from fstab once they are given to heartbeat since > >>> when it comes online, it will mount the resources, correct? > >>> > >>> David > >>> > >> > >> > >> Yes: on the servers, they must be not there or "noauto". Once you start > >> running heartbeat, > >> you have given control of the resource away, and must not mount/umount > it > >> yourself > >> (unless you stop heartbeat on both nodes in the HA pair to get control > >> back). > >> > >> Kevin > >> > >> > > > > > > > > -- > > Personally, I liked the university. They gave us money and facilities, > > we didn''t have to produce anything! You''ve never been out of college! > > You don''t know what it''s like out there! I''ve worked in the private > > sector. They expect results. -Ray Ghostbusters > > > > > > -- > Personally, I liked the university. They gave us money and facilities, > we didn''t have to produce anything! You''ve never been out of college! > You don''t know what it''s like out there! I''ve worked in the private > sector. They expect results. -Ray Ghostbusters > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Wojciech Turek Senior System Architect High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100817/689fcae6/attachment.html
That is good to know, but already started formatting. No issues as it hasn''t been put into production, just playing with it and working kinks like this out. Though formatting the OSTs was rather quick while the MDT is taking some time. Is this normal? 192.168.5.105 is the other(standby) mds node. [root at meta1 ~]# mkfs.lustre --reformat --fsname=lustre --mgs --mdt --failnode=192.168.5.105 at tcp0 /dev/lustre-mdt-dg/lv1 Permanent disk data: Target: lustre-MDTffff Index: unassigned Lustre FS: lustre Mount type: ldiskfs Flags: 0x75 (MDT MGS needs_index first_time update ) Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro Parameters: failover.node=192.168.5.105 at tcp mdt.group_upcall=/usr/sbin/l_getgroups device size = 2323456MB 2 6 18 formatting backing filesystem ldiskfs on /dev/lustre-mdt-dg/lv1 target name lustre-MDTffff 4k blocks 594804736 options -J size=400 -i 4096 -I 512 -q -O dir_index,extents,uninit_groups,mmp -F mkfs_cmd = mke2fs -j -b 4096 -L lustre-MDTffff -J size=400 -i 4096 -I 512 -q -O dir_index,extents,uninit_groups,mmp -F /dev/lustre-mdt-dg/lv1 594804736 David On Tue, Aug 17, 2010 at 12:27 PM, Wojciech Turek <wjt27 at cam.ac.uk> wrote:> Hi David, > > You need to umount your OSTs and MDTs and run tunefs.lustre? --writeconf > /dev/<lustre device> on all Lustre OSTs and MDTs This will force the lustre > targets to fetch new configuration next time they are mounted. The order of > mounting is: MGT -> MDT -> OSTs > > Best regards, > > Wojciech > > > > On 17 August 2010 18:19, David Noriega <tsk133 at my.utsa.edu> wrote: >> >> Oppps some how I changed the target name of all OSTs to lustre-OST0000 >> and trying to mount any other ost fails. I''ve gone and found the ''More >> Complicated Configuration'' section which details the usage of >> --mgsnode=nid1,nid2 and so using this I think I''ll just reformat. >> >> On Tue, Aug 17, 2010 at 11:26 AM, David Noriega <tsk133 at my.utsa.edu> >> wrote: >> > Some info: >> > MDS/MGS 192.168.5.104 >> > Passive failover MDS/MGS 192.168.5.105 >> > OSS1 192.168.5.100 >> > OSS2 192.168.5.101 >> > >> > I''ve got some more questions about setting up failover. Besides having >> > heartbeat setup, what about using tunefs.lustre to set options? >> > >> > On the MDS/MGS I set the following options >> > tunefs.lustre --failnode=192.168.5.105 /dev/lustre-mdt-dg/lv1 >> > Heartbeat works just fine, can mount on the primary node and then >> > failover to the other and back. >> > >> > Now on the OSSs things get a bit more confusing. Reading these two blog >> > posts: >> > >> > http://mergingbusinessandit.blogspot.com/2008/12/implementing-lustre-failover.html >> > http://jermen.posterous.com/lustre-mds-failover >> > >> > From these I tried these options: >> > tunefs.lustre --erase-params --mgsnode=192.168.5.104 at tcp0 >> > --mgsnode=192.168.5.105 at tcp0 --failover=192.168.5.101 at tcp0 >> > -write-params /dev/lustre-ost1-dg1/lv1 >> > >> > I ran that for all for OSTs, changing the failover option on the last >> > two OSTs to point OSS1 while the first two point to OST2. >> > >> > My understanding is that you mount the OSTs first, then the MDS, but >> > the OSTs are failing to mount. Are all these options needed? Or is >> > simply specifying the primary MDS is enough for it to find out about >> > the second MDS? >> > >> > David >> > >> > On Mon, Aug 16, 2010 at 2:14 PM, Kevin Van Maren >> > <kevin.van.maren at oracle.com> wrote: >> >> David Noriega wrote: >> >>> >> >>> Ok I''ve gotten heartbeat setup with the two OSSs, but I do have a >> >>> question that isn''t stated in the documentation. Shouldn''t the lustre >> >>> mounts be removed from fstab once they are given to heartbeat since >> >>> when it comes online, it will mount the resources, correct? >> >>> >> >>> David >> >>> >> >> >> >> >> >> Yes: on the servers, they must be not there or "noauto". ?Once you >> >> start >> >> running heartbeat, >> >> you have given control of the resource away, and must not mount/umount >> >> it >> >> yourself >> >> (unless you stop heartbeat on both nodes in the HA pair to get control >> >> back). >> >> >> >> Kevin >> >> >> >> >> > >> > >> > >> > -- >> > Personally, I liked the university. They gave us money and facilities, >> > we didn''t have to produce anything! You''ve never been out of college! >> > You don''t know what it''s like out there! I''ve worked in the private >> > sector. They expect results. -Ray Ghostbusters >> > >> >> >> >> -- >> Personally, I liked the university. They gave us money and facilities, >> we didn''t have to produce anything! You''ve never been out of college! >> You don''t know what it''s like out there! I''ve worked in the private >> sector. They expect results. -Ray Ghostbusters >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > -- > Wojciech Turek > > Senior System Architect > > High Performance Computing Service > University of Cambridge > Email: wjt27 at cam.ac.uk > Tel: (+)44 1223 763517 >-- Personally, I liked the university. They gave us money and facilities, we didn''t have to produce anything! You''ve never been out of college! You don''t know what it''s like out there! I''ve worked in the private sector. They expect results. -Ray Ghostbusters