Howdy, The lustre manual recommends heartbeat for handling failover. The pacemaker is successor of hearbeat version 2. So whats recommended - should we be using pacemaker or stick to hearbeat? - CS.
Carlos Santana wrote:> Howdy, > > The lustre manual recommends heartbeat for handling failover. The > pacemaker is successor of hearbeat version 2. So whats recommended - > should we be using pacemaker or stick to hearbeat?We haven''t tried pacemaker - should work fine, as lustre failover is simple shared storage. We''d be interested in hearing about any experiences. cliffw> > - > CS. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
It is very difficult to find relevant documentation for heartbeat 1/2. I just finished configuring a heartbeat system and would not recommend it because of the documentation. (They seem to have removed portions the heartbeat documentation from the site.) Pacemaker is not a simple solution to configure either. I played briefly with the RH clustering software. It does not directly support any FS type other than the basic ext2/ext3, and wasn''t happy with a lustre type. -- Andrew> -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss- > bounces at lists.lustre.org] On Behalf Of Carlos Santana > Sent: Monday, July 13, 2009 11:42 AM > To: lustre-discuss at lists.lustre.org > Subject: [Lustre-discuss] failover software - heartbeat > > Howdy, > > The lustre manual recommends heartbeat for handling failover. The > pacemaker is successor of hearbeat version 2. So whats recommended - > should we be using pacemaker or stick to hearbeat? > > - > CS. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
We recently put heartbeat v1 in production and along the way developed some admin scripts including heartbeat resource agent compliant lustre init scripts, a script to initiate failover/failback and get detailed status, a powerman stonith interface, and various safeguards to ensure MMP is on, devices are present and usable, etc. before starting lustre. If this is of general interest I could post it to a bug for review. Jim On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote:> It is very difficult to find relevant documentation for heartbeat 1/2. I just finished configuring a heartbeat system and would not recommend it because of the documentation. (They seem to have removed portions the heartbeat documentation from the site.) > > Pacemaker is not a simple solution to configure either. I played briefly with the RH clustering software. It does not directly support any FS type other than the basic ext2/ext3, and wasn''t happy with a lustre type. > > -- > Andrew > > > -----Original Message----- > > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss- > > bounces at lists.lustre.org] On Behalf Of Carlos Santana > > Sent: Monday, July 13, 2009 11:42 AM > > To: lustre-discuss at lists.lustre.org > > Subject: [Lustre-discuss] failover software - heartbeat > > > > Howdy, > > > > The lustre manual recommends heartbeat for handling failover. The > > pacemaker is successor of hearbeat version 2. So whats recommended - > > should we be using pacemaker or stick to hearbeat? > > > > - > > CS. > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://*lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://*lists.lustre.org/mailman/listinfo/lustre-discuss
Were you able to get monitoring working to detect network failures? (pingd?) I have it configured, but haven''t been able to get it to trigger a failover when an MDS cannot ping the network. (I tried with 1.0 and 2.0 conf files, I am currently using 2.0) I have a ticket open with the pacemaker project (no ticket system for the HA stuff...) but not resolution. I am considering writing a script to down the node when the ping fails, but don''t like the idea. I would also like to get the hpingd functioning to detect a fiber failure, but there was less available on that solution. -- Andrew> -----Original Message----- > From: Jim Garlick [mailto:garlick at llnl.gov] > Sent: Monday, July 13, 2009 2:21 PM > To: Lundgren, Andrew > Cc: Carlos Santana; lustre-discuss at lists.lustre.org > Subject: Re: [Lustre-discuss] failover software - heartbeat > > We recently put heartbeat v1 in production and along the way > developed some admin scripts including heartbeat resource agent > compliant > lustre init scripts, a script to initiate failover/failback and get > detailed > status, a powerman stonith interface, and various safeguards to ensure > MMP > is on, devices are present and usable, etc. before starting lustre. > > If this is of general interest I could post it to a bug for review. > > Jim > > On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote: > > It is very difficult to find relevant documentation for heartbeat > 1/2. I just finished configuring a heartbeat system and would not > recommend it because of the documentation. (They seem to have removed > portions the heartbeat documentation from the site.) > > > > Pacemaker is not a simple solution to configure either. I played > briefly with the RH clustering software. It does not directly support > any FS type other than the basic ext2/ext3, and wasn''t happy with a > lustre type. > > > > -- > > Andrew > > > > > -----Original Message----- > > > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre- > discuss- > > > bounces at lists.lustre.org] On Behalf Of Carlos Santana > > > Sent: Monday, July 13, 2009 11:42 AM > > > To: lustre-discuss at lists.lustre.org > > > Subject: [Lustre-discuss] failover software - heartbeat > > > > > > Howdy, > > > > > > The lustre manual recommends heartbeat for handling failover. The > > > pacemaker is successor of hearbeat version 2. So whats recommended > - > > > should we be using pacemaker or stick to hearbeat? > > > > > > - > > > CS. > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss at lists.lustre.org > > > http://*lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://*lists.lustre.org/mailman/listinfo/lustre-discuss
No. I originally did have it set up like this (a v1 ha.cf snippet): # One partner losing contact with both lnet routers or MDS triggers failover. #ping_group lnet-router 172.16.10.254 172.16.2.254 #ping_group tycho-mds1 172.16.10.200 172.16.2.200 #respawn hacluster /usr/lib64/heartbeat/ipfail However, I ran into a problem when rebooting the MDS. Apparently if one partner re-establishes contact with the MDS before the other one, it immediately triggers failover. This is with heartbeat-2.1.4. Jim On Mon, Jul 13, 2009 at 02:25:17PM -0600, Lundgren, Andrew wrote:> Were you able to get monitoring working to detect network failures? (pingd?) > > I have it configured, but haven''t been able to get it to trigger a failover when an MDS cannot ping the network. (I tried with 1.0 and 2.0 conf files, I am currently using 2.0) I have a ticket open with the pacemaker project (no ticket system for the HA stuff...) > but not resolution. I am considering writing a script to down the node when the ping fails, but don''t like the idea. > > I would also like to get the hpingd functioning to detect a fiber failure, but there was less available on that solution. > > -- > Andrew > > > -----Original Message----- > > From: Jim Garlick [mailto:garlick at llnl.gov] > > Sent: Monday, July 13, 2009 2:21 PM > > To: Lundgren, Andrew > > Cc: Carlos Santana; lustre-discuss at lists.lustre.org > > Subject: Re: [Lustre-discuss] failover software - heartbeat > > > > We recently put heartbeat v1 in production and along the way > > developed some admin scripts including heartbeat resource agent > > compliant > > lustre init scripts, a script to initiate failover/failback and get > > detailed > > status, a powerman stonith interface, and various safeguards to ensure > > MMP > > is on, devices are present and usable, etc. before starting lustre. > > > > If this is of general interest I could post it to a bug for review. > > > > Jim > > > > On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote: > > > It is very difficult to find relevant documentation for heartbeat > > 1/2. I just finished configuring a heartbeat system and would not > > recommend it because of the documentation. (They seem to have removed > > portions the heartbeat documentation from the site.) > > > > > > Pacemaker is not a simple solution to configure either. I played > > briefly with the RH clustering software. It does not directly support > > any FS type other than the basic ext2/ext3, and wasn''t happy with a > > lustre type. > > > > > > -- > > > Andrew > > > > > > > -----Original Message----- > > > > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre- > > discuss- > > > > bounces at lists.lustre.org] On Behalf Of Carlos Santana > > > > Sent: Monday, July 13, 2009 11:42 AM > > > > To: lustre-discuss at lists.lustre.org > > > > Subject: [Lustre-discuss] failover software - heartbeat > > > > > > > > Howdy, > > > > > > > > The lustre manual recommends heartbeat for handling failover. The > > > > pacemaker is successor of hearbeat version 2. So whats recommended > > - > > > > should we be using pacemaker or stick to hearbeat? > > > > > > > > - > > > > CS. > > > > _______________________________________________ > > > > Lustre-discuss mailing list > > > > Lustre-discuss at lists.lustre.org > > > > http://**lists.lustre.org/mailman/listinfo/lustre-discuss > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss at lists.lustre.org > > > http://**lists.lustre.org/mailman/listinfo/lustre-discuss
Are you doing anything if the network fails to one mds? How about if your fiber path fails?> -----Original Message----- > From: Jim Garlick [mailto:garlick at llnl.gov] > Sent: Monday, July 13, 2009 2:39 PM > To: Lundgren, Andrew > Cc: Carlos Santana; lustre-discuss at lists.lustre.org > Subject: Re: [Lustre-discuss] failover software - heartbeat > > No. I originally did have it set up like this (a v1 ha.cf snippet): > > # One partner losing contact with both lnet routers or MDS triggers > failover. > #ping_group lnet-router 172.16.10.254 172.16.2.254 > #ping_group tycho-mds1 172.16.10.200 172.16.2.200 > #respawn hacluster /usr/lib64/heartbeat/ipfail > > However, I ran into a problem when rebooting the MDS. Apparently if > one > partner re-establishes contact with the MDS before the other one, it > immediately triggers failover. This is with heartbeat-2.1.4. > > Jim > > On Mon, Jul 13, 2009 at 02:25:17PM -0600, Lundgren, Andrew wrote: > > Were you able to get monitoring working to detect network failures? > (pingd?) > > > > I have it configured, but haven''t been able to get it to trigger a > failover when an MDS cannot ping the network. (I tried with 1.0 and > 2.0 conf files, I am currently using 2.0) I have a ticket open with > the pacemaker project (no ticket system for the HA stuff...) > > but not resolution. I am considering writing a script to down the > node when the ping fails, but don''t like the idea. > > > > I would also like to get the hpingd functioning to detect a fiber > failure, but there was less available on that solution. > > > > -- > > Andrew > > > > > -----Original Message----- > > > From: Jim Garlick [mailto:garlick at llnl.gov] > > > Sent: Monday, July 13, 2009 2:21 PM > > > To: Lundgren, Andrew > > > Cc: Carlos Santana; lustre-discuss at lists.lustre.org > > > Subject: Re: [Lustre-discuss] failover software - heartbeat > > > > > > We recently put heartbeat v1 in production and along the way > > > developed some admin scripts including heartbeat resource agent > > > compliant > > > lustre init scripts, a script to initiate failover/failback and get > > > detailed > > > status, a powerman stonith interface, and various safeguards to > ensure > > > MMP > > > is on, devices are present and usable, etc. before starting lustre. > > > > > > If this is of general interest I could post it to a bug for review. > > > > > > Jim > > > > > > On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote: > > > > It is very difficult to find relevant documentation for heartbeat > > > 1/2. I just finished configuring a heartbeat system and would not > > > recommend it because of the documentation. (They seem to have > removed > > > portions the heartbeat documentation from the site.) > > > > > > > > Pacemaker is not a simple solution to configure either. I played > > > briefly with the RH clustering software. It does not directly > support > > > any FS type other than the basic ext2/ext3, and wasn''t happy with a > > > lustre type. > > > > > > > > -- > > > > Andrew > > > > > > > > > -----Original Message----- > > > > > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre- > > > discuss- > > > > > bounces at lists.lustre.org] On Behalf Of Carlos Santana > > > > > Sent: Monday, July 13, 2009 11:42 AM > > > > > To: lustre-discuss at lists.lustre.org > > > > > Subject: [Lustre-discuss] failover software - heartbeat > > > > > > > > > > Howdy, > > > > > > > > > > The lustre manual recommends heartbeat for handling failover. > The > > > > > pacemaker is successor of hearbeat version 2. So whats > recommended > > > - > > > > > should we be using pacemaker or stick to hearbeat? > > > > > > > > > > - > > > > > CS. > > > > > _______________________________________________ > > > > > Lustre-discuss mailing list > > > > > Lustre-discuss at lists.lustre.org > > > > > http://**lists.lustre.org/mailman/listinfo/lustre-discuss > > > > _______________________________________________ > > > > Lustre-discuss mailing list > > > > Lustre-discuss at lists.lustre.org > > > > http://**lists.lustre.org/mailman/listinfo/lustre-discuss
On network failures: no. On fibre path failures: we configure ldiskfs with errors=panic so fibre issues or other issues in the storage path will likely cause a panic and trigger failover. We''re just getting started with failover so we elected to keep it simple for now. Jim On Mon, Jul 13, 2009 at 02:41:09PM -0600, Lundgren, Andrew wrote:> Are you doing anything if the network fails to one mds? > > How about if your fiber path fails? > > > -----Original Message----- > > From: Jim Garlick [mailto:garlick at llnl.gov] > > Sent: Monday, July 13, 2009 2:39 PM > > To: Lundgren, Andrew > > Cc: Carlos Santana; lustre-discuss at lists.lustre.org > > Subject: Re: [Lustre-discuss] failover software - heartbeat > > > > No. I originally did have it set up like this (a v1 ha.cf snippet): > > > > # One partner losing contact with both lnet routers or MDS triggers > > failover. > > #ping_group lnet-router 172.16.10.254 172.16.2.254 > > #ping_group tycho-mds1 172.16.10.200 172.16.2.200 > > #respawn hacluster /usr/lib64/heartbeat/ipfail > > > > However, I ran into a problem when rebooting the MDS. Apparently if > > one > > partner re-establishes contact with the MDS before the other one, it > > immediately triggers failover. This is with heartbeat-2.1.4. > > > > Jim > > > > On Mon, Jul 13, 2009 at 02:25:17PM -0600, Lundgren, Andrew wrote: > > > Were you able to get monitoring working to detect network failures? > > (pingd?) > > > > > > I have it configured, but haven''t been able to get it to trigger a > > failover when an MDS cannot ping the network. (I tried with 1.0 and > > 2.0 conf files, I am currently using 2.0) I have a ticket open with > > the pacemaker project (no ticket system for the HA stuff...) > > > but not resolution. I am considering writing a script to down the > > node when the ping fails, but don''t like the idea. > > > > > > I would also like to get the hpingd functioning to detect a fiber > > failure, but there was less available on that solution. > > > > > > -- > > > Andrew > > > > > > > -----Original Message----- > > > > From: Jim Garlick [mailto:garlick at llnl.gov] > > > > Sent: Monday, July 13, 2009 2:21 PM > > > > To: Lundgren, Andrew > > > > Cc: Carlos Santana; lustre-discuss at lists.lustre.org > > > > Subject: Re: [Lustre-discuss] failover software - heartbeat > > > > > > > > We recently put heartbeat v1 in production and along the way > > > > developed some admin scripts including heartbeat resource agent > > > > compliant > > > > lustre init scripts, a script to initiate failover/failback and get > > > > detailed > > > > status, a powerman stonith interface, and various safeguards to > > ensure > > > > MMP > > > > is on, devices are present and usable, etc. before starting lustre. > > > > > > > > If this is of general interest I could post it to a bug for review. > > > > > > > > Jim > > > > > > > > On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote: > > > > > It is very difficult to find relevant documentation for heartbeat > > > > 1/2. I just finished configuring a heartbeat system and would not > > > > recommend it because of the documentation. (They seem to have > > removed > > > > portions the heartbeat documentation from the site.) > > > > > > > > > > Pacemaker is not a simple solution to configure either. I played > > > > briefly with the RH clustering software. It does not directly > > support > > > > any FS type other than the basic ext2/ext3, and wasn''t happy with a > > > > lustre type. > > > > > > > > > > -- > > > > > Andrew > > > > > > > > > > > -----Original Message----- > > > > > > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre- > > > > discuss- > > > > > > bounces at lists.lustre.org] On Behalf Of Carlos Santana > > > > > > Sent: Monday, July 13, 2009 11:42 AM > > > > > > To: lustre-discuss at lists.lustre.org > > > > > > Subject: [Lustre-discuss] failover software - heartbeat > > > > > > > > > > > > Howdy, > > > > > > > > > > > > The lustre manual recommends heartbeat for handling failover. > > The > > > > > > pacemaker is successor of hearbeat version 2. So whats > > recommended > > > > - > > > > > > should we be using pacemaker or stick to hearbeat? > > > > > > > > > > > > - > > > > > > CS. > > > > > > _______________________________________________ > > > > > > Lustre-discuss mailing list > > > > > > Lustre-discuss at lists.lustre.org > > > > > > http://***lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > _______________________________________________ > > > > > Lustre-discuss mailing list > > > > > Lustre-discuss at lists.lustre.org > > > > > http://***lists.lustre.org/mailman/listinfo/lustre-discuss
Hi Jim, It would be great if you can attach the scripts to a Lustre bugzilla bug. Cheers, _Atul Jim Garlick wrote:> We recently put heartbeat v1 in production and along the way > developed some admin scripts including heartbeat resource agent compliant > lustre init scripts, a script to initiate failover/failback and get detailed > status, a powerman stonith interface, and various safeguards to ensure MMP > is on, devices are present and usable, etc. before starting lustre. > > If this is of general interest I could post it to a bug for review. > > Jim > > On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote: > >> It is very difficult to find relevant documentation for heartbeat 1/2. I just finished configuring a heartbeat system and would not recommend it because of the documentation. (They seem to have removed portions the heartbeat documentation from the site.) >> >> Pacemaker is not a simple solution to configure either. I played briefly with the RH clustering software. It does not directly support any FS type other than the basic ext2/ext3, and wasn''t happy with a lustre type. >> >> -- >> Andrew >> >> >>> -----Original Message----- >>> From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss- >>> bounces at lists.lustre.org] On Behalf Of Carlos Santana >>> Sent: Monday, July 13, 2009 11:42 AM >>> To: lustre-discuss at lists.lustre.org >>> Subject: [Lustre-discuss] failover software - heartbeat >>> >>> Howdy, >>> >>> The lustre manual recommends heartbeat for handling failover. The >>> pacemaker is successor of hearbeat version 2. So whats recommended - >>> should we be using pacemaker or stick to hearbeat? >>> >>> - >>> CS. >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://*lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://*lists.lustre.org/mailman/listinfo/lustre-discuss >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- =================================Atul Vidwansa Sun Microsystems Australia Pty Ltd Web: http://blogs.sun.com/atulvid Email: Atul.Vidwansa at Sun.COM
I have been able to get pingd working for interconnect failover using Heartbeat V2.1.3. It works for OSTs, MDTs and the MGS. You will need lines like this in your ha.cf file, 2 pingd devices shown, but 1 will do: ping 172.31.80.240 ping 172.31.64.1 respawn root /usr/lib64/heartbeat/pingd -m 200 -d 5s -p /var/run/pingd.pid -h 172.31.80.240 -h 172.31.64.1 You also need to add pingd rsc_location rules to your cib.xml file within the constraints section as shown below, one for each Lustre filesystem: <rsc_location id="testfsmds_connected" rsc="testfsmds"> <rule id="testfsmds_connected_rule" score_attribute="pingd"> <expression id="testfsmds_connected_rule_expr" attribute="pingd" operation="defined"/> </rule> </rsc_location> This has worked well for me for InfiniBand and 10GbE systems. HTH, Bob>/ -----Original Message-----/>/ From: Jim Garlick [mailto:garlick at llnl.gov <http://lists.lustre.org/mailman/listinfo/lustre-discuss>] />/ Sent: Monday, July 13, 2009 2:39 PM />/ To: Lundgren, Andrew />/ Cc: Carlos Santana; lustre-discuss at lists.lustre.org <http://lists.lustre.org/mailman/listinfo/lustre-discuss> />/ Subject: Re: [Lustre-discuss] failover software - heartbeat />/ />/ No. I originally did have it set up like this (a v1 ha.cf snippet): />/ />/ # One partner losing contact with both lnet routers or MDS triggers />/ failover. />/ #ping_group lnet-router 172.16.10.254 172.16.2.254 />/ #ping_group tycho-mds1 172.16.10.200 172.16.2.200 />/ #respawn hacluster /usr/lib64/heartbeat/ipfail />/ />/ However, I ran into a problem when rebooting the MDS. Apparently if />/ one />/ partner re-establishes contact with the MDS before the other one, it />/ immediately triggers failover. This is with heartbeat-2.1.4. />/ />/ Jim />/ />/ On Mon, Jul 13, 2009 at 02:25:17PM -0600, Lundgren, Andrew wrote: />/ > Were you able to get monitoring working to detect network failures? />/ (pingd?) />/ > />/ > I have it configured, but haven''t been able to get it to trigger a />/ failover when an MDS cannot ping the network. (I tried with 1.0 and />/ 2.0 conf files, I am currently using 2.0) I have a ticket open with />/ the pacemaker project (no ticket system for the HA stuff...) />/ > but not resolution. I am considering writing a script to down the />/ node when the ping fails, but don''t like the idea. />/ > />/ > I would also like to get the hpingd functioning to detect a fiber />/ failure, but there was less available on that solution. />/ > />/ > -- />/ > Andrew />/ > /
Lundgren, Andrew wrote:> It is very difficult to find relevant documentation for heartbeat 1/2. I just finished configuring a heartbeat system and would not recommend it because of the documentation. (They seem to have removed portions the heartbeat documentation from the site.) > > Pacemaker is not a simple solution to configure either. I played briefly with the RH clustering software. It does not directly support any FS type other than the basic ext2/ext3, and wasn''t happy with a lustre type. >That might be simple to fix, if it is script-based. We submitted a patch aeons ago to the heartbeat guys to add ''ldiskfs'' as a supported FS. As I recall, it was a one-line change. cliffw> -- > Andrew > >> -----Original Message----- >> From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss- >> bounces at lists.lustre.org] On Behalf Of Carlos Santana >> Sent: Monday, July 13, 2009 11:42 AM >> To: lustre-discuss at lists.lustre.org >> Subject: [Lustre-discuss] failover software - heartbeat >> >> Howdy, >> >> The lustre manual recommends heartbeat for handling failover. The >> pacemaker is successor of hearbeat version 2. So whats recommended - >> should we be using pacemaker or stick to hearbeat? >> >> - >> CS. >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hi, OK I have posted it to https://bugzilla.lustre.org/show_bug.cgi?id=20165 20165: scripts for heartbeat v1 integration I added example config files from our test cluster. Probably best to redirect questions/comments/criticisms to the bug and I''ll respond there. Jim On Tue, Jul 14, 2009 at 12:26:24PM +1000, Atul Vidwansa wrote:> Hi Jim, > > It would be great if you can attach the scripts to a Lustre bugzilla bug. > > Cheers, > _Atul > > Jim Garlick wrote: > >We recently put heartbeat v1 in production and along the way > >developed some admin scripts including heartbeat resource agent compliant > >lustre init scripts, a script to initiate failover/failback and get > >detailed > >status, a powerman stonith interface, and various safeguards to ensure MMP > >is on, devices are present and usable, etc. before starting lustre. > > > >If this is of general interest I could post it to a bug for review. > > > >Jim > > > >On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote: > > > >>It is very difficult to find relevant documentation for heartbeat 1/2. I > >>just finished configuring a heartbeat system and would not recommend it > >>because of the documentation. (They seem to have removed portions the > >>heartbeat documentation from the site.) > >>Pacemaker is not a simple solution to configure either. I played briefly > >>with the RH clustering software. It does not directly support any FS > >>type other than the basic ext2/ext3, and wasn''t happy with a lustre type. > >> > >>-- > >>Andrew > >> > >> > >>>-----Original Message----- > >>>From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss- > >>>bounces at lists.lustre.org] On Behalf Of Carlos Santana > >>>Sent: Monday, July 13, 2009 11:42 AM > >>>To: lustre-discuss at lists.lustre.org > >>>Subject: [Lustre-discuss] failover software - heartbeat > >>> > >>>Howdy, > >>> > >>>The lustre manual recommends heartbeat for handling failover. The > >>>pacemaker is successor of hearbeat version 2. So whats recommended - > >>>should we be using pacemaker or stick to hearbeat? > >>> > >>>- > >>>CS. > >>>_______________________________________________ > >>>Lustre-discuss mailing list > >>>Lustre-discuss at lists.lustre.org > >>>http://**lists.lustre.org/mailman/listinfo/lustre-discuss > >>> > >>_______________________________________________ > >>Lustre-discuss mailing list > >>Lustre-discuss at lists.lustre.org > >>http://**lists.lustre.org/mailman/listinfo/lustre-discuss > >> > >_______________________________________________ > >Lustre-discuss mailing list > >Lustre-discuss at lists.lustre.org > >http://*lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > -- > =================================> Atul Vidwansa > Sun Microsystems Australia Pty Ltd > Web: http://*blogs.sun.com/atulvid > Email: Atul.Vidwansa at Sun.COM >
Jim Garlick wrote:> Hi, > > OK I have posted it to https://bugzilla.lustre.org/show_bug.cgi?id=20165 > > 20165: scripts for heartbeat v1 integration > > I added example config files from our test cluster. Probably best to > redirect questions/comments/criticisms to the bug and I''ll respond there.Looks very good, thanks bunches. I''ve added a few extras from the discussion. Did you guy try ipfail, or only pingd? cliffw> > Jim > > > On Tue, Jul 14, 2009 at 12:26:24PM +1000, Atul Vidwansa wrote: >> Hi Jim, >> >> It would be great if you can attach the scripts to a Lustre bugzilla bug. >> >> Cheers, >> _Atul >> >> Jim Garlick wrote: >>> We recently put heartbeat v1 in production and along the way >>> developed some admin scripts including heartbeat resource agent compliant >>> lustre init scripts, a script to initiate failover/failback and get >>> detailed >>> status, a powerman stonith interface, and various safeguards to ensure MMP >>> is on, devices are present and usable, etc. before starting lustre. >>> >>> If this is of general interest I could post it to a bug for review. >>> >>> Jim >>> >>> On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote: >>> >>>> It is very difficult to find relevant documentation for heartbeat 1/2. I >>>> just finished configuring a heartbeat system and would not recommend it >>>> because of the documentation. (They seem to have removed portions the >>>> heartbeat documentation from the site.) >>>> Pacemaker is not a simple solution to configure either. I played briefly >>>> with the RH clustering software. It does not directly support any FS >>>> type other than the basic ext2/ext3, and wasn''t happy with a lustre type. >>>> >>>> -- >>>> Andrew >>>> >>>> >>>>> -----Original Message----- >>>>> From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss- >>>>> bounces at lists.lustre.org] On Behalf Of Carlos Santana >>>>> Sent: Monday, July 13, 2009 11:42 AM >>>>> To: lustre-discuss at lists.lustre.org >>>>> Subject: [Lustre-discuss] failover software - heartbeat >>>>> >>>>> Howdy, >>>>> >>>>> The lustre manual recommends heartbeat for handling failover. The >>>>> pacemaker is successor of hearbeat version 2. So whats recommended - >>>>> should we be using pacemaker or stick to hearbeat? >>>>> >>>>> - >>>>> CS. >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://**lists.lustre.org/mailman/listinfo/lustre-discuss >>>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://**lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://*lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> -- >> =================================>> Atul Vidwansa >> Sun Microsystems Australia Pty Ltd >> Web: http://*blogs.sun.com/atulvid >> Email: Atul.Vidwansa at Sun.COM >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Tue, Jul 14, 2009 at 09:37:54AM -0700, Cliff White wrote:> Jim Garlick wrote: > >Hi, > > > >OK I have posted it to https://*bugzilla.lustre.org/show_bug.cgi?id=20165 > > > > 20165: scripts for heartbeat v1 integration > > > >I added example config files from our test cluster. Probably best to > >redirect questions/comments/criticisms to the bug and I''ll respond there. > > Looks very good, thanks bunches. I''ve added a few extras from the > discussion. Did you guy try ipfail, or only pingd? > cliffwWe tried ipfail (unsuccessfully), not pingd. I think pingd is a v2 only feature? Our work is entirely with v1, which seemed adeqate and also much simpler to understand and get right.> >Jim > > > > > >On Tue, Jul 14, 2009 at 12:26:24PM +1000, Atul Vidwansa wrote: > >>Hi Jim, > >> > >>It would be great if you can attach the scripts to a Lustre bugzilla bug. > >> > >>Cheers, > >>_Atul > >> > >>Jim Garlick wrote: > >>>We recently put heartbeat v1 in production and along the way > >>>developed some admin scripts including heartbeat resource agent compliant > >>>lustre init scripts, a script to initiate failover/failback and get > >>>detailed > >>>status, a powerman stonith interface, and various safeguards to ensure > >>>MMP > >>>is on, devices are present and usable, etc. before starting lustre. > >>> > >>>If this is of general interest I could post it to a bug for review. > >>> > >>>Jim > >>> > >>>On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote: > >>> > >>>>It is very difficult to find relevant documentation for heartbeat 1/2. > >>>>I just finished configuring a heartbeat system and would not recommend > >>>>it because of the documentation. (They seem to have removed portions > >>>>the heartbeat documentation from the site.) > >>>>Pacemaker is not a simple solution to configure either. I played > >>>>briefly with the RH clustering software. It does not directly support > >>>>any FS type other than the basic ext2/ext3, and wasn''t happy with a > >>>>lustre type. > >>>>-- > >>>>Andrew > >>>> > >>>> > >>>>>-----Original Message----- > >>>>>From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss- > >>>>>bounces at lists.lustre.org] On Behalf Of Carlos Santana > >>>>>Sent: Monday, July 13, 2009 11:42 AM > >>>>>To: lustre-discuss at lists.lustre.org > >>>>>Subject: [Lustre-discuss] failover software - heartbeat > >>>>> > >>>>>Howdy, > >>>>> > >>>>>The lustre manual recommends heartbeat for handling failover. The > >>>>>pacemaker is successor of hearbeat version 2. So whats recommended - > >>>>>should we be using pacemaker or stick to hearbeat? > >>>>> > >>>>>- > >>>>>CS. > >>>>>_______________________________________________ > >>>>>Lustre-discuss mailing list > >>>>>Lustre-discuss at lists.lustre.org > >>>>>http://***lists.lustre.org/mailman/listinfo/lustre-discuss > >>>>> > >>>>_______________________________________________ > >>>>Lustre-discuss mailing list > >>>>Lustre-discuss at lists.lustre.org > >>>>http://***lists.lustre.org/mailman/listinfo/lustre-discuss > >>>> > >>>_______________________________________________ > >>>Lustre-discuss mailing list > >>>Lustre-discuss at lists.lustre.org > >>>http://**lists.lustre.org/mailman/listinfo/lustre-discuss > >>> > >> > >>-- > >>=================================> >>Atul Vidwansa > >>Sun Microsystems Australia Pty Ltd > >>Web: http://**blogs.sun.com/atulvid > >>Email: Atul.Vidwansa at Sun.COM > >> > >_______________________________________________ > >Lustre-discuss mailing list > >Lustre-discuss at lists.lustre.org > >http://*lists.lustre.org/mailman/listinfo/lustre-discuss >
I tried the pingd in v2, and was unable to get it working. (I also tried it in pacemaker and ended up opening a ticket that we haven''t finished working though yet.) The fiber ping is only around as a v1 tool as far as I can tell. Since I wasn''t able to get normal ping to function, I never even tried the FC stuff.> -----Original Message----- > From: Jim Garlick [mailto:garlick at llnl.gov] > Sent: Tuesday, July 14, 2009 12:39 PM > To: Cliff White > Cc: Atul Vidwansa; Lundgren, Andrew; lustre-discuss at lists.lustre.org > Subject: Re: [Lustre-discuss] failover software - heartbeat > > On Tue, Jul 14, 2009 at 09:37:54AM -0700, Cliff White wrote: > > Jim Garlick wrote: > > >Hi, > > > > > >OK I have posted it to > https://*bugzilla.lustre.org/show_bug.cgi?id=20165 > > > > > > 20165: scripts for heartbeat v1 integration > > > > > >I added example config files from our test cluster. Probably best > to > > >redirect questions/comments/criticisms to the bug and I''ll respond > there. > > > > Looks very good, thanks bunches. I''ve added a few extras from the > > discussion. Did you guy try ipfail, or only pingd? > > cliffw > > We tried ipfail (unsuccessfully), not pingd. > I think pingd is a v2 only feature? Our work is entirely with v1, > which seemed adeqate and also much simpler to understand and get right. > > > >Jim > > > > > > > > >On Tue, Jul 14, 2009 at 12:26:24PM +1000, Atul Vidwansa wrote: > > >>Hi Jim, > > >> > > >>It would be great if you can attach the scripts to a Lustre > bugzilla bug. > > >> > > >>Cheers, > > >>_Atul > > >> > > >>Jim Garlick wrote: > > >>>We recently put heartbeat v1 in production and along the way > > >>>developed some admin scripts including heartbeat resource agent > compliant > > >>>lustre init scripts, a script to initiate failover/failback and > get > > >>>detailed > > >>>status, a powerman stonith interface, and various safeguards to > ensure > > >>>MMP > > >>>is on, devices are present and usable, etc. before starting > lustre. > > >>> > > >>>If this is of general interest I could post it to a bug for > review. > > >>> > > >>>Jim > > >>> > > >>>On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote: > > >>> > > >>>>It is very difficult to find relevant documentation for heartbeat > 1/2. > > >>>>I just finished configuring a heartbeat system and would not > recommend > > >>>>it because of the documentation. (They seem to have removed > portions > > >>>>the heartbeat documentation from the site.) > > >>>>Pacemaker is not a simple solution to configure either. I played > > >>>>briefly with the RH clustering software. It does not directly > support > > >>>>any FS type other than the basic ext2/ext3, and wasn''t happy with > a > > >>>>lustre type. > > >>>>-- > > >>>>Andrew > > >>>> > > >>>> > > >>>>>-----Original Message----- > > >>>>>From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre- > discuss- > > >>>>>bounces at lists.lustre.org] On Behalf Of Carlos Santana > > >>>>>Sent: Monday, July 13, 2009 11:42 AM > > >>>>>To: lustre-discuss at lists.lustre.org > > >>>>>Subject: [Lustre-discuss] failover software - heartbeat > > >>>>> > > >>>>>Howdy, > > >>>>> > > >>>>>The lustre manual recommends heartbeat for handling failover. > The > > >>>>>pacemaker is successor of hearbeat version 2. So whats > recommended - > > >>>>>should we be using pacemaker or stick to hearbeat? > > >>>>> > > >>>>>- > > >>>>>CS. > > >>>>>_______________________________________________ > > >>>>>Lustre-discuss mailing list > > >>>>>Lustre-discuss at lists.lustre.org > > >>>>>http://***lists.lustre.org/mailman/listinfo/lustre-discuss > > >>>>> > > >>>>_______________________________________________ > > >>>>Lustre-discuss mailing list > > >>>>Lustre-discuss at lists.lustre.org > > >>>>http://***lists.lustre.org/mailman/listinfo/lustre-discuss > > >>>> > > >>>_______________________________________________ > > >>>Lustre-discuss mailing list > > >>>Lustre-discuss at lists.lustre.org > > >>>http://**lists.lustre.org/mailman/listinfo/lustre-discuss > > >>> > > >> > > >>-- > > >>=================================> > >>Atul Vidwansa > > >>Sun Microsystems Australia Pty Ltd > > >>Web: http://**blogs.sun.com/atulvid > > >>Email: Atul.Vidwansa at Sun.COM > > >> > > >_______________________________________________ > > >Lustre-discuss mailing list > > >Lustre-discuss at lists.lustre.org > > >http://*lists.lustre.org/mailman/listinfo/lustre-discuss > >