Christopher Deneen
2009-Apr-22 20:11 UTC
[Lustre-discuss] Has anyone had experience with heartbeat and drdb providing full redundancy on lustre clusters
trying to get a feel if it''s worth investing time to implement.
Thomas Roth
2009-Apr-23 12:57 UTC
[Lustre-discuss] Has anyone had experience with heartbeat and drdb providing full redundancy on lustre clusters
Hi, we are using Heartbeat+DRBD on our MGS/MDT. DRBD work fine, we have been using it for backing up the MDT, upgrading the Lustre version, and of course for failover. Heartbeat proves to be much trickier. Since our network is rather shaky, we are suffering from late heartbeats, Heartbeat trying to fail over for no apparent reason, Stonith for no good reason... In addition, if one umounts the MDT, it always starts with a delay of 330sec, and Hearbeat always gives up on the resource MDT after 20000ms - no matter what I put into the cib.xml, no matter the Lustre timeouts. So one always needs to force the umount or more likely a reboot - not a problem with a reliable Stonith-procedure ;-) Thomas Christopher Deneen wrote:> trying to get a feel if it''s worth investing time to implement. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum f?r Schwerionenforschung GmbH Planckstra?e 1 D-64291 Darmstadt www.gsi.de Gesellschaft mit beschr?nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gesch?ftsf?hrer: Professor Dr. Horst St?cker Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
Adam Gandelman
2009-Apr-23 21:05 UTC
[Lustre-discuss] Has anyone had experience with heartbeat and drdb providing full redundancy on lustre clusters
I''ve got a basic fully redundant and HA Lustre setup in the lab with heartbeat+drbd on both the MDS and OSS''s. So far everything everything appears to be working fine. In the coming weeks we''ll be doing heavy testing and benchmarking and hopefully have some results to post here and/or to the Lustre wiki. I did run into an unexpected split-brain in one configuration because of umount but, like Thomas said, that''s where a good Stonith agent would come into play. Cya, Adam Christopher Deneen wrote:> trying to get a feel if it''s worth investing time to implement. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Heiko Schröter
2009-Apr-24 05:59 UTC
[Lustre-discuss] Has anyone had experience with heartbeat and drdbd providing full redundancy on lustre clusters
On Donnerstag, 23. April 2009 14:57:34 Thomas Roth wrote: We are using heartbeat (including stonith) and drbd for our MDT/MGS system. The procedure and setup has been posted to this list. The system works fine in our scenario and we haven''t encountered any problems so far. Total Storage capacity is 110TB (51% full) with 8 OSS Raids. Heartbeat is a beast when it comes to configure all possible events. This might be your problem that you have to tell heartbeat _explicitly_ what to do when it recovers from _certain_ failover. There are some values to look for in the config file. Heiko> Hi, > > we are using Heartbeat+DRBD on our MGS/MDT. > DRBD work fine, we have been using it for backing up the MDT, upgrading > the Lustre version, and of course for failover. > Heartbeat proves to be much trickier. Since our network is rather shaky, > we are suffering from late heartbeats, Heartbeat trying to fail over for > no apparent reason, Stonith for no good reason... In addition, if one > umounts the MDT, it always starts with a delay of 330sec, and Hearbeat > always gives up on the resource MDT after 20000ms - no matter what I put > into the cib.xml, no matter the Lustre timeouts. So one always needs to > force the umount or more likely a reboot - not a problem with a reliable > Stonith-procedure ;-) > > Thomas > > Christopher Deneen wrote: > > trying to get a feel if it''s worth investing time to implement. > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Christopher Deneen
2009-Apr-24 12:52 UTC
[Lustre-discuss] Has anyone had experience with heartbeat and drdbd providing full redundancy on lustre clusters
What about a redundant system for the OSS''s (assuming some lower level redundancy for OST''s like raid) On Fri, Apr 24, 2009 at 1:59 AM, Heiko Schr?ter <schroete at iup.physik.uni-bremen.de> wrote:> On Donnerstag, 23. April 2009 14:57:34 Thomas Roth wrote: > > We are using heartbeat (including stonith) and drbd for our MDT/MGS system. > The procedure and setup has been posted to this list. > The system works fine in our scenario and we haven''t encountered any problems > so far. Total Storage capacity is 110TB (51% full) with 8 OSS Raids. > Heartbeat is a beast when it comes to configure all possible events. This might > be your problem that you have to tell heartbeat _explicitly_ what to do when > it recovers from _certain_ failover. There are some values to look for in the > config file. > > Heiko > >> Hi, >> >> we are using Heartbeat+DRBD on our MGS/MDT. >> DRBD work fine, we have been using it for backing up the MDT, upgrading >> the Lustre version, and of course for failover. >> Heartbeat proves to be much trickier. Since our network is rather shaky, >> we are suffering from late heartbeats, Heartbeat trying to fail over for >> no apparent reason, Stonith for no good reason... In addition, if one >> umounts the MDT, it always starts with a delay of 330sec, and Hearbeat >> always gives up on the resource MDT after 20000ms - no matter what I put >> into the cib.xml, no matter the Lustre timeouts. So one always needs to >> force the umount or more likely a reboot - not a problem with a reliable >> Stonith-procedure ;-) >> >> Thomas >> >> Christopher Deneen wrote: >> > trying to get a feel if it''s worth investing time to implement. >> > _______________________________________________ >> > Lustre-discuss mailing list >> > Lustre-discuss at lists.lustre.org >> > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >