We are dipping our toes into the waters of Lustre HA using pacemaker. We have 16 7.2 TB OSTs across 4 OSSs (4 OSTs each). The four OSSs are broken out into two dual-active pairs running Lustre 1.8.5. Mostly, the water is fine but we''ve encountered a few surprises. 1. An 8-client iozone write test in which we write 64 files of 1.7 TB each seems to go well - until the end at which point iozone seems to finish successfully and begins its "cleanup". That is to say it starts to remove all 64 large files. At this point, the ll_ost threads go bananas - consuming all available cpu cycles on all 8 cores of each server. This seems to block the corosync "totem" exchange long enough to initiate a "stonith" request. 2. We have found that re-mounting the OSTs, either via the HA agent or manually, often can take a *very* long time - on the order of four or five minutes. We have not figured out why yet. An strace of the mount process has not yielded much. The mount seems to just be waiting for something but we can''t tell what. We are starting to adjust our HA parameters to compensate for these observations but we hate to do this in a vacuum and wonder if others have also observed these behaviors and what, if anything, was done to compensate/correct? Regards, Charlie Taylor UF HPC Center
We''re investigating Pacemaker HA setup here too, so I''m interested in your findings, and I hope I can help a little here. 1. So it seems like totem is not responding on some, but still running on others, if they take the initiative to stonith. I would investigate bumping up or adding some parameters in the corosync.conf Check out token, token_restransmit, and token_retransmits_before_loss_const (among others, not sure what the complete answer for this is), they may help get you past spikes in load. 2. This sounds like normal OST recovery. It is taking that time to return the OSTs to a consistent state. Check out: http://wiki.lustre.org/manual/LustreManual18_HTML/LustreRecovery.html The Pacemaker page on wiki.lustre.org shows you how to deal with this: http://wiki.lustre.org/index.php/Using_Pacemaker_with_Lustre The op start and op stop timeouts should be set to 300 to allow for the recovery process to complete. It would be helpful to see your resource configuration file, as well as your corosync.conf. Justin Miller (812) 855-2719 jupmille at iu.edu Indiana University - Research Technologies - Data Capacitor On 5/4/11 1:05 PM, Charles Taylor wrote:> > We are dipping our toes into the waters of Lustre HA using > pacemaker. We have 16 7.2 TB OSTs across 4 OSSs (4 OSTs each). > The four OSSs are broken out into two dual-active pairs running Lustre > 1.8.5. Mostly, the water is fine but we''ve encountered a few > surprises. > > 1. An 8-client iozone write test in which we write 64 files of 1.7 > TB each seems to go well - until the end at which point iozone seems > to finish successfully and begins its "cleanup". That is to say it > starts to remove all 64 large files. At this point, the ll_ost > threads go bananas - consuming all available cpu cycles on all 8 cores > of each server. This seems to block the corosync "totem" exchange > long enough to initiate a "stonith" request. > > 2. We have found that re-mounting the OSTs, either via the HA agent or > manually, often can take a *very* long time - on the order of four or > five minutes. We have not figured out why yet. An strace of the > mount process has not yielded much. The mount seems to just be > waiting for something but we can''t tell what. > > We are starting to adjust our HA parameters to compensate for these > observations but we hate to do this in a vacuum and wonder if others > have also observed these behaviors and what, if anything, was done to > compensate/correct? > > Regards, > > Charlie Taylor > UF HPC Center > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On May 4, 2011, at 10:05 AM, Charles Taylor wrote:> > We are dipping our toes into the waters of Lustre HA using > pacemaker. We have 16 7.2 TB OSTs across 4 OSSs (4 OSTs each). > The four OSSs are broken out into two dual-active pairs running Lustre > 1.8.5. Mostly, the water is fine but we''ve encountered a few > surprises. > > 1. An 8-client iozone write test in which we write 64 files of 1.7 > TB each seems to go well - until the end at which point iozone seems > to finish successfully and begins its "cleanup". That is to say it > starts to remove all 64 large files. At this point, the ll_ost > threads go bananas - consuming all available cpu cycles on all 8 cores > of each server. This seems to block the corosync "totem" exchange > long enough to initiate a "stonith" request.Running oprofile or profile.pl (possibly only included in SGI''s respin of perfsuite, original is at http://perfsuite.ncsa.illinois.edu/) is useful in situations where you have one or more thread consuming a lot of CPU. It should point to what function(s) the offending thread(s) are spending time in. From there, bugzilla/jira or the mailing list should be able to help further.> 2. We have found that re-mounting the OSTs, either via the HA agent or > manually, often can take a *very* long time - on the order of four or > five minutes. We have not figured out why yet. An strace of the > mount process has not yielded much. The mount seems to just be > waiting for something but we can''t tell what.Could be bz 18456. Jason -- Jason Rappleye System Administrator NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035
What was your conclusion? What is a good HA solution with Lustre? I am hoping SNS will be a big push for the next year On Wed, May 4, 2011 at 5:16 PM, Jason Rappleye <jason.rappleye at nasa.gov> wrote:> > On May 4, 2011, at 10:05 AM, Charles Taylor wrote: > >> >> We are dipping our toes into the waters of Lustre HA using >> pacemaker. ? ? We have 16 7.2 TB OSTs across 4 OSSs (4 OSTs each). >> The four OSSs are broken out into two dual-active pairs running Lustre >> 1.8.5. ? ?Mostly, the water is fine but we''ve encountered a few >> surprises. >> >> 1. An 8-client ?iozone write test in which we write 64 files of 1.7 >> TB ?each seems to go well - until the end at which point iozone seems >> to finish successfully and begins its "cleanup". ? That is to say it >> starts to remove all 64 large files. ? ?At this point, the ll_ost >> threads go bananas - consuming all available cpu cycles on all 8 cores >> of each server. ? This seems to block the corosync "totem" exchange >> long enough to initiate a "stonith" request. > > Running oprofile or profile.pl (possibly only included in SGI''s respin of perfsuite, original is at http://perfsuite.ncsa.illinois.edu/) is useful in situations where you have one or more thread consuming a lot of CPU. It should point to what function(s) the offending thread(s) are spending time in. From there, bugzilla/jira or the mailing list should be able to help further. > >> 2. We have found that re-mounting the OSTs, either via the HA agent or >> manually, often can take a *very* long time - on the order of four or >> five minutes. ? We have not figured out why yet. ? An strace of the >> mount process has not yielded much. ? ?The mount seems to just be >> waiting for something but we can''t tell what. > > Could be bz 18456. > > Jason > > -- > Jason Rappleye > System Administrator > NASA Advanced Supercomputing Division > NASA Ames Research Center > Moffett Field, CA 94035 > > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >