Lisa Giacchetti
2010-Jul-01 16:29 UTC
[Lustre-discuss] best practice for lustre clustre startup
Hello, I have recently installed a lustre cluster which is in a test phase now but will potentially be in 24x7 production if its accepted. I would like input from the list on what the recommendations/best practices are for configuration of a lustre cluster startup. Is it advisable to have lustre on the various server pieces (mgs/mdt/oss''s) start automatically? If not why not? If you try to start it and there is a very serious problem will it abort the startup or just continue on blindly? Again this is going to need to be a 24x7 service for a compute facility that which has global access (ie someone is always up and running something). We''d like to be able to at least get the service back up in an automated way if at all possible and then debug problems when the support staff are awake/available. Lisa Giacchetti -------------- next part -------------- A non-text attachment was scrubbed... Name: lisa.vcf Type: text/x-vcard Size: 275 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100701/b287c83a/attachment.vcf
Kevin Van Maren
2010-Jul-01 17:17 UTC
[Lustre-discuss] best practice for lustre clustre startup
My (personal) opinion: Lustre clients should always start (mount) automatically. Lustre servers should have their services started through heartbeat (or other HA package), if failover is possible (be sure to configure stonith). If heartbeat starts automatically, do ensure auto-failback is NOT enabled: fail the resources back manually after you verify the rebooted server is healthy. Whether heartbeat starts automatically seems to be a preference issue. While unlikely, it is possible for an issue to cause Lustre to not start successfully, resulting in a node crash or other issue preventing a login. So if it does start automatically you''ll want to be prepared to reboot w/o Lustre (eg, single-user mode). Kevin Lisa Giacchetti wrote:> Hello, > I have recently installed a lustre cluster which is in a test phase > now but will potentially be in 24x7 production if its accepted. > I would like input from the list on what the recommendations/best > practices are for configuration of a lustre cluster startup. > Is it advisable to have lustre on the various server pieces > (mgs/mdt/oss''s) start automatically? If not why not? > If you try to start it and there is a very serious problem will it > abort the startup or just continue on blindly? > > Again this is going to need to be a 24x7 service for a compute > facility that which has global access (ie someone is always > up and running something). We''d like to be able to at least get the > service back up in an automated way if at all possible and then debug > problems when the support staff are awake/available. > > Lisa Giacchetti > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Craig Prescott
2010-Jul-01 17:52 UTC
[Lustre-discuss] best practice for lustre clustre startup
Hi Lisa; We don''t start the services automatically on our servers. We don''t have so many Lustre servers that this is a big problem (17 total), and it is pretty rare for one of them to go down unexpectedly. If one of our Lustre server node does go down unexpectedly, we fsck the associated OSTs/MDT before starting up Lustre services again. I think you will want to do the same. We do the fsck from the command line and look at the output. If there were no filesystem modifications (this is the usual case), we then start the Lustre services interactively. If there were modifications from fsck, we''ll generally fsck it again and verify there were no further modifications. If ''fsck -f -p'' fails, we''ll fsck interactively or just go whole hog and ''fsck -f -y''. I imagine you could achieve an "automated startup following failure" at least most of the time with an init script that does an ''fsck -f -p'' on the associated OSTs/MDT if the node is coming back up from a crash or power outage. If there aren''t any modifications made by fsck, your init script could mount the storage. If ''fsck -f -p'' bails out, you might send out an "I need help" email or something. Cheers, Craig Prescott UF HPC Center We once ran a cluster with lustre We bought from a guy named Buster It ran for a year with nary a tear A complaint we could not muster Lisa Giacchetti wrote:> Hello, > I have recently installed a lustre cluster which is in a test phase now > but will potentially be in 24x7 production if its accepted. > I would like input from the list on what the recommendations/best > practices are for configuration of a lustre cluster startup. > Is it advisable to have lustre on the various server pieces > (mgs/mdt/oss''s) start automatically? If not why not? > If you try to start it and there is a very serious problem will it > abort the startup or just continue on blindly? > > Again this is going to need to be a 24x7 service for a compute facility > that which has global access (ie someone is always > up and running something). We''d like to be able to at least get the > service back up in an automated way if at all possible and then debug > problems when the support staff are awake/available. > > Lisa Giacchetti > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Robin Humble
2010-Jul-01 18:21 UTC
[Lustre-discuss] best practice for lustre clustre startup
On Thu, Jul 01, 2010 at 11:17:31AM -0600, Kevin Van Maren wrote:>My (personal) opinion: > >Lustre clients should always start (mount) automatically.yup>Lustre servers should have their services started through heartbeat (or >other HA package), if failover is possible (be sure to configure stonith).IMHO that''s a bad idea. servers should not start automatically. my objections to automated mount/failover are not Lustre related, but to all layers underneath - as Kevin well knows, mptsas drivers can and do and have screwed up majorly and I''m sure other drivers have too. md is far from smart, and disks are broken in such an infinite amount of weird and wonderful ways that no driver or OS can reasonably be expected to deal with them all :-/ if you have the simple setup of singly-attached storage and a Lustre server just crashed, then why wouldn''t it just crash again? we have had that happen. automated startup seems silly in this case - especially if you don''t know what the problem was to start with. worst case is if the hardware started corrupting data and crashed the machine, is it really a good idea to reboot, remount, continue corrupting data more, and then keep rebooting until dawn? if you have a more elaborate Lustre setup with HA failover pairs then the above applies, and additionally there are inherent races in both nodes in a pair trying to mount a set of disks if you do not have a third impartial member participating in a failover chorum - not a common HA setup for Lustre, although it probably should be. if a sw raid is assembled on both machines at the same time because of a HA race, then it''s likely data will be lost. Lustre mmp should save you from multi-mounting the OST, but obviously not from corruption if the underlying raid is pre-trashed. overall without diagnosing why a machine crashed I fail to see how an automated reboot or failover can possibly be a safe course of action. cheers, robin>If heartbeat starts automatically, do ensure auto-failback is NOT >enabled: fail the resources back manually after you verify the rebooted >server is healthy. >Whether heartbeat starts automatically seems to be a preference issue. > >While unlikely, it is possible for an issue to cause Lustre to not start >successfully, resulting in a node crash or other issue preventing a >login. So if it does start automatically you''ll want to be prepared to >reboot w/o Lustre (eg, single-user mode). > >Kevin > >
Andreas Dilger
2010-Jul-02 07:01 UTC
[Lustre-discuss] best practice for lustre clustre startup
On 2010-07-01, at 11:52, Craig Prescott <prescott at hpc.ufl.edu> wrote:> We do the fsck from the command line and look at the output. If there > were no filesystem modifications (this is the usual case), we then start > the Lustre services interactively.Note that if you are not running with writeback cache enabled on the disks, then you shouldn''t have to run an fsck on the filesystems after a crash. That should only be needed if the storage is faulty, or if it is using writeback cache without mirroring and battery backup.> If there were modifications from > fsck, we''ll generally fsck it again and verify there were no further > modifications. If ''fsck -f -p'' fails, we''ll fsck interactively or just > go whole hog and ''fsck -f -y''.It''s always a good idea to run fsck in a manner that logs the output, either under ''script'' or similar tool.> I imagine you could achieve an "automated startup following failure" at > least most of the time with an init script that does an ''fsck -f -p'' on > the associated OSTs/MDT if the node is coming back up from a crash or > power outage.Note that if you do this you should run fsck under the control of the HA manager, to avoid both nodes running fsck at the same time. The Lustre-patched e2fsck will refuse to do this if you have mmp enabled (which is done automatically if the Lustre filesystems are formatted with failover enabled, but can also be enabled manually afterward. Also note that if you are using software RAID or LVM that it should also only be configured under the control of the HA manager.> We once ran a cluster with lustre > We bought from a guy named Buster > It ran for a year > with nary a tear > A complaint we could not musterAwesome. :-)> Lisa Giacchetti wrote: >> Hello, >> I have recently installed a lustre cluster which is in a test phase now >> but will potentially be in 24x7 production if its accepted. >> I would like input from the list on what the recommendations/best >> practices are for configuration of a lustre cluster startup. >> Is it advisable to have lustre on the various server pieces >> (mgs/mdt/oss''s) start automatically? If not why not? >> If you try to start it and there is a very serious problem will it >> abort the startup or just continue on blindly? >> >> Again this is going to need to be a 24x7 service for a compute facility >> that which has global access (ie someone is always >> up and running something). We''d like to be able to at least get the >> service back up in an automated way if at all possible and then debug >> problems when the support staff are awake/available. >> >> Lisa Giacchetti >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Peter Grandi
2010-Jul-03 21:02 UTC
[Lustre-discuss] best practice for lustre clustre startup
[ ... ]>> We do the fsck from the command line and look at the output. >> If there were no filesystem modifications (this is the usual >> case), we then start the Lustre services interactively.> Note that if you are not running with writeback cache enabled > on the disks, then you shouldn''t have to run an fsck on the > filesystems after a crash.This seems to me extremely bad advice, based on these rather extraordinarily optimistic assumptions:> That should only be needed if the storage is faulty, or if it > is using writeback cache without mirroring and battery backup.This reminds me of the immortal statement "as far as we know in our datacenter we never had an undetected error". How do you know whether "storage is faulty" or many of the other reaosn why metadata can get corrupted never happened? ''fsck'' does metadata auditing and garbage collection and a full scan, at least every now and then, is essential to give some confidence that no hidden problem has been eating the metadata. And if there is a way to at least sample check data integrity (e.g. run ''gzip -t'' on a subset of compressed files) I would run that periodically too. Experience with storage systems induces distrusts, never mind CERN''s experiences: http://storagemojo.com/2007/09/19/cerns-data-corruption-research/ Admittedly "happy go lucky", as the investment banks have shown in the past several years with derivaties, can be a profitable strategy (until it blows up :->). [ ... ]
Andreas Dilger
2010-Jul-04 05:00 UTC
[Lustre-discuss] best practice for lustre clustre startup
On 2010-07-03, at 15:02, pg_lus at lus.for.sabi.co.UK wrote:>> Note that if you are not running with writeback cache enabled >> on the disks, then you shouldn''t have to run an fsck on the >> filesystems after a crash. > > This seems to me extremely bad advice, based on these rather > extraordinarily optimistic assumptions: > >> That should only be needed if the storage is faulty, or if it >> is using writeback cache without mirroring and battery backup. > > This reminds me of the immortal statement "as far as we know in > our datacenter we never had an undetected error".I think my record speaks for itself in terms of advocating running fsck on filesystems on a regular basis. I think you are making assumptions about what my statement says or does not say. What it says is that you shouldn''t need to run fsck after a crash, if this wasn''t involving e.g. RAID controller failure or the loss of writeback cache. It doesn''t say that you should never run fsck, and in fact I always recommend a full fsck in case on RAID failure or if the filesystem has detected inconsistencies. My point was that if there are uptime requirements that running a full fsck after an unplanned outage of one node is probably a bad use of time. It would be better to run a full fsck on ALL of the filesystems during scheduled maintenance windows, since they can be run in parallel and wouldn''t take longer than a single node. I have also written the lvcheck tool to run fsck on LVM snapshots via cron on a regular basis so that you don''t need to wait for a crash before validating whether your hardware is faulty.> a full scan, at least every now and then, is essential to give some > confidence that no hidden problem has been eating the metadata.I''ve been a staunch advocate among the ext4 developers for keeping the periodic fsck at mount time to catch those places that never fsck on their own. If that bothers people because of the unexpected delay in startup, I point them at the script so they can check the snapshot and reset the fsck counters before they expire.
Maybe Matching Threads
- What's the correct sequence to umount multiple lustre file system
- Multihomed question: want Lustre over IB andEthernet
- Setting up a lustre zfs dual mgs/mdt over tcp - help requested
- How To change server recovery timeout
- iptables rules for lustre 1.6.x and MGS recovery procedures