Alexander Oltu
2012-Jan-13 11:01 UTC
[Lustre-discuss] OSS and MDS resilience to power failures
Hi, We are going to have a Lustre setup with around 1000 clients. Due to a power cable constraints we are not able to provide power to OSS and MDS servers from UPS. Therefore the question is how resilient is Lustre to OSS and/or MDS power failures? We have quite few thunderstorms here during summer with short power interruptions. I understand that ext4 is a journaling filesystem and should be more or less stable to power interruptions. But the real practice can be different. I believe that some of you have experience of running OSS and MDS without UPS and can shed some light on this topic. Thank you, Alex.
Wojciech Turek
2012-Jan-13 11:13 UTC
[Lustre-discuss] OSS and MDS resilience to power failures
Make sure that use use RAID controller''s with cache protected by battery backup and if you use redundant controllers that the cache mirroring feature is enabled. The ldiskfs (ext4) should recover after power failures with no problems as long as the back end storage recovers fine too. Best regards, Wojciech On 13 January 2012 11:01, Alexander Oltu <Alexander.Oltu at uni.no> wrote:> Hi, > > We are going to have a Lustre setup with around 1000 clients. Due to a > power cable constraints we are not able to provide power to OSS and MDS > servers from UPS. > > Therefore the question is how resilient is Lustre to OSS and/or MDS > power failures? > > We have quite few thunderstorms here during summer with short power > interruptions. I understand that ext4 is a journaling filesystem and > should be more or less stable to power interruptions. But the real > practice can be different. I believe that some of you have experience > of running OSS and MDS without UPS and can shed some light on this > topic. > > Thank you, > Alex. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120113/1bc5bd6e/attachment.html
Alexander Oltu
2012-Jan-13 12:04 UTC
[Lustre-discuss] OSS and MDS resilience to power failures
Wojciech, thanks! I forgot to mention that our disk storage will be on UPS. Alex. On Fri, 13 Jan 2012 11:13:08 +0000 Wojciech Turek wrote:> Make sure that use use RAID controller''s with cache protected by > battery backup and if you use redundant controllers that the cache > mirroring feature is enabled. The ldiskfs (ext4) should recover after > power failures with no problems as long as the back end storage > recovers fine too. > > Best regards, > > Wojciech > > On 13 January 2012 11:01, Alexander Oltu <Alexander.Oltu at uni.no> > wrote: > > > Hi, > > > > We are going to have a Lustre setup with around 1000 clients. Due > > to a power cable constraints we are not able to provide power to > > OSS and MDS servers from UPS. > > > > Therefore the question is how resilient is Lustre to OSS and/or MDS > > power failures? > > > > We have quite few thunderstorms here during summer with short power > > interruptions. I understand that ext4 is a journaling filesystem and > > should be more or less stable to power interruptions. But the real > > practice can be different. I believe that some of you have > > experience of running OSS and MDS without UPS and can shed some > > light on this topic. > > > > Thank you, > > Alex. > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-- Alexander Oltu System Engineer Parallab, Uni Computing HIB, Thorm?hlensgt.55 N-5008 Bergen, Norway phone: +47 55584144
Wojciech Turek
2012-Jan-13 12:56 UTC
[Lustre-discuss] OSS and MDS resilience to power failures
In that case just make sure that RAID controller cache coherency is enabled, that is if you have shared storage with redundant controllers. Wojciech On 13 January 2012 12:04, Alexander Oltu <Alexander.Oltu at uni.no> wrote:> Wojciech, thanks! I forgot to mention that our disk storage will be on > UPS. > > Alex. > > On Fri, 13 Jan 2012 11:13:08 +0000 > Wojciech Turek wrote: > > > Make sure that use use RAID controller''s with cache protected by > > battery backup and if you use redundant controllers that the cache > > mirroring feature is enabled. The ldiskfs (ext4) should recover after > > power failures with no problems as long as the back end storage > > recovers fine too. > > > > Best regards, > > > > Wojciech > > > > On 13 January 2012 11:01, Alexander Oltu <Alexander.Oltu at uni.no> > > wrote: > > > > > Hi, > > > > > > We are going to have a Lustre setup with around 1000 clients. Due > > > to a power cable constraints we are not able to provide power to > > > OSS and MDS servers from UPS. > > > > > > Therefore the question is how resilient is Lustre to OSS and/or MDS > > > power failures? > > > > > > We have quite few thunderstorms here during summer with short power > > > interruptions. I understand that ext4 is a journaling filesystem and > > > should be more or less stable to power interruptions. But the real > > > practice can be different. I believe that some of you have > > > experience of running OSS and MDS without UPS and can shed some > > > light on this topic. > > > > > > Thank you, > > > Alex. > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss at lists.lustre.org > > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > -- > Alexander Oltu > System Engineer > Parallab, Uni Computing > > HIB, Thorm?hlensgt.55 > N-5008 Bergen, Norway > phone: +47 55584144 >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20120113/aff1d38b/attachment.html
Christopher J. Morrone
2012-Jan-20 23:29 UTC
[Lustre-discuss] OSS and MDS resilience to power failures
We have never had battery backup on our OSS nodes and we have been successful in that mode. Years ago, powering off an OSS or MDS uncleanly was very dangerous. A lot of work went into fixing ext/ldiskfs, and we have been reasonably successful at surviving power outages in production for a few years now. In all of our development and testing work, unclean MDS/OSS power-offs are our standard practice (partly because cleanly shutting down an active server has historically been next to impossible...). So we very frequently validate that this is still reasonably safe. However, we have seen so many ext/ldiskfs bugs over the years that we have decided to make "fsck.ldiskfs -p" (a quicker "preen" fsck) standard practice at boot time before starting and MDT or OST. That at least provides us with a partial sanity check. Honestly, we would probably prefer to do a full fsck every time, but the time to do that is not acceptable as standard practice. I would warn that if you are using large LUNs, there may be a regression that we have just opened LU-1015 about. http://jira.whamcloud.com/browse/LU-1015 But evaluating that is still in the early stages. Chris On 01/13/2012 03:13 AM, Wojciech Turek wrote:> Make sure that use use RAID controller''s with cache protected by battery backup and if you use redundant controllers that the cache mirroring feature is enabled. The ldiskfs (ext4) should recover after power failures with no problems as long as the back end storage recovers fine too. > > Best regards, > > Wojciech > > On 13 January 2012 11:01, Alexander Oltu<Alexander.Oltu at uni.no<mailto:Alexander.Oltu at uni.no>> wrote: > Hi, > > We are going to have a Lustre setup with around 1000 clients. Due to a > power cable constraints we are not able to provide power to OSS and MDS > servers from UPS. > > Therefore the question is how resilient is Lustre to OSS and/or MDS > power failures? > > We have quite few thunderstorms here during summer with short power > interruptions. I understand that ext4 is a journaling filesystem and > should be more or less stable to power interruptions. But the real > practice can be different. I believe that some of you have experience > of running OSS and MDS without UPS and can shed some light on this > topic. > > Thank you, > Alex. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org<mailto:Lustre-discuss at lists.lustre.org> > http://lists.lustre.org/mailman/listinfo/lustre-discuss >