Hello list, does anyone has more background infos of what happened there ? Regards Heiko HLRN News --------- Since Mon Aug 18, 2008 12:00 HLRN-II complex Berlin is open for users, again. During the maintenance it turned out that the Lustre file system holding the users $WORK and $TMPDIR was damaged completely. The file system had to be reconstructed from scratch. All user data in $WORK are lost. We hope that this event remains an exception. SGI apologizes for this event. /Bka =======================================================================This is an announcement for all HLRN Users
Hi there I got the following background information from Juergen Kreuels at SGI "It turned out that a bad disk ( which did NOT report itself as being bad ) killed the lustre leading to data corruption due to inode areas on that disk. It was finally decided to remake the whole FS and only during that action we finally ( after nearly 48 h ) found that bad drive. It had nothing to do with the lustre FS itself. Lustre had been the victim of a HW failure on a Raid6 lun." I hope that this helps PJones Heiko Schroeter wrote:> Hello list, > > does anyone has more background infos of what happened there ? > > Regards > Heiko > > > > > HLRN News > --------- > > > Since Mon Aug 18, 2008 12:00 HLRN-II complex Berlin is open for users, > again. > > During the maintenance it turned out that the Lustre file system holding > the users $WORK and $TMPDIR was damaged completely. > The file system had to be reconstructed from scratch. All user data in > $WORK are lost. > > We hope that this event remains an exception. SGI apologizes for this > event. > > /Bka > > =======================================================================> This is an announcement for all HLRN Users > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Oh damn, I''m always afraid of silent data corruptions due to bad harddisks. We also already had this issue, fortunately we found this disk before taking the system into production. Will lustre-2.0 use the ZFS checksum feature? Thanks, Bernd On Wednesday 20 August 2008 19:08:34 Peter Jones wrote:> Hi there > > I got the following background information from Juergen Kreuels at SGI > > "It turned out that a bad disk ( which did NOT report itself as being > bad ) killed the lustre leading to data corruption due to inode areas on > that disk. > It was finally decided to remake the whole FS and only during that > action we finally ( after nearly 48 h ) found that bad drive. > > It had nothing to do with the lustre FS itself. Lustre had been the > victim of a HW failure on a Raid6 lun." > > I hope that this helps > > PJones > > Heiko Schroeter wrote: > > Hello list, > > > > does anyone has more background infos of what happened there ? > > > > Regards > > Heiko > > > > > > > > > > HLRN News > > --------- > > > > > > Since Mon Aug 18, 2008 12:00 HLRN-II complex Berlin is open for users, > > again. > > > > During the maintenance it turned out that the Lustre file system holding > > the users $WORK and $TMPDIR was damaged completely. > > The file system had to be reconstructed from scratch. All user data in > > $WORK are lost. > > > > We hope that this event remains an exception. SGI apologizes for this > > event. > > > > /Bka > > > > =======================================================================> > This is an announcement for all HLRN Users > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Bernd Schubert Q-Leap Networks GmbH
Am Mittwoch, 20. August 2008 19:08:34 schrieben Sie: Hello, thank you very much for this info. Good to know that lustre is not the cause. Not so good is that a silent disk crash can corrupt the whole system because we do use plenty of raids in our setup .... Regards Heiko> Hi there > > I got the following background information from Juergen Kreuels at SGI > > "It turned out that a bad disk ( which did NOT report itself as being > bad ) killed the lustre leading to data corruption due to inode areas on > that disk. > It was finally decided to remake the whole FS and only during that > action we finally ( after nearly 48 h ) found that bad drive. > > It had nothing to do with the lustre FS itself. Lustre had been the > victim of a HW failure on a Raid6 lun." > > I hope that this helps > > PJones > > Heiko Schroeter wrote: > > Hello list, > > > > does anyone has more background infos of what happened there ? > > > > Regards > > Heiko > > > > > > > > > > HLRN News > > --------- > > > > > > Since Mon Aug 18, 2008 12:00 HLRN-II complex Berlin is open for users, > > again. > > > > During the maintenance it turned out that the Lustre file system holding > > the users $WORK and $TMPDIR was damaged completely. > > The file system had to be reconstructed from scratch. All user data in > > $WORK are lost. > > > > We hope that this event remains an exception. SGI apologizes for this > > event. > > > > /Bka > > > > =======================================================================> > This is an announcement for all HLRN Users > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss
This is a big nasty issue, particularly for HPC applications where performance is a big issue. How does one even begin to benchmark the performance overhead of a parallel filesystem with checksumming? I am having nightmares over the ways vendors will try to play games with performance numbers. My suspicion is that whenever a parallel filesystem with checksumming is available and works, that all the end-users will just turn it off anyway because the applications will run twice as fast without it, regardless of what the benchmarks say.. leaving us back at the same problem. On Wed, Aug 20, 2008 at 07:12:10PM +0200, Bernd Schubert wrote:> Oh damn, I''m always afraid of silent data corruptions due to bad harddisks. We > also already had this issue, fortunately we found this disk before taking the > system into production. > > Will lustre-2.0 use the ZFS checksum feature? > > > Thanks, > Bernd > > On Wednesday 20 August 2008 19:08:34 Peter Jones wrote: > > Hi there > > > > I got the following background information from Juergen Kreuels at SGI > > > > "It turned out that a bad disk ( which did NOT report itself as being > > bad ) killed the lustre leading to data corruption due to inode areas on > > that disk. > > It was finally decided to remake the whole FS and only during that > > action we finally ( after nearly 48 h ) found that bad drive. > > > > It had nothing to do with the lustre FS itself. Lustre had been the > > victim of a HW failure on a Raid6 lun." > > > > I hope that this helps > > > > PJones > > > > Heiko Schroeter wrote: > > > Hello list, > > > > > > does anyone has more background infos of what happened there ? > > > > > > Regards > > > Heiko > > > > > > > > > > > > > > > HLRN News > > > --------- > > > > > > > > > Since Mon Aug 18, 2008 12:00 HLRN-II complex Berlin is open for users, > > > again. > > > > > > During the maintenance it turned out that the Lustre file system holding > > > the users $WORK and $TMPDIR was damaged completely. > > > The file system had to be reconstructed from scratch. All user data in > > > $WORK are lost. > > > > > > We hope that this event remains an exception. SGI apologizes for this > > > event. > > > > > > /Bka > > > > > > =======================================================================> > > This is an announcement for all HLRN Users > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss at lists.lustre.org > > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > -- > Bernd Schubert > Q-Leap Networks GmbH > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- -------------------------------------------------------------------------- Troy Benjegerdes ''da hozer'' hozer at hozed.org Somone asked me why I work on this free (http://www.gnu.org/philosophy/) software stuff and not get a real job. Charles Shultz had the best answer: "Why do musicians compose symphonies and poets write poems? They do it because life wouldn''t have any meaning for them if they didn''t. That''s why I draw cartoons. It''s my life." -- Charles Shultz
On Aug 21, 2008, at 10:22 AM, Troy Benjegerdes wrote:> This is a big nasty issue, particularly for HPC applications where > performance is a big issue. > > How does one even begin to benchmark the performance overhead of a > parallel filesystem with checksumming? I am having nightmares over the > ways vendors will try to play games with performance numbers.True> > My suspicion is that whenever a parallel filesystem with > checksumming is > available and works, that all the end-users will just turn it off > anyway > because the applications will run twice as fast without it, regardless > of what the benchmarks say.. leaving us back at the same problem.I don''t think this will be a problem. On current systems it may be the case of the checksummed filesystem becoming cpu bound. I think the OST''s will be bailed out by cpu speeds going up faster than disk speeds. You just need to limit the number of OST''s/OSS. Where I could see it being a problem is on the client side. That assumes that writes and reads are competing with the application for cycles. So far on our clusters I see applications do ether compute or IO on a thread/rank. Not both, freeing up allocated cpus for IO. Then again maybe I should ask our users why they don''t do any async IO. Prob depends. My 2 cents.> > On Wed, Aug 20, 2008 at 07:12:10PM +0200, Bernd Schubert wrote: >> Oh damn, I''m always afraid of silent data corruptions due to bad >> harddisks. We >> also already had this issue, fortunately we found this disk before >> taking the >> system into production. >> >> Will lustre-2.0 use the ZFS checksum feature? >> >> >> Thanks, >> Bernd >> >> On Wednesday 20 August 2008 19:08:34 Peter Jones wrote: >>> Hi there >>> >>> I got the following background information from Juergen Kreuels >>> at SGI >>> >>> "It turned out that a bad disk ( which did NOT report itself as >>> being >>> bad ) killed the lustre leading to data corruption due to inode >>> areas on >>> that disk. >>> It was finally decided to remake the whole FS and only during that >>> action we finally ( after nearly 48 h ) found that bad drive. >>> >>> It had nothing to do with the lustre FS itself. Lustre had been the >>> victim of a HW failure on a Raid6 lun." >>> >>> I hope that this helps >>> >>> PJones >>> >>> Heiko Schroeter wrote: >>>> Hello list, >>>> >>>> does anyone has more background infos of what happened there ? >>>> >>>> Regards >>>> Heiko >>>> >>>> >>>> >>>> >>>> HLRN News >>>> --------- >>>> >>>> >>>> Since Mon Aug 18, 2008 12:00 HLRN-II complex Berlin is open for >>>> users, >>>> again. >>>> >>>> During the maintenance it turned out that the Lustre file system >>>> holding >>>> the users $WORK and $TMPDIR was damaged completely. >>>> The file system had to be reconstructed from scratch. All user >>>> data in >>>> $WORK are lost. >>>> >>>> We hope that this event remains an exception. SGI apologizes for >>>> this >>>> event. >>>> >>>> /Bka >>>> >>>> =================================================================== >>>> ====>>>> This is an announcement for all HLRN Users >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> >> -- >> Bernd Schubert >> Q-Leap Networks GmbH >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > -- > ---------------------------------------------------------------------- > ---- > Troy Benjegerdes ''da hozer'' > hozer at hozed.org > > Somone asked me why I work on this free (http://www.gnu.org/ > philosophy/) > software stuff and not get a real job. Charles Shultz had the best > answer: > > "Why do musicians compose symphonies and poets write poems? They do it > because life wouldn''t have any meaning for them if they didn''t. > That''s why > I draw cartoons. It''s my life." -- Charles Shultz > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
On Aug 21, 2008 10:55 -0400, Brock Palen wrote:> On Aug 21, 2008, at 10:22 AM, Troy Benjegerdes wrote: > > This is a big nasty issue, particularly for HPC applications where > > performance is a big issue. > > > > How does one even begin to benchmark the performance overhead of a > > parallel filesystem with checksumming? I am having nightmares over the > > ways vendors will try to play games with performance numbers. > > TrueActually, Lustre 1.6.5 does checksumming by default, and that is how we do our benchmarking. Some customers will turn it off because the overhead hurts them. New customers may not even notice it... Also, for many workloads the data integrity is much more important than the speed.> > My suspicion is that whenever a parallel filesystem with > > checksumming is > > available and works, that all the end-users will just turn it off > > anyway > > because the applications will run twice as fast without it, regardless > > of what the benchmarks say.. leaving us back at the same problem. > > I don''t think this will be a problem. On current systems it may be > the case of the checksummed filesystem becoming cpu bound. I think > the OST''s will be bailed out by cpu speeds going up faster than disk > speeds. You just need to limit the number of OST''s/OSS.I agree that CPU speeds will almost certainly cover this in the future.> Where I could see it being a problem is on the client side. That > assumes that writes and reads are competing with the application for > cycles. So far on our clusters I see applications do ether compute > or IO on a thread/rank. Not both, freeing up allocated cpus for IO.Yes, that is our experience also. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
> Actually, Lustre 1.6.5 does checksumming by default, and that is how > we do our benchmarking. Some customers will turn it off because the > overhead hurts them. New customers may not even notice it... Also, for > many workloads the data integrity is much more important than the speed.I went digging in CVS HEAD for ''checksum'', and it wasn''t clear to me if this was end-to-end (from file write all the way to disk), or just an option for network RPC''s. Is there some design or architecture document on the checksumming? All I could find was some references to the kerberos5 RPC checksums.
Really ? You sure? I just set up a new 1.6.5.1 filesystem this week: [root at nyx003 ~]# cat /proc/fs/lustre/llite/nobackup-0000010037e27c00/ checksum_pages 0 I am curious to test if they were on. My MPI_File_write() of a large file was less than I expected, but it looked like OST''s were cpu bound. (two x4500''s) Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Aug 21, 2008, at 2:59 PM, Andreas Dilger wrote:> On Aug 21, 2008 10:55 -0400, Brock Palen wrote: >> On Aug 21, 2008, at 10:22 AM, Troy Benjegerdes wrote: >>> This is a big nasty issue, particularly for HPC applications where >>> performance is a big issue. >>> >>> How does one even begin to benchmark the performance overhead of a >>> parallel filesystem with checksumming? I am having nightmares >>> over the >>> ways vendors will try to play games with performance numbers. >> >> True > > Actually, Lustre 1.6.5 does checksumming by default, and that is how > we do our benchmarking. Some customers will turn it off because the > overhead hurts them. New customers may not even notice it... > Also, for > many workloads the data integrity is much more important than the > speed. > >>> My suspicion is that whenever a parallel filesystem with >>> checksumming is >>> available and works, that all the end-users will just turn it off >>> anyway >>> because the applications will run twice as fast without it, >>> regardless >>> of what the benchmarks say.. leaving us back at the same problem. >> >> I don''t think this will be a problem. On current systems it may be >> the case of the checksummed filesystem becoming cpu bound. I think >> the OST''s will be bailed out by cpu speeds going up faster than disk >> speeds. You just need to limit the number of OST''s/OSS. > > I agree that CPU speeds will almost certainly cover this in the > future. > >> Where I could see it being a problem is on the client side. That >> assumes that writes and reads are competing with the application for >> cycles. So far on our clusters I see applications do ether compute >> or IO on a thread/rank. Not both, freeing up allocated cpus for IO. > > Yes, that is our experience also. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > >
On Aug 21, 2008 15:59 -0400, Brock Palen wrote:> Really ? You sure? I just set up a new 1.6.5.1 filesystem this week: > > [root at nyx003 ~]# cat /proc/fs/lustre/llite/nobackup-0000010037e27c00/ > checksum_pages > 0This is for keeping checksums of the pages in the client memory. This is off by default, but we''ve used it in the past when trying to diagnose memory corruption on the clients. What you want to check is /proc/fs/lustre/osc/*/checksums. .../checksum_type allows changing the checksum type, either CRC32 (the only option for OSTs < 1.6.5) or Adler32 (default if OST supports it). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.