thr3ads.net - Lustre discuss - [Lustre-discuss] HLRN lustre breakdown [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Heiko Schroeter

2008-Aug-18 10:42 UTC

[Lustre-discuss] HLRN lustre breakdown

Hello list,

does anyone has more background infos of what happened there ?

Regards
Heiko




HLRN News
---------


Since Mon Aug 18, 2008 12:00 HLRN-II complex Berlin is open for users,
again.

During the maintenance it turned out that the Lustre file system holding
the users $WORK and $TMPDIR was damaged completely.
The file system had to be reconstructed from scratch. All user data in
$WORK are lost.

We hope that this event remains an exception. SGI apologizes for this
event.

/Bka

=======================================================================This is
an announcement for all HLRN Users

Peter Jones

2008-Aug-20 17:08 UTC

head link

[Lustre-discuss] HLRN lustre breakdown

Hi there

I got the following background information from Juergen Kreuels at SGI

"It turned out that a bad disk ( which did NOT report itself as being 
bad ) killed the lustre leading to data corruption due to inode areas on 
that disk.
It was finally decided to remake the whole FS and only during that 
action we finally ( after nearly 48 h ) found that bad drive.

It had nothing to do with the lustre FS itself. Lustre had been the 
victim of a HW failure on a Raid6 lun."

I hope that this helps

PJones

Heiko Schroeter wrote:> Hello list,
>
> does anyone has more background infos of what happened there ?
>
> Regards
> Heiko
>
>
>
>
> HLRN News
> ---------
>
>
> Since Mon Aug 18, 2008 12:00 HLRN-II complex Berlin is open for users,
> again.
>
> During the maintenance it turned out that the Lustre file system holding
> the users $WORK and $TMPDIR was damaged completely.
> The file system had to be reconstructed from scratch. All user data in
> $WORK are lost.
>
> We hope that this event remains an exception. SGI apologizes for this
> event.
>
> /Bka
>
> =======================================================================>
This is an announcement for all HLRN Users
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

Bernd Schubert

2008-Aug-20 17:12 UTC

head link

[Lustre-discuss] HLRN lustre breakdown

Oh damn, I''m always afraid of silent data corruptions due to bad
harddisks. We
also already had this issue, fortunately we found this disk before taking the 
system into production.

Will lustre-2.0 use the ZFS checksum feature?


Thanks,
Bernd

On Wednesday 20 August 2008 19:08:34 Peter Jones wrote:> Hi there
>
> I got the following background information from Juergen Kreuels at SGI
>
> "It turned out that a bad disk ( which did NOT report itself as being
> bad ) killed the lustre leading to data corruption due to inode areas on
> that disk.
> It was finally decided to remake the whole FS and only during that
> action we finally ( after nearly 48 h ) found that bad drive.
>
> It had nothing to do with the lustre FS itself. Lustre had been the
> victim of a HW failure on a Raid6 lun."
>
> I hope that this helps
>
> PJones
>
> Heiko Schroeter wrote:
> > Hello list,
> >
> > does anyone has more background infos of what happened there ?
> >
> > Regards
> > Heiko
> >
> >
> >
> >
> > HLRN News
> > ---------
> >
> >
> > Since Mon Aug 18, 2008 12:00 HLRN-II complex Berlin is open for users,
> > again.
> >
> > During the maintenance it turned out that the Lustre file system
holding
> > the users $WORK and $TMPDIR was damaged completely.
> > The file system had to be reconstructed from scratch. All user data in
> > $WORK are lost.
> >
> > We hope that this event remains an exception. SGI apologizes for this
> > event.
> >
> > /Bka
> >
> >
=======================================================================> >
This is an announcement for all HLRN Users
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


-- 
Bernd Schubert
Q-Leap Networks GmbH

Heiko Schroeter

2008-Aug-21 08:39 UTC

head link

[Lustre-discuss] HLRN lustre breakdown

Am Mittwoch, 20. August 2008 19:08:34 schrieben Sie:

Hello,

thank you very much for this info.

Good to know that lustre is not the cause.

Not so good is that a silent disk crash can corrupt the whole system because 
we do use plenty of raids in our setup ....

Regards
Heiko
> Hi there
>
> I got the following background information from Juergen Kreuels at SGI
>
> "It turned out that a bad disk ( which did NOT report itself as being
> bad ) killed the lustre leading to data corruption due to inode areas on
> that disk.
> It was finally decided to remake the whole FS and only during that
> action we finally ( after nearly 48 h ) found that bad drive.
>
> It had nothing to do with the lustre FS itself. Lustre had been the
> victim of a HW failure on a Raid6 lun."
>
> I hope that this helps
>
> PJones
>
> Heiko Schroeter wrote:
> > Hello list,
> >
> > does anyone has more background infos of what happened there ?
> >
> > Regards
> > Heiko
> >
> >
> >
> >
> > HLRN News
> > ---------
> >
> >
> > Since Mon Aug 18, 2008 12:00 HLRN-II complex Berlin is open for users,
> > again.
> >
> > During the maintenance it turned out that the Lustre file system
holding
> > the users $WORK and $TMPDIR was damaged completely.
> > The file system had to be reconstructed from scratch. All user data in
> > $WORK are lost.
> >
> > We hope that this event remains an exception. SGI apologizes for this
> > event.
> >
> > /Bka
> >
> >
=======================================================================> >
This is an announcement for all HLRN Users
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss

Troy Benjegerdes

2008-Aug-21 14:22 UTC

head link

[Lustre-discuss] HLRN lustre breakdown

This is a big nasty issue, particularly for HPC applications where
performance is a big issue.

How does one even begin to benchmark the performance overhead of a
parallel filesystem with checksumming? I am having nightmares over the
ways vendors will try to play games with performance numbers.

My suspicion is that whenever a parallel filesystem with checksumming is
available and works, that all the end-users will just turn it off anyway
because the applications will run twice as fast without it, regardless
of what the benchmarks say.. leaving us back at the same problem.

On Wed, Aug 20, 2008 at 07:12:10PM +0200, Bernd Schubert
wrote:> Oh damn, I''m always afraid of silent data corruptions due to bad
harddisks. We
> also already had this issue, fortunately we found this disk before taking
the
> system into production.
> 
> Will lustre-2.0 use the ZFS checksum feature?
> 
> 
> Thanks,
> Bernd
> 
> On Wednesday 20 August 2008 19:08:34 Peter Jones wrote:
> > Hi there
> >
> > I got the following background information from Juergen Kreuels at SGI
> >
> > "It turned out that a bad disk ( which did NOT report itself as
being
> > bad ) killed the lustre leading to data corruption due to inode areas
on
> > that disk.
> > It was finally decided to remake the whole FS and only during that
> > action we finally ( after nearly 48 h ) found that bad drive.
> >
> > It had nothing to do with the lustre FS itself. Lustre had been the
> > victim of a HW failure on a Raid6 lun."
> >
> > I hope that this helps
> >
> > PJones
> >
> > Heiko Schroeter wrote:
> > > Hello list,
> > >
> > > does anyone has more background infos of what happened there ?
> > >
> > > Regards
> > > Heiko
> > >
> > >
> > >
> > >
> > > HLRN News
> > > ---------
> > >
> > >
> > > Since Mon Aug 18, 2008 12:00 HLRN-II complex Berlin is open for
users,
> > > again.
> > >
> > > During the maintenance it turned out that the Lustre file system
holding
> > > the users $WORK and $TMPDIR was damaged completely.
> > > The file system had to be reconstructed from scratch. All user
data in
> > > $WORK are lost.
> > >
> > > We hope that this event remains an exception. SGI apologizes for
this
> > > event.
> > >
> > > /Bka
> > >
> > >
=======================================================================> >
> This is an announcement for all HLRN Users
> > > _______________________________________________
> > > Lustre-discuss mailing list
> > > Lustre-discuss at lists.lustre.org
> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 
> 
> -- 
> Bernd Schubert
> Q-Leap Networks GmbH
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 
--------------------------------------------------------------------------
Troy Benjegerdes                ''da hozer''               
hozer at hozed.org

Somone asked me why I work on this free (http://www.gnu.org/philosophy/)
software stuff and not get a real job. Charles Shultz had the best answer:

"Why do musicians compose symphonies and poets write poems? They do it
because life wouldn''t have any meaning for them if they
didn''t. That''s why
I draw cartoons. It''s my life." -- Charles Shultz

Brock Palen

2008-Aug-21 14:55 UTC

head link

[Lustre-discuss] HLRN lustre breakdown

On Aug 21, 2008, at 10:22 AM, Troy Benjegerdes wrote:> This is a big nasty issue, particularly for HPC applications where
> performance is a big issue.
>
> How does one even begin to benchmark the performance overhead of a
> parallel filesystem with checksumming? I am having nightmares over the
> ways vendors will try to play games with performance numbers.
True
>
> My suspicion is that whenever a parallel filesystem with  
> checksumming is
> available and works, that all the end-users will just turn it off  
> anyway
> because the applications will run twice as fast without it, regardless
> of what the benchmarks say.. leaving us back at the same problem.
I don''t think this will be a problem. On current systems it may be  
the case of the checksummed filesystem becoming cpu bound.  I think  
the OST''s will be bailed out by cpu speeds going up faster than disk  
speeds. You just need to limit the number of OST''s/OSS.

Where I could see it being a problem is on the client side. That  
assumes that writes and reads are competing with the application for  
cycles.  So far on our clusters I see applications do ether compute  
or IO on a thread/rank.  Not both, freeing up allocated cpus for IO.   
Then again maybe I should ask our users why they don''t do any async IO.

Prob depends.
My 2 cents.
>
> On Wed, Aug 20, 2008 at 07:12:10PM +0200, Bernd Schubert wrote:
>> Oh damn, I''m always afraid of silent data corruptions due to
bad
>> harddisks. We
>> also already had this issue, fortunately we found this disk before  
>> taking the
>> system into production.
>>
>> Will lustre-2.0 use the ZFS checksum feature?
>>
>>
>> Thanks,
>> Bernd
>>
>> On Wednesday 20 August 2008 19:08:34 Peter Jones wrote:
>>> Hi there
>>>
>>> I got the following background information from Juergen Kreuels  
>>> at SGI
>>>
>>> "It turned out that a bad disk ( which did NOT report itself
as
>>> being
>>> bad ) killed the lustre leading to data corruption due to inode  
>>> areas on
>>> that disk.
>>> It was finally decided to remake the whole FS and only during that
>>> action we finally ( after nearly 48 h ) found that bad drive.
>>>
>>> It had nothing to do with the lustre FS itself. Lustre had been the
>>> victim of a HW failure on a Raid6 lun."
>>>
>>> I hope that this helps
>>>
>>> PJones
>>>
>>> Heiko Schroeter wrote:
>>>> Hello list,
>>>>
>>>> does anyone has more background infos of what happened there ?
>>>>
>>>> Regards
>>>> Heiko
>>>>
>>>>
>>>>
>>>>
>>>> HLRN News
>>>> ---------
>>>>
>>>>
>>>> Since Mon Aug 18, 2008 12:00 HLRN-II complex Berlin is open for
>>>> users,
>>>> again.
>>>>
>>>> During the maintenance it turned out that the Lustre file
system
>>>> holding
>>>> the users $WORK and $TMPDIR was damaged completely.
>>>> The file system had to be reconstructed from scratch. All user
>>>> data in
>>>> $WORK are lost.
>>>>
>>>> We hope that this event remains an exception. SGI apologizes
for
>>>> this
>>>> event.
>>>>
>>>> /Bka
>>>>
>>>>
===================================================================
>>>> ====>>>> This is an announcement for all HLRN Users
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>>
>> -- 
>> Bernd Schubert
>> Q-Leap Networks GmbH
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> -- 
> ---------------------------------------------------------------------- 
> ----
> Troy Benjegerdes                ''da hozer''
> hozer at hozed.org
>
> Somone asked me why I work on this free (http://www.gnu.org/ 
> philosophy/)
> software stuff and not get a real job. Charles Shultz had the best  
> answer:
>
> "Why do musicians compose symphonies and poets write poems? They do it
> because life wouldn''t have any meaning for them if they
didn''t.
> That''s why
> I draw cartoons. It''s my life." -- Charles Shultz
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

Andreas Dilger

2008-Aug-21 18:59 UTC

head link

[Lustre-discuss] HLRN lustre breakdown

On Aug 21, 2008  10:55 -0400, Brock Palen wrote:> On Aug 21, 2008, at 10:22 AM, Troy Benjegerdes wrote:
> > This is a big nasty issue, particularly for HPC applications where
> > performance is a big issue.
> >
> > How does one even begin to benchmark the performance overhead of a
> > parallel filesystem with checksumming? I am having nightmares over the
> > ways vendors will try to play games with performance numbers.
> 
> True
Actually, Lustre 1.6.5 does checksumming by default, and that is how
we do our benchmarking.  Some customers will turn it off because the
overhead hurts them.  New customers may not even notice it...  Also, for
many workloads the data integrity is much more important than the speed.
> > My suspicion is that whenever a parallel filesystem with  
> > checksumming is
> > available and works, that all the end-users will just turn it off  
> > anyway
> > because the applications will run twice as fast without it, regardless
> > of what the benchmarks say.. leaving us back at the same problem.
> 
> I don''t think this will be a problem. On current systems it may be
> the case of the checksummed filesystem becoming cpu bound.  I think  
> the OST''s will be bailed out by cpu speeds going up faster than
disk
> speeds. You just need to limit the number of OST''s/OSS.
I agree that CPU speeds will almost certainly cover this in the future.
> Where I could see it being a problem is on the client side. That  
> assumes that writes and reads are competing with the application for  
> cycles.  So far on our clusters I see applications do ether compute  
> or IO on a thread/rank.  Not both, freeing up allocated cpus for IO.   
Yes, that is our experience also.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Troy Benjegerdes

2008-Aug-21 19:46 UTC

head link

[Lustre-discuss] HLRN lustre breakdown

> Actually, Lustre 1.6.5 does checksumming by default, and that is how
> we do our benchmarking.  Some customers will turn it off because the
> overhead hurts them.  New customers may not even notice it...  Also, for
> many workloads the data integrity is much more important than the speed.
I went digging in CVS HEAD for ''checksum'', and it
wasn''t clear to me if
this was end-to-end (from file write all the way to disk), or just an
option for network RPC''s.

Is there some design or architecture document on the checksumming? All I
could find was some references to the kerberos5 RPC checksums.

Brock Palen

2008-Aug-21 19:59 UTC

head link

[Lustre-discuss] HLRN lustre breakdown

Really ?  You sure?  I just set up a new 1.6.5.1 filesystem this week:

[root at nyx003 ~]# cat /proc/fs/lustre/llite/nobackup-0000010037e27c00/ 
checksum_pages
  0

I am curious to test if they were on.  My MPI_File_write() of a large  
file was less than I expected, but it looked like OST''s were cpu  
bound.  (two x4500''s)

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985



On Aug 21, 2008, at 2:59 PM, Andreas Dilger wrote:> On Aug 21, 2008  10:55 -0400, Brock Palen wrote:
>> On Aug 21, 2008, at 10:22 AM, Troy Benjegerdes wrote:
>>> This is a big nasty issue, particularly for HPC applications where
>>> performance is a big issue.
>>>
>>> How does one even begin to benchmark the performance overhead of a
>>> parallel filesystem with checksumming? I am having nightmares  
>>> over the
>>> ways vendors will try to play games with performance numbers.
>>
>> True
>
> Actually, Lustre 1.6.5 does checksumming by default, and that is how
> we do our benchmarking.  Some customers will turn it off because the
> overhead hurts them.  New customers may not even notice it...   
> Also, for
> many workloads the data integrity is much more important than the  
> speed.
>
>>> My suspicion is that whenever a parallel filesystem with
>>> checksumming is
>>> available and works, that all the end-users will just turn it off
>>> anyway
>>> because the applications will run twice as fast without it,  
>>> regardless
>>> of what the benchmarks say.. leaving us back at the same problem.
>>
>> I don''t think this will be a problem. On current systems it
may be
>> the case of the checksummed filesystem becoming cpu bound.  I think
>> the OST''s will be bailed out by cpu speeds going up faster
than disk
>> speeds. You just need to limit the number of OST''s/OSS.
>
> I agree that CPU speeds will almost certainly cover this in the  
> future.
>
>> Where I could see it being a problem is on the client side. That
>> assumes that writes and reads are competing with the application for
>> cycles.  So far on our clusters I see applications do ether compute
>> or IO on a thread/rank.  Not both, freeing up allocated cpus for IO.
>
> Yes, that is our experience also.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
>

Andreas Dilger

2008-Aug-23 11:40 UTC

head link

[Lustre-discuss] HLRN lustre breakdown

On Aug 21, 2008  15:59 -0400, Brock Palen wrote:> Really ?  You sure?  I just set up a new 1.6.5.1 filesystem this week:
>
> [root at nyx003 ~]# cat /proc/fs/lustre/llite/nobackup-0000010037e27c00/ 
> checksum_pages
>  0
This is for keeping checksums of the pages in the client memory.  This
is off by default, but we''ve used it in the past when trying to
diagnose
memory corruption on the clients.

What you want to check is /proc/fs/lustre/osc/*/checksums. 
.../checksum_type allows changing the checksum type, either CRC32 (the
only option for OSTs < 1.6.5) or Adler32 (default if OST supports it).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Aug 2008 - HLRN lustre breakdown

[Lustre-discuss] HLRN lustre breakdown

[Lustre-discuss] HLRN lustre breakdown

[Lustre-discuss] HLRN lustre breakdown

[Lustre-discuss] HLRN lustre breakdown

[Lustre-discuss] HLRN lustre breakdown

[Lustre-discuss] HLRN lustre breakdown

[Lustre-discuss] HLRN lustre breakdown

[Lustre-discuss] HLRN lustre breakdown

[Lustre-discuss] HLRN lustre breakdown

[Lustre-discuss] HLRN lustre breakdown