thr3ads.net - CentOS - [CentOS] S.M.A.R.T [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Mag Gam

2008-Aug-30 08:08 UTC

[CentOS] S.M.A.R.T

At my physics lab we have 30 servers with 1TB disk packs. I am in need
of monitoring for disk failures. I have been reading about SMART and
it seems it can help. However, I am not sure what to look for if a
drive is about to fail. Any thoughts about this? Is anyone using this
method to predetermine disk failures?

TIA

Richard Karhuse

2008-Aug-30 08:57 UTC

head link

[CentOS] S.M.A.R.T

On Sat, Aug 30, 2008 at 4:08 AM, Mag Gam <magawake at gmail.com> wrote:
> At my physics lab we have 30 servers with 1TB disk packs. I am in need
> of monitoring for disk failures. I have been reading about SMART and
> it seems it can help. However, I am not sure what to look for if a
> drive is about to fail. Any thoughts about this? Is anyone using this
> method to predetermine disk failures?
>

Here are a few references from my archives w.r.t. SMART ...

Hope they help ...

-rak-

===
http://hardware.slashdot.org/hardware/07/02/18/0420247.shtml
Google Releases Paper on Disk Reliability*"The Google engineers just
published a paper on Failure Trends in a Large Disk Drive
Population<http://labs.google.com/papers/disk_failures.pdf>.
Based on a study of 100,000 disk drives over 5 years they find some
interesting stuff. To quote from the abstract: 'Our analysis identifies
several parameters from the drive's self monitoring facility (SMART) that
correlate highly with failures. Despite this high correlation, we conclude
that models based on SMART parameters alone are unlikely to be useful for
predicting individual drive failures. Surprisingly, we found that
temperature and activity levels were much less correlated with drive
failures than previously reported.'"

*
http://hardware.slashdot.org/hardware/07/02/21/004233.shtml

Everything You Know About Disks Is Wrong*"Google's wasn't the best
storage
paper at FAST '07 <http://www.usenix.org/events/fast07/>. Another,
more
provocative paper looking at real-world results from 100,000 disk drives got
the 'Best Paper' award. Bianca Schroeder, of CMU's Parallel Data
Lab,
submitted Disk failures in the real world: What does an MTTF of 1,000,000
hours mean to
you?<http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html>The
paper crushes a number of (what we now know to be) myths about disks
such as vendor MTBF validity, 'consumer' vs. 'enterprise' drive
reliability
(spoiler: no difference), and RAID 5 assumptions. StorageMojo has a good
summary of the paper's key points
<http://storagemojo.com/?p=383>."*

http://www.linuxjournal.com/article/6983?from=50&comments_per_page=50

Monitoring Hard Disks with SMART By Bruce
Allen<http://www.linuxjournal.com/user/801273>on Thu, 2004-01-01
02:00.
SysAdmin <http://www.linuxjournal.com/taxonomy/term/8> One of your hard
disks might be trying to tell you it's not long for this world. Install
software that lets you know when to replace it.

It's a given that all disks eventually die, and it's easy to see why.
The
platters in a modern disk drive rotate more than a hundred times per second,
maintaining submicron tolerances between the disk heads and the magnetic
media that store data. Often they run 24/7 in dusty, overheated
environments, thrashing on heavily loaded or poorly managed machines. So,
it's not surprising that experienced users are all too familiar with the
symptoms of a dying disk. Strange things start happening. Inscrutable kernel
error messages cover the console and then the system becomes unstable and
locks up. Often, entire days are lost repeating recent work, re-installing
the OS and trying to recover data. Even if you have a recent backup, sudden
disk failure is a minor catastrophe.

http://smartmontools.sourceforge.net/

smartmontools Home Page

Welcome! This is the home page for the smartmontools package.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20080830/20bf8808/attachment-0005.html>

Mag Gam

2008-Aug-30 10:07 UTC

head link

[CentOS] S.M.A.R.T

Rak,

Thanks! The Google paper is intense. I was hoping to get some
practical usage with command or scripts to better monitor my SMART
environment.



On Sat, Aug 30, 2008 at 4:57 AM, Richard Karhuse <rkarhuse at gmail.com>
wrote:>
>
> On Sat, Aug 30, 2008 at 4:08 AM, Mag Gam <magawake at gmail.com>
wrote:
>>
>> At my physics lab we have 30 servers with 1TB disk packs. I am in need
>> of monitoring for disk failures. I have been reading about SMART and
>> it seems it can help. However, I am not sure what to look for if a
>> drive is about to fail. Any thoughts about this? Is anyone using this
>> method to predetermine disk failures?
>
>
> Here are a few references from my archives w.r.t. SMART ...
>
> Hope they help ...
>
>    -rak-
>
> ===>
> http://hardware.slashdot.org/hardware/07/02/18/0420247.shtml
>
> Google Releases Paper on Disk Reliability
>
> "The Google engineers just published a paper on Failure Trends in a
Large
> Disk Drive Population. Based on a study of 100,000 disk drives over 5 years
> they find some interesting stuff. To quote from the abstract: 'Our
analysis
> identifies several parameters from the drive's self monitoring facility
> (SMART) that correlate highly with failures. Despite this high correlation,
> we conclude that models based on SMART parameters alone are unlikely to be
> useful for predicting individual drive failures. Surprisingly, we found
that
> temperature and activity levels were much less correlated with drive
> failures than previously reported.'"
>
>
> http://hardware.slashdot.org/hardware/07/02/21/004233.shtml
>
> Everything You Know About Disks Is Wrong
>
> "Google's wasn't the best storage paper at FAST '07.
Another, more
> provocative paper looking at real-world results from 100,000 disk drives
got
> the 'Best Paper' award. Bianca Schroeder, of CMU's Parallel
Data Lab,
> submitted Disk failures in the real world: What does an MTTF of 1,000,000
> hours mean to you? The paper crushes a number of (what we now know to be)
> myths about disks such as vendor MTBF validity, 'consumer' vs.
'enterprise'
> drive reliability (spoiler: no difference), and RAID 5 assumptions.
> StorageMojo has a good summary of the paper's key points."
>
>
> http://www.linuxjournal.com/article/6983?from=50&comments_per_page=50
>
> Monitoring Hard Disks with SMART
>
> By Bruce Allen on Thu, 2004-01-01 02:00. SysAdmin One of your hard disks
> might be trying to tell you it's not long for this world. Install
software
> that lets you know when to replace it.
>
> It's a given that all disks eventually die, and it's easy to see
why. The
> platters in a modern disk drive rotate more than a hundred times per
second,
> maintaining submicron tolerances between the disk heads and the magnetic
> media that store data. Often they run 24/7 in dusty, overheated
> environments, thrashing on heavily loaded or poorly managed machines. So,
> it's not surprising that experienced users are all too familiar with
the
> symptoms of a dying disk. Strange things start happening. Inscrutable
kernel
> error messages cover the console and then the system becomes unstable and
> locks up. Often, entire days are lost repeating recent work, re-installing
> the OS and trying to recover data. Even if you have a recent backup, sudden
> disk failure is a minor catastrophe.
>
> http://smartmontools.sourceforge.net/
>
> smartmontools Home Page
>
> Welcome! This is the home page for the smartmontools package.
>
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>
>

Possibly Parallel Threads

Search for more apparently analagous threads

CentOS - Aug 2008 - S.M.A.R.T

[CentOS] S.M.A.R.T

[CentOS] S.M.A.R.T

[CentOS] S.M.A.R.T

Possibly Parallel Threads