At my physics lab we have 30 servers with 1TB disk packs. I am in need of monitoring for disk failures. I have been reading about SMART and it seems it can help. However, I am not sure what to look for if a drive is about to fail. Any thoughts about this? Is anyone using this method to predetermine disk failures? TIA
On Sat, Aug 30, 2008 at 4:08 AM, Mag Gam <magawake at gmail.com> wrote:> At my physics lab we have 30 servers with 1TB disk packs. I am in need > of monitoring for disk failures. I have been reading about SMART and > it seems it can help. However, I am not sure what to look for if a > drive is about to fail. Any thoughts about this? Is anyone using this > method to predetermine disk failures? >Here are a few references from my archives w.r.t. SMART ... Hope they help ... -rak- === http://hardware.slashdot.org/hardware/07/02/18/0420247.shtml Google Releases Paper on Disk Reliability*"The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population<http://labs.google.com/papers/disk_failures.pdf>. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'" * http://hardware.slashdot.org/hardware/07/02/21/004233.shtml Everything You Know About Disks Is Wrong*"Google's wasn't the best storage paper at FAST '07 <http://www.usenix.org/events/fast07/>. Another, more provocative paper looking at real-world results from 100,000 disk drives got the 'Best Paper' award. Bianca Schroeder, of CMU's Parallel Data Lab, submitted Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?<http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html>The paper crushes a number of (what we now know to be) myths about disks such as vendor MTBF validity, 'consumer' vs. 'enterprise' drive reliability (spoiler: no difference), and RAID 5 assumptions. StorageMojo has a good summary of the paper's key points <http://storagemojo.com/?p=383>."* http://www.linuxjournal.com/article/6983?from=50&comments_per_page=50 Monitoring Hard Disks with SMART By Bruce Allen<http://www.linuxjournal.com/user/801273>on Thu, 2004-01-01 02:00. SysAdmin <http://www.linuxjournal.com/taxonomy/term/8> One of your hard disks might be trying to tell you it's not long for this world. Install software that lets you know when to replace it. It's a given that all disks eventually die, and it's easy to see why. The platters in a modern disk drive rotate more than a hundred times per second, maintaining submicron tolerances between the disk heads and the magnetic media that store data. Often they run 24/7 in dusty, overheated environments, thrashing on heavily loaded or poorly managed machines. So, it's not surprising that experienced users are all too familiar with the symptoms of a dying disk. Strange things start happening. Inscrutable kernel error messages cover the console and then the system becomes unstable and locks up. Often, entire days are lost repeating recent work, re-installing the OS and trying to recover data. Even if you have a recent backup, sudden disk failure is a minor catastrophe. http://smartmontools.sourceforge.net/ smartmontools Home Page Welcome! This is the home page for the smartmontools package. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos/attachments/20080830/20bf8808/attachment-0005.html>
Rak, Thanks! The Google paper is intense. I was hoping to get some practical usage with command or scripts to better monitor my SMART environment. On Sat, Aug 30, 2008 at 4:57 AM, Richard Karhuse <rkarhuse at gmail.com> wrote:> > > On Sat, Aug 30, 2008 at 4:08 AM, Mag Gam <magawake at gmail.com> wrote: >> >> At my physics lab we have 30 servers with 1TB disk packs. I am in need >> of monitoring for disk failures. I have been reading about SMART and >> it seems it can help. However, I am not sure what to look for if a >> drive is about to fail. Any thoughts about this? Is anyone using this >> method to predetermine disk failures? > > > Here are a few references from my archives w.r.t. SMART ... > > Hope they help ... > > -rak- > > ===> > http://hardware.slashdot.org/hardware/07/02/18/0420247.shtml > > Google Releases Paper on Disk Reliability > > "The Google engineers just published a paper on Failure Trends in a Large > Disk Drive Population. Based on a study of 100,000 disk drives over 5 years > they find some interesting stuff. To quote from the abstract: 'Our analysis > identifies several parameters from the drive's self monitoring facility > (SMART) that correlate highly with failures. Despite this high correlation, > we conclude that models based on SMART parameters alone are unlikely to be > useful for predicting individual drive failures. Surprisingly, we found that > temperature and activity levels were much less correlated with drive > failures than previously reported.'" > > > http://hardware.slashdot.org/hardware/07/02/21/004233.shtml > > Everything You Know About Disks Is Wrong > > "Google's wasn't the best storage paper at FAST '07. Another, more > provocative paper looking at real-world results from 100,000 disk drives got > the 'Best Paper' award. Bianca Schroeder, of CMU's Parallel Data Lab, > submitted Disk failures in the real world: What does an MTTF of 1,000,000 > hours mean to you? The paper crushes a number of (what we now know to be) > myths about disks such as vendor MTBF validity, 'consumer' vs. 'enterprise' > drive reliability (spoiler: no difference), and RAID 5 assumptions. > StorageMojo has a good summary of the paper's key points." > > > http://www.linuxjournal.com/article/6983?from=50&comments_per_page=50 > > Monitoring Hard Disks with SMART > > By Bruce Allen on Thu, 2004-01-01 02:00. SysAdmin One of your hard disks > might be trying to tell you it's not long for this world. Install software > that lets you know when to replace it. > > It's a given that all disks eventually die, and it's easy to see why. The > platters in a modern disk drive rotate more than a hundred times per second, > maintaining submicron tolerances between the disk heads and the magnetic > media that store data. Often they run 24/7 in dusty, overheated > environments, thrashing on heavily loaded or poorly managed machines. So, > it's not surprising that experienced users are all too familiar with the > symptoms of a dying disk. Strange things start happening. Inscrutable kernel > error messages cover the console and then the system becomes unstable and > locks up. Often, entire days are lost repeating recent work, re-installing > the OS and trying to recover data. Even if you have a recent backup, sudden > disk failure is a minor catastrophe. > > http://smartmontools.sourceforge.net/ > > smartmontools Home Page > > Welcome! This is the home page for the smartmontools package. > > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos > >