Hi Folks I believe that the word would have gone around already, Google engineers have published a paper on disk reliability. It might supplement the ZFS FMA integration and well - all the numerous debates on spares etc etc over here. To quote /. "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: ''Our analysis identifies several parameters from the drive''s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.''" Link to the paper is http://labs.google.com/papers/disk_failures.pdf This message posted from opensolaris.org
On 18/2/07 4:56, "Akhilesh Mritunjai" <mritun+opensolaris at gmail.com> wrote:> Hi Folks > > I believe that the word would have gone around already, Google engineers have > published a paper on disk reliability. It might supplement the ZFS FMA > integration and well - all the numerous debates on spares etc etc over here. > > To quote /. > > "The Google engineers just published a paper on Failure Trends in a Large Disk > Drive Population. Based on a study of 100,000 disk drives over 5 years they > find some interesting stuff. To quote from the abstract: ''Our analysis > identifies several parameters from the drive''s self monitoring facility > (SMART) that correlate highly with failures. Despite this high correlation, we > conclude that models based on SMART parameters alone are unlikely to be useful > for predicting individual drive failures. Surprisingly, we found that > temperature and activity levels were much less correlated with drive failures > than previously reported.''" > > Link to the paper is http://labs.google.com/papers/disk_failures.pdfThere was another similar paper (written at CMU) given at the same conference: <http://www.cs.cmu.edu/~bianca/fast07.pdf> Cheers, Chris
Akhilesh Mritunjai wrote:> I believe that the word would have gone around already, Google engineers have published a paper on disk reliability. It might supplement the ZFS FMA integration and well - all the numerous debates on spares etc etc over here.Good paper. They validate the old saying, "complex systems fail in complex ways." We''ve also done some internal (Sun) studies which cast doubt on the ability of SMART to predict failures.> To quote /. > > "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: ''Our analysis identifies several parameters from the drive''s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.''" > > Link to the paper is http://labs.google.com/papers/disk_failures.pdfAs for the spares debate, that is easy: use spares :-) -- richard
Richard Elling wrote:> Akhilesh Mritunjai wrote: >> I believe that the word would have gone around already, Google >> engineers have published a paper on disk reliability. It might >> supplement the ZFS FMA integration and well - all the numerous >> debates on spares etc etc over here. > > Good paper. They validate the old saying, "complex systems fail in > complex ways." > We''ve also done some internal (Sun) studies which cast doubt on the > ability of SMART > to predict failures..... which is why we were never really fans of turning it on.
Richard Elling <Richard.Elling at Sun.COM> wrote:> > > > Link to the paper is http://labs.google.com/papers/disk_failures.pdf > > As for the spares debate, that is easy: use spares :-)What they missed to say is that you need to access the whole disk frequently enough in order to give SMART the ability to work. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
It turns out that even rather poor prediction accuracy is good enough to make a big difference (10x) in the failure probability of a RAID system. See Gordon Hughes & Joseph Murray, "Reliability and Security of RAID Storage Systems and D2D Archives Using SATA Disk Drives", ACM Transactions on Storage, Vol. 1, No. 1, December 2004, Pages 95?107. This message posted from opensolaris.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Joerg Schilling wrote:> What they missed to say is that you need to access the whole disk > frequently enough in order to give SMART the ability to work.I thought modern disks could be instructed to do "offline scanning", using any idle time available. """ [root at yolco video]# smartctl -a /dev/hda ... General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. ... """ - -- Jesus Cea Avion _/_/ _/_/_/ _/_/_/ jcea at argo.es http://www.argo.es/~jcea/ _/_/ _/_/ _/_/ _/_/ _/_/ jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/_/_/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQCVAwUBRdvQi5lgi5GaxT1NAQKbPwP+N9PtmXu/bO3YegGtppZzo3McWanUVBAr rfnW10AbrYZ1RgtqQ/nofB8AugzK/zkIuB80EyUFraJ0ZvxMEKgtK9mQilwWiA3f TOQOUPq/uwzK2y6XtQUwfhnWqbXJPAWYPdQ1nBxEKRBtyarjxG7rE9MbsWMJ7lj2 EY1zf9OoEgg=kcIg -----END PGP SIGNATURE-----
Hello Jesus, Wednesday, February 21, 2007, 5:54:35 AM, you wrote: JC> -----BEGIN PGP SIGNED MESSAGE----- JC> Hash: SHA1 JC> Joerg Schilling wrote:>> What they missed to say is that you need to access the whole disk >> frequently enough in order to give SMART the ability to work.JC> I thought modern disks could be instructed to do "offline scanning", JC> using any idle time available. it was mentioned also in the paper -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Jesus Cea <jcea at argo.es> wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Joerg Schilling wrote: > > What they missed to say is that you need to access the whole disk > > frequently enough in order to give SMART the ability to work. > > I thought modern disks could be instructed to do "offline scanning", > using any idle time available. > > """ > [root at yolco video]# smartctl -a /dev/hda > ... > General SMART Values: > Offline data collection status: (0x82) Offline data collection activity > was completed without error. > Auto Offline Data Collection: > Enabled.For my Maxtor drives, I had to manually rewrite some bad blocks after about a year in use. There have been unrecoverable read errors that did completely go away after writing zeroes to the related blocks. After I did completely write all sectors of a new disk in install position and at operating temperature, I found that the disks did not fail after a year anymore. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily