mcclnx mcc
2009-Aug-04 11:27 UTC
[CentOS] [Q} how can O.S. predicate a disk going to failure??
we have CENTOS 4.X on DELL server and one one of virtual disk include 4 disk configure as REID5 (one more disk for hot spare). I saw /var/log/messages file have: Aug 4 06:27:02 host1 Server Administrator: Storage Service EventID: 2094 Predictive Failure reported: Physical Disk 1:5 Controller 0, Connector 1 Aug 4 06:27:02 host1 Server Administrator: Storage Service EventID: 2051 Physical disk degraded: Physical Disk 1:5 Controller 0, Connector 1 I use DELL OPMN to check and found "disk 1:5" still "online", but "predicate failure". I also use DELL OPMN to check virtual disk and it show "online", not "degrade". my questions are: 1. is this disk really "degrade" or not? 2. how O.S. can predicate disk going to failure? 3. do I need replace this disk now? Thanks. ______________________________________________________________________________________________________ ?????????Yahoo!??????2.0????????????? http://tw.mg0.mail.yahoo.com/dc/landing
Mogens Kjaer
2009-Aug-04 11:38 UTC
[CentOS] [Q} how can O.S. predicate a disk going to failure??
On 08/04/2009 01:27 PM, mcclnx mcc wrote: ...> 1. is this disk really "degrade" or not?Disks aren't in degraded mode. The RAID system will run in degraded mode when the disk eventually fails.> > 2. how O.S. can predicate disk going to failure?The disk's SMART feature tells it so.> > 3. do I need replace this disk now?That would be a good idea, the disk could fail in 5 minutes or in 5 month, you can't tell. Mogens -- Mogens Kjaer, Carlsberg A/S, Computer Department Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark Phone: +45 33 27 53 25, Mobile: +45 22 12 53 25 Email: mk at crc.dk Homepage: http://www.crc.dk
John Doe
2009-Aug-04 13:42 UTC
[CentOS] [Q} how can O.S. predicate a disk going to failure??
From: mcclnx mcc <mcclnx at yahoo.com.tw>> 2. how O.S. can predicate disk going to failure?As I understand it, disks can handle a certain amount of bad 'sectors', thanks to some hidden extra space. When a 'sector' fails, the disk marks it as 'bad' and then map it to a 'sector' from the hidden space. As this extra space is not infinite, after a few months/years, there won't be any spare 'sector' left. So, the disk says to the OS: "I will soon fail!"; as in "I am running out of spare 'sectors' so I won't be able to cope with bad 'sectors'". Not sure if it is the case for all vendors... JD
Kwan Lowe
2009-Aug-04 13:48 UTC
[CentOS] [Q} how can O.S. predicate a disk going to failure??
2009/8/4 mcclnx mcc <mcclnx at yahoo.com.tw>: [snip]> > my questions are: > > 1. is this disk really "degrade" or not? > > 2. how O.S. can predicate disk going to failure? > > 3. do I need replace this disk now?I understand that the drive electronics can check things such as the time it takes to read a sector, number of failures per read, retries before success etc.. This information gets processed and reported to the OS via SMART as some others have replied. But I really just want to say that within one day of getting SMART errors, my disk failed.
Brian Mathis
2009-Aug-04 13:58 UTC
[CentOS] [Q} how can O.S. predicate a disk going to failure??
Disks are cheap, your data is not. Replace the disk without hesitation. 2009/8/4 mcclnx mcc <mcclnx at yahoo.com.tw>:> > we have CENTOS 4.X on DELL server and one one of virtual disk include 4 disk configure as REID5 (one more disk for hot spare). I saw /var/log/messages file have: > > Aug 4 06:27:02 host1 Server Administrator: Storage Service EventID: 2094 Predictive Failure reported: Physical Disk 1:5 Controller 0, Connector 1 > Aug 4 06:27:02 host1 Server Administrator: Storage Service EventID: 2051 Physical disk degraded: Physical Disk 1:5 Controller 0, Connector 1 > > I use DELL OPMN to check and found "disk 1:5" still "online", but "predicate failure". > > I also use DELL OPMN to check virtual disk and it show "online", not "degrade". > > my questions are: > > 1. is this disk really "degrade" or not? > > 2. how O.S. can predicate disk going to failure? > > 3. do I need replace this disk now? > > Thanks. > > > ______________________________________________________________________________________________________ > ?????????Yahoo!??????2.0????????????? http://tw.mg0.mail.yahoo.com/dc/landing > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos >
mcclnx mcc wrote:> 1. is this disk really "degrade" or not?Depends on your point of view, to me it would be. I remember two situations with "predictive" failure on HP Smart arrays a few years ago where the drives were practically dead but the controller kept using them dragging performance down something like 90%. The drives were detected as about to fail but there was no way to remove/disable the disk from the array remotely, so we had to send someone on site to yank the disk to force the array to rebuild. HP later said a firmware update should fix the issue, never got around to upgrading it before we migrated off those systems onto a real SAN.> 2. how O.S. can predicate disk going to failure?In this case it's not the OS, it's the controller that is keeping track of a bunch of internal counters on the disk and perhaps even scrubbing it every so often. If # of soft errors exceeds a threshold it triggers the predictive failure logic.> 3. do I need replace this disk now?Based on my past experience yes, and any enterprise storage array's support contract(for comparison) will trigger an immediate replacement if the array detects that condition. nate
2009/8/4 mcclnx mcc <mcclnx at yahoo.com.tw>:> > we have CENTOS 4.X on DELL server and one one of virtual disk include 4 disk configure as REID5 (one more disk for hot spare). ?I saw /var/log/messages file have: > > Aug ?4 06:27:02 host1 Server Administrator: Storage Service EventID: 2094 ?Predictive Failure reported: ?Physical Disk 1:5 Controller 0, Connector 1 > Aug ?4 06:27:02 host1 Server Administrator: Storage Service EventID: 2051 ?Physical disk degraded: ?Physical Disk 1:5 Controller 0, Connector 1 > > I use DELL OPMN to check and found "disk 1:5" still "online", but "predicate failure". > > I also use DELL OPMN to check virtual disk and it show "online", not "degrade". > > my questions are: > > 1. is this disk really "degrade" or not? > > 2. how O.S. can predicate disk going to failure?There are several possibilities, the most likely being that the drive's predict-fail bit flipped, or the # of sectors available for reallocation dropped below the manufacturer's recommended threshold. The following links have additional information on drive failures, and how to use SMART data to look at drive health: http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html http://prefetch.net/articles/diskdrives.smart.html Thanks, - Ryan -- http://prefetch.net