Matty
2007-Feb-11 17:56 UTC
[zfs-discuss] Why doesn''t Solaris remove a faulty disk from operation?
Howdy, On one of my Solaris 10 11/06 servers, I am getting numerous errors similar to the following: Feb 11 09:30:23 rx scsi: WARNING: /pci at b,2000/scsi at 2,1/sd at 2,0 (sd1): Feb 11 09:30:23 rx Error for Command: write(10) Error Level: Retryable Feb 11 09:30:23 rx scsi: Requested Block: 58458343 Error Block: 58458343 Feb 11 09:30:23 rx scsi: Vendor: SEAGATE Serial Number: 0404A72YCG Feb 11 09:30:23 rx scsi: Sense Key: Hardware Error Feb 11 09:30:23 rx scsi: ASC: 0x19 (defect list error), ASCQ: 0x0, FRU: 0x2 Feb 11 09:32:18 rx scsi: WARNING: /pci at b,2000/scsi at 2,1/sd at 2,0 (sd1): Feb 11 09:32:18 rx Error for Command: write(10) Error Level: Retryable Feb 11 09:32:18 rx scsi: Requested Block: 58696759 Error Block: 58696501 Feb 11 09:32:18 rx scsi: Vendor: SEAGATE Serial Number: 0404A72YCG Feb 11 09:32:18 rx scsi: Sense Key: Media Error Feb 11 09:32:18 rx scsi: ASC: 0xc (write error - auto reallocation failed), ASCQ: 0x2, FRU: 0x1 Assuming I am reading the error message correctly, it looks like the disk drive (c2t2d0) has used up all of all of the spare sectors used to reallocate bad sectors. If this is the case, is there a reason Solaris doesn''t offline the drive? This would allow ZFS to evict the faulty disk from my pool, and kick in the spare disk drive I have configured: $ zpool status -v pool: rz2pool state: ONLINE scrub: scrub completed with 0 errors on Sat Feb 10 18:46:54 2007 config: NAME STATE READ WRITE CKSUM rz2pool ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c1t12d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 spares c2t3d0 AVAIL errors: No known data errors Thanks for any insight, - Ryan -- UNIX Administrator http://prefetch.net
Joe Little
2007-Feb-11 21:17 UTC
[zfs-discuss] Why doesn''t Solaris remove a faulty disk from operation?
On 2/11/07, Matty <matty91 at gmail.com> wrote:> Howdy, > > On one of my Solaris 10 11/06 servers, I am getting numerous errors > similar to the following: > > Feb 11 09:30:23 rx scsi: WARNING: /pci at b,2000/scsi at 2,1/sd at 2,0 (sd1): > Feb 11 09:30:23 rx Error for Command: write(10) > Error Level: Retryable > Feb 11 09:30:23 rx scsi: Requested Block: 58458343 > Error Block: 58458343 > Feb 11 09:30:23 rx scsi: Vendor: SEAGATE > Serial Number: 0404A72YCG > Feb 11 09:30:23 rx scsi: Sense Key: Hardware Error > Feb 11 09:30:23 rx scsi: ASC: 0x19 (defect list error), ASCQ: > 0x0, FRU: 0x2 > > Feb 11 09:32:18 rx scsi: WARNING: /pci at b,2000/scsi at 2,1/sd at 2,0 (sd1): > Feb 11 09:32:18 rx Error for Command: write(10) > Error Level: Retryable > Feb 11 09:32:18 rx scsi: Requested Block: 58696759 > Error Block: 58696501 > Feb 11 09:32:18 rx scsi: Vendor: SEAGATE > Serial Number: 0404A72YCG > Feb 11 09:32:18 rx scsi: Sense Key: Media Error > Feb 11 09:32:18 rx scsi: ASC: 0xc (write error - auto > reallocation failed), ASCQ: 0x2, FRU: 0x1 > > Assuming I am reading the error message correctly, it looks like the > disk drive (c2t2d0) has used up all of all of the spare sectors used > to reallocate bad sectors. If this is the case, is there a reason > Solaris doesn''t offline the drive? This would allow ZFS to evict the > faulty disk from my pool, and kick in the spare disk drive I have > configured: > > $ zpool status -v > pool: rz2pool > state: ONLINE > scrub: scrub completed with 0 errors on Sat Feb 10 18:46:54 2007 > config: > > NAME STATE READ WRITE CKSUM > rz2pool ONLINE 0 0 0 > raidz2 ONLINE 0 0 0 > c1t9d0 ONLINE 0 0 0 > c1t10d0 ONLINE 0 0 0 > c1t12d0 ONLINE 0 0 0 > c2t1d0 ONLINE 0 0 0 > c2t2d0 ONLINE 0 0 0 > spares > c2t3d0 AVAIL > > errors: No known data errors > > Thanks for any insight, > - Ryan > -- > UNIX Administrator > http://prefetch.net > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >We''ve seen the same thing with sil3124 or marvel chipsets and SATA drives, but when the errors come up the pool wedges and the hot spare is never automatically utilized. I''m not sure if FMA and friends actually use the spare and start a re-silver automatically. It appears to still be a manual effort.
Robert Milkowski
2007-Feb-11 23:16 UTC
[zfs-discuss] Re: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?
Hello Matty, Sunday, February 11, 2007, 6:56:14 PM, you wrote: M> Howdy, M> On one of my Solaris 10 11/06 servers, I am getting numerous errors M> similar to the following: AFAIK nothing was integrated yet to do it. Hot Spare will kick in automatically only when zfs can''t open a device other than that you are on manual mode for now. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Matty
2007-Feb-12 00:44 UTC
[zfs-discuss] Re: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?
On 2/11/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> Hello Matty, > > Sunday, February 11, 2007, 6:56:14 PM, you wrote: > > M> Howdy, > > M> On one of my Solaris 10 11/06 servers, I am getting numerous errors > M> similar to the following: > > AFAIK nothing was integrated yet to do it. > Hot Spare will kick in automatically only when zfs can''t open a device > other than that you are on manual mode for now.Yikes! Does anyone from the ZFS / storage team happen to know when work will complete to detect and replace failed disk drives? If hot spares don''t actually kick in to replace failed drives, is there any value in using them? Thanks, - Ryan -- UNIX Administrator http://prefetch.net
Robert Milkowski
2007-Feb-12 12:06 UTC
[zfs-discuss] Re[2]: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?
Hello Matty, Monday, February 12, 2007, 1:44:13 AM, you wrote: M> On 2/11/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:>> Hello Matty, >> >> Sunday, February 11, 2007, 6:56:14 PM, you wrote: >> >> M> Howdy, >> >> M> On one of my Solaris 10 11/06 servers, I am getting numerous errors >> M> similar to the following: >> >> AFAIK nothing was integrated yet to do it. >> Hot Spare will kick in automatically only when zfs can''t open a device >> other than that you are on manual mode for now.M> Yikes! Does anyone from the ZFS / storage team happen to know when M> work will complete to detect and replace failed disk drives? If hot M> spares don''t actually kick in to replace failed drives, is there any M> value in using them? Of course there''s. Nevertheless I completely agree that HS support in ZFS is "somewhat" lacking. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Eric Schrock
2007-Feb-21 18:23 UTC
[zfs-discuss] Re: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?
On Sun, Feb 11, 2007 at 07:44:13PM -0500, Matty wrote:> > Yikes! Does anyone from the ZFS / storage team happen to know when > work will complete to detect and replace failed disk drives? If hot > spares don''t actually kick in to replace failed drives, is there any > value in using them? >This has been on my plate for far too long, so feel free to blame me ;-) I have a prototype of this in a workspace, I''d expect it in Nevada sometime in the next month or two. Note that the initial SERD values are going to be pulled essentially from thin air, and rather pessimistic. Now that we actually have the structure erepor data, we''ll be able to do some more complex analysis once we have a body of failure data to work with. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock