Matty
2007-Feb-11 17:56 UTC
[zfs-discuss] Why doesn''t Solaris remove a faulty disk from operation?
Howdy,
On one of my Solaris 10 11/06 servers, I am getting numerous errors
similar to the following:
Feb 11 09:30:23 rx scsi: WARNING: /pci at b,2000/scsi at 2,1/sd at 2,0 (sd1):
Feb 11 09:30:23 rx Error for Command: write(10)
Error Level: Retryable
Feb 11 09:30:23 rx scsi: Requested Block: 58458343
Error Block: 58458343
Feb 11 09:30:23 rx scsi: Vendor: SEAGATE
Serial Number: 0404A72YCG
Feb 11 09:30:23 rx scsi: Sense Key: Hardware Error
Feb 11 09:30:23 rx scsi: ASC: 0x19 (defect list error), ASCQ:
0x0, FRU: 0x2
Feb 11 09:32:18 rx scsi: WARNING: /pci at b,2000/scsi at 2,1/sd at 2,0 (sd1):
Feb 11 09:32:18 rx Error for Command: write(10)
Error Level: Retryable
Feb 11 09:32:18 rx scsi: Requested Block: 58696759
Error Block: 58696501
Feb 11 09:32:18 rx scsi: Vendor: SEAGATE
Serial Number: 0404A72YCG
Feb 11 09:32:18 rx scsi: Sense Key: Media Error
Feb 11 09:32:18 rx scsi: ASC: 0xc (write error - auto
reallocation failed), ASCQ: 0x2, FRU: 0x1
Assuming I am reading the error message correctly, it looks like the
disk drive (c2t2d0) has used up all of all of the spare sectors used
to reallocate bad sectors. If this is the case, is there a reason
Solaris doesn''t offline the drive? This would allow ZFS to evict the
faulty disk from my pool, and kick in the spare disk drive I have
configured:
$ zpool status -v
pool: rz2pool
state: ONLINE
scrub: scrub completed with 0 errors on Sat Feb 10 18:46:54 2007
config:
NAME STATE READ WRITE CKSUM
rz2pool ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c1t9d0 ONLINE 0 0 0
c1t10d0 ONLINE 0 0 0
c1t12d0 ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0
spares
c2t3d0 AVAIL
errors: No known data errors
Thanks for any insight,
- Ryan
--
UNIX Administrator
http://prefetch.net
Joe Little
2007-Feb-11 21:17 UTC
[zfs-discuss] Why doesn''t Solaris remove a faulty disk from operation?
On 2/11/07, Matty <matty91 at gmail.com> wrote:> Howdy, > > On one of my Solaris 10 11/06 servers, I am getting numerous errors > similar to the following: > > Feb 11 09:30:23 rx scsi: WARNING: /pci at b,2000/scsi at 2,1/sd at 2,0 (sd1): > Feb 11 09:30:23 rx Error for Command: write(10) > Error Level: Retryable > Feb 11 09:30:23 rx scsi: Requested Block: 58458343 > Error Block: 58458343 > Feb 11 09:30:23 rx scsi: Vendor: SEAGATE > Serial Number: 0404A72YCG > Feb 11 09:30:23 rx scsi: Sense Key: Hardware Error > Feb 11 09:30:23 rx scsi: ASC: 0x19 (defect list error), ASCQ: > 0x0, FRU: 0x2 > > Feb 11 09:32:18 rx scsi: WARNING: /pci at b,2000/scsi at 2,1/sd at 2,0 (sd1): > Feb 11 09:32:18 rx Error for Command: write(10) > Error Level: Retryable > Feb 11 09:32:18 rx scsi: Requested Block: 58696759 > Error Block: 58696501 > Feb 11 09:32:18 rx scsi: Vendor: SEAGATE > Serial Number: 0404A72YCG > Feb 11 09:32:18 rx scsi: Sense Key: Media Error > Feb 11 09:32:18 rx scsi: ASC: 0xc (write error - auto > reallocation failed), ASCQ: 0x2, FRU: 0x1 > > Assuming I am reading the error message correctly, it looks like the > disk drive (c2t2d0) has used up all of all of the spare sectors used > to reallocate bad sectors. If this is the case, is there a reason > Solaris doesn''t offline the drive? This would allow ZFS to evict the > faulty disk from my pool, and kick in the spare disk drive I have > configured: > > $ zpool status -v > pool: rz2pool > state: ONLINE > scrub: scrub completed with 0 errors on Sat Feb 10 18:46:54 2007 > config: > > NAME STATE READ WRITE CKSUM > rz2pool ONLINE 0 0 0 > raidz2 ONLINE 0 0 0 > c1t9d0 ONLINE 0 0 0 > c1t10d0 ONLINE 0 0 0 > c1t12d0 ONLINE 0 0 0 > c2t1d0 ONLINE 0 0 0 > c2t2d0 ONLINE 0 0 0 > spares > c2t3d0 AVAIL > > errors: No known data errors > > Thanks for any insight, > - Ryan > -- > UNIX Administrator > http://prefetch.net > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >We''ve seen the same thing with sil3124 or marvel chipsets and SATA drives, but when the errors come up the pool wedges and the hot spare is never automatically utilized. I''m not sure if FMA and friends actually use the spare and start a re-silver automatically. It appears to still be a manual effort.
Robert Milkowski
2007-Feb-11 23:16 UTC
[zfs-discuss] Re: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?
Hello Matty,
Sunday, February 11, 2007, 6:56:14 PM, you wrote:
M> Howdy,
M> On one of my Solaris 10 11/06 servers, I am getting numerous errors
M> similar to the following:
AFAIK nothing was integrated yet to do it.
Hot Spare will kick in automatically only when zfs can''t open a device
other than that you are on manual mode for now.
--
Best regards,
Robert mailto:rmilkowski at task.gda.pl
http://milek.blogspot.com
Matty
2007-Feb-12 00:44 UTC
[zfs-discuss] Re: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?
On 2/11/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> Hello Matty, > > Sunday, February 11, 2007, 6:56:14 PM, you wrote: > > M> Howdy, > > M> On one of my Solaris 10 11/06 servers, I am getting numerous errors > M> similar to the following: > > AFAIK nothing was integrated yet to do it. > Hot Spare will kick in automatically only when zfs can''t open a device > other than that you are on manual mode for now.Yikes! Does anyone from the ZFS / storage team happen to know when work will complete to detect and replace failed disk drives? If hot spares don''t actually kick in to replace failed drives, is there any value in using them? Thanks, - Ryan -- UNIX Administrator http://prefetch.net
Robert Milkowski
2007-Feb-12 12:06 UTC
[zfs-discuss] Re[2]: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?
Hello Matty, Monday, February 12, 2007, 1:44:13 AM, you wrote: M> On 2/11/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:>> Hello Matty, >> >> Sunday, February 11, 2007, 6:56:14 PM, you wrote: >> >> M> Howdy, >> >> M> On one of my Solaris 10 11/06 servers, I am getting numerous errors >> M> similar to the following: >> >> AFAIK nothing was integrated yet to do it. >> Hot Spare will kick in automatically only when zfs can''t open a device >> other than that you are on manual mode for now.M> Yikes! Does anyone from the ZFS / storage team happen to know when M> work will complete to detect and replace failed disk drives? If hot M> spares don''t actually kick in to replace failed drives, is there any M> value in using them? Of course there''s. Nevertheless I completely agree that HS support in ZFS is "somewhat" lacking. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Eric Schrock
2007-Feb-21 18:23 UTC
[zfs-discuss] Re: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?
On Sun, Feb 11, 2007 at 07:44:13PM -0500, Matty wrote:> > Yikes! Does anyone from the ZFS / storage team happen to know when > work will complete to detect and replace failed disk drives? If hot > spares don''t actually kick in to replace failed drives, is there any > value in using them? >This has been on my plate for far too long, so feel free to blame me ;-) I have a prototype of this in a workspace, I''d expect it in Nevada sometime in the next month or two. Note that the initial SERD values are going to be pulled essentially from thin air, and rather pessimistic. Now that we actually have the structure erepor data, we''ll be able to do some more complex analysis once we have a body of failure data to work with. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock