thr3ads.net - zfs discuss - [zfs-discuss] Why doesn''t Solaris remove a faulty disk from operation? [Feb 2007]

If this information is useful, please help other people find it:
Share via:

Matty

2007-Feb-11 17:56 UTC

[zfs-discuss] Why doesn''t Solaris remove a faulty disk from operation?

Howdy,

On one of my Solaris 10 11/06 servers, I am getting numerous errors
similar to the following:

Feb 11 09:30:23 rx scsi: WARNING: /pci at b,2000/scsi at 2,1/sd at 2,0 (sd1):
Feb 11 09:30:23 rx      Error for Command: write(10)
Error Level: Retryable
Feb 11 09:30:23 rx scsi:        Requested Block: 58458343
    Error Block: 58458343
Feb 11 09:30:23 rx scsi:        Vendor: SEAGATE
    Serial Number: 0404A72YCG
Feb 11 09:30:23 rx scsi:        Sense Key: Hardware Error
Feb 11 09:30:23 rx scsi:        ASC: 0x19 (defect list error), ASCQ:
0x0, FRU: 0x2

Feb 11 09:32:18 rx scsi: WARNING: /pci at b,2000/scsi at 2,1/sd at 2,0 (sd1):
Feb 11 09:32:18 rx      Error for Command: write(10)
Error Level: Retryable
Feb 11 09:32:18 rx scsi:        Requested Block: 58696759
    Error Block: 58696501
Feb 11 09:32:18 rx scsi:        Vendor: SEAGATE
    Serial Number: 0404A72YCG
Feb 11 09:32:18 rx scsi:        Sense Key: Media Error
Feb 11 09:32:18 rx scsi:        ASC: 0xc (write error - auto
reallocation failed), ASCQ: 0x2, FRU: 0x1

Assuming I am reading the error message correctly, it looks like the
disk drive (c2t2d0) has used up all of all of the spare sectors used
to reallocate bad sectors. If this is the case, is there a reason
Solaris doesn''t offline the drive? This would allow ZFS to evict the
faulty disk from my pool, and kick in the spare disk drive I have
configured:

$ zpool status -v
  pool: rz2pool
 state: ONLINE
 scrub: scrub completed with 0 errors on Sat Feb 10 18:46:54 2007
config:

        NAME         STATE     READ WRITE CKSUM
        rz2pool      ONLINE       0     0     0
          raidz2     ONLINE       0     0     0
            c1t9d0   ONLINE       0     0     0
            c1t10d0  ONLINE       0     0     0
            c1t12d0  ONLINE       0     0     0
            c2t1d0   ONLINE       0     0     0
            c2t2d0   ONLINE       0     0     0
        spares
          c2t3d0     AVAIL

errors: No known data errors

Thanks for any insight,
- Ryan
-- 
UNIX Administrator
http://prefetch.net

Joe Little

2007-Feb-11 21:17 UTC

head link

[zfs-discuss] Why doesn''t Solaris remove a faulty disk from operation?

On 2/11/07, Matty <matty91 at gmail.com> wrote:> Howdy,
>
> On one of my Solaris 10 11/06 servers, I am getting numerous errors
> similar to the following:
>
> Feb 11 09:30:23 rx scsi: WARNING: /pci at b,2000/scsi at 2,1/sd at 2,0
(sd1):
> Feb 11 09:30:23 rx      Error for Command: write(10)
> Error Level: Retryable
> Feb 11 09:30:23 rx scsi:        Requested Block: 58458343
>     Error Block: 58458343
> Feb 11 09:30:23 rx scsi:        Vendor: SEAGATE
>     Serial Number: 0404A72YCG
> Feb 11 09:30:23 rx scsi:        Sense Key: Hardware Error
> Feb 11 09:30:23 rx scsi:        ASC: 0x19 (defect list error), ASCQ:
> 0x0, FRU: 0x2
>
> Feb 11 09:32:18 rx scsi: WARNING: /pci at b,2000/scsi at 2,1/sd at 2,0
(sd1):
> Feb 11 09:32:18 rx      Error for Command: write(10)
> Error Level: Retryable
> Feb 11 09:32:18 rx scsi:        Requested Block: 58696759
>     Error Block: 58696501
> Feb 11 09:32:18 rx scsi:        Vendor: SEAGATE
>     Serial Number: 0404A72YCG
> Feb 11 09:32:18 rx scsi:        Sense Key: Media Error
> Feb 11 09:32:18 rx scsi:        ASC: 0xc (write error - auto
> reallocation failed), ASCQ: 0x2, FRU: 0x1
>
> Assuming I am reading the error message correctly, it looks like the
> disk drive (c2t2d0) has used up all of all of the spare sectors used
> to reallocate bad sectors. If this is the case, is there a reason
> Solaris doesn''t offline the drive? This would allow ZFS to evict
the
> faulty disk from my pool, and kick in the spare disk drive I have
> configured:
>
> $ zpool status -v
>   pool: rz2pool
>  state: ONLINE
>  scrub: scrub completed with 0 errors on Sat Feb 10 18:46:54 2007
> config:
>
>         NAME         STATE     READ WRITE CKSUM
>         rz2pool      ONLINE       0     0     0
>           raidz2     ONLINE       0     0     0
>             c1t9d0   ONLINE       0     0     0
>             c1t10d0  ONLINE       0     0     0
>             c1t12d0  ONLINE       0     0     0
>             c2t1d0   ONLINE       0     0     0
>             c2t2d0   ONLINE       0     0     0
>         spares
>           c2t3d0     AVAIL
>
> errors: No known data errors
>
> Thanks for any insight,
> - Ryan
> --
> UNIX Administrator
> http://prefetch.net
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
We''ve seen the same thing with sil3124 or marvel chipsets and SATA
drives, but when the errors come up the pool wedges and the hot spare
is never automatically utilized. I''m not sure if FMA and friends
actually use the spare and start a re-silver automatically. It appears
to still be a manual effort.

Robert Milkowski

2007-Feb-11 23:16 UTC

head link

[zfs-discuss] Re: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?

Hello Matty,

Sunday, February 11, 2007, 6:56:14 PM, you wrote:

M> Howdy,

M> On one of my Solaris 10 11/06 servers, I am getting numerous errors
M> similar to the following:

AFAIK nothing was integrated yet to do it.
Hot Spare will kick in automatically only when zfs can''t open a device
other than that you are on manual mode for now.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Matty

2007-Feb-12 00:44 UTC

head link

[zfs-discuss] Re: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?

On 2/11/07, Robert Milkowski <rmilkowski at task.gda.pl>
wrote:> Hello Matty,
>
> Sunday, February 11, 2007, 6:56:14 PM, you wrote:
>
> M> Howdy,
>
> M> On one of my Solaris 10 11/06 servers, I am getting numerous errors
> M> similar to the following:
>
> AFAIK nothing was integrated yet to do it.
> Hot Spare will kick in automatically only when zfs can''t open a
device
> other than that you are on manual mode for now.
Yikes! Does anyone from the ZFS / storage team happen to know when
work will complete to detect and replace failed disk drives? If hot
spares don''t actually kick in to replace failed drives, is there any
value in using them?

Thanks,
- Ryan
-- 
UNIX Administrator
http://prefetch.net

Robert Milkowski

2007-Feb-12 12:06 UTC

head link

[zfs-discuss] Re[2]: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?

Hello Matty,

Monday, February 12, 2007, 1:44:13 AM, you wrote:

M> On 2/11/07, Robert Milkowski <rmilkowski at task.gda.pl>
wrote:>> Hello Matty,
>>
>> Sunday, February 11, 2007, 6:56:14 PM, you wrote:
>>
>> M> Howdy,
>>
>> M> On one of my Solaris 10 11/06 servers, I am getting numerous
errors
>> M> similar to the following:
>>
>> AFAIK nothing was integrated yet to do it.
>> Hot Spare will kick in automatically only when zfs can''t open
a device
>> other than that you are on manual mode for now.
M> Yikes! Does anyone from the ZFS / storage team happen to know when
M> work will complete to detect and replace failed disk drives? If hot
M> spares don''t actually kick in to replace failed drives, is there
any
M> value in using them?

Of course there''s.
Nevertheless I completely agree that HS support in ZFS is "somewhat"
lacking.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Eric Schrock

2007-Feb-21 18:23 UTC

head link

[zfs-discuss] Re: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?

On Sun, Feb 11, 2007 at 07:44:13PM -0500, Matty wrote:> 
> Yikes! Does anyone from the ZFS / storage team happen to know when
> work will complete to detect and replace failed disk drives? If hot
> spares don''t actually kick in to replace failed drives, is there
any
> value in using them?
> 
This has been on my plate for far too long, so feel free to blame me ;-)
I have a prototype of this in a workspace, I''d expect it in Nevada
sometime in the next month or two.  Note that the initial SERD values
are going to be pulled essentially from thin air, and rather
pessimistic.  Now that we actually have the structure erepor data,
we''ll
be able to do some more complex analysis once we have a body of failure
data to work with.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

zfs discuss - Feb 2007 - Why doesn''t Solaris remove a faulty disk from operation?

[zfs-discuss] Why doesn''t Solaris remove a faulty disk from operation?

[zfs-discuss] Why doesn''t Solaris remove a faulty disk from operation?

[zfs-discuss] Re: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?

[zfs-discuss] Re: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?

[zfs-discuss] Re[2]: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?

[zfs-discuss] Re: [storage-discuss] Why doesn''t Solaris remove a faulty disk from operation?