thr3ads.net - zfs discuss - [zfs-discuss] no hot spare activation? [Apr 2010]

If this information is useful, please help other people find it:
Share via:

Garrett D''Amore

2010-Apr-05 07:38 UTC

[zfs-discuss] no hot spare activation?

While testing a zpool with a different storage adapter using my
"blkdev"
device, I did a test which made a disk unavailable -- all attempts to 
read from it report EIO.

I expected my configuration (which is a 3 disk test, with 2 disks in a 
RAIDZ and a hot spare) to work where the hot spare would automatically 
be activated.  But I''m finding that ZFS does not behave this way -- if 
only some I/Os are failed, then the hot spare is failed, but if ZFS 
decides that the label is gone, it takes no attempt to recruit a hot spare.

I had added FMA notification to my blkdev driver - it will post 
device.no_response or device.invalid_state ereports (per the 
ddi_fm_ereport_post() man page) in certain failure scenarios.

I *suspect* the problem is in the FMA notification for zfs-retire, where 
the event is not being interpreted in a way that ZFS retire can figure 
out that the drive is toasted.

Of course, this is just an educated guess on my part.  I''m no ZFS nor 
FMA expert here.

Am I missing something here?  Under what conditions can I expect hot 
spares to be recruited?

My zpool status showing the results is below.

     - Garrett


 > pfexec zpool status
   pool: rpool
  state: ONLINE
  scrub: none requested
config:

     NAME        STATE     READ WRITE CKSUM
     rpool       ONLINE       0     0     0
       c1t0d0s0  ONLINE       0     0     0

errors: No known data errors

   pool: testpool
  state: DEGRADED
status: One or more devices could not be used because the label is 
missing or
     invalid.  Sufficient replicas exist for the pool to continue
     functioning in a degraded state.
action: Replace the device using ''zpool replace''.
    see: http://www.sun.com/msg/ZFS-8000-4J
  scrub: none requested
config:

     NAME        STATE     READ WRITE CKSUM
     testpool    DEGRADED     0     0     0
       raidz1-0  DEGRADED     0     0     0
         c2t3d0  ONLINE       0     0     0
         c2t3d1  UNAVAIL      9   132     0  experienced I/O failures
     spares
       c2t3d2    AVAIL

errors: No known data errors

Eric Schrock

2010-Apr-05 12:28 UTC

head link

[zfs-discuss] no hot spare activation?

On Apr 5, 2010, at 3:38 AM, Garrett D''Amore
wrote:> 
> Am I missing something here?  Under what conditions can I expect hot spares
to be recruited?
Hot spares are activated by the zfs-retire agent in response to a list.suspect
event containing one of the following faults:

	fault.fs.zfs.vdev.io
	fault.fs.zfs.vdev.checksum
	fault.fs.zfs.device

The last of these (fault.fs.zfs.device) is what is diagnosed when a label is
corrupted.  What software are you runnig?  Have you confirmed that you are
getting one of these faults?  What does ''fmdump -V'' show? 
Does doing a ''zpool replace c2t3d1 c2t3d2'' by hand succeed?

- Eric 

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

Garrett D''Amore

2010-Apr-05 15:43 UTC

head link

[zfs-discuss] no hot spare activation?

On 04/ 5/10 05:28 AM, Eric Schrock wrote:> On Apr 5, 2010, at 3:38 AM, Garrett D''Amore wrote:
>    
>> Am I missing something here?  Under what conditions can I expect hot
spares to be recruited?
>>      
> Hot spares are activated by the zfs-retire agent in response to a
list.suspect event containing one of the following faults:
>
> 	fault.fs.zfs.vdev.io
> 	fault.fs.zfs.vdev.checksum
> 	fault.fs.zfs.device
>
> The last of these (fault.fs.zfs.device) is what is diagnosed when a label
is corrupted.  What software are you runnig?  Have you confirmed that you are
getting one of these faults?  What does ''fmdump -V'' show? 
Does doing a ''zpool replace c2t3d1 c2t3d2'' by hand succeed?
>    
I see ereport.fs.zfs.io_failure, and ereport.fs.zfs.probe_failure.  
Also, ereport.io.service.lost and ereport.io.device.inval_state.  There 
is indeed a fault.fs.zfs.device in the list as well.

Clearly ZFS thinks the device is unavailable (which is accurate).

And "pfexec zpool replace testpool c2t3d1 c2t3d2" works fine, as shown
here:

gdamore at tabasco{33}> pfexec zpool status
   pool: rpool
  state: ONLINE
  scrub: none requested
config:

     NAME        STATE     READ WRITE CKSUM
     rpool       ONLINE       0     0     0
       c1t0d0s0  ONLINE       0     0     0

errors: No known data errors

   pool: testpool
  state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas 
exist for
     the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using ''zpool
online''.
    see: http://www.sun.com/msg/ZFS-8000-2Q
  scrub: resilver completed after 0h0m with 0 errors on Mon Apr  5 
08:39:57 2010
config:

     NAME          STATE     READ WRITE CKSUM
     testpool      DEGRADED     0     0     0
       raidz1-0    DEGRADED     0     0     0
         c2t3d0    ONLINE       0     0     0
         spare-1   DEGRADED     0     0     0
           c2t3d1  UNAVAIL      9   132     0  cannot open
           c2t3d2  ONLINE       0     0     0  20.8M resilvered
     spares
       c2t3d2      INUSE     currently in use

errors: No known data errors
gdamore at tabasco{34}>


Everything seems to be correct *except* that ZFS isn''t automatically 
doing the replace operation with the hot spare.

It feels to me like this is possibly a ZFS bug --- perhaps ZFS is 
expecting a specific set of FMA faults that only sd delivers?  (Recall 
this is with a different target device.)

     - Garrett
> - Eric
>
> --
> Eric Schrock, Fishworks                       
http://blogs.sun.com/eschrock
>
>

Eric Schrock

2010-Apr-05 15:53 UTC

head link

[zfs-discuss] no hot spare activation?

On Apr 5, 2010, at 11:43 AM, Garrett D''Amore wrote: 
> 
> I see ereport.fs.zfs.io_failure, and ereport.fs.zfs.probe_failure.  Also,
ereport.io.service.lost and ereport.io.device.inval_state.  There is indeed a
fault.fs.zfs.device in the list as well.
The ereports are not interesting, only the fault.  In FMA, ereports contribute
to diagnosis, but faults are the only thing that are presented to the user and
retire agents.
> Everything seems to be correct *except* that ZFS isn''t
automatically doing the replace operation with the hot spare.
> 
> It feels to me like this is possibly a ZFS bug --- perhaps ZFS is expecting
a specific set of FMA faults that only sd delivers?  (Recall this is with a
different target device.)
Yes, it may be a bug.  You will have to step through the zfs retire agent to see
what goes wrong when it receives the list.suspect event.  This code path is
tested many, many times every day, so it''s not as obvious as "this
doesn''t work."

The ZFS retire agent subscribes only to ZFS faults.  The underlying driver or
other telemetry has no bearing on the diagnosis or associated action.

- Eric

Seemingly Similar Threads

Search for more reasonably related threads

zfs discuss - Apr 2010 - no hot spare activation?

[zfs-discuss] no hot spare activation?

[zfs-discuss] no hot spare activation?

[zfs-discuss] no hot spare activation?

[zfs-discuss] no hot spare activation?

Seemingly Similar Threads