thr3ads.net - zfs discuss - [zfs-discuss] ZFS keeps trying to open a dead disk: lots of logging [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Brian H. Nelson

2008-Feb-12 15:44 UTC

[zfs-discuss] ZFS keeps trying to open a dead disk: lots of logging

This is Solaris 10U3 w/127111-05.

It appears that one of the disks in my zpool died yesterday. I got 
several SCSI errors finally ending with ''device not responding to 
selection''. That seems to be all well and good. ZFS figured it out and 
the pool is degraded:

maxwell /var/adm >zpool status
  pool: pool1
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas 
exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using ''zpool
online''.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        pool1        DEGRADED     0     0     0
          raidz1     DEGRADED     0     0     0
            c0t9d0   ONLINE       0     0     0
            c0t10d0  ONLINE       0     0     0
            c0t11d0  ONLINE       0     0     0
            c0t12d0  ONLINE       0     0     0
            c2t0d0   ONLINE       0     0     0
            c2t1d0   ONLINE       0     0     0
            c2t2d0   UNAVAIL  1.88K 17.98     0  cannot open

errors: No known data errors


My question is why does ZFS keep attempting to open the dead device? At 
least that''s what I assume is happening. About every minute, I get
eight
of these entries in the messages log:

Feb 12 10:15:54 maxwell scsi: [ID 107833 kern.warning] WARNING: 
/pci at 1f,4000/pci at 2/SUNW,isptwo at 4/sd at 2,0 (sd32):
Feb 12 10:15:54 maxwell         disk not responding to selection

I also got a number of these thrown in for good measure:

Feb 11 22:21:58 maxwell scsi: [ID 107833 kern.warning] WARNING: 
/pci at 1f,4000/pci at 2/SUNW,isptwo at 4/sd at 2,0 (sd32):
Feb 11 22:21:58 maxwell         SYNCHRONIZE CACHE command failed (5)


Since the disk died last night (at about 11:20pm EST) I now have over 
15K of similar entries in my log. What gives? Is this expected behavior? 
If ZFS knows the device is having problems, why does it not just leave 
it alone and wait for user intervention?

Also, I noticed that the ''action'' says to attach the device
and ''zpool
online'' it. Am I correct in assuming that a ''zpool
replace'' is what
would really be needed, as the data on the disk will be outdated?

Thanks,
-Brian

-- 
---------------------------------------------------
Brian H. Nelson         Youngstown State University
System Administrator   Media and Academic Computing
              bnelson[at]cis.ysu.edu
---------------------------------------------------

Brian H. Nelson

2008-Feb-12 17:16 UTC

head link

[zfs-discuss] Need help with a dead disk (was: ZFS keeps trying to open a dead disk: lots of logging)

Ok. I think I answered my own question. ZFS _didn''t_ realize that the 
disk was bad/stale. I power-cycled the failed drive (external) to see if 
it would come back up and/or run diagnostics on it. As soon as I did 
that, ZFS put the disk ONLINE and started using it again! Observe:

bash-3.00# zpool status
  pool: pool1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        pool1        ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c0t9d0   ONLINE       0     0     0
            c0t10d0  ONLINE       0     0     0
            c0t11d0  ONLINE       0     0     0
            c0t12d0  ONLINE       0     0     0
            c2t0d0   ONLINE       0     0     0
            c2t1d0   ONLINE       0     0     0
            c2t2d0   ONLINE   2.11K 20.09     0

errors: No known data errors


Now I _really_ have a problem. I can''t offline the disk myself:

bash-3.00# zpool offline pool1 c2t2d0                    
cannot offline c2t2d0: no valid replicas

I don''t understand why, as ''zpool status'' says all
the other drives are OK.

What''s worse, if I just power off the drive in question (trying to get 
back to where I started) the zpool hangs completely! I let it go for 
about 7 minutes thinking maybe there was some timeout, but still 
nothing. Any command that would access the zpool (including ''zpool  
status'') hangs. The only way to fix is to power the external disk back 
on upon which everything starts working like nothing has happened. 
Nothing gets logged other than lots of these only while the drive is 
powered off:

Feb 12 11:49:32 maxwell scsi: [ID 107833 kern.warning] WARNING: 
/pci at 1f,4000/pci at 2/SUNW,isptwo at 4/sd at 2,0 (sd32):
Feb 12 11:49:32 maxwell         disk not responding to selection
Feb 12 11:49:32 maxwell scsi: [ID 107833 kern.warning] WARNING: 
/pci at 1f,4000/pci at 2/SUNW,isptwo at 4/sd at 2,0 (sd32):
Feb 12 11:49:32 maxwell         offline or reservation conflict
Feb 12 11:49:32 maxwell scsi: [ID 107833 kern.warning] WARNING: 
/pci at 1f,4000/pci at 2/SUNW,isptwo at 4/sd at 2,0 (sd32):
Feb 12 11:49:32 maxwell         i/o to invalid geometry


What''s going on here? What can I do to make ZFS let go of the bad
drive?
This is a production machine and I''m getting concerned. I _really_
don''t
like the fact that ZFS is using a suspect drive, but I can''t seem to 
make it stop!

Thanks,
-Brian


Brian H. Nelson wrote:> This is Solaris 10U3 w/127111-05.
>
> It appears that one of the disks in my zpool died yesterday. I got 
> several SCSI errors finally ending with ''device not responding to 
> selection''. That seems to be all well and good. ZFS figured it out
and
> the pool is degraded:
>
> maxwell /var/adm >zpool status
>   pool: pool1
>  state: DEGRADED
> status: One or more devices could not be opened.  Sufficient replicas 
> exist for
>         the pool to continue functioning in a degraded state.
> action: Attach the missing device and online it using ''zpool
online''.
>    see: http://www.sun.com/msg/ZFS-8000-D3
>  scrub: none requested
> config:
>
>         NAME         STATE     READ WRITE CKSUM
>         pool1        DEGRADED     0     0     0
>           raidz1     DEGRADED     0     0     0
>             c0t9d0   ONLINE       0     0     0
>             c0t10d0  ONLINE       0     0     0
>             c0t11d0  ONLINE       0     0     0
>             c0t12d0  ONLINE       0     0     0
>             c2t0d0   ONLINE       0     0     0
>             c2t1d0   ONLINE       0     0     0
>             c2t2d0   UNAVAIL  1.88K 17.98     0  cannot open
>
> errors: No known data errors
>
>
> My question is why does ZFS keep attempting to open the dead device? At 
> least that''s what I assume is happening. About every minute, I get
eight
> of these entries in the messages log:
>
> Feb 12 10:15:54 maxwell scsi: [ID 107833 kern.warning] WARNING: 
> /pci at 1f,4000/pci at 2/SUNW,isptwo at 4/sd at 2,0 (sd32):
> Feb 12 10:15:54 maxwell         disk not responding to selection
>
> I also got a number of these thrown in for good measure:
>
> Feb 11 22:21:58 maxwell scsi: [ID 107833 kern.warning] WARNING: 
> /pci at 1f,4000/pci at 2/SUNW,isptwo at 4/sd at 2,0 (sd32):
> Feb 11 22:21:58 maxwell         SYNCHRONIZE CACHE command failed (5)
>
>
> Since the disk died last night (at about 11:20pm EST) I now have over 
> 15K of similar entries in my log. What gives? Is this expected behavior? 
> If ZFS knows the device is having problems, why does it not just leave 
> it alone and wait for user intervention?
>
> Also, I noticed that the ''action'' says to attach the
device and ''zpool
> online'' it. Am I correct in assuming that a ''zpool
replace'' is what
> would really be needed, as the data on the disk will be outdated?
>
> Thanks,
> -Brian
>
>   
-- 
---------------------------------------------------
Brian H. Nelson         Youngstown State University
System Administrator   Media and Academic Computing
              bnelson[at]cis.ysu.edu
---------------------------------------------------

Brian H. Nelson

2008-Feb-12 17:48 UTC

head link

[zfs-discuss] Need help with a dead disk

Here''s a bit more info. The drive appears to have failed at 22:19 EST 
but it wasn''t until 1:30 EST the next day that the system finally 
decided that it was bad. (Why?) Here''s some relevant log stuff (with 
lots of repeated ''device not responding'' errors removed) I
don''t know if
it will be useful:

Feb 11 22:19:09 maxwell scsi: [ID 107833 kern.warning] WARNING: 
/pci at 1f,4000/pci at 2/SUNW,isptwo at 4/sd at 2,0 (sd32):
Feb 11 22:19:09 maxwell         SCSI transport failed: reason 
''incomplete'': retrying command
Feb 11 22:19:10 maxwell scsi: [ID 107833 kern.warning] WARNING: 
/pci at 1f,4000/pci at 2/SUNW,isptwo at 4/sd at 2,0 (sd32):
Feb 11 22:19:10 maxwell         disk not responding to selection
...
Feb 11 22:21:08 maxwell scsi: [ID 107833 kern.warning] WARNING: 
/pci at 1f,4000/pci at 2/SUNW,isptwo at 4 (isp0):
Feb 11 22:21:08 maxwell         SCSI Cable/Connection problem.
Feb 11 22:21:08 maxwell scsi: [ID 107833 kern.notice]   
Hardware/Firmware error.
Feb 11 22:21:08 maxwell scsi: [ID 107833 kern.warning] WARNING: 
/pci at 1f,4000/pci at 2/SUNW,isptwo at 4 (isp0):
Feb 11 22:21:08 maxwell         Fatal error, resetting interface, flg 16

... (Why did this take so long?)

Feb 12 01:30:05 maxwell scsi: [ID 107833 kern.warning] WARNING: 
/pci at 1f,4000/pci at 2/SUNW,isptwo at 4/sd at 2,0 (sd32):
Feb 12 01:30:05 maxwell         offline
...
Feb 12 01:30:22 maxwell fmd: [ID 441519 daemon.error] SUNW-MSG-ID: 
ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major
Feb 12 01:30:22 maxwell EVENT-TIME: Tue Feb 12 01:30:22 EST 2008
Feb 12 01:30:22 maxwell PLATFORM: SUNW,Ultra-250, CSN: -, HOSTNAME: maxwell
Feb 12 01:30:22 maxwell SOURCE: zfs-diagnosis, REV: 1.0
Feb 12 01:30:22 maxwell EVENT-ID: 7f48f376-2eb1-ccaf-afc5-e56f5bf4576f
Feb 12 01:30:22 maxwell DESC: A ZFS device failed.  Refer to 
http://sun.com/msg/ZFS-8000-D3 for more information.
Feb 12 01:30:22 maxwell AUTO-RESPONSE: No automated response will occur.
Feb 12 01:30:22 maxwell IMPACT: Fault tolerance of the pool may be 
compromised.
Feb 12 01:30:22 maxwell REC-ACTION: Run ''zpool status -x'' and
replace
the bad device.


One thought I had was to unconfigure the bad disk with cfgadm. Would 
that force the system back into the ''offline'' response?

Thanks,
-Brian



Brian H. Nelson wrote:> Ok. I think I answered my own question. ZFS _didn''t_ realize that
the
> disk was bad/stale. I power-cycled the failed drive (external) to see if 
> it would come back up and/or run diagnostics on it. As soon as I did 
> that, ZFS put the disk ONLINE and started using it again! Observe:
>
> bash-3.00# zpool status
>   pool: pool1
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are
unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using ''zpool clear'' or replace the device with
''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: none requested
> config:
>
>         NAME         STATE     READ WRITE CKSUM
>         pool1        ONLINE       0     0     0
>           raidz1     ONLINE       0     0     0
>             c0t9d0   ONLINE       0     0     0
>             c0t10d0  ONLINE       0     0     0
>             c0t11d0  ONLINE       0     0     0
>             c0t12d0  ONLINE       0     0     0
>             c2t0d0   ONLINE       0     0     0
>             c2t1d0   ONLINE       0     0     0
>             c2t2d0   ONLINE   2.11K 20.09     0
>
> errors: No known data errors
>
>
> Now I _really_ have a problem. I can''t offline the disk myself:
>
> bash-3.00# zpool offline pool1 c2t2d0                    
> cannot offline c2t2d0: no valid replicas
>
> I don''t understand why, as ''zpool status'' says
all the other drives are OK.
>
> What''s worse, if I just power off the drive in question (trying to
get
> back to where I started) the zpool hangs completely! I let it go for 
> about 7 minutes thinking maybe there was some timeout, but still 
> nothing. Any command that would access the zpool (including ''zpool
> status'') hangs. The only way to fix is to power the external disk
back
> on upon which everything starts working like nothing has happened. 
> Nothing gets logged other than lots of these only while the drive is 
> powered off:
>
> Feb 12 11:49:32 maxwell scsi: [ID 107833 kern.warning] WARNING: 
> /pci at 1f,4000/pci at 2/SUNW,isptwo at 4/sd at 2,0 (sd32):
> Feb 12 11:49:32 maxwell         disk not responding to selection
> Feb 12 11:49:32 maxwell scsi: [ID 107833 kern.warning] WARNING: 
> /pci at 1f,4000/pci at 2/SUNW,isptwo at 4/sd at 2,0 (sd32):
> Feb 12 11:49:32 maxwell         offline or reservation conflict
> Feb 12 11:49:32 maxwell scsi: [ID 107833 kern.warning] WARNING: 
> /pci at 1f,4000/pci at 2/SUNW,isptwo at 4/sd at 2,0 (sd32):
> Feb 12 11:49:32 maxwell         i/o to invalid geometry
>
>
> What''s going on here? What can I do to make ZFS let go of the bad
drive?
> This is a production machine and I''m getting concerned. I _really_
don''t
> like the fact that ZFS is using a suspect drive, but I can''t seem
to
> make it stop!
>
> Thanks,
> -Brian
>
>
> Brian H. Nelson wrote:
>   
>> This is Solaris 10U3 w/127111-05.
>>
>> It appears that one of the disks in my zpool died yesterday. I got 
>> several SCSI errors finally ending with ''device not responding
to
>> selection''. That seems to be all well and good. ZFS figured it
out and
>> the pool is degraded:
>>
>> maxwell /var/adm >zpool status
>>   pool: pool1
>>  state: DEGRADED
>> status: One or more devices could not be opened.  Sufficient replicas 
>> exist for
>>         the pool to continue functioning in a degraded state.
>> action: Attach the missing device and online it using ''zpool
online''.
>>    see: http://www.sun.com/msg/ZFS-8000-D3
>>  scrub: none requested
>> config:
>>
>>         NAME         STATE     READ WRITE CKSUM
>>         pool1        DEGRADED     0     0     0
>>           raidz1     DEGRADED     0     0     0
>>             c0t9d0   ONLINE       0     0     0
>>             c0t10d0  ONLINE       0     0     0
>>             c0t11d0  ONLINE       0     0     0
>>             c0t12d0  ONLINE       0     0     0
>>             c2t0d0   ONLINE       0     0     0
>>             c2t1d0   ONLINE       0     0     0
>>             c2t2d0   UNAVAIL  1.88K 17.98     0  cannot open
>>
>> errors: No known data errors
>>
>>
>> My question is why does ZFS keep attempting to open the dead device? At
>> least that''s what I assume is happening. About every minute, I
get eight
>> of these entries in the messages log:
>>
>> Feb 12 10:15:54 maxwell scsi: [ID 107833 kern.warning] WARNING: 
>> /pci at 1f,4000/pci at 2/SUNW,isptwo at 4/sd at 2,0 (sd32):
>> Feb 12 10:15:54 maxwell         disk not responding to selection
>>
>> I also got a number of these thrown in for good measure:
>>
>> Feb 11 22:21:58 maxwell scsi: [ID 107833 kern.warning] WARNING: 
>> /pci at 1f,4000/pci at 2/SUNW,isptwo at 4/sd at 2,0 (sd32):
>> Feb 11 22:21:58 maxwell         SYNCHRONIZE CACHE command failed (5)
>>
>>
>> Since the disk died last night (at about 11:20pm EST) I now have over 
>> 15K of similar entries in my log. What gives? Is this expected
behavior?
>> If ZFS knows the device is having problems, why does it not just leave 
>> it alone and wait for user intervention?
>>
>> Also, I noticed that the ''action'' says to attach the
device and ''zpool
>> online'' it. Am I correct in assuming that a ''zpool
replace'' is what
>> would really be needed, as the data on the disk will be outdated?
>>
>> Thanks,
>> -Brian
>>
>>   
>>     
>
>   
-- 
---------------------------------------------------
Brian H. Nelson         Youngstown State University
System Administrator   Media and Academic Computing
              bnelson[at]cis.ysu.edu
---------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080212/a3c53ff8/attachment.html>

Ross

2008-Feb-12 19:13 UTC

head link

[zfs-discuss] Need help with a dead disk

Hmm... this won''t help you, but I think I''m having similar
problems with an iSCSI target device.  If I offline the target, zfs hangs for
just over 5 minutes before it realises the device is unavailable, and even then
it doesn''t report the problem until I repeat the zpool status command.

What I see here every time is:
 - iSCSI device disconnected
 - zpool status, and all file i/o appears to hang for 5 mins
 - zpool status then finishes (reporting pools ok), and i/o carries on.
 - Immediately running zpool status again correctly shows the device as faulty
and the pool as degraded.

It seems either ZFS or the Solaris driver stack has a problem when devices go
offline.  Both of us have seen zpool status hang for huge amounts of time when
there''s a problem with a drive.  Not something that inspires confidence
in a raid system.
 
 
This message posted from opensolaris.org

Marion Hakanson

2008-Feb-12 20:09 UTC

head link

[zfs-discuss] Need help with a dead disk

bnelson at cis.ysu.edu said:> One thought I had was to unconfigure the bad disk with cfgadm. Would  that
> force the system back into the ''offline'' response? 
In my experience (X4100 internal drive), that will make ZFS stop trying
to use it.  It''s also a good idea to do this before you hot-unplug the
bad drive to replace it with a new one.

Regards,

Marion

ZFS Fan

2008-Mar-02 08:19 UTC

head link

[zfs-discuss] Need help with a dead disk

We are encountering the same issue. Essentially ZFS has trouble stopping access
to a dead drive. We are testing out Solaris/ZFS and this is has become a very
serious issue for us.

Any help /fix for this would be greatly appreciated.

Reg: cfgadm --unconfigure ...

The recommendation seems to be to stop access to the drive before using the
cfgadm command. We cannot do that because that would mean shutting down the
entire filesytem just for a failed hard drive. I think that defeats the purpose
of raidz2.

Does the above command work when zfs is still trying to access the disk?

Thanks
 
 
This message posted from opensolaris.org

zfs discuss - Feb 2008 - ZFS keeps trying to open a dead disk: lots of logging

[zfs-discuss] ZFS keeps trying to open a dead disk: lots of logging

[zfs-discuss] Need help with a dead disk (was: ZFS keeps trying to open a dead disk: lots of logging)

[zfs-discuss] Need help with a dead disk

[zfs-discuss] Need help with a dead disk

[zfs-discuss] Need help with a dead disk

[zfs-discuss] Need help with a dead disk