thr3ads.net - freebsd stable - zfs, cam sticking on failed disk [May 2015]

If this information is useful, please help other people find it:
Share via:

Steven Hartland

2015-May-07 10:38 UTC

zfs, cam sticking on failed disk

On 07/05/2015 10:50, Slawa Olhovchenkov wrote:> On Thu, May 07, 2015 at 09:41:43AM +0100, Steven Hartland wrote:
>
>> On 07/05/2015 09:07, Slawa Olhovchenkov wrote:
>>> I have zpool of 12 vdev (zmirrors).
>>> One disk in one vdev out of service and stop serving reuquest:
>>>
>>> dT: 1.036s  w: 1.000s
>>>    L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy
Name
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
ada0
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
ada1
>>>       1      0      0      0    0.0      0      0    0.0    0.0|
ada2
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
ada3
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da0
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da1
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da2
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da3
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da4
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da5
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da6
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da7
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da8
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da9
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da10
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da11
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da12
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da13
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da14
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da15
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da16
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da17
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da18
>>>      24      0      0      0    0.0      0      0    0.0    0.0|
da19
>>>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da20
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da21
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da22
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da23
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da24
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da25
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da26
>>>       0      0      0      0    0.0      0      0    0.0    0.0|
da27
>>>
>>> As result zfs operation on this pool stoped too.
>>> `zpool list -v` don't worked.
>>> `zpool detach tank da19` don't worked.
>>> Application worked with this pool sticking in `zfs` wchan and
don't killed.
>>>
>>> # camcontrol tags da19 -v
>>> (pass19:isci0:0:3:0): dev_openings  7
>>> (pass19:isci0:0:3:0): dev_active    25
>>> (pass19:isci0:0:3:0): allocated     25
>>> (pass19:isci0:0:3:0): queued        0
>>> (pass19:isci0:0:3:0): held          0
>>> (pass19:isci0:0:3:0): mintags       2
>>> (pass19:isci0:0:3:0): maxtags       255
>>>
>>> How I can cancel this 24 requst?
>>> Why this requests don't timeout (3 hours already)?
>>> How I can forced detach this disk? (I am lready try `camcontrol
reset`, `camconrol rescan`).
>>> Why ZFS (or geom) don't timeout on request and don't
rerouted to da18?
>>>
>> If they are in mirrors, in theory you can just pull the disk, isci will
>> report to cam and cam will report to ZFS which should all recover.
> Yes, zmirror with da18.
> I am surprise that ZFS don't use da18. All zpool fully stuck.A single low level request can only be handled by one device, if that 
device returns an error then ZFS will use the other device, but not
until.>
>> With regards to not timing out this could be a default issue, but
having
> I am understand, no universal acceptable timeout for all cases: good
> disk, good saturated disk, tape, tape library, failed disk, etc.
> In my case -- failed disk. This model already failed (other specimen)
> with same symptoms).
>
> May be exist some tricks for cancel/aborting all request in queue and
> removing disk from system?
Unlikely tbh, pulling the disk however should.>
>> a very quick look that's not obvious in the code as
>> isci_io_request_construct etc do indeed set a timeout when
>> CAM_TIME_INFINITY hasn't been requested.
>>
>> The sysctl hw.isci.debug_level may be able to provide more information,
>> but be aware this can be spammy.
> I am already have this situation, what command interesting after
> setting hw.isci.debug_level?I'm afraid I'm not familiar isci I'm afraid possibly someone else
who is
can chime in.

     Regards
     Steve

Slawa Olhovchenkov

2015-May-07 10:46 UTC

head link

zfs, cam sticking on failed disk

On Thu, May 07, 2015 at 11:38:46AM +0100, Steven Hartland wrote:
> >>> How I can cancel this 24 requst?
> >>> Why this requests don't timeout (3 hours already)?
> >>> How I can forced detach this disk? (I am lready try
`camcontrol reset`, `camconrol rescan`).
> >>> Why ZFS (or geom) don't timeout on request and don't
rerouted to da18?
> >>>
> >> If they are in mirrors, in theory you can just pull the disk, isci
will
> >> report to cam and cam will report to ZFS which should all recover.
> > Yes, zmirror with da18.
> > I am surprise that ZFS don't use da18. All zpool fully stuck.
> A single low level request can only be handled by one device, if that 
> device returns an error then ZFS will use the other device, but not until.
Why next requests don't routed to da18?
Current request stuck on da19 (unlikely, but understund), but why
stuck all pool?

freebsd stable - May 2015 - zfs, cam sticking on failed disk

zfs, cam sticking on failed disk

zfs, cam sticking on failed disk