thr3ads.net - freebsd stable - zfs, cam sticking on failed disk [May 2015]

If this information is useful, please help other people find it:
Share via:

Slawa Olhovchenkov

2015-May-07 12:44 UTC

zfs, cam sticking on failed disk

On Thu, May 07, 2015 at 01:35:05PM +0100, Steven Hartland wrote:
> 
> 
> On 07/05/2015 13:05, Slawa Olhovchenkov wrote:
> > On Thu, May 07, 2015 at 01:00:40PM +0100, Steven Hartland wrote:
> >
> >>
> >> On 07/05/2015 11:46, Slawa Olhovchenkov wrote:
> >>> On Thu, May 07, 2015 at 11:38:46AM +0100, Steven Hartland
wrote:
> >>>
> >>>>>>> How I can cancel this 24 requst?
> >>>>>>> Why this requests don't timeout (3 hours
already)?
> >>>>>>> How I can forced detach this disk? (I am
lready try `camcontrol reset`, `camconrol rescan`).
> >>>>>>> Why ZFS (or geom) don't timeout on request
and don't rerouted to da18?
> >>>>>>>
> >>>>>> If they are in mirrors, in theory you can just
pull the disk, isci will
> >>>>>> report to cam and cam will report to ZFS which
should all recover.
> >>>>> Yes, zmirror with da18.
> >>>>> I am surprise that ZFS don't use da18. All zpool
fully stuck.
> >>>> A single low level request can only be handled by one
device, if that
> >>>> device returns an error then ZFS will use the other
device, but not until.
> >>> Why next requests don't routed to da18?
> >>> Current request stuck on da19 (unlikely, but understund), but
why
> >>> stuck all pool?
> >> Its still waiting for the request from the failed device to
complete. As
> >> far as ZFS currently knows there is nothing wrong with the device
as its
> >> had no failures.
> > Can you explain some more?
> > One requst waiting, understand.
> > I am do next request. Some information need from vdev with failed
> > disk. Failed disk more busy (queue long), why don't routed to
mirror
> > disk? Or, for metadata, to less busy vdev?
> As no error has been reported to ZFS, due to the stalled IO, there is no 
> failed vdev.
I see that device isn't failed (for both OS and ZFS).
I am don't talk 'failed vdev'. I am talk 'busy vdev' or
'busy device'.
> Yes in theory new requests should go to the other vdev, but there could 
> be some dependency issues preventing that such as a syncing TXG.
Currenly this pool must not have write activity (from application).
What about go to the other (mirror) device in the same vdev?
Same dependency?

Steven Hartland

2015-May-07 12:46 UTC

head link

zfs, cam sticking on failed disk

On 07/05/2015 13:44, Slawa Olhovchenkov wrote:> On Thu, May 07, 2015 at 01:35:05PM +0100, Steven Hartland wrote:
>
>>
>> On 07/05/2015 13:05, Slawa Olhovchenkov wrote:
>>> On Thu, May 07, 2015 at 01:00:40PM +0100, Steven Hartland wrote:
>>>
>>>> On 07/05/2015 11:46, Slawa Olhovchenkov wrote:
>>>>> On Thu, May 07, 2015 at 11:38:46AM +0100, Steven Hartland
wrote:
>>>>>
>>>>>>>>> How I can cancel this 24 requst?
>>>>>>>>> Why this requests don't timeout (3
hours already)?
>>>>>>>>> How I can forced detach this disk? (I am
lready try `camcontrol reset`, `camconrol rescan`).
>>>>>>>>> Why ZFS (or geom) don't timeout on
request and don't rerouted to da18?
>>>>>>>>>
>>>>>>>> If they are in mirrors, in theory you can just
pull the disk, isci will
>>>>>>>> report to cam and cam will report to ZFS which
should all recover.
>>>>>>> Yes, zmirror with da18.
>>>>>>> I am surprise that ZFS don't use da18. All
zpool fully stuck.
>>>>>> A single low level request can only be handled by one
device, if that
>>>>>> device returns an error then ZFS will use the other
device, but not until.
>>>>> Why next requests don't routed to da18?
>>>>> Current request stuck on da19 (unlikely, but understund),
but why
>>>>> stuck all pool?
>>>> Its still waiting for the request from the failed device to
complete. As
>>>> far as ZFS currently knows there is nothing wrong with the
device as its
>>>> had no failures.
>>> Can you explain some more?
>>> One requst waiting, understand.
>>> I am do next request. Some information need from vdev with failed
>>> disk. Failed disk more busy (queue long), why don't routed to
mirror
>>> disk? Or, for metadata, to less busy vdev?
>> As no error has been reported to ZFS, due to the stalled IO, there is
no
>> failed vdev.
> I see that device isn't failed (for both OS and ZFS).
> I am don't talk 'failed vdev'. I am talk 'busy vdev' or
'busy device'.
>
>> Yes in theory new requests should go to the other vdev, but there could
>> be some dependency issues preventing that such as a syncing TXG.
> Currenly this pool must not have write activity (from application).
> What about go to the other (mirror) device in the same vdev?
> Same dependency?Yes, if there's an outstanding TXG, then I believe all IO will stall.

freebsd stable - May 2015 - zfs, cam sticking on failed disk

zfs, cam sticking on failed disk

zfs, cam sticking on failed disk