thr3ads.net - freebsd stable - zfs, cam sticking on failed disk [May 2015]

If this information is useful, please help other people find it:
Share via:

Steven Hartland

2015-May-07 12:00 UTC

zfs, cam sticking on failed disk

On 07/05/2015 11:46, Slawa Olhovchenkov wrote:> On Thu, May 07, 2015 at 11:38:46AM +0100, Steven Hartland wrote:
>
>>>>> How I can cancel this 24 requst?
>>>>> Why this requests don't timeout (3 hours already)?
>>>>> How I can forced detach this disk? (I am lready try
`camcontrol reset`, `camconrol rescan`).
>>>>> Why ZFS (or geom) don't timeout on request and
don't rerouted to da18?
>>>>>
>>>> If they are in mirrors, in theory you can just pull the disk,
isci will
>>>> report to cam and cam will report to ZFS which should all
recover.
>>> Yes, zmirror with da18.
>>> I am surprise that ZFS don't use da18. All zpool fully stuck.
>> A single low level request can only be handled by one device, if that
>> device returns an error then ZFS will use the other device, but not
until.
> Why next requests don't routed to da18?
> Current request stuck on da19 (unlikely, but understund), but why
> stuck all pool?
Its still waiting for the request from the failed device to complete. As 
far as ZFS currently knows there is nothing wrong with the device as its 
had no failures.

You didn't say which FreeBSD version you where running?

     Regards
     Steve

Slawa Olhovchenkov

2015-May-07 12:05 UTC

head link

zfs, cam sticking on failed disk

On Thu, May 07, 2015 at 01:00:40PM +0100, Steven Hartland wrote:
> 
> 
> On 07/05/2015 11:46, Slawa Olhovchenkov wrote:
> > On Thu, May 07, 2015 at 11:38:46AM +0100, Steven Hartland wrote:
> >
> >>>>> How I can cancel this 24 requst?
> >>>>> Why this requests don't timeout (3 hours already)?
> >>>>> How I can forced detach this disk? (I am lready try
`camcontrol reset`, `camconrol rescan`).
> >>>>> Why ZFS (or geom) don't timeout on request and
don't rerouted to da18?
> >>>>>
> >>>> If they are in mirrors, in theory you can just pull the
disk, isci will
> >>>> report to cam and cam will report to ZFS which should all
recover.
> >>> Yes, zmirror with da18.
> >>> I am surprise that ZFS don't use da18. All zpool fully
stuck.
> >> A single low level request can only be handled by one device, if
that
> >> device returns an error then ZFS will use the other device, but
not until.
> > Why next requests don't routed to da18?
> > Current request stuck on da19 (unlikely, but understund), but why
> > stuck all pool?
> 
> Its still waiting for the request from the failed device to complete. As 
> far as ZFS currently knows there is nothing wrong with the device as its 
> had no failures.
Can you explain some more?
One requst waiting, understand.
I am do next request. Some information need from vdev with failed
disk. Failed disk more busy (queue long), why don't routed to mirror
disk? Or, for metadata, to less busy vdev?
> You didn't say which FreeBSD version you where running?
10-STABLE, r281264.

Paul Mather

2015-May-07 13:56 UTC

head link

zfs, cam sticking on failed disk

On May 7, 2015, at 8:00 AM, Steven Hartland <killing at multiplay.co.uk>
wrote:
> On 07/05/2015 11:46, Slawa Olhovchenkov wrote:
>> On Thu, May 07, 2015 at 11:38:46AM +0100, Steven Hartland wrote:
>> 
>>>>>> How I can cancel this 24 requst?
>>>>>> Why this requests don't timeout (3 hours already)?
>>>>>> How I can forced detach this disk? (I am lready try
`camcontrol reset`, `camconrol rescan`).
>>>>>> Why ZFS (or geom) don't timeout on request and
don't rerouted to da18?
>>>>>> 
>>>>> If they are in mirrors, in theory you can just pull the
disk, isci will
>>>>> report to cam and cam will report to ZFS which should all
recover.
>>>> Yes, zmirror with da18.
>>>> I am surprise that ZFS don't use da18. All zpool fully
stuck.
>>> A single low level request can only be handled by one device, if
that
>>> device returns an error then ZFS will use the other device, but not
until.
>> Why next requests don't routed to da18?
>> Current request stuck on da19 (unlikely, but understund), but why
>> stuck all pool?
> 
> Its still waiting for the request from the failed device to complete. As
far as ZFS currently knows there is nothing wrong with the device as its had no
failures.

Maybe related to this, but if the drive stalls indefinitely, is it what leads to
the "panic: I/O to pool 'poolname' appears to be hung on vdev guid
GUID-ID at '/dev/somedevice'."?

I have a 6-disk RAIDZ2 pool that is used for nightly rsync backups from various
systems.  I believe one of the drives is a bit temperamental.  Very
occasionally, I discover the backup has failed and the machine actually paniced
because of this drive, with a panic message like the above.  The panic backtrace
includes references to vdev_deadman, which sounds like some sort of dead
man's switch/watchdog.

It's a bit counter-intuitive that a single drive wandering off into la-la
land can not only cause an entire ZFS pool to wedge, but, worse still, panic the
whole machine.

If I'm understanding this thread correctly, part of the problem is that an
I/O never completing is not the same as a failure to ZFS, and hence ZFS
can't call upon various resources in the pool and mechanisms at its disposal
to correct for that.  Is that accurate?

I would think that never-ending I/O requests would be a type of failure that ZFS
could sustain.  It seems from the "hung on vdev" panic that it does
detect this situation, though the resolution (panic) is not ideal. :-)

Cheers,

Paul.

freebsd stable - May 2015 - zfs, cam sticking on failed disk

zfs, cam sticking on failed disk

zfs, cam sticking on failed disk

zfs, cam sticking on failed disk