thr3ads.net - zfs discuss - [zfs-discuss] Hung mirror severely impacts performance [Apr 2006]

If this information is useful, please help other people find it:
Share via:

Casper.Dik at Sun.COM

2006-Apr-05 18:34 UTC

[zfs-discuss] Hung mirror severely impacts performance

On half of a mirror (one of a pair of D1000s) has gone
away in a rather awkward fashion: all SCSI bus requests
end in a timeout.

This causes all I/O to be non existingly slow, it seems.

I would expect zfs to just give up on half of the mirror and
just continue working until I reattached the other half.

How can I fix this without driving to the office?

Rebooting does NOT help.

Casper

Richard Elling

2006-Apr-05 19:09 UTC

head link

[zfs-discuss] Hung mirror severely impacts performance

On Wed, 2006-04-05 at 20:34 +0200, Casper.Dik at sun.com
wrote:> 
> On half of a mirror (one of a pair of D1000s) has gone
> away in a rather awkward fashion: all SCSI bus requests
> end in a timeout.
Several possibilities here, almost all are mechanical.
Alas, one of the deficiencies in the D1000 design is that
it doesn''t have very good remote management capabilities.
> This causes all I/O to be non existingly slow, it seems.
By default, sd timeouts are 60 seconds with 5 retries.
But it really does depend on the exact fault.  For example,
if you pull a disk drive (a common technique for people who
think a missing drive is the same as a dead drive) then
we detect that rather quickly and return EIO (or some error?)
quickly.  In other words, it can take a long time to know
that an iop didn''t succeed.
> I would expect zfs to just give up on half of the mirror and
> just continue working until I reattached the other half.
I haven''t yet studied how ZFS determines that a whole drive
is bad.  For some other LVMs, they try additional iops which
will run into the same timeout/retry delays.
> How can I fix this without driving to the office?
I''m surprised you don''t have a minion yet :-)
 -- richard

Casper.Dik at Sun.COM

2006-Apr-05 19:17 UTC

head link

[zfs-discuss] Hung mirror severely impacts performance

>By default, sd timeouts are 60 seconds with 5 retries.
>But it really does depend on the exact fault.  For example,
>if you pull a disk drive (a common technique for people who
>think a missing drive is the same as a dead drive) then
>we detect that rather quickly and return EIO (or some error?)
>quickly.  In other words, it can take a long time to know
>that an iop didn''t succeed.
It knows the drives are faulted and lists them.

Yet it (or something) continues to access them.

Casper

Torrey McMahon

2006-Apr-05 20:32 UTC

head link

[zfs-discuss] Hung mirror severely impacts performance

Casper.Dik at sun.com wrote:>> By default, sd timeouts are 60 seconds with 5 retries.
>> But it really does depend on the exact fault.  For example,
>> if you pull a disk drive (a common technique for people who
>> think a missing drive is the same as a dead drive) then
>> we detect that rather quickly and return EIO (or some error?)
>> quickly.  In other words, it can take a long time to know
>> that an iop didn''t succeed.
>>     
>
> It knows the drives are faulted and lists them.
>
> Yet it (or something) continues to access them.
>   

cfgadm -c unconfigure cX (???)

Eric Schrock

2006-Apr-05 21:56 UTC

head link

[zfs-discuss] Hung mirror severely impacts performance

On Wed, Apr 05, 2006 at 09:17:28PM +0200, Casper.Dik at sun.com
wrote:> 
> It knows the drives are faulted and lists them.
> 
> Yet it (or something) continues to access them.
This is quite strange.  If ''zpool status'' displays the drives
as fauled,
this means that their internal vdev state is VDEV_STATE_CANT_OPEN.  This
means that vdev_is_dead() should return TRUE.  My impression from
looking at the vdev_disk code is that this should bypass any actual I/O.
Can you use the DTrace io provider to determine where the I/O attempts
are coming from?

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

grant beattie

2006-Apr-06 00:42 UTC

head link

[zfs-discuss] Hung mirror severely impacts performance

On Wed, Apr 05, 2006 at 12:09:34PM -0700, Richard Elling wrote:
> Several possibilities here, almost all are mechanical.
> Alas, one of the deficiencies in the D1000 design is that
> it doesn''t have very good remote management capabilities.
> 
> > This causes all I/O to be non existingly slow, it seems.
> 
> By default, sd timeouts are 60 seconds with 5 retries.
> But it really does depend on the exact fault.  For example,
> if you pull a disk drive (a common technique for people who
> think a missing drive is the same as a dead drive) then
> we detect that rather quickly and return EIO (or some error?)
> quickly.  In other words, it can take a long time to know
> that an iop didn''t succeed.
I saw similar to what Casper describes when removing a cable or disks
from a D1000, yet, if I recall correctly, ZFS did not detach the
mirror and merrily kept on trying to access the unavailable disks. all
IO to the entire pool stopped.

I''ll try reproduce to this soon and report my findings.

grant.

Eric Schrock

2006-Apr-06 00:56 UTC

head link

[zfs-discuss] Hung mirror severely impacts performance

On Thu, Apr 06, 2006 at 10:42:00AM +1000, grant beattie
wrote:> 
> I saw similar to what Casper describes when removing a cable or disks
> from a D1000, yet, if I recall correctly, ZFS did not detach the
> mirror and merrily kept on trying to access the unavailable disks. all
> IO to the entire pool stopped.
Not detaching the mirror expected.  With the upcoming hot spare support,
we would have automatically spared in an available hot spare if one had
been configured.

Merrily trying to access the unavailable disk is bad.  A dead vdev
should shun any I/O sent to it and return immediately.  Note that
"dead"
here implies that it was physically removed (as in your case), and not
"arbitrarily unresponsive".  We don''t yet have the FMA
diagnosis engine
capable of analyzing the latter and determining that a disk is "dead".
But it''s coming.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

zfs discuss - Apr 2006 - Hung mirror severely impacts performance

[zfs-discuss] Hung mirror severely impacts performance

[zfs-discuss] Hung mirror severely impacts performance

[zfs-discuss] Hung mirror severely impacts performance

[zfs-discuss] Hung mirror severely impacts performance

[zfs-discuss] Hung mirror severely impacts performance

[zfs-discuss] Hung mirror severely impacts performance

[zfs-discuss] Hung mirror severely impacts performance