thr3ads.net - zfs discuss - [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Ross

2008-Aug-28 08:08 UTC

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Since somebody else has just posted about their entire system locking up when
pulling a drive, I thought I''d raise this for discussion.

I think Ralf made a very good point in the other thread.  ZFS can guarantee data
integrity, what it can''t do is guarantee data availability.  The
problem is, the way ZFS is marketed people expect it to be able to do just that.

This turned into a longer thread than expected, so I''ll start with what
I''m asking for, and then attempt to explain my thinking.  I''m
essentially asking for two features to improve the availability of ZFS pools:

- Isolation of storage drivers so that buggy drivers do not bring down the OS.

- ZFS timeouts to improve pool availability when no timely response is received
from storage drivers.

And my reasons for asking for these is that there are now many, many posts on
here about people experiencing either total system lockup or ZFS lockup after
removing a hot swap drive, and indeed while some of them are using consumer
hardware, others have reported problems with server grade kit that definately
should be able to handle these errors:

Aug 2008:  AMD SB600 - System hang
 - http://www.opensolaris.org/jive/thread.jspa?threadID=70349
Aug 2008:  Supermicro SAT2-MV8 - System hang
 - http://www.opensolaris.org/jive/thread.jspa?messageID=271218
May 2008: Sun hardware - ZFS hang
 - http://opensolaris.org/jive/thread.jspa?messageID=240481
Feb 2008:  iSCSI - ZFS hang
 - http://www.opensolaris.org/jive/thread.jspa?messageID=206985
Oct 2007:  Supermicro SAT2-MV8 - system hang
 - http://www.opensolaris.org/jive/thread.jspa?messageID=166037
Sept 2007:  Fibre channel
 - http://opensolaris.org/jive/thread.jspa?messageID=151719
... etc

Now while the root cause of each of these may be slightly different, I feel it
would still be good to address this if possible as it''s going to affect
the perception of ZFS as a reliable system.

The common factor in all of these is that either the solaris driver hangs and
locks the OS, or ZFS hangs and locks the pool.  Most of these are for hardware
that should handle these failures fine (mine occured for hardware that
definately works fine under windows), so I''m wondering:  Is there
anything that can be done to prevent either type of lockup in these situations?

Firstly, for the OS, if a storage component (hardware or driver) fails for a non
essential part of the system, the entire OS should not hang.  I appreciate there
isn''t a lot you can do if the OS is using the same driver as
it''s storage, but certainly in some of the cases above, the OS and the
data are using different drivers, and I expect more examples of that could be
found with a bit of work.  Is there any way storage drivers could be isolated
such that the OS (and hence ZFS) can report a problem with that particular
driver without hanging the entire system?

Please note:  I know work is being done on FMA to handle all kinds of bugs,
I''m not talking about that.  It seems to me that FMA involves proper
detection and reporting of bugs, which involves knowing in advance what the
problems are and how to report them.  What I''m looking for is something
much simpler, something that''s able to keep the OS running when it
encounters unexpected or unhandled behaviour from storage drivers or hardware.

It seems to me that one of the benefits of ZFS is working against it here. 
It''s such a flexible system it''s being used for many, many
types of devices, and that means there are a whole host of drivers being used,
and a lot of scope for bugs in those drivers.  I know that ultimately any driver
issues will need to be sorted individually, but what I''m wondering is
whether there''s any possibility of putting some error checking code at
a layer above the drivers in such a way it''s able to trap major
problems without hanging the OS?  ie: update ZFS/Solaris so they can handle
storage layer bugs gracefully without downing the entire system.

My second suggestion is to ask if ZFS can be made to handle unexpected events
more gracefully.  In the past I''ve suggested that ZFS have a separate
timeout so that a redundant pool can continue working even if one device is not
responding, and I really think that would be worthwhile.  My idea is to have a
"WAITING" status flag for drives, so that if one isn''t
responding quickly, ZFS can flag it as "WAITING", and attempt to read
or write the same data from elsewhere in the pool.  That would work alongside
the existing failure modes, and would allow ZFS to handle hung drivers much more
smoothly, preventing redundant pools hanging when a single drive fails.

The ZFS update I feel is particularly appropriate.  ZFS already uses
checksumming since it doesn''t trust drivers or hardware to always
return the correct data.  But ZFS then trusts those same drivers and hardware
absolutely when it comes to the availability of the pool.

I believe ZFS should apply the same tough standards to pool availability as it
does to data integrity.  A bad checksum makes ZFS read the data from elsewhere,
why shouldn''t a timeout do the same thing?

Ross
 
 
This message posted from opensolaris.org

Bob Friesenhahn

2008-Aug-28 16:29 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

On Thu, 28 Aug 2008, Ross wrote:>
> I believe ZFS should apply the same tough standards to pool 
> availability as it does to data integrity.  A bad checksum makes ZFS 
> read the data from elsewhere, why shouldn''t a timeout do the same 
> thing?
A problem is that for some devices, a five minute timeout is ok.  For 
others, there must be a problem if the device does not respond in a 
second or two.

If the system or device is simply overwelmed with work, then you would 
not want the system to go haywire and make the problems much worse.

Which of these do you prefer?

   o System waits substantial time for devices to (possibly) recover in
     order to ensure that subsequently written data has the least
     chance of being lost.

   o System immediately ignores slow devices and switches to
     non-redundant non-fail-safe non-fault-tolerant may-lose-your-data
     mode.  When system is under intense load, it automatically
     switches to the may-lose-your-data mode.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Eric Schrock

2008-Aug-28 16:52 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Ross, thanks for the feedback.  A couple points here -

A lot of work went into improving the error handling around build 77 of
Nevada.  There are still problems today, but a number of the
complaints we''ve seen are on s10 software or older nevada builds that
didn''t have these fixes.  Anything from the pre-2008 (or pre-s10u5)
timeframe should be taken with grain of salt.

There is a fix in the immediate future to prevent I/O timeouts from
hanging other parts of the system - namely administrative commands and
other pool activity.  So I/O to that particular pool will hang, but
you''ll still be able to run your favorite ZFS commands, and it
won''t
impact the ability of other pools to run.

We have some good ideas on how to improve the retry logic.  There is a
flag in Solaris, B_FAILFAST, that tells the drive to not try too hard
getting the data.  However, it can return failure when trying harder
would produce the correct results.  Currently, we try the first I/O with
B_FAILFAST, and if that fails immediately retry without the flag.  The
idea is to elevate the retry logic to a higher level, so when a read
from a side of a mirror fails with B_FAILFAST, instead of immediately
retrying the same device without the failfast flag, we push the error
higher up the stack, and issue another B_FAILFAST I/O to the other half
of the mirror.  Only if both fail with failfast do we try a more
thorough request (though with ditto blocks we may try another vdev
alltogether). This should improve I/O error latency for a subset of
failure scenarios, and biasing reads away from degraded (but not faulty)
devices should also improve response time.  The tricky part is
incoporating this into the FMA diagnosis engine, as devices may fail
B_FAILFAST requests for a variety of non-fatal reasons.

Finally, imposing additional timeouts in ZFS is a bad idea.  ZFS is
designed to be a generic storage consumer.  It can be layered on top of
directly attached disks, SSDs, SAN devices, iSCSI targets, files, and
basically anything else.  As such, it doesn''t have the necessary
context
to know what constitutes a reasonable timeout.  This is explicitly
delegated to the underlying storage subsystem.  If a storage subsystem
is timing out for excessive periods of time when B_FAILFAST is set, then
that''s a bug in the storage subsystem, and working around it in ZFS
with
yet another set of tunables is not practical.  It will be interesting to
see if this is an issue after the retry logic is modified as described
above.

Hope that helps,

- Eric

On Thu, Aug 28, 2008 at 01:08:26AM -0700, Ross wrote:> Since somebody else has just posted about their entire system locking up
when pulling a drive, I thought I''d raise this for discussion.
> 
> I think Ralf made a very good point in the other thread.  ZFS can guarantee
data integrity, what it can''t do is guarantee data availability.  The
problem is, the way ZFS is marketed people expect it to be able to do just that.
> 
> This turned into a longer thread than expected, so I''ll start with
what I''m asking for, and then attempt to explain my thinking. 
I''m essentially asking for two features to improve the availability of
ZFS pools:
> 
> - Isolation of storage drivers so that buggy drivers do not bring down the
OS.
> 
> - ZFS timeouts to improve pool availability when no timely response is
received from storage drivers.
> 
> And my reasons for asking for these is that there are now many, many posts
on here about people experiencing either total system lockup or ZFS lockup after
removing a hot swap drive, and indeed while some of them are using consumer
hardware, others have reported problems with server grade kit that definately
should be able to handle these errors:
> 
> Aug 2008:  AMD SB600 - System hang
>  - http://www.opensolaris.org/jive/thread.jspa?threadID=70349
> Aug 2008:  Supermicro SAT2-MV8 - System hang
>  - http://www.opensolaris.org/jive/thread.jspa?messageID=271218
> May 2008: Sun hardware - ZFS hang
>  - http://opensolaris.org/jive/thread.jspa?messageID=240481
> Feb 2008:  iSCSI - ZFS hang
>  - http://www.opensolaris.org/jive/thread.jspa?messageID=206985
> Oct 2007:  Supermicro SAT2-MV8 - system hang
>  - http://www.opensolaris.org/jive/thread.jspa?messageID=166037
> Sept 2007:  Fibre channel
>  - http://opensolaris.org/jive/thread.jspa?messageID=151719
> ... etc
> 
> Now while the root cause of each of these may be slightly different, I feel
it would still be good to address this if possible as it''s going to
affect the perception of ZFS as a reliable system.
> 
> The common factor in all of these is that either the solaris driver hangs
and locks the OS, or ZFS hangs and locks the pool.  Most of these are for
hardware that should handle these failures fine (mine occured for hardware that
definately works fine under windows), so I''m wondering:  Is there
anything that can be done to prevent either type of lockup in these situations?
> 
> Firstly, for the OS, if a storage component (hardware or driver) fails for
a non essential part of the system, the entire OS should not hang.  I appreciate
there isn''t a lot you can do if the OS is using the same driver as
it''s storage, but certainly in some of the cases above, the OS and the
data are using different drivers, and I expect more examples of that could be
found with a bit of work.  Is there any way storage drivers could be isolated
such that the OS (and hence ZFS) can report a problem with that particular
driver without hanging the entire system?
> 
> Please note:  I know work is being done on FMA to handle all kinds of bugs,
I''m not talking about that.  It seems to me that FMA involves proper
detection and reporting of bugs, which involves knowing in advance what the
problems are and how to report them.  What I''m looking for is something
much simpler, something that''s able to keep the OS running when it
encounters unexpected or unhandled behaviour from storage drivers or hardware.
> 
> It seems to me that one of the benefits of ZFS is working against it here. 
It''s such a flexible system it''s being used for many, many
types of devices, and that means there are a whole host of drivers being used,
and a lot of scope for bugs in those drivers.  I know that ultimately any driver
issues will need to be sorted individually, but what I''m wondering is
whether there''s any possibility of putting some error checking code at
a layer above the drivers in such a way it''s able to trap major
problems without hanging the OS?  ie: update ZFS/Solaris so they can handle
storage layer bugs gracefully without downing the entire system.
> 
> My second suggestion is to ask if ZFS can be made to handle unexpected
events more gracefully.  In the past I''ve suggested that ZFS have a
separate timeout so that a redundant pool can continue working even if one
device is not responding, and I really think that would be worthwhile.  My idea
is to have a "WAITING" status flag for drives, so that if one
isn''t responding quickly, ZFS can flag it as "WAITING", and
attempt to read or write the same data from elsewhere in the pool.  That would
work alongside the existing failure modes, and would allow ZFS to handle hung
drivers much more smoothly, preventing redundant pools hanging when a single
drive fails.
> 
> The ZFS update I feel is particularly appropriate.  ZFS already uses
checksumming since it doesn''t trust drivers or hardware to always
return the correct data.  But ZFS then trusts those same drivers and hardware
absolutely when it comes to the availability of the pool.
> 
> I believe ZFS should apply the same tough standards to pool availability as
it does to data integrity.  A bad checksum makes ZFS read the data from
elsewhere, why shouldn''t a timeout do the same thing?
> 
> Ross
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

Miles Nordin

2008-Aug-28 18:17 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

>>>>> "es" == Eric Schrock <eric.schrock at
sun.com> writes:
    es> Finally, imposing additional timeouts in ZFS is a bad idea.
    es> [...] As such, it doesn''t have the necessary context to know
    es> what constitutes a reasonable timeout.

you''re right in terms of fixed timeouts, but there''s no reason
it
can''t compare the performance of redundant data sources, and if one
vdev performs an order of magnitude slower than another set of vdevs
with sufficient redundancy, stop issuing reads except scrubs/healing
to the underperformer (issue writes only), and pass an event to FMA.

ZFS can also compare the performance of a drive to itself over time,
and if the performance suddenly decreases, do the same.

The former case eliminates the need for the mirror policies in SVM,
which Ian requested a few hours ago for the situation that half the
mirror is a slow iSCSI target for geographic redundancy and half is
faster/local.  Some care would have to be taken for targets shared by
ZFS and some other initiator, but I''m not sure the care would really
be that difficult to take, or that the oscillations induced by failing
to take it would really be particularly harmful compared to
unsupervised contention for a device.

The latter notices quickly drives that have been pulled, or for
Richard''s ``overwhelmingly dominant'''' case, for
drives which are
stalled for 30 seconds pending their report of an unrecovered read.

Developing meaningful performance statistics for drives and a tool for
displaying them would be useful for itself, not just for stopping
freezes and preventing a failing drive from degrading performance a
thousandfold.

Issuing reads to redundant devices is cheap compared to freezing.  The
policy with which it''s done is highly tunable and should be fun to
tune and watch, and the consequence if the policy makes the wrong
choice isn''t incredibly dire.


This B_FAILFAST architecture captures the situation really poorly.

First, it''s not implementable in any serious way with near-line
drives, or really with any drives with which you''re not intimately
familiar and in control of firmware/release-engineering, and perhaps
not with any drives period.  I suspect in practice it''s more a
controller-level feature, about whether or not you''d like to distrust
the device''s error report and start resetting busses and channels and
mucking everything up trying to recover from some kind of
``weirdness''''.  It''s not an answer to the known
problem of drives
stalling for 30 seconds when they start to fail.

First and a half, when it''s not implemented, the system degrades to
doubling your timeout pointlessly.  A driver-level block cache of
UNC''s would probably have more value toward this
speed/read-aggressiveness tradeoff than the whole B_FAILFAST
architecture---just cache known unrecoverable read sectors, and refuse
to issue further I/O for them until a timeout of 3 - 10 minutes
passes.  I bet this would speed up most failures tremendously, and
without burdening upper layers with retry logic.

Second, B_FAILFAST entertains the fantasy that I/O''s are independent,
while what happens in practice is that the drive hits a UNC on one
I/O, and won''t entertain any further I/O''s no matter what
flags the
request has on it or how many times you ``reset'''' things.


Maybe you could try to rescue B_FAILFAST by putting clever statistics
into the driver to compare the drive''s performance to recent past as I
suggested ZFS do, and admit no B_FAILFAST requests to queues of drives
that have suddenly slowed down, just fail them immediately without
even trying.  I submit this queueing and statistic collection is
actually _better_ managed by ZFS than the driver because ZFS can
compare a whole floating-point statistic across a whole vdev, while
even a driver which is fancier than we ever dreamed, is still playing
poker with only 1 bit of input ``I''ll call,'''' or
``I''ll fold.''''  ZFS
can see all the cards and get better results while being stupider and
requiring less clever poker-guessing than would be required by a
hypothetical driver B_FAILFAST implementation that actually worked.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080828/1f22b5e4/attachment.bin>

Eric Schrock

2008-Aug-28 18:29 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

On Thu, Aug 28, 2008 at 02:17:08PM -0400, Miles Nordin
wrote:> 
> you''re right in terms of fixed timeouts, but there''s no
reason it
> can''t compare the performance of redundant data sources, and if
one
> vdev performs an order of magnitude slower than another set of vdevs
> with sufficient redundancy, stop issuing reads except scrubs/healing
> to the underperformer (issue writes only), and pass an event to FMA.
Yep, latency would be a useful metric to add to mirroring choices.
The current logic is rather naive (round-robin) and could easily be
enhanced.

Making diagnoses based on this is much trickier, particularly at the ZFS
level.  A better option would be to leverage the SCSI FMA work going on
to do a more intimate diagnosis at the scsa level.

Also, the problem you are trying to solve - timing out the first I/O to
take a long time - is not captured well by the type  of hysteresis you
would need to perform in order to do this diagnosis.  It certainly can
be done, but is much better suited to diagnosising a failing drive over
time, not aborting a transaction in response to immediate failure.
> This B_FAILFAST architecture captures the situation really poorly.
I don''t think you understand how this works.  Imagine two I/Os, just
with different sd timeouts and retry logic - that''s B_FAILFAST. 
It''s
quite simple, and independent of any hardware implementation.

- Eric

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

Bob Friesenhahn

2008-Aug-28 18:59 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

On Thu, 28 Aug 2008, Miles Nordin wrote:>
> you''re right in terms of fixed timeouts, but there''s no
reason it
> can''t compare the performance of redundant data sources, and if
one
> vdev performs an order of magnitude slower than another set of vdevs
> with sufficient redundancy, stop issuing reads except scrubs/healing
> to the underperformer (issue writes only), and pass an event to FMA.
You are saying that I can''t split my mirrors between a local disk in 
Dallas and a remote disk in New York accessed via iSCSI?  Why don''t 
you want me to be able to do that?

ZFS already backs off from writing to slow vdevs.
> ZFS can also compare the performance of a drive to itself over time,
> and if the performance suddenly decreases, do the same.
While this may be useful for reads, I would hate to disable redundancy 
just because a device is currently slow.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ross Smith

2008-Aug-28 19:34 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Hi guys,

Bob, my thought was to have this timeout as something that can be optionally set
by the administrator on a per pool basis.  I''ll admit I was mainly
thinking about reads and hadn''t considered the write scenario, but even
having thought about that it''s still a feature I''d like. 
After all, this would be a timeout set by the administrator based on the longest
delay they can afford for that storage pool.

Personally, if a SATA disk wasn''t responding to any requests after 2
seconds I really don''t care if an error has been detected, as far as
I''m concerned that disk is faulty.  I''d be quite happy for the
array to drop to a degraded mode based on that and for writes to carry on with
the rest of the array.

Eric, thanks for the extra details, they''re very much appreciated. 
It''s good to hear you''re working on this, and I love the idea
of doing a B_FAILFAST read on both halves of the mirror.

I do have a question though.  From what you''re saying, the response
time can''t be consistent across all hardware, so you''re once
again at the mercy of the storage drivers.  Do you know how long does B_FAILFAST
takes to return a response on iSCSI?  If that''s over 1-2 seconds I
would still consider that too slow I''m afraid.

I understand that Sun in general don''t want to add fault management to
ZFS, but I don''t see how this particular timeout does anything other
than help ZFS when it''s dealing with such a diverse range of media.  I
agree that ZFS can''t know itself what should be a valid timeout, but
that''s exactly why this needs to be an optional administrator set
parameter.  The administrator of a storage array who wants to set this certainly
knows what a valid timeout is for them, and these timeouts are likely to be
several orders of magnitude larger than the standard response times.  I would
configure very different values for my SATA drives as for my iSCSI connections,
but in each case I would be happier knowing that ZFS has more of a chance of
catching bad drivers or unexpected scenarios.

I very much doubt hardware raid controllers would wait 3 minutes for a drive to
return a response, they will have their own internal timeouts to know when a
drive has failed, and while ZFS is dealing with very different hardware I
can''t help but feel it should have that same approach to management of
its drives.

However, that said, I''ll be more than willing to test the new
B_FAILFAST logic on iSCSI once it''s released.  Just let me know when
it''s out.

Ross

> Date: Thu, 28 Aug 2008 11:29:21 -0500
> From: bfriesen at simple.dallas.tx.us
> To: myxiplx at hotmail.com
> CC: zfs-discuss at opensolaris.org
> Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal /
driver failure better
> 
> On Thu, 28 Aug 2008, Ross wrote:
> >
> > I believe ZFS should apply the same tough standards to pool 
> > availability as it does to data integrity.  A bad checksum makes ZFS 
> > read the data from elsewhere, why shouldn''t a timeout do the
same
> > thing?
> 
> A problem is that for some devices, a five minute timeout is ok.  For 
> others, there must be a problem if the device does not respond in a 
> second or two.
> 
> If the system or device is simply overwelmed with work, then you would 
> not want the system to go haywire and make the problems much worse.
> 
> Which of these do you prefer?
> 
>    o System waits substantial time for devices to (possibly) recover in
>      order to ensure that subsequently written data has the least
>      chance of being lost.
> 
>    o System immediately ignores slow devices and switches to
>      non-redundant non-fail-safe non-fault-tolerant may-lose-your-data
>      mode.  When system is under intense load, it automatically
>      switches to the may-lose-your-data mode.
> 
> Bob
> =====================================> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> 
_________________________________________________________________
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080828/4f017593/attachment.html>

Miles Nordin

2008-Aug-28 19:40 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

>>>>> "es" == Eric Schrock <eric.schrock at
sun.com> writes:
    es> I don''t think you understand how this works.  Imagine two
    es> I/Os, just with different sd timeouts and retry logic -
that''s
    es> B_FAILFAST.  It''s quite simple, and independent of any
    es> hardware implementation.

AIUI the main timeout to which we should be subject, at least for
nearline drives, is about 30 seconds long and is decided by the
drive''s firmware, not the driver, and can''t be negotiated in
any way
that''s independent of the hardware implementation, although sometimes
there are dependent ways to negotiate it.  The driver could also
decide through ``retry logic'''' to time out the command sooner,
before
the drive completes it, but this won''t do much good because the drive
won''t accept a second command until ITS timeout expires.

which leads to the second problem, that we''re talking about timeouts
for individual I/O''s, not marking whole devices.  A
``fast'''' timeout
of even 1 second could cause a 100- or 1000-fold decrease in
performance, which could end up being equivalent to a freeze depending
on the type of load on the filesystem.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080828/67967b82/attachment.bin>

Eric Schrock

2008-Aug-28 20:05 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

On Thu, Aug 28, 2008 at 08:34:24PM +0100, Ross Smith
wrote:> 
> Personally, if a SATA disk wasn''t responding to any requests after
2
> seconds I really don''t care if an error has been detected, as far
as
> I''m concerned that disk is faulty.
Unless you have power management enabled, or there''s a bad region of
the
disk, or the bus was reset, or...
> I do have a question though.  From what you''re saying, the
response
> time can''t be consistent across all hardware, so you''re
once again at
> the mercy of the storage drivers.  Do you know how long does
> B_FAILFAST takes to return a response on iSCSI?  If that''s over
1-2
> seconds I would still consider that too slow I''m afraid.
It''s main function is how it deals with retryable errors.  If the drive
responds with a retryable error, or any error at all, it won''t attempt
to retry again.  If you have a device that is taking arbitrarily long to
respond to successful commands (or to notice that a command won''t
succeed), it won''t help you.
> I understand that Sun in general don''t want to add fault
management to
> ZFS, but I don''t see how this particular timeout does anything
other
> than help ZFS when it''s dealing with such a diverse range of
media.  I
> agree that ZFS can''t know itself what should be a valid timeout,
but
> that''s exactly why this needs to be an optional administrator set
> parameter.  The administrator of a storage array who wants to set this
> certainly knows what a valid timeout is for them, and these timeouts
> are likely to be several orders of magnitude larger than the standard
> response times.  I would configure very different values for my SATA
> drives as for my iSCSI connections, but in each case I would be
> happier knowing that ZFS has more of a chance of catching bad drivers
> or unexpected scenarios.
The main problem with exposing tunables like this is that they have a
direct correlation to service actions, and mis-diagnosing failures costs
everybody (admin, companies, Sun, etc) lots of time and money.  Once you
expose such a tunable, it will be impossible to trust any FMA diagnosis,
because you won''t be able to know whether it was a mistaken tunable.

A better option would be to not use this to perform FMA diagnosis, but
instead work into the mirror child selection code.  This has already
been alluded to before, but it would be cool to keep track of latency
over time, and use this to both a) prefer one drive over another when
selecting the child and b) proactively timeout/ignore results from one
child and select the other if it''s taking longer than some historical
standard deviation.  This keeps away from diagnosing drives as faulty,
but does allow ZFS to make better choices and maintain response times.
It shouldn''t be hard to keep track of the average and/or standard
deviation and use it for selection; proactively timing out the slow I/Os
is much trickier.

As others have mentioned, things get more difficult with writes.  If I
issue a write to both halves of a mirror, should I return when the first
one completes, or when both complete?  One possibility is to expose this
as a tunable, but any such "best effort RAS" is a little dicey because
you have very little visibility into the state of the pool in this
scenario - "is my data protected?" becomes a very difficult question
to
answer.

- Eric

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

Miles Nordin

2008-Aug-28 20:31 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

>>>>> "bf" == Bob Friesenhahn <bfriesen at
simple.dallas.tx.us> writes:
bf> If the system or device is simply overwelmed with work, then
bf> you would not want the system to go haywire and make the
bf> problems much worse.

None of the decisions I described its making based on performance
statistics are ``haywire''''---I said it should funnel reads to
the
faster side of the mirror, and do this really quickly and
unconservatively. What''s your issue with that?

bf> You are saying that I can''t split my mirrors between a local
bf> disk in Dallas and a remote disk in New York accessed via
bf> iSCSI?

nope, you''ve misread. I''m saying reads should go to the local
disk
only, and writes should go to both. See SVM''s ''metaparam
-r''. I
suggested that unlike the SVM feature it should be automatic, because
by so being it becomes useful as an availability tool rather than just
performance optimisation.

The performance-statistic logic should influence read scheduling
immediately, and generate events which are fed to FMA, then FMA can
mark devices faulty. There''s no need for both to make the same
decision at the same time. If the events aren''t useful for diagnosis,
ZFS could not bother generating them, or fmd could ignore them in its
diagnosis. I suspect they *would* be useful, though.

I''m imagining the read rescheduling would happen very quickly, quicker
than one would want a round-trip from FMA, in much less than a second.
That''s why it would have to compare devices to others in the same
vdev, and to themselves over time, rather than use fixed timeouts or
punt to haphazard driver and firmware logic.

bf> o System waits substantial time for devices to (possibly)
bf> recover in order to ensure that subsequently written data has
bf> the least chance of being lost.

There''s no need for the filesystem to *wait* for data to be written,
unless you are calling fsync. and maybe not even then if there''s a
slog.

I said clearly that you read only one half of the mirror, but write to
both. But you''re right that the trick probably won''t work
perfectly---eventually dead devices need to be faulted. The idea is
that normal write caching will buy you orders of magnitude longer time
in which to make a better decision before anyone notices.

Experience here is that ``waits substantial time'''' usually
means
``freezes for hours and gets rebooted''''. There''s no
need to be
abstract: we know what happens when a drive starts taking 1000x -
2000x longer than usual to respond to commands, and we know that this
is THE common online failure mode for drives. That''s what started the
thread. so, think about this: hanging for an hour trying to write to
a broken device may block other writes to devices which are still
working, until the patiently-waiting data is eventually lost in the
reboot.

bf> o System immediately ignores slow devices and switches to
bf> non-redundant non-fail-safe non-fault-tolerant
bf> may-lose-your-data mode. When system is under intense load,
bf> it automatically switches to the may-lose-your-data mode.

nobody''s proposing a system which silently rocks back and forth
between faulted and online. That''s not what we have now, and no such
system would naturally arise. If FMA marked a drive faulty based on
performance statistics, that drive would get retired permanently and
hot-spare-replaced. Obviously false positives are bad, just as
obviously as freezes/reboots are bad.

It''s not my idea to use FMA in this way. This is how FMA was pitched,
and the excuse for leaving good exception handling out of ZFS for two
years. so, where''s the beef?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080828/0afce269/attachment.bin>

Ian Collins

2008-Aug-28 21:15 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Eric Schrock writes:> 
> A better option would be to not use this to perform FMA diagnosis, but
> instead work into the mirror child selection code.  This has already
> been alluded to before, but it would be cool to keep track of latency
> over time, and use this to both a) prefer one drive over another when
> selecting the child and b) proactively timeout/ignore results from one
> child and select the other if it''s taking longer than some
historical
> standard deviation.  This keeps away from diagnosing drives as faulty,
> but does allow ZFS to make better choices and maintain response times.
> It shouldn''t be hard to keep track of the average and/or standard
> deviation and use it for selection; proactively timing out the slow I/Os
> is much trickier. 
> This would be a good solution to the remote iSCSI mirror configuration.  
I''ve been working though this situation with a client (we have been 
comparing ZFS with Cleversafe) and we''d love to be able to get the read
performance of the local drives from such a pool. 
> As others have mentioned, things get more difficult with writes.  If I
> issue a write to both halves of a mirror, should I return when the first
> one completes, or when both complete?  One possibility is to expose this
> as a tunable, but any such "best effort RAS" is a little dicey
because
> you have very little visibility into the state of the pool in this
> scenario - "is my data protected?" becomes a very difficult
question to
> answer. 
> One solution (again, to be used with a remote mirror) is the three way 
mirror.  If two devices are local and one remote, data is safe once the two 
local writes return.  I guess the issue then changes from "is my data
safe"
to "how safe is my data".  I would be reluctant to deploy a remote
mirror
device without local redundancy, so this probably won''t be an uncommon 
setup.  There would have to be an acceptable window of risk when local data 
isn''t replicated. 

Ian

Ian Collins

2008-Aug-28 21:21 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Miles Nordin writes: 
>>>>>> "bf" == Bob Friesenhahn <bfriesen at
simple.dallas.tx.us> writes:
> 
>     bf> You are saying that I can''t split my mirrors between a
local
>     bf> disk in Dallas and a remote disk in New York accessed via
>     bf> iSCSI? 
> 
> nope, you''ve misread.  I''m saying reads should go to the
local disk
> only, and writes should go to both.  See SVM''s ''metaparam
-r''.  I
> suggested that unlike the SVM feature it should be automatic, because
> by so being it becomes useful as an availability tool rather than just
> performance optimisation. 
> So on a server with a read workload, how would you know if the remote volume 
was working? 

Ian

Bob Friesenhahn

2008-Aug-28 21:27 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

On Thu, 28 Aug 2008, Miles Nordin wrote:
> None of the decisions I described its making based on performance
> statistics are ``haywire''''---I said it should funnel
reads to the
> faster side of the mirror, and do this really quickly and
> unconservatively.  What''s your issue with that?
>From what I understand, this is partially happening now based on average service time.  If I/O is backed up for a device, then the 
other device is preferred.  However it good to keep in mind that if 
data is never read, then it is never validated and corrected.  It is 
good for ZFS to read data sometimes.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bill Sommerfeld

2008-Aug-28 21:46 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote:> A better option would be to not use this to perform FMA diagnosis, but
> instead work into the mirror child selection code.  This has already
> been alluded to before, but it would be cool to keep track of latency
> over time, and use this to both a) prefer one drive over another when
> selecting the child and b) proactively timeout/ignore results from one
> child and select the other if it''s taking longer than some
historical
> standard deviation.  This keeps away from diagnosing drives as faulty,
> but does allow ZFS to make better choices and maintain response times.
> It shouldn''t be hard to keep track of the average and/or standard
> deviation and use it for selection; proactively timing out the slow I/Os
> is much trickier.
tcp has to solve essentially the same problem: decide when a response is
"overdue" based only on the timing of recent successful exchanges in a
context where it''s difficult to make assumptions about
"reasonable"
expected behavior of the underlying network.

it tracks both the smoothed round trip time and the variance, and
declares a response overdue after (SRTT + K * variance).

I think you''d probably do well to start with something similar to
what''s
described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on
experience.

					- Bill

Richard Elling

2008-Aug-28 23:24 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Bill Sommerfeld wrote:> On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote:
>   
>> A better option would be to not use this to perform FMA diagnosis, but
>> instead work into the mirror child selection code.  This has already
>> been alluded to before, but it would be cool to keep track of latency
>> over time, and use this to both a) prefer one drive over another when
>> selecting the child and b) proactively timeout/ignore results from one
>> child and select the other if it''s taking longer than some
historical
>> standard deviation.  This keeps away from diagnosing drives as faulty,
>> but does allow ZFS to make better choices and maintain response times.
>> It shouldn''t be hard to keep track of the average and/or
standard
>> deviation and use it for selection; proactively timing out the slow
I/Os
>> is much trickier.
>>     
>
> tcp has to solve essentially the same problem: decide when a response is
> "overdue" based only on the timing of recent successful exchanges
in a
> context where it''s difficult to make assumptions about
"reasonable"
> expected behavior of the underlying network.
>
> it tracks both the smoothed round trip time and the variance, and
> declares a response overdue after (SRTT + K * variance).
>
> I think you''d probably do well to start with something similar to
what''s
> described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on
> experience.
>   
I think this is a good place to start. In general, we can see 3 orders 
of magnitude
range for magnetic disk I/Os, 4 orders of magnitude for power managed disks.
With that range, I don''t see the variance being small, at least for 
magnetic disks.
SSDs will have a much smaller variance, in general.  For lopsided 
mirrors, such
as magnetic disk mirrored to SSD or Bob''s Dallas vs New York paths, we 
should
be able to automatically steer towards the faster side.

However, A comprehensive solution must also deal with top-level vdev usage,
which can be very different than the physical vdevs.  We can use 
driver-level FMA
for the physical vdevs, but ultimately ZFS will need to be able to make 
decisions
based on the response time across the top-level vdevs.  This can be 
implemented in
two phases, of course.

I''ve got some lopsided mirror TNF data, so we could fairly easily try
some
algorithms... I''ll whip it into shape for further analysis.
 -- richard

Nicolas Williams

2008-Aug-29 15:48 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

On Thu, Aug 28, 2008 at 11:29:21AM -0500, Bob Friesenhahn
wrote:> Which of these do you prefer?
> 
>    o System waits substantial time for devices to (possibly) recover in
>      order to ensure that subsequently written data has the least
>      chance of being lost.
> 
>    o System immediately ignores slow devices and switches to
>      non-redundant non-fail-safe non-fault-tolerant may-lose-your-data
>      mode.  When system is under intense load, it automatically
>      switches to the may-lose-your-data mode.
Given how long a resilver might take, waiting some time for a device to
come back makes sense.  Also, if a cable was taken out, or drive tray
powered off, then you''ll see lots of drives timing out, and then the
better thing to do is to wait (heuristic: not enough spares to recover).

Nico
--

Nicolas Williams

2008-Aug-29 16:02 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

On Thu, Aug 28, 2008 at 01:05:54PM -0700, Eric Schrock
wrote:> As others have mentioned, things get more difficult with writes.  If I
> issue a write to both halves of a mirror, should I return when the first
> one completes, or when both complete?  One possibility is to expose this
> as a tunable, but any such "best effort RAS" is a little dicey
because
> you have very little visibility into the state of the pool in this
> scenario - "is my data protected?" becomes a very difficult
question to
> answer.
Depending on the amount of redundancy left one might want the writes to
continue.  E.g., a 3-way mirror with one vdev timing out or going extra
slow, or Richard''s lopsided mirror example.

The value of "best effort RAS" might make a useful property for
mirrors
and RAIDZ-2.  If because of some slow vdev you''ve got less redundancy
for recent writes, but still have enough (for some value of "enough"),
and still have full redundancy for older writes, well, that''s not so
bad.

Something like:

% # require at least successful writes to two mirrors and wait no more
% # than 15 seconds for the 3rd.
% zpool create mypool mirror ... mirror ... mirror ...
% zpool set minimum_redundancy=1 mypool
% zpool set vdev_write_wait=15s mypool

and for known-to-be-lopsided mirrors:

% # require at least successful writes to two mirrors and don''t wait
for
% # the slow vdevs
% zpool create mypool mirror ... mirror ... mirror -slow ...
% zpool set minimum_redundancy=1 mypool
% zpool set vdev_write_wait=0s mypool

?

Nico
--

Richard Elling

2008-Aug-29 18:07 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Nicolas Williams wrote:> On Thu, Aug 28, 2008 at 11:29:21AM -0500, Bob Friesenhahn wrote:
>   
>> Which of these do you prefer?
>>
>>    o System waits substantial time for devices to (possibly) recover in
>>      order to ensure that subsequently written data has the least
>>      chance of being lost.
>>
>>    o System immediately ignores slow devices and switches to
>>      non-redundant non-fail-safe non-fault-tolerant may-lose-your-data
>>      mode.  When system is under intense load, it automatically
>>      switches to the may-lose-your-data mode.
>>     
>
> Given how long a resilver might take, waiting some time for a device to
> come back makes sense.  Also, if a cable was taken out, or drive tray
> powered off, then you''ll see lots of drives timing out, and then
the
> better thing to do is to wait (heuristic: not enough spares to recover).
>
>   
argv!  I didn''t even consider switches.  Ethernet switches often use
spanning-tree algorithms to converge on the topology.  I''m not sure
what SAN switches use.  We have the following problem with highly
available clusters which use switches in the interconnect:
   + Solaris Cluster interconnect timeout defaults to 10 seconds
   + STP can take > 30 seconds to converge
So, if you use Ethernet switches in the interconnect, you need to
disable STP on the ports used for interconnects or risk unnecessary
cluster reconfigurations.  Normally, this isn''t a problem as the people
who tend to build HA clusters also tend to read the docs which point
this out.  Still, a few slip through every few months. As usual, Solaris
Cluster gets blamed, though it really is a systems engineering problem.
Can we expect a similar attention to detail for ZFS implementers?
I''m afraid not :-(. 

I''m not confident we can be successful with sub-minute reconfiguration,
so the B_FAILFAST may be the best we could do for the general case.
That isn''t so bad, in fact we use failfasts rather extensively for
Solaris
Clusters, too.
 -- richard

Miles Nordin

2008-Aug-29 21:14 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

>>>>> "es" == Eric Schrock <eric.schrock at
sun.com> writes:
es> The main problem with exposing tunables like this is that they
es> have a direct correlation to service actions, and
es> mis-diagnosing failures costs everybody (admin, companies,
es> Sun, etc) lots of time and money. Once you expose such a
es> tunable, it will be impossible to trust any FMA diagnosis,

Yeah, I tend to agree that the constants shouldn''t be tunable, becuase
I hoped Sun would become a disciplined collection-point for experience
to set the constants, discipline meaning the constants are only
adjusted in response to bad diagnosis not ``preference,'''' and
in a
direction that improves diagnosis for everyone, not for ``the
site''''.

I''m not yet won over to the idea that statistical FMA diagnosis
constants shouldn''t exist. I think drives can''t diagnose
themselves
for shit, and I think drivers these days are diagnosees, not
diagnosers. But clearly a confusingly-bad diagnosis is much worse
than diagnosis that''s bad in a simple way.

es> If I issue a write to both halves of a mirror, should
es> I return when the first one completes, or when both complete?

well, if it''s not a synchronous write, you return before
you''ve
written either half of the mirror, so it''s only an issue for
O_SYNC/ZIL writes, true?

BTW what does ZFS do right now for synchronous writes to mirrors, wait
for all, wait for two, or wait for one?

es> any such "best effort RAS" is a little dicey because you
have
es> very little visibility into the state of the pool in this
es> scenario - "is my data protected?" becomes a very difficult
es> question to answer.

I think it''s already difficult. For example, a pool will say ONLINE
while it''s resilvering, won''t it? I might be wrong.

Take a pool that can only tolerate one failure. Is the difference
between replacing an ONLINE device (still redundant) and replacing an
OFFLINE device (not redundant until resilvered) captured? Likewise,
should a pool with a spare in use really be marked DEGRADED both
before the spare resilvers and after?

The answers to the questions aren''t important so much as that you have
to think about the answers---what should they be, what are they
now---which means ``is my data protected?'''' is already a
difficult
question to answer.

Also there were recently fixed bugs with DTL. The status of each
device''s DTL, even the existence and purpose of the DTL, isn''t
well-exposed to the admin, and is relevant to answering the ``is my
data protected?'''' question---indirect means of inspecting it
like
tracking the status of resilvering seem too wallpapered given that the
bug escaped notice for so long.

I agree with the problem 100% and don''t wish to worsen it, just
disagree that it''s a new one.

re> 3 orders of magnitude range for magnetic disk I/Os, 4 orders
re> of magnitude for power managed disks.

I would argue for power management a fixed timeout. The time to spin
up doesn''t have anything to do with the io/s you got before the disk
spun down. There''s no reason to disguise the constant for which we
secretly wish inside some fancy math for deriving it just because
writing down constants feels bad.

unless you _know_ the disk is spinning up through some in-band means,
and want to compare its spinup time to recorded measurements of past
spinups.

This is a good case for pointing out there are two sets of rules:

* ''metaparam -r'' rules

+ not invoked at all if there''s no redundancy.

+ very complicated

- involve sets of disks, not one disk. comparison of statistic
among disks within a vdev (definitely), and comparison of
individual disks to themselves over time (possibly).

- complicated output: rules return a set of disks per vdev, not a
yay-or-nay diagnosis per disk. And there are two kinds of
output decision:

o for n-way mirrors, select anywhere from 1 to n disks. for
example, a three-way mirror with two fast local mirrors, one
slow remote iSCSI mirror, should split reads among the two
local disks.

for raidz and raidz2 they can eliminate 0, 1 (,or 2) disks
from the read-us set. It''s possible to issue all the reads
and take the first sufficient set to return as Anton
suggested, but I imagine 4-device raidz2 vdevs will be common
which could some day perform as well as a 2-device mirror.

o also, decide when to stop waiting on an existing read and
re-issue it. so the decision is not only about future reads,
but has to cancel already-issued reads, possibly replacing
the B_FAILFAST mechanism so there will be a second
uncancellable round of reads once the first round exhausts
all redundancy.

o that second decision needs to be made thousands of times per
second without a lot of CPU overhead

+ small consequence if the rules deliver false-positives, just
reduced performance (which is the same with the TCP
fast-retransmit rules Bill mentioned)

+ large consequence for false-negatives (system freeze), so one
can''t really say, ``we won''t bother doing it for raidz2
because it''s
too complicated.'''' The rules are NOT just about
optimizing performance.

+ at least partly in kernel

* diagnosis rules

+ should it be invoked for single-device vdev''s? Does ZFS
diagnosis already consider that a device in an unredundant vdev
should be FAULTED less aggressively (ex., never for CKSUM
errors)? this is arguable.

+ diagnosis is strictly per-disk and should compare disks
only to themselves, or to cultural memory of The Typical Disk in
the form of untunable constants, never others in the same vdev

+ three possible verdicts per disk:

- all''s good

- warn the sysadmin about this disk but keep writing to it

- fault this disk in ZFS. no further I/O, not even writes, and
start rebuilding it onto a spare

Erik points out that false positives are expensive in BOTH cases,
not just the second, because even the warning can initiate
expensive repare procedures and reduce trust in FMA diagnoses.

so, there should probably be only two verdicts, good and fault.

If the statistics are extractable, more aggressive sysadmins can
devise their own warning rules and competitively try to predict
the future. The owners of large clusters might be better at
crafting warning rules than Sun, but their results won''t be
general.

+ potentially complicated, but might be really simple, like ``an
I/O takes more than three minutes to complete.''''

+ A more complicated but still somewhat simple hypothetical rule:
``one I/O hasn''t returned completion or failure after 10 mintues,
OR at least one I/O originally issued to the driver from within
each of three separate four-minute-long buckets within the last
40 minutes takes 1000 times longer than usual or more than 120
seconds, whichever is larger (three slow I/O''s in recent
past)''''

These might be really bad rules. my point is that variance, or
some more complicated statistic than addition and buckets, might
be good for diagnosing bad disks but isn''t necessarily required,
while for the ''metaparam -r'' rules it IS required.

for diagnosing bad disks, a big bag of traditional-AI rules might
be better than statistical/machine-learning rules, and will be
easier for less-sophisticated developers to modify according to
experience and futuristic hardware.

ex., power-managed disk spinning up takes less than x seconds and
should not be spinning down more often than every y minutes. SAN
fabric disconnection should reconnect within z seconds, and
unannounced outages don''t need to be tolerated silently without
intervention more than once per day. u.s.w.

It may even be possible to generate negative fault events, like
``disk IS replying, not silent, and it says
Not-ready-coming-ready, so don''t fault it for 1
minute.'''' The
option of creating this kind of hairy mess of special-case
layer-violating codified-tradition rules is the advantage I
perceived in tolerating the otherwise disgusting hairy bolt-on
shared-lib-spaghetty mess that is FMA.

But for the ''metaparam -r'' rules OTOH,
variance/machine-learning
is probably the only approach.

+ rules are in userland, can be more expensive CPU-wise, and return
feedback to the kernel only a couple times a minute, not per-I/O
like the ''metaparam -r'' reissue rules.

I guess I''m changing my story slightly. I *would* want ZFS to collect
drive performance statistics and report them to FMA, but I wouldn''t
suggest reporting the _decision_ outputs of the
''metaparam -r''-replacement engine to FMA, only the raw stats.

and, of course, ``reporting'''' is tricky for the diagnosis case
becuase
of the bolted-on separation of FMA. You can''t usefully report ``the
I/O took 3 hours to complete'''' because you''ve now
waited three hours
to get the report, and the completed I/O has a normal driver error
attached to it so no fancy statistical decisions are any longer
needed. Instead, you have to make polled reports to userland a couple
times a minute, containing the list of incomplete outstanding I/O''s,
along with averages and variances and whatever else.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080829/cd20a4fa/attachment.bin>

Bob Friesenhahn

2008-Aug-29 21:49 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

On Fri, 29 Aug 2008, Miles Nordin wrote:>
> I guess I''m changing my story slightly.  I *would* want ZFS to
collect
> drive performance statistics and report them to FMA, but I
wouldn''t
Your email *totally* blew my limited buffer size, but this little bit 
remained for me to look at.  It left me wondering how ZFS would know 
if the device is a drive.  How can ZFS maintain statistics for a 
"drive" if it is perhaps not a drive at all?  ZFS does not require 
that the device be a "drive".  Isn''t ZFS the wrong level to
be
managing the details of a device?

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Miles Nordin

2008-Aug-29 23:03 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
re> if you use Ethernet switches in the interconnect, you need to
re> disable STP on the ports used for interconnects or risk
re> unnecessary cluster reconfigurations.

RSTP/802.1w plus setting the ports connected to Solaris as
``edge'''' is
good enough, less risky for the WAN, and pretty ubiquitously supported
with non-EOL switches. The network guys will know this (assuming you
have network guys) and do something like this:

sw: can you disable STP for me?

net: No?

sw: <jumping up and down screaming>

net: um,...i mean, Why?

sw: [....]

net: oh, that. Ok, try it now.

sw: thanks for disabling STP for me.

net: i uh,.. whatever. No problem!

re> Can we expect a similar attention to detail for ZFS
re> implementers? I''m afraid not :-(.

well....you weren''t really ``expecting'''' it of the
sun cluster
implementers. You just ran into it by surprise in the form of an
Issue. so, can you expect ZFS implementers to accept that running
ZFS, iSCSI, FC-SW might teach them something about their LAN/SAN they
didn''t already know? So far they seem receptive to arcane advice like
``make this config change in your SAN controller to let it use the
NVRAM cache more aggressively, and stop using EMC PowerPath unless
<blah>.'''' so, Yes?

I think you can also expect them to wait longer than 40 seconds before
declaring a system is frozen and rebooting it, though.

``Let''s `patiently wait'' forever because we think, based on
our
uncertainty, that FSPF might take several hours to converge''''
is the
alternative that strikes me as unreasonable.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080829/dc156765/attachment.bin>

Richard Elling

2008-Aug-30 00:26 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
>>>>>>             
>
>     re> if you use Ethernet switches in the interconnect, you need to
>     re> disable STP on the ports used for interconnects or risk
>     re> unnecessary cluster reconfigurations.
>
> RSTP/802.1w plus setting the ports connected to Solaris as
``edge'''' is
> good enough, less risky for the WAN, and pretty ubiquitously supported
> with non-EOL switches.  The network guys will know this (assuming you
> have network guys) and do something like this:
>
> sw: can you disable STP for me?
>
> net: No?
>
> sw: <jumping up and down screaming>
>
> net: um,...i mean, Why?
>
> sw: [....]
>
> net: oh, that.  Ok, try it now.
>
> sw: thanks for disabling STP for me.
>
> net: i uh,.. whatever.  No problem!
>   
Precisely, this is not a problem that is usually solved unilaterally.
>     re> Can we expect a similar attention to detail for ZFS
>     re> implementers?  I''m afraid not :-(.
>
> well....you weren''t really ``expecting'''' it of
the sun cluster
> implementers.  You just ran into it by surprise in the form of an
> Issue.  
Rather, cluster implementers tend to RTFM. I know few ZFSers who
have RTFM, and do not expect many to do so... such is life.
> so, can you expect ZFS implementers to accept that running
> ZFS, iSCSI, FC-SW might teach them something about their LAN/SAN they
> didn''t already know?  
No, I expect them to see a "problem" cause by network reconfiguration
and blame ZFS.  Indeed, this is what occasionally happens with Solaris
Cluster -- but only occasionally, solving via RTFM.
> So far they seem receptive to arcane advice like
> ``make this config change in your SAN controller to let it use the
> NVRAM cache more aggressively, and stop using EMC PowerPath unless
> <blah>.''''  so, Yes?
>   
I have no idea what you are trying to say here.
> I think you can also expect them to wait longer than 40 seconds before
> declaring a system is frozen and rebooting it, though.
>   
Current [s]sd driver timeouts are 60 seconds with 3-5 retries by default.
We''ve had those timeouts for many, many years now and do provide highly
available services on such systems.  The B_FAILFAST change did improve
the availability of systems and similar tricks have improved service 
availability
for Solaris Clusters.  Refer to Eric''s post for more details of this 
minefield.

NB some bugids one should research before filing new bugs here are:
CR 4713686: sd/ssd driver should have an additional target specific timeout
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4713686
CR 4500536 introduces B_FAILFAST
http://bugs.opensolaris.org/view_bug.do?bug_id=4500536
> ``Let''s `patiently wait'' forever because we think, based
on our
> uncertainty, that FSPF might take several hours to
converge'''' is the
> alternative that strikes me as unreasonable.
>   
AFAICT, nobody is making such a proposal.  Did I miss a post?
 -- richard

Ross

2008-Aug-30 07:55 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Wow, some great comments on here now, even a few people agreeing with me which
is nice :D

I''ll happily admit I don''t have the in depth understanding of
storage many of you guys have, but since the idea doesn''t seem
pie-in-the-sky crazy, I''m going to try to write up all my current
thoughts on how this could work after reading through all the replies.

1. Track disk response times
- ZFS should track the average response time of each disk.
- This should be used internally for performance tweaking, so faster disks are
favoured for reads.  This works particularly well for lop sided mirrors.
- I''d like to see this information (and the number of timeouts) in the
output of zpool status, so administrators can see if any one device is
performing badly.

2. New parameters
- ZFS should gain two new parameters.
  - A timeout value for the pool.
  - An option to enable that timeout for writes too (off by default).
- Still to be decided is whether that timeout is set manually, or automatically
based on the information gathered in 1.
- Do we need a pool timeout (based on the timeout of the slowest device in the
pool), or will individual device timeouts work better?
- I''ve decided that having this off by default for writes is probably
better for ZFS.  It addresses some people''s concerns about writing to a
degraded pool, and puts data integrity ahead of availability, which seems to fit
better with ZFS'' goals.  I''d still like it myself for writes 
I can live with a pool running degraded for 2 minutes while the problem is
diagnosed.
- With that said, could the write timeout default to on when you have a slog
device?  After all, the data is safely committed to the slog, and should remain
there until it''s written to all devices.  Bob, you seemed the most
concerned about writes, would that be enough redundancy for you to be happy to
have this on by default?  If not, I''d still be ok having it off by
default, we could maybe just include it in the evil tuning guide suggesting that
this could be turned on by anybody who has a separate slog device.

3. How it would work
- If a read times out for any device, ZFS should immediately issue reads to all
other devices holding that data.  The first response back will be used.
- Timeouts should be logged so the information can be used by administrators or
FMA to help diagnose failing drives, but they should not count as a device
failure on their own.
- Some thought is needed as to how this algorithm works on busy pools.  When
reads are queuing up, we need to avoid false positives and avoid adding extra
load on the pool.  Would it be a possibility that instead of checking the
response time for an individual request, this timeout is used to check if no
responses at all have been received from a device for that length of time?  That
still sounds reasonable for finding stuck devices, and should still work
reliably on a busy pool.
- For reads, the pool does not need to go degraded, the device is simply flagged
as "WAITING".
- When enabled for writes, these will be going to all devices, so there are no
alternate devices to try.  This means any write timeout will be used to put the
pool into a degraded mode.  This should be considered a temporary state with the
drive in "WAITING" status, as while the pool itself is degraded (due
to missing the writes for that drive), the drive is not yet offline.  At this
point the system is simply keeping itself running while waiting for a proper
error response from either the drive or from FMA.  If the drive eventually
returns the missing response, it can be resilvered with any data it missed.  If
the drive doesn''t return a response, FMA should eventually fault it,
and the drive can be taken offline and replaced with a hot spare.  At all times
the administrator can see what it going on using zpool status, with the
appropriate pool and drive status visible.
- Please bear in mind that although I''m using the word
''degraded'' above, this is not necessarily the case for dual
parity pools, I just don''t know the proper term to use for a dual
parity raid set where a single drive has failed.
- If this is just a one off glitch and the device comes back online, the
resilver shouldn''t take long as ZFS just needs to send the data that
was missed (which will still be stored in the ZIL).
- If many devices timeout at once due to a bad controller, cable pulled, power
failure, etc, all the affected devices will be flagged as "WAITING"
and if too many have gone for the pool to stay operational, ZFS should switch
the entire pool to the ''wait'' state while it waits for FMA,
etc to return a proper response, after which it should react according to the
proper failmode property for the pool.

4. Food for thought
- While I like nico''s idea for lop sided mirrors, I''m not sure
any tweaking is needed.  I was thinking about whether these timeouts could
improve performance for such a mirror, but I think a better option there is
simply to use plenty of local write cache.  A ton of flash memory for the ZIL
would be the ideal, but if it''s a particularly lopsided mirror you
could use something as simple as a pair of locally mirrored drives.  Reads
should be biased to the local devices anyway, so you''re making the best
possible use of the slow half by caching all writes and then streaming
everything to the remote device.  Of course, things will slow down when the ZIL
fils as it now has to write to both halves together, but if that happens you
probably need to rethink your solution anyway as your slow mirror isn''t
keeping up with your demands.
- I see all this as working on top of the B_FAILFAST modes mentioned in the
thread.  I don''t see that it has to impact ZFS'' current fault
management at all.  It''s an extra layer that sits on top, simply with
the aim of keeping the pool responsive while the lower level fault management
does its job.  Of course, it''s also able to provide extra information
for things like FMA, but it''s not indended as a way of diagnosing
faults itself.
- Thinking about this further, if anything this seems to me like it should
improve ZFS'' reliability for writes.  It puts the pool into a temporary
''degraded'' state much faster if there is a doubt over whether
any writes have been committed to disk.  It will still be queuing writes for
that device, exactly as it would have done before, but the pool and the
administrator now have earlier information as to the status.  If a second device
enters the "WAITING" state, the whole pool immediately switches to
''wait'' mode until a proper diagnosis happens.  This happens
much earlier than could be the case with FMA alone doing the diagnostic, and
means that ZFS stops accepting data much earlier while it waits to see what the
problem is.  For async writes it reduces the amount of data that has been
accepted by ZFS but not committed to storage.  For sync writes I don''t
see that it has much effect, these writes would have been waiting for the bad
device anyway.

Well, I think that''s everything.  To me it seems to address most of the
problems raised here and there should be performance benefits to using it in
some situations.  It will definately have a beneficial effect in ensuring that
ZFS maintains a good pool response time for any kind of failure, and it looks to
me like it would work well for any kind of storage device.

I''m looking forward to seeing what holes can be knocked in these ideas
with the next set of replies :)

Ross
--
This message posted from opensolaris.org

Bob Friesenhahn

2008-Aug-30 15:59 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

On Sat, 30 Aug 2008, Ross wrote:> while the problem is diagnosed. - With that said, could the write 
> timeout default to on when you have a slog device?  After all, the 
> data is safely committed to the slog, and should remain there until 
> it''s written to all devices.  Bob, you seemed the most concerned 
> about writes, would that be enough redundancy for you to be happy to 
> have this on by default?  If not, I''d still be ok having it off by
> default, we could maybe just include it in the evil tuning guide 
> suggesting that this could be turned on by anybody who has a 
> separate slog device.
It is my impression that the slog device is only used for synchronous 
writes.  Depending on the system, this could be just a small fraction 
of the writes.

In my opinion, ZFS''s primary goal is to avoid data loss, or 
consumption of wrong data.  Availability is a lesser goal.

If someone really needs maximum availability then they can go to 
triple mirroring or some other maximally redundant scheme.  ZFS should 
to its best to continue moving forward as long as some level of 
redundancy exists.  There could be an option to allow moving forward 
with no redundancy at all.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ross Smith

2008-Aug-30 19:32 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Triple mirroring you say?  That''d be me then :D

The reason I really want to get ZFS timeouts sorted is that our long term goal
is to mirror that over two servers too, giving us a pool mirrored across two
servers, each of which is actually a zfs iscsi volume hosted on triply mirrored
disks.

Oh, and we''ll have two sets of online off-site backups running raid-z2,
plus a set of off-line backups too.

All in all I''m pretty happy with the integrity of the data,
wouldn''t want to use anything other than ZFS for that now. 
I''d just like to get the availability working a bit better, without
having to go back to buying raid controllers.  We have big plans for that too;
once we get the iSCSI / iSER timeout issue sorted our long term availability
goals are to have the setup I mentioned above hosted out from a pair of
clustered Solaris NFS / CIFS servers.

Failover time on the cluster is currently in the order of 5-10 seconds, if I can
get the detection of a bad iSCSI link down under 2 seconds we''ll
essentially have a worst case scenario of < 15 seconds downtime.  Downtime
that low means it''s effectively transparent for our users as all of our
applications can cope with that seamlessly, and I''d really love to be
able to do that this calendar year.

Anyway, getting back on topic, it''s a good point about moving forward
while redundancy exists.  I think the flag for specifying the write behavior
should have that as the default, with the optional setting being to allow the
pool to continue accepting writes while the pool is in a non redundant state.

Ross
> Date: Sat, 30 Aug 2008 10:59:19 -0500
> From: bfriesen at simple.dallas.tx.us
> To: myxiplx at hotmail.com
> CC: zfs-discuss at opensolaris.org
> Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal /
driver failure better
> 
> On Sat, 30 Aug 2008, Ross wrote:
> > while the problem is diagnosed. - With that said, could the write 
> > timeout default to on when you have a slog device?  After all, the 
> > data is safely committed to the slog, and should remain there until 
> > it''s written to all devices.  Bob, you seemed the most
concerned
> > about writes, would that be enough redundancy for you to be happy to 
> > have this on by default?  If not, I''d still be ok having it
off by
> > default, we could maybe just include it in the evil tuning guide 
> > suggesting that this could be turned on by anybody who has a 
> > separate slog device.
> 
> It is my impression that the slog device is only used for synchronous 
> writes.  Depending on the system, this could be just a small fraction 
> of the writes.
> 
> In my opinion, ZFS''s primary goal is to avoid data loss, or 
> consumption of wrong data.  Availability is a lesser goal.
> 
> If someone really needs maximum availability then they can go to 
> triple mirroring or some other maximally redundant scheme.  ZFS should 
> to its best to continue moving forward as long as some level of 
> redundancy exists.  There could be an option to allow moving forward 
> with no redundancy at all.
> 
> Bob
> =====================================> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> 
_________________________________________________________________
Win a voice over part with Kung Fu Panda & Live Search?? and?? 100?s of Kung
Fu Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080830/0fd8b2f4/attachment.html>

Johan Hartzenberg

2008-Aug-31 11:03 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

On Thu, Aug 28, 2008 at 11:21 PM, Ian Collins <ian at ianshome.com> wrote:
> Miles Nordin writes:
>
> > suggested that unlike the SVM feature it should be automatic, because
> > by so being it becomes useful as an availability tool rather than just
> > performance optimisation.
> >
> So on a server with a read workload, how would you know if the remote
> volume
> was working?
>
Even reads induced writes (last access time, if nothing else)

My question: If a pool becomes non-redundant (eg due to a timeout, hotplug
removal, bad data returned from device, or for whatever reason), do we want
the affected pool/vdev/system to hang?  Generally speaking I would say that
this is what currently happens with other solutions.

Conversely:  Can the current situation be improved by allowing a device to
be taken out of the pool for writes - eg be placed in read-only mode?  I
would assume it is possible to modify the CoW system / functions which
allocates blocks for writes to ignore certain devices, at least
temporarily.

This would also lay a groundwork for allowing devices to be removed from a
pool - eg: Step 1: Make the device read-only. Step 2: touch every allocated
block on that device (causing it to be copied to some other disk), step 3:
remove it from the pool for reads as well and finally remove it from the
pool permanently.

  _hartz
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080831/ae4ae714/attachment.html>

Richard Elling

2008-Aug-31 19:09 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Ross Smith wrote:> Triple mirroring you say?  That''d be me then :D
>
> The reason I really want to get ZFS timeouts sorted is that our long 
> term goal is to mirror that over two servers too, giving us a pool 
> mirrored across two servers, each of which is actually a zfs iscsi 
> volume hosted on triply mirrored disks.
>
> Oh, and we''ll have two sets of online off-site backups running 
> raid-z2, plus a set of off-line backups too.
>
> All in all I''m pretty happy with the integrity of the data,
wouldn''t
> want to use anything other than ZFS for that now.  I''d just like
to
> get the availability working a bit better, without having to go back 
> to buying raid controllers.  We have big plans for that too; once we 
> get the iSCSI / iSER timeout issue sorted our long term availability 
> goals are to have the setup I mentioned above hosted out from a pair 
> of clustered Solaris NFS / CIFS servers.
>
> Failover time on the cluster is currently in the order of 5-10 
> seconds, if I can get the detection of a bad iSCSI link down under 2 
> seconds we''ll essentially have a worst case scenario of < 15
seconds
> downtime.
I don''t think this is possible for a stable system.  2 second failure 
detection
for IP networks is troublesome for a wide variety of reasons.  Even with
Solaris Clusters, we can show consistent failover times for NFS services on
the order of a minute (2-3 client retry intervals, including backoff).  But
getting to consistent sub-minute failover for a service like NFS might be a
bridge too far, given the current technology and the amount of 
customization
required to "make it work"^TM.
> Downtime that low means it''s effectively transparent for our users
as
> all of our applications can cope with that seamlessly, and I''d
really
> love to be able to do that this calendar year.
I think most people (traders are a notable exception) and applications can
deal with larger recovery times, as long as human-intervention is not  
required.
 -- richard

Ross Smith

2008-Sep-02 11:37 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Thinking about it, we could make use of this too.  The ability to add a
remote iSCSI mirror to any pool without sacrificing local performance
could be a huge benefit.

> From: ian at ianshome.com
> To: eric.schrock at sun.com
> CC: myxiplx at hotmail.com; zfs-discuss at opensolaris.org
> Subject: Re: Availability: ZFS needs to handle disk removal / driver
failure better
> Date: Fri, 29 Aug 2008 09:15:41 +1200
> 
> Eric Schrock writes:
> > 
> > A better option would be to not use this to perform FMA diagnosis, but
> > instead work into the mirror child selection code.  This has already
> > been alluded to before, but it would be cool to keep track of latency
> > over time, and use this to both a) prefer one drive over another when
> > selecting the child and b) proactively timeout/ignore results from one
> > child and select the other if it''s taking longer than some
historical
> > standard deviation.  This keeps away from diagnosing drives as faulty,
> > but does allow ZFS to make better choices and maintain response times.
> > It shouldn''t be hard to keep track of the average and/or
standard
> > deviation and use it for selection; proactively timing out the slow
I/Os
> > is much trickier. 
> > 
> This would be a good solution to the remote iSCSI mirror configuration.  
> I''ve been working though this situation with a client (we have
been
> comparing ZFS with Cleversafe) and we''d love to be able to get the
read
> performance of the local drives from such a pool. 
> 
> > As others have mentioned, things get more difficult with writes.  If I
> > issue a write to both halves of a mirror, should I return when the
first
> > one completes, or when both complete?  One possibility is to expose
this
> > as a tunable, but any such "best effort RAS" is a little
dicey because
> > you have very little visibility into the state of the pool in this
> > scenario - "is my data protected?" becomes a very difficult
question to
> > answer. 
> > 
> One solution (again, to be used with a remote mirror) is the three way 
> mirror.  If two devices are local and one remote, data is safe once the two
> local writes return.  I guess the issue then changes from "is my data
safe"
> to "how safe is my data".  I would be reluctant to deploy a
remote mirror
> device without local redundancy, so this probably won''t be an
uncommon
> setup.  There would have to be an acceptable window of risk when local data
> isn''t replicated. 
> 
> Ian
_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080902/1e13ffd1/attachment.html>

Ross

2008-Sep-06 06:48 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Hey folks,

Well, there haven''t been any more comments knocking holes in this idea,
so I''m wondering now if I should log this as an RFE?

Is this something others would find useful?

Ross
--
This message posted from opensolaris.org

Richard Elling

2008-Sep-06 15:14 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Ross wrote:> Hey folks,
>
> Well, there haven''t been any more comments knocking holes in this
idea, so I''m wondering now if I should log this as an RFE?
>   
go for it!
> Is this something others would find useful?
>   

Yes.  But remember that this has a very limited scope.  Basically
it will apply to mirrors, not raidz.  Some people may find that to
be uninteresting.  Implementing something simple, like a preferred
side would be a easy first step (ala VxVM''s preferred plex).
 -- richard

Ross

2008-Nov-27 12:33 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Hmm...  I logged this CR ages ago, but now I''ve come to find it in the
bug tracker I can''t see it anywhere.

I actually logged three CR''s back to back, the first appears to have
been created ok, but two have just disappeared.  The one I created ok is: 
http://bugs.opensolaris.org/view_bug.do?bug_id=6766364

There should be two other CR''s created within a few minutes of that,
one for disabling caching on CIFS shares, and one regarding this ZFS
availability discussion.  Could somebody at Sun let me know what''s
happened to these please.
-- 
This message posted from opensolaris.org

James C. McPherson

2008-Nov-27 12:41 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

On Thu, 27 Nov 2008 04:33:54 -0800 (PST)
Ross <myxiplx at googlemail.com> wrote:
> Hmm...  I logged this CR ages ago, but now I''ve come to find it in
> the bug tracker I can''t see it anywhere.
> 
> I actually logged three CR''s back to back, the first appears to
have
> been created ok, but two have just disappeared.  The one I created ok
> is:  http://bugs.opensolaris.org/view_bug.do?bug_id=6766364
> 
> There should be two other CR''s created within a few minutes of
that,
> one for disabling caching on CIFS shares, and one regarding this ZFS
> availability discussion.  Could somebody at Sun let me know what''s
> happened to these please.
Hi Ross,
I can''t find the ZFS one you mention. The CIFS one is 
http://bugs.opensolaris.org/view_bug.do?bug_id=6766126.
It''s been marked as ''incomplete'' so you should
contact
the R.E. - Alan M. Wright (at sun dot com, etc) to find
out what further info is required.


hth,
James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Ross

2008-Nov-27 13:07 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Thanks James, I''ve e-mailed Alan and submitted this one again.
-- 
This message posted from opensolaris.org

Bernard Dugas

2008-Nov-27 15:11 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Hello,

Thank you for this very interesting thread !

I want to confirm that Synchronous Distributed Storage is main goal when using
ZFS !

The target architecture is 1 local drive, and 2 (or more) remote iSCSI targets,
with ZFS being the iSCSI initiator.

System is designed/cut so that local disk can handle all needed performance with
good margin, as each one of iSCSI targets through large enough Ethernet fibers.

I need that any network problem doesn''t slow the readings on local
disk, and that writings are stopped only if not any remote are available after a
time-out.

I also did a comment on that subject in :
http://blogs.sun.com/roch/entry/using_zfs_as_a_network

To  myxiplx :  we called "Sleeping Failure" a failure of 1 part, that
is hidden by redundancy but not detected by monitoring. These are the most
dangerous...

Would anybody be interested by supporting an opensource "projectseed"
called MiSCSI ? This is for Multicast iSCSI, so that only 1 writing from
initiator be propagated by network to all suscribed targets, with dynamic
suscribing and "resilvering" being delegated to remote targets. I
would even prefer this behaviour already exists in ZFS :-)

Please let me any comment if interested, i may send a draft for RFP...

Best regards !
-- 
This message posted from opensolaris.org

Ross

2008-Nov-27 15:29 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Well, you''re not alone in wanting to use ZFS and iSCSI like that, and
in fact my change request suggested that this is exactly one of the things that
could be addressed:

"The idea is really a two stage RFE, since just the first part would have
benefits.  The key is to improve ZFS availability, without affecting
it''s flexibility, bringing it on par with traditional raid controllers.

A.  Track response times, allowing for lop sided mirrors, and better failure
detection.  Many people have requested this since it would facilitate remote
live mirrors.

B.  Use response times to timeout devices, dropping them to an interim failure
mode while waiting for the official result from the driver.  This would prevent
redundant pools hanging when waiting for a single device."

Unfortunately if your links tend to drop, you really need both parts.  However,
if this does get added to ZFS, all you would then need is standard monitoring on
the ZFS pool.  That would notify you when any device fails and the pool goes to
a degraded state, making it easy to spot when either the remote mirrors or local
storage are having problems.  I''d have thought it would make monitoring
much simpler.

And if this were possible, I would hope that you could configure iSCSI devices
to automatically reconnect and resilver too, so the system would be self
repairing once faults are corrected, but I haven''t gone so far as to
test that yet.
-- 
This message posted from opensolaris.org

Bernard Dugas

2008-Nov-27 16:45 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

> Well, you''re not alone in wanting to use ZFS and
> iSCSI like that, and in fact my change request
> suggested that this is exactly one of the things that
> could be addressed:
Thank you ! Yes, this was also to tell you that you are not alone :-)

I agree completely with you on your technical points !
-- 
This message posted from opensolaris.org

Richard Elling

2008-Nov-28 05:05 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Ross wrote:> Well, you''re not alone in wanting to use ZFS and iSCSI like that,
and in fact my change request suggested that this is exactly one of the things
that could be addressed:
>
> "The idea is really a two stage RFE, since just the first part would
have benefits.  The key is to improve ZFS availability, without affecting
it''s flexibility, bringing it on par with traditional raid controllers.
>
> A.  Track response times, allowing for lop sided mirrors, and better
failure detection.
I''ve never seen a study which shows, categorically, that disk or
network
failures are preceded by significant latency changes.  How do we get
"better failure detection" from such measurements?
>  Many people have requested this since it would facilitate remote live
mirrors.
>   
At a minimum, something like VxVM''s preferred plex should be reasonably
easy to implement.
> B.  Use response times to timeout devices, dropping them to an interim
failure mode while waiting for the official result from the driver.  This would
prevent redundant pools hanging when waiting for a single device."
>   
I don''t see how this could work except for mirrored pools.  Would that
carry enough market to be worthwhile?
 -- richard
> Unfortunately if your links tend to drop, you really need both parts. 
However, if this does get added to ZFS, all you would then need is standard
monitoring on the ZFS pool.  That would notify you when any device fails and the
pool goes to a degraded state, making it easy to spot when either the remote
mirrors or local storage are having problems.  I''d have thought it
would make monitoring much simpler.
>
> And if this were possible, I would hope that you could configure iSCSI
devices to automatically reconnect and resilver too, so the system would be self
repairing once faults are corrected, but I haven''t gone so far as to
test that yet.
>

Ross Smith

2008-Nov-28 07:03 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling <Richard.Elling at
sun.com> wrote:> Ross wrote:
>>
>> Well, you''re not alone in wanting to use ZFS and iSCSI like
that, and in
>> fact my change request suggested that this is exactly one of the things
that
>> could be addressed:
>>
>> "The idea is really a two stage RFE, since just the first part
would have
>> benefits.  The key is to improve ZFS availability, without affecting
it''s
>> flexibility, bringing it on par with traditional raid controllers.
>>
>> A.  Track response times, allowing for lop sided mirrors, and better
>> failure detection.
>
> I''ve never seen a study which shows, categorically, that disk or
network
> failures are preceded by significant latency changes.  How do we get
> "better failure detection" from such measurements?
Not preceded by as such, but a disk or network failure will certainly
cause significant latency changes.  If the hardware is down, there''s
going to be a sudden, and very large change in latency.  Sure, FMA
will catch most cases, but we''ve already shown that there are some
cases where it doesn''t work too well (and I would argue that''s
always
going to be possible when you are relying on so many different types
of driver).  This is there to ensure that ZFS can handle *all* cases.

>>  Many people have requested this since it would facilitate remote live
>> mirrors.
>>
>
> At a minimum, something like VxVM''s preferred plex should be
reasonably
> easy to implement.
>
>> B.  Use response times to timeout devices, dropping them to an interim
>> failure mode while waiting for the official result from the driver. 
This
>> would prevent redundant pools hanging when waiting for a single
device."
>>
>
> I don''t see how this could work except for mirrored pools.  Would
that
> carry enough market to be worthwhile?
> -- richard
I have to admit, I''ve not tested this with a raided pool, but since
all ZFS commands hung when my iSCSI device went offline, I assumed
that you would get the same effect of the pool hanging if a raid-z2
pool is waiting for a response from a device.  Mirrored pools do work
particularly well with this since it gives you the potential to have
remote mirrors of your data, but if you had a raid-z2 pool, you still
wouldn''t want that hanging if a single device failed.

I will go and test the raid scenario though on a current build, just to be sure.

Richard Elling

2008-Nov-28 16:12 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Ross Smith wrote:> On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling <Richard.Elling at
sun.com> wrote:
>   
>> Ross wrote:
>>     
>>> Well, you''re not alone in wanting to use ZFS and iSCSI
like that, and in
>>> fact my change request suggested that this is exactly one of the
things that
>>> could be addressed:
>>>
>>> "The idea is really a two stage RFE, since just the first part
would have
>>> benefits.  The key is to improve ZFS availability, without
affecting it''s
>>> flexibility, bringing it on par with traditional raid controllers.
>>>
>>> A.  Track response times, allowing for lop sided mirrors, and
better
>>> failure detection.
>>>       
>> I''ve never seen a study which shows, categorically, that disk
or network
>> failures are preceded by significant latency changes.  How do we get
>> "better failure detection" from such measurements?
>>     
>
> Not preceded by as such, but a disk or network failure will certainly
> cause significant latency changes.  If the hardware is down,
there''s
> going to be a sudden, and very large change in latency.  Sure, FMA
> will catch most cases, but we''ve already shown that there are some
> cases where it doesn''t work too well (and I would argue
that''s always
> going to be possible when you are relying on so many different types
> of driver).  This is there to ensure that ZFS can handle *all* cases.
>   
I think that there is some confusion about FMA. The value of FMA is
diagnosis.  If there was no FMA, then driver timeouts would still exist.
Where FMA is useful is diagnosing the problem such that we know that
the fault is in the SAN and not the RAID array, for example.  From the
device driver level, all sd knows is that an I/O request to a device timed
out.  Similarly, all ZFS could know is what sd tells it.
>>>  Many people have requested this since it would facilitate remote
live
>>> mirrors.
>>>
>>>       
>> At a minimum, something like VxVM''s preferred plex should be
reasonably
>> easy to implement.
>>
>>     
>>> B.  Use response times to timeout devices, dropping them to an
interim
>>> failure mode while waiting for the official result from the driver.
This
>>> would prevent redundant pools hanging when waiting for a single
device."
>>>
>>>       
>> I don''t see how this could work except for mirrored pools. 
Would that
>> carry enough market to be worthwhile?
>> -- richard
>>     
>
> I have to admit, I''ve not tested this with a raided pool, but
since
> all ZFS commands hung when my iSCSI device went offline, I assumed
> that you would get the same effect of the pool hanging if a raid-z2
> pool is waiting for a response from a device.  Mirrored pools do work
> particularly well with this since it gives you the potential to have
> remote mirrors of your data, but if you had a raid-z2 pool, you still
> wouldn''t want that hanging if a single device failed.
>   
zpool commands hanging is CR6667208, and has been fixed in b100.
http://bugs.opensolaris.org/view_bug.do?bug_id=6667208
> I will go and test the raid scenario though on a current build, just to be
sure.
>   
Please.
 -- richard

Ross Smith

2008-Dec-02 11:31 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Hey folks,

I''ve just followed up on this, testing iSCSI with a raided pool, and
it still appears to be struggling when a device goes offline.
>>> I don''t see how this could work except for mirrored pools.
Would that
>>> carry enough market to be worthwhile?
>>> -- richard
>>>
>>
>> I have to admit, I''ve not tested this with a raided pool, but
since
>> all ZFS commands hung when my iSCSI device went offline, I assumed
>> that you would get the same effect of the pool hanging if a raid-z2
>> pool is waiting for a response from a device.  Mirrored pools do work
>> particularly well with this since it gives you the potential to have
>> remote mirrors of your data, but if you had a raid-z2 pool, you still
>> wouldn''t want that hanging if a single device failed.
>>
>
> zpool commands hanging is CR6667208, and has been fixed in b100.
> http://bugs.opensolaris.org/view_bug.do?bug_id=6667208
>
>> I will go and test the raid scenario though on a current build, just to
be
>> sure.
>>
>
> Please.
> -- richard

I''ve just created a pool using three snv_103 iscsi Targets, with a
fourth install of snv_103 collating those targets into a raidz pool,
and sharing that out over CIFS.

To test the server, while transferring files from a windows
workstation, I powered down one of the three iSCSI targets.  It took a
few minutes to shutdown, but once that happened the windows copy
halted with the error:
"The specified network name is no longer available."

At this point, the zfs admin tools still work fine (which is a huge
improvement, well done!), but zpool status still reports that all
three devices are online.

A minute later, I can open the share again, and start another copy.

Thirty seconds after that, zpool status finally reports that the iscsi
device is offline.

So it looks like we have the same problems with that 3 minute delay,
with zpool status reporting wrong information, and the CIFS service
having problems tool.

At this point I restarted the iSCSI target, but had problems bringing
it back online.  It appears there''s a bug in the initiator, but
it''s
easily worked around:
http://www.opensolaris.org/jive/thread.jspa?messageID=312981&#312981

What was great was that as soon as the iSCSI initiator reconnected,
ZFS started resilvering.

What might not be so great is the fact that all three devices are
showing that they''ve been resilvered:

# zpool status
  pool: iscsipool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 0h2m with 0 errors on Tue Dec  2 11:04:10 2008
config:

        NAME                                       STATE     READ WRITE CKSUM
        iscsipool                                  ONLINE       0     0     0
          raidz1                                   ONLINE       0     0     0
            c2t600144F04933FF6C00005056967AC800d0  ONLINE       0
0     0  179K resilvered
            c2t600144F04934FAB300005056964D9500d0  ONLINE       5
9.88K     0  311M resilvered
            c2t600144F04934119E000050569675FF00d0  ONLINE       0
0     0  179K resilvered

errors: No known data errors

It''s proving a little hard to know exactly what''s happening
when,
since I''ve only got a few seconds to log times, and there are delays
with each step.  However, I ran another test using robocopy and was
able to observe the behaviour a little more closely:

Test 2:  Using robocopy for the transfer, and iostat plus zpool status
on the server

10:46:30 - iSCSI server shutdown started
10:52:20 - all drives still online according to zpool status
10:53:30 - robocopy error - "The specified network name is no longer
available"
 - zpool status shows all three drives as online
 - zpool iostat appears to have hung, taking much longer than the 30s
specified to return a result
 - robocopy is now retrying the file, but appears to have hung
10:54:30 - robocopy, CIFS and iostat all start working again, pretty
much simultaneously
 - zpool status now shows the drive as offline

I could probably do with using DTrace to get a better look at this,
but I haven''t learnt that yet.  My guess as to what''s
happening would
be:

- iSCSI target goes offline
- ZFS will not be notified for 3 minutes, but I/O to that device is
essentially hung
- CIFS times out (I suspect this is on the client side with around a
30s timeout, but I can''t find the timeout documented anywhere).
- zpool iostat is now waiting, I may be wrong but this doesn''t appear
to have benefited from the changes to zpool status
- After 3 minutes, the iSCSI drive goes offline.  The pool carries on
with the remaining two drives, CIFS carries on working, iostat carries
on working.  "zpool status" however is still out of date.
- zpool status eventually catches up, and reports that the drive has
gone offline.


So, if my guesses are right, I see several problems here:
1. ZFS could still benefit with the timeout I''ve suggested to keep the
pool active.  I''ve now shown this benefits raidz pools as well as
mirrors, and with problems other people have reported, we''ve shown
that at least two drivers have problems that this would mitigate.
2. I would guess that the timeout needs to be under 30 seconds to
prevent problems with CIFS clients, I need to find some documentation
on this, and find some way to prove it''s a client timeout and not a
problem with the CIFS server.
3. zpool iostat is still blocked by a hung device (there may be an
existing bug for this, it rings a bell).
4. zpool status still reports out of date information.
5. When iSCSI targets finally do come back online, ZFS is resilvering
all of them (again, this rings a bell, Miles might have reported
something similar).

And while I don''t know the code at all, I really can''t
understand how
ZFS can be serving files out from a pool, but zpool status doesn''t
know what''s going on.  ZFS physically can''t work unless it
knows which
drives it is and isn''t writing to.  Why can''t you just use
this
knowledge for zpool status?

Ross

Ross

2008-Dec-02 12:12 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Incidentally, while I''ve reported this again as a RFE, I still
haven''t seen a CR number for this.  Could somebody from Sun check if
it''s been filed please.

thanks,

Ross
-- 
This message posted from opensolaris.org

Ross Smith

2008-Dec-02 16:43 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Hi Richard,

Thanks, I''ll give that a try.  I think I just had a kernel dump while
trying to boot this system back up though, I don''t think it likes it
if the iscsi targets aren''t available during boot.  Again, that rings
a bell, so I''ll go see if that''s another known bug.

Changing that setting on the fly didn''t seem to help, if anything
things are worse this time around.  I changed the timeout to 15
seconds, but didn''t restart any services:

# echo iscsi_rx_max_window/D | mdb -k
iscsi_rx_max_window:
iscsi_rx_max_window:            180
# echo iscsi_rx_max_window/W0t15 | mdb -kw
iscsi_rx_max_window:            0xb4            =       0xf
# echo iscsi_rx_max_window/D | mdb -k
iscsi_rx_max_window:
iscsi_rx_max_window:            15

After making those changes, and repeating the test, offlining an iscsi
volume hung all the commands running on the pool.  I had three ssh
sessions open, running the following:
# zpool iostats -v iscsipool 10 100
# format < /dev/null
# time zpool status

They hung for what felt a minute or so.
After that, the CIFS copy timed out.

After the CIFS copy timed out, I tried immediately restarting it.  It
took a few more seconds, but restarted no problem.  Within a few
seconds of that restarting, iostat recovered, and format returned it''s
result too.

Around 30 seconds later, zpool status reported two drives, paused
again, then showed the status of the third:

# time zpool status
  pool: iscsipool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 0h0m with 0 errors on Tue Dec  2 16:39:21 2008
config:

        NAME                                       STATE     READ WRITE CKSUM
        iscsipool                                  ONLINE       0     0     0
          raidz1                                   ONLINE       0     0     0
            c2t600144F04933FF6C00005056967AC800d0  ONLINE       0
0     0  15K resilvered
            c2t600144F04934FAB300005056964D9500d0  ONLINE       0
0     0  15K resilvered
            c2t600144F04934119E000050569675FF00d0  ONLINE       0
200     0  24K resilvered

errors: No known data errors

real    3m51.774s
user    0m0.015s
sys     0m0.100s

Repeating that a few seconds later gives:

# time zpool status
  pool: iscsipool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using ''zpool
online''.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: resilver completed after 0h0m with 0 errors on Tue Dec  2 16:39:21 2008
config:

        NAME                                       STATE     READ WRITE CKSUM
        iscsipool                                  DEGRADED     0     0     0
          raidz1                                   DEGRADED     0     0     0
            c2t600144F04933FF6C00005056967AC800d0  ONLINE       0
0     0  15K resilvered
            c2t600144F04934FAB300005056964D9500d0  ONLINE       0
0     0  15K resilvered
            c2t600144F04934119E000050569675FF00d0  UNAVAIL      3
5.80K     0  cannot open

errors: No known data errors

real    0m0.272s
user    0m0.029s
sys     0m0.169s

On Tue, Dec 2, 2008 at 3:58 PM, Richard Elling <Richard.Elling at sun.com>
wrote:

......
> iSCSI timeout is set to 180 seconds in the client code.  The only way
> to change is to recompile it, or use mdb.  Since you have this test rig
> setup, and I don''t, do you want to experiment with this timeout?
> The variable is actually called "iscsi_rx_max_window" so if you
do
>   echo iscsi_rx_max_window/D | mdb -k
> you should see "180"
> Change it using something like:
>   echo iscsi_rx_max_window/W0t30 | mdb -kw
> to set it to 30 seconds.
> -- richard

Miles Nordin

2008-Dec-02 18:35 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

>>>>> "rs" == Ross Smith <myxiplx at
googlemail.com> writes:
rs> 4. zpool status still reports out of date information.

I know people are going to skim this message and not hear this.
They''ll say ``well of course zpool status says ONLINE while the pool
is hung. ZFS is patiently waiting. It doesn''t know anything is
broken yet.'''' but you are NOT saying it''s out of
date because it
doesn''t say OFFLINE the instant you power down an iSCSI target.
You''re saying:

rs> - After 3 minutes, the iSCSI drive goes offline.
rs> The pool carries on with the remaining two drives, CIFS
rs> carries on working, iostat carries on working. "zpool
status"
rs> however is still out of date.

rs> - zpool status eventually
rs> catches up, and reports that the drive has gone offline.

so, there is a ~30sec window when it''s out of date. When you say
``goes offline'''' in the first bullet, you''re saying
``ZFS must have
marked it offline internally, because the pool unfroze.'''' but
you
found that even after it ``goes offline'''' ''zpool
status'' still
reports it ONLINE.

The question is, what the hell is ''zpool status'' reporting?
not the
status, apparently. It''s supposed to be a diagnosis tool. Why should
you have to second-guess it and infer the position of ZFS''s various
internal state machines through careful indirect observation, ``oops,
CIFS just came back,'''' or ``oh sometihng must have changed
because
zpool iostat isn''t hanging any more''''? Why not have
a tool that TELLS
you plainly what''s going on? ''zpool status''
isn''t.

Is it trying to oversimplify things, to condescend to the sysadmin or
hide ZFS''s rough edges? Are there more states for devices that are
being compressed down to ONLINE OFFLINE DEGRADED FAULTED? Is there
some tool in zdb or mdb that is like ''zpool status -simonsez''?
I
already know sometimes it''ll report everything as ONLINE but refuse
''zpool offline ... <device>'' with ''no valid
replicas'', so I think, yes
there are ``secret states'''' for devices? Or is it trying to
do too
many things with one output format?

rs> 5. When iSCSI targets finally do come back online, ZFS is
rs> resilvering all of them (again, this rings a bell, Miles might
rs> have reported something similar).

my zpool status is so old it doesn''t say ``xxkB
resilvered'''' so I''ve
no indication which devices are the source vs. target of the resilver.
What I found was, the auto-resilver isn''t sufficient. If you wait for
it to complete, then ''zpool scrub'', you''ll get
thousands of CKSUM
errors on the dirty device, so the resilver isn''t covering all the
dirtyness. Also ZFS seems to forget about the need to resilver if you
shut down the machine, bring back the missing target, and boot---it
marks everything ONLINE and then resilvers as you hit the dirty data,
counting CKSUM errors. This has likely been fixed between b71 and
b101. It''s easy to test: (a) shut down one iSCSI target, (b) write to
the pool, (c) bring the iSCSI target back, (d) wait for auto-resilver
to finish, (e) ''zpool scrub'', (f) look for CKSUM errors. I
suspect
you''re more worried about your own problems though---I''ll try
to
retest it soon.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081202/7a60ae97/attachment.bin>

Ross

2008-Dec-02 19:55 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Hi Miles,

It''s probably a bad sign that although that post came through as
anonymous in my e-mail, I recognised your style before I got half way through
your post :)

I agree, the zpool status being out of date is weird, I''ll dig out the
bug number for that at some point as I''m sure I''ve mentioned
it before.  It looks to me like there are two separate pieces of code that work
out the status of the pool.  There''s the stuff ZFS uses internally to
run the pool, and then there''s a completely separate piece that does
the reporting to the end user.

I agree that it could be a case of oversimplifying things.  There''s no
denying the ease of admin is one of ZFS'' strengths, but I think the
whole zpool status thing needs looking at again.  Neither the way the command
freezes, nor the out of date information make any sense to me.

And yes, I''m aware of the problems you''ve reported with
resilvering.  That''s on my list of things to test with this. 
I''ve already done a quick test of running a scrub after the resilver
(which appeared ok at first glance), and tomorrow I''ll be testing the
reboot status too.
-- 
This message posted from opensolaris.org

Miles Nordin

2008-Dec-02 20:35 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

>>>>> "r" == Ross <myxiplx at googlemail.com>
writes:
r> style before I got half way through your post :) [...status
r> problems...] could be a case of oversimplifying things.

yeah I was a bit inappropriate, but my frustration comes from the
(partly paranoid) imagining of how the idea ``we need to make it
simple'''' might have spooled out through a series of design
meetings to
a culturally-insidious mind-blowing condescention toward the sysadmin.

``simple'''', to me, means that a ''status''
tool does not read things off
disks, and does not gather a bunch of scraps to fabricate a pretty
(``simple''''?) fantasy-world at invocation which is torn down
again
when it exits. The Linux status tools are pretty-printing wrappers
around ''cat /proc/$THING/status''. That, is SIMPLE! And,
screaming
monkeys though they often are, the college kids writing Linux are
generally disciplined enough not to grab a bunch of locks and then go
to sleep for minutes when delivering things from /proc. I love that.
The other, broken, idea of ``simple'''' is what I come to Unix
to avoid.

And yes, this is a religious argument. Just because it spans decades
of experience and includes ideas of style doesn''t mean it should be
dismissed as hocus-pocus. And I don''t like all these binary config
files either. Not even Mac OS X is pulling that baloney any more.

r> There''s no denying the ease of admin is one of ZFS''
strengths,

I deny it! It is not simple to start up ''format'' and
''zpool iostat''
and RoboCopy on another machine because you cannot trust the output of
the status command. And getting visibility into something by starting
a bunch of commands in different windows and watching when which one
unfreezes is hilarious, not simple.

r> the problems you''ve reported with resilvering.

I think we were watching this bug:

http://bugs.opensolaris.org/view_bug.do?bug_id=6675685

so that ought to be fixed in your test system but not in s10u6. but
it might not be completely fixed yet:

http://bugs.opensolaris.org/view_bug.do?bug_id=6747698

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081202/2a365522/attachment.bin>

Toby Thain

2008-Dec-02 23:06 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

On 2-Dec-08, at 3:35 PM, Miles Nordin wrote:
>>>>>> "r" == Ross  <myxiplx at
googlemail.com> writes:
>
>      r> style before I got half way through your post :) [...status
>      r> problems...] could be a case of oversimplifying things.
> ...
> And yes, this is a religious argument.  Just because it spans decades
> of experience and includes ideas of style doesn''t mean it should
be
> dismissed as hocus-pocus.  And I don''t like all these binary
config
> files either.  Not even Mac OS X is pulling that baloney any more.
OS X never used binary config files; it standardised on XML property  
lists for the new subsystems (plus a lot of good old fashioned UNIX  
config).

Perhaps you are thinking of Mac OS 9 and earlier (resource forks).

--Toby

Ross

2008-Dec-03 15:20 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Ok, I''ve done some more testing today and I almost don''t know
where to start.

I''ll begin with the good news for Miles :)
- Rebooting doesn''t appear to cause ZFS to loose the resilver status
(but see 1. below)
- Resilvering appears to work fine, once complete I never saw any checksum
errors when scrubbing the pool.
- Reconnecting iscsi drives causes zfs to automatically online the pool and
automatically begin resilvering.

And now the bad news:
1.  While rebooting doesn''t seem cause the resilver to loose
it''s status, something''s causing it problems.  I saw it
restart several times.
2.  With iscsi, you can''t reboot with sendtargets enabled, static
discovery still seems to be the order of the day.
3.  There appears to be a disconnect between what iscsiadm knows and what ZFS
knows about the status of the devices.

And I have confirmation of some of my earlier findings too:
4.  iSCSI still has a 3 minute timeout, during which time your pool will hang,
no matter how many redundant drives you have available.
5.  zpool status can still hang when a  device goes offline, and when it finally
recovers, it will then report out of date information.  This could be Bug
6667199, but I''ve not seen anybody reporting the incorrect information
part of this.
6.  After one drive goes offline, during the resilver process, zpool status
shows that information is being resilvered on the good drives.  Does anybody
know why this happens?
7.  Although ZFS will automatically online a pool when iscsi devices come
online, CIFS shares are not automatically remounted.

I also have a few extra notes about a couple of those:

1 - resilver loosing status
==============Regarding the resilver restarting, I''ve seen it reported
that "zpool status" can cause this when run as admin, but I''m
not convinced that''s the cause.  Same for the rebooting problem.  I was
able to run "zpool status" dozens of times as an admin, but only two
or three times did I see the resilver restart.

Also, after rebooting, I could see that the resilver was showing that it was 66%
complete, but then a second later it restarted.

Now, none of this is conclusive.  I really need to test with a much larger
dataset to get an idea of what''s really going on, but there''s
definately something weird happening here.

3 - disconnect between iscsiadm and ZFS
========================I repeated my test of offlining an iscsi target, this
time checking iscsiadm to see when it disconnected.

What I did was wait until iscsiadm reported 0 connections to the target, and
then started a CIFS file copy and ran "zpool status".

Zpool status hung as expected, and a minute or so later, the CIFS copy failed. 
It seems that although iscsiadm was aware that the target was offline, ZFS did
not yet know about it.  As expected, a minute or so later, zpool status
completed (returning incorrect results), and I could then run the CIFS copy
fine.

5 - zpool status hanging and reporting incorrect information
==================================When an iSCSI device goes offline, if you
immediately run zpool status, it hangs for 3-4 minutes.  Also, when it finally
completes, it gives incorrect information, reporting all the devices as online.

If you immediately re-run zpool status, it completes rapidly and will now
correctly show the offline devices.
-- 
This message posted from opensolaris.org

Miles Nordin

2008-Dec-03 22:20 UTC

head link

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

>>>>> "r" == Ross  <myxiplx at googlemail.com>
writes:
    rs> I don''t think it likes it if the iscsi targets
aren''t
    rs> available during boot.

from my cheatsheet:
-----8<-----
ok boot -m milestone=none
[boots.  enter root password for maintenance.]
bash-3.00# /sbin/mount -o remount,rw /  [<-- otherwise iscsiadm
won''t update /etc/iscsi/*]
bash-3.00# /sbin/mount /usr
bash-3.00# /sbin/mount /var
bash-3.00# /sbin/mount /tmp
bash-3.00# iscsiadm remove discovery-address 10.100.100.135
bash-3.00# iscsiadm remove discovery-address 10.100.100.138
bash-3.00# iscsiadm remove discovery-address 10.100.100.138
iscsiadm: unexpected OS error
iscsiadm: Unable to complete operation  [<-- good.  it''s gone.]
bash-3.00# sync
bash-3.00# lockfs -fa
bash-3.00# reboot
-----8<-----

    rs> # time zpool status 
[...]
    rs> real 3m51.774s

so, this hang may happen in fewer situations, but it is not fixed.

     r> 6.  After one drive goes offline, during the resilver process,
     r> zpool status shows that information is being resilvered on the
     r> good drives.  Does anybody know why this happens?  

I don''t know why.

I''ve seen that, too, though.  For me it''s always been
relatively
short, <1min.  I wonder if there are three kinds of scrub-like things,
not just two (resilvers and scrubs), and ''zpool status'' is
``simplifying'''' for us again?

     r> 7.  Although ZFS will automatically online a pool when iscsi
     r> devices come online, CIFS shares are not automatically
     r> remounted.

For me, even plain filesystems are not all remounted.  ZFS tries to
mount them in the wrong order, so it would mount /a/b/c, then try to
mount /a/b and complain ``directory not empty''''.  I''m
not sure why it
mounts things in the right order at boot/import, but in haphazard
order after one of these auto-onlines.  Then NFS exporting didn''t work
either.

To fix, I have to ''zfs umount /a/b/c'', but then there is a b/c
directory inside filesystem /a, so I have to ''rmdir /a/b/c'' by
hand
because the ''... set mountpoint'' koolaid creates the
directories but
doesn''t remove them.  Then ''zfs mount -a'' and
''zfs share -a''.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081203/ccdba521/attachment.bin>

zfs discuss - Aug 2008 - Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better