Ross
2008-Aug-28 08:08 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Since somebody else has just posted about their entire system locking up when pulling a drive, I thought I''d raise this for discussion. I think Ralf made a very good point in the other thread. ZFS can guarantee data integrity, what it can''t do is guarantee data availability. The problem is, the way ZFS is marketed people expect it to be able to do just that. This turned into a longer thread than expected, so I''ll start with what I''m asking for, and then attempt to explain my thinking. I''m essentially asking for two features to improve the availability of ZFS pools: - Isolation of storage drivers so that buggy drivers do not bring down the OS. - ZFS timeouts to improve pool availability when no timely response is received from storage drivers. And my reasons for asking for these is that there are now many, many posts on here about people experiencing either total system lockup or ZFS lockup after removing a hot swap drive, and indeed while some of them are using consumer hardware, others have reported problems with server grade kit that definately should be able to handle these errors: Aug 2008: AMD SB600 - System hang - http://www.opensolaris.org/jive/thread.jspa?threadID=70349 Aug 2008: Supermicro SAT2-MV8 - System hang - http://www.opensolaris.org/jive/thread.jspa?messageID=271218 May 2008: Sun hardware - ZFS hang - http://opensolaris.org/jive/thread.jspa?messageID=240481 Feb 2008: iSCSI - ZFS hang - http://www.opensolaris.org/jive/thread.jspa?messageID=206985 Oct 2007: Supermicro SAT2-MV8 - system hang - http://www.opensolaris.org/jive/thread.jspa?messageID=166037 Sept 2007: Fibre channel - http://opensolaris.org/jive/thread.jspa?messageID=151719 ... etc Now while the root cause of each of these may be slightly different, I feel it would still be good to address this if possible as it''s going to affect the perception of ZFS as a reliable system. The common factor in all of these is that either the solaris driver hangs and locks the OS, or ZFS hangs and locks the pool. Most of these are for hardware that should handle these failures fine (mine occured for hardware that definately works fine under windows), so I''m wondering: Is there anything that can be done to prevent either type of lockup in these situations? Firstly, for the OS, if a storage component (hardware or driver) fails for a non essential part of the system, the entire OS should not hang. I appreciate there isn''t a lot you can do if the OS is using the same driver as it''s storage, but certainly in some of the cases above, the OS and the data are using different drivers, and I expect more examples of that could be found with a bit of work. Is there any way storage drivers could be isolated such that the OS (and hence ZFS) can report a problem with that particular driver without hanging the entire system? Please note: I know work is being done on FMA to handle all kinds of bugs, I''m not talking about that. It seems to me that FMA involves proper detection and reporting of bugs, which involves knowing in advance what the problems are and how to report them. What I''m looking for is something much simpler, something that''s able to keep the OS running when it encounters unexpected or unhandled behaviour from storage drivers or hardware. It seems to me that one of the benefits of ZFS is working against it here. It''s such a flexible system it''s being used for many, many types of devices, and that means there are a whole host of drivers being used, and a lot of scope for bugs in those drivers. I know that ultimately any driver issues will need to be sorted individually, but what I''m wondering is whether there''s any possibility of putting some error checking code at a layer above the drivers in such a way it''s able to trap major problems without hanging the OS? ie: update ZFS/Solaris so they can handle storage layer bugs gracefully without downing the entire system. My second suggestion is to ask if ZFS can be made to handle unexpected events more gracefully. In the past I''ve suggested that ZFS have a separate timeout so that a redundant pool can continue working even if one device is not responding, and I really think that would be worthwhile. My idea is to have a "WAITING" status flag for drives, so that if one isn''t responding quickly, ZFS can flag it as "WAITING", and attempt to read or write the same data from elsewhere in the pool. That would work alongside the existing failure modes, and would allow ZFS to handle hung drivers much more smoothly, preventing redundant pools hanging when a single drive fails. The ZFS update I feel is particularly appropriate. ZFS already uses checksumming since it doesn''t trust drivers or hardware to always return the correct data. But ZFS then trusts those same drivers and hardware absolutely when it comes to the availability of the pool. I believe ZFS should apply the same tough standards to pool availability as it does to data integrity. A bad checksum makes ZFS read the data from elsewhere, why shouldn''t a timeout do the same thing? Ross This message posted from opensolaris.org
Bob Friesenhahn
2008-Aug-28 16:29 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 28 Aug 2008, Ross wrote:> > I believe ZFS should apply the same tough standards to pool > availability as it does to data integrity. A bad checksum makes ZFS > read the data from elsewhere, why shouldn''t a timeout do the same > thing?A problem is that for some devices, a five minute timeout is ok. For others, there must be a problem if the device does not respond in a second or two. If the system or device is simply overwelmed with work, then you would not want the system to go haywire and make the problems much worse. Which of these do you prefer? o System waits substantial time for devices to (possibly) recover in order to ensure that subsequently written data has the least chance of being lost. o System immediately ignores slow devices and switches to non-redundant non-fail-safe non-fault-tolerant may-lose-your-data mode. When system is under intense load, it automatically switches to the may-lose-your-data mode. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Eric Schrock
2008-Aug-28 16:52 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Ross, thanks for the feedback. A couple points here - A lot of work went into improving the error handling around build 77 of Nevada. There are still problems today, but a number of the complaints we''ve seen are on s10 software or older nevada builds that didn''t have these fixes. Anything from the pre-2008 (or pre-s10u5) timeframe should be taken with grain of salt. There is a fix in the immediate future to prevent I/O timeouts from hanging other parts of the system - namely administrative commands and other pool activity. So I/O to that particular pool will hang, but you''ll still be able to run your favorite ZFS commands, and it won''t impact the ability of other pools to run. We have some good ideas on how to improve the retry logic. There is a flag in Solaris, B_FAILFAST, that tells the drive to not try too hard getting the data. However, it can return failure when trying harder would produce the correct results. Currently, we try the first I/O with B_FAILFAST, and if that fails immediately retry without the flag. The idea is to elevate the retry logic to a higher level, so when a read from a side of a mirror fails with B_FAILFAST, instead of immediately retrying the same device without the failfast flag, we push the error higher up the stack, and issue another B_FAILFAST I/O to the other half of the mirror. Only if both fail with failfast do we try a more thorough request (though with ditto blocks we may try another vdev alltogether). This should improve I/O error latency for a subset of failure scenarios, and biasing reads away from degraded (but not faulty) devices should also improve response time. The tricky part is incoporating this into the FMA diagnosis engine, as devices may fail B_FAILFAST requests for a variety of non-fatal reasons. Finally, imposing additional timeouts in ZFS is a bad idea. ZFS is designed to be a generic storage consumer. It can be layered on top of directly attached disks, SSDs, SAN devices, iSCSI targets, files, and basically anything else. As such, it doesn''t have the necessary context to know what constitutes a reasonable timeout. This is explicitly delegated to the underlying storage subsystem. If a storage subsystem is timing out for excessive periods of time when B_FAILFAST is set, then that''s a bug in the storage subsystem, and working around it in ZFS with yet another set of tunables is not practical. It will be interesting to see if this is an issue after the retry logic is modified as described above. Hope that helps, - Eric On Thu, Aug 28, 2008 at 01:08:26AM -0700, Ross wrote:> Since somebody else has just posted about their entire system locking up when pulling a drive, I thought I''d raise this for discussion. > > I think Ralf made a very good point in the other thread. ZFS can guarantee data integrity, what it can''t do is guarantee data availability. The problem is, the way ZFS is marketed people expect it to be able to do just that. > > This turned into a longer thread than expected, so I''ll start with what I''m asking for, and then attempt to explain my thinking. I''m essentially asking for two features to improve the availability of ZFS pools: > > - Isolation of storage drivers so that buggy drivers do not bring down the OS. > > - ZFS timeouts to improve pool availability when no timely response is received from storage drivers. > > And my reasons for asking for these is that there are now many, many posts on here about people experiencing either total system lockup or ZFS lockup after removing a hot swap drive, and indeed while some of them are using consumer hardware, others have reported problems with server grade kit that definately should be able to handle these errors: > > Aug 2008: AMD SB600 - System hang > - http://www.opensolaris.org/jive/thread.jspa?threadID=70349 > Aug 2008: Supermicro SAT2-MV8 - System hang > - http://www.opensolaris.org/jive/thread.jspa?messageID=271218 > May 2008: Sun hardware - ZFS hang > - http://opensolaris.org/jive/thread.jspa?messageID=240481 > Feb 2008: iSCSI - ZFS hang > - http://www.opensolaris.org/jive/thread.jspa?messageID=206985 > Oct 2007: Supermicro SAT2-MV8 - system hang > - http://www.opensolaris.org/jive/thread.jspa?messageID=166037 > Sept 2007: Fibre channel > - http://opensolaris.org/jive/thread.jspa?messageID=151719 > ... etc > > Now while the root cause of each of these may be slightly different, I feel it would still be good to address this if possible as it''s going to affect the perception of ZFS as a reliable system. > > The common factor in all of these is that either the solaris driver hangs and locks the OS, or ZFS hangs and locks the pool. Most of these are for hardware that should handle these failures fine (mine occured for hardware that definately works fine under windows), so I''m wondering: Is there anything that can be done to prevent either type of lockup in these situations? > > Firstly, for the OS, if a storage component (hardware or driver) fails for a non essential part of the system, the entire OS should not hang. I appreciate there isn''t a lot you can do if the OS is using the same driver as it''s storage, but certainly in some of the cases above, the OS and the data are using different drivers, and I expect more examples of that could be found with a bit of work. Is there any way storage drivers could be isolated such that the OS (and hence ZFS) can report a problem with that particular driver without hanging the entire system? > > Please note: I know work is being done on FMA to handle all kinds of bugs, I''m not talking about that. It seems to me that FMA involves proper detection and reporting of bugs, which involves knowing in advance what the problems are and how to report them. What I''m looking for is something much simpler, something that''s able to keep the OS running when it encounters unexpected or unhandled behaviour from storage drivers or hardware. > > It seems to me that one of the benefits of ZFS is working against it here. It''s such a flexible system it''s being used for many, many types of devices, and that means there are a whole host of drivers being used, and a lot of scope for bugs in those drivers. I know that ultimately any driver issues will need to be sorted individually, but what I''m wondering is whether there''s any possibility of putting some error checking code at a layer above the drivers in such a way it''s able to trap major problems without hanging the OS? ie: update ZFS/Solaris so they can handle storage layer bugs gracefully without downing the entire system. > > My second suggestion is to ask if ZFS can be made to handle unexpected events more gracefully. In the past I''ve suggested that ZFS have a separate timeout so that a redundant pool can continue working even if one device is not responding, and I really think that would be worthwhile. My idea is to have a "WAITING" status flag for drives, so that if one isn''t responding quickly, ZFS can flag it as "WAITING", and attempt to read or write the same data from elsewhere in the pool. That would work alongside the existing failure modes, and would allow ZFS to handle hung drivers much more smoothly, preventing redundant pools hanging when a single drive fails. > > The ZFS update I feel is particularly appropriate. ZFS already uses checksumming since it doesn''t trust drivers or hardware to always return the correct data. But ZFS then trusts those same drivers and hardware absolutely when it comes to the availability of the pool. > > I believe ZFS should apply the same tough standards to pool availability as it does to data integrity. A bad checksum makes ZFS read the data from elsewhere, why shouldn''t a timeout do the same thing? > > Ross > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Miles Nordin
2008-Aug-28 18:17 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
>>>>> "es" == Eric Schrock <eric.schrock at sun.com> writes:es> Finally, imposing additional timeouts in ZFS is a bad idea. es> [...] As such, it doesn''t have the necessary context to know es> what constitutes a reasonable timeout. you''re right in terms of fixed timeouts, but there''s no reason it can''t compare the performance of redundant data sources, and if one vdev performs an order of magnitude slower than another set of vdevs with sufficient redundancy, stop issuing reads except scrubs/healing to the underperformer (issue writes only), and pass an event to FMA. ZFS can also compare the performance of a drive to itself over time, and if the performance suddenly decreases, do the same. The former case eliminates the need for the mirror policies in SVM, which Ian requested a few hours ago for the situation that half the mirror is a slow iSCSI target for geographic redundancy and half is faster/local. Some care would have to be taken for targets shared by ZFS and some other initiator, but I''m not sure the care would really be that difficult to take, or that the oscillations induced by failing to take it would really be particularly harmful compared to unsupervised contention for a device. The latter notices quickly drives that have been pulled, or for Richard''s ``overwhelmingly dominant'''' case, for drives which are stalled for 30 seconds pending their report of an unrecovered read. Developing meaningful performance statistics for drives and a tool for displaying them would be useful for itself, not just for stopping freezes and preventing a failing drive from degrading performance a thousandfold. Issuing reads to redundant devices is cheap compared to freezing. The policy with which it''s done is highly tunable and should be fun to tune and watch, and the consequence if the policy makes the wrong choice isn''t incredibly dire. This B_FAILFAST architecture captures the situation really poorly. First, it''s not implementable in any serious way with near-line drives, or really with any drives with which you''re not intimately familiar and in control of firmware/release-engineering, and perhaps not with any drives period. I suspect in practice it''s more a controller-level feature, about whether or not you''d like to distrust the device''s error report and start resetting busses and channels and mucking everything up trying to recover from some kind of ``weirdness''''. It''s not an answer to the known problem of drives stalling for 30 seconds when they start to fail. First and a half, when it''s not implemented, the system degrades to doubling your timeout pointlessly. A driver-level block cache of UNC''s would probably have more value toward this speed/read-aggressiveness tradeoff than the whole B_FAILFAST architecture---just cache known unrecoverable read sectors, and refuse to issue further I/O for them until a timeout of 3 - 10 minutes passes. I bet this would speed up most failures tremendously, and without burdening upper layers with retry logic. Second, B_FAILFAST entertains the fantasy that I/O''s are independent, while what happens in practice is that the drive hits a UNC on one I/O, and won''t entertain any further I/O''s no matter what flags the request has on it or how many times you ``reset'''' things. Maybe you could try to rescue B_FAILFAST by putting clever statistics into the driver to compare the drive''s performance to recent past as I suggested ZFS do, and admit no B_FAILFAST requests to queues of drives that have suddenly slowed down, just fail them immediately without even trying. I submit this queueing and statistic collection is actually _better_ managed by ZFS than the driver because ZFS can compare a whole floating-point statistic across a whole vdev, while even a driver which is fancier than we ever dreamed, is still playing poker with only 1 bit of input ``I''ll call,'''' or ``I''ll fold.'''' ZFS can see all the cards and get better results while being stupider and requiring less clever poker-guessing than would be required by a hypothetical driver B_FAILFAST implementation that actually worked. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080828/1f22b5e4/attachment.bin>
Eric Schrock
2008-Aug-28 18:29 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, Aug 28, 2008 at 02:17:08PM -0400, Miles Nordin wrote:> > you''re right in terms of fixed timeouts, but there''s no reason it > can''t compare the performance of redundant data sources, and if one > vdev performs an order of magnitude slower than another set of vdevs > with sufficient redundancy, stop issuing reads except scrubs/healing > to the underperformer (issue writes only), and pass an event to FMA.Yep, latency would be a useful metric to add to mirroring choices. The current logic is rather naive (round-robin) and could easily be enhanced. Making diagnoses based on this is much trickier, particularly at the ZFS level. A better option would be to leverage the SCSI FMA work going on to do a more intimate diagnosis at the scsa level. Also, the problem you are trying to solve - timing out the first I/O to take a long time - is not captured well by the type of hysteresis you would need to perform in order to do this diagnosis. It certainly can be done, but is much better suited to diagnosising a failing drive over time, not aborting a transaction in response to immediate failure.> This B_FAILFAST architecture captures the situation really poorly.I don''t think you understand how this works. Imagine two I/Os, just with different sd timeouts and retry logic - that''s B_FAILFAST. It''s quite simple, and independent of any hardware implementation. - Eric -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Bob Friesenhahn
2008-Aug-28 18:59 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 28 Aug 2008, Miles Nordin wrote:> > you''re right in terms of fixed timeouts, but there''s no reason it > can''t compare the performance of redundant data sources, and if one > vdev performs an order of magnitude slower than another set of vdevs > with sufficient redundancy, stop issuing reads except scrubs/healing > to the underperformer (issue writes only), and pass an event to FMA.You are saying that I can''t split my mirrors between a local disk in Dallas and a remote disk in New York accessed via iSCSI? Why don''t you want me to be able to do that? ZFS already backs off from writing to slow vdevs.> ZFS can also compare the performance of a drive to itself over time, > and if the performance suddenly decreases, do the same.While this may be useful for reads, I would hate to disable redundancy just because a device is currently slow. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Ross Smith
2008-Aug-28 19:34 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hi guys, Bob, my thought was to have this timeout as something that can be optionally set by the administrator on a per pool basis. I''ll admit I was mainly thinking about reads and hadn''t considered the write scenario, but even having thought about that it''s still a feature I''d like. After all, this would be a timeout set by the administrator based on the longest delay they can afford for that storage pool. Personally, if a SATA disk wasn''t responding to any requests after 2 seconds I really don''t care if an error has been detected, as far as I''m concerned that disk is faulty. I''d be quite happy for the array to drop to a degraded mode based on that and for writes to carry on with the rest of the array. Eric, thanks for the extra details, they''re very much appreciated. It''s good to hear you''re working on this, and I love the idea of doing a B_FAILFAST read on both halves of the mirror. I do have a question though. From what you''re saying, the response time can''t be consistent across all hardware, so you''re once again at the mercy of the storage drivers. Do you know how long does B_FAILFAST takes to return a response on iSCSI? If that''s over 1-2 seconds I would still consider that too slow I''m afraid. I understand that Sun in general don''t want to add fault management to ZFS, but I don''t see how this particular timeout does anything other than help ZFS when it''s dealing with such a diverse range of media. I agree that ZFS can''t know itself what should be a valid timeout, but that''s exactly why this needs to be an optional administrator set parameter. The administrator of a storage array who wants to set this certainly knows what a valid timeout is for them, and these timeouts are likely to be several orders of magnitude larger than the standard response times. I would configure very different values for my SATA drives as for my iSCSI connections, but in each case I would be happier knowing that ZFS has more of a chance of catching bad drivers or unexpected scenarios. I very much doubt hardware raid controllers would wait 3 minutes for a drive to return a response, they will have their own internal timeouts to know when a drive has failed, and while ZFS is dealing with very different hardware I can''t help but feel it should have that same approach to management of its drives. However, that said, I''ll be more than willing to test the new B_FAILFAST logic on iSCSI once it''s released. Just let me know when it''s out. Ross> Date: Thu, 28 Aug 2008 11:29:21 -0500 > From: bfriesen at simple.dallas.tx.us > To: myxiplx at hotmail.com > CC: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better > > On Thu, 28 Aug 2008, Ross wrote: > > > > I believe ZFS should apply the same tough standards to pool > > availability as it does to data integrity. A bad checksum makes ZFS > > read the data from elsewhere, why shouldn''t a timeout do the same > > thing? > > A problem is that for some devices, a five minute timeout is ok. For > others, there must be a problem if the device does not respond in a > second or two. > > If the system or device is simply overwelmed with work, then you would > not want the system to go haywire and make the problems much worse. > > Which of these do you prefer? > > o System waits substantial time for devices to (possibly) recover in > order to ensure that subsequently written data has the least > chance of being lost. > > o System immediately ignores slow devices and switches to > non-redundant non-fail-safe non-fault-tolerant may-lose-your-data > mode. When system is under intense load, it automatically > switches to the may-lose-your-data mode. > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >_________________________________________________________________ Get Hotmail on your mobile from Vodafone http://clk.atdmt.com/UKM/go/107571435/direct/01/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080828/4f017593/attachment.html>
Miles Nordin
2008-Aug-28 19:40 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
>>>>> "es" == Eric Schrock <eric.schrock at sun.com> writes:es> I don''t think you understand how this works. Imagine two es> I/Os, just with different sd timeouts and retry logic - that''s es> B_FAILFAST. It''s quite simple, and independent of any es> hardware implementation. AIUI the main timeout to which we should be subject, at least for nearline drives, is about 30 seconds long and is decided by the drive''s firmware, not the driver, and can''t be negotiated in any way that''s independent of the hardware implementation, although sometimes there are dependent ways to negotiate it. The driver could also decide through ``retry logic'''' to time out the command sooner, before the drive completes it, but this won''t do much good because the drive won''t accept a second command until ITS timeout expires. which leads to the second problem, that we''re talking about timeouts for individual I/O''s, not marking whole devices. A ``fast'''' timeout of even 1 second could cause a 100- or 1000-fold decrease in performance, which could end up being equivalent to a freeze depending on the type of load on the filesystem. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080828/67967b82/attachment.bin>
Eric Schrock
2008-Aug-28 20:05 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, Aug 28, 2008 at 08:34:24PM +0100, Ross Smith wrote:> > Personally, if a SATA disk wasn''t responding to any requests after 2 > seconds I really don''t care if an error has been detected, as far as > I''m concerned that disk is faulty.Unless you have power management enabled, or there''s a bad region of the disk, or the bus was reset, or...> I do have a question though. From what you''re saying, the response > time can''t be consistent across all hardware, so you''re once again at > the mercy of the storage drivers. Do you know how long does > B_FAILFAST takes to return a response on iSCSI? If that''s over 1-2 > seconds I would still consider that too slow I''m afraid.It''s main function is how it deals with retryable errors. If the drive responds with a retryable error, or any error at all, it won''t attempt to retry again. If you have a device that is taking arbitrarily long to respond to successful commands (or to notice that a command won''t succeed), it won''t help you.> I understand that Sun in general don''t want to add fault management to > ZFS, but I don''t see how this particular timeout does anything other > than help ZFS when it''s dealing with such a diverse range of media. I > agree that ZFS can''t know itself what should be a valid timeout, but > that''s exactly why this needs to be an optional administrator set > parameter. The administrator of a storage array who wants to set this > certainly knows what a valid timeout is for them, and these timeouts > are likely to be several orders of magnitude larger than the standard > response times. I would configure very different values for my SATA > drives as for my iSCSI connections, but in each case I would be > happier knowing that ZFS has more of a chance of catching bad drivers > or unexpected scenarios.The main problem with exposing tunables like this is that they have a direct correlation to service actions, and mis-diagnosing failures costs everybody (admin, companies, Sun, etc) lots of time and money. Once you expose such a tunable, it will be impossible to trust any FMA diagnosis, because you won''t be able to know whether it was a mistaken tunable. A better option would be to not use this to perform FMA diagnosis, but instead work into the mirror child selection code. This has already been alluded to before, but it would be cool to keep track of latency over time, and use this to both a) prefer one drive over another when selecting the child and b) proactively timeout/ignore results from one child and select the other if it''s taking longer than some historical standard deviation. This keeps away from diagnosing drives as faulty, but does allow ZFS to make better choices and maintain response times. It shouldn''t be hard to keep track of the average and/or standard deviation and use it for selection; proactively timing out the slow I/Os is much trickier. As others have mentioned, things get more difficult with writes. If I issue a write to both halves of a mirror, should I return when the first one completes, or when both complete? One possibility is to expose this as a tunable, but any such "best effort RAS" is a little dicey because you have very little visibility into the state of the pool in this scenario - "is my data protected?" becomes a very difficult question to answer. - Eric -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Miles Nordin
2008-Aug-28 20:31 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
>>>>> "bf" == Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes:bf> If the system or device is simply overwelmed with work, then bf> you would not want the system to go haywire and make the bf> problems much worse. None of the decisions I described its making based on performance statistics are ``haywire''''---I said it should funnel reads to the faster side of the mirror, and do this really quickly and unconservatively. What''s your issue with that? bf> You are saying that I can''t split my mirrors between a local bf> disk in Dallas and a remote disk in New York accessed via bf> iSCSI? nope, you''ve misread. I''m saying reads should go to the local disk only, and writes should go to both. See SVM''s ''metaparam -r''. I suggested that unlike the SVM feature it should be automatic, because by so being it becomes useful as an availability tool rather than just performance optimisation. The performance-statistic logic should influence read scheduling immediately, and generate events which are fed to FMA, then FMA can mark devices faulty. There''s no need for both to make the same decision at the same time. If the events aren''t useful for diagnosis, ZFS could not bother generating them, or fmd could ignore them in its diagnosis. I suspect they *would* be useful, though. I''m imagining the read rescheduling would happen very quickly, quicker than one would want a round-trip from FMA, in much less than a second. That''s why it would have to compare devices to others in the same vdev, and to themselves over time, rather than use fixed timeouts or punt to haphazard driver and firmware logic. bf> o System waits substantial time for devices to (possibly) bf> recover in order to ensure that subsequently written data has bf> the least chance of being lost. There''s no need for the filesystem to *wait* for data to be written, unless you are calling fsync. and maybe not even then if there''s a slog. I said clearly that you read only one half of the mirror, but write to both. But you''re right that the trick probably won''t work perfectly---eventually dead devices need to be faulted. The idea is that normal write caching will buy you orders of magnitude longer time in which to make a better decision before anyone notices. Experience here is that ``waits substantial time'''' usually means ``freezes for hours and gets rebooted''''. There''s no need to be abstract: we know what happens when a drive starts taking 1000x - 2000x longer than usual to respond to commands, and we know that this is THE common online failure mode for drives. That''s what started the thread. so, think about this: hanging for an hour trying to write to a broken device may block other writes to devices which are still working, until the patiently-waiting data is eventually lost in the reboot. bf> o System immediately ignores slow devices and switches to bf> non-redundant non-fail-safe non-fault-tolerant bf> may-lose-your-data mode. When system is under intense load, bf> it automatically switches to the may-lose-your-data mode. nobody''s proposing a system which silently rocks back and forth between faulted and online. That''s not what we have now, and no such system would naturally arise. If FMA marked a drive faulty based on performance statistics, that drive would get retired permanently and hot-spare-replaced. Obviously false positives are bad, just as obviously as freezes/reboots are bad. It''s not my idea to use FMA in this way. This is how FMA was pitched, and the excuse for leaving good exception handling out of ZFS for two years. so, where''s the beef? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080828/0afce269/attachment.bin>
Ian Collins
2008-Aug-28 21:15 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Eric Schrock writes:> > A better option would be to not use this to perform FMA diagnosis, but > instead work into the mirror child selection code. This has already > been alluded to before, but it would be cool to keep track of latency > over time, and use this to both a) prefer one drive over another when > selecting the child and b) proactively timeout/ignore results from one > child and select the other if it''s taking longer than some historical > standard deviation. This keeps away from diagnosing drives as faulty, > but does allow ZFS to make better choices and maintain response times. > It shouldn''t be hard to keep track of the average and/or standard > deviation and use it for selection; proactively timing out the slow I/Os > is much trickier. >This would be a good solution to the remote iSCSI mirror configuration. I''ve been working though this situation with a client (we have been comparing ZFS with Cleversafe) and we''d love to be able to get the read performance of the local drives from such a pool.> As others have mentioned, things get more difficult with writes. If I > issue a write to both halves of a mirror, should I return when the first > one completes, or when both complete? One possibility is to expose this > as a tunable, but any such "best effort RAS" is a little dicey because > you have very little visibility into the state of the pool in this > scenario - "is my data protected?" becomes a very difficult question to > answer. >One solution (again, to be used with a remote mirror) is the three way mirror. If two devices are local and one remote, data is safe once the two local writes return. I guess the issue then changes from "is my data safe" to "how safe is my data". I would be reluctant to deploy a remote mirror device without local redundancy, so this probably won''t be an uncommon setup. There would have to be an acceptable window of risk when local data isn''t replicated. Ian
Ian Collins
2008-Aug-28 21:21 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Miles Nordin writes:>>>>>> "bf" == Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes: > > bf> You are saying that I can''t split my mirrors between a local > bf> disk in Dallas and a remote disk in New York accessed via > bf> iSCSI? > > nope, you''ve misread. I''m saying reads should go to the local disk > only, and writes should go to both. See SVM''s ''metaparam -r''. I > suggested that unlike the SVM feature it should be automatic, because > by so being it becomes useful as an availability tool rather than just > performance optimisation. >So on a server with a read workload, how would you know if the remote volume was working? Ian
Bob Friesenhahn
2008-Aug-28 21:27 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 28 Aug 2008, Miles Nordin wrote:> None of the decisions I described its making based on performance > statistics are ``haywire''''---I said it should funnel reads to the > faster side of the mirror, and do this really quickly and > unconservatively. What''s your issue with that?>From what I understand, this is partially happening now based onaverage service time. If I/O is backed up for a device, then the other device is preferred. However it good to keep in mind that if data is never read, then it is never validated and corrected. It is good for ZFS to read data sometimes. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bill Sommerfeld
2008-Aug-28 21:46 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote:> A better option would be to not use this to perform FMA diagnosis, but > instead work into the mirror child selection code. This has already > been alluded to before, but it would be cool to keep track of latency > over time, and use this to both a) prefer one drive over another when > selecting the child and b) proactively timeout/ignore results from one > child and select the other if it''s taking longer than some historical > standard deviation. This keeps away from diagnosing drives as faulty, > but does allow ZFS to make better choices and maintain response times. > It shouldn''t be hard to keep track of the average and/or standard > deviation and use it for selection; proactively timing out the slow I/Os > is much trickier.tcp has to solve essentially the same problem: decide when a response is "overdue" based only on the timing of recent successful exchanges in a context where it''s difficult to make assumptions about "reasonable" expected behavior of the underlying network. it tracks both the smoothed round trip time and the variance, and declares a response overdue after (SRTT + K * variance). I think you''d probably do well to start with something similar to what''s described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on experience. - Bill
Richard Elling
2008-Aug-28 23:24 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Bill Sommerfeld wrote:> On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote: > >> A better option would be to not use this to perform FMA diagnosis, but >> instead work into the mirror child selection code. This has already >> been alluded to before, but it would be cool to keep track of latency >> over time, and use this to both a) prefer one drive over another when >> selecting the child and b) proactively timeout/ignore results from one >> child and select the other if it''s taking longer than some historical >> standard deviation. This keeps away from diagnosing drives as faulty, >> but does allow ZFS to make better choices and maintain response times. >> It shouldn''t be hard to keep track of the average and/or standard >> deviation and use it for selection; proactively timing out the slow I/Os >> is much trickier. >> > > tcp has to solve essentially the same problem: decide when a response is > "overdue" based only on the timing of recent successful exchanges in a > context where it''s difficult to make assumptions about "reasonable" > expected behavior of the underlying network. > > it tracks both the smoothed round trip time and the variance, and > declares a response overdue after (SRTT + K * variance). > > I think you''d probably do well to start with something similar to what''s > described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on > experience. >I think this is a good place to start. In general, we can see 3 orders of magnitude range for magnetic disk I/Os, 4 orders of magnitude for power managed disks. With that range, I don''t see the variance being small, at least for magnetic disks. SSDs will have a much smaller variance, in general. For lopsided mirrors, such as magnetic disk mirrored to SSD or Bob''s Dallas vs New York paths, we should be able to automatically steer towards the faster side. However, A comprehensive solution must also deal with top-level vdev usage, which can be very different than the physical vdevs. We can use driver-level FMA for the physical vdevs, but ultimately ZFS will need to be able to make decisions based on the response time across the top-level vdevs. This can be implemented in two phases, of course. I''ve got some lopsided mirror TNF data, so we could fairly easily try some algorithms... I''ll whip it into shape for further analysis. -- richard
Nicolas Williams
2008-Aug-29 15:48 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, Aug 28, 2008 at 11:29:21AM -0500, Bob Friesenhahn wrote:> Which of these do you prefer? > > o System waits substantial time for devices to (possibly) recover in > order to ensure that subsequently written data has the least > chance of being lost. > > o System immediately ignores slow devices and switches to > non-redundant non-fail-safe non-fault-tolerant may-lose-your-data > mode. When system is under intense load, it automatically > switches to the may-lose-your-data mode.Given how long a resilver might take, waiting some time for a device to come back makes sense. Also, if a cable was taken out, or drive tray powered off, then you''ll see lots of drives timing out, and then the better thing to do is to wait (heuristic: not enough spares to recover). Nico --
Nicolas Williams
2008-Aug-29 16:02 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, Aug 28, 2008 at 01:05:54PM -0700, Eric Schrock wrote:> As others have mentioned, things get more difficult with writes. If I > issue a write to both halves of a mirror, should I return when the first > one completes, or when both complete? One possibility is to expose this > as a tunable, but any such "best effort RAS" is a little dicey because > you have very little visibility into the state of the pool in this > scenario - "is my data protected?" becomes a very difficult question to > answer.Depending on the amount of redundancy left one might want the writes to continue. E.g., a 3-way mirror with one vdev timing out or going extra slow, or Richard''s lopsided mirror example. The value of "best effort RAS" might make a useful property for mirrors and RAIDZ-2. If because of some slow vdev you''ve got less redundancy for recent writes, but still have enough (for some value of "enough"), and still have full redundancy for older writes, well, that''s not so bad. Something like: % # require at least successful writes to two mirrors and wait no more % # than 15 seconds for the 3rd. % zpool create mypool mirror ... mirror ... mirror ... % zpool set minimum_redundancy=1 mypool % zpool set vdev_write_wait=15s mypool and for known-to-be-lopsided mirrors: % # require at least successful writes to two mirrors and don''t wait for % # the slow vdevs % zpool create mypool mirror ... mirror ... mirror -slow ... % zpool set minimum_redundancy=1 mypool % zpool set vdev_write_wait=0s mypool ? Nico --
Richard Elling
2008-Aug-29 18:07 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Nicolas Williams wrote:> On Thu, Aug 28, 2008 at 11:29:21AM -0500, Bob Friesenhahn wrote: > >> Which of these do you prefer? >> >> o System waits substantial time for devices to (possibly) recover in >> order to ensure that subsequently written data has the least >> chance of being lost. >> >> o System immediately ignores slow devices and switches to >> non-redundant non-fail-safe non-fault-tolerant may-lose-your-data >> mode. When system is under intense load, it automatically >> switches to the may-lose-your-data mode. >> > > Given how long a resilver might take, waiting some time for a device to > come back makes sense. Also, if a cable was taken out, or drive tray > powered off, then you''ll see lots of drives timing out, and then the > better thing to do is to wait (heuristic: not enough spares to recover). > >argv! I didn''t even consider switches. Ethernet switches often use spanning-tree algorithms to converge on the topology. I''m not sure what SAN switches use. We have the following problem with highly available clusters which use switches in the interconnect: + Solaris Cluster interconnect timeout defaults to 10 seconds + STP can take > 30 seconds to converge So, if you use Ethernet switches in the interconnect, you need to disable STP on the ports used for interconnects or risk unnecessary cluster reconfigurations. Normally, this isn''t a problem as the people who tend to build HA clusters also tend to read the docs which point this out. Still, a few slip through every few months. As usual, Solaris Cluster gets blamed, though it really is a systems engineering problem. Can we expect a similar attention to detail for ZFS implementers? I''m afraid not :-(. I''m not confident we can be successful with sub-minute reconfiguration, so the B_FAILFAST may be the best we could do for the general case. That isn''t so bad, in fact we use failfasts rather extensively for Solaris Clusters, too. -- richard
Miles Nordin
2008-Aug-29 21:14 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
>>>>> "es" == Eric Schrock <eric.schrock at sun.com> writes:es> The main problem with exposing tunables like this is that they es> have a direct correlation to service actions, and es> mis-diagnosing failures costs everybody (admin, companies, es> Sun, etc) lots of time and money. Once you expose such a es> tunable, it will be impossible to trust any FMA diagnosis, Yeah, I tend to agree that the constants shouldn''t be tunable, becuase I hoped Sun would become a disciplined collection-point for experience to set the constants, discipline meaning the constants are only adjusted in response to bad diagnosis not ``preference,'''' and in a direction that improves diagnosis for everyone, not for ``the site''''. I''m not yet won over to the idea that statistical FMA diagnosis constants shouldn''t exist. I think drives can''t diagnose themselves for shit, and I think drivers these days are diagnosees, not diagnosers. But clearly a confusingly-bad diagnosis is much worse than diagnosis that''s bad in a simple way. es> If I issue a write to both halves of a mirror, should es> I return when the first one completes, or when both complete? well, if it''s not a synchronous write, you return before you''ve written either half of the mirror, so it''s only an issue for O_SYNC/ZIL writes, true? BTW what does ZFS do right now for synchronous writes to mirrors, wait for all, wait for two, or wait for one? es> any such "best effort RAS" is a little dicey because you have es> very little visibility into the state of the pool in this es> scenario - "is my data protected?" becomes a very difficult es> question to answer. I think it''s already difficult. For example, a pool will say ONLINE while it''s resilvering, won''t it? I might be wrong. Take a pool that can only tolerate one failure. Is the difference between replacing an ONLINE device (still redundant) and replacing an OFFLINE device (not redundant until resilvered) captured? Likewise, should a pool with a spare in use really be marked DEGRADED both before the spare resilvers and after? The answers to the questions aren''t important so much as that you have to think about the answers---what should they be, what are they now---which means ``is my data protected?'''' is already a difficult question to answer. Also there were recently fixed bugs with DTL. The status of each device''s DTL, even the existence and purpose of the DTL, isn''t well-exposed to the admin, and is relevant to answering the ``is my data protected?'''' question---indirect means of inspecting it like tracking the status of resilvering seem too wallpapered given that the bug escaped notice for so long. I agree with the problem 100% and don''t wish to worsen it, just disagree that it''s a new one. re> 3 orders of magnitude range for magnetic disk I/Os, 4 orders re> of magnitude for power managed disks. I would argue for power management a fixed timeout. The time to spin up doesn''t have anything to do with the io/s you got before the disk spun down. There''s no reason to disguise the constant for which we secretly wish inside some fancy math for deriving it just because writing down constants feels bad. unless you _know_ the disk is spinning up through some in-band means, and want to compare its spinup time to recorded measurements of past spinups. This is a good case for pointing out there are two sets of rules: * ''metaparam -r'' rules + not invoked at all if there''s no redundancy. + very complicated - involve sets of disks, not one disk. comparison of statistic among disks within a vdev (definitely), and comparison of individual disks to themselves over time (possibly). - complicated output: rules return a set of disks per vdev, not a yay-or-nay diagnosis per disk. And there are two kinds of output decision: o for n-way mirrors, select anywhere from 1 to n disks. for example, a three-way mirror with two fast local mirrors, one slow remote iSCSI mirror, should split reads among the two local disks. for raidz and raidz2 they can eliminate 0, 1 (,or 2) disks from the read-us set. It''s possible to issue all the reads and take the first sufficient set to return as Anton suggested, but I imagine 4-device raidz2 vdevs will be common which could some day perform as well as a 2-device mirror. o also, decide when to stop waiting on an existing read and re-issue it. so the decision is not only about future reads, but has to cancel already-issued reads, possibly replacing the B_FAILFAST mechanism so there will be a second uncancellable round of reads once the first round exhausts all redundancy. o that second decision needs to be made thousands of times per second without a lot of CPU overhead + small consequence if the rules deliver false-positives, just reduced performance (which is the same with the TCP fast-retransmit rules Bill mentioned) + large consequence for false-negatives (system freeze), so one can''t really say, ``we won''t bother doing it for raidz2 because it''s too complicated.'''' The rules are NOT just about optimizing performance. + at least partly in kernel * diagnosis rules + should it be invoked for single-device vdev''s? Does ZFS diagnosis already consider that a device in an unredundant vdev should be FAULTED less aggressively (ex., never for CKSUM errors)? this is arguable. + diagnosis is strictly per-disk and should compare disks only to themselves, or to cultural memory of The Typical Disk in the form of untunable constants, never others in the same vdev + three possible verdicts per disk: - all''s good - warn the sysadmin about this disk but keep writing to it - fault this disk in ZFS. no further I/O, not even writes, and start rebuilding it onto a spare Erik points out that false positives are expensive in BOTH cases, not just the second, because even the warning can initiate expensive repare procedures and reduce trust in FMA diagnoses. so, there should probably be only two verdicts, good and fault. If the statistics are extractable, more aggressive sysadmins can devise their own warning rules and competitively try to predict the future. The owners of large clusters might be better at crafting warning rules than Sun, but their results won''t be general. + potentially complicated, but might be really simple, like ``an I/O takes more than three minutes to complete.'''' + A more complicated but still somewhat simple hypothetical rule: ``one I/O hasn''t returned completion or failure after 10 mintues, OR at least one I/O originally issued to the driver from within each of three separate four-minute-long buckets within the last 40 minutes takes 1000 times longer than usual or more than 120 seconds, whichever is larger (three slow I/O''s in recent past)'''' These might be really bad rules. my point is that variance, or some more complicated statistic than addition and buckets, might be good for diagnosing bad disks but isn''t necessarily required, while for the ''metaparam -r'' rules it IS required. for diagnosing bad disks, a big bag of traditional-AI rules might be better than statistical/machine-learning rules, and will be easier for less-sophisticated developers to modify according to experience and futuristic hardware. ex., power-managed disk spinning up takes less than x seconds and should not be spinning down more often than every y minutes. SAN fabric disconnection should reconnect within z seconds, and unannounced outages don''t need to be tolerated silently without intervention more than once per day. u.s.w. It may even be possible to generate negative fault events, like ``disk IS replying, not silent, and it says Not-ready-coming-ready, so don''t fault it for 1 minute.'''' The option of creating this kind of hairy mess of special-case layer-violating codified-tradition rules is the advantage I perceived in tolerating the otherwise disgusting hairy bolt-on shared-lib-spaghetty mess that is FMA. But for the ''metaparam -r'' rules OTOH, variance/machine-learning is probably the only approach. + rules are in userland, can be more expensive CPU-wise, and return feedback to the kernel only a couple times a minute, not per-I/O like the ''metaparam -r'' reissue rules. I guess I''m changing my story slightly. I *would* want ZFS to collect drive performance statistics and report them to FMA, but I wouldn''t suggest reporting the _decision_ outputs of the ''metaparam -r''-replacement engine to FMA, only the raw stats. and, of course, ``reporting'''' is tricky for the diagnosis case becuase of the bolted-on separation of FMA. You can''t usefully report ``the I/O took 3 hours to complete'''' because you''ve now waited three hours to get the report, and the completed I/O has a normal driver error attached to it so no fancy statistical decisions are any longer needed. Instead, you have to make polled reports to userland a couple times a minute, containing the list of incomplete outstanding I/O''s, along with averages and variances and whatever else. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080829/cd20a4fa/attachment.bin>
Bob Friesenhahn
2008-Aug-29 21:49 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Fri, 29 Aug 2008, Miles Nordin wrote:> > I guess I''m changing my story slightly. I *would* want ZFS to collect > drive performance statistics and report them to FMA, but I wouldn''tYour email *totally* blew my limited buffer size, but this little bit remained for me to look at. It left me wondering how ZFS would know if the device is a drive. How can ZFS maintain statistics for a "drive" if it is perhaps not a drive at all? ZFS does not require that the device be a "drive". Isn''t ZFS the wrong level to be managing the details of a device? Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Miles Nordin
2008-Aug-29 23:03 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes:re> if you use Ethernet switches in the interconnect, you need to re> disable STP on the ports used for interconnects or risk re> unnecessary cluster reconfigurations. RSTP/802.1w plus setting the ports connected to Solaris as ``edge'''' is good enough, less risky for the WAN, and pretty ubiquitously supported with non-EOL switches. The network guys will know this (assuming you have network guys) and do something like this: sw: can you disable STP for me? net: No? sw: <jumping up and down screaming> net: um,...i mean, Why? sw: [....] net: oh, that. Ok, try it now. sw: thanks for disabling STP for me. net: i uh,.. whatever. No problem! re> Can we expect a similar attention to detail for ZFS re> implementers? I''m afraid not :-(. well....you weren''t really ``expecting'''' it of the sun cluster implementers. You just ran into it by surprise in the form of an Issue. so, can you expect ZFS implementers to accept that running ZFS, iSCSI, FC-SW might teach them something about their LAN/SAN they didn''t already know? So far they seem receptive to arcane advice like ``make this config change in your SAN controller to let it use the NVRAM cache more aggressively, and stop using EMC PowerPath unless <blah>.'''' so, Yes? I think you can also expect them to wait longer than 40 seconds before declaring a system is frozen and rebooting it, though. ``Let''s `patiently wait'' forever because we think, based on our uncertainty, that FSPF might take several hours to converge'''' is the alternative that strikes me as unreasonable. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080829/dc156765/attachment.bin>
Richard Elling
2008-Aug-30 00:26 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: >>>>>> > > re> if you use Ethernet switches in the interconnect, you need to > re> disable STP on the ports used for interconnects or risk > re> unnecessary cluster reconfigurations. > > RSTP/802.1w plus setting the ports connected to Solaris as ``edge'''' is > good enough, less risky for the WAN, and pretty ubiquitously supported > with non-EOL switches. The network guys will know this (assuming you > have network guys) and do something like this: > > sw: can you disable STP for me? > > net: No? > > sw: <jumping up and down screaming> > > net: um,...i mean, Why? > > sw: [....] > > net: oh, that. Ok, try it now. > > sw: thanks for disabling STP for me. > > net: i uh,.. whatever. No problem! >Precisely, this is not a problem that is usually solved unilaterally.> re> Can we expect a similar attention to detail for ZFS > re> implementers? I''m afraid not :-(. > > well....you weren''t really ``expecting'''' it of the sun cluster > implementers. You just ran into it by surprise in the form of an > Issue.Rather, cluster implementers tend to RTFM. I know few ZFSers who have RTFM, and do not expect many to do so... such is life.> so, can you expect ZFS implementers to accept that running > ZFS, iSCSI, FC-SW might teach them something about their LAN/SAN they > didn''t already know?No, I expect them to see a "problem" cause by network reconfiguration and blame ZFS. Indeed, this is what occasionally happens with Solaris Cluster -- but only occasionally, solving via RTFM.> So far they seem receptive to arcane advice like > ``make this config change in your SAN controller to let it use the > NVRAM cache more aggressively, and stop using EMC PowerPath unless > <blah>.'''' so, Yes? >I have no idea what you are trying to say here.> I think you can also expect them to wait longer than 40 seconds before > declaring a system is frozen and rebooting it, though. >Current [s]sd driver timeouts are 60 seconds with 3-5 retries by default. We''ve had those timeouts for many, many years now and do provide highly available services on such systems. The B_FAILFAST change did improve the availability of systems and similar tricks have improved service availability for Solaris Clusters. Refer to Eric''s post for more details of this minefield. NB some bugids one should research before filing new bugs here are: CR 4713686: sd/ssd driver should have an additional target specific timeout http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4713686 CR 4500536 introduces B_FAILFAST http://bugs.opensolaris.org/view_bug.do?bug_id=4500536> ``Let''s `patiently wait'' forever because we think, based on our > uncertainty, that FSPF might take several hours to converge'''' is the > alternative that strikes me as unreasonable. >AFAICT, nobody is making such a proposal. Did I miss a post? -- richard
Ross
2008-Aug-30 07:55 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Wow, some great comments on here now, even a few people agreeing with me which is nice :D I''ll happily admit I don''t have the in depth understanding of storage many of you guys have, but since the idea doesn''t seem pie-in-the-sky crazy, I''m going to try to write up all my current thoughts on how this could work after reading through all the replies. 1. Track disk response times - ZFS should track the average response time of each disk. - This should be used internally for performance tweaking, so faster disks are favoured for reads. This works particularly well for lop sided mirrors. - I''d like to see this information (and the number of timeouts) in the output of zpool status, so administrators can see if any one device is performing badly. 2. New parameters - ZFS should gain two new parameters. - A timeout value for the pool. - An option to enable that timeout for writes too (off by default). - Still to be decided is whether that timeout is set manually, or automatically based on the information gathered in 1. - Do we need a pool timeout (based on the timeout of the slowest device in the pool), or will individual device timeouts work better? - I''ve decided that having this off by default for writes is probably better for ZFS. It addresses some people''s concerns about writing to a degraded pool, and puts data integrity ahead of availability, which seems to fit better with ZFS'' goals. I''d still like it myself for writes I can live with a pool running degraded for 2 minutes while the problem is diagnosed. - With that said, could the write timeout default to on when you have a slog device? After all, the data is safely committed to the slog, and should remain there until it''s written to all devices. Bob, you seemed the most concerned about writes, would that be enough redundancy for you to be happy to have this on by default? If not, I''d still be ok having it off by default, we could maybe just include it in the evil tuning guide suggesting that this could be turned on by anybody who has a separate slog device. 3. How it would work - If a read times out for any device, ZFS should immediately issue reads to all other devices holding that data. The first response back will be used. - Timeouts should be logged so the information can be used by administrators or FMA to help diagnose failing drives, but they should not count as a device failure on their own. - Some thought is needed as to how this algorithm works on busy pools. When reads are queuing up, we need to avoid false positives and avoid adding extra load on the pool. Would it be a possibility that instead of checking the response time for an individual request, this timeout is used to check if no responses at all have been received from a device for that length of time? That still sounds reasonable for finding stuck devices, and should still work reliably on a busy pool. - For reads, the pool does not need to go degraded, the device is simply flagged as "WAITING". - When enabled for writes, these will be going to all devices, so there are no alternate devices to try. This means any write timeout will be used to put the pool into a degraded mode. This should be considered a temporary state with the drive in "WAITING" status, as while the pool itself is degraded (due to missing the writes for that drive), the drive is not yet offline. At this point the system is simply keeping itself running while waiting for a proper error response from either the drive or from FMA. If the drive eventually returns the missing response, it can be resilvered with any data it missed. If the drive doesn''t return a response, FMA should eventually fault it, and the drive can be taken offline and replaced with a hot spare. At all times the administrator can see what it going on using zpool status, with the appropriate pool and drive status visible. - Please bear in mind that although I''m using the word ''degraded'' above, this is not necessarily the case for dual parity pools, I just don''t know the proper term to use for a dual parity raid set where a single drive has failed. - If this is just a one off glitch and the device comes back online, the resilver shouldn''t take long as ZFS just needs to send the data that was missed (which will still be stored in the ZIL). - If many devices timeout at once due to a bad controller, cable pulled, power failure, etc, all the affected devices will be flagged as "WAITING" and if too many have gone for the pool to stay operational, ZFS should switch the entire pool to the ''wait'' state while it waits for FMA, etc to return a proper response, after which it should react according to the proper failmode property for the pool. 4. Food for thought - While I like nico''s idea for lop sided mirrors, I''m not sure any tweaking is needed. I was thinking about whether these timeouts could improve performance for such a mirror, but I think a better option there is simply to use plenty of local write cache. A ton of flash memory for the ZIL would be the ideal, but if it''s a particularly lopsided mirror you could use something as simple as a pair of locally mirrored drives. Reads should be biased to the local devices anyway, so you''re making the best possible use of the slow half by caching all writes and then streaming everything to the remote device. Of course, things will slow down when the ZIL fils as it now has to write to both halves together, but if that happens you probably need to rethink your solution anyway as your slow mirror isn''t keeping up with your demands. - I see all this as working on top of the B_FAILFAST modes mentioned in the thread. I don''t see that it has to impact ZFS'' current fault management at all. It''s an extra layer that sits on top, simply with the aim of keeping the pool responsive while the lower level fault management does its job. Of course, it''s also able to provide extra information for things like FMA, but it''s not indended as a way of diagnosing faults itself. - Thinking about this further, if anything this seems to me like it should improve ZFS'' reliability for writes. It puts the pool into a temporary ''degraded'' state much faster if there is a doubt over whether any writes have been committed to disk. It will still be queuing writes for that device, exactly as it would have done before, but the pool and the administrator now have earlier information as to the status. If a second device enters the "WAITING" state, the whole pool immediately switches to ''wait'' mode until a proper diagnosis happens. This happens much earlier than could be the case with FMA alone doing the diagnostic, and means that ZFS stops accepting data much earlier while it waits to see what the problem is. For async writes it reduces the amount of data that has been accepted by ZFS but not committed to storage. For sync writes I don''t see that it has much effect, these writes would have been waiting for the bad device anyway. Well, I think that''s everything. To me it seems to address most of the problems raised here and there should be performance benefits to using it in some situations. It will definately have a beneficial effect in ensuring that ZFS maintains a good pool response time for any kind of failure, and it looks to me like it would work well for any kind of storage device. I''m looking forward to seeing what holes can be knocked in these ideas with the next set of replies :) Ross -- This message posted from opensolaris.org
Bob Friesenhahn
2008-Aug-30 15:59 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Sat, 30 Aug 2008, Ross wrote:> while the problem is diagnosed. - With that said, could the write > timeout default to on when you have a slog device? After all, the > data is safely committed to the slog, and should remain there until > it''s written to all devices. Bob, you seemed the most concerned > about writes, would that be enough redundancy for you to be happy to > have this on by default? If not, I''d still be ok having it off by > default, we could maybe just include it in the evil tuning guide > suggesting that this could be turned on by anybody who has a > separate slog device.It is my impression that the slog device is only used for synchronous writes. Depending on the system, this could be just a small fraction of the writes. In my opinion, ZFS''s primary goal is to avoid data loss, or consumption of wrong data. Availability is a lesser goal. If someone really needs maximum availability then they can go to triple mirroring or some other maximally redundant scheme. ZFS should to its best to continue moving forward as long as some level of redundancy exists. There could be an option to allow moving forward with no redundancy at all. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Ross Smith
2008-Aug-30 19:32 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Triple mirroring you say? That''d be me then :D The reason I really want to get ZFS timeouts sorted is that our long term goal is to mirror that over two servers too, giving us a pool mirrored across two servers, each of which is actually a zfs iscsi volume hosted on triply mirrored disks. Oh, and we''ll have two sets of online off-site backups running raid-z2, plus a set of off-line backups too. All in all I''m pretty happy with the integrity of the data, wouldn''t want to use anything other than ZFS for that now. I''d just like to get the availability working a bit better, without having to go back to buying raid controllers. We have big plans for that too; once we get the iSCSI / iSER timeout issue sorted our long term availability goals are to have the setup I mentioned above hosted out from a pair of clustered Solaris NFS / CIFS servers. Failover time on the cluster is currently in the order of 5-10 seconds, if I can get the detection of a bad iSCSI link down under 2 seconds we''ll essentially have a worst case scenario of < 15 seconds downtime. Downtime that low means it''s effectively transparent for our users as all of our applications can cope with that seamlessly, and I''d really love to be able to do that this calendar year. Anyway, getting back on topic, it''s a good point about moving forward while redundancy exists. I think the flag for specifying the write behavior should have that as the default, with the optional setting being to allow the pool to continue accepting writes while the pool is in a non redundant state. Ross> Date: Sat, 30 Aug 2008 10:59:19 -0500 > From: bfriesen at simple.dallas.tx.us > To: myxiplx at hotmail.com > CC: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better > > On Sat, 30 Aug 2008, Ross wrote: > > while the problem is diagnosed. - With that said, could the write > > timeout default to on when you have a slog device? After all, the > > data is safely committed to the slog, and should remain there until > > it''s written to all devices. Bob, you seemed the most concerned > > about writes, would that be enough redundancy for you to be happy to > > have this on by default? If not, I''d still be ok having it off by > > default, we could maybe just include it in the evil tuning guide > > suggesting that this could be turned on by anybody who has a > > separate slog device. > > It is my impression that the slog device is only used for synchronous > writes. Depending on the system, this could be just a small fraction > of the writes. > > In my opinion, ZFS''s primary goal is to avoid data loss, or > consumption of wrong data. Availability is a lesser goal. > > If someone really needs maximum availability then they can go to > triple mirroring or some other maximally redundant scheme. ZFS should > to its best to continue moving forward as long as some level of > redundancy exists. There could be an option to allow moving forward > with no redundancy at all. > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >_________________________________________________________________ Win a voice over part with Kung Fu Panda & Live Search?? and?? 100?s of Kung Fu Panda prizes to win with Live Search http://clk.atdmt.com/UKM/go/107571439/direct/01/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080830/0fd8b2f4/attachment.html>
Johan Hartzenberg
2008-Aug-31 11:03 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, Aug 28, 2008 at 11:21 PM, Ian Collins <ian at ianshome.com> wrote:> Miles Nordin writes: > > > suggested that unlike the SVM feature it should be automatic, because > > by so being it becomes useful as an availability tool rather than just > > performance optimisation. > > > So on a server with a read workload, how would you know if the remote > volume > was working? >Even reads induced writes (last access time, if nothing else) My question: If a pool becomes non-redundant (eg due to a timeout, hotplug removal, bad data returned from device, or for whatever reason), do we want the affected pool/vdev/system to hang? Generally speaking I would say that this is what currently happens with other solutions. Conversely: Can the current situation be improved by allowing a device to be taken out of the pool for writes - eg be placed in read-only mode? I would assume it is possible to modify the CoW system / functions which allocates blocks for writes to ignore certain devices, at least temporarily. This would also lay a groundwork for allowing devices to be removed from a pool - eg: Step 1: Make the device read-only. Step 2: touch every allocated block on that device (causing it to be copied to some other disk), step 3: remove it from the pool for reads as well and finally remove it from the pool permanently. _hartz -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080831/ae4ae714/attachment.html>
Richard Elling
2008-Aug-31 19:09 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Ross Smith wrote:> Triple mirroring you say? That''d be me then :D > > The reason I really want to get ZFS timeouts sorted is that our long > term goal is to mirror that over two servers too, giving us a pool > mirrored across two servers, each of which is actually a zfs iscsi > volume hosted on triply mirrored disks. > > Oh, and we''ll have two sets of online off-site backups running > raid-z2, plus a set of off-line backups too. > > All in all I''m pretty happy with the integrity of the data, wouldn''t > want to use anything other than ZFS for that now. I''d just like to > get the availability working a bit better, without having to go back > to buying raid controllers. We have big plans for that too; once we > get the iSCSI / iSER timeout issue sorted our long term availability > goals are to have the setup I mentioned above hosted out from a pair > of clustered Solaris NFS / CIFS servers. > > Failover time on the cluster is currently in the order of 5-10 > seconds, if I can get the detection of a bad iSCSI link down under 2 > seconds we''ll essentially have a worst case scenario of < 15 seconds > downtime.I don''t think this is possible for a stable system. 2 second failure detection for IP networks is troublesome for a wide variety of reasons. Even with Solaris Clusters, we can show consistent failover times for NFS services on the order of a minute (2-3 client retry intervals, including backoff). But getting to consistent sub-minute failover for a service like NFS might be a bridge too far, given the current technology and the amount of customization required to "make it work"^TM.> Downtime that low means it''s effectively transparent for our users as > all of our applications can cope with that seamlessly, and I''d really > love to be able to do that this calendar year.I think most people (traders are a notable exception) and applications can deal with larger recovery times, as long as human-intervention is not required. -- richard
Ross Smith
2008-Sep-02 11:37 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Thinking about it, we could make use of this too. The ability to add a remote iSCSI mirror to any pool without sacrificing local performance could be a huge benefit.> From: ian at ianshome.com > To: eric.schrock at sun.com > CC: myxiplx at hotmail.com; zfs-discuss at opensolaris.org > Subject: Re: Availability: ZFS needs to handle disk removal / driver failure better > Date: Fri, 29 Aug 2008 09:15:41 +1200 > > Eric Schrock writes: > > > > A better option would be to not use this to perform FMA diagnosis, but > > instead work into the mirror child selection code. This has already > > been alluded to before, but it would be cool to keep track of latency > > over time, and use this to both a) prefer one drive over another when > > selecting the child and b) proactively timeout/ignore results from one > > child and select the other if it''s taking longer than some historical > > standard deviation. This keeps away from diagnosing drives as faulty, > > but does allow ZFS to make better choices and maintain response times. > > It shouldn''t be hard to keep track of the average and/or standard > > deviation and use it for selection; proactively timing out the slow I/Os > > is much trickier. > > > This would be a good solution to the remote iSCSI mirror configuration. > I''ve been working though this situation with a client (we have been > comparing ZFS with Cleversafe) and we''d love to be able to get the read > performance of the local drives from such a pool. > > > As others have mentioned, things get more difficult with writes. If I > > issue a write to both halves of a mirror, should I return when the first > > one completes, or when both complete? One possibility is to expose this > > as a tunable, but any such "best effort RAS" is a little dicey because > > you have very little visibility into the state of the pool in this > > scenario - "is my data protected?" becomes a very difficult question to > > answer. > > > One solution (again, to be used with a remote mirror) is the three way > mirror. If two devices are local and one remote, data is safe once the two > local writes return. I guess the issue then changes from "is my data safe" > to "how safe is my data". I would be reluctant to deploy a remote mirror > device without local redundancy, so this probably won''t be an uncommon > setup. There would have to be an acceptable window of risk when local data > isn''t replicated. > > Ian_________________________________________________________________ Make a mini you and download it into Windows Live Messenger http://clk.atdmt.com/UKM/go/111354029/direct/01/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080902/1e13ffd1/attachment.html>
Ross
2008-Sep-06 06:48 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hey folks, Well, there haven''t been any more comments knocking holes in this idea, so I''m wondering now if I should log this as an RFE? Is this something others would find useful? Ross -- This message posted from opensolaris.org
Richard Elling
2008-Sep-06 15:14 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Ross wrote:> Hey folks, > > Well, there haven''t been any more comments knocking holes in this idea, so I''m wondering now if I should log this as an RFE? >go for it!> Is this something others would find useful? >Yes. But remember that this has a very limited scope. Basically it will apply to mirrors, not raidz. Some people may find that to be uninteresting. Implementing something simple, like a preferred side would be a easy first step (ala VxVM''s preferred plex). -- richard
Ross
2008-Nov-27 12:33 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hmm... I logged this CR ages ago, but now I''ve come to find it in the bug tracker I can''t see it anywhere. I actually logged three CR''s back to back, the first appears to have been created ok, but two have just disappeared. The one I created ok is: http://bugs.opensolaris.org/view_bug.do?bug_id=6766364 There should be two other CR''s created within a few minutes of that, one for disabling caching on CIFS shares, and one regarding this ZFS availability discussion. Could somebody at Sun let me know what''s happened to these please. -- This message posted from opensolaris.org
James C. McPherson
2008-Nov-27 12:41 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 27 Nov 2008 04:33:54 -0800 (PST) Ross <myxiplx at googlemail.com> wrote:> Hmm... I logged this CR ages ago, but now I''ve come to find it in > the bug tracker I can''t see it anywhere. > > I actually logged three CR''s back to back, the first appears to have > been created ok, but two have just disappeared. The one I created ok > is: http://bugs.opensolaris.org/view_bug.do?bug_id=6766364 > > There should be two other CR''s created within a few minutes of that, > one for disabling caching on CIFS shares, and one regarding this ZFS > availability discussion. Could somebody at Sun let me know what''s > happened to these please.Hi Ross, I can''t find the ZFS one you mention. The CIFS one is http://bugs.opensolaris.org/view_bug.do?bug_id=6766126. It''s been marked as ''incomplete'' so you should contact the R.E. - Alan M. Wright (at sun dot com, etc) to find out what further info is required. hth, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Ross
2008-Nov-27 13:07 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Thanks James, I''ve e-mailed Alan and submitted this one again. -- This message posted from opensolaris.org
Bernard Dugas
2008-Nov-27 15:11 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hello, Thank you for this very interesting thread ! I want to confirm that Synchronous Distributed Storage is main goal when using ZFS ! The target architecture is 1 local drive, and 2 (or more) remote iSCSI targets, with ZFS being the iSCSI initiator. System is designed/cut so that local disk can handle all needed performance with good margin, as each one of iSCSI targets through large enough Ethernet fibers. I need that any network problem doesn''t slow the readings on local disk, and that writings are stopped only if not any remote are available after a time-out. I also did a comment on that subject in : http://blogs.sun.com/roch/entry/using_zfs_as_a_network To myxiplx : we called "Sleeping Failure" a failure of 1 part, that is hidden by redundancy but not detected by monitoring. These are the most dangerous... Would anybody be interested by supporting an opensource "projectseed" called MiSCSI ? This is for Multicast iSCSI, so that only 1 writing from initiator be propagated by network to all suscribed targets, with dynamic suscribing and "resilvering" being delegated to remote targets. I would even prefer this behaviour already exists in ZFS :-) Please let me any comment if interested, i may send a draft for RFP... Best regards ! -- This message posted from opensolaris.org
Ross
2008-Nov-27 15:29 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Well, you''re not alone in wanting to use ZFS and iSCSI like that, and in fact my change request suggested that this is exactly one of the things that could be addressed: "The idea is really a two stage RFE, since just the first part would have benefits. The key is to improve ZFS availability, without affecting it''s flexibility, bringing it on par with traditional raid controllers. A. Track response times, allowing for lop sided mirrors, and better failure detection. Many people have requested this since it would facilitate remote live mirrors. B. Use response times to timeout devices, dropping them to an interim failure mode while waiting for the official result from the driver. This would prevent redundant pools hanging when waiting for a single device." Unfortunately if your links tend to drop, you really need both parts. However, if this does get added to ZFS, all you would then need is standard monitoring on the ZFS pool. That would notify you when any device fails and the pool goes to a degraded state, making it easy to spot when either the remote mirrors or local storage are having problems. I''d have thought it would make monitoring much simpler. And if this were possible, I would hope that you could configure iSCSI devices to automatically reconnect and resilver too, so the system would be self repairing once faults are corrected, but I haven''t gone so far as to test that yet. -- This message posted from opensolaris.org
Bernard Dugas
2008-Nov-27 16:45 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
> Well, you''re not alone in wanting to use ZFS and > iSCSI like that, and in fact my change request > suggested that this is exactly one of the things that > could be addressed:Thank you ! Yes, this was also to tell you that you are not alone :-) I agree completely with you on your technical points ! -- This message posted from opensolaris.org
Richard Elling
2008-Nov-28 05:05 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Ross wrote:> Well, you''re not alone in wanting to use ZFS and iSCSI like that, and in fact my change request suggested that this is exactly one of the things that could be addressed: > > "The idea is really a two stage RFE, since just the first part would have benefits. The key is to improve ZFS availability, without affecting it''s flexibility, bringing it on par with traditional raid controllers. > > A. Track response times, allowing for lop sided mirrors, and better failure detection.I''ve never seen a study which shows, categorically, that disk or network failures are preceded by significant latency changes. How do we get "better failure detection" from such measurements?> Many people have requested this since it would facilitate remote live mirrors. >At a minimum, something like VxVM''s preferred plex should be reasonably easy to implement.> B. Use response times to timeout devices, dropping them to an interim failure mode while waiting for the official result from the driver. This would prevent redundant pools hanging when waiting for a single device." >I don''t see how this could work except for mirrored pools. Would that carry enough market to be worthwhile? -- richard> Unfortunately if your links tend to drop, you really need both parts. However, if this does get added to ZFS, all you would then need is standard monitoring on the ZFS pool. That would notify you when any device fails and the pool goes to a degraded state, making it easy to spot when either the remote mirrors or local storage are having problems. I''d have thought it would make monitoring much simpler. > > And if this were possible, I would hope that you could configure iSCSI devices to automatically reconnect and resilver too, so the system would be self repairing once faults are corrected, but I haven''t gone so far as to test that yet. >
Ross Smith
2008-Nov-28 07:03 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling <Richard.Elling at sun.com> wrote:> Ross wrote: >> >> Well, you''re not alone in wanting to use ZFS and iSCSI like that, and in >> fact my change request suggested that this is exactly one of the things that >> could be addressed: >> >> "The idea is really a two stage RFE, since just the first part would have >> benefits. The key is to improve ZFS availability, without affecting it''s >> flexibility, bringing it on par with traditional raid controllers. >> >> A. Track response times, allowing for lop sided mirrors, and better >> failure detection. > > I''ve never seen a study which shows, categorically, that disk or network > failures are preceded by significant latency changes. How do we get > "better failure detection" from such measurements?Not preceded by as such, but a disk or network failure will certainly cause significant latency changes. If the hardware is down, there''s going to be a sudden, and very large change in latency. Sure, FMA will catch most cases, but we''ve already shown that there are some cases where it doesn''t work too well (and I would argue that''s always going to be possible when you are relying on so many different types of driver). This is there to ensure that ZFS can handle *all* cases.>> Many people have requested this since it would facilitate remote live >> mirrors. >> > > At a minimum, something like VxVM''s preferred plex should be reasonably > easy to implement. > >> B. Use response times to timeout devices, dropping them to an interim >> failure mode while waiting for the official result from the driver. This >> would prevent redundant pools hanging when waiting for a single device." >> > > I don''t see how this could work except for mirrored pools. Would that > carry enough market to be worthwhile? > -- richardI have to admit, I''ve not tested this with a raided pool, but since all ZFS commands hung when my iSCSI device went offline, I assumed that you would get the same effect of the pool hanging if a raid-z2 pool is waiting for a response from a device. Mirrored pools do work particularly well with this since it gives you the potential to have remote mirrors of your data, but if you had a raid-z2 pool, you still wouldn''t want that hanging if a single device failed. I will go and test the raid scenario though on a current build, just to be sure.
Richard Elling
2008-Nov-28 16:12 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Ross Smith wrote:> On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling <Richard.Elling at sun.com> wrote: > >> Ross wrote: >> >>> Well, you''re not alone in wanting to use ZFS and iSCSI like that, and in >>> fact my change request suggested that this is exactly one of the things that >>> could be addressed: >>> >>> "The idea is really a two stage RFE, since just the first part would have >>> benefits. The key is to improve ZFS availability, without affecting it''s >>> flexibility, bringing it on par with traditional raid controllers. >>> >>> A. Track response times, allowing for lop sided mirrors, and better >>> failure detection. >>> >> I''ve never seen a study which shows, categorically, that disk or network >> failures are preceded by significant latency changes. How do we get >> "better failure detection" from such measurements? >> > > Not preceded by as such, but a disk or network failure will certainly > cause significant latency changes. If the hardware is down, there''s > going to be a sudden, and very large change in latency. Sure, FMA > will catch most cases, but we''ve already shown that there are some > cases where it doesn''t work too well (and I would argue that''s always > going to be possible when you are relying on so many different types > of driver). This is there to ensure that ZFS can handle *all* cases. >I think that there is some confusion about FMA. The value of FMA is diagnosis. If there was no FMA, then driver timeouts would still exist. Where FMA is useful is diagnosing the problem such that we know that the fault is in the SAN and not the RAID array, for example. From the device driver level, all sd knows is that an I/O request to a device timed out. Similarly, all ZFS could know is what sd tells it.>>> Many people have requested this since it would facilitate remote live >>> mirrors. >>> >>> >> At a minimum, something like VxVM''s preferred plex should be reasonably >> easy to implement. >> >> >>> B. Use response times to timeout devices, dropping them to an interim >>> failure mode while waiting for the official result from the driver. This >>> would prevent redundant pools hanging when waiting for a single device." >>> >>> >> I don''t see how this could work except for mirrored pools. Would that >> carry enough market to be worthwhile? >> -- richard >> > > I have to admit, I''ve not tested this with a raided pool, but since > all ZFS commands hung when my iSCSI device went offline, I assumed > that you would get the same effect of the pool hanging if a raid-z2 > pool is waiting for a response from a device. Mirrored pools do work > particularly well with this since it gives you the potential to have > remote mirrors of your data, but if you had a raid-z2 pool, you still > wouldn''t want that hanging if a single device failed. >zpool commands hanging is CR6667208, and has been fixed in b100. http://bugs.opensolaris.org/view_bug.do?bug_id=6667208> I will go and test the raid scenario though on a current build, just to be sure. >Please. -- richard
Ross Smith
2008-Dec-02 11:31 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hey folks, I''ve just followed up on this, testing iSCSI with a raided pool, and it still appears to be struggling when a device goes offline.>>> I don''t see how this could work except for mirrored pools. Would that >>> carry enough market to be worthwhile? >>> -- richard >>> >> >> I have to admit, I''ve not tested this with a raided pool, but since >> all ZFS commands hung when my iSCSI device went offline, I assumed >> that you would get the same effect of the pool hanging if a raid-z2 >> pool is waiting for a response from a device. Mirrored pools do work >> particularly well with this since it gives you the potential to have >> remote mirrors of your data, but if you had a raid-z2 pool, you still >> wouldn''t want that hanging if a single device failed. >> > > zpool commands hanging is CR6667208, and has been fixed in b100. > http://bugs.opensolaris.org/view_bug.do?bug_id=6667208 > >> I will go and test the raid scenario though on a current build, just to be >> sure. >> > > Please. > -- richardI''ve just created a pool using three snv_103 iscsi Targets, with a fourth install of snv_103 collating those targets into a raidz pool, and sharing that out over CIFS. To test the server, while transferring files from a windows workstation, I powered down one of the three iSCSI targets. It took a few minutes to shutdown, but once that happened the windows copy halted with the error: "The specified network name is no longer available." At this point, the zfs admin tools still work fine (which is a huge improvement, well done!), but zpool status still reports that all three devices are online. A minute later, I can open the share again, and start another copy. Thirty seconds after that, zpool status finally reports that the iscsi device is offline. So it looks like we have the same problems with that 3 minute delay, with zpool status reporting wrong information, and the CIFS service having problems tool. At this point I restarted the iSCSI target, but had problems bringing it back online. It appears there''s a bug in the initiator, but it''s easily worked around: http://www.opensolaris.org/jive/thread.jspa?messageID=312981񌚕 What was great was that as soon as the iSCSI initiator reconnected, ZFS started resilvering. What might not be so great is the fact that all three devices are showing that they''ve been resilvered: # zpool status pool: iscsipool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 0h2m with 0 errors on Tue Dec 2 11:04:10 2008 config: NAME STATE READ WRITE CKSUM iscsipool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t600144F04933FF6C00005056967AC800d0 ONLINE 0 0 0 179K resilvered c2t600144F04934FAB300005056964D9500d0 ONLINE 5 9.88K 0 311M resilvered c2t600144F04934119E000050569675FF00d0 ONLINE 0 0 0 179K resilvered errors: No known data errors It''s proving a little hard to know exactly what''s happening when, since I''ve only got a few seconds to log times, and there are delays with each step. However, I ran another test using robocopy and was able to observe the behaviour a little more closely: Test 2: Using robocopy for the transfer, and iostat plus zpool status on the server 10:46:30 - iSCSI server shutdown started 10:52:20 - all drives still online according to zpool status 10:53:30 - robocopy error - "The specified network name is no longer available" - zpool status shows all three drives as online - zpool iostat appears to have hung, taking much longer than the 30s specified to return a result - robocopy is now retrying the file, but appears to have hung 10:54:30 - robocopy, CIFS and iostat all start working again, pretty much simultaneously - zpool status now shows the drive as offline I could probably do with using DTrace to get a better look at this, but I haven''t learnt that yet. My guess as to what''s happening would be: - iSCSI target goes offline - ZFS will not be notified for 3 minutes, but I/O to that device is essentially hung - CIFS times out (I suspect this is on the client side with around a 30s timeout, but I can''t find the timeout documented anywhere). - zpool iostat is now waiting, I may be wrong but this doesn''t appear to have benefited from the changes to zpool status - After 3 minutes, the iSCSI drive goes offline. The pool carries on with the remaining two drives, CIFS carries on working, iostat carries on working. "zpool status" however is still out of date. - zpool status eventually catches up, and reports that the drive has gone offline. So, if my guesses are right, I see several problems here: 1. ZFS could still benefit with the timeout I''ve suggested to keep the pool active. I''ve now shown this benefits raidz pools as well as mirrors, and with problems other people have reported, we''ve shown that at least two drivers have problems that this would mitigate. 2. I would guess that the timeout needs to be under 30 seconds to prevent problems with CIFS clients, I need to find some documentation on this, and find some way to prove it''s a client timeout and not a problem with the CIFS server. 3. zpool iostat is still blocked by a hung device (there may be an existing bug for this, it rings a bell). 4. zpool status still reports out of date information. 5. When iSCSI targets finally do come back online, ZFS is resilvering all of them (again, this rings a bell, Miles might have reported something similar). And while I don''t know the code at all, I really can''t understand how ZFS can be serving files out from a pool, but zpool status doesn''t know what''s going on. ZFS physically can''t work unless it knows which drives it is and isn''t writing to. Why can''t you just use this knowledge for zpool status? Ross
Ross
2008-Dec-02 12:12 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Incidentally, while I''ve reported this again as a RFE, I still haven''t seen a CR number for this. Could somebody from Sun check if it''s been filed please. thanks, Ross -- This message posted from opensolaris.org
Ross Smith
2008-Dec-02 16:43 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hi Richard, Thanks, I''ll give that a try. I think I just had a kernel dump while trying to boot this system back up though, I don''t think it likes it if the iscsi targets aren''t available during boot. Again, that rings a bell, so I''ll go see if that''s another known bug. Changing that setting on the fly didn''t seem to help, if anything things are worse this time around. I changed the timeout to 15 seconds, but didn''t restart any services: # echo iscsi_rx_max_window/D | mdb -k iscsi_rx_max_window: iscsi_rx_max_window: 180 # echo iscsi_rx_max_window/W0t15 | mdb -kw iscsi_rx_max_window: 0xb4 = 0xf # echo iscsi_rx_max_window/D | mdb -k iscsi_rx_max_window: iscsi_rx_max_window: 15 After making those changes, and repeating the test, offlining an iscsi volume hung all the commands running on the pool. I had three ssh sessions open, running the following: # zpool iostats -v iscsipool 10 100 # format < /dev/null # time zpool status They hung for what felt a minute or so. After that, the CIFS copy timed out. After the CIFS copy timed out, I tried immediately restarting it. It took a few more seconds, but restarted no problem. Within a few seconds of that restarting, iostat recovered, and format returned it''s result too. Around 30 seconds later, zpool status reported two drives, paused again, then showed the status of the third: # time zpool status pool: iscsipool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 0h0m with 0 errors on Tue Dec 2 16:39:21 2008 config: NAME STATE READ WRITE CKSUM iscsipool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t600144F04933FF6C00005056967AC800d0 ONLINE 0 0 0 15K resilvered c2t600144F04934FAB300005056964D9500d0 ONLINE 0 0 0 15K resilvered c2t600144F04934119E000050569675FF00d0 ONLINE 0 200 0 24K resilvered errors: No known data errors real 3m51.774s user 0m0.015s sys 0m0.100s Repeating that a few seconds later gives: # time zpool status pool: iscsipool state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using ''zpool online''. see: http://www.sun.com/msg/ZFS-8000-2Q scrub: resilver completed after 0h0m with 0 errors on Tue Dec 2 16:39:21 2008 config: NAME STATE READ WRITE CKSUM iscsipool DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 c2t600144F04933FF6C00005056967AC800d0 ONLINE 0 0 0 15K resilvered c2t600144F04934FAB300005056964D9500d0 ONLINE 0 0 0 15K resilvered c2t600144F04934119E000050569675FF00d0 UNAVAIL 3 5.80K 0 cannot open errors: No known data errors real 0m0.272s user 0m0.029s sys 0m0.169s On Tue, Dec 2, 2008 at 3:58 PM, Richard Elling <Richard.Elling at sun.com> wrote: ......> iSCSI timeout is set to 180 seconds in the client code. The only way > to change is to recompile it, or use mdb. Since you have this test rig > setup, and I don''t, do you want to experiment with this timeout? > The variable is actually called "iscsi_rx_max_window" so if you do > echo iscsi_rx_max_window/D | mdb -k > you should see "180" > Change it using something like: > echo iscsi_rx_max_window/W0t30 | mdb -kw > to set it to 30 seconds. > -- richard
Miles Nordin
2008-Dec-02 18:35 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
>>>>> "rs" == Ross Smith <myxiplx at googlemail.com> writes:rs> 4. zpool status still reports out of date information. I know people are going to skim this message and not hear this. They''ll say ``well of course zpool status says ONLINE while the pool is hung. ZFS is patiently waiting. It doesn''t know anything is broken yet.'''' but you are NOT saying it''s out of date because it doesn''t say OFFLINE the instant you power down an iSCSI target. You''re saying: rs> - After 3 minutes, the iSCSI drive goes offline. rs> The pool carries on with the remaining two drives, CIFS rs> carries on working, iostat carries on working. "zpool status" rs> however is still out of date. rs> - zpool status eventually rs> catches up, and reports that the drive has gone offline. so, there is a ~30sec window when it''s out of date. When you say ``goes offline'''' in the first bullet, you''re saying ``ZFS must have marked it offline internally, because the pool unfroze.'''' but you found that even after it ``goes offline'''' ''zpool status'' still reports it ONLINE. The question is, what the hell is ''zpool status'' reporting? not the status, apparently. It''s supposed to be a diagnosis tool. Why should you have to second-guess it and infer the position of ZFS''s various internal state machines through careful indirect observation, ``oops, CIFS just came back,'''' or ``oh sometihng must have changed because zpool iostat isn''t hanging any more''''? Why not have a tool that TELLS you plainly what''s going on? ''zpool status'' isn''t. Is it trying to oversimplify things, to condescend to the sysadmin or hide ZFS''s rough edges? Are there more states for devices that are being compressed down to ONLINE OFFLINE DEGRADED FAULTED? Is there some tool in zdb or mdb that is like ''zpool status -simonsez''? I already know sometimes it''ll report everything as ONLINE but refuse ''zpool offline ... <device>'' with ''no valid replicas'', so I think, yes there are ``secret states'''' for devices? Or is it trying to do too many things with one output format? rs> 5. When iSCSI targets finally do come back online, ZFS is rs> resilvering all of them (again, this rings a bell, Miles might rs> have reported something similar). my zpool status is so old it doesn''t say ``xxkB resilvered'''' so I''ve no indication which devices are the source vs. target of the resilver. What I found was, the auto-resilver isn''t sufficient. If you wait for it to complete, then ''zpool scrub'', you''ll get thousands of CKSUM errors on the dirty device, so the resilver isn''t covering all the dirtyness. Also ZFS seems to forget about the need to resilver if you shut down the machine, bring back the missing target, and boot---it marks everything ONLINE and then resilvers as you hit the dirty data, counting CKSUM errors. This has likely been fixed between b71 and b101. It''s easy to test: (a) shut down one iSCSI target, (b) write to the pool, (c) bring the iSCSI target back, (d) wait for auto-resilver to finish, (e) ''zpool scrub'', (f) look for CKSUM errors. I suspect you''re more worried about your own problems though---I''ll try to retest it soon. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081202/7a60ae97/attachment.bin>
Ross
2008-Dec-02 19:55 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hi Miles, It''s probably a bad sign that although that post came through as anonymous in my e-mail, I recognised your style before I got half way through your post :) I agree, the zpool status being out of date is weird, I''ll dig out the bug number for that at some point as I''m sure I''ve mentioned it before. It looks to me like there are two separate pieces of code that work out the status of the pool. There''s the stuff ZFS uses internally to run the pool, and then there''s a completely separate piece that does the reporting to the end user. I agree that it could be a case of oversimplifying things. There''s no denying the ease of admin is one of ZFS'' strengths, but I think the whole zpool status thing needs looking at again. Neither the way the command freezes, nor the out of date information make any sense to me. And yes, I''m aware of the problems you''ve reported with resilvering. That''s on my list of things to test with this. I''ve already done a quick test of running a scrub after the resilver (which appeared ok at first glance), and tomorrow I''ll be testing the reboot status too. -- This message posted from opensolaris.org
Miles Nordin
2008-Dec-02 20:35 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
>>>>> "r" == Ross <myxiplx at googlemail.com> writes:r> style before I got half way through your post :) [...status r> problems...] could be a case of oversimplifying things. yeah I was a bit inappropriate, but my frustration comes from the (partly paranoid) imagining of how the idea ``we need to make it simple'''' might have spooled out through a series of design meetings to a culturally-insidious mind-blowing condescention toward the sysadmin. ``simple'''', to me, means that a ''status'' tool does not read things off disks, and does not gather a bunch of scraps to fabricate a pretty (``simple''''?) fantasy-world at invocation which is torn down again when it exits. The Linux status tools are pretty-printing wrappers around ''cat /proc/$THING/status''. That, is SIMPLE! And, screaming monkeys though they often are, the college kids writing Linux are generally disciplined enough not to grab a bunch of locks and then go to sleep for minutes when delivering things from /proc. I love that. The other, broken, idea of ``simple'''' is what I come to Unix to avoid. And yes, this is a religious argument. Just because it spans decades of experience and includes ideas of style doesn''t mean it should be dismissed as hocus-pocus. And I don''t like all these binary config files either. Not even Mac OS X is pulling that baloney any more. r> There''s no denying the ease of admin is one of ZFS'' strengths, I deny it! It is not simple to start up ''format'' and ''zpool iostat'' and RoboCopy on another machine because you cannot trust the output of the status command. And getting visibility into something by starting a bunch of commands in different windows and watching when which one unfreezes is hilarious, not simple. r> the problems you''ve reported with resilvering. I think we were watching this bug: http://bugs.opensolaris.org/view_bug.do?bug_id=6675685 so that ought to be fixed in your test system but not in s10u6. but it might not be completely fixed yet: http://bugs.opensolaris.org/view_bug.do?bug_id=6747698 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081202/2a365522/attachment.bin>
Toby Thain
2008-Dec-02 23:06 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On 2-Dec-08, at 3:35 PM, Miles Nordin wrote:>>>>>> "r" == Ross <myxiplx at googlemail.com> writes: > > r> style before I got half way through your post :) [...status > r> problems...] could be a case of oversimplifying things. > ... > And yes, this is a religious argument. Just because it spans decades > of experience and includes ideas of style doesn''t mean it should be > dismissed as hocus-pocus. And I don''t like all these binary config > files either. Not even Mac OS X is pulling that baloney any more.OS X never used binary config files; it standardised on XML property lists for the new subsystems (plus a lot of good old fashioned UNIX config). Perhaps you are thinking of Mac OS 9 and earlier (resource forks). --Toby
Ross
2008-Dec-03 15:20 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Ok, I''ve done some more testing today and I almost don''t know where to start. I''ll begin with the good news for Miles :) - Rebooting doesn''t appear to cause ZFS to loose the resilver status (but see 1. below) - Resilvering appears to work fine, once complete I never saw any checksum errors when scrubbing the pool. - Reconnecting iscsi drives causes zfs to automatically online the pool and automatically begin resilvering. And now the bad news: 1. While rebooting doesn''t seem cause the resilver to loose it''s status, something''s causing it problems. I saw it restart several times. 2. With iscsi, you can''t reboot with sendtargets enabled, static discovery still seems to be the order of the day. 3. There appears to be a disconnect between what iscsiadm knows and what ZFS knows about the status of the devices. And I have confirmation of some of my earlier findings too: 4. iSCSI still has a 3 minute timeout, during which time your pool will hang, no matter how many redundant drives you have available. 5. zpool status can still hang when a device goes offline, and when it finally recovers, it will then report out of date information. This could be Bug 6667199, but I''ve not seen anybody reporting the incorrect information part of this. 6. After one drive goes offline, during the resilver process, zpool status shows that information is being resilvered on the good drives. Does anybody know why this happens? 7. Although ZFS will automatically online a pool when iscsi devices come online, CIFS shares are not automatically remounted. I also have a few extra notes about a couple of those: 1 - resilver loosing status ==============Regarding the resilver restarting, I''ve seen it reported that "zpool status" can cause this when run as admin, but I''m not convinced that''s the cause. Same for the rebooting problem. I was able to run "zpool status" dozens of times as an admin, but only two or three times did I see the resilver restart. Also, after rebooting, I could see that the resilver was showing that it was 66% complete, but then a second later it restarted. Now, none of this is conclusive. I really need to test with a much larger dataset to get an idea of what''s really going on, but there''s definately something weird happening here. 3 - disconnect between iscsiadm and ZFS ========================I repeated my test of offlining an iscsi target, this time checking iscsiadm to see when it disconnected. What I did was wait until iscsiadm reported 0 connections to the target, and then started a CIFS file copy and ran "zpool status". Zpool status hung as expected, and a minute or so later, the CIFS copy failed. It seems that although iscsiadm was aware that the target was offline, ZFS did not yet know about it. As expected, a minute or so later, zpool status completed (returning incorrect results), and I could then run the CIFS copy fine. 5 - zpool status hanging and reporting incorrect information ==================================When an iSCSI device goes offline, if you immediately run zpool status, it hangs for 3-4 minutes. Also, when it finally completes, it gives incorrect information, reporting all the devices as online. If you immediately re-run zpool status, it completes rapidly and will now correctly show the offline devices. -- This message posted from opensolaris.org
Miles Nordin
2008-Dec-03 22:20 UTC
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
>>>>> "r" == Ross <myxiplx at googlemail.com> writes:rs> I don''t think it likes it if the iscsi targets aren''t rs> available during boot. from my cheatsheet: -----8<----- ok boot -m milestone=none [boots. enter root password for maintenance.] bash-3.00# /sbin/mount -o remount,rw / [<-- otherwise iscsiadm won''t update /etc/iscsi/*] bash-3.00# /sbin/mount /usr bash-3.00# /sbin/mount /var bash-3.00# /sbin/mount /tmp bash-3.00# iscsiadm remove discovery-address 10.100.100.135 bash-3.00# iscsiadm remove discovery-address 10.100.100.138 bash-3.00# iscsiadm remove discovery-address 10.100.100.138 iscsiadm: unexpected OS error iscsiadm: Unable to complete operation [<-- good. it''s gone.] bash-3.00# sync bash-3.00# lockfs -fa bash-3.00# reboot -----8<----- rs> # time zpool status [...] rs> real 3m51.774s so, this hang may happen in fewer situations, but it is not fixed. r> 6. After one drive goes offline, during the resilver process, r> zpool status shows that information is being resilvered on the r> good drives. Does anybody know why this happens? I don''t know why. I''ve seen that, too, though. For me it''s always been relatively short, <1min. I wonder if there are three kinds of scrub-like things, not just two (resilvers and scrubs), and ''zpool status'' is ``simplifying'''' for us again? r> 7. Although ZFS will automatically online a pool when iscsi r> devices come online, CIFS shares are not automatically r> remounted. For me, even plain filesystems are not all remounted. ZFS tries to mount them in the wrong order, so it would mount /a/b/c, then try to mount /a/b and complain ``directory not empty''''. I''m not sure why it mounts things in the right order at boot/import, but in haphazard order after one of these auto-onlines. Then NFS exporting didn''t work either. To fix, I have to ''zfs umount /a/b/c'', but then there is a b/c directory inside filesystem /a, so I have to ''rmdir /a/b/c'' by hand because the ''... set mountpoint'' koolaid creates the directories but doesn''t remove them. Then ''zfs mount -a'' and ''zfs share -a''. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081203/ccdba521/attachment.bin>