I watched both the youtube video http://www.youtube.com/watch?v=CN6iDzesEs0 and the one on http://www.opensolaris.com/, "ZFS ? A Smashing Hit". In the first one is obvious that the app stops working when they smash the drives; they have to physically detach the drive before the array reconstruction begins. I''m not the only one that noticed it, comments on youtube: "It appears that ZFS didn''t recover after each drive failure until he unplugged the failed drive? Or was it coincidence that he unplugged the drive just as ZFS started recovering?" Reply "Yep. its a bug in solaris. BUt if you try and tell a sun person that, they get really pissy." In the second video the focus is on the drive when the guy smashes it; I don''t see any reasons why they would not let you see the app while he smashed the drive. The focus comes back to the running app right after he detached the hard drive. -- This message posted from opensolaris.org
On 23-Nov-08, at 12:21 PM, Scara Maccai wrote:> I watched both the youtube video > > http://www.youtube.com/watch?v=CN6iDzesEs0 > > and the one on http://www.opensolaris.com/, "ZFS ? A Smashing Hit". > > In the first one is obvious that the app stops working when they > smash the drives; they have to physically detach the drive before > the array reconstruction begins. > I''m not the only one that noticed it, comments on youtube: > > "It appears that ZFS didn''t recover after each drive failure until > he unplugged the failed drive? Or was it coincidence that he > unplugged the drive just as ZFS started recovering?" > Reply > "Yep. its a bug in solaris. BUt if you try and tell a sun person > that, they get really pissy." >Why would it be assumed to be a bug in Solaris? Seems more likely on balance to be a problem in the error reporting path or a controller/ firmware weakness. I''m pretty sure the first 2 versions of this demo I saw were executed perfectly - and in a packed auditorium (Moscow? and Russians are the toughest crowd). No smoke, no mirrors. --T> In the second video the focus is on the drive when the guy smashes > it; I don''t see any reasons why they would not let you see the app > while he smashed the drive. > The focus comes back to the running app right after he detached the > hard drive. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> Why would it be assumed to be a bug in Solaris? Seems > more likely on > balance to be a problem in the error reporting path > or a controller/ > firmware weakness.Weird: they would use a controller/firmware that doesn''t work? Bad call...> I''m pretty sure the first 2 versions of this demo I > saw were executed > perfectly - and in a packed auditorium (Moscow? and > Russians are the > toughest crowd). No smoke, no mirrors.Still don''t understand why even the one on http://www.opensolaris.com/, "ZFS ? A Smashing Hit", doesn''t show the app running in the moment the HD is smashed... weird... -- This message posted from opensolaris.org
On 24-Nov-08, at 10:40 AM, Scara Maccai wrote:>> Why would it be assumed to be a bug in Solaris? Seems >> more likely on >> balance to be a problem in the error reporting path >> or a controller/ >> firmware weakness. > > Weird: they would use a controller/firmware that doesn''t work? Bad > call...Seems to me, a sledgehammer would produce fairly random failure modes. How would you pre-test?! --T> >> I''m pretty sure the first 2 versions of this demo I >> saw were executed >> perfectly - and in a packed auditorium (Moscow? and >> Russians are the >> toughest crowd). No smoke, no mirrors. > > Still don''t understand why even the one on http:// > www.opensolaris.com/, "ZFS ? A Smashing Hit", doesn''t show the app > running in the moment the HD is smashed... weird... > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Mon, Nov 24, 2008 at 10:40, Scara Maccai <troiai at yahoo.it> wrote:> Still don''t understand why even the one on http://www.opensolaris.com/, "ZFS ? A Smashing Hit", doesn''t show the app running in the moment the HD is smashed... weird...ZFS is primarily about protecting your data: correctness, at the expense of everything else if necessary. It happens to be very fast under most circumstances, but if a disk vanishes like a sledgehammer hit it, ZFS will wait on the device driver to decide it''s dead. Device drivers are generally the same way, choosing correctness over speed. Thus, ZFS can take a while to notice that a disk is gone and do something about it---but in the meantime, it won''t make any promises it can''t keep. This is to be regarded as a Good Thing. If a disk fails and ZFS throws away all of my data as a result I''m not going to be happy; if a disk fails and ZFS takes 30 seconds to notice I''m still happy with that. That said, there have been several threads about wanting configurable device timeouts handled at the ZFS level rather than the device driver level. Perhaps this will be implemented at some point... but in the meantime I prefer correctness to availability. Will
Will Murnane wrote:> On Mon, Nov 24, 2008 at 10:40, Scara Maccai <troiai at yahoo.it> wrote: > >> Still don''t understand why even the one on http://www.opensolaris.com/, "ZFS ? A Smashing Hit", doesn''t show the app running in the moment the HD is smashed... weird... >>Sorry this is OT, but is it just me or does is only seem proper to have Gallagher do this? ;) ./C
> if a disk vanishes like > a sledgehammer > hit it, ZFS will wait on the device driver to decide > it''s dead.OK I see it.> That said, there have been several threads about > wanting configurable > device timeouts handled at the ZFS level rather than > the device driver > level.Uh, so I can configure timeouts at the device level? I didn''t know that. -- This message posted from opensolaris.org
"C. Bergstr?m" wrote:> Will Murnane wrote: > > On Mon, Nov 24, 2008 at 10:40, Scara Maccai <troiai at yahoo.it> wrote: > > > >> Still don''t understand why even the one on > http://www.opensolaris.com/, "ZFS - A Smashing Hit", doesn''t > show the app running in the moment the HD is smashed... weird... > >> > Sorry this is OT, but is it just me or does is only seem > proper to have > Gallagher do this? ;)Absolutely not. Under no circumstances should you attempt to create a striped ZFS pool on a watermelon, nor on any other type of epigynous berry. If you try, you will certainly rind up with a mess, if not a core dump. And let me tell you, that''s the pits. --Joe
>>>>> "tt" == Toby Thain <toby at telegraphics.com.au> writes:tt> Why would it be assumed to be a bug in Solaris? Seems more tt> likely on balance to be a problem in the error reporting path tt> or a controller/ firmware weakness. It''s not really an assumption. It''s been discussed in here a lot, and we know why it''s happening. It''s just a case of ``it''s a feature not a bug'''' combined with ``somebody else''s problem.'''' The error-reporting path you mention is inside Solaris, so I have a little trouble decoding your statement. I wish drives had a failure-aware QoS with a split queue for aggressive-retry cdb''s and deadline cdb''s. This would make the B_FAILFAST primitive the Solaris developers seem to believe in actually mean something. Solaris is supposed to have a B_FAILFAST option for block I/O that ZFS could start using to capture vdev-level knowledge like ``don''t try too hard to read this block from one device, because we can get it faster by asking another device.'''' In the real world B_FAILFAST is IMO quite silly and, not exactly useless but at best deceptive to the higher-layer developer, because even IF the drive could be told to fail faster than 30 seconds by some future fancier sd driver, there would still be some fail-slow cdbs hitting the drive, and the two can''t be parallelized. Sending a fail-slow cdb to a drive freezes the drive for up to 30 seconds * <n>, where <n> is the multiplier of some cargo-cult state machine built into the host adapter driver involving ``bus resets'''' and other such stuff. All the B_FAILFAST cdbs queued behind the fail-slow may as well forget the flag becasue the drive''s busy with the slow cdb. If you have a very few of these retryable cdbs peppered into your transaction stream, which are expected to take 10 - 100ms each but actually take one or two MINUTES each, the drive will be so slow it''d be more expressive to mark it dead. What will probably happen in $REALITY is, the sysadmin will declare his machine ``frozen without a panic message'''' and reboot it, losing any write-cached data which, if not for this idiocy, could have been committed to other drives in a redundant vdev, as well as rebooting the rest of the system unrelated to this stuck zpool. However, it''s inappropriate for a driver to actually report ``drive dead'''' in this scenario, because the drive is NOT dead. The drive-failure-statistic papers posted in here say that drives usually fail with a bunch of contiguous or clumped-together unreadable sectors. You can still get most of the data off them with dd_rescue or ''dd if=baddrive of=gooddrive bs=512 conv=noerror,sync'', if you wait about a week. About four hours of that week is spent copying data and the rest spent aggressively ``retrying''''. An instantanious manual command, ``I insist this drive is failed. Mark it failed instantly, without leaving me stuck in bogus state machines for two minutes or two hours,'''' would be a huge improvement, but I think graceful automatic behavior is not too much to wish for because this isn''t some strange circumstance. This is *the way drives usually fail*. SCSI drives have all kinds of retry-tuning in the ``mode pages'''' in a standardized format. Even 5.25" 10MB SCSI drives had these pages. One of NetApp''s papers said they don''t even let their SCSI/FC drives do their own bad-block reallocation. They do all that in host software. so there are a lot of secret tuning knobs, and they''re AIUI largely standardized across manufacturers and through the years. ATA drives, AIUI, don''t have the pages, but some WD gamer drives have some goofy DOS RAID-tuner tool. But even what SCSI drives offer isn''t enough to provide the architecture ZFS seems to dream ov. What''s really needed to provide ZFS developer''s expectations of B_FAILFAST is QoS inside the drive firmware. Drives need to have split queues, with an aggressive-retry queue and a deadline-service queue. While retrying a stuck cdb in the aggressive queue, they must keep servicing the deadline queue. I''ve never heard of anything like this existing in a real drive. I think it''s beyond the programming skill of an electrical engineer, and it may be too constraining for them because drives seem to do spastic head-seeks and sometimes partway spin themselves down and back up during a retry cycle. ZFS still seems to have this taxonomic-arcania view of drives that they are ``failing operations'''' or the drive itself is ``failed''''. It belongs to the driver''s realm to decide whether it''s the whole drive or just the ``operation'''' which is failing, because that''s how the square peg fits snugly into it''s square hole. One of the NetApp papers mentions they have proprietary statistical heuristics for when to ignore a drive for a little while and use redundant drives instead, and when to fail a drive and call autosupport. And they log drive behavior really explicitly and unambiguously separate from ``controller'''' failure, which is why they''re able to write the paper at all. I''m in favour of heuristics, but most of the ZFS developers seem to think the issue lies with every driver in Solaris being not up to its promised standards. I still think the ZFS approach is wrong and the Netapp approach right. * I think if SATA is to be supported, then the fantasy that drives can be configured to return failure early should be cast off forever. * I don''t think ZFS will match the availability behavior of Netapp or even of Areca/PERC/RAID-on-a-card until it includes vdev-level handling of slow devices. This means vdev-level timers inside ZFS, above the block driver level, driving error-recovery decisions. * I think a pool read/write that takes longer than other drives in a redundant vdev, or longer than other cdb''s took on the same drive, should be re-dispatched to fetch redundant data. I think this should happen with really tight tolerance and should be stateful, such that a mirror could have a remote iSCSI component and a local component, and only the local component would be used for reads. * If a drive is taking 30 seconds do perform every cdb, but is still present and the driver refuses to mark it bad, ZFS needs to be able to mark it bad on its own, so that it no longer blocks synchronous writes, and so hot-spare replacement can start to get the pool back up to policy''s redundancy expectation. If we''re designing systems with multiple controllers to avoid a ``single point of failure'''' then it''s not okay to punt and say, well this isn''t our problem because we''re waiting patiently on the controller to do something sane. The short-term decisions require vdev-level knowledge which doesn''t exist inside the driver, but arguably marking drives failed does not require vdev-level knowledge and could be done in the driver rather than ZFS. I still think this is wrong. Based on our experience so far with controller drivers, they aren''t very good, and controller chips are rather short-lived so they''re never going to be very good, and the drivers are often proprietary so the work has to be redone inside ZFS just to have a bit of software freedom again. A practical modern storage design is robust against bugs in the controller driver, bugs exercised by combinations of drive firmware and controllers or by doing strange things with cables. If this won''t go inside ZFS, then people will reasonably want some pseudodevice like an SVM soft partition or a multipath layer to protect them from failing controller drivers. They might want a way to manually, and instantly, without waiting on stupid state machines, mark the device failed, crack ZFS and the controller driver apart so they''re not locked in some deadly embrace of failure that requires rebooting. If we agree there is a need for multipath to a single device, why can we not agree that we expect protection from failures of a controller or its driver even when we don''t have multipath but have laid out our vdev''s with enough redundancy to tolerate controller failure? In practice, I think drives that become really slow instead of failing outright is the real problem, but bringing in multipath and controller redundancy shows what is to my view the taxonomic hypocricy of wanting to keep this out of ZFS. * Management commands like export, status, detach, offline, replace, must either (a) never block waiting for I/O. use kernel state only, and do disk writes asynchronously reporting failure through inspection commands that the user polls like ''zpool status''. This world is possible---we don''t expect the mirror to be in sync before ''zpool attach'' returns though we could. Or (b) sleep interruptably, and include a more drastic version that doesn''t block, so normally you type ''zpool offline'' and when the prompt returns without error, you know that all your labels are updated. But if you don''t get a prompt back, you can ^C and ''zpool offline -f''. Not being able to get rid of a drive without access to the drive you want to get rid of is as ridiculous as ``keyboard not found. Press F1 to continue.'''' Even square-peg square-hole taxonomists ought to agree on this one. And I don''t like getting ``no valid replicas'''' errors to situations that ZFS will tolerate if you force it by rebooting or by hot-unplugging the device---there should be clear deliniation of which pool states are acceptable and which are not, and I should be able to explore all the acceptable states by moving the pool through them manually. If I can''t ''zpool offline'' a device, and _instantly_ if I insist on it, then the pool should not mount at boot without that device. I shouldn''t have to involve rebooting in my testing, or else it feels like fisher price wallpapered crap. I sometimes run my dishwasher with the door open for a half second when I become suspicious of it. The sky doesn''t fall. but these days it seems like people believe any interlock anywhere, even a preposterous invented one, is as sacred as the one on a microwave or a UV oven. oh, and when possible ZFS should not forget its knowledge of inconsistencies across a reboot, and should for example continue interrupted resilvers like SVM did. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081124/5b9c29d8/attachment.bin>
On 24-Nov-08, at 3:49 PM, Miles Nordin wrote:>>>>>> "tt" == Toby Thain <toby at telegraphics.com.au> writes: > > tt> Why would it be assumed to be a bug in Solaris? Seems more > tt> likely on balance to be a problem in the error reporting path > tt> or a controller/ firmware weakness. > > It''s not really an assumption. It''s been discussed in here a lot, and > we know why it''s happening. It''s just a case of ``it''s a feature not > a bug'''' combined with ``somebody else''s problem.'''' > > The error-reporting path you mention is inside Solaris, so I have a > little trouble decoding your statement. >Not all of it is! I don''t see how anyone could confidently correlate "behaviour after sledgehammer impact" with a specific fault in Solaris, without doing a lot more investigation than "watching a YouTube video". Perhaps this has already been narrowed down to a specific root cause within Solaris - I just didn''t see enough data in the OP''s post to indicate that. But I bow to your far more extensive experience... --Toby> _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Toby Thain wrote:> On 24-Nov-08, at 3:49 PM, Miles Nordin wrote: > > >>>>>>> "tt" == Toby Thain <toby at telegraphics.com.au> writes: >>>>>>> >> tt> Why would it be assumed to be a bug in Solaris? Seems more >> tt> likely on balance to be a problem in the error reporting path >> tt> or a controller/ firmware weakness. >> >> It''s not really an assumption. It''s been discussed in here a lot, and >> we know why it''s happening. It''s just a case of ``it''s a feature not >> a bug'''' combined with ``somebody else''s problem.'''' >> >> The error-reporting path you mention is inside Solaris, so I have a >> little trouble decoding your statement. >> >> > > > Not all of it is! > > I don''t see how anyone could confidently correlate "behaviour after > sledgehammer impact" with a specific fault in Solaris, without doing > a lot more investigation than "watching a YouTube video". Perhaps > this has already been narrowed down to a specific root cause within > Solaris - I just didn''t see enough data in the OP''s post to indicate > that. >We could add strain sensors to disk drives which, when the strain was suddenly too great, would register an ASC/ASCQ 75/00 "DEVICE WAS HIT BY A HAMMER" and then we could add the e-report to sd and then register with a "io-hammer-event" FMA diagnosis engine which would be registered to ZFS to offline the device :-) But seriously, it really does depend on the failure mode of the device and I''m not sure people have studied the hammer case very closely. In the worst case, the device would be selectable, but not responding to data requests which would lead through the device retry logic and can take minutes. If the (USB) device simply disappeared, it would be indistinquishable from a hot-plug event and that logic would take over which results in a faster diagnosis. I suppose it will depend on the device and your aim. -- richard
> In the worst case, the device would be selectable, > but not responding > to data requests which would lead through the device > retry logic and can > take minutes.that''s what I didn''t know: that a driver could take minutes (hours???) to decide that a device is not working anymore. Now it comes another question: how can one assume that a drive failure won''t take one hour to be acknowledged by the driver? That is: what good is a failover strategy if it takes one hour to start? I''m grateful that the system doesn''t write until it knows what is going on, but that can''t take that long. -- This message posted from opensolaris.org
Scara Maccai wrote:>> In the worst case, the device would be selectable, >> but not responding >> to data requests which would lead through the device >> retry logic and can >> take minutes. >> > > that''s what I didn''t know: that a driver could take minutes (hours???) to decide that a device is not working anymore. >For Solaris, sd driver, there are, by default, 60 second timeouts with 5 retries. For ssd driver, 3 retries. But sometimes, additional tests are made to try to verify that the disk is really not working properly which will cause more of these. Again, it depends on the failure mode.> Now it comes another question: how can one assume that a drive failure won''t take one hour to be acknowledged by the driver? That is: what good is a failover strategy if it takes one hour to start? I''m grateful that the system doesn''t write until it knows what is going on, but that can''t take that long. >AFAIK, there are no cases where the timeouts would result in an hour delay before making a decision. Usually, the policy is made in advance, as in the zpool failmode property. -- richard
But that''s exactly the problem Richard: AFAIK. Can you state that absolutely, categorically, there is no failure mode out there (caused by hardware faults, or bad drivers) that won''t lock a drive up for hours? You can''t, obviously, which is why we keep saying that ZFS should have this kind of timeout feature. For once I agree with Miles, I think he''s written a really good writeup of the problem here. My simple view on it would be this: Drives are only aware of themselves as an individual entity. Their job is to save & restore data to themselves, and drivers are written to minimise any chance of data loss. So when a drive starts to fail, it makes complete sense for the driver and hardware to be very, very thorough about trying to read or write that data, and to only fail as a last resort. I''m not at all surprised that drives take 30 seconds to timeout, nor that they could slow a pool for hours. That''s their job. They know nothing else about the storage, they just have to do their level best to do as they''re told, and will only fail if they absolutely can''t store the data. The raid controller on the other hand (Netapp / ZFS, etc) knows all about the pool. It knows if you have half a dozen good drives online, it knows if there are hot spares available, and it *should* also know how quickly the drives under its care usually respond to requests. ZFS is perfectly placed to spot when a drive is starting to fail, and to take the appropriate action to safeguard your data. It has far more information available than a single drive ever will, and should be designed accordingly. Expecting the firmware and drivers of individual drives to control the failure modes of your redundant pool is just crazy imo. You''re throwing away some of the biggest benefits of using multiple drives in the first place. -- This message posted from opensolaris.org
I think we (the ZFS team) all generally agree with you. The current nevada code is much better at handling device failures than it was just a few months ago. And there are additional changes that were made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000) product line that will make things even better once the FishWorks team has a chance to catch its breath and integrate those changes into nevada. And then we''ve got further improvements in the pipeline. The reason this is all so much harder than it sounds is that we''re trying to provide increasingly optimal behavior given a collection of devices whose failure modes are largely ill-defined. (Is the disk dead or just slow? Gone or just temporarily disconnected? Does this burst of bad sectors indicate catastrophic failure, or just localized media errors?) The disks'' SMART data is notoriously unreliable, BTW. So there''s a lot of work underway to model the physical topology of the hardware, gather telemetry from the devices, the enclosures, the environmental sensors etc, so that we can generate an accurate FMA fault diagnosis and then tell ZFS to take appropriate action. We have some of this today; it''s just a lot of work to complete it. Oh, and regarding the original post -- as several readers correctly surmised, we weren''t faking anything, we just didn''t want to wait for all the device timeouts. Because the disks were on USB, which is a hotplug-capable bus, unplugging the dead disk generated an interrupt that bypassed the timeout. We could have waited it out, but 60 seconds is an eternity on stage. Jeff On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:> But that''s exactly the problem Richard: AFAIK. > > Can you state that absolutely, categorically, there is no failure mode out there (caused by hardware faults, or bad drivers) that won''t lock a drive up for hours? You can''t, obviously, which is why we keep saying that ZFS should have this kind of timeout feature. > > For once I agree with Miles, I think he''s written a really good writeup of the problem here. My simple view on it would be this: > > Drives are only aware of themselves as an individual entity. Their job is to save & restore data to themselves, and drivers are written to minimise any chance of data loss. So when a drive starts to fail, it makes complete sense for the driver and hardware to be very, very thorough about trying to read or write that data, and to only fail as a last resort. > > I''m not at all surprised that drives take 30 seconds to timeout, nor that they could slow a pool for hours. That''s their job. They know nothing else about the storage, they just have to do their level best to do as they''re told, and will only fail if they absolutely can''t store the data. > > The raid controller on the other hand (Netapp / ZFS, etc) knows all about the pool. It knows if you have half a dozen good drives online, it knows if there are hot spares available, and it *should* also know how quickly the drives under its care usually respond to requests. > > ZFS is perfectly placed to spot when a drive is starting to fail, and to take the appropriate action to safeguard your data. It has far more information available than a single drive ever will, and should be designed accordingly. > > Expecting the firmware and drivers of individual drives to control the failure modes of your redundant pool is just crazy imo. You''re throwing away some of the biggest benefits of using multiple drives in the first place. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hey Jeff, Good to hear there''s work going on to address this. What did you guys think to my idea of ZFS supporting a "waiting for a response" status for disks as an interim solution that allows the pool to continue operation while it''s waiting for FMA or the driver to fault the drive? I do appreciate that it''s hard to come up with a definative "it''s dead Jim" answer, and I agree that long term the FMA approach will pay dividends. But I still feel this is a good short term solution, and one that would also compliment your long term plans. My justification for this is that it seems to me that you can split disk behavior into two states: - returns data ok - doesn''t return data ok And for the state where it''s not returning data, you can again split that in two: - returns wrong data - doesn''t return data The first of these is already covered by ZFS with its checksums (with FMA doing the extra work to fault drives), so it''s just the second that needs immediate attention, and for the life of me I can''t think of any situation that a simple timeout wouldn''t catch. Personally I''d love to see two parameters, allowing this behavior to be turned on if desired, and allowing timeouts to be configured: zfs-auto-device-timeout zfs-auto-device-timeout-fail-delay The first sets whether to use this feature, and configures the maximum time ZFS will wait for a response from a device before putting it in a "waiting" status. The second would be optional and is the maximum time ZFS will wait before faulting a device (at which point it''s replaced by a hot spare). The reason I think this will work well with the FMA work is that you can implement this now and have a real improvement in ZFS availability. Then, as the other work starts bringing better modeling for drive timeouts, the parameters can be either removed, or set automatically by ZFS. Long term I guess there''s also the potential to remove the second setting if you felt FMA etc ever got reliable enough, but personally I would always want to have the final fail delay set. I''d maybe set it to a long value such as 1-2 minutes to give FMA, etc a fair chance to find the fault. But I''d be much happier knowing that the system will *always* be able to replace a faulty device within a minute or two, no matter what the FMA system finds. The key thing is that you''re not faulting devices early, so FMA is still vital. The idea is purely to let ZFS to keep the pool active by removing the need for the entire pool to wait on the FMA diagnosis. As I said before, the driver and firmware are only aware of a single disk, and I would imagine that FMA also has the same limitation - it''s only going to be looking at a single item and trying to determine whether it''s faulty or not. Because of that, FMA is going to be designed to be very careful to avoid false positives, and will likely take it''s time to reach an answer in some situations. ZFS however has the benefit of knowing more about the pool, and in the vast majority of situations, it should be possible for ZFS to read or write from other devices while it''s waiting for an ''official'' result from any one faulty component. Ross On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick <Jeff.Bonwick at sun.com> wrote:> I think we (the ZFS team) all generally agree with you. The current > nevada code is much better at handling device failures than it was > just a few months ago. And there are additional changes that were > made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000) > product line that will make things even better once the FishWorks team > has a chance to catch its breath and integrate those changes into nevada. > And then we''ve got further improvements in the pipeline. > > The reason this is all so much harder than it sounds is that we''re > trying to provide increasingly optimal behavior given a collection of > devices whose failure modes are largely ill-defined. (Is the disk > dead or just slow? Gone or just temporarily disconnected? Does this > burst of bad sectors indicate catastrophic failure, or just localized > media errors?) The disks'' SMART data is notoriously unreliable, BTW. > So there''s a lot of work underway to model the physical topology of > the hardware, gather telemetry from the devices, the enclosures, > the environmental sensors etc, so that we can generate an accurate > FMA fault diagnosis and then tell ZFS to take appropriate action. > > We have some of this today; it''s just a lot of work to complete it. > > Oh, and regarding the original post -- as several readers correctly > surmised, we weren''t faking anything, we just didn''t want to wait > for all the device timeouts. Because the disks were on USB, which > is a hotplug-capable bus, unplugging the dead disk generated an > interrupt that bypassed the timeout. We could have waited it out, > but 60 seconds is an eternity on stage. > > Jeff > > On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote: >> But that''s exactly the problem Richard: AFAIK. >> >> Can you state that absolutely, categorically, there is no failure mode out there (caused by hardware faults, or bad drivers) that won''t lock a drive up for hours? You can''t, obviously, which is why we keep saying that ZFS should have this kind of timeout feature. >> >> For once I agree with Miles, I think he''s written a really good writeup of the problem here. My simple view on it would be this: >> >> Drives are only aware of themselves as an individual entity. Their job is to save & restore data to themselves, and drivers are written to minimise any chance of data loss. So when a drive starts to fail, it makes complete sense for the driver and hardware to be very, very thorough about trying to read or write that data, and to only fail as a last resort. >> >> I''m not at all surprised that drives take 30 seconds to timeout, nor that they could slow a pool for hours. That''s their job. They know nothing else about the storage, they just have to do their level best to do as they''re told, and will only fail if they absolutely can''t store the data. >> >> The raid controller on the other hand (Netapp / ZFS, etc) knows all about the pool. It knows if you have half a dozen good drives online, it knows if there are hot spares available, and it *should* also know how quickly the drives under its care usually respond to requests. >> >> ZFS is perfectly placed to spot when a drive is starting to fail, and to take the appropriate action to safeguard your data. It has far more information available than a single drive ever will, and should be designed accordingly. >> >> Expecting the firmware and drivers of individual drives to control the failure modes of your redundant pool is just crazy imo. You''re throwing away some of the biggest benefits of using multiple drives in the first place. >> -- >> This message posted from opensolaris.org >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
PS. I think this also gives you a chance at making the whole problem much simpler. Instead of the hard question of "is this faulty", you''re just trying to say "is it working right now?". In fact, I''m now wondering if the "waiting for a response" flag wouldn''t be better as "possibly faulty". That way you could use it with checksum errors too, possibly with settings as simple as "errors per minute" or "error percentage". As with the timeouts, you could have it off by default (or provide sensible defaults), and let administrators tweak it for their particular needs. Imagine a pool with the following settings: - zfs-auto-device-timeout = 5s - zfs-auto-device-checksum-fail-limit-epm = 20 - zfs-auto-device-checksum-fail-limit-percent = 10 - zfs-auto-device-fail-delay = 120s That would allow the pool to flag a device as possibly faulty regardless of the type of fault, and take immediate proactive action to safeguard data (generally long before the device is actually faulted). A device triggering any of these flags would be enough for ZFS to start reading from (or writing to) other devices first, and should you get multiple failures, or problems on a non redundant pool, you always just revert back to ZFS'' current behaviour. Ross On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick <Jeff.Bonwick at sun.com> wrote:> I think we (the ZFS team) all generally agree with you. The current > nevada code is much better at handling device failures than it was > just a few months ago. And there are additional changes that were > made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000) > product line that will make things even better once the FishWorks team > has a chance to catch its breath and integrate those changes into nevada. > And then we''ve got further improvements in the pipeline. > > The reason this is all so much harder than it sounds is that we''re > trying to provide increasingly optimal behavior given a collection of > devices whose failure modes are largely ill-defined. (Is the disk > dead or just slow? Gone or just temporarily disconnected? Does this > burst of bad sectors indicate catastrophic failure, or just localized > media errors?) The disks'' SMART data is notoriously unreliable, BTW. > So there''s a lot of work underway to model the physical topology of > the hardware, gather telemetry from the devices, the enclosures, > the environmental sensors etc, so that we can generate an accurate > FMA fault diagnosis and then tell ZFS to take appropriate action. > > We have some of this today; it''s just a lot of work to complete it. > > Oh, and regarding the original post -- as several readers correctly > surmised, we weren''t faking anything, we just didn''t want to wait > for all the device timeouts. Because the disks were on USB, which > is a hotplug-capable bus, unplugging the dead disk generated an > interrupt that bypassed the timeout. We could have waited it out, > but 60 seconds is an eternity on stage. > > Jeff > > On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote: >> But that''s exactly the problem Richard: AFAIK. >> >> Can you state that absolutely, categorically, there is no failure mode out there (caused by hardware faults, or bad drivers) that won''t lock a drive up for hours? You can''t, obviously, which is why we keep saying that ZFS should have this kind of timeout feature. >> >> For once I agree with Miles, I think he''s written a really good writeup of the problem here. My simple view on it would be this: >> >> Drives are only aware of themselves as an individual entity. Their job is to save & restore data to themselves, and drivers are written to minimise any chance of data loss. So when a drive starts to fail, it makes complete sense for the driver and hardware to be very, very thorough about trying to read or write that data, and to only fail as a last resort. >> >> I''m not at all surprised that drives take 30 seconds to timeout, nor that they could slow a pool for hours. That''s their job. They know nothing else about the storage, they just have to do their level best to do as they''re told, and will only fail if they absolutely can''t store the data. >> >> The raid controller on the other hand (Netapp / ZFS, etc) knows all about the pool. It knows if you have half a dozen good drives online, it knows if there are hot spares available, and it *should* also know how quickly the drives under its care usually respond to requests. >> >> ZFS is perfectly placed to spot when a drive is starting to fail, and to take the appropriate action to safeguard your data. It has far more information available than a single drive ever will, and should be designed accordingly. >> >> Expecting the firmware and drivers of individual drives to control the failure modes of your redundant pool is just crazy imo. You''re throwing away some of the biggest benefits of using multiple drives in the first place. >> -- >> This message posted from opensolaris.org >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
>My justification for this is that it seems to me that you can split >disk behavior into two states: >- returns data ok >- doesn''t return data okI think you''re missing "won''t write". There''s clearly a difference between "get data from a different copy" which you can fix but retrying data to a different part of the redundant data and writing data: the data which can''t be written must be kept until the drive is faulted. Casper
No, I count that as "doesn''t return data ok", but my post wasn''t very clear at all on that. Even for a write, the disk will return something to indicate that the action has completed, so that can also be covered by just those two scenarios, and right now ZFS can lock the whole pool up if it''s waiting for that response. My idea is simply to allow the pool to continue operation while waiting for the drive to fault, even if that''s a faulty write. It just means that the rest of the operations (reads and writes) can keep working for the minute (or three) it takes for FMA and the rest of the chain to flag a device as faulty. For write operations, the data can be safely committed to the rest of the pool, with just the outstanding writes for the drive left waiting. Then as soon as the device is faulted, the hot spare can kick in, and the outstanding writes quickly written to the spare. For single parity, or non redundant volumes there''s some benefit in this. For dual parity pools there''s a massive benefit as your pool stays available, and your data is still well protected. Ross On Tue, Nov 25, 2008 at 10:44 AM, <Casper.Dik at sun.com> wrote:> > >>My justification for this is that it seems to me that you can split >>disk behavior into two states: >>- returns data ok >>- doesn''t return data ok > > > I think you''re missing "won''t write". > > There''s clearly a difference between "get data from a different copy" > which you can fix but retrying data to a different part of the redundant > data and writing data: the data which can''t be written must be kept > until the drive is faulted. > > > Casper > >
>My idea is simply to allow the pool to continue operation while >waiting for the drive to fault, even if that''s a faulty write. It >just means that the rest of the operations (reads and writes) can keep >working for the minute (or three) it takes for FMA and the rest of the >chain to flag a device as faulty.Except when you''re writing a lot; 3 minutes can cause a 20GB backlog for a single disk. Casper
Hmm, true. The idea doesn''t work so well if you have a lot of writes, so there needs to be some thought as to how you handle that. Just thinking aloud, could the missing writes be written to the log file on the rest of the pool? Or temporarily stored somewhere else in the pool? Would it be an option to allow up to a certain amount of writes to be cached in this way while waiting for FMA, and only suspend writes once that cache is full? With a large SSD slog device would it be possible to just stream all writes to the log? As a further enhancement, might it be possible to commit writes to the working drives, and just leave the writes for the bad drive(s) in the slog (potentially saving a lot of space)? For pools without log devices, I suspect that you would probably need the administrator to specify the behavior as I can see several options depending on the raid level and that pools priorities for data availability / integrity: Drive fault write cache settings: default - pool waits for device, no writes occur until device or spare comes online slog - writes are cached to slog device until full, then pool reverts to default behavior (could this be the default with slog devices present?) pool - writes are cached to the pool itself, up to a set maximum, and are written to the device or spare as soon as possible. This assumes a single parity pool with the other devices available. If the upper limit is reached, or another devices goes faulty, pool reverts to default behaviour. Storing directly to the rest of the pool would probably want to be off by default on single parity pools, but I would imagine that it could be on by default on dual parity pools. Would that be enough to allow writes to continue in most circumstances while the pool waits for FMA? Ross On Tue, Nov 25, 2008 at 10:55 AM, <Casper.Dik at sun.com> wrote:> > >>My idea is simply to allow the pool to continue operation while >>waiting for the drive to fault, even if that''s a faulty write. It >>just means that the rest of the operations (reads and writes) can keep >>working for the minute (or three) it takes for FMA and the rest of the >>chain to flag a device as faulty. > > Except when you''re writing a lot; 3 minutes can cause a 20GB backlog > for a single disk. > > Casper > >
On 25-Nov-08, at 5:10 AM, Ross Smith wrote:> Hey Jeff, > > Good to hear there''s work going on to address this. > > What did you guys think to my idea of ZFS supporting a "waiting for a > response" status for disks as an interim solution that allows the pool > to continue operation while it''s waiting for FMA or the driver to > fault the drive? > ... > > The first of these is already covered by ZFS with its checksums (with > FMA doing the extra work to fault drives), so it''s just the second > that needs immediate attention, and for the life of me I can''t think > of any situation that a simple timeout wouldn''t catch. > > Personally I''d love to see two parameters, allowing this behavior to > be turned on if desired, and allowing timeouts to be configured: > > zfs-auto-device-timeout > zfs-auto-device-timeout-fail-delay > > The first sets whether to use this feature, and configures the maximum > time ZFS will wait for a response from a device before putting it in a > "waiting" status.The shortcomings of timeouts have been discussed on this list before. How do you tell the difference between a drive that is dead and a path that is just highly loaded? I seem to recall the argument strongly made in the past that making decisions based on a timeout alone can provoke various undesirable cascade effects.> The second would be optional and is the maximum > time ZFS will wait before faulting a device (at which point it''s > replaced by a hot spare). > > The reason I think this will work well with the FMA work is that you > can implement this now and have a real improvement in ZFS > availability. Then, as the other work starts bringing better modeling > for drive timeouts, the parameters can be either removed, or set > automatically by ZFS. > ... it should be possible for ZFS to read or > write from other devices while it''s waiting for an ''official'' result > from any one faulty component.Sounds good - devil, meet details, etc. --Toby> > Ross > > > On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick > <Jeff.Bonwick at sun.com> wrote: >> I think we (the ZFS team) all generally agree with you. ... >> >> The reason this is all so much harder than it sounds is that we''re >> trying to provide increasingly optimal behavior given a collection of >> devices whose failure modes are largely ill-defined. (Is the disk >> dead or just slow? Gone or just temporarily disconnected? ... >> >> Jeff >> >> On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote: >>> But that''s exactly the problem Richard: AFAIK. >>> >>> Can you state that absolutely, categorically, there is no failure >>> mode out there (caused by hardware faults, or bad drivers) that >>> won''t lock a drive up for hours? You can''t, obviously, which is >>> why we keep saying that ZFS should have this kind of timeout >>> feature. >>> ...
> The shortcomings of timeouts have been discussed on this list before. How do > you tell the difference between a drive that is dead and a path that is just > highly loaded?A path that is dead is either returning bad data, or isn''t returning anything. A highly loaded path is by definition reading & writing lots of data. I think you''re assuming that these are file level timeouts, when this would actually need to be much lower level.> Sounds good - devil, meet details, etc.Yup, I imagine there are going to be a few details to iron out, many of which will need looking at by somebody a lot more technical than myself. Despite that I still think this is a discussion worth having. So far I don''t think I''ve seen any situation where this would make things worse than they are now, and I can think of plenty of cases where it would be a huge improvement. Of course, it also probably means a huge amount of work to implement. I''m just hoping that it''s not prohibitively difficult, and that the ZFS team see the benefits as being worth it.
> Oh, and regarding the original post -- as several > readers correctly > surmised, we weren''t faking anything, we just didn''t > want to wait > for all the device timeouts. Because the disks were > on USB, which > is a hotplug-capable bus, unplugging the dead disk > generated an > interrupt that bypassed the timeout. We could have > waited it out, > but 60 seconds is an eternity on stage.I''m sorry, I didn''t mean to sound offensive. Anyway I think that people should know that their drives can stuck the system for minutes, "despite" ZFS. I mean: there are a lot of writings about how ZFS is great for recovery in case a drive fails, but there''s nothing regarding this problem. I know now it''s not ZFS fault; but I wonder how many people set up their drives with ZFS assuming that "as soon as something goes bad, ZFS will fix it". Is there any way to test these cases other than smashing the drive with a hammer? Having a failover policy where the failover can''t be tested sounds scary... -- This message posted from opensolaris.org
Ross Smith wrote:> My justification for this is that it seems to me that you can split > disk behavior into two states: > - returns data ok > - doesn''t return data ok > > And for the state where it''s not returning data, you can again split > that in two: > - returns wrong data > - doesn''t return dataThe state in discussion in this thread is "the I/O requested by ZFS hasn''t finished after 60, 120, 180, 3600, etc. seconds" The pool is waiting (for device timeouts) to distinguish between the first two states. More accurate state descriptions are: - The I/O has returned data - The I/O hasn''t yet returned data and the user (admin) is justifiably impatient. For the first state, the data is either correct (verified by the ZFS checksums, or ESUCCESS on write) or incorrect and retried.> > The first of these is already covered by ZFS with its checksums (with > FMA doing the extra work to fault drives), so it''s just the second > that needs immediate attention, and for the life of me I can''t think > of any situation that a simple timeout wouldn''t catch. > > Personally I''d love to see two parameters, allowing this behavior to > be turned on if desired, and allowing timeouts to be configured: > > zfs-auto-device-timeout > zfs-auto-device-timeout-fail-delayI''d prefer these be set at the (default) pool level: zpool-device-timeout zpool-device-timeout-fail-delay with specific per-VDEV overrides possible: vdev-device-timeout and vdev-device-fail-delay This would allow but not require slower VDEVs to be tuned specifically for that case without hindering the default pool behavior on the local fast disks. Specifically, consider where I''m using mirrored VDEVs with one half over iSCSI, and want to have the iSCSI retry logic to still apply. Writes that failed while the iSCSI link is down would have to be resilvered, but at least reads would switch to the local devices faster. Set them to the default magic "0" value to have the system use the current behavior, of relying on the device drivers to report failures. Set to a number (in ms probably) and the pool would consider an I/O that takes longer than that as "returns invalid data" When the FMA work discussed below, these could be augmented by the pools "best heuristic guess" as to what the proper timeouts should be, which could be saved in (kstat?) vdev-device-autotimeout. If you set the timeout to the magic "-1" value, the pool would use vdev-device-autotimeout. All that would be required is for the I/O that caused the disk to take a long time to be given a deadline (now + (vdev-device-timeout ?: (zpool-device-timeout?: forever)))* and consider the I/O complete with whatever data has returned after that deadline: if that''s a bunch of 0''s in a read, which would have a bad checksum; or a partially-completed write that would have to be committed somewhere else. Unfortunately, I''m not enough of a programmer to implement this. --Joe * with the -1 magic, it would be a little more complicated calculation.
On Tue, Nov 25, 2008 at 11:55:17AM +0100, Casper.Dik at Sun.COM wrote:> >My idea is simply to allow the pool to continue operation while > >waiting for the drive to fault, even if that''s a faulty write. It > >just means that the rest of the operations (reads and writes) can keep > >working for the minute (or three) it takes for FMA and the rest of the > >chain to flag a device as faulty. > > Except when you''re writing a lot; 3 minutes can cause a 20GB backlog > for a single disk.If we''re talking isolated, or even clumped-but-relatively-few bad sectors, then having a short timeout for writes and remapping should be possible to do without running out of memory to cache those writes. But... ...writes to bad sectors will happen when txgs flush, and depending on how bad sector remapping is done (say, by picking a new block address and changing the blkptrs that referred to the old one) that might mean redoing large chunks of the txg in the next one, which might mean that fsync() could be delayed an additional 5 seconds or so. And even if that''s not the case, writes to mirrors are supposed to be synchronous, so one would think that bad block remapping should be synchronous also, thus there must be a delay on writes to bad blocks no matter what -- though that delay could be tuned to be no more than a few seconds. That points to a possibly decent heuristic on writes: vdev-level timeouts that result in bad block remapping, but if the queue of outstanding bad block remappings grows too large -> treat the disk as faulted and degrade the pool. Sounds simple, but it needs to be combined at a higher layer with information from other vdevs. Unplugging a whole jbod shouldn''t necessarily fault all the vdevs on it -- perhaps it should cause pool operation to pause until the jbod is plugged back in... which should then cause those outstanding bad block remappings to be rolled back since they weren''t bad blocks after all. That''s a lot of fault detection and handling logic across many layers. Incidentally, cables to fall out, or, rather, get pulled out accidentally. What should be the failure mode of a jbod disappearing due to a pulled cable (or power supply failure)? A pause in operation (hangs)? Or faulting of all affected vdevs, and if you''re mirrored across different jbods, incurring the need to re-silver later, with degraded operation for hours on end? I bet answers will vary. The best answer is to provide enough redundancy (multiple power supplies, multi-pathing, ...) to make such situations less likely, but that''s not a complete answer. Nico --
On Tue, 25 Nov 2008, Ross Smith wrote:> > Good to hear there''s work going on to address this. > > What did you guys think to my idea of ZFS supporting a "waiting for a > response" status for disks as an interim solution that allows the pool > to continue operation while it''s waiting for FMA or the driver to > fault the drive?A stable and sane system never comes with "two brains". It is wrong to put this sort of logic into ZFS when ZFS is already depending on FMA to make the decisions and Solaris already has an infrastructure to handle faults. The more appropriate solution is that this feature should be in FMA. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Scara Maccai wrote:>> Oh, and regarding the original post -- as several >> readers correctly >> surmised, we weren''t faking anything, we just didn''t >> want to wait >> for all the device timeouts. Because the disks were >> on USB, which >> is a hotplug-capable bus, unplugging the dead disk >> generated an >> interrupt that bypassed the timeout. We could have >> waited it out, >> but 60 seconds is an eternity on stage. >> > > I''m sorry, I didn''t mean to sound offensive. Anyway I think that people should know that their drives can stuck the system for minutes, "despite" ZFS. I mean: there are a lot of writings about how ZFS is great for recovery in case a drive fails, but there''s nothing regarding this problem. I know now it''s not ZFS fault; but I wonder how many people set up their drives with ZFS assuming that "as soon as something goes bad, ZFS will fix it". > Is there any way to test these cases other than smashing the drive with a hammer? Having a failover policy where the failover can''t be tested sounds scary... >It is with this idea in mind that I wrote part of Chapter 1 of the book Designing Enterprise Solutions with Sun Cluster 3.0. For convenience, I also published chapter 1 as a Sun BluePrint Online article. http://www.sun.com/blueprints/1101/clstrcomplex.pdf False positives are very expensive in highly available systems, so we really do want to avoid them. One thing that we can do, and I''ve already (again[1]) started down the path to document, is to show where and how the various (common) timeouts are in the system. Once you know how sd, cmdk, dbus, and friends work you can make better decisions on where to look when the behaviour is not as you expect. But this is a very tedious path because there are so many different failure modes and real-world devices can react ambiguously when they fail. [1] we developed a method to benchmark cluster dependability. The description of the benchmark was published in several papers, but is now available in the new IEEE book on Dependability Benchmarking. This is really the first book of its kind and the first steps toward making dependability benchmarks more mainstream. Anyway, the work done for that effort included methods to improve failure detection and handling, so we have a detailed understanding of those things for SPARC, in lab form. Expanding that work to cover the random-device-bought-at-Frys will be a substantial undertaking. Co-conspirators welcome. -- richard
I disagree Bob, I think this is a very different function to that which FMA provides. As far as I know, FMA doesn''t have access to the big picture of pool configuration that ZFS has, so why shouldn''t ZFS use that information to increase the reliability of the pool while still using FMA to handle device failures? The flip side of the argument is that ZFS already checks the data returned by the hardware. You might as well say that FMA should deal with that too since it''s responsible for all hardware failures. The role of ZFS is to manage the pool, availability should be part and parcel of that. On Tue, Nov 25, 2008 at 3:57 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Tue, 25 Nov 2008, Ross Smith wrote: >> >> Good to hear there''s work going on to address this. >> >> What did you guys think to my idea of ZFS supporting a "waiting for a >> response" status for disks as an interim solution that allows the pool >> to continue operation while it''s waiting for FMA or the driver to >> fault the drive? > > A stable and sane system never comes with "two brains". It is wrong to put > this sort of logic into ZFS when ZFS is already depending on FMA to make the > decisions and Solaris already has an infrastructure to handle faults. The > more appropriate solution is that this feature should be in FMA. > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > >
On Tue, 25 Nov 2008, Ross Smith wrote:> I disagree Bob, I think this is a very different function to that > which FMA provides. > > As far as I know, FMA doesn''t have access to the big picture of pool > configuration that ZFS has, so why shouldn''t ZFS use that information > to increase the reliability of the pool while still using FMA to > handle device failures?If FMA does not currently have knowledge of the redundancy model but needs it to make well-informed decisions, then it should be updated to incorporate this information. FMA sees all the hardware in the system, including devices used for UFS and other types of filesystems, and even tape devices. It is able to see hardware at a much more detailed level than ZFS does. ZFS only sees an abstracted level of the hardware. If a HBA or part of the backplane fails, FMA should be able to determine the failing area (at least as far out as it can see based on available paths) whereas all ZFS knows is that it is having difficulty getting there from here.> The flip side of the argument is that ZFS already checks the data > returned by the hardware. You might as well say that FMA should deal > with that too since it''s responsible for all hardware failures.If bad data is returned, then I assume that there is a peg to FMA''s error statistics counters.> The role of ZFS is to manage the pool, availability should be part and > parcel of that.Too much complexity tends to clog up the works and keep other areas of ZFS from being enhanced expediently. ZFS would soon become a chunk of source code that no mortal could understand and as such it would be put under "maintenance" with no more hope of moving forward and inability to address new requirements. A rational system really does not want to have mutiple brains. Otherwise some parts of the system will think that the device is fine while other parts believe that it has failed. None of us want to deal with an insane system like that. There is also the matter of fault isolation. If a drive can not be reached, is it because the drive failed, or because a HBA supporting multiple drives failed, or a cable got pulled? This sort of information is extremely important for large reliable systems. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
It''s hard to tell exactly what you are asking for, but this sounds similar to how ZFS already works. If ZFS decides that a device is pathologically broken (as evidenced by vdev_probe() failure), it knows that FMA will come back and diagnose the drive is faulty (becuase we generate a probe_failure ereport). So ZFS pre-emptively short circuits all I/O and treats the drive as faulted, even though the diagnosis hasn''t come back yet. We can only do this for errors that have a 1:1 correspondence with faults. - Eric On Tue, Nov 25, 2008 at 04:10:13PM +0000, Ross Smith wrote:> I disagree Bob, I think this is a very different function to that > which FMA provides. > > As far as I know, FMA doesn''t have access to the big picture of pool > configuration that ZFS has, so why shouldn''t ZFS use that information > to increase the reliability of the pool while still using FMA to > handle device failures? > > The flip side of the argument is that ZFS already checks the data > returned by the hardware. You might as well say that FMA should deal > with that too since it''s responsible for all hardware failures. > > The role of ZFS is to manage the pool, availability should be part and > parcel of that. > > > On Tue, Nov 25, 2008 at 3:57 PM, Bob Friesenhahn > <bfriesen at simple.dallas.tx.us> wrote: > > On Tue, 25 Nov 2008, Ross Smith wrote: > >> > >> Good to hear there''s work going on to address this. > >> > >> What did you guys think to my idea of ZFS supporting a "waiting for a > >> response" status for disks as an interim solution that allows the pool > >> to continue operation while it''s waiting for FMA or the driver to > >> fault the drive? > > > > A stable and sane system never comes with "two brains". It is wrong to put > > this sort of logic into ZFS when ZFS is already depending on FMA to make the > > decisions and Solaris already has an infrastructure to handle faults. The > > more appropriate solution is that this feature should be in FMA. > > > > Bob > > =====================================> > Bob Friesenhahn > > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
>>>>> "rs" == Ross Smith <myxiplx at googlemail.com> writes: >>>>> "nw" == Nicolas Williams <Nicolas.Williams at sun.com> writes:rs> I disagree Bob, I think this is a very different function to rs> that which FMA provides. I see two problems. (1) FMA doesn''t seem to work very well, and was used as an excuse to keep proper exception handling out of ZFS for a couple years, so im sort of...skeptical whenever it''s brought up like a panacea. (2) The FMA model of collecting telemmetry, taking it into user-space, chin-strokingly contemplating it for a while, then decreeing a diagnosis, is actually a rather limited one. I can think of two kinds of limit: (a) you''re diagnosing the pool FMA is running on. FMA is on the root pool, but the root pool won''t unfreeze until FMA diagnoses it. In practice it''s much worse, because problems in one pool''s devices can freeze all of ZFS, even other pools. Or if the system is NFS-rooted and also exporting ZFS filesystems over NFS, maybe all of NFS freezes? problems like that, knocking out FMA. Diagnosis in kernel is harder to knock out. (b) calls are sleeping uninterruptably in the path that returns events to FMA. ``Call down into the controller driver, wait for return success or failure, then count the event and callback to FMA as appropriate. If something''s borked, FMA will eventually return diagnosis.'''' This plan is useless if the controller just freezes. FMA never sees anything. You are analyzing faults, yes, but you can only do it with hindsight. When do you do the FMA callback? To implement this timeout, you''d have to do a callback before and after each I/O, which is obviously too expensive. Likewise, when FMA returns the diagnosis, are you prepared to act on it? Or are you busy right now, and you''re going to act on it just as soon as that controller returns success or failure? You can''t abstract the notion of time out of your diagnosis. Trying to compartmentalize it interferes with working it into low-level event loops in a way that''s sometimes needed. It''s not a matter of where things taxonomically belong, where it feels clean to put some functionality in your compartmentalized layered tower. Certain things just aren''t achievable from certain places. nw> If we''re talking isolated, or even clumped-but-relatively-few nw> bad sectors, then having a short timeout for writes and nw> remapping should be possible I''m not sure I understand the state machine for the remapping plan but...I think your idea is, try to write to some spot on the disk. If it takes too long, cancel the write, and try writing somewhere else instead. Then do bad-block-remapping: fix up all the pointers for the new location, mark the spot that took too long as poisonous, all that. I don''t think it''ll work. First, you can''t cancel the write. Once you dispatch a write that hangs, you''ve locked up, at a minimum, the drive trying to write. You don''t get the option of remapping and writing elsewhere, because the drive''s stopped listening to you. Likely, you''ve also locked up the bus (if the drive''s on PATA or SCSI), or maybe the whole controller. (This is IMHO the best reason for laying out a RAID to survive a controller failure---interaction with a bad drive could freeze a whole controller.) Even if you could cancel the write, when do you cancel it? If you can learn your drive and controller so well you convince them to ignore you for 10 seconds instead of two minutes when they hit a block they can''t write, you''ve got approximately the same problem, because you don''t know where the poison sectors are. You''ll probably hit another one. Even a ten-second write means the drive''s performance is shot by almost three orders of magnitude---it''s not workable. Finally, this approach interferes with diagnosis. The drives have their own retry state machine. If you start muddling all this ad-hoc stuff on top of it you can''t tell the difference between drive failures, cabling problems, controller failures. You end up with normal thermal recalibration events being treated as some kind of ``spurious late read'''' and inventing all these strange unexplained failure terms which make it impossible to write a paper like the Netapp or Google papers on UNC''s we used to cite in here all the time, because your failure statistics no longer correspond to a single layer of the storage stack and can''t be compared to others'' statistics. Also, remember that we suspect and wish to tolerate drives that operate many standard deviations outside their specification, even when they''re not broken or suspect or about to break. There are two reasons. First, we think they might do it. Second, otherwise you can''t collect performance statistics you can compare with others''. That''s why the added failure handling I suggested is only to ignore drives---either for a little while, or permanently. Merely ignoring a drive, without telling the drive you''re ignoring it, doesn''t interfere with collecting statistics from it. The two queues inside the drive (retryable and deadline) would let you do this bad-block-remapping, but no drive implements it, and it''s probably impossible to implement because of the sorts of things drives do while ``retrying''''. I described the drive-QoS idea to explain why this B_FAILFAST-ish plan of supervising the drive''s recovery behavior, or any plan involving ``cancelling'''' CDB''s, is never going to work. Here is one variant of this remapping plan I think could work, which somewhat preserves the existing storage stack layering: * add a timeout to B_FAILFAST cdb''s above the controller driver, a short one like a couple seconds. * when a drive is busy on a non-B_FAILFAST transaction for longer than the B_FAILFAST timeout, walk through the CDB queue and instantly fail all the B_FAILFAST transactions, without even sending them to the drive. * when a drive blows a B_FAILFAST timeout, admit no more B_FAILFAST transactions until it successfully completes a non-B_FAILFAST transaction. If the drive is marked timeout-blown, and no transactions are queued for it, wait 60 seconds and then make up a fake transaction for it, like ``read one sector in the middle of the disk.'''' I like the vdev-layer ideas better than the block-layer ideas though. nw> What should be the failure mode of a jbod disappearing due to nw> a pulled cable (or power supply failure)? A pause in nw> operation (hangs)? Or faulting of all affected vdevs, and if nw> you''re mirrored across different jbods, incurring the need to nw> re-silver later, with degraded operation for hours on end? The resilvering shoudl only include things written during the outage, so the degraded operation will last some time proportional to the outage. Resilvering is already supposed to work this way. The argument, I think, will be over the idea of auto-onlining things. My opinion: if you are dealing with failure by deciding to return success to fsync() with fewer copies of the data written, then this should require either a spare rebuild or manually issuing ''zpool clear'' to get back to normal. Certain kinds of rocking behavior---like changes to the mirror roundrobin, or delaying writes of non-fsync() data, are okay, but rocking back and forth between redundancy states automatically during normal operation is probably unacceptable. The counter opinion I suppose might be that we get more MTDL by writing as quickly as possible to as many places as possible, so automatic-onlining is good. but i dont think so. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081126/348b00e6/attachment.bin>
On Wed, 26 Nov 2008, Miles Nordin wrote:> > (2) The FMA model of collecting telemmetry, taking it into > user-space, chin-strokingly contemplating it for a while, then > decreeing a diagnosis, is actually a rather limited one. I can > think of two kinds of limit: > > (a) you''re diagnosing the pool FMA is running on. FMA is on the > root pool, but the root pool won''t unfreeze until FMA > diagnoses it.I did not have time to read most of your lengthy thesis but I agree that FMA is useless if the motherboard catches fire. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Wed, Nov 26, 2008 at 07:02:11PM -0500, Miles Nordin wrote:> (2) The FMA model of collecting telemmetry, taking it into > user-space, chin-strokingly contemplating it for a while, then > decreeing a diagnosis, is actually a rather limited one. I can > think of two kinds of limit:As mentioned previously, this is not an accurate description of what''s going on. FMA allows diagnosis to happen at the detector when the telemetry is conclusive and cross-domain or predictive analysis isn''t required. This is exactly what ZFS does on recent nevada builds. If a drive is pathologically broken (i.e. a reopen fails, or reads and writes to the label fail), it will *immediately* fail the drive and not wait for any further diagnosis from FMA. For drives that randomly fail I/Os or take along time, but otherwise respond to basic requests, ZFS is often in no better position to perform a diagnosis in the kernel. And as of build 101, ZFS behaves much better in these circumstances by not aggressively retrying commands before exhausting all other options. Are you running your experiments on build 101 or later? And what experiments are you running? Drawing conclusions from previous experience or reports is basically pointless given the amount of change that has occurred recently (Jeff''s putback wasn''t nicknamed "SPA 3.0" for nothing). While there are no doubt more rough edges, we have incorporated much of the previous feedback into new behavior that should provide a much improved experience. - Eric P.S. I''m also not sure that B_FAILFAST behaves in the way you think it does. My reading of sd.c seems to imply that much of what you suggest is actually how it currently behaves, but you should probably bring up the issue on storage-discuss where you will find more experts in this area. -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
>>>>> "es" == Eric Schrock <eric.schrock at sun.com> writes:es> Are you running your experiments on build 101 or later? no. aside from that quick one for copies=2 im pretty bad about running well-designed experiments. and I do have old builds. I need to buy more hardware. It''s hard to know how to get the most stable system. I bet it''ll be a year before this b101 stuff makes it into stable Solaris, yet the bleeding-edge improvements are all stability-related, so for mostly-ZFS jobs maybe it''s better to run SXCE than sol10 in production. I suppose I should be happy about that since it means more people will have some source. :) es> P.S. I''m also not sure that B_FAILFAST behaves in the way you es> think it does. My reading of sd.c seems to imply that much of es> what you suggest is actually how it currently behaves, Yeah, I got a private email referring me to the spec for PSARC/2002/126 which already included both pieces I hoped for (killing queued CDB''s, and statefully tracking each device as failed/good), so I take back what I said about B_FAILFAST being useless---it should be able to help the ZFS availability problems we''ve seen. The PSARC says B_FAILFAST is implemented in the ``disk driver'''' which AIUI is above the controller, just as I hoped, but there is more than one ``disk driver'''' so the B_FAILFAST stuff is not factored out to one spot the way a vdev-level system would be but rather punted downwards and paste-and-raped into sd, ssd, dad, ...., so whatever experience you get with it isn''t necessarily portable to disks with a different kind of attachment. I still think the vdev-layer logic could make better decisions by using more than the 1 bit of information per device, but maybe 1-bit B_FAILFAST is enough to make me accept the shortfall as an arguable-feature rather than a unanimous-bug. Also if it can fix my (1) and (2) with FMA then maybe the gap between B_FAILFAST and real NetApp-like drive diagnosis can be done partly in userspace the way developers seem to want. The problems this doesn''t cover are write-related: * what should we do about implicit and explicit fsync()s where all the data is already on stable storage, but not with full redundancy---one device won''t finish writing? I think there should not be transparent recovery from this, though maybe others disagree. but pool-level failmode doesn''t settle the issue: (a) _when_ will you take the failure action (if failmode != wait)? The property says *what* to do, not *when* to do it. (b) There isn''t any vdev-level failure, only device-level, so it''s not appropriate to consult the failmode property in the first place---the situation is different. The question is, do we keep trying, or do we transition the device to FAULTED and the vdev to DEGRADED so that fsync()''s can proceed without that device and hotspare resilver kicks in? (c) Inside the time interval between when the device starts writing slowly and when you take the (b) action, how well can you isolate the failure? For example, can you insure that read-only access remains instantaneous, even though atime updates involve writing, even though these 5-second txg-flushes are blocked, and even though the admin might (gasp!) type ''zpool status''---or even a label-writing command like ''zpool attach''? or will one of those three things cause pool-wide or ZFS-wide hang that blocks read access which could theoretically work? * commands like zpool attach, detach, replace, offline, export (a) should not be uninterruptably hangable. (b) Problems in one pool should not spill over into another. (c) And finally they should be forcable even when they can''t write everything they''d like to, so that rebooting isn''t a necessary move in certain kinds failure-recovery of pool gymnastics. I expect there''s some quiet work on this in b101 also---at least someone said ''zpool status'' isn''t supposed to hang anymore? so I''ll have to try it out, but B_FAILFAST isn''t enough to settle the whole issue, even modulo marginal performance improvement that more ambitiously wacky schemes might promise us. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081128/5b12dd50/attachment.bin>