I''m running ZFS in a test-server against a bunch of drives in an Apple XRaid (configured in the JBOD mode). It works pretty well, except that when I yank one of the drives, ZFS hangs -- presumably, it''s waiting for a response from the the XRAID. Is there any way to set the device-failure timeout with ZFS? Thanks, -Luke
Luke Scharf wrote:> I''m running ZFS in a test-server against a bunch of drives in an Apple > XRaid (configured in the JBOD mode). It works pretty well, except that > when I yank one of the drives, ZFS hangs -- presumably, it''s waiting > for a response from the the XRAID. > > Is there any way to set the device-failure timeout with ZFS? >In general, ZFS doesn''t manage device timeouts. The lower layer drivers do. The timeout management depends on which OS, OS version, and HBA you use. A fairly extreme example may be Solaris using parallel SCSI and the sd driver, which uses a default timeout of 60 seconds and 5 retries. In the more recent Solaris NV builds, FMA has been enhanced with an io-retire module which can make better decisions on whether the device is behaving well. -- richard> Thanks, > -Luke > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Richard Elling wrote:> In general, ZFS doesn''t manage device timeouts. The lower > layer drivers do. The timeout management depends on which OS, > OS version, and HBA you use. A fairly extreme example may be > Solaris using parallel SCSI and the sd driver, which uses a default > timeout of 60 seconds and 5 retries. In the more recent Solaris NV > builds, FMA has been enhanced with an io-retire module which > can make better decisions on whether the device is behaving well.What, ZFS isn''t the whole kernel? ;-) I can Google/RTFM from here. Thanks! -Luke
To my mind it''s a big limitation of ZFS that it relies on the driver timeouts. The driver has no knowledge of what kind of configuration the disks are in, and generally any kind of data loss is bad, so it''s not unexpected to see that long timeouts are the norm as the driver does it''s very best to avoid data loss. ZFS however knows full well if a device is in a protected pool (whether raided or mirrored), and really has no reason to hang operations on that entire pool if one device is not responding. I''ve seen this with iSCSI drivers and I''ve seen plenty of reports of other people experiencing ZFS hangs, and that includes the admin tools which makes error reporting / monitoring kind of difficult too. When dealing with redundant devices ZFS needs to either have it''s own timeouts, or a more intelligent way of handling this kind of scenario. This message posted from opensolaris.org