Maurice Volaski
2010-Jul-16 00:01 UTC
[zfs-discuss] Making a zvol unavailable to iSCSI trips up ZFS
I''ve been experimenting with a two system setup in snv_134 where each system exports a zvol via COMSTAR iSCSI. One system imports both its own zvol and the one from the other system and puts them together in a ZFS mirror. I manually faulted the zvol on one system by physically removing some drives. What I expect to happen is that ZFS will fault the zvol pool and the iSCSI stack will detect this and fault the target. Then ZFS for the mirrored pool will detect a failed device and report it. Throughout all this the system should operate normally, perhaps will small delays as it waits on failed devices. That isn''t what happens. The removed drives were detected and the zvol zpool was faulted. This eventually resulted in iSCSI "device is busy too long" errors, and that sounds about right so far. But the top-level mirror, which is acting as an NFS share, suddenly vanished from its NFS client! That is, the failure of a zvol tied to iSCSI seems to poison other parts of the OS causing the NFS to fail. Isn''t that odd? At the same time, zpool status on the mirrored pool detected nothing wrong. Eventually, it did detect errors on the failed device in the mirror, but oddly it didn''t offline it as the logs claimed it would. Instead, it seems that I/O stopped altogether. Also, it appears that the iSCSI timeout errors are taking way longer than what I have them set for and even after they have timed out, ZFS is ignoring that and still keeps trying. Somehow, I eventually got the pool to unmount and export, but when I tried to import it, the same thing is happening. First, the iSCSI errors seem to be ignoring the parameters to timeout and are instead taking an arbitrarily long time, even longer than the defaults. Second, ZFS won''t give up on trying to import the pool even though iSCSI is reporting to it that a device has failed. That is, ZFS gets hung when trying to import pools that contain a failed device. The pool is set to continue on failure, however. And technically, with just one device in the mirror failed, it really isn''t failed, just degraded. These are my iSCSI parameters: recv-login-rsp-timeout=6 conn-login-max=3 polling-login-delay=2 -- Maurice Volaski, maurice.volaski at einstein.yu.edu Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University
Maurice Volaski
2010-Jul-19 17:36 UTC
[zfs-discuss] Making a zvol unavailable to iSCSI trips up ZFS
This is now CR 6970210.>I''ve been experimenting with a two system setup in snv_134 where >each system exports a zvol via COMSTAR iSCSI. One system imports >both its own zvol and the one from the other system and puts them >together in a ZFS mirror. > >I manually faulted the zvol on one system by physically removing >some drives. What I expect to happen is that ZFS will fault the zvol >pool and the iSCSI stack will detect this and fault the target. Then >ZFS for the mirrored pool will detect a failed device and report it. >Throughout all this the system should operate normally, perhaps will >small delays as it waits on failed devices. > >That isn''t what happens. > >The removed drives were detected and the zvol zpool was faulted. >This eventually resulted in iSCSI "device is busy too long" errors, >and that sounds about right so far. > >But the top-level mirror, which is acting as an NFS share, suddenly >vanished from its NFS client! That is, the failure of a zvol tied to >iSCSI seems to poison other parts of the OS causing the NFS to fail. >Isn''t that odd? > >At the same time, zpool status on the mirrored pool detected nothing >wrong. Eventually, it did detect errors on the failed device in the >mirror, but oddly it didn''t offline it as the logs claimed it would. >Instead, it seems that I/O stopped altogether. Also, it appears that >the iSCSI timeout errors are taking way longer than what I have them >set for and even after they have timed out, ZFS is ignoring that and >still keeps trying. > >Somehow, I eventually got the pool to unmount and export, but when I >tried to import it, the same thing is happening. First, the iSCSI >errors seem to be ignoring the parameters to timeout and are instead >taking an arbitrarily long time, even longer than the defaults. >Second, ZFS won''t give up on trying to import the pool even though >iSCSI is reporting to it that a device has failed. That is, ZFS gets >hung when trying to import pools that contain a failed device. The >pool is set to continue on failure, however. And technically, with >just one device in the mirror failed, it really isn''t failed, just >degraded. > >These are my iSCSI parameters: >recv-login-rsp-timeout=6 >conn-login-max=3 >polling-login-delay=2-- Maurice Volaski, maurice.volaski at einstein.yu.edu Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University