I''ve recently upgraded my x4500 to Nevada build 97, and am having problems with the iscsi target. Background: this box is used to serve NFS underlying a VMware ESX environment (zfs filesystem-type datasets) and presents iSCSI targets (zfs zvol datasets) for a Windows host and to act as zoneroots for Solaris 10 hosts. For optimal random-read performance, I''ve configured a single zfs pool of mirrored VDEVs of all 44 disks (+2 boot disks, +2 spares = 48) Before the upgrade, the box was flaky under load: all I/Os to the ZFS pool would stop occasionally. Since the upgrade, that hasn''t happened, and the NFS clients are quite happy. The iSCSI initiators are not. The windows initiator is running the Microsoft iSCSI initiator v2.0.6 on Windows 2003 SP2 x64 Enterprise Edition. When the system reboots, it is not able to connect to its iscsi targets. No devices are found until I restart the iscsitgt process on the x4500, at which point the initiator will reconnect and find everything. I notice that on the x4500, it maintains an active TCP connection (according to netstat -an | grep 3260) to the Windows box through the reboot and for a long time afterwards. The initiator starts a second connection, but it seems that the target doesn''t let go of the old one. Or something. At this point, every time I reboot the Windows system I have to `pkill iscsitgtd` The Solaris system is running S10 Update 4. Every once in a while (twice today, and not correlated with the pkill''s above) the system reports that all of the iscsi disks are unavailable. Nothing I''ve tried short of a reboot of the whole host brings them back. All of the zones on the system remount their zoneroots read-only (and give I/O errors when read or zlogin''d to) There are a set of TCP connections from the zonehost to the x4500 that remain even through disabling the iscsi_initiator service. There''s no process holding them as far as pfiles can tell. Does this sound familiar to anyone? Any suggestions on what I can do to troubleshoot further? I have a kernel dump from the zonehost and a snoop capture of the wire for the Windows host (but it''s big). I''ll be opening a bug too. Thanks, --Joe
tim szeto
2008-Sep-16 21:50 UTC
[zfs-discuss] [storage-discuss] iscsi target problems on snv_97
Moore, Joe wrote:> I''ve recently upgraded my x4500 to Nevada build 97, and am having problems with the iscsi target. > > Background: this box is used to serve NFS underlying a VMware ESX environment (zfs filesystem-type datasets) and presents iSCSI targets (zfs zvol datasets) for a Windows host and to act as zoneroots for Solaris 10 hosts. For optimal random-read performance, I''ve configured a single zfs pool of mirrored VDEVs of all 44 disks (+2 boot disks, +2 spares = 48) > > Before the upgrade, the box was flaky under load: all I/Os to the ZFS pool would stop occasionally. > > Since the upgrade, that hasn''t happened, and the NFS clients are quite happy. The iSCSI initiators are not. > > The windows initiator is running the Microsoft iSCSI initiator v2.0.6 on Windows 2003 SP2 x64 Enterprise Edition. When the system reboots, it is not able to connect to its iscsi targets. No devices are found until I restart the iscsitgt process on the x4500, at which point the initiator will reconnect and find everything. I notice that on the x4500, it maintains an active TCP connection (according to netstat -an | grep 3260) to the Windows box through the reboot and for a long time afterwards. The initiator starts a second connection, but it seems that the target doesn''t let go of the old one. Or something. At this point, every time I reboot the Windows system I have to `pkill iscsitgtd` > > The Solaris system is running S10 Update 4. Every once in a while (twice today, and not correlated with the pkill''s above) the system reports that all of the iscsi disks are unavailable. Nothing I''ve tried short of a reboot of the whole host brings them back. All of the zones on the system remount their zoneroots read-only (and give I/O errors when read or zlogin''d to) > > There are a set of TCP connections from the zonehost to the x4500 that remain even through disabling the iscsi_initiator service. There''s no process holding them as far as pfiles can tell. > > Does this sound familiar to anyone? Any suggestions on what I can do to troubleshoot further? I have a kernel dump from the zonehost and a snoop capture of the wire for the Windows host (but it''s big). >I believe the problem you''re seeing might be related to deadlock condition (CR 6745310), if you run pstack on the iscsi target daemon you might find a bunch of zombie threads. The fix is putback to snv-99, give snv-99 a try. -Tim> I''ll be opening a bug too. > > Thanks, > --Joe > _______________________________________________ > storage-discuss mailing list > storage-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/storage-discuss >
Moore, Joe
2008-Sep-17 12:58 UTC
[zfs-discuss] [storage-discuss] iscsi target problems on snv_97
> I believe the problem you''re seeing might be related to deadlock > condition (CR 6745310), if you run pstack on the > iscsi target daemon you might find a bunch of zombie > threads. The fix > is putback to snv-99, give snv-99 a try.Yes, a pstack of the core I''ve generated from iscsitgtd does have a number of zombie threads. I''m afraid I can''t make heads nor tails of the bug report at http://bugs.opensolaris.org/view_bug.do?bug_id=6658836 nor its duplicate-of 6745310, nor any of the related bugs (all are "unavailable" except for 6676298, and the stack trace reported in that bug doesn''t look anything like mine. As far as I can tell snv-98 is the latest build, from Sep 10 according to http://dlc.sun.com/osol/on/downloads/. So snv-99 should be out next week, correct? Anything I can do in the mean time? Do I need to BFU to the latest nightly build? Or would just taking the iscsitgtd from that build suffice? --Joe
tim szeto
2008-Sep-17 14:41 UTC
[zfs-discuss] [storage-discuss] iscsi target problems on snv_97
Moore, Joe wrote:>> I believe the problem you''re seeing might be related to deadlock >> condition (CR 6745310), if you run pstack on the >> iscsi target daemon you might find a bunch of zombie >> threads. The fix >> is putback to snv-99, give snv-99 a try. >> > > Yes, a pstack of the core I''ve generated from iscsitgtd does have a number of zombie threads. > > I''m afraid I can''t make heads nor tails of the bug report at http://bugs.opensolaris.org/view_bug.do?bug_id=6658836 nor its duplicate-of 6745310, nor any of the related bugs (all are "unavailable" except for 6676298, and the stack trace reported in that bug doesn''t look anything like mine. > > As far as I can tell snv-98 is the latest build, from Sep 10 according to http://dlc.sun.com/osol/on/downloads/. So snv-99 should be out next week, correct? >snv-99 should be out next week.> Anything I can do in the mean time? Do I need to BFU to the latest nightly build? Or would just taking the iscsitgtd from that build suffice? >You could try snv-98. You don''t need to bfu, just get the latest iscsitgtd. -Tim> --Joe > _______________________________________________ > storage-discuss mailing list > storage-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/storage-discuss >