We''re currently evaluating ZFS prior to (hopefully) rolling it out across our server room, and have managed to lock up a server after connecting to an iSCSI target, and then changing the IP address of the target. Basically we have two test Solaris servers running, and I followed the instructions on the post below to share a zpool on Server1 using the iSCSI Target, and then import that into a new zpool on Server2. http://blogs.sun.com/chrisg/date/20070418. Everything appeared to work fine until I moved the servers to a new network (while powered on), which changed their IP addresses. The server running the iSCSI Target is still fine, it has it''s IP address and from another machine I can see that the iSCSI target is still visible. However, Server2 was not as happy with the move. As far as I can tell, all ZFS commands locked up on it. I couldn''t run "zfs list", "zpool list", "zpool status" or "zfs iostat". Every single one locked up and I couldn''t even find a way to stop them. Now I''ve seen a few posts about ZFS commands locking up, but this is very concerning for something we''re considering using in a production system. Anyway, with Server 2 well and truly locked up, I restarted it hoping that would clear the problem (figuring ZFS would simply mark the device as offline), but found that the server can''t even boot. For the past hour it has simply spammed the following message to the screen: "NOTICE: iscsi connection(27) unable to connecct to target iqn.1986-03.com.sun:02:3d882af1-91cc-6d9e-9f19-edfa095fca6d" Now that I wasn''t expecting. This volume isn''t a boot volume, the server doesn''t need either ZFS or iSCSI to boot, and I don''t think I even saved any data on that drive. I have found a post reporting a similar message to the above, which was reporting a ten minute boot delay with a working iSCSI volume, however I can''t find anything to say what happens if the iSCSI volume is no longer there: http://forum.java.sun.com/thread.jspa?threadID=5243777&messageID=10004717 So, I have quite a few questions: 1. Does anybody know how I can recover from this, or am I going to have to wipe my test server and start again? 2. How vulnerable are the ZFS admin tools to locking up like this? 3. How vulnerable is the iSCSI client to locking up like this during boot? 4. Is there any way we can disconnect the iSCSI share while ZFS is locked up like this? What could I have tried to regain control of my server before rebooting? 5. If I can get the server booted, is there any way to redirect an iSCSI volume so it''s pointing at the new IP address? (I was expecting to simply do a "zpool replace" when ZFS reported the drive as missing). And finally, does anybody know why "zpool status" should lock up like this? I''m really not happy that the ZFS admin tools seem so fragile. At the very least I would have expected "zpool status" to be able to list the devices attached to the pools and report that they are timing out or erroring, and for me to be able to use the other ZFS tools to forcibly remove failed drives as needed. Anything less means I''m risking my whole system should ZFS find something it doesn''t like. I admit I''m a solaris newbie, but surely something designed as a robust filesystem also needs robust management tools? This message posted from opensolaris.org
Victor Engle
2008-Feb-05 17:13 UTC
[zfs-discuss] ZFS hang and boot hang when iSCSI device removed
I don''t think this is so much a ZFS problem as an iSCSI initiator problem. Are you using static configs or Send Target discovery? There are many reports of sent target discovery misbehavior in the storage discuss forum. To recover: 1. Boot into single user from CD 2. mount the root slice on /a 3. rm /etc/iscsi/* 4. reboot 5. configure iscsi static discovery for the new target IP''s. A nice trick mentioned by David Weibel previously on storage discuss is to use discovery addresses to provide all the info you need to create the static configs. Just add the discovery addresses but don''t enable send targets. Then run "iscsiadm list discovery-address -v". The initiator will login to the discovery address and issue a send targets all command and print the results on stdout. Use the results to create the static configs and then enable static discovery. Good Luck, Vic On Feb 5, 2008 11:44 AM, Ross <myxiplx at hotmail.com> wrote:> We''re currently evaluating ZFS prior to (hopefully) rolling it out across our server room, and have managed to lock up a server after connecting to an iSCSI target, and then changing the IP address of the target. > > Basically we have two test Solaris servers running, and I followed the instructions on the post below to share a zpool on Server1 using the iSCSI Target, and then import that into a new zpool on Server2. > http://blogs.sun.com/chrisg/date/20070418. > > Everything appeared to work fine until I moved the servers to a new network (while powered on), which changed their IP addresses. The server running the iSCSI Target is still fine, it has it''s IP address and from another machine I can see that the iSCSI target is still visible. > > However, Server2 was not as happy with the move. As far as I can tell, all ZFS commands locked up on it. I couldn''t run "zfs list", "zpool list", "zpool status" or "zfs iostat". Every single one locked up and I couldn''t even find a way to stop them. Now I''ve seen a few posts about ZFS commands locking up, but this is very concerning for something we''re considering using in a production system. > > Anyway, with Server 2 well and truly locked up, I restarted it hoping that would clear the problem (figuring ZFS would simply mark the device as offline), but found that the server can''t even boot. For the past hour it has simply spammed the following message to the screen: > > "NOTICE: iscsi connection(27) unable to connecct to target iqn.1986-03.com.sun:02:3d882af1-91cc-6d9e-9f19-edfa095fca6d" > > Now that I wasn''t expecting. This volume isn''t a boot volume, the server doesn''t need either ZFS or iSCSI to boot, and I don''t think I even saved any data on that drive. I have found a post reporting a similar message to the above, which was reporting a ten minute boot delay with a working iSCSI volume, however I can''t find anything to say what happens if the iSCSI volume is no longer there: > http://forum.java.sun.com/thread.jspa?threadID=5243777&messageID=10004717 > > So, I have quite a few questions: > > 1. Does anybody know how I can recover from this, or am I going to have to wipe my test server and start again? > > 2. How vulnerable are the ZFS admin tools to locking up like this? > > 3. How vulnerable is the iSCSI client to locking up like this during boot? > > 4. Is there any way we can disconnect the iSCSI share while ZFS is locked up like this? What could I have tried to regain control of my server before rebooting? > > 5. If I can get the server booted, is there any way to redirect an iSCSI volume so it''s pointing at the new IP address? (I was expecting to simply do a "zpool replace" when ZFS reported the drive as missing). > > And finally, does anybody know why "zpool status" should lock up like this? I''m really not happy that the ZFS admin tools seem so fragile. At the very least I would have expected "zpool status" to be able to list the devices attached to the pools and report that they are timing out or erroring, and for me to be able to use the other ZFS tools to forcibly remove failed drives as needed. Anything less means I''m risking my whole system should ZFS find something it doesn''t like. > > I admit I''m a solaris newbie, but surely something designed as a robust filesystem also needs robust management tools? > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Vic Engle
2008-Feb-05 21:54 UTC
[zfs-discuss] ZFS hang and boot hang when iSCSI device removed
I don''t think this is so much a ZFS problem as an iSCSI initiator problem. Are you using static configs or Send Target discovery? There are many reports of sent target discovery misbehavior in the storage discuss forum. To recover: 1. Boot into single user from CD 2. mount the root slice on /a 3. rm /etc/iscsi/* 4. reboot 5. configure iscsi static discovery for the new target IP''s. A nice trick mentioned by David Weibel previously on storage discuss is to use discovery addresses to provide all the info you need to create the static configs. Just add the discovery addresses but don''t enable send targets. Then run "iscsiadm list discovery-address -v". The initiator will login to the discovery address and issue a send targets all command and print the results on stdout. Use the results to create the static configs and then enable static discovery. Good Luck, Vic Note - this post may end up here twice. I sent it originally via email several hours ago but it never posted to the thread here. This message posted from opensolaris.org
Yes, I''ve learnt that I get the e-mail reply a long while before it appears on the boards. Not entirely sure how these boards are run, it''s certainly odd for somebody used to forums and not mailing lists, but they do seem to work eventually :) Thanks for the help Vic, will try to get back into that server this morning. This message posted from opensolaris.org