thr3ads.net - zfs discuss - [zfs-discuss] ZFS hang and boot hang when iSCSI device removed [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Ross

2008-Feb-05 16:44 UTC

[zfs-discuss] ZFS hang and boot hang when iSCSI device removed

We''re currently evaluating ZFS prior to (hopefully) rolling it out
across our server room, and have managed to lock up a server after connecting to
an iSCSI target, and then changing the IP address of the target.

Basically we have two test Solaris servers running, and I followed the
instructions on the post below to share a zpool on Server1 using the iSCSI
Target, and then import that into a new zpool on Server2.
http://blogs.sun.com/chrisg/date/20070418.

Everything appeared to work fine until I moved the servers to a new network
(while powered on), which changed their IP addresses.  The server running the
iSCSI Target is still fine, it has it''s IP address and from another
machine I can see that the iSCSI target is still visible.

However, Server2 was not as happy with the move.  As far as I can tell, all ZFS
commands locked up on it.  I couldn''t run "zfs list",
"zpool list", "zpool status" or "zfs iostat". 
Every single one locked up and I couldn''t even find a way to stop them.
Now I''ve seen a few posts about ZFS commands locking up, but this is
very concerning for something we''re considering using in a production
system.

Anyway, with Server 2 well and truly locked up, I restarted it hoping that would
clear the problem (figuring ZFS would simply mark the device as offline), but
found that the server can''t even boot.  For the past hour it has simply
spammed the following message to the screen:

"NOTICE: iscsi connection(27) unable to connecct to target
iqn.1986-03.com.sun:02:3d882af1-91cc-6d9e-9f19-edfa095fca6d"

Now that I wasn''t expecting.  This volume isn''t a boot volume,
the server doesn''t need either ZFS or iSCSI to boot, and I
don''t think I even saved any data on that drive.  I have found a post
reporting a similar message to the above, which was reporting a ten minute boot
delay with a working iSCSI volume, however I can''t find anything to say
what happens if the iSCSI volume is no longer there:
http://forum.java.sun.com/thread.jspa?threadID=5243777&messageID=10004717

So, I have quite a few questions:

1.  Does anybody know how I can recover from this, or am I going to have to wipe
my test server and start again?

2.  How vulnerable are the ZFS admin tools to locking up like this?

3.  How vulnerable is the iSCSI client to locking up like this during boot?

4.  Is there any way we can disconnect the iSCSI share while ZFS is locked up
like this?  What could I have tried to regain control of my server before
rebooting?

5.  If I can get the server booted, is there any way to redirect an iSCSI volume
so it''s pointing at the new IP address?  (I was expecting to simply do
a "zpool replace" when ZFS reported the drive as missing).

And finally, does anybody know why "zpool status" should lock up like
this?  I''m really not happy that the ZFS admin tools seem so fragile. 
At the very least I would have expected "zpool status" to be able to
list the devices attached to the pools and report that they are timing out or
erroring, and for me to be able to use the other ZFS tools to forcibly remove
failed drives as needed.  Anything less means I''m risking my whole
system should ZFS find something it doesn''t like.

I admit I''m a solaris newbie, but surely something designed as a robust
filesystem also needs robust management tools?
 
 
This message posted from opensolaris.org

Victor Engle

2008-Feb-05 17:13 UTC

head link

[zfs-discuss] ZFS hang and boot hang when iSCSI device removed

I don''t think this is so much a ZFS problem as an iSCSI initiator
problem. Are you using static configs or Send Target discovery? There
are many reports of sent target discovery misbehavior in the storage
discuss forum.

To recover:
1. Boot into single user from CD
2. mount the root slice on /a
3. rm /etc/iscsi/*
4. reboot
5. configure iscsi static discovery for the new target IP''s.

A nice trick mentioned by David Weibel previously on storage discuss
is to use discovery addresses to provide all the info you need to
create the static configs. Just add the discovery addresses but don''t
enable send targets. Then run "iscsiadm list discovery-address -v".
The initiator will login to the discovery address and issue a send
targets all command and print the results on stdout. Use the results
to create the static configs and then enable static discovery.

Good Luck,
Vic



On Feb 5, 2008 11:44 AM, Ross <myxiplx at hotmail.com>
wrote:> We''re currently evaluating ZFS prior to (hopefully) rolling it out
across our server room, and have managed to lock up a server after connecting to
an iSCSI target, and then changing the IP address of the target.
>
> Basically we have two test Solaris servers running, and I followed the
instructions on the post below to share a zpool on Server1 using the iSCSI
Target, and then import that into a new zpool on Server2.
> http://blogs.sun.com/chrisg/date/20070418.
>
> Everything appeared to work fine until I moved the servers to a new network
(while powered on), which changed their IP addresses.  The server running the
iSCSI Target is still fine, it has it''s IP address and from another
machine I can see that the iSCSI target is still visible.
>
> However, Server2 was not as happy with the move.  As far as I can tell, all
ZFS commands locked up on it.  I couldn''t run "zfs list",
"zpool list", "zpool status" or "zfs iostat". 
Every single one locked up and I couldn''t even find a way to stop them.
Now I''ve seen a few posts about ZFS commands locking up, but this is
very concerning for something we''re considering using in a production
system.
>
> Anyway, with Server 2 well and truly locked up, I restarted it hoping that
would clear the problem (figuring ZFS would simply mark the device as offline),
but found that the server can''t even boot.  For the past hour it has
simply spammed the following message to the screen:
>
> "NOTICE: iscsi connection(27) unable to connecct to target
iqn.1986-03.com.sun:02:3d882af1-91cc-6d9e-9f19-edfa095fca6d"
>
> Now that I wasn''t expecting.  This volume isn''t a boot
volume, the server doesn''t need either ZFS or iSCSI to boot, and I
don''t think I even saved any data on that drive.  I have found a post
reporting a similar message to the above, which was reporting a ten minute boot
delay with a working iSCSI volume, however I can''t find anything to say
what happens if the iSCSI volume is no longer there:
>
http://forum.java.sun.com/thread.jspa?threadID=5243777&messageID=10004717
>
> So, I have quite a few questions:
>
> 1.  Does anybody know how I can recover from this, or am I going to have to
wipe my test server and start again?
>
> 2.  How vulnerable are the ZFS admin tools to locking up like this?
>
> 3.  How vulnerable is the iSCSI client to locking up like this during boot?
>
> 4.  Is there any way we can disconnect the iSCSI share while ZFS is locked
up like this?  What could I have tried to regain control of my server before
rebooting?
>
> 5.  If I can get the server booted, is there any way to redirect an iSCSI
volume so it''s pointing at the new IP address?  (I was expecting to
simply do a "zpool replace" when ZFS reported the drive as missing).
>
> And finally, does anybody know why "zpool status" should lock up
like this?  I''m really not happy that the ZFS admin tools seem so
fragile.  At the very least I would have expected "zpool status" to be
able to list the devices attached to the pools and report that they are timing
out or erroring, and for me to be able to use the other ZFS tools to forcibly
remove failed drives as needed.  Anything less means I''m risking my
whole system should ZFS find something it doesn''t like.
>
> I admit I''m a solaris newbie, but surely something designed as a
robust filesystem also needs robust management tools?
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Vic Engle

2008-Feb-05 21:54 UTC

head link

[zfs-discuss] ZFS hang and boot hang when iSCSI device removed

I don''t think this is so much a ZFS problem as an iSCSI initiator
problem. Are you using static configs or Send Target discovery? There
are many reports of sent target discovery misbehavior in the storage
discuss forum.

To recover:
1. Boot into single user from CD
2. mount the root slice on /a
3. rm /etc/iscsi/*
4. reboot
5. configure iscsi static discovery for the new target IP''s.

A nice trick mentioned by David Weibel previously on storage discuss
is to use discovery addresses to provide all the info you need to
create the static configs. Just add the discovery addresses but don''t
enable send targets. Then run "iscsiadm list discovery-address -v".
The initiator will login to the discovery address and issue a send
targets all command and print the results on stdout. Use the results
to create the static configs and then enable static discovery.

Good Luck,
Vic

Note - this post may end up here twice. I sent it originally via email several
hours ago but it never posted to the thread here.
 
 
This message posted from opensolaris.org

Ross

2008-Feb-06 08:24 UTC

head link

[zfs-discuss] ZFS hang and boot hang when iSCSI device removed

Yes, I''ve learnt that I get the e-mail reply a long while before it
appears on the boards.  Not entirely sure how these boards are run,
it''s certainly odd for somebody used to forums and not mailing lists,
but they do seem to work eventually :)

Thanks for the help Vic, will try to get back into that server this morning.
 
 
This message posted from opensolaris.org

Possibly Parallel Threads

Search for more maybe matching threads

zfs discuss - Feb 2008 - ZFS hang and boot hang when iSCSI device removed

[zfs-discuss] ZFS hang and boot hang when iSCSI device removed

[zfs-discuss] ZFS hang and boot hang when iSCSI device removed

[zfs-discuss] ZFS hang and boot hang when iSCSI device removed

[zfs-discuss] ZFS hang and boot hang when iSCSI device removed

Possibly Parallel Threads