thr3ads.net - zfs discuss - [zfs-discuss] How to diagnose zfs - iscsi

If this information is useful, please help other people find it:
Share via:

Jacob Ritorto

2008-Nov-07 17:11 UTC

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

I have a PC server running Solaris 10 5/08 which seems to frequently become
unable to share zfs filesystems via the shareiscsi and sharenfs options.  It
appears, from the outside, to be hung -- all clients just freeze, and while
they''re able to ping the host, they''re not able to transfer
nfs or iSCSI data.  They''re in the same subnet and I''ve found
no network problems thus far.

After hearing so much about the Marvell problems I''m beginning to
wonder it they''re the culprit, though they''re supposed to be
fixed in 127128-11, which is the kernel I''m running.

I have an exact hardware duplicate of this machine running Nevada b91 (iirc)
that doesn''t exhibit this problem.

There''s nothing in /var/adm/messages and I''m not sure where
else to begin.

Would someone please help me in diagnosing this failure?  

thx
jake
-- 
This message posted from opensolaris.org

Brent Jones

2008-Nov-07 20:17 UTC

head link

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

On Fri, Nov 7, 2008 at 9:11 AM, Jacob Ritorto <jacob.ritorto at gmail.com>
wrote:> I have a PC server running Solaris 10 5/08 which seems to frequently become
unable to share zfs filesystems via the shareiscsi and sharenfs options.  It
appears, from the outside, to be hung -- all clients just freeze, and while
they''re able to ping the host, they''re not able to transfer
nfs or iSCSI data.  They''re in the same subnet and I''ve found
no network problems thus far.
>
> After hearing so much about the Marvell problems I''m beginning to
wonder it they''re the culprit, though they''re supposed to be
fixed in 127128-11, which is the kernel I''m running.
>
> I have an exact hardware duplicate of this machine running Nevada b91
(iirc) that doesn''t exhibit this problem.
>
> There''s nothing in /var/adm/messages and I''m not sure
where else to begin.
>
> Would someone please help me in diagnosing this failure?
>
> thx
> jake
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
I saw this in Nev b87, where for whatever reason, CIFS and NFS would
completely hang and no longer serve requests (I don''t use iscsi,
unable to confirm if that had hung too).
The server was responsive, SSH was fine and could execute commands,
clients could ping it and reach it, but CIFS and NFS were essentially
hung.
Intermittently, the system would recover and resume offering shares,
no triggering events could be correlated.
Since upgrading to newer builds, I haven''t seen similar issues.

-- 
Brent Jones
brent at servuhome.net

Jacob Ritorto

2008-Nov-10 18:58 UTC

head link

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

Thanks for the reply and corroboration, Brent.  I just liveupgraded the machine
from Solaris 10 u5 to Solaris 10 u6, which purports to have fixed all known
issues with the Marvell device, and am still experiencing the hang.  So I guess
this set of facts would imply one of:

1) they missed one, or
2) it''s not a Marvell related problem.


Not sure where else to look for information about this.  Without further info, I
guess I''m essentially forced to ditch production Solaris and stick with
Nevada.  But that''d be a very blind, dismissive action on my part and
I''d really rather find out what''s at play here.

A little more background/ tangent:  The other filer we''re running with
this exact same feature set (simultaneous iSCSI and NFS sharing out of the same
zpool), in production, is at Nevada b91 and it has never exhibited this flaw. 
My intention was to install an officially supported Solaris release on the new
filer and zfs send everything from the old Nevada box to the new Solaris box to
get to a position where I could purchase Sun support.   But now I''m
obviously thinking that I can''t do it.  We have like $12000 worth of
Sun contracts here but haven''t added the PC filers in yet because
they''re on Nevada and thus, I assumed, unsupportable.  Is that correct?
Or can I put a Nevada PC on Sun support?  (Yes, it''s on the HCL.)
(Sorry for the seemingly ot question here, but I do need to find out how to get
Sun support on my zfs box, so it''s at least *arguably* on-topic :)


One last thing I noticed was that the zfs version in Solaris 10 u6 is higher
than that in u5.  Any chance that an upgrade of my zpool would enable the new
features that would address this issue?

thx
jake
-- 
This message posted from opensolaris.org

Thomas Garner

2008-Nov-10 20:06 UTC

head link

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

Are these machines 32-bit by chance?  I ran into similar seemingly
unexplainable hangs, which Marc correctly diagnosed and have since not
reappeared:

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-August/049994.html

Thomas

Jacob Ritorto

2008-Nov-10 21:01 UTC

head link

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

It''s a 64 bit dual processor 4 core Xeon kit.  16GB RAM. 
Supermicro-Marvell SATA boards featuring the same S-ATA chips as the Sun x4500.
-- 
This message posted from opensolaris.org

Jacob Ritorto

2008-Nov-10 21:07 UTC

head link

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

FWIW: 

root at Erika~[17]16:01#kstat vmem::heap
module: vmem                            instance: 1     
name:   heap                            class:    vmem
        alloc                           25055
        contains                        0
        contains_search                 0
        crtime                          0
        fail                            0
        free                            7187
        lookup                          356
        mem_import                      0
        mem_inuse                       1966219264
        mem_total                       1627733884928
        populate_fail                   0
        populate_wait                   0
        search                          14135
        snaptime                        11240.434350896
        vmem_source                     0
        wait                            0
-- 
This message posted from opensolaris.org

Blake Irvin

2008-Dec-03 13:21 UTC

head link

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

I''m having a very similar issue.  Just updated to 10 u6 and upgrade my
zpools.  They are fine (all 3-way mirors), but I''ve lost the machine
around 12:30am two nights in a row.

I''m booting ZFS root pools, if that makes any difference.

I also don''t see anything in dmesg, nothing on the console either.

I''m going to go back to the logs today to see what was going on around
midnight on these occasions.  I know there are some built-in cronjobs that run
around that time - perhaps one of them in the culprit.

What I''d really like is a way to force a core dump when the machine
hangs like this.  scat is a very nifty tool for debugging such things - but
I''m not getting a core or panic or anything :(
-- 
This message posted from opensolaris.org

Jacob Ritorto

2008-Dec-03 13:49 UTC

head link

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

Update:  It would appear that the bug I was complaining about nearly a 
year ago is still at play here:  
http://opensolaris.org/jive/thread.jspa?threadID=49372&tstart=0

Unfortunate Solution:  Ditch Solaris 10 and run Nevada.  The nice folks 
in the OpenSolaris project fixed the problem a long time ago.

        This means that I can''t have Sun support until Nevada becomes a
real product, but it''s better than having a silent failure every time 
6GB crosses the wire.  My big question is why won''t they fix it in 
Solaris 10?  Sun''s depriving themselves of my support revenue stream
and
I''m stuck with an unsupportable box as my core filer.  Bad situation on
so many levels..  If it weren''t for the stellar quality of the Nevada 
builds (b91 uptime=132 days now with no problems), I''d not be sleeping 
much at night..  Imagine my embarrassment had I taken the high road and 
spent the $$$ for a Thumper for this purpose..

Tim

2008-Dec-03 14:32 UTC

head link

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

On Wed, Dec 3, 2008 at 7:49 AM, Jacob Ritorto <jacob.ritorto at
gmail.com>wrote:
> Update:  It would appear that the bug I was complaining about nearly a
> year ago is still at play here:
> http://opensolaris.org/jive/thread.jspa?threadID=49372&tstart=0
>
> Unfortunate Solution:  Ditch Solaris 10 and run Nevada.  The nice folks
> in the OpenSolaris project fixed the problem a long time ago.
>
>        This means that I can''t have Sun support until Nevada
becomes a
> real product, but it''s better than having a silent failure every
time
> 6GB crosses the wire.  My big question is why won''t they fix it in
> Solaris 10?  Sun''s depriving themselves of my support revenue
stream and
> I''m stuck with an unsupportable box as my core filer.  Bad
situation on
> so many levels..  If it weren''t for the stellar quality of the
Nevada
> builds (b91 uptime=132 days now with no problems), I''d not be
sleeping
> much at night..  Imagine my embarrassment had I taken the high road and
> spent the $$$ for a Thumper for this purpose..
>

Can''t you just run opensolaris?  They''ve got support contracts
for that, and
the bug should be fixed in 2008.11.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081203/5e71a51a/attachment.html>

Blake

2008-Dec-03 16:02 UTC

head link

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

I think my problem is actually different - I''m not using iSCSI at all.
 I will update if I find otherwise.

And yes, I do think there is support available for OpenSolaris now:
<http://www.sun.com/service/opensolaris/faq.xml>

Blake



On Wed, Dec 3, 2008 at 9:32 AM, Tim <tim at tcsac.net>
wrote:> On Wed, Dec 3, 2008 at 7:49 AM, Jacob Ritorto <jacob.ritorto at
gmail.com>
> wrote:
>>
>> Update:  It would appear that the bug I was complaining about nearly a
>> year ago is still at play here:
>> http://opensolaris.org/jive/thread.jspa?threadID=49372&tstart=0
>>
>> Unfortunate Solution:  Ditch Solaris 10 and run Nevada.  The nice folks
>> in the OpenSolaris project fixed the problem a long time ago.
>>
>>        This means that I can''t have Sun support until Nevada
becomes a
>> real product, but it''s better than having a silent failure
every time
>> 6GB crosses the wire.  My big question is why won''t they fix
it in
>> Solaris 10?  Sun''s depriving themselves of my support revenue
stream and
>> I''m stuck with an unsupportable box as my core filer.  Bad
situation on
>> so many levels..  If it weren''t for the stellar quality of the
Nevada
>> builds (b91 uptime=132 days now with no problems), I''d not be
sleeping
>> much at night..  Imagine my embarrassment had I taken the high road and
>> spent the $$$ for a Thumper for this purpose..
>
>
> Can''t you just run opensolaris?  They''ve got support
contracts for that, and
> the bug should be fixed in 2008.11.
>
> --Tim
>
>

max at bruningsystems.com

2008-Dec-03 19:50 UTC

head link

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

Hi Blake,

Blake Irvin wrote:> I''m having a very similar issue.  Just updated to 10 u6 and
upgrade my zpools.  They are fine (all 3-way mirors), but I''ve lost the
machine around 12:30am two nights in a row.
>
>
> What I''d really like is a way to force a core dump when the
machine hangs like this.  scat is a very nifty tool for debugging such things -
but I''m not getting a core or panic or anything :(
>   You can force a dump.  Here are the steps:

Before the system is hung:

# mdb -K -F   <-- this will load kmdb and drop into it

Don''t worry if your system now seems hung.
Type, carefully, with no typos:

:c   <-- and carriage-return.  You should get your prompt back

Now, when the system is hung, type F1-a  (that''s function key f1 and
the
"a" key together.
This should put you into kmdb.  Now, type, (again, no typos):

$<systemdump

This should give you a panic dump, followed by reboot,  (unless your 
system is hard-hung).

max

Blake Irvin

2008-Dec-03 20:38 UTC

head link

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

Thanks - however, the machine hangs and doesn''t even accept console
input when this occurs.  I can''t get into the kernel debugger in these
cases.

I''ve enabled the deadman timer instead.  I''m also using the
automatic snapshot service to get a look at things like /var/adm/sa/sa** files
that get overwritten after a hard reset.

I''m just going to stay up late tonight and see what happens :)

Blake



> Hi Blake,
> 
> Blake Irvin wrote:
> > I''m having a very similar issue.  Just updated to
> 10 u6 and upgrade my zpools.  They are fine (all
> 3-way mirors), but I''ve lost the machine around
> 12:30am two nights in a row.
> >
> >
> > What I''d really like is a way to force a core dump
> when the machine hangs like this.  scat is a very
> nifty tool for debugging such things - but I''m not
> getting a core or panic or anything :(
> >   
> You can force a dump.  Here are the steps:
> 
> Before the system is hung:
> 
> # mdb -K -F   <-- this will load kmdb and drop into
> it
> 
> Don''t worry if your system now seems hung.
> Type, carefully, with no typos:
> 
> :c   <-- and carriage-return.  You should get your
> prompt back
> 
> Now, when the system is hung, type F1-a  (that''s
> function key f1 and the 
> "a" key together.
> This should put you into kmdb.  Now, type, (again, no
> typos):
> 
> $<systemdump
> 
> This should give you a panic dump, followed by
> reboot,  (unless your 
> system is hard-hung).
> 
> max
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss-- 
This message posted from opensolaris.org

max at bruningsystems.com

2008-Dec-03 20:48 UTC

head link

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

Hi Blake,

Blake Irvin wrote:> Thanks - however, the machine hangs and doesn''t even accept
console input when this occurs.  I can''t get into the kernel debugger
in these cases.
>   Are you directly on the console, or is the console on a serial port?  If 
you are
running over X windows, the input might still get in, but X may not be 
displaying.
If keyboard input is not getting in, your machine is probably wedged at 
a high
level interrupt, which sounds doubtful based on your problem description.
> I''ve enabled the deadman timer instead.  I''m also using
the automatic snapshot service to get a look at things like /var/adm/sa/sa**
files that get overwritten after a hard reset.
>   If the deadman timer does not trigger, the clock is almost certainly 
running, and your machine is
almost certainly accepting keyboard input.

Good luck,
max
> I''m just going to stay up late tonight and see what happens :)
>
> Blake
>
>
>
>
>   
>> Hi Blake,
>>
>> Blake Irvin wrote:
>>     
>>> I''m having a very similar issue.  Just updated to
>>>       
>> 10 u6 and upgrade my zpools.  They are fine (all
>> 3-way mirors), but I''ve lost the machine around
>> 12:30am two nights in a row.
>>     
>>> What I''d really like is a way to force a core dump
>>>       
>> when the machine hangs like this.  scat is a very
>> nifty tool for debugging such things - but I''m not
>> getting a core or panic or anything :(
>>     
>>>   
>>>       
>> You can force a dump.  Here are the steps:
>>
>> Before the system is hung:
>>
>> # mdb -K -F   <-- this will load kmdb and drop into
>> it
>>
>> Don''t worry if your system now seems hung.
>> Type, carefully, with no typos:
>>
>> :c   <-- and carriage-return.  You should get your
>> prompt back
>>
>> Now, when the system is hung, type F1-a  (that''s
>> function key f1 and the 
>> "a" key together.
>> This should put you into kmdb.  Now, type, (again, no
>> typos):
>>
>> $<systemdump
>>
>> This should give you a panic dump, followed by
>> reboot,  (unless your 
>> system is hard-hung).
>>
>> max
>>
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
>> ss
>>

Blake Irvin

2008-Dec-03 21:47 UTC

head link

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

I am directly on the console.  cde-login is disabled, so i''m dealing
with direct entry.>    
> Are you directly on the console, or is the console on
> a serial port?  If 
> you are
> running over X windows, the input might still get in,
> but X may not be 
> displaying.
> If keyboard input is not getting in, your machine is
> probably wedged at 
> a high
> level interrupt, which sounds doubtful based on your
> problem description.Out of curiosity, why do you say that?  I''m no expert on interrupts, so
I''m curious.  It DOES seem that keyboard entry is ignored in this
situation, since I see no results from ctrl-c, for example (I had left the
console running ''tail -f /var/adm/messages''.  I''m not
saying your are wrong, but if I should be examining interrupt issues,
I''d like to know (I have 3 hard disk controllers in the box, for
example...)>   
> If the deadman timer does not trigger, the clock is
> almost certainly 
> running, and your machine is
> almost certainly accepting keyboard input.That''s good to know.  I just enabled deadman after the last freeze, so
it will be a bit before I can test this (hope I don''t have to).

thanks!
Blake
> 
> Good luck,
> max-- 
This message posted from opensolaris.org

max at bruningsystems.com

2008-Dec-03 22:11 UTC

head link

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

Hi Blake,

Blake Irvin wrote:> I am directly on the console.  cde-login is disabled, so i''m
dealing
> with direct entry.
>  
>>    Are you directly on the console, or is the console on
>> a serial port?  If you are
>> running over X windows, the input might still get in,
>> but X may not be displaying.
>> If keyboard input is not getting in, your machine is
>> probably wedged at a high
>> level interrupt, which sounds doubtful based on your
>> problem description.
>>     
> Out of curiosity, why do you say that?  I''m no expert on
interrupts,
> so I''m curious.  It DOES seem that keyboard entry is ignored in
this
> situation, since I see no results from ctrl-c, for example (I had left 
> the console running ''tail -f /var/adm/messages''. 
I''m not saying your
> are wrong, but if I should be examining interrupt issues, I''d like
to
> know (I have 3 hard disk controllers in the box, for example...)
>   Typing ctrl-c, and having process killed because of it are 2 different 
actions.
The interpretation of ctrl-c as a kill character is done in a streams 
module
(ldterm, I believe).  This is not done at the device interrupt handler.  
I doubt
you need to examine interrupts.  I was only saying that you could try 
what I
recommended to get a dump.  The f1-a is handled at the driver during 
interrupt
handling, so it should get processed.
I have done this many times, so I am sure it works.
>>   If the deadman timer does not trigger, the clock is
>> almost certainly running, and your machine is
>> almost certainly accepting keyboard input.
>>     
> That''s good to know.  I just enabled deadman after the last
freeze, so
> it will be a bit before I can test this (hope I don''t have to).
>
> thanks!
> Blake
>
>  
>> Good luck,
>> max
>>

Blake

2008-Dec-04 02:05 UTC

head link

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

Thanks Max and Chris.  I don''t really want the problem to occur again,
of
course, but I''ll be prepared if it does.

On Wed, Dec 3, 2008 at 6:46 PM, Chris Siebenmann <cks at cs.toronto.edu>
wrote:
> You write:
> | > If keyboard input is not getting in, your machine is probably wedged
> | > at a high level interrupt, which sounds doubtful based on your
> | > problem description.
> | Out of curiosity, why do you say that?  I''m no expert on
interrupts, so
> | I''m curious.  It DOES seem that keyboard entry is ignored in
this
> | situation, since I see no results from ctrl-c, for example (I had left
> | the console running ''tail -f /var/adm/messages''. 
I''m not saying your
> | are wrong, but if I should be examining interrupt issues, I''d
like to
> | know (I have 3 hard disk controllers in the box, for example...)
>
>  ^C handling requires a great deal of high-level kernel infrastructure
> to be working, far beyond basic interrupt handling. To get much visible
> reaction in a situation where nothing is producing output, for example,
> the system has to be able to get all the way to running your shell so
> that it can notice that tail has died and print the shell prompt. By
> contrast, if the console echoes ''^C'', you have a fair
amount of
> interrupt handling.
>
>  The Solaris kernel debugger hooks in to the system at a fairly low
> level (I believe significantly lower than all of the things that have
> to be working to even echo ''^C'', much less get all the
way to executing
> user-level code). Thus, you can get into it and force-crash your system
> even if it is otherwise fairly dead, so I think that trying is well
> worth it in your situation.
>
> ---
>        "I shall clasp my hands together and bow to the corners of the
> world."
>                        Number Ten Ox, "Bridge of Birds"
> Chris Siebenmann
> cks at cs.toronto.edu
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081203/0edcb168/attachment.html>

zfs discuss - Nov 2008 - How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang

[zfs-discuss] How to diagnose zfs - iscsi - nfs hang