I have a PC server running Solaris 10 5/08 which seems to frequently become unable to share zfs filesystems via the shareiscsi and sharenfs options. It appears, from the outside, to be hung -- all clients just freeze, and while they''re able to ping the host, they''re not able to transfer nfs or iSCSI data. They''re in the same subnet and I''ve found no network problems thus far. After hearing so much about the Marvell problems I''m beginning to wonder it they''re the culprit, though they''re supposed to be fixed in 127128-11, which is the kernel I''m running. I have an exact hardware duplicate of this machine running Nevada b91 (iirc) that doesn''t exhibit this problem. There''s nothing in /var/adm/messages and I''m not sure where else to begin. Would someone please help me in diagnosing this failure? thx jake -- This message posted from opensolaris.org
On Fri, Nov 7, 2008 at 9:11 AM, Jacob Ritorto <jacob.ritorto at gmail.com> wrote:> I have a PC server running Solaris 10 5/08 which seems to frequently become unable to share zfs filesystems via the shareiscsi and sharenfs options. It appears, from the outside, to be hung -- all clients just freeze, and while they''re able to ping the host, they''re not able to transfer nfs or iSCSI data. They''re in the same subnet and I''ve found no network problems thus far. > > After hearing so much about the Marvell problems I''m beginning to wonder it they''re the culprit, though they''re supposed to be fixed in 127128-11, which is the kernel I''m running. > > I have an exact hardware duplicate of this machine running Nevada b91 (iirc) that doesn''t exhibit this problem. > > There''s nothing in /var/adm/messages and I''m not sure where else to begin. > > Would someone please help me in diagnosing this failure? > > thx > jake > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >I saw this in Nev b87, where for whatever reason, CIFS and NFS would completely hang and no longer serve requests (I don''t use iscsi, unable to confirm if that had hung too). The server was responsive, SSH was fine and could execute commands, clients could ping it and reach it, but CIFS and NFS were essentially hung. Intermittently, the system would recover and resume offering shares, no triggering events could be correlated. Since upgrading to newer builds, I haven''t seen similar issues. -- Brent Jones brent at servuhome.net
Thanks for the reply and corroboration, Brent. I just liveupgraded the machine from Solaris 10 u5 to Solaris 10 u6, which purports to have fixed all known issues with the Marvell device, and am still experiencing the hang. So I guess this set of facts would imply one of: 1) they missed one, or 2) it''s not a Marvell related problem. Not sure where else to look for information about this. Without further info, I guess I''m essentially forced to ditch production Solaris and stick with Nevada. But that''d be a very blind, dismissive action on my part and I''d really rather find out what''s at play here. A little more background/ tangent: The other filer we''re running with this exact same feature set (simultaneous iSCSI and NFS sharing out of the same zpool), in production, is at Nevada b91 and it has never exhibited this flaw. My intention was to install an officially supported Solaris release on the new filer and zfs send everything from the old Nevada box to the new Solaris box to get to a position where I could purchase Sun support. But now I''m obviously thinking that I can''t do it. We have like $12000 worth of Sun contracts here but haven''t added the PC filers in yet because they''re on Nevada and thus, I assumed, unsupportable. Is that correct? Or can I put a Nevada PC on Sun support? (Yes, it''s on the HCL.) (Sorry for the seemingly ot question here, but I do need to find out how to get Sun support on my zfs box, so it''s at least *arguably* on-topic :) One last thing I noticed was that the zfs version in Solaris 10 u6 is higher than that in u5. Any chance that an upgrade of my zpool would enable the new features that would address this issue? thx jake -- This message posted from opensolaris.org
Are these machines 32-bit by chance? I ran into similar seemingly unexplainable hangs, which Marc correctly diagnosed and have since not reappeared: http://mail.opensolaris.org/pipermail/zfs-discuss/2008-August/049994.html Thomas
It''s a 64 bit dual processor 4 core Xeon kit. 16GB RAM. Supermicro-Marvell SATA boards featuring the same S-ATA chips as the Sun x4500. -- This message posted from opensolaris.org
FWIW: root at Erika~[17]16:01#kstat vmem::heap module: vmem instance: 1 name: heap class: vmem alloc 25055 contains 0 contains_search 0 crtime 0 fail 0 free 7187 lookup 356 mem_import 0 mem_inuse 1966219264 mem_total 1627733884928 populate_fail 0 populate_wait 0 search 14135 snaptime 11240.434350896 vmem_source 0 wait 0 -- This message posted from opensolaris.org
I''m having a very similar issue. Just updated to 10 u6 and upgrade my zpools. They are fine (all 3-way mirors), but I''ve lost the machine around 12:30am two nights in a row. I''m booting ZFS root pools, if that makes any difference. I also don''t see anything in dmesg, nothing on the console either. I''m going to go back to the logs today to see what was going on around midnight on these occasions. I know there are some built-in cronjobs that run around that time - perhaps one of them in the culprit. What I''d really like is a way to force a core dump when the machine hangs like this. scat is a very nifty tool for debugging such things - but I''m not getting a core or panic or anything :( -- This message posted from opensolaris.org
Update: It would appear that the bug I was complaining about nearly a year ago is still at play here: http://opensolaris.org/jive/thread.jspa?threadID=49372&tstart=0 Unfortunate Solution: Ditch Solaris 10 and run Nevada. The nice folks in the OpenSolaris project fixed the problem a long time ago. This means that I can''t have Sun support until Nevada becomes a real product, but it''s better than having a silent failure every time 6GB crosses the wire. My big question is why won''t they fix it in Solaris 10? Sun''s depriving themselves of my support revenue stream and I''m stuck with an unsupportable box as my core filer. Bad situation on so many levels.. If it weren''t for the stellar quality of the Nevada builds (b91 uptime=132 days now with no problems), I''d not be sleeping much at night.. Imagine my embarrassment had I taken the high road and spent the $$$ for a Thumper for this purpose..
On Wed, Dec 3, 2008 at 7:49 AM, Jacob Ritorto <jacob.ritorto at gmail.com>wrote:> Update: It would appear that the bug I was complaining about nearly a > year ago is still at play here: > http://opensolaris.org/jive/thread.jspa?threadID=49372&tstart=0 > > Unfortunate Solution: Ditch Solaris 10 and run Nevada. The nice folks > in the OpenSolaris project fixed the problem a long time ago. > > This means that I can''t have Sun support until Nevada becomes a > real product, but it''s better than having a silent failure every time > 6GB crosses the wire. My big question is why won''t they fix it in > Solaris 10? Sun''s depriving themselves of my support revenue stream and > I''m stuck with an unsupportable box as my core filer. Bad situation on > so many levels.. If it weren''t for the stellar quality of the Nevada > builds (b91 uptime=132 days now with no problems), I''d not be sleeping > much at night.. Imagine my embarrassment had I taken the high road and > spent the $$$ for a Thumper for this purpose.. >Can''t you just run opensolaris? They''ve got support contracts for that, and the bug should be fixed in 2008.11. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081203/5e71a51a/attachment.html>
I think my problem is actually different - I''m not using iSCSI at all. I will update if I find otherwise. And yes, I do think there is support available for OpenSolaris now: <http://www.sun.com/service/opensolaris/faq.xml> Blake On Wed, Dec 3, 2008 at 9:32 AM, Tim <tim at tcsac.net> wrote:> On Wed, Dec 3, 2008 at 7:49 AM, Jacob Ritorto <jacob.ritorto at gmail.com> > wrote: >> >> Update: It would appear that the bug I was complaining about nearly a >> year ago is still at play here: >> http://opensolaris.org/jive/thread.jspa?threadID=49372&tstart=0 >> >> Unfortunate Solution: Ditch Solaris 10 and run Nevada. The nice folks >> in the OpenSolaris project fixed the problem a long time ago. >> >> This means that I can''t have Sun support until Nevada becomes a >> real product, but it''s better than having a silent failure every time >> 6GB crosses the wire. My big question is why won''t they fix it in >> Solaris 10? Sun''s depriving themselves of my support revenue stream and >> I''m stuck with an unsupportable box as my core filer. Bad situation on >> so many levels.. If it weren''t for the stellar quality of the Nevada >> builds (b91 uptime=132 days now with no problems), I''d not be sleeping >> much at night.. Imagine my embarrassment had I taken the high road and >> spent the $$$ for a Thumper for this purpose.. > > > Can''t you just run opensolaris? They''ve got support contracts for that, and > the bug should be fixed in 2008.11. > > --Tim > >
max at bruningsystems.com
2008-Dec-03 19:50 UTC
[zfs-discuss] How to diagnose zfs - iscsi - nfs hang
Hi Blake, Blake Irvin wrote:> I''m having a very similar issue. Just updated to 10 u6 and upgrade my zpools. They are fine (all 3-way mirors), but I''ve lost the machine around 12:30am two nights in a row. > > > What I''d really like is a way to force a core dump when the machine hangs like this. scat is a very nifty tool for debugging such things - but I''m not getting a core or panic or anything :( >You can force a dump. Here are the steps: Before the system is hung: # mdb -K -F <-- this will load kmdb and drop into it Don''t worry if your system now seems hung. Type, carefully, with no typos: :c <-- and carriage-return. You should get your prompt back Now, when the system is hung, type F1-a (that''s function key f1 and the "a" key together. This should put you into kmdb. Now, type, (again, no typos): $<systemdump This should give you a panic dump, followed by reboot, (unless your system is hard-hung). max
Thanks - however, the machine hangs and doesn''t even accept console input when this occurs. I can''t get into the kernel debugger in these cases. I''ve enabled the deadman timer instead. I''m also using the automatic snapshot service to get a look at things like /var/adm/sa/sa** files that get overwritten after a hard reset. I''m just going to stay up late tonight and see what happens :) Blake> Hi Blake, > > Blake Irvin wrote: > > I''m having a very similar issue. Just updated to > 10 u6 and upgrade my zpools. They are fine (all > 3-way mirors), but I''ve lost the machine around > 12:30am two nights in a row. > > > > > > What I''d really like is a way to force a core dump > when the machine hangs like this. scat is a very > nifty tool for debugging such things - but I''m not > getting a core or panic or anything :( > > > You can force a dump. Here are the steps: > > Before the system is hung: > > # mdb -K -F <-- this will load kmdb and drop into > it > > Don''t worry if your system now seems hung. > Type, carefully, with no typos: > > :c <-- and carriage-return. You should get your > prompt back > > Now, when the system is hung, type F1-a (that''s > function key f1 and the > "a" key together. > This should put you into kmdb. Now, type, (again, no > typos): > > $<systemdump > > This should give you a panic dump, followed by > reboot, (unless your > system is hard-hung). > > max > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss-- This message posted from opensolaris.org
max at bruningsystems.com
2008-Dec-03 20:48 UTC
[zfs-discuss] How to diagnose zfs - iscsi - nfs hang
Hi Blake, Blake Irvin wrote:> Thanks - however, the machine hangs and doesn''t even accept console input when this occurs. I can''t get into the kernel debugger in these cases. >Are you directly on the console, or is the console on a serial port? If you are running over X windows, the input might still get in, but X may not be displaying. If keyboard input is not getting in, your machine is probably wedged at a high level interrupt, which sounds doubtful based on your problem description.> I''ve enabled the deadman timer instead. I''m also using the automatic snapshot service to get a look at things like /var/adm/sa/sa** files that get overwritten after a hard reset. >If the deadman timer does not trigger, the clock is almost certainly running, and your machine is almost certainly accepting keyboard input. Good luck, max> I''m just going to stay up late tonight and see what happens :) > > Blake > > > > > >> Hi Blake, >> >> Blake Irvin wrote: >> >>> I''m having a very similar issue. Just updated to >>> >> 10 u6 and upgrade my zpools. They are fine (all >> 3-way mirors), but I''ve lost the machine around >> 12:30am two nights in a row. >> >>> What I''d really like is a way to force a core dump >>> >> when the machine hangs like this. scat is a very >> nifty tool for debugging such things - but I''m not >> getting a core or panic or anything :( >> >>> >>> >> You can force a dump. Here are the steps: >> >> Before the system is hung: >> >> # mdb -K -F <-- this will load kmdb and drop into >> it >> >> Don''t worry if your system now seems hung. >> Type, carefully, with no typos: >> >> :c <-- and carriage-return. You should get your >> prompt back >> >> Now, when the system is hung, type F1-a (that''s >> function key f1 and the >> "a" key together. >> This should put you into kmdb. Now, type, (again, no >> typos): >> >> $<systemdump >> >> This should give you a panic dump, followed by >> reboot, (unless your >> system is hard-hung). >> >> max >> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discu >> ss >>
I am directly on the console. cde-login is disabled, so i''m dealing with direct entry.> > Are you directly on the console, or is the console on > a serial port? If > you are > running over X windows, the input might still get in, > but X may not be > displaying. > If keyboard input is not getting in, your machine is > probably wedged at > a high > level interrupt, which sounds doubtful based on your > problem description.Out of curiosity, why do you say that? I''m no expert on interrupts, so I''m curious. It DOES seem that keyboard entry is ignored in this situation, since I see no results from ctrl-c, for example (I had left the console running ''tail -f /var/adm/messages''. I''m not saying your are wrong, but if I should be examining interrupt issues, I''d like to know (I have 3 hard disk controllers in the box, for example...)> > If the deadman timer does not trigger, the clock is > almost certainly > running, and your machine is > almost certainly accepting keyboard input.That''s good to know. I just enabled deadman after the last freeze, so it will be a bit before I can test this (hope I don''t have to). thanks! Blake> > Good luck, > max-- This message posted from opensolaris.org
max at bruningsystems.com
2008-Dec-03 22:11 UTC
[zfs-discuss] How to diagnose zfs - iscsi - nfs hang
Hi Blake, Blake Irvin wrote:> I am directly on the console. cde-login is disabled, so i''m dealing > with direct entry. > >> Are you directly on the console, or is the console on >> a serial port? If you are >> running over X windows, the input might still get in, >> but X may not be displaying. >> If keyboard input is not getting in, your machine is >> probably wedged at a high >> level interrupt, which sounds doubtful based on your >> problem description. >> > Out of curiosity, why do you say that? I''m no expert on interrupts, > so I''m curious. It DOES seem that keyboard entry is ignored in this > situation, since I see no results from ctrl-c, for example (I had left > the console running ''tail -f /var/adm/messages''. I''m not saying your > are wrong, but if I should be examining interrupt issues, I''d like to > know (I have 3 hard disk controllers in the box, for example...) >Typing ctrl-c, and having process killed because of it are 2 different actions. The interpretation of ctrl-c as a kill character is done in a streams module (ldterm, I believe). This is not done at the device interrupt handler. I doubt you need to examine interrupts. I was only saying that you could try what I recommended to get a dump. The f1-a is handled at the driver during interrupt handling, so it should get processed. I have done this many times, so I am sure it works.>> If the deadman timer does not trigger, the clock is >> almost certainly running, and your machine is >> almost certainly accepting keyboard input. >> > That''s good to know. I just enabled deadman after the last freeze, so > it will be a bit before I can test this (hope I don''t have to). > > thanks! > Blake > > >> Good luck, >> max >>
Thanks Max and Chris. I don''t really want the problem to occur again, of course, but I''ll be prepared if it does. On Wed, Dec 3, 2008 at 6:46 PM, Chris Siebenmann <cks at cs.toronto.edu> wrote:> You write: > | > If keyboard input is not getting in, your machine is probably wedged > | > at a high level interrupt, which sounds doubtful based on your > | > problem description. > | Out of curiosity, why do you say that? I''m no expert on interrupts, so > | I''m curious. It DOES seem that keyboard entry is ignored in this > | situation, since I see no results from ctrl-c, for example (I had left > | the console running ''tail -f /var/adm/messages''. I''m not saying your > | are wrong, but if I should be examining interrupt issues, I''d like to > | know (I have 3 hard disk controllers in the box, for example...) > > ^C handling requires a great deal of high-level kernel infrastructure > to be working, far beyond basic interrupt handling. To get much visible > reaction in a situation where nothing is producing output, for example, > the system has to be able to get all the way to running your shell so > that it can notice that tail has died and print the shell prompt. By > contrast, if the console echoes ''^C'', you have a fair amount of > interrupt handling. > > The Solaris kernel debugger hooks in to the system at a fairly low > level (I believe significantly lower than all of the things that have > to be working to even echo ''^C'', much less get all the way to executing > user-level code). Thus, you can get into it and force-crash your system > even if it is otherwise fairly dead, so I think that trying is well > worth it in your situation. > > --- > "I shall clasp my hands together and bow to the corners of the > world." > Number Ten Ox, "Bridge of Birds" > Chris Siebenmann > cks at cs.toronto.edu >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081203/0edcb168/attachment.html>