Jorgen Lundman
2008-Aug-11 02:04 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
SunOS x4500-02.unix 5.10 Generic_127128-11 i86pc i386 i86pc Admittedly we are not having much luck with the x4500s. This time it was the new x4500, running Solaris 10 5/08. Drive "/pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1/disk at 3,0 (sd30):" stopped responding, and even after a hard reset, it would simply repeat "retryable", "reset", and "fatal" messages forever. So unable to login on console. Again we ended up with the problem of knowing which HDD that actually is broken. Turns out to be drive #40. (Has anyone got a map we can print? Since we couldn''t boot it, any Unix commands needed to map are a bit useless, nor do we have a "hd" utility). That a HDD died in the first month of operation is understandable, but does it really have to take the whole server with it? Not to mention stop it from booting. Eventually the NOC staff guessed the correct drive from the blinking of LEDs (no LED was RED), and we were able to boot. Log outputs: Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx5: device on port 3 reset: device disconnected or device error Aug 11 08:47:59 x4500-02.unix sata: [ID 801593 kern.notice] NOTICE: /pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1: Aug 11 08:47:59 x4500-02.unix port 3: device reset Aug 11 08:47:59 x4500-02.unix sata: [ID 801593 kern.notice] NOTICE: /pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1: Aug 11 08:47:59 x4500-02.unix port 3: link lost Aug 11 08:47:59 x4500-02.unix sata: [ID 801593 kern.notice] NOTICE: /pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1: Aug 11 08:47:59 x4500-02.unix port 3: link established Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx5: error on port 3: Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] device error Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] device disconnected Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] device connected Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] EDMA self disabled Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.warning] WARNING: /pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1/disk at 3,0 (sd30): Aug 11 08:47:59 x4500-02.unix Error for Command: read Error Level: Retryable Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice] Requested Block: 439202 Error Block: 439202 Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice] Sense Key: No Additional Sense Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 scrub: resilver in progress, 10.27% done, 2h14m to go Perhaps not related, but equally annoying: # fmdump TIME UUID SUNW-MSG-ID Aug 11 08:16:32.3925 64da6f29-4dda-44aa-e9ca-ad7054aaeaa1 ZFS-8000-D3 Aug 11 09:08:18.7834 086e6170-e4c7-c66b-c908-e37840db7e96 ZFS-8000-D3 # fmdump -v -u 086e6170-e4c7-c66b-c908-e37840db7e96 TIME UUID SUNW-MSG-ID Aug 11 09:08:18.7834 086e6170-e4c7-c66b-c908-e37840db7e96 ZFS-8000-D3 ^C^Z^\ Alas, "kill -9" does not kill fmdump either, and it appears to lock the server (as well). I will remove the command for now, as it definitely hangs the server every time. Hard reset done again. Lund -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Frank Leers
2008-Aug-11 02:18 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
On Aug 10, 2008, at 7:04 PM, Jorgen Lundman wrote:> > SunOS x4500-02.unix 5.10 Generic_127128-11 i86pc i386 i86pc > > > Admittedly we are not having much luck with the x4500s. > > This time it was the new x4500, running Solaris 10 5/08. Drive > "/pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1/disk at 3,0 (sd30):" stopped > responding, and even after a hard reset, it would simply repeat > "retryable", "reset", and "fatal" messages forever. > > So unable to login on console. Again we ended up with the problem of > knowing which HDD that actually is broken. Turns out to be drive #40. > (Has anyone got a map we can print? Since we couldn''t boot it, any > Unix > commands needed to map are a bit useless, nor do we have a "hd" > utility).the ''hd'' utility on the tools and drivers cd produces the attached output on thumper. -------------- next part -------------- A non-text attachment was scrubbed... Name: hd_output.png Type: image/png Size: 101071 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080810/a9630b6f/attachment.png> -------------- next part --------------> > > That a HDD died in the first month of operation is understandable, but > does it really have to take the whole server with it? Not to mention > stop it from booting. Eventually the NOC staff guessed the correct > drive > from the blinking of LEDs (no LED was RED), and we were able to boot. > > Log outputs: > > Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 670675 kern.info] > NOTICE: > marvell88sx5: device on port 3 reset: device disconnected or device > error > Aug 11 08:47:59 x4500-02.unix sata: [ID 801593 kern.notice] NOTICE: > /pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1: > Aug 11 08:47:59 x4500-02.unix port 3: device reset > Aug 11 08:47:59 x4500-02.unix sata: [ID 801593 kern.notice] NOTICE: > /pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1: > Aug 11 08:47:59 x4500-02.unix port 3: link lost > Aug 11 08:47:59 x4500-02.unix sata: [ID 801593 kern.notice] NOTICE: > /pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1: > Aug 11 08:47:59 x4500-02.unix port 3: link established > Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 812950 kern.warning] > WARNING: marvell88sx5: error on port 3: > Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] > device error > Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] > device disconnected > Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] > device connected > Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] > EDMA self disabled > Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.warning] WARNING: > /pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1/disk at 3,0 (sd30): > Aug 11 08:47:59 x4500-02.unix Error for Command: read > Error Level: Retryable > Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice] > Requested Block: 439202 Error Block: 439202 > Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice] > Vendor: > ATA Serial Number: > Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice] Sense > Key: No Additional Sense > Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice] ASC: > 0x0 > (no additional sense info), ASCQ: 0x0, FRU: 0x0 > > > scrub: resilver in progress, 10.27% done, 2h14m to go > > > > Perhaps not related, but equally annoying: > > # fmdump > TIME UUID SUNW-MSG-ID > Aug 11 08:16:32.3925 64da6f29-4dda-44aa-e9ca-ad7054aaeaa1 ZFS-8000-D3 > Aug 11 09:08:18.7834 086e6170-e4c7-c66b-c908-e37840db7e96 ZFS-8000-D3 > > # fmdump -v -u 086e6170-e4c7-c66b-c908-e37840db7e96 > TIME UUID SUNW-MSG-ID > Aug 11 09:08:18.7834 086e6170-e4c7-c66b-c908-e37840db7e96 ZFS-8000-D3 > ^C^Z^\ > > Alas, "kill -9" does not kill fmdump either, and it appears to lock > the > server (as well). I will remove the command for now, as it definitely > hangs the server every time. Hard reset done again. > > Lund > > > > -- > Jorgen Lundman | <lundman at lundman.net> > Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) > Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) > Japan | +81 (0)3 -3375-1767 (home) > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2421 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080810/a9630b6f/attachment.bin>
Jorgen Lundman
2008-Aug-11 02:26 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
> the ''hd'' utility on the tools and drivers cd produces the attached > output on thumper. >Clearly I need to find and install this utility, but even then, that seems to just add "yet another way" to number the drives. The message I get from kernel is: "/pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1/disk at 3,0 (sd30):" And I need to get the answer "40". The "hd" output additionally gives me "sdar" .... ? Lund -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Ian Collins
2008-Aug-11 02:28 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
Jorgen Lundman writes:> > So unable to login on console. Again we ended up with the problem of > knowing which HDD that actually is broken. Turns out to be drive #40. > (Has anyone got a map we can print? Since we couldn''t boot it, any Unix > commands needed to map are a bit useless, nor do we have a "hd" utility). >See http://www.sun.com/servers/x64/x4500/arch-wp.pdf page 21. Ian
Jorgen Lundman
2008-Aug-11 02:39 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
> See http://www.sun.com/servers/x64/x4500/arch-wp.pdf page 21. > IanReferring to Page 20? That does show the drive order, just like it does on the box, but not how to map them from the kernel message to drive slot number. Lund -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Brent Jones
2008-Aug-11 03:06 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
On Sun, Aug 10, 2008 at 7:39 PM, Jorgen Lundman <lundman at gmo.jp> wrote:> > > See http://www.sun.com/servers/x64/x4500/arch-wp.pdf page 21. > > Ian > > Referring to Page 20? That does show the drive order, just like it does > on the box, but not how to map them from the kernel message to drive > slot number. > > Lund > > > -- > Jorgen Lundman | <lundman at lundman.net> > Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) > Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) > Japan | +81 (0)3 -3375-1767 (home) > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Does the SATA controller show any information in its log (if you go into the controller BIOS, if there is one)? Seeing more reports of full systems hangs from an unresponsive drive makes me very concerned about bring a 4500 into our environment :( -- Brent Jones brent at servuhome.net -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080810/3847ce0f/attachment.html>
Jorgen Lundman
2008-Aug-11 03:14 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
> Does the SATA controller show any information in its log (if you go into > the controller BIOS, if there is one)? > > Seeing more reports of full systems hangs from an unresponsive drive > makes me very concerned about bring a 4500 into our environment :( >Not that I can see. Rebooting the new x4500 for the 6th time now as it keeps hanging on IO. (Box is 100% idle, but any IO commands like zpool/zfs/fmdump etc will just hung). I have absolutely no idea why it hangs now, we have pulled out the replacement drive to see if it stays up (in case it is a drive channel problem). The most disappointing aspects of all this, is the incredibly poor support we have had from our vendor (compared to NetApp support that we have had in the past). I would have thought being the biggest ISP in Japan would mean we''d be interesting to Sun, even if just a little bit. I suspect we are one the first to try x4500 here as well. Anyway, it has almost rebooted, so I need to go remount everything. Lund -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Frank Leers
2008-Aug-11 03:18 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
On Aug 10, 2008, at 7:26 PM, Jorgen Lundman wrote:> >> the ''hd'' utility on the tools and drivers cd produces the attached >> output on thumper. >> > > Clearly I need to find and install this utility, but even then, that > seems to just add "yet another way" to number the drives. > > The message I get from kernel is: > > "/pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1/disk at 3,0 (sd30):" > > And I need to get the answer "40". The "hd" output additionally > gives me > "sdar" .... ? >...yeah, when run on a thumper that is booted into linux. I attached it to show you the drive positions. Go get it and run it on your installation of S10. -frank -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2421 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080810/2ca3911f/attachment.bin>
Frank Leers
2008-Aug-11 03:25 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
On Aug 10, 2008, at 8:14 PM, Jorgen Lundman wrote:> >> Does the SATA controller show any information in its log (if you go >> into >> the controller BIOS, if there is one)? >> >> Seeing more reports of full systems hangs from an unresponsive drive >> makes me very concerned about bring a 4500 into our environment :( >> > > Not that I can see. Rebooting the new x4500 for the 6th time now as it > keeps hanging on IO. (Box is 100% idle, but any IO commands like > zpool/zfs/fmdump etc will just hung). I have absolutely no idea why it > hangs now, we have pulled out the replacement drive to see if it stays > up (in case it is a drive channel problem). > > The most disappointing aspects of all this, is the incredibly poor > support we have had from our vendor (compared to NetApp support that > we > have had in the past). I would have thought being the biggest ISP in > Japan would mean we''d be interesting to Sun, even if just a little > bit. > I suspect we are one the first to try x4500 here as well.Nope, Tokyo Tech in your neighborhood has a boatload...50 or so IIRC. http://www.sun.com/blueprints/0507/820-2187.pdf Have you opened up a case with Sun?> > > Anyway, it has almost rebooted, so I need to go remount everything. > > Lund > > -- > Jorgen Lundman | <lundman at lundman.net> > Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) > Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) > Japan | +81 (0)3 -3375-1767 (home) > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2421 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080810/0c503f76/attachment.bin>
Jorgen Lundman
2008-Aug-11 03:56 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
Jorgen Lundman wrote:> > Anyway, it has almost rebooted, so I need to go remount everything. >Not that it wants to stay up for longer than ~20 mins, then hangs. In that all IO hangs, including "nfsd". I thought this might have been related: http://sunsolve.sun.com/search/document.do?assetkey=1-66-233341-1 # /usr/X11/bin/scanpci | /usr/sfw/bin/ggrep -A1 "vendor 0x11ab device 0x6081" pci bus 0x0001 cardnum 0x01 function 0x00: vendor 0x11ab device 0x6081 Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller But it claims resolved for our version: SunOS x4500-02.unix 5.10 Generic_127128-11 i86pc i386 i86pc Perhaps I should see if there are any recommended patches for Sol 10 5/08? Lund -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
James C. McPherson
2008-Aug-11 04:13 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
Jorgen Lundman wrote:> > Jorgen Lundman wrote: >> Anyway, it has almost rebooted, so I need to go remount everything. >> > > Not that it wants to stay up for longer than ~20 mins, then hangs. In > that all IO hangs, including "nfsd". > > I thought this might have been related: > > http://sunsolve.sun.com/search/document.do?assetkey=1-66-233341-1 > > # /usr/X11/bin/scanpci | /usr/sfw/bin/ggrep -A1 "vendor 0x11ab device > 0x6081" > pci bus 0x0001 cardnum 0x01 function 0x00: vendor 0x11ab device 0x6081 > Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller > > But it claims resolved for our version: > > SunOS x4500-02.unix 5.10 Generic_127128-11 i86pc i386 i86pc > > Perhaps I should see if there are any recommended patches for Sol 10 5/08?One question to ask is: are you seeing the same messages on your system that are shown in that Sunsolve doc? Not just the write errors, but the whole sequence. Can you force a crash dump when the system hangs? If you can, then you could provide that to the support engineer who has accepted the call you''ve already logged with Sun''s support organisation. You _did_ log a call, didn''t you? James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Jorgen Lundman
2008-Aug-11 05:18 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
James C. McPherson wrote:> One question to ask is: are you seeing the same messages > on your system that are shown in that Sunsolve doc? Not > just the write errors, but the whole sequence.Unfortunately, I get no messages at all. I/O just stops. But login shells are fine, as long as I don''t issue commands that query zfs/zpool in any way. Nothing on console, dmesg, or the various log files. Just booted with "-k" since it happens so frequently. Most likely are not related to that bug. Having to do hard resets (well,from ILOM) doesn''t feel good.> Can you force a crash dump when the system hangs? If you > can, then you could provide that to the support engineer > who has accepted the call you''ve already logged with Sun''s > support organisation. > > You _did_ log a call, didn''t you?Crash dump will be next time (30 mins or so), and we can only log a call with vendor, and if they feel like it, will push it to Sun. Although, we do have SunSolve logins, can we by-pass the middleman, and avoid the whole translation fiasco, and log directly with Sun? Lund -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Jonathan Loran
2008-Aug-11 06:00 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
Jorgen Lundman wrote:> # /usr/X11/bin/scanpci | /usr/sfw/bin/ggrep -A1 "vendor 0x11ab device > 0x6081" > pci bus 0x0001 cardnum 0x01 function 0x00: vendor 0x11ab device 0x6081 > Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller > > But it claims resolved for our version: > > SunOS x4500-02.unix 5.10 Generic_127128-11 i86pc i386 i86pc > > Perhaps I should see if there are any recommended patches for Sol 10 5/08? > >Jorgen, For Sol 10, you need to get the IDR patch for the Marvell controllers. Given the crummy support you''re getting, you may have problems getting it. (Can anyone on this list help Jorgen?) From recent posts on this list, I don''t think there''s an official patch yet, but if so, get that instead. This should greatly improve matters for you. Jon
Jorgen Lundman
2008-Aug-11 06:08 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
So it does appear that it is zpool that hangs, possibly during resilvering (we lost a HDD at midnight, this what was started all this). After boot: x4500-02:~# zpool status -x pool: zpool1 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 11.10% done, 2h11m to go config: NAME STATE READ WRITE CKSUM zpool1 DEGRADED 0 0 0 raidz1 ONLINE 0 0 0 [snip] c7t3d0 ONLINE 0 0 0 replacing UNAVAIL 0 0 0 insufficient replicas c8t3d0s0/o UNAVAIL 0 0 0 cannot open c8t3d0 UNAVAIL 0 0 0 cannot open raidz1 ONLINE 0 0 0 You can run zpool for about 4-5 minutes, then they start to hang. For example, I tried to issue; # zpool offline zpool1 c8t3d0 .. and the system stops z-responding. # mdb -k ::ps!grep pool R 732 722 732 662 0 0x4a004000 ffffffffb92a8030 zpool > ffffffffb92a8030::walk thread|::findstack -v stack pointer for thread fffffe85285d07e0: fffffe800283fc40 [ fffffe800283fc40 _resume_from_idle+0xf8() ] fffffe800283fc70 swtch+0x12a() fffffe800283fc90 cv_wait+0x68() fffffe800283fcc0 spa_config_enter+0x50() fffffe800283fce0 spa_vdev_enter+0x2a() fffffe800283fd10 vdev_offline+0x29() fffffe800283fd40 zfs_ioc_vdev_offline+0x58() fffffe800283fd80 zfsdev_ioctl+0x13e() fffffe800283fd90 cdev_ioctl+0x1d() fffffe800283fdb0 spec_ioctl+0x50() fffffe800283fde0 fop_ioctl+0x25() fffffe800283fec0 ioctl+0xac() fffffe800283ff10 sys_syscall32+0x101() Similarly, nfs: > ::ps!grep nfsd R 548 1 548 548 1 0x42000000 ffffffffb92ad6d0 nfsd > ffffffffb92ad6d0::walk thread|::findstack -v stack pointer for thread ffffffff9af8e540: fffffe8001046cc0 [ fffffe8001046cc0 _resume_from_idle+0xf8() ] fffffe8001046cf0 swtch+0x12a() fffffe8001046d40 cv_wait_sig_swap_core+0x177() fffffe8001046d50 cv_wait_sig_swap+0xb() fffffe8001046da0 cv_waituntil_sig+0xd7() fffffe8001046e50 poll_common+0x420() fffffe8001046ec0 pollsys+0xbe() fffffe8001046f10 sys_syscall32+0x101() -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Jorgen Lundman
2008-Aug-11 13:13 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
ok, so I tried installing 138053-02, and umounting/unsharing for the entire resilvering process, meanwhile, onsite support decided to replace the mainboard due to some reason (not that I was full of confidence here) ... and between us, it has actually been up for 2 hours, and has a clean "zpool status". Going to get some sleep, and really hope it has been fixed. Thank you to everyone who helped. Lund Jorgen Lundman wrote:> > Jorgen Lundman wrote: >> Anyway, it has almost rebooted, so I need to go remount everything. >> > > Not that it wants to stay up for longer than ~20 mins, then hangs. In > that all IO hangs, including "nfsd". > > I thought this might have been related: > > http://sunsolve.sun.com/search/document.do?assetkey=1-66-233341-1 > > # /usr/X11/bin/scanpci | /usr/sfw/bin/ggrep -A1 "vendor 0x11ab device > 0x6081" > pci bus 0x0001 cardnum 0x01 function 0x00: vendor 0x11ab device 0x6081 > Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller > > But it claims resolved for our version: > > SunOS x4500-02.unix 5.10 Generic_127128-11 i86pc i386 i86pc > > Perhaps I should see if there are any recommended patches for Sol 10 5/08? > > > Lund >-- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Weldon S Godfrey 3
2008-Aug-11 14:16 UTC
[zfs-discuss] 32 bit NFS clients with 64 bit ZFS server okay?
Are there any known issues with having 32 bit OS clients using NFS to access a NFS server using a 64 bit OS exporting > 2TB filesystem? Are there any issues with using NFS v3 over NFS v4? Thanks! Weldon
Casper.Dik at Sun.COM
2008-Aug-11 14:34 UTC
[zfs-discuss] 32 bit NFS clients with 64 bit ZFS server okay?
> >Are there any known issues with having 32 bit OS clients using NFS to >access a NFS server using a 64 bit OS exporting > 2TB filesystem? Are >there any issues with using NFS v3 over NFS v4?The problems are not about the size of the data; it''s how 32 clients used the data returned; the sticky point is the use of > 32 bit offsets for directory entries. Casper
Frank Fischer
2008-Aug-12 14:05 UTC
[zfs-discuss] x4500 dead HDD, hung server, unable to boot.
James, one question: Do you know if and when yes in which version of opensolaris this issue is solved? We have the exact same problems using a Supermicro X7DBE with two Supermicro AOC-SAT2-MV8 (we are on snv79). Thanks, Frank This message posted from opensolaris.org