thr3ads.net - zfs discuss - [zfs-discuss] x4500 dead HDD, hung server, unable to boot. [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Jorgen Lundman

2008-Aug-11 02:04 UTC

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

SunOS x4500-02.unix 5.10 Generic_127128-11 i86pc i386 i86pc


Admittedly we are not having much luck with the x4500s.

This time it was the new x4500, running Solaris 10 5/08. Drive 
"/pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1/disk at 3,0 (sd30):"
stopped
responding, and even after a hard reset, it would simply repeat 
"retryable", "reset", and "fatal" messages
forever.

So unable to login on console. Again we ended up with the problem of 
knowing which HDD that actually is broken. Turns out to be drive #40. 
(Has anyone got a map we can print? Since we couldn''t boot it, any Unix
commands needed to map are a bit useless, nor do we have a "hd"
utility).

That a HDD died in the first month of operation is understandable, but 
does it really have to take the whole server with it? Not to mention 
stop it from booting. Eventually the NOC staff guessed the correct drive 
from the blinking of LEDs (no LED was RED), and we were able to boot.

Log outputs:

Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 670675 kern.info] NOTICE: 
marvell88sx5: device on port 3 reset: device disconnected or device error
Aug 11 08:47:59 x4500-02.unix sata: [ID 801593 kern.notice] NOTICE: 
/pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1:
Aug 11 08:47:59 x4500-02.unix  port 3: device reset
Aug 11 08:47:59 x4500-02.unix sata: [ID 801593 kern.notice] NOTICE: 
/pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1:
Aug 11 08:47:59 x4500-02.unix  port 3: link lost
Aug 11 08:47:59 x4500-02.unix sata: [ID 801593 kern.notice] NOTICE: 
/pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1:
Aug 11 08:47:59 x4500-02.unix  port 3: link established
Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 812950 kern.warning] 
WARNING: marvell88sx5: error on port 3:
Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] 
device error
Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] 
device disconnected
Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] 
device connected
Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info] 
EDMA self disabled
Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.warning] WARNING: 
/pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1/disk at 3,0 (sd30):
Aug 11 08:47:59 x4500-02.unix   Error for Command: read 
    Error Level: Retryable
Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice] 
Requested Block: 439202                    Error Block: 439202
Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice]     Vendor: 
ATA                                Serial Number:
Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice]     Sense 
Key: No Additional Sense
Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice]     ASC: 0x0 
(no additional sense info), ASCQ: 0x0, FRU: 0x0


scrub: resilver in progress, 10.27% done, 2h14m to go



Perhaps not related, but equally annoying:

# fmdump
TIME                 UUID                                 SUNW-MSG-ID
Aug 11 08:16:32.3925 64da6f29-4dda-44aa-e9ca-ad7054aaeaa1 ZFS-8000-D3
Aug 11 09:08:18.7834 086e6170-e4c7-c66b-c908-e37840db7e96 ZFS-8000-D3

# fmdump -v -u 086e6170-e4c7-c66b-c908-e37840db7e96
TIME                 UUID                                 SUNW-MSG-ID
Aug 11 09:08:18.7834 086e6170-e4c7-c66b-c908-e37840db7e96 ZFS-8000-D3
^C^Z^\

Alas, "kill -9" does not kill fmdump either, and it appears to lock
the
server (as well). I will remove the command for now, as it definitely 
hangs the server every time. Hard reset done again.

Lund



-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Frank Leers

2008-Aug-11 02:18 UTC

head link

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

On Aug 10, 2008, at 7:04 PM, Jorgen Lundman wrote:
>
> SunOS x4500-02.unix 5.10 Generic_127128-11 i86pc i386 i86pc
>
>
> Admittedly we are not having much luck with the x4500s.
>
> This time it was the new x4500, running Solaris 10 5/08. Drive
> "/pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1/disk at 3,0
(sd30):" stopped
> responding, and even after a hard reset, it would simply repeat
> "retryable", "reset", and "fatal" messages
forever.
>
> So unable to login on console. Again we ended up with the problem of
> knowing which HDD that actually is broken. Turns out to be drive #40.
> (Has anyone got a map we can print? Since we couldn''t boot it, any
> Unix
> commands needed to map are a bit useless, nor do we have a "hd"  
> utility).
the ''hd'' utility on the tools and drivers cd produces the
attached
output on thumper.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: hd_output.png
Type: image/png
Size: 101071 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080810/a9630b6f/attachment.png>
-------------- next part --------------

>
>
> That a HDD died in the first month of operation is understandable, but
> does it really have to take the whole server with it? Not to mention
> stop it from booting. Eventually the NOC staff guessed the correct  
> drive
> from the blinking of LEDs (no LED was RED), and we were able to boot.
>
> Log outputs:
>
> Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 670675 kern.info]  
> NOTICE:
> marvell88sx5: device on port 3 reset: device disconnected or device  
> error
> Aug 11 08:47:59 x4500-02.unix sata: [ID 801593 kern.notice] NOTICE:
> /pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1:
> Aug 11 08:47:59 x4500-02.unix  port 3: device reset
> Aug 11 08:47:59 x4500-02.unix sata: [ID 801593 kern.notice] NOTICE:
> /pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1:
> Aug 11 08:47:59 x4500-02.unix  port 3: link lost
> Aug 11 08:47:59 x4500-02.unix sata: [ID 801593 kern.notice] NOTICE:
> /pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1:
> Aug 11 08:47:59 x4500-02.unix  port 3: link established
> Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 812950 kern.warning]
> WARNING: marvell88sx5: error on port 3:
> Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info]
> device error
> Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info]
> device disconnected
> Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info]
> device connected
> Aug 11 08:47:59 x4500-02.unix marvell88sx: [ID 517869 kern.info]
> EDMA self disabled
> Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.warning] WARNING:
> /pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1/disk at 3,0 (sd30):
> Aug 11 08:47:59 x4500-02.unix   Error for Command: read
>    Error Level: Retryable
> Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice]
> Requested Block: 439202                    Error Block: 439202
> Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice]      
> Vendor:
> ATA                                Serial Number:
> Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice]     Sense
> Key: No Additional Sense
> Aug 11 08:47:59 x4500-02.unix scsi: [ID 107833 kern.notice]     ASC:  
> 0x0
> (no additional sense info), ASCQ: 0x0, FRU: 0x0
>
>
> scrub: resilver in progress, 10.27% done, 2h14m to go
>
>
>
> Perhaps not related, but equally annoying:
>
> # fmdump
> TIME                 UUID                                 SUNW-MSG-ID
> Aug 11 08:16:32.3925 64da6f29-4dda-44aa-e9ca-ad7054aaeaa1 ZFS-8000-D3
> Aug 11 09:08:18.7834 086e6170-e4c7-c66b-c908-e37840db7e96 ZFS-8000-D3
>
> # fmdump -v -u 086e6170-e4c7-c66b-c908-e37840db7e96
> TIME                 UUID                                 SUNW-MSG-ID
> Aug 11 09:08:18.7834 086e6170-e4c7-c66b-c908-e37840db7e96 ZFS-8000-D3
> ^C^Z^\
>
> Alas, "kill -9" does not kill fmdump either, and it appears to
lock
> the
> server (as well). I will remove the command for now, as it definitely
> hangs the server every time. Hard reset done again.
>
> Lund
>
>
>
> -- 
> Jorgen Lundman       | <lundman at lundman.net>
> Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
> Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
> Japan                | +81 (0)3 -3375-1767          (home)
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2421 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080810/a9630b6f/attachment.bin>

Jorgen Lundman

2008-Aug-11 02:26 UTC

head link

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

> the ''hd'' utility on the tools and drivers cd produces the
attached
> output on thumper.
> 
Clearly I need to find and install this utility, but even then, that 
seems to just add "yet another way" to number the drives.

The message I get from kernel is:

"/pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1/disk at 3,0 (sd30):"

And I need to get the answer "40". The "hd" output
additionally gives me
"sdar" .... ?

Lund

-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Ian Collins

2008-Aug-11 02:28 UTC

head link

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

Jorgen Lundman writes:> 
> So unable to login on console. Again we ended up with the problem of 
> knowing which HDD that actually is broken. Turns out to be drive #40. 
> (Has anyone got a map we can print? Since we couldn''t boot it, any
Unix
> commands needed to map are a bit useless, nor do we have a "hd"
utility).
> See http://www.sun.com/servers/x64/x4500/arch-wp.pdf page 21. 

Ian

Jorgen Lundman

2008-Aug-11 02:39 UTC

head link

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

> See http://www.sun.com/servers/x64/x4500/arch-wp.pdf page 21.
> Ian
Referring to Page 20? That does show the drive order, just like it does 
on the box, but not how to map them from the kernel message to drive 
slot number.

Lund


-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Brent Jones

2008-Aug-11 03:06 UTC

head link

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

On Sun, Aug 10, 2008 at 7:39 PM, Jorgen Lundman <lundman at gmo.jp> wrote:
>
> > See http://www.sun.com/servers/x64/x4500/arch-wp.pdf page 21.
> > Ian
>
> Referring to Page 20? That does show the drive order, just like it does
> on the box, but not how to map them from the kernel message to drive
> slot number.
>
> Lund
>
>
> --
> Jorgen Lundman       | <lundman at lundman.net>
> Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
> Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
> Japan                | +81 (0)3 -3375-1767          (home)
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
Does the SATA controller show any information in its log (if you go into the
controller BIOS, if there is one)?

Seeing more reports of full systems hangs from an unresponsive drive makes
me very concerned about bring a 4500 into our environment  :(

-- 
Brent Jones
brent at servuhome.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080810/3847ce0f/attachment.html>

Jorgen Lundman

2008-Aug-11 03:14 UTC

head link

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

> Does the SATA controller show any information in its log (if you go into 
> the controller BIOS, if there is one)?
> 
> Seeing more reports of full systems hangs from an unresponsive drive 
> makes me very concerned about bring a 4500 into our environment  :(
> 
Not that I can see. Rebooting the new x4500 for the 6th time now as it 
keeps hanging on IO. (Box is 100% idle, but any IO commands like 
zpool/zfs/fmdump etc will just hung). I have absolutely no idea why it 
hangs now, we have pulled out the replacement drive to see if it stays 
up (in case it is a drive channel problem).

The most disappointing aspects of all this, is the incredibly poor 
support we have had from our vendor (compared to NetApp support that we 
have had in the past). I would have thought being the biggest ISP in 
Japan would mean we''d be interesting to Sun, even if just a little bit.
I suspect we are one the first to try x4500 here as well.

Anyway, it has almost rebooted, so I need to go remount everything.

Lund

-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Frank Leers

2008-Aug-11 03:18 UTC

head link

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

On Aug 10, 2008, at 7:26 PM, Jorgen Lundman wrote:
>
>> the ''hd'' utility on the tools and drivers cd produces
the attached
>> output on thumper.
>>
>
> Clearly I need to find and install this utility, but even then, that
> seems to just add "yet another way" to number the drives.
>
> The message I get from kernel is:
>
> "/pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1/disk at 3,0
(sd30):"
>
> And I need to get the answer "40". The "hd" output
additionally
> gives me
> "sdar" .... ?
>
...yeah, when run on a thumper that is booted into linux.  I attached  
it to show you the drive positions.  Go get it and run it on your  
installation of S10.

-frank
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2421 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080810/2ca3911f/attachment.bin>

Frank Leers

2008-Aug-11 03:25 UTC

head link

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

On Aug 10, 2008, at 8:14 PM, Jorgen Lundman wrote:
>
>> Does the SATA controller show any information in its log (if you go  
>> into
>> the controller BIOS, if there is one)?
>>
>> Seeing more reports of full systems hangs from an unresponsive drive
>> makes me very concerned about bring a 4500 into our environment  :(
>>
>
> Not that I can see. Rebooting the new x4500 for the 6th time now as it
> keeps hanging on IO. (Box is 100% idle, but any IO commands like
> zpool/zfs/fmdump etc will just hung). I have absolutely no idea why it
> hangs now, we have pulled out the replacement drive to see if it stays
> up (in case it is a drive channel problem).
>
> The most disappointing aspects of all this, is the incredibly poor
> support we have had from our vendor (compared to NetApp support that  
> we
> have had in the past). I would have thought being the biggest ISP in
> Japan would mean we''d be interesting to Sun, even if just a little
> bit.
> I suspect we are one the first to try x4500 here as well.
Nope, Tokyo Tech in your neighborhood has a boatload...50 or so IIRC.
http://www.sun.com/blueprints/0507/820-2187.pdf

Have you opened up a case with Sun?
>
>
> Anyway, it has almost rebooted, so I need to go remount everything.
>
> Lund
>
> -- 
> Jorgen Lundman       | <lundman at lundman.net>
> Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
> Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
> Japan                | +81 (0)3 -3375-1767          (home)
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2421 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080810/0c503f76/attachment.bin>

Jorgen Lundman

2008-Aug-11 03:56 UTC

head link

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

Jorgen Lundman wrote:> 
> Anyway, it has almost rebooted, so I need to go remount everything.
> 
Not that it wants to stay up for longer than ~20 mins, then hangs. In 
that all IO hangs, including "nfsd".

I thought this might have been related:

http://sunsolve.sun.com/search/document.do?assetkey=1-66-233341-1

# /usr/X11/bin/scanpci | /usr/sfw/bin/ggrep -A1 "vendor 0x11ab device
0x6081"
pci bus 0x0001 cardnum 0x01 function 0x00: vendor 0x11ab device 0x6081
  Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller

But it claims resolved for our version:

SunOS x4500-02.unix 5.10 Generic_127128-11 i86pc i386 i86pc

Perhaps I should see if there are any recommended patches for Sol 10 5/08?


Lund

-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

James C. McPherson

2008-Aug-11 04:13 UTC

head link

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

Jorgen Lundman wrote:> 
> Jorgen Lundman wrote:
>> Anyway, it has almost rebooted, so I need to go remount everything.
>>
> 
> Not that it wants to stay up for longer than ~20 mins, then hangs. In 
> that all IO hangs, including "nfsd".
> 
> I thought this might have been related:
> 
> http://sunsolve.sun.com/search/document.do?assetkey=1-66-233341-1
> 
> # /usr/X11/bin/scanpci | /usr/sfw/bin/ggrep -A1 "vendor 0x11ab device
> 0x6081"
> pci bus 0x0001 cardnum 0x01 function 0x00: vendor 0x11ab device 0x6081
>   Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller
> 
> But it claims resolved for our version:
> 
> SunOS x4500-02.unix 5.10 Generic_127128-11 i86pc i386 i86pc
> 
> Perhaps I should see if there are any recommended patches for Sol 10 5/08?
One question to ask is: are you seeing the same messages
on your system that are shown in that Sunsolve doc? Not
just the write errors, but the whole sequence.

Can you force a crash dump when the system hangs? If you
can, then you could provide that to the support engineer
who has accepted the call you''ve already logged with Sun''s
support organisation.

You _did_ log a call, didn''t you?


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Jorgen Lundman

2008-Aug-11 05:18 UTC

head link

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

James C. McPherson wrote:> One question to ask is: are you seeing the same messages
> on your system that are shown in that Sunsolve doc? Not
> just the write errors, but the whole sequence.
Unfortunately, I get no messages at all. I/O just stops. But login 
shells are fine, as long as I don''t issue commands that query zfs/zpool
in any way. Nothing on console, dmesg, or the various log files. Just 
booted with "-k" since it happens so frequently.

Most likely are not related to that bug. Having to do hard resets 
(well,from ILOM) doesn''t feel good.

> Can you force a crash dump when the system hangs? If you
> can, then you could provide that to the support engineer
> who has accepted the call you''ve already logged with
Sun''s
> support organisation.
> 
> You _did_ log a call, didn''t you?
Crash dump will be next time (30 mins or so), and we can only log a call 
with vendor, and if they feel like it, will push it to Sun.  Although, 
we do have SunSolve logins, can we by-pass the middleman, and avoid the 
whole translation fiasco, and log directly with Sun?

Lund

-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Jonathan Loran

2008-Aug-11 06:00 UTC

head link

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

Jorgen Lundman wrote:> # /usr/X11/bin/scanpci | /usr/sfw/bin/ggrep -A1 "vendor 0x11ab device
> 0x6081"
> pci bus 0x0001 cardnum 0x01 function 0x00: vendor 0x11ab device 0x6081
>   Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller
>
> But it claims resolved for our version:
>
> SunOS x4500-02.unix 5.10 Generic_127128-11 i86pc i386 i86pc
>
> Perhaps I should see if there are any recommended patches for Sol 10 5/08?
>
>   Jorgen,

For Sol 10, you need to get the IDR patch for the Marvell controllers.  
Given the crummy support you''re getting, you may have problems getting 
it.  (Can anyone on this list help Jorgen?)  From recent posts on this 
list, I don''t think there''s an official patch yet, but if so,
get that
instead.  This should greatly improve matters for you.

Jon

Jorgen Lundman

2008-Aug-11 06:08 UTC

head link

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

So it does appear that it is zpool that hangs, possibly during 
resilvering (we lost a HDD at midnight, this what was started all this).

After boot:

x4500-02:~# zpool status -x
   pool: zpool1
  state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
         continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scrub: resilver in progress, 11.10% done, 2h11m to go
config:

         NAME              STATE     READ WRITE CKSUM
         zpool1            DEGRADED     0     0     0
           raidz1          ONLINE       0     0     0
[snip]
             c7t3d0        ONLINE       0     0     0
             replacing     UNAVAIL      0     0     0  insufficient replicas
               c8t3d0s0/o  UNAVAIL      0     0     0  cannot open
               c8t3d0      UNAVAIL      0     0     0  cannot open
           raidz1          ONLINE       0     0     0


You can run zpool for about 4-5 minutes, then they start to hang. For 
example, I tried to issue;

# zpool offline zpool1 c8t3d0

.. and the system stops z-responding.

# mdb -k

::ps!grep pool

R    732    722    732    662      0 0x4a004000 ffffffffb92a8030 zpool

 > ffffffffb92a8030::walk thread|::findstack -v
stack pointer for thread fffffe85285d07e0: fffffe800283fc40
[ fffffe800283fc40 _resume_from_idle+0xf8() ]
   fffffe800283fc70 swtch+0x12a()
   fffffe800283fc90 cv_wait+0x68()
   fffffe800283fcc0 spa_config_enter+0x50()
   fffffe800283fce0 spa_vdev_enter+0x2a()
   fffffe800283fd10 vdev_offline+0x29()
   fffffe800283fd40 zfs_ioc_vdev_offline+0x58()
   fffffe800283fd80 zfsdev_ioctl+0x13e()
   fffffe800283fd90 cdev_ioctl+0x1d()
   fffffe800283fdb0 spec_ioctl+0x50()
   fffffe800283fde0 fop_ioctl+0x25()
   fffffe800283fec0 ioctl+0xac()
   fffffe800283ff10 sys_syscall32+0x101()


Similarly, nfs:

 > ::ps!grep nfsd
R    548      1    548    548      1 0x42000000 ffffffffb92ad6d0 nfsd
 > ffffffffb92ad6d0::walk thread|::findstack -v
stack pointer for thread ffffffff9af8e540: fffffe8001046cc0
[ fffffe8001046cc0 _resume_from_idle+0xf8() ]
   fffffe8001046cf0 swtch+0x12a()
   fffffe8001046d40 cv_wait_sig_swap_core+0x177()
   fffffe8001046d50 cv_wait_sig_swap+0xb()
   fffffe8001046da0 cv_waituntil_sig+0xd7()
   fffffe8001046e50 poll_common+0x420()
   fffffe8001046ec0 pollsys+0xbe()
   fffffe8001046f10 sys_syscall32+0x101()





-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Jorgen Lundman

2008-Aug-11 13:13 UTC

head link

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

ok, so I tried installing 138053-02, and umounting/unsharing for the 
entire resilvering process,

meanwhile, onsite support decided to replace the mainboard due to some 
reason (not that I was full of confidence here)

... and between us, it has actually been up for 2 hours, and has a clean 
"zpool status".

Going to get some sleep, and really hope it has been fixed. Thank you to 
everyone who helped.

Lund


Jorgen Lundman wrote:> 
> Jorgen Lundman wrote:
>> Anyway, it has almost rebooted, so I need to go remount everything.
>>
> 
> Not that it wants to stay up for longer than ~20 mins, then hangs. In 
> that all IO hangs, including "nfsd".
> 
> I thought this might have been related:
> 
> http://sunsolve.sun.com/search/document.do?assetkey=1-66-233341-1
> 
> # /usr/X11/bin/scanpci | /usr/sfw/bin/ggrep -A1 "vendor 0x11ab device
> 0x6081"
> pci bus 0x0001 cardnum 0x01 function 0x00: vendor 0x11ab device 0x6081
>   Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller
> 
> But it claims resolved for our version:
> 
> SunOS x4500-02.unix 5.10 Generic_127128-11 i86pc i386 i86pc
> 
> Perhaps I should see if there are any recommended patches for Sol 10 5/08?
> 
> 
> Lund
> 
-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Weldon S Godfrey 3

2008-Aug-11 14:16 UTC

head link

[zfs-discuss] 32 bit NFS clients with 64 bit ZFS server okay?

Are there any known issues with having 32 bit OS clients using NFS to 
access a NFS server using a 64 bit OS exporting > 2TB filesystem?  Are 
there any issues with using NFS v3 over NFS v4?

Thanks!

Weldon

Casper.Dik at Sun.COM

2008-Aug-11 14:34 UTC

head link

[zfs-discuss] 32 bit NFS clients with 64 bit ZFS server okay?

>
>Are there any known issues with having 32 bit OS clients using NFS to 
>access a NFS server using a 64 bit OS exporting > 2TB filesystem?  Are 
>there any issues with using NFS v3 over NFS v4?
The problems are not about the size of the data; it''s how 32 clients
used
the data returned; the sticky point is the use of > 32 bit offsets for
directory entries.

Casper

Frank Fischer

2008-Aug-12 14:05 UTC

head link

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

James, one question: Do you know if and when yes in which version of opensolaris
this issue is solved? We have the exact same problems using a Supermicro X7DBE
with two Supermicro AOC-SAT2-MV8 (we are on snv79).

Thanks,

Frank
 
 
This message posted from opensolaris.org

zfs discuss - Aug 2008 - x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.

[zfs-discuss] 32 bit NFS clients with 64 bit ZFS server okay?

[zfs-discuss] 32 bit NFS clients with 64 bit ZFS server okay?

[zfs-discuss] x4500 dead HDD, hung server, unable to boot.