thr3ads.net - zfs discuss - [zfs-discuss] Replacing HDD in x4500 [Jan 2009]

If this information is useful, please help other people find it:
Share via:

Jorgen Lundman

2009-Jan-27 05:19 UTC

[zfs-discuss] Replacing HDD in x4500

The vendor wanted to come in and replace an HDD in the 2nd X4500, as it 
was "constantly busy", and since our x4500 has always died miserably
in
the past when a HDD dies, they wanted to replace it before the HDD 
actually died.

The usual was done, HDD replaced, resilvering started and ran for about 
50 minutes. Then the system hung, same as always, all ZFS related 
commands would just hang and do nothing. System is otherwise fine and 
completely idle.

The vendor for some reason decided to fsck root-fs, not sure why as it 
is mounted with "logging", and also decided it would be best to do so 
from a CDRom boot.

Anyway, that was 12 hours ago and the x4500 is still down. I think they 
have it at single-user prompt resilvering again. (I also noticed they''d
decided to break the mirror of the root disks for some very strange 
reason). It still shows:

           raidz1          DEGRADED     0     0     0
             c0t1d0        ONLINE       0     0     0
             replacing     UNAVAIL      0     0     0  insufficient replicas
               c1t1d0s0/o  OFFLINE      0     0     0
               c1t1d0      UNAVAIL      0     0     0  cannot open

So I am pretty sure it''ll hang again sometime soon. What is interesting
though is that this is on x4500-02, and all our previous troubles mailed 
to the list was regarding our first x4500. The hardware is all 
different, but identical. Solaris 10 5/08.

Anyway, I think they want to boot CDrom to fsck root again for some 
reason, but since customers have been without their mail for 12 hours, 
they can go a little longer, I guess.

What I was really wondering, has there been any progress or patches 
regarding the system always hanging whenever a HDD dies (or is replaced 
it seems). It really is rather frustrating.

Lund

-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Blake

2009-Jan-28 03:22 UTC

head link

[zfs-discuss] Replacing HDD in x4500

I''m not an authority, but on my ''vanilla'' filer,
using the same
controller chipset as the thumper, I''ve been in really good shape
since moving to zfs boot in 10/08 and doing ''zpool upgrade''
and ''zfs
upgrade'' to all my mirrors (3 3-way).  I''d been having similar
troubles to yours in the past.

My system is pretty puny next to yours, but it''s been reliable now for
slightly over a month.


On Tue, Jan 27, 2009 at 12:19 AM, Jorgen Lundman <lundman at gmo.jp>
wrote:>
> The vendor wanted to come in and replace an HDD in the 2nd X4500, as it
> was "constantly busy", and since our x4500 has always died
miserably in
> the past when a HDD dies, they wanted to replace it before the HDD
> actually died.
>
> The usual was done, HDD replaced, resilvering started and ran for about
> 50 minutes. Then the system hung, same as always, all ZFS related
> commands would just hang and do nothing. System is otherwise fine and
> completely idle.
>
> The vendor for some reason decided to fsck root-fs, not sure why as it
> is mounted with "logging", and also decided it would be best to
do so
> from a CDRom boot.
>
> Anyway, that was 12 hours ago and the x4500 is still down. I think they
> have it at single-user prompt resilvering again. (I also noticed
they''d
> decided to break the mirror of the root disks for some very strange
> reason). It still shows:
>
>           raidz1          DEGRADED     0     0     0
>             c0t1d0        ONLINE       0     0     0
>             replacing     UNAVAIL      0     0     0  insufficient replicas
>               c1t1d0s0/o  OFFLINE      0     0     0
>               c1t1d0      UNAVAIL      0     0     0  cannot open
>
> So I am pretty sure it''ll hang again sometime soon. What is
interesting
> though is that this is on x4500-02, and all our previous troubles mailed
> to the list was regarding our first x4500. The hardware is all
> different, but identical. Solaris 10 5/08.
>
> Anyway, I think they want to boot CDrom to fsck root again for some
> reason, but since customers have been without their mail for 12 hours,
> they can go a little longer, I guess.
>
> What I was really wondering, has there been any progress or patches
> regarding the system always hanging whenever a HDD dies (or is replaced
> it seems). It really is rather frustrating.
>
> Lund
>
> --
> Jorgen Lundman       | <lundman at lundman.net>
> Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
> Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
> Japan                | +81 (0)3 -3375-1767          (home)
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Jorgen Lundman

2009-Jan-28 03:28 UTC

head link

[zfs-discuss] Replacing HDD in x4500

Thanks for your reply,

While the savecore is working its way up the chain to (hopefully) Sun, 
the vendor asked us not to use it, so we moved x4500-02 to use x4500-04 
and x4500-05. But perhaps moving to Sol 10 10/08 on x4500-02 when fixed 
is the way to go.

The savecore had the usual info, that everything is blocked waiting on 
locks:


   601*  threads trying to get a mutex (598 user, 3 kernel)
           longest sleeping 10 minutes 13.52 seconds earlier
   115*  threads trying to get an rwlock (115 user, 0 kernel)

1678   total threads in allthreads list (1231 user, 447 kernel)
    10   thread_reapcnt
     0   lwp_reapcnt
1688   nthread

   thread             pri pctcpu           idle   PID              wchan 
command
   0xfffffe8000137c80  60  0.000      -9m44.88s     0 0xfffffe84d816cdc8 
sched
   0xfffffe800092cc80  60  0.000      -9m44.52s     0 0xffffffffc03c6538 
sched
   0xfffffe8527458b40  59  0.005      -1m41.38s  1217 0xffffffffb02339e0 
/usr/lib/nfs/rquotad
   0xfffffe8527b534e0  60  0.000       -5m4.79s   402 0xfffffe84d816cdc8 
/usr/lib/nfs/lockd
   0xfffffe852578f460  60  0.000      -4m59.79s   402 0xffffffffc0633fc8 
/usr/lib/nfs/lockd
   0xfffffe8532ad47a0  60  0.000      -10m4.40s   623 0xfffffe84bde48598 
/usr/lib/nfs/nfsd
   0xfffffe8532ad3d80  60  0.000      -10m9.10s   623 0xfffffe84d816ced8 
/usr/lib/nfs/nfsd
   0xfffffe8532ad3360  60  0.000      -10m3.77s   623 0xfffffe84d816cde0 
/usr/lib/nfs/nfsd
   0xfffffe85341e9100  60  0.000      -10m6.85s   623 0xfffffe84bde48428 
/usr/lib/nfs/nfsd
   0xfffffe85341e8a40  60  0.000      -10m4.76s   623 0xfffffe84d816ced8 
/usr/lib/nfs/nfsd

SolarisCAT(vmcore.0/10X)> tlist sobj locks | grep nfsd | wc -l
      680

    scl_writer = 0xfffffe8000185c80  <- locking thread



thread 0xfffffe8000185c80
==== kernel thread: 0xfffffe8000185c80  PID: 0 ===cmd: sched
t_wchan: 0xfffffffffbc8200a  sobj: condition var (from genunix:bflush+0x4d)
t_procp: 0xfffffffffbc22dc0(proc_sched)
   p_as: 0xfffffffffbc24a20(kas)
   zone: global
t_stk: 0xfffffe8000185c80  sp: 0xfffffe8000185aa0  t_stkbase: 
0xfffffe8000181000
t_pri: 99(SYS)  pctcpu: 0.000000
t_lwp: 0x0  psrset: 0  last CPU: 0
idle: 44943 ticks (7 minutes 29.43 seconds)
start: Tue Jan 27 23:44:21 2009
age: 674 seconds (11 minutes 14 seconds)
tstate: TS_SLEEP - awaiting an event
tflg:   T_TALLOCSTK - thread structure allocated from stk
tpflg:  none set
tsched: TS_LOAD - thread is in memory
         TS_DONT_SWAP - thread/LWP should not be swapped
pflag:  SSYS - system resident process

pc:      0xfffffffffb83616f     unix:_resume_from_idle+0xf8 resume_return
startpc: 0xffffffffeff889e0     zfs:spa_async_thread+0x0

unix:_resume_from_idle+0xf8 resume_return()
unix:swtch+0x12a()
genunix:cv_wait+0x68()
genunix:bflush+0x4d()
genunix:ldi_close+0xbe()
zfs:vdev_disk_close+0x6a()
zfs:vdev_close+0x13()
zfs:vdev_raidz_close+0x26()
zfs:vdev_close+0x13()
zfs:vdev_reopen+0x1d()
zfs:spa_async_reopen+0x5f()
zfs:spa_async_thread+0xc8()
unix:thread_start+0x8()
-- end of kernel thread''s stack --




Blake wrote:> I''m not an authority, but on my ''vanilla'' filer,
using the same
> controller chipset as the thumper, I''ve been in really good shape
> since moving to zfs boot in 10/08 and doing ''zpool
upgrade'' and ''zfs
> upgrade'' to all my mirrors (3 3-way).  I''d been having
similar
> troubles to yours in the past.
> 
> My system is pretty puny next to yours, but it''s been reliable now
for
> slightly over a month.
> 
> 
> On Tue, Jan 27, 2009 at 12:19 AM, Jorgen Lundman <lundman at gmo.jp>
wrote:
>> The vendor wanted to come in and replace an HDD in the 2nd X4500, as it
>> was "constantly busy", and since our x4500 has always died
miserably in
>> the past when a HDD dies, they wanted to replace it before the HDD
>> actually died.
>>
>> The usual was done, HDD replaced, resilvering started and ran for about
>> 50 minutes. Then the system hung, same as always, all ZFS related
>> commands would just hang and do nothing. System is otherwise fine and
>> completely idle.
>>
>> The vendor for some reason decided to fsck root-fs, not sure why as it
>> is mounted with "logging", and also decided it would be best
to do so
>> from a CDRom boot.
>>
>> Anyway, that was 12 hours ago and the x4500 is still down. I think they
>> have it at single-user prompt resilvering again. (I also noticed
they''d
>> decided to break the mirror of the root disks for some very strange
>> reason). It still shows:
>>
>>           raidz1          DEGRADED     0     0     0
>>             c0t1d0        ONLINE       0     0     0
>>             replacing     UNAVAIL      0     0     0  insufficient
replicas
>>               c1t1d0s0/o  OFFLINE      0     0     0
>>               c1t1d0      UNAVAIL      0     0     0  cannot open
>>
>> So I am pretty sure it''ll hang again sometime soon. What is
interesting
>> though is that this is on x4500-02, and all our previous troubles
mailed
>> to the list was regarding our first x4500. The hardware is all
>> different, but identical. Solaris 10 5/08.
>>
>> Anyway, I think they want to boot CDrom to fsck root again for some
>> reason, but since customers have been without their mail for 12 hours,
>> they can go a little longer, I guess.
>>
>> What I was really wondering, has there been any progress or patches
>> regarding the system always hanging whenever a HDD dies (or is replaced
>> it seems). It really is rather frustrating.
>>
>> Lund
>>
>> --
>> Jorgen Lundman       | <lundman at lundman.net>
>> Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
>> Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
>> Japan                | +81 (0)3 -3375-1767          (home)
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
> 
-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Tim

2009-Jan-28 04:20 UTC

head link

[zfs-discuss] Replacing HDD in x4500

On Tue, Jan 27, 2009 at 9:28 PM, Jorgen Lundman <lundman at gmo.jp> wrote:
>
> Thanks for your reply,
>
> While the savecore is working its way up the chain to (hopefully) Sun,
> the vendor asked us not to use it, so we moved x4500-02 to use x4500-04
> and x4500-05. But perhaps moving to Sol 10 10/08 on x4500-02 when fixed
> is the way to go.
>
> The savecore had the usual info, that everything is blocked waiting on
> locks:
>
>I assume you''ve changed the failmode to continue already?

http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090127/b3a5ae39/attachment.html>

Jorgen Lundman

2009-Jan-28 05:15 UTC

head link

[zfs-discuss] Replacing HDD in x4500

> 
> I assume you''ve changed the failmode to continue already?
> 
>
http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/
>  
This appears to be new to 10/08, so that is another vote to upgrade. 
Also interesting that the default is "wait", since it almost behaves 
like it. Not sure why it would block "zpool", "zfs" and
"df" commands as
well though?


Lund


-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Jorgen Lundman

2009-Feb-04 00:41 UTC

head link

[zfs-discuss] Replacing HDD in x4500

I''ve been told we got a BugID:

"3-way deadlock happens in ufs filesystem on zvol when writing ufs
log"

but I can not view the BugID yet (presumably due to my accounts weak 
credentials)

Perhaps it isn''t something we do wrong, that would be a nice change.

Lund


Jorgen Lundman wrote:>> I assume you''ve changed the failmode to continue already?
>>
>>
http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/
>>  
> 
> This appears to be new to 10/08, so that is another vote to upgrade. 
> Also interesting that the default is "wait", since it almost
behaves
> like it. Not sure why it would block "zpool", "zfs" and
"df" commands as
> well though?
> 
> 
> Lund
> 
> 
-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Reasonably Related Threads

Search for more apparently analagous threads

zfs discuss - Jan 2009 - Replacing HDD in x4500

[zfs-discuss] Replacing HDD in x4500

[zfs-discuss] Replacing HDD in x4500

[zfs-discuss] Replacing HDD in x4500

[zfs-discuss] Replacing HDD in x4500

[zfs-discuss] Replacing HDD in x4500

[zfs-discuss] Replacing HDD in x4500

Reasonably Related Threads