The vendor wanted to come in and replace an HDD in the 2nd X4500, as it
was "constantly busy", and since our x4500 has always died miserably
in
the past when a HDD dies, they wanted to replace it before the HDD
actually died.
The usual was done, HDD replaced, resilvering started and ran for about
50 minutes. Then the system hung, same as always, all ZFS related
commands would just hang and do nothing. System is otherwise fine and
completely idle.
The vendor for some reason decided to fsck root-fs, not sure why as it
is mounted with "logging", and also decided it would be best to do so
from a CDRom boot.
Anyway, that was 12 hours ago and the x4500 is still down. I think they
have it at single-user prompt resilvering again. (I also noticed they''d
decided to break the mirror of the root disks for some very strange
reason). It still shows:
raidz1 DEGRADED 0 0 0
c0t1d0 ONLINE 0 0 0
replacing UNAVAIL 0 0 0 insufficient replicas
c1t1d0s0/o OFFLINE 0 0 0
c1t1d0 UNAVAIL 0 0 0 cannot open
So I am pretty sure it''ll hang again sometime soon. What is interesting
though is that this is on x4500-02, and all our previous troubles mailed
to the list was regarding our first x4500. The hardware is all
different, but identical. Solaris 10 5/08.
Anyway, I think they want to boot CDrom to fsck root again for some
reason, but since customers have been without their mail for 12 hours,
they can go a little longer, I guess.
What I was really wondering, has there been any progress or patches
regarding the system always hanging whenever a HDD dies (or is replaced
it seems). It really is rather frustrating.
Lund
--
Jorgen Lundman | <lundman at lundman.net>
Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell)
Japan | +81 (0)3 -3375-1767 (home)
I''m not an authority, but on my ''vanilla'' filer, using the same controller chipset as the thumper, I''ve been in really good shape since moving to zfs boot in 10/08 and doing ''zpool upgrade'' and ''zfs upgrade'' to all my mirrors (3 3-way). I''d been having similar troubles to yours in the past. My system is pretty puny next to yours, but it''s been reliable now for slightly over a month. On Tue, Jan 27, 2009 at 12:19 AM, Jorgen Lundman <lundman at gmo.jp> wrote:> > The vendor wanted to come in and replace an HDD in the 2nd X4500, as it > was "constantly busy", and since our x4500 has always died miserably in > the past when a HDD dies, they wanted to replace it before the HDD > actually died. > > The usual was done, HDD replaced, resilvering started and ran for about > 50 minutes. Then the system hung, same as always, all ZFS related > commands would just hang and do nothing. System is otherwise fine and > completely idle. > > The vendor for some reason decided to fsck root-fs, not sure why as it > is mounted with "logging", and also decided it would be best to do so > from a CDRom boot. > > Anyway, that was 12 hours ago and the x4500 is still down. I think they > have it at single-user prompt resilvering again. (I also noticed they''d > decided to break the mirror of the root disks for some very strange > reason). It still shows: > > raidz1 DEGRADED 0 0 0 > c0t1d0 ONLINE 0 0 0 > replacing UNAVAIL 0 0 0 insufficient replicas > c1t1d0s0/o OFFLINE 0 0 0 > c1t1d0 UNAVAIL 0 0 0 cannot open > > So I am pretty sure it''ll hang again sometime soon. What is interesting > though is that this is on x4500-02, and all our previous troubles mailed > to the list was regarding our first x4500. The hardware is all > different, but identical. Solaris 10 5/08. > > Anyway, I think they want to boot CDrom to fsck root again for some > reason, but since customers have been without their mail for 12 hours, > they can go a little longer, I guess. > > What I was really wondering, has there been any progress or patches > regarding the system always hanging whenever a HDD dies (or is replaced > it seems). It really is rather frustrating. > > Lund > > -- > Jorgen Lundman | <lundman at lundman.net> > Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) > Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) > Japan | +81 (0)3 -3375-1767 (home) > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Thanks for your reply,
While the savecore is working its way up the chain to (hopefully) Sun,
the vendor asked us not to use it, so we moved x4500-02 to use x4500-04
and x4500-05. But perhaps moving to Sol 10 10/08 on x4500-02 when fixed
is the way to go.
The savecore had the usual info, that everything is blocked waiting on
locks:
601* threads trying to get a mutex (598 user, 3 kernel)
longest sleeping 10 minutes 13.52 seconds earlier
115* threads trying to get an rwlock (115 user, 0 kernel)
1678 total threads in allthreads list (1231 user, 447 kernel)
10 thread_reapcnt
0 lwp_reapcnt
1688 nthread
thread pri pctcpu idle PID wchan
command
0xfffffe8000137c80 60 0.000 -9m44.88s 0 0xfffffe84d816cdc8
sched
0xfffffe800092cc80 60 0.000 -9m44.52s 0 0xffffffffc03c6538
sched
0xfffffe8527458b40 59 0.005 -1m41.38s 1217 0xffffffffb02339e0
/usr/lib/nfs/rquotad
0xfffffe8527b534e0 60 0.000 -5m4.79s 402 0xfffffe84d816cdc8
/usr/lib/nfs/lockd
0xfffffe852578f460 60 0.000 -4m59.79s 402 0xffffffffc0633fc8
/usr/lib/nfs/lockd
0xfffffe8532ad47a0 60 0.000 -10m4.40s 623 0xfffffe84bde48598
/usr/lib/nfs/nfsd
0xfffffe8532ad3d80 60 0.000 -10m9.10s 623 0xfffffe84d816ced8
/usr/lib/nfs/nfsd
0xfffffe8532ad3360 60 0.000 -10m3.77s 623 0xfffffe84d816cde0
/usr/lib/nfs/nfsd
0xfffffe85341e9100 60 0.000 -10m6.85s 623 0xfffffe84bde48428
/usr/lib/nfs/nfsd
0xfffffe85341e8a40 60 0.000 -10m4.76s 623 0xfffffe84d816ced8
/usr/lib/nfs/nfsd
SolarisCAT(vmcore.0/10X)> tlist sobj locks | grep nfsd | wc -l
680
scl_writer = 0xfffffe8000185c80 <- locking thread
thread 0xfffffe8000185c80
==== kernel thread: 0xfffffe8000185c80 PID: 0 ===cmd: sched
t_wchan: 0xfffffffffbc8200a sobj: condition var (from genunix:bflush+0x4d)
t_procp: 0xfffffffffbc22dc0(proc_sched)
p_as: 0xfffffffffbc24a20(kas)
zone: global
t_stk: 0xfffffe8000185c80 sp: 0xfffffe8000185aa0 t_stkbase:
0xfffffe8000181000
t_pri: 99(SYS) pctcpu: 0.000000
t_lwp: 0x0 psrset: 0 last CPU: 0
idle: 44943 ticks (7 minutes 29.43 seconds)
start: Tue Jan 27 23:44:21 2009
age: 674 seconds (11 minutes 14 seconds)
tstate: TS_SLEEP - awaiting an event
tflg: T_TALLOCSTK - thread structure allocated from stk
tpflg: none set
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
pflag: SSYS - system resident process
pc: 0xfffffffffb83616f unix:_resume_from_idle+0xf8 resume_return
startpc: 0xffffffffeff889e0 zfs:spa_async_thread+0x0
unix:_resume_from_idle+0xf8 resume_return()
unix:swtch+0x12a()
genunix:cv_wait+0x68()
genunix:bflush+0x4d()
genunix:ldi_close+0xbe()
zfs:vdev_disk_close+0x6a()
zfs:vdev_close+0x13()
zfs:vdev_raidz_close+0x26()
zfs:vdev_close+0x13()
zfs:vdev_reopen+0x1d()
zfs:spa_async_reopen+0x5f()
zfs:spa_async_thread+0xc8()
unix:thread_start+0x8()
-- end of kernel thread''s stack --
Blake wrote:> I''m not an authority, but on my ''vanilla'' filer,
using the same
> controller chipset as the thumper, I''ve been in really good shape
> since moving to zfs boot in 10/08 and doing ''zpool
upgrade'' and ''zfs
> upgrade'' to all my mirrors (3 3-way). I''d been having
similar
> troubles to yours in the past.
>
> My system is pretty puny next to yours, but it''s been reliable now
for
> slightly over a month.
>
>
> On Tue, Jan 27, 2009 at 12:19 AM, Jorgen Lundman <lundman at gmo.jp>
wrote:
>> The vendor wanted to come in and replace an HDD in the 2nd X4500, as it
>> was "constantly busy", and since our x4500 has always died
miserably in
>> the past when a HDD dies, they wanted to replace it before the HDD
>> actually died.
>>
>> The usual was done, HDD replaced, resilvering started and ran for about
>> 50 minutes. Then the system hung, same as always, all ZFS related
>> commands would just hang and do nothing. System is otherwise fine and
>> completely idle.
>>
>> The vendor for some reason decided to fsck root-fs, not sure why as it
>> is mounted with "logging", and also decided it would be best
to do so
>> from a CDRom boot.
>>
>> Anyway, that was 12 hours ago and the x4500 is still down. I think they
>> have it at single-user prompt resilvering again. (I also noticed
they''d
>> decided to break the mirror of the root disks for some very strange
>> reason). It still shows:
>>
>> raidz1 DEGRADED 0 0 0
>> c0t1d0 ONLINE 0 0 0
>> replacing UNAVAIL 0 0 0 insufficient
replicas
>> c1t1d0s0/o OFFLINE 0 0 0
>> c1t1d0 UNAVAIL 0 0 0 cannot open
>>
>> So I am pretty sure it''ll hang again sometime soon. What is
interesting
>> though is that this is on x4500-02, and all our previous troubles
mailed
>> to the list was regarding our first x4500. The hardware is all
>> different, but identical. Solaris 10 5/08.
>>
>> Anyway, I think they want to boot CDrom to fsck root again for some
>> reason, but since customers have been without their mail for 12 hours,
>> they can go a little longer, I guess.
>>
>> What I was really wondering, has there been any progress or patches
>> regarding the system always hanging whenever a HDD dies (or is replaced
>> it seems). It really is rather frustrating.
>>
>> Lund
>>
>> --
>> Jorgen Lundman | <lundman at lundman.net>
>> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work)
>> Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell)
>> Japan | +81 (0)3 -3375-1767 (home)
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>
--
Jorgen Lundman | <lundman at lundman.net>
Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell)
Japan | +81 (0)3 -3375-1767 (home)
On Tue, Jan 27, 2009 at 9:28 PM, Jorgen Lundman <lundman at gmo.jp> wrote:> > Thanks for your reply, > > While the savecore is working its way up the chain to (hopefully) Sun, > the vendor asked us not to use it, so we moved x4500-02 to use x4500-04 > and x4500-05. But perhaps moving to Sol 10 10/08 on x4500-02 when fixed > is the way to go. > > The savecore had the usual info, that everything is blocked waiting on > locks: > >I assume you''ve changed the failmode to continue already? http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090127/b3a5ae39/attachment.html>
> > I assume you''ve changed the failmode to continue already? > > http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/ >This appears to be new to 10/08, so that is another vote to upgrade. Also interesting that the default is "wait", since it almost behaves like it. Not sure why it would block "zpool", "zfs" and "df" commands as well though? Lund -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
I''ve been told we got a BugID: "3-way deadlock happens in ufs filesystem on zvol when writing ufs log" but I can not view the BugID yet (presumably due to my accounts weak credentials) Perhaps it isn''t something we do wrong, that would be a nice change. Lund Jorgen Lundman wrote:>> I assume you''ve changed the failmode to continue already? >> >> http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/ >> > > This appears to be new to 10/08, so that is another vote to upgrade. > Also interesting that the default is "wait", since it almost behaves > like it. Not sure why it would block "zpool", "zfs" and "df" commands as > well though? > > > Lund > >-- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)