The vendor wanted to come in and replace an HDD in the 2nd X4500, as it was "constantly busy", and since our x4500 has always died miserably in the past when a HDD dies, they wanted to replace it before the HDD actually died. The usual was done, HDD replaced, resilvering started and ran for about 50 minutes. Then the system hung, same as always, all ZFS related commands would just hang and do nothing. System is otherwise fine and completely idle. The vendor for some reason decided to fsck root-fs, not sure why as it is mounted with "logging", and also decided it would be best to do so from a CDRom boot. Anyway, that was 12 hours ago and the x4500 is still down. I think they have it at single-user prompt resilvering again. (I also noticed they''d decided to break the mirror of the root disks for some very strange reason). It still shows: raidz1 DEGRADED 0 0 0 c0t1d0 ONLINE 0 0 0 replacing UNAVAIL 0 0 0 insufficient replicas c1t1d0s0/o OFFLINE 0 0 0 c1t1d0 UNAVAIL 0 0 0 cannot open So I am pretty sure it''ll hang again sometime soon. What is interesting though is that this is on x4500-02, and all our previous troubles mailed to the list was regarding our first x4500. The hardware is all different, but identical. Solaris 10 5/08. Anyway, I think they want to boot CDrom to fsck root again for some reason, but since customers have been without their mail for 12 hours, they can go a little longer, I guess. What I was really wondering, has there been any progress or patches regarding the system always hanging whenever a HDD dies (or is replaced it seems). It really is rather frustrating. Lund -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
I''m not an authority, but on my ''vanilla'' filer, using the same controller chipset as the thumper, I''ve been in really good shape since moving to zfs boot in 10/08 and doing ''zpool upgrade'' and ''zfs upgrade'' to all my mirrors (3 3-way). I''d been having similar troubles to yours in the past. My system is pretty puny next to yours, but it''s been reliable now for slightly over a month. On Tue, Jan 27, 2009 at 12:19 AM, Jorgen Lundman <lundman at gmo.jp> wrote:> > The vendor wanted to come in and replace an HDD in the 2nd X4500, as it > was "constantly busy", and since our x4500 has always died miserably in > the past when a HDD dies, they wanted to replace it before the HDD > actually died. > > The usual was done, HDD replaced, resilvering started and ran for about > 50 minutes. Then the system hung, same as always, all ZFS related > commands would just hang and do nothing. System is otherwise fine and > completely idle. > > The vendor for some reason decided to fsck root-fs, not sure why as it > is mounted with "logging", and also decided it would be best to do so > from a CDRom boot. > > Anyway, that was 12 hours ago and the x4500 is still down. I think they > have it at single-user prompt resilvering again. (I also noticed they''d > decided to break the mirror of the root disks for some very strange > reason). It still shows: > > raidz1 DEGRADED 0 0 0 > c0t1d0 ONLINE 0 0 0 > replacing UNAVAIL 0 0 0 insufficient replicas > c1t1d0s0/o OFFLINE 0 0 0 > c1t1d0 UNAVAIL 0 0 0 cannot open > > So I am pretty sure it''ll hang again sometime soon. What is interesting > though is that this is on x4500-02, and all our previous troubles mailed > to the list was regarding our first x4500. The hardware is all > different, but identical. Solaris 10 5/08. > > Anyway, I think they want to boot CDrom to fsck root again for some > reason, but since customers have been without their mail for 12 hours, > they can go a little longer, I guess. > > What I was really wondering, has there been any progress or patches > regarding the system always hanging whenever a HDD dies (or is replaced > it seems). It really is rather frustrating. > > Lund > > -- > Jorgen Lundman | <lundman at lundman.net> > Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) > Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) > Japan | +81 (0)3 -3375-1767 (home) > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Thanks for your reply, While the savecore is working its way up the chain to (hopefully) Sun, the vendor asked us not to use it, so we moved x4500-02 to use x4500-04 and x4500-05. But perhaps moving to Sol 10 10/08 on x4500-02 when fixed is the way to go. The savecore had the usual info, that everything is blocked waiting on locks: 601* threads trying to get a mutex (598 user, 3 kernel) longest sleeping 10 minutes 13.52 seconds earlier 115* threads trying to get an rwlock (115 user, 0 kernel) 1678 total threads in allthreads list (1231 user, 447 kernel) 10 thread_reapcnt 0 lwp_reapcnt 1688 nthread thread pri pctcpu idle PID wchan command 0xfffffe8000137c80 60 0.000 -9m44.88s 0 0xfffffe84d816cdc8 sched 0xfffffe800092cc80 60 0.000 -9m44.52s 0 0xffffffffc03c6538 sched 0xfffffe8527458b40 59 0.005 -1m41.38s 1217 0xffffffffb02339e0 /usr/lib/nfs/rquotad 0xfffffe8527b534e0 60 0.000 -5m4.79s 402 0xfffffe84d816cdc8 /usr/lib/nfs/lockd 0xfffffe852578f460 60 0.000 -4m59.79s 402 0xffffffffc0633fc8 /usr/lib/nfs/lockd 0xfffffe8532ad47a0 60 0.000 -10m4.40s 623 0xfffffe84bde48598 /usr/lib/nfs/nfsd 0xfffffe8532ad3d80 60 0.000 -10m9.10s 623 0xfffffe84d816ced8 /usr/lib/nfs/nfsd 0xfffffe8532ad3360 60 0.000 -10m3.77s 623 0xfffffe84d816cde0 /usr/lib/nfs/nfsd 0xfffffe85341e9100 60 0.000 -10m6.85s 623 0xfffffe84bde48428 /usr/lib/nfs/nfsd 0xfffffe85341e8a40 60 0.000 -10m4.76s 623 0xfffffe84d816ced8 /usr/lib/nfs/nfsd SolarisCAT(vmcore.0/10X)> tlist sobj locks | grep nfsd | wc -l 680 scl_writer = 0xfffffe8000185c80 <- locking thread thread 0xfffffe8000185c80 ==== kernel thread: 0xfffffe8000185c80 PID: 0 ===cmd: sched t_wchan: 0xfffffffffbc8200a sobj: condition var (from genunix:bflush+0x4d) t_procp: 0xfffffffffbc22dc0(proc_sched) p_as: 0xfffffffffbc24a20(kas) zone: global t_stk: 0xfffffe8000185c80 sp: 0xfffffe8000185aa0 t_stkbase: 0xfffffe8000181000 t_pri: 99(SYS) pctcpu: 0.000000 t_lwp: 0x0 psrset: 0 last CPU: 0 idle: 44943 ticks (7 minutes 29.43 seconds) start: Tue Jan 27 23:44:21 2009 age: 674 seconds (11 minutes 14 seconds) tstate: TS_SLEEP - awaiting an event tflg: T_TALLOCSTK - thread structure allocated from stk tpflg: none set tsched: TS_LOAD - thread is in memory TS_DONT_SWAP - thread/LWP should not be swapped pflag: SSYS - system resident process pc: 0xfffffffffb83616f unix:_resume_from_idle+0xf8 resume_return startpc: 0xffffffffeff889e0 zfs:spa_async_thread+0x0 unix:_resume_from_idle+0xf8 resume_return() unix:swtch+0x12a() genunix:cv_wait+0x68() genunix:bflush+0x4d() genunix:ldi_close+0xbe() zfs:vdev_disk_close+0x6a() zfs:vdev_close+0x13() zfs:vdev_raidz_close+0x26() zfs:vdev_close+0x13() zfs:vdev_reopen+0x1d() zfs:spa_async_reopen+0x5f() zfs:spa_async_thread+0xc8() unix:thread_start+0x8() -- end of kernel thread''s stack -- Blake wrote:> I''m not an authority, but on my ''vanilla'' filer, using the same > controller chipset as the thumper, I''ve been in really good shape > since moving to zfs boot in 10/08 and doing ''zpool upgrade'' and ''zfs > upgrade'' to all my mirrors (3 3-way). I''d been having similar > troubles to yours in the past. > > My system is pretty puny next to yours, but it''s been reliable now for > slightly over a month. > > > On Tue, Jan 27, 2009 at 12:19 AM, Jorgen Lundman <lundman at gmo.jp> wrote: >> The vendor wanted to come in and replace an HDD in the 2nd X4500, as it >> was "constantly busy", and since our x4500 has always died miserably in >> the past when a HDD dies, they wanted to replace it before the HDD >> actually died. >> >> The usual was done, HDD replaced, resilvering started and ran for about >> 50 minutes. Then the system hung, same as always, all ZFS related >> commands would just hang and do nothing. System is otherwise fine and >> completely idle. >> >> The vendor for some reason decided to fsck root-fs, not sure why as it >> is mounted with "logging", and also decided it would be best to do so >> from a CDRom boot. >> >> Anyway, that was 12 hours ago and the x4500 is still down. I think they >> have it at single-user prompt resilvering again. (I also noticed they''d >> decided to break the mirror of the root disks for some very strange >> reason). It still shows: >> >> raidz1 DEGRADED 0 0 0 >> c0t1d0 ONLINE 0 0 0 >> replacing UNAVAIL 0 0 0 insufficient replicas >> c1t1d0s0/o OFFLINE 0 0 0 >> c1t1d0 UNAVAIL 0 0 0 cannot open >> >> So I am pretty sure it''ll hang again sometime soon. What is interesting >> though is that this is on x4500-02, and all our previous troubles mailed >> to the list was regarding our first x4500. The hardware is all >> different, but identical. Solaris 10 5/08. >> >> Anyway, I think they want to boot CDrom to fsck root again for some >> reason, but since customers have been without their mail for 12 hours, >> they can go a little longer, I guess. >> >> What I was really wondering, has there been any progress or patches >> regarding the system always hanging whenever a HDD dies (or is replaced >> it seems). It really is rather frustrating. >> >> Lund >> >> -- >> Jorgen Lundman | <lundman at lundman.net> >> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) >> Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) >> Japan | +81 (0)3 -3375-1767 (home) >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >-- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
On Tue, Jan 27, 2009 at 9:28 PM, Jorgen Lundman <lundman at gmo.jp> wrote:> > Thanks for your reply, > > While the savecore is working its way up the chain to (hopefully) Sun, > the vendor asked us not to use it, so we moved x4500-02 to use x4500-04 > and x4500-05. But perhaps moving to Sol 10 10/08 on x4500-02 when fixed > is the way to go. > > The savecore had the usual info, that everything is blocked waiting on > locks: > >I assume you''ve changed the failmode to continue already? http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090127/b3a5ae39/attachment.html>
> > I assume you''ve changed the failmode to continue already? > > http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/ >This appears to be new to 10/08, so that is another vote to upgrade. Also interesting that the default is "wait", since it almost behaves like it. Not sure why it would block "zpool", "zfs" and "df" commands as well though? Lund -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
I''ve been told we got a BugID: "3-way deadlock happens in ufs filesystem on zvol when writing ufs log" but I can not view the BugID yet (presumably due to my accounts weak credentials) Perhaps it isn''t something we do wrong, that would be a nice change. Lund Jorgen Lundman wrote:>> I assume you''ve changed the failmode to continue already? >> >> http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/ >> > > This appears to be new to 10/08, so that is another vote to upgrade. > Also interesting that the default is "wait", since it almost behaves > like it. Not sure why it would block "zpool", "zfs" and "df" commands as > well though? > > > Lund > >-- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)