On Saturday the X4500 system paniced, and rebooted. For some reason the /export/saba1 UFS partition was corrupt, and needed "fsck". This is why it did not come back online. /export/saba1 is mounted "logging,noatime", so fsck should never (-ish) be needed. SunOS x4500-01.unix 5.11 snv_70b i86pc i386 i86pc /export/saba1 on /dev/zvol/dsk/zpool1/saba1 read/write/setuid/devices/intr/largefiles/logging/quota/xattr/noatime/onerror=panic/dev=2d80024 on Sat Jul 5 08:48:54 2008 One possible related bug: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4884138 What would be the best solution? Go back to latest Solaris 10 and pass it on to Sun support, or find a patch for this problem? Panic dump follows: -rw-r--r-- 1 root root 2529300 Jul 5 08:48 unix.2 -rw-r--r-- 1 root root 10133225472 Jul 5 09:10 vmcore.2 # mdb unix.2 vmcore.2 Loading modules: [ unix genunix specfs dtrace cpu.AuthenticAMD.15 uppc pcplusmp scsi_vhci ufs md ip hook neti sctp arp usba uhci s1394 qlc fctl nca lofs zfs random cpc crypto fcip fcp logindmux nsctl sdbc ptm sv ii sppp rdc nfs ] > $c vpanic() vcmn_err+0x28(3, fffffffff783ade0, ffffff001e737aa8) real_panic_v+0xf7(0, fffffffff783ade0, ffffff001e737aa8) ufs_fault_v+0x1d0(fffffffed0bfb980, fffffffff783ade0, ffffff001e737aa8) ufs_fault+0xa0() dqput+0xce(ffffffff1db26ef0) dqrele+0x48(ffffffff1db26ef0) ufs_trans_dqrele+0x6f(ffffffff1db26ef0) ufs_idle_free+0x16d(ffffff04f17b1e00) ufs_idle_some+0x152(3f60) ufs_thread_idle+0x1a1() thread_start+8() > ::cpuinfo ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 fffffffffbc2fc10 1b 0 0 60 no no t-0 ffffff001e737c80 sched 1 fffffffec3a0a000 1f 1 0 -1 no no t-0 ffffff001e971c80 (idle) 2 fffffffec3a02ac0 1f 0 0 -1 no no t-1 ffffff001e9dbc80 (idle) 3 fffffffec3d60580 1f 0 0 -1 no no t-1 ffffff001ea50c80 (idle) > ::panicinfo cpu 0 thread ffffff001e737c80 message dqput: dqp->dq_cnt == 0 rdi fffffffff783ade0 rsi ffffff001e737aa8 rdx fffffffff783ade0 rcx ffffff001e737aa8 r8 fffffffff783ade0 r9 0 rax 3 rbx 0 rbp ffffff001e737900 r10 fffffffffbc26fb0 r10 fffffffffbc26fb0 r11 ffffff001e737c80 r12 fffffffff783ade0 r13 ffffff001e737aa8 r14 3 r15 fffffffff783ade0 fsbase 0 gsbase fffffffffbc26fb0 ds 4b es 4b fsbase 0 gsbase fffffffffbc26fb0 ds 4b es 4b fs 0 gs 1c3 trapno 0 err 0 rip fffffffffb83c860 cs 30 rflags 246 rsp ffffff001e7378b8 ss 38 gdt_hi 0 gdt_lo e00001ef idt_hi 0 idt_lo 77c00fff ldt 0 task 70 cr0 8005003b cr2 fee7d650 cr3 2c00000 cr4 6f8 > ::msgbuf quota_ufs: over hard disk limit (pid 600, uid 178199, inum 941499, fs /export/zero1) quota_ufs: over hard disk limit (pid 600, uid 33647, inum 29504134, fs /export/zero1) panic[cpu0]/thread=ffffff001e737c80: dqput: dqp->dq_cnt == 0 ffffff001e737930 genunix:vcmn_err+28 () ffffff001e737980 ufs:real_panic_v+f7 () ffffff001e7379e0 ufs:ufs_fault_v+1d0 () ffffff001e737ad0 ufs:ufs_fault+a0 () ffffff001e737b00 ufs:dqput+ce () ffffff001e737b30 ufs:dqrele+48 () ffffff001e737b70 ufs:ufs_trans_dqrele+6f () ffffff001e737bc0 ufs:ufs_idle_free+16d () ffffff001e737c10 ufs:ufs_idle_some+152 () ffffff001e737c60 ufs:ufs_thread_idle+1a1 () ffffff001e737c70 unix:thread_start+8 () syncing file systems... -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Jorgen Lundman wrote:> On Saturday the X4500 system paniced, and rebooted. For some reason the > /export/saba1 UFS partition was corrupt, and needed "fsck". This is why > it did not come back online. /export/saba1 is mounted "logging,noatime", > so fsck should never (-ish) be needed. > > SunOS x4500-01.unix 5.11 snv_70b i86pc i386 i86pc > > /export/saba1 on /dev/zvol/dsk/zpool1/saba1 > read/write/setuid/devices/intr/largefiles/logging/quota/xattr/noatime/onerror=panic/dev=2d80024 > on Sat Jul 5 08:48:54 2008 > > > One possible related bug: > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4884138Yes, that bug is possibly related. However, the panic stacks listed in it do not match yours.> What would be the best solution? Go back to latest Solaris 10 and pass > it on to Sun support, or find a patch for this problem?Since the panic stack only ever goes through ufs, you should log a call with Sun support. ...> > ::msgbuf > quota_ufs: over hard disk limit (pid 600, uid 178199, inum 941499, fs > /export/zero1) > quota_ufs: over hard disk limit (pid 600, uid 33647, inum 29504134, fs > /export/zero1) > > panic[cpu0]/thread=ffffff001e737c80: > dqput: dqp->dq_cnt == 0 > > > ffffff001e737930 genunix:vcmn_err+28 () > ffffff001e737980 ufs:real_panic_v+f7 () > ffffff001e7379e0 ufs:ufs_fault_v+1d0 () > ffffff001e737ad0 ufs:ufs_fault+a0 () > ffffff001e737b00 ufs:dqput+ce () > ffffff001e737b30 ufs:dqrele+48 () > ffffff001e737b70 ufs:ufs_trans_dqrele+6f () > ffffff001e737bc0 ufs:ufs_idle_free+16d () > ffffff001e737c10 ufs:ufs_idle_some+152 () > ffffff001e737c60 ufs:ufs_thread_idle+1a1 () > ffffff001e737c70 unix:thread_start+8 ()Although.... given the entry in the msgbuf, perhaps you might want to fix up your quota settings on that particular filesystem. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
> Since the panic stack only ever goes through ufs, you shouldlog a call with Sun support. We do have support, but they only speak Japanese, and I''m still quite poor at it. But I have started the process of having it translated and passed along to the next person. It is always fun to see what it becomes at the other end. Meanwhile, I like to research and see if it is a already known problem, rather than just sit around and wait. >> quota_ufs: over hard disk limit (pid 600, uid 33647, inum 29504134, fs /export/zero1)> > Although.... given the entry in the msgbuf, perhaps > you might want to fix up your quota settings on that > particular filesystem. >Customers pay for a certain amount of disk-quota, and being users, always stay close to the edge. Those messages are as constant as precipitation in the rainy season. Are you suggestion that indicate a problem, beyond that the user is out of space? Lund -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Jorgen Lundman wrote:> > Since the panic stack only ever goes through ufs, you should > log a call with Sun support. > > We do have support, but they only speak Japanese, and I''m still quite > poor at it. But I have started the process of having it translated and > passed along to the next person. It is always fun to see what it becomes > at the other end. Meanwhile, I like to research and see if it is a > already known problem, rather than just sit around and wait.That sounds like a learning opportunity :-)> >> quota_ufs: over hard disk limit (pid 600, uid 33647, inum 29504134, > fs /export/zero1) > >> Although.... given the entry in the msgbuf, perhaps >> you might want to fix up your quota settings on that >> particular filesystem. >> > > Customers pay for a certain amount of disk-quota, and being users, > always stay close to the edge. Those messages are as constant as > precipitation in the rainy season. > > Are you suggestion that indicate a problem, beyond that the user is out > of space?I don''t know, I''m not a UFS expert (heck, I''m not an expert on _anything_). Have you investigated putting your paying customers onto zfs and managing quotas with zfs properties instead of ufs? James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
> I don''t know, I''m not a UFS expert (heck, I''m not an expert > on _anything_). Have you investigated putting your paying > customers onto zfs and managing quotas with zfs properties > instead of ufs?Yep, we spent about 6 weeks during the trial period of the x4500 to try to find a way for ZFS to be able to replace the current NetApps. History of this mailing-list should have it, and thanks to everyone who helped. But it was just not possible. Perhaps now it can be done, using mirror-mounts, but the 50 odd servers hanging off the x4500 don''t all support it, so it would still not be feasible. Unless there has been some advancement in ZFS in the last 6 months I am not aware of... like user quotas? Thanks for your assistance. Lund -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Today we had another panic, at least it was during work time :) Just a shame the 999GB ufs takes 80+ mins to fsck. (Yes, it is mounted ''logging''). panic[cpu3]/thread=ffffff001e70dc80: free: freeing free block, dev:0xb600000024, block:13144, ino:1737885, fs:/export /saba1 ffffff001e70d500 genunix:vcmn_err+28 () ffffff001e70d550 ufs:real_panic_v+f7 () ffffff001e70d5b0 ufs:ufs_fault_v+1d0 () ffffff001e70d6a0 ufs:ufs_fault+a0 () ffffff001e70d770 ufs:free+38f () ffffff001e70d830 ufs:indirtrunc+260 () ffffff001e70dab0 ufs:ufs_itrunc+738 () ffffff001e70db60 ufs:ufs_trans_itrunc+128 () ffffff001e70dbf0 ufs:ufs_delete+3b0 () ffffff001e70dc60 ufs:ufs_thread_delete+da () ffffff001e70dc70 unix:thread_start+8 () syncing file systems... panic[cpu3]/thread=ffffff001e70dc80: panic sync timeout dumping to /dev/dsk/c6t0d0s1, offset 65536, content: kernel > $c vpanic() vcmn_err+0x28(3, fffffffff783a128, ffffff001e70d678) real_panic_v+0xf7(0, fffffffff783a128, ffffff001e70d678) ufs_fault_v+0x1d0(ffffff04facf65c0, fffffffff783a128, ffffff001e70d678) ufs_fault+0xa0() free+0x38f(ffffff001e70d8d0, a6a7358, 2000, 89) indirtrunc+0x260(ffffff001e70d8d0, a6a42b8, ffffffffffffffff, 0, 89) ufs_itrunc+0x738(ffffff0550b9fde0, 0, 81, fffffffec0594db0) ufs_trans_itrunc+0x128(ffffff0550b9fde0, 0, 81, fffffffec0594db0) ufs_delete+0x3b0(fffffffed20e2a00, ffffff0550b9fde0, 1) ufs_thread_delete+0xda(ffffffff64704840) thread_start+8() > ::panicinfo cpu 3 thread ffffff001e70dc80 message free: freeing free block, dev:0xb600000024, block:13144, ino:1737885, fs:/export /saba1 rdi fffffffff783a128 rsi ffffff001e70d678 rdx fffffffff783a128 rcx ffffff001e70d678 r8 fffffffff783a128 r9 0 rax 3 rbx 0 rbp ffffff001e70d4d0 r10 fffffffec3d40580 r10 fffffffec3d40580 r11 ffffff001e70dc80 r12 fffffffff783a128 r13 ffffff001e70d678 r14 3 r15 fffffffff783a128 fsbase 0 gsbase fffffffec3d40580 ds 4b es 4b fs 0 gs 1c3 trapno 0 err 0 rip fffffffffb83c860 cs 30 rflags 246 rsp ffffff001e70d488 ss 38 gdt_hi 0 gdt_lo 800001ef idt_hi 0 idt_lo 70000fff ldt 0 task 70 cr0 8005003b cr2 fed0e010 cr3 2c00000 cr4 6f8 Jorgen Lundman wrote:> On Saturday the X4500 system paniced, and rebooted. For some reason the > /export/saba1 UFS partition was corrupt, and needed "fsck". This is why > it did not come back online. /export/saba1 is mounted "logging,noatime", > so fsck should never (-ish) be needed. > > SunOS x4500-01.unix 5.11 snv_70b i86pc i386 i86pc > > /export/saba1 on /dev/zvol/dsk/zpool1/saba1 > read/write/setuid/devices/intr/largefiles/logging/quota/xattr/noatime/onerror=panic/dev=2d80024 > on Sat Jul 5 08:48:54 2008 > > > One possible related bug: > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4884138 > > > What would be the best solution? Go back to latest Solaris 10 and pass > it on to Sun support, or find a patch for this problem? > > > > Panic dump follows: > > > -rw-r--r-- 1 root root 2529300 Jul 5 08:48 unix.2 > -rw-r--r-- 1 root root 10133225472 Jul 5 09:10 vmcore.2 > > > # mdb unix.2 vmcore.2 > Loading modules: [ unix genunix specfs dtrace cpu.AuthenticAMD.15 uppc > pcplusmp scsi_vhci ufs md ip hook neti sctp arp usba uhci s1394 qlc fctl > nca lofs zfs random cpc crypto fcip fcp logindmux nsctl sdbc ptm sv ii > sppp rdc nfs ] > > > $c > vpanic() > vcmn_err+0x28(3, fffffffff783ade0, ffffff001e737aa8) > real_panic_v+0xf7(0, fffffffff783ade0, ffffff001e737aa8) > ufs_fault_v+0x1d0(fffffffed0bfb980, fffffffff783ade0, ffffff001e737aa8) > ufs_fault+0xa0() > dqput+0xce(ffffffff1db26ef0) > dqrele+0x48(ffffffff1db26ef0) > ufs_trans_dqrele+0x6f(ffffffff1db26ef0) > ufs_idle_free+0x16d(ffffff04f17b1e00) > ufs_idle_some+0x152(3f60) > ufs_thread_idle+0x1a1() > thread_start+8() > > > > ::cpuinfo > ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD > PROC > 0 fffffffffbc2fc10 1b 0 0 60 no no t-0 > ffffff001e737c80 sched > 1 fffffffec3a0a000 1f 1 0 -1 no no t-0 ffffff001e971c80 > (idle) > 2 fffffffec3a02ac0 1f 0 0 -1 no no t-1 ffffff001e9dbc80 > (idle) > 3 fffffffec3d60580 1f 0 0 -1 no no t-1 ffffff001ea50c80 > (idle) > > > ::panicinfo > cpu 0 > thread ffffff001e737c80 > message dqput: dqp->dq_cnt == 0 > rdi fffffffff783ade0 > rsi ffffff001e737aa8 > rdx fffffffff783ade0 > rcx ffffff001e737aa8 > r8 fffffffff783ade0 > r9 0 > rax 3 > rbx 0 > rbp ffffff001e737900 > r10 fffffffffbc26fb0 > r10 fffffffffbc26fb0 > r11 ffffff001e737c80 > r12 fffffffff783ade0 > r13 ffffff001e737aa8 > r14 3 > r15 fffffffff783ade0 > fsbase 0 > gsbase fffffffffbc26fb0 > ds 4b > es 4b > fsbase 0 > gsbase fffffffffbc26fb0 > ds 4b > es 4b > fs 0 > gs 1c3 > trapno 0 > err 0 > rip fffffffffb83c860 > cs 30 > rflags 246 > rsp ffffff001e7378b8 > ss 38 > gdt_hi 0 > gdt_lo e00001ef > idt_hi 0 > idt_lo 77c00fff > ldt 0 > task 70 > cr0 8005003b > cr2 fee7d650 > cr3 2c00000 > cr4 6f8 > > > ::msgbuf > quota_ufs: over hard disk limit (pid 600, uid 178199, inum 941499, fs > /export/zero1) > quota_ufs: over hard disk limit (pid 600, uid 33647, inum 29504134, fs > /export/zero1) > > panic[cpu0]/thread=ffffff001e737c80: > dqput: dqp->dq_cnt == 0 > > > ffffff001e737930 genunix:vcmn_err+28 () > ffffff001e737980 ufs:real_panic_v+f7 () > ffffff001e7379e0 ufs:ufs_fault_v+1d0 () > ffffff001e737ad0 ufs:ufs_fault+a0 () > ffffff001e737b00 ufs:dqput+ce () > ffffff001e737b30 ufs:dqrele+48 () > ffffff001e737b70 ufs:ufs_trans_dqrele+6f () > ffffff001e737bc0 ufs:ufs_idle_free+16d () > ffffff001e737c10 ufs:ufs_idle_some+152 () > ffffff001e737c60 ufs:ufs_thread_idle+1a1 () > ffffff001e737c70 unix:thread_start+8 () > > syncing file systems... > > > >-- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)