thr3ads.net - zfs discuss - [zfs-discuss] x4500 panic report. [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Jorgen Lundman

2008-Jul-07 01:33 UTC

[zfs-discuss] x4500 panic report.

On Saturday the X4500 system paniced, and rebooted. For some reason the 
/export/saba1 UFS partition was corrupt, and needed "fsck". This is
why
it did not come back online. /export/saba1 is mounted
"logging,noatime",
so fsck should never (-ish) be needed.

SunOS x4500-01.unix 5.11 snv_70b i86pc i386 i86pc

/export/saba1 on /dev/zvol/dsk/zpool1/saba1 
read/write/setuid/devices/intr/largefiles/logging/quota/xattr/noatime/onerror=panic/dev=2d80024
on Sat Jul  5 08:48:54 2008


One possible related bug:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4884138


What would be the best solution? Go back to latest Solaris 10 and pass 
it on to Sun support, or find a patch for this problem?



Panic dump follows:


-rw-r--r--   1 root     root     2529300 Jul  5 08:48 unix.2
-rw-r--r--   1 root     root     10133225472 Jul  5 09:10 vmcore.2


# mdb unix.2 vmcore.2
Loading modules: [ unix genunix specfs dtrace cpu.AuthenticAMD.15 uppc 
pcplusmp scsi_vhci ufs md ip hook neti sctp arp usba uhci s1394 qlc fctl 
nca lofs zfs random cpc crypto fcip fcp logindmux nsctl sdbc ptm sv ii 
sppp rdc nfs ]

 > $c
vpanic()
vcmn_err+0x28(3, fffffffff783ade0, ffffff001e737aa8)
real_panic_v+0xf7(0, fffffffff783ade0, ffffff001e737aa8)
ufs_fault_v+0x1d0(fffffffed0bfb980, fffffffff783ade0, ffffff001e737aa8)
ufs_fault+0xa0()
dqput+0xce(ffffffff1db26ef0)
dqrele+0x48(ffffffff1db26ef0)
ufs_trans_dqrele+0x6f(ffffffff1db26ef0)
ufs_idle_free+0x16d(ffffff04f17b1e00)
ufs_idle_some+0x152(3f60)
ufs_thread_idle+0x1a1()
thread_start+8()


 > ::cpuinfo
  ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD 
    PROC
   0 fffffffffbc2fc10  1b    0    0  60   no    no t-0 
ffffff001e737c80 sched
   1 fffffffec3a0a000  1f    1    0  -1   no    no t-0    ffffff001e971c80
  (idle)
   2 fffffffec3a02ac0  1f    0    0  -1   no    no t-1    ffffff001e9dbc80
  (idle)
   3 fffffffec3d60580  1f    0    0  -1   no    no t-1    ffffff001ea50c80
  (idle)

 > ::panicinfo
              cpu                0
           thread ffffff001e737c80
          message dqput: dqp->dq_cnt == 0
              rdi fffffffff783ade0
              rsi ffffff001e737aa8
              rdx fffffffff783ade0
              rcx ffffff001e737aa8
               r8 fffffffff783ade0
               r9                0
              rax                3
              rbx                0
              rbp ffffff001e737900
              r10 fffffffffbc26fb0
              r10 fffffffffbc26fb0
              r11 ffffff001e737c80
              r12 fffffffff783ade0
              r13 ffffff001e737aa8
              r14                3
              r15 fffffffff783ade0
           fsbase                0
           gsbase fffffffffbc26fb0
               ds               4b
               es               4b
           fsbase                0
           gsbase fffffffffbc26fb0
               ds               4b
               es               4b
               fs                0
               gs              1c3
           trapno                0
              err                0
              rip fffffffffb83c860
               cs               30
           rflags              246
              rsp ffffff001e7378b8
               ss               38
           gdt_hi                0
           gdt_lo         e00001ef
           idt_hi                0
           idt_lo         77c00fff
              ldt                0
             task               70
              cr0         8005003b
              cr2         fee7d650
              cr3          2c00000
              cr4              6f8

 > ::msgbuf
quota_ufs: over hard disk limit (pid 600, uid 178199, inum 941499, fs 
/export/zero1)
quota_ufs: over hard disk limit (pid 600, uid 33647, inum 29504134, fs 
/export/zero1)

panic[cpu0]/thread=ffffff001e737c80:
dqput: dqp->dq_cnt == 0


ffffff001e737930 genunix:vcmn_err+28 ()
ffffff001e737980 ufs:real_panic_v+f7 ()
ffffff001e7379e0 ufs:ufs_fault_v+1d0 ()
ffffff001e737ad0 ufs:ufs_fault+a0 ()
ffffff001e737b00 ufs:dqput+ce ()
ffffff001e737b30 ufs:dqrele+48 ()
ffffff001e737b70 ufs:ufs_trans_dqrele+6f ()
ffffff001e737bc0 ufs:ufs_idle_free+16d ()
ffffff001e737c10 ufs:ufs_idle_some+152 ()
ffffff001e737c60 ufs:ufs_thread_idle+1a1 ()
ffffff001e737c70 unix:thread_start+8 ()

syncing file systems...




-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

James C. McPherson

2008-Jul-07 02:49 UTC

head link

[zfs-discuss] x4500 panic report.

Jorgen Lundman wrote:> On Saturday the X4500 system paniced, and rebooted. For some reason the 
> /export/saba1 UFS partition was corrupt, and needed "fsck". This
is why
> it did not come back online. /export/saba1 is mounted
"logging,noatime",
> so fsck should never (-ish) be needed.
> 
> SunOS x4500-01.unix 5.11 snv_70b i86pc i386 i86pc
> 
> /export/saba1 on /dev/zvol/dsk/zpool1/saba1 
>
read/write/setuid/devices/intr/largefiles/logging/quota/xattr/noatime/onerror=panic/dev=2d80024
> on Sat Jul  5 08:48:54 2008
> 
> 
> One possible related bug:
> 
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4884138
Yes, that bug is possibly related. However, the panic stacks listed
in it do not match yours.
> What would be the best solution? Go back to latest Solaris 10 and pass 
> it on to Sun support, or find a patch for this problem?
Since the panic stack only ever goes through ufs, you should
log a call with Sun support.
...>  > ::msgbuf
> quota_ufs: over hard disk limit (pid 600, uid 178199, inum 941499, fs 
> /export/zero1)
> quota_ufs: over hard disk limit (pid 600, uid 33647, inum 29504134, fs 
> /export/zero1)
> 
> panic[cpu0]/thread=ffffff001e737c80:
> dqput: dqp->dq_cnt == 0
> 
> 
> ffffff001e737930 genunix:vcmn_err+28 ()
> ffffff001e737980 ufs:real_panic_v+f7 ()
> ffffff001e7379e0 ufs:ufs_fault_v+1d0 ()
> ffffff001e737ad0 ufs:ufs_fault+a0 ()
> ffffff001e737b00 ufs:dqput+ce ()
> ffffff001e737b30 ufs:dqrele+48 ()
> ffffff001e737b70 ufs:ufs_trans_dqrele+6f ()
> ffffff001e737bc0 ufs:ufs_idle_free+16d ()
> ffffff001e737c10 ufs:ufs_idle_some+152 ()
> ffffff001e737c60 ufs:ufs_thread_idle+1a1 ()
> ffffff001e737c70 unix:thread_start+8 ()
Although.... given the entry in the msgbuf, perhaps
you might want to fix up your quota settings on that
particular filesystem.



James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Jorgen Lundman

2008-Jul-07 03:01 UTC

head link

[zfs-discuss] x4500 panic report.

> Since the panic stack only ever goes through ufs, you shouldlog a call with Sun support.

We do have support, but they only speak Japanese, and I''m still quite 
poor at it. But I have started the process of having it translated and 
passed along to the next person. It is always fun to see what it becomes 
at the other end. Meanwhile, I like to research and see if it is a 
already known problem, rather than just sit around and wait.



 >> quota_ufs: over hard disk limit (pid 600, uid 33647, inum 29504134, 
fs /export/zero1)
> 
> Although.... given the entry in the msgbuf, perhaps
> you might want to fix up your quota settings on that
> particular filesystem.
> 
Customers pay for a certain amount of disk-quota, and being users, 
always stay close to the edge. Those messages are as constant as 
precipitation in the rainy season.

Are you suggestion that indicate a problem, beyond that the user is out 
of space?

Lund


-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

James C. McPherson

2008-Jul-07 03:19 UTC

head link

[zfs-discuss] x4500 panic report.

Jorgen Lundman wrote:>  > Since the panic stack only ever goes through ufs, you should
> log a call with Sun support.
> 
> We do have support, but they only speak Japanese, and I''m still
quite
> poor at it. But I have started the process of having it translated and 
> passed along to the next person. It is always fun to see what it becomes 
> at the other end. Meanwhile, I like to research and see if it is a 
> already known problem, rather than just sit around and wait.
That sounds like a learning opportunity :-)
>  >> quota_ufs: over hard disk limit (pid 600, uid 33647, inum
29504134,
> fs /export/zero1)
> 
>> Although.... given the entry in the msgbuf, perhaps
>> you might want to fix up your quota settings on that
>> particular filesystem.
>>
> 
> Customers pay for a certain amount of disk-quota, and being users, 
> always stay close to the edge. Those messages are as constant as 
> precipitation in the rainy season.
> 
> Are you suggestion that indicate a problem, beyond that the user is out 
> of space?
I don''t know, I''m not a UFS expert (heck, I''m not an
expert
on _anything_). Have you investigated putting your paying
customers onto zfs and managing quotas with zfs properties
instead of ufs?




James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Jorgen Lundman

2008-Jul-07 03:31 UTC

head link

[zfs-discuss] x4500 panic report.

> I don''t know, I''m not a UFS expert (heck, I''m
not an expert
> on _anything_). Have you investigated putting your paying
> customers onto zfs and managing quotas with zfs properties
> instead of ufs?
Yep, we spent about 6 weeks during the trial period of the x4500 to try 
to find a way for ZFS to be able to replace the current NetApps. History 
of this mailing-list should have it, and thanks to everyone who helped.

But it was just not possible. Perhaps now it can be done, using 
mirror-mounts, but the 50 odd servers hanging off the x4500 don''t all 
support it, so it would still not be feasible.

Unless there has been some advancement in ZFS in the last 6 months I am 
not aware of... like user quotas?

Thanks for your assistance.

Lund

-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Jorgen Lundman

2008-Jul-11 07:03 UTC

head link

[zfs-discuss] x4500 panic report.

Today we had another panic, at least it was during work time :) Just a 
shame the 999GB ufs takes 80+ mins to fsck. (Yes, it is mounted
''logging'').







panic[cpu3]/thread=ffffff001e70dc80:
free: freeing free block, dev:0xb600000024, block:13144, ino:1737885, 
fs:/export
/saba1


ffffff001e70d500 genunix:vcmn_err+28 ()
ffffff001e70d550 ufs:real_panic_v+f7 ()
ffffff001e70d5b0 ufs:ufs_fault_v+1d0 ()
ffffff001e70d6a0 ufs:ufs_fault+a0 ()
ffffff001e70d770 ufs:free+38f ()
ffffff001e70d830 ufs:indirtrunc+260 ()
ffffff001e70dab0 ufs:ufs_itrunc+738 ()
ffffff001e70db60 ufs:ufs_trans_itrunc+128 ()
ffffff001e70dbf0 ufs:ufs_delete+3b0 ()
ffffff001e70dc60 ufs:ufs_thread_delete+da ()
ffffff001e70dc70 unix:thread_start+8 ()

syncing file systems...

panic[cpu3]/thread=ffffff001e70dc80:
panic sync timeout

dumping to /dev/dsk/c6t0d0s1, offset 65536, content: kernel


 > $c
vpanic()
vcmn_err+0x28(3, fffffffff783a128, ffffff001e70d678)
real_panic_v+0xf7(0, fffffffff783a128, ffffff001e70d678)
ufs_fault_v+0x1d0(ffffff04facf65c0, fffffffff783a128, ffffff001e70d678)
ufs_fault+0xa0()
free+0x38f(ffffff001e70d8d0, a6a7358, 2000, 89)
indirtrunc+0x260(ffffff001e70d8d0, a6a42b8, ffffffffffffffff, 0, 89)
ufs_itrunc+0x738(ffffff0550b9fde0, 0, 81, fffffffec0594db0)
ufs_trans_itrunc+0x128(ffffff0550b9fde0, 0, 81, fffffffec0594db0)
ufs_delete+0x3b0(fffffffed20e2a00, ffffff0550b9fde0, 1)
ufs_thread_delete+0xda(ffffffff64704840)
thread_start+8()

 > ::panicinfo
              cpu                3
           thread ffffff001e70dc80
          message
free: freeing free block, dev:0xb600000024, block:13144, ino:1737885, 
fs:/export
/saba1
              rdi fffffffff783a128
              rsi ffffff001e70d678
              rdx fffffffff783a128
              rcx ffffff001e70d678
               r8 fffffffff783a128
               r9                0
              rax                3
              rbx                0
              rbp ffffff001e70d4d0
              r10 fffffffec3d40580
              r10 fffffffec3d40580
              r11 ffffff001e70dc80
              r12 fffffffff783a128
              r13 ffffff001e70d678
              r14                3
              r15 fffffffff783a128
           fsbase                0
           gsbase fffffffec3d40580
               ds               4b
               es               4b
               fs                0
               gs              1c3
           trapno                0
              err                0
              rip fffffffffb83c860
               cs               30
           rflags              246
              rsp ffffff001e70d488
               ss               38
           gdt_hi                0
           gdt_lo         800001ef
           idt_hi                0
           idt_lo         70000fff
              ldt                0
             task               70
              cr0         8005003b
              cr2         fed0e010
              cr3          2c00000
              cr4              6f8





Jorgen Lundman wrote:> On Saturday the X4500 system paniced, and rebooted. For some reason the 
> /export/saba1 UFS partition was corrupt, and needed "fsck". This
is why
> it did not come back online. /export/saba1 is mounted
"logging,noatime",
> so fsck should never (-ish) be needed.
> 
> SunOS x4500-01.unix 5.11 snv_70b i86pc i386 i86pc
> 
> /export/saba1 on /dev/zvol/dsk/zpool1/saba1 
>
read/write/setuid/devices/intr/largefiles/logging/quota/xattr/noatime/onerror=panic/dev=2d80024
> on Sat Jul  5 08:48:54 2008
> 
> 
> One possible related bug:
> 
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4884138
> 
> 
> What would be the best solution? Go back to latest Solaris 10 and pass 
> it on to Sun support, or find a patch for this problem?
> 
> 
> 
> Panic dump follows:
> 
> 
> -rw-r--r--   1 root     root     2529300 Jul  5 08:48 unix.2
> -rw-r--r--   1 root     root     10133225472 Jul  5 09:10 vmcore.2
> 
> 
> # mdb unix.2 vmcore.2
> Loading modules: [ unix genunix specfs dtrace cpu.AuthenticAMD.15 uppc 
> pcplusmp scsi_vhci ufs md ip hook neti sctp arp usba uhci s1394 qlc fctl 
> nca lofs zfs random cpc crypto fcip fcp logindmux nsctl sdbc ptm sv ii 
> sppp rdc nfs ]
> 
>  > $c
> vpanic()
> vcmn_err+0x28(3, fffffffff783ade0, ffffff001e737aa8)
> real_panic_v+0xf7(0, fffffffff783ade0, ffffff001e737aa8)
> ufs_fault_v+0x1d0(fffffffed0bfb980, fffffffff783ade0, ffffff001e737aa8)
> ufs_fault+0xa0()
> dqput+0xce(ffffffff1db26ef0)
> dqrele+0x48(ffffffff1db26ef0)
> ufs_trans_dqrele+0x6f(ffffffff1db26ef0)
> ufs_idle_free+0x16d(ffffff04f17b1e00)
> ufs_idle_some+0x152(3f60)
> ufs_thread_idle+0x1a1()
> thread_start+8()
> 
> 
>  > ::cpuinfo
>   ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD 
>     PROC
>    0 fffffffffbc2fc10  1b    0    0  60   no    no t-0 
> ffffff001e737c80 sched
>    1 fffffffec3a0a000  1f    1    0  -1   no    no t-0    ffffff001e971c80
>   (idle)
>    2 fffffffec3a02ac0  1f    0    0  -1   no    no t-1    ffffff001e9dbc80
>   (idle)
>    3 fffffffec3d60580  1f    0    0  -1   no    no t-1    ffffff001ea50c80
>   (idle)
> 
>  > ::panicinfo
>               cpu                0
>            thread ffffff001e737c80
>           message dqput: dqp->dq_cnt == 0
>               rdi fffffffff783ade0
>               rsi ffffff001e737aa8
>               rdx fffffffff783ade0
>               rcx ffffff001e737aa8
>                r8 fffffffff783ade0
>                r9                0
>               rax                3
>               rbx                0
>               rbp ffffff001e737900
>               r10 fffffffffbc26fb0
>               r10 fffffffffbc26fb0
>               r11 ffffff001e737c80
>               r12 fffffffff783ade0
>               r13 ffffff001e737aa8
>               r14                3
>               r15 fffffffff783ade0
>            fsbase                0
>            gsbase fffffffffbc26fb0
>                ds               4b
>                es               4b
>            fsbase                0
>            gsbase fffffffffbc26fb0
>                ds               4b
>                es               4b
>                fs                0
>                gs              1c3
>            trapno                0
>               err                0
>               rip fffffffffb83c860
>                cs               30
>            rflags              246
>               rsp ffffff001e7378b8
>                ss               38
>            gdt_hi                0
>            gdt_lo         e00001ef
>            idt_hi                0
>            idt_lo         77c00fff
>               ldt                0
>              task               70
>               cr0         8005003b
>               cr2         fee7d650
>               cr3          2c00000
>               cr4              6f8
> 
>  > ::msgbuf
> quota_ufs: over hard disk limit (pid 600, uid 178199, inum 941499, fs 
> /export/zero1)
> quota_ufs: over hard disk limit (pid 600, uid 33647, inum 29504134, fs 
> /export/zero1)
> 
> panic[cpu0]/thread=ffffff001e737c80:
> dqput: dqp->dq_cnt == 0
> 
> 
> ffffff001e737930 genunix:vcmn_err+28 ()
> ffffff001e737980 ufs:real_panic_v+f7 ()
> ffffff001e7379e0 ufs:ufs_fault_v+1d0 ()
> ffffff001e737ad0 ufs:ufs_fault+a0 ()
> ffffff001e737b00 ufs:dqput+ce ()
> ffffff001e737b30 ufs:dqrele+48 ()
> ffffff001e737b70 ufs:ufs_trans_dqrele+6f ()
> ffffff001e737bc0 ufs:ufs_idle_free+16d ()
> ffffff001e737c10 ufs:ufs_idle_some+152 ()
> ffffff001e737c60 ufs:ufs_thread_idle+1a1 ()
> ffffff001e737c70 unix:thread_start+8 ()
> 
> syncing file systems...
> 
> 
> 
> 
-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

zfs discuss - Jul 2008 - x4500 panic report.

[zfs-discuss] x4500 panic report.

[zfs-discuss] x4500 panic report.

[zfs-discuss] x4500 panic report.

[zfs-discuss] x4500 panic report.

[zfs-discuss] x4500 panic report.

[zfs-discuss] x4500 panic report.