ontario# uname -a SunOS ontario 5.10 Generic_118822-23 sun4v sparc SUNW,Sun-Fire-T200 ontario# cat /etc/release Solaris 10 1/06 s10s_u1wos_18 SPARC Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 08 November 2005 ontario# ls -l total 2156818 -rw-r--r-- 1 root root 2 Mar 30 17:03 bounds -rw-r--r-- 1 root root 1479995 Mar 30 17:02 unix.0 -rw-r--r-- 1 root root 1102249984 Mar 30 17:03 vmcore.0 ontario# mdb -k 0 Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usba random fcp fctl emlxs nca lofs nfs ssd sppp logindmux ptm md cpc fcip crypto ipc ]> $cdtrace_aggregate+0x1c0(18, 0, 18, 6001e3fffe8, 17420, 0) dtrace_probe+0x578(0, 60007520eb8, 183b4e0, 183b960, 60006822000, 0) 0x140417c(2a100d71cc0, 0, 60006027c40, 3b, 7, 1) resume+4(2a100d71cc0, 48d559, 300013a8178, 1855400, 300013a8000, 2a100d71cc0) sema_p+0x130(60007253d58, 0, 60006027c40, 3b, 7, 1) biowait+0x6c(60007253c98, 0, 183e800, 300013a8000, 60005025e70, 60007253c98) default_physio+0x388(7befdf40, 800, 0, 60007253cd8, 7befd760, 60007253cd0) pread64+0x1e8(c, 26000, 800, 0, 0, 600079cbdf8) syscall_trap32+0xcc(c, 26000, 800, 2, e7faa000, 800)> ::statusdebugging crash dump vmcore.0 (64-bit) from ontario operating system: 5.10 Generic_118822-23 (sun4v) panic message: BAD TRAP: type=28 rp=2a10333d2c0 addr=7bbdb788 mmu_fsr=0 dump content: kernel pages only>panic[cpu18]/thread=300039a0fc0: BAD TRAP: type=28 rp=2a10333d2c0 addr=7bbdb788 mmu_fsr=0 vdblite: integer divide zero trap: addr=0x7bbdb788 pid=2079, pc=0x7bbdb788, sp=0x2a10333cb61, tstate=0x4480001403, context=0x103e g1-g7: 60012000000, 0, 0, 0, 0, 1c, 300039a0fc0 000002a10333cfe0 unix:die+9c (28, 2a10333d2c0, 7bbdb788, 0, 2a10333d0a0, 10000) %l0-3: 0000000000000000 0000000000000028 0000000000000028 0000060006027c40 %l4-7: 0000000000000000 0000000000000000 000000000000000b 0000000001072c00 000002a10333d0c0 unix:trap+564 (2a10333d2c0, 58, 0, 0, 300013a8000, 300039a0fc0) %l0-3: 0000000000000000 000006000607c098 0000000000000028 0000060006027c40 %l4-7: 0000000000000000 0000000000000000 000000000000000b 00000300013a8180 000002a10333d210 unix:ktl0+64 (0, 0, 60012000000, 410, 0, 0) %l0-3: 00000300013a8000 0000000000000090 0000004480001403 000000000101c370 %l4-7: 0000000000000000 0000000000000000 000000000000000b 000002a10333d2c0 000002a10333d360 0 (18, 0, 18, 6001e3fffe8, 17420, 0) %l0-3: 000006000755e480 0000060001d35978 000006001e000000 0000000000000018 %l4-7: 0000000000000000 0000000000000000 0000000000000007 00000000f0000000 000002a10333d410 dtrace:dtrace_probe+578 (0, 60007520eb8, 183b4e0, 183b960, 60006822000, 0) %l0-3: 0000060001d35978 0000060012000000 0000060001755480 000002a10333d560 %l4-7: 0000000000000012 0000000000000010 0000000000000004 0000060006a4f040 000002a10333d5f0 140417c (2a100d71cc0, 0, 60006027c40, 3b, 7, 1) %l0-3: 0000000000000000 00000000018ac018 00000000018588e8 000000000183eb00 %l4-7: 000002a10333daa8 00000000018b7270 0000000000095cd4 0000000000095cd3 000002a10333d6a0 unix:resume+4 (2a100d71cc0, 48d559, 300013a8178, 1855400, 300013a8000, 2a100d71c c0) %l0-3: 0000000000000000 00000000018ac018 00000000018588e8 000000000183eb00 %l4-7: 000002a10333daa8 00000000018b7270 0000000000095cd4 0000000000095cd3 000002a10333d750 genunix:sema_p+130 (60007253d58, 0, 60006027c40, 3b, 7, 1) %l0-3: 0000000000000000 00000000018ac018 00000000018588e8 000000000183eb00 %l4-7: 000002a10333daa8 00000000018b7270 0000000000095cd4 0000000000095cd3 000002a10333d800 genunix:biowait+6c (60007253c98, 0, 183e800, 300013a8000, 60005025e70, 60007253c 98) %l0-3: 0000060007253c98 0000060006bbb2b8 00000300013a8310 0000000002000000 %l4-7: 000002a10333daa8 0000000000000061 000002a10333dac8 0000000000026000 000002a10333d8b0 genunix:default_physio+388 (7befdf40, 800, 0, 60007253cd8, 7befd760, 60007253cd0 ) %l0-3: 0000060007253c98 0000060006bbb2b8 0000000000200000 0000000002000000 %l4-7: 000002a10333daa8 0000000000000061 000002a10333dac8 0000000000026000 000002a10333d9e0 genunix:pread64+1e8 (c, 26000, 800, 0, 0, 600079cbdf8) %l0-3: 0000000080000000 0000000000000000 00000600079b8f00 000002a10333dac8 %l4-7: 000000000185d908 0000000000000001 0000000000002001 0000000200000000 syncing file systems... 6 1 done dumping to /dev/dsk/c2t0d0s1, offset 859701248, content: kernel>This is my dtrace script. ontario# more sched.d #!/usr/sbin/dtrace -Fs sched:::on-cpu { self->ts = timestamp; } sched:::off-cpu /self->ts/ { @[cpu] = quantize(timestamp - self->ts); self->ts = 0; } This message posted from opensolaris.org
Eric C. Saxe
2006-Mar-31 02:26 UTC
[dtrace-discuss] Re: dtrace panic - ed my system SF T200
Hi John, Is there a way you can provide us with the crash dump? If so, could you (privately) send me instructions for retrieving it? I''d be happy to take a look. Clearly, something went very wrong... Thanks, -Eric This message posted from opensolaris.org
On Thu, Mar 30, 2006 at 05:12:31PM -0800, Qiang Liu wrote:> ontario# uname -a > SunOS ontario 5.10 Generic_118822-23 sun4v sparc SUNW,Sun-Fire-T200 > ontario# cat /etc/release > Solaris 10 1/06 s10s_u1wos_18 SPARC > Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. > Use is subject to license terms. > Assembled 08 November 2005 > > ontario# ls -l > total 2156818 > -rw-r--r-- 1 root root 2 Mar 30 17:03 bounds > -rw-r--r-- 1 root root 1479995 Mar 30 17:02 unix.0 > -rw-r--r-- 1 root root 1102249984 Mar 30 17:03 vmcore.0 > ontario# mdb -k 0 > Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usba random fcp fctl emlxs nca > lofs nfs ssd sppp logindmux ptm md cpc fcip crypto ipc ] > > $c > dtrace_aggregate+0x1c0(18, 0, 18, 6001e3fffe8, 17420, 0) > dtrace_probe+0x578(0, 60007520eb8, 183b4e0, 183b960, 60006822000, 0) > 0x140417c(2a100d71cc0, 0, 60006027c40, 3b, 7, 1) > resume+4(2a100d71cc0, 48d559, 300013a8178, 1855400, 300013a8000, 2a100d71cc0) > sema_p+0x130(60007253d58, 0, 60006027c40, 3b, 7, 1) > biowait+0x6c(60007253c98, 0, 183e800, 300013a8000, 60005025e70, 60007253c98) > default_physio+0x388(7befdf40, 800, 0, 60007253cd8, 7befd760, 60007253cd0) > pread64+0x1e8(c, 26000, 800, 0, 0, 600079cbdf8) > syscall_trap32+0xcc(c, 26000, 800, 2, e7faa000, 800)This is a dup of 6274126, fixed in Build 14 of Nevada and (ironically) integrated in the build after the build you have of S10U1: s10s_u1wos_19. (Is this a beta Niagara? I didn''t think that FCS Niagaras snuck off without the fix for this bug...) And for what it''s worth, this isn''t actually a bug in DTrace -- it''s a really nasty VM bug that DTrace happened to trip over. (In particular, the TSB retains a stale entry for 256MB mappings; I can forward my analysis when I debugged this back in October if people are interested.) You can upgrade to s10s_u1wos_19 or later, or you can workaround the bug by adding the following line to /etc/system: set segkmem_lpsize=0x400000 (This reduces the large page size used for the kernel heap from 256MB to 4MB, thereby eliminating the possibility of the 256MB TSB entry altogether.) - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
> This is a dup of 6274126, fixed in Build 14 of Nevada and (ironically) > integrated in the build after the build you have of S10U1: s10s_u1wos_19. > (Is this a beta Niagara? I didn''t think that FCS Niagaras snuck off without > the fix for this bug...) And for what it''s worth, this isn''t actually a > bug in DTrace -- it''s a really nasty VM bug that DTrace happened to trip > over. (In particular, the TSB retains a stale entry for 256MB mappings; > I can forward my analysis when I debugged this back in October if people > are interested.) > > You can upgrade to s10s_u1wos_19 or later, or you can workaround the bug > by adding the following line to /etc/system: > > set segkmem_lpsize=0x400000 > > (This reduces the large page size used for the kernel heap from 256MB > to 4MB, thereby eliminating the possibility of the 256MB TSB entry > altogether.)Amazingly, some people have managed to already draw sweeping conclusions about DTrace from this bug (albeit in private to me), so let me be a little clearer: (1) This was NOT a bug in DTrace; DTrace happened to be tripping over a low-level VM bug that other subsystems also tripped over (i.e., this was not the only manifestation). (2) This bug affects Niagara ONLY. (3) Only pre-release Niagaras should be affected; this bug was fixed before Niagara formally shipped, and I believe that all revenue-released Niagaras have the fix. Sorry to be so strident, but hopefully everyone (now) gets the message about the scope of this problem. And as a warning, if I get any more "so much for DTrace being safe" mails, the next message from me on the subject will be written with caps lock on... ;) - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
...>> bug in DTrace -- it''s a really nasty VM bug that DTrace happened to trip >> over. (In particular, the TSB retains a stale entry for 256MB mappings; >> I can forward my analysis when I debugged this back in October if people >> are interested.)... Yes this was a really horrible bug in the large page support for seg_kmem (the segment driver which backs all kmem_alloc() memory in the VM system).> Sorry to be so strident, but hopefully everyone (now) gets the message > about the scope of this problem. And as a warning, if I get any more "so > much for DTrace being safe" mails, the next message from me on the subject > will be written with caps lock on... ;)Agreed. DTrace is no different from any other subsystem in the kernel, in that it depends on lots of stuff to function correctly. When that "stuff" is broken, the system fails. It''s simply a reality of writing software. Arguing that bugs like this make DTrace somehow unsafe is no different than blaming a programmer when his program crashes due to a bug in a shared library his program links to. Such arguments hold no water. DTrace and ZFS both at their inception introduced usage patterns and demands on the VM system which over-extended its capabilities. This is not their fault; VM has been in break-and-fix mode for fifteen years, and further it lacks a comprehensive test suite to anticipate such problems before they occur. See Bill Moore''s blog, http://blogs.sun.com/bill for a nice treatise on the relationship between test suites and resultant code quality. The VM system is an antiquated design which has been extended well beyond its original design points; additionally it has undergone a lot of code churn in the last year or so to support features required for Niagara. Such factors inevitably conspire to create a few mishaps. Fortunately we''re more than a year now into a complete redesign of VM. Things will get better, but it will take time, because it''s a hard problem. We aspire to have our code be nearly as clean and bug-free as DTrace and ZFS, which are both *amazing* pieces of work. Please don''t blame them that our garbage stinks. - Eric
Hi Bryan Thanks so much for the quick reply, I can tell you for sure that this system is a prototype Niagara system (beta Niagara?) I want to apply the workaround first. You said the revenue-released Niagara shoud have the fix, do you mean the fix is on the hardware? If I get a revenue-released Niagara, do I have to upgrade the OS to s10s_u1wos_19 or later? john This message posted from opensolaris.org
Bryan Cantrill
2006-Mar-31 19:00 UTC
[dtrace-discuss] Re: dtrace panic - ed my system SF T200
Hey John,> Thanks so much for the quick reply, I can tell you for sure that this > system is a prototype Niagara system (beta Niagara?)Phew!> I want to apply the workaround first.That''ll definitely do the trick.> You said the revenue-released Niagara shoud have the fix, do you mean > the fix is on the hardware? If I get a revenue-released Niagara, do I > have to upgrade the OS to s10s_u1wos_19 or later?If you have a revenue-released Niagara, it will simply _come_ with the box -- the bits that they''re installing at the factory are recent enough to have the fix. (Indeed, as I recall, this bug was one of a handful of last minute showstopper bugs before Niagara shipped.) Alternatively, you can upgrade your existing (prototype) Niagara to newer S10 bits, or a recent Nevada or OpenSolaris build. Any of them should contain the fix. - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc