Ryan Johnson
2010-Jul-15 13:51 UTC
[dtrace-discuss] dtrace (userland) seg faults while processing aggregates
Hi all, Odd behavior here: I''m using the profile probe to sample stack traces in my app, 100+ threads banging hard on a T5440. The script exits after 30 seconds of profiling and seg faults while trying to print the aggregated stack trace counts on exit:> $ dbx - core > Corefile specified executable: "/usr/sbin/sparcv9/dtrace" > For information about new features see `help changes'' > To remove this message, put `dbxenv suppress_startup_message 7.8'' in > your .dbxrc > Reading /usr/sbin/sparcv9/dtrace > core file header read successfully<snipped long list of .so files>> program terminated by signal SEGV (no mapping at the fault address) > 0xffffffff7e8a8ff0: _ndoprnt+0x0004: stx %i2, [%sp + 2223] > (dbx) bt > =>[1] _ndoprnt() > [2] snprintf() > [3] dtrace_uaddr2str() > [4] pfprint_uaddr() > [5] dt_printf_format() > [6] dt_fprinta() > [7] dt_aggregate_walk_sorted() > [8] dtrace_fprinta() > [9] dt_consume_cpu() > [10] dtrace_consume() > [11] dtrace_work() > [12] main()The problem seems related to the fact that I had specified ''dtrace -x aggsize=256k'' -- rerunning with aggsize=2048k makes the problem go away. Note that I didn''t see any messages from dtrace about drops, so I can''t say whether 256k should have been enough or not. However, the app has a lot of code and deep call stacks, so I could easily imagine 256k not being enough. Ideas? Should I file a bug report? Ryan
Ryan Johnson
2010-Jul-16 15:41 UTC
[dtrace-discuss] dtrace (userland) seg faults while processing aggregates
Update: the bug manifests quite often no matter what aggsize I use. I just got lucky that one time... Has anyone else seen this? Ryan On 7/15/2010 3:51 PM, Ryan Johnson wrote:> Hi all, > > Odd behavior here: I''m using the profile probe to sample stack traces > in my app, 100+ threads banging hard on a T5440. The script exits > after 30 seconds of profiling and seg faults while trying to print the > aggregated stack trace counts on exit: >> $ dbx - core >> Corefile specified executable: "/usr/sbin/sparcv9/dtrace" >> For information about new features see `help changes'' >> To remove this message, put `dbxenv suppress_startup_message 7.8'' in >> your .dbxrc >> Reading /usr/sbin/sparcv9/dtrace >> core file header read successfully > <snipped long list of .so files> >> program terminated by signal SEGV (no mapping at the fault address) >> 0xffffffff7e8a8ff0: _ndoprnt+0x0004: stx %i2, [%sp + 2223] >> (dbx) bt >> =>[1] _ndoprnt() >> [2] snprintf() >> [3] dtrace_uaddr2str() >> [4] pfprint_uaddr() >> [5] dt_printf_format() >> [6] dt_fprinta() >> [7] dt_aggregate_walk_sorted() >> [8] dtrace_fprinta() >> [9] dt_consume_cpu() >> [10] dtrace_consume() >> [11] dtrace_work() >> [12] main() > > The problem seems related to the fact that I had specified ''dtrace -x > aggsize=256k'' -- rerunning with aggsize=2048k makes the problem go > away. Note that I didn''t see any messages from dtrace about drops, so > I can''t say whether 256k should have been enough or not. However, the > app has a lot of code and deep call stacks, so I could easily imagine > 256k not being enough. > > Ideas? Should I file a bug report? > > Ryan > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
Adam Leventhal
2010-Jul-16 21:22 UTC
[dtrace-discuss] dtrace (userland) seg faults while processing aggregates
Hey Ryan, Thanks for providing the core file to me offline. This turns out to be a relatively simple bug in libdtrace:> $Cffffffff7f7ff6b1 libc.so.1`_ndoprnt+4(ffffffff7f174760, ffffffff7f800e58, ffffffff7f800d38, 0, ff00, 2f000000) ffffffff7f800481 libc.so.1`snprintf+0x88(ffffffff7f801690, 0, ffffffff7f174760, ffffffff7f800e9c, ffffffff7f801290, c4) ffffffff7f8005c1 libdtrace.so.1`dtrace_uaddr2str+0x108(1001117e0, 6106, 100521a6c, ffffffff7f801f40, 10d, ffffffff7f801690) ffffffff7f801691 libdtrace.so.1`pfprint_uaddr+0xd8(1001117e0, 10010b570, ffffffff7fffde54, 6106, 10d, 10) ffffffff7fffd4d1 libdtrace.so.1`dt_printf_format+0x6f4(ffffffff7fffde55, 100577ed8, 100114d40, ffffffff7fffde55, ffffffff7f276000, ffffffff7f1584bc) ffffffff7fffd6b1 libdtrace.so.1`dt_fprinta+0x88(0, ffffffff7fffe108, 100114d40, 4, 100577e70, 100577e98) ffffffff7fffd791 libdtrace.so.1`dt_aggregate_walk_sorted+0x108(1001117e0, ffffffff7f15af84, ffffffff7fffe108, 4c50, 98a, 1013ddb20) ffffffff7fffd841 libdtrace.so.1`dtrace_fprinta+0x11c(1001117e0, 10010b570, 100114d40, 1, 10010dac8, 1) ffffffff7fffd941 libdtrace.so.1`dt_consume_cpu+0x78c(1001117e0, 10010b570, 0, ffffffff7f276000, 1, 1009949b8) ffffffff7fffdb41 libdtrace.so.1`dtrace_consume+0x2f8(1001117e0, 10010b570, 1000044f8, 10000446c, c0, 0) ffffffff7fffdc01 libdtrace.so.1`dtrace_work+0x108(1001117e0, 10010b570, 1000044f8, 10000446c, 0, 1) ffffffff7fffdcb1 main+0x1fb0(10010b, 10010b390, 10010b378, 100000, 10010b, 100000) ffffffff7fffdee1 _start+0x17c(0, 0, 0, 0, 0, 0) Note how big the frame for dt_printf_format() is:> ffffffff7fffd4d1-ffffffff7f801691=K7fbe40> (ffffffff7fffd4d1-ffffffff7f801691)%0t1024=D8175 That''s an 8MB stack frame! Here''s the problem in dt_printf_format(): for (i = 0; i < pfv->pfv_argc; i++, pfd = pfd->pfd_next) { ... if (pfd->pfd_preflen != 0) { char *tmp = alloca(pfd->pfd_preflen + 1); bcopy(pfd->pfd_prefix, tmp, pfd->pfd_preflen); tmp[pfd->pfd_preflen] = ''\0''; if ((rval = dt_printf(dtp, fp, tmp)) < 0) return (rval); ... We''re allocating unbounded amounts of data off the stack. You may be able to work around the problem by using ulimit(1) to increase the size of dtrace(1M)''s stack. I''ve filed the following bug: 6969970 dt_printf_format() calls alloca(3C) in a loop Adam On Fri, Jul 16, 2010 at 05:41:51PM +0200, Ryan Johnson wrote:> Update: the bug manifests quite often no matter what aggsize I use. I just > got lucky that one time... > > Has anyone else seen this? > > Ryan > > On 7/15/2010 3:51 PM, Ryan Johnson wrote: >> Hi all, >> >> Odd behavior here: I''m using the profile probe to sample stack traces in >> my app, 100+ threads banging hard on a T5440. The script exits after 30 >> seconds of profiling and seg faults while trying to print the aggregated >> stack trace counts on exit: >>> $ dbx - core >>> Corefile specified executable: "/usr/sbin/sparcv9/dtrace" >>> For information about new features see `help changes'' >>> To remove this message, put `dbxenv suppress_startup_message 7.8'' in your >>> .dbxrc >>> Reading /usr/sbin/sparcv9/dtrace >>> core file header read successfully >> <snipped long list of .so files> >>> program terminated by signal SEGV (no mapping at the fault address) >>> 0xffffffff7e8a8ff0: _ndoprnt+0x0004: stx %i2, [%sp + 2223] >>> (dbx) bt >>> =>[1] _ndoprnt() >>> [2] snprintf() >>> [3] dtrace_uaddr2str() >>> [4] pfprint_uaddr() >>> [5] dt_printf_format() >>> [6] dt_fprinta() >>> [7] dt_aggregate_walk_sorted() >>> [8] dtrace_fprinta() >>> [9] dt_consume_cpu() >>> [10] dtrace_consume() >>> [11] dtrace_work() >>> [12] main() >> >> The problem seems related to the fact that I had specified ''dtrace -x >> aggsize=256k'' -- rerunning with aggsize=2048k makes the problem go away. >> Note that I didn''t see any messages from dtrace about drops, so I can''t >> say whether 256k should have been enough or not. However, the app has a >> lot of code and deep call stacks, so I could easily imagine 256k not being >> enough. >> >> Ideas? Should I file a bug report? >> >> Ryan >> >> _______________________________________________ >> dtrace-discuss mailing list >> dtrace-discuss at opensolaris.org > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org-- Adam Leventhal, Fishworks http://blogs.sun.com/ahl