Fernando Gleiser
2008-Nov-24 19:57 UTC
[dtrace-discuss] debugging a faulty jboss/application
>From time to time a jboss process would end eating all, the available CPU and the load avg would skyrocket.Once the operators restarted jboss, the system''d be normal again (sometimes for weeks) until the next incident Since we moved the app from a v440 running Solaris 10 8/07 to a t2000 running Solaris 10 5/08 the problem started to happen more frequently (2-3 times a week). The only other modification (besides hard and OS release) is we put the app server inside a sparse-zone container with the FSS and a 95% limit on the CPU usage. Here''s what uptime says: Mon Nov 24 10:10:00 ARST 2008 10:10am up 10 day(s), 14:40, 6 users, load average: 325.52, 320.07, 318.72 (yes, load avg is almost 320, but the server was still usable) a little output from mpstat shows this: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 531 259 150 46 27 17 43 0 29 99 1 0 0 1 0 0 3 66 0 124 65 36 10 0 86 99 1 0 0 2 0 0 0 23 0 34 22 17 9 0 8 100 0 0 0 3 0 0 0 13 0 13 12 6 9 0 1 100 0 0 0 4 0 0 0 13 0 14 12 9 9 0 2 100 0 0 0 5 0 0 0 38 0 69 37 17 8 0 23 100 0 0 0 6 0 0 0 25 0 40 24 13 8 0 7 95 5 0 0 7 0 0 0 15 0 17 14 10 9 0 3 100 0 0 0 8 0 0 0 11 0 10 10 6 8 0 0 100 0 0 0 9 0 0 0 12 0 11 11 7 9 0 0 100 0 0 0 10 0 0 0 14 0 13 13 7 9 0 0 100 0 0 0 11 0 0 0 13 0 12 12 6 9 0 0 100 0 0 0 12 0 0 0 25 0 37 24 14 8 0 25 100 0 0 0 13 0 0 0 13 0 13 12 9 9 0 1 100 0 0 0 14 0 0 0 13 0 12 12 4 9 0 0 100 0 0 0 15 0 0 0 13 0 12 12 7 9 0 0 100 0 0 0 16 0 0 0 12 0 11 11 5 7 0 0 100 0 0 0 17 0 0 0 14 0 13 13 8 7 0 0 100 0 0 0 18 0 0 0 13 0 12 12 8 9 0 0 100 0 0 0 19 0 0 1 13 0 12 12 7 8 0 0 100 0 0 0 20 0 0 7 25 0 35 25 13 9 0 3 100 0 0 0 21 0 0 72 45 3 78 41 29 11 0 19 99 1 0 0 22 0 0 3 44 4 71 39 21 11 0 135 100 0 0 0 23 0 0 3 20 5 16 14 7 9 0 2 100 0 0 0 24 0 0 3 17 5 13 12 6 9 0 2 99 1 0 0 25 0 0 1 15 4 10 10 8 8 0 0 100 0 0 0 26 0 0 8 15 0 17 14 5 10 0 0 100 0 0 0 27 0 0 0 12 0 11 11 7 8 0 0 100 0 0 0 28 0 0 1 11 0 10 10 6 8 0 0 100 0 0 0 29 0 0 2 67 0 126 66 30 11 0 48 100 0 0 0 30 0 0 0 19 0 25 18 11 8 0 232 100 0 0 0 31 0 0 1 27 0 47 26 12 10 0 10 100 0 0 0 So, 100% cpu in usr mode looks a lot like some kind of infinite loop I tried to dig it up using dtrace: profile-1001us /(execname=="java") && (pid==$1)/ { @[jstack()]=count(); } tick-20s { exit(O); } But I got a LOT of these arrors: dtrace: error on enabled probe ID 1 (ID 52070: profile:::Profile-1001us): invalid address (0x96a7e000) in action #2 and th stack traces are mostly the hex addresses without the names: 0xf886ca64 0xf884cec0 0xf8805d3c 0xf8805874 0xf8805c70 0xf8805c70 0xf8805d3c 0xf88380b8 0xf9509440 0xf88b7b44 0xf8805874 0xf8805874 0xf8805764 0xf8805874 0xf8805874 0xf8805d3c 0xf8805874 0xf8805874 0xf8805874 0xf8805874 0xf8805874 0xf8805874 0xf8805764 0xf8805764 0xf8805764 0xf886993c 0xf9999004 0xf884cec0 0xf8805874 0xf8805d3c 0xf884cc7c 0xf994090c 0xf8838df8 0xf8800218 libjvm.so`__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_+0x5a0 libjvm.so`JVM_DoPrivileged+0x500 libjava.so`Java_java_security_AccessController_doPrivileged__Ljava_security_PrivilegedExceptionAction_2Ljava_security_AccessControlContext_2+0x14 0xf880e22c 0xf880e1d0 0xf89c6470 0xf9dfc018 0xf99ee350 0xf9d74fd8 0xf8838df8 0xf8800218 libjvm.so`__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_+0x5a0 libjvm.so`JVM_DoPrivileged+0x500 libjava.so`Java_java_security_AccessController_doPrivileged__Ljava_security_PrivilegedExceptionAction_2+0x14 0xf9467cc8 0xf9badd44 9852 So my questions are: any ideas on what to do next? I''m pretty sure it''s a jboss/application problem, but I need to get more data to show the jboss/devel people what''s causing those dtrace errors? thanks in advance
Adam Leventhal
2008-Nov-26 22:30 UTC
[dtrace-discuss] debugging a faulty jboss/application
Is java running with the -server option on x86? There''s a known bug where jstack() can''t capture stacks On Nov 24, 2008, at 11:57 AM, Fernando Gleiser wrote:> >> From time to time a jboss process would end eating all, the >> available CPU and the load avg would skyrocket. > > Once the operators restarted jboss, the system''d be normal again > (sometimes for weeks) until the next incident > > Since we moved the app from a v440 running Solaris 10 8/07 to a > t2000 running Solaris 10 5/08 the problem started to happen more > frequently (2-3 times a week). The only other modification (besides > hard and OS release) is we put the app server inside a sparse-zone > container with the FSS and a 95% limit on the CPU usage. > > Here''s what uptime says: > > Mon Nov 24 10:10:00 ARST 2008 > 10:10am up 10 day(s), 14:40, 6 users, load average: 325.52, > 320.07, 318.72 > > (yes, load avg is almost 320, but the server was still usable) > > a little output from mpstat shows this: > > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr > sys wt idl > 0 0 0 531 259 150 46 27 17 43 0 29 99 > 1 0 0 > 1 0 0 3 66 0 124 65 36 10 0 86 99 > 1 0 0 > 2 0 0 0 23 0 34 22 17 9 0 8 100 > 0 0 0 > 3 0 0 0 13 0 13 12 6 9 0 1 100 > 0 0 0 > 4 0 0 0 13 0 14 12 9 9 0 2 100 > 0 0 0 > 5 0 0 0 38 0 69 37 17 8 0 23 100 > 0 0 0 > 6 0 0 0 25 0 40 24 13 8 0 7 95 > 5 0 0 > 7 0 0 0 15 0 17 14 10 9 0 3 100 > 0 0 0 > 8 0 0 0 11 0 10 10 6 8 0 0 100 > 0 0 0 > 9 0 0 0 12 0 11 11 7 9 0 0 100 > 0 0 0 > 10 0 0 0 14 0 13 13 7 9 0 0 100 > 0 0 0 > 11 0 0 0 13 0 12 12 6 9 0 0 100 > 0 0 0 > 12 0 0 0 25 0 37 24 14 8 0 25 100 > 0 0 0 > 13 0 0 0 13 0 13 12 9 9 0 1 100 > 0 0 0 > 14 0 0 0 13 0 12 12 4 9 0 0 100 > 0 0 0 > 15 0 0 0 13 0 12 12 7 9 0 0 100 > 0 0 0 > 16 0 0 0 12 0 11 11 5 7 0 0 100 > 0 0 0 > 17 0 0 0 14 0 13 13 8 7 0 0 100 > 0 0 0 > 18 0 0 0 13 0 12 12 8 9 0 0 100 > 0 0 0 > 19 0 0 1 13 0 12 12 7 8 0 0 100 > 0 0 0 > 20 0 0 7 25 0 35 25 13 9 0 3 100 > 0 0 0 > 21 0 0 72 45 3 78 41 29 11 0 19 99 > 1 0 0 > 22 0 0 3 44 4 71 39 21 11 0 135 100 > 0 0 0 > 23 0 0 3 20 5 16 14 7 9 0 2 100 > 0 0 0 > 24 0 0 3 17 5 13 12 6 9 0 2 99 > 1 0 0 > 25 0 0 1 15 4 10 10 8 8 0 0 100 > 0 0 0 > 26 0 0 8 15 0 17 14 5 10 0 0 100 > 0 0 0 > 27 0 0 0 12 0 11 11 7 8 0 0 100 > 0 0 0 > 28 0 0 1 11 0 10 10 6 8 0 0 100 > 0 0 0 > 29 0 0 2 67 0 126 66 30 11 0 48 100 > 0 0 0 > 30 0 0 0 19 0 25 18 11 8 0 232 100 > 0 0 0 > 31 0 0 1 27 0 47 26 12 10 0 10 100 > 0 0 0 > > > > So, 100% cpu in usr mode looks a lot like some kind of infinite loop > > I tried to dig it up using dtrace: > > profile-1001us > > /(execname=="java") && (pid==$1)/ > > { > > @[jstack()]=count(); > > } > > > > tick-20s > > { > > exit(O); > > } > > > But I got a LOT of these arrors: > > dtrace: error on enabled probe ID 1 (ID 52070: > profile:::Profile-1001us): invalid address (0x96a7e000) in action #2 > > and th stack traces are mostly the hex addresses without the names: > > > 0xf886ca64 > 0xf884cec0 > 0xf8805d3c > 0xf8805874 > 0xf8805c70 > 0xf8805c70 > 0xf8805d3c > 0xf88380b8 > 0xf9509440 > 0xf88b7b44 > 0xf8805874 > 0xf8805874 > 0xf8805764 > 0xf8805874 > 0xf8805874 > 0xf8805d3c > 0xf8805874 > 0xf8805874 > 0xf8805874 > 0xf8805874 > 0xf8805874 > 0xf8805874 > 0xf8805764 > 0xf8805764 > 0xf8805764 > 0xf886993c > 0xf9999004 > 0xf884cec0 > 0xf8805874 > 0xf8805d3c > 0xf884cc7c > 0xf994090c > 0xf8838df8 > 0xf8800218 > > libjvm > .so > ` > __1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_ > +0x5a0 > libjvm.so`JVM_DoPrivileged+0x500 > > libjava > .so > ` > Java_java_security_AccessController_doPrivileged__Ljava_security_PrivilegedExceptionAction_2Ljava_security_AccessControlContext_2 > +0x14 > 0xf880e22c > 0xf880e1d0 > 0xf89c6470 > 0xf9dfc018 > 0xf99ee350 > 0xf9d74fd8 > 0xf8838df8 > 0xf8800218 > > libjvm > .so > ` > __1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_ > +0x5a0 > libjvm.so`JVM_DoPrivileged+0x500 > > libjava > .so > ` > Java_java_security_AccessController_doPrivileged__Ljava_security_PrivilegedExceptionAction_2 > +0x14 > 0xf9467cc8 > 0xf9badd44 > 9852 > > > So my questions are: > > any ideas on what to do next? I''m pretty sure it''s a jboss/ > application problem, but I need to get more data to show the jboss/ > devel people > what''s causing those dtrace errors? > > > thanks in advance > > > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org-- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
William D. Hathaway
2008-Nov-27 02:07 UTC
[dtrace-discuss] debugging a faulty jboss/application
Sorry DTrace isn''t working for you here. Try using the command line pstack (Solaris) or jstack (comes with jdk) tools. I saw a similar situation last year where garbage collection would occasionally go nuts (where nuts might be taking several hours to complete the GC cycle). You can use prstat -L to see which java threads are chewing the CPU and then look for those threads in the pstack output. -- This message posted from opensolaris.org
Marcelo Fukushima
2008-Nov-29 00:18 UTC
[dtrace-discuss] debugging a faulty jboss/application
for the stack traces, try running with a different jdk. I was only able to see the stacks properly when i installed jdk7 (from the openjdk builds) On Wed, Nov 26, 2008 at 8:30 PM, Adam Leventhal <ahl at eng.sun.com> wrote:> Is java running with the -server option on x86? There''s a known bug > where jstack() can''t capture stacks > On Nov 24, 2008, at 11:57 AM, Fernando Gleiser wrote: > >> >>> From time to time a jboss process would end eating all, the >>> available CPU and the load avg would skyrocket. >> >> Once the operators restarted jboss, the system''d be normal again >> (sometimes for weeks) until the next incident >> >> Since we moved the app from a v440 running Solaris 10 8/07 to a >> t2000 running Solaris 10 5/08 the problem started to happen more >> frequently (2-3 times a week). The only other modification (besides >> hard and OS release) is we put the app server inside a sparse-zone >> container with the FSS and a 95% limit on the CPU usage. >> >> Here''s what uptime says: >> >> Mon Nov 24 10:10:00 ARST 2008 >> 10:10am up 10 day(s), 14:40, 6 users, load average: 325.52, >> 320.07, 318.72 >> >> (yes, load avg is almost 320, but the server was still usable) >> >> a little output from mpstat shows this: >> >> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr >> sys wt idl >> 0 0 0 531 259 150 46 27 17 43 0 29 99 >> 1 0 0 >> 1 0 0 3 66 0 124 65 36 10 0 86 99 >> 1 0 0 >> 2 0 0 0 23 0 34 22 17 9 0 8 100 >> 0 0 0 >> 3 0 0 0 13 0 13 12 6 9 0 1 100 >> 0 0 0 >> 4 0 0 0 13 0 14 12 9 9 0 2 100 >> 0 0 0 >> 5 0 0 0 38 0 69 37 17 8 0 23 100 >> 0 0 0 >> 6 0 0 0 25 0 40 24 13 8 0 7 95 >> 5 0 0 >> 7 0 0 0 15 0 17 14 10 9 0 3 100 >> 0 0 0 >> 8 0 0 0 11 0 10 10 6 8 0 0 100 >> 0 0 0 >> 9 0 0 0 12 0 11 11 7 9 0 0 100 >> 0 0 0 >> 10 0 0 0 14 0 13 13 7 9 0 0 100 >> 0 0 0 >> 11 0 0 0 13 0 12 12 6 9 0 0 100 >> 0 0 0 >> 12 0 0 0 25 0 37 24 14 8 0 25 100 >> 0 0 0 >> 13 0 0 0 13 0 13 12 9 9 0 1 100 >> 0 0 0 >> 14 0 0 0 13 0 12 12 4 9 0 0 100 >> 0 0 0 >> 15 0 0 0 13 0 12 12 7 9 0 0 100 >> 0 0 0 >> 16 0 0 0 12 0 11 11 5 7 0 0 100 >> 0 0 0 >> 17 0 0 0 14 0 13 13 8 7 0 0 100 >> 0 0 0 >> 18 0 0 0 13 0 12 12 8 9 0 0 100 >> 0 0 0 >> 19 0 0 1 13 0 12 12 7 8 0 0 100 >> 0 0 0 >> 20 0 0 7 25 0 35 25 13 9 0 3 100 >> 0 0 0 >> 21 0 0 72 45 3 78 41 29 11 0 19 99 >> 1 0 0 >> 22 0 0 3 44 4 71 39 21 11 0 135 100 >> 0 0 0 >> 23 0 0 3 20 5 16 14 7 9 0 2 100 >> 0 0 0 >> 24 0 0 3 17 5 13 12 6 9 0 2 99 >> 1 0 0 >> 25 0 0 1 15 4 10 10 8 8 0 0 100 >> 0 0 0 >> 26 0 0 8 15 0 17 14 5 10 0 0 100 >> 0 0 0 >> 27 0 0 0 12 0 11 11 7 8 0 0 100 >> 0 0 0 >> 28 0 0 1 11 0 10 10 6 8 0 0 100 >> 0 0 0 >> 29 0 0 2 67 0 126 66 30 11 0 48 100 >> 0 0 0 >> 30 0 0 0 19 0 25 18 11 8 0 232 100 >> 0 0 0 >> 31 0 0 1 27 0 47 26 12 10 0 10 100 >> 0 0 0 >> >> >> >> So, 100% cpu in usr mode looks a lot like some kind of infinite loop >> >> I tried to dig it up using dtrace: >> >> profile-1001us >> >> /(execname=="java") && (pid==$1)/ >> >> { >> >> @[jstack()]=count(); >> >> } >> >> >> >> tick-20s >> >> { >> >> exit(O); >> >> } >> >> >> But I got a LOT of these arrors: >> >> dtrace: error on enabled probe ID 1 (ID 52070: >> profile:::Profile-1001us): invalid address (0x96a7e000) in action #2 >> >> and th stack traces are mostly the hex addresses without the names: >> >> >> 0xf886ca64 >> 0xf884cec0 >> 0xf8805d3c >> 0xf8805874 >> 0xf8805c70 >> 0xf8805c70 >> 0xf8805d3c >> 0xf88380b8 >> 0xf9509440 >> 0xf88b7b44 >> 0xf8805874 >> 0xf8805874 >> 0xf8805764 >> 0xf8805874 >> 0xf8805874 >> 0xf8805d3c >> 0xf8805874 >> 0xf8805874 >> 0xf8805874 >> 0xf8805874 >> 0xf8805874 >> 0xf8805874 >> 0xf8805764 >> 0xf8805764 >> 0xf8805764 >> 0xf886993c >> 0xf9999004 >> 0xf884cec0 >> 0xf8805874 >> 0xf8805d3c >> 0xf884cc7c >> 0xf994090c >> 0xf8838df8 >> 0xf8800218 >> >> libjvm >> .so >> ` >> __1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_ >> +0x5a0 >> libjvm.so`JVM_DoPrivileged+0x500 >> >> libjava >> .so >> ` >> Java_java_security_AccessController_doPrivileged__Ljava_security_PrivilegedExceptionAction_2Ljava_security_AccessControlContext_2 >> +0x14 >> 0xf880e22c >> 0xf880e1d0 >> 0xf89c6470 >> 0xf9dfc018 >> 0xf99ee350 >> 0xf9d74fd8 >> 0xf8838df8 >> 0xf8800218 >> >> libjvm >> .so >> ` >> __1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_ >> +0x5a0 >> libjvm.so`JVM_DoPrivileged+0x500 >> >> libjava >> .so >> ` >> Java_java_security_AccessController_doPrivileged__Ljava_security_PrivilegedExceptionAction_2 >> +0x14 >> 0xf9467cc8 >> 0xf9badd44 >> 9852 >> >> >> So my questions are: >> >> any ideas on what to do next? I''m pretty sure it''s a jboss/ >> application problem, but I need to get more data to show the jboss/ >> devel people >> what''s causing those dtrace errors? >> >> >> thanks in advance >> >> >> >> _______________________________________________ >> dtrace-discuss mailing list >> dtrace-discuss at opensolaris.org > > > -- > Adam Leventhal, Fishworks http://blogs.sun.com/ahl > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org >-- []''s Marcelo Takeshi Fukushima