Hi all, We have a system on which about 17 OAS JVM processes are running and to which a number of OAS Apache processes connect (used to be local but since moved to another system for testing the impact). However, for reasons we have not yet been able to determine, the system will have thousands (we''ve seen 10K+) of TCP connections in the TIME_WAIT state, having ramped up to that in seconds. (There''s a lot more details to this problem, but this covers the gist of it.) So I am trying to determine which JVM PIDs are the ones that are receiving the TCP connections that are ending up in TIME_WAIT to help narrow down the problem. Is there a way to do this via DTrace? Thanks, Justin Justin C. Lloyd Senior Engineer and System Administrator 303-684-4166 Office 720-480-0380 Cell 303-684-4100 Fax jlloyd at digitalglobe.com DigitalGlobe ?, An Imaging and Information Company -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20070411/b011d28a/attachment.html>
Justin Lloyd writes:> We have a system on which about 17 OAS JVM processes are running and > to which a number of OAS Apache processes connect (used to be local > but since moved to another system for testing the impact). However, > for reasons we have not yet been able to determine, the system will > have thousands (we''ve seen 10K+) of TCP connections in the TIME_WAIT > state, having ramped up to that in seconds. (There''s a lot more > details to this problem, but this covers the gist of it.) So I am > trying to determine which JVM PIDs are the ones that are receiving > the TCP connections that are ending up in TIME_WAIT to help narrow > down the problem. Is there a way to do this via DTrace?Back up a bit: why is TIME_WAIT a problem? It just means that your application issued close() before receiving TCP FIN. It typically happens on the "client" side, but not always. -- James Carlson, Solaris Networking <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Hi Justin,> > We have a system on which about 17 OAS JVM processes are running and toyou have a lot of JVMs. Are you on Solaris 10 or Express ?> problem, but this covers the gist of it.) So I am trying to determine > which JVM PIDs are the ones that are receiving the TCP connections that > are ending up in TIME_WAIT to help narrow down the problem. Is there a > way to do this via DTrace?Basically you can use tcpsnoop from DTT. http://www.brendangregg.com/dtrace.html#DTraceToolkit Read the sample for tcpsnoop. As well have a look on tcptop. Hope it helps, Stefan
We''re on Solaris 10. The JVMs are various OAS containers, and that is just one of the two OAS instances running on this server, but the other one only has 4 JVMs, I think. I''ve looked at tcpsnoop in the past, so I''ll check it out again wrt this issue. Thanks!> -----Original Message----- > From: dtrace-discuss-bounces at opensolaris.org [mailto:dtrace-discuss- > bounces at opensolaris.org] On Behalf Of Stefan Parvu > Sent: Wednesday, April 11, 2007 1:47 PM > To: dtrace-discuss at opensolaris.org > Subject: Re: [dtrace-discuss] Tracing PIDs to TIME_WAIT states? > > Hi Justin, > > > > > We have a system on which about 17 OAS JVM processes are running andto> > you have a lot of JVMs. Are you on Solaris 10 or Express ? > > > problem, but this covers the gist of it.) So I am trying todetermine> > which JVM PIDs are the ones that are receiving the TCP connectionsthat> > are ending up in TIME_WAIT to help narrow down the problem. Is therea> > way to do this via DTrace? > > > Basically you can use tcpsnoop from DTT. > http://www.brendangregg.com/dtrace.html#DTraceToolkit > > Read the sample for tcpsnoop. As well have a look on > tcptop. > > > Hope it helps, > Stefan > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
TIME_WAIT itself isn''t a problem. Having many thousands when we should normally have a few hundred is the problem. :)> -----Original Message----- > From: James Carlson [mailto:james.d.carlson at sun.com] > Sent: Wednesday, April 11, 2007 1:03 PM > To: Justin Lloyd > Cc: dtrace-discuss at opensolaris.org > Subject: Re: [dtrace-discuss] Tracing PIDs to TIME_WAIT states? > > Justin Lloyd writes: > > We have a system on which about 17 OAS JVM processes are running and > > to which a number of OAS Apache processes connect (used to be local > > but since moved to another system for testing the impact). However, > > for reasons we have not yet been able to determine, the system will > > have thousands (we''ve seen 10K+) of TCP connections in the TIME_WAIT > > state, having ramped up to that in seconds. (There''s a lot more > > details to this problem, but this covers the gist of it.) So I am > > trying to determine which JVM PIDs are the ones that are receiving > > the TCP connections that are ending up in TIME_WAIT to help narrow > > down the problem. Is there a way to do this via DTrace? > > Back up a bit: why is TIME_WAIT a problem? > > It just means that your application issued close() before receiving > TCP FIN. It typically happens on the "client" side, but not always. > > -- > James Carlson, Solaris Networking<james.d.carlson at sun.com>> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 4422084> MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 4421677
Justin Lloyd wrote:> TIME_WAIT itself isn''t a problem. Having many thousands when we should > normally have a few hundred is the problem. :) >See who calls socket, and who calls close on a socket.... #!/usr/sbin/dtrace -s syscall::so_socket:entry { @a[execname, "socket"]=count(); } syscall::open:entry { self->trigger = 1; } syscall::close:entry { self->trigger = 1; } fbt::getf:return /self->trigger && arg1/ { self->v_type = ((file_t *)arg1)->f_vnode->v_type; self->trigger = 0; } syscall::open:return /self->v_type == 9/ { @[execname, "open socket"]=count(); } syscall::open:return { self->trigger = 0; } syscall::close:return /self->v_type == 9/ { @[execname, "close socket"]=count(); } syscall::close:return { self->vtype = 0; } -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
Justin Lloyd wrote:> We''re on Solaris 10. The JVMs are various OAS containers, and that is > just one of the two OAS instances running on this server, but the other > one only has 4 JVMs, I think.make sure you are all set on the JVM level before starting to go further with DTrace: - are you using Sun''s JVM HotSpot, -server -client - Java 1.4.2_xx or Java 5 ? - your heap size - frequency of the GC - what collector are you using: serial, concurrent, parallel - is there any external plugins installed in the web layer which redirects the requests to the OAS - check once more the TCP tuning - experiment with the dvm agent based on your Java version> I''ve looked at tcpsnoop in the past, so I''ll check it out again wrt this > issue. Thanks!Use Bart''s sample or tcpsnoop or even socketsnoop.d to discover what pids are creating traffic. Stefan
Our DBA group is on top of the Oracle JVM issue, working with Oracle support. We (the IS team) are assisting with trying to determine what is happening "under the hood". The 1000s of connetions is the cause of a deeper problem: dropped connetions. Many of the things you mention are being evaluated and tested in our development and test runtimes (e.g. GC, TCP tuning, etc.) I''m looking over Bart''s sample now. I had started writing something similar, looking at so_socket calls, but his is a bit more in-depth. Thanks, Justin> -----Original Message----- > From: Stefan Parvu [mailto:stefanparvu14 at yahoo.com] > Sent: Wednesday, April 11, 2007 4:30 PM > To: Justin Lloyd > Cc: dtrace-discuss at opensolaris.org > Subject: Re: [dtrace-discuss] Tracing PIDs to TIME_WAIT states? > > Justin Lloyd wrote: > > We''re on Solaris 10. The JVMs are various OAS containers, and thatis> > just one of the two OAS instances running on this server, but theother> > one only has 4 JVMs, I think. > > make sure you are all set on the JVM level before starting to gofurther> with DTrace: > > - are you using Sun''s JVM HotSpot, -server -client > - Java 1.4.2_xx or Java 5 ? > - your heap size > - frequency of the GC > - what collector are you using: serial, concurrent, parallel > - is there any external plugins installed in the web layer which > redirects the requests to the OAS > - check once more the TCP tuning > - experiment with the dvm agent based on your Java version > > > I''ve looked at tcpsnoop in the past, so I''ll check it out again wrtthis> > issue. Thanks! > > Use Bart''s sample or tcpsnoop or even socketsnoop.d to discover what > pids are creating traffic. > > > Stefan