thr3ads.net - dtrace discuss - [dtrace-discuss] dtrace

If this information is useful, please help other people find it:
Share via:

tester

2010-May-20 11:46 UTC

[dtrace-discuss] dtrace - in time of need

Hello,

  We had a situation on a t5140 running solaris 10 where the load average  from
prstat were 20-25,30-40,60-80. We ran a simple dtrace cmd to capture the jstack
using the pid provider using the PID from prstat. That action pushed the 1
minute load average to 120. Not only that we were not able to get any o/p, the
system became unresponsive to SSH sessions. Ctrl-C were ignored for atleast 30
seconds. Some other simple dtrace one liners to aggregate syscalls per PID also
resulted in the same behaviour. Any dtrace action during that time pushed the
load significantly. I would like to believe that is an case operator error than
dtrace issue. Please let us know if we missed something or if there are tunables
in those situations.

dtrace cmd used
dtrace -n ''pid1234:::entry{@[jstack()] = count()}''
and some other  simple dtrace one liners to aggegate syscalls/pid.

Thanks
-- 
This message posted from opensolaris.org

Chad Mynhier

2010-May-20 12:47 UTC

head link

[dtrace-discuss] dtrace - in time of need

On Thu, May 20, 2010 at 7:46 AM, tester <solaris.identity at gmail.com>
wrote:> Hello,
>
> ?We had a situation on a t5140 running solaris 10 where the load average
?from prstat were 20-25,30-40,60-80. We ran a simple dtrace cmd to capture the
jstack using the pid provider using the PID from prstat. That action pushed the
1 minute load average to 120. Not only that we were not able to get any o/p, the
system became unresponsive to SSH sessions. Ctrl-C were ignored for atleast 30
seconds. Some other simple dtrace one liners to aggregate syscalls per PID also
resulted in the same behaviour. Any dtrace action during that time pushed the
load significantly. I would like to believe that is an case operator error than
dtrace issue. Please let us know if we missed something or if there are tunables
in those situations.
>
> dtrace cmd used
> dtrace -n ''pid1234:::entry{@[jstack()] = count()}''
> and some other ?simple dtrace one liners to aggegate syscalls/pid.
>
The pid provider is fairly heavyweight.  What''s more, you''ve
instrumented every single function in that process (including all
libraries.)  This is going to have a significant performance impact on
the application, and probably on the server itself, if the application
is the main thing running there.

A much better way to get what you want is the following:

profile-397
/ pid == 1234 /
{
    @[jstack()] = count();
}

This will give you a feel for where the application is spending all of
its time, but it does it in a far more lightweight manner.  Doing it
this way has a secondary benefit in that it can identify hot spots
inside functions.  Your original script will only tell you how often
particular functions are entered and says nothing about how much time
is spent in those functions.

Chad

tester

2010-May-20 15:10 UTC

head link

[dtrace-discuss] dtrace - in time of need

Chad: Thanks for the information.
-- 
This message posted from opensolaris.org

Mike Gerdts

2010-May-20 15:36 UTC

head link

[dtrace-discuss] dtrace - in time of need

On Thu, May 20, 2010 at 6:46 AM, tester <solaris.identity at gmail.com>
wrote:> Hello,
>
> ?We had a situation on a t5140 running solaris 10 where the load average
?from prstat were 20-25,30-40,60-80.
The T5140 presumably has two 8 core x 8 strand processors.  On this
box, a load average of 128 indicates that there is work for all of the
CPU strands.  That is, a T5140 with a load average of 128 has a
saturated CPU, just like a Netra T1 with one UltraSPARC IIi processor
has processor saturation when the load average is 1.

Put another way, a load average of 20 means the processor is 15.6%
utilized.  Of course this is simplistic - you need to look at other
factors, particularly corestat (use google).

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Bart Smaalders

2010-May-24 21:34 UTC

head link

[dtrace-discuss] dtrace - in time of need

On 05/20/10 08:36, Mike Gerdts wrote:> On Thu, May 20, 2010 at 6:46 AM, tester<solaris.identity at
gmail.com>  wrote:
>> Hello,
>>
>>   We had a situation on a t5140 running solaris 10 where the load
average  from prstat were 20-25,30-40,60-80.
>
> The T5140 presumably has two 8 core x 8 strand processors.  On this
> box, a load average of 128 indicates that there is work for all of the
> CPU strands.  That is, a T5140 with a load average of 128 has a
> saturated CPU, just like a Netra T1 with one UltraSPARC IIi processor
> has processor saturation when the load average is 1.
>
> Put another way, a load average of 20 means the processor is 15.6%
> utilized.  Of course this is simplistic - you need to look at other
> factors, particularly corestat (use google).
>
Perhaps we should consider a new load average that is normalized to
the number of CPUs....

- Bart


-- 
Bart Smaalders			Solaris Kernel Performance
bart.smaalders at oracle.com	http://blogs.sun.com/barts
"You will contribute more with mercurial than with thunderbird."

tester

2010-Jun-22 22:00 UTC

head link

[dtrace-discuss] dtrace - in time of need

Hi,

Sorry to bring up an old thread. I need some general guidance on how to use
dtrace during loads that is greater than normal. Again using on T2+ systemss
running with a 1 minute load average of 10-15. Running the below script shoots
up load to 50-60 range and distorts any test.

profile-397
/pid == 18900/
{
  @[jstack(40,1024)] = count()
}

tick-5s {

   trunc(@,10);
   printa(@);
   trunc(@);
}

The other symptoms are the system become unresponsive. Any thoughts?


Thanks
-- 
This message posted from opensolaris.org

Mike Gerdts

2010-Jun-22 22:33 UTC

head link

[dtrace-discuss] dtrace - in time of need

On Tue, Jun 22, 2010 at 5:00 PM, tester <solaris.identity at gmail.com>
wrote:> Hi,
>
> Sorry to bring up an old thread. I need some general guidance on how to use
dtrace during loads that is greater than normal. Again using on T2+ systemss
running with a 1 minute load average of 10-15. Running the below script shoots
up load to 50-60 range and distorts any test.
T2+ implies a multi-processor system with 8 cores x 8 threads.  Let''s
assume you don''t have a T5440.  Load average of 15 on a dual
processors T2+ system is less than 12% busy (15 / (2*8*8)).
>
> profile-397
> /pid == 18900/
> {
> ?@[jstack(40,1024)] = count()
> }
>
> tick-5s {
>
> ? trunc(@,10);
> ? printa(@);
> ? trunc(@);
> }
>
> The other symptoms are the system become unresponsive. Any thoughts?
Is it unresponsive only when dtrace is running or normally?  With
recent releases of Solaris, I''ve found systems to be quite responsive
with a load average that is many times higher than the number of CPU''s
(as seen by mpstat - 128 for the typical dual processor T5140/T5240).
It seems highly unlikely that the problem is related to being short on
CPU (again, only at about 12% CPU utilization).

If it is unresponsive or sluggish before you start dtrace, I would
guess that one of the following is the case:

- The machine is short on RAM and is paging.  Use vmstat to diagnose.
Look at the "b" column (blocked on I/O) and paging related columns
such as sr (scan rate). You would see things as being extremely
sluggish (e.g. when executing a command) because the disk reads needed
to load the commands and related libraries are getting queue behind
the IO requests for paging.

- The network is having troubles.  Look for a duplex mismatch or
non-zero values in:

    kstat -p e1000g | nawk ''$NF != 0 && $0 ~
/(err|drop|fail)/''

- There is some other I/O problem.  Does iostat -En show hard errors
on any disk?  Does "iostat -xzn  1" show svc_time + wsvc_time over
20ms?  How many I/Os are queued and active?


Your question is performance - but you jumped to the conclusion that
dtrace would tell you the answer.  It may, but there are likely other
tools that will be helpful with a lot less effort and less system
impact.  perf-discuss may be a better list to ask for more help.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

tester

2010-Jun-22 23:22 UTC

head link

[dtrace-discuss] dtrace - in time of need

Thanks Mike.
> Is it unresponsive only when dtrace is running or
> normally?  
It becomes unresponsive after starting Dtrace. 

> With recent releases of Solaris, I''ve found systems to be
> quite responsive
> with a load average that is many times higher than
> the number of CPU''s
> (as seen by mpstat - 128 for the typical dual
> processor T5140/T5240).
The system is a T5440. 256G RAM.
> It seems highly unlikely that the problem is related
> to being short on
> CPU (again, only at about 12% CPU utilization).
vmstat reports more than 95%CPU free. core utilization is between 2-3%
> 
> If it is unresponsive or sluggish before you start
> dtrace, I would
> guess that one of the following is the case:
No. It gets sluggish after Dtrace is started.
> - The machine is short on RAM and is paging.  Use
> vmstat to diagnose.
> Look at the "b" column (blocked on I/O) and paging
> related columns
> such as sr (scan rate). You would see things as being
> extremely
> sluggish (e.g. when executing a command) because the
> disk reads needed
> to load the commands and related libraries are
> getting queue behind
> the IO requests for paging.
> 
there''s plenty of RAM ~240G

> - The network is having troubles.  Look for a duplex
> mismatch or
> non-zero values in:
> 
> kstat -p e1000g | nawk ''$NF != 0 && $0 ~
>  /(err|drop|fail)/''
> - There is some other I/O problem.  Does iostat -En
> show hard errors
> on any disk?  Does "iostat -xzn  1" show svc_time +
> wsvc_time over
> 20ms?  How many I/Os are queued and active?
> 
> 
> Your question is performance - but you jumped to the
> conclusion that
> dtrace would tell you the answer.  It may, but there
> are likely other
> tools that will be helpful with a lot less effort and
> less system
> impact.  perf-discuss may be a better list to ask for
> more help.
we were checking application performance when we enaged this script to check
where the hot spots were; we had to Ctrl-c dtrace because of it behavior.

Even now on a idle server (same system) here is what is what I see, although not
that unresponsive now ( vey little load to start with)

Before dtrace:
Total: 162 processes, 2058 lwps, load averages: 0.79, 2.87, 2.26
Total: 162 processes, 2058 lwps, load averages: 1.03, 2.89, 2.27
After Dtrace
Total: 161 processes, 2057 lwps, load averages: 20.61, 6.88, 3.61
Total: 161 processes, 2057 lwps, load averages: 38.40, 10.76, 4.93
Total: 161 processes, 2057 lwps, load averages: 35.38, 10.59, 4.91

This time I was able to get some o/p from the script otherwise with load I have
not seen script o/p.  Now you can imagine the state of the system if the initail
load was 10-15.

Please let mw know if you need more details.
-- 
This message posted from opensolaris.org

Mike Gerdts

2010-Jun-23 04:00 UTC

head link

[dtrace-discuss] dtrace - in time of need

On Tue, Jun 22, 2010 at 6:22 PM, tester <solaris.identity at gmail.com>
wrote:> Thanks Mike.
>
>> Is it unresponsive only when dtrace is running or
>> normally?
>
> It becomes unresponsive after starting Dtrace.
[snip]>
> This time I was able to get some o/p from the script otherwise with load I
have not seen script o/p. ?Now you can imagine the state of the system if the
initail load was 10-15.
>
> Please let mw know if you need more details.
Hmmm... it looks like you are just having troubles with the enabled
probes being to heavy.  My approach would be to try to reduce the
amount of data that is being collected.  You could try:

- Use a lower frequency in your profile provider.  That is, switch
profile-397 to something like profile-113.
- Reduce the number of stack frames from 40 to 10.
- Try tracing only one thread at a time.

For the last suggestion, you can probably figure out which are the
interesting/busy threads with:

prstat -mLc -n 10 1 5

You should be able to observe one thread (tid - the part after the /
at the end of the line in prstat output) at a time with...

profile-113
/pid == 18900 && tid == 123/
{
 @[jstack(10,1024)] = count()
}

tick-5s {
  trunc(@,10);
  printa(@);
  trunc(@);
}

Perhaps if you share the output of "prstat -mLc -n 10 1 5" when the
unexpected load exists something will jump out.  For example, if we
see a large amount of %sys, it may be worthwhile looking at the
syscall provider.

You may also want to take a look at tools that are java-specific.  I
haven''t had a need to try these, but a quick search reveals
http://profiler.netbeans.org/ which looks interesting.

Also, I haven''t had much of a chance to compare the performance of
jstack vs. ustack.  It could be that you are seeing a performance hit
due to the JVM saying "256 CPUs?  Ahh, you must want 256 garbage
collection threads!"  Using ustack() instead of jstack() may be
lighter weight (I''ve had good success with ustack(5) at profile-997 on
a pretty busy T5240 debugging JNI calls).  I suspect a similar
approach may point out heavy GC activity.

http://www.c0t0d0s0.org/archives/6617-About-Java,-parallel-garbage-collection-and-processor-sets.html

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Phil Harman

2010-Jun-23 06:39 UTC

head link

[dtrace-discuss] dtrace - in time of need

Check the archives for Jon Haslam''s post on DTrace being sluggish. The
original poster also had a T5440. The reason he fave is that DTrace will
allocate 4MB per CPU and again if you are using aggregations.

In your case this means 2GB of allocations.Although you have loads of RAM it
still has to be managed. If your system has been running for a long time, there
may not be many large pages left. In the worst case 2GB is a huge number of 8K
pages, and that''s a lot if work!

So your run queues are growing because DTrace has given the system quite a lot
of extra work to do. Jon suggests that you could tune the buffer size down, but
that you may begin to see drops as a result.

Your simple profiling script should be fine. Obviously extreme profile
frequencies will have a greater probe effect, but I''ve never considered
even 997Hz that extreme (although you do have one of the slowest SPARC machines
on the planet)!

I think your issue is more to do with initialising DTrace. It may have settled
down if you had let it, but I don''t blame you for pulling the plug.
That isn''t what DTrace claims on the tin. It''s just that on
sane hardware there will be less of an impact.

Phil

p.s. in cases like this, nothing beats a bit of good old vmstat, mpstat, iostat
-xnz and (less old) prstat -Lm ... say 12 samples with 5 second interval of each

http://harmanholistix.com
Focusing on the big picture and the detail

On 23 Jun 2010, at 00:22, tester <solaris.identity at gmail.com> wrote:
> Thanks Mike.
> 
>> Is it unresponsive only when dtrace is running or
>> normally?  
> 
> It becomes unresponsive after starting Dtrace. 
> 
> 
>> With recent releases of Solaris, I''ve found systems to be
>> quite responsive
>> with a load average that is many times higher than
>> the number of CPU''s
>> (as seen by mpstat - 128 for the typical dual
>> processor T5140/T5240).
> 
> The system is a T5440. 256G RAM.
> 
>> It seems highly unlikely that the problem is related
>> to being short on
>> CPU (again, only at about 12% CPU utilization).
> 
> vmstat reports more than 95%CPU free. core utilization is between 2-3%
> 
>> 
>> If it is unresponsive or sluggish before you start
>> dtrace, I would
>> guess that one of the following is the case:
> 
> No. It gets sluggish after Dtrace is started.
> 
>> - The machine is short on RAM and is paging.  Use
>> vmstat to diagnose.
>> Look at the "b" column (blocked on I/O) and paging
>> related columns
>> such as sr (scan rate). You would see things as being
>> extremely
>> sluggish (e.g. when executing a command) because the
>> disk reads needed
>> to load the commands and related libraries are
>> getting queue behind
>> the IO requests for paging.
>> 
> 
> there''s plenty of RAM ~240G
> 
> 
>> - The network is having troubles.  Look for a duplex
>> mismatch or
>> non-zero values in:
>> 
>> kstat -p e1000g | nawk ''$NF != 0 && $0 ~
>> /(err|drop|fail)/''
>> - There is some other I/O problem.  Does iostat -En
>> show hard errors
>> on any disk?  Does "iostat -xzn  1" show svc_time +
>> wsvc_time over
>> 20ms?  How many I/Os are queued and active?
>> 
>> 
>> Your question is performance - but you jumped to the
>> conclusion that
>> dtrace would tell you the answer.  It may, but there
>> are likely other
>> tools that will be helpful with a lot less effort and
>> less system
>> impact.  perf-discuss may be a better list to ask for
>> more help.
> 
> we were checking application performance when we enaged this script to
check where the hot spots were; we had to Ctrl-c dtrace because of it behavior.
> 
> Even now on a idle server (same system) here is what is what I see,
although not that unresponsive now ( vey little load to start with)
> 
> Before dtrace:
> Total: 162 processes, 2058 lwps, load averages: 0.79, 2.87, 2.26
> Total: 162 processes, 2058 lwps, load averages: 1.03, 2.89, 2.27
> After Dtrace
> Total: 161 processes, 2057 lwps, load averages: 20.61, 6.88, 3.61
> Total: 161 processes, 2057 lwps, load averages: 38.40, 10.76, 4.93
> Total: 161 processes, 2057 lwps, load averages: 35.38, 10.59, 4.91
> 
> This time I was able to get some o/p from the script otherwise with load I
have not seen script o/p.  Now you can imagine the state of the system if the
initail load was 10-15.
> 
> Please let mw know if you need more details.
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> dtrace-discuss mailing list
> dtrace-discuss at opensolaris.org

dtrace discuss - May 2010 - dtrace - in time of need

[dtrace-discuss] dtrace - in time of need

[dtrace-discuss] dtrace - in time of need

[dtrace-discuss] dtrace - in time of need

[dtrace-discuss] dtrace - in time of need

[dtrace-discuss] dtrace - in time of need

[dtrace-discuss] dtrace - in time of need

[dtrace-discuss] dtrace - in time of need

[dtrace-discuss] dtrace - in time of need

[dtrace-discuss] dtrace - in time of need

[dtrace-discuss] dtrace - in time of need