thr3ads.net - dtrace discuss - [dtrace-discuss] Pfilestat vs. prstat [Jun 2008]

If this information is useful, please help other people find it:
Share via:

Bob Resendes

2008-Jun-26 14:40 UTC

[dtrace-discuss] Pfilestat vs. prstat

[Just starting out with DTrace and was hoping to get some guidance.]

I have a "benchmark" program that I monitored with both prstat (prstat
-mL -P <PID>) and pfilestat (from the DTrace toolkit). Prstat reports LAT
values in the 0.1-0.2% range, but pfilestat reports "waitcpu" values
in the 6-10%. Since those two numbers supposedly represent time waiting for the
CPU, I''m wondering why they are so different for the same application
(over the same time period).

On a side note, the pfilestat script (I think) assumes that a "read
entry" will be followed by a "read return" probe. On a per-thread
basis, I can see this happening, but if the application has multiple
threads/readers/writers, isn''t it possible to get interleaved syscalls?
For example,

Thread1-> read entry
Thread2 -> write entry
Thread1 -> read return
Thread2 -> write return

If so, then shouldn''t the pfilestat script be using thread local
variables for timing versus globabl variables?


--
This message posted from opensolaris.org

Brendan Gregg - Sun Microsystems

2008-Jun-26 20:07 UTC

head link

[dtrace-discuss] Pfilestat vs. prstat

G''Day Bob,

On Thu, Jun 26, 2008 at 07:40:28AM -0700, Bob Resendes
wrote:> [Just starting out with DTrace and was hoping to get some guidance.]
> 
> I have a "benchmark" program that I monitored with both prstat
(prstat -mL -P <PID>) and pfilestat (from the DTrace toolkit). Prstat
reports LAT values in the 0.1-0.2% range, but pfilestat reports
"waitcpu" values in the 6-10%. Since those two numbers supposedly
represent time waiting for the CPU, I''m wondering why they are so
different for the same application (over the same time period).
prstat -mL is giving a by-thread summary, which is going to be better than
either prstat -m, pfilestat, or anything else that tries to represent
multiple thread info by-process.
> On a side note, the pfilestat script (I think) assumes that a "read
entry" will be followed by a "read return" probe. On a per-thread
basis, I can see this happening, but if the application has multiple
threads/readers/writers, isn''t it possible to get interleaved syscalls?
For example,
> 
> Thread1-> read entry
> Thread2 -> write entry
> Thread1 -> read return
> Thread2 -> write return
> 
> If so, then shouldn''t the pfilestat script be using thread local
variables for timing versus globabl variables?
Try it, and consider a multi-threaded app with a thread that keeps calling
sleep() for 10 seconds (or some interval greater than the summary). 
You''d
want your profile:::tick-5s to walk the thread list to account for this
properly, but that involves loops - and DTrace doesn''t do loops.  It
doesn''t
mean this is impossible to solve, just that it''ll start getting
difficult.
Representing this data by-process presents another challenge - how to
present data that will best identify performance issues.

I believe pfilestat was written as a demo of what is possible, and works
great in some situations but not others.  The docs should explain that -
sorry.  Perhaps the easiest fix would be to allow pfilestat to take an
optional tid as an argument; and document that it should be used when
analysing buly multi-threaded apps.

pfilestat is one of the longest of the 230+ scripts, if you are
starting out, it may be best to "ls -lrLS" in /Bin and start from the
top.

Brendan

-- 
Brendan
[CA, USA]

Bob Resendes

2008-Jun-27 12:07 UTC

head link

[dtrace-discuss] Pfilestat vs. prstat

> 
> prstat -mL is giving a by-thread summary, which is
> going to be better than
> either prstat -m, pfilestat, or anything else that
> tries to represent
> multiple thread info by-process.Brendan, thanks for the quick reply. Are you suggesting that a single threaded
application should reflect "closer" numbers? If so, then I''m
still curious about the differences I''m seeing. For example, the
following are (representative) snapshots of both commands for the same
(single-threaded) process:

prstat -mL -p 10997 5
==================   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL
SIG PROCESS/LWPID
 10997 40011     14 6.1 0.0 0.0 0.0 0.0  79 1.1 730  19  3K   0 postgres/1

[Note: LAT = 1.1% here]

./pfilestat 10997
=============    STATE   FDNUM      Time Filename
      read       4        0% /zones/dbzone05/<snip>
      read      17        0% /zones/dbzone05/<snip>
     write      20        0% /zones/dbzone05/<snip>
      read      19        0% /zones/dbzone05/<snip>
      read      23        0% /zones/dbzone05/<snip>
     write      19        1% /zones/dbzone05/<snip>
   sleep-w       0        2%
   waitcpu       0        4%
   running       0       13%
     sleep       0       76%
[snip]
Total event time (ms): 4999   Total Mbytes/sec: 1

[Note: waitcpu is 4% here.]

I''m fine with the answer that they are measuring things differently.
I''m just curious as to what accounts for the 1% vs 4% difference.

Again, I''m not really complaining. I think these scripts are going to
be a tremendous help in getting started. I''m just concerned that
I''m missing something about the internal workings of DTrace. For
example, something like "the sched:::* probes are done in a different
context than the application and therefore the numbers don''t
match".

Bob


--
This message posted from opensolaris.org

Brendan Gregg - Sun Microsystems

2008-Jun-28 03:55 UTC

head link

[dtrace-discuss] Pfilestat vs. prstat

G''Day Bob,

On Fri, Jun 27, 2008 at 05:07:21AM -0700, Bob Resendes
wrote:> > 
> > prstat -mL is giving a by-thread summary, which is
> > going to be better than
> > either prstat -m, pfilestat, or anything else that
> > tries to represent
> > multiple thread info by-process.
> Brendan, thanks for the quick reply. Are you suggesting that a single
threaded application should reflect "closer" numbers? If so, then
I''m still curious about the differences I''m seeing. For
example, the following are (representative) snapshots of both commands for the
same (single-threaded) process:
> 
> prstat -mL -p 10997 5
> ==================>    PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX
ICX SCL SIG PROCESS/LWPID
>  10997 40011     14 6.1 0.0 0.0 0.0 0.0  79 1.1 730  19  3K   0 postgres/1
> 
> [Note: LAT = 1.1% here]
> 
> ./pfilestat 10997
> =============>     STATE   FDNUM      Time Filename
>       read       4        0% /zones/dbzone05/<snip>
>       read      17        0% /zones/dbzone05/<snip>
>      write      20        0% /zones/dbzone05/<snip>
>       read      19        0% /zones/dbzone05/<snip>
>       read      23        0% /zones/dbzone05/<snip>
>      write      19        1% /zones/dbzone05/<snip>
>    sleep-w       0        2%
>    waitcpu       0        4%
>    running       0       13%
>      sleep       0       76%
> [snip]
> Total event time (ms): 4999   Total Mbytes/sec: 1
> 
> [Note: waitcpu is 4% here.]
> 
> I''m fine with the answer that they are measuring things
differently. I''m just curious as to what accounts for the 1% vs 4%
difference.
Yes, I often use other tools as a sanity check of DTrace scripts, and I do
worry about small percentage differences - they often lead to bugs or a
better understanding of system behaviour...
> Again, I''m not really complaining. I think these scripts are going
to be a tremendous help in getting started. I''m just concerned that
I''m missing something about the internal workings of DTrace. For
example, something like "the sched:::* probes are done in a different
context than the application and therefore the numbers don''t
match".
No, pfilestat is careful to follow the documented sched provider:

	wikis.sun.com/display/DTrace/sched+Provider

Such as using ''pid'' for events in thread context:

	sched:::off-cpu
	/pid == PID/

and ''args[1]->pr_pid'' for events that aren''t:

	sched:::dequeue
	/args[1]->pr_pid == PID/

However I did spot a couple of things with pfilestat that might be improved
like this:

--- /opt/DTT/Bin/pfilestat      Wed Aug  1 11:01:38 2007
+++ /tmp/pfilestat      Sat Jun 28 03:47:48 2008
@@ -184,7 +184,8 @@
  sched:::on-cpu
  /pid == PID/
  {
-       @logical["waitcpu", (uint64_t)0, ""] = sum(timestamp
- last);
+       @logical[probefunc == "resume" ? "running" :
"waitcpu",
+           (uint64_t)0, ""] = sum(timestamp - last);
        totaltime += timestamp - last;
        last = timestamp;
  }
@@ -233,7 +234,8 @@
  sched:::enqueue
  /args[1]->pr_pid == PID/
  {
-       @logical["waitcpu", (uint64_t)0, ""] = sum(timestamp
- last);
+       @logical[pid == args[1]->pr_pid ? "running" :
"sleep",
+           (uint64_t)0, ""] = sum(timestamp - last);
        totaltime += timestamp - last;
        last = timestamp;
  }

If you don''t mind, try that out and let us know how it then compares to
prstat -m.

cheers,

Brendan

-- 
Brendan
[CA, USA]

Bob Resendes

2008-Jun-30 13:50 UTC

head link

[dtrace-discuss] Pfilestat vs. prstat

> you don''t mind, try that out and let us know how it
> then compares to
> prstat -m.
The changes seem to track prstat more closely, now (e.g. sleep, running, ...).
Thanks for the update.

On the down side, the process I was observing had LAT (from prstat) in the 0-2%
range, but I never saw "waitcpu" (from pfilestat) go above 0[1]. Also,
I noticed sleep times in the 101-103% range a few times as well.  Don''t
know if you want to pursue this any further[1], but let me know if you need any
more info.

Bob
[1] I won''t have access to the "test" system for the rest of
the day, so I wasn''t able to experiment with trying to drive the LAT
values any higher to see if waitcpu would register (> 0).


--
This message posted from opensolaris.org

Apparently Analagous Threads

Search for more seemingly similar threads

dtrace discuss - Jun 2008 - Pfilestat vs. prstat

[dtrace-discuss] Pfilestat vs. prstat

[dtrace-discuss] Pfilestat vs. prstat

[dtrace-discuss] Pfilestat vs. prstat

[dtrace-discuss] Pfilestat vs. prstat

[dtrace-discuss] Pfilestat vs. prstat

Apparently Analagous Threads