[Just starting out with DTrace and was hoping to get some guidance.] I have a "benchmark" program that I monitored with both prstat (prstat -mL -P <PID>) and pfilestat (from the DTrace toolkit). Prstat reports LAT values in the 0.1-0.2% range, but pfilestat reports "waitcpu" values in the 6-10%. Since those two numbers supposedly represent time waiting for the CPU, I''m wondering why they are so different for the same application (over the same time period). On a side note, the pfilestat script (I think) assumes that a "read entry" will be followed by a "read return" probe. On a per-thread basis, I can see this happening, but if the application has multiple threads/readers/writers, isn''t it possible to get interleaved syscalls? For example, Thread1-> read entry Thread2 -> write entry Thread1 -> read return Thread2 -> write return If so, then shouldn''t the pfilestat script be using thread local variables for timing versus globabl variables? -- This message posted from opensolaris.org
Brendan Gregg - Sun Microsystems
2008-Jun-26 20:07 UTC
[dtrace-discuss] Pfilestat vs. prstat
G''Day Bob, On Thu, Jun 26, 2008 at 07:40:28AM -0700, Bob Resendes wrote:> [Just starting out with DTrace and was hoping to get some guidance.] > > I have a "benchmark" program that I monitored with both prstat (prstat -mL -P <PID>) and pfilestat (from the DTrace toolkit). Prstat reports LAT values in the 0.1-0.2% range, but pfilestat reports "waitcpu" values in the 6-10%. Since those two numbers supposedly represent time waiting for the CPU, I''m wondering why they are so different for the same application (over the same time period).prstat -mL is giving a by-thread summary, which is going to be better than either prstat -m, pfilestat, or anything else that tries to represent multiple thread info by-process.> On a side note, the pfilestat script (I think) assumes that a "read entry" will be followed by a "read return" probe. On a per-thread basis, I can see this happening, but if the application has multiple threads/readers/writers, isn''t it possible to get interleaved syscalls? For example, > > Thread1-> read entry > Thread2 -> write entry > Thread1 -> read return > Thread2 -> write return > > If so, then shouldn''t the pfilestat script be using thread local variables for timing versus globabl variables?Try it, and consider a multi-threaded app with a thread that keeps calling sleep() for 10 seconds (or some interval greater than the summary). You''d want your profile:::tick-5s to walk the thread list to account for this properly, but that involves loops - and DTrace doesn''t do loops. It doesn''t mean this is impossible to solve, just that it''ll start getting difficult. Representing this data by-process presents another challenge - how to present data that will best identify performance issues. I believe pfilestat was written as a demo of what is possible, and works great in some situations but not others. The docs should explain that - sorry. Perhaps the easiest fix would be to allow pfilestat to take an optional tid as an argument; and document that it should be used when analysing buly multi-threaded apps. pfilestat is one of the longest of the 230+ scripts, if you are starting out, it may be best to "ls -lrLS" in /Bin and start from the top. Brendan -- Brendan [CA, USA]
> > prstat -mL is giving a by-thread summary, which is > going to be better than > either prstat -m, pfilestat, or anything else that > tries to represent > multiple thread info by-process.Brendan, thanks for the quick reply. Are you suggesting that a single threaded application should reflect "closer" numbers? If so, then I''m still curious about the differences I''m seeing. For example, the following are (representative) snapshots of both commands for the same (single-threaded) process: prstat -mL -p 10997 5 ================== PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 10997 40011 14 6.1 0.0 0.0 0.0 0.0 79 1.1 730 19 3K 0 postgres/1 [Note: LAT = 1.1% here] ./pfilestat 10997 ============= STATE FDNUM Time Filename read 4 0% /zones/dbzone05/<snip> read 17 0% /zones/dbzone05/<snip> write 20 0% /zones/dbzone05/<snip> read 19 0% /zones/dbzone05/<snip> read 23 0% /zones/dbzone05/<snip> write 19 1% /zones/dbzone05/<snip> sleep-w 0 2% waitcpu 0 4% running 0 13% sleep 0 76% [snip] Total event time (ms): 4999 Total Mbytes/sec: 1 [Note: waitcpu is 4% here.] I''m fine with the answer that they are measuring things differently. I''m just curious as to what accounts for the 1% vs 4% difference. Again, I''m not really complaining. I think these scripts are going to be a tremendous help in getting started. I''m just concerned that I''m missing something about the internal workings of DTrace. For example, something like "the sched:::* probes are done in a different context than the application and therefore the numbers don''t match". Bob -- This message posted from opensolaris.org
Brendan Gregg - Sun Microsystems
2008-Jun-28 03:55 UTC
[dtrace-discuss] Pfilestat vs. prstat
G''Day Bob, On Fri, Jun 27, 2008 at 05:07:21AM -0700, Bob Resendes wrote:> > > > prstat -mL is giving a by-thread summary, which is > > going to be better than > > either prstat -m, pfilestat, or anything else that > > tries to represent > > multiple thread info by-process. > Brendan, thanks for the quick reply. Are you suggesting that a single threaded application should reflect "closer" numbers? If so, then I''m still curious about the differences I''m seeing. For example, the following are (representative) snapshots of both commands for the same (single-threaded) process: > > prstat -mL -p 10997 5 > ==================> PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID > 10997 40011 14 6.1 0.0 0.0 0.0 0.0 79 1.1 730 19 3K 0 postgres/1 > > [Note: LAT = 1.1% here] > > ./pfilestat 10997 > =============> STATE FDNUM Time Filename > read 4 0% /zones/dbzone05/<snip> > read 17 0% /zones/dbzone05/<snip> > write 20 0% /zones/dbzone05/<snip> > read 19 0% /zones/dbzone05/<snip> > read 23 0% /zones/dbzone05/<snip> > write 19 1% /zones/dbzone05/<snip> > sleep-w 0 2% > waitcpu 0 4% > running 0 13% > sleep 0 76% > [snip] > Total event time (ms): 4999 Total Mbytes/sec: 1 > > [Note: waitcpu is 4% here.] > > I''m fine with the answer that they are measuring things differently. I''m just curious as to what accounts for the 1% vs 4% difference.Yes, I often use other tools as a sanity check of DTrace scripts, and I do worry about small percentage differences - they often lead to bugs or a better understanding of system behaviour...> Again, I''m not really complaining. I think these scripts are going to be a tremendous help in getting started. I''m just concerned that I''m missing something about the internal workings of DTrace. For example, something like "the sched:::* probes are done in a different context than the application and therefore the numbers don''t match".No, pfilestat is careful to follow the documented sched provider: http://wikis.sun.com/display/DTrace/sched+Provider Such as using ''pid'' for events in thread context: sched:::off-cpu /pid == PID/ and ''args[1]->pr_pid'' for events that aren''t: sched:::dequeue /args[1]->pr_pid == PID/ However I did spot a couple of things with pfilestat that might be improved like this: --- /opt/DTT/Bin/pfilestat Wed Aug 1 11:01:38 2007 +++ /tmp/pfilestat Sat Jun 28 03:47:48 2008 @@ -184,7 +184,8 @@ sched:::on-cpu /pid == PID/ { - @logical["waitcpu", (uint64_t)0, ""] = sum(timestamp - last); + @logical[probefunc == "resume" ? "running" : "waitcpu", + (uint64_t)0, ""] = sum(timestamp - last); totaltime += timestamp - last; last = timestamp; } @@ -233,7 +234,8 @@ sched:::enqueue /args[1]->pr_pid == PID/ { - @logical["waitcpu", (uint64_t)0, ""] = sum(timestamp - last); + @logical[pid == args[1]->pr_pid ? "running" : "sleep", + (uint64_t)0, ""] = sum(timestamp - last); totaltime += timestamp - last; last = timestamp; } If you don''t mind, try that out and let us know how it then compares to prstat -m. cheers, Brendan -- Brendan [CA, USA]
> you don''t mind, try that out and let us know how it > then compares to > prstat -m.The changes seem to track prstat more closely, now (e.g. sleep, running, ...). Thanks for the update. On the down side, the process I was observing had LAT (from prstat) in the 0-2% range, but I never saw "waitcpu" (from pfilestat) go above 0[1]. Also, I noticed sleep times in the 101-103% range a few times as well. Don''t know if you want to pursue this any further[1], but let me know if you need any more info. Bob [1] I won''t have access to the "test" system for the rest of the day, so I wasn''t able to experiment with trying to drive the LAT values any higher to see if waitcpu would register (> 0). -- This message posted from opensolaris.org
Maybe Matching Threads
- prstat -Z and load average values in different zones give same numeric results
- how to debug context switching and mutex contentions?
- With RAID-Z2 under load, machine stops responding to local or remote login
- Signal id and signal sender pid using dtrace
- find thread that is causing high cpu usage