I am using timestamp() in DTrace for timing some events, in particular duration of a lock (i.e. time elapsed between the acquisition and release of a mutex lock). I am getting pretty high numbers. But when I time the lock by explicitly using timing functions such as gethrtime(), I get pretty low numbers. Does this mean timestamp() doesn''t give us the accurate measurements? What are the alternatives? Thanks, Neelam -- This message posted from opensolaris.org
On Tue, 2007-11-06 at 16:37 -0800, Neelam wrote:> I am using timestamp() in DTrace for timing some events, in particular duration of a lock (i.e. time elapsed between the acquisition and release of a mutex lock). I am getting pretty high numbers. But when I time the lock by explicitly using timing functions such as gethrtime(), I get pretty low numbers. Does this mean timestamp() doesn''t give us the accurate measurements? What are the alternatives?We had similar issues with gethrtime on 64bit architectures. Where the result was truncated to 32 bits. Of course, it is hard to tell whether it is the same problem or not without seeing your D script. Thanks, Roman.
gethrtime() is implemented as a "Fast Trap System Call", but timestamp() in DTrace needs to go deeper into the kernel... I only have the 1st edition of the "Solaris Internals" book, but the conparsion (on page 47) of a fast trap system call vs a getpid() on a 300 MHz Ultra2: latency (ns) getpid() 2138 gethrtime() 320 lib C call 93 Since the overhead of gethrtime() is relative cheap compare to a normal system call, the time difference may come from the difference in overhead between the two... Rayson On Nov 6, 2007 7:37 PM, Neelam <neelam.goyal at gmail.com> wrote:> I am using timestamp() in DTrace for timing some events, in particular duration of a lock (i.e. time elapsed between the acquisition and release of a mutex lock). I am getting pretty high numbers. But when I time the lock by explicitly using timing functions such as gethrtime(), I get pretty low numbers. Does this mean timestamp() doesn''t give us the accurate measurements? What are the alternatives?
Rayson Ho wrote:> gethrtime() is implemented as a "Fast Trap System Call", but > timestamp() in DTrace needs to go deeper into the kernel... > > I only have the 1st edition of the "Solaris Internals" book, but the > conparsion (on page 47) of a fast trap system call vs a getpid() on a > 300 MHz Ultra2: > > latency (ns) > getpid() 2138 > gethrtime() 320 > lib C call 93 > > Since the overhead of gethrtime() is relative cheap compare to a > normal system call, the time difference may come from the difference > in overhead between the two...I''m fairly sure that that is not the case - all the measurement DTrace does is actually done inside the kernel, so the above will not apply. I also believe that there''s be confision here, as there''s a kernel *and* a userland version of gethrtime(), and from context, I believe Rayson and Neelam are talking about different ones :-) I''m inclined to believe that, compared to gethrtime(9F), DTrace instrumentation can be quite "heavy" (again, depending on your specific probe), and may in fact influence the results. Please show us the exact script, so we can see whether this assumption helps. Michael> > Rayson > > > > > On Nov 6, 2007 7:37 PM, Neelam <neelam.goyal at gmail.com> wrote: >> I am using timestamp() in DTrace for timing some events, in particular duration of a lock (i.e. time elapsed between the acquisition and release of a mutex lock). I am getting pretty high numbers. But when I time the lock by explicitly using timing functions such as gethrtime(), I get pretty low numbers. Does this mean timestamp() doesn''t give us the accurate measurements? What are the alternatives? > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org-- Michael Schuster Sun Microsystems, Inc. recursion, n: see ''recursion''
Well I am trying to record the duration of user level mutex locks. Below is the simple script: ----------------------------------------------------------------- #!/usr/sbin/dtrace -s dtrace:::BEGIN { self->begin=timestamp; self->mutex_acq=0; } plockstat$1:::mutex-acquire { self->ts[arg0]=timestamp; self->mutex_acq++; @mutex_acquired["No of mutexes acquired"]=count(); } plockstat$1:::mutex-release /self->mutex_acq==1/ { self->tstop[arg0]=timestamp; @mutex_released["No of mutexes released"]=count(); printf("%d outer mutex duration:%d\n", arg0, self->tstop[arg0]-self->ts[arg0]); @["all_mutex_duration"]=avg(self->tstop[arg0]-self->ts[arg0]); self->mutex_acq=0; self->ts[arg0]=0; } plockstat$1:::mutex-release /self->mutex_acq>1/ { self->mutex_acq--; self->tstp[arg0]=timestamp; @mutex_released["No of mutexes released"]=count(); printf("%d mutex duration:%d\n", arg0, self->tstp[arg0]-self->ts[arg0]); } ------------------------------------------------------------------------------------------ You can see two release probes here, because I was trying to find out the duration of outermost lock, when there is nesting. And for the timing measurement in the c code, I think it must be the user level gethrtime()..I include #include <sys/time.h> for using it. Thanks, Neelam -- This message posted from opensolaris.org
On Nov 7, 2007 3:35 PM, Neelam <neelam.goyal at gmail.com> wrote:> You can see two release probes here, because I was trying to find out the duration of outermost lock, when there is nesting. > > And for the timing measurement in the c code, I think it must be the user level gethrtime()..I include #include <sys/time.h> for using it.Right, so what I said yesterday was correct. gethrtime() is a fast trap system call, so calling it has very little overhead. But using DTrace to instrument your code is expensive. In fact, going so deep from the userland all the way to the DTrace logic is way more expensive than a getpid(2), which is already slower than a fast trap system call. Rayson> > > Thanks, > Neelam > > > -- > This message posted from opensolaris.org > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org >
so is there any alternative to find more accurate numbers using dtrace? I would like to use dtrace, because my actual workload is much more complex and long, so manually instrumenting for all possible locks is not feasible. I would still like to use dtrace to trace all the locks!! Thanks, Neelam -- This message posted from opensolaris.org
On Wed, 2007-11-07 at 15:48 -0500, Rayson Ho wrote:> On Nov 7, 2007 3:35 PM, Neelam <neelam.goyal at gmail.com> wrote: > > You can see two release probes here, because I was trying to find out the duration of outermost lock, when there is nesting. > > > > And for the timing measurement in the c code, I think it must be the user level gethrtime()..I include #include <sys/time.h> for using it. > > Right, so what I said yesterday was correct. > > gethrtime() is a fast trap system call, so calling it has very little > overhead. But using DTrace to instrument your code is expensive. In > fact, going so deep from the userland all the way to the DTrace logic > is way more expensive than a getpid(2), which is already slower than a > fast trap system call.I have troubles correlating the two statements you''ve made. Is it that you''re suggesting NOT to use DTrace because it is slower than manually inserted timing calls around each pthread_mutex_[un]lock ? Thanks, Roman.
On Nov 7, 2007 4:17 PM, Roman Shaposhnik <rvs at sun.com> wrote:> I have troubles correlating the two statements you''ve made. Is it that > you''re suggesting NOT to use DTrace because it is slower than manually > inserted timing calls around each pthread_mutex_[un]lock ?Well... Where did I say that?? I am helping Neelam to understand the problem. You sure don''t want to use a stop watch to measure the latency of a cache miss, do you?? By the same token, if the overhead of getting to DTrace from the userland is much greater than that of getting a lock, I am sure picking DTrace is wrong... BTW, this is not the first time I am seeing responses like this. Every time I make the suggestion that DTrace is not the right tool for a certain problem, someone from Sun would sure to defense. *Please* understand that no tools can handle all the performance issues!! Rayson> > Thanks, > Roman. > >
On Wed, 2007-11-07 at 16:24 -0500, Rayson Ho wrote:> On Nov 7, 2007 4:17 PM, Roman Shaposhnik <rvs at sun.com> wrote: > > I have troubles correlating the two statements you''ve made. Is it that > > you''re suggesting NOT to use DTrace because it is slower than manually > > inserted timing calls around each pthread_mutex_[un]lock ? > > Well... Where did I say that??I didn''t imply you did. I felt confused about your two statements and offered you an interpretation of them that seemed to fit hoping that it would be a useful first step to understanding what you actually had in mind.> I am helping Neelam to understand the problem. You sure don''t want to > use a stop watch to measure the latency of a cache miss, do you??No, but I can used DTrace (well, almost -- the CPC provider is still not part of the default setup) ;-)> By the same token, if the overhead of getting to DTrace from the > userlandDo you have proof that the overhead is as great as you make it look? pthread_* operations are not exactly cheap. Even in userspace. Besides, the lock duration time is usually a relative value anyway. I ran an experiment once using rdtsc assembly instruction for manually timing some mid-frequency user-space operations with and without DTrace enabled. The difference was there (and I shouldn''t say it was negligible) but it wasn''t all that impressive. I''d be curious to hear your side of the story. Thanks, Roman. P.S. As for the tools applicable to the problem at hand one can always use Sun Studio Performance Analyzer if small dilation is of a top concern and all that''s needed is user-space application analysis.
On Nov 7, 2007 6:22 PM, Roman Shaposhnik <rvs at sun.com> wrote:> I didn''t imply you did. I felt confused about your two statements > and offered you an interpretation of them that seemed to fit hoping > that it would be a useful first step to understanding what you > actually had in mind.OK :)> Do you have proof that the overhead is as great as you make it look? > pthread_* operations are not exactly cheap. Even in userspace. Besides, > the lock duration time is usually a relative value anyway.Agreed. We don''t have Neelam''s code so we don''t know exactly what it does. So there''s too much guess work.> I ran an > experiment once using rdtsc assembly instruction for manually timing > some mid-frequency user-space operations with and without DTrace > enabled. > The difference was there (and I shouldn''t say it was > negligible) but it wasn''t all that impressive. I''d be curious to hear > your side of the story.I believe the overhead of firing a probe from userspace to the DTrace framework is in the order of thousands of cycles. But I don''t have access to a Solaris system right now so I don''t know. Whether DTrace is good enough or not all depends on how long the lock is held. I am interested in your experiment as well... Rayson> > Thanks, > Roman. > > P.S. As for the tools applicable to the problem at hand one can always > use Sun Studio Performance Analyzer if small dilation is of a top > concern and all that''s needed is user-space application analysis. > >
On Wed, 2007-11-07 at 12:35 -0800, Neelam wrote:> You can see two release probes here, because I was trying to find out the duration of outermost lock, when there is nesting. > > And for the timing measurement in the c code, I think it must be the user level gethrtime()..I include #include <sys/time.h> for using it.First of all -- your script looks reasonable, so the only possible explanation I could come up with would be: there''s a hard to reproduce bug in DTrace that makes timestamp disagree with gethrtime() on some 64bit architectures (we observed it on a Solaris Nevada running on a 64bit Acer laptop). The easiest way to verify whether you''re affected by it or not is to run the following set of commands: $ cat q.c #include <sys/time.h> #include <stdio.h> #include <inttypes.h> int main() { while (1) { printf("uspace -> %llx\n", gethrtime()); sleep(1); } } $ cc [-m64 depending on whether your actual app is 64bit] q.c $ dtrace -qn ''pid$pid:libc.so:gethrtime:entry { \ printf("dtrace -> %llx\n", (uint64_t)timestamp); }'' -c ./a.out Let it run for at least a minute and observe numbers. If the numbers you get with dtrace -> and uspace -> prefixes seems to disagree you''ve got that problem. Thanks, Roman.
On Wed, 2007-11-07 at 18:44 -0500, Rayson Ho wrote:> > I ran an > > experiment once using rdtsc assembly instruction for manually timing > > some mid-frequency user-space operations with and without DTrace > > enabled. > > The difference was there (and I shouldn''t say it was > > negligible) but it wasn''t all that impressive. I''d be curious to hear > > your side of the story. > > I believe the overhead of firing a probe from userspace to the DTrace > framework is in the order of thousands of cycles. But I don''t have > access to a Solaris system right now so I don''t know. Whether DTrace > is good enough or not all depends on how long the lock is held. I am > interested in your experiment as well...Thousands of cycles sounds like a reasonable assessment, but in my book that doesn''t make DTrace application to lock monitoring much slower. pthread_mutex_lock would take about 10-30K cycles (depending on whether it is contended or not) so we''re talking about ~5% overhead. Again, I don''t consider it to be negligible, but at the same time it is far from being catastrophic and in now way rules out DTrace application. IMHO. Thanks, Roman.
On Wed, 2007-11-07 at 12:35 -0800, Neelam wrote:> Well I am trying to record the duration of user level mutex locks. Below is the simple script: > > ----------------------------------------------------------------- > #!/usr/sbin/dtrace -s > > dtrace:::BEGIN > { > self->begin=timestamp; > self->mutex_acq=0; > } > > plockstat$1:::mutex-acquire > { > self->ts[arg0]=timestamp; > self->mutex_acq++; > @mutex_acquired["No of mutexes acquired"]=count();One extra bit of advice -- get rid of aggregations. They might add to dilation quite significantly. Thanks, Roman.
On Wed, Nov 07, 2007 at 06:25:38PM -0800, Roman Shaposhnik wrote:> On Wed, 2007-11-07 at 12:35 -0800, Neelam wrote: > > Well I am trying to record the duration of user level mutex locks. Below is the simple script: > > > > ----------------------------------------------------------------- > > #!/usr/sbin/dtrace -s > > > > dtrace:::BEGIN > > { > > self->begin=timestamp; > > self->mutex_acq=0; > > } > > > > plockstat$1:::mutex-acquire > > { > > self->ts[arg0]=timestamp; > > self->mutex_acq++; > > @mutex_acquired["No of mutexes acquired"]=count(); > > One extra bit of advice -- get rid of aggregations. They might > add to dilation quite significantly.Certainly aggregating by a string is going to hurt; if you just change that to "@mutex_acquired = count()", you should be happier. And given that you''re trying to tune your enabling, you might want to consult both the chapter on the subject in the DTrace documentation, and the slides on the topic in the Advanced DTrace presenation (specifically, look for "Trick: Measuring DTrace"). - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Sun Microsystems FishWorks. http://blogs.sun.com/bmc
Roman, Since you are on the Sun Studio team, and since this thread is related to threads/locks, may be it is a good place to ask here... Intel has thread checker, which is similar to Sun''s thread analyzer. However, is there a tool on Solaris that is similar to Intel''s Thread Profiler?? Rayson On Nov 7, 2007 9:21 PM, Roman Shaposhnik <rvs at sun.com> wrote:> Thousands of cycles sounds like a reasonable assessment, but in my book > that doesn''t make DTrace application to lock monitoring much slower. > pthread_mutex_lock would take about 10-30K cycles (depending on whether > it is contended or not) so we''re talking about ~5% overhead. Again, > I don''t consider it to be negligible, but at the same time it is > far from being catastrophic and in now way rules out DTrace application. > IMHO. > > Thanks, > Roman. > >
Actually currently, I am running DTrace on a microbenchmark. This is just to validate the numbers for more complex workloads. In the microbenchmark, I just have a single mutex lock. Actually when I introduced some delay to make the critical section bigger, the difference between gethrtime() measurement and DTrace was significantly reduced. And can someone explain more about aggregations. I have couple of them in my bigger scripts and I use various strings to differentiate different numbers. Thanks for the help, Neelam -- This message posted from opensolaris.org
On Wed, 2007-11-07 at 21:52 -0500, Rayson Ho wrote:> Roman, > > Since you are on the Sun Studio team, and since this thread is related > to threads/locks, may be it is a good place to ask here... > > Intel has thread checker, which is similar to Sun''s thread analyzer. > However, is there a tool on Solaris that is similar to Intel''s Thread > Profiler??I''m not all that familiar with Intel''s Thread Profiler but from the exposure I got at various conference it seems to be the tool that lets you detect and fix things like lock contention, resource starvation, false cache sharing and such. If that''s what you''re interested in -- I have two answers. First of all, Sun Studio Performance Analyzer can do some of it for you (you just have to know how to use it ;-)). Plus we''re currently working on Project D-Light (based on DTrace technology) where some of the model problems have to do with lock contention and resource starvation. We''re just approaching them from a different (more generalized) direction. And for the record: I actually disagree with Intel''s approach of creating a dedicated performance analysis tool for each domain. These domains have enough similarities to be exploited in a single framework (hint: in Intel''s world, would I need an MPI Profiler in addition to Thread Profiler? I guess I would, but what if I''m dealing with a mixed mode application that''s MPI and PThreaded one at the same time?). Thanks, Roman.
On Nov 7, 2007 11:24 PM, Neelam <neelam.goyal at gmail.com> wrote:> Actually currently, I am running DTrace on a microbenchmark. This is just > to validate the numbers for more complex workloads. In the microbenchmark, I > just have a single mutex lock. Actually when I introduced some delay to make > the critical section bigger, the difference between gethrtime() measurement > and DTrace was significantly reduced. > > And can someone explain more about aggregations. I have couple of them in > my bigger scripts and I use various strings to differentiate different > numbers. > > Thanks for the help, > Neelam > > When you do something like: > > @foo["this case here"] = aggfunc1(args2); > @foo["case two here"] = aggfunc2(args2); > > Dtrace has to do a substantial amount of work *each time these lines are > run* to hash the string into the correct aggregation bucket. A better > approach is: > >When you do something like: @foo["this case here"] = aggfunc1(args2); @foo["case two here"] = aggfunc2(args2); Dtrace has to do a substantial amount of work *each time these lines are run* to hash the string into the correct aggregation bucket. A better approach is: @case1 = aggfunc1(args1); @case2 = aggfunc2(args2); ... END { printa("Case one: %@d\n", @case1); printa("Case two: %@d\n", @case2); } Which removes all of the string processing. Cheers, - jonathan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20071109/d0686686/attachment.html>