Hi All, I''m looking for some advice: I want to use Dtrace to observe the latency and results of all vnode operations. This is something we want to do in production. For example, I''d like to run a Dscript using the fop_read:entry/return FBT probes to save uio_resid on entry, subtract off uio_resid after the vop call returns then display the number bytes transferred. Throw in a little vtimestamp subtraction and I''ve got latency also. The trouble is that most of the fop calls are tail-call optimized which (as you well know) makes the return probe fire before the (*(vp)->v_op->vop_read)(vp, uiop, ioflag, cr, ct); happens. I''ve tried the following: - I followed the suggestions in the Dtrace cookbook for dealing with tail-call optimized routines: http://dtrace.eng/recipes/chasetail.html . After the 100K+ probes are enabled, there is a significant performance hit on the system. (And I''ll want to use this on systems that are already having problems.) - I looked at our production compilers and noticed that they have lower levels of optimization (e.g., -xO1) that won''t do tail-call optimization, but seems so wrong way on several levels: It''s (a) compiler-specific and (b) too big of a hammer (even just for vnode.c). - I added a static probe after the vop call. This works and has good observed performance, but all I''m really doing is subverting the tail-call optimization. Seems like there must be a better way. Any suggestions? Thanks for your help, Rich
Did you check out Richard McDougall''s VFS tracing script? http://blogs.sun.com/roller/page/rmc/Weblog?catname=%2FDTrace /jim Rich Brown wrote:> Hi All, > > I''m looking for some advice: > > I want to use Dtrace to observe the latency and results of all vnode > operations. This is something we want to do in production. > > For example, I''d like to run a Dscript using the fop_read:entry/return > FBT probes to save uio_resid on entry, subtract off uio_resid after the > vop call returns then display the number bytes transferred. Throw in a > little vtimestamp subtraction and I''ve got latency also. > > The trouble is that most of the fop calls are tail-call optimized which > (as you well know) makes the return probe fire before the > (*(vp)->v_op->vop_read)(vp, uiop, ioflag, cr, ct); happens. I''ve tried > the following: > > - I followed the suggestions in the Dtrace cookbook for dealing with > tail-call optimized routines: > http://dtrace.eng/recipes/chasetail.html . After the 100K+ probes > are enabled, there is a significant performance hit > on the system. (And I''ll want to use this on systems that are already > having problems.) > > - I looked at our production compilers and noticed that they have lower > levels of optimization (e.g., -xO1) that won''t do tail-call optimization, > but seems so wrong way on several levels: It''s (a) compiler-specific and > (b) too big of a hammer (even just for vnode.c). > > - I added a static probe after the vop call. This works and has good > observed performance, but all I''m really doing is subverting the > tail-call > optimization. Seems like there must be a better way. > > Any suggestions? Thanks for your help, > > Rich > > > > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
Hi Jim, Thanks for the pointer. It gives the information on what the vop intends to do plus any actual I/O performed, but I''m looking for the results of the vop itself. Here''s an example of a script that gives me exactly the output I''m looking for, but only on a system that doesn''t optimize the tail-call for fop routines. If I use the following Dscript: --------------------- cut ------------------------- #!/usr/sbin/dtrace -Cqs #include <sys/uio.h> fop_read:entry / execname != "dtrace" && ((uio_t *)arg1)->uio_resid != 0 / { self->tstart = vtimestamp; self->path = ((vnode_t *)arg0)->v_path, self->resid = ((uio_t *)arg1)->uio_resid; self->uiop = arg1; self->offset = ((uio_t *)arg1)->uio_offset; } fop_read:return / self->resid / { printf("READ of %s at %d for %ld bytes in %d usec\n", stringof(self->path), self->offset, (self->resid - ((uio_t *)self->uiop)->uio_resid), (vtimestamp - self->tstart)/1000); } --------------------- cut ------------------------- And do: dd if=/kernel/genunix of=/dev/null bs=1024k count=1 Then I see the following on an Opteron box running Solaris: READ of /dev/pts/../../devices/pseudo/pts at 0:10 at 7838 for 1 bytes in 7 usec READ of /usr/bin/dd at 0 for 8 bytes in 10 usec READ of /usr/bin/dd at 0 for 52 bytes in 1 usec READ of /usr/bin/dd at 52 for 192 bytes in 1 usec READ of /usr/bin/dd at 244 for 17 bytes in 1 usec READ of /lib/ld.so.1 at 0 for 52 bytes in 4 usec READ of /lib/ld.so.1 at 52 for 160 bytes in 1 usec READ of /kernel/genunix at 0 for 1048576 bytes in 2479 usec But if I do this on a SPARC box, I get 0 for the bytes transferred and 0 for the latency because the ''return'' probe fires before the vop call has been made. Thanks, Rich Jim Mauro wrote:> > Did you check out Richard McDougall''s VFS tracing script? > > http://blogs.sun.com/roller/page/rmc/Weblog?catname=%2FDTrace > > /jim > > > Rich Brown wrote: > >> Hi All, >> >> I''m looking for some advice: >> >> I want to use Dtrace to observe the latency and results of all vnode >> operations. This is something we want to do in production. >> >> For example, I''d like to run a Dscript using the fop_read:entry/return >> FBT probes to save uio_resid on entry, subtract off uio_resid after the >> vop call returns then display the number bytes transferred. Throw in a >> little vtimestamp subtraction and I''ve got latency also. >> >> The trouble is that most of the fop calls are tail-call optimized which >> (as you well know) makes the return probe fire before the >> (*(vp)->v_op->vop_read)(vp, uiop, ioflag, cr, ct); happens. I''ve tried >> the following: >> >> - I followed the suggestions in the Dtrace cookbook for dealing with >> tail-call optimized routines: >> http://dtrace.eng/recipes/chasetail.html . After the 100K+ probes >> are enabled, there is a significant performance hit >> on the system. (And I''ll want to use this on systems that are already >> having problems.) >> >> - I looked at our production compilers and noticed that they have lower >> levels of optimization (e.g., -xO1) that won''t do tail-call >> optimization, >> but seems so wrong way on several levels: It''s (a) compiler-specific >> and >> (b) too big of a hammer (even just for vnode.c). >> >> - I added a static probe after the vop call. This works and has good >> observed performance, but all I''m really doing is subverting the >> tail-call >> optimization. Seems like there must be a better way. >> >> Any suggestions? Thanks for your help, >> >> Rich >> >> >> >> >> _______________________________________________ >> dtrace-discuss mailing list >> dtrace-discuss at opensolaris.org > > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
Got it Rich, thanks. Cool script. Unfortunately, I have no further suggestions on getting around your specific problem. Best, /jim Rich Brown wrote:> Hi Jim, > > Thanks for the pointer. It gives the information on what > the vop intends to do plus any actual I/O performed, but > I''m looking for the results of the vop itself. Here''s an > example of a script that gives me exactly the output I''m > looking for, but only on a system that doesn''t optimize > the tail-call for fop routines. > > If I use the following Dscript: > > --------------------- cut ------------------------- > #!/usr/sbin/dtrace -Cqs > > #include <sys/uio.h> > > fop_read:entry > / execname != "dtrace" && ((uio_t *)arg1)->uio_resid != 0 / > { > self->tstart = vtimestamp; > self->path = ((vnode_t *)arg0)->v_path, > self->resid = ((uio_t *)arg1)->uio_resid; > self->uiop = arg1; > self->offset = ((uio_t *)arg1)->uio_offset; > } > > fop_read:return > / self->resid / > { > printf("READ of %s at %d for %ld bytes in %d usec\n", > stringof(self->path), > self->offset, > (self->resid - ((uio_t *)self->uiop)->uio_resid), > (vtimestamp - self->tstart)/1000); > } > --------------------- cut ------------------------- > > And do: dd if=/kernel/genunix of=/dev/null bs=1024k count=1 > > Then I see the following on an Opteron box running Solaris: > > READ of /dev/pts/../../devices/pseudo/pts at 0:10 at 7838 for 1 bytes in > 7 usec > READ of /usr/bin/dd at 0 for 8 bytes in 10 usec > READ of /usr/bin/dd at 0 for 52 bytes in 1 usec > READ of /usr/bin/dd at 52 for 192 bytes in 1 usec > READ of /usr/bin/dd at 244 for 17 bytes in 1 usec > READ of /lib/ld.so.1 at 0 for 52 bytes in 4 usec > READ of /lib/ld.so.1 at 52 for 160 bytes in 1 usec > READ of /kernel/genunix at 0 for 1048576 bytes in 2479 usec > > But if I do this on a SPARC box, I get 0 for the bytes transferred > and 0 for the latency because the ''return'' probe fires before the > vop call has been made. > > Thanks, > > Rich > > > Jim Mauro wrote: > >> >> Did you check out Richard McDougall''s VFS tracing script? >> >> http://blogs.sun.com/roller/page/rmc/Weblog?catname=%2FDTrace >> >> /jim >> >> >> Rich Brown wrote: >> >>> Hi All, >>> >>> I''m looking for some advice: >>> >>> I want to use Dtrace to observe the latency and results of all vnode >>> operations. This is something we want to do in production. >>> >>> For example, I''d like to run a Dscript using the fop_read:entry/return >>> FBT probes to save uio_resid on entry, subtract off uio_resid after the >>> vop call returns then display the number bytes transferred. Throw in a >>> little vtimestamp subtraction and I''ve got latency also. >>> >>> The trouble is that most of the fop calls are tail-call optimized which >>> (as you well know) makes the return probe fire before the >>> (*(vp)->v_op->vop_read)(vp, uiop, ioflag, cr, ct); happens. I''ve tried >>> the following: >>> >>> - I followed the suggestions in the Dtrace cookbook for dealing with >>> tail-call optimized routines: >>> http://dtrace.eng/recipes/chasetail.html . After the 100K+ probes >>> are enabled, there is a significant performance hit >>> on the system. (And I''ll want to use this on systems that are already >>> having problems.) >>> >>> - I looked at our production compilers and noticed that they have lower >>> levels of optimization (e.g., -xO1) that won''t do tail-call >>> optimization, >>> but seems so wrong way on several levels: It''s (a) >>> compiler-specific and >>> (b) too big of a hammer (even just for vnode.c). >>> >>> - I added a static probe after the vop call. This works and has good >>> observed performance, but all I''m really doing is subverting the >>> tail-call >>> optimization. Seems like there must be a better way. >>> >>> Any suggestions? Thanks for your help, >>> >>> Rich >>> >>> >>> >>> >>> _______________________________________________ >>> dtrace-discuss mailing list >>> dtrace-discuss at opensolaris.org >> >> >> >> _______________________________________________ >> dtrace-discuss mailing list >> dtrace-discuss at opensolaris.org > > >
Roch Bourbonnais - Performance Engineering
2005-Aug-03 10:54 UTC
[dtrace-discuss] Getting around tail-calls...
You could spell out the *fs_read of interest ? Include the -Z option in case a given FS is not installed. ufs_read:entry, zfs_read:entry, qfs_read:entry, vxfs_read:entry, nfs3_read:entry, nfs4_read:entry... / execname != "dtrace" && ((uio_t *)arg1)->uio_resid != 0 / It won''t be too ugly. -r Rich Brown writes: > Hi Jim, > > Thanks for the pointer. It gives the information on what > the vop intends to do plus any actual I/O performed, but > I''m looking for the results of the vop itself. Here''s an > example of a script that gives me exactly the output I''m > looking for, but only on a system that doesn''t optimize > the tail-call for fop routines. > > If I use the following Dscript: > > --------------------- cut ------------------------- > #!/usr/sbin/dtrace -Cqs > > #include <sys/uio.h> > > fop_read:entry > / execname != "dtrace" && ((uio_t *)arg1)->uio_resid != 0 / > { > self->tstart = vtimestamp; > self->path = ((vnode_t *)arg0)->v_path, > self->resid = ((uio_t *)arg1)->uio_resid; > self->uiop = arg1; > self->offset = ((uio_t *)arg1)->uio_offset; > } > > fop_read:return > / self->resid / > { > printf("READ of %s at %d for %ld bytes in %d usec\n", > stringof(self->path), > self->offset, > (self->resid - ((uio_t *)self->uiop)->uio_resid), > (vtimestamp - self->tstart)/1000); > } > --------------------- cut ------------------------- > > And do: dd if=/kernel/genunix of=/dev/null bs=1024k count=1 > > Then I see the following on an Opteron box running Solaris: > > READ of /dev/pts/../../devices/pseudo/pts at 0:10 at 7838 for 1 bytes in 7 usec > READ of /usr/bin/dd at 0 for 8 bytes in 10 usec > READ of /usr/bin/dd at 0 for 52 bytes in 1 usec > READ of /usr/bin/dd at 52 for 192 bytes in 1 usec > READ of /usr/bin/dd at 244 for 17 bytes in 1 usec > READ of /lib/ld.so.1 at 0 for 52 bytes in 4 usec > READ of /lib/ld.so.1 at 52 for 160 bytes in 1 usec > READ of /kernel/genunix at 0 for 1048576 bytes in 2479 usec > > But if I do this on a SPARC box, I get 0 for the bytes transferred > and 0 for the latency because the ''return'' probe fires before the > vop call has been made. > > Thanks, > > Rich > > > Jim Mauro wrote: > > > > > Did you check out Richard McDougall''s VFS tracing script? > > > > http://blogs.sun.com/roller/page/rmc/Weblog?catname=%2FDTrace > > > > /jim > > > > > > Rich Brown wrote: > > > >> Hi All, > >> > >> I''m looking for some advice: > >> > >> I want to use Dtrace to observe the latency and results of all vnode > >> operations. This is something we want to do in production. > >> > >> For example, I''d like to run a Dscript using the fop_read:entry/return > >> FBT probes to save uio_resid on entry, subtract off uio_resid after the > >> vop call returns then display the number bytes transferred. Throw in a > >> little vtimestamp subtraction and I''ve got latency also. > >> > >> The trouble is that most of the fop calls are tail-call optimized which > >> (as you well know) makes the return probe fire before the > >> (*(vp)->v_op->vop_read)(vp, uiop, ioflag, cr, ct); happens. I''ve tried > >> the following: > >> > >> - I followed the suggestions in the Dtrace cookbook for dealing with > >> tail-call optimized routines: > >> http://dtrace.eng/recipes/chasetail.html . After the 100K+ probes > >> are enabled, there is a significant performance hit > >> on the system. (And I''ll want to use this on systems that are already > >> having problems.) > >> > >> - I looked at our production compilers and noticed that they have lower > >> levels of optimization (e.g., -xO1) that won''t do tail-call > >> optimization, > >> but seems so wrong way on several levels: It''s (a) compiler-specific > >> and > >> (b) too big of a hammer (even just for vnode.c). > >> > >> - I added a static probe after the vop call. This works and has good > >> observed performance, but all I''m really doing is subverting the > >> tail-call > >> optimization. Seems like there must be a better way. > >> > >> Any suggestions? Thanks for your help, > >> > >> Rich > >> > >> > >> > >> > >> _______________________________________________ > >> dtrace-discuss mailing list > >> dtrace-discuss at opensolaris.org > > > > > > _______________________________________________ > > dtrace-discuss mailing list > > dtrace-discuss at opensolaris.org > > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
Hi Roch, I was thinking of something similar recently. I didn''t know about the -Z option which makes this viable. Thanks! (David Robinson and I were discussing various ways to discover what vnode operations might be available. That can get ugly pretty quickly.) My goal is still to do this from the fop_* level. One of the reasons those calls were added was to give us a file system independent place to stand to do just these kinds of things. I''ve had performance runs (PerfPIT) done on a build with static probes added to fop_read, fop_write, fop_getattr. So far, there are no statistically significant regressions in the results. If this doesn''t pan out, your suggestion is at the top of my list. Thanks, Rich Roch Bourbonnais - Performance Engineering wrote:>You could spell out the *fs_read of interest ? >Include the -Z option in case a given FS is not installed. > > > ufs_read:entry, > zfs_read:entry, > qfs_read:entry, > vxfs_read:entry, > nfs3_read:entry, > nfs4_read:entry... > / execname != "dtrace" && ((uio_t *)arg1)->uio_resid != 0 / > >It won''t be too ugly. > >-r > >Rich Brown writes: > > Hi Jim, > > > > Thanks for the pointer. It gives the information on what > > the vop intends to do plus any actual I/O performed, but > > I''m looking for the results of the vop itself. Here''s an > > example of a script that gives me exactly the output I''m > > looking for, but only on a system that doesn''t optimize > > the tail-call for fop routines. > > > > If I use the following Dscript: > > > > --------------------- cut ------------------------- > > #!/usr/sbin/dtrace -Cqs > > > > #include <sys/uio.h> > > > > fop_read:entry > > / execname != "dtrace" && ((uio_t *)arg1)->uio_resid != 0 / > > { > > self->tstart = vtimestamp; > > self->path = ((vnode_t *)arg0)->v_path, > > self->resid = ((uio_t *)arg1)->uio_resid; > > self->uiop = arg1; > > self->offset = ((uio_t *)arg1)->uio_offset; > > } > > > > fop_read:return > > / self->resid / > > { > > printf("READ of %s at %d for %ld bytes in %d usec\n", > > stringof(self->path), > > self->offset, > > (self->resid - ((uio_t *)self->uiop)->uio_resid), > > (vtimestamp - self->tstart)/1000); > > } > > --------------------- cut ------------------------- > > > > And do: dd if=/kernel/genunix of=/dev/null bs=1024k count=1 > > > > Then I see the following on an Opteron box running Solaris: > > > > READ of /dev/pts/../../devices/pseudo/pts at 0:10 at 7838 for 1 bytes in 7 usec > > READ of /usr/bin/dd at 0 for 8 bytes in 10 usec > > READ of /usr/bin/dd at 0 for 52 bytes in 1 usec > > READ of /usr/bin/dd at 52 for 192 bytes in 1 usec > > READ of /usr/bin/dd at 244 for 17 bytes in 1 usec > > READ of /lib/ld.so.1 at 0 for 52 bytes in 4 usec > > READ of /lib/ld.so.1 at 52 for 160 bytes in 1 usec > > READ of /kernel/genunix at 0 for 1048576 bytes in 2479 usec > > > > But if I do this on a SPARC box, I get 0 for the bytes transferred > > and 0 for the latency because the ''return'' probe fires before the > > vop call has been made. > > > > Thanks, > > > > Rich > > > > > > Jim Mauro wrote: > > > > > > > > Did you check out Richard McDougall''s VFS tracing script? > > > > > > http://blogs.sun.com/roller/page/rmc/Weblog?catname=%2FDTrace > > > > > > /jim > > > > > > > > > Rich Brown wrote: > > > > > >> Hi All, > > >> > > >> I''m looking for some advice: > > >> > > >> I want to use Dtrace to observe the latency and results of all vnode > > >> operations. This is something we want to do in production. > > >> > > >> For example, I''d like to run a Dscript using the fop_read:entry/return > > >> FBT probes to save uio_resid on entry, subtract off uio_resid after the > > >> vop call returns then display the number bytes transferred. Throw in a > > >> little vtimestamp subtraction and I''ve got latency also. > > >> > > >> The trouble is that most of the fop calls are tail-call optimized which > > >> (as you well know) makes the return probe fire before the > > >> (*(vp)->v_op->vop_read)(vp, uiop, ioflag, cr, ct); happens. I''ve tried > > >> the following: > > >> > > >> - I followed the suggestions in the Dtrace cookbook for dealing with > > >> tail-call optimized routines: > > >> http://dtrace.eng/recipes/chasetail.html . After the 100K+ probes > > >> are enabled, there is a significant performance hit > > >> on the system. (And I''ll want to use this on systems that are already > > >> having problems.) > > >> > > >> - I looked at our production compilers and noticed that they have lower > > >> levels of optimization (e.g., -xO1) that won''t do tail-call > > >> optimization, > > >> but seems so wrong way on several levels: It''s (a) compiler-specific > > >> and > > >> (b) too big of a hammer (even just for vnode.c). > > >> > > >> - I added a static probe after the vop call. This works and has good > > >> observed performance, but all I''m really doing is subverting the > > >> tail-call > > >> optimization. Seems like there must be a better way. > > >> > > >> Any suggestions? Thanks for your help, > > >> > > >> Rich > > >> > > >> > > >> > > >> > > >> _______________________________________________ > > >> dtrace-discuss mailing list > > >> dtrace-discuss at opensolaris.org > > > > > > > > > _______________________________________________ > > > dtrace-discuss mailing list > > > dtrace-discuss at opensolaris.org > > > > > > _______________________________________________ > > dtrace-discuss mailing list > > dtrace-discuss at opensolaris.org > > >