Hi All, I''m looking for some advice: I want to use Dtrace to observe the latency and results of all vnode operations. This is something we want to do in production. For example, I''d like to run a Dscript using the fop_read:entry/return FBT probes to save uio_resid on entry, subtract off uio_resid after the vop call returns then display the number bytes transferred. Throw in a little vtimestamp subtraction and I''ve got latency also. The trouble is that most of the fop calls are tail-call optimized which (as you well know) makes the return probe fire before the (*(vp)->v_op->vop_read)(vp, uiop, ioflag, cr, ct); happens. I''ve tried the following: - I followed the suggestions in the Dtrace cookbook for dealing with tail-call optimized routines: http://dtrace.eng/recipes/chasetail.html . After the 100K+ probes are enabled, there is a significant performance hit on the system. (And I''ll want to use this on systems that are already having problems.) - I looked at our production compilers and noticed that they have lower levels of optimization (e.g., -xO1) that won''t do tail-call optimization, but seems so wrong way on several levels: It''s (a) compiler-specific and (b) too big of a hammer (even just for vnode.c). - I added a static probe after the vop call. This works and has good observed performance, but all I''m really doing is subverting the tail-call optimization. Seems like there must be a better way. Any suggestions? Thanks for your help, Rich
Did you check out Richard McDougall''s VFS tracing script? http://blogs.sun.com/roller/page/rmc/Weblog?catname=%2FDTrace /jim Rich Brown wrote:> Hi All, > > I''m looking for some advice: > > I want to use Dtrace to observe the latency and results of all vnode > operations. This is something we want to do in production. > > For example, I''d like to run a Dscript using the fop_read:entry/return > FBT probes to save uio_resid on entry, subtract off uio_resid after the > vop call returns then display the number bytes transferred. Throw in a > little vtimestamp subtraction and I''ve got latency also. > > The trouble is that most of the fop calls are tail-call optimized which > (as you well know) makes the return probe fire before the > (*(vp)->v_op->vop_read)(vp, uiop, ioflag, cr, ct); happens. I''ve tried > the following: > > - I followed the suggestions in the Dtrace cookbook for dealing with > tail-call optimized routines: > http://dtrace.eng/recipes/chasetail.html . After the 100K+ probes > are enabled, there is a significant performance hit > on the system. (And I''ll want to use this on systems that are already > having problems.) > > - I looked at our production compilers and noticed that they have lower > levels of optimization (e.g., -xO1) that won''t do tail-call optimization, > but seems so wrong way on several levels: It''s (a) compiler-specific and > (b) too big of a hammer (even just for vnode.c). > > - I added a static probe after the vop call. This works and has good > observed performance, but all I''m really doing is subverting the > tail-call > optimization. Seems like there must be a better way. > > Any suggestions? Thanks for your help, > > Rich > > > > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
Hi Jim,
Thanks for the pointer. It gives the information on what
the vop intends to do plus any actual I/O performed, but
I''m looking for the results of the vop itself. Here''s an
example of a script that gives me exactly the output I''m
looking for, but only on a system that doesn''t optimize
the tail-call for fop routines.
If I use the following Dscript:
--------------------- cut -------------------------
#!/usr/sbin/dtrace -Cqs
#include <sys/uio.h>
fop_read:entry
/ execname != "dtrace" && ((uio_t *)arg1)->uio_resid != 0 /
{
self->tstart = vtimestamp;
self->path = ((vnode_t *)arg0)->v_path,
self->resid = ((uio_t *)arg1)->uio_resid;
self->uiop = arg1;
self->offset = ((uio_t *)arg1)->uio_offset;
}
fop_read:return
/ self->resid /
{
printf("READ of %s at %d for %ld bytes in %d usec\n",
stringof(self->path),
self->offset,
(self->resid - ((uio_t *)self->uiop)->uio_resid),
(vtimestamp - self->tstart)/1000);
}
--------------------- cut -------------------------
And do: dd if=/kernel/genunix of=/dev/null bs=1024k count=1
Then I see the following on an Opteron box running Solaris:
READ of /dev/pts/../../devices/pseudo/pts at 0:10 at 7838 for 1 bytes in 7 usec
READ of /usr/bin/dd at 0 for 8 bytes in 10 usec
READ of /usr/bin/dd at 0 for 52 bytes in 1 usec
READ of /usr/bin/dd at 52 for 192 bytes in 1 usec
READ of /usr/bin/dd at 244 for 17 bytes in 1 usec
READ of /lib/ld.so.1 at 0 for 52 bytes in 4 usec
READ of /lib/ld.so.1 at 52 for 160 bytes in 1 usec
READ of /kernel/genunix at 0 for 1048576 bytes in 2479 usec
But if I do this on a SPARC box, I get 0 for the bytes transferred
and 0 for the latency because the ''return'' probe fires before
the
vop call has been made.
Thanks,
Rich
Jim Mauro wrote:
>
> Did you check out Richard McDougall''s VFS tracing script?
>
> http://blogs.sun.com/roller/page/rmc/Weblog?catname=%2FDTrace
>
> /jim
>
>
> Rich Brown wrote:
>
>> Hi All,
>>
>> I''m looking for some advice:
>>
>> I want to use Dtrace to observe the latency and results of all vnode
>> operations. This is something we want to do in production.
>>
>> For example, I''d like to run a Dscript using the
fop_read:entry/return
>> FBT probes to save uio_resid on entry, subtract off uio_resid after the
>> vop call returns then display the number bytes transferred. Throw in a
>> little vtimestamp subtraction and I''ve got latency also.
>>
>> The trouble is that most of the fop calls are tail-call optimized which
>> (as you well know) makes the return probe fire before the
>> (*(vp)->v_op->vop_read)(vp, uiop, ioflag, cr, ct); happens.
I''ve tried
>> the following:
>>
>> - I followed the suggestions in the Dtrace cookbook for dealing with
>> tail-call optimized routines:
>> http://dtrace.eng/recipes/chasetail.html . After the 100K+ probes
>> are enabled, there is a significant performance hit
>> on the system. (And I''ll want to use this on systems that are
already
>> having problems.)
>>
>> - I looked at our production compilers and noticed that they have lower
>> levels of optimization (e.g., -xO1) that won''t do tail-call
>> optimization,
>> but seems so wrong way on several levels: It''s (a)
compiler-specific
>> and
>> (b) too big of a hammer (even just for vnode.c).
>>
>> - I added a static probe after the vop call. This works and has good
>> observed performance, but all I''m really doing is subverting
the
>> tail-call
>> optimization. Seems like there must be a better way.
>>
>> Any suggestions? Thanks for your help,
>>
>> Rich
>>
>>
>>
>>
>> _______________________________________________
>> dtrace-discuss mailing list
>> dtrace-discuss at opensolaris.org
>
>
> _______________________________________________
> dtrace-discuss mailing list
> dtrace-discuss at opensolaris.org
Got it Rich, thanks. Cool script. Unfortunately, I have no further suggestions on getting around your specific problem. Best, /jim Rich Brown wrote:> Hi Jim, > > Thanks for the pointer. It gives the information on what > the vop intends to do plus any actual I/O performed, but > I''m looking for the results of the vop itself. Here''s an > example of a script that gives me exactly the output I''m > looking for, but only on a system that doesn''t optimize > the tail-call for fop routines. > > If I use the following Dscript: > > --------------------- cut ------------------------- > #!/usr/sbin/dtrace -Cqs > > #include <sys/uio.h> > > fop_read:entry > / execname != "dtrace" && ((uio_t *)arg1)->uio_resid != 0 / > { > self->tstart = vtimestamp; > self->path = ((vnode_t *)arg0)->v_path, > self->resid = ((uio_t *)arg1)->uio_resid; > self->uiop = arg1; > self->offset = ((uio_t *)arg1)->uio_offset; > } > > fop_read:return > / self->resid / > { > printf("READ of %s at %d for %ld bytes in %d usec\n", > stringof(self->path), > self->offset, > (self->resid - ((uio_t *)self->uiop)->uio_resid), > (vtimestamp - self->tstart)/1000); > } > --------------------- cut ------------------------- > > And do: dd if=/kernel/genunix of=/dev/null bs=1024k count=1 > > Then I see the following on an Opteron box running Solaris: > > READ of /dev/pts/../../devices/pseudo/pts at 0:10 at 7838 for 1 bytes in > 7 usec > READ of /usr/bin/dd at 0 for 8 bytes in 10 usec > READ of /usr/bin/dd at 0 for 52 bytes in 1 usec > READ of /usr/bin/dd at 52 for 192 bytes in 1 usec > READ of /usr/bin/dd at 244 for 17 bytes in 1 usec > READ of /lib/ld.so.1 at 0 for 52 bytes in 4 usec > READ of /lib/ld.so.1 at 52 for 160 bytes in 1 usec > READ of /kernel/genunix at 0 for 1048576 bytes in 2479 usec > > But if I do this on a SPARC box, I get 0 for the bytes transferred > and 0 for the latency because the ''return'' probe fires before the > vop call has been made. > > Thanks, > > Rich > > > Jim Mauro wrote: > >> >> Did you check out Richard McDougall''s VFS tracing script? >> >> http://blogs.sun.com/roller/page/rmc/Weblog?catname=%2FDTrace >> >> /jim >> >> >> Rich Brown wrote: >> >>> Hi All, >>> >>> I''m looking for some advice: >>> >>> I want to use Dtrace to observe the latency and results of all vnode >>> operations. This is something we want to do in production. >>> >>> For example, I''d like to run a Dscript using the fop_read:entry/return >>> FBT probes to save uio_resid on entry, subtract off uio_resid after the >>> vop call returns then display the number bytes transferred. Throw in a >>> little vtimestamp subtraction and I''ve got latency also. >>> >>> The trouble is that most of the fop calls are tail-call optimized which >>> (as you well know) makes the return probe fire before the >>> (*(vp)->v_op->vop_read)(vp, uiop, ioflag, cr, ct); happens. I''ve tried >>> the following: >>> >>> - I followed the suggestions in the Dtrace cookbook for dealing with >>> tail-call optimized routines: >>> http://dtrace.eng/recipes/chasetail.html . After the 100K+ probes >>> are enabled, there is a significant performance hit >>> on the system. (And I''ll want to use this on systems that are already >>> having problems.) >>> >>> - I looked at our production compilers and noticed that they have lower >>> levels of optimization (e.g., -xO1) that won''t do tail-call >>> optimization, >>> but seems so wrong way on several levels: It''s (a) >>> compiler-specific and >>> (b) too big of a hammer (even just for vnode.c). >>> >>> - I added a static probe after the vop call. This works and has good >>> observed performance, but all I''m really doing is subverting the >>> tail-call >>> optimization. Seems like there must be a better way. >>> >>> Any suggestions? Thanks for your help, >>> >>> Rich >>> >>> >>> >>> >>> _______________________________________________ >>> dtrace-discuss mailing list >>> dtrace-discuss at opensolaris.org >> >> >> >> _______________________________________________ >> dtrace-discuss mailing list >> dtrace-discuss at opensolaris.org > > >
Roch Bourbonnais - Performance Engineering
2005-Aug-03 10:54 UTC
[dtrace-discuss] Getting around tail-calls...
You could spell out the *fs_read of interest ?
Include the -Z option in case a given FS is not installed.
ufs_read:entry,
zfs_read:entry,
qfs_read:entry,
vxfs_read:entry,
nfs3_read:entry,
nfs4_read:entry...
/ execname != "dtrace" && ((uio_t *)arg1)->uio_resid != 0
/
It won''t be too ugly.
-r
Rich Brown writes:
> Hi Jim,
>
> Thanks for the pointer. It gives the information on what
> the vop intends to do plus any actual I/O performed, but
> I''m looking for the results of the vop itself. Here''s
an
> example of a script that gives me exactly the output I''m
> looking for, but only on a system that doesn''t optimize
> the tail-call for fop routines.
>
> If I use the following Dscript:
>
> --------------------- cut -------------------------
> #!/usr/sbin/dtrace -Cqs
>
> #include <sys/uio.h>
>
> fop_read:entry
> / execname != "dtrace" && ((uio_t *)arg1)->uio_resid
!= 0 /
> {
> self->tstart = vtimestamp;
> self->path = ((vnode_t *)arg0)->v_path,
> self->resid = ((uio_t *)arg1)->uio_resid;
> self->uiop = arg1;
> self->offset = ((uio_t *)arg1)->uio_offset;
> }
>
> fop_read:return
> / self->resid /
> {
> printf("READ of %s at %d for %ld bytes in %d usec\n",
> stringof(self->path),
> self->offset,
> (self->resid - ((uio_t *)self->uiop)->uio_resid),
> (vtimestamp - self->tstart)/1000);
> }
> --------------------- cut -------------------------
>
> And do: dd if=/kernel/genunix of=/dev/null bs=1024k count=1
>
> Then I see the following on an Opteron box running Solaris:
>
> READ of /dev/pts/../../devices/pseudo/pts at 0:10 at 7838 for 1 bytes in 7
usec
> READ of /usr/bin/dd at 0 for 8 bytes in 10 usec
> READ of /usr/bin/dd at 0 for 52 bytes in 1 usec
> READ of /usr/bin/dd at 52 for 192 bytes in 1 usec
> READ of /usr/bin/dd at 244 for 17 bytes in 1 usec
> READ of /lib/ld.so.1 at 0 for 52 bytes in 4 usec
> READ of /lib/ld.so.1 at 52 for 160 bytes in 1 usec
> READ of /kernel/genunix at 0 for 1048576 bytes in 2479 usec
>
> But if I do this on a SPARC box, I get 0 for the bytes transferred
> and 0 for the latency because the ''return'' probe fires
before the
> vop call has been made.
>
> Thanks,
>
> Rich
>
>
> Jim Mauro wrote:
>
> >
> > Did you check out Richard McDougall''s VFS tracing script?
> >
> > http://blogs.sun.com/roller/page/rmc/Weblog?catname=%2FDTrace
> >
> > /jim
> >
> >
> > Rich Brown wrote:
> >
> >> Hi All,
> >>
> >> I''m looking for some advice:
> >>
> >> I want to use Dtrace to observe the latency and results of all
vnode
> >> operations. This is something we want to do in production.
> >>
> >> For example, I''d like to run a Dscript using the
fop_read:entry/return
> >> FBT probes to save uio_resid on entry, subtract off uio_resid
after the
> >> vop call returns then display the number bytes transferred.
Throw in a
> >> little vtimestamp subtraction and I''ve got latency also.
> >>
> >> The trouble is that most of the fop calls are tail-call optimized
which
> >> (as you well know) makes the return probe fire before the
> >> (*(vp)->v_op->vop_read)(vp, uiop, ioflag, cr, ct); happens.
I''ve tried
> >> the following:
> >>
> >> - I followed the suggestions in the Dtrace cookbook for dealing
with
> >> tail-call optimized routines:
> >> http://dtrace.eng/recipes/chasetail.html . After the 100K+
probes
> >> are enabled, there is a significant performance hit
> >> on the system. (And I''ll want to use this on systems
that are already
> >> having problems.)
> >>
> >> - I looked at our production compilers and noticed that they have
lower
> >> levels of optimization (e.g., -xO1) that won''t do
tail-call
> >> optimization,
> >> but seems so wrong way on several levels: It''s (a)
compiler-specific
> >> and
> >> (b) too big of a hammer (even just for vnode.c).
> >>
> >> - I added a static probe after the vop call. This works and has
good
> >> observed performance, but all I''m really doing is
subverting the
> >> tail-call
> >> optimization. Seems like there must be a better way.
> >>
> >> Any suggestions? Thanks for your help,
> >>
> >> Rich
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> dtrace-discuss mailing list
> >> dtrace-discuss at opensolaris.org
> >
> >
> > _______________________________________________
> > dtrace-discuss mailing list
> > dtrace-discuss at opensolaris.org
>
>
> _______________________________________________
> dtrace-discuss mailing list
> dtrace-discuss at opensolaris.org
Hi Roch,
I was thinking of something similar recently. I didn''t know about the
-Z option which makes this viable. Thanks! (David Robinson and I
were discussing various ways to discover what vnode operations might
be available. That can get ugly pretty quickly.)
My goal is still to do this from the fop_* level. One of the reasons
those calls were added was to give us a file system independent place
to stand to do just these kinds of things. I''ve had performance runs
(PerfPIT) done on a build with static probes added to fop_read,
fop_write, fop_getattr. So far, there are no statistically significant
regressions in the results.
If this doesn''t pan out, your suggestion is at the top of my list.
Thanks,
Rich
Roch Bourbonnais - Performance Engineering wrote:
>You could spell out the *fs_read of interest ?
>Include the -Z option in case a given FS is not installed.
>
>
> ufs_read:entry,
> zfs_read:entry,
> qfs_read:entry,
> vxfs_read:entry,
> nfs3_read:entry,
> nfs4_read:entry...
> / execname != "dtrace" && ((uio_t *)arg1)->uio_resid
!= 0 /
>
>It won''t be too ugly.
>
>-r
>
>Rich Brown writes:
> > Hi Jim,
> >
> > Thanks for the pointer. It gives the information on what
> > the vop intends to do plus any actual I/O performed, but
> > I''m looking for the results of the vop itself.
Here''s an
> > example of a script that gives me exactly the output I''m
> > looking for, but only on a system that doesn''t optimize
> > the tail-call for fop routines.
> >
> > If I use the following Dscript:
> >
> > --------------------- cut -------------------------
> > #!/usr/sbin/dtrace -Cqs
> >
> > #include <sys/uio.h>
> >
> > fop_read:entry
> > / execname != "dtrace" && ((uio_t
*)arg1)->uio_resid != 0 /
> > {
> > self->tstart = vtimestamp;
> > self->path = ((vnode_t *)arg0)->v_path,
> > self->resid = ((uio_t *)arg1)->uio_resid;
> > self->uiop = arg1;
> > self->offset = ((uio_t *)arg1)->uio_offset;
> > }
> >
> > fop_read:return
> > / self->resid /
> > {
> > printf("READ of %s at %d for %ld bytes in %d usec\n",
> > stringof(self->path),
> > self->offset,
> > (self->resid - ((uio_t *)self->uiop)->uio_resid),
> > (vtimestamp - self->tstart)/1000);
> > }
> > --------------------- cut -------------------------
> >
> > And do: dd if=/kernel/genunix of=/dev/null bs=1024k count=1
> >
> > Then I see the following on an Opteron box running Solaris:
> >
> > READ of /dev/pts/../../devices/pseudo/pts at 0:10 at 7838 for 1 bytes
in 7 usec
> > READ of /usr/bin/dd at 0 for 8 bytes in 10 usec
> > READ of /usr/bin/dd at 0 for 52 bytes in 1 usec
> > READ of /usr/bin/dd at 52 for 192 bytes in 1 usec
> > READ of /usr/bin/dd at 244 for 17 bytes in 1 usec
> > READ of /lib/ld.so.1 at 0 for 52 bytes in 4 usec
> > READ of /lib/ld.so.1 at 52 for 160 bytes in 1 usec
> > READ of /kernel/genunix at 0 for 1048576 bytes in 2479 usec
> >
> > But if I do this on a SPARC box, I get 0 for the bytes transferred
> > and 0 for the latency because the ''return'' probe
fires before the
> > vop call has been made.
> >
> > Thanks,
> >
> > Rich
> >
> >
> > Jim Mauro wrote:
> >
> > >
> > > Did you check out Richard McDougall''s VFS tracing
script?
> > >
> > > http://blogs.sun.com/roller/page/rmc/Weblog?catname=%2FDTrace
> > >
> > > /jim
> > >
> > >
> > > Rich Brown wrote:
> > >
> > >> Hi All,
> > >>
> > >> I''m looking for some advice:
> > >>
> > >> I want to use Dtrace to observe the latency and results of
all vnode
> > >> operations. This is something we want to do in production.
> > >>
> > >> For example, I''d like to run a Dscript using the
fop_read:entry/return
> > >> FBT probes to save uio_resid on entry, subtract off uio_resid
after the
> > >> vop call returns then display the number bytes transferred.
Throw in a
> > >> little vtimestamp subtraction and I''ve got latency
also.
> > >>
> > >> The trouble is that most of the fop calls are tail-call
optimized which
> > >> (as you well know) makes the return probe fire before the
> > >> (*(vp)->v_op->vop_read)(vp, uiop, ioflag, cr, ct);
happens. I''ve tried
> > >> the following:
> > >>
> > >> - I followed the suggestions in the Dtrace cookbook for
dealing with
> > >> tail-call optimized routines:
> > >> http://dtrace.eng/recipes/chasetail.html . After the 100K+
probes
> > >> are enabled, there is a significant performance hit
> > >> on the system. (And I''ll want to use this on systems
that are already
> > >> having problems.)
> > >>
> > >> - I looked at our production compilers and noticed that they
have lower
> > >> levels of optimization (e.g., -xO1) that won''t do
tail-call
> > >> optimization,
> > >> but seems so wrong way on several levels: It''s (a)
compiler-specific
> > >> and
> > >> (b) too big of a hammer (even just for vnode.c).
> > >>
> > >> - I added a static probe after the vop call. This works and
has good
> > >> observed performance, but all I''m really doing is
subverting the
> > >> tail-call
> > >> optimization. Seems like there must be a better way.
> > >>
> > >> Any suggestions? Thanks for your help,
> > >>
> > >> Rich
> > >>
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > >> dtrace-discuss mailing list
> > >> dtrace-discuss at opensolaris.org
> > >
> > >
> > > _______________________________________________
> > > dtrace-discuss mailing list
> > > dtrace-discuss at opensolaris.org
> >
> >
> > _______________________________________________
> > dtrace-discuss mailing list
> > dtrace-discuss at opensolaris.org
>
>
>