thr3ads.net - dtrace discuss - [dtrace-discuss] DTrace Network Provider [Sep 2006]

If this information is useful, please help other people find it:
Share via:

Brendan Gregg - Sun Microsystems

2006-Sep-19 00:48 UTC

[dtrace-discuss] DTrace Network Provider

G''Day Folks,

I''ve just created a website to flesh out ideas for a DTrace Network
Provider.
Let us know what you think.

   http://www.opensolaris.org/os/community/dtrace/NetworkProvider/

I''m not currently writing this provider; this began as a discussion
with
Adam about what a network provider could do, which we thought would be
appropriate to discuss on dtrace-discuss. So, this may not actually happen;
although it would help if a lot of people really do want it. :-)

We are also interested to hear from Sun''s network kernel engineers, to 
hear how doable this really is.

thanks,

Brendan

Richard Lowe

2006-Sep-19 01:31 UTC

head link

[dtrace-discuss] DTrace Network Provider

[Cc''ing networking-discuss, I suspect followups should go to 
dtrace-discuss, however]

Brendan Gregg - Sun Microsystems wrote:> G''Day Folks,
> 
> I''ve just created a website to flesh out ideas for a DTrace
Network Provider.
> Let us know what you think.
> 
>    http://www.opensolaris.org/os/community/dtrace/NetworkProvider/
> 
> I''m not currently writing this provider; this began as a
discussion with
> Adam about what a network provider could do, which we thought would be
> appropriate to discuss on dtrace-discuss. So, this may not actually happen;
> although it would help if a lot of people really do want it. :-)
+1, I for one certainly want it :)

A couple of questions.

What are the intents regarding ICMP, SCTP etc?

What are the semantics of these where ipf is involved?  I assume that a 
packet blocked by ipf will not fire the probes, for obvious reasons. 
Would ipf gain probes for packets being dropped/being nat''d, etc?

> We are also interested to hear from Sun''s network kernel
engineers, to
> hear how doable this really is.
> 
> thanks,
> 
> Brendan
> _______________________________________________
> dtrace-discuss mailing list
> dtrace-discuss at opensolaris.org
>

Brendan Gregg - Sun Microsystems

2006-Sep-19 02:24 UTC

head link

[dtrace-discuss] DTrace Network Provider

On Mon, Sep 18, 2006 at 09:31:48PM -0400, Richard Lowe
wrote:> [Cc''ing networking-discuss, I suspect followups should go to 
> dtrace-discuss, however]
> 
> Brendan Gregg - Sun Microsystems wrote:
> >G''Day Folks,
> >
> >I''ve just created a website to flesh out ideas for a DTrace
Network
> >Provider.
> >Let us know what you think.
> >
> >   http://www.opensolaris.org/os/community/dtrace/NetworkProvider/
> >
> >I''m not currently writing this provider; this began as a
discussion with
> >Adam about what a network provider could do, which we thought would be
> >appropriate to discuss on dtrace-discuss. So, this may not actually
happen;
> >although it would help if a lot of people really do want it. :-)
> 
> +1, I for one certainly want it :)
> 
> A couple of questions.
> 
> What are the intents regarding ICMP, SCTP etc?
They should be included!
> What are the semantics of these where ipf is involved?  I assume that a 
> packet blocked by ipf will not fire the probes, for obvious reasons. 
> Would ipf gain probes for packets being dropped/being nat''d, etc?
Hmm, I hadn''t thought of ipf. It would be great, for troubleshooting,
to see
some ipf probes... :)

Brendan

Mike Gerdts

2006-Sep-19 02:29 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

> G''Day Folks,
> 
> I''ve just created a website to flesh out ideas for a
> DTrace Network Provider.
> Let us know what you think.
Way cool.  This would take the place of much of what I use snoop for.  Rarely do
I care about data that would not be available through this.

What happens when...

- A packet comes in with an 802.1Q tag
-- And a VLAN interface for the corresponding VLAN is plumbed
-- And a VLAN interface for the corresponding VLAN is not plumbed
- A packet comes in with a different ethertype
-- e.g. Jumbo packet and jumbo is not enabled
-- LLT (Veritas Cluster interconnect)
- A packet comes in on an interface that is not plumbed (LLT interfaces are
commonly not plumbed)

I look forward to seeing this move forward.

Mike
 
 
This message posted from opensolaris.org

Chip Bennett

2006-Sep-19 06:00 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Mike Gerdts wrote:>> G''Day Folks,
>>
>> I''ve just created a website to flesh out ideas for a
>> DTrace Network Provider.
>> Let us know what you think.
>>     
>
> Way cool.  This would take the place of much of what I use snoop for. 
Rarely do I care about data that would not be available through this.
>   Yes, and I think the power of the D programming language will make 
controlling the output a lot more flexible than snoop.

Chip

Alan Maguire

2006-Sep-20 08:31 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

hi Brendan

looks great! a few thoughts:

- i was trying to figure out how ipv6 would fit into things, and i guess
there''s a few issues here. one that occurs to me is the way the
ip-related probes pass back an ipv4-specific pointer. i wonder, would another
way of getting the packet data back to the probe consumers be to pass back an
array as arg1, and have a set of indices defined such that if i want arg1[ETHER]
i get a pointer to the ether header that i can parse etc. this would give
consumers a simple test for ipv6 - is arg1[IPv6] NULL? not sure about the
feasiblilty of this from the provider side though. i guess another issue with v6
is getting v6 addresses to work as predicates.

- maybe we could we augment the instrumentation of the MIB probes to pass back
packet data too, this might be useful? 

- to investigate performance bottlenecks in depth, combining this work with
instrumentation of the Packet Event Framework would perhaps be useful. also, i
imagine the Clearview and Nemo folks would have suggestions on instrumenting
loopback and adding probes to gld.

- can you think of a way to "pick up" an individual packet and follow
it through the stack? i''m thinking a DTrace script that set probes at
all the interesting points, then sets variables that uniquely identify the
packet at gld, and can then subsequently be used at the other probe points as
predicates. i guess this might require "decoding" the packet at each
level, in the sense of providing pointers into each protocol header, but it
might be worthwhile.

alan
 
 
This message posted from opensolaris.org

James Carlson

2006-Sep-20 12:36 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Alan Maguire writes:> - i was trying to figure out how ipv6 would fit into things, and i guess
there''s a few issues here. one that occurs to me is the way the
ip-related probes pass back an ipv4-specific pointer. i wonder, would another
way of getting the packet data back to the probe consumers be to pass back an
array as arg1, and have a set of indices defined such that if i want arg1[ETHER]
i get a pointer to the ether header that i can parse etc. this would give
consumers a simple test for ipv6 - is arg1[IPv6] NULL? not sure about the
feasiblilty of this from the provider side though. i guess another issue with v6
is getting v6 addresses to work as predicates.
Only one of those indicies could be non-NULL at a time, so that
doesn''t seem like a good change to me. I think it makes tracing a bit
too error-prone.

Having separate probe points for v4 and v6, each with its own
parameters, might make more sense, even if there are some
commonalities at the transport layer.

Plus, SCTP is notably absent.

I think a bigger issue here is getting lower-level details. The
existing proposal seems to center around the sockets API model, but
I''m not sure that''s really the right direction to go. I think
it''d be
truer to the dtrace philosophy if it were to expose all the internal
state for these protocols.

That means having probes that represent IP''s forwarding and fanout
actions (including drop and ICMP error generation); probes that
reflect TCP''s window handling (including out-of-window, overlapping
segment trimming, and ack generation), state machine, and timers;
probes for IGMP activity; probes for sockfs interaction, and so on.
> - maybe we could we augment the instrumentation of the MIB probes to pass
back
> packet data too, this might be useful?
Indeed. The current design of the MIB probe code looks like it would
make this hard to do, but it''d be valuable just the same.
> - can you think of a way to "pick up" an individual packet and
follow it through the stack? i''m thinking a DTrace script that set
probes at all the interesting points, then sets variables that uniquely identify
the packet at gld, and can then subsequently be used at the other probe points
as predicates. i guess this might require "decoding" the packet at
each level, in the sense of providing pointers into each protocol header, but it
might be worthwhile.
Having "dblk event" provider would likely be very helpful (and might
even allow us to retire the existing STREAMS flow tracing), but there
are a lot of complications involved here:

- What does dupb() look like at the dtrace probe level? How about
copyb()? What happens when someone does allocb(), copies the
data, and then does freeb() (as with IPsec)?

- How do you express interest in a particular dblk? I could imagine
having something like "follow_dblk(dblk_t *)" and
"abandon_dblk(dblk_t *)" added to the dtrace programming model,
and having a set of
"dblk:<module>:<function>:<name>" probes that
fire on the followed dblk(s). Is that expressive enough?

- Is there perhaps a more general principle at play here? Instead
of a dblk-follower, should we have an allocated-and-possibly-ref-
counted-object-follower?

--
James Carlson, KISS Network <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677

Kacheong Poon

2006-Sep-20 16:10 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

James Carlson wrote:
> I think a bigger issue here is getting lower-level details.  The
> existing proposal seems to center around the sockets API model, but
> I''m not sure that''s really the right direction to go.  I
think it''d be
> truer to the dtrace philosophy if it were to expose all the internal
> state for these protocols.

Agreed.  Even if the proposal only deals with socket API
mode of operation, it still does not capture some interactions.
For example, I think the tcp::connect-established will not fire
based on the proposal if an app does simultaneous connect (active
connect from both sides).  This may not be common, but we got
a bug report some time back from customer who has problems in
exactly this operation.

> That means having probes that represent IP''s forwarding and fanout
> actions (including drop and ICMP error generation); probes that
> reflect TCP''s window handling (including out-of-window,
overlapping
> segment trimming, and ack generation), state machine, and timers;
> probes for IGMP activity; probes for sockfs interaction, and so on.

One obvious reason why a provider needs to expose the above
internal data of protocols is in the proposal web page.  The
"Solved Problems" list has "Why are outbound connections
slow?"
but I don''t think the proposal can actually give out the answer.
The proposal can tell a user that the throughput is low.  But why
is it low?  It may be that the other end is slow in receiving
so that TCP is in a zero window probing mode.  Or it may be that
there is a lot of congestion and packets are being dropped.  Or
it may be that the driver has a bug such that some packets are
stuck in the driver''s send queue for a long time (need the ability
to trace a packet to detect this).  Or it may be that there are a
lot of checksum errors (it is not clear exactly the position of
the probe in the current proposal to determine if the tcp::receive
probe is fired after the checksum is verified or not).  Or it may
be that there is a bug in the sockfs/TCP interaction such that
the data is stuck at the sockfs layer and cannot reach TCP.

So to answer the question "Why are outbound connections slow?",
we really need to check the whole stack, from sockfs down to the
driver.  Simple probes like tcp::send, tcp::receive, ip::send,
ip::receive, ... cannot tell us the reason.  If the net provider
only has these simple probes, I am wondering if we really need a
separate provider instead of just using sdt and put well positioned
static probes.  Are there some guidelines of when we need a provider
versus just using sdt?

-- 

						K. Poon.
						kacheong.poon at sun.com

Sangeeta Misra

2006-Sep-20 16:21 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Kacheong Poon wrote:
> 
>> That means having probes that represent IP''s forwarding and
fanout
>> actions (including drop and ICMP error generation); probes that
>> reflect TCP''s window handling (including out-of-window,
overlapping
>> segment trimming, and ack generation), state machine, and timers;
>> probes for IGMP activity; probes for sockfs interaction, and so on.
> 
> 
> 
> One obvious reason why a provider needs to expose the above
> internal data of protocols is in the proposal web page.  The
> "Solved Problems" list has "Why are outbound connections
slow?"
> but I don''t think the proposal can actually give out the answer.
> The proposal can tell a user that the throughput is low.  But why
> is it low? 
THe above requirement applies to IP forwarding throuput as well. I had 
sent email to Brendan about this, and am waiting for a response.

Given a Solaris box that is configured to be a forwarder, I want to find 
out how long it takes a packet to be forwarded out into the wire from 
the time its received (this would mean time spent in IP and in network 
driver code). It also would help to do performance bottleneck analysis.

Will the DTrace Network Provider be able to provide this info, or is 
another tool more appropriate for this form of analysis.

  It may be that the other end is slow in receiving> so that TCP is in a zero window probing mode.  Or it may be that
> there is a lot of congestion and packets are being dropped.  Or
> it may be that the driver has a bug such that some packets are
> stuck in the driver''s send queue for a long time (need the ability
> to trace a packet to detect this).  Or it may be that there are a
> lot of checksum errors (it is not clear exactly the position of
> the probe in the current proposal to determine if the tcp::receive
> probe is fired after the checksum is verified or not).  Or it may
> be that there is a bug in the sockfs/TCP interaction such that
> the data is stuck at the sockfs layer and cannot reach TCP.
> 
> So to answer the question "Why are outbound connections slow?",
> we really need to check the whole stack, from sockfs down to the
> driver.  Simple probes like tcp::send, tcp::receive, ip::send,
> ip::receive, ... cannot tell us the reason.  If the net provider
> only has these simple probes, I am wondering if we really need a
> separate provider instead of just using sdt and put well positioned
> static probes.  Are there some guidelines of when we need a provider
> versus just using sdt?

Brendan Gregg

2006-Sep-20 22:33 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Alan,

On Wed, 20 Sep 2006, Alan Maguire wrote:
> hi Brendan
>
> looks great! a few thoughts:
>
> - i was trying to figure out how ipv6 would fit into things, and i
> guess there''s a few issues here. one that occurs to me is the way
the
> ip-related probes pass back an ipv4-specific pointer. i wonder, would
> another way of getting the packet data back to the probe consumers be
> to pass back an array as arg1, and have a set of indices defined such
> that if i want arg1[ETHER] i get a pointer to the ether header that i
> can parse etc. this would give consumers a simple test for ipv6 - is
> arg1[IPv6] NULL? not sure about the feasiblilty of this from the
> provider side though. i guess another issue with v6 is getting v6
> addresses to work as predicates.
Yes, there are indeed a few issues. :)

Interesting - having arg1[name] would be useful to allow easy extention to
other headers, such as ARP, ICMP, SCTP, etc.

The way I had it in mind was: args[1] has members to all IP info (either
v4 or v6), NULL if not used, and a programmer would write,

net:ip::send
/args[2]->ip_ver == 4/
{
	/* IPv4 processing */
}

net:ip::send
/args[2]->ip_ver == 6/
{
	/* IPv6 processing */
}

net:ip::send
{
	/* Process both */
}

... but my idea may not scale as well to other headers.
> - maybe we could we augment the instrumentation of the MIB probes to
> pass back packet data too, this might be useful?
Yes! - the mib probes are already fairly well placed, so it would be nice
to use them further.
> - to investigate performance bottlenecks in depth, combining this work
> with instrumentation of the Packet Event Framework would perhaps be
> useful. also, i imagine the Clearview and Nemo folks would have
> suggestions on instrumenting loopback and adding probes to gld.
Instrumenting the PEF will be interesting; but an internal Sun networking
component wouldn''t be the first place I imagine customers would want to
trace. First up is standard RFC* style probes, tcp::send, tcp::receive,
tcp::connection-established, etc; then comes Sun internals.

I haven''t looked at instrumenting loopback or gld yet - so I''d
be
interested to hear from the Clearview/Nemo folks!
> - can you think of a way to "pick up" an individual packet and
follow
> it through the stack? i''m thinking a DTrace script that set probes
at
> all the interesting points, then sets variables that uniquely identify
> the packet at gld, and can then subsequently be used at the other probe
> points as predicates. i guess this might require "decoding" the
packet
> at each level, in the sense of providing pointers into each protocol
> header, but it might be worthwhile.
There should be ways to trace it through parts of the stack, and there are
some unique identifiers available; for example, the address of the struct
conn_s can serve as a connection indentifier after it has been
established (eg, ipcl_classify_v4() in ip_tcp_input()).

To go entirerly though the stack from ethernet received to ethernet sent
is going to be some work; but again, I''d like to see this done, but as
with the DTrace io provider - it''s not the first thing that is needed.

In fact - that actually sounds like a provider that would help internal
Sun engineers the most; so what we have on our hands is two providers:

net::: Network Provider for customers, RFC* generic details, similar to
	the io provider. Bytes by IP, bytes by port, connect time...

net-priv::: Network Provider for Sun engineers, private Sun internals.
	layer by layer tracing, IP classifier internals...

cheers,

Brendan

Bryan Cantrill

2006-Sep-20 22:37 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

> I think a bigger issue here is getting lower-level details.  The
> existing proposal seems to center around the sockets API model, but
> I''m not sure that''s really the right direction to go.  I
think it''d be
> truer to the dtrace philosophy if it were to expose all the internal
> state for these protocols.
No -- wrong.  You are confusing the philosophies of DTrace (zero disabled
probe effect, absolute safety, systemic instrumentation etc.) with the
motivation for providing Stable providers.  Stable providers should export
semantics _not_ implementation; if you want to export the implementation,
go right ahead -- with an SDT provider of Private stability.  (Indeed, we
took great care to define our own stability taxonomy to allow exactly
this; such a provider needn''t be documented or ARC''d.)  But
understand
that Stable providers are intended for those _using_ the system (e.g.,
Solaris users) not those _implementing_ the system (e.g., you).  The
vast majority of problems that users have with networking are not related
to protocol intricacies like window handling -- they are due to vast
amounts of unnecessary work.  (As in, "Why is our monitoring script
causing three quarters of the traffic!?" or "We shouldn''t
even be talking
to that machine -- that project was supposed to have beeen decommisioned
six weeks ago!")  More than anything, what we (or at least I) learned
from DTrace is that if you want to get big wins, you don''t make it
incrementally faster to do existing work -- you eliminate work entirely by
addressing its source higher in the stack of abstraction.  It is easiest
to find unnecessary work and eliminate it when one has Stable providers
that are crisp, simple, and export the understood semantics of the
system -- like our existing sched, proc and io providers.

I believe that Brendan''s proposal is very much in this mold, and
therefore
I believe that it is the right direction for a Stable networking provider --
but that should not deter you from adding Private SDT providers to export
whatever implementation details you need when instrumenting the system.

	- Bryan

--------------------------------------------------------------------------
Bryan Cantrill, Solaris Kernel Development.       http://blogs.sun.com/bmc

Kacheong Poon

2006-Sep-21 01:20 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg wrote:
> Instrumenting the PEF will be interesting; but an internal Sun networking
> component wouldn''t be the first place I imagine customers would
want to
> trace. First up is standard RFC* style probes, tcp::send, tcp::receive,
> tcp::connection-established, etc; then comes Sun internals.

It is unclear what is meant by "standard RFC* style probes."
Does it mean an API style?  But there is no RFC for standard
API as IETF does not standardize API.  Could you please
elaborate that?

In the proposal, it says that those tcp::send, ..., are placed
"close to IP interface."  It is unclear what that means,
especially args[2] are about IP header info.  For example, a
TCP segment can be sent through a tunnel.  So what is the IP
header info in args[2]?  What is meant by "close to?"  And
does tcp::send include sending ACKs?

The above are examples of why a probe like tcp::send can be
confusing.  It is too general a concept of tcp sending
something.

> net::: Network Provider for customers, RFC* generic details, similar to
> 	the io provider. Bytes by IP, bytes by port, connect time...

What is the scope of this provider?  For example, the current
proposal does not answer "Why are outbound connections slow?"
So if there are some problems which we assume that customers
are interested to solve, could you please list them here?

-- 

						K. Poon.
						kacheong.poon at sun.com

Brendan Gregg

2006-Sep-21 01:55 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day James,

On Wed, 20 Sep 2006, James Carlson wrote:
[...]> I think a bigger issue here is getting lower-level details.  The
> existing proposal seems to center around the sockets API model, but
> I''m not sure that''s really the right direction to go.  I
think it''d be
> truer to the dtrace philosophy if it were to expose all the internal
> state for these protocols.
So long as internal state means RFC standard and well understood state,
and not Sun''s private implementation state.
> That means having probes that represent IP''s forwarding and fanout
> actions (including drop and ICMP error generation); probes that
> reflect TCP''s window handling (including out-of-window,
overlapping
> segment trimming, and ack generation), state machine, and timers;
> probes for IGMP activity; probes for sockfs interaction, and so on.
I''d question how many of these were simply useful to know had occured,
in
which case, mib or kstat should be tracking them (if not already).
> > - can you think of a way to "pick up" an individual packet
and follow
> > it through the stack? i''m thinking a DTrace script that set
probes at
> > all the interesting points, then sets variables that uniquely
> > identify the packet at gld, and can then subsequently be used at the
> > other probe points as predicates. i guess this might require
> > "decoding" the packet at each level, in the sense of
providing
> > pointers into each protocol header, but it might be worthwhile.
>
> Having "dblk event" provider would likely be very helpful (and
might
> even allow us to retire the existing STREAMS flow tracing), but there
> are a lot of complications involved here:
Sure, which would be part of a private provider for Sun engineers.
>   - What does dupb() look like at the dtrace probe level?  How about
>     copyb()?  What happens when someone does allocb(), copies the
>     data, and then does freeb() (as with IPsec)?
>
>   - How do you express interest in a particular dblk?  I could imagine
>     having something like "follow_dblk(dblk_t *)" and
>     "abandon_dblk(dblk_t *)" added to the dtrace programming
model,
>     and having a set of
"dblk:<module>:<function>:<name>" probes that
>     fire on the followed dblk(s).  Is that expressive enough?
>
>   - Is there perhaps a more general principle at play here?  Instead
>     of a dblk-follower, should we have an allocated-and-possibly-ref-
>     counted-object-follower?
How many customers want this?

Sure, various Sun engineers want to track the code path that a packet
takes through the stack, with timing details. And that''s a different
provider, a private provider that meets Sun engineer''s needs. (I would
have thought that fbt already solves most of this, with some clever
coding).

The network provider for customers should focus on stable RFC defined
and well understood semantics. tcp::send, tcp::receive,
tcp::connection-established, etc. That''s not to say that other
functionality like you describe doesn''t exist; it''s to say
that the focus
is a simple and stable interface, like that which the io provider has.

I think that more customers will know RFC793 than they will know
your dblks.

Brendan

Brendan Gregg

2006-Sep-21 02:18 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Kacheong,

On Thu, 21 Sep 2006, Kacheong Poon wrote:
> Brendan Gregg wrote:
>
> > Instrumenting the PEF will be interesting; but an internal Sun
networking
> > component wouldn''t be the first place I imagine customers
would want to
> > trace. First up is standard RFC* style probes, tcp::send,
tcp::receive,
> > tcp::connection-established, etc; then comes Sun internals.
>
>
> It is unclear what is meant by "standard RFC* style probes."
> Does it mean an API style?  But there is no RFC for standard
> API as IETF does not standardize API.  Could you please
> elaborate that?
Show me in the RFCs where dblks are discussed. The provider proposal uses
the stable terminology from the RFCs, and events from the RFCs, which
customers should find easy to follow.
> In the proposal, it says that those tcp::send, ..., are placed
> "close to IP interface."  It is unclear what that means,
> especially args[2] are about IP header info.
I was trying not to describe the provider by actually writing the code.
> For example, a
> TCP segment can be sent through a tunnel.  So what is the IP
> header info in args[2]?
The tunnel IP. People can tunnel traffic over SSH anyway, would you have
us add USDT probes into sshd for details there?
> What is meant by "close to?"
For TCP that would be code that is closer to, say, gld than the
application.
> And does tcp::send include sending ACKs?
tcp::send does include sending ACKs.
> The above are examples of why a probe like tcp::send can be
> confusing.  It is too general a concept of tcp sending
> something.
Sure, there is no one place to put a tcp::send probe in tcp.c code; but
nor are there hundreds.
> > net::: Network Provider for customers, RFC* generic details, similar
to
> > 	the io provider. Bytes by IP, bytes by port, connect time...
>
> What is the scope of this provider?  For example, the current
> proposal does not answer "Why are outbound connections slow?"
Quoting from the website (maybe you didn''t get this far?)

  "Why are outbound connections slow? (Connect latency, 1st byte latency,
  throughput latency)"

These metrics are solvable from the proposal.
> So if there are some problems which we assume that customers
> are interested to solve, could you please list them here?
How about this to start with,

     1. New inbound TCP connections by port
     2. New inbound TCP connections by source IP address
     3. Inbound TCP packets by local port
     4. Inbound TCP packets by remote address
     5. Inbound TCP packets by remote address, for a specific port, eg: port 80
(HTTP)
     6. Sent TCP bytes by local port
     7. TCP bytes by address and port
     8. TCP bytes by address and port, per second
     9. Detect TCP connect() scan by IP address
    10. Detect TCP SYN stealth scan by IP address
    11. Detect TCP null scan by IP address
    12. Detect TCP FIN stealth scan by IP address
    13. Detect TCP Xmas scan by IP address
    14. Network Utilization by TCP Port
    15. Connect Latency
    16. 1st Byte Latency
    17. Throughput Latency
    18. TCP Saturation Drops by IP address

direct from the website.

Brendan

Richard Lowe

2006-Sep-21 03:10 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

James Carlson wrote:> Alan Maguire writes:
>> - i was trying to figure out how ipv6 would fit into things, and i
guess there''s a few issues here. one that occurs to me is the way the
ip-related probes pass back an ipv4-specific pointer. i wonder, would another
way of getting the packet data back to the probe consumers be to pass back an
array as arg1, and have a set of indices defined such that if i want arg1[ETHER]
i get a pointer to the ether header that i can parse etc. this would give
consumers a simple test for ipv6 - is arg1[IPv6] NULL? not sure about the
feasiblilty of this from the provider side though. i guess another issue with v6
is getting v6 addresses to work as predicates.
> 
> Only one of those indicies could be non-NULL at a time, so that
> doesn''t seem like a good change to me.  I think it makes tracing a
bit
> too error-prone.
> 
> Having separate probe points for v4 and v6, each with its own
> parameters, might make more sense, even if there are some
> commonalities at the transport layer.
> 
> Plus, SCTP is notably absent.
> 
> I think a bigger issue here is getting lower-level details.  The
> existing proposal seems to center around the sockets API model, but
> I''m not sure that''s really the right direction to go.  I
think it''d be
> truer to the dtrace philosophy if it were to expose all the internal
> state for these protocols.
> 
> That means having probes that represent IP''s forwarding and fanout
> actions (including drop and ICMP error generation); probes that
> reflect TCP''s window handling (including out-of-window,
overlapping
> segment trimming, and ack generation), state machine, and timers;
> probes for IGMP activity; probes for sockfs interaction, and so on.
> 
I disagree.  As a user I have absolutely no interest in having to spend 
quality time with the RFCs to answer simple questions, much less 
spending time reading through and understanding the implementation.  If 
I were interested in doing the latter, especially, I would be doing it 
now, using fbt.

-- Rich

Brendan Gregg

2006-Sep-21 04:09 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Folks,

Lets summarise where we are at. So far I''ve got,

 net:tcp::accept-listen
 net:tcp::accept-established
 net:tcp::accept-closed
 net:tcp::accept-refused
 net:tcp::connect-request
 net:tcp::connect-established
 net:tcp::connect-closed
 net:tcp::drop-saturation
 net:tcp::receive
 net:tcp::send
 net:udp::receive
 net:udp::send
 net:ip::receive
 net:ip::send
 net:gld::receive
 net:gld::send

Plus other protocols as needed (ICMP, ARP, SCTP, etc). Their arguments
are based on the RFCs.

Does anyone think that these shouldn''t be done, or has a technical
reason why they can''t be done?

There has been dblk events and probes mentioned. Please list the 4-tuple
probe names in mind. What would their arguments be? Please provide
short examples of their use in scripts.

cheers,

Brendan

peter.memishian at sun.com

2006-Sep-21 09:12 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

> Lets summarise where we are at. So far I''ve got, > 
 >  net:tcp::accept-listen
 >  net:tcp::accept-established
 >  net:tcp::accept-closed
 >  net:tcp::accept-refused
 >  net:tcp::connect-request
 >  net:tcp::connect-established
 >  net:tcp::connect-closed
 >  net:tcp::drop-saturation
 >  net:tcp::receive
 >  net:tcp::send
 >  net:udp::receive
 >  net:udp::send
 >  net:ip::receive
 >  net:ip::send
 >  net:gld::receive
 >  net:gld::send

[ Apologies if these questions were answered earlier. ]

Is there a reason the last two are `gld'' rather than `link''? 
(Or are we
explicitly excluding probing monolithic network devices like ce?)  Also,
when would `send'' fire -- when the layer has been asked to send the
packet, or when it''s actually sent the packet to the next layer?

--
meem

Roch

2006-Sep-21 11:01 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

> The above are examples of why a probe like tcp::send can be  > confusing.  It is too general a concept of tcp sending
  > something.

  Sure, there is no one place to put a tcp::send probe in tcp.c code; but
  nor are there hundreds.

Maybe this should be ''tcp::sent''. 

TCP implementation decides when it should _try_ to send some
of  the  usable data, and  sometimes it  manages  to do some
sometimes not, sometimes it resends. Not much conceptual things here. But at
times, TCP manages to get some data one layer below and
considers it sent ... but not acknowledge then later acked.

So maybe

tcp::sent   	: suna was advanced, args: from seq, nbytes
tcp::resent 	: suna not advanced, args: from seq, nbytes
tcp::acked  	: acked pointer was advanced. args: seq, nbytes
tcp::dupacked  	: acked pointer was not advanced. args: seq

I''d also like

tcp::flowctrl 	: args on|off wainting on a peer event to transmit.


Timings from sent to acked would give easy access to round
trip time.

flowctrl would help start to answer whether or not
performance is limited by peer (or at least something below
TCP).



-r

James Carlson

2006-Sep-21 12:34 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Bryan Cantrill writes:> 
> > I think a bigger issue here is getting lower-level details.  The
> > existing proposal seems to center around the sockets API model, but
> > I''m not sure that''s really the right direction to
go.  I think it''d be
> > truer to the dtrace philosophy if it were to expose all the internal
> > state for these protocols.
> 
> No -- wrong.  You are confusing the philosophies of DTrace (zero disabled
> probe effect, absolute safety, systemic instrumentation etc.) with the
> motivation for providing Stable providers.  Stable providers should export
> semantics _not_ implementation; if you want to export the implementation,
> go right ahead -- with an SDT provider of Private stability.
Note that I said "internal state for these protocols."

I did not say "export the implementation."  I already expect sdt and
fbt to do this part just fine.

The two are very different concepts.
>  (Indeed, we
> took great care to define our own stability taxonomy to allow exactly
> this; such a provider needn''t be documented or ARC''d.) 
But understand
> that Stable providers are intended for those _using_ the system (e.g.,
> Solaris users) not those _implementing_ the system (e.g., you).  The
> vast majority of problems that users have with networking are not related
> to protocol intricacies like window handling -- they are due to vast
> amounts of unnecessary work.  (As in, "Why is our monitoring script
> causing three quarters of the traffic!?" or "We
shouldn''t even be talking
> to that machine -- that project was supposed to have beeen decommisioned
> six weeks ago!")
And I can certainly see a case for having both "simplified" (gross)
trace points as well as tools to make it even simpler for users who
don''t care about the lower-level details.

However, not exposing the protocol-defined state and activity in some
stable form seems like a mistake to me.  This activity is related to
what these protocols do, and not related to how they''re designed.
>  More than anything, what we (or at least I) learned
> from DTrace is that if you want to get big wins, you don''t make it
> incrementally faster to do existing work -- you eliminate work entirely by
> addressing its source higher in the stack of abstraction.  It is easiest
> to find unnecessary work and eliminate it when one has Stable providers
> that are crisp, simple, and export the understood semantics of the
> system -- like our existing sched, proc and io providers.
And that''s exactly the problem.  We apparently disagree on what
constitutes a crisp, simple provider in the context of networking.

To me, failing to include TCP state transitions and special activity
such as packet drops due to reasons _other_ than buffer overflow
doesn''t make the provider "crisp."  It burns it to a crisp.

It''d be like omitting sched:unix:setkpdq:enqueue or
proc:unix:trap:fault from those providers because they''re hard for
casual users to understand.

They are hard to understand, and almost certainly unimportant for the
common sorts of questions that users may want to answer, but that
doesn''t seem to be a good argument to "dumb down" the
interface to the
point where it can''t be used for non-trivial problem solving.
> I believe that Brendan''s proposal is very much in this mold, and
therefore
> I believe that it is the right direction for a Stable networking provider
--
> but that should not deter you from adding Private SDT providers to export
> whatever implementation details you need when instrumenting the system.
Of course, but I think that still misses the point.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

James Carlson

2006-Sep-21 12:44 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg writes:> G''Day James,
> 
> On Wed, 20 Sep 2006, James Carlson wrote:
> [...]
> > I think a bigger issue here is getting lower-level details.  The
> > existing proposal seems to center around the sockets API model, but
> > I''m not sure that''s really the right direction to
go.  I think it''d be
> > truer to the dtrace philosophy if it were to expose all the internal
> > state for these protocols.
> 
> So long as internal state means RFC standard and well understood state,
> and not Sun''s private implementation state.
OK, I think we''re in sync on that.
> > That means having probes that represent IP''s forwarding and
fanout
> > actions (including drop and ICMP error generation); probes that
> > reflect TCP''s window handling (including out-of-window,
overlapping
> > segment trimming, and ack generation), state machine, and timers;
> > probes for IGMP activity; probes for sockfs interaction, and so on.
> 
> I''d question how many of these were simply useful to know had
occured, in
> which case, mib or kstat should be tracking them (if not already).
There''s indeed some overlap here.

The sorts of questions I''d like to be able to answer are:

  - this connection seems to have slow retransmits, why is the
    estimated RTT so large?

  - there are a lot of misordered packets; is it possible that we''re
    dropping packets with some sort of pattern?

  - this one connection gets stuck in CLOSE_WAIT state; what event
    drove it there?
> > Having "dblk event" provider would likely be very helpful
(and might
> > even allow us to retire the existing STREAMS flow tracing), but there
> > are a lot of complications involved here:
> 
> Sure, which would be part of a private provider for Sun engineers.
Yep; I agree it''s a side conversation, and not directly relevant for a
stable network provider.
> >   - Is there perhaps a more general principle at play here?  Instead
> >     of a dblk-follower, should we have an allocated-and-possibly-ref-
> >     counted-object-follower?
> 
> How many customers want this?
Besides Sun engineers (I consider us to be customers of our own
products ;-}), I suspect it''s most useful for third party software
designers -- those who create network drivers and applications.
> Sure, various Sun engineers want to track the code path that a packet
> takes through the stack, with timing details. And that''s a
different
> provider, a private provider that meets Sun engineer''s needs. (I
would
> have thought that fbt already solves most of this, with some clever
> coding).
I think that can be said of many of the other providers.  It''s just a
matter of how ugly (and unmaintainable) you''d like your D code to
become.
> The network provider for customers should focus on stable RFC defined
> and well understood semantics. tcp::send, tcp::receive,
> tcp::connection-established, etc. That''s not to say that other
> functionality like you describe doesn''t exist; it''s to
say that the focus
> is a simple and stable interface, like that which the io provider has.
> 
> I think that more customers will know RFC793 than they will know
> your dblks.
Sure; as I said, it''s a side-conversation.  It was sparked by this
previous comment:
> > - can you think of a way to "pick up" an individual packet
and follow
> > it through the stack? i''m thinking a DTrace script that set
probes at
I''m not sure how to do something like that without having some kind of
low-level instrumentation.  Doing it at a high level (with a stable
"packet" provider, perhaps) seems much less likely to me to be useful,
as it''s less likely to be related to application problems.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

James Carlson

2006-Sep-21 12:45 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Richard Lowe writes:> > That means having probes that represent IP''s forwarding and
fanout
> > actions (including drop and ICMP error generation); probes that
> > reflect TCP''s window handling (including out-of-window,
overlapping
> > segment trimming, and ack generation), state machine, and timers;
> > probes for IGMP activity; probes for sockfs interaction, and so on.
> > 
> 
> I disagree.  As a user I have absolutely no interest in having to spend 
> quality time with the RFCs to answer simple questions, much less 
> spending time reading through and understanding the implementation.  If 
> I were interested in doing the latter, especially, I would be doing it 
> now, using fbt.
Adding this detail in does not mean that you''re required to use it.
It just makes it available.

Leaving this detail out means that others who do need it are stuck --
they have to depend on implementation artifacts (and follow those
artifacts as they change in patches), or just give up.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Alan Maguire

2006-Sep-21 14:53 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

hi folks

here''s a possible scheme for TCP state machine probes (these states are
based
on the RFC 793-defined TCP state machine).

net:tcp:state:closed
net:tcp:state:listen
net:tcp:state:syn_received
net:tcp:state:fin_wait_1
net:tcp:state:fin_wait_2
net:tcp:state:closing
net:tcp:state:established
net:tcp:state:syn_sent
net:tcp:state:close_wait
net:tcp:state:time_wait
net:tcp:state:last_ack

each will pass back have the relevant conn_s and tcp_s structures. might also
be good to pass back the previous tcp state, i.e args are:

prev_state, curr_state, conn_s, tcp_s

having these probes would allow scripts to do things such as watch connection 
state transitions (obviously). e.g:

dtrace -n ''net:tcp:state:* { printf("tcp state machine transition
from %d to %d\n", arg0, arg1); }''

and also examine connections that stay in particular states for a long time etc.

i guess there''s some overlap between these probes and
Brendan''s
suggested probes (e.g. "established" in this scheme ==
"connect-established"),
but maybe it would make sense to split up the provider namespace into
"protocol state" and "packet data" probes, i.e.

net:tcp:state:
net:tcp:packet:

with the former looking at state events, the latter at packet events (e.g.
receive, drop)? would this be useful? (i think the common states users will
be interested in are pretty intuitive from the RFC-derived names above, e.g. 
connection is listen''ing, established, closed). thoughts?

alan
 
 
This message posted from opensolaris.org

Brendan Gregg

2006-Sep-21 17:27 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Roch,

On Thu, 21 Sep 2006, Roch wrote:
>
>   > The above are examples of why a probe like tcp::send can be
>   > confusing.  It is too general a concept of tcp sending
>   > something.
>
>   Sure, there is no one place to put a tcp::send probe in tcp.c code; but
>   nor are there hundreds.
>
> Maybe this should be ''tcp::sent''.
>
> TCP implementation decides when it should _try_ to send some
> of  the  usable data, and  sometimes it  manages  to do some
> sometimes not, sometimes it resends. Not much conceptual things here. But
at
> times, TCP manages to get some data one layer below and
> considers it sent ... but not acknowledge then later acked.
Sure; I hit this issue and thought it best that tcp::send be "we sent
something with a TCP header" and tcp::receive be "we received
something
with a TCP header". Should matching on data packets be needed, the options
were,

	- Match on TCP length or other TCP flags
		net:tcp::send /args[3]->tcp_length > 0/

	- Add other probes just for this,
		net:tcp::data-send
		net:tcp::data-receive

	- Use the net:socket layer, where data is real data

On one hand, you could say that this idea will over match activity -
tcp::send will match resends and tcp::receive duplicate packets, as we
are looking at what TCP is doing to make the data I/O happen.

However the io provider also over matches activity -- io::start isn''t
exactly equivalent to the application requesting data I/O, it is what the
block I/O driver is doing to make the application I/O happen (which
includes rounded up blocks, read-aheads and metadata from the previous
file system layers).

The io provider keeps it simple - we asked the disk to do something; it
doesn''t try to interpret if that was required data or not (as we can do
that from other layers).
> So maybe
>
> tcp::sent   	: suna was advanced, args: from seq, nbytes
> tcp::resent 	: suna not advanced, args: from seq, nbytes
> tcp::acked  	: acked pointer was advanced. args: seq, nbytes
> tcp::dupacked  	: acked pointer was not advanced. args: seq
Cool - those algorithms look good. Would this still make sense as
tcp::data-sent, tcp::data-resent, etc?
> I''d also like
>
> tcp::flowctrl 	: args on|off wainting on a peer event to transmit.
fair enough.
> Timings from sent to acked would give easy access to round
> trip time.
Exactly - I ran out of time to write that example by matching on TCP flags
from args[3].
> flowctrl would help start to answer whether or not
> performance is limited by peer (or at least something below
> TCP).
Yes!

Brendan

Kacheong Poon

2006-Sep-21 18:29 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg wrote:
>> It is unclear what is meant by "standard RFC* style probes."
>> Does it mean an API style?  But there is no RFC for standard
>> API as IETF does not standardize API.  Could you please
>> elaborate that?
> 
> Show me in the RFCs where dblks are discussed. The provider proposal uses
> the stable terminology from the RFCs, and events from the RFCs, which
> customers should find easy to follow.

I guess you misunderstood my question.  I was not
asking about dblk tracing.  I was asking about what
is really meant by "standard RFC* style probes."
For example, the proposal does not track congestion/
send/receive windows.  They are well defined in the
TCP standard and are very useful in telling a user
what is really going on.  OTOH probes like
tcp::drop-saturation is not in any TCP standard.

>> In the proposal, it says that those tcp::send, ..., are placed
>> "close to IP interface."  It is unclear what that means,
>> especially args[2] are about IP header info.
> 
> I was trying not to describe the provider by actually writing the code.

I guess it is better to understand the issues first
before actually trying to write the code...

>> For example, a
>> TCP segment can be sent through a tunnel.  So what is the IP
>> header info in args[2]?
> 
> The tunnel IP. People can tunnel traffic over SSH anyway, would you have
> us add USDT probes into sshd for details there?

I don''t understand this comment.  We are discussing a
provider of the kernel network stack.  IP tunnel is a well
defined concept.  We are not discussing about application
protocol tracing.  My question is about what is being provided
in the args[] list.  The list should be well defined.  Otherwise,
how can a user write scripts using them?  This is exactly
my suggestion above, we need to understand the issues first.

> Quoting from the website (maybe you didn''t get this far?)
> 
>   "Why are outbound connections slow? (Connect latency, 1st byte
latency,
>   throughput latency)"
> 
> These metrics are solvable from the proposal.

This was exactly my question (I quoted the above question
in my email).  I don''t believe the proposal can answer the
why part.  It can tell a user that something is slow, but not
why.  Finding the latency does not tell us why the latency
is there...

> How about this to start with,
> 
>      1. New inbound TCP connections by port
>      2. New inbound TCP connections by source IP address
>      3. Inbound TCP packets by local port
>      4. Inbound TCP packets by remote address
>      5. Inbound TCP packets by remote address, for a specific port, eg:
port 80 (HTTP)
>      6. Sent TCP bytes by local port
>      7. TCP bytes by address and port
>      8. TCP bytes by address and port, per second
>      9. Detect TCP connect() scan by IP address
>     10. Detect TCP SYN stealth scan by IP address
>     11. Detect TCP null scan by IP address
>     12. Detect TCP FIN stealth scan by IP address
>     13. Detect TCP Xmas scan by IP address
>     14. Network Utilization by TCP Port
>     15. Connect Latency
>     16. 1st Byte Latency
>     17. Throughput Latency
>     18. TCP Saturation Drops by IP address
> 
> direct from the website.

It is a good idea to have the above list.  The next step
in defining a network provider is to understand whether
it is really needed.  For example, the above questions
can simply be answered by existing tools like snoop.  If
we really have to use dtrace, I believe a set of SDT probes
is sufficient.  So do we need a network provider?  OTOH,
if we can look beyond the above list of questions, it
may be possible to abstract things out and implement a
provider which is more fundamental than what snoop can
provide.

-- 

						K. Poon.
						kacheong.poon at sun.com

Adam Leventhal

2006-Sep-21 18:30 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Thu, Sep 21, 2006 at 12:10:05AM +0800, Kacheong Poon
wrote:> ... If the net provider
> only has these simple probes, I am wondering if we really need a
> separate provider instead of just using sdt and put well positioned
> static probes.  Are there some guidelines of when we need a provider
> versus just using sdt?
The provider Brendan is proposing will be based on SDT -- just like the
many other provider such as io, proc, and sysinfo. Like those other
providers it will have a different name and will have higher stability.
In general, SDT probes should be private probes that may be useful for
developers, but that end users probably shouldn''t rely on. We should
make
a new provider (based on SDT) when developing coherent, stable, documented
instrumentation for a subsystem for which the target is the Solaris user --
as is the case with Brendan''s proposal.

Adam

-- 
Adam Leventhal, Solaris Kernel Development       http://blogs.sun.com/ahl

Sangeeta Misra

2006-Sep-21 18:43 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan

 > If the network engineers want to trace code path for their own needs,
 > then do so with fbt and add some sdt as needed.

That answers my question regarding using dtrace to analyse performance 
bottlenecks in networking.

 > But don''t assume that customers care most about seeing code path,
 > or have read the 50,000+ lines that make up tcp.c and ip.c.

At this point,Sun engineers who work in Solaris kernel are customers who 
would want to use DTrace Network Provider to track the above code paths 
( to do analyse performance bottlenecks and latencies)  There will be 
Solaris kernel developers OpenSolaris community in future who would need 
the same.

Anyway, I am done trying to justify whether the feature I requested is 
important or not, and how may customers want it. You seem to have some 
quite strong opinions about what Soalris networking engineers need.

Thanks
Sangeeta

Richard Lowe

2006-Sep-21 19:31 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

peter.memishian at sun.com wrote:>  > Lets summarise where we are at. So far I''ve got,
>  > 
>  >  net:tcp::accept-listen
>  >  net:tcp::accept-established
>  >  net:tcp::accept-closed
>  >  net:tcp::accept-refused
>  >  net:tcp::connect-request
>  >  net:tcp::connect-established
>  >  net:tcp::connect-closed
>  >  net:tcp::drop-saturation
>  >  net:tcp::receive
>  >  net:tcp::send
>  >  net:udp::receive
>  >  net:udp::send
>  >  net:ip::receive
>  >  net:ip::send
>  >  net:gld::receive
>  >  net:gld::send
> 
> [ Apologies if these questions were answered earlier. ]
> 
> Is there a reason the last two are `gld'' rather than
`link''?  (Or are we
> explicitly excluding probing monolithic network devices like ce?)  Also,
> when would `send'' fire -- when the layer has been asked to send
the
> packet, or when it''s actually sent the packet to the next layer?
> 
The 2nd element in the probe tuple is the module the probe resides 
within.  To the best of my knowledge, this cannot be arbitrarily 
changed.  (which doesn''t answer how it''s gld rather dld,
but...)

-- Rich

Brendan Gregg

2006-Sep-21 19:52 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

G''Day Alan,

On Thu, 21 Sep 2006, Alan Maguire wrote:
> hi folks
>
> here''s a possible scheme for TCP state machine probes (these
states are based
> on the RFC 793-defined TCP state machine).
>
> net:tcp:state:closed
> net:tcp:state:listen
> net:tcp:state:syn_received
> net:tcp:state:fin_wait_1
> net:tcp:state:fin_wait_2
> net:tcp:state:closing
> net:tcp:state:established
> net:tcp:state:syn_sent
> net:tcp:state:close_wait
> net:tcp:state:time_wait
> net:tcp:state:last_ack
Great! When I see this list, I immediately know what all of the probes
mean - and I could start writing scripts straight away. This is what I had
in mind when I said it should be RFC based (that and the arguments).

I''m not sure we should be changing the 3rd field - DTrace generally
keeps
that as the kernel function name (which I believe is the case from SDT).

When I get a chance I''ll add these to the list (probably change
underscore
to dash - also following DTrace conventions).
> each will pass back have the relevant conn_s and tcp_s structures. might
also
> be good to pass back the previous tcp state, i.e args are:
>
> prev_state, curr_state, conn_s, tcp_s
>
> having these probes would allow scripts to do things such as watch
connection
> state transitions (obviously). e.g:
>
> dtrace -n ''net:tcp:state:* { printf("tcp state machine
transition from %d to %d\n", arg0, arg1); }''
>
> and also examine connections that stay in particular states for a long time
etc.
>
> i guess there''s some overlap between these probes and
Brendan''s
> suggested probes (e.g. "established" in this scheme ==
"connect-established"),
> but maybe it would make sense to split up the provider namespace into
"protocol state" and "packet data" probes, i.e.
>
> net:tcp:state:
> net:tcp:packet:
ok, a 3rd field would give us that flexibility. without it we could use a
prefix, state-closed, but that may get ugly. Someone may have a better
idea to avoid the 3rd field?

I do think splitting up some of the states into client and server pairs
will be extreamly useful (connection-established/accept-established); as
it immediately lets us focus on either client or server traffic.
> with the former looking at state events, the latter at packet events (e.g.
> receive, drop)? would this be useful? (i think the common states users will
> be interested in are pretty intuitive from the RFC-derived names above,
e.g.
> connection is listen''ing, established, closed). thoughts?
Yep - I think they are indeed intuitive and useful, and with several terse
examples of their usage we may hit on some must-have scripts.

cheers,

Brendan

Dan McDonald

2006-Sep-21 19:55 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Thu, Sep 21, 2006 at 03:31:23PM -0400, Richard Lowe
wrote:> >Is there a reason the last two are `gld'' rather than
`link''?  (Or are we
> >explicitly excluding probing monolithic network devices like ce?) 
Also,
> >when would `send'' fire -- when the layer has been asked to
send the
> >packet, or when it''s actually sent the packet to the next
layer?
> >
> 
> The 2nd element in the probe tuple is the module the probe resides 
> within.  To the best of my knowledge, this cannot be arbitrarily 
> changed.  (which doesn''t answer how it''s gld rather dld,
but...)
That''s not true.  If it were, all of:

	tcp
	udp
	ip

would have to be "ip", because that''s where the protocol
processing code all
lives now in a post-FireEngine world.

Dan

Alan Maguire

2006-Sep-21 20:56 UTC

head link

[dtrace-discuss] Re: Re: Re: DTrace Network Provider

hi Brendan
> I''m not sure we should be changing the 3rd field - DTrace
generally keeps
> that as the kernel function name (which I believe is the case from SDT).
ah, right. if it was possible to reuse the third field, it''d be handy
for wildcarding all the state probes/all the packet-related probes,
though i guess as long as the probes are alll prefixed with the
same thing, we could wildcard net:tcp::state_*, right?
> When I get a chance I''ll add these to the list (probably change
underscore
> to dash - also following DTrace conventions).
excellent!

alan
 
 
This message posted from opensolaris.org

Adam Leventhal

2006-Sep-21 20:59 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Fri, Sep 22, 2006 at 02:29:18AM +0800, Kacheong Poon
wrote:> ... If
> we really have to use dtrace, I believe a set of SDT probes
> is sufficient.  So do we need a network provider?
The network provider is a bunch of SDT probes. SDT _is_ the underlying
instrumentation provider -- it just has a different name.

Adam

-- 
Adam Leventhal, Solaris Kernel Development       http://blogs.sun.com/ahl

Kacheong Poon

2006-Sep-22 01:19 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Roch wrote:
> So maybe
> 
> tcp::sent   	: suna was advanced, args: from seq, nbytes
> tcp::resent 	: suna not advanced, args: from seq, nbytes

Do you mean tcp::sent is fired when TCP receives an ACK
(otherwise suna won''t matter)?  I think it is a little bit
difficult to explain the above.  In the tcp::sent case, it
means that TCP receives an ACK such that it can now free
up those data.  But how can the probe distinguish it from
a retransmission?

Or do you mean they are fired when TCP calls into IP to send
some data?  So it is when TCP sends some data to IP.  Then
suna does not matter and tcp::resent cannot be conditioned
on "suna not advanced" as doing NewReno is a resent and suna
is advanced in this case.  I guess tcp::resent should fire
when TCP does a retransmission.  We also need something like

tcp::ack-sent: fires when an ACK is sent, args: tcp_t, ack #,
                window size
tcp::rst-sent: fires when a RST is sent, args: src and dst
                addrs, seq number

We don''t need syn or fin sent as they consume 1 seq num and
can be put into the tcp::send probe.

> tcp::acked  	: acked pointer was advanced. args: seq, nbytes

This is when TCP receives an ACK segment, right?  It is not
about sending?  But why is ACK number not in the args?  I think
we should also have the window size and SACK options as args.

> tcp::dupacked  	: acked pointer was not advanced. args: seq

I think SACK options and dupack count should also be in the
args.

> I''d also like
> 
> tcp::flowctrl 	: args on|off wainting on a peer event to transmit.

Flow control is always on :-)  I guess what you mean is when
TCP''s send window is 0.  Maybe it should be tcp::flow-stopped
or tcp::zero-win (the latter is well defined).  Or do you
mean something beyond the TCP protocol flow control?

> Timings from sent to acked would give easy access to round
> trip time.
> 
> flowctrl would help start to answer whether or not
> performance is limited by peer (or at least something below
> TCP).

Zero window does not tell us anything below TCP.  So maybe
you are referring to other things in tcp::flowctrl.  What
are they?

-- 

						K. Poon.
						kacheong.poon at sun.com

Kacheong Poon

2006-Sep-22 01:30 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

Alan Maguire wrote:
> net:tcp:state:closed
> net:tcp:state:listen
> net:tcp:state:syn_received
> net:tcp:state:fin_wait_1
> net:tcp:state:fin_wait_2
> net:tcp:state:closing
> net:tcp:state:established
> net:tcp:state:syn_sent
> net:tcp:state:close_wait
> net:tcp:state:time_wait
> net:tcp:state:last_ack

State transition is certainly something very important.  But
I guess you can omit closed as it just means that a tcp_t
is allocated.  A fbt probe on the tcp_t creation function
serves the same purpose.

> each will pass back have the relevant conn_s and tcp_s structures. might
also
> be good to pass back the previous tcp state, i.e args are:
> 
> prev_state, curr_state, conn_s, tcp_s

Plus the TCP segment (in some states) which triggers the
transition.

> i guess there''s some overlap between these probes and
Brendan''s
> suggested probes (e.g. "established" in this scheme ==
"connect-established"),
> but maybe it would make sense to split up the provider namespace into
"protocol state" and "packet data" probes, i.e.
> net:tcp:state:
> net:tcp:packet:

If the TCP segment which triggers the transition is passed back,
there is no need to have a "packet" probe.  It is more confusing
to the user.

> with the former looking at state events, the latter at packet events (e.g.
> receive, drop)? would this be useful? (i think the common states users will
> be interested in are pretty intuitive from the RFC-derived names above,
e.g.
> connection is listen''ing, established, closed). thoughts?

I think we may not need a "drop" probe in TCP, unless it means
checksum error (which is actually done in the IP level in our
implementation).  TCP only drops a segment if it does not
pass certain checks.  This means that a script can easily be
written using the tcp::receive probe and check on the seq number
and timestamps option.

-- 

						K. Poon.
						kacheong.poon at sun.com

Dan McDonald

2006-Sep-22 01:34 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

On Fri, Sep 22, 2006 at 09:30:39AM +0800, Kacheong Poon
wrote:> I think we may not need a "drop" probe in TCP, unless it means
> checksum error (which is actually done in the IP level in our
> implementation).  TCP only drops a segment if it does not
> pass certain checks.  This means that a script can easily be
> written using the tcp::receive probe and check on the seq number
> and timestamps option.
> 
What we need there is for the whole TCP/IP stack to start using
ip_drop_packet() and expand its IPsec-only usage to the wider system.
ip_drop_packet FBT probes have helped us a lot in debugging not only IPsec
development but also IPsec *deployment*.  (The original RFE that drove ipdrop
was from a disgruntled punchin user who wanted to know how he FUBARed his
config.  This was before Paul''s extensive punchctl mods.) 

Dan

Kacheong Poon

2006-Sep-22 01:40 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg wrote:
> 	- Match on TCP length or other TCP flags
> 		net:tcp::send /args[3]->tcp_length > 0/

Note that TCP header does not have a length field.  The
length of a segment is determined using the IP header.
So it is simpler to write the probe if we exclude the
length of a segment.

> 	- Add other probes just for this,
> 		net:tcp::data-send
> 		net:tcp::data-receive
> 
> 	- Use the net:socket layer, where data is real data

One important thing a user wants to know is how TCP does
the segmentation of data.  So it is not that useful to
only catch data sending in the socket layer.  I think
tcp::sent and tcp::resent are fine.  We can then add
the ack-sent and rst-sent as mentioned in my previous email.

> On one hand, you could say that this idea will over match activity -
> tcp::send will match resends and tcp::receive duplicate packets, as we
> are looking at what TCP is doing to make the data I/O happen.

I think it is simpler for tcp::sent to not match retransmission
and have a dedicated tcp::resent probe.

-- 

						K. Poon.
						kacheong.poon at sun.com

James Carlson

2006-Sep-22 01:49 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Kacheong Poon writes:> Brendan Gregg wrote:
> 
> > 	- Match on TCP length or other TCP flags
> > 		net:tcp::send /args[3]->tcp_length > 0/
> 
> 
> Note that TCP header does not have a length field.  The
> length of a segment is determined using the IP header.
> So it is simpler to write the probe if we exclude the
> length of a segment.
True ... but there should be a simple way to get at the payload
(skipping past the options) and compute the payload length.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Kacheong Poon

2006-Sep-22 02:06 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

Dan McDonald wrote:
> What we need there is for the whole TCP/IP stack to start using
> ip_drop_packet() and expand its IPsec-only usage to the wider system.
> ip_drop_packet FBT probes have helped us a lot in debugging not only IPsec
> development but also IPsec *deployment*.  (The original RFE that drove
ipdrop
> was from a disgruntled punchin user who wanted to know how he FUBARed his
> config.  This was before Paul''s extensive punchctl mods.) 

I thought there''s an RFE for the above and isn''t you the
RE of it ;-)


-- 

						K. Poon.
						kacheong.poon at sun.com

Philip Beevers

2006-Sep-22 06:18 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

It would be extremely useful to me if some of the TCP statistics presented by
netstat -s had probes associated with them, to allow event-driven tracing
analagous to that supported by (for example) the sched provider. It''s
great to look at those stats and see them incrementing, which might give a clue
to where your problem lies, but it would be better to be able to probe on them
so you could pinpoint which connection has the issue.

For example, a problem I look at on a weekly basis is which of several hundred
TCP connections on a machine is suffering retransmission of unacknowledged
octets. Currently this involves cloning a port on the switch, canning all the
traffic with snoop, and post-processing. A net:tcp::retransBytes probe would
make this a 10 second job, particularly when given access to the relevant
tcpinfo_t.

[If my interpretation of Brendan''s stuff is correct, I could see
retransmissions anyway (by probing sends and looking for old sequence numbers
being sent again on a particular connection), but it''d be hard work.]
 
 
This message posted from opensolaris.org

Casper.Dik at Sun.COM

2006-Sep-22 08:32 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

>It would be extremely useful to me if some of the TCP statistics presented
by netstat -s had probes associated with them, to allow event-driven tracing analagous to that
supported by (for example)
the sched provider. It''s great to look at those stats and see them
incrementing, which might give a
 clue to where your problem lies, but it would be better to be able to probe on
them so you could p
inpoint which connection has the issue.


All the data presented in "MIB" form should have mib probes associated
with
them already.

Casper

Jeremy Harris

2006-Sep-22 08:44 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

Alan Maguire wrote:> net:tcp:state:closed
> net:tcp:state:listen
> net:tcp:state:syn_received
> net:tcp:state:fin_wait_1
> net:tcp:state:fin_wait_2
> net:tcp:state:closing
> net:tcp:state:established
> net:tcp:state:syn_sent
> net:tcp:state:close_wait
> net:tcp:state:time_wait
> net:tcp:state:last_ack
> 
> each will pass back have the relevant conn_s and tcp_s structures. might
also
> be good to pass back the previous tcp state, i.e args are:
> 
> prev_state, curr_state, conn_s, tcp_s
Is the curr_state identical to the lest element of the
probe name?  If so, why is it needed as an arg?
If not, how do they differ?

- Jeremy

Roch

2006-Sep-22 09:13 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Kacheong Poon writes:
 > Brendan Gregg wrote:
 > 
 > > 	- Match on TCP length or other TCP flags
 > > 		net:tcp::send /args[3]->tcp_length > 0/
 > 
 > 
 > Note that TCP header does not have a length field.  The
 > length of a segment is determined using the IP header.
 > So it is simpler to write the probe if we exclude the
 > length of a segment.
 > 
 > 
 > > 	- Add other probes just for this,
 > > 		net:tcp::data-send
 > > 		net:tcp::data-receive
 > > 
 > > 	- Use the net:socket layer, where data is real data
 > 
 > 
 > One important thing a user wants to know is how TCP does
 > the segmentation of data.  So it is not that useful to
 > only catch data sending in the socket layer.  I think
 > tcp::sent and tcp::resent are fine.  We can then add
 > the ack-sent and rst-sent as mentioned in my previous email.
 > 
 > 
 > > On one hand, you could say that this idea will over match activity -
 > > tcp::send will match resends and tcp::receive duplicate packets, as
we
 > > are looking at what TCP is doing to make the data I/O happen.
 > 
 > 
 > I think it is simpler for tcp::sent to not match retransmission
 > and have a dedicated tcp::resent probe.
 > 
 > 

I do think that the tcp provider should stear away from a
streams/packets provider and focus on sequence numbers and
length. How about this For TCP send:


net:tcp::sending : new data posted in the tcp xmit buffer (from above)
net:tcp::sent    : data was sent down for a first time (suna was advanced).
net:tcp::acked   : data from xmit buffer can be freed (send competed)

I note we already have mib probes for retransmits and dup acks.

-r

Alan Maguire

2006-Sep-22 09:33 UTC

head link

[dtrace-discuss] Re: Re: Re: DTrace Network Provider

> Is the curr_state identical to the lest element of the
> probe name? If so, why is it needed as an arg?
good point. i guess passing it back might be useful
in the case where users utilize wildcard matching
on the state probe, but then again the current
state is also accessible via the tcp_s.tcp_state
member.
>> each will pass back have the relevant conn_s and tcp_s structures.
might also
>> be good to pass back the previous tcp state, i.e args are:
>>
>> prev_state, curr_state, conn_s, tcp_s
>
>
> Plus the TCP segment (in some states) which triggers the
> transition.
that would be very useful i think.

so how about probe arguments:

prev_state, conn_s, tcp_s, (possibly null) tcp_info_t *

?

alan
 
 
This message posted from opensolaris.org

Roch

2006-Sep-22 09:49 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Kacheong Poon writes:

 > 
 > tcp::ack-sent: fires when an ACK is sent, args: tcp_t, ack #,
 >                 window size
 > tcp::rst-sent: fires when a RST is sent, args: src and dst
 >                 addrs, seq number
 > 
 > We don''t need syn or fin sent as they consume 1 seq num and
 > can be put into the tcp::send probe.

I use this to cover the data path:

	net:tcp::sending : new data posted in the tcp xmit buffer (from above)
	net:tcp::sent    : data was sent down for a first time (suna was advanced).
	net:tcp::acked   : data from xmit buffer can be freed (send competed)

I do consider that an incoming ACK completes the TCP sending 
action.

 > 
 > 
 > > tcp::acked  	: acked pointer was advanced. args: seq, nbytes
 > 
 > 
 > This is when TCP receives an ACK segment, right?  It is not
 > about sending?  But why is ACK number not in the args?  

Your right. I propose this fires when new data is acked, we
can use the simple ACK number or the full segment of newly
ACKed data. 

 > I think
 > we should also have the window size and SACK options as args.
 > 

I figure all these probes would have a connection argument
from which _I hope_ we can trace window information.

How about a separate probe that fires when actual SACK info
is exchanged  ?

 > 
 > > tcp::dupacked  	: acked pointer was not advanced. args: seq
 > 

So I later realized we had those MIBs for dupacks and
restrans. Do we need anything more ?

 > 
 > 
 > > I''d also like
 > > 
 > > tcp::flowctrl 	: args on|off wainting on a peer event to transmit.
 > 
 > 
 > Flow control is always on :-)  I guess what you mean is when
 > TCP''s send window is 0.  Maybe it should be tcp::flow-stopped
 > or tcp::zero-win (the latter is well defined).  Or do you
 > mean something beyond the TCP protocol flow control?

No you''re right.   I''m thinking of a  probe that  fires when
there is data to transmits but we need to wait for something
from below;  as in the peer to  send either newly ACKed data
or  a window  update packet  before we can  proceed. As  for
naming:

	tcp::flow 	: args on|off; waiting on a peer event to transmit.

 > 
 > 
 > > Timings from sent to acked would give easy access to round
 > > trip time.
 > > 
 > > flowctrl would help start to answer whether or not
 > > performance is limited by peer (or at least something below
 > > TCP).
 > 
 > 
 > Zero window does not tell us anything below TCP.  So maybe
 > you are referring to other things in tcp::flowctrl.  What
 > are they?

I''m not sure what Zero window refers to. 

Does it include the congestion window ?

If we sent a full windows worth of data and receive no ACK,
is that a zero window ?  Or does it refer to receiving a
packet with 0 window. That may be the point we need to
cross to FBT/SDT.

But if there is data in the xmit list and we''re waiting for
the peer to respond to our previous transmits, I do think of 
this as being flow-stopped. Is this traceable ?

-r

 > 
 > 
 > -- 
 > 
 > 						K. Poon.
 > 						kacheong.poon at sun.com
 > 
 > _______________________________________________
 > dtrace-discuss mailing list
 > dtrace-discuss at opensolaris.org

sowmini.varadhan at Sun.COM

2006-Sep-22 13:38 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

> 
> I do think that the tcp provider should stear away from a
> streams/packets provider and focus on sequence numbers and
> length. How about this For TCP send:
> 
> 
> net:tcp::sending : new data posted in the tcp xmit buffer (from above)
> net:tcp::sent    : data was sent down for a first time (suna was advanced).
> net:tcp::acked   : data from xmit buffer can be freed (send competed)
> 
> I note we already have mib probes for retransmits and dup acks.
that would require mib provider to provide info from the tcp/ip header
associated with the retransmit/dup-ack? Currently afaict, the mib
provider provides very little ancillary info. e.g., for dup-ack, 
we''d need the seq number that is being acked plus info about current
tcp window for that connection? 

--Sowmini

Alex Peng

2006-Sep-22 14:08 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

With the putback of "packet filter hooks" (which will happen in few
days), you will get many more probes regarding to IPFilter.    Unfortunately,
those probes are placed around packet filter hook callout entries, not inside
IPFilter,  so you will know what is there at IP->IPFilter interface, but you
can''t know how IPFilter deals with that packet.  Perhaps this could be
a good RFE.      BTW,  I know this since I added those probes,  I will post a
short howto after the putback.    Darren Reed, who is the tech lead, can give
more up-to-date informations.

Also IPsec, IPMP, etc... these add-on features should be traced.     
MIB provider offers the ability to access those network statistical data, it
could a plus to this new network provider.

Thanks,
-Alex
 
 
This message posted from opensolaris.org

Alex Peng

2006-Sep-22 14:24 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

If D could have:
        iphdr(ipha_t *)
        tcphdr(tcph_t *)

to just iphdr/tcphdr as in mdb, which will save many engineer''s time.

Thanks,
-Alex
 
 
This message posted from opensolaris.org

Alex Peng

2006-Sep-22 14:34 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

This blog gives an example on how to spy tcp state change:

http://blogs.sun.com/xltang/entry/spy_the_tcp_state

Hopefully the people is this threadlist will like it.    The author is an ERI QE
engineer, but he left now.

Thanks,
-Alex
 
 
This message posted from opensolaris.org

Alex Peng

2006-Sep-22 14:50 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

Yes, the current MIB probe implementation doesn''t bring in any tcp_t,
conn_t, etc. information.    You can get them, but it needs smart coding, and
know the details.

-Alex
 
 
This message posted from opensolaris.org

Brian Utterback

2006-Sep-22 15:13 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Casper.Dik at Sun.COM wrote:>> It would be extremely useful to me if some of the TCP statistics
presented by netstat -s had probe
> s associated with them, to allow event-driven tracing analagous to that
supported by (for example)
> the sched provider. It''s great to look at those stats and see them
incrementing, which might give a
>  clue to where your problem lies, but it would be better to be able to
probe on them so you could p
> inpoint which connection has the issue.
> 
> 
> All the data presented in "MIB" form should have mib probes
associated with
> them already.
> 
> Casper
Yes, but how do you determine what packet or connection or whatever
triggered the probe? For instance, you would like to get the tcp pointer
for each of the tcp* mib probes.


-- 
blu

"When I was a boy I was told that anybody could become
President; I''m beginning to believe it." - Clarence Darrow
----------------------------------------------------------------------
Brian Utterback - OP/N1 RPE, Sun Microsystems, Inc.
Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom

Kacheong Poon

2006-Sep-22 23:57 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Roch wrote:
> I do think that the tcp provider should stear away from a
> streams/packets provider and focus on sequence numbers and
> length. How about this For TCP send:

There are the send (snxt) and receive (rnxt) seq numbers.
So I guess you meant to have a probe fired when they are
changed?  But when TCP sends a new segment, snxt is advanced.
And when TCP receives an in order segment, rnxt is advanced.
So it seems that the above is just tcp::sent and tcp::received
with the exception that if a probe is only fired when rnxt
is advanced, out of order segments will not immediately trigger
a probe.  So I think tcp::received is better.

> net:tcp::sending : new data posted in the tcp xmit buffer (from above)

Or should this be the sent probe of the layer above TCP?  For
example, we can have sockfs::sent.

> net:tcp::sent    : data was sent down for a first time (suna was advanced).

I suppose you meant snxt instead of suna.

> net:tcp::acked   : data from xmit buffer can be freed (send competed)

Or maybe tcp::ack-received, similar to tcp::ack-sent?

> I note we already have mib probes for retransmits and dup acks.

The current mib probes are very limited, no TCP info available.
I think it is better to have dedicated probes for them.

-- 

						K. Poon.
						kacheong.poon at sun.com

Kacheong Poon

2006-Sep-23 00:17 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Roch wrote:
>  > I think
>  > we should also have the window size and SACK options as args.
>  > 
> 
> I figure all these probes would have a connection argument
> from which _I hope_ we can trace window information.

If the probe is fired when TCP receives an ACK, the
conn_s won''t tell us anything about the advertised
window in the TCP header nor about SACK.  If one of
the args is the complete TCP header, we can certainly
get info from it.  But it is easier to write the D
script if the info is handy.

> How about a separate probe that fires when actual SACK info
> is exchanged  ?

Since SACK info is exchanged in ACKs, so if we have
tcp::ack-sent and tcp::ack-received, I guess the SACK
info can be one of the args.  Otherwise, two probes are
fired for all ACKs sent and received with SACK info.

> No you''re right.   I''m thinking of a  probe that  fires
when
> there is data to transmits but we need to wait for something
> from below;  as in the peer to  send either newly ACKed data
> or  a window  update packet  before we can  proceed. As  for
> naming:

If it is about the peer, then I guess it is the zero
window.  If it is about something below TCP, I am not
sure what it is.  IP does not flow control TCP currently.
There is probably something in Crossbow which will do this.
So I guess this lower layer flow control probe can be
added at that time.  But if we are restricting the network
provider to standard TCP behavior, this lower layer flow
control probe should not be in the network provider.

> I''m not sure what Zero window refers to. 

The peer advertises a zero window so that TCP cannot send
any new data.

> Does it include the congestion window ?

No, it is not about congestion window.

> If we sent a full windows worth of data and receive no ACK,
> is that a zero window ?  Or does it refer to receiving a
> packet with 0 window. That may be the point we need to
> cross to FBT/SDT.

I meant the latter.  Now thinking about it, the two proposed
probes tcp::sent and tcp::ack-received should be enough to
detect both situation above.  I guess we don''t really need a
separate probe.

> But if there is data in the xmit list and we''re waiting for
> the peer to respond to our previous transmits, I do think of 
> this as being flow-stopped. Is this traceable ?

This is traceable as tcp::sent is fired but there is no
subsequent corresponding tcp::ack-received fires.

-- 

						K. Poon.
						kacheong.poon at sun.com

Alan Maguire

2006-Sep-25 09:00 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

hi folks

another area we should perhaps cover into the net provider
is routing table updates, e.g. 

net:ip::route-add
net:ip::route-delete

with relevant stable information culled from the associated ipif_s 
(interface structure) and ire_s (internet routing entry) as args.

this might help diagnose routing problems such as route flapping.

alan
 
 
This message posted from opensolaris.org

Casper.Dik at Sun.COM

2006-Sep-25 09:22 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

>another area we should perhaps cover into the net provider
>is routing table updates, e.g. 
>
>net:ip::route-add
>net:ip::route-delete
>
>with relevant stable information culled from the associated ipif_s 
>(interface structure) and ire_s (internet routing entry) as args.
>
>this might help diagnose routing problems such as route flapping.

While this is nice to have in dtrace, the "route monitor" command
already
provides this information.

(And fairly easily allowed me to track down the removal of IPv6 routes
from tunnels in builds 47/48)

Casper

Chip Bennett

2006-Sep-25 11:56 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Casper.Dik at Sun.COM wrote:>> another area we should perhaps cover into the net provider
>> is routing table updates, e.g. 
>>
>> net:ip::route-add
>> net:ip::route-delete
>>     
>
> While this is nice to have in dtrace, the "route monitor" command
already
> provides this information.
>   Without passing judgment specifically on this one capability, in general 
there is a good reason for adding functionality to DTrace that may be 
duplicated in other commands: to bring information into a D program that 
can be used in combination with other information, i.e. the firing of 
route-add or route-delete probe gets saved and used in the predicate of 
another probe clause.

Chip

Roch

2006-Sep-25 13:49 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Kacheong Poon writes:
 > Roch wrote:
 > 
 > > I do think that the tcp provider should stear away from a
 > > streams/packets provider and focus on sequence numbers and
 > > length. How about this For TCP send:
 > 
 > 
 > There are the send (snxt) and receive (rnxt) seq numbers.
 > So I guess you meant to have a probe fired when they are
 > changed?  But when TCP sends a new segment, snxt is advanced.
 > And when TCP receives an in order segment, rnxt is advanced.
 > So it seems that the above is just tcp::sent and tcp::received
 > with the exception that if a probe is only fired when rnxt
 > is advanced, out of order segments will not immediately trigger
 > a probe.  So I think tcp::received is better.
 > 
 > 

I do  think that receiving new data out of order need to
fire a receive probe. Since sending has 3 stages, so do receiving:



	tcp::sent-new 		: new data was sent
	tcp::sent-retrans	: already sent data was resent
	tcp::sent-done		: sent data is freed, acked received and we''re done.

	tcp::received-new	: new data was received from below (includes out of order)
	tcp::received-dup	: already received data received again
	tcp::received-done	: sent up toward application (we''re done)

 > > net:tcp::sending : new data posted in the tcp xmit buffer (from
above)
 > 
 > 
 > Or should this be the sent probe of the layer above TCP?  For
 > example, we can have sockfs::sent.

I think that if new data is seen by tcp then it belong in a
tcp probe. If we manage a sockfs buffer outside the view of
TCP then we could have similar probes to handle that case.
But since tcp does handle the pending buffer I still like

	tcp::sending		: new data posted in tcp.


And I still like the flow stop probes that fires when
we have data in the xmit list and TCP refrains from sending 
waiting for an ACK or window update. For incoming, we could
have similar probes for when we advertise a zero window or
open it up.

	tcp::out-flow		: started or stopped
	tcp::in-flow		: started or stopped


And we could want a probe that fires when tcp generate a
dataless packet

	tcp::ctrl		:  pure ack, sack, window update, keepalive...

-r


 > 
 > -- 
 > 
 > 						K. Poon.
 > 						kacheong.poon at sun.com
 > 
 > _______________________________________________
 > dtrace-discuss mailing list
 > dtrace-discuss at opensolaris.org

Tariq Rashid

2006-Sep-25 15:26 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

i''ve not followed the entire thread but the following triggers would be
useful for myself, and i''m sure many others who want to observe network
server processes:

	* socket bind (listen)
	* incoming tcp and handshake handshake
	* or udp data
	* outgoing tcp data, or udp data
	* tcp shutdown

very useful would be if this was observable per thread, with a master process
handing off reply tasklets to worker threads.

an example task would be to accumulate histograms of the time between udp in and
udp out, per server thread.

i have DNS and RADIUS servers in mind, initially. secondarily, observing SMTP
and IMAP servers would be useful.

tariq

Brendan Gregg - Sun Microsystems

2006-Sep-25 21:45 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Sangeeta,

On Thu, Sep 21, 2006 at 11:43:55AM -0700, Sangeeta Misra wrote:
[...]> > But don''t assume that customers care most about seeing code
path,
> > or have read the 50,000+ lines that make up tcp.c and ip.c.
> 
> At this point,Sun engineers who work in Solaris kernel are customers who 
> would want to use DTrace Network Provider to track the above code paths 
> ( to do analyse performance bottlenecks and latencies)  There will be 
> Solaris kernel developers OpenSolaris community in future who would need 
> the same.
>
> Anyway, I am done trying to justify whether the feature I requested is 
> important or not, and how may customers want it. You seem to have some 
> quite strong opinions about what Soalris networking engineers need.
Ok, so you still don''t understand what is happening.

When I''m talking about customers, I''m talking about Solaris
users - the
system administrators, developers and staff who use Solaris servers and
would like easy observability tools.

The world isn''t made up of kernel network engineers or customers who
care
about kernel code path latencies. Instead there are Solaris users who
run prstat or top from time to time, and would use a similar network summary
tool if we provided one. I can''t see the average customer running top
in
one window and a kernel code path latency tool in another.

In fact, I could count on one finger how many customers I''ve met who
ran
lockstat to profile the kernel, similar information to what you think
is so important.

I''ve already said that I want to measure code path latencies too, and
that
I think it should be part of a provider. However, since kernel network
engineers understand the kernel network code, then they can get their own
statistics from fbt and some sdt, and write their own private provider
for their own pet needs. Are they also "customers" of DTrace? Sure,
but
customers who have the skills to satisfy their own private needs.

This thread is about a network provider for all Solaris users. If it also
serves the needs of the kernel network engineers, then great; but not at
the expense of what the greater Solaris community will find useful.

Brendan

-- 
Brendan
[CA, USA]

Richard Lowe

2006-Sep-25 22:30 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Dan McDonald wrote:> On Thu, Sep 21, 2006 at 03:31:23PM -0400, Richard Lowe wrote:
>>> Is there a reason the last two are `gld'' rather than
`link''?  (Or are we
>>> explicitly excluding probing monolithic network devices like ce?) 
Also,
>>> when would `send'' fire -- when the layer has been asked to
send the
>>> packet, or when it''s actually sent the packet to the next
layer?
>>>
>> The 2nd element in the probe tuple is the module the probe resides 
>> within.  To the best of my knowledge, this cannot be arbitrarily 
>> changed.  (which doesn''t answer how it''s gld rather
dld, but...)
> 
> That''s not true.  If it were, all of:
> 
> 	tcp
> 	udp
> 	ip
> 
> would have to be "ip", because that''s where the protocol
processing code all
> lives now in a post-FireEngine world.
> 
While I''m entirely willing to accept that I maybe incorrect,
I''d very
much like to hear how it''s intended this be done.

-- Rich

Dan Mick

2006-Sep-25 22:34 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

> This thread is about a network provider for all Solaris users. If it also
> serves the needs of the kernel network engineers, then great; but not at
> the expense of what the greater Solaris community will find useful.
it would be a very good design if it could serve both roles.

Darren.Reed at Sun.COM

2006-Sep-25 23:00 UTC

head link

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg - Sun Microsystems wrote:
>G''Day Sangeeta,
>
>On Thu, Sep 21, 2006 at 11:43:55AM -0700, Sangeeta Misra wrote:
>[...]
>
>>>But don''t assume that customers care most about seeing code
path,
>>>or have read the 50,000+ lines that make up tcp.c and ip.c.
>>>
>>At this point,Sun engineers who work in Solaris kernel are customers who
>>would want to use DTrace Network Provider to track the above code paths 
>>( to do analyse performance bottlenecks and latencies)  There will be 
>>Solaris kernel developers OpenSolaris community in future who would need
>>the same.
>>
>>Anyway, I am done trying to justify whether the feature I requested is 
>>important or not, and how may customers want it. You seem to have some 
>>quite strong opinions about what Soalris networking engineers need.
>>
>
>Ok, so you still don''t understand what is happening.
>
>When I''m talking about customers, I''m talking about
Solaris users - the
>system administrators, developers and staff who use Solaris servers and
>would like easy observability tools.
>
>The world isn''t made up of kernel network engineers or customers
who care
>about kernel code path latencies. Instead there are Solaris users who
>run prstat or top from time to time, and would use a similar network summary
>tool if we provided one. I can''t see the average customer running
top in
>one window and a kernel code path latency tool in another.
>
>
>I''ve already said that I want to measure code path latencies too,
and that
>I think it should be part of a provider. However, since kernel network
>engineers understand the kernel network code, then they can get their own
>statistics from fbt and some sdt, and write their own private provider
>for their own pet needs. Are they also "customers" of DTrace?
Sure, but
>customers who have the skills to satisfy their own private needs.
>
I think you''re underestimating the complexities involved (or maybe
I''m underestimating what dtrace can do?)

What I''d like to see dtrace be able to do is follow a packet through
the kernel, not a thread of execution.  fbt is for following execution.
sdt is up to how it is used, but most often, it is just random hooks
in the code.

The difference between using fbt and being able to follow a packet
is when a packet is put on a queue, the fbt path is terminated,
even though the packet still has places to go and I/O ports to see.

If the networking dtrace provider doesn''t deliver the capability
to trace a packet through the kernel then it is not delivering a
key feature that is needed by many people.

Why don''t we just hack up our own dtrace provider?  Because it
doesn''t help in solving problems with installed systems in the
field.  In addition, it would take a lot of messing about to do
this with fbt or sdt and that doesn''t help anyone.

To say that kernel engineers understand the networking code and
therefore do not need help in generating stats is not a very
ingeneous statement to make and ignores the fact that there are
varying levels of capability throughout the organisation, not
all of whom are experts.

In addition to the examples you''ve presented on the web page,
I think it would be of benefit if the following questions could
also be answered:

- how many PPS is an application responsible for being sent out?
- how many PPS is an application responsible for on the receive side?
The above two questions respeated for bytes per second, not PPS...or
take all of your "questions answered" and repeat for PID.

...and I think this is the biggest problem with the dtrace networking
provider as scoped so far - nothing in it relates to a process.  I''m
aware that this can be challenging for networking (especially on the
receive side) but there are worthwhile questions here to answer from
the generic customer perspective.

I''d dispute that "Are hackers/crackers port scanning my server?
(TCP
flag matching by IP address)" is actually a question worth asking.
If it is directly connected to the Internet, the answer is "yes"
(but maybe not right now.)  If there are any other boxes (routers,
firewalls) along the way in from the Internet, then it''s not a
question you should be using dtrace to find an answer for.  If
you''re trying to come up with a way to justify that particular
dtrace probe, I''d recommend looking for a better example question.

Some other probes that might be useful:
- RTT calculations from TCP timestamp measurements
- TCP window size changes
- when a packet is dropped because it is "bad" (checksum, etc)
- look through "netstat -s" output as there is quite a large
  number of stats there that are worthy of being identified
  with a probe.

Darren

p.s the HREF for dtrace-discuss at the bottom of
http://www.opensolaris.org/os/community/dtrace/NetworkProvider/
is wrong/broken.

Brendan Gregg - Sun Microsystems

2006-Sep-26 00:25 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Provider

G''Day Darren,

On Thu, Sep 21, 2006 at 11:12:55AM -0700, Darren.Reed at Sun.COM
wrote:> Richard Lowe wrote:
> 
> >[Cc''ing networking-discuss, I suspect followups should go to 
> >dtrace-discuss, however]
> >
> >Brendan Gregg - Sun Microsystems wrote:
> >
> >>G''Day Folks,
> >>
> >>I''ve just created a website to flesh out ideas for a
DTrace Network
> >>Provider.
> >>Let us know what you think.
> >>
> >>   http://www.opensolaris.org/os/community/dtrace/NetworkProvider/
> >>
> >>I''m not currently writing this provider; this began as a
discussion with
> >>Adam about what a network provider could do, which we thought would
be
> >>appropriate to discuss on dtrace-discuss. So, this may not actually
> >>happen;
> >>although it would help if a lot of people really do want it. :-)
> >
> >
> >+1, I for one certainly want it :)
> 
> 
> Agreed! :)
> 
> Some work needs to happen on args[2] - it is completely IPv4 specific.
> There is no mention of IPv6 header fields such as flow.
There have been other ideas, but I was thinking that args[2] be a
combination of all IPv* fields, then specific ones can be plucked
from statements as follows,

net:ip::send
/args[2]->ip_ver == 4/
{
	@foo[args[2]->ip_flags] = count();
}

net:ip::send
/args[2]->ip_ver == 6/
{
	@bar[args[2]->ip_flow] = count();
}

Which of course allows all IP traffic be matched, and common fields used,

net:ip::send
{
	@bytes = sum(args[2]->ip_length);
}
> For IPv4, it would be good if it were possible to look at IP header
> options, or for IPv6, extension headers.  Same again for TCP and its
> options (timestamp, MSS, SACK, etc.)
Yes, I agree.
> e.g. What can be done with dtrace to get information about the AH
> header as well as the TCP header in an IPv4 or IPv6 packet?
Sure, I think AH + ESP header info should be available.
> Or in TX, how would you access CIPSO information?
I haven''t used this; could someone who has suggest an appropriate way?
> Couldn''t we also do with some ICMP probes?
Just added to the site now.
> And can we mark ICMP packets in such a way with dtrace that we
> can measure the time it takes from gld::receive to gld::send for
> an ICMP ECHO to be received, processed and then send out again ?
I''d have to check what unique ID''s were available for ICMP,
but yes -
that would be useful; and the same for any such reply (TCP data
received to TCP ACK, ...).

cheers,

Brendan

-- 
Brendan
[CA, USA]

Darren.Reed at Sun.COM

2006-Sep-26 00:43 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Provider

Brendan Gregg - Sun Microsystems wrote:
>G''Day Darren,
>
>On Thu, Sep 21, 2006 at 11:12:55AM -0700, Darren.Reed at Sun.COM wrote:
>  
>
>>Richard Lowe wrote:
>>
>>    
>>
>>>[Cc''ing networking-discuss, I suspect followups should go
to
>>>dtrace-discuss, however]
>>>
>>>Brendan Gregg - Sun Microsystems wrote:
>>>
>>>      
>>>
>>>>G''Day Folks,
>>>>
>>>>I''ve just created a website to flesh out ideas for a
DTrace Network
>>>>Provider.
>>>>Let us know what you think.
>>>>
>>>> 
http://www.opensolaris.org/os/community/dtrace/NetworkProvider/
>>>>
>>>>I''m not currently writing this provider; this began as
a discussion with
>>>>Adam about what a network provider could do, which we thought
would be
>>>>appropriate to discuss on dtrace-discuss. So, this may not
actually
>>>>happen;
>>>>although it would help if a lot of people really do want it. :-)
>>>>        
>>>>
>>>+1, I for one certainly want it :)
>>>      
>>>
>>Agreed! :)
>>
>>Some work needs to happen on args[2] - it is completely IPv4 specific.
>>There is no mention of IPv6 header fields such as flow.
>>    
>>
>
>There have been other ideas, but I was thinking that args[2] be a
>combination of all IPv* fields, then specific ones can be plucked
>from statements as follows,
>
>net:ip::send
>/args[2]->ip_ver == 4/
>{
>	@foo[args[2]->ip_flags] = count();
>}
>  
>
Have you considered using names without the "ip_" ?

Such as "flags", "version", "length", etc ?

To those of us who work with the data structures, seeing the
three letter "ip_" tag implies IPv4.  There are also those who
believe that "IP is IP", which is both IPv4 and IPv6...

Is it possible to ship dtrace with macros?  Ie. to do:

struct foo {
    union {
       struct ip blah;
       struct ipv6 foo;
    } fooun;
}
#define ip_v fooun.blah.ip_v
#define ip6_vfc fooun.foo.ip6_vfc

or some such?
>>And can we mark ICMP packets in such a way with dtrace that we
>>can measure the time it takes from gld::receive to gld::send for
>>an ICMP ECHO to be received, processed and then send out again ?
>>    
>>
>
>I''d have to check what unique ID''s were available for
ICMP, but yes -
>that would be useful; and the same for any such reply (TCP data
>received to TCP ACK, ...).
>  
>
The combination of source address, destination address,
ICMP id and ICMP sequence number should be enough to
uniquely identify a particular packet and its reply.  Of
course the source and destination address swap around
on the way out too...is that likely to hamper the lookup by
dtrace?

Darren

Kacheong Poon

2006-Sep-26 02:09 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Roch wrote:
> I do  think that receiving new data out of order need to
> fire a receive probe. Since sending has 3 stages, so do receiving:
> 
> 
> 
> 	tcp::sent-new 		: new data was sent
> 	tcp::sent-retrans	: already sent data was resent
> 	tcp::sent-done		: sent data is freed, acked received and we''re
done.

I agree that it is easier to have tcp::sent-done instead of using
a tcp::ack-received to detect that it is an ACK which moves suna.
But with a little bit of D script, a user can filter that out
easily from the tcp::ack-received probe.

> 	tcp::received-new	: new data was received from below (includes out of
order)
> 	tcp::received-dup	: already received data received again
> 	tcp::received-done	: sent up toward application (we''re done)

The above six probes are centered around data.  But how about dup
ACKs, window update, RST, ...?  They are useful info too.  An
interesting case is zero window probe.  Does it consider as new?
How about segments completely outside the receive window?  And
similar to my previous comment on tcp::sending, is tcp::received-done
equivalent to say, sockfs::received or potentially other module''s
received probe above TCP?

> I think that if new data is seen by tcp then it belong in a
> tcp probe. If we manage a sockfs buffer outside the view of
> TCP then we could have similar probes to handle that case.
> But since tcp does handle the pending buffer I still like
> 
> 	tcp::sending		: new data posted in tcp.

It seems that you are suggestion something like a tcp::send-queued
probe.  But isn''t it the same as fbt::tcp_wput:entry probe?
And how do you see it will be used?

> And I still like the flow stop probes that fires when
> we have data in the xmit list and TCP refrains from sending 
> waiting for an ACK or window update. For incoming, we could
> have similar probes for when we advertise a zero window or
> open it up.
> 
> 	tcp::out-flow		: started or stopped
> 	tcp::in-flow		: started or stopped

Let''s focus on the sending side.  I expect that this probe
will usually be fired at the same time as a normal tcp::sent
probe.  The reason is that there is usually data to be
sent but TCP cannot send it because of congestion control.
But we probably don''t want this to happen.  So it can be quite
confusing on exactly when this probe is fired.

> And we could want a probe that fires when tcp generate a
> dataless packet
> 
> 	tcp::ctrl		:  pure ack, sack, window update, keepalive...

The first three are basically ACKs.  I''d say a dedicated
tcp::ack-sent is easier to deal with.

-- 

						K. Poon.
						kacheong.poon at sun.com

Roch

2006-Sep-26 07:13 UTC

head link

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

I think that streams/packet tracing makes the scope of the
Network provider overwhelming (it would to me at least).

So, I would like to see us make progress on the individual
modules tracing and defer packet tracing as a followup.


-r


Darren.Reed at Sun.COM writes:
 > Brendan Gregg - Sun Microsystems wrote:
 > 
 > >G''Day Sangeeta,
 > >
 > >On Thu, Sep 21, 2006 at 11:43:55AM -0700, Sangeeta Misra wrote:
 > >[...]
 > >
 > >>>But don''t assume that customers care most about
seeing code path,
 > >>>or have read the 50,000+ lines that make up tcp.c and ip.c.
 > >>>
 > >>At this point,Sun engineers who work in Solaris kernel are
customers who
 > >>would want to use DTrace Network Provider to track the above code
paths
 > >>( to do analyse performance bottlenecks and latencies)  There will
be
 > >>Solaris kernel developers OpenSolaris community in future who
would need
 > >>the same.
 > >>
 > >>Anyway, I am done trying to justify whether the feature I
requested is
 > >>important or not, and how may customers want it. You seem to have
some
 > >>quite strong opinions about what Soalris networking engineers
need.
 > >>
 > >
 > >Ok, so you still don''t understand what is happening.
 > >
 > >When I''m talking about customers, I''m talking about
Solaris users - the
 > >system administrators, developers and staff who use Solaris servers
and
 > >would like easy observability tools.
 > >
 > >The world isn''t made up of kernel network engineers or
customers who care
 > >about kernel code path latencies. Instead there are Solaris users who
 > >run prstat or top from time to time, and would use a similar network
summary
 > >tool if we provided one. I can''t see the average customer
running top in
 > >one window and a kernel code path latency tool in another.
 > >
 > >
 > >I''ve already said that I want to measure code path latencies
too, and that
 > >I think it should be part of a provider. However, since kernel network
 > >engineers understand the kernel network code, then they can get their
own
 > >statistics from fbt and some sdt, and write their own private provider
 > >for their own pet needs. Are they also "customers" of
DTrace? Sure, but
 > >customers who have the skills to satisfy their own private needs.
 > >
 > 
 > I think you''re underestimating the complexities involved (or
maybe
 > I''m underestimating what dtrace can do?)
 > 
 > What I''d like to see dtrace be able to do is follow a packet
through
 > the kernel, not a thread of execution.  fbt is for following execution.
 > sdt is up to how it is used, but most often, it is just random hooks
 > in the code.
 > 
 > The difference between using fbt and being able to follow a packet
 > is when a packet is put on a queue, the fbt path is terminated,
 > even though the packet still has places to go and I/O ports to see.
 > 
 > If the networking dtrace provider doesn''t deliver the capability
 > to trace a packet through the kernel then it is not delivering a
 > key feature that is needed by many people.
 > 
 > Why don''t we just hack up our own dtrace provider?  Because it
 > doesn''t help in solving problems with installed systems in the
 > field.  In addition, it would take a lot of messing about to do
 > this with fbt or sdt and that doesn''t help anyone.
 > 
 > To say that kernel engineers understand the networking code and
 > therefore do not need help in generating stats is not a very
 > ingeneous statement to make and ignores the fact that there are
 > varying levels of capability throughout the organisation, not
 > all of whom are experts.
 > 
 > In addition to the examples you''ve presented on the web page,
 > I think it would be of benefit if the following questions could
 > also be answered:
 > 
 > - how many PPS is an application responsible for being sent out?
 > - how many PPS is an application responsible for on the receive side?
 > The above two questions respeated for bytes per second, not PPS...or
 > take all of your "questions answered" and repeat for PID.
 > 
 > ...and I think this is the biggest problem with the dtrace networking
 > provider as scoped so far - nothing in it relates to a process. 
I''m
 > aware that this can be challenging for networking (especially on the
 > receive side) but there are worthwhile questions here to answer from
 > the generic customer perspective.
 > 
 > I''d dispute that "Are hackers/crackers port scanning my
server? (TCP
 > flag matching by IP address)" is actually a question worth asking.
 > If it is directly connected to the Internet, the answer is "yes"
 > (but maybe not right now.)  If there are any other boxes (routers,
 > firewalls) along the way in from the Internet, then it''s not a
 > question you should be using dtrace to find an answer for.  If
 > you''re trying to come up with a way to justify that particular
 > dtrace probe, I''d recommend looking for a better example
question.
 > 
 > Some other probes that might be useful:
 > - RTT calculations from TCP timestamp measurements
 > - TCP window size changes
 > - when a packet is dropped because it is "bad" (checksum, etc)
 > - look through "netstat -s" output as there is quite a large
 >   number of stats there that are worthy of being identified
 >   with a probe.
 > 
 > Darren
 > 
 > p.s the HREF for dtrace-discuss at the bottom of
 > http://www.opensolaris.org/os/community/dtrace/NetworkProvider/
 > is wrong/broken.
 > 
 > _______________________________________________
 > dtrace-discuss mailing list
 > dtrace-discuss at opensolaris.org

James Carlson

2006-Sep-26 14:40 UTC

head link

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

Andrew Gallatin writes:> 
> Darren.Reed at Sun.COM writes:
> 
>  > Some other probes that might be useful:
> <....>
>  > - when a packet is dropped because it is "bad" (checksum,
etc)
> 
> Speaking as driver developer for 10Gb/s NICs, it would also be nice to
> somehow make snoop capture only these bad packets.  I realize this is
> outside the scope of dtrace, but while we are wishing for things... :)
Really, there''s a wealth of things that snoop and its DLPI cohort
doesn''t do right or particularly well.  This is just one of many.

Among the others are identifying input versus output traffic, noting
non-packet-related conditions (e.g., loss of carrier), retaining
errored packets (runts, framing, FCS errors), and high-resolution
timestamps.  (Others have also mentioned a lower-overhead interface
for monitoring fast networks, but I''m a little hesitant to put a
performance goal in with functional issues.  It could be in the list,
though.)

If we''re going to get into snoop, I''d rather see us cast a
wider net
than just adding a few minor features.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Kacheong Poon

2006-Sep-26 19:16 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Tariq Rashid wrote:> i''ve not followed the entire thread but the following triggers
would be useful for myself, and i''m sure many others who want to
observe network server processes:
> 
> 	* socket bind (listen)

Which layer do you want info from?  For example, the above
can be done using the tcp::listen probe.  But it will give
the TCP layer info.

> 	* incoming tcp and handshake handshake

This can also be done using the TCP state changes probes.

> 	* or udp data
> 	* outgoing tcp data, or udp data

It will work with the proposed sending side probes.

> 	* tcp shutdown

TCP state changes probes will help.

> very useful would be if this was observable per thread, with a master
process handing off reply tasklets to worker threads.
> 
> an example task would be to accumulate histograms of the time between udp
in and udp out, per server thread.

It may not be possible.  For example, incoming UDP packets
are handled by the interrupt threads.  At best we can
map an incoming packets to the socket via conn_t.  But this
still does not map to a specific thread of a process.  And
what do you mean by "time between UDP in and out?"

-- 

						K. Poon.
						kacheong.poon at sun.com

Kacheong Poon

2006-Sep-26 19:28 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Provider

Brendan Gregg - Sun Microsystems wrote:
> There have been other ideas, but I was thinking that args[2] be a
> combination of all IPv* fields, then specific ones can be plucked
> from statements as follows,
> 
> net:ip::send
> /args[2]->ip_ver == 4/
> {
> 	@foo[args[2]->ip_flags] = count();
> }

As mentioned in a previous email, I think ip::send can
be confusing.  Firstly, the exact location when this
probe is fired needs to be defined.  From the reading
so far, I suspect that it is intended to be placed right
before IP sends it down to the driver.  If this is the
case, then we probably need other "send" probes.  For
example, if a packet is to be encrypted, we probably
want a probe before/after encryption.  And how about a
tunnel packet?

After so much discussion, could we have an updated
proposal on what probes are there now?  Thanks.

-- 

						K. Poon.
						kacheong.poon at sun.com

James Carlson

2006-Sep-26 21:01 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Kacheong Poon writes:> > an example task would be to accumulate histograms of the time between
udp in and udp out, per server thread.
> 
> 
> It may not be possible.  For example, incoming UDP packets
> are handled by the interrupt threads.  At best we can
> map an incoming packets to the socket via conn_t.  But this
> still does not map to a specific thread of a process.  And
> what do you mean by "time between UDP in and out?"
More to the point: there''s just no way to know at this layer which
transmitted UDP packet(s), if any, might be related to which received
UDP packet(s), if any.  If you need to know that, then you need to be
tracing _inside_ the application itself, so that you can distinguish
between packet-sent-due-to-application-timer from packet-generated-in-
response-to-some-input.

There''s no guarantee that an application even transmits its replies in
the same order in which its queries were received, let alone how many
replies might go with a given query.

For some really simple protocols, it may be possible to guess "most of
the time" based on the packet contents, and that may be good enough
for an ad-hoc test script or some application-specific case, but I
doubt that''s a good basis for a stable provider.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Brendan Gregg - Sun Microsystems

2006-Sep-26 22:12 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Kacheong,

On Fri, Sep 22, 2006 at 02:29:18AM +0800, Kacheong Poon
wrote:> Brendan Gregg wrote:
> 
> >>It is unclear what is meant by "standard RFC* style
probes."
> >>Does it mean an API style?  But there is no RFC for standard
> >>API as IETF does not standardize API.  Could you please
> >>elaborate that?
> >
> >Show me in the RFCs where dblks are discussed. The provider proposal
uses
> >the stable terminology from the RFCs, and events from the RFCs, which
> >customers should find easy to follow.
> 
> 
> I guess you misunderstood my question.  I was not
> asking about dblk tracing.  I was asking about what
> is really meant by "standard RFC* style probes."
> For example, the proposal does not track congestion/
> send/receive windows.  They are well defined in the
> TCP standard and are very useful in telling a user
> what is really going on.  OTOH probes like
> tcp::drop-saturation is not in any TCP standard.
The RFCs provide useful terminology and values, and are a great starting
point. So when a customer sees the probe list, many of the probes use the
same RFC terminology, and should be easier to understand. The same can
be said for the provided arguments.

TCP window details are in the proposal - please check before posting.
> >>For example, a
> >>TCP segment can be sent through a tunnel.  So what is the IP
> >>header info in args[2]?
> >
> >The tunnel IP. People can tunnel traffic over SSH anyway, would you
have
> >us add USDT probes into sshd for details there?
> 
> I don''t understand this comment.  We are discussing a
> provider of the kernel network stack.  IP tunnel is a well
> defined concept.  We are not discussing about application
> protocol tracing.  My question is about what is being provided
> in the args[] list.  The list should be well defined.  Otherwise,
> how can a user write scripts using them?  This is exactly
> my suggestion above, we need to understand the issues first.
The provider uses the IP address from the IP header - this follows 
principal of least suprise. 

Should something else be encapsulated, be it via IP tunnels or sshd
tunnels or anything else, then that data should be fetched using
another means. Since IP tunnels well defined as you say, then perhaps
they should be args[2]->ip_tun_daddr, etc.

As soon as we replace the IP address with the tunnel''d IP address,
someone may make a similar argument for sshd tunnels.
> >Quoting from the website (maybe you didn''t get this far?)
> >
> >  "Why are outbound connections slow? (Connect latency, 1st byte
latency,
> >  throughput latency)"
> >
> >These metrics are solvable from the proposal.
> 
> 
> This was exactly my question (I quoted the above question
> in my email).  I don''t believe the proposal can answer the
> why part.  It can tell a user that something is slow, but not
> why.  Finding the latency does not tell us why the latency
> is there...
Connect latency, 1st byte latency and throughput latency are all
measurable from the proposed provider. They certainly give insights
as to why a connection is slow.

And these are quite similar to the high level latencies provided by
the stable and public io provider.

Brendan

-- 
Brendan
[CA, USA]

Brendan Gregg - Sun Microsystems

2006-Sep-26 22:48 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day,

On Fri, Sep 22, 2006 at 09:19:40AM +0800, Kacheong Poon
wrote:> Roch wrote:
> 
> >So maybe
> >
> >tcp::sent   	: suna was advanced, args: from seq, nbytes
> >tcp::resent 	: suna not advanced, args: from seq, nbytes
I think these are useful as data-send, data-resend, etc. This 
gives us send/receive for all TCP traffic regardless; and then
higher level probes for convience, such as data-send, 
connect-established, etc.
> Do you mean tcp::sent is fired when TCP receives an ACK
> (otherwise suna won''t matter)?  I think it is a little bit
> difficult to explain the above.  In the tcp::sent case, it
> means that TCP receives an ACK such that it can now free
> up those data.  But how can the probe distinguish it from
> a retransmission?
Yes, I agree - it may be a little bit difficult. I''ve focused on
using "send" and not "sent" to help with this. ie,

"data-send" means that tcp is sending data, and is a simple SDT probe.

"data-sent" would mean that tcp has sent the data and received the
ACK. This can probably still be done, (and if so, perhaps this
should be done) but to begin with - "send" will be both easier
for us to write and users to understand.
> Or do you mean they are fired when TCP calls into IP to send
> some data?  So it is when TCP sends some data to IP.
Without writing the code - my thoughts are that this will be both
the easiest and the most appropriate place to probe. It''s when
TCP is handing it to the next layer down (yes, I know it''s part
of ip).

Which is like the DTrace io provider - which shows us when the
block I/O driver hands the requests to the next layer down (disk
driver).
>  Then
> suna does not matter and tcp::resent cannot be conditioned
> on "suna not advanced" as doing NewReno is a resent and suna
> is advanced in this case.  I guess tcp::resent should fire
> when TCP does a retransmission.  We also need something like
> 
> tcp::ack-sent: fires when an ACK is sent, args: tcp_t, ack #,
>                window size
> tcp::rst-sent: fires when a RST is sent, args: src and dst
>                addrs, seq number
> 
> We don''t need syn or fin sent as they consume 1 seq num and
> can be put into the tcp::send probe.
> 
> 
> >tcp::acked  	: acked pointer was advanced. args: seq, nbytes
> 
> 
> This is when TCP receives an ACK segment, right?  It is not
> about sending?  But why is ACK number not in the args?  I think
> we should also have the window size and SACK options as args.
> 
> 
> >tcp::dupacked  	: acked pointer was not advanced. args: seq
> 
> 
> I think SACK options and dupack count should also be in the
> args.
Do we need seperate ack probes? Maybe - I don''t have a strong opinion
on this, with tcp::send/receive matching everything - we have at least
one way to measure them.

Maybe the best way to suggest what the ack probes should be, is to
write a script that uses them. Writing the DTrace examples on the
website helped me fine tune how things worked best.
> >Timings from sent to acked would give easy access to round
> >trip time.
Yes - this script would be the best example of using acks, and is
the only script I didn''t write on the website (it should be under
throughput latency - although the term "round trip time" is probably
more appropriate). I began wondering how best to match acks, then
left writing that script for another time.

cheers,

Brendan

-- 
Brendan
[CA, USA]

Kacheong Poon

2006-Sep-26 22:58 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg - Sun Microsystems wrote:
> TCP window details are in the proposal - please check before posting.

There are multiple "windows" in TCP protocol.  I don''t
think the current proposal actually tracks them.

> The provider uses the IP address from the IP header - this follows 
> principal of least suprise. 

The above is OK as long as it is very clear that the
probe is placed right before IP sends it down to the
driver.  So the user knows that what the probe args
will be.  I mentioned this in a previous email.

> Should something else be encapsulated, be it via IP tunnels or sshd
> tunnels or anything else, then that data should be fetched using
> another means. Since IP tunnels well defined as you say, then perhaps
> they should be args[2]->ip_tun_daddr, etc.

Or we can have multiple probes for different purposes,
as suggested in my previous email.

> As soon as we replace the IP address with the tunnel''d IP address,
> someone may make a similar argument for sshd tunnels.

Well, as long as we are talking about a provider for
the kernel network stack, I don''t believe the above
request will come.  It is simply the wrong request.
People may ask to put probes at sshd, but it has
nothing to do with the kernel network stack.  I hope
this part is very clear.

> Connect latency, 1st byte latency and throughput latency are all
> measurable from the proposed provider. They certainly give insights
> as to why a connection is slow.

The above just tells us that there is a latency.  I
am not sure how they can tell us why there is a latency.
We need more information.  As suggested by many people
in this discussion thread, we can have more probes which
tie to the TCP protocol processing.

> And these are quite similar to the high level latencies provided by
> the stable and public io provider.

Well, we are talking about a network provider.  I don''t
see why we need to be limited by designs of other providers.
Is there a reason?

-- 

						K. Poon.
						kacheong.poon at sun.com

Brendan Gregg - Sun Microsystems

2006-Sep-26 23:13 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

G''Day Kacheong,

On Fri, Sep 22, 2006 at 09:30:39AM +0800, Kacheong Poon
wrote:> Alan Maguire wrote:
> 
> >net:tcp:state:closed
> >net:tcp:state:listen
> >net:tcp:state:syn_received
> >net:tcp:state:fin_wait_1
> >net:tcp:state:fin_wait_2
> >net:tcp:state:closing
> >net:tcp:state:established
> >net:tcp:state:syn_sent
> >net:tcp:state:close_wait
> >net:tcp:state:time_wait
> >net:tcp:state:last_ack
> 
> 
> State transition is certainly something very important.  But
> I guess you can omit closed as it just means that a tcp_t
> is allocated.  A fbt probe on the tcp_t creation function
> serves the same purpose.
closed is part of the diagram on p23 of RFC 793; so it is something
people may go looking for.
> >i guess there''s some overlap between these probes and
Brendan''s
> >suggested probes (e.g. "established" in this scheme == 
> >"connect-established"), but maybe it would make sense to
split up the
> >provider namespace into "protocol state" and "packet
data" probes, i.e.
> 
> >net:tcp:state:
> >net:tcp:packet:
Yes, I''ve made a differentiation by using a "state-" prefix.

I''m also a little cautious to use the word packet at the transport
level, as IP may fragment this further (although I''ve never really
noticed Solaris doing this for TCP - I''m sure there is a reason
in the code somewhere). Although It''s been hard to avoid saying
"packet" in some of my examples.
> If the TCP segment which triggers the transition is passed back,
> there is no need to have a "packet" probe.  It is more confusing
> to the user.
Yep. tcp::state-* events are about state changes, tcp::send/receive
are about packets (as far as TCP knows, anyway).
> >with the former looking at state events, the latter at packet events
(e.g.
> >receive, drop)? would this be useful? (i think the common states users
will
> >be interested in are pretty intuitive from the RFC-derived names above,
> >e.g. connection is listen''ing, established, closed). thoughts?
> 
> 
> I think we may not need a "drop" probe in TCP, unless it means
> checksum error (which is actually done in the IP level in our
> implementation).  TCP only drops a segment if it does not
> pass certain checks.  This means that a script can easily be
> written using the tcp::receive probe and check on the seq number
> and timestamps option.
While drops will be more interesting at the NIC driver level, it 
does look like TCP can drop SYNs when saturated.

cheers,

Brendan

-- 
Brendan
[CA, USA]

Kacheong Poon

2006-Sep-27 00:14 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg - Sun Microsystems wrote:
> Do we need seperate ack probes? Maybe - I don''t have a strong
opinion
> on this, with tcp::send/receive matching everything - we have at least
> one way to measure them.

Yes, it is more for the convenience of writing D scripts.

> Maybe the best way to suggest what the ack probes should be, is to
> write a script that uses them. Writing the DTrace examples on the
> website helped me fine tune how things worked best.

Using a convention suggested by Darren, we can do
something like the following to check for, say
ACK splitting attack for peers to gain unfair
advantage.


net:ipv4:tcp:established
/args[1]->tcp_lport == 8080/
{
	tp = args[0];	/* assume args[0] is tcp_t */
}

net:ipv4:tcp:ack-received
/args[0] == tp/
{
	@bytes_acked[inet_ntop(IPV4, args[1]->tcp_faddr] 	
quantize(args[1]->tcp_ack - args[0]->tcp_suna);
}

> Yes - this script would be the best example of using acks, and is
> the only script I didn''t write on the website (it should be under
> throughput latency - although the term "round trip time" is
probably
> more appropriate). I began wondering how best to match acks, then
> left writing that script for another time.

Since TCP also keeps track of the RTT, I guess an easier way
for finding this is to just take a look at the value.


-- 

						K. Poon.
						kacheong.poon at sun.com

Kacheong Poon

2006-Sep-27 00:18 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

Brendan Gregg - Sun Microsystems wrote:
> I''m also a little cautious to use the word packet at the transport
> level, as IP may fragment this further (although I''ve never really
> noticed Solaris doing this for TCP - I''m sure there is a reason
> in the code somewhere). Although It''s been hard to avoid saying
> "packet" in some of my examples.

Yes, we should not use "packet" in the transport level.

> While drops will be more interesting at the NIC driver level, it 
> does look like TCP can drop SYNs when saturated.

I''d suggest we use net:ipv*:tcp:listen-drop, or more
specific tcp:listen-drop-q0 and tcp:listen-drop-q for
the above.  This is implementation specific as there
is no such drop in the standard.  So I think we can
use implementation specific names.

-- 

						K. Poon.
						kacheong.poon at sun.com

Brendan Gregg - Sun Microsystems

2006-Sep-27 03:38 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

G''Day Alex,

On Fri, Sep 22, 2006 at 07:24:36AM -0700, Alex Peng
wrote:> If D could have:
>         iphdr(ipha_t *)
>         tcphdr(tcph_t *)
> 
> to just iphdr/tcphdr as in mdb, which will save many engineer''s
time.
Yes - the proposal is to have ipha_t and tcphdr available as probe args,
which is appropriate as they are RFC based.

Brendan

-- 
Brendan
[CA, USA]

Brendan Gregg - Sun Microsystems

2006-Sep-27 03:46 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Kacheong,

On Sat, Sep 23, 2006 at 07:57:13AM +0800, Kacheong Poon
wrote:> Roch wrote:
> 
> >net:tcp::sending : new data posted in the tcp xmit buffer (from above)
> 
> 
> Or should this be the sent probe of the layer above TCP?  For
> example, we can have sockfs::sent.
Sure; an ideal outcome would be to cache the thread info from the sockfs
layer, to assist tracing data as it''s sent down the layers from
application to ethernet. When we begin coding we''ll find out how doable
this is.
> >net:tcp::acked   : data from xmit buffer can be freed (send competed)
> 
> 
> Or maybe tcp::ack-received, similar to tcp::ack-sent?
> 
> 
> >I note we already have mib probes for retransmits and dup acks.
> 
> 
> The current mib probes are very limited, no TCP info available.
> I think it is better to have dedicated probes for them.
Yes. The mib probes have been handy as signposts in the kernel code
for interesting points. I''m glad there are there. I have used them
a little in the DTraceToolkit, but keep wanting further info (conn_s or
tcp_s). 

Brendan

-- 
Brendan
[CA, USA]

Brendan Gregg - Sun Microsystems

2006-Sep-27 03:59 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Sat, Sep 23, 2006 at 07:57:13AM +0800, Kacheong Poon
wrote:> Roch wrote:
> 
> >net:tcp::acked   : data from xmit buffer can be freed (send competed)
> 
> 
> Or maybe tcp::ack-received, similar to tcp::ack-sent?
Oh - and I''ve just added ack-receive and ack-send to the website.

Yes, I''m fiddling the letters a little - if data-sent fires after the
ACK
has been returned so that we know it was sent ok, then ack-sent means
something stranger.

The rules would be,

 data-send means I''m sending the data. ack-send means I''m
sending the ack.
 data-sent means the data was sent ok. ack-sent means the ack was sent ok.

Which leaves the door ajar to add equivalent "sent" and
"received" probes
if they prove more appropriate down the road.

Brendan

Brendan Gregg - Sun Microsystems

2006-Sep-27 04:29 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Roch,

On Mon, Sep 25, 2006 at 03:49:13PM +0200, Roch wrote:> 
> Kacheong Poon writes:
> 
> I do  think that receiving new data out of order need to
> fire a receive probe. Since sending has 3 stages, so do receiving:
> 
> 	tcp::sent-new 		: new data was sent
> 	tcp::sent-retrans	: already sent data was resent
> 	tcp::sent-done		: sent data is freed, acked received and we''re
done.
> 
> 	tcp::received-new	: new data was received from below (includes out of
order)
> 	tcp::received-dup	: already received data received again
> 	tcp::received-done	: sent up toward application (we''re done)
To watch these events for a "packet" sounds quite interesting - but
SACKs
could make it confusing - we wouldn''t get a sent-done for every
sent-new.

Maybe I''d adjust them to be,

	tcp::data-sent-start
	tcp::data-sent-retrans
	tcp::data-sent-done

Since they are about moving data; and "start" is already an accepted 
DTrace term (from the io provider :).

Brendan

-- 
Brendan
[CA, USA]

James Carlson

2006-Sep-27 13:18 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

Brendan Gregg - Sun Microsystems writes:> > State transition is certainly something very important.  But
> > I guess you can omit closed as it just means that a tcp_t
> > is allocated.  A fbt probe on the tcp_t creation function
> > serves the same purpose.
> 
> closed is part of the diagram on p23 of RFC 793; so it is something
> people may go looking for.
I suspect those might be confused people.

RFC 793 page 21:

  TIME-WAIT, and the fictional state CLOSED.  CLOSED is fictional
  because it represents the state when there is no TCB, and therefore,
  no connection.  Briefly the meanings of the states are:

page 22:

    CLOSED - represents no connection state at all.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Roch

2006-Sep-27 14:13 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

James Carlson writes:
 > Brendan Gregg - Sun Microsystems writes:
 > > > State transition is certainly something very important.  But
 > > > I guess you can omit closed as it just means that a tcp_t
 > > > is allocated.  A fbt probe on the tcp_t creation function
 > > > serves the same purpose.
 > > 
 > > closed is part of the diagram on p23 of RFC 793; so it is something
 > > people may go looking for.
 > 
 > I suspect those might be confused people.
 > 
 > RFC 793 page 21:
 > 
 >   TIME-WAIT, and the fictional state CLOSED.  CLOSED is fictional
 >   because it represents the state when there is no TCB, and therefore,
 >   no connection.  Briefly the meanings of the states are:
 > 
 > page 22:
 > 
 >     CLOSED - represents no connection state at all.
 > 

Ok,  but    there  is an event     which makes  a connection
transition it into this fictional state.  I think this event
can match a Dtrace probe, we just need to put that at the
end of the list.

-r


 > -- 
 > James Carlson, KISS Network                    <james.d.carlson at
sun.com>
 > Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
 > MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677
 > _______________________________________________
 > dtrace-discuss mailing list
 > dtrace-discuss at opensolaris.org

Terri Molini

2006-Sep-27 14:45 UTC

head link

[dtrace-discuss] WSJ article: Some ''Breakthroughs'' Deserve That Title -- But Definitely Not All

Thought you would enjoy the following WSJ artcle.

Best,
terri

_________________________________________________________

*Some ''Breakthroughs'' Deserve That Title -- But Definitely Not
All *

The Wall Street Journal, Lee Gomes; September 27, 2006

All companies, but especially those in technology, like few things
better than to talk about their "breakthroughs," those great leaps
forward that make products out of the formerly impossible. A search by
Factiva Consulting Services found that more than 8,600 press releases
have been issued over the years with "breakthrough" in the headline, a
majority of them by computer and electronics companies.

IBM, incidentally, is the year''s breakthrough champion, at least in
terms of how often it has headlined the word in its press releases. The
company has had eight headline-worthy breakthroughs since January in
customer privacy, energy efficiency, speedy billing software, storage,
mainframes (twice), computer networking and voice recognition.

But to what extent are we experiencing "breakthrough inflation," in
which the work that an engineer would consider simply a good day in the
lab becomes, in the hands of the PR department, an advance worthy of
being shouted about from the rooftops?

A breakthrough-themed announcement from one of the biggest Silicon
Valley companies shows some of the issues involved in evaluating whether
a particular technical development merits "breakthrough" status.

It was issued this month by Intel, and it involves optical computing --
that is, using light rather than electrons to transmit digital data.
Optical communications are already used throughout the fiber strands of
our phone networks, but the electronic devices that make them possible
are much too expensive for the insides of a desktop PC, or even for the
high-end servers used by businesses.

Intel has been working to change that, and it has announced a number of
optically oriented breakthroughs in recent years. The theme of all of
them has been finding ways to use ordinary low-cost silicon chips to
manipulate the laser light used in optical computing.

Intel''s latest breakthrough involves, of all things, glue --
specifically, finding a way to attach a layer of indium phosphide, a
material long used to generate photons in laser equipment, to the top of
a piece of silicon. That step had eluded researchers for years. With it
accomplished, you get the breakthrough of a "laser on a chip."

Intel says the arrival of such a laser chip will hasten the arrival of
the day when, for instance, the different parts of your computer will
communicate via laser fibers instead of copper wires. (Light travels
farther and generates less "noise" than electrons running over
copper.)

It will also allow, says Intel, different parts of a computer -- the
microprocessor and memory, for instance -- to be much farther apart than
they are today, including on the other side of the room from each other.
That would help diffuse the heat generated by the rows and rows of
servers now found in the computer rooms of even medium-sized businesses.
In turn, that could result in lower energy costs. Nearly every big
company these days says that the biggest problem its IT shop has is
getting enough power to run all the computers the company is buying.

Having been inside Intel''s laser labs, I need no persuading that the
company is doing important work here, and an Intel spokesman says the
development is indeed a "breakthrough" because it shows how real-world
optical products can be made with silicon. I wonder, though, how many
more breakthroughs we will be reading about before optical computing
becomes ubiquitous. An Intel spokesman says the laser chip is indeed a
"breakthrough" because it shows how real-world optical products can be
made with silicon.

But being a "breakthrough" implies that you don''t need a lot
more like
it to get to where you need to go. Intel is the first to admit that many
issues remain to be solved before that day is at hand in this case --
including, for instance, getting the hybrid indium-phosphide and silicon
chips to produce light at the high temperatures that are normal in
computer parts. If every few months produces a breakthrough, we might
want to change what we are calling it to something less arresting --
perhaps "research advance."

Like many other companies, Intel may also have a penchant for
overclaiming even with well-grounded technical accomplishments. For
example, the company said its recent work will hasten the arrival of
ultrafast fiber Web connections to the home because silicon-based laser
components will reduce the cost of optical networks.

True enough, but the real expense of fiber to the home is digging up
sidewalks and front yards. Unless Intel is also working on silicon-based
shoveling technology, that cost is not likely to come down any time soon.

There is another category of "breakthrough" in which the word becomes
a
synonym for a "supremely useful product." One example is Dtrace, a
piece
of software that is like a computer debugger on steroids.

Dtrace was developed by a team of engineers headed by Bryan Cantrill at
Sun Microsystems, and earlier this month, it won first place in The Wall
Street Journal''s Technology Innovation Awards. It enables engineers to
diagnose a computer and say how well it''s running and what might be
causing a program to hiccup -- all without interfering with the
machine''s operations.

If Dtrace won''t change computing, it''s at least available
today. Not
only is it currently shipping from Sun, it also has been licensed by
Apple Computer for new versions of the Macintosh. Apple and Sun
together: Now there''s a breakthrough.

###

--
The language of truth is unadorned and always simple.
Marcellinus Ammianus

Terri Molini
Sun Microsystems, Inc.
Global Communications
408/404-4976 office; x6-9968
408/404-4976 fax
408/406-9021 mobile
IM: tmolini

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20060927/c15021d4/attachment.html>

Brendan Gregg - Sun Microsystems

2006-Sep-27 16:39 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

G''Day James,

On Wed, Sep 27, 2006 at 09:18:27AM -0400, James Carlson
wrote:> Brendan Gregg - Sun Microsystems writes:
> > > State transition is certainly something very important.  But
> > > I guess you can omit closed as it just means that a tcp_t
> > > is allocated.  A fbt probe on the tcp_t creation function
> > > serves the same purpose.
> > 
> > closed is part of the diagram on p23 of RFC 793; so it is something
> > people may go looking for.
> 
> I suspect those might be confused people.
> 
> RFC 793 page 21:
> 
>   TIME-WAIT, and the fictional state CLOSED.  CLOSED is fictional
>   because it represents the state when there is no TCB, and therefore,
>   no connection.  Briefly the meanings of the states are:
> 
> page 22:
> 
>     CLOSED - represents no connection state at all.
and page 39,

      TCP A                                                TCP B

  1.  ESTABLISHED                                          ESTABLISHED

  2.  (Close)
      FIN-WAIT-1  --> <SEQ=100><ACK=300><CTL=FIN,ACK> 
--> CLOSE-WAIT

  3.  FIN-WAIT-2  <-- <SEQ=300><ACK=101><CTL=ACK>     
<-- CLOSE-WAIT

  4.                                                       (Close)
      TIME-WAIT   <-- <SEQ=300><ACK=101><CTL=FIN,ACK> 
<-- LAST-ACK

  5.  TIME-WAIT   --> <SEQ=101><ACK=301><CTL=ACK>     
--> CLOSED

  6.  (2 MSL)
      CLOSED                                                      


Sure, the RFC shows that TIME-WAIT to CLOSED is fictional, but LAST-ACK
to CLOSED isn''t.

And in the source code, uts/common/inet/tcp/tcp.c,

/*
 * The tcp_t is going away. Remove it from all lists and set it
 * to TCPS_CLOSED. The freeing up of memory is deferred until
 * tcp_inactive. This is needed since a thread in tcp_rput might have
 * done a CONN_INC_REF on this structure before it was removed from the
 * hashes.
 */
static void
tcp_closei_local(tcp_t *tcp)
{
[...]
        tcp->tcp_state = TCPS_CLOSED;

CLOSED is an event that happens. Sure, it may not a state as such that a
TCP connection is in, however the probe net:tcp::state-closed is showing
a state transition (but not that the connection then remains CLOSED).

Brendan

-- 
Brendan
[CA, USA]

Brendan Gregg - Sun Microsystems

2006-Sep-27 16:41 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

G''Day Roch,

On Wed, Sep 27, 2006 at 04:13:54PM +0200, Roch wrote:> James Carlson writes:
>  > Brendan Gregg - Sun Microsystems writes:
>  > > > State transition is certainly something very important. 
But
>  > > > I guess you can omit closed as it just means that a tcp_t
>  > > > is allocated.  A fbt probe on the tcp_t creation function
>  > > > serves the same purpose.
>  > > 
>  > > closed is part of the diagram on p23 of RFC 793; so it is
something
>  > > people may go looking for.
>  > 
>  > I suspect those might be confused people.
>  > 
>  > RFC 793 page 21:
>  > 
>  >   TIME-WAIT, and the fictional state CLOSED.  CLOSED is fictional
>  >   because it represents the state when there is no TCB, and
therefore,
>  >   no connection.  Briefly the meanings of the states are:
>  > 
>  > page 22:
>  > 
>  >     CLOSED - represents no connection state at all.
>  > 
> 
> Ok,  but    there  is an event     which makes  a connection
> transition it into this fictional state.  I think this event
> can match a Dtrace probe, we just need to put that at the
> end of the list.
Yep! I should have checked your message before posting. :)

Brendan

-- 
Brendan
[CA, USA]

James Carlson

2006-Sep-27 16:50 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

Brendan Gregg - Sun Microsystems writes:> Sure, the RFC shows that TIME-WAIT to CLOSED is fictional, but LAST-ACK
> to CLOSED isn''t.
Agreed.  It depends on what the user is trying to track.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Brendan Gregg - Sun Microsystems

2006-Sep-27 19:11 UTC

head link

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

G''Day Darren,

On Mon, Sep 25, 2006 at 04:00:02PM -0700, Darren.Reed at Sun.COM
wrote:> Brendan Gregg - Sun Microsystems wrote:
[...]> >I''ve already said that I want to measure code path latencies
too, and that
> >I think it should be part of a provider. However, since kernel network
> >engineers understand the kernel network code, then they can get their
own
> >statistics from fbt and some sdt, and write their own private provider
> >for their own pet needs. Are they also "customers" of DTrace?
Sure, but
> >customers who have the skills to satisfy their own private needs.
> >
> 
> I think you''re underestimating the complexities involved (or maybe
> I''m underestimating what dtrace can do?)
It''s complex, sure, but it will make a big difference where that
complexity
is. For the kernel to do all the work and dump associated events for
code path latency, that is probably going to be a complex enhancement.
The enabled probe effect may become a unwanted factor in the timing
(we''d
have to test this), and as it would be coded into the kernel, it would be
set in stone - whether that is appropriate for one script or another.

However, if the kernel were to dump some rather raw details, we could
try to move much of the complexity to a DTrace translater (such as
/usr/lib/dtrace/io.d), which can be tweaked as needed; and/or do 
postprocessing of the details in a userland program - even Perl - to
reduce the inline enabled probe effect. Each script could approach the
raw data in a different way as appropriate.
> What I''d like to see dtrace be able to do is follow a packet
through
> the kernel, not a thread of execution.  fbt is for following execution.
> sdt is up to how it is used, but most often, it is just random hooks
> in the code.
> 
> The difference between using fbt and being able to follow a packet
> is when a packet is put on a queue, the fbt path is terminated,
> even though the packet still has places to go and I/O ports to see.
> 
> If the networking dtrace provider doesn''t deliver the capability
> to trace a packet through the kernel then it is not delivering a
> key feature that is needed by many people.
Noone is saying that the provider will never do this. If you are saying
that this feature is important to you, then sure - I understand.
And if many people means many engineers (which in turn helps them
achieve cool things), then I understand that too.
> In addition to the examples you''ve presented on the web page,
> I think it would be of benefit if the following questions could
> also be answered:
> 
> - how many PPS is an application responsible for being sent out?
> - how many PPS is an application responsible for on the receive side?
> The above two questions respeated for bytes per second, not PPS...or
> take all of your "questions answered" and repeat for PID.
>
> ...and I think this is the biggest problem with the dtrace networking
> provider as scoped so far - nothing in it relates to a process. 
I''m
> aware that this can be challenging for networking (especially on the
> receive side) but there are worthwhile questions here to answer from
> the generic customer perspective.
> 
I completely agree, and this was the first networking issue I wanted
DTrace to solve - which led me to write tcpsnoop, tcptop, etc, 
which appear in the DTraceToolkit.

It''s hard - I haven''t thought how best to add it to the
provider -
but I do want it to be there. My tcpsnoop/tcptop scripts are 
helpful in the meantime, but they break between Solaris versions
(as they are entirerly fbt based). One day I hope to write stable
versions of tcpsnoop/tcptop.
> I''d dispute that "Are hackers/crackers port scanning my
server? (TCP
> flag matching by IP address)" is actually a question worth asking.
> If it is directly connected to the Internet, the answer is "yes"
> (but maybe not right now.)  If there are any other boxes (routers,
> firewalls) along the way in from the Internet, then it''s not a
> question you should be using dtrace to find an answer for.  If
> you''re trying to come up with a way to justify that particular
> dtrace probe, I''d recommend looking for a better example question.
Ok, - I do think a sensible final solution is that customers run a
NIDS (such as snort) somewhere on the network, or use tools to examine
their firewall logs.

To have a single DTrace script check for all types of scans and give
a report will still have some value. A customer finds unexplained
network utilization, and can run a quick script to check what it is -
rather than install a NIDS if one isn''t available.

I''m sure there are better example questions with example DTrace
solutions
that I can add to this site - I hope some people will start posting 
suggestions. :)
> Some other probes that might be useful:
> - RTT calculations from TCP timestamp measurements
> - TCP window size changes
These could be scripted from tcp::send and tcp::receive alone; but
seperate probes may make sense.
> - when a packet is dropped because it is "bad" (checksum, etc)
Yep - we need a tcp::drop-checksum or some such.
> - look through "netstat -s" output as there is quite a large
>  number of stats there that are worthy of being identified
>  with a probe.
Sure. So long as we really do want to probe it, and that the kstat
value isn''t sufficient. Thinking of sample scripts that use them should
help here.
> p.s the HREF for dtrace-discuss at the bottom of
> http://www.opensolaris.org/os/community/dtrace/NetworkProvider/
> is wrong/broken.
thanks,

Brendan

-- 
Brendan
[CA, USA]

Kacheong Poon

2006-Sep-27 22:06 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg - Sun Microsystems wrote:
> The rules would be,
> 
>  data-send means I''m sending the data. ack-send means I''m
sending the ack.
>  data-sent means the data was sent ok. ack-sent means the ack was sent ok.

The problem is that we can never know if an ACK is sent OK.

> Which leaves the door ajar to add equivalent "sent" and
"received" probes
> if they prove more appropriate down the road.

I think the first rule above is OK, if sending means giving
it to IP.  But I guess Roch''s point is that it is good to
have a probe fired when data is freed as it is ack''ed by
the peer.  This can save some code in writing a D script
which only has tcp:ack-received.

-- 

						K. Poon.
						kacheong.poon at sun.com

Kacheong Poon

2006-Sep-27 22:11 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg - Sun Microsystems wrote:
> Maybe I''d adjust them to be,
> 
> 	tcp::data-sent-start

So it fires when new data is sent down to IP?

> 	tcp::data-sent-retrans
> 	tcp::data-sent-done

Does it mean that data is ack''ed and freed?  If it does, I''d
suggest we just use data-acked.  "Done" sounds like the sending
action is finished, so it fires when the thread calling the
IP send function returns.  We probably don''t need that.

> Since they are about moving data; and "start" is already an
accepted
> DTrace term (from the io provider :).
> 
> Brendan
> 

-- 

						K. Poon.
						kacheong.poon at sun.com

Kacheong Poon

2006-Sep-28 02:02 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg - Sun Microsystems wrote:
> Oh - and I''ve just added ack-receive and ack-send to the website.

I just took a look at the updated page.  Since TCP is
probably the most developed part, I guess we should
have a more detailed explanations on it first.  We
can work on other parts later.  For example, we can
spell out the exact arguments of all the proposed
probes.  The args[0] should probably be tcp_t instead
of conn_s (one can get conn_s from tcp_t) for all the
probes.

Do we still need those TCP "accept-*" and "connect-*"
probes?  They are almost the same as the state probes.
I''d suggest we remove those "socket API related" probes.
OTOH, if we have a sockfs module, then I think we should
have those socket related probes in that module.

We have talked about IPv4 and IPv6 and people have
proposed some ways to work with them.  I''d suggest to
update the page with the new proposal.


-- 

						K. Poon.
						kacheong.poon at sun.com

Brendan Gregg - Sun Microsystems

2006-Sep-28 05:15 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Kacheong, Darren,

On Thu, Sep 28, 2006 at 10:02:09AM +0800, Kacheong Poon
wrote:> Brendan Gregg - Sun Microsystems wrote:
> 
> >Oh - and I''ve just added ack-receive and ack-send to the
website.
> 
> 
> I just took a look at the updated page.  Since TCP is
> probably the most developed part, I guess we should
> have a more detailed explanations on it first.  We
> can work on other parts later.  For example, we can
> spell out the exact arguments of all the proposed
> probes.  The args[0] should probably be tcp_t instead
> of conn_s (one can get conn_s from tcp_t) for all the
> probes.
As both conn_s and tcp_t are private implementation info, we don''t want
either casted as args[0] (thanks to eschrock for correcting me on that);
Instead I have it there as a uintptr_t (just changed it to uint64_t), so
that the pointer value can be used as a unique session ID.

Is the tcp_t valid all the way from the first SYN to the last packet?
I''d have to check; if there was a scenario where conn_s was valid while
tcp_t was NULL, say during a close function, then conn_s would be better.
conn_s can also be used for UDP, so customers will be learning one
unique ID value for both.

However if the conn_s gets reused for the next TCP session (or even
UDP session) - then that would make tcp_t a better choice.

I''m going to have to check through the code to answer these, unless
someone
can already shed light on them. I believe one of them is suitable - either
conn_s or tcp_t, it''s just a matter of checking their lifecycles.
> Do we still need those TCP "accept-*" and "connect-*"
> probes?  They are almost the same as the state probes.
> I''d suggest we remove those "socket API related" probes.
> OTOH, if we have a sockfs module, then I think we should
> have those socket related probes in that module.
Yes, the state probes have just state-established, while accept-established
and connect-established make it easy to differentiate between inbound and
outbound connections. They are stable concepts and are also quite practical -
who is connecting to my webserver? This can be answered easily with
accept-established; however if we had just state-established then we''d
need to check the direction as well - which isn''t available as a flag,
and so we''d need to cache the previous syn-sent/syn-received state
in a global variable keyed on the conn_s. That''s going to make many of
the one-liners go from being elegant to clumsy.
> We have talked about IPv4 and IPv6 and people have
> proposed some ways to work with them.  I''d suggest to
> update the page with the new proposal.
Yes - I''d like to do something about it, but the issues go like this:

Lets have,

	net:ipv4:tcp:send

no, we can''t fiddle the 3rd field in an SDT provider, and leaving it as
the function name is something of a DTrace convention. So we can have,

	net:ipv4::send
	net:ipv6::send

so far so good. args[2] is either an IPv4 header or IPv6. This makes
writing the provider easier. :) Now what about,

	net:tcp::send

Oh-oh - will args[2] now be ipv4 or ipv6? We still want to match IP
addresses at the TCP layer, as they are available. Do we use,

	net:tcp4::send
	net:tcp6::send

... which would lead us to,

	net:tcp4::send
	net:tcp4::receive
	net:tcp6::send
	net:tcp6::receive
	net:udp4::send
	net:udp4::receive
	net:udp6::send
	net:udp6::receive
	net:sctp4::send
	net:sctp4::receive
	net:sctp6::send
	net:sctp6::receive
	...

which is getting ugly. Now consider multiplying this pain with other
network layer protocols we may encounter (802.11g?).

I''d propose the following two suggestions:

1)
	net:ip::send	args[2] is ipinfo header
	net:tcp::send	args[2] is ipinfo header

where an ipinfo header is an aggregation of fields from all possible 
network protocols, with individuals differentiated using a predicate,
eg,

	/args[2]->ip_ver == 4/
	/args[2]->ip_ver == 6/

or no predicate if you don''t care; which will usually be the case
(we''ll
often be just interested in the address as a string - whether that is a
IPv4 dotted quad decimal or an IPv6 compact string).

or,

2)
	net:ipv4::send	args[2] is ipv4 header
	net:ipv6::send	args[2] is ipv6 header
	net:tcp::send	args[2] is ipinfo header

For other layers (tcp) where we are likely to be just interested in common
fields, such as source & dest address, then ipinfo is sufficient.
However for network layer protocols we have the original ipv4 or ipv6 
header available.

Both can be matched using,

	net:ipv*::send

... so long as no other protocol begins with "ipv"!

This gives us the advantage of protocol access at the network layer, 
without a mess at the transport layer.

I guess there is a 3rd option as well,

3)
	net:ipv4::send	args[2] is ipinfo header
	net:ipv6::send	args[2] is ipinfo header
	net:tcp::send	args[2] is ipinfo header

since ipinfo aggregates everything from both ipv4 and ipv6, would it
matter that there were unused fields? 

...

PS - the "ip_" and "tcp_" prefixes are a DTrace convention
as well, and
would be appropriate for a DTrace invented ipinfo_t (just like the io
provider invents a devinfo_t and a fileinfo_t, with prefixes). Trying
to make these look identical to the Solaris implementation sounds like
a step towards a private provider again, which for net::: we aren''t
trying to do.

cheers,

Brendan

-- 
Brendan
[CA, USA]

Bryan Cantrill

2006-Sep-28 05:24 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

> I''d propose the following two suggestions:
> 
> 1)
> 	net:ip::send	args[2] is ipinfo header
> 	net:tcp::send	args[2] is ipinfo header
> 
> where an ipinfo header is an aggregation of fields from all possible 
> network protocols, with individuals differentiated using a predicate,
> eg,
> 
> 	/args[2]->ip_ver == 4/
> 	/args[2]->ip_ver == 6/
> 
> or no predicate if you don''t care; which will usually be the case
(we''ll
> often be just interested in the address as a string - whether that is a
> IPv4 dotted quad decimal or an IPv6 compact string).
> 
> or,
> 
> 2)
> 	net:ipv4::send	args[2] is ipv4 header
> 	net:ipv6::send	args[2] is ipv6 header
> 	net:tcp::send	args[2] is ipinfo header
> 
> For other layers (tcp) where we are likely to be just interested in common
> fields, such as source & dest address, then ipinfo is sufficient.
> However for network layer protocols we have the original ipv4 or ipv6 
> header available.
> 
> Both can be matched using,
> 
> 	net:ipv*::send
> 
> ... so long as no other protocol begins with "ipv"!
> 
> This gives us the advantage of protocol access at the network layer, 
> without a mess at the transport layer.
Option (1) is much more in keeping with the DTrace conventions -- and is
a much more scalable option, in my opinion.  A Stable net provider that
used Option (2) would have to essentially lie about the module name and
then make that a Stable component in the probe tuple -- an idea that
I''m
quite uncomfortable with.  So I would weigh in quite strongly towards
Option (1) (or its ilk) unless that is somehow technically impossible...

	- Bryan

--------------------------------------------------------------------------
Bryan Cantrill, Solaris Kernel Development.       http://blogs.sun.com/bmc

Roch

2006-Sep-28 07:22 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Kacheong Poon writes:
 > Brendan Gregg - Sun Microsystems wrote:
 > 
 > > Maybe I''d adjust them to be,
 > > 
 > > 	tcp::data-sent-start
 > 
 > 
 > So it fires when new data is sent down to IP?
 >

Ideally it would be better to fire this up before we call
into IP (and call the probes data-send*) but do we actually
know or control before hand how much data IP/NIC is going to
accept ? If we actually do know then lets fire before but if 
we don''t I think we can only fire after calling into IP.

 > 
 > > 	tcp::data-sent-retrans
 > > 	tcp::data-sent-done
 > 
 > 
 > Does it mean that data is ack''ed and freed?  If it does,
I''d
 > suggest we just use data-acked.  "Done" sounds like the sending
 > action is finished, so it fires when the thread calling the
 > IP send function returns.  We probably don''t need that.

How about data-sent-acked ?

-r

Roch

2006-Sep-28 07:28 UTC

head link

[dtrace-discuss] Re: Re: DTrace Network Provider

Sure, the RFC shows that TIME-WAIT to CLOSED is fictional, but LAST-ACK
  to CLOSED isn''t.


Even if it''s not triggered  by an external event, it''s  real
enough  to hold on  to the port number  for that  2 MSL (and
cause port starvation at  least in microbenchmarks).  So for
completeness  I think we   should  fire the transition  into
closed after that delay.

-r

Dan McDonald

2006-Sep-28 10:35 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Wed, Sep 27, 2006 at 10:24:25PM -0700, Bryan Cantrill
wrote:> > 1)
> > 	net:ip::send	args[2] is ipinfo header
> > 	net:tcp::send	args[2] is ipinfo header
> > 
<SNIP!>> > or,
> > 
> > 2)
> > 	net:ipv4::send	args[2] is ipv4 header
> > 	net:ipv6::send	args[2] is ipv6 header
> > 	net:tcp::send	args[2] is ipinfo header
<SNIP!>> Option (1) is much more in keeping with the DTrace conventions -- and is
> a much more scalable option, in my opinion.  A Stable net provider that
> used Option (2) would have to essentially lie about the module name and
> then make that a Stable component in the probe tuple -- an idea that
I''m
> quite uncomfortable with.  So I would weigh in quite strongly towards
> Option (1) (or its ilk) unless that is somehow technically impossible...
You do realize that both options lie about the module name, right?

There is a "tcp" driver module, but in a post-FireEngine world it
merely acts
as a compatibility shim for TLI/XTI applications.  All of the hotpath TCP/IP
protocol processing now lives in the ip module.

So given BOTH options don''t state the precise module where the code
lives,
I would lean (a little) toward #2, where args[2] gets its type straightened
out immediately.

Dan

James Carlson

2006-Sep-28 12:49 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Bryan Cantrill writes:> > 	net:ip::send	args[2] is ipinfo header
> > 	net:tcp::send	args[2] is ipinfo header
[...]> > 	net:ipv4::send	args[2] is ipv4 header
> > 	net:ipv6::send	args[2] is ipv6 header
> > 	net:tcp::send	args[2] is ipinfo header
[...]> Option (1) is much more in keeping with the DTrace conventions -- and is
> a much more scalable option, in my opinion.  A Stable net provider that
> used Option (2) would have to essentially lie about the module name and
> then make that a Stable component in the probe tuple -- an idea that
I''m
> quite uncomfortable with.  So I would weigh in quite strongly towards
> Option (1) (or its ilk) unless that is somehow technically impossible...
Unfortunately, the IPv4 and IPv6 headers are not actually the same.
It would have been nice if the ipng group had just extended IP, but
they did some rototilling as well.

This means that while "version," "source address," and
"destination
address" are in common (except perhaps for representation of
addresses, which we can probably ignore) the rest is not.

There are three fields that are renamed.  The "TOS" field in IPv4
becomes the "TC" field in IPv6.  The contents are actually mostly the
same in practice (using DSCP), so perhaps these could be aliases.
Similarly, "TTL" becomes "hop limit" without an actual
change in
semantics.  And "protocol" becomes "next header."

The options handling is quite different between the two.  In IPv4, the
options are part of the IP header, which is why there''s an
"IHL"
field.  In IPv6, the options are in *separate* headers (called
"extension headers") that look to the user somewhat like
encapsulation.

All of the IPv4 fragmentation bits (ID, DF, MF, offset) are different
in IPv6.  Instead of being fixed header fields, these turn into an
option extension header.

IPv6 doesn''t have an IP header checksum or flags.  IPv4
doesn''t have a
"flow label" (but, then, perhaps nobody uses the IPv6 flow label
anyway).

In other words, I''m puzzled about how ''ipinfo'' will
work in practice.
Suppose we do go with option (1).  What does this do?

	net:ip::send
	/args[2]->ip_tos == 0x80/

Does it select all IPv4 packets with TOS 0x80, and ignore IPv6?  Does
it select both IPv4 and IPv6 with TC set to 0x80?  Does it cause one
of those hard-to-understand "at DIF offset" errors?

Repeat the same for the not-in-common fields.

I think merging IPv4 and IPv6 fields into a single structure is, at
least, to me, a confusing idea.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

James Carlson

2006-Sep-28 13:48 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg - Sun Microsystems writes:> Oh-oh - will args[2] now be ipv4 or ipv6? We still want to match IP
> addresses at the TCP layer, as they are available.
Do we?  I think that''s one of the important questions here.  It seems
to me that doing so, while convenient, is a bit of a layering
violation and may lead to fragility.

While TCP and UDP currently run only on IPv[46], that hasn''t always
been so (TUBA ran over CLNP, and RFC 1791 ran over IPX).  Even if we
wave away this issue ("we really have all the protocols we''ll ever
need"), it begs questions about how tunnels are represented and how
users might interact with both general encapsulation and stack
instances.

Perhaps more importantly, at the point where TCP does a "send"
operation, there is no IP header, so referring to one is a bit of a
falsehood.  (There happens to be one at this point in time
prefabricated in our implementation, but that''s an implementation
artifact.  The RFCs admit no such thing.)

Here''s a thought experiment: how would you represent TCP/IP over PPP
on L2TP?  The stack (for user data) looks something like:

	TCP/IP/PPP/L2TP/UDP/IP

I suspect it might be just as easy to do something like this:

  net:ipv4::recv
  /args[1]->ip_src == $v4src && args[1]->ip_proto == 6/
  {
	self->conn = args[0];
  }

  net:tcp::recv
  /self->conn == args[0] && args[1]->tcp_sport == 80/
  {
  }

as it is to do something like this:

  net:tcp::recv
  /args[1]->ip_src == $v4src && args[2]->tcp_sport == 80/
  {
  }

... and the former is semantically clearer and richer (as you can
discriminate IP fragments if you wish, as well as handling tunnels).
The downside is not having something in the framework that allows you
to "reach back" into prior headers, but since that''s also not
really
part of the layering model, that might not be as big a deal as it
seems.

On the transmit side, I would likely want to have access to the
well-known connection state items (bound addresses and ports as well
as various socket options) in some implementation-neutral way.

In fact, if you wanted, providing access to the connection state would
provide you with essentially the same information you''re proposing,
but without the layering violation implied by accessing it out of the
packet header.

In other words, instead of:

	net:tcp::send	args[2] is ipinfo header

you could have:

	net:tcp::send	args[2] is connection state

... where we have some abstract (non-implementation-exposing; not a
copy of the conn_t or tcp_t) structure containing the basic connection
variables at that layer.  For TCP, that might be:

	cs_bound_local_ip
	cs_bound_remote_ip
	cs_bound_local_port
	cs_bound_remote_port
	cs_nagle_enabled
	cs_linger_enabled
	cs_linger_time
	cs_state
	cs_snd_una
	cs_snd_nxt
	cs_snd_wnd
	...

(Of course, someone should come up with pithier names.)

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Brendan Gregg - Sun Microsystems

2006-Sep-28 16:51 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Dan,

On Thu, Sep 28, 2006 at 06:35:10AM -0400, Dan McDonald
wrote:> On Wed, Sep 27, 2006 at 10:24:25PM -0700, Bryan Cantrill wrote:
> > > 1)
> > > 	net:ip::send	args[2] is ipinfo header
> > > 	net:tcp::send	args[2] is ipinfo header
> > > 
> <SNIP!>
> > > or,
> > > 
> > > 2)
> > > 	net:ipv4::send	args[2] is ipv4 header
> > > 	net:ipv6::send	args[2] is ipv6 header
> > > 	net:tcp::send	args[2] is ipinfo header
> <SNIP!>
> > Option (1) is much more in keeping with the DTrace conventions -- and
is
> > a much more scalable option, in my opinion.  A Stable net provider
that
> > used Option (2) would have to essentially lie about the module name
and
> > then make that a Stable component in the probe tuple -- an idea that
I''m
> > quite uncomfortable with.  So I would weigh in quite strongly towards
> > Option (1) (or its ilk) unless that is somehow technically
impossible...
> 
> You do realize that both options lie about the module name, right?
Yes. But we shouldn''t have to plunge the customer into the world of
Solaris TCP/IP stack implementation for this public and stable provider.
We don''t want,

DTrace Networking FAQ

Q1) Where is TCP?
A1) Once upon a time there was a tcp module, but then Sun commenced work
    on a project codenamed "FireEngine" ... (long history lesson) ...
So
    when you see "ip", that really means whatever is in the kernel ip
    module, which includes the tcp functions.

Lets say customers write a ton of scripts using,

	net:icmp::receive

And then a (fictional) project FireStation comes along and moves icmp into
ip. We wouldn''t want to then tell customers that we broke all their
icmp
scripts, and they they had to do something like,

	net:ip::receive
	/args[4]->type == ICMP/

No, we want tcp to be tcp and icmp to be icmp.
> So given BOTH options don''t state the precise module where the
code lives,
> I would lean (a little) toward #2, where args[2] gets its type straightened
> out immediately.
Sure, but this provider is focused on networking not network implementation.
I think there is a seperate need for the following,

	net-code:::	Network Code Provider
	io-code:::	I/O Code Provider

cheers,

Brendan

-- 
Brendan
[CA, USA]

Brendan Gregg - Sun Microsystems

2006-Sep-28 17:08 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Thu, Sep 28, 2006 at 09:51:49AM -0700, Brendan Gregg - Sun Microsystems
wrote:> On Thu, Sep 28, 2006 at 06:35:10AM -0400, Dan McDonald wrote:
[...]> > You do realize that both options lie about the module name, right?
Ok, I have to do some back-pedling now as it''s been explained that 
changing the module name in DTrace is just as difficult as changing
the function name -- and, that the module name equaling the correct kernel
module is a deliberate DTrace convention. (I thought that strictness was
just for functions - so I''ve learnt something new. :)

What we can do is change the probes to be the following,

	ip:::send
	ip:::receive
	tcp:::send
	tcp:::receive

etc. This means a seperate provider for each protocol -- collectively
they will be the Network Providers, but there won''t be a specific
"net:::".

Brendan

-- 
Brendan
[CA, USA]

Dan McDonald

2006-Sep-28 17:13 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Thu, Sep 28, 2006 at 09:51:49AM -0700, Brendan Gregg - Sun Microsystems
wrote:> G''Day Dan,
<SNIP!>> > 
> > You do realize that both options lie about the module name, right?
> 
> Yes. But we shouldn''t have to plunge the customer into the world
of
> Solaris TCP/IP stack implementation for this public and stable provider.
Agreed.
> And then a (fictional) project FireStation comes along and moves icmp into
> ip. We wouldn''t want to then tell customers that we broke all
their icmp
> scripts, and they they had to do something like,
> 
> 	net:ip::receive
> 	/args[4]->type == ICMP/
> 
> No, we want tcp to be tcp and icmp to be icmp.
Guess what, icmp is already IN IP except for the bits that handle SOCK_RAW
sockets for apps like ping(1m).
> > So given BOTH options don''t state the precise module where
the code
> > lives, I would lean (a little) toward #2, where args[2] gets its type
> > straightened out immediately.
> 
> Sure, but this provider is focused on networking not network
implementation.
> I think there is a seperate need for the following,
> 
> 	net-code:::	Network Code Provider
> 	io-code:::	I/O Code Provider
Hmmmm... people who do networking (and I mean sysadmins when I say this) are
already comfy enough with snoop(1m) or tcpdump(8) or ethereal(8) so that
they''ll know their way around a TCP or IP header.  THAT''s why
I want an
IPv6/IPv4 split.  It''s not about the implementation, it''s
about the protocol,
which includes the bits on the wire AND how endpoints send/receive and
process said bits.

Wait until tref''s off my back - there are lots of concepts in IPsec
that look
like implementation details but really aren''t.  (See RFCs 2401 or 4301
for a
start, assuming you''ve the time.)

Dan

Brendan Gregg - Sun Microsystems

2006-Sep-28 17:27 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Dan,

On Thu, Sep 28, 2006 at 01:13:49PM -0400, Dan McDonald
wrote:> On Thu, Sep 28, 2006 at 09:51:49AM -0700, Brendan Gregg - Sun Microsystems
wrote:
> > G''Day Dan,
> <SNIP!>
> > > 
> > > You do realize that both options lie about the module name,
right?
> > 
> > Yes. But we shouldn''t have to plunge the customer into the
world of
> > Solaris TCP/IP stack implementation for this public and stable
provider.
> 
> Agreed.
> 
> > And then a (fictional) project FireStation comes along and moves icmp
into
> > ip. We wouldn''t want to then tell customers that we broke all
their icmp
> > scripts, and they they had to do something like,
> > 
> > 	net:ip::receive
> > 	/args[4]->type == ICMP/
> > 
> > No, we want tcp to be tcp and icmp to be icmp.
> 
> Guess what, icmp is already IN IP except for the bits that handle SOCK_RAW
> sockets for apps like ping(1m).
fair enough :)
> > > So given BOTH options don''t state the precise module
where the code
> > > lives, I would lean (a little) toward #2, where args[2] gets its
type
> > > straightened out immediately.
> > 
> > Sure, but this provider is focused on networking not network
implementation.
> > I think there is a seperate need for the following,
> > 
> > 	net-code:::	Network Code Provider
> > 	io-code:::	I/O Code Provider
> 
> Hmmmm... people who do networking (and I mean sysadmins when I say this)
are
> already comfy enough with snoop(1m) or tcpdump(8) or ethereal(8) so that
> they''ll know their way around a TCP or IP header.  THAT''s
why I want an
> IPv6/IPv4 split.  It''s not about the implementation, it''s
about the protocol,
> which includes the bits on the wire AND how endpoints send/receive and
> process said bits.
so that would be,

	ipv4:::send
	ipv4:::receive
	ipv6:::send
	ipv6:::receive
	ipv*:::send
	ipv*:::receive

instead of,

	ip:::send    /args[2]->ip_ver == 4/
	ip:::receive /args[2]->ip_ver == 4/
	ip:::send    /args[2]->ip_ver == 6/
	ip:::receive /args[2]->ip_ver == 6/
	ip:::send
	ip:::receive

which may be fine; but then accessing ip info from other layers gets
nasty (I guess which is why you leant a little towards option 2 :)

Brendan

-- 
Brendan
[CA, USA]

Bryan Cantrill

2006-Sep-28 17:50 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Thu, Sep 28, 2006 at 06:35:10AM -0400, Dan McDonald
wrote:> On Wed, Sep 27, 2006 at 10:24:25PM -0700, Bryan Cantrill wrote:
> > > 1)
> > > 	net:ip::send	args[2] is ipinfo header
> > > 	net:tcp::send	args[2] is ipinfo header
> > > 
> <SNIP!>
> > > or,
> > > 
> > > 2)
> > > 	net:ipv4::send	args[2] is ipv4 header
> > > 	net:ipv6::send	args[2] is ipv6 header
> > > 	net:tcp::send	args[2] is ipinfo header
> <SNIP!>
> > Option (1) is much more in keeping with the DTrace conventions -- and
is
> > a much more scalable option, in my opinion.  A Stable net provider
that
> > used Option (2) would have to essentially lie about the module name
and
> > then make that a Stable component in the probe tuple -- an idea that
I''m
> > quite uncomfortable with.  So I would weigh in quite strongly towards
> > Option (1) (or its ilk) unless that is somehow technically
impossible...
> 
> You do realize that both options lie about the module name, right?
Ah yes -- thanks for the reminder.  (I confess that I''m still getting
used
to the everything-in-ip world.)  While there is latitude for discussion
here, I feel pretty strongly that we should not lie about either the 
module name or the function name.  I like the current split:  the provider
name and the probe name are arbitrary names that can convey arbitrary
meaning; the module name and the function name represent the Truth of
the implementation.(*)  If we allow all of the elements of the probe tuple
to be arbitrary, we have created a world where we can no longer correlate
a probe to its implementation -- and this, I think, is undesirable.
(Especially in a USDT world, where one might use Stable probes simply to
better understand the implementation.)  

In terms of this case:  ip/tcp should be in either the provider name
(e.g. ip:::send or net-ip:::send, which I think makes sense) or the
probe name (e.g. net:::ip-send, which I think still makes sense but
is slightly muddier).

	- Bryan

--------------------------------------------------------------------------
Bryan Cantrill, Solaris Kernel Development.       http://blogs.sun.com/bmc

(*) Yes, the syscall provider is an unfortunate exception to this rule:
the syscall provider''s functions don''t necessarily correspond
to actual
functions in the implemention, because the sysent name does not need to
be the name of the function that handles the system call.  That said,
these nearly always match -- with notable exceptions like rexit(),
gtime(), profil(), etc.  But the syscall provider is an oddball in this
regard, and it should stay the exception.  Or more concisely:  SDT-based
and USDT-based providers should honor the DTrace convention of telling
the truth in probemod and probefunc.

Nicolas Williams

2006-Sep-28 18:00 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Thu, Sep 28, 2006 at 09:48:50AM -0400, James Carlson
wrote:> While TCP and UDP currently run only on IPv[46], that hasn''t
always
> been so (TUBA ran over CLNP, and RFC 1791 ran over IPX).  Even if we
> wave away this issue ("we really have all the protocols we''ll
ever
> need"), it begs questions about how tunnels are represented and how
> users might interact with both general encapsulation and stack
> instances.
If it''s possible that having "tcp/ipv4" and
"tcp/ipv6" providers (that
make TCP and IP information available in the probes at once) results in
better tracing performance than only having "ipv4," "ipv6"
and "tcp"
providers that don''t violate abstractions, then it may be worth doing.

IOW, by having TCP and IP in the same module we may be able have a
provider such that script writers need not resort to threading probes
for TCP and IP through thread local variables, hash tables, etc, and
that seems worthwhile.

But I agree that this isn''t the Right Way (tm) to do it.  That there
should be "ipv4," "ipv6," "udp," "tcp,"
and "sctp" providers.

Nico
--

Peter Memishian

2006-Sep-28 18:07 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

> Ah yes -- thanks for the reminder.  (I confess that I''m still
getting used > to the everything-in-ip world.)  While there is latitude for discussion
 > here, I feel pretty strongly that we should not lie about either the 
 > module name or the function name.  I like the current split:  the provider
 > name and the probe name are arbitrary names that can convey arbitrary
 > meaning; the module name and the function name represent the Truth of
 > the implementation.(*)

But what place does the Truth of Implementation have in a high-level
provider with stable probes?  Or, put another way, it seems like the Truth
of Implementation is inherently at-odds with stability, and thus that the
second and third elements of the probe tuple should not be used by
high-level providers.

-- 
meem

Bryan Cantrill

2006-Sep-28 18:36 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Fri, Sep 29, 2006 at 02:07:46AM +0800, Peter Memishian
wrote:> 
>  > Ah yes -- thanks for the reminder.  (I confess that I''m
still getting used
>  > to the everything-in-ip world.)  While there is latitude for
discussion
>  > here, I feel pretty strongly that we should not lie about either the 
>  > module name or the function name.  I like the current split:  the
provider
>  > name and the probe name are arbitrary names that can convey arbitrary
>  > meaning; the module name and the function name represent the Truth of
>  > the implementation.(*)
> 
> But what place does the Truth of Implementation have in a high-level
> provider with stable probes?  Or, put another way, it seems like the Truth
> of Implementation is inherently at-odds with stability, and thus that the
> second and third elements of the probe tuple should not be used by
> high-level providers.
That''s exactly what I''m saying -- that the probemod and
probefunc should
_always_ reflect the Truth, regardless of the stablity.  Which is not
to say that they would always be Private or Unstable; if the stability
of the function itself is Stable (e.g., with parts of the DDI or the libc
API), it might well make sense to have an SDT provider that had a 
Stable (or Evolving) name stability for the function component of the 
probe tuple.

As for what place the Truth has:  in a debugger, the Truth always has 
a place -- for it is patronizing and dangerous for the debugger to assume
that it is smarter than the person doing the debugging by hiding the
Truth.  

As a more concrete example in this case:  if one were looking at a USDT
probe from a database (say), and the transactions (say) simply didn''t
add
up, and one wanted to understand where that mydb:::transaction-start probe
was _really_ coming from, it would be unfortunate if the module and function
elements of the probe tuple couldn''t be relied upon to take you there.
Yes, one could aggregate on ustack() and so on -- but that assumes that
the probe is being hit.  In my opinion, the presence of a semantically
stable probe can and should assist in understanding the implementation.

The provider name and the probe name convey semantics; the module name and
the function name convey implementation -- because one often wants the
former and one occasionally needs the latter.

	- Bryan

--------------------------------------------------------------------------
Bryan Cantrill, Solaris Kernel Development.       http://blogs.sun.com/bmc

Brendan Gregg - Sun Microsystems

2006-Sep-28 19:36 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Nico,

On Thu, Sep 28, 2006 at 01:00:17PM -0500, Nicolas Williams
wrote:> On Thu, Sep 28, 2006 at 09:48:50AM -0400, James Carlson wrote:
> > While TCP and UDP currently run only on IPv[46], that hasn''t
always
> > been so (TUBA ran over CLNP, and RFC 1791 ran over IPX).  Even if we
> > wave away this issue ("we really have all the protocols
we''ll ever
> > need"), it begs questions about how tunnels are represented and
how
> > users might interact with both general encapsulation and stack
> > instances.
> 
> If it''s possible that having "tcp/ipv4" and
"tcp/ipv6" providers (that
> make TCP and IP information available in the probes at once) results in
> better tracing performance than only having "ipv4,"
"ipv6" and "tcp"
> providers that don''t violate abstractions, then it may be worth
doing.
>
> IOW, by having TCP and IP in the same module we may be able have a
> provider such that script writers need not resort to threading probes
> for TCP and IP through thread local variables, hash tables, etc, and
> that seems worthwhile.
The tcp provider was really tcp/ip - as can be seen in the examples,

 # dtrace -n ''tcp:::receive /args[3]->tcp_dport == 80/ {
         @pkts[inet_ntos(args[2]->ip_daddr)] = count();
 }''
 dtrace: description ''tcp:::receive'' matched 1 probe
 ^C
 
   192.168.1.8                                                       9
   fe80::214:4fff:fe3b:76c8/10                                      12
   192.168.1.51                                                     32
   10.1.70.16                                                       83
   192.168.7.3                                                     121
   192.168.101.101                                                 192

And I don''t think that an ipinfo_t violates an abstraction of IP, as
if you use,

ip:::receive
/args[2]->ip_ver == 4/
{
	args[2] -- all original IPv4 header info
}

ip:::receive
/args[2]->ip_ver == 6/
{
	args[2] -- all original IPv6 header info
}

ip:::receive
{
	args[2] -- all common header info, such as ip_saddr/ip_daddr,
		   plus additional members depending on IPv4/IPv6 (and
		   if that matters - use the previous clauses!!!)
}

The difference with an ipinfo_t is this,

- Matching members that I shouldn''t (an IPv4 field while in an IPv6
clause)
  gives me NULL rather than an error. If I match an IPv4 field while in
  a generic clause - then for IPv4 traffic it is defined, and for IPv6 it
  is null.

I don''t see that as a big problem. 
> But I agree that this isn''t the Right Way (tm) to do it.  That
there
> should be "ipv4," "ipv6," "udp,"
"tcp," and "sctp" providers.
One problem is pollution of the provider space. If we can stick to
"ip", "tcp", etc (or prefixed with "net-"), then
that is minimised.

Do we really, really need an SCTP provider right now? We are about to solve
a ton of observability problems (who is connecting to my web server, show
me bytes by address, bytes by port, etc); we don''t want to get hung up
on something that won''t be used. Should we have a CHOAS provider as
well? ;)

Brendan

-- 
Brendan
[CA, USA]

Brendan Gregg - Sun Microsystems

2006-Sep-28 19:40 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Folks,

On Thu, Sep 28, 2006 at 10:50:54AM -0700, Bryan Cantrill wrote:
[...]> In terms of this case:  ip/tcp should be in either the provider name
> (e.g. ip:::send or net-ip:::send, which I think makes sense) or the
> probe name (e.g. net:::ip-send, which I think still makes sense but
> is slightly muddier).
I''ve updated the website to have the ip:::send style. It''s not
so bad.
Should we actually do the net-ip:::send style? (would customers keep 
asking "what''s with the dash - why isn''t this
net:ip::send?"?).

This is something we can think about. It would be a minor tweak to change
the provider name later, once we are certain which way to go.

Brendan

-- 
Brendan
[CA, USA]

James Carlson

2006-Sep-28 19:52 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg - Sun Microsystems writes:> The difference with an ipinfo_t is this,
> 
> - Matching members that I shouldn''t (an IPv4 field while in an
IPv6 clause)
>   gives me NULL rather than an error. If I match an IPv4 field while in
>   a generic clause - then for IPv4 traffic it is defined, and for IPv6 it
>   is null.
> 
> I don''t see that as a big problem. 
I do, for the reasons I outlined before.  I think it''s a really
unclear and confusing model because IPv4 and IPv6 are not the same
thing.

If this ''info'' portion is limited to just accessing the TCP
state
(rather than the previous-header-no-longer-in-evidence), then I think
you get 99 44/100ths of what you''re asking -- including this example
-- but without the layering trouble.
> > But I agree that this isn''t the Right Way (tm) to do it. 
That there
> > should be "ipv4," "ipv6," "udp,"
"tcp," and "sctp" providers.
> 
> One problem is pollution of the provider space. If we can stick to
> "ip", "tcp", etc (or prefixed with "net-"),
then that is minimised.
How exactly is having "ipv4" and "ipv6" more polluting than
having
"ip?"  I''m missing the argument that says that this is an
area that
needs strict conservation -- to the point of merging the immiscible
IPv4 and IPv6 semantics.
> Do we really, really need an SCTP provider right now? We are about to solve
> a ton of observability problems (who is connecting to my web server, show
> me bytes by address, bytes by port, etc); we don''t want to get
hung up
> on something that won''t be used.
I think the project is incomplete if it doesn''t support the primary
protocols supported by Solaris, and SCTP is clearly one of them, no
less than UDP or TCP.

Now, the project you''re proposing may well be useful even if it is
incomplete, but there ought to be a plan to make it complete some day.
(And I think whether you putback something that''s deliberately
incomplete or not is between you, your RTI Advocate, and your
conscience.)
> Should we have a CHOAS provider as well? ;)
Of course not.  Nothing supports CHAOS (including Solaris), and it''s
well beyond obsolescent.

I hope you''re not trying to argue that a somewhat newer protocol such
as SCTP is already "obsolete."  That doesn''t make sense to
me.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

James Carlson

2006-Sep-28 19:57 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg - Sun Microsystems writes:> G''Day Folks,
> 
> On Thu, Sep 28, 2006 at 10:50:54AM -0700, Bryan Cantrill wrote:
> [...]
> > In terms of this case:  ip/tcp should be in either the provider name
> > (e.g. ip:::send or net-ip:::send, which I think makes sense) or the
> > probe name (e.g. net:::ip-send, which I think still makes sense but
> > is slightly muddier).
> 
> I''ve updated the website to have the ip:::send style.
It''s not so bad.
> Should we actually do the net-ip:::send style? (would customers keep 
> asking "what''s with the dash - why isn''t this
net:ip::send?"?).
I think the issue rests on the likelihood that there''ll be a confusing
use of "ip" somewhere else that needs to be dtrace-providered, and
whether you think networking is a first class citizen of the operating
system.

For what it''s worth, having a separate "ipv4" and
"ipv6" is attractive
for that first problem, because it''s far more specific and less likely
to collide with anything else in the future than the simple "ip."

As for the second part, I don''t think treating IP as a "net-"
bag on
the side is the right way to go, but I also don''t care too much.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Nicolas Williams

2006-Sep-28 20:59 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Thu, Sep 28, 2006 at 12:36:05PM -0700, Brendan Gregg - Sun Microsystems
wrote:> On Thu, Sep 28, 2006 at 01:00:17PM -0500, Nicolas Williams wrote:
> > If it''s possible that having "tcp/ipv4" and
"tcp/ipv6" providers (that
> > make TCP and IP information available in the probes at once) results
in
> > better tracing performance than only having "ipv4,"
"ipv6" and "tcp"
> > providers that don''t violate abstractions, then it may be
worth doing.
> >
> > IOW, by having TCP and IP in the same module we may be able have a
> > provider such that script writers need not resort to threading probes
> > for TCP and IP through thread local variables, hash tables, etc, and
> > that seems worthwhile.
> 
> The tcp provider was really tcp/ip - as can be seen in the examples,
Yes, but I happen to agree with James that we should have a "tcp"
provider that doesn''t expose lower-layer information other than
identifying what protocol TCP is layered over.  And I do think we need
separate ipv4 and ipv6 providers, for the reasons that James gives.

So, what I see as needed are:

 - ipv4, ipv6
 - udp, udp/ipv4, udp/ipv6
 - tcp, tcp/ipv4, tcp/ipv6
 - sctp, ...

(Is ''/'' allowed in provider names?)

Oh, and add ARP, ICMP, IGMP, etc...

It might even be nice to have a bare-bones ''ip'' provider that
allows
access to v4 and v6 _packets_ sent/received, but allowing access only to
the packet data.
> And I don''t think that an ipinfo_t violates an abstraction of IP,
as
> if you use,
> 
> ip:::receive
> /args[2]->ip_ver == 4/
> {
> 	args[2] -- all original IPv4 header info
> }
> 
> ip:::receive
> /args[2]->ip_ver == 6/
> {
> 	args[2] -- all original IPv6 header info
> }
This is akin to dynamic typing, which is surprising to me here.

I think I understand where you''re coming from: fewer probes makes for
nicer scripts.  Why should I have v4- and v6-specific probes?  Right?

But to me the lack of conditionals in DTrace probe bodies makes it hard
to use dynamic typing: I''d still end up with multiple probes, only now
using different predicates instead of provider names.

I suspect that scripts will be more efficient if we make such predicates
unnecessary and make the providers force the distinction between v4 and
v6.
> > But I agree that this isn''t the Right Way (tm) to do it. 
That there
> > should be "ipv4," "ipv6," "udp,"
"tcp," and "sctp" providers.
> 
> One problem is pollution of the provider space. If we can stick to
> "ip", "tcp", etc (or prefixed with "net-"),
then that is minimised.
It''s not an open-ended namespace though -- we see new protocols at
these
layers at glacial rates.
> Do we really, really need an SCTP provider right now? We are about to solve
> a ton of observability problems (who is connecting to my web server, show
> me bytes by address, bytes by port, etc); we don''t want to get
hung up
> on something that won''t be used. Should we have a CHOAS provider
as well? ;)
I think we need to know how an SCTP provider would fit in.  It
shouldn''t
have to be putback at the same times as the others, no.

And we do want SCTP to be used!  Particularly by applications that could
benefit from its semantics (e.g., NFSv4).

Nico
--

Bryan Cantrill

2006-Sep-28 21:46 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

> > Do we really, really need an SCTP provider right now? We are about to
solve
> > a ton of observability problems (who is connecting to my web server,
show
> > me bytes by address, bytes by port, etc); we don''t want to
get hung up
> > on something that won''t be used.
> 
> I think the project is incomplete if it doesn''t support the
primary
> protocols supported by Solaris, and SCTP is clearly one of them, no
> less than UDP or TCP.
This general idea -- that if you don''t solve everything you must solve
absolutely nothing -- is toxic and corrosive.  One of the most subtle
software engineering tasks is differentiating that which must be done
now versus that which can be done later; if one demands that everything
be done now, one will end up doing nothing at all -- a pathology that has
sunk many a project.  The key is to be careful when making decisions that
might unnecessarily constrain future decisions; while one can not afford
to do all future work in the present, one must be careful not make such
future work impossible.

In this case, the presence of a tcp or ip provider has no impact on a
hypothetical future sctp provider, and the lack or presence of an sctp
provider has no bearing on the completeness of a tcp or ip provider.  If
you would like to implement an sctp provider, you should by all means
do so -- but it is not and nor should it be Brendan''s responsibility as
part of this work.

	- Bryan

--------------------------------------------------------------------------
Bryan Cantrill, Solaris Kernel Development.       http://blogs.sun.com/bmc

Darren.Reed at Sun.COM

2006-Sep-29 01:17 UTC

head link

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

Brendan,

Brendan Gregg - Sun Microsystems wrote:
>G''Day Darren,
>
You do know that Australians reserve that greeting for use
only with foreigners, don''t you ? ;)
>On Mon, Sep 25, 2006 at 04:00:02PM -0700, Darren.Reed at Sun.COM wrote:
>
>>Brendan Gregg - Sun Microsystems wrote:
>>
>[...]
>
>>What I''d like to see dtrace be able to do is follow a packet
through
>>the kernel, not a thread of execution.  fbt is for following execution.
>>sdt is up to how it is used, but most often, it is just random hooks
>>in the code.
>>
>>The difference between using fbt and being able to follow a packet
>>is when a packet is put on a queue, the fbt path is terminated,
>>even though the packet still has places to go and I/O ports to see.
>>
>>If the networking dtrace provider doesn''t deliver the
capability
>>to trace a packet through the kernel then it is not delivering a
>>key feature that is needed by many people.
>>
>
>Noone is saying that the provider will never do this. If you are saying
>that this feature is important to you, then sure - I understand.
>And if many people means many engineers (which in turn helps them
>achieve cool things), then I understand that too.
>
Can you comment on what the provider is being designed to do,
if it won''t allow for following a packet through the kernel?

If I look at the design presented on the web page, it tells us
what probes will be implemented and some ways in which they can
be used.  What appears to be missing is a statement of "this is
what we''re trying to achieve", from a more loftly level.

Are you just trying to provide stable names for interesting
places to observe information inside IP or something else?

Whatever the goal, it might be would worth mentioning it.

> ...
>
>To have a single DTrace script check for all types of scans and give
>a report will still have some value. A customer finds unexplained
>network utilization, and can run a quick script to check what it is -
>rather than install a NIDS if one isn''t available.
>
If they''re trying to diagnose unexplained network utilization then
they should not be using dtrace - it isn''t the right tool for this
job. Put a network sniffer on there or use tcpdump or ethereal or
similar.

How will dtrace tell me that someone has installed a backdoor''d
version of some PHP module that runs an IRC bot connecting to
some chat server on the ''net?

If dtrace were the right tool to use, how would you use it to
find the answer to that question or to even discover this issue
was present?

>>Some other probes that might be useful:
>>- RTT calculations from TCP timestamp measurements
>>- TCP window size changes
>>
>
>These could be scripted from tcp::send and tcp::receive alone; but
>seperate probes may make sense.
>
That''s not what I''m getting at.  TCP makes internal
calculations
about what the averge RTT to its peer is.  Using ::send and
::receieve will lead to independant calculations being made that
do not necessarily reflect what TCP thinks.

While the window size can be observed from ::receive, it isn''t
the size I want to know about, it''s when it changes so I can
hopefully see what the old value is and what the new one is.

>>- when a packet is dropped because it is "bad" (checksum, etc)
>>
>
>Yep - we need a tcp::drop-checksum or some such.
>
Or maybe:
ip:::drop, tcp:::drop, etc, and make an arg the type of drop?
>>- look through "netstat -s" output as there is quite a large
>> number of stats there that are worthy of being identified
>> with a probe.
>>
>
>Sure. So long as we really do want to probe it, and that the kstat
>value isn''t sufficient. Thinking of sample scripts that use them
should
>help here.
>
It''s not the kstat value that''s important, it''s being
able to see
what the packet looked like that when the kernel decided to ldrop it.

Darren

Nicolas Williams

2006-Sep-29 04:40 UTC

head link

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

On Wed, Sep 27, 2006 at 12:11:38PM -0700, Brendan Gregg - Sun Microsystems
wrote:> On Mon, Sep 25, 2006 at 04:00:02PM -0700, Darren.Reed at Sun.COM wrote:
> > The difference between using fbt and being able to follow a packet
> > is when a packet is put on a queue, the fbt path is terminated,
> > even though the packet still has places to go and I/O ports to see.
> > 
> > If the networking dtrace provider doesn''t deliver the
capability
> > to trace a packet through the kernel then it is not delivering a
> > key feature that is needed by many people.
> 
> Noone is saying that the provider will never do this. If you are saying
> that this feature is important to you, then sure - I understand.
> And if many people means many engineers (which in turn helps them
> achieve cool things), then I understand that too.
The fbt provider allows you to probe function calls/returns.  It still
requires that the scripter thread together the actual function calls
into a trace of call flow -- as far as the kernel side of DTrace is
concerned the fbt probe firings are discrete events.

The same should apply to a network provider, but I''m afraid it
won''t.

The thing that makes it easy to thread function calls/returns is that
they all happen in the same thread of execution and in sequence.  The
same is not necessarily true of "packets."  There''s
multicast, and
SOCK_RAW, fragmentation, etc...  Also, do you want to trace flows all
the way from the link layer through applications (e.g., NFS, Apache)?
That seems even more daunting.

How do you thread receipt of fragmented packets into a nice call flow-
style of report?

If it takes associative arrays to do it then it will be expensive.  In
order to rely on thread locals one might have to be able to rely on
packets being processed all in one thread -- an assumption that we
shouldn''t want to make, even if it were true.

But I somehow doubt that would be the primary use of a network provider.

Nico
--

Brendan Gregg - Sun Microsystems

2006-Sep-29 04:52 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day James,

On Thu, Sep 28, 2006 at 08:49:00AM -0400, James Carlson
wrote:> Bryan Cantrill writes:
> > > 	net:ip::send	args[2] is ipinfo header
> > > 	net:tcp::send	args[2] is ipinfo header
> [...]
> > > 	net:ipv4::send	args[2] is ipv4 header
> > > 	net:ipv6::send	args[2] is ipv6 header
> > > 	net:tcp::send	args[2] is ipinfo header
> [...]
> > Option (1) is much more in keeping with the DTrace conventions -- and
is
> > a much more scalable option, in my opinion.  A Stable net provider
that
> > used Option (2) would have to essentially lie about the module name
and
> > then make that a Stable component in the probe tuple -- an idea that
I''m
> > quite uncomfortable with.  So I would weigh in quite strongly towards
> > Option (1) (or its ilk) unless that is somehow technically
impossible...
> 
> Unfortunately, the IPv4 and IPv6 headers are not actually the same.
> It would have been nice if the ipng group had just extended IP, but
> they did some rototilling as well.
>
> This means that while "version," "source address," and
"destination
> address" are in common (except perhaps for representation of
> addresses, which we can probably ignore) the rest is not.
>
> There are three fields that are renamed.  The "TOS" field in IPv4
> becomes the "TC" field in IPv6.  The contents are actually mostly
the
> same in practice (using DSCP), so perhaps these could be aliases.
> Similarly, "TTL" becomes "hop limit" without an actual
change in
> semantics.  And "protocol" becomes "next header."
>
> The options handling is quite different between the two.  In IPv4, the
> options are part of the IP header, which is why there''s an
"IHL"
> field.  In IPv6, the options are in *separate* headers (called
> "extension headers") that look to the user somewhat like
> encapsulation.
>
> All of the IPv4 fragmentation bits (ID, DF, MF, offset) are different
> in IPv6.  Instead of being fixed header fields, these turn into an
> option extension header.
> 
> IPv6 doesn''t have an IP header checksum or flags.  IPv4
doesn''t have a
> "flow label" (but, then, perhaps nobody uses the IPv6 flow label
> anyway).
> 
> In other words, I''m puzzled about how ''ipinfo''
will work in practice.
> Suppose we do go with option (1).  What does this do?
> 
> 	net:ip::send
> 	/args[2]->ip_tos == 0x80/
> 
> Does it select all IPv4 packets with TOS 0x80, and ignore IPv6?  Does
> it select both IPv4 and IPv6 with TC set to 0x80?  Does it cause one
> of those hard-to-understand "at DIF offset" errors?
> 
> Repeat the same for the not-in-common fields.
> 
> I think merging IPv4 and IPv6 fields into a single structure is, at
> least, to me, a confusing idea.
I bet you are confused - you are thinking structures, not DTrace
translators.

With translators I can take an IPv4 header, and export the IPv4 members
into ipinfo_t -- which looks and feels like an IPv4 header. The
documentation will just list the IPv4 members for when IPv4 traffic
is probed, as though that really was the structure. Translators can
also take an IPv6 header, and export the IPv6 members into an ipinfo_t.
Any unused ipinfo_t members can be set to NULL.

If you want to think that you really just have an IPv4 header, or 
really just an IPv6 header - then you still can - that''s how ipinfo_t
will behave. Of course, if you start accessing members that shouldn''t
be defined, you''ll get NULL.

(This is assuming that the one probe, ip:::send, can access args[2] from
two different translators depending on the probe location. This may not
be possible - I''m still checking. If it does turn out to be impossible,
then that will put a bullet in the head of ipinfo_t).

How ipinfo_t (args[2]) will look in scripts,

ip:::send
/args[2]->ip_ver == 4/
{
	args[2] contains valid IPv4 members, straight from the IPv4 header
}

ip:::send
/args[2]->ip_ver == 6/
{
	args[2] contains valid IPv6 members, straight from the IPv6 header
}

Some fields are in common -- this is an ADVANTAGE,

ip:::send
{
	@out_packets[inet_ntos(args[2]->ip_daddr)] = count();
}

Other fields that are not common will be NULL.

What other network level protocols may we ever see apart from IPv4 
and IPv6? An advantage of "ip" is that customers are never in the 
dark about network traffic - they can see everything.

The documentation would say "The ip provider matches all network level
protocols. If you want to match a particular protocol, then filter
using predicates."

If you made a mistake and used an IPv6 field in an IPv4 clause, then 
you get NULL. The solution? Don''t use IPv6 fields in IPv4, and
vise-versa!

You''d have to dream up some incredibly unlikely scenarios to explain
why this is still confusing. Such as,

*** Unlikely Scenario #1 ***

Joe is a sysadmin. Joe knows IPv4 really well, and would like DTrace to
match IPv4 specific fields. Joe''s entire site runs on IPv6, yet Joe has
never heard of IPv6. Joe has heard of DTrace. (Joe had spent the last 15
years in a coma after being struck on the head by a low flying plane).

Joe logs into a server and uses DTrace to match specific IPv4 fields,
then becomes confused to find that they are NULL. Joe is confused.
Joe doesn''t look at the ip provider documentation, which will clearly
state that the provider matches both IPv4 and IPv6, with examples of
matching them in predicates.

*** Unlikely Scenario #2 ***

James is a network expert, and can remember IPv4 TOS flags off-by-heart.
James discovers a new DTrace provider - ip. IP matches all ip level traffic,
both IPv4 and IPv6. Great. James immediately suffers uncharacteristic short
term memory loss, and then types in a script to match IPv4 TOS without
using a predicate to filter just IPv4.

       net:ip::send
       /args[2]->ip_tos == 0x80/

This matches IPv4 correctly, and nothing from IPv6 (ip_tos will be NULL).
Suddenly, James''s memory returns, and he becomes confused as to why
the script worked - when in fact he should have matched ip_ver in the
predicate.

...

If someone is matching the "not in common" fields - then they already
know the difference between IPv4 and IPv6.

The ipinfo_t could always provider a pointer to the raw header, whatever
it is, if the raw un-dtrace-translated header is desired. (however
the two should contain identical populated members (skipping the NULLs) 
- so it''s six of one and half a dozen of another).

Brendan

-- 
Brendan
[CA, USA]

Nicolas Williams

2006-Sep-29 05:04 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Thu, Sep 28, 2006 at 09:52:34PM -0700, Brendan Gregg - Sun Microsystems
wrote:> What other network level protocols may we ever see apart from IPv4 
> and IPv6? An advantage of "ip" is that customers are never in the
> dark about network traffic - they can see everything.
But they are also aware of both, v4 and v6.  Asking them to use two
probes instead of one (or two anyways, with predicates selecting on v4
vs. v6) hardly seems like a big deal, while it would simplify the
proposal.  Also, it''s not clear to me that you really can have probes
where the types of their arguments are dynamically determined (you can
always use unions, I suppose), or that that would be a good thing (it
seems very unnatural to me).

And they won''t always see everything -- DTrace can and will drop events
:)

Nico
--

James Carlson

2006-Sep-29 11:41 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg - Sun Microsystems writes:> > I think merging IPv4 and IPv6 fields into a single structure is, at
> > least, to me, a confusing idea.
> 
> I bet you are confused - you are thinking structures, not DTrace
> translators.
Fine.  If you don''t like the word "structure," then use the
words
"name space."

It''s equivalent in this context.  Collapsing IPv4 and IPv6 headers
into a single name space (in terms of what''s visible to the dtrace
user, regardless of how that''s actually implemented under the covers)
is confusing.

As far as the user is concerned, there''s a structure named
"ipinfo"
that contains members that wink in and out of existence on different
packets.  That _might_ make sense if you were trying to represent a
single protocol that has options (e.g., the decoded TLVs in DHCP or
PPP), but makes no sense when mashing two distinct protocols together.

While you''re blending IPv4 and IPv6, why not blend TCP and UDP?  After
all, both TCP and UDP have port numbers and checksum fields.  The
extra parts that TCP has could just be NULL when a UDP packet comes
in.

Why would that not also be a good solution?

It seems to me that you''re merging based on the fact that these two
separate protocols happen to both be referred to as "IP" by users,
even if they''re really quite different.  I don''t think this is
the
right criterion.
> If you want to think that you really just have an IPv4 header, or 
> really just an IPv6 header - then you still can - that''s how
ipinfo_t
> will behave. Of course, if you start accessing members that
shouldn''t
> be defined, you''ll get NULL.
This still seems quite wrong to me, as it requires the user to add a
test on ip_ver to any clause that tests something other than source or
destination address.  Otherwise, the results are ambiguous.

Testing ip_froffset == 0 will catch all IPv6 frames, including
fragments.  Testing ip_ttl == 0 will catch IPv6 frames as well.

I''d rather see these "mistakes" throw DIF errors than just
silently do
something _different_ from what the user asked.
> How ipinfo_t (args[2]) will look in scripts,
I understood the proposal, thanks.  I just don''t agree.
> What other network level protocols may we ever see apart from IPv4 
> and IPv6? An advantage of "ip" is that customers are never in the
> dark about network traffic - they can see everything.
The main (and unfortunate) one would be InfiniBand.  But other than
that, I agree that it could be swept under the rug.
> Joe is a sysadmin. Joe knows IPv4 really well, and would like DTrace to
> match IPv4 specific fields. Joe''s entire site runs on IPv6, yet
Joe has
> never heard of IPv6.
That last sentence represents an incorrect assumption.

Change it to: "Joe''s site runs mostly IPv4, but has a small amount
of
IPv6, and Joe doesn''t realize that IPv6 is confusing in dtrace and
will accidentally match some IPv4 clauses because we merged the name
space."

Or even: "Joe''s site runs both IPv4 and IPv6, but he''s an
IPv4 user
and doesn''t care a whit to learn anything about IPv6; he thus proceeds
as though IPv6 doesn''t exist."
> Joe logs into a server and uses DTrace to match specific IPv4 fields,
> then becomes confused to find that they are NULL.
Worse, it''ll silently match packets he wasn''t expecting to
match.  If
he doesn''t also look at version, these will be some very strange
looking beasts.

Worse still, many people aren''t just doing trace(args[2]->ip_src).
They use the probe point to bump a counter or as a condition for some
other action.  In that case, the failure is silent, and corrupts the
results, as there''ll be no record that the wrong thing was matched.
> James is a network expert, and can remember IPv4 TOS flags off-by-heart.
> James discovers a new DTrace provider - ip. IP matches all ip level
traffic,
> both IPv4 and IPv6. Great. James immediately suffers uncharacteristic short
> term memory loss, and then types in a script to match IPv4 TOS without
> using a predicate to filter just IPv4.
> 
>        net:ip::send
>        /args[2]->ip_tos == 0x80/
> 
> This matches IPv4 correctly, and nothing from IPv6 (ip_tos will be NULL).
More to the point: from an network administrative point of view,
there''s no difference between ip_tos and ip_tc.
> If someone is matching the "not in common" fields - then they
already
> know the difference between IPv4 and IPv6.
The point is that they may not realize the effect of this merge.

You seem to think that it''s "obvious."  Other than for the IP
source
and destination address, I strongly disagree.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

James Carlson

2006-Sep-29 12:03 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Bryan Cantrill writes:> > I think the project is incomplete if it doesn''t support the
primary
> > protocols supported by Solaris, and SCTP is clearly one of them, no
> > less than UDP or TCP.
> 
> This general idea -- that if you don''t solve everything you must
solve
> absolutely nothing -- is toxic and corrosive.
Great.  So then why did you snip away the rest of my response and make
it look as if that''s what I was saying?

I didn''t say that he _must_ solve everything.  I said that
it''s
incomplete without it, we''ll need to provide it _later_, and that
thinking through what the other providers would look like is helpful.
> In this case, the presence of a tcp or ip provider has no impact on a
> hypothetical future sctp provider, and the lack or presence of an sctp
> provider has no bearing on the completeness of a tcp or ip provider.  If
I thought we were talking about a "dtrace network provider" in this
thread.  I must have misunderstood.
> you would like to implement an sctp provider, you should by all means
> do so -- but it is not and nor should it be Brendan''s
responsibility as
> part of this work.
By creating these stable providers, Brendan is setting the high level
architecture for *ALL* networking providers, including those yet to
come.  At least considering the protocols we have is part of that job.

I don''t think the structure he''s proposing really does extend
well to
other protocols, because it blurs some lines that are fundamental
parts of networking.  In particular, with SCTP, a single connection
can be multihomed with both IPv4 and IPv6 addresses.  Exposing the
actual IP headers through a single name space that forces these two to
be aliased together will cause confusion.

Leaving SCTP aside for a moment, there are other problems here.  In
the proposed TCP provider, I can access bits out of the IP header
below.  But why stop there?  Why not allow access to all layers below?
It would not be possible to do so in any coherent manner, but the fact
that the layering distinction has been blurred raises questions about
where this goes in the future.

What about the transmit side?  Brendan''s proposal allows access to the
fields in an IP header that, per the standards, doesn''t even exist.
It''s an implementation artifact.  Why is that a stable thing to do?
And if he doesn''t do that, then his transmit and receive bits become
strangely asymmetric and much harder to use.

Or consider what happens if someone does an RPC provider in the
future.  That''d likely be helpful.  RPC runs on top of transports such
as TCP, UDP, and (perhaps) SCTP.  What sort of "info" representation
could RPC possibly provide to the dtrace user?  Given the
architectural direction that Brendan is setting here, it would have to
merge all of those tranport layers together into one big glop.

I suspect that you''re looking for simplicity here, and that''s
great,
but I don''t think this desire should mean discarding fundamental
networking design principles (such as layering) as well.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Nicolas Williams

2006-Sep-29 14:18 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Fri, Sep 29, 2006 at 08:03:51AM -0400, James Carlson
wrote:> Or consider what happens if someone does an RPC provider in the
> future.  That''d likely be helpful.  RPC runs on top of transports
such
> as TCP, UDP, and (perhaps) SCTP.
And doors!  And what not.
>                                   What sort of "info"
representation
> could RPC possibly provide to the dtrace user?  Given the
> architectural direction that Brendan is setting here, it would have to
> merge all of those tranport layers together into one big glop.
Well, the provider *could* format a string representation of the
"info"
and make that available as a probe argument.  But this would be done
always, no?  SDT providers have no idea that a given probe is enabled,
right?  So it seems like a bit of a waste in the off chance that some
probe is enabled.  Also, who''s to say that the full IP headers will be
available to transport layer implementations always?  Isn''t that an
implementation detail that we don''t want to expose through the provider
to avoid constraining the implementation?

A thought: why couldn''t such a string representation of some
lower-layer
information be computed once and kept in and accessed through the TCB?
E.g., end-point addresses, that which Brendan seems to care about the
most.

Also, the reverse, having information made available about higher levels
will surely be thought of too :)  (At least packet classifier actions,
no?)

IF we could trace packet, or, rather, message/data flows between these
protocols then the need for having all the info in one place can go
away.  But without being able to rely on one thread processing all of
some fairly extended path (which we can''t guarantee without seriously
constraining the implementation) one cannot rely on thread-local
variables to do the heavy lifting and must instead rely on associative
arrays (and then one would need a way to address individual packets).

I think at a minimum we need a set of providers and probes that don''t
violate abstractions much at all.

Then add packet/message IDs if need be.

Then add abstraction violations when they prove necessary and
unavoidable :)
> I suspect that you''re looking for simplicity here, and
that''s great,
> but I don''t think this desire should mean discarding fundamental
> networking design principles (such as layering) as well.
Yes.

Nico
--

Spencer Shepler

2006-Sep-29 14:24 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Fri, Nicolas Williams wrote:> On Fri, Sep 29, 2006 at 08:03:51AM -0400, James Carlson wrote:
> > Or consider what happens if someone does an RPC provider in the
> > future.  That''d likely be helpful.  RPC runs on top of
transports such
> > as TCP, UDP, and (perhaps) SCTP.
> 
> And doors!  And what not.
and Infiniband and RDDP...  :-)

Spencer

Brendan Gregg - Sun Microsystems

2006-Sep-29 16:41 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Fri, Sep 29, 2006 at 07:41:00AM -0400, James Carlson
wrote:> Brendan Gregg - Sun Microsystems writes:
> > > I think merging IPv4 and IPv6 fields into a single structure is,
at
> > > least, to me, a confusing idea.
> > 
> > I bet you are confused - you are thinking structures, not DTrace
> > translators.
> 
> Fine.  If you don''t like the word "structure," then use
the words
> "name space."
> 
> It''s equivalent in this context.  Collapsing IPv4 and IPv6 headers
> into a single name space (in terms of what''s visible to the dtrace
> user, regardless of how that''s actually implemented under the
covers)
> is confusing.
Nope, you still don''t get it.

There is no point where customers will be told to used a IPv4 + IPv6
combo struct. If they predicated IPv4, use IPv4; if they predicated
IPv6, use IPv6. If neither, then they can only use a documented list
of common fields.
> As far as the user is concerned, there''s a structure named
"ipinfo"
> that contains members that wink in and out of existence on different
> packets.  That _might_ make sense if you were trying to represent a
> single protocol that has options (e.g., the decoded TLVs in DHCP or
> PPP), but makes no sense when mashing two distinct protocols together.
Rubbish - you aren''t thinking of how the final scripts will be used,
and are making unfounded comments on how you think they will be used.

How about you actually write some scripts, as I have done, and you 
will see that there is no problem.
> While you''re blending IPv4 and IPv6, why not blend TCP and UDP? 
After
> all, both TCP and UDP have port numbers and checksum fields.  The
> extra parts that TCP has could just be NULL when a UDP packet comes
> in.
>
> Why would that not also be a good solution?
It would be if the provider was called "transport:::". 

That "ip" could mean both ipv4 and ipv6 isn''t a strange
concept.
> It seems to me that you''re merging based on the fact that these
two
> separate protocols happen to both be referred to as "IP" by
users,
> even if they''re really quite different.  I don''t think
this is the
> right criterion.
> 
> > If you want to think that you really just have an IPv4 header, or 
> > really just an IPv6 header - then you still can - that''s how
ipinfo_t
> > will behave. Of course, if you start accessing members that
shouldn''t
> > be defined, you''ll get NULL.
> 
> This still seems quite wrong to me, as it requires the user to add a
> test on ip_ver to any clause that tests something other than source or
> destination address.  Otherwise, the results are ambiguous.
> 
> Testing ip_froffset == 0 will catch all IPv6 frames, including
> fragments.  Testing ip_ttl == 0 will catch IPv6 frames as well.
> 
> I''d rather see these "mistakes" throw DIF errors than
just silently do
> something _different_ from what the user asked.
So here we go - you finally put the finger on THE difference!

If a user is matching IPv4 or IPv6 specific fields,
AND they do something they shouldn''t - like examine IPv6 fields in
IPv4,
YOU want an error, rather than NULL.

This is ridiculous. If a customer knows enough about IPv4 and IPv6 to
be matching specific fields - then they know the difference between
IPv4 and IPv6.
> > If someone is matching the "not in common" fields - then
they already
> > know the difference between IPv4 and IPv6.
> 
> The point is that they may not realize the effect of this merge.
There is no effect.

Brendan

-- 
Brendan
[CA, USA]

James Carlson

2006-Sep-29 17:10 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg - Sun Microsystems writes:> Rubbish - you aren''t thinking of how the final scripts will be
used,
> and are making unfounded comments on how you think they will be used.
I''m somewhat surprised and taken aback that you can read my thoughts,
but as I''ve already explained my objections more than once, I
won''t
belabor the point.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Nicolas Williams

2006-Sep-29 17:28 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Fri, Sep 29, 2006 at 09:41:07AM -0700, Brendan Gregg - Sun Microsystems
wrote:> Nope, you still don''t get it.
> 
> There is no point where customers will be told to used a IPv4 + IPv6
> combo struct. If they predicated IPv4, use IPv4; if they predicated
> IPv6, use IPv6. If neither, then they can only use a documented list
> of common fields.
Er, no, DTrace doesn''t work that way -- you can''t have probe
arguments
typed according to arbitrary predicate expressions!

So, either you do "tell customers to use an IPv4 + IPv6 combo struct"
or
you don''t do this at all.

Also, if you''d have predicates to make this probe useful then you might
as well have separate ipv4 and v6 probes.
> > While you''re blending IPv4 and IPv6, why not blend TCP and
UDP?  After
> > all, both TCP and UDP have port numbers and checksum fields.  The
> > extra parts that TCP has could just be NULL when a UDP packet comes
> > in.
> >
> > Why would that not also be a good solution?
> 
> It would be if the provider was called "transport:::". 
Why propose that for IP but not for transport protocols then?  Be
consistent :)

But I think it''s just not a workable idea.  You could prototype it and
see...

Nico
--

Bryan Cantrill

2006-Sep-29 17:55 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Fri, Sep 29, 2006 at 12:28:56PM -0500, Nicolas Williams
wrote:> On Fri, Sep 29, 2006 at 09:41:07AM -0700, Brendan Gregg - Sun Microsystems
wrote:
> > Nope, you still don''t get it.
> > 
> > There is no point where customers will be told to used a IPv4 + IPv6
> > combo struct. If they predicated IPv4, use IPv4; if they predicated
> > IPv6, use IPv6. If neither, then they can only use a documented list
> > of common fields.
> 
> Er, no, DTrace doesn''t work that way -- you can''t have
probe arguments
> typed according to arbitrary predicate expressions!
I think Brendan was saying that if you want the IPv6 fields to be valid
(that is, to be non-NULL), use a predicate.  

For whatever it''s worth. there is precedent for this general idea in
the
fileinfo argument to io:::start:  having the file available when doing an
I/O represents a total layer violation of the implementation (and as a
result, we don''t have it all the time), but I think users of DTrace
would
agree that being able to correspond I/O operations to file names is
tremendously useful.  And indeed, utility -- above all else -- is the 
guiding principle of DTrace.
> Also, if you''d have predicates to make this probe useful then you
might
> as well have separate ipv4 and v6 probes.
Brendan made the point repeatedly, but just to make it again:  it''s
useful because the two protocols have fields in common -- and in fact,
these are the fields that most people care about most of the time.
> But I think it''s just not a workable idea.  You could prototype it
and
> see...
That''s exactly what''s going to happen.  And I think it would
be wise to
defer any future discussion until such a prototype is in hand...

	- Bryan

--------------------------------------------------------------------------
Bryan Cantrill, Solaris Kernel Development.       http://blogs.sun.com/bmc

Bryan Cantrill

2006-Sep-29 21:56 UTC

head link

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

> So what resource is less limited than cpu breakpoint registers but can
> induce entry to specified code (a dtrace probe)?  Bingo: main memory
> and ECC codes [2].   We can deliberately mark objects, to a precision
> of an ECC code word, as bad - but list them in a tracking table which
> the ECC handler checks for dtrace entry in preference to a real ECC
> fault.
We will never be doing this:  it violates the safety constraint, in
several dimensions.  (One of which you outline below.)
> The non-enabled probe cost is zero.  The enabled probe cost is that
> of a) rewriting the memory area in question with deliberate bad ECC;
> a bit expensive to do (e.g.) to every packet on a 10GB link,
> (especially if a bad-ECC write is significantly slower than a normal
> write) plus b) another rewrite with good ECC, plus c) the actual
> probe code.
> 
> There is a reliability downside: the tracked memory may no
> longer be checked for real faults.
> It helps if an ECC codepoint is available which is not used
> by any hardware fault (or combination of faults).
Ah, yes -- the reliability "downside".  This is but one tangible way
in
which this notion violates the safety constraint, and as such it will
never be used by DTrace.

	- Bryan

--------------------------------------------------------------------------
Bryan Cantrill, Solaris Kernel Development.       http://blogs.sun.com/bmc

Bryan Cantrill

2006-Sep-29 22:19 UTC

head link

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

On Sat, Sep 30, 2006 at 11:09:54PM +0100, Jeremy Harris
wrote:> Bryan Cantrill wrote:
> >We will never be doing this:  it violates the safety constraint, in
> >several dimensions.  (One of which you outline below.)
> 
> I''m interested - which other dimensions do you have in mind?
What if someone puts bad ECC on the text or data required to execute
the bad ECC handler?  How does DTrace know what this text and data are
to prohibit it?  More generally, what if someone puts bad ECC on text
or data that needs to be accessed at TL>1?  How is this handled?
And what about memory that DTrace itself uses to (for example) store the
lookup table to know how to correct the bogus bad ECC?  Most generally,
how can one prevent bad ECC from being placed on any word of memory that
is accessed in a context in which a correctable ECC trap cannot be
handled?  Trust me that this is tricky -- and made much more so by this
idea, as it creates many more memory accesses from the correctable
handler.

And all of these issues are in _addition_ to the issue that you mentioned:
that by introducing bad ECC, you have lowered the fault tolerence of
those words that you are marking.  This alone is unacceptable and a 
violation of the safety constraint; the many other issues with this idea
are just the nails in the coffin.

	- Bryan

--------------------------------------------------------------------------
Bryan Cantrill, Solaris Kernel Development.       http://blogs.sun.com/bmc

Adam Leventhal

2006-Sep-29 22:42 UTC

head link

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

Hi Jeremy,

Recall that through DTrace, a privileged user can trace the contents of
arbitrary memory addresses, and it performs all these actions with interrupts
disabled. Your idea is clever, it just would be extremely complex to
implement safely if, indeed, it could be done at all.

Adam

On Sat, Sep 30, 2006 at 11:40:12PM +0100, Jeremy Harris
wrote:> Bryan Cantrill wrote:
> >On Sat, Sep 30, 2006 at 11:09:54PM +0100, Jeremy Harris wrote:
> >>Bryan Cantrill wrote:
> >>>We will never be doing this:  it violates the safety
constraint, in
> >>>several dimensions.  (One of which you outline below.)
> >>I''m interested - which other dimensions do you have in
mind?
> >
> >What if someone puts bad ECC on the text or data required to execute
> >the bad ECC handler?  How does DTrace know what this text and data are
> >to prohibit it?
> 
> Sorry - I wasn''t clear.  "Someone" here is Dtrace;
I''m suggesting
> that Dtrace should both place the bad-ECC and log the placed regions.
> 
> > More generally, what if someone puts bad ECC on text
> >or data that needs to be accessed at TL>1?  How is this handled?
> 
> Yes, that is an issue.  Can we constrain it in any way?  What memory
> is accessed at TL>1, and can we restrict it from this type of
> tracing?   Are there similar issues on non-SPARC architectures?
> 
> >And what about memory that DTrace itself uses to (for example) store
the
> >lookup table to know how to correct the bogus bad ECC?
> 
> What about it?  You''re not very clear here; please expand.
> 
> It''s just memory; given the assumption above
> that it is Dtrace placing the bad ECC, it would seem possible to
> not permit it as a traceable region.
> 
> > Most generally,
> >how can one prevent bad ECC from being placed on any word of memory
that
> >is accessed in a context in which a correctable ECC trap cannot be
> >handled?
> 
> By having Dtrace responsible for placing the bad ECC.
> 
> - Jeremy Harris
> _______________________________________________
> dtrace-discuss mailing list
> dtrace-discuss at opensolaris.org
-- 
Adam Leventhal, Solaris Kernel Development       http://blogs.sun.com/ahl

Bryan Cantrill

2006-Sep-29 22:53 UTC

head link

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

> >>>We will never be doing this:  it violates the safety
constraint, in
> >>>several dimensions.  (One of which you outline below.)
> >>I''m interested - which other dimensions do you have in
mind?
> >
> >What if someone puts bad ECC on the text or data required to execute
> >the bad ECC handler?  How does DTrace know what this text and data are
> >to prohibit it?
> 
> Sorry - I wasn''t clear.  "Someone" here is Dtrace;
I''m suggesting
> that Dtrace should both place the bad-ECC and log the placed regions.
No, you were being perfectly clear -- you''re just not understanding 
what I''m saying.  DTrace is but a conduit; assuming you''re
going to allow
programmability around this, you will be allowing users to put bad ECC
on arbitrary memory.  Offering this programmability introduces all of the
issues that I mentioned.
> > More generally, what if someone puts bad ECC on text
> >or data that needs to be accessed at TL>1?  How is this handled?
> 
> Yes, that is an issue.  Can we constrain it in any way?  What memory
> is accessed at TL>1, and can we restrict it from this type of
> tracing?   
I''m not going to give you a tutorial on SPARC; suffice it to say that
this is an extraordinarily difficult problem to bound or solve -- and
the failure mode is fatal, which means the solution must not be brittle.
It''s very hard to see how one could constrain or know the memory
accessed
in illegal contexts in a way that was sufficiently robust.

But I''m not going to have an extended discussion about this idea;
it''s
fatally flawed (at least from the DTrace perspective), for the reasons
that I have already expanded on.

	- Bryan

--------------------------------------------------------------------------
Bryan Cantrill, Solaris Kernel Development.       http://blogs.sun.com/bmc

Richard L. Hamilton

2006-Sep-30 06:58 UTC

head link

[dtrace-discuss] Re: [networking-discuss] Re: Re: DTrace Network

carlsonj wrote:> If we''re going to get into snoop, I''d rather see us
> cast a wider net
> than just adding a few minor features.
I''ve thought more than once that it would be great if
snoop implemented its protocol interpretation with
shared objects with a defined interface, so that
support for interpreting additional protocols could
be added without modifying snoop proper.  Presumably
this would also include an index file that provided
magic number, offset and size thereof, parent protocol(s),
etc; i.e. whatever would be needed to determine when
that protocol interpretation module was applicable.

At least when reading directly from a network and writing
interpreted output (rather than to a capture file), it would
probably need to load all modules at startup rather than as
needed (and not lazy-loaded, either), so as to not suffer
from delays that might cause it to lose packets while running.

This message posted from opensolaris.org

Jeremy Harris

2006-Sep-30 21:45 UTC

head link

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

Nicolas Williams wrote:> On Wed, Sep 27, 2006 at 12:11:38PM -0700, Brendan Gregg - Sun Microsystems
wrote:
>> On Mon, Sep 25, 2006 at 04:00:02PM -0700, Darren.Reed at Sun.COM wrote:
>>> The difference between using fbt and being able to follow a packet
>>> is when a packet is put on a queue, the fbt path is terminated,
>>> even though the packet still has places to go and I/O ports to see.
>>>
>>> If the networking dtrace provider doesn''t deliver the
capability
>>> to trace a packet through the kernel then it is not delivering a
>>> key feature that is needed by many people.
>> Noone is saying that the provider will never do this. If you are saying
>> that this feature is important to you, then sure - I understand.
>> And if many people means many engineers (which in turn helps them
>> achieve cool things), then I understand that too.
> 
> The fbt provider allows you to probe function calls/returns.  It still
> requires that the scripter thread together the actual function calls
> into a trace of call flow -- as far as the kernel side of DTrace is
> concerned the fbt probe firings are discrete events.
[...]> If it takes associative arrays to do it then it will be expensive.
Generalizing somewhat away from networking, we would like tracing of
data objects, independent of the execution environment.

I started wondering if dtrace could make use of the "data breakpoint"
facility that many processors offer these days, labeling a data
object for triggering probes just by setting such a trap-on-access.
Two problems: First, the precision may be too tight, trapping exactly
one address, when you want a whole mblk''s worth (the reverse, trapping
a cacheline or page is ok, just filter in the fired-probe code).
Second, processor breakpoint registers are a scarce resource.  We
would not be able to handle an indefinitely complex trace operation,
over several mblks at once.  The situation is obviously worse when
Fred across the hall is tracking disk I/O blocks on the system at
the same time.

On the other hand, it''s usable in user-space and kernel-space including
interrupt level, which is a good start.

To attack the first problem above, perhaps we should ask cpu designers
for data-breakpoint *range* [1]  facilities?

So what resource is less limited than cpu breakpoint registers but can
induce entry to specified code (a dtrace probe)?  Bingo: main memory
and ECC codes [2].   We can deliberately mark objects, to a precision
of an ECC code word, as bad - but list them in a tracking table which
the ECC handler checks for dtrace entry in preference to a real ECC
fault.

The non-enabled probe cost is zero.  The enabled probe cost is that
of a) rewriting the memory area in question with deliberate bad ECC;
a bit expensive to do (e.g.) to every packet on a 10GB link,
(especially if a bad-ECC write is significantly slower than a normal
write) plus b) another rewrite with good ECC, plus c) the actual
probe code.

There is a reliability downside: the tracked memory may no
longer be checked for real faults.
It helps if an ECC codepoint is available which is not used
by any hardware fault (or combination of faults).

Very large objects should be marked with bad-page-translations
in MMU tables rather than bad-ECC to reduce the costs [3].

Probes would have to be dynamically placed on memory areas; in the
dtrace world I suspect this would be best done as a result of the
firing of some traditional, code-based, probe - thus allowing
restriction of the cost of the enabled data probes by appropriate
prefiltering conditions.

- Jeremy Harris

Footnotes:
[1], [2], [3]  These may be patentable ideas.  Since I no longer
           work for Sun, they are hereby placed in the public
           domain, with no restrictions.

Jeremy Harris

2006-Sep-30 22:09 UTC

head link

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

Bryan Cantrill wrote:> We will never be doing this:  it violates the safety constraint, in
> several dimensions.  (One of which you outline below.)
I''m interested - which other dimensions do you have in mind?
- Jeremy Harris

Jeremy Harris

2006-Sep-30 22:40 UTC

head link

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

Bryan Cantrill wrote:> On Sat, Sep 30, 2006 at 11:09:54PM +0100, Jeremy Harris wrote:
>> Bryan Cantrill wrote:
>>> We will never be doing this:  it violates the safety constraint, in
>>> several dimensions.  (One of which you outline below.)
>> I''m interested - which other dimensions do you have in mind?
> 
> What if someone puts bad ECC on the text or data required to execute
> the bad ECC handler?  How does DTrace know what this text and data are
> to prohibit it?
Sorry - I wasn''t clear.  "Someone" here is Dtrace;
I''m suggesting
that Dtrace should both place the bad-ECC and log the placed regions.
>  More generally, what if someone puts bad ECC on text
> or data that needs to be accessed at TL>1?  How is this handled?
Yes, that is an issue.  Can we constrain it in any way?  What memory
is accessed at TL>1, and can we restrict it from this type of
tracing?   Are there similar issues on non-SPARC architectures?
> And what about memory that DTrace itself uses to (for example) store the
> lookup table to know how to correct the bogus bad ECC?
What about it?  You''re not very clear here; please expand.

It''s just memory; given the assumption above
that it is Dtrace placing the bad ECC, it would seem possible to
not permit it as a traceable region.
>  Most generally,
> how can one prevent bad ECC from being placed on any word of memory that
> is accessed in a context in which a correctable ECC trap cannot be
> handled?
By having Dtrace responsible for placing the bad ECC.

- Jeremy Harris

James Carlson

2006-Oct-02 12:19 UTC

head link

[dtrace-discuss] Re: [networking-discuss] Re: Re: DTrace Network

Richard L. Hamilton writes:> carlsonj wrote:
> > If we''re going to get into snoop, I''d rather see us
> > cast a wider net
> > than just adding a few minor features.
> 
> I''ve thought more than once that it would be great if
> snoop implemented its protocol interpretation with
> shared objects with a defined interface, so that
> support for interpreting additional protocols could
> be added without modifying snoop proper.
That''s the same as meem''s "tap" project, some time
ago.

He can comment on what happened with it, but I think the main problem
in that area isn''t in fact the current design of the code (which I
agree is unpleasant).  The main problem is that there are so many
protocols that one might want to decode, that creating the decoder
modules is the uncrossable hurdle.  You need armies writing those
modules.  Having a Solaris-specific design target just limits the
number of people available even further.

I think if we''re going to overhaul snoop _itself_, we''d be
better off
tossing the code we''ve got, and going with ethereal.  There are
undoubtedly problems with ethereal as well, but at least it has a rich
feature set.

That wasn''t the only thing I was referring to in the comment you
quoted above.  I was referring to the way in which snoop gathers data.
Snoop gathers data by opening a DLPI node and issuing some commands to
place the device into promiscuous mode.

There are a lot of problems with this approach:

  - You can''t watch loopback traffic, because that traffic
doesn''t go
    anywhere near DLPI.  (Though Clearview is fixing this problem by
    adding observability nodes.)

  - The interface doesn''t distinguish between inbound and outbound
    packets.

  - There''s no way for DLPI to retain errored packets or pass up
    metadata along with the packet ("this partial packet terminated by
    collision").

  - There''s no way to pass up other events (carrier indication, flow
    control, packet loss).

  - Time stamps are generated by bufmod(7D), well after packet
    reception and handling, rather than referring to actual receive
    time.

Some of those things could be addressed by yet-more DLPI extensions,
but it''s unclear to me whether that''s really the right path.

I suspect that something better than raw DLPI access is needed for
monitoring traffic.  I''m not sure whether a dtrace provider is the
right answer to _that_ problem, as turning the output of the provider
into something that portable applications (using libpcap) could access
sounds both much harder and less useful than designing a new interface
from scratch.  And if we can''t do that, then we''re really
sunk,
because we can''t recreate ethereal from scratch.

So, if the target of the dtrace network providers is to do what snoop
does (but "better"), then I think that''s probably not a good
target.
We have a lot of things to do in that area, but it''s unclear whether
(or how) dtrace is a part of them.

The target should be making what Solaris does with packets clearer.
That means exposing the relevant state transitions and activity, so
that users can find out not just that a given packet is on the wire,
but why it''s there.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Peter Memishian

2006-Oct-02 12:21 UTC

head link

[dtrace-discuss] Re: [networking-discuss] Re: Re: DTrace Network

> That''s the same as meem''s "tap" project, some
time ago.
(An unfortunate name choice now, given *cough* SystemTap -- but moot now.)

 > He can comment on what happened with it, but I think the main problem
 > in that area isn''t in fact the current design of the code (which
I
 > agree is unpleasant).  The main problem is that there are so many
 > protocols that one might want to decode, that creating the decoder
 > modules is the uncrossable hurdle.  You need armies writing those
 > modules.  Having a Solaris-specific design target just limits the
 > number of people available even further.

Yes, that is a sizeable problem.  Some of it can be mitigated through
high-quality API''s, but there''s also a seemingly-endless
number of
decoders to write -- not to mention a surprisingly number of "clever"
ways
that are used to hop from one protocol to the next (especially once one
gets into application protocols).

-- 
meem

James Carlson

2006-Oct-02 13:28 UTC

head link

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

Jeremy Harris writes:> To attack the first problem above, perhaps we should ask cpu designers
> for data-breakpoint *range* [1]  facilities?
[...]> [1], [2], [3]  These may be patentable ideas.  Since I no longer
>            work for Sun, they are hereby placed in the public
>            domain, with no restrictions.
Motorola would likely beg to differ on your disclosure for [1].
Hardware range breakpoints and even breakpoints with conditional
(and/or) logic have been a standard part of the PowerQUICC family for
a long time -- at least since the MPC860, more than ten years ago.
The idea is almost certainly older than that.  (Didn''t VM/CP also have
range facilities?)

The 860 had four I-address, two L-address, and two L-data
comparators, each with ==, !=, >, and <.  The L-data comparators also
had signed/unsigned modes and byte masks.  These comparators fed into
a programmable and/or logic structure.  If you''re interested, see
chapter 44 in the MPC860UM [user''s manual].

It''d be nice if the CPUs we ordinarily used had those features, but
they''re certainly not new inventions.

-- 
James Carlson, KISS Network                    <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Brendan Gregg - Sun Microsystems

2006-Oct-02 23:36 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Folks,

I''ve been checking the kernel code to see what is doable for the
Network
Provider, and I''ve made some updates to the website. This is where
I''m at:

1) ipinfo_t is mostly dead.

On Thu, Sep 28, 2006 at 09:52:34PM -0700, Brendan Gregg - Sun Microsystems
wrote:> 
> (This is assuming that the one probe, ip:::send, can access args[2] from
> two different translators depending on the probe location. This may not
> be possible - I''m still checking. If it does turn out to be
impossible,
> then that will put a bullet in the head of ipinfo_t).
My assumption was wrong. It turns out that I can''t have two translators
that output the same ipinfo_t. DTrace translators take one input and make
one output only. Nor can I pass in a void and have the translator figure
it out. I can pass in an mblk_t * (or really any pointer to the IP header,
should mblk_t''s dissapear in the future), and get the translator to
check
the IP version -- but the translator gets very complex fast, and needs to
filter out implementation (such as IP_FORWARD_PROG).

So now that there is a technical reason not to have a merged translator,
that idea is fairly dead. There is still the possibility for an 
observability only ipinfo_t, with the following contents:

	args[2]->ip_ver		IP version
	args[2]->ip_plength	payload length
	args[2]->ip_saddr	stringified source address
	args[2]->ip_daddr	stringified dest address

In a similar style to the io provider''s fileinfo_t (which can return
"<null>" if such information was unavailable or not
appropriate);
which has been indispensible for io observability.

As for ipv4 and ipv6 headers, from the ip provider (only) they can be
presented as args[3] (ipv4) and args[4] (ipv6), which can either be:

	A) stable (host-endian correct) DTrace translations of the
	   header: ipv4info_t and ipv6info_t, with pointers to the
	   raw headers.
	B) just the raw headers: ipha_t * and ip6_t *.

and whether this ends up as,

	ip:::	/args[2]->ip_ver == 4/	args[3] is IPv4 header
	ip:::	/args[2]->ip_ver == 6/	args[4] is IPv6 header
	ip:::	args[2] is ipinfo_t observability (IPv4 & IPv6)
Or,
	ipv4:::		args[3] is IPv4 header
	ipv6:::		args[3] is IPv6 header
	ipv4:::,
	ipv6:::		args[2] is ipinfo_t observability

Can still be discussed later - it''s not a big change to go from
one to the other.

2) Correct IP and TCP info can be hard to come by.

One tcp:::send probe can''t pick IP info from some structs, and another
tcp:::send probe pick them from elsewhere. This is difficult for the
same reason a merged translator is difficult. Not only that, but correct
IP info isn''t available easily for much of the tcp code. Correct TCP
header details isn''t easily available much of the time either.

The most extreme example would be the tcp:::state-* probes; getting
correct IP and TCP header details for all of these is going to be
extreamly difficult (and worse to maintain). I think it would be more
practical to only provide minimum stable details from tcp_t *,
something like,

tcp:::state-*
	arg0			unique ID (either conn_s * or tcp_t *)
	args[3]->tcp_lport	local port
	args[3]->tcp_rport	remote port
	args[3]->tcp_state	TCP state
	args[3]->tcp_pstate	previous TCP state

And that''s it - no IP header info, no TCP header info. 

tcp:::send can be placed after much of the IP header has been created
(arguably in ip itself), and tcp:::receive soon after the packet is
received - such that our minimal ipinfo_t is valid. Full IP headers
however, are probably going to be available in the ip provider only -
so to not bind the tcp provider with the current kernel implementation.

The following providers and arguments given may be more realistic,

	ip:::			IPv4 and IPv6 headers
	tcp:::state-*		stable tcp_t details
	tcp:::send/receive	tcp_t, TCP header details, and ipinfo_t

I''m still checking what tcp:::accept-* and tcp:::connect-* can get;
It may also make sense to drop tcp:::data-* and tcp:::ack-* and have 
these matched using tcp:::send/receive and predicates -- yes, they
would be nice to have, but too many probes could make ongoing
maintenance of the tcp code painful.

I hope these discoveries and changes have settled some of the concerns
of the readers. :)

cheers,

Brendan

Brendan Gregg - Sun Microsystems

2006-Oct-03 01:25 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Rich,

On Mon, Oct 02, 2006 at 07:49:26PM -0400, Richard Lowe
wrote:> Brendan Gregg - Sun Microsystems wrote:
[...]> >>(This is assuming that the one probe, ip:::send, can access args[2]
from
> >>two different translators depending on the probe location. This may
not
> >>be possible - I''m still checking. If it does turn out to
be impossible,
> >>then that will put a bullet in the head of ipinfo_t).
> >
> >My assumption was wrong. It turns out that I can''t have two
translators
> >that output the same ipinfo_t. DTrace translators take one input and
make
> >one output only.
> 
> I maybe misunderstanding you here, but as I read it, this seems not 
> correct.  For instance, look at how proc.d translates both proc_t* and 
> kthread_t* to psinfo_t.
Sorry - I should have stressed that this was for that one probe to have two
translators (for different kernel code locations) that feed the same struct
at the same args[] location.

Yes - I can do it if I have a different probe name or use different args[]
locations, which the proc and sched providers do. I''ve already gone
down
this path for probing fused loopback traffic, where I can get details from
tcp_t * but not the usual tcph_t *; and so the suggested probe names are,

	tcp:::fuse-send
	tcp:::fuse-receive

[...]> >and whether this ends up as,
> >
> >	ip:::	/args[2]->ip_ver == 4/	args[3] is IPv4 header
> >	ip:::	/args[2]->ip_ver == 6/	args[4] is IPv6 header
> >	ip:::	args[2] is ipinfo_t observability (IPv4 & IPv6)
> >Or,
> >	ipv4:::		args[3] is IPv4 header
> >	ipv6:::		args[3] is IPv6 header
> >	ipv4:::,
> >	ipv6:::		args[2] is ipinfo_t observability
> >
> >Can still be discussed later - it''s not a big change to go
from
> >one to the other.
> 
> I far prefer the latter, the former seems particularly ugly.
Well, there is a precedent for this - the io provider merges different I/O
types, and this helps to write simple scripts rather than always getting
in the way. This was why I wrote the example scripts on the website --
not once did I need to match the difference between IPv4 and IPv6, however
every script did needed to match both. Going the other way would make all
of those examples uglier.

I''m glad I can type,

	dtrace -n ''io:::start { @[execname] = quantize(args[0]->b_bcount);
}''

and not be forced to type,

	dtrace -n ''blockio:::start, rawio:::start, asyncio:::start,
	    nfs*:::start { @[execname] = quantize(args[0]->b_bcount); }''

I''d encourage anyone to post some realistic and everyday examples of
when
your regular Joe sysadmin would want to match the difference between IPv4
and IPv6, for IP layer traceing. What springs to my mind includes,

	* packets by IP address (both)
	* bytes by IP address
	* bytes per second by IP address
	* distribution plot of packet size
	* distribution plot of packet size by IP address
	* distribution plot of throughput latency
	* average throughput latency by IP address

Those are some everyday tasks, and I haven''t hit any specific IPv4/6
fields yet...

cheers,

Brendan

-- 
Brendan
[CA, USA]

Richard Lowe

2006-Oct-03 01:50 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

Brendan Gregg - Sun Microsystems wrote:> G''Day Rich,
> 
> On Mon, Oct 02, 2006 at 07:49:26PM -0400, Richard Lowe wrote:
>> Brendan Gregg - Sun Microsystems wrote:
> [...]
>>>> (This is assuming that the one probe, ip:::send, can access
args[2] from
>>>> two different translators depending on the probe location. This
may not
>>>> be possible - I''m still checking. If it does turn out
to be impossible,
>>>> then that will put a bullet in the head of ipinfo_t).
>>> My assumption was wrong. It turns out that I can''t have
two translators
>>> that output the same ipinfo_t. DTrace translators take one input
and make
>>> one output only.
>> I maybe misunderstanding you here, but as I read it, this seems not 
>> correct.  For instance, look at how proc.d translates both proc_t* and 
>> kthread_t* to psinfo_t.
> 
> Sorry - I should have stressed that this was for that one probe to have two
> translators (for different kernel code locations) that feed the same struct
> at the same args[] location.
> 
> Yes - I can do it if I have a different probe name or use different args[]
> locations, which the proc and sched providers do. I''ve already
gone down
> this path for probing fused loopback traffic, where I can get details from
> tcp_t * but not the usual tcph_t *; and so the suggested probe names are,
> 
> 	tcp:::fuse-send
> 	tcp:::fuse-receive
> 
> [...]
>>> and whether this ends up as,
>>>
>>> 	ip:::	/args[2]->ip_ver == 4/	args[3] is IPv4 header
>>> 	ip:::	/args[2]->ip_ver == 6/	args[4] is IPv6 header
>>> 	ip:::	args[2] is ipinfo_t observability (IPv4 & IPv6)
>>> Or,
>>> 	ipv4:::		args[3] is IPv4 header
>>> 	ipv6:::		args[3] is IPv6 header
>>> 	ipv4:::,
>>> 	ipv6:::		args[2] is ipinfo_t observability
>>>
>>> Can still be discussed later - it''s not a big change to go
from
>>> one to the other.
>> I far prefer the latter, the former seems particularly ugly.
> 
> Well, there is a precedent for this - the io provider merges different I/O
> types, and this helps to write simple scripts rather than always getting
> in the way. This was why I wrote the example scripts on the website --
> not once did I need to match the difference between IPv4 and IPv6, however
> every script did needed to match both. Going the other way would make all
> of those examples uglier.
Sorry, I should have perhaps been more specific.

It''s the validity of arguments depending on other arguments that
I''m
disliking, which I don''t *think* other providers currently do.

I''ll admit, however, that I can''t currently see a way to go
about this
other than those presented.
> I''m glad I can type,
> 
> 	dtrace -n ''io:::start { @[execname] =
quantize(args[0]->b_bcount); }''
> 
> and not be forced to type,
> 
> 	dtrace -n ''blockio:::start, rawio:::start, asyncio:::start,
> 	    nfs*:::start { @[execname] = quantize(args[0]->b_bcount);
}''
Certainly.

-- Rich

Brendan Gregg - Sun Microsystems

2006-Oct-03 02:50 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

G''Day Rich,

On Mon, Oct 02, 2006 at 09:50:37PM -0400, Richard Lowe
wrote:> Brendan Gregg - Sun Microsystems wrote:
[...]> >Well, there is a precedent for this - the io provider merges different
I/O
> >types, and this helps to write simple scripts rather than always
getting
> >in the way. This was why I wrote the example scripts on the website --
> >not once did I need to match the difference between IPv4 and IPv6,
however
> >every script did needed to match both. Going the other way would make
all
> >of those examples uglier.
> 
> Sorry, I should have perhaps been more specific.
> 
> It''s the validity of arguments depending on other arguments that
I''m
> disliking, which I don''t *think* other providers currently do.
In io, args[1]->dev_pathname only works for /args[1]->dev_name !=
"nfs"/,
maybe I could find others. If you mean that it is uncommon and couldn''t
be
described as a strong DTrace convention - then yes, fair enough.

The precedent in mind was the profile provider, with arg0 for userland
address or arg1 for kernel address, depending on the underlying activity. 

That I now can''t have ip::: and args[2] (which was simple), and instead
must have ip::: with either args[3] or args[4] (which people may find odd)
- I''ll conceed has certainly has taken some of the shine off the ip:::
idea.

Brendan

-- 
Brendan
[CA, USA]

Brendan Gregg - Sun Microsystems

2006-Oct-03 02:53 UTC

head link

[dtrace-discuss] Re: DTrace Network Provider

On Mon, Oct 02, 2006 at 07:50:34PM -0700, Brendan Gregg - Sun Microsystems
wrote:> 
> The precedent in mind was the profile provider, with arg0 for userland
> address or arg1 for kernel address, depending on the underlying activity. 
Oops - I meant the other way around. arg0 is kernel...

I knew I should have suggested args[4] be IPv4 and args[6] be IPv6, then
noone could suggest that having two args be difficult to remember which
was which. ;)

Brendan

-- 
Brendan
[CA, USA]

Erik Nordmark

2006-Nov-23 01:58 UTC

head link

[dtrace-discuss] Network Provider comments

Brendan,

I went and looked at:

http://www.opensolaris.org/os/community/dtrace/NetworkProvider/CEC2006/

and overall I liked what I saw.

But I think it would make sense to formalize the notion of the tcp 
pseudo-header for the tcp packet events. See page 17 in RFC 793.

For instance, you end up with code like this
         this->length = args[1]->ip_plength - args[2]->tcp_offset;
instead of capturing this in a pseudo-header field called something like 
tcp_length. (UDP and ICMPv6 has the same notion of pseudo-header.)

To accurately capture the difference between the TCP pseudo header and 
the IP header on the packet on transmit, you''d need to take into
account
IP source routes (LSRR and SSRR in RFC 791, and the routing header in 
RFC 2460) you need to compensate for the fact that in the TCP pseudo 
header is the final destination, but in the implementation TCP sends 
down a packet where ipha_dst/ip6_dst is the first hop of the source route.

I think expression TCP''s, UDP''s etc behavior in terms of the
pseudo
header takes care of the layering concerns; this pseudo header is what 
defines the interface between the transports and IP.

I recently talked to a large customer and they brought up the need 
performance observability for networking. They wanted both to be able to 
observe throughput (which nicstat already provides, and your new 
providers give even more insight into) and also latency up and down the 
stack.

After pushing back on the latency part, they explained the need as 
coming from handling operational performance anomalies (e.g., overload 
type situations) when they can''t tell where the issue is; whether TCP
or
the something else queues the packet, or the switches/routers doing it, etc.

Basically being able to see what the latency is (aggregated globally or 
per socket/connection) from the write/send system call until the packet 
hits the wire, and likewise on the receive side. (Tracking the time 
spent queued in the NICs hardware transmit/receive descriptor isn''t 
something we can do, and they understood that.)

Is anybody looking at providing a stable latency provider?
Based on the 100+ emails on the list on this subject, I suspect that 
this requires us to invent some new concepts in Solaris so that we can 
track causality within the stack without relying on mblk/dblk addresses 
or resorting to destructive dtrace providers. But it might not be hard 
to have the key producers of messages (sockfs and GLD) generate a unique 
handle/identity in the dblk which a dtrace provider can then follow even 
as other mblks are prepended and different threads handle the 
message/packet.

    Erik

Brendan Gregg - Sun Microsystems

2006-Nov-25 02:28 UTC

head link

[dtrace-discuss] Re: Network Provider comments

G''Day Erik,

On Wed, Nov 22, 2006 at 05:58:30PM -0800, Erik Nordmark
wrote:> 
> Brendan,
> 
> I went and looked at:
> 
> http://www.opensolaris.org/os/community/dtrace/NetworkProvider/CEC2006/
> 
> and overall I liked what I saw.
Great,
> But I think it would make sense to formalize the notion of the tcp 
> pseudo-header for the tcp packet events. See page 17 in RFC 793.
> 
> For instance, you end up with code like this
>         this->length = args[1]->ip_plength - args[2]->tcp_offset;
> instead of capturing this in a pseudo-header field called something like 
> tcp_length. (UDP and ICMPv6 has the same notion of pseudo-header.)
Yes, I originally had planned to include a tcp_length that was the
TCP payload length - however I ran into a small issue involving 
DTrace translators. In the future we would like to fix this issue, at
which point a tcp_length could be added to the provider. For now, these
standard RFC fields (args[1]->ip_plength - args[2]->tcp_offset) both work
and are sufficient, and their inclusion doesn''t prohibit the future
addition of tcp_length.

Dropping tcp_length for now has been one of the benefits of prototyping
the provider, as I''m able to suggest something that is realistic,
functional and known to be doable. 

Further details (if anyone is interested): 

The issue was the need for a translator to have multiple inputs that
vary depending on probe location, and that combine to give a single
output argument (args[]), so that ip_plength can be read from one place
and tcp_offset from another, and then subtracted to give tcp_length.
DTrace translators currently take one type and translate it to another, 
and can only vary their inputs if the probe name varies.

Ways we may resolve this issue include,
- feeding the translator a custom struct of pointers to all input data.
Building this struct will incurr overhead in code that needs to remain
fast (this would need to be benchmarked); one way around this may be to
add is-enabled kernel probes (which don''t exist yet).
- enhancing the DTrace translator interface to accept multiple inputs
for one args output.
> To accurately capture the difference between the TCP pseudo header and 
> the IP header on the packet on transmit, you''d need to take into
account
> IP source routes (LSRR and SSRR in RFC 791, and the routing header in 
> RFC 2460) you need to compensate for the fact that in the TCP pseudo 
> header is the final destination, but in the implementation TCP sends 
> down a packet where ipha_dst/ip6_dst is the first hop of the source route.
Good point. I haven''t yet tried to represent src or dst address in TCP
terms, instead I''m using args[1]->ip_saddr, args[1]->ip_daddr -
which
are whatever addresses are in the IP packet. LSRR/SSRR doesn''t break
this.
If I had tried to provide args[2]->tcp_saddr, args[2]->tcp_daddr - then
they would definately need to show the TCP destination from LSRR/SSRR
properly.

This will certainly need to be addressed. If not by the addition of
correct args[2]->tcp_saddr, args[2]->tcp_daddr variables, then by 
clearly pointing out the dangers of assuming ip_saddr == tcp_saddr,
and providing another way to get at the correct tcp addresses.
So far my examples assume ip_saddr == tcp_saddr - which for LSRR/SSRR
isn''t correct.
> I think expression TCP''s, UDP''s etc behavior in terms of
the pseudo
> header takes care of the layering concerns; this pseudo header is what 
> defines the interface between the transports and IP.
>
> I recently talked to a large customer and they brought up the need 
> performance observability for networking. They wanted both to be able to 
> observe throughput (which nicstat already provides, and your new 
> providers give even more insight into) and also latency up and down the 
> stack.
> 
> After pushing back on the latency part, they explained the need as 
> coming from handling operational performance anomalies (e.g., overload 
> type situations) when they can''t tell where the issue is; whether
TCP or
> the something else queues the packet, or the switches/routers doing it,
etc.
Sure, being able to show how much time was spent serving network interrupts,
and how much time was spent in kernel tcp/ip code, are both crucial metrics
that I can see many customers will want to know, and for the exact
reasons you have described.

The following checklist could be applied when attempting to understand
network performance:

	Metrics						Tool
	--------------------------------------------------------------------
	current throughput (bytes and packets)		nicstat
	NIC settings (duplex, auto-neg, speed)		checkcable
	route table / routing				netstat, route,
							traceroute
	network latency					dtrace (net provider)
	CPU saturation					vmstat, dtrace
	NIC interrupt CPU consumption			intrstat,
							mdb & kstat
	kernel CPU time					vmstat, dtrace
	tcp/ip CPU time					dtrace (fbt)
> Basically being able to see what the latency is (aggregated globally or 
> per socket/connection) from the write/send system call until the packet 
> hits the wire, and likewise on the receive side. (Tracking the time 
> spent queued in the NICs hardware transmit/receive descriptor
isn''t
> something we can do, and they understood that.)
>
> Is anybody looking at providing a stable latency provider?
There are a number needs that DTrace can fulfill,

	Need					Customer
	--------------------------------------------------------
	A. General network observability	Solaris users
	B. Network latency			Solaris users
	C. Overall tcp/ip CPU latency		Solaris users
	D. Kernel code-path latency		Kernel engineers

A. General network observability

This is currently addressed in the proposed providers, and provides such
metrics as connections rate by port, by host, etc.

B. Network latency

This is also in the proposed providers, as we can measure the time from
when certain packets are sent and received.

C. Overall tcp/ip CPU latency

I''d like to have some form of overall tcp/ip code latency in the
provider,
so that customers can easily see how much time is consumed in tcp/ip code.
Since the network providers are public and stable, we will want to try not
to make this implementation based, and focus on more general events.
It should be possible to write fairly straightforward scripts that show
time spent processing packets in total, then breaking down on certain 
TCP/IP fields. It''s tricky to do this for a number of reasons,
including
the performance overhead of DTrace events on a server that could be
pumping 100,000 packets per second.

D. Kernel code-path latency

Kernel code-path latency is not a priority for these network providers 
for reasons such as:

   1) the kernel code-path isn''t stable, and these are STABLE providers
   2) code-path latency data is already available using DTrace and the
      fbt provider, assuming,
		a) the engineer knows DTrace well
		b) the engineer understands the code they are tracing

It should be possible to write an unstable code-path latency provider
for the kernel engineers to use (and others interested in such details,
like myself), and such a provider hasn''t been ruled out. In the 
shorter term future it should be possible to add SDT probes to make
mblk based tracing easier. 

There have been some suggestions that many regular Solaris users will
be interested in kernel code-path latency as well, which is ridiculous.
A regular Solaris user should certainly be able to identify if there is
an overall latency in the tcp/ip code, but it''s then up to the kernel
engineers to study and fix the code-path.
> Based on the 100+ emails on the list on this subject, I suspect that 
> this requires us to invent some new concepts in Solaris so that we can 
> track causality within the stack without relying on mblk/dblk addresses 
> or resorting to destructive dtrace providers. But it might not be hard 
> to have the key producers of messages (sockfs and GLD) generate a unique 
> handle/identity in the dblk which a dtrace provider can then follow even 
> as other mblks are prepended and different threads handle the 
> message/packet.
Numerous code path latency problems can be solved with mblk addresses -
with the DTrace script tracing changes to the mblk. With the addition
of some private unstable SDT probes (to trace cration of mblks, splitting
of packets, etc), such scripts would even become easy.

When a server is pushing 100,000 packets per second, additional
CPU instructions become a concern. An additional data/message ID would
be nice, but would it be close to CPU free? Or is there a solid reason
why mblk addresses and the SDT provider can''t be used?

An appropriate path for solving kernel code-path latency, may involve:

1) someone who REALLY understands DTrace could write a suite of tools
   to demonstrate mblk based tracing from the fbt provider. These tools
   will solve numerous observability issues, but also highlight certain
   difficulties with this approach, which leads to,

2) the addition of private unstable SDT probes to enhance the suite of
   tools, and allow the tools to be written more easily. This may
   include the addition of data/message IDs, that are returned via SDT.

3) if need be, the design of a seperate kernel code-path latency provider.
   this would include data/message IDs. It''s likely that this will
still
   be termed ''unstable'', due to the code-path focus.

If I get a chance, I can make a start on (1) and (2).

If a large customer wants latency metrics now, then fbt based scripts
can be written without waiting for the provider. But first, are we
even in the kernel that much? What does vmstat and intrstat show?

...

Oh, and who said anything about destructive DTrace providers?

cheers,

Brendan

-- 
Brendan
[CA, USA]

Darren Reed

2006-Nov-27 07:03 UTC

head link

[dtrace-discuss] Re: Network Provider comments

Brendan Gregg - Sun Microsystems wrote:
>G''Day Erik,
>
>On Wed, Nov 22, 2006 at 05:58:30PM -0800, Erik Nordmark wrote:
>  
>
> ...
>
>>Based on the 100+ emails on the list on this subject, I suspect that 
>>this requires us to invent some new concepts in Solaris so that we can 
>>track causality within the stack without relying on mblk/dblk addresses 
>>or resorting to destructive dtrace providers. But it might not be hard 
>>to have the key producers of messages (sockfs and GLD) generate a unique
>>handle/identity in the dblk which a dtrace provider can then follow even
>>as other mblks are prepended and different threads handle the 
>>message/packet.
>>    
>>
>
>Numerous code path latency problems can be solved with mblk addresses -
>with the DTrace script tracing changes to the mblk. With the addition
>of some private unstable SDT probes (to trace cration of mblks, splitting
>of packets, etc), such scripts would even become easy.
>
>When a server is pushing 100,000 packets per second, additional
>CPU instructions become a concern. An additional data/message ID would
>be nice, but would it be close to CPU free? Or is there a solid reason
>why mblk addresses and the SDT provider can''t be used?
>  
>
Using mblk addresses presumes that a packet will only ever be
referred to by the mbk at the start of the packet.  This can be
invalidated when we:
* prepend M_PROTO/M_PCPROTO when sending to a driver;
* add/remove an M_DATA from the start of a packet when we
  are working with encapsulation (IPsec, tunnelling, etc);
* make calls to msgpullup();
* otherwise change the construction of a packet relative to the
  mblks that point to it.

Darren

Brendan Gregg - Sun Microsystems

2006-Nov-27 08:01 UTC

head link

[dtrace-discuss] Re: Network Provider comments

G''Day Darren,

On Mon, Nov 27, 2006 at 03:03:59PM +0800, Darren Reed
wrote:> Brendan Gregg - Sun Microsystems wrote:
[...]> >When a server is pushing 100,000 packets per second, additional
> >CPU instructions become a concern. An additional data/message ID would
> >be nice, but would it be close to CPU free? Or is there a solid reason
> >why mblk addresses and the SDT provider can''t be used?
> > 
> 
> Using mblk addresses presumes that a packet will only ever be
> referred to by the mbk at the start of the packet.
No, no, I wasn''t presuming that at all -- I suggested that mblk changes
be traced, and that sdt probes could make that easier.
>  This can be invalidated when we:
> * prepend M_PROTO/M_PCPROTO when sending to a driver;
> * add/remove an M_DATA from the start of a packet when we
>  are working with encapsulation (IPsec, tunnelling, etc);
> * make calls to msgpullup();
> * otherwise change the construction of a packet relative to the
>  mblks that point to it.
All of these changes we can usually trace in a DTrace script, using
the existing fbt provider and ctf''d arguments.

Yes, mblk''s change - and you''ve written a valid list of
challenges for an
mblk-based DTrace script to solve. Some of those mblk changes may be
inlined, but there are workarounds for that too. I think one of the
biggest challenges would be to verify that the measured latency is
accurate, and not just microsecond noise from the DTrace overhead; this
may involve setting up a testing framework that can apply some known
network loads.

cheers,

Brendan

-- 
Brendan
[CA, USA]

dtrace discuss - Sep 2006 - DTrace Network Provider

[dtrace-discuss] DTrace Network Provider

[dtrace-discuss] DTrace Network Provider

[dtrace-discuss] DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: Re: Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: Re: DTrace Network Provider

[dtrace-discuss] Re: Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: Re: Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: Re: DTrace Network Provider

[dtrace-discuss] Re: Re: DTrace Network Provider

[dtrace-discuss] Re: Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

[networking-discuss] Re: [dtrace-discuss] DTrace Network Provider

[networking-discuss] Re: [dtrace-discuss] DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[networking-discuss] Re: [dtrace-discuss] DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: Re: DTrace Network Provider

[dtrace-discuss] Re: Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: DTrace Network Provider

[dtrace-discuss] Re: Re: DTrace Network Provider