Brendan Gregg - Sun Microsystems
2006-Sep-19 00:48 UTC
[dtrace-discuss] DTrace Network Provider
G''Day Folks, I''ve just created a website to flesh out ideas for a DTrace Network Provider. Let us know what you think. http://www.opensolaris.org/os/community/dtrace/NetworkProvider/ I''m not currently writing this provider; this began as a discussion with Adam about what a network provider could do, which we thought would be appropriate to discuss on dtrace-discuss. So, this may not actually happen; although it would help if a lot of people really do want it. :-) We are also interested to hear from Sun''s network kernel engineers, to hear how doable this really is. thanks, Brendan
[Cc''ing networking-discuss, I suspect followups should go to dtrace-discuss, however] Brendan Gregg - Sun Microsystems wrote:> G''Day Folks, > > I''ve just created a website to flesh out ideas for a DTrace Network Provider. > Let us know what you think. > > http://www.opensolaris.org/os/community/dtrace/NetworkProvider/ > > I''m not currently writing this provider; this began as a discussion with > Adam about what a network provider could do, which we thought would be > appropriate to discuss on dtrace-discuss. So, this may not actually happen; > although it would help if a lot of people really do want it. :-)+1, I for one certainly want it :) A couple of questions. What are the intents regarding ICMP, SCTP etc? What are the semantics of these where ipf is involved? I assume that a packet blocked by ipf will not fire the probes, for obvious reasons. Would ipf gain probes for packets being dropped/being nat''d, etc?> We are also interested to hear from Sun''s network kernel engineers, to > hear how doable this really is. > > thanks, > > Brendan > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org >
Brendan Gregg - Sun Microsystems
2006-Sep-19 02:24 UTC
[dtrace-discuss] DTrace Network Provider
On Mon, Sep 18, 2006 at 09:31:48PM -0400, Richard Lowe wrote:> [Cc''ing networking-discuss, I suspect followups should go to > dtrace-discuss, however] > > Brendan Gregg - Sun Microsystems wrote: > >G''Day Folks, > > > >I''ve just created a website to flesh out ideas for a DTrace Network > >Provider. > >Let us know what you think. > > > > http://www.opensolaris.org/os/community/dtrace/NetworkProvider/ > > > >I''m not currently writing this provider; this began as a discussion with > >Adam about what a network provider could do, which we thought would be > >appropriate to discuss on dtrace-discuss. So, this may not actually happen; > >although it would help if a lot of people really do want it. :-) > > +1, I for one certainly want it :) > > A couple of questions. > > What are the intents regarding ICMP, SCTP etc?They should be included!> What are the semantics of these where ipf is involved? I assume that a > packet blocked by ipf will not fire the probes, for obvious reasons. > Would ipf gain probes for packets being dropped/being nat''d, etc?Hmm, I hadn''t thought of ipf. It would be great, for troubleshooting, to see some ipf probes... :) Brendan
> G''Day Folks, > > I''ve just created a website to flesh out ideas for a > DTrace Network Provider. > Let us know what you think.Way cool. This would take the place of much of what I use snoop for. Rarely do I care about data that would not be available through this. What happens when... - A packet comes in with an 802.1Q tag -- And a VLAN interface for the corresponding VLAN is plumbed -- And a VLAN interface for the corresponding VLAN is not plumbed - A packet comes in with a different ethertype -- e.g. Jumbo packet and jumbo is not enabled -- LLT (Veritas Cluster interconnect) - A packet comes in on an interface that is not plumbed (LLT interfaces are commonly not plumbed) I look forward to seeing this move forward. Mike This message posted from opensolaris.org
Mike Gerdts wrote:>> G''Day Folks, >> >> I''ve just created a website to flesh out ideas for a >> DTrace Network Provider. >> Let us know what you think. >> > > Way cool. This would take the place of much of what I use snoop for. Rarely do I care about data that would not be available through this. >Yes, and I think the power of the D programming language will make controlling the output a lot more flexible than snoop. Chip
hi Brendan looks great! a few thoughts: - i was trying to figure out how ipv6 would fit into things, and i guess there''s a few issues here. one that occurs to me is the way the ip-related probes pass back an ipv4-specific pointer. i wonder, would another way of getting the packet data back to the probe consumers be to pass back an array as arg1, and have a set of indices defined such that if i want arg1[ETHER] i get a pointer to the ether header that i can parse etc. this would give consumers a simple test for ipv6 - is arg1[IPv6] NULL? not sure about the feasiblilty of this from the provider side though. i guess another issue with v6 is getting v6 addresses to work as predicates. - maybe we could we augment the instrumentation of the MIB probes to pass back packet data too, this might be useful? - to investigate performance bottlenecks in depth, combining this work with instrumentation of the Packet Event Framework would perhaps be useful. also, i imagine the Clearview and Nemo folks would have suggestions on instrumenting loopback and adding probes to gld. - can you think of a way to "pick up" an individual packet and follow it through the stack? i''m thinking a DTrace script that set probes at all the interesting points, then sets variables that uniquely identify the packet at gld, and can then subsequently be used at the other probe points as predicates. i guess this might require "decoding" the packet at each level, in the sense of providing pointers into each protocol header, but it might be worthwhile. alan This message posted from opensolaris.org
Alan Maguire writes:> - i was trying to figure out how ipv6 would fit into things, and i guess there''s a few issues here. one that occurs to me is the way the ip-related probes pass back an ipv4-specific pointer. i wonder, would another way of getting the packet data back to the probe consumers be to pass back an array as arg1, and have a set of indices defined such that if i want arg1[ETHER] i get a pointer to the ether header that i can parse etc. this would give consumers a simple test for ipv6 - is arg1[IPv6] NULL? not sure about the feasiblilty of this from the provider side though. i guess another issue with v6 is getting v6 addresses to work as predicates.Only one of those indicies could be non-NULL at a time, so that doesn''t seem like a good change to me. I think it makes tracing a bit too error-prone. Having separate probe points for v4 and v6, each with its own parameters, might make more sense, even if there are some commonalities at the transport layer. Plus, SCTP is notably absent. I think a bigger issue here is getting lower-level details. The existing proposal seems to center around the sockets API model, but I''m not sure that''s really the right direction to go. I think it''d be truer to the dtrace philosophy if it were to expose all the internal state for these protocols. That means having probes that represent IP''s forwarding and fanout actions (including drop and ICMP error generation); probes that reflect TCP''s window handling (including out-of-window, overlapping segment trimming, and ack generation), state machine, and timers; probes for IGMP activity; probes for sockfs interaction, and so on.> - maybe we could we augment the instrumentation of the MIB probes to pass back > packet data too, this might be useful?Indeed. The current design of the MIB probe code looks like it would make this hard to do, but it''d be valuable just the same.> - can you think of a way to "pick up" an individual packet and follow it through the stack? i''m thinking a DTrace script that set probes at all the interesting points, then sets variables that uniquely identify the packet at gld, and can then subsequently be used at the other probe points as predicates. i guess this might require "decoding" the packet at each level, in the sense of providing pointers into each protocol header, but it might be worthwhile.Having "dblk event" provider would likely be very helpful (and might even allow us to retire the existing STREAMS flow tracing), but there are a lot of complications involved here: - What does dupb() look like at the dtrace probe level? How about copyb()? What happens when someone does allocb(), copies the data, and then does freeb() (as with IPsec)? - How do you express interest in a particular dblk? I could imagine having something like "follow_dblk(dblk_t *)" and "abandon_dblk(dblk_t *)" added to the dtrace programming model, and having a set of "dblk:<module>:<function>:<name>" probes that fire on the followed dblk(s). Is that expressive enough? - Is there perhaps a more general principle at play here? Instead of a dblk-follower, should we have an allocated-and-possibly-ref- counted-object-follower? -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
James Carlson wrote:> I think a bigger issue here is getting lower-level details. The > existing proposal seems to center around the sockets API model, but > I''m not sure that''s really the right direction to go. I think it''d be > truer to the dtrace philosophy if it were to expose all the internal > state for these protocols.Agreed. Even if the proposal only deals with socket API mode of operation, it still does not capture some interactions. For example, I think the tcp::connect-established will not fire based on the proposal if an app does simultaneous connect (active connect from both sides). This may not be common, but we got a bug report some time back from customer who has problems in exactly this operation.> That means having probes that represent IP''s forwarding and fanout > actions (including drop and ICMP error generation); probes that > reflect TCP''s window handling (including out-of-window, overlapping > segment trimming, and ack generation), state machine, and timers; > probes for IGMP activity; probes for sockfs interaction, and so on.One obvious reason why a provider needs to expose the above internal data of protocols is in the proposal web page. The "Solved Problems" list has "Why are outbound connections slow?" but I don''t think the proposal can actually give out the answer. The proposal can tell a user that the throughput is low. But why is it low? It may be that the other end is slow in receiving so that TCP is in a zero window probing mode. Or it may be that there is a lot of congestion and packets are being dropped. Or it may be that the driver has a bug such that some packets are stuck in the driver''s send queue for a long time (need the ability to trace a packet to detect this). Or it may be that there are a lot of checksum errors (it is not clear exactly the position of the probe in the current proposal to determine if the tcp::receive probe is fired after the checksum is verified or not). Or it may be that there is a bug in the sockfs/TCP interaction such that the data is stuck at the sockfs layer and cannot reach TCP. So to answer the question "Why are outbound connections slow?", we really need to check the whole stack, from sockfs down to the driver. Simple probes like tcp::send, tcp::receive, ip::send, ip::receive, ... cannot tell us the reason. If the net provider only has these simple probes, I am wondering if we really need a separate provider instead of just using sdt and put well positioned static probes. Are there some guidelines of when we need a provider versus just using sdt? -- K. Poon. kacheong.poon at sun.com
Kacheong Poon wrote:> >> That means having probes that represent IP''s forwarding and fanout >> actions (including drop and ICMP error generation); probes that >> reflect TCP''s window handling (including out-of-window, overlapping >> segment trimming, and ack generation), state machine, and timers; >> probes for IGMP activity; probes for sockfs interaction, and so on. > > > > One obvious reason why a provider needs to expose the above > internal data of protocols is in the proposal web page. The > "Solved Problems" list has "Why are outbound connections slow?" > but I don''t think the proposal can actually give out the answer. > The proposal can tell a user that the throughput is low. But why > is it low?THe above requirement applies to IP forwarding throuput as well. I had sent email to Brendan about this, and am waiting for a response. Given a Solaris box that is configured to be a forwarder, I want to find out how long it takes a packet to be forwarded out into the wire from the time its received (this would mean time spent in IP and in network driver code). It also would help to do performance bottleneck analysis. Will the DTrace Network Provider be able to provide this info, or is another tool more appropriate for this form of analysis. It may be that the other end is slow in receiving> so that TCP is in a zero window probing mode. Or it may be that > there is a lot of congestion and packets are being dropped. Or > it may be that the driver has a bug such that some packets are > stuck in the driver''s send queue for a long time (need the ability > to trace a packet to detect this). Or it may be that there are a > lot of checksum errors (it is not clear exactly the position of > the probe in the current proposal to determine if the tcp::receive > probe is fired after the checksum is verified or not). Or it may > be that there is a bug in the sockfs/TCP interaction such that > the data is stuck at the sockfs layer and cannot reach TCP. > > So to answer the question "Why are outbound connections slow?", > we really need to check the whole stack, from sockfs down to the > driver. Simple probes like tcp::send, tcp::receive, ip::send, > ip::receive, ... cannot tell us the reason. If the net provider > only has these simple probes, I am wondering if we really need a > separate provider instead of just using sdt and put well positioned > static probes. Are there some guidelines of when we need a provider > versus just using sdt?
G''Day Alan, On Wed, 20 Sep 2006, Alan Maguire wrote:> hi Brendan > > looks great! a few thoughts: > > - i was trying to figure out how ipv6 would fit into things, and i > guess there''s a few issues here. one that occurs to me is the way the > ip-related probes pass back an ipv4-specific pointer. i wonder, would > another way of getting the packet data back to the probe consumers be > to pass back an array as arg1, and have a set of indices defined such > that if i want arg1[ETHER] i get a pointer to the ether header that i > can parse etc. this would give consumers a simple test for ipv6 - is > arg1[IPv6] NULL? not sure about the feasiblilty of this from the > provider side though. i guess another issue with v6 is getting v6 > addresses to work as predicates.Yes, there are indeed a few issues. :) Interesting - having arg1[name] would be useful to allow easy extention to other headers, such as ARP, ICMP, SCTP, etc. The way I had it in mind was: args[1] has members to all IP info (either v4 or v6), NULL if not used, and a programmer would write, net:ip::send /args[2]->ip_ver == 4/ { /* IPv4 processing */ } net:ip::send /args[2]->ip_ver == 6/ { /* IPv6 processing */ } net:ip::send { /* Process both */ } ... but my idea may not scale as well to other headers.> - maybe we could we augment the instrumentation of the MIB probes to > pass back packet data too, this might be useful?Yes! - the mib probes are already fairly well placed, so it would be nice to use them further.> - to investigate performance bottlenecks in depth, combining this work > with instrumentation of the Packet Event Framework would perhaps be > useful. also, i imagine the Clearview and Nemo folks would have > suggestions on instrumenting loopback and adding probes to gld.Instrumenting the PEF will be interesting; but an internal Sun networking component wouldn''t be the first place I imagine customers would want to trace. First up is standard RFC* style probes, tcp::send, tcp::receive, tcp::connection-established, etc; then comes Sun internals. I haven''t looked at instrumenting loopback or gld yet - so I''d be interested to hear from the Clearview/Nemo folks!> - can you think of a way to "pick up" an individual packet and follow > it through the stack? i''m thinking a DTrace script that set probes at > all the interesting points, then sets variables that uniquely identify > the packet at gld, and can then subsequently be used at the other probe > points as predicates. i guess this might require "decoding" the packet > at each level, in the sense of providing pointers into each protocol > header, but it might be worthwhile.There should be ways to trace it through parts of the stack, and there are some unique identifiers available; for example, the address of the struct conn_s can serve as a connection indentifier after it has been established (eg, ipcl_classify_v4() in ip_tcp_input()). To go entirerly though the stack from ethernet received to ethernet sent is going to be some work; but again, I''d like to see this done, but as with the DTrace io provider - it''s not the first thing that is needed. In fact - that actually sounds like a provider that would help internal Sun engineers the most; so what we have on our hands is two providers: net::: Network Provider for customers, RFC* generic details, similar to the io provider. Bytes by IP, bytes by port, connect time... net-priv::: Network Provider for Sun engineers, private Sun internals. layer by layer tracing, IP classifier internals... cheers, Brendan
> I think a bigger issue here is getting lower-level details. The > existing proposal seems to center around the sockets API model, but > I''m not sure that''s really the right direction to go. I think it''d be > truer to the dtrace philosophy if it were to expose all the internal > state for these protocols.No -- wrong. You are confusing the philosophies of DTrace (zero disabled probe effect, absolute safety, systemic instrumentation etc.) with the motivation for providing Stable providers. Stable providers should export semantics _not_ implementation; if you want to export the implementation, go right ahead -- with an SDT provider of Private stability. (Indeed, we took great care to define our own stability taxonomy to allow exactly this; such a provider needn''t be documented or ARC''d.) But understand that Stable providers are intended for those _using_ the system (e.g., Solaris users) not those _implementing_ the system (e.g., you). The vast majority of problems that users have with networking are not related to protocol intricacies like window handling -- they are due to vast amounts of unnecessary work. (As in, "Why is our monitoring script causing three quarters of the traffic!?" or "We shouldn''t even be talking to that machine -- that project was supposed to have beeen decommisioned six weeks ago!") More than anything, what we (or at least I) learned from DTrace is that if you want to get big wins, you don''t make it incrementally faster to do existing work -- you eliminate work entirely by addressing its source higher in the stack of abstraction. It is easiest to find unnecessary work and eliminate it when one has Stable providers that are crisp, simple, and export the understood semantics of the system -- like our existing sched, proc and io providers. I believe that Brendan''s proposal is very much in this mold, and therefore I believe that it is the right direction for a Stable networking provider -- but that should not deter you from adding Private SDT providers to export whatever implementation details you need when instrumenting the system. - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
Brendan Gregg wrote:> Instrumenting the PEF will be interesting; but an internal Sun networking > component wouldn''t be the first place I imagine customers would want to > trace. First up is standard RFC* style probes, tcp::send, tcp::receive, > tcp::connection-established, etc; then comes Sun internals.It is unclear what is meant by "standard RFC* style probes." Does it mean an API style? But there is no RFC for standard API as IETF does not standardize API. Could you please elaborate that? In the proposal, it says that those tcp::send, ..., are placed "close to IP interface." It is unclear what that means, especially args[2] are about IP header info. For example, a TCP segment can be sent through a tunnel. So what is the IP header info in args[2]? What is meant by "close to?" And does tcp::send include sending ACKs? The above are examples of why a probe like tcp::send can be confusing. It is too general a concept of tcp sending something.> net::: Network Provider for customers, RFC* generic details, similar to > the io provider. Bytes by IP, bytes by port, connect time...What is the scope of this provider? For example, the current proposal does not answer "Why are outbound connections slow?" So if there are some problems which we assume that customers are interested to solve, could you please list them here? -- K. Poon. kacheong.poon at sun.com
G''Day James, On Wed, 20 Sep 2006, James Carlson wrote: [...]> I think a bigger issue here is getting lower-level details. The > existing proposal seems to center around the sockets API model, but > I''m not sure that''s really the right direction to go. I think it''d be > truer to the dtrace philosophy if it were to expose all the internal > state for these protocols.So long as internal state means RFC standard and well understood state, and not Sun''s private implementation state.> That means having probes that represent IP''s forwarding and fanout > actions (including drop and ICMP error generation); probes that > reflect TCP''s window handling (including out-of-window, overlapping > segment trimming, and ack generation), state machine, and timers; > probes for IGMP activity; probes for sockfs interaction, and so on.I''d question how many of these were simply useful to know had occured, in which case, mib or kstat should be tracking them (if not already).> > - can you think of a way to "pick up" an individual packet and follow > > it through the stack? i''m thinking a DTrace script that set probes at > > all the interesting points, then sets variables that uniquely > > identify the packet at gld, and can then subsequently be used at the > > other probe points as predicates. i guess this might require > > "decoding" the packet at each level, in the sense of providing > > pointers into each protocol header, but it might be worthwhile. > > Having "dblk event" provider would likely be very helpful (and might > even allow us to retire the existing STREAMS flow tracing), but there > are a lot of complications involved here:Sure, which would be part of a private provider for Sun engineers.> - What does dupb() look like at the dtrace probe level? How about > copyb()? What happens when someone does allocb(), copies the > data, and then does freeb() (as with IPsec)? > > - How do you express interest in a particular dblk? I could imagine > having something like "follow_dblk(dblk_t *)" and > "abandon_dblk(dblk_t *)" added to the dtrace programming model, > and having a set of "dblk:<module>:<function>:<name>" probes that > fire on the followed dblk(s). Is that expressive enough? > > - Is there perhaps a more general principle at play here? Instead > of a dblk-follower, should we have an allocated-and-possibly-ref- > counted-object-follower?How many customers want this? Sure, various Sun engineers want to track the code path that a packet takes through the stack, with timing details. And that''s a different provider, a private provider that meets Sun engineer''s needs. (I would have thought that fbt already solves most of this, with some clever coding). The network provider for customers should focus on stable RFC defined and well understood semantics. tcp::send, tcp::receive, tcp::connection-established, etc. That''s not to say that other functionality like you describe doesn''t exist; it''s to say that the focus is a simple and stable interface, like that which the io provider has. I think that more customers will know RFC793 than they will know your dblks. Brendan
G''Day Kacheong, On Thu, 21 Sep 2006, Kacheong Poon wrote:> Brendan Gregg wrote: > > > Instrumenting the PEF will be interesting; but an internal Sun networking > > component wouldn''t be the first place I imagine customers would want to > > trace. First up is standard RFC* style probes, tcp::send, tcp::receive, > > tcp::connection-established, etc; then comes Sun internals. > > > It is unclear what is meant by "standard RFC* style probes." > Does it mean an API style? But there is no RFC for standard > API as IETF does not standardize API. Could you please > elaborate that?Show me in the RFCs where dblks are discussed. The provider proposal uses the stable terminology from the RFCs, and events from the RFCs, which customers should find easy to follow.> In the proposal, it says that those tcp::send, ..., are placed > "close to IP interface." It is unclear what that means, > especially args[2] are about IP header info.I was trying not to describe the provider by actually writing the code.> For example, a > TCP segment can be sent through a tunnel. So what is the IP > header info in args[2]?The tunnel IP. People can tunnel traffic over SSH anyway, would you have us add USDT probes into sshd for details there?> What is meant by "close to?"For TCP that would be code that is closer to, say, gld than the application.> And does tcp::send include sending ACKs?tcp::send does include sending ACKs.> The above are examples of why a probe like tcp::send can be > confusing. It is too general a concept of tcp sending > something.Sure, there is no one place to put a tcp::send probe in tcp.c code; but nor are there hundreds.> > net::: Network Provider for customers, RFC* generic details, similar to > > the io provider. Bytes by IP, bytes by port, connect time... > > What is the scope of this provider? For example, the current > proposal does not answer "Why are outbound connections slow?"Quoting from the website (maybe you didn''t get this far?) "Why are outbound connections slow? (Connect latency, 1st byte latency, throughput latency)" These metrics are solvable from the proposal.> So if there are some problems which we assume that customers > are interested to solve, could you please list them here?How about this to start with, 1. New inbound TCP connections by port 2. New inbound TCP connections by source IP address 3. Inbound TCP packets by local port 4. Inbound TCP packets by remote address 5. Inbound TCP packets by remote address, for a specific port, eg: port 80 (HTTP) 6. Sent TCP bytes by local port 7. TCP bytes by address and port 8. TCP bytes by address and port, per second 9. Detect TCP connect() scan by IP address 10. Detect TCP SYN stealth scan by IP address 11. Detect TCP null scan by IP address 12. Detect TCP FIN stealth scan by IP address 13. Detect TCP Xmas scan by IP address 14. Network Utilization by TCP Port 15. Connect Latency 16. 1st Byte Latency 17. Throughput Latency 18. TCP Saturation Drops by IP address direct from the website. Brendan
James Carlson wrote:> Alan Maguire writes: >> - i was trying to figure out how ipv6 would fit into things, and i guess there''s a few issues here. one that occurs to me is the way the ip-related probes pass back an ipv4-specific pointer. i wonder, would another way of getting the packet data back to the probe consumers be to pass back an array as arg1, and have a set of indices defined such that if i want arg1[ETHER] i get a pointer to the ether header that i can parse etc. this would give consumers a simple test for ipv6 - is arg1[IPv6] NULL? not sure about the feasiblilty of this from the provider side though. i guess another issue with v6 is getting v6 addresses to work as predicates. > > Only one of those indicies could be non-NULL at a time, so that > doesn''t seem like a good change to me. I think it makes tracing a bit > too error-prone. > > Having separate probe points for v4 and v6, each with its own > parameters, might make more sense, even if there are some > commonalities at the transport layer. > > Plus, SCTP is notably absent. > > I think a bigger issue here is getting lower-level details. The > existing proposal seems to center around the sockets API model, but > I''m not sure that''s really the right direction to go. I think it''d be > truer to the dtrace philosophy if it were to expose all the internal > state for these protocols. > > That means having probes that represent IP''s forwarding and fanout > actions (including drop and ICMP error generation); probes that > reflect TCP''s window handling (including out-of-window, overlapping > segment trimming, and ack generation), state machine, and timers; > probes for IGMP activity; probes for sockfs interaction, and so on. >I disagree. As a user I have absolutely no interest in having to spend quality time with the RFCs to answer simple questions, much less spending time reading through and understanding the implementation. If I were interested in doing the latter, especially, I would be doing it now, using fbt. -- Rich
G''Day Folks, Lets summarise where we are at. So far I''ve got, net:tcp::accept-listen net:tcp::accept-established net:tcp::accept-closed net:tcp::accept-refused net:tcp::connect-request net:tcp::connect-established net:tcp::connect-closed net:tcp::drop-saturation net:tcp::receive net:tcp::send net:udp::receive net:udp::send net:ip::receive net:ip::send net:gld::receive net:gld::send Plus other protocols as needed (ICMP, ARP, SCTP, etc). Their arguments are based on the RFCs. Does anyone think that these shouldn''t be done, or has a technical reason why they can''t be done? There has been dblk events and probes mentioned. Please list the 4-tuple probe names in mind. What would their arguments be? Please provide short examples of their use in scripts. cheers, Brendan
peter.memishian at sun.com
2006-Sep-21 09:12 UTC
[dtrace-discuss] Re: DTrace Network Provider
> Lets summarise where we are at. So far I''ve got,> > net:tcp::accept-listen > net:tcp::accept-established > net:tcp::accept-closed > net:tcp::accept-refused > net:tcp::connect-request > net:tcp::connect-established > net:tcp::connect-closed > net:tcp::drop-saturation > net:tcp::receive > net:tcp::send > net:udp::receive > net:udp::send > net:ip::receive > net:ip::send > net:gld::receive > net:gld::send [ Apologies if these questions were answered earlier. ] Is there a reason the last two are `gld'' rather than `link''? (Or are we explicitly excluding probing monolithic network devices like ce?) Also, when would `send'' fire -- when the layer has been asked to send the packet, or when it''s actually sent the packet to the next layer? -- meem
> The above are examples of why a probe like tcp::send can be> confusing. It is too general a concept of tcp sending > something. Sure, there is no one place to put a tcp::send probe in tcp.c code; but nor are there hundreds. Maybe this should be ''tcp::sent''. TCP implementation decides when it should _try_ to send some of the usable data, and sometimes it manages to do some sometimes not, sometimes it resends. Not much conceptual things here. But at times, TCP manages to get some data one layer below and considers it sent ... but not acknowledge then later acked. So maybe tcp::sent : suna was advanced, args: from seq, nbytes tcp::resent : suna not advanced, args: from seq, nbytes tcp::acked : acked pointer was advanced. args: seq, nbytes tcp::dupacked : acked pointer was not advanced. args: seq I''d also like tcp::flowctrl : args on|off wainting on a peer event to transmit. Timings from sent to acked would give easy access to round trip time. flowctrl would help start to answer whether or not performance is limited by peer (or at least something below TCP). -r
Bryan Cantrill writes:> > > I think a bigger issue here is getting lower-level details. The > > existing proposal seems to center around the sockets API model, but > > I''m not sure that''s really the right direction to go. I think it''d be > > truer to the dtrace philosophy if it were to expose all the internal > > state for these protocols. > > No -- wrong. You are confusing the philosophies of DTrace (zero disabled > probe effect, absolute safety, systemic instrumentation etc.) with the > motivation for providing Stable providers. Stable providers should export > semantics _not_ implementation; if you want to export the implementation, > go right ahead -- with an SDT provider of Private stability.Note that I said "internal state for these protocols." I did not say "export the implementation." I already expect sdt and fbt to do this part just fine. The two are very different concepts.> (Indeed, we > took great care to define our own stability taxonomy to allow exactly > this; such a provider needn''t be documented or ARC''d.) But understand > that Stable providers are intended for those _using_ the system (e.g., > Solaris users) not those _implementing_ the system (e.g., you). The > vast majority of problems that users have with networking are not related > to protocol intricacies like window handling -- they are due to vast > amounts of unnecessary work. (As in, "Why is our monitoring script > causing three quarters of the traffic!?" or "We shouldn''t even be talking > to that machine -- that project was supposed to have beeen decommisioned > six weeks ago!")And I can certainly see a case for having both "simplified" (gross) trace points as well as tools to make it even simpler for users who don''t care about the lower-level details. However, not exposing the protocol-defined state and activity in some stable form seems like a mistake to me. This activity is related to what these protocols do, and not related to how they''re designed.> More than anything, what we (or at least I) learned > from DTrace is that if you want to get big wins, you don''t make it > incrementally faster to do existing work -- you eliminate work entirely by > addressing its source higher in the stack of abstraction. It is easiest > to find unnecessary work and eliminate it when one has Stable providers > that are crisp, simple, and export the understood semantics of the > system -- like our existing sched, proc and io providers.And that''s exactly the problem. We apparently disagree on what constitutes a crisp, simple provider in the context of networking. To me, failing to include TCP state transitions and special activity such as packet drops due to reasons _other_ than buffer overflow doesn''t make the provider "crisp." It burns it to a crisp. It''d be like omitting sched:unix:setkpdq:enqueue or proc:unix:trap:fault from those providers because they''re hard for casual users to understand. They are hard to understand, and almost certainly unimportant for the common sorts of questions that users may want to answer, but that doesn''t seem to be a good argument to "dumb down" the interface to the point where it can''t be used for non-trivial problem solving.> I believe that Brendan''s proposal is very much in this mold, and therefore > I believe that it is the right direction for a Stable networking provider -- > but that should not deter you from adding Private SDT providers to export > whatever implementation details you need when instrumenting the system.Of course, but I think that still misses the point. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Brendan Gregg writes:> G''Day James, > > On Wed, 20 Sep 2006, James Carlson wrote: > [...] > > I think a bigger issue here is getting lower-level details. The > > existing proposal seems to center around the sockets API model, but > > I''m not sure that''s really the right direction to go. I think it''d be > > truer to the dtrace philosophy if it were to expose all the internal > > state for these protocols. > > So long as internal state means RFC standard and well understood state, > and not Sun''s private implementation state.OK, I think we''re in sync on that.> > That means having probes that represent IP''s forwarding and fanout > > actions (including drop and ICMP error generation); probes that > > reflect TCP''s window handling (including out-of-window, overlapping > > segment trimming, and ack generation), state machine, and timers; > > probes for IGMP activity; probes for sockfs interaction, and so on. > > I''d question how many of these were simply useful to know had occured, in > which case, mib or kstat should be tracking them (if not already).There''s indeed some overlap here. The sorts of questions I''d like to be able to answer are: - this connection seems to have slow retransmits, why is the estimated RTT so large? - there are a lot of misordered packets; is it possible that we''re dropping packets with some sort of pattern? - this one connection gets stuck in CLOSE_WAIT state; what event drove it there?> > Having "dblk event" provider would likely be very helpful (and might > > even allow us to retire the existing STREAMS flow tracing), but there > > are a lot of complications involved here: > > Sure, which would be part of a private provider for Sun engineers.Yep; I agree it''s a side conversation, and not directly relevant for a stable network provider.> > - Is there perhaps a more general principle at play here? Instead > > of a dblk-follower, should we have an allocated-and-possibly-ref- > > counted-object-follower? > > How many customers want this?Besides Sun engineers (I consider us to be customers of our own products ;-}), I suspect it''s most useful for third party software designers -- those who create network drivers and applications.> Sure, various Sun engineers want to track the code path that a packet > takes through the stack, with timing details. And that''s a different > provider, a private provider that meets Sun engineer''s needs. (I would > have thought that fbt already solves most of this, with some clever > coding).I think that can be said of many of the other providers. It''s just a matter of how ugly (and unmaintainable) you''d like your D code to become.> The network provider for customers should focus on stable RFC defined > and well understood semantics. tcp::send, tcp::receive, > tcp::connection-established, etc. That''s not to say that other > functionality like you describe doesn''t exist; it''s to say that the focus > is a simple and stable interface, like that which the io provider has. > > I think that more customers will know RFC793 than they will know > your dblks.Sure; as I said, it''s a side-conversation. It was sparked by this previous comment:> > - can you think of a way to "pick up" an individual packet and follow > > it through the stack? i''m thinking a DTrace script that set probes atI''m not sure how to do something like that without having some kind of low-level instrumentation. Doing it at a high level (with a stable "packet" provider, perhaps) seems much less likely to me to be useful, as it''s less likely to be related to application problems. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Richard Lowe writes:> > That means having probes that represent IP''s forwarding and fanout > > actions (including drop and ICMP error generation); probes that > > reflect TCP''s window handling (including out-of-window, overlapping > > segment trimming, and ack generation), state machine, and timers; > > probes for IGMP activity; probes for sockfs interaction, and so on. > > > > I disagree. As a user I have absolutely no interest in having to spend > quality time with the RFCs to answer simple questions, much less > spending time reading through and understanding the implementation. If > I were interested in doing the latter, especially, I would be doing it > now, using fbt.Adding this detail in does not mean that you''re required to use it. It just makes it available. Leaving this detail out means that others who do need it are stuck -- they have to depend on implementation artifacts (and follow those artifacts as they change in patches), or just give up. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
hi folks here''s a possible scheme for TCP state machine probes (these states are based on the RFC 793-defined TCP state machine). net:tcp:state:closed net:tcp:state:listen net:tcp:state:syn_received net:tcp:state:fin_wait_1 net:tcp:state:fin_wait_2 net:tcp:state:closing net:tcp:state:established net:tcp:state:syn_sent net:tcp:state:close_wait net:tcp:state:time_wait net:tcp:state:last_ack each will pass back have the relevant conn_s and tcp_s structures. might also be good to pass back the previous tcp state, i.e args are: prev_state, curr_state, conn_s, tcp_s having these probes would allow scripts to do things such as watch connection state transitions (obviously). e.g: dtrace -n ''net:tcp:state:* { printf("tcp state machine transition from %d to %d\n", arg0, arg1); }'' and also examine connections that stay in particular states for a long time etc. i guess there''s some overlap between these probes and Brendan''s suggested probes (e.g. "established" in this scheme == "connect-established"), but maybe it would make sense to split up the provider namespace into "protocol state" and "packet data" probes, i.e. net:tcp:state: net:tcp:packet: with the former looking at state events, the latter at packet events (e.g. receive, drop)? would this be useful? (i think the common states users will be interested in are pretty intuitive from the RFC-derived names above, e.g. connection is listen''ing, established, closed). thoughts? alan This message posted from opensolaris.org
G''Day Roch, On Thu, 21 Sep 2006, Roch wrote:> > > The above are examples of why a probe like tcp::send can be > > confusing. It is too general a concept of tcp sending > > something. > > Sure, there is no one place to put a tcp::send probe in tcp.c code; but > nor are there hundreds. > > Maybe this should be ''tcp::sent''. > > TCP implementation decides when it should _try_ to send some > of the usable data, and sometimes it manages to do some > sometimes not, sometimes it resends. Not much conceptual things here. But at > times, TCP manages to get some data one layer below and > considers it sent ... but not acknowledge then later acked.Sure; I hit this issue and thought it best that tcp::send be "we sent something with a TCP header" and tcp::receive be "we received something with a TCP header". Should matching on data packets be needed, the options were, - Match on TCP length or other TCP flags net:tcp::send /args[3]->tcp_length > 0/ - Add other probes just for this, net:tcp::data-send net:tcp::data-receive - Use the net:socket layer, where data is real data On one hand, you could say that this idea will over match activity - tcp::send will match resends and tcp::receive duplicate packets, as we are looking at what TCP is doing to make the data I/O happen. However the io provider also over matches activity -- io::start isn''t exactly equivalent to the application requesting data I/O, it is what the block I/O driver is doing to make the application I/O happen (which includes rounded up blocks, read-aheads and metadata from the previous file system layers). The io provider keeps it simple - we asked the disk to do something; it doesn''t try to interpret if that was required data or not (as we can do that from other layers).> So maybe > > tcp::sent : suna was advanced, args: from seq, nbytes > tcp::resent : suna not advanced, args: from seq, nbytes > tcp::acked : acked pointer was advanced. args: seq, nbytes > tcp::dupacked : acked pointer was not advanced. args: seqCool - those algorithms look good. Would this still make sense as tcp::data-sent, tcp::data-resent, etc?> I''d also like > > tcp::flowctrl : args on|off wainting on a peer event to transmit.fair enough.> Timings from sent to acked would give easy access to round > trip time.Exactly - I ran out of time to write that example by matching on TCP flags from args[3].> flowctrl would help start to answer whether or not > performance is limited by peer (or at least something below > TCP).Yes! Brendan
Brendan Gregg wrote:>> It is unclear what is meant by "standard RFC* style probes." >> Does it mean an API style? But there is no RFC for standard >> API as IETF does not standardize API. Could you please >> elaborate that? > > Show me in the RFCs where dblks are discussed. The provider proposal uses > the stable terminology from the RFCs, and events from the RFCs, which > customers should find easy to follow.I guess you misunderstood my question. I was not asking about dblk tracing. I was asking about what is really meant by "standard RFC* style probes." For example, the proposal does not track congestion/ send/receive windows. They are well defined in the TCP standard and are very useful in telling a user what is really going on. OTOH probes like tcp::drop-saturation is not in any TCP standard.>> In the proposal, it says that those tcp::send, ..., are placed >> "close to IP interface." It is unclear what that means, >> especially args[2] are about IP header info. > > I was trying not to describe the provider by actually writing the code.I guess it is better to understand the issues first before actually trying to write the code...>> For example, a >> TCP segment can be sent through a tunnel. So what is the IP >> header info in args[2]? > > The tunnel IP. People can tunnel traffic over SSH anyway, would you have > us add USDT probes into sshd for details there?I don''t understand this comment. We are discussing a provider of the kernel network stack. IP tunnel is a well defined concept. We are not discussing about application protocol tracing. My question is about what is being provided in the args[] list. The list should be well defined. Otherwise, how can a user write scripts using them? This is exactly my suggestion above, we need to understand the issues first.> Quoting from the website (maybe you didn''t get this far?) > > "Why are outbound connections slow? (Connect latency, 1st byte latency, > throughput latency)" > > These metrics are solvable from the proposal.This was exactly my question (I quoted the above question in my email). I don''t believe the proposal can answer the why part. It can tell a user that something is slow, but not why. Finding the latency does not tell us why the latency is there...> How about this to start with, > > 1. New inbound TCP connections by port > 2. New inbound TCP connections by source IP address > 3. Inbound TCP packets by local port > 4. Inbound TCP packets by remote address > 5. Inbound TCP packets by remote address, for a specific port, eg: port 80 (HTTP) > 6. Sent TCP bytes by local port > 7. TCP bytes by address and port > 8. TCP bytes by address and port, per second > 9. Detect TCP connect() scan by IP address > 10. Detect TCP SYN stealth scan by IP address > 11. Detect TCP null scan by IP address > 12. Detect TCP FIN stealth scan by IP address > 13. Detect TCP Xmas scan by IP address > 14. Network Utilization by TCP Port > 15. Connect Latency > 16. 1st Byte Latency > 17. Throughput Latency > 18. TCP Saturation Drops by IP address > > direct from the website.It is a good idea to have the above list. The next step in defining a network provider is to understand whether it is really needed. For example, the above questions can simply be answered by existing tools like snoop. If we really have to use dtrace, I believe a set of SDT probes is sufficient. So do we need a network provider? OTOH, if we can look beyond the above list of questions, it may be possible to abstract things out and implement a provider which is more fundamental than what snoop can provide. -- K. Poon. kacheong.poon at sun.com
On Thu, Sep 21, 2006 at 12:10:05AM +0800, Kacheong Poon wrote:> ... If the net provider > only has these simple probes, I am wondering if we really need a > separate provider instead of just using sdt and put well positioned > static probes. Are there some guidelines of when we need a provider > versus just using sdt?The provider Brendan is proposing will be based on SDT -- just like the many other provider such as io, proc, and sysinfo. Like those other providers it will have a different name and will have higher stability. In general, SDT probes should be private probes that may be useful for developers, but that end users probably shouldn''t rely on. We should make a new provider (based on SDT) when developing coherent, stable, documented instrumentation for a subsystem for which the target is the Solaris user -- as is the case with Brendan''s proposal. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
Brendan > If the network engineers want to trace code path for their own needs, > then do so with fbt and add some sdt as needed. That answers my question regarding using dtrace to analyse performance bottlenecks in networking. > But don''t assume that customers care most about seeing code path, > or have read the 50,000+ lines that make up tcp.c and ip.c. At this point,Sun engineers who work in Solaris kernel are customers who would want to use DTrace Network Provider to track the above code paths ( to do analyse performance bottlenecks and latencies) There will be Solaris kernel developers OpenSolaris community in future who would need the same. Anyway, I am done trying to justify whether the feature I requested is important or not, and how may customers want it. You seem to have some quite strong opinions about what Soalris networking engineers need. Thanks Sangeeta
peter.memishian at sun.com wrote:> > Lets summarise where we are at. So far I''ve got, > > > > net:tcp::accept-listen > > net:tcp::accept-established > > net:tcp::accept-closed > > net:tcp::accept-refused > > net:tcp::connect-request > > net:tcp::connect-established > > net:tcp::connect-closed > > net:tcp::drop-saturation > > net:tcp::receive > > net:tcp::send > > net:udp::receive > > net:udp::send > > net:ip::receive > > net:ip::send > > net:gld::receive > > net:gld::send > > [ Apologies if these questions were answered earlier. ] > > Is there a reason the last two are `gld'' rather than `link''? (Or are we > explicitly excluding probing monolithic network devices like ce?) Also, > when would `send'' fire -- when the layer has been asked to send the > packet, or when it''s actually sent the packet to the next layer? >The 2nd element in the probe tuple is the module the probe resides within. To the best of my knowledge, this cannot be arbitrarily changed. (which doesn''t answer how it''s gld rather dld, but...) -- Rich
G''Day Alan, On Thu, 21 Sep 2006, Alan Maguire wrote:> hi folks > > here''s a possible scheme for TCP state machine probes (these states are based > on the RFC 793-defined TCP state machine). > > net:tcp:state:closed > net:tcp:state:listen > net:tcp:state:syn_received > net:tcp:state:fin_wait_1 > net:tcp:state:fin_wait_2 > net:tcp:state:closing > net:tcp:state:established > net:tcp:state:syn_sent > net:tcp:state:close_wait > net:tcp:state:time_wait > net:tcp:state:last_ackGreat! When I see this list, I immediately know what all of the probes mean - and I could start writing scripts straight away. This is what I had in mind when I said it should be RFC based (that and the arguments). I''m not sure we should be changing the 3rd field - DTrace generally keeps that as the kernel function name (which I believe is the case from SDT). When I get a chance I''ll add these to the list (probably change underscore to dash - also following DTrace conventions).> each will pass back have the relevant conn_s and tcp_s structures. might also > be good to pass back the previous tcp state, i.e args are: > > prev_state, curr_state, conn_s, tcp_s > > having these probes would allow scripts to do things such as watch connection > state transitions (obviously). e.g: > > dtrace -n ''net:tcp:state:* { printf("tcp state machine transition from %d to %d\n", arg0, arg1); }'' > > and also examine connections that stay in particular states for a long time etc. > > i guess there''s some overlap between these probes and Brendan''s > suggested probes (e.g. "established" in this scheme == "connect-established"), > but maybe it would make sense to split up the provider namespace into "protocol state" and "packet data" probes, i.e. > > net:tcp:state: > net:tcp:packet:ok, a 3rd field would give us that flexibility. without it we could use a prefix, state-closed, but that may get ugly. Someone may have a better idea to avoid the 3rd field? I do think splitting up some of the states into client and server pairs will be extreamly useful (connection-established/accept-established); as it immediately lets us focus on either client or server traffic.> with the former looking at state events, the latter at packet events (e.g. > receive, drop)? would this be useful? (i think the common states users will > be interested in are pretty intuitive from the RFC-derived names above, e.g. > connection is listen''ing, established, closed). thoughts?Yep - I think they are indeed intuitive and useful, and with several terse examples of their usage we may hit on some must-have scripts. cheers, Brendan
On Thu, Sep 21, 2006 at 03:31:23PM -0400, Richard Lowe wrote:> >Is there a reason the last two are `gld'' rather than `link''? (Or are we > >explicitly excluding probing monolithic network devices like ce?) Also, > >when would `send'' fire -- when the layer has been asked to send the > >packet, or when it''s actually sent the packet to the next layer? > > > > The 2nd element in the probe tuple is the module the probe resides > within. To the best of my knowledge, this cannot be arbitrarily > changed. (which doesn''t answer how it''s gld rather dld, but...)That''s not true. If it were, all of: tcp udp ip would have to be "ip", because that''s where the protocol processing code all lives now in a post-FireEngine world. Dan
hi Brendan> I''m not sure we should be changing the 3rd field - DTrace generally keeps > that as the kernel function name (which I believe is the case from SDT).ah, right. if it was possible to reuse the third field, it''d be handy for wildcarding all the state probes/all the packet-related probes, though i guess as long as the probes are alll prefixed with the same thing, we could wildcard net:tcp::state_*, right?> When I get a chance I''ll add these to the list (probably change underscore > to dash - also following DTrace conventions).excellent! alan This message posted from opensolaris.org
On Fri, Sep 22, 2006 at 02:29:18AM +0800, Kacheong Poon wrote:> ... If > we really have to use dtrace, I believe a set of SDT probes > is sufficient. So do we need a network provider?The network provider is a bunch of SDT probes. SDT _is_ the underlying instrumentation provider -- it just has a different name. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
Roch wrote:> So maybe > > tcp::sent : suna was advanced, args: from seq, nbytes > tcp::resent : suna not advanced, args: from seq, nbytesDo you mean tcp::sent is fired when TCP receives an ACK (otherwise suna won''t matter)? I think it is a little bit difficult to explain the above. In the tcp::sent case, it means that TCP receives an ACK such that it can now free up those data. But how can the probe distinguish it from a retransmission? Or do you mean they are fired when TCP calls into IP to send some data? So it is when TCP sends some data to IP. Then suna does not matter and tcp::resent cannot be conditioned on "suna not advanced" as doing NewReno is a resent and suna is advanced in this case. I guess tcp::resent should fire when TCP does a retransmission. We also need something like tcp::ack-sent: fires when an ACK is sent, args: tcp_t, ack #, window size tcp::rst-sent: fires when a RST is sent, args: src and dst addrs, seq number We don''t need syn or fin sent as they consume 1 seq num and can be put into the tcp::send probe.> tcp::acked : acked pointer was advanced. args: seq, nbytesThis is when TCP receives an ACK segment, right? It is not about sending? But why is ACK number not in the args? I think we should also have the window size and SACK options as args.> tcp::dupacked : acked pointer was not advanced. args: seqI think SACK options and dupack count should also be in the args.> I''d also like > > tcp::flowctrl : args on|off wainting on a peer event to transmit.Flow control is always on :-) I guess what you mean is when TCP''s send window is 0. Maybe it should be tcp::flow-stopped or tcp::zero-win (the latter is well defined). Or do you mean something beyond the TCP protocol flow control?> Timings from sent to acked would give easy access to round > trip time. > > flowctrl would help start to answer whether or not > performance is limited by peer (or at least something below > TCP).Zero window does not tell us anything below TCP. So maybe you are referring to other things in tcp::flowctrl. What are they? -- K. Poon. kacheong.poon at sun.com
Alan Maguire wrote:> net:tcp:state:closed > net:tcp:state:listen > net:tcp:state:syn_received > net:tcp:state:fin_wait_1 > net:tcp:state:fin_wait_2 > net:tcp:state:closing > net:tcp:state:established > net:tcp:state:syn_sent > net:tcp:state:close_wait > net:tcp:state:time_wait > net:tcp:state:last_ackState transition is certainly something very important. But I guess you can omit closed as it just means that a tcp_t is allocated. A fbt probe on the tcp_t creation function serves the same purpose.> each will pass back have the relevant conn_s and tcp_s structures. might also > be good to pass back the previous tcp state, i.e args are: > > prev_state, curr_state, conn_s, tcp_sPlus the TCP segment (in some states) which triggers the transition.> i guess there''s some overlap between these probes and Brendan''s > suggested probes (e.g. "established" in this scheme == "connect-established"), > but maybe it would make sense to split up the provider namespace into "protocol state" and "packet data" probes, i.e.> net:tcp:state: > net:tcp:packet:If the TCP segment which triggers the transition is passed back, there is no need to have a "packet" probe. It is more confusing to the user.> with the former looking at state events, the latter at packet events (e.g. > receive, drop)? would this be useful? (i think the common states users will > be interested in are pretty intuitive from the RFC-derived names above, e.g. > connection is listen''ing, established, closed). thoughts?I think we may not need a "drop" probe in TCP, unless it means checksum error (which is actually done in the IP level in our implementation). TCP only drops a segment if it does not pass certain checks. This means that a script can easily be written using the tcp::receive probe and check on the seq number and timestamps option. -- K. Poon. kacheong.poon at sun.com
On Fri, Sep 22, 2006 at 09:30:39AM +0800, Kacheong Poon wrote:> I think we may not need a "drop" probe in TCP, unless it means > checksum error (which is actually done in the IP level in our > implementation). TCP only drops a segment if it does not > pass certain checks. This means that a script can easily be > written using the tcp::receive probe and check on the seq number > and timestamps option. >What we need there is for the whole TCP/IP stack to start using ip_drop_packet() and expand its IPsec-only usage to the wider system. ip_drop_packet FBT probes have helped us a lot in debugging not only IPsec development but also IPsec *deployment*. (The original RFE that drove ipdrop was from a disgruntled punchin user who wanted to know how he FUBARed his config. This was before Paul''s extensive punchctl mods.) Dan
Brendan Gregg wrote:> - Match on TCP length or other TCP flags > net:tcp::send /args[3]->tcp_length > 0/Note that TCP header does not have a length field. The length of a segment is determined using the IP header. So it is simpler to write the probe if we exclude the length of a segment.> - Add other probes just for this, > net:tcp::data-send > net:tcp::data-receive > > - Use the net:socket layer, where data is real dataOne important thing a user wants to know is how TCP does the segmentation of data. So it is not that useful to only catch data sending in the socket layer. I think tcp::sent and tcp::resent are fine. We can then add the ack-sent and rst-sent as mentioned in my previous email.> On one hand, you could say that this idea will over match activity - > tcp::send will match resends and tcp::receive duplicate packets, as we > are looking at what TCP is doing to make the data I/O happen.I think it is simpler for tcp::sent to not match retransmission and have a dedicated tcp::resent probe. -- K. Poon. kacheong.poon at sun.com
Kacheong Poon writes:> Brendan Gregg wrote: > > > - Match on TCP length or other TCP flags > > net:tcp::send /args[3]->tcp_length > 0/ > > > Note that TCP header does not have a length field. The > length of a segment is determined using the IP header. > So it is simpler to write the probe if we exclude the > length of a segment.True ... but there should be a simple way to get at the payload (skipping past the options) and compute the payload length. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Dan McDonald wrote:> What we need there is for the whole TCP/IP stack to start using > ip_drop_packet() and expand its IPsec-only usage to the wider system. > ip_drop_packet FBT probes have helped us a lot in debugging not only IPsec > development but also IPsec *deployment*. (The original RFE that drove ipdrop > was from a disgruntled punchin user who wanted to know how he FUBARed his > config. This was before Paul''s extensive punchctl mods.)I thought there''s an RFE for the above and isn''t you the RE of it ;-) -- K. Poon. kacheong.poon at sun.com
It would be extremely useful to me if some of the TCP statistics presented by netstat -s had probes associated with them, to allow event-driven tracing analagous to that supported by (for example) the sched provider. It''s great to look at those stats and see them incrementing, which might give a clue to where your problem lies, but it would be better to be able to probe on them so you could pinpoint which connection has the issue. For example, a problem I look at on a weekly basis is which of several hundred TCP connections on a machine is suffering retransmission of unacknowledged octets. Currently this involves cloning a port on the switch, canning all the traffic with snoop, and post-processing. A net:tcp::retransBytes probe would make this a 10 second job, particularly when given access to the relevant tcpinfo_t. [If my interpretation of Brendan''s stuff is correct, I could see retransmissions anyway (by probing sends and looking for old sequence numbers being sent again on a particular connection), but it''d be hard work.] This message posted from opensolaris.org
>It would be extremely useful to me if some of the TCP statistics presented by netstat -s had probes associated with them, to allow event-driven tracing analagous to that supported by (for example) the sched provider. It''s great to look at those stats and see them incrementing, which might give a clue to where your problem lies, but it would be better to be able to probe on them so you could p inpoint which connection has the issue. All the data presented in "MIB" form should have mib probes associated with them already. Casper
Alan Maguire wrote:> net:tcp:state:closed > net:tcp:state:listen > net:tcp:state:syn_received > net:tcp:state:fin_wait_1 > net:tcp:state:fin_wait_2 > net:tcp:state:closing > net:tcp:state:established > net:tcp:state:syn_sent > net:tcp:state:close_wait > net:tcp:state:time_wait > net:tcp:state:last_ack > > each will pass back have the relevant conn_s and tcp_s structures. might also > be good to pass back the previous tcp state, i.e args are: > > prev_state, curr_state, conn_s, tcp_sIs the curr_state identical to the lest element of the probe name? If so, why is it needed as an arg? If not, how do they differ? - Jeremy
Kacheong Poon writes: > Brendan Gregg wrote: > > > - Match on TCP length or other TCP flags > > net:tcp::send /args[3]->tcp_length > 0/ > > > Note that TCP header does not have a length field. The > length of a segment is determined using the IP header. > So it is simpler to write the probe if we exclude the > length of a segment. > > > > - Add other probes just for this, > > net:tcp::data-send > > net:tcp::data-receive > > > > - Use the net:socket layer, where data is real data > > > One important thing a user wants to know is how TCP does > the segmentation of data. So it is not that useful to > only catch data sending in the socket layer. I think > tcp::sent and tcp::resent are fine. We can then add > the ack-sent and rst-sent as mentioned in my previous email. > > > > On one hand, you could say that this idea will over match activity - > > tcp::send will match resends and tcp::receive duplicate packets, as we > > are looking at what TCP is doing to make the data I/O happen. > > > I think it is simpler for tcp::sent to not match retransmission > and have a dedicated tcp::resent probe. > > I do think that the tcp provider should stear away from a streams/packets provider and focus on sequence numbers and length. How about this For TCP send: net:tcp::sending : new data posted in the tcp xmit buffer (from above) net:tcp::sent : data was sent down for a first time (suna was advanced). net:tcp::acked : data from xmit buffer can be freed (send competed) I note we already have mib probes for retransmits and dup acks. -r
> Is the curr_state identical to the lest element of the > probe name? If so, why is it needed as an arg?good point. i guess passing it back might be useful in the case where users utilize wildcard matching on the state probe, but then again the current state is also accessible via the tcp_s.tcp_state member.>> each will pass back have the relevant conn_s and tcp_s structures. might also >> be good to pass back the previous tcp state, i.e args are: >> >> prev_state, curr_state, conn_s, tcp_s > > > Plus the TCP segment (in some states) which triggers the > transition.that would be very useful i think. so how about probe arguments: prev_state, conn_s, tcp_s, (possibly null) tcp_info_t * ? alan This message posted from opensolaris.org
Kacheong Poon writes: > > tcp::ack-sent: fires when an ACK is sent, args: tcp_t, ack #, > window size > tcp::rst-sent: fires when a RST is sent, args: src and dst > addrs, seq number > > We don''t need syn or fin sent as they consume 1 seq num and > can be put into the tcp::send probe. I use this to cover the data path: net:tcp::sending : new data posted in the tcp xmit buffer (from above) net:tcp::sent : data was sent down for a first time (suna was advanced). net:tcp::acked : data from xmit buffer can be freed (send competed) I do consider that an incoming ACK completes the TCP sending action. > > > > tcp::acked : acked pointer was advanced. args: seq, nbytes > > > This is when TCP receives an ACK segment, right? It is not > about sending? But why is ACK number not in the args? Your right. I propose this fires when new data is acked, we can use the simple ACK number or the full segment of newly ACKed data. > I think > we should also have the window size and SACK options as args. > I figure all these probes would have a connection argument from which _I hope_ we can trace window information. How about a separate probe that fires when actual SACK info is exchanged ? > > > tcp::dupacked : acked pointer was not advanced. args: seq > So I later realized we had those MIBs for dupacks and restrans. Do we need anything more ? > > > > I''d also like > > > > tcp::flowctrl : args on|off wainting on a peer event to transmit. > > > Flow control is always on :-) I guess what you mean is when > TCP''s send window is 0. Maybe it should be tcp::flow-stopped > or tcp::zero-win (the latter is well defined). Or do you > mean something beyond the TCP protocol flow control? No you''re right. I''m thinking of a probe that fires when there is data to transmits but we need to wait for something from below; as in the peer to send either newly ACKed data or a window update packet before we can proceed. As for naming: tcp::flow : args on|off; waiting on a peer event to transmit. > > > > Timings from sent to acked would give easy access to round > > trip time. > > > > flowctrl would help start to answer whether or not > > performance is limited by peer (or at least something below > > TCP). > > > Zero window does not tell us anything below TCP. So maybe > you are referring to other things in tcp::flowctrl. What > are they? I''m not sure what Zero window refers to. Does it include the congestion window ? If we sent a full windows worth of data and receive no ACK, is that a zero window ? Or does it refer to receiving a packet with 0 window. That may be the point we need to cross to FBT/SDT. But if there is data in the xmit list and we''re waiting for the peer to respond to our previous transmits, I do think of this as being flow-stopped. Is this traceable ? -r > > > -- > > K. Poon. > kacheong.poon at sun.com > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
sowmini.varadhan at Sun.COM
2006-Sep-22 13:38 UTC
[dtrace-discuss] Re: DTrace Network Provider
> > I do think that the tcp provider should stear away from a > streams/packets provider and focus on sequence numbers and > length. How about this For TCP send: > > > net:tcp::sending : new data posted in the tcp xmit buffer (from above) > net:tcp::sent : data was sent down for a first time (suna was advanced). > net:tcp::acked : data from xmit buffer can be freed (send competed) > > I note we already have mib probes for retransmits and dup acks.that would require mib provider to provide info from the tcp/ip header associated with the retransmit/dup-ack? Currently afaict, the mib provider provides very little ancillary info. e.g., for dup-ack, we''d need the seq number that is being acked plus info about current tcp window for that connection? --Sowmini
With the putback of "packet filter hooks" (which will happen in few days), you will get many more probes regarding to IPFilter. Unfortunately, those probes are placed around packet filter hook callout entries, not inside IPFilter, so you will know what is there at IP->IPFilter interface, but you can''t know how IPFilter deals with that packet. Perhaps this could be a good RFE. BTW, I know this since I added those probes, I will post a short howto after the putback. Darren Reed, who is the tech lead, can give more up-to-date informations. Also IPsec, IPMP, etc... these add-on features should be traced. MIB provider offers the ability to access those network statistical data, it could a plus to this new network provider. Thanks, -Alex This message posted from opensolaris.org
If D could have: iphdr(ipha_t *) tcphdr(tcph_t *) to just iphdr/tcphdr as in mdb, which will save many engineer''s time. Thanks, -Alex This message posted from opensolaris.org
This blog gives an example on how to spy tcp state change: http://blogs.sun.com/xltang/entry/spy_the_tcp_state Hopefully the people is this threadlist will like it. The author is an ERI QE engineer, but he left now. Thanks, -Alex This message posted from opensolaris.org
Yes, the current MIB probe implementation doesn''t bring in any tcp_t, conn_t, etc. information. You can get them, but it needs smart coding, and know the details. -Alex This message posted from opensolaris.org
Casper.Dik at Sun.COM wrote:>> It would be extremely useful to me if some of the TCP statistics presented by netstat -s had probe > s associated with them, to allow event-driven tracing analagous to that supported by (for example) > the sched provider. It''s great to look at those stats and see them incrementing, which might give a > clue to where your problem lies, but it would be better to be able to probe on them so you could p > inpoint which connection has the issue. > > > All the data presented in "MIB" form should have mib probes associated with > them already. > > CasperYes, but how do you determine what packet or connection or whatever triggered the probe? For instance, you would like to get the tcp pointer for each of the tcp* mib probes. -- blu "When I was a boy I was told that anybody could become President; I''m beginning to believe it." - Clarence Darrow ---------------------------------------------------------------------- Brian Utterback - OP/N1 RPE, Sun Microsystems, Inc. Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom
Roch wrote:> I do think that the tcp provider should stear away from a > streams/packets provider and focus on sequence numbers and > length. How about this For TCP send:There are the send (snxt) and receive (rnxt) seq numbers. So I guess you meant to have a probe fired when they are changed? But when TCP sends a new segment, snxt is advanced. And when TCP receives an in order segment, rnxt is advanced. So it seems that the above is just tcp::sent and tcp::received with the exception that if a probe is only fired when rnxt is advanced, out of order segments will not immediately trigger a probe. So I think tcp::received is better.> net:tcp::sending : new data posted in the tcp xmit buffer (from above)Or should this be the sent probe of the layer above TCP? For example, we can have sockfs::sent.> net:tcp::sent : data was sent down for a first time (suna was advanced).I suppose you meant snxt instead of suna.> net:tcp::acked : data from xmit buffer can be freed (send competed)Or maybe tcp::ack-received, similar to tcp::ack-sent?> I note we already have mib probes for retransmits and dup acks.The current mib probes are very limited, no TCP info available. I think it is better to have dedicated probes for them. -- K. Poon. kacheong.poon at sun.com
Roch wrote:> > I think > > we should also have the window size and SACK options as args. > > > > I figure all these probes would have a connection argument > from which _I hope_ we can trace window information.If the probe is fired when TCP receives an ACK, the conn_s won''t tell us anything about the advertised window in the TCP header nor about SACK. If one of the args is the complete TCP header, we can certainly get info from it. But it is easier to write the D script if the info is handy.> How about a separate probe that fires when actual SACK info > is exchanged ?Since SACK info is exchanged in ACKs, so if we have tcp::ack-sent and tcp::ack-received, I guess the SACK info can be one of the args. Otherwise, two probes are fired for all ACKs sent and received with SACK info.> No you''re right. I''m thinking of a probe that fires when > there is data to transmits but we need to wait for something > from below; as in the peer to send either newly ACKed data > or a window update packet before we can proceed. As for > naming:If it is about the peer, then I guess it is the zero window. If it is about something below TCP, I am not sure what it is. IP does not flow control TCP currently. There is probably something in Crossbow which will do this. So I guess this lower layer flow control probe can be added at that time. But if we are restricting the network provider to standard TCP behavior, this lower layer flow control probe should not be in the network provider.> I''m not sure what Zero window refers to.The peer advertises a zero window so that TCP cannot send any new data.> Does it include the congestion window ?No, it is not about congestion window.> If we sent a full windows worth of data and receive no ACK, > is that a zero window ? Or does it refer to receiving a > packet with 0 window. That may be the point we need to > cross to FBT/SDT.I meant the latter. Now thinking about it, the two proposed probes tcp::sent and tcp::ack-received should be enough to detect both situation above. I guess we don''t really need a separate probe.> But if there is data in the xmit list and we''re waiting for > the peer to respond to our previous transmits, I do think of > this as being flow-stopped. Is this traceable ?This is traceable as tcp::sent is fired but there is no subsequent corresponding tcp::ack-received fires. -- K. Poon. kacheong.poon at sun.com
hi folks another area we should perhaps cover into the net provider is routing table updates, e.g. net:ip::route-add net:ip::route-delete with relevant stable information culled from the associated ipif_s (interface structure) and ire_s (internet routing entry) as args. this might help diagnose routing problems such as route flapping. alan This message posted from opensolaris.org
>another area we should perhaps cover into the net provider >is routing table updates, e.g. > >net:ip::route-add >net:ip::route-delete > >with relevant stable information culled from the associated ipif_s >(interface structure) and ire_s (internet routing entry) as args. > >this might help diagnose routing problems such as route flapping.While this is nice to have in dtrace, the "route monitor" command already provides this information. (And fairly easily allowed me to track down the removal of IPv6 routes from tunnels in builds 47/48) Casper
Casper.Dik at Sun.COM wrote:>> another area we should perhaps cover into the net provider >> is routing table updates, e.g. >> >> net:ip::route-add >> net:ip::route-delete >> > > While this is nice to have in dtrace, the "route monitor" command already > provides this information. >Without passing judgment specifically on this one capability, in general there is a good reason for adding functionality to DTrace that may be duplicated in other commands: to bring information into a D program that can be used in combination with other information, i.e. the firing of route-add or route-delete probe gets saved and used in the predicate of another probe clause. Chip
Kacheong Poon writes: > Roch wrote: > > > I do think that the tcp provider should stear away from a > > streams/packets provider and focus on sequence numbers and > > length. How about this For TCP send: > > > There are the send (snxt) and receive (rnxt) seq numbers. > So I guess you meant to have a probe fired when they are > changed? But when TCP sends a new segment, snxt is advanced. > And when TCP receives an in order segment, rnxt is advanced. > So it seems that the above is just tcp::sent and tcp::received > with the exception that if a probe is only fired when rnxt > is advanced, out of order segments will not immediately trigger > a probe. So I think tcp::received is better. > > I do think that receiving new data out of order need to fire a receive probe. Since sending has 3 stages, so do receiving: tcp::sent-new : new data was sent tcp::sent-retrans : already sent data was resent tcp::sent-done : sent data is freed, acked received and we''re done. tcp::received-new : new data was received from below (includes out of order) tcp::received-dup : already received data received again tcp::received-done : sent up toward application (we''re done) > > net:tcp::sending : new data posted in the tcp xmit buffer (from above) > > > Or should this be the sent probe of the layer above TCP? For > example, we can have sockfs::sent. I think that if new data is seen by tcp then it belong in a tcp probe. If we manage a sockfs buffer outside the view of TCP then we could have similar probes to handle that case. But since tcp does handle the pending buffer I still like tcp::sending : new data posted in tcp. And I still like the flow stop probes that fires when we have data in the xmit list and TCP refrains from sending waiting for an ACK or window update. For incoming, we could have similar probes for when we advertise a zero window or open it up. tcp::out-flow : started or stopped tcp::in-flow : started or stopped And we could want a probe that fires when tcp generate a dataless packet tcp::ctrl : pure ack, sack, window update, keepalive... -r > > -- > > K. Poon. > kacheong.poon at sun.com > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
i''ve not followed the entire thread but the following triggers would be useful for myself, and i''m sure many others who want to observe network server processes: * socket bind (listen) * incoming tcp and handshake handshake * or udp data * outgoing tcp data, or udp data * tcp shutdown very useful would be if this was observable per thread, with a master process handing off reply tasklets to worker threads. an example task would be to accumulate histograms of the time between udp in and udp out, per server thread. i have DNS and RADIUS servers in mind, initially. secondarily, observing SMTP and IMAP servers would be useful. tariq
Brendan Gregg - Sun Microsystems
2006-Sep-25 21:45 UTC
[dtrace-discuss] Re: DTrace Network Provider
G''Day Sangeeta, On Thu, Sep 21, 2006 at 11:43:55AM -0700, Sangeeta Misra wrote: [...]> > But don''t assume that customers care most about seeing code path, > > or have read the 50,000+ lines that make up tcp.c and ip.c. > > At this point,Sun engineers who work in Solaris kernel are customers who > would want to use DTrace Network Provider to track the above code paths > ( to do analyse performance bottlenecks and latencies) There will be > Solaris kernel developers OpenSolaris community in future who would need > the same. > > Anyway, I am done trying to justify whether the feature I requested is > important or not, and how may customers want it. You seem to have some > quite strong opinions about what Soalris networking engineers need.Ok, so you still don''t understand what is happening. When I''m talking about customers, I''m talking about Solaris users - the system administrators, developers and staff who use Solaris servers and would like easy observability tools. The world isn''t made up of kernel network engineers or customers who care about kernel code path latencies. Instead there are Solaris users who run prstat or top from time to time, and would use a similar network summary tool if we provided one. I can''t see the average customer running top in one window and a kernel code path latency tool in another. In fact, I could count on one finger how many customers I''ve met who ran lockstat to profile the kernel, similar information to what you think is so important. I''ve already said that I want to measure code path latencies too, and that I think it should be part of a provider. However, since kernel network engineers understand the kernel network code, then they can get their own statistics from fbt and some sdt, and write their own private provider for their own pet needs. Are they also "customers" of DTrace? Sure, but customers who have the skills to satisfy their own private needs. This thread is about a network provider for all Solaris users. If it also serves the needs of the kernel network engineers, then great; but not at the expense of what the greater Solaris community will find useful. Brendan -- Brendan [CA, USA]
Dan McDonald wrote:> On Thu, Sep 21, 2006 at 03:31:23PM -0400, Richard Lowe wrote: >>> Is there a reason the last two are `gld'' rather than `link''? (Or are we >>> explicitly excluding probing monolithic network devices like ce?) Also, >>> when would `send'' fire -- when the layer has been asked to send the >>> packet, or when it''s actually sent the packet to the next layer? >>> >> The 2nd element in the probe tuple is the module the probe resides >> within. To the best of my knowledge, this cannot be arbitrarily >> changed. (which doesn''t answer how it''s gld rather dld, but...) > > That''s not true. If it were, all of: > > tcp > udp > ip > > would have to be "ip", because that''s where the protocol processing code all > lives now in a post-FireEngine world. >While I''m entirely willing to accept that I maybe incorrect, I''d very much like to hear how it''s intended this be done. -- Rich
> This thread is about a network provider for all Solaris users. If it also > serves the needs of the kernel network engineers, then great; but not at > the expense of what the greater Solaris community will find useful.it would be a very good design if it could serve both roles.
Darren.Reed at Sun.COM
2006-Sep-25 23:00 UTC
[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider
Brendan Gregg - Sun Microsystems wrote:>G''Day Sangeeta, > >On Thu, Sep 21, 2006 at 11:43:55AM -0700, Sangeeta Misra wrote: >[...] > >>>But don''t assume that customers care most about seeing code path, >>>or have read the 50,000+ lines that make up tcp.c and ip.c. >>> >>At this point,Sun engineers who work in Solaris kernel are customers who >>would want to use DTrace Network Provider to track the above code paths >>( to do analyse performance bottlenecks and latencies) There will be >>Solaris kernel developers OpenSolaris community in future who would need >>the same. >> >>Anyway, I am done trying to justify whether the feature I requested is >>important or not, and how may customers want it. You seem to have some >>quite strong opinions about what Soalris networking engineers need. >> > >Ok, so you still don''t understand what is happening. > >When I''m talking about customers, I''m talking about Solaris users - the >system administrators, developers and staff who use Solaris servers and >would like easy observability tools. > >The world isn''t made up of kernel network engineers or customers who care >about kernel code path latencies. Instead there are Solaris users who >run prstat or top from time to time, and would use a similar network summary >tool if we provided one. I can''t see the average customer running top in >one window and a kernel code path latency tool in another. > > >I''ve already said that I want to measure code path latencies too, and that >I think it should be part of a provider. However, since kernel network >engineers understand the kernel network code, then they can get their own >statistics from fbt and some sdt, and write their own private provider >for their own pet needs. Are they also "customers" of DTrace? Sure, but >customers who have the skills to satisfy their own private needs. >I think you''re underestimating the complexities involved (or maybe I''m underestimating what dtrace can do?) What I''d like to see dtrace be able to do is follow a packet through the kernel, not a thread of execution. fbt is for following execution. sdt is up to how it is used, but most often, it is just random hooks in the code. The difference between using fbt and being able to follow a packet is when a packet is put on a queue, the fbt path is terminated, even though the packet still has places to go and I/O ports to see. If the networking dtrace provider doesn''t deliver the capability to trace a packet through the kernel then it is not delivering a key feature that is needed by many people. Why don''t we just hack up our own dtrace provider? Because it doesn''t help in solving problems with installed systems in the field. In addition, it would take a lot of messing about to do this with fbt or sdt and that doesn''t help anyone. To say that kernel engineers understand the networking code and therefore do not need help in generating stats is not a very ingeneous statement to make and ignores the fact that there are varying levels of capability throughout the organisation, not all of whom are experts. In addition to the examples you''ve presented on the web page, I think it would be of benefit if the following questions could also be answered: - how many PPS is an application responsible for being sent out? - how many PPS is an application responsible for on the receive side? The above two questions respeated for bytes per second, not PPS...or take all of your "questions answered" and repeat for PID. ...and I think this is the biggest problem with the dtrace networking provider as scoped so far - nothing in it relates to a process. I''m aware that this can be challenging for networking (especially on the receive side) but there are worthwhile questions here to answer from the generic customer perspective. I''d dispute that "Are hackers/crackers port scanning my server? (TCP flag matching by IP address)" is actually a question worth asking. If it is directly connected to the Internet, the answer is "yes" (but maybe not right now.) If there are any other boxes (routers, firewalls) along the way in from the Internet, then it''s not a question you should be using dtrace to find an answer for. If you''re trying to come up with a way to justify that particular dtrace probe, I''d recommend looking for a better example question. Some other probes that might be useful: - RTT calculations from TCP timestamp measurements - TCP window size changes - when a packet is dropped because it is "bad" (checksum, etc) - look through "netstat -s" output as there is quite a large number of stats there that are worthy of being identified with a probe. Darren p.s the HREF for dtrace-discuss at the bottom of http://www.opensolaris.org/os/community/dtrace/NetworkProvider/ is wrong/broken.
Brendan Gregg - Sun Microsystems
2006-Sep-26 00:25 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Provider
G''Day Darren, On Thu, Sep 21, 2006 at 11:12:55AM -0700, Darren.Reed at Sun.COM wrote:> Richard Lowe wrote: > > >[Cc''ing networking-discuss, I suspect followups should go to > >dtrace-discuss, however] > > > >Brendan Gregg - Sun Microsystems wrote: > > > >>G''Day Folks, > >> > >>I''ve just created a website to flesh out ideas for a DTrace Network > >>Provider. > >>Let us know what you think. > >> > >> http://www.opensolaris.org/os/community/dtrace/NetworkProvider/ > >> > >>I''m not currently writing this provider; this began as a discussion with > >>Adam about what a network provider could do, which we thought would be > >>appropriate to discuss on dtrace-discuss. So, this may not actually > >>happen; > >>although it would help if a lot of people really do want it. :-) > > > > > >+1, I for one certainly want it :) > > > Agreed! :) > > Some work needs to happen on args[2] - it is completely IPv4 specific. > There is no mention of IPv6 header fields such as flow.There have been other ideas, but I was thinking that args[2] be a combination of all IPv* fields, then specific ones can be plucked from statements as follows, net:ip::send /args[2]->ip_ver == 4/ { @foo[args[2]->ip_flags] = count(); } net:ip::send /args[2]->ip_ver == 6/ { @bar[args[2]->ip_flow] = count(); } Which of course allows all IP traffic be matched, and common fields used, net:ip::send { @bytes = sum(args[2]->ip_length); }> For IPv4, it would be good if it were possible to look at IP header > options, or for IPv6, extension headers. Same again for TCP and its > options (timestamp, MSS, SACK, etc.)Yes, I agree.> e.g. What can be done with dtrace to get information about the AH > header as well as the TCP header in an IPv4 or IPv6 packet?Sure, I think AH + ESP header info should be available.> Or in TX, how would you access CIPSO information?I haven''t used this; could someone who has suggest an appropriate way?> Couldn''t we also do with some ICMP probes?Just added to the site now.> And can we mark ICMP packets in such a way with dtrace that we > can measure the time it takes from gld::receive to gld::send for > an ICMP ECHO to be received, processed and then send out again ?I''d have to check what unique ID''s were available for ICMP, but yes - that would be useful; and the same for any such reply (TCP data received to TCP ACK, ...). cheers, Brendan -- Brendan [CA, USA]
Darren.Reed at Sun.COM
2006-Sep-26 00:43 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Provider
Brendan Gregg - Sun Microsystems wrote:>G''Day Darren, > >On Thu, Sep 21, 2006 at 11:12:55AM -0700, Darren.Reed at Sun.COM wrote: > > >>Richard Lowe wrote: >> >> >> >>>[Cc''ing networking-discuss, I suspect followups should go to >>>dtrace-discuss, however] >>> >>>Brendan Gregg - Sun Microsystems wrote: >>> >>> >>> >>>>G''Day Folks, >>>> >>>>I''ve just created a website to flesh out ideas for a DTrace Network >>>>Provider. >>>>Let us know what you think. >>>> >>>> http://www.opensolaris.org/os/community/dtrace/NetworkProvider/ >>>> >>>>I''m not currently writing this provider; this began as a discussion with >>>>Adam about what a network provider could do, which we thought would be >>>>appropriate to discuss on dtrace-discuss. So, this may not actually >>>>happen; >>>>although it would help if a lot of people really do want it. :-) >>>> >>>> >>>+1, I for one certainly want it :) >>> >>> >>Agreed! :) >> >>Some work needs to happen on args[2] - it is completely IPv4 specific. >>There is no mention of IPv6 header fields such as flow. >> >> > >There have been other ideas, but I was thinking that args[2] be a >combination of all IPv* fields, then specific ones can be plucked >from statements as follows, > >net:ip::send >/args[2]->ip_ver == 4/ >{ > @foo[args[2]->ip_flags] = count(); >} > >Have you considered using names without the "ip_" ? Such as "flags", "version", "length", etc ? To those of us who work with the data structures, seeing the three letter "ip_" tag implies IPv4. There are also those who believe that "IP is IP", which is both IPv4 and IPv6... Is it possible to ship dtrace with macros? Ie. to do: struct foo { union { struct ip blah; struct ipv6 foo; } fooun; } #define ip_v fooun.blah.ip_v #define ip6_vfc fooun.foo.ip6_vfc or some such?>>And can we mark ICMP packets in such a way with dtrace that we >>can measure the time it takes from gld::receive to gld::send for >>an ICMP ECHO to be received, processed and then send out again ? >> >> > >I''d have to check what unique ID''s were available for ICMP, but yes - >that would be useful; and the same for any such reply (TCP data >received to TCP ACK, ...). > >The combination of source address, destination address, ICMP id and ICMP sequence number should be enough to uniquely identify a particular packet and its reply. Of course the source and destination address swap around on the way out too...is that likely to hamper the lookup by dtrace? Darren
Roch wrote:> I do think that receiving new data out of order need to > fire a receive probe. Since sending has 3 stages, so do receiving: > > > > tcp::sent-new : new data was sent > tcp::sent-retrans : already sent data was resent > tcp::sent-done : sent data is freed, acked received and we''re done.I agree that it is easier to have tcp::sent-done instead of using a tcp::ack-received to detect that it is an ACK which moves suna. But with a little bit of D script, a user can filter that out easily from the tcp::ack-received probe.> tcp::received-new : new data was received from below (includes out of order) > tcp::received-dup : already received data received again > tcp::received-done : sent up toward application (we''re done)The above six probes are centered around data. But how about dup ACKs, window update, RST, ...? They are useful info too. An interesting case is zero window probe. Does it consider as new? How about segments completely outside the receive window? And similar to my previous comment on tcp::sending, is tcp::received-done equivalent to say, sockfs::received or potentially other module''s received probe above TCP?> I think that if new data is seen by tcp then it belong in a > tcp probe. If we manage a sockfs buffer outside the view of > TCP then we could have similar probes to handle that case. > But since tcp does handle the pending buffer I still like > > tcp::sending : new data posted in tcp.It seems that you are suggestion something like a tcp::send-queued probe. But isn''t it the same as fbt::tcp_wput:entry probe? And how do you see it will be used?> And I still like the flow stop probes that fires when > we have data in the xmit list and TCP refrains from sending > waiting for an ACK or window update. For incoming, we could > have similar probes for when we advertise a zero window or > open it up. > > tcp::out-flow : started or stopped > tcp::in-flow : started or stoppedLet''s focus on the sending side. I expect that this probe will usually be fired at the same time as a normal tcp::sent probe. The reason is that there is usually data to be sent but TCP cannot send it because of congestion control. But we probably don''t want this to happen. So it can be quite confusing on exactly when this probe is fired.> And we could want a probe that fires when tcp generate a > dataless packet > > tcp::ctrl : pure ack, sack, window update, keepalive...The first three are basically ACKs. I''d say a dedicated tcp::ack-sent is easier to deal with. -- K. Poon. kacheong.poon at sun.com
Roch
2006-Sep-26 07:13 UTC
[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider
I think that streams/packet tracing makes the scope of the Network provider overwhelming (it would to me at least). So, I would like to see us make progress on the individual modules tracing and defer packet tracing as a followup. -r Darren.Reed at Sun.COM writes: > Brendan Gregg - Sun Microsystems wrote: > > >G''Day Sangeeta, > > > >On Thu, Sep 21, 2006 at 11:43:55AM -0700, Sangeeta Misra wrote: > >[...] > > > >>>But don''t assume that customers care most about seeing code path, > >>>or have read the 50,000+ lines that make up tcp.c and ip.c. > >>> > >>At this point,Sun engineers who work in Solaris kernel are customers who > >>would want to use DTrace Network Provider to track the above code paths > >>( to do analyse performance bottlenecks and latencies) There will be > >>Solaris kernel developers OpenSolaris community in future who would need > >>the same. > >> > >>Anyway, I am done trying to justify whether the feature I requested is > >>important or not, and how may customers want it. You seem to have some > >>quite strong opinions about what Soalris networking engineers need. > >> > > > >Ok, so you still don''t understand what is happening. > > > >When I''m talking about customers, I''m talking about Solaris users - the > >system administrators, developers and staff who use Solaris servers and > >would like easy observability tools. > > > >The world isn''t made up of kernel network engineers or customers who care > >about kernel code path latencies. Instead there are Solaris users who > >run prstat or top from time to time, and would use a similar network summary > >tool if we provided one. I can''t see the average customer running top in > >one window and a kernel code path latency tool in another. > > > > > >I''ve already said that I want to measure code path latencies too, and that > >I think it should be part of a provider. However, since kernel network > >engineers understand the kernel network code, then they can get their own > >statistics from fbt and some sdt, and write their own private provider > >for their own pet needs. Are they also "customers" of DTrace? Sure, but > >customers who have the skills to satisfy their own private needs. > > > > I think you''re underestimating the complexities involved (or maybe > I''m underestimating what dtrace can do?) > > What I''d like to see dtrace be able to do is follow a packet through > the kernel, not a thread of execution. fbt is for following execution. > sdt is up to how it is used, but most often, it is just random hooks > in the code. > > The difference between using fbt and being able to follow a packet > is when a packet is put on a queue, the fbt path is terminated, > even though the packet still has places to go and I/O ports to see. > > If the networking dtrace provider doesn''t deliver the capability > to trace a packet through the kernel then it is not delivering a > key feature that is needed by many people. > > Why don''t we just hack up our own dtrace provider? Because it > doesn''t help in solving problems with installed systems in the > field. In addition, it would take a lot of messing about to do > this with fbt or sdt and that doesn''t help anyone. > > To say that kernel engineers understand the networking code and > therefore do not need help in generating stats is not a very > ingeneous statement to make and ignores the fact that there are > varying levels of capability throughout the organisation, not > all of whom are experts. > > In addition to the examples you''ve presented on the web page, > I think it would be of benefit if the following questions could > also be answered: > > - how many PPS is an application responsible for being sent out? > - how many PPS is an application responsible for on the receive side? > The above two questions respeated for bytes per second, not PPS...or > take all of your "questions answered" and repeat for PID. > > ...and I think this is the biggest problem with the dtrace networking > provider as scoped so far - nothing in it relates to a process. I''m > aware that this can be challenging for networking (especially on the > receive side) but there are worthwhile questions here to answer from > the generic customer perspective. > > I''d dispute that "Are hackers/crackers port scanning my server? (TCP > flag matching by IP address)" is actually a question worth asking. > If it is directly connected to the Internet, the answer is "yes" > (but maybe not right now.) If there are any other boxes (routers, > firewalls) along the way in from the Internet, then it''s not a > question you should be using dtrace to find an answer for. If > you''re trying to come up with a way to justify that particular > dtrace probe, I''d recommend looking for a better example question. > > Some other probes that might be useful: > - RTT calculations from TCP timestamp measurements > - TCP window size changes > - when a packet is dropped because it is "bad" (checksum, etc) > - look through "netstat -s" output as there is quite a large > number of stats there that are worthy of being identified > with a probe. > > Darren > > p.s the HREF for dtrace-discuss at the bottom of > http://www.opensolaris.org/os/community/dtrace/NetworkProvider/ > is wrong/broken. > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
James Carlson
2006-Sep-26 14:40 UTC
[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider
Andrew Gallatin writes:> > Darren.Reed at Sun.COM writes: > > > Some other probes that might be useful: > <....> > > - when a packet is dropped because it is "bad" (checksum, etc) > > Speaking as driver developer for 10Gb/s NICs, it would also be nice to > somehow make snoop capture only these bad packets. I realize this is > outside the scope of dtrace, but while we are wishing for things... :)Really, there''s a wealth of things that snoop and its DLPI cohort doesn''t do right or particularly well. This is just one of many. Among the others are identifying input versus output traffic, noting non-packet-related conditions (e.g., loss of carrier), retaining errored packets (runts, framing, FCS errors), and high-resolution timestamps. (Others have also mentioned a lower-overhead interface for monitoring fast networks, but I''m a little hesitant to put a performance goal in with functional issues. It could be in the list, though.) If we''re going to get into snoop, I''d rather see us cast a wider net than just adding a few minor features. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Tariq Rashid wrote:> i''ve not followed the entire thread but the following triggers would be useful for myself, and i''m sure many others who want to observe network server processes: > > * socket bind (listen)Which layer do you want info from? For example, the above can be done using the tcp::listen probe. But it will give the TCP layer info.> * incoming tcp and handshake handshakeThis can also be done using the TCP state changes probes.> * or udp data > * outgoing tcp data, or udp dataIt will work with the proposed sending side probes.> * tcp shutdownTCP state changes probes will help.> very useful would be if this was observable per thread, with a master process handing off reply tasklets to worker threads. > > an example task would be to accumulate histograms of the time between udp in and udp out, per server thread.It may not be possible. For example, incoming UDP packets are handled by the interrupt threads. At best we can map an incoming packets to the socket via conn_t. But this still does not map to a specific thread of a process. And what do you mean by "time between UDP in and out?" -- K. Poon. kacheong.poon at sun.com
Kacheong Poon
2006-Sep-26 19:28 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Provider
Brendan Gregg - Sun Microsystems wrote:> There have been other ideas, but I was thinking that args[2] be a > combination of all IPv* fields, then specific ones can be plucked > from statements as follows, > > net:ip::send > /args[2]->ip_ver == 4/ > { > @foo[args[2]->ip_flags] = count(); > }As mentioned in a previous email, I think ip::send can be confusing. Firstly, the exact location when this probe is fired needs to be defined. From the reading so far, I suspect that it is intended to be placed right before IP sends it down to the driver. If this is the case, then we probably need other "send" probes. For example, if a packet is to be encrypted, we probably want a probe before/after encryption. And how about a tunnel packet? After so much discussion, could we have an updated proposal on what probes are there now? Thanks. -- K. Poon. kacheong.poon at sun.com
Kacheong Poon writes:> > an example task would be to accumulate histograms of the time between udp in and udp out, per server thread. > > > It may not be possible. For example, incoming UDP packets > are handled by the interrupt threads. At best we can > map an incoming packets to the socket via conn_t. But this > still does not map to a specific thread of a process. And > what do you mean by "time between UDP in and out?"More to the point: there''s just no way to know at this layer which transmitted UDP packet(s), if any, might be related to which received UDP packet(s), if any. If you need to know that, then you need to be tracing _inside_ the application itself, so that you can distinguish between packet-sent-due-to-application-timer from packet-generated-in- response-to-some-input. There''s no guarantee that an application even transmits its replies in the same order in which its queries were received, let alone how many replies might go with a given query. For some really simple protocols, it may be possible to guess "most of the time" based on the packet contents, and that may be good enough for an ad-hoc test script or some application-specific case, but I doubt that''s a good basis for a stable provider. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Brendan Gregg - Sun Microsystems
2006-Sep-26 22:12 UTC
[dtrace-discuss] Re: DTrace Network Provider
G''Day Kacheong, On Fri, Sep 22, 2006 at 02:29:18AM +0800, Kacheong Poon wrote:> Brendan Gregg wrote: > > >>It is unclear what is meant by "standard RFC* style probes." > >>Does it mean an API style? But there is no RFC for standard > >>API as IETF does not standardize API. Could you please > >>elaborate that? > > > >Show me in the RFCs where dblks are discussed. The provider proposal uses > >the stable terminology from the RFCs, and events from the RFCs, which > >customers should find easy to follow. > > > I guess you misunderstood my question. I was not > asking about dblk tracing. I was asking about what > is really meant by "standard RFC* style probes." > For example, the proposal does not track congestion/ > send/receive windows. They are well defined in the > TCP standard and are very useful in telling a user > what is really going on. OTOH probes like > tcp::drop-saturation is not in any TCP standard.The RFCs provide useful terminology and values, and are a great starting point. So when a customer sees the probe list, many of the probes use the same RFC terminology, and should be easier to understand. The same can be said for the provided arguments. TCP window details are in the proposal - please check before posting.> >>For example, a > >>TCP segment can be sent through a tunnel. So what is the IP > >>header info in args[2]? > > > >The tunnel IP. People can tunnel traffic over SSH anyway, would you have > >us add USDT probes into sshd for details there? > > I don''t understand this comment. We are discussing a > provider of the kernel network stack. IP tunnel is a well > defined concept. We are not discussing about application > protocol tracing. My question is about what is being provided > in the args[] list. The list should be well defined. Otherwise, > how can a user write scripts using them? This is exactly > my suggestion above, we need to understand the issues first.The provider uses the IP address from the IP header - this follows principal of least suprise. Should something else be encapsulated, be it via IP tunnels or sshd tunnels or anything else, then that data should be fetched using another means. Since IP tunnels well defined as you say, then perhaps they should be args[2]->ip_tun_daddr, etc. As soon as we replace the IP address with the tunnel''d IP address, someone may make a similar argument for sshd tunnels.> >Quoting from the website (maybe you didn''t get this far?) > > > > "Why are outbound connections slow? (Connect latency, 1st byte latency, > > throughput latency)" > > > >These metrics are solvable from the proposal. > > > This was exactly my question (I quoted the above question > in my email). I don''t believe the proposal can answer the > why part. It can tell a user that something is slow, but not > why. Finding the latency does not tell us why the latency > is there...Connect latency, 1st byte latency and throughput latency are all measurable from the proposed provider. They certainly give insights as to why a connection is slow. And these are quite similar to the high level latencies provided by the stable and public io provider. Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems
2006-Sep-26 22:48 UTC
[dtrace-discuss] Re: DTrace Network Provider
G''Day, On Fri, Sep 22, 2006 at 09:19:40AM +0800, Kacheong Poon wrote:> Roch wrote: > > >So maybe > > > >tcp::sent : suna was advanced, args: from seq, nbytes > >tcp::resent : suna not advanced, args: from seq, nbytesI think these are useful as data-send, data-resend, etc. This gives us send/receive for all TCP traffic regardless; and then higher level probes for convience, such as data-send, connect-established, etc.> Do you mean tcp::sent is fired when TCP receives an ACK > (otherwise suna won''t matter)? I think it is a little bit > difficult to explain the above. In the tcp::sent case, it > means that TCP receives an ACK such that it can now free > up those data. But how can the probe distinguish it from > a retransmission?Yes, I agree - it may be a little bit difficult. I''ve focused on using "send" and not "sent" to help with this. ie, "data-send" means that tcp is sending data, and is a simple SDT probe. "data-sent" would mean that tcp has sent the data and received the ACK. This can probably still be done, (and if so, perhaps this should be done) but to begin with - "send" will be both easier for us to write and users to understand.> Or do you mean they are fired when TCP calls into IP to send > some data? So it is when TCP sends some data to IP.Without writing the code - my thoughts are that this will be both the easiest and the most appropriate place to probe. It''s when TCP is handing it to the next layer down (yes, I know it''s part of ip). Which is like the DTrace io provider - which shows us when the block I/O driver hands the requests to the next layer down (disk driver).> Then > suna does not matter and tcp::resent cannot be conditioned > on "suna not advanced" as doing NewReno is a resent and suna > is advanced in this case. I guess tcp::resent should fire > when TCP does a retransmission. We also need something like > > tcp::ack-sent: fires when an ACK is sent, args: tcp_t, ack #, > window size > tcp::rst-sent: fires when a RST is sent, args: src and dst > addrs, seq number > > We don''t need syn or fin sent as they consume 1 seq num and > can be put into the tcp::send probe. > > > >tcp::acked : acked pointer was advanced. args: seq, nbytes > > > This is when TCP receives an ACK segment, right? It is not > about sending? But why is ACK number not in the args? I think > we should also have the window size and SACK options as args. > > > >tcp::dupacked : acked pointer was not advanced. args: seq > > > I think SACK options and dupack count should also be in the > args.Do we need seperate ack probes? Maybe - I don''t have a strong opinion on this, with tcp::send/receive matching everything - we have at least one way to measure them. Maybe the best way to suggest what the ack probes should be, is to write a script that uses them. Writing the DTrace examples on the website helped me fine tune how things worked best.> >Timings from sent to acked would give easy access to round > >trip time.Yes - this script would be the best example of using acks, and is the only script I didn''t write on the website (it should be under throughput latency - although the term "round trip time" is probably more appropriate). I began wondering how best to match acks, then left writing that script for another time. cheers, Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems wrote:> TCP window details are in the proposal - please check before posting.There are multiple "windows" in TCP protocol. I don''t think the current proposal actually tracks them.> The provider uses the IP address from the IP header - this follows > principal of least suprise.The above is OK as long as it is very clear that the probe is placed right before IP sends it down to the driver. So the user knows that what the probe args will be. I mentioned this in a previous email.> Should something else be encapsulated, be it via IP tunnels or sshd > tunnels or anything else, then that data should be fetched using > another means. Since IP tunnels well defined as you say, then perhaps > they should be args[2]->ip_tun_daddr, etc.Or we can have multiple probes for different purposes, as suggested in my previous email.> As soon as we replace the IP address with the tunnel''d IP address, > someone may make a similar argument for sshd tunnels.Well, as long as we are talking about a provider for the kernel network stack, I don''t believe the above request will come. It is simply the wrong request. People may ask to put probes at sshd, but it has nothing to do with the kernel network stack. I hope this part is very clear.> Connect latency, 1st byte latency and throughput latency are all > measurable from the proposed provider. They certainly give insights > as to why a connection is slow.The above just tells us that there is a latency. I am not sure how they can tell us why there is a latency. We need more information. As suggested by many people in this discussion thread, we can have more probes which tie to the TCP protocol processing.> And these are quite similar to the high level latencies provided by > the stable and public io provider.Well, we are talking about a network provider. I don''t see why we need to be limited by designs of other providers. Is there a reason? -- K. Poon. kacheong.poon at sun.com
Brendan Gregg - Sun Microsystems
2006-Sep-26 23:13 UTC
[dtrace-discuss] Re: Re: DTrace Network Provider
G''Day Kacheong, On Fri, Sep 22, 2006 at 09:30:39AM +0800, Kacheong Poon wrote:> Alan Maguire wrote: > > >net:tcp:state:closed > >net:tcp:state:listen > >net:tcp:state:syn_received > >net:tcp:state:fin_wait_1 > >net:tcp:state:fin_wait_2 > >net:tcp:state:closing > >net:tcp:state:established > >net:tcp:state:syn_sent > >net:tcp:state:close_wait > >net:tcp:state:time_wait > >net:tcp:state:last_ack > > > State transition is certainly something very important. But > I guess you can omit closed as it just means that a tcp_t > is allocated. A fbt probe on the tcp_t creation function > serves the same purpose.closed is part of the diagram on p23 of RFC 793; so it is something people may go looking for.> >i guess there''s some overlap between these probes and Brendan''s > >suggested probes (e.g. "established" in this scheme == > >"connect-established"), but maybe it would make sense to split up the > >provider namespace into "protocol state" and "packet data" probes, i.e. > > >net:tcp:state: > >net:tcp:packet:Yes, I''ve made a differentiation by using a "state-" prefix. I''m also a little cautious to use the word packet at the transport level, as IP may fragment this further (although I''ve never really noticed Solaris doing this for TCP - I''m sure there is a reason in the code somewhere). Although It''s been hard to avoid saying "packet" in some of my examples.> If the TCP segment which triggers the transition is passed back, > there is no need to have a "packet" probe. It is more confusing > to the user.Yep. tcp::state-* events are about state changes, tcp::send/receive are about packets (as far as TCP knows, anyway).> >with the former looking at state events, the latter at packet events (e.g. > >receive, drop)? would this be useful? (i think the common states users will > >be interested in are pretty intuitive from the RFC-derived names above, > >e.g. connection is listen''ing, established, closed). thoughts? > > > I think we may not need a "drop" probe in TCP, unless it means > checksum error (which is actually done in the IP level in our > implementation). TCP only drops a segment if it does not > pass certain checks. This means that a script can easily be > written using the tcp::receive probe and check on the seq number > and timestamps option.While drops will be more interesting at the NIC driver level, it does look like TCP can drop SYNs when saturated. cheers, Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems wrote:> Do we need seperate ack probes? Maybe - I don''t have a strong opinion > on this, with tcp::send/receive matching everything - we have at least > one way to measure them.Yes, it is more for the convenience of writing D scripts.> Maybe the best way to suggest what the ack probes should be, is to > write a script that uses them. Writing the DTrace examples on the > website helped me fine tune how things worked best.Using a convention suggested by Darren, we can do something like the following to check for, say ACK splitting attack for peers to gain unfair advantage. net:ipv4:tcp:established /args[1]->tcp_lport == 8080/ { tp = args[0]; /* assume args[0] is tcp_t */ } net:ipv4:tcp:ack-received /args[0] == tp/ { @bytes_acked[inet_ntop(IPV4, args[1]->tcp_faddr] quantize(args[1]->tcp_ack - args[0]->tcp_suna); }> Yes - this script would be the best example of using acks, and is > the only script I didn''t write on the website (it should be under > throughput latency - although the term "round trip time" is probably > more appropriate). I began wondering how best to match acks, then > left writing that script for another time.Since TCP also keeps track of the RTT, I guess an easier way for finding this is to just take a look at the value. -- K. Poon. kacheong.poon at sun.com
Brendan Gregg - Sun Microsystems wrote:> I''m also a little cautious to use the word packet at the transport > level, as IP may fragment this further (although I''ve never really > noticed Solaris doing this for TCP - I''m sure there is a reason > in the code somewhere). Although It''s been hard to avoid saying > "packet" in some of my examples.Yes, we should not use "packet" in the transport level.> While drops will be more interesting at the NIC driver level, it > does look like TCP can drop SYNs when saturated.I''d suggest we use net:ipv*:tcp:listen-drop, or more specific tcp:listen-drop-q0 and tcp:listen-drop-q for the above. This is implementation specific as there is no such drop in the standard. So I think we can use implementation specific names. -- K. Poon. kacheong.poon at sun.com
Brendan Gregg - Sun Microsystems
2006-Sep-27 03:38 UTC
[dtrace-discuss] Re: Re: DTrace Network Provider
G''Day Alex, On Fri, Sep 22, 2006 at 07:24:36AM -0700, Alex Peng wrote:> If D could have: > iphdr(ipha_t *) > tcphdr(tcph_t *) > > to just iphdr/tcphdr as in mdb, which will save many engineer''s time.Yes - the proposal is to have ipha_t and tcphdr available as probe args, which is appropriate as they are RFC based. Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems
2006-Sep-27 03:46 UTC
[dtrace-discuss] Re: DTrace Network Provider
G''Day Kacheong, On Sat, Sep 23, 2006 at 07:57:13AM +0800, Kacheong Poon wrote:> Roch wrote: > > >net:tcp::sending : new data posted in the tcp xmit buffer (from above) > > > Or should this be the sent probe of the layer above TCP? For > example, we can have sockfs::sent.Sure; an ideal outcome would be to cache the thread info from the sockfs layer, to assist tracing data as it''s sent down the layers from application to ethernet. When we begin coding we''ll find out how doable this is.> >net:tcp::acked : data from xmit buffer can be freed (send competed) > > > Or maybe tcp::ack-received, similar to tcp::ack-sent? > > > >I note we already have mib probes for retransmits and dup acks. > > > The current mib probes are very limited, no TCP info available. > I think it is better to have dedicated probes for them.Yes. The mib probes have been handy as signposts in the kernel code for interesting points. I''m glad there are there. I have used them a little in the DTraceToolkit, but keep wanting further info (conn_s or tcp_s). Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems
2006-Sep-27 03:59 UTC
[dtrace-discuss] Re: DTrace Network Provider
On Sat, Sep 23, 2006 at 07:57:13AM +0800, Kacheong Poon wrote:> Roch wrote: > > >net:tcp::acked : data from xmit buffer can be freed (send competed) > > > Or maybe tcp::ack-received, similar to tcp::ack-sent?Oh - and I''ve just added ack-receive and ack-send to the website. Yes, I''m fiddling the letters a little - if data-sent fires after the ACK has been returned so that we know it was sent ok, then ack-sent means something stranger. The rules would be, data-send means I''m sending the data. ack-send means I''m sending the ack. data-sent means the data was sent ok. ack-sent means the ack was sent ok. Which leaves the door ajar to add equivalent "sent" and "received" probes if they prove more appropriate down the road. Brendan
Brendan Gregg - Sun Microsystems
2006-Sep-27 04:29 UTC
[dtrace-discuss] Re: DTrace Network Provider
G''Day Roch, On Mon, Sep 25, 2006 at 03:49:13PM +0200, Roch wrote:> > Kacheong Poon writes: > > I do think that receiving new data out of order need to > fire a receive probe. Since sending has 3 stages, so do receiving: > > tcp::sent-new : new data was sent > tcp::sent-retrans : already sent data was resent > tcp::sent-done : sent data is freed, acked received and we''re done. > > tcp::received-new : new data was received from below (includes out of order) > tcp::received-dup : already received data received again > tcp::received-done : sent up toward application (we''re done)To watch these events for a "packet" sounds quite interesting - but SACKs could make it confusing - we wouldn''t get a sent-done for every sent-new. Maybe I''d adjust them to be, tcp::data-sent-start tcp::data-sent-retrans tcp::data-sent-done Since they are about moving data; and "start" is already an accepted DTrace term (from the io provider :). Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems writes:> > State transition is certainly something very important. But > > I guess you can omit closed as it just means that a tcp_t > > is allocated. A fbt probe on the tcp_t creation function > > serves the same purpose. > > closed is part of the diagram on p23 of RFC 793; so it is something > people may go looking for.I suspect those might be confused people. RFC 793 page 21: TIME-WAIT, and the fictional state CLOSED. CLOSED is fictional because it represents the state when there is no TCB, and therefore, no connection. Briefly the meanings of the states are: page 22: CLOSED - represents no connection state at all. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
James Carlson writes: > Brendan Gregg - Sun Microsystems writes: > > > State transition is certainly something very important. But > > > I guess you can omit closed as it just means that a tcp_t > > > is allocated. A fbt probe on the tcp_t creation function > > > serves the same purpose. > > > > closed is part of the diagram on p23 of RFC 793; so it is something > > people may go looking for. > > I suspect those might be confused people. > > RFC 793 page 21: > > TIME-WAIT, and the fictional state CLOSED. CLOSED is fictional > because it represents the state when there is no TCB, and therefore, > no connection. Briefly the meanings of the states are: > > page 22: > > CLOSED - represents no connection state at all. > Ok, but there is an event which makes a connection transition it into this fictional state. I think this event can match a Dtrace probe, we just need to put that at the end of the list. -r > -- > James Carlson, KISS Network <james.d.carlson at sun.com> > Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 > MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677 > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
Terri Molini
2006-Sep-27 14:45 UTC
[dtrace-discuss] WSJ article: Some ''Breakthroughs'' Deserve That Title -- But Definitely Not All
Thought you would enjoy the following WSJ artcle. Best, terri _________________________________________________________ *Some ''Breakthroughs'' Deserve That Title -- But Definitely Not All * The Wall Street Journal, Lee Gomes; September 27, 2006 All companies, but especially those in technology, like few things better than to talk about their "breakthroughs," those great leaps forward that make products out of the formerly impossible. A search by Factiva Consulting Services found that more than 8,600 press releases have been issued over the years with "breakthrough" in the headline, a majority of them by computer and electronics companies. IBM, incidentally, is the year''s breakthrough champion, at least in terms of how often it has headlined the word in its press releases. The company has had eight headline-worthy breakthroughs since January in customer privacy, energy efficiency, speedy billing software, storage, mainframes (twice), computer networking and voice recognition. But to what extent are we experiencing "breakthrough inflation," in which the work that an engineer would consider simply a good day in the lab becomes, in the hands of the PR department, an advance worthy of being shouted about from the rooftops? A breakthrough-themed announcement from one of the biggest Silicon Valley companies shows some of the issues involved in evaluating whether a particular technical development merits "breakthrough" status. It was issued this month by Intel, and it involves optical computing -- that is, using light rather than electrons to transmit digital data. Optical communications are already used throughout the fiber strands of our phone networks, but the electronic devices that make them possible are much too expensive for the insides of a desktop PC, or even for the high-end servers used by businesses. Intel has been working to change that, and it has announced a number of optically oriented breakthroughs in recent years. The theme of all of them has been finding ways to use ordinary low-cost silicon chips to manipulate the laser light used in optical computing. Intel''s latest breakthrough involves, of all things, glue -- specifically, finding a way to attach a layer of indium phosphide, a material long used to generate photons in laser equipment, to the top of a piece of silicon. That step had eluded researchers for years. With it accomplished, you get the breakthrough of a "laser on a chip." Intel says the arrival of such a laser chip will hasten the arrival of the day when, for instance, the different parts of your computer will communicate via laser fibers instead of copper wires. (Light travels farther and generates less "noise" than electrons running over copper.) It will also allow, says Intel, different parts of a computer -- the microprocessor and memory, for instance -- to be much farther apart than they are today, including on the other side of the room from each other. That would help diffuse the heat generated by the rows and rows of servers now found in the computer rooms of even medium-sized businesses. In turn, that could result in lower energy costs. Nearly every big company these days says that the biggest problem its IT shop has is getting enough power to run all the computers the company is buying. Having been inside Intel''s laser labs, I need no persuading that the company is doing important work here, and an Intel spokesman says the development is indeed a "breakthrough" because it shows how real-world optical products can be made with silicon. I wonder, though, how many more breakthroughs we will be reading about before optical computing becomes ubiquitous. An Intel spokesman says the laser chip is indeed a "breakthrough" because it shows how real-world optical products can be made with silicon. But being a "breakthrough" implies that you don''t need a lot more like it to get to where you need to go. Intel is the first to admit that many issues remain to be solved before that day is at hand in this case -- including, for instance, getting the hybrid indium-phosphide and silicon chips to produce light at the high temperatures that are normal in computer parts. If every few months produces a breakthrough, we might want to change what we are calling it to something less arresting -- perhaps "research advance." Like many other companies, Intel may also have a penchant for overclaiming even with well-grounded technical accomplishments. For example, the company said its recent work will hasten the arrival of ultrafast fiber Web connections to the home because silicon-based laser components will reduce the cost of optical networks. True enough, but the real expense of fiber to the home is digging up sidewalks and front yards. Unless Intel is also working on silicon-based shoveling technology, that cost is not likely to come down any time soon. There is another category of "breakthrough" in which the word becomes a synonym for a "supremely useful product." One example is Dtrace, a piece of software that is like a computer debugger on steroids. Dtrace was developed by a team of engineers headed by Bryan Cantrill at Sun Microsystems, and earlier this month, it won first place in The Wall Street Journal''s Technology Innovation Awards. It enables engineers to diagnose a computer and say how well it''s running and what might be causing a program to hiccup -- all without interfering with the machine''s operations. If Dtrace won''t change computing, it''s at least available today. Not only is it currently shipping from Sun, it also has been licensed by Apple Computer for new versions of the Macintosh. Apple and Sun together: Now there''s a breakthrough. ### -- The language of truth is unadorned and always simple. Marcellinus Ammianus Terri Molini Sun Microsystems, Inc. Global Communications 408/404-4976 office; x6-9968 408/404-4976 fax 408/406-9021 mobile IM: tmolini -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20060927/c15021d4/attachment.html>
Brendan Gregg - Sun Microsystems
2006-Sep-27 16:39 UTC
[dtrace-discuss] Re: Re: DTrace Network Provider
G''Day James, On Wed, Sep 27, 2006 at 09:18:27AM -0400, James Carlson wrote:> Brendan Gregg - Sun Microsystems writes: > > > State transition is certainly something very important. But > > > I guess you can omit closed as it just means that a tcp_t > > > is allocated. A fbt probe on the tcp_t creation function > > > serves the same purpose. > > > > closed is part of the diagram on p23 of RFC 793; so it is something > > people may go looking for. > > I suspect those might be confused people. > > RFC 793 page 21: > > TIME-WAIT, and the fictional state CLOSED. CLOSED is fictional > because it represents the state when there is no TCB, and therefore, > no connection. Briefly the meanings of the states are: > > page 22: > > CLOSED - represents no connection state at all.and page 39, TCP A TCP B 1. ESTABLISHED ESTABLISHED 2. (Close) FIN-WAIT-1 --> <SEQ=100><ACK=300><CTL=FIN,ACK> --> CLOSE-WAIT 3. FIN-WAIT-2 <-- <SEQ=300><ACK=101><CTL=ACK> <-- CLOSE-WAIT 4. (Close) TIME-WAIT <-- <SEQ=300><ACK=101><CTL=FIN,ACK> <-- LAST-ACK 5. TIME-WAIT --> <SEQ=101><ACK=301><CTL=ACK> --> CLOSED 6. (2 MSL) CLOSED Sure, the RFC shows that TIME-WAIT to CLOSED is fictional, but LAST-ACK to CLOSED isn''t. And in the source code, uts/common/inet/tcp/tcp.c, /* * The tcp_t is going away. Remove it from all lists and set it * to TCPS_CLOSED. The freeing up of memory is deferred until * tcp_inactive. This is needed since a thread in tcp_rput might have * done a CONN_INC_REF on this structure before it was removed from the * hashes. */ static void tcp_closei_local(tcp_t *tcp) { [...] tcp->tcp_state = TCPS_CLOSED; CLOSED is an event that happens. Sure, it may not a state as such that a TCP connection is in, however the probe net:tcp::state-closed is showing a state transition (but not that the connection then remains CLOSED). Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems
2006-Sep-27 16:41 UTC
[dtrace-discuss] Re: Re: DTrace Network Provider
G''Day Roch, On Wed, Sep 27, 2006 at 04:13:54PM +0200, Roch wrote:> James Carlson writes: > > Brendan Gregg - Sun Microsystems writes: > > > > State transition is certainly something very important. But > > > > I guess you can omit closed as it just means that a tcp_t > > > > is allocated. A fbt probe on the tcp_t creation function > > > > serves the same purpose. > > > > > > closed is part of the diagram on p23 of RFC 793; so it is something > > > people may go looking for. > > > > I suspect those might be confused people. > > > > RFC 793 page 21: > > > > TIME-WAIT, and the fictional state CLOSED. CLOSED is fictional > > because it represents the state when there is no TCB, and therefore, > > no connection. Briefly the meanings of the states are: > > > > page 22: > > > > CLOSED - represents no connection state at all. > > > > Ok, but there is an event which makes a connection > transition it into this fictional state. I think this event > can match a Dtrace probe, we just need to put that at the > end of the list.Yep! I should have checked your message before posting. :) Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems writes:> Sure, the RFC shows that TIME-WAIT to CLOSED is fictional, but LAST-ACK > to CLOSED isn''t.Agreed. It depends on what the user is trying to track. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Brendan Gregg - Sun Microsystems
2006-Sep-27 19:11 UTC
[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider
G''Day Darren, On Mon, Sep 25, 2006 at 04:00:02PM -0700, Darren.Reed at Sun.COM wrote:> Brendan Gregg - Sun Microsystems wrote:[...]> >I''ve already said that I want to measure code path latencies too, and that > >I think it should be part of a provider. However, since kernel network > >engineers understand the kernel network code, then they can get their own > >statistics from fbt and some sdt, and write their own private provider > >for their own pet needs. Are they also "customers" of DTrace? Sure, but > >customers who have the skills to satisfy their own private needs. > > > > I think you''re underestimating the complexities involved (or maybe > I''m underestimating what dtrace can do?)It''s complex, sure, but it will make a big difference where that complexity is. For the kernel to do all the work and dump associated events for code path latency, that is probably going to be a complex enhancement. The enabled probe effect may become a unwanted factor in the timing (we''d have to test this), and as it would be coded into the kernel, it would be set in stone - whether that is appropriate for one script or another. However, if the kernel were to dump some rather raw details, we could try to move much of the complexity to a DTrace translater (such as /usr/lib/dtrace/io.d), which can be tweaked as needed; and/or do postprocessing of the details in a userland program - even Perl - to reduce the inline enabled probe effect. Each script could approach the raw data in a different way as appropriate.> What I''d like to see dtrace be able to do is follow a packet through > the kernel, not a thread of execution. fbt is for following execution. > sdt is up to how it is used, but most often, it is just random hooks > in the code. > > The difference between using fbt and being able to follow a packet > is when a packet is put on a queue, the fbt path is terminated, > even though the packet still has places to go and I/O ports to see. > > If the networking dtrace provider doesn''t deliver the capability > to trace a packet through the kernel then it is not delivering a > key feature that is needed by many people.Noone is saying that the provider will never do this. If you are saying that this feature is important to you, then sure - I understand. And if many people means many engineers (which in turn helps them achieve cool things), then I understand that too.> In addition to the examples you''ve presented on the web page, > I think it would be of benefit if the following questions could > also be answered: > > - how many PPS is an application responsible for being sent out? > - how many PPS is an application responsible for on the receive side? > The above two questions respeated for bytes per second, not PPS...or > take all of your "questions answered" and repeat for PID. > > ...and I think this is the biggest problem with the dtrace networking > provider as scoped so far - nothing in it relates to a process. I''m > aware that this can be challenging for networking (especially on the > receive side) but there are worthwhile questions here to answer from > the generic customer perspective. >I completely agree, and this was the first networking issue I wanted DTrace to solve - which led me to write tcpsnoop, tcptop, etc, which appear in the DTraceToolkit. It''s hard - I haven''t thought how best to add it to the provider - but I do want it to be there. My tcpsnoop/tcptop scripts are helpful in the meantime, but they break between Solaris versions (as they are entirerly fbt based). One day I hope to write stable versions of tcpsnoop/tcptop.> I''d dispute that "Are hackers/crackers port scanning my server? (TCP > flag matching by IP address)" is actually a question worth asking. > If it is directly connected to the Internet, the answer is "yes" > (but maybe not right now.) If there are any other boxes (routers, > firewalls) along the way in from the Internet, then it''s not a > question you should be using dtrace to find an answer for. If > you''re trying to come up with a way to justify that particular > dtrace probe, I''d recommend looking for a better example question.Ok, - I do think a sensible final solution is that customers run a NIDS (such as snort) somewhere on the network, or use tools to examine their firewall logs. To have a single DTrace script check for all types of scans and give a report will still have some value. A customer finds unexplained network utilization, and can run a quick script to check what it is - rather than install a NIDS if one isn''t available. I''m sure there are better example questions with example DTrace solutions that I can add to this site - I hope some people will start posting suggestions. :)> Some other probes that might be useful: > - RTT calculations from TCP timestamp measurements > - TCP window size changesThese could be scripted from tcp::send and tcp::receive alone; but seperate probes may make sense.> - when a packet is dropped because it is "bad" (checksum, etc)Yep - we need a tcp::drop-checksum or some such.> - look through "netstat -s" output as there is quite a large > number of stats there that are worthy of being identified > with a probe.Sure. So long as we really do want to probe it, and that the kstat value isn''t sufficient. Thinking of sample scripts that use them should help here.> p.s the HREF for dtrace-discuss at the bottom of > http://www.opensolaris.org/os/community/dtrace/NetworkProvider/ > is wrong/broken.thanks, Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems wrote:> The rules would be, > > data-send means I''m sending the data. ack-send means I''m sending the ack. > data-sent means the data was sent ok. ack-sent means the ack was sent ok.The problem is that we can never know if an ACK is sent OK.> Which leaves the door ajar to add equivalent "sent" and "received" probes > if they prove more appropriate down the road.I think the first rule above is OK, if sending means giving it to IP. But I guess Roch''s point is that it is good to have a probe fired when data is freed as it is ack''ed by the peer. This can save some code in writing a D script which only has tcp:ack-received. -- K. Poon. kacheong.poon at sun.com
Brendan Gregg - Sun Microsystems wrote:> Maybe I''d adjust them to be, > > tcp::data-sent-startSo it fires when new data is sent down to IP?> tcp::data-sent-retrans > tcp::data-sent-doneDoes it mean that data is ack''ed and freed? If it does, I''d suggest we just use data-acked. "Done" sounds like the sending action is finished, so it fires when the thread calling the IP send function returns. We probably don''t need that.> Since they are about moving data; and "start" is already an accepted > DTrace term (from the io provider :). > > Brendan >-- K. Poon. kacheong.poon at sun.com
Brendan Gregg - Sun Microsystems wrote:> Oh - and I''ve just added ack-receive and ack-send to the website.I just took a look at the updated page. Since TCP is probably the most developed part, I guess we should have a more detailed explanations on it first. We can work on other parts later. For example, we can spell out the exact arguments of all the proposed probes. The args[0] should probably be tcp_t instead of conn_s (one can get conn_s from tcp_t) for all the probes. Do we still need those TCP "accept-*" and "connect-*" probes? They are almost the same as the state probes. I''d suggest we remove those "socket API related" probes. OTOH, if we have a sockfs module, then I think we should have those socket related probes in that module. We have talked about IPv4 and IPv6 and people have proposed some ways to work with them. I''d suggest to update the page with the new proposal. -- K. Poon. kacheong.poon at sun.com
Brendan Gregg - Sun Microsystems
2006-Sep-28 05:15 UTC
[dtrace-discuss] Re: DTrace Network Provider
G''Day Kacheong, Darren, On Thu, Sep 28, 2006 at 10:02:09AM +0800, Kacheong Poon wrote:> Brendan Gregg - Sun Microsystems wrote: > > >Oh - and I''ve just added ack-receive and ack-send to the website. > > > I just took a look at the updated page. Since TCP is > probably the most developed part, I guess we should > have a more detailed explanations on it first. We > can work on other parts later. For example, we can > spell out the exact arguments of all the proposed > probes. The args[0] should probably be tcp_t instead > of conn_s (one can get conn_s from tcp_t) for all the > probes.As both conn_s and tcp_t are private implementation info, we don''t want either casted as args[0] (thanks to eschrock for correcting me on that); Instead I have it there as a uintptr_t (just changed it to uint64_t), so that the pointer value can be used as a unique session ID. Is the tcp_t valid all the way from the first SYN to the last packet? I''d have to check; if there was a scenario where conn_s was valid while tcp_t was NULL, say during a close function, then conn_s would be better. conn_s can also be used for UDP, so customers will be learning one unique ID value for both. However if the conn_s gets reused for the next TCP session (or even UDP session) - then that would make tcp_t a better choice. I''m going to have to check through the code to answer these, unless someone can already shed light on them. I believe one of them is suitable - either conn_s or tcp_t, it''s just a matter of checking their lifecycles.> Do we still need those TCP "accept-*" and "connect-*" > probes? They are almost the same as the state probes. > I''d suggest we remove those "socket API related" probes. > OTOH, if we have a sockfs module, then I think we should > have those socket related probes in that module.Yes, the state probes have just state-established, while accept-established and connect-established make it easy to differentiate between inbound and outbound connections. They are stable concepts and are also quite practical - who is connecting to my webserver? This can be answered easily with accept-established; however if we had just state-established then we''d need to check the direction as well - which isn''t available as a flag, and so we''d need to cache the previous syn-sent/syn-received state in a global variable keyed on the conn_s. That''s going to make many of the one-liners go from being elegant to clumsy.> We have talked about IPv4 and IPv6 and people have > proposed some ways to work with them. I''d suggest to > update the page with the new proposal.Yes - I''d like to do something about it, but the issues go like this: Lets have, net:ipv4:tcp:send no, we can''t fiddle the 3rd field in an SDT provider, and leaving it as the function name is something of a DTrace convention. So we can have, net:ipv4::send net:ipv6::send so far so good. args[2] is either an IPv4 header or IPv6. This makes writing the provider easier. :) Now what about, net:tcp::send Oh-oh - will args[2] now be ipv4 or ipv6? We still want to match IP addresses at the TCP layer, as they are available. Do we use, net:tcp4::send net:tcp6::send ... which would lead us to, net:tcp4::send net:tcp4::receive net:tcp6::send net:tcp6::receive net:udp4::send net:udp4::receive net:udp6::send net:udp6::receive net:sctp4::send net:sctp4::receive net:sctp6::send net:sctp6::receive ... which is getting ugly. Now consider multiplying this pain with other network layer protocols we may encounter (802.11g?). I''d propose the following two suggestions: 1) net:ip::send args[2] is ipinfo header net:tcp::send args[2] is ipinfo header where an ipinfo header is an aggregation of fields from all possible network protocols, with individuals differentiated using a predicate, eg, /args[2]->ip_ver == 4/ /args[2]->ip_ver == 6/ or no predicate if you don''t care; which will usually be the case (we''ll often be just interested in the address as a string - whether that is a IPv4 dotted quad decimal or an IPv6 compact string). or, 2) net:ipv4::send args[2] is ipv4 header net:ipv6::send args[2] is ipv6 header net:tcp::send args[2] is ipinfo header For other layers (tcp) where we are likely to be just interested in common fields, such as source & dest address, then ipinfo is sufficient. However for network layer protocols we have the original ipv4 or ipv6 header available. Both can be matched using, net:ipv*::send ... so long as no other protocol begins with "ipv"! This gives us the advantage of protocol access at the network layer, without a mess at the transport layer. I guess there is a 3rd option as well, 3) net:ipv4::send args[2] is ipinfo header net:ipv6::send args[2] is ipinfo header net:tcp::send args[2] is ipinfo header since ipinfo aggregates everything from both ipv4 and ipv6, would it matter that there were unused fields? ... PS - the "ip_" and "tcp_" prefixes are a DTrace convention as well, and would be appropriate for a DTrace invented ipinfo_t (just like the io provider invents a devinfo_t and a fileinfo_t, with prefixes). Trying to make these look identical to the Solaris implementation sounds like a step towards a private provider again, which for net::: we aren''t trying to do. cheers, Brendan -- Brendan [CA, USA]
> I''d propose the following two suggestions: > > 1) > net:ip::send args[2] is ipinfo header > net:tcp::send args[2] is ipinfo header > > where an ipinfo header is an aggregation of fields from all possible > network protocols, with individuals differentiated using a predicate, > eg, > > /args[2]->ip_ver == 4/ > /args[2]->ip_ver == 6/ > > or no predicate if you don''t care; which will usually be the case (we''ll > often be just interested in the address as a string - whether that is a > IPv4 dotted quad decimal or an IPv6 compact string). > > or, > > 2) > net:ipv4::send args[2] is ipv4 header > net:ipv6::send args[2] is ipv6 header > net:tcp::send args[2] is ipinfo header > > For other layers (tcp) where we are likely to be just interested in common > fields, such as source & dest address, then ipinfo is sufficient. > However for network layer protocols we have the original ipv4 or ipv6 > header available. > > Both can be matched using, > > net:ipv*::send > > ... so long as no other protocol begins with "ipv"! > > This gives us the advantage of protocol access at the network layer, > without a mess at the transport layer.Option (1) is much more in keeping with the DTrace conventions -- and is a much more scalable option, in my opinion. A Stable net provider that used Option (2) would have to essentially lie about the module name and then make that a Stable component in the probe tuple -- an idea that I''m quite uncomfortable with. So I would weigh in quite strongly towards Option (1) (or its ilk) unless that is somehow technically impossible... - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
Kacheong Poon writes: > Brendan Gregg - Sun Microsystems wrote: > > > Maybe I''d adjust them to be, > > > > tcp::data-sent-start > > > So it fires when new data is sent down to IP? > Ideally it would be better to fire this up before we call into IP (and call the probes data-send*) but do we actually know or control before hand how much data IP/NIC is going to accept ? If we actually do know then lets fire before but if we don''t I think we can only fire after calling into IP. > > > tcp::data-sent-retrans > > tcp::data-sent-done > > > Does it mean that data is ack''ed and freed? If it does, I''d > suggest we just use data-acked. "Done" sounds like the sending > action is finished, so it fires when the thread calling the > IP send function returns. We probably don''t need that. How about data-sent-acked ? -r
Sure, the RFC shows that TIME-WAIT to CLOSED is fictional, but LAST-ACK to CLOSED isn''t. Even if it''s not triggered by an external event, it''s real enough to hold on to the port number for that 2 MSL (and cause port starvation at least in microbenchmarks). So for completeness I think we should fire the transition into closed after that delay. -r
On Wed, Sep 27, 2006 at 10:24:25PM -0700, Bryan Cantrill wrote:> > 1) > > net:ip::send args[2] is ipinfo header > > net:tcp::send args[2] is ipinfo header > ><SNIP!>> > or, > > > > 2) > > net:ipv4::send args[2] is ipv4 header > > net:ipv6::send args[2] is ipv6 header > > net:tcp::send args[2] is ipinfo header<SNIP!>> Option (1) is much more in keeping with the DTrace conventions -- and is > a much more scalable option, in my opinion. A Stable net provider that > used Option (2) would have to essentially lie about the module name and > then make that a Stable component in the probe tuple -- an idea that I''m > quite uncomfortable with. So I would weigh in quite strongly towards > Option (1) (or its ilk) unless that is somehow technically impossible...You do realize that both options lie about the module name, right? There is a "tcp" driver module, but in a post-FireEngine world it merely acts as a compatibility shim for TLI/XTI applications. All of the hotpath TCP/IP protocol processing now lives in the ip module. So given BOTH options don''t state the precise module where the code lives, I would lean (a little) toward #2, where args[2] gets its type straightened out immediately. Dan
Bryan Cantrill writes:> > net:ip::send args[2] is ipinfo header > > net:tcp::send args[2] is ipinfo header[...]> > net:ipv4::send args[2] is ipv4 header > > net:ipv6::send args[2] is ipv6 header > > net:tcp::send args[2] is ipinfo header[...]> Option (1) is much more in keeping with the DTrace conventions -- and is > a much more scalable option, in my opinion. A Stable net provider that > used Option (2) would have to essentially lie about the module name and > then make that a Stable component in the probe tuple -- an idea that I''m > quite uncomfortable with. So I would weigh in quite strongly towards > Option (1) (or its ilk) unless that is somehow technically impossible...Unfortunately, the IPv4 and IPv6 headers are not actually the same. It would have been nice if the ipng group had just extended IP, but they did some rototilling as well. This means that while "version," "source address," and "destination address" are in common (except perhaps for representation of addresses, which we can probably ignore) the rest is not. There are three fields that are renamed. The "TOS" field in IPv4 becomes the "TC" field in IPv6. The contents are actually mostly the same in practice (using DSCP), so perhaps these could be aliases. Similarly, "TTL" becomes "hop limit" without an actual change in semantics. And "protocol" becomes "next header." The options handling is quite different between the two. In IPv4, the options are part of the IP header, which is why there''s an "IHL" field. In IPv6, the options are in *separate* headers (called "extension headers") that look to the user somewhat like encapsulation. All of the IPv4 fragmentation bits (ID, DF, MF, offset) are different in IPv6. Instead of being fixed header fields, these turn into an option extension header. IPv6 doesn''t have an IP header checksum or flags. IPv4 doesn''t have a "flow label" (but, then, perhaps nobody uses the IPv6 flow label anyway). In other words, I''m puzzled about how ''ipinfo'' will work in practice. Suppose we do go with option (1). What does this do? net:ip::send /args[2]->ip_tos == 0x80/ Does it select all IPv4 packets with TOS 0x80, and ignore IPv6? Does it select both IPv4 and IPv6 with TC set to 0x80? Does it cause one of those hard-to-understand "at DIF offset" errors? Repeat the same for the not-in-common fields. I think merging IPv4 and IPv6 fields into a single structure is, at least, to me, a confusing idea. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Brendan Gregg - Sun Microsystems writes:> Oh-oh - will args[2] now be ipv4 or ipv6? We still want to match IP > addresses at the TCP layer, as they are available.Do we? I think that''s one of the important questions here. It seems to me that doing so, while convenient, is a bit of a layering violation and may lead to fragility. While TCP and UDP currently run only on IPv[46], that hasn''t always been so (TUBA ran over CLNP, and RFC 1791 ran over IPX). Even if we wave away this issue ("we really have all the protocols we''ll ever need"), it begs questions about how tunnels are represented and how users might interact with both general encapsulation and stack instances. Perhaps more importantly, at the point where TCP does a "send" operation, there is no IP header, so referring to one is a bit of a falsehood. (There happens to be one at this point in time prefabricated in our implementation, but that''s an implementation artifact. The RFCs admit no such thing.) Here''s a thought experiment: how would you represent TCP/IP over PPP on L2TP? The stack (for user data) looks something like: TCP/IP/PPP/L2TP/UDP/IP I suspect it might be just as easy to do something like this: net:ipv4::recv /args[1]->ip_src == $v4src && args[1]->ip_proto == 6/ { self->conn = args[0]; } net:tcp::recv /self->conn == args[0] && args[1]->tcp_sport == 80/ { } as it is to do something like this: net:tcp::recv /args[1]->ip_src == $v4src && args[2]->tcp_sport == 80/ { } ... and the former is semantically clearer and richer (as you can discriminate IP fragments if you wish, as well as handling tunnels). The downside is not having something in the framework that allows you to "reach back" into prior headers, but since that''s also not really part of the layering model, that might not be as big a deal as it seems. On the transmit side, I would likely want to have access to the well-known connection state items (bound addresses and ports as well as various socket options) in some implementation-neutral way. In fact, if you wanted, providing access to the connection state would provide you with essentially the same information you''re proposing, but without the layering violation implied by accessing it out of the packet header. In other words, instead of: net:tcp::send args[2] is ipinfo header you could have: net:tcp::send args[2] is connection state ... where we have some abstract (non-implementation-exposing; not a copy of the conn_t or tcp_t) structure containing the basic connection variables at that layer. For TCP, that might be: cs_bound_local_ip cs_bound_remote_ip cs_bound_local_port cs_bound_remote_port cs_nagle_enabled cs_linger_enabled cs_linger_time cs_state cs_snd_una cs_snd_nxt cs_snd_wnd ... (Of course, someone should come up with pithier names.) -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Brendan Gregg - Sun Microsystems
2006-Sep-28 16:51 UTC
[dtrace-discuss] Re: DTrace Network Provider
G''Day Dan, On Thu, Sep 28, 2006 at 06:35:10AM -0400, Dan McDonald wrote:> On Wed, Sep 27, 2006 at 10:24:25PM -0700, Bryan Cantrill wrote: > > > 1) > > > net:ip::send args[2] is ipinfo header > > > net:tcp::send args[2] is ipinfo header > > > > <SNIP!> > > > or, > > > > > > 2) > > > net:ipv4::send args[2] is ipv4 header > > > net:ipv6::send args[2] is ipv6 header > > > net:tcp::send args[2] is ipinfo header > <SNIP!> > > Option (1) is much more in keeping with the DTrace conventions -- and is > > a much more scalable option, in my opinion. A Stable net provider that > > used Option (2) would have to essentially lie about the module name and > > then make that a Stable component in the probe tuple -- an idea that I''m > > quite uncomfortable with. So I would weigh in quite strongly towards > > Option (1) (or its ilk) unless that is somehow technically impossible... > > You do realize that both options lie about the module name, right?Yes. But we shouldn''t have to plunge the customer into the world of Solaris TCP/IP stack implementation for this public and stable provider. We don''t want, DTrace Networking FAQ Q1) Where is TCP? A1) Once upon a time there was a tcp module, but then Sun commenced work on a project codenamed "FireEngine" ... (long history lesson) ... So when you see "ip", that really means whatever is in the kernel ip module, which includes the tcp functions. Lets say customers write a ton of scripts using, net:icmp::receive And then a (fictional) project FireStation comes along and moves icmp into ip. We wouldn''t want to then tell customers that we broke all their icmp scripts, and they they had to do something like, net:ip::receive /args[4]->type == ICMP/ No, we want tcp to be tcp and icmp to be icmp.> So given BOTH options don''t state the precise module where the code lives, > I would lean (a little) toward #2, where args[2] gets its type straightened > out immediately.Sure, but this provider is focused on networking not network implementation. I think there is a seperate need for the following, net-code::: Network Code Provider io-code::: I/O Code Provider cheers, Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems
2006-Sep-28 17:08 UTC
[dtrace-discuss] Re: DTrace Network Provider
On Thu, Sep 28, 2006 at 09:51:49AM -0700, Brendan Gregg - Sun Microsystems wrote:> On Thu, Sep 28, 2006 at 06:35:10AM -0400, Dan McDonald wrote:[...]> > You do realize that both options lie about the module name, right?Ok, I have to do some back-pedling now as it''s been explained that changing the module name in DTrace is just as difficult as changing the function name -- and, that the module name equaling the correct kernel module is a deliberate DTrace convention. (I thought that strictness was just for functions - so I''ve learnt something new. :) What we can do is change the probes to be the following, ip:::send ip:::receive tcp:::send tcp:::receive etc. This means a seperate provider for each protocol -- collectively they will be the Network Providers, but there won''t be a specific "net:::". Brendan -- Brendan [CA, USA]
On Thu, Sep 28, 2006 at 09:51:49AM -0700, Brendan Gregg - Sun Microsystems wrote:> G''Day Dan,<SNIP!>> > > > You do realize that both options lie about the module name, right? > > Yes. But we shouldn''t have to plunge the customer into the world of > Solaris TCP/IP stack implementation for this public and stable provider.Agreed.> And then a (fictional) project FireStation comes along and moves icmp into > ip. We wouldn''t want to then tell customers that we broke all their icmp > scripts, and they they had to do something like, > > net:ip::receive > /args[4]->type == ICMP/ > > No, we want tcp to be tcp and icmp to be icmp.Guess what, icmp is already IN IP except for the bits that handle SOCK_RAW sockets for apps like ping(1m).> > So given BOTH options don''t state the precise module where the code > > lives, I would lean (a little) toward #2, where args[2] gets its type > > straightened out immediately. > > Sure, but this provider is focused on networking not network implementation. > I think there is a seperate need for the following, > > net-code::: Network Code Provider > io-code::: I/O Code ProviderHmmmm... people who do networking (and I mean sysadmins when I say this) are already comfy enough with snoop(1m) or tcpdump(8) or ethereal(8) so that they''ll know their way around a TCP or IP header. THAT''s why I want an IPv6/IPv4 split. It''s not about the implementation, it''s about the protocol, which includes the bits on the wire AND how endpoints send/receive and process said bits. Wait until tref''s off my back - there are lots of concepts in IPsec that look like implementation details but really aren''t. (See RFCs 2401 or 4301 for a start, assuming you''ve the time.) Dan
Brendan Gregg - Sun Microsystems
2006-Sep-28 17:27 UTC
[dtrace-discuss] Re: DTrace Network Provider
G''Day Dan, On Thu, Sep 28, 2006 at 01:13:49PM -0400, Dan McDonald wrote:> On Thu, Sep 28, 2006 at 09:51:49AM -0700, Brendan Gregg - Sun Microsystems wrote: > > G''Day Dan, > <SNIP!> > > > > > > You do realize that both options lie about the module name, right? > > > > Yes. But we shouldn''t have to plunge the customer into the world of > > Solaris TCP/IP stack implementation for this public and stable provider. > > Agreed. > > > And then a (fictional) project FireStation comes along and moves icmp into > > ip. We wouldn''t want to then tell customers that we broke all their icmp > > scripts, and they they had to do something like, > > > > net:ip::receive > > /args[4]->type == ICMP/ > > > > No, we want tcp to be tcp and icmp to be icmp. > > Guess what, icmp is already IN IP except for the bits that handle SOCK_RAW > sockets for apps like ping(1m).fair enough :)> > > So given BOTH options don''t state the precise module where the code > > > lives, I would lean (a little) toward #2, where args[2] gets its type > > > straightened out immediately. > > > > Sure, but this provider is focused on networking not network implementation. > > I think there is a seperate need for the following, > > > > net-code::: Network Code Provider > > io-code::: I/O Code Provider > > Hmmmm... people who do networking (and I mean sysadmins when I say this) are > already comfy enough with snoop(1m) or tcpdump(8) or ethereal(8) so that > they''ll know their way around a TCP or IP header. THAT''s why I want an > IPv6/IPv4 split. It''s not about the implementation, it''s about the protocol, > which includes the bits on the wire AND how endpoints send/receive and > process said bits.so that would be, ipv4:::send ipv4:::receive ipv6:::send ipv6:::receive ipv*:::send ipv*:::receive instead of, ip:::send /args[2]->ip_ver == 4/ ip:::receive /args[2]->ip_ver == 4/ ip:::send /args[2]->ip_ver == 6/ ip:::receive /args[2]->ip_ver == 6/ ip:::send ip:::receive which may be fine; but then accessing ip info from other layers gets nasty (I guess which is why you leant a little towards option 2 :) Brendan -- Brendan [CA, USA]
On Thu, Sep 28, 2006 at 06:35:10AM -0400, Dan McDonald wrote:> On Wed, Sep 27, 2006 at 10:24:25PM -0700, Bryan Cantrill wrote: > > > 1) > > > net:ip::send args[2] is ipinfo header > > > net:tcp::send args[2] is ipinfo header > > > > <SNIP!> > > > or, > > > > > > 2) > > > net:ipv4::send args[2] is ipv4 header > > > net:ipv6::send args[2] is ipv6 header > > > net:tcp::send args[2] is ipinfo header > <SNIP!> > > Option (1) is much more in keeping with the DTrace conventions -- and is > > a much more scalable option, in my opinion. A Stable net provider that > > used Option (2) would have to essentially lie about the module name and > > then make that a Stable component in the probe tuple -- an idea that I''m > > quite uncomfortable with. So I would weigh in quite strongly towards > > Option (1) (or its ilk) unless that is somehow technically impossible... > > You do realize that both options lie about the module name, right?Ah yes -- thanks for the reminder. (I confess that I''m still getting used to the everything-in-ip world.) While there is latitude for discussion here, I feel pretty strongly that we should not lie about either the module name or the function name. I like the current split: the provider name and the probe name are arbitrary names that can convey arbitrary meaning; the module name and the function name represent the Truth of the implementation.(*) If we allow all of the elements of the probe tuple to be arbitrary, we have created a world where we can no longer correlate a probe to its implementation -- and this, I think, is undesirable. (Especially in a USDT world, where one might use Stable probes simply to better understand the implementation.) In terms of this case: ip/tcp should be in either the provider name (e.g. ip:::send or net-ip:::send, which I think makes sense) or the probe name (e.g. net:::ip-send, which I think still makes sense but is slightly muddier). - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc (*) Yes, the syscall provider is an unfortunate exception to this rule: the syscall provider''s functions don''t necessarily correspond to actual functions in the implemention, because the sysent name does not need to be the name of the function that handles the system call. That said, these nearly always match -- with notable exceptions like rexit(), gtime(), profil(), etc. But the syscall provider is an oddball in this regard, and it should stay the exception. Or more concisely: SDT-based and USDT-based providers should honor the DTrace convention of telling the truth in probemod and probefunc.
On Thu, Sep 28, 2006 at 09:48:50AM -0400, James Carlson wrote:> While TCP and UDP currently run only on IPv[46], that hasn''t always > been so (TUBA ran over CLNP, and RFC 1791 ran over IPX). Even if we > wave away this issue ("we really have all the protocols we''ll ever > need"), it begs questions about how tunnels are represented and how > users might interact with both general encapsulation and stack > instances.If it''s possible that having "tcp/ipv4" and "tcp/ipv6" providers (that make TCP and IP information available in the probes at once) results in better tracing performance than only having "ipv4," "ipv6" and "tcp" providers that don''t violate abstractions, then it may be worth doing. IOW, by having TCP and IP in the same module we may be able have a provider such that script writers need not resort to threading probes for TCP and IP through thread local variables, hash tables, etc, and that seems worthwhile. But I agree that this isn''t the Right Way (tm) to do it. That there should be "ipv4," "ipv6," "udp," "tcp," and "sctp" providers. Nico --
> Ah yes -- thanks for the reminder. (I confess that I''m still getting used> to the everything-in-ip world.) While there is latitude for discussion > here, I feel pretty strongly that we should not lie about either the > module name or the function name. I like the current split: the provider > name and the probe name are arbitrary names that can convey arbitrary > meaning; the module name and the function name represent the Truth of > the implementation.(*) But what place does the Truth of Implementation have in a high-level provider with stable probes? Or, put another way, it seems like the Truth of Implementation is inherently at-odds with stability, and thus that the second and third elements of the probe tuple should not be used by high-level providers. -- meem
On Fri, Sep 29, 2006 at 02:07:46AM +0800, Peter Memishian wrote:> > > Ah yes -- thanks for the reminder. (I confess that I''m still getting used > > to the everything-in-ip world.) While there is latitude for discussion > > here, I feel pretty strongly that we should not lie about either the > > module name or the function name. I like the current split: the provider > > name and the probe name are arbitrary names that can convey arbitrary > > meaning; the module name and the function name represent the Truth of > > the implementation.(*) > > But what place does the Truth of Implementation have in a high-level > provider with stable probes? Or, put another way, it seems like the Truth > of Implementation is inherently at-odds with stability, and thus that the > second and third elements of the probe tuple should not be used by > high-level providers.That''s exactly what I''m saying -- that the probemod and probefunc should _always_ reflect the Truth, regardless of the stablity. Which is not to say that they would always be Private or Unstable; if the stability of the function itself is Stable (e.g., with parts of the DDI or the libc API), it might well make sense to have an SDT provider that had a Stable (or Evolving) name stability for the function component of the probe tuple. As for what place the Truth has: in a debugger, the Truth always has a place -- for it is patronizing and dangerous for the debugger to assume that it is smarter than the person doing the debugging by hiding the Truth. As a more concrete example in this case: if one were looking at a USDT probe from a database (say), and the transactions (say) simply didn''t add up, and one wanted to understand where that mydb:::transaction-start probe was _really_ coming from, it would be unfortunate if the module and function elements of the probe tuple couldn''t be relied upon to take you there. Yes, one could aggregate on ustack() and so on -- but that assumes that the probe is being hit. In my opinion, the presence of a semantically stable probe can and should assist in understanding the implementation. The provider name and the probe name convey semantics; the module name and the function name convey implementation -- because one often wants the former and one occasionally needs the latter. - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
Brendan Gregg - Sun Microsystems
2006-Sep-28 19:36 UTC
[dtrace-discuss] Re: DTrace Network Provider
G''Day Nico, On Thu, Sep 28, 2006 at 01:00:17PM -0500, Nicolas Williams wrote:> On Thu, Sep 28, 2006 at 09:48:50AM -0400, James Carlson wrote: > > While TCP and UDP currently run only on IPv[46], that hasn''t always > > been so (TUBA ran over CLNP, and RFC 1791 ran over IPX). Even if we > > wave away this issue ("we really have all the protocols we''ll ever > > need"), it begs questions about how tunnels are represented and how > > users might interact with both general encapsulation and stack > > instances. > > If it''s possible that having "tcp/ipv4" and "tcp/ipv6" providers (that > make TCP and IP information available in the probes at once) results in > better tracing performance than only having "ipv4," "ipv6" and "tcp" > providers that don''t violate abstractions, then it may be worth doing. > > IOW, by having TCP and IP in the same module we may be able have a > provider such that script writers need not resort to threading probes > for TCP and IP through thread local variables, hash tables, etc, and > that seems worthwhile.The tcp provider was really tcp/ip - as can be seen in the examples, # dtrace -n ''tcp:::receive /args[3]->tcp_dport == 80/ { @pkts[inet_ntos(args[2]->ip_daddr)] = count(); }'' dtrace: description ''tcp:::receive'' matched 1 probe ^C 192.168.1.8 9 fe80::214:4fff:fe3b:76c8/10 12 192.168.1.51 32 10.1.70.16 83 192.168.7.3 121 192.168.101.101 192 And I don''t think that an ipinfo_t violates an abstraction of IP, as if you use, ip:::receive /args[2]->ip_ver == 4/ { args[2] -- all original IPv4 header info } ip:::receive /args[2]->ip_ver == 6/ { args[2] -- all original IPv6 header info } ip:::receive { args[2] -- all common header info, such as ip_saddr/ip_daddr, plus additional members depending on IPv4/IPv6 (and if that matters - use the previous clauses!!!) } The difference with an ipinfo_t is this, - Matching members that I shouldn''t (an IPv4 field while in an IPv6 clause) gives me NULL rather than an error. If I match an IPv4 field while in a generic clause - then for IPv4 traffic it is defined, and for IPv6 it is null. I don''t see that as a big problem.> But I agree that this isn''t the Right Way (tm) to do it. That there > should be "ipv4," "ipv6," "udp," "tcp," and "sctp" providers.One problem is pollution of the provider space. If we can stick to "ip", "tcp", etc (or prefixed with "net-"), then that is minimised. Do we really, really need an SCTP provider right now? We are about to solve a ton of observability problems (who is connecting to my web server, show me bytes by address, bytes by port, etc); we don''t want to get hung up on something that won''t be used. Should we have a CHOAS provider as well? ;) Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems
2006-Sep-28 19:40 UTC
[dtrace-discuss] Re: DTrace Network Provider
G''Day Folks, On Thu, Sep 28, 2006 at 10:50:54AM -0700, Bryan Cantrill wrote: [...]> In terms of this case: ip/tcp should be in either the provider name > (e.g. ip:::send or net-ip:::send, which I think makes sense) or the > probe name (e.g. net:::ip-send, which I think still makes sense but > is slightly muddier).I''ve updated the website to have the ip:::send style. It''s not so bad. Should we actually do the net-ip:::send style? (would customers keep asking "what''s with the dash - why isn''t this net:ip::send?"?). This is something we can think about. It would be a minor tweak to change the provider name later, once we are certain which way to go. Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems writes:> The difference with an ipinfo_t is this, > > - Matching members that I shouldn''t (an IPv4 field while in an IPv6 clause) > gives me NULL rather than an error. If I match an IPv4 field while in > a generic clause - then for IPv4 traffic it is defined, and for IPv6 it > is null. > > I don''t see that as a big problem.I do, for the reasons I outlined before. I think it''s a really unclear and confusing model because IPv4 and IPv6 are not the same thing. If this ''info'' portion is limited to just accessing the TCP state (rather than the previous-header-no-longer-in-evidence), then I think you get 99 44/100ths of what you''re asking -- including this example -- but without the layering trouble.> > But I agree that this isn''t the Right Way (tm) to do it. That there > > should be "ipv4," "ipv6," "udp," "tcp," and "sctp" providers. > > One problem is pollution of the provider space. If we can stick to > "ip", "tcp", etc (or prefixed with "net-"), then that is minimised.How exactly is having "ipv4" and "ipv6" more polluting than having "ip?" I''m missing the argument that says that this is an area that needs strict conservation -- to the point of merging the immiscible IPv4 and IPv6 semantics.> Do we really, really need an SCTP provider right now? We are about to solve > a ton of observability problems (who is connecting to my web server, show > me bytes by address, bytes by port, etc); we don''t want to get hung up > on something that won''t be used.I think the project is incomplete if it doesn''t support the primary protocols supported by Solaris, and SCTP is clearly one of them, no less than UDP or TCP. Now, the project you''re proposing may well be useful even if it is incomplete, but there ought to be a plan to make it complete some day. (And I think whether you putback something that''s deliberately incomplete or not is between you, your RTI Advocate, and your conscience.)> Should we have a CHOAS provider as well? ;)Of course not. Nothing supports CHAOS (including Solaris), and it''s well beyond obsolescent. I hope you''re not trying to argue that a somewhat newer protocol such as SCTP is already "obsolete." That doesn''t make sense to me. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Brendan Gregg - Sun Microsystems writes:> G''Day Folks, > > On Thu, Sep 28, 2006 at 10:50:54AM -0700, Bryan Cantrill wrote: > [...] > > In terms of this case: ip/tcp should be in either the provider name > > (e.g. ip:::send or net-ip:::send, which I think makes sense) or the > > probe name (e.g. net:::ip-send, which I think still makes sense but > > is slightly muddier). > > I''ve updated the website to have the ip:::send style. It''s not so bad. > Should we actually do the net-ip:::send style? (would customers keep > asking "what''s with the dash - why isn''t this net:ip::send?"?).I think the issue rests on the likelihood that there''ll be a confusing use of "ip" somewhere else that needs to be dtrace-providered, and whether you think networking is a first class citizen of the operating system. For what it''s worth, having a separate "ipv4" and "ipv6" is attractive for that first problem, because it''s far more specific and less likely to collide with anything else in the future than the simple "ip." As for the second part, I don''t think treating IP as a "net-" bag on the side is the right way to go, but I also don''t care too much. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
On Thu, Sep 28, 2006 at 12:36:05PM -0700, Brendan Gregg - Sun Microsystems wrote:> On Thu, Sep 28, 2006 at 01:00:17PM -0500, Nicolas Williams wrote: > > If it''s possible that having "tcp/ipv4" and "tcp/ipv6" providers (that > > make TCP and IP information available in the probes at once) results in > > better tracing performance than only having "ipv4," "ipv6" and "tcp" > > providers that don''t violate abstractions, then it may be worth doing. > > > > IOW, by having TCP and IP in the same module we may be able have a > > provider such that script writers need not resort to threading probes > > for TCP and IP through thread local variables, hash tables, etc, and > > that seems worthwhile. > > The tcp provider was really tcp/ip - as can be seen in the examples,Yes, but I happen to agree with James that we should have a "tcp" provider that doesn''t expose lower-layer information other than identifying what protocol TCP is layered over. And I do think we need separate ipv4 and ipv6 providers, for the reasons that James gives. So, what I see as needed are: - ipv4, ipv6 - udp, udp/ipv4, udp/ipv6 - tcp, tcp/ipv4, tcp/ipv6 - sctp, ... (Is ''/'' allowed in provider names?) Oh, and add ARP, ICMP, IGMP, etc... It might even be nice to have a bare-bones ''ip'' provider that allows access to v4 and v6 _packets_ sent/received, but allowing access only to the packet data.> And I don''t think that an ipinfo_t violates an abstraction of IP, as > if you use, > > ip:::receive > /args[2]->ip_ver == 4/ > { > args[2] -- all original IPv4 header info > } > > ip:::receive > /args[2]->ip_ver == 6/ > { > args[2] -- all original IPv6 header info > }This is akin to dynamic typing, which is surprising to me here. I think I understand where you''re coming from: fewer probes makes for nicer scripts. Why should I have v4- and v6-specific probes? Right? But to me the lack of conditionals in DTrace probe bodies makes it hard to use dynamic typing: I''d still end up with multiple probes, only now using different predicates instead of provider names. I suspect that scripts will be more efficient if we make such predicates unnecessary and make the providers force the distinction between v4 and v6.> > But I agree that this isn''t the Right Way (tm) to do it. That there > > should be "ipv4," "ipv6," "udp," "tcp," and "sctp" providers. > > One problem is pollution of the provider space. If we can stick to > "ip", "tcp", etc (or prefixed with "net-"), then that is minimised.It''s not an open-ended namespace though -- we see new protocols at these layers at glacial rates.> Do we really, really need an SCTP provider right now? We are about to solve > a ton of observability problems (who is connecting to my web server, show > me bytes by address, bytes by port, etc); we don''t want to get hung up > on something that won''t be used. Should we have a CHOAS provider as well? ;)I think we need to know how an SCTP provider would fit in. It shouldn''t have to be putback at the same times as the others, no. And we do want SCTP to be used! Particularly by applications that could benefit from its semantics (e.g., NFSv4). Nico --
> > Do we really, really need an SCTP provider right now? We are about to solve > > a ton of observability problems (who is connecting to my web server, show > > me bytes by address, bytes by port, etc); we don''t want to get hung up > > on something that won''t be used. > > I think the project is incomplete if it doesn''t support the primary > protocols supported by Solaris, and SCTP is clearly one of them, no > less than UDP or TCP.This general idea -- that if you don''t solve everything you must solve absolutely nothing -- is toxic and corrosive. One of the most subtle software engineering tasks is differentiating that which must be done now versus that which can be done later; if one demands that everything be done now, one will end up doing nothing at all -- a pathology that has sunk many a project. The key is to be careful when making decisions that might unnecessarily constrain future decisions; while one can not afford to do all future work in the present, one must be careful not make such future work impossible. In this case, the presence of a tcp or ip provider has no impact on a hypothetical future sctp provider, and the lack or presence of an sctp provider has no bearing on the completeness of a tcp or ip provider. If you would like to implement an sctp provider, you should by all means do so -- but it is not and nor should it be Brendan''s responsibility as part of this work. - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
Darren.Reed at Sun.COM
2006-Sep-29 01:17 UTC
[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider
Brendan, Brendan Gregg - Sun Microsystems wrote:>G''Day Darren, >You do know that Australians reserve that greeting for use only with foreigners, don''t you ? ;)>On Mon, Sep 25, 2006 at 04:00:02PM -0700, Darren.Reed at Sun.COM wrote: > >>Brendan Gregg - Sun Microsystems wrote: >> >[...] > >>What I''d like to see dtrace be able to do is follow a packet through >>the kernel, not a thread of execution. fbt is for following execution. >>sdt is up to how it is used, but most often, it is just random hooks >>in the code. >> >>The difference between using fbt and being able to follow a packet >>is when a packet is put on a queue, the fbt path is terminated, >>even though the packet still has places to go and I/O ports to see. >> >>If the networking dtrace provider doesn''t deliver the capability >>to trace a packet through the kernel then it is not delivering a >>key feature that is needed by many people. >> > >Noone is saying that the provider will never do this. If you are saying >that this feature is important to you, then sure - I understand. >And if many people means many engineers (which in turn helps them >achieve cool things), then I understand that too. >Can you comment on what the provider is being designed to do, if it won''t allow for following a packet through the kernel? If I look at the design presented on the web page, it tells us what probes will be implemented and some ways in which they can be used. What appears to be missing is a statement of "this is what we''re trying to achieve", from a more loftly level. Are you just trying to provide stable names for interesting places to observe information inside IP or something else? Whatever the goal, it might be would worth mentioning it.> ... > >To have a single DTrace script check for all types of scans and give >a report will still have some value. A customer finds unexplained >network utilization, and can run a quick script to check what it is - >rather than install a NIDS if one isn''t available. >If they''re trying to diagnose unexplained network utilization then they should not be using dtrace - it isn''t the right tool for this job. Put a network sniffer on there or use tcpdump or ethereal or similar. How will dtrace tell me that someone has installed a backdoor''d version of some PHP module that runs an IRC bot connecting to some chat server on the ''net? If dtrace were the right tool to use, how would you use it to find the answer to that question or to even discover this issue was present?>>Some other probes that might be useful: >>- RTT calculations from TCP timestamp measurements >>- TCP window size changes >> > >These could be scripted from tcp::send and tcp::receive alone; but >seperate probes may make sense. >That''s not what I''m getting at. TCP makes internal calculations about what the averge RTT to its peer is. Using ::send and ::receieve will lead to independant calculations being made that do not necessarily reflect what TCP thinks. While the window size can be observed from ::receive, it isn''t the size I want to know about, it''s when it changes so I can hopefully see what the old value is and what the new one is.>>- when a packet is dropped because it is "bad" (checksum, etc) >> > >Yep - we need a tcp::drop-checksum or some such. >Or maybe: ip:::drop, tcp:::drop, etc, and make an arg the type of drop?>>- look through "netstat -s" output as there is quite a large >> number of stats there that are worthy of being identified >> with a probe. >> > >Sure. So long as we really do want to probe it, and that the kstat >value isn''t sufficient. Thinking of sample scripts that use them should >help here. >It''s not the kstat value that''s important, it''s being able to see what the packet looked like that when the kernel decided to ldrop it. Darren
Nicolas Williams
2006-Sep-29 04:40 UTC
[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider
On Wed, Sep 27, 2006 at 12:11:38PM -0700, Brendan Gregg - Sun Microsystems wrote:> On Mon, Sep 25, 2006 at 04:00:02PM -0700, Darren.Reed at Sun.COM wrote: > > The difference between using fbt and being able to follow a packet > > is when a packet is put on a queue, the fbt path is terminated, > > even though the packet still has places to go and I/O ports to see. > > > > If the networking dtrace provider doesn''t deliver the capability > > to trace a packet through the kernel then it is not delivering a > > key feature that is needed by many people. > > Noone is saying that the provider will never do this. If you are saying > that this feature is important to you, then sure - I understand. > And if many people means many engineers (which in turn helps them > achieve cool things), then I understand that too.The fbt provider allows you to probe function calls/returns. It still requires that the scripter thread together the actual function calls into a trace of call flow -- as far as the kernel side of DTrace is concerned the fbt probe firings are discrete events. The same should apply to a network provider, but I''m afraid it won''t. The thing that makes it easy to thread function calls/returns is that they all happen in the same thread of execution and in sequence. The same is not necessarily true of "packets." There''s multicast, and SOCK_RAW, fragmentation, etc... Also, do you want to trace flows all the way from the link layer through applications (e.g., NFS, Apache)? That seems even more daunting. How do you thread receipt of fragmented packets into a nice call flow- style of report? If it takes associative arrays to do it then it will be expensive. In order to rely on thread locals one might have to be able to rely on packets being processed all in one thread -- an assumption that we shouldn''t want to make, even if it were true. But I somehow doubt that would be the primary use of a network provider. Nico --
Brendan Gregg - Sun Microsystems
2006-Sep-29 04:52 UTC
[dtrace-discuss] Re: DTrace Network Provider
G''Day James, On Thu, Sep 28, 2006 at 08:49:00AM -0400, James Carlson wrote:> Bryan Cantrill writes: > > > net:ip::send args[2] is ipinfo header > > > net:tcp::send args[2] is ipinfo header > [...] > > > net:ipv4::send args[2] is ipv4 header > > > net:ipv6::send args[2] is ipv6 header > > > net:tcp::send args[2] is ipinfo header > [...] > > Option (1) is much more in keeping with the DTrace conventions -- and is > > a much more scalable option, in my opinion. A Stable net provider that > > used Option (2) would have to essentially lie about the module name and > > then make that a Stable component in the probe tuple -- an idea that I''m > > quite uncomfortable with. So I would weigh in quite strongly towards > > Option (1) (or its ilk) unless that is somehow technically impossible... > > Unfortunately, the IPv4 and IPv6 headers are not actually the same. > It would have been nice if the ipng group had just extended IP, but > they did some rototilling as well. > > This means that while "version," "source address," and "destination > address" are in common (except perhaps for representation of > addresses, which we can probably ignore) the rest is not. > > There are three fields that are renamed. The "TOS" field in IPv4 > becomes the "TC" field in IPv6. The contents are actually mostly the > same in practice (using DSCP), so perhaps these could be aliases. > Similarly, "TTL" becomes "hop limit" without an actual change in > semantics. And "protocol" becomes "next header." > > The options handling is quite different between the two. In IPv4, the > options are part of the IP header, which is why there''s an "IHL" > field. In IPv6, the options are in *separate* headers (called > "extension headers") that look to the user somewhat like > encapsulation. > > All of the IPv4 fragmentation bits (ID, DF, MF, offset) are different > in IPv6. Instead of being fixed header fields, these turn into an > option extension header. > > IPv6 doesn''t have an IP header checksum or flags. IPv4 doesn''t have a > "flow label" (but, then, perhaps nobody uses the IPv6 flow label > anyway). > > In other words, I''m puzzled about how ''ipinfo'' will work in practice. > Suppose we do go with option (1). What does this do? > > net:ip::send > /args[2]->ip_tos == 0x80/ > > Does it select all IPv4 packets with TOS 0x80, and ignore IPv6? Does > it select both IPv4 and IPv6 with TC set to 0x80? Does it cause one > of those hard-to-understand "at DIF offset" errors? > > Repeat the same for the not-in-common fields. > > I think merging IPv4 and IPv6 fields into a single structure is, at > least, to me, a confusing idea.I bet you are confused - you are thinking structures, not DTrace translators. With translators I can take an IPv4 header, and export the IPv4 members into ipinfo_t -- which looks and feels like an IPv4 header. The documentation will just list the IPv4 members for when IPv4 traffic is probed, as though that really was the structure. Translators can also take an IPv6 header, and export the IPv6 members into an ipinfo_t. Any unused ipinfo_t members can be set to NULL. If you want to think that you really just have an IPv4 header, or really just an IPv6 header - then you still can - that''s how ipinfo_t will behave. Of course, if you start accessing members that shouldn''t be defined, you''ll get NULL. (This is assuming that the one probe, ip:::send, can access args[2] from two different translators depending on the probe location. This may not be possible - I''m still checking. If it does turn out to be impossible, then that will put a bullet in the head of ipinfo_t). How ipinfo_t (args[2]) will look in scripts, ip:::send /args[2]->ip_ver == 4/ { args[2] contains valid IPv4 members, straight from the IPv4 header } ip:::send /args[2]->ip_ver == 6/ { args[2] contains valid IPv6 members, straight from the IPv6 header } Some fields are in common -- this is an ADVANTAGE, ip:::send { @out_packets[inet_ntos(args[2]->ip_daddr)] = count(); } Other fields that are not common will be NULL. What other network level protocols may we ever see apart from IPv4 and IPv6? An advantage of "ip" is that customers are never in the dark about network traffic - they can see everything. The documentation would say "The ip provider matches all network level protocols. If you want to match a particular protocol, then filter using predicates." If you made a mistake and used an IPv6 field in an IPv4 clause, then you get NULL. The solution? Don''t use IPv6 fields in IPv4, and vise-versa! You''d have to dream up some incredibly unlikely scenarios to explain why this is still confusing. Such as, *** Unlikely Scenario #1 *** Joe is a sysadmin. Joe knows IPv4 really well, and would like DTrace to match IPv4 specific fields. Joe''s entire site runs on IPv6, yet Joe has never heard of IPv6. Joe has heard of DTrace. (Joe had spent the last 15 years in a coma after being struck on the head by a low flying plane). Joe logs into a server and uses DTrace to match specific IPv4 fields, then becomes confused to find that they are NULL. Joe is confused. Joe doesn''t look at the ip provider documentation, which will clearly state that the provider matches both IPv4 and IPv6, with examples of matching them in predicates. *** Unlikely Scenario #2 *** James is a network expert, and can remember IPv4 TOS flags off-by-heart. James discovers a new DTrace provider - ip. IP matches all ip level traffic, both IPv4 and IPv6. Great. James immediately suffers uncharacteristic short term memory loss, and then types in a script to match IPv4 TOS without using a predicate to filter just IPv4. net:ip::send /args[2]->ip_tos == 0x80/ This matches IPv4 correctly, and nothing from IPv6 (ip_tos will be NULL). Suddenly, James''s memory returns, and he becomes confused as to why the script worked - when in fact he should have matched ip_ver in the predicate. ... If someone is matching the "not in common" fields - then they already know the difference between IPv4 and IPv6. The ipinfo_t could always provider a pointer to the raw header, whatever it is, if the raw un-dtrace-translated header is desired. (however the two should contain identical populated members (skipping the NULLs) - so it''s six of one and half a dozen of another). Brendan -- Brendan [CA, USA]
On Thu, Sep 28, 2006 at 09:52:34PM -0700, Brendan Gregg - Sun Microsystems wrote:> What other network level protocols may we ever see apart from IPv4 > and IPv6? An advantage of "ip" is that customers are never in the > dark about network traffic - they can see everything.But they are also aware of both, v4 and v6. Asking them to use two probes instead of one (or two anyways, with predicates selecting on v4 vs. v6) hardly seems like a big deal, while it would simplify the proposal. Also, it''s not clear to me that you really can have probes where the types of their arguments are dynamically determined (you can always use unions, I suppose), or that that would be a good thing (it seems very unnatural to me). And they won''t always see everything -- DTrace can and will drop events :) Nico --
Brendan Gregg - Sun Microsystems writes:> > I think merging IPv4 and IPv6 fields into a single structure is, at > > least, to me, a confusing idea. > > I bet you are confused - you are thinking structures, not DTrace > translators.Fine. If you don''t like the word "structure," then use the words "name space." It''s equivalent in this context. Collapsing IPv4 and IPv6 headers into a single name space (in terms of what''s visible to the dtrace user, regardless of how that''s actually implemented under the covers) is confusing. As far as the user is concerned, there''s a structure named "ipinfo" that contains members that wink in and out of existence on different packets. That _might_ make sense if you were trying to represent a single protocol that has options (e.g., the decoded TLVs in DHCP or PPP), but makes no sense when mashing two distinct protocols together. While you''re blending IPv4 and IPv6, why not blend TCP and UDP? After all, both TCP and UDP have port numbers and checksum fields. The extra parts that TCP has could just be NULL when a UDP packet comes in. Why would that not also be a good solution? It seems to me that you''re merging based on the fact that these two separate protocols happen to both be referred to as "IP" by users, even if they''re really quite different. I don''t think this is the right criterion.> If you want to think that you really just have an IPv4 header, or > really just an IPv6 header - then you still can - that''s how ipinfo_t > will behave. Of course, if you start accessing members that shouldn''t > be defined, you''ll get NULL.This still seems quite wrong to me, as it requires the user to add a test on ip_ver to any clause that tests something other than source or destination address. Otherwise, the results are ambiguous. Testing ip_froffset == 0 will catch all IPv6 frames, including fragments. Testing ip_ttl == 0 will catch IPv6 frames as well. I''d rather see these "mistakes" throw DIF errors than just silently do something _different_ from what the user asked.> How ipinfo_t (args[2]) will look in scripts,I understood the proposal, thanks. I just don''t agree.> What other network level protocols may we ever see apart from IPv4 > and IPv6? An advantage of "ip" is that customers are never in the > dark about network traffic - they can see everything.The main (and unfortunate) one would be InfiniBand. But other than that, I agree that it could be swept under the rug.> Joe is a sysadmin. Joe knows IPv4 really well, and would like DTrace to > match IPv4 specific fields. Joe''s entire site runs on IPv6, yet Joe has > never heard of IPv6.That last sentence represents an incorrect assumption. Change it to: "Joe''s site runs mostly IPv4, but has a small amount of IPv6, and Joe doesn''t realize that IPv6 is confusing in dtrace and will accidentally match some IPv4 clauses because we merged the name space." Or even: "Joe''s site runs both IPv4 and IPv6, but he''s an IPv4 user and doesn''t care a whit to learn anything about IPv6; he thus proceeds as though IPv6 doesn''t exist."> Joe logs into a server and uses DTrace to match specific IPv4 fields, > then becomes confused to find that they are NULL.Worse, it''ll silently match packets he wasn''t expecting to match. If he doesn''t also look at version, these will be some very strange looking beasts. Worse still, many people aren''t just doing trace(args[2]->ip_src). They use the probe point to bump a counter or as a condition for some other action. In that case, the failure is silent, and corrupts the results, as there''ll be no record that the wrong thing was matched.> James is a network expert, and can remember IPv4 TOS flags off-by-heart. > James discovers a new DTrace provider - ip. IP matches all ip level traffic, > both IPv4 and IPv6. Great. James immediately suffers uncharacteristic short > term memory loss, and then types in a script to match IPv4 TOS without > using a predicate to filter just IPv4. > > net:ip::send > /args[2]->ip_tos == 0x80/ > > This matches IPv4 correctly, and nothing from IPv6 (ip_tos will be NULL).More to the point: from an network administrative point of view, there''s no difference between ip_tos and ip_tc.> If someone is matching the "not in common" fields - then they already > know the difference between IPv4 and IPv6.The point is that they may not realize the effect of this merge. You seem to think that it''s "obvious." Other than for the IP source and destination address, I strongly disagree. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Bryan Cantrill writes:> > I think the project is incomplete if it doesn''t support the primary > > protocols supported by Solaris, and SCTP is clearly one of them, no > > less than UDP or TCP. > > This general idea -- that if you don''t solve everything you must solve > absolutely nothing -- is toxic and corrosive.Great. So then why did you snip away the rest of my response and make it look as if that''s what I was saying? I didn''t say that he _must_ solve everything. I said that it''s incomplete without it, we''ll need to provide it _later_, and that thinking through what the other providers would look like is helpful.> In this case, the presence of a tcp or ip provider has no impact on a > hypothetical future sctp provider, and the lack or presence of an sctp > provider has no bearing on the completeness of a tcp or ip provider. IfI thought we were talking about a "dtrace network provider" in this thread. I must have misunderstood.> you would like to implement an sctp provider, you should by all means > do so -- but it is not and nor should it be Brendan''s responsibility as > part of this work.By creating these stable providers, Brendan is setting the high level architecture for *ALL* networking providers, including those yet to come. At least considering the protocols we have is part of that job. I don''t think the structure he''s proposing really does extend well to other protocols, because it blurs some lines that are fundamental parts of networking. In particular, with SCTP, a single connection can be multihomed with both IPv4 and IPv6 addresses. Exposing the actual IP headers through a single name space that forces these two to be aliased together will cause confusion. Leaving SCTP aside for a moment, there are other problems here. In the proposed TCP provider, I can access bits out of the IP header below. But why stop there? Why not allow access to all layers below? It would not be possible to do so in any coherent manner, but the fact that the layering distinction has been blurred raises questions about where this goes in the future. What about the transmit side? Brendan''s proposal allows access to the fields in an IP header that, per the standards, doesn''t even exist. It''s an implementation artifact. Why is that a stable thing to do? And if he doesn''t do that, then his transmit and receive bits become strangely asymmetric and much harder to use. Or consider what happens if someone does an RPC provider in the future. That''d likely be helpful. RPC runs on top of transports such as TCP, UDP, and (perhaps) SCTP. What sort of "info" representation could RPC possibly provide to the dtrace user? Given the architectural direction that Brendan is setting here, it would have to merge all of those tranport layers together into one big glop. I suspect that you''re looking for simplicity here, and that''s great, but I don''t think this desire should mean discarding fundamental networking design principles (such as layering) as well. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
On Fri, Sep 29, 2006 at 08:03:51AM -0400, James Carlson wrote:> Or consider what happens if someone does an RPC provider in the > future. That''d likely be helpful. RPC runs on top of transports such > as TCP, UDP, and (perhaps) SCTP.And doors! And what not.> What sort of "info" representation > could RPC possibly provide to the dtrace user? Given the > architectural direction that Brendan is setting here, it would have to > merge all of those tranport layers together into one big glop.Well, the provider *could* format a string representation of the "info" and make that available as a probe argument. But this would be done always, no? SDT providers have no idea that a given probe is enabled, right? So it seems like a bit of a waste in the off chance that some probe is enabled. Also, who''s to say that the full IP headers will be available to transport layer implementations always? Isn''t that an implementation detail that we don''t want to expose through the provider to avoid constraining the implementation? A thought: why couldn''t such a string representation of some lower-layer information be computed once and kept in and accessed through the TCB? E.g., end-point addresses, that which Brendan seems to care about the most. Also, the reverse, having information made available about higher levels will surely be thought of too :) (At least packet classifier actions, no?) IF we could trace packet, or, rather, message/data flows between these protocols then the need for having all the info in one place can go away. But without being able to rely on one thread processing all of some fairly extended path (which we can''t guarantee without seriously constraining the implementation) one cannot rely on thread-local variables to do the heavy lifting and must instead rely on associative arrays (and then one would need a way to address individual packets). I think at a minimum we need a set of providers and probes that don''t violate abstractions much at all. Then add packet/message IDs if need be. Then add abstraction violations when they prove necessary and unavoidable :)> I suspect that you''re looking for simplicity here, and that''s great, > but I don''t think this desire should mean discarding fundamental > networking design principles (such as layering) as well.Yes. Nico --
On Fri, Nicolas Williams wrote:> On Fri, Sep 29, 2006 at 08:03:51AM -0400, James Carlson wrote: > > Or consider what happens if someone does an RPC provider in the > > future. That''d likely be helpful. RPC runs on top of transports such > > as TCP, UDP, and (perhaps) SCTP. > > And doors! And what not.and Infiniband and RDDP... :-) Spencer
Brendan Gregg - Sun Microsystems
2006-Sep-29 16:41 UTC
[dtrace-discuss] Re: DTrace Network Provider
On Fri, Sep 29, 2006 at 07:41:00AM -0400, James Carlson wrote:> Brendan Gregg - Sun Microsystems writes: > > > I think merging IPv4 and IPv6 fields into a single structure is, at > > > least, to me, a confusing idea. > > > > I bet you are confused - you are thinking structures, not DTrace > > translators. > > Fine. If you don''t like the word "structure," then use the words > "name space." > > It''s equivalent in this context. Collapsing IPv4 and IPv6 headers > into a single name space (in terms of what''s visible to the dtrace > user, regardless of how that''s actually implemented under the covers) > is confusing.Nope, you still don''t get it. There is no point where customers will be told to used a IPv4 + IPv6 combo struct. If they predicated IPv4, use IPv4; if they predicated IPv6, use IPv6. If neither, then they can only use a documented list of common fields.> As far as the user is concerned, there''s a structure named "ipinfo" > that contains members that wink in and out of existence on different > packets. That _might_ make sense if you were trying to represent a > single protocol that has options (e.g., the decoded TLVs in DHCP or > PPP), but makes no sense when mashing two distinct protocols together.Rubbish - you aren''t thinking of how the final scripts will be used, and are making unfounded comments on how you think they will be used. How about you actually write some scripts, as I have done, and you will see that there is no problem.> While you''re blending IPv4 and IPv6, why not blend TCP and UDP? After > all, both TCP and UDP have port numbers and checksum fields. The > extra parts that TCP has could just be NULL when a UDP packet comes > in. > > Why would that not also be a good solution?It would be if the provider was called "transport:::". That "ip" could mean both ipv4 and ipv6 isn''t a strange concept.> It seems to me that you''re merging based on the fact that these two > separate protocols happen to both be referred to as "IP" by users, > even if they''re really quite different. I don''t think this is the > right criterion. > > > If you want to think that you really just have an IPv4 header, or > > really just an IPv6 header - then you still can - that''s how ipinfo_t > > will behave. Of course, if you start accessing members that shouldn''t > > be defined, you''ll get NULL. > > This still seems quite wrong to me, as it requires the user to add a > test on ip_ver to any clause that tests something other than source or > destination address. Otherwise, the results are ambiguous. > > Testing ip_froffset == 0 will catch all IPv6 frames, including > fragments. Testing ip_ttl == 0 will catch IPv6 frames as well. > > I''d rather see these "mistakes" throw DIF errors than just silently do > something _different_ from what the user asked.So here we go - you finally put the finger on THE difference! If a user is matching IPv4 or IPv6 specific fields, AND they do something they shouldn''t - like examine IPv6 fields in IPv4, YOU want an error, rather than NULL. This is ridiculous. If a customer knows enough about IPv4 and IPv6 to be matching specific fields - then they know the difference between IPv4 and IPv6.> > If someone is matching the "not in common" fields - then they already > > know the difference between IPv4 and IPv6. > > The point is that they may not realize the effect of this merge.There is no effect. Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems writes:> Rubbish - you aren''t thinking of how the final scripts will be used, > and are making unfounded comments on how you think they will be used.I''m somewhat surprised and taken aback that you can read my thoughts, but as I''ve already explained my objections more than once, I won''t belabor the point. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
On Fri, Sep 29, 2006 at 09:41:07AM -0700, Brendan Gregg - Sun Microsystems wrote:> Nope, you still don''t get it. > > There is no point where customers will be told to used a IPv4 + IPv6 > combo struct. If they predicated IPv4, use IPv4; if they predicated > IPv6, use IPv6. If neither, then they can only use a documented list > of common fields.Er, no, DTrace doesn''t work that way -- you can''t have probe arguments typed according to arbitrary predicate expressions! So, either you do "tell customers to use an IPv4 + IPv6 combo struct" or you don''t do this at all. Also, if you''d have predicates to make this probe useful then you might as well have separate ipv4 and v6 probes.> > While you''re blending IPv4 and IPv6, why not blend TCP and UDP? After > > all, both TCP and UDP have port numbers and checksum fields. The > > extra parts that TCP has could just be NULL when a UDP packet comes > > in. > > > > Why would that not also be a good solution? > > It would be if the provider was called "transport:::".Why propose that for IP but not for transport protocols then? Be consistent :) But I think it''s just not a workable idea. You could prototype it and see... Nico --
On Fri, Sep 29, 2006 at 12:28:56PM -0500, Nicolas Williams wrote:> On Fri, Sep 29, 2006 at 09:41:07AM -0700, Brendan Gregg - Sun Microsystems wrote: > > Nope, you still don''t get it. > > > > There is no point where customers will be told to used a IPv4 + IPv6 > > combo struct. If they predicated IPv4, use IPv4; if they predicated > > IPv6, use IPv6. If neither, then they can only use a documented list > > of common fields. > > Er, no, DTrace doesn''t work that way -- you can''t have probe arguments > typed according to arbitrary predicate expressions!I think Brendan was saying that if you want the IPv6 fields to be valid (that is, to be non-NULL), use a predicate. For whatever it''s worth. there is precedent for this general idea in the fileinfo argument to io:::start: having the file available when doing an I/O represents a total layer violation of the implementation (and as a result, we don''t have it all the time), but I think users of DTrace would agree that being able to correspond I/O operations to file names is tremendously useful. And indeed, utility -- above all else -- is the guiding principle of DTrace.> Also, if you''d have predicates to make this probe useful then you might > as well have separate ipv4 and v6 probes.Brendan made the point repeatedly, but just to make it again: it''s useful because the two protocols have fields in common -- and in fact, these are the fields that most people care about most of the time.> But I think it''s just not a workable idea. You could prototype it and > see...That''s exactly what''s going to happen. And I think it would be wise to defer any future discussion until such a prototype is in hand... - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
Bryan Cantrill
2006-Sep-29 21:56 UTC
[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider
> So what resource is less limited than cpu breakpoint registers but can > induce entry to specified code (a dtrace probe)? Bingo: main memory > and ECC codes [2]. We can deliberately mark objects, to a precision > of an ECC code word, as bad - but list them in a tracking table which > the ECC handler checks for dtrace entry in preference to a real ECC > fault.We will never be doing this: it violates the safety constraint, in several dimensions. (One of which you outline below.)> The non-enabled probe cost is zero. The enabled probe cost is that > of a) rewriting the memory area in question with deliberate bad ECC; > a bit expensive to do (e.g.) to every packet on a 10GB link, > (especially if a bad-ECC write is significantly slower than a normal > write) plus b) another rewrite with good ECC, plus c) the actual > probe code. > > There is a reliability downside: the tracked memory may no > longer be checked for real faults. > It helps if an ECC codepoint is available which is not used > by any hardware fault (or combination of faults).Ah, yes -- the reliability "downside". This is but one tangible way in which this notion violates the safety constraint, and as such it will never be used by DTrace. - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
Bryan Cantrill
2006-Sep-29 22:19 UTC
[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider
On Sat, Sep 30, 2006 at 11:09:54PM +0100, Jeremy Harris wrote:> Bryan Cantrill wrote: > >We will never be doing this: it violates the safety constraint, in > >several dimensions. (One of which you outline below.) > > I''m interested - which other dimensions do you have in mind?What if someone puts bad ECC on the text or data required to execute the bad ECC handler? How does DTrace know what this text and data are to prohibit it? More generally, what if someone puts bad ECC on text or data that needs to be accessed at TL>1? How is this handled? And what about memory that DTrace itself uses to (for example) store the lookup table to know how to correct the bogus bad ECC? Most generally, how can one prevent bad ECC from being placed on any word of memory that is accessed in a context in which a correctable ECC trap cannot be handled? Trust me that this is tricky -- and made much more so by this idea, as it creates many more memory accesses from the correctable handler. And all of these issues are in _addition_ to the issue that you mentioned: that by introducing bad ECC, you have lowered the fault tolerence of those words that you are marking. This alone is unacceptable and a violation of the safety constraint; the many other issues with this idea are just the nails in the coffin. - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
Adam Leventhal
2006-Sep-29 22:42 UTC
[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider
Hi Jeremy, Recall that through DTrace, a privileged user can trace the contents of arbitrary memory addresses, and it performs all these actions with interrupts disabled. Your idea is clever, it just would be extremely complex to implement safely if, indeed, it could be done at all. Adam On Sat, Sep 30, 2006 at 11:40:12PM +0100, Jeremy Harris wrote:> Bryan Cantrill wrote: > >On Sat, Sep 30, 2006 at 11:09:54PM +0100, Jeremy Harris wrote: > >>Bryan Cantrill wrote: > >>>We will never be doing this: it violates the safety constraint, in > >>>several dimensions. (One of which you outline below.) > >>I''m interested - which other dimensions do you have in mind? > > > >What if someone puts bad ECC on the text or data required to execute > >the bad ECC handler? How does DTrace know what this text and data are > >to prohibit it? > > Sorry - I wasn''t clear. "Someone" here is Dtrace; I''m suggesting > that Dtrace should both place the bad-ECC and log the placed regions. > > > More generally, what if someone puts bad ECC on text > >or data that needs to be accessed at TL>1? How is this handled? > > Yes, that is an issue. Can we constrain it in any way? What memory > is accessed at TL>1, and can we restrict it from this type of > tracing? Are there similar issues on non-SPARC architectures? > > >And what about memory that DTrace itself uses to (for example) store the > >lookup table to know how to correct the bogus bad ECC? > > What about it? You''re not very clear here; please expand. > > It''s just memory; given the assumption above > that it is Dtrace placing the bad ECC, it would seem possible to > not permit it as a traceable region. > > > Most generally, > >how can one prevent bad ECC from being placed on any word of memory that > >is accessed in a context in which a correctable ECC trap cannot be > >handled? > > By having Dtrace responsible for placing the bad ECC. > > - Jeremy Harris > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org-- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
Bryan Cantrill
2006-Sep-29 22:53 UTC
[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider
> >>>We will never be doing this: it violates the safety constraint, in > >>>several dimensions. (One of which you outline below.) > >>I''m interested - which other dimensions do you have in mind? > > > >What if someone puts bad ECC on the text or data required to execute > >the bad ECC handler? How does DTrace know what this text and data are > >to prohibit it? > > Sorry - I wasn''t clear. "Someone" here is Dtrace; I''m suggesting > that Dtrace should both place the bad-ECC and log the placed regions.No, you were being perfectly clear -- you''re just not understanding what I''m saying. DTrace is but a conduit; assuming you''re going to allow programmability around this, you will be allowing users to put bad ECC on arbitrary memory. Offering this programmability introduces all of the issues that I mentioned.> > More generally, what if someone puts bad ECC on text > >or data that needs to be accessed at TL>1? How is this handled? > > Yes, that is an issue. Can we constrain it in any way? What memory > is accessed at TL>1, and can we restrict it from this type of > tracing?I''m not going to give you a tutorial on SPARC; suffice it to say that this is an extraordinarily difficult problem to bound or solve -- and the failure mode is fatal, which means the solution must not be brittle. It''s very hard to see how one could constrain or know the memory accessed in illegal contexts in a way that was sufficiently robust. But I''m not going to have an extended discussion about this idea; it''s fatally flawed (at least from the DTrace perspective), for the reasons that I have already expanded on. - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
Richard L. Hamilton
2006-Sep-30 06:58 UTC
[dtrace-discuss] Re: [networking-discuss] Re: Re: DTrace Network
carlsonj wrote:> If we''re going to get into snoop, I''d rather see us > cast a wider net > than just adding a few minor features.I''ve thought more than once that it would be great if snoop implemented its protocol interpretation with shared objects with a defined interface, so that support for interpreting additional protocols could be added without modifying snoop proper. Presumably this would also include an index file that provided magic number, offset and size thereof, parent protocol(s), etc; i.e. whatever would be needed to determine when that protocol interpretation module was applicable. At least when reading directly from a network and writing interpreted output (rather than to a capture file), it would probably need to load all modules at startup rather than as needed (and not lazy-loaded, either), so as to not suffer from delays that might cause it to lose packets while running. This message posted from opensolaris.org
Jeremy Harris
2006-Sep-30 21:45 UTC
[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider
Nicolas Williams wrote:> On Wed, Sep 27, 2006 at 12:11:38PM -0700, Brendan Gregg - Sun Microsystems wrote: >> On Mon, Sep 25, 2006 at 04:00:02PM -0700, Darren.Reed at Sun.COM wrote: >>> The difference between using fbt and being able to follow a packet >>> is when a packet is put on a queue, the fbt path is terminated, >>> even though the packet still has places to go and I/O ports to see. >>> >>> If the networking dtrace provider doesn''t deliver the capability >>> to trace a packet through the kernel then it is not delivering a >>> key feature that is needed by many people. >> Noone is saying that the provider will never do this. If you are saying >> that this feature is important to you, then sure - I understand. >> And if many people means many engineers (which in turn helps them >> achieve cool things), then I understand that too. > > The fbt provider allows you to probe function calls/returns. It still > requires that the scripter thread together the actual function calls > into a trace of call flow -- as far as the kernel side of DTrace is > concerned the fbt probe firings are discrete events.[...]> If it takes associative arrays to do it then it will be expensive.Generalizing somewhat away from networking, we would like tracing of data objects, independent of the execution environment. I started wondering if dtrace could make use of the "data breakpoint" facility that many processors offer these days, labeling a data object for triggering probes just by setting such a trap-on-access. Two problems: First, the precision may be too tight, trapping exactly one address, when you want a whole mblk''s worth (the reverse, trapping a cacheline or page is ok, just filter in the fired-probe code). Second, processor breakpoint registers are a scarce resource. We would not be able to handle an indefinitely complex trace operation, over several mblks at once. The situation is obviously worse when Fred across the hall is tracking disk I/O blocks on the system at the same time. On the other hand, it''s usable in user-space and kernel-space including interrupt level, which is a good start. To attack the first problem above, perhaps we should ask cpu designers for data-breakpoint *range* [1] facilities? So what resource is less limited than cpu breakpoint registers but can induce entry to specified code (a dtrace probe)? Bingo: main memory and ECC codes [2]. We can deliberately mark objects, to a precision of an ECC code word, as bad - but list them in a tracking table which the ECC handler checks for dtrace entry in preference to a real ECC fault. The non-enabled probe cost is zero. The enabled probe cost is that of a) rewriting the memory area in question with deliberate bad ECC; a bit expensive to do (e.g.) to every packet on a 10GB link, (especially if a bad-ECC write is significantly slower than a normal write) plus b) another rewrite with good ECC, plus c) the actual probe code. There is a reliability downside: the tracked memory may no longer be checked for real faults. It helps if an ECC codepoint is available which is not used by any hardware fault (or combination of faults). Very large objects should be marked with bad-page-translations in MMU tables rather than bad-ECC to reduce the costs [3]. Probes would have to be dynamically placed on memory areas; in the dtrace world I suspect this would be best done as a result of the firing of some traditional, code-based, probe - thus allowing restriction of the cost of the enabled data probes by appropriate prefiltering conditions. - Jeremy Harris Footnotes: [1], [2], [3] These may be patentable ideas. Since I no longer work for Sun, they are hereby placed in the public domain, with no restrictions.
Jeremy Harris
2006-Sep-30 22:09 UTC
[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider
Bryan Cantrill wrote:> We will never be doing this: it violates the safety constraint, in > several dimensions. (One of which you outline below.)I''m interested - which other dimensions do you have in mind? - Jeremy Harris
Jeremy Harris
2006-Sep-30 22:40 UTC
[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider
Bryan Cantrill wrote:> On Sat, Sep 30, 2006 at 11:09:54PM +0100, Jeremy Harris wrote: >> Bryan Cantrill wrote: >>> We will never be doing this: it violates the safety constraint, in >>> several dimensions. (One of which you outline below.) >> I''m interested - which other dimensions do you have in mind? > > What if someone puts bad ECC on the text or data required to execute > the bad ECC handler? How does DTrace know what this text and data are > to prohibit it?Sorry - I wasn''t clear. "Someone" here is Dtrace; I''m suggesting that Dtrace should both place the bad-ECC and log the placed regions.> More generally, what if someone puts bad ECC on text > or data that needs to be accessed at TL>1? How is this handled?Yes, that is an issue. Can we constrain it in any way? What memory is accessed at TL>1, and can we restrict it from this type of tracing? Are there similar issues on non-SPARC architectures?> And what about memory that DTrace itself uses to (for example) store the > lookup table to know how to correct the bogus bad ECC?What about it? You''re not very clear here; please expand. It''s just memory; given the assumption above that it is Dtrace placing the bad ECC, it would seem possible to not permit it as a traceable region.> Most generally, > how can one prevent bad ECC from being placed on any word of memory that > is accessed in a context in which a correctable ECC trap cannot be > handled?By having Dtrace responsible for placing the bad ECC. - Jeremy Harris
James Carlson
2006-Oct-02 12:19 UTC
[dtrace-discuss] Re: [networking-discuss] Re: Re: DTrace Network
Richard L. Hamilton writes:> carlsonj wrote: > > If we''re going to get into snoop, I''d rather see us > > cast a wider net > > than just adding a few minor features. > > I''ve thought more than once that it would be great if > snoop implemented its protocol interpretation with > shared objects with a defined interface, so that > support for interpreting additional protocols could > be added without modifying snoop proper.That''s the same as meem''s "tap" project, some time ago. He can comment on what happened with it, but I think the main problem in that area isn''t in fact the current design of the code (which I agree is unpleasant). The main problem is that there are so many protocols that one might want to decode, that creating the decoder modules is the uncrossable hurdle. You need armies writing those modules. Having a Solaris-specific design target just limits the number of people available even further. I think if we''re going to overhaul snoop _itself_, we''d be better off tossing the code we''ve got, and going with ethereal. There are undoubtedly problems with ethereal as well, but at least it has a rich feature set. That wasn''t the only thing I was referring to in the comment you quoted above. I was referring to the way in which snoop gathers data. Snoop gathers data by opening a DLPI node and issuing some commands to place the device into promiscuous mode. There are a lot of problems with this approach: - You can''t watch loopback traffic, because that traffic doesn''t go anywhere near DLPI. (Though Clearview is fixing this problem by adding observability nodes.) - The interface doesn''t distinguish between inbound and outbound packets. - There''s no way for DLPI to retain errored packets or pass up metadata along with the packet ("this partial packet terminated by collision"). - There''s no way to pass up other events (carrier indication, flow control, packet loss). - Time stamps are generated by bufmod(7D), well after packet reception and handling, rather than referring to actual receive time. Some of those things could be addressed by yet-more DLPI extensions, but it''s unclear to me whether that''s really the right path. I suspect that something better than raw DLPI access is needed for monitoring traffic. I''m not sure whether a dtrace provider is the right answer to _that_ problem, as turning the output of the provider into something that portable applications (using libpcap) could access sounds both much harder and less useful than designing a new interface from scratch. And if we can''t do that, then we''re really sunk, because we can''t recreate ethereal from scratch. So, if the target of the dtrace network providers is to do what snoop does (but "better"), then I think that''s probably not a good target. We have a lot of things to do in that area, but it''s unclear whether (or how) dtrace is a part of them. The target should be making what Solaris does with packets clearer. That means exposing the relevant state transitions and activity, so that users can find out not just that a given packet is on the wire, but why it''s there. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Peter Memishian
2006-Oct-02 12:21 UTC
[dtrace-discuss] Re: [networking-discuss] Re: Re: DTrace Network
> That''s the same as meem''s "tap" project, some time ago.(An unfortunate name choice now, given *cough* SystemTap -- but moot now.) > He can comment on what happened with it, but I think the main problem > in that area isn''t in fact the current design of the code (which I > agree is unpleasant). The main problem is that there are so many > protocols that one might want to decode, that creating the decoder > modules is the uncrossable hurdle. You need armies writing those > modules. Having a Solaris-specific design target just limits the > number of people available even further. Yes, that is a sizeable problem. Some of it can be mitigated through high-quality API''s, but there''s also a seemingly-endless number of decoders to write -- not to mention a surprisingly number of "clever" ways that are used to hop from one protocol to the next (especially once one gets into application protocols). -- meem
James Carlson
2006-Oct-02 13:28 UTC
[networking-discuss] Re: [dtrace-discuss] Re: DTrace Network Provider
Jeremy Harris writes:> To attack the first problem above, perhaps we should ask cpu designers > for data-breakpoint *range* [1] facilities?[...]> [1], [2], [3] These may be patentable ideas. Since I no longer > work for Sun, they are hereby placed in the public > domain, with no restrictions.Motorola would likely beg to differ on your disclosure for [1]. Hardware range breakpoints and even breakpoints with conditional (and/or) logic have been a standard part of the PowerQUICC family for a long time -- at least since the MPC860, more than ten years ago. The idea is almost certainly older than that. (Didn''t VM/CP also have range facilities?) The 860 had four I-address, two L-address, and two L-data comparators, each with ==, !=, >, and <. The L-data comparators also had signed/unsigned modes and byte masks. These comparators fed into a programmable and/or logic structure. If you''re interested, see chapter 44 in the MPC860UM [user''s manual]. It''d be nice if the CPUs we ordinarily used had those features, but they''re certainly not new inventions. -- James Carlson, KISS Network <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Brendan Gregg - Sun Microsystems
2006-Oct-02 23:36 UTC
[dtrace-discuss] Re: DTrace Network Provider
G''Day Folks, I''ve been checking the kernel code to see what is doable for the Network Provider, and I''ve made some updates to the website. This is where I''m at: 1) ipinfo_t is mostly dead. On Thu, Sep 28, 2006 at 09:52:34PM -0700, Brendan Gregg - Sun Microsystems wrote:> > (This is assuming that the one probe, ip:::send, can access args[2] from > two different translators depending on the probe location. This may not > be possible - I''m still checking. If it does turn out to be impossible, > then that will put a bullet in the head of ipinfo_t).My assumption was wrong. It turns out that I can''t have two translators that output the same ipinfo_t. DTrace translators take one input and make one output only. Nor can I pass in a void and have the translator figure it out. I can pass in an mblk_t * (or really any pointer to the IP header, should mblk_t''s dissapear in the future), and get the translator to check the IP version -- but the translator gets very complex fast, and needs to filter out implementation (such as IP_FORWARD_PROG). So now that there is a technical reason not to have a merged translator, that idea is fairly dead. There is still the possibility for an observability only ipinfo_t, with the following contents: args[2]->ip_ver IP version args[2]->ip_plength payload length args[2]->ip_saddr stringified source address args[2]->ip_daddr stringified dest address In a similar style to the io provider''s fileinfo_t (which can return "<null>" if such information was unavailable or not appropriate); which has been indispensible for io observability. As for ipv4 and ipv6 headers, from the ip provider (only) they can be presented as args[3] (ipv4) and args[4] (ipv6), which can either be: A) stable (host-endian correct) DTrace translations of the header: ipv4info_t and ipv6info_t, with pointers to the raw headers. B) just the raw headers: ipha_t * and ip6_t *. and whether this ends up as, ip::: /args[2]->ip_ver == 4/ args[3] is IPv4 header ip::: /args[2]->ip_ver == 6/ args[4] is IPv6 header ip::: args[2] is ipinfo_t observability (IPv4 & IPv6) Or, ipv4::: args[3] is IPv4 header ipv6::: args[3] is IPv6 header ipv4:::, ipv6::: args[2] is ipinfo_t observability Can still be discussed later - it''s not a big change to go from one to the other. 2) Correct IP and TCP info can be hard to come by. One tcp:::send probe can''t pick IP info from some structs, and another tcp:::send probe pick them from elsewhere. This is difficult for the same reason a merged translator is difficult. Not only that, but correct IP info isn''t available easily for much of the tcp code. Correct TCP header details isn''t easily available much of the time either. The most extreme example would be the tcp:::state-* probes; getting correct IP and TCP header details for all of these is going to be extreamly difficult (and worse to maintain). I think it would be more practical to only provide minimum stable details from tcp_t *, something like, tcp:::state-* arg0 unique ID (either conn_s * or tcp_t *) args[3]->tcp_lport local port args[3]->tcp_rport remote port args[3]->tcp_state TCP state args[3]->tcp_pstate previous TCP state And that''s it - no IP header info, no TCP header info. tcp:::send can be placed after much of the IP header has been created (arguably in ip itself), and tcp:::receive soon after the packet is received - such that our minimal ipinfo_t is valid. Full IP headers however, are probably going to be available in the ip provider only - so to not bind the tcp provider with the current kernel implementation. The following providers and arguments given may be more realistic, ip::: IPv4 and IPv6 headers tcp:::state-* stable tcp_t details tcp:::send/receive tcp_t, TCP header details, and ipinfo_t I''m still checking what tcp:::accept-* and tcp:::connect-* can get; It may also make sense to drop tcp:::data-* and tcp:::ack-* and have these matched using tcp:::send/receive and predicates -- yes, they would be nice to have, but too many probes could make ongoing maintenance of the tcp code painful. I hope these discoveries and changes have settled some of the concerns of the readers. :) cheers, Brendan
Brendan Gregg - Sun Microsystems
2006-Oct-03 01:25 UTC
[dtrace-discuss] Re: DTrace Network Provider
G''Day Rich, On Mon, Oct 02, 2006 at 07:49:26PM -0400, Richard Lowe wrote:> Brendan Gregg - Sun Microsystems wrote:[...]> >>(This is assuming that the one probe, ip:::send, can access args[2] from > >>two different translators depending on the probe location. This may not > >>be possible - I''m still checking. If it does turn out to be impossible, > >>then that will put a bullet in the head of ipinfo_t). > > > >My assumption was wrong. It turns out that I can''t have two translators > >that output the same ipinfo_t. DTrace translators take one input and make > >one output only. > > I maybe misunderstanding you here, but as I read it, this seems not > correct. For instance, look at how proc.d translates both proc_t* and > kthread_t* to psinfo_t.Sorry - I should have stressed that this was for that one probe to have two translators (for different kernel code locations) that feed the same struct at the same args[] location. Yes - I can do it if I have a different probe name or use different args[] locations, which the proc and sched providers do. I''ve already gone down this path for probing fused loopback traffic, where I can get details from tcp_t * but not the usual tcph_t *; and so the suggested probe names are, tcp:::fuse-send tcp:::fuse-receive [...]> >and whether this ends up as, > > > > ip::: /args[2]->ip_ver == 4/ args[3] is IPv4 header > > ip::: /args[2]->ip_ver == 6/ args[4] is IPv6 header > > ip::: args[2] is ipinfo_t observability (IPv4 & IPv6) > >Or, > > ipv4::: args[3] is IPv4 header > > ipv6::: args[3] is IPv6 header > > ipv4:::, > > ipv6::: args[2] is ipinfo_t observability > > > >Can still be discussed later - it''s not a big change to go from > >one to the other. > > I far prefer the latter, the former seems particularly ugly.Well, there is a precedent for this - the io provider merges different I/O types, and this helps to write simple scripts rather than always getting in the way. This was why I wrote the example scripts on the website -- not once did I need to match the difference between IPv4 and IPv6, however every script did needed to match both. Going the other way would make all of those examples uglier. I''m glad I can type, dtrace -n ''io:::start { @[execname] = quantize(args[0]->b_bcount); }'' and not be forced to type, dtrace -n ''blockio:::start, rawio:::start, asyncio:::start, nfs*:::start { @[execname] = quantize(args[0]->b_bcount); }'' I''d encourage anyone to post some realistic and everyday examples of when your regular Joe sysadmin would want to match the difference between IPv4 and IPv6, for IP layer traceing. What springs to my mind includes, * packets by IP address (both) * bytes by IP address * bytes per second by IP address * distribution plot of packet size * distribution plot of packet size by IP address * distribution plot of throughput latency * average throughput latency by IP address Those are some everyday tasks, and I haven''t hit any specific IPv4/6 fields yet... cheers, Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems wrote:> G''Day Rich, > > On Mon, Oct 02, 2006 at 07:49:26PM -0400, Richard Lowe wrote: >> Brendan Gregg - Sun Microsystems wrote: > [...] >>>> (This is assuming that the one probe, ip:::send, can access args[2] from >>>> two different translators depending on the probe location. This may not >>>> be possible - I''m still checking. If it does turn out to be impossible, >>>> then that will put a bullet in the head of ipinfo_t). >>> My assumption was wrong. It turns out that I can''t have two translators >>> that output the same ipinfo_t. DTrace translators take one input and make >>> one output only. >> I maybe misunderstanding you here, but as I read it, this seems not >> correct. For instance, look at how proc.d translates both proc_t* and >> kthread_t* to psinfo_t. > > Sorry - I should have stressed that this was for that one probe to have two > translators (for different kernel code locations) that feed the same struct > at the same args[] location. > > Yes - I can do it if I have a different probe name or use different args[] > locations, which the proc and sched providers do. I''ve already gone down > this path for probing fused loopback traffic, where I can get details from > tcp_t * but not the usual tcph_t *; and so the suggested probe names are, > > tcp:::fuse-send > tcp:::fuse-receive > > [...] >>> and whether this ends up as, >>> >>> ip::: /args[2]->ip_ver == 4/ args[3] is IPv4 header >>> ip::: /args[2]->ip_ver == 6/ args[4] is IPv6 header >>> ip::: args[2] is ipinfo_t observability (IPv4 & IPv6) >>> Or, >>> ipv4::: args[3] is IPv4 header >>> ipv6::: args[3] is IPv6 header >>> ipv4:::, >>> ipv6::: args[2] is ipinfo_t observability >>> >>> Can still be discussed later - it''s not a big change to go from >>> one to the other. >> I far prefer the latter, the former seems particularly ugly. > > Well, there is a precedent for this - the io provider merges different I/O > types, and this helps to write simple scripts rather than always getting > in the way. This was why I wrote the example scripts on the website -- > not once did I need to match the difference between IPv4 and IPv6, however > every script did needed to match both. Going the other way would make all > of those examples uglier.Sorry, I should have perhaps been more specific. It''s the validity of arguments depending on other arguments that I''m disliking, which I don''t *think* other providers currently do. I''ll admit, however, that I can''t currently see a way to go about this other than those presented.> I''m glad I can type, > > dtrace -n ''io:::start { @[execname] = quantize(args[0]->b_bcount); }'' > > and not be forced to type, > > dtrace -n ''blockio:::start, rawio:::start, asyncio:::start, > nfs*:::start { @[execname] = quantize(args[0]->b_bcount); }''Certainly. -- Rich
Brendan Gregg - Sun Microsystems
2006-Oct-03 02:50 UTC
[dtrace-discuss] Re: DTrace Network Provider
G''Day Rich, On Mon, Oct 02, 2006 at 09:50:37PM -0400, Richard Lowe wrote:> Brendan Gregg - Sun Microsystems wrote:[...]> >Well, there is a precedent for this - the io provider merges different I/O > >types, and this helps to write simple scripts rather than always getting > >in the way. This was why I wrote the example scripts on the website -- > >not once did I need to match the difference between IPv4 and IPv6, however > >every script did needed to match both. Going the other way would make all > >of those examples uglier. > > Sorry, I should have perhaps been more specific. > > It''s the validity of arguments depending on other arguments that I''m > disliking, which I don''t *think* other providers currently do.In io, args[1]->dev_pathname only works for /args[1]->dev_name != "nfs"/, maybe I could find others. If you mean that it is uncommon and couldn''t be described as a strong DTrace convention - then yes, fair enough. The precedent in mind was the profile provider, with arg0 for userland address or arg1 for kernel address, depending on the underlying activity. That I now can''t have ip::: and args[2] (which was simple), and instead must have ip::: with either args[3] or args[4] (which people may find odd) - I''ll conceed has certainly has taken some of the shine off the ip::: idea. Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems
2006-Oct-03 02:53 UTC
[dtrace-discuss] Re: DTrace Network Provider
On Mon, Oct 02, 2006 at 07:50:34PM -0700, Brendan Gregg - Sun Microsystems wrote:> > The precedent in mind was the profile provider, with arg0 for userland > address or arg1 for kernel address, depending on the underlying activity.Oops - I meant the other way around. arg0 is kernel... I knew I should have suggested args[4] be IPv4 and args[6] be IPv6, then noone could suggest that having two args be difficult to remember which was which. ;) Brendan -- Brendan [CA, USA]
Brendan, I went and looked at: http://www.opensolaris.org/os/community/dtrace/NetworkProvider/CEC2006/ and overall I liked what I saw. But I think it would make sense to formalize the notion of the tcp pseudo-header for the tcp packet events. See page 17 in RFC 793. For instance, you end up with code like this this->length = args[1]->ip_plength - args[2]->tcp_offset; instead of capturing this in a pseudo-header field called something like tcp_length. (UDP and ICMPv6 has the same notion of pseudo-header.) To accurately capture the difference between the TCP pseudo header and the IP header on the packet on transmit, you''d need to take into account IP source routes (LSRR and SSRR in RFC 791, and the routing header in RFC 2460) you need to compensate for the fact that in the TCP pseudo header is the final destination, but in the implementation TCP sends down a packet where ipha_dst/ip6_dst is the first hop of the source route. I think expression TCP''s, UDP''s etc behavior in terms of the pseudo header takes care of the layering concerns; this pseudo header is what defines the interface between the transports and IP. I recently talked to a large customer and they brought up the need performance observability for networking. They wanted both to be able to observe throughput (which nicstat already provides, and your new providers give even more insight into) and also latency up and down the stack. After pushing back on the latency part, they explained the need as coming from handling operational performance anomalies (e.g., overload type situations) when they can''t tell where the issue is; whether TCP or the something else queues the packet, or the switches/routers doing it, etc. Basically being able to see what the latency is (aggregated globally or per socket/connection) from the write/send system call until the packet hits the wire, and likewise on the receive side. (Tracking the time spent queued in the NICs hardware transmit/receive descriptor isn''t something we can do, and they understood that.) Is anybody looking at providing a stable latency provider? Based on the 100+ emails on the list on this subject, I suspect that this requires us to invent some new concepts in Solaris so that we can track causality within the stack without relying on mblk/dblk addresses or resorting to destructive dtrace providers. But it might not be hard to have the key producers of messages (sockfs and GLD) generate a unique handle/identity in the dblk which a dtrace provider can then follow even as other mblks are prepended and different threads handle the message/packet. Erik
Brendan Gregg - Sun Microsystems
2006-Nov-25 02:28 UTC
[dtrace-discuss] Re: Network Provider comments
G''Day Erik, On Wed, Nov 22, 2006 at 05:58:30PM -0800, Erik Nordmark wrote:> > Brendan, > > I went and looked at: > > http://www.opensolaris.org/os/community/dtrace/NetworkProvider/CEC2006/ > > and overall I liked what I saw.Great,> But I think it would make sense to formalize the notion of the tcp > pseudo-header for the tcp packet events. See page 17 in RFC 793. > > For instance, you end up with code like this > this->length = args[1]->ip_plength - args[2]->tcp_offset; > instead of capturing this in a pseudo-header field called something like > tcp_length. (UDP and ICMPv6 has the same notion of pseudo-header.)Yes, I originally had planned to include a tcp_length that was the TCP payload length - however I ran into a small issue involving DTrace translators. In the future we would like to fix this issue, at which point a tcp_length could be added to the provider. For now, these standard RFC fields (args[1]->ip_plength - args[2]->tcp_offset) both work and are sufficient, and their inclusion doesn''t prohibit the future addition of tcp_length. Dropping tcp_length for now has been one of the benefits of prototyping the provider, as I''m able to suggest something that is realistic, functional and known to be doable. Further details (if anyone is interested): The issue was the need for a translator to have multiple inputs that vary depending on probe location, and that combine to give a single output argument (args[]), so that ip_plength can be read from one place and tcp_offset from another, and then subtracted to give tcp_length. DTrace translators currently take one type and translate it to another, and can only vary their inputs if the probe name varies. Ways we may resolve this issue include, - feeding the translator a custom struct of pointers to all input data. Building this struct will incurr overhead in code that needs to remain fast (this would need to be benchmarked); one way around this may be to add is-enabled kernel probes (which don''t exist yet). - enhancing the DTrace translator interface to accept multiple inputs for one args output.> To accurately capture the difference between the TCP pseudo header and > the IP header on the packet on transmit, you''d need to take into account > IP source routes (LSRR and SSRR in RFC 791, and the routing header in > RFC 2460) you need to compensate for the fact that in the TCP pseudo > header is the final destination, but in the implementation TCP sends > down a packet where ipha_dst/ip6_dst is the first hop of the source route.Good point. I haven''t yet tried to represent src or dst address in TCP terms, instead I''m using args[1]->ip_saddr, args[1]->ip_daddr - which are whatever addresses are in the IP packet. LSRR/SSRR doesn''t break this. If I had tried to provide args[2]->tcp_saddr, args[2]->tcp_daddr - then they would definately need to show the TCP destination from LSRR/SSRR properly. This will certainly need to be addressed. If not by the addition of correct args[2]->tcp_saddr, args[2]->tcp_daddr variables, then by clearly pointing out the dangers of assuming ip_saddr == tcp_saddr, and providing another way to get at the correct tcp addresses. So far my examples assume ip_saddr == tcp_saddr - which for LSRR/SSRR isn''t correct.> I think expression TCP''s, UDP''s etc behavior in terms of the pseudo > header takes care of the layering concerns; this pseudo header is what > defines the interface between the transports and IP. > > I recently talked to a large customer and they brought up the need > performance observability for networking. They wanted both to be able to > observe throughput (which nicstat already provides, and your new > providers give even more insight into) and also latency up and down the > stack. > > After pushing back on the latency part, they explained the need as > coming from handling operational performance anomalies (e.g., overload > type situations) when they can''t tell where the issue is; whether TCP or > the something else queues the packet, or the switches/routers doing it, etc.Sure, being able to show how much time was spent serving network interrupts, and how much time was spent in kernel tcp/ip code, are both crucial metrics that I can see many customers will want to know, and for the exact reasons you have described. The following checklist could be applied when attempting to understand network performance: Metrics Tool -------------------------------------------------------------------- current throughput (bytes and packets) nicstat NIC settings (duplex, auto-neg, speed) checkcable route table / routing netstat, route, traceroute network latency dtrace (net provider) CPU saturation vmstat, dtrace NIC interrupt CPU consumption intrstat, mdb & kstat kernel CPU time vmstat, dtrace tcp/ip CPU time dtrace (fbt)> Basically being able to see what the latency is (aggregated globally or > per socket/connection) from the write/send system call until the packet > hits the wire, and likewise on the receive side. (Tracking the time > spent queued in the NICs hardware transmit/receive descriptor isn''t > something we can do, and they understood that.) > > Is anybody looking at providing a stable latency provider?There are a number needs that DTrace can fulfill, Need Customer -------------------------------------------------------- A. General network observability Solaris users B. Network latency Solaris users C. Overall tcp/ip CPU latency Solaris users D. Kernel code-path latency Kernel engineers A. General network observability This is currently addressed in the proposed providers, and provides such metrics as connections rate by port, by host, etc. B. Network latency This is also in the proposed providers, as we can measure the time from when certain packets are sent and received. C. Overall tcp/ip CPU latency I''d like to have some form of overall tcp/ip code latency in the provider, so that customers can easily see how much time is consumed in tcp/ip code. Since the network providers are public and stable, we will want to try not to make this implementation based, and focus on more general events. It should be possible to write fairly straightforward scripts that show time spent processing packets in total, then breaking down on certain TCP/IP fields. It''s tricky to do this for a number of reasons, including the performance overhead of DTrace events on a server that could be pumping 100,000 packets per second. D. Kernel code-path latency Kernel code-path latency is not a priority for these network providers for reasons such as: 1) the kernel code-path isn''t stable, and these are STABLE providers 2) code-path latency data is already available using DTrace and the fbt provider, assuming, a) the engineer knows DTrace well b) the engineer understands the code they are tracing It should be possible to write an unstable code-path latency provider for the kernel engineers to use (and others interested in such details, like myself), and such a provider hasn''t been ruled out. In the shorter term future it should be possible to add SDT probes to make mblk based tracing easier. There have been some suggestions that many regular Solaris users will be interested in kernel code-path latency as well, which is ridiculous. A regular Solaris user should certainly be able to identify if there is an overall latency in the tcp/ip code, but it''s then up to the kernel engineers to study and fix the code-path.> Based on the 100+ emails on the list on this subject, I suspect that > this requires us to invent some new concepts in Solaris so that we can > track causality within the stack without relying on mblk/dblk addresses > or resorting to destructive dtrace providers. But it might not be hard > to have the key producers of messages (sockfs and GLD) generate a unique > handle/identity in the dblk which a dtrace provider can then follow even > as other mblks are prepended and different threads handle the > message/packet.Numerous code path latency problems can be solved with mblk addresses - with the DTrace script tracing changes to the mblk. With the addition of some private unstable SDT probes (to trace cration of mblks, splitting of packets, etc), such scripts would even become easy. When a server is pushing 100,000 packets per second, additional CPU instructions become a concern. An additional data/message ID would be nice, but would it be close to CPU free? Or is there a solid reason why mblk addresses and the SDT provider can''t be used? An appropriate path for solving kernel code-path latency, may involve: 1) someone who REALLY understands DTrace could write a suite of tools to demonstrate mblk based tracing from the fbt provider. These tools will solve numerous observability issues, but also highlight certain difficulties with this approach, which leads to, 2) the addition of private unstable SDT probes to enhance the suite of tools, and allow the tools to be written more easily. This may include the addition of data/message IDs, that are returned via SDT. 3) if need be, the design of a seperate kernel code-path latency provider. this would include data/message IDs. It''s likely that this will still be termed ''unstable'', due to the code-path focus. If I get a chance, I can make a start on (1) and (2). If a large customer wants latency metrics now, then fbt based scripts can be written without waiting for the provider. But first, are we even in the kernel that much? What does vmstat and intrstat show? ... Oh, and who said anything about destructive DTrace providers? cheers, Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems wrote:>G''Day Erik, > >On Wed, Nov 22, 2006 at 05:58:30PM -0800, Erik Nordmark wrote: > > > ... > >>Based on the 100+ emails on the list on this subject, I suspect that >>this requires us to invent some new concepts in Solaris so that we can >>track causality within the stack without relying on mblk/dblk addresses >>or resorting to destructive dtrace providers. But it might not be hard >>to have the key producers of messages (sockfs and GLD) generate a unique >>handle/identity in the dblk which a dtrace provider can then follow even >>as other mblks are prepended and different threads handle the >>message/packet. >> >> > >Numerous code path latency problems can be solved with mblk addresses - >with the DTrace script tracing changes to the mblk. With the addition >of some private unstable SDT probes (to trace cration of mblks, splitting >of packets, etc), such scripts would even become easy. > >When a server is pushing 100,000 packets per second, additional >CPU instructions become a concern. An additional data/message ID would >be nice, but would it be close to CPU free? Or is there a solid reason >why mblk addresses and the SDT provider can''t be used? > >Using mblk addresses presumes that a packet will only ever be referred to by the mbk at the start of the packet. This can be invalidated when we: * prepend M_PROTO/M_PCPROTO when sending to a driver; * add/remove an M_DATA from the start of a packet when we are working with encapsulation (IPsec, tunnelling, etc); * make calls to msgpullup(); * otherwise change the construction of a packet relative to the mblks that point to it. Darren
Brendan Gregg - Sun Microsystems
2006-Nov-27 08:01 UTC
[dtrace-discuss] Re: Network Provider comments
G''Day Darren, On Mon, Nov 27, 2006 at 03:03:59PM +0800, Darren Reed wrote:> Brendan Gregg - Sun Microsystems wrote:[...]> >When a server is pushing 100,000 packets per second, additional > >CPU instructions become a concern. An additional data/message ID would > >be nice, but would it be close to CPU free? Or is there a solid reason > >why mblk addresses and the SDT provider can''t be used? > > > > Using mblk addresses presumes that a packet will only ever be > referred to by the mbk at the start of the packet.No, no, I wasn''t presuming that at all -- I suggested that mblk changes be traced, and that sdt probes could make that easier.> This can be invalidated when we: > * prepend M_PROTO/M_PCPROTO when sending to a driver; > * add/remove an M_DATA from the start of a packet when we > are working with encapsulation (IPsec, tunnelling, etc); > * make calls to msgpullup(); > * otherwise change the construction of a packet relative to the > mblks that point to it.All of these changes we can usually trace in a DTrace script, using the existing fbt provider and ctf''d arguments. Yes, mblk''s change - and you''ve written a valid list of challenges for an mblk-based DTrace script to solve. Some of those mblk changes may be inlined, but there are workarounds for that too. I think one of the biggest challenges would be to verify that the measured latency is accurate, and not just microsecond noise from the DTrace overhead; this may involve setting up a testing framework that can apply some known network loads. cheers, Brendan -- Brendan [CA, USA]