Brendan Gregg - Sun Microsystems
2007-Jun-12 03:11 UTC
[dtrace-discuss] DTrace Network Providers, take 2
G''Day Folks, I''m restarting work on the proposed DTrace network providers. These are the stable providers intended for use by Solaris end users for general observability, similar to what the "io" provider achieves for disk I/O. I have been busy on other projects during the last 6 months, so little has progressed since the last (long) dtrace-discuss thread with the subject "DTrace Network Provider". I''ve just reread that thread - and I''m grateful to have had so much useful feedback so far. I''ve reintegrated some feedback with what I''ve learnt through using the network provider during the past 6 months, and I''m considering some new changes - which would be take 2 of the prototype network providers. For now I''m sticking to just discussing the internet layer protocols. Summary Take 1 (current): Probes Args ip:::send ipstateinfo_t, ipinfo_t, ipv4info_t, ipv6info_t ip:::receive ipstateinfo_t, ipinfo_t, ipv4info_t, ipv6info_t more details: http://www.opensolaris.org/os/community/dtrace/NetworkProvider This provider has been great to use in testing. A key issue is the uglyness of having ipv4info_t and ipv6info_t as different args; although this hasn''t been much of a problem as ipinfo_t is usually used instead. Take 2 (proposed): Probes Args ip:::send cstateinfo_t, ipinfo_t ip:::receive cstateinfo_t, ipinfo_t ipv4:::send cstateinfo_t, ipinfo_t, ipv4info_t ipv4:::receive cstateinfo_t, ipinfo_t, ipv4info_t ipv6:::send cstateinfo_t, ipinfo_t, ipv6info_t ipv6:::receive cstateinfo_t, ipinfo_t, ipv6info_t ipsec:::send cstateinfo_t, ipinfo_t, ipsecinfo_t ipsec:::receive cstateinfo_t, ipinfo_t, ipsecinfo_t icmp:::send cstateinfo_t, ipinfo_t, icmpinfo_t icmp:::receive cstateinfo_t, ipinfo_t, icmpinfo_t arp:::send cstateinfo_t, arpinfo_t arp:::receive cstateinfo_t, arpinfo_t rarp:::send cstateinfo_t, arpinfo_t rarp:::receive cstateinfo_t, arpinfo_t ... Providers such as "ipv4" and "ipv6" are for specific protocol analysis, and their 3rd argument is an endian correct structure of the protocol members (DTrace translator). A pointer to the raw protocol struct will be provided. (this assumes that "ipv4" is an allowed provider name, as it could potentially clash with a USDT provider called "ipv" that is asked to trace PID 4). Some discussion points about the new suggestions follow. 1) ipinfo_t This provides common details from the IP protocols that we expect to have available, ipinfo_t, ip_protocol int /* protocol */ ip_plength uint /* payload length */ ip_saddr string /* source address */ ip_daddr string /* destination address */ ip_protocol could be just the IP protocol number (4 or 6), or borrow existing definitions such as AF_INET/AF_INET6, ETHERTYPE_IP/ETHERTYPE_IPV6 or use /etc/protocols - as these may better accomodate future protocol additions. Using the IP protocol number (4/6) would be the least suprising choice for the end user. 2) cstateinfo_t This provides connection state information if available, cstateinfo_t, cs_cid uint64_t /* connection ID (conn_t *) */ cs_loopback int /* loopback state */ and may be extended to include details such as, cs_zoneid int /* zone ID */ cs_ip_stack uint64_t /* stack ID (ip_stack_t *) */ cs_cid isn''t provided as a conn_t *, as that would expose an unstable interface; instead it is provided as a uint64_t, as it can be useful as a connection ID (and people can cast it as a conn_t * for raw debugging). 3) ip provider "ip" is the internet protocol provider, which is a convience provider for tracing all internet protocol traffic (anything with an IP header). It may be expanded in the future to cover future internet layer data protocols (protocols that serve the same purpose as IPv4/IPv6). 4) ip provider & ipsec Since "ip" is a convience provider, we have the possibility of casting data that is useful (for convience) rather than what is strictly in the protocol. For example, IPSec, where packets have the tunnel source and dest and the actual source and dest, could be traced as follows: Probes Arguments contain ipv4::: tunnel source and dest ipv6::: tunnel source and dest ipsec::: actual source and dest, and tunnel source and dest ip::: actual source and dest (assuming this is doable - I need to check if ip::: probes can be placed so to not see duplicates of IPSec traffic). This would allow the ipv4 provider to probe the IPv4 protocol, and the ip provider to probe the actual end points of our communication. 5) ICMP To stress the intent of these providers, an IPv4 ICMP send would fire, ip:::send Generic info ipv4:::send IPv4 packet info icmp:::send ICMP details (type, code, ...) perhaps snoop -V helps explain this, 192.168.1.109 -> 192.168.1.3 ETHER Type=0800 (IP), size=98 bytes 192.168.1.109 -> 192.168.1.3 IP D=192.168.1.3 S=192.168.1.109 LEN=84, ID=3232, TOS=0x0, TTL=255 192.168.1.109 -> 192.168.1.3 ICMP Echo request (ID: 20076 Sequence number: 0) The ip and ipv4 providers will trace the "IP" line, and the icmp provider will trace the "ICMP" line. 6) Nifty-IP (hypothetical) In the distant future, Aliens from the Pleiades cluster make contact with Earth. They have been searching the galaxy for a quality operating system that they can install on their interstellar battleships, and are delighted to have found this thing that Earth people call "Solaris". However, they need to communicate with their own internet layer protocol, which roughly translated into English is called "nifty-IP". Work begins to support nifty-IP in Solaris 16, partly as a good jesture for our alien neighbours, and partly because they aggreed to pay in tons of solid gold. The following providers would trace nifty-IP, ip::: Generic info nifty-ip::: nifty-ip packet info (niftyipinfo_t) An integration difficulty with ip::: is that of matching fields in ipinfo_t. That should work as follows, ip_protocol no problem - nifty-ip gets its own protocol number ip_plength no problem - nifty-ip must move some length of data ip_saddr no problem - although may be unintelligible ip_daddr no problem - although may be unintelligible The assumptions are that any future internet layer protocol will move data from address A to address B, and that data will have a measurable length. The cstateinfo_t members shouldn''t be a problem, as they are specific to Solaris and not the protocol. cheers, Brendan -- Brendan [CA, USA]
Darren.Reed at Sun.COM
2007-Jun-12 03:52 UTC
[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2
Brendan, I had a quick look through the URL you provided and one thing struck me as very odd - none of the protocol headers (IP, TCP, UDP) expose the checksum value. Can you explain your rationale for this? Whlie this is perhaps uninteresting on the inbound side, it is quite interesting on the outbound side. Also, can you explain where the probes for each protocol fit in relation to the greater processing for the protocol? For example, with TCP you mention that it is possible to see if invalid flag combinations have been supplied - does this mean the TCP header has also not been checksum''d? Will tracemem() be told how to walk the packet, so I can the below? tracemem(args[1]->ip_hdr, args[1]->ip_plength) ...well, that''d also require ip_hdr being added to ipinfo_t. Adding ipv4_hl (or whatever you want to call it) would also be beneficial for people who want to use dtrace to look for packets with IP options in them. For IPv6, how do you see this design evolving to include looking at extension header data? Darren
Brendan Gregg - Sun Microsystems
2007-Jun-12 18:28 UTC
[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2
G''Day Darren, On Mon, Jun 11, 2007 at 08:52:30PM -0700, Darren.Reed at sun.com wrote:> Brendan, > > I had a quick look through the URL you provided and one thingThat URL currently describes "take 1" of the network provider proposal, and so is out of date. The newer details (IP only) were in the email I sent, and I''ll be updating the URL based on that and the feedback I get.> struck me as very odd - none of the protocol headers (IP, TCP, > UDP) expose the checksum value. Can you explain your rationale > for this? Whlie this is perhaps uninteresting on the inbound side, > it is quite interesting on the outbound side.Sorry - that URL is out of date - the providers I''ve been using in testing do provide checksum values.> Also, can you explain where the probes for each protocol fit in > relation to the greater processing for the protocol?Sure, probes fire when packets are sent and received. The actual location of the probe macros in the kernel TCP/IP code is based on available information and maintainability. This means that ip probes may be placed in tcp_send_data() and udp_send_data(), just as MIB macros for IP statistics appear in tcp.c and udp.c> For example, with TCP you mention that it is possible to see if > invalid flag combinations have been supplied - does this mean > the TCP header has also not been checksum''d?I just want to solve IP providers for now - I know there are complications with the location of the TCP checksum calculation, which I''ll eventually try to solve. :)> Will tracemem() be told how to walk the packet, so I can the below? > tracemem(args[1]->ip_hdr, args[1]->ip_plength) > > ...well, that''d also require ip_hdr being added to ipinfo_t.I don''t think ipinfo_t should have ip_hdr; if you are interested in protocol details, then instead of using the "ip" provider or the generic ipinfo_t structure, you use the "ipv4" (or "ipv6") protocol specific provider, which currently provides args[2]->ipv4_hdr as a pointer to the raw ipha_t. I think there are two things you may be interested in: A) dumping the raw header + extended options. Here is an example from the old prototype "ip" provider, # dtrace -n ''ip:::send,ip:::receive { tracemem(args[2]->ipv4_hdr, 20); }'' dtrace: description ''ip:::send,ip:::receive '' matched 5 probes CPU ID FUNCTION:NAME 0 54627 ip_input:receive 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 0: 45 20 00 98 3b 01 40 00 28 06 e0 c2 45 b5 2e b2 E ..;. at .(...E... 10: c0 a8 01 6d ...m 0 54627 ip_input:receive 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 0: 45 00 00 58 b6 fe 40 00 3c 06 03 e1 c0 a8 01 03 E..X.. at .<....... 10: c0 a8 01 6d ...m [...] So far so good, now need to replace the "20" value with args[2]->ipv4_ihl, # dtrace -n ''ip:::send,ip:::receive { tracemem(args[2]->ipv4_hdr, args[2]->ipv4_ihl); }'' dtrace: invalid probe specifier ip:::send,ip:::receive { tracemem(args[2]->ipv4_hdr, args[2]->ipv4_ihl); }: tracemem( ) argument #2 must be a non-zero positive integral constant expression hmm. DTrace is trying to help out by sanity checking args, if (dt_node_is_posconst(size) == 0) { dnerror(size, D_TRACEMEM_SIZE, "tracemem( ) argument #2 must " "be a non-zero positive integral constant expression\n"); } I need to check if there is a sensible way around this... I can think of a work-around using existing DTrace functionality, but it isn''t pretty. B) dumping the payload It wasn''t the initial intent to access the payload easily from DTrace, but enough people have asked that I will check how doable this is (it could certainly be added as a later feature). Of course, if you can walk to the data offsets in C, you can tracemem() them using DTrace (I assume this would need the first mblk pointer, which would be provided in the protocol providers for such raw debugging). So there will be at least one way to do this. It would be nice to do this easily from DTrace, eg, tracemem(args[2]->ipv4_payload, args[2]->ipv4_plength) which might turn into something like, tracemblk(args[2]->ipv4_baddr, args[2]->ipv4_plength) baddr meaning buffer address, and tracemblk() a new function to walk and dump mblks easily. I''m not yet sure which approach would be best without trying to code it first.> Adding ipv4_hl (or whatever you want to call it) would also be > beneficial for people who want to use dtrace to look for packets > with IP options in them.Yes, the "ipv4" provider will have args[2]->ipv4_ihl.> For IPv6, how do you see this design evolving to include looking > at extension header data?Extension header data would only be accessed through the ipv6 provider, and can be added to the ipv6info_t struct. cheers, Brendan -- Brendan [CA, USA]
Peter Lawrence
2007-Jun-12 18:42 UTC
[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2
G''Day Brendan, I''ve got a couple very open ended questions, I''m not so much concerned with the actual answers as to know that you are thinking about such issues... how does the info available through this provider compare with /usr/sbin/snoop ? will it be possible to emulate the filtering capabilities of snoop ? how does the placement of the probes imply timing info ? is there any way to track time between application send and ip send, and then between ip send and NIC send ? is there any way to track time between ip receive and application read, and also between NIC receive and ip receive ? (and, for my own edification, how does IPsec relate to SSL, and if they are unrelated and I need to dtrace SLL how will I do that ?) thanks, -Pete Lawrence. Brendan Gregg - Sun Microsystems wrote On 06/11/07 08:11 PM,:> G''Day Folks, > > I''m restarting work on the proposed DTrace network providers. These are > the stable providers intended for use by Solaris end users for general > observability, similar to what the "io" provider achieves for disk I/O. > > I have been busy on other projects during the last 6 months, so little has > progressed since the last (long) dtrace-discuss thread with the subject > "DTrace Network Provider". > > I''ve just reread that thread - and I''m grateful to have had so much useful > feedback so far. I''ve reintegrated some feedback with what I''ve learnt > through using the network provider during the past 6 months, and I''m > considering some new changes - which would be take 2 of the prototype > network providers. > > For now I''m sticking to just discussing the internet layer protocols. > > Summary > > Take 1 (current): > > Probes Args > ip:::send ipstateinfo_t, ipinfo_t, ipv4info_t, ipv6info_t > ip:::receive ipstateinfo_t, ipinfo_t, ipv4info_t, ipv6info_t > > more details: http://www.opensolaris.org/os/community/dtrace/NetworkProvider > > This provider has been great to use in testing. A key issue is the > uglyness of having ipv4info_t and ipv6info_t as different args; although > this hasn''t been much of a problem as ipinfo_t is usually used instead. > > Take 2 (proposed): > > Probes Args > ip:::send cstateinfo_t, ipinfo_t > ip:::receive cstateinfo_t, ipinfo_t > ipv4:::send cstateinfo_t, ipinfo_t, ipv4info_t > ipv4:::receive cstateinfo_t, ipinfo_t, ipv4info_t > ipv6:::send cstateinfo_t, ipinfo_t, ipv6info_t > ipv6:::receive cstateinfo_t, ipinfo_t, ipv6info_t > ipsec:::send cstateinfo_t, ipinfo_t, ipsecinfo_t > ipsec:::receive cstateinfo_t, ipinfo_t, ipsecinfo_t > icmp:::send cstateinfo_t, ipinfo_t, icmpinfo_t > icmp:::receive cstateinfo_t, ipinfo_t, icmpinfo_t > arp:::send cstateinfo_t, arpinfo_t > arp:::receive cstateinfo_t, arpinfo_t > rarp:::send cstateinfo_t, arpinfo_t > rarp:::receive cstateinfo_t, arpinfo_t > ... > > Providers such as "ipv4" and "ipv6" are for specific protocol analysis, > and their 3rd argument is an endian correct structure of the protocol > members (DTrace translator). A pointer to the raw protocol struct will > be provided. (this assumes that "ipv4" is an allowed provider name, as > it could potentially clash with a USDT provider called "ipv" that is > asked to trace PID 4). > > Some discussion points about the new suggestions follow. > > 1) ipinfo_t > > This provides common details from the IP protocols that we expect to > have available, > > ipinfo_t, > ip_protocol int /* protocol */ > ip_plength uint /* payload length */ > ip_saddr string /* source address */ > ip_daddr string /* destination address */ > > ip_protocol could be just the IP protocol number (4 or 6), or borrow existing > definitions such as AF_INET/AF_INET6, ETHERTYPE_IP/ETHERTYPE_IPV6 or use > /etc/protocols - as these may better accomodate future protocol additions. > Using the IP protocol number (4/6) would be the least suprising choice for > the end user. > > 2) cstateinfo_t > > This provides connection state information if available, > > cstateinfo_t, > cs_cid uint64_t /* connection ID (conn_t *) */ > cs_loopback int /* loopback state */ > > and may be extended to include details such as, > > cs_zoneid int /* zone ID */ > cs_ip_stack uint64_t /* stack ID (ip_stack_t *) */ > > cs_cid isn''t provided as a conn_t *, as that would expose an unstable > interface; instead it is provided as a uint64_t, as it can be useful as a > connection ID (and people can cast it as a conn_t * for raw debugging). > > 3) ip provider > > "ip" is the internet protocol provider, which is a convience provider for > tracing all internet protocol traffic (anything with an IP header). > It may be expanded in the future to cover future internet layer data > protocols (protocols that serve the same purpose as IPv4/IPv6). > > 4) ip provider & ipsec > > Since "ip" is a convience provider, we have the possibility of casting > data that is useful (for convience) rather than what is strictly in the > protocol. For example, IPSec, where packets have the tunnel source and dest > and the actual source and dest, could be traced as follows: > > Probes Arguments contain > ipv4::: tunnel source and dest > ipv6::: tunnel source and dest > ipsec::: actual source and dest, and tunnel source and dest > ip::: actual source and dest > > (assuming this is doable - I need to check if ip::: probes can be placed > so to not see duplicates of IPSec traffic). > > This would allow the ipv4 provider to probe the IPv4 protocol, and the ip > provider to probe the actual end points of our communication. > > 5) ICMP > > To stress the intent of these providers, an IPv4 ICMP send would fire, > > ip:::send Generic info > ipv4:::send IPv4 packet info > icmp:::send ICMP details (type, code, ...) > > perhaps snoop -V helps explain this, > > 192.168.1.109 -> 192.168.1.3 ETHER Type=0800 (IP), size=98 bytes > 192.168.1.109 -> 192.168.1.3 IP D=192.168.1.3 S=192.168.1.109 LEN=84, ID=3232, TOS=0x0, TTL=255 > 192.168.1.109 -> 192.168.1.3 ICMP Echo request (ID: 20076 Sequence number: 0) > > The ip and ipv4 providers will trace the "IP" line, and the icmp provider > will trace the "ICMP" line. > > 6) Nifty-IP (hypothetical) > > In the distant future, Aliens from the Pleiades cluster make contact with > Earth. They have been searching the galaxy for a quality operating system > that they can install on their interstellar battleships, and are delighted > to have found this thing that Earth people call "Solaris". However, they > need to communicate with their own internet layer protocol, which roughly > translated into English is called "nifty-IP". Work begins to support > nifty-IP in Solaris 16, partly as a good jesture for our alien neighbours, > and partly because they aggreed to pay in tons of solid gold. > > The following providers would trace nifty-IP, > > ip::: Generic info > nifty-ip::: nifty-ip packet info (niftyipinfo_t) > > An integration difficulty with ip::: is that of matching fields in ipinfo_t. > That should work as follows, > > ip_protocol no problem - nifty-ip gets its own protocol number > ip_plength no problem - nifty-ip must move some length of data > ip_saddr no problem - although may be unintelligible > ip_daddr no problem - although may be unintelligible > > The assumptions are that any future internet layer protocol will move > data from address A to address B, and that data will have a measurable > length. > > The cstateinfo_t members shouldn''t be a problem, as they are specific to > Solaris and not the protocol. > > cheers, > > Brendan >
Dan McDonald
2007-Jun-12 19:02 UTC
[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2
On Tue, Jun 12, 2007 at 11:42:42AM -0700, Peter Lawrence wrote: <SNIP!>> (and, for my own edification, how does IPsec relate to SSL, and if they > are unrelated and I need to dtrace SLL how will I do that ?)There *is* kernel SSL, but most SSL is in user-space. I *think* what''s being proposed is an all-kernel set of probes. Dan
Dan McDonald
2007-Jun-12 19:14 UTC
[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2
On Mon, Jun 11, 2007 at 08:11:25PM -0700, Brendan Gregg - Sun Microsystems wrote:> This provider has been great to use in testing. A key issue is the > uglyness of having ipv4info_t and ipv6info_t as different args; although > this hasn''t been much of a problem as ipinfo_t is usually used instead. > > Take 2 (proposed): > > Probes Args > ip:::send cstateinfo_t, ipinfo_t > ip:::receive cstateinfo_t, ipinfo_t > ipv4:::send cstateinfo_t, ipinfo_t, ipv4info_t > ipv4:::receive cstateinfo_t, ipinfo_t, ipv4info_t > ipv6:::send cstateinfo_t, ipinfo_t, ipv6info_t > ipv6:::receive cstateinfo_t, ipinfo_t, ipv6info_t > ipsec:::send cstateinfo_t, ipinfo_t, ipsecinfo_t > ipsec:::receive cstateinfo_t, ipinfo_t, ipsecinfo_t > icmp:::send cstateinfo_t, ipinfo_t, icmpinfo_t > icmp:::receive cstateinfo_t, ipinfo_t, icmpinfo_t > arp:::send cstateinfo_t, arpinfo_t > arp:::receive cstateinfo_t, arpinfo_t > rarp:::send cstateinfo_t, arpinfo_t > rarp:::receive cstateinfo_t, arpinfo_t > ... > > Providers such as "ipv4" and "ipv6" are for specific protocol analysis, > and their 3rd argument is an endian correct structure of the protocol > members (DTrace translator). A pointer to the raw protocol struct will > be provided. (this assumes that "ipv4" is an allowed provider name, as > it could potentially clash with a USDT provider called "ipv" that is > asked to trace PID 4).You have "ipsec" as one set of probes. I take it your ipsecinfo_t will distinguish between AH, ESP, or both on a packet.> 2) cstateinfo_t > > This provides connection state information if available, > > cstateinfo_t, > cs_cid uint64_t /* connection ID (conn_t *) */ > cs_loopback int /* loopback state */ > > and may be extended to include details such as, > > cs_zoneid int /* zone ID */ > cs_ip_stack uint64_t /* stack ID (ip_stack_t *) */ > > cs_cid isn''t provided as a conn_t *, as that would expose an unstable > interface; instead it is provided as a uint64_t, as it can be useful as a > connection ID (and people can cast it as a conn_t * for raw debugging).That seems sensible, but for lots of receive-side processing, there will be NOTHING resembling a conn_t immediately available. For example, IPsec packets (depending on where you put the probe) might have the SA (ipsa_t) available, but no way will it have the conn_t... conn lookup occurs AFTER inbound IPsec processing (at least right now).> 4) ip provider & ipsec > > Since "ip" is a convience provider, we have the possibility of casting > data that is useful (for convience) rather than what is strictly in the > protocol. For example, IPSec, where packets have the tunnel source and destSpelling nit: IPsec.> and the actual source and dest, could be traced as follows: > > Probes Arguments contain > ipv4::: tunnel source and dest > ipv6::: tunnel source and dest > ipsec::: actual source and dest, and tunnel source and destIPsec is not a tunnelling protocol, and not all IPsec packets have tunnel source/dest to contend with. Also, in Solaris, IP tunnels are distinct entities, we implement IPsec tunnel-mode inside the context of our IP tunnelling - that''s the implementation tack of Tunnel Reform. For IPsec transmit-side, depending on where the probes are placed, you''ll want to know things in the IPSEC_OUT M_CTL mblk. Highlights include: - Policy entry and/or actions (actions --> what precise algorithm(s) need(s) to be applied to the packet). - Outbound SA (if cached in the transmitting conn_t). Maybe some info from the SA in a readable form (e.g. algorithms). - conn_t of transmitter (may be NULL for ICMP replies and TCP RST) For the receive side, look at the IPSEC_IN mblk, but mostly, you''ll want: - Inbound SA and all relevant fields. And don''t forget --> IPsec uses ip_drop_packet() whenever a packet is dropped for IPsec reasons. FBT + ip_drop_packet() is a gold mine of usable information already, if you could extend the rest of TCP/IP to use ip_drop_packet(), it may help a LOT.> (assuming this is doable - I need to check if ip::: probes can be placed > so to not see duplicates of IPSec traffic).You''re right - placement is everything. Dan
all, what happens when I CTRL-Z suspend a dtrace script, does this only suspend the user-mode portion (formatting and output) of dtrace, and not the kernel-mode portion (probe actions). if so, where does the generated data go, does it keep on accumulating in the kernel. I am guessing that if my actions are not calling trace or print, and only accumulating aggregates, other than in my END action, then there isn''t going to be any data loss, or data overflow ? I am guessing that if there is some instrumentation overhead of the kernel portion of my dtrace script, then CTRL-Z suspending it won''t affect the overhead, that the kernel portion keeps on running ? thanks, Pete Lawrence.
Peter Lawrence
2007-Jun-12 19:41 UTC
[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2
Dan, I''ve been studying the Solaris kernel implementaion of SSL, and the encompassing kernel-crypto-framework, and have no interest in user-land implementations. I''m still curious about how IPsec and SSL compare SSL is for TCP, so its for reliable-ordered-byte-stream connections IPsec is, I''m guessing from the IP in its name, just encrypted IP, which could be containing UDP or TCP or ... so, I suppose that SSL could be emulated with IPsec as the underlying protocol for TCP, but then I think nothing would prohibit the IPsec connection from being used for additional connections, or even other next level up (UDP,..., NFS?,...) protocols and connections. I''m guessing that there isn''t much in common in their implementations other than crypto functions, since they''re at such different locations in the protocol stack. thanks, Pete. McDonald wrote On 06/12/07 12:02 PM,:> On Tue, Jun 12, 2007 at 11:42:42AM -0700, Peter Lawrence wrote: > > <SNIP!> > >>(and, for my own edification, how does IPsec relate to SSL, and if they >> are unrelated and I need to dtrace SLL how will I do that ?) > > > There *is* kernel SSL, but most SSL is in user-space. I *think* what''s being > proposed is an all-kernel set of probes. > > Dan
Dan McDonald
2007-Jun-12 19:49 UTC
[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2
On Tue, Jun 12, 2007 at 12:41:15PM -0700, Peter Lawrence wrote:> Dan, > I''ve been studying the Solaris kernel implementaion of SSL, > and the encompassing kernel-crypto-framework, and have no interest > in user-land implementations.Ack.> I''m still curious about how IPsec and SSL compare > > SSL is for TCP, so its for reliable-ordered-byte-stream connections > > IPsec is, I''m guessing from the IP in its name, just encrypted IP, > which could be containing UDP or TCP or ...You are correct in that IPsec and SSL protect at different layers in the stack.> so, I suppose that SSL could be emulated with IPsec as the underlying > protocol for TCP, but then I think nothing would prohibit the IPsec > connection from being used for additional connections, or even other > next level up (UDP,..., NFS?,...) protocols and connections.SSL and IPsec have different packet formats. I don''t even *know* (he says sheepishly) what the SSL/TLS data formates look like. You can''t really build one in terms of the other and maintain interoperability. Also, IPsec isn''t tied to connections per se -- it''s a per-datagram protocol. You can narrow IPsec to one TCP connection by either using per-socket policy or by specifying the full 5-tuple (TCP, laddr, raddr, lport, rport) in IPsec policy.> I''m guessing that there isn''t much in common in their implementations > other than crypto functions, since they''re at such different locations > in the protocol stack.That''s a fair statement! Dan
Peter, There is a set of switch buffers allocated for each DTrace consumer, and then for each processor core or hardware thread. For example, if you start a DTrace script on a T2000 with 32 hardware threads available, the system allocates 32 sets of switch buffers just for your script. If you''ve put your DTrace script process into a STOP state, the probes continue to be activated and data recorded into the buffers; however, when the buffers fill, you''ll start loosing data. When this happens while the script is running (i.e. you''re generating so much data that the consumer can''t keep up), then you''d see "dtrace: nnnnn drops on CPU n", but since the process is stopped, I doubt you''d ever see the message. Aggregation keys are kept in a different buffer, but the principle is the same. If stopping the process causes the buffer to overflow, you''ll get aggregation drops. Chip Peter Lawrence wrote:> all, > what happens when I CTRL-Z suspend a dtrace script, does this only > suspend the user-mode portion (formatting and output) of dtrace, and > not the kernel-mode portion (probe actions). if so, where does the > generated data go, does it keep on accumulating in the kernel. > > I am guessing that if my actions are not calling trace or print, and only > accumulating aggregates, other than in my END action, then there isn''t going > to be any data loss, or data overflow ? > > I am guessing that if there is some instrumentation overhead of the kernel > portion of my dtrace script, then CTRL-Z suspending it won''t affect the > overhead, that the kernel portion keeps on running ? > > thanks, > Pete Lawrence. > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org >
Nicolas Williams
2007-Jun-12 20:00 UTC
[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2
On Tue, Jun 12, 2007 at 12:41:15PM -0700, Peter Lawrence wrote:> I''ve been studying the Solaris kernel implementaion of SSL, > and the encompassing kernel-crypto-framework, and have no interest > in user-land implementations. > > I''m still curious about how IPsec and SSL compareThey are very, very different. IPsec consists of ESP and AH for protecting IP payloads (and, in the case of AH, headers) below the transport protocols using keys that are either manually configured or exchanged with a key exchange protocol. TLS runs atop a transport protocol, has its'' own authentication and key exchange protocols and its own "security layer" (to borrow a term from SASL). There is nothing in common, on the wire, between IPsec and TLS (except, of course, for things incidental to authentication, like PKIX certificates). IPsec protects packets. TLS protects octet streams or datagram streams (DTLS; see below). The security models are different and so is the applicability of each. In my view the most important distinction between IPsec and TLS is that the former is usually implemented with no useful APIs for application developers while for the latter there is a plethora of APIs.> SSL is for TCP, so its for reliable-ordered-byte-stream connectionsAnd then there''s DTLS (Datagram TLS) which, as you can probably imagine, can run over UDP.> IPsec is, I''m guessing from the IP in its name, just encrypted IP, > which could be containing UDP or TCP or ...ESP and AH. Plus key exchange protocols (IKEv1, IKEv2, KINK).> so, I suppose that SSL could be emulated with IPsec as the underlying > protocol for TCP, but then I think nothing would prohibit the IPsec > connection from being used for additional connections, or even other > next level up (UDP,..., NFS?,...) protocols and connections.FYI, the IETF BTNS WG is working on a notion of "IPsec channel" and "IPsec APIs."> I''m guessing that there isn''t much in common in their implementations > other than crypto functions, since they''re at such different locations > in the protocol stack.Correct, but then, kssl runs in-kernel, so there''s a bit more in common than you expected, but not much. Nico --
Just one extra thing to add to Chip''s comments. The DTrace subsystem expects a consumer to check-in periodically (i.e. when it snapshots buffer state). If it doesn''t check in for 30 seconds, by default, then the deadman timer will fire and the consumer is as good as dead. If you suspend a consumer for longer than 30 seconds it will bail when you next put it in the foreground. You can just enable destructive actions to get around this. Jon.> Peter, > > There is a set of switch buffers allocated for each DTrace consumer, > and then for each processor core or hardware thread. For example, if > you start a DTrace script on a T2000 with 32 hardware threads > available, the system allocates 32 sets of switch buffers just for > your script. > > If you''ve put your DTrace script process into a STOP state, the probes > continue to be activated and data recorded into the buffers; however, > when the buffers fill, you''ll start loosing data. When this happens > while the script is running (i.e. you''re generating so much data that > the consumer can''t keep up), then you''d see "dtrace: nnnnn drops on > CPU n", but since the process is stopped, I doubt you''d ever see the > message. > > Aggregation keys are kept in a different buffer, but the principle is > the same. If stopping the process causes the buffer to overflow, > you''ll get aggregation drops. > > Chip > > Peter Lawrence wrote: > >> all, >> what happens when I CTRL-Z suspend a dtrace script, does this only >> suspend the user-mode portion (formatting and output) of dtrace, and >> not the kernel-mode portion (probe actions). if so, where does the >> generated data go, does it keep on accumulating in the kernel. >> >> I am guessing that if my actions are not calling trace or print, and >> only >> accumulating aggregates, other than in my END action, then there >> isn''t going >> to be any data loss, or data overflow ? >> >> I am guessing that if there is some instrumentation overhead of the >> kernel >> portion of my dtrace script, then CTRL-Z suspending it won''t affect the >> overhead, that the kernel portion keeps on running ? >> >> thanks, >> Pete Lawrence. >> _______________________________________________ >> dtrace-discuss mailing list >> dtrace-discuss at opensolaris.org >> > > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
Brendan Gregg - Sun Microsystems
2007-Jun-12 20:45 UTC
[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2
G''Day Peter, On Tue, Jun 12, 2007 at 11:42:42AM -0700, Peter Lawrence wrote:> G''Day Brendan, > I''ve got a couple very open ended questions, I''m not so > much concerned with the actual answers as to know that you are thinking > about such issues... > > how does the info available through this provider compare with /usr/sbin/snoop ? > > will it be possible to emulate the filtering capabilities of snoop ?There are many instances where these providers will be used instead of snoop, such as general traffic observability (eg, bytes by IP address). The out of date URL in my first post demonstrates many of these, including similar filtering capabilities. However I imagine that snoop has features that these providers won''t do easily, such as, - output to standard capture file format (RFC 1761) - output formatting and translations (-V, -v)> how does the placement of the probes imply timing info ?In terms of kernel code-path, it doesn''t.> is there any way to track time between application send > and ip send, and then between ip send and NIC send ? > is there any way to track time between ip receive and > application read, and also between NIC receive and ip > receive ?Kernel code-path latency was solved in 2004 with the fbt provider, which provides 4344 probes (on my build) for examining the ip module, which includes function entry and return values, and return offsets. There are some cases where it is appropriate to add unstable sdt probes to make this analysis easier. For example, when packets are dropped due to error, it would be easier to instrument this with an sdt probe than to process the return value and offset. I would be happy to prototype a high level code-path latency provider (which would be sdt based, and would probably be classified as unstable) as a future project. In the meantime, use fbt. If you don''t know how, learn how. And when you run into fbt instrumentation issues that are genuinely unsolvable, then add (or RFE) sdt probes. These providers are for users of Solaris, and they are similar to what the "io" provider achieves. They are not for developers of Solaris who are interested in code-path latency and can already use fbt and sdt. Brendan -- Brendan [CA, USA]
Darren.Reed at Sun.COM
2007-Jun-12 23:22 UTC
[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2
Brendan Gregg - Sun Microsystems wrote:>G''Day Darren, > >On Mon, Jun 11, 2007 at 08:52:30PM -0700, Darren.Reed at sun.com wrote: > > >>Brendan, >> >>I had a quick look through the URL you provided and one thing >> >> > >That URL currently describes "take 1" of the network provider proposal, >and so is out of date. The newer details (IP only) were in the email I >sent, and I''ll be updating the URL based on that and the feedback I get. > >Ah, I used the URL because it seemed to have better content than the email ;)>>Also, can you explain where the probes for each protocol fit in >>relation to the greater processing for the protocol? >> >> > >Sure, probes fire when packets are sent and received. The actual >location of the probe macros in the kernel TCP/IP code is based on >available information and maintainability. This means that ip >probes may be placed in tcp_send_data() and udp_send_data(), just >as MIB macros for IP statistics appear in tcp.c and udp.c > >Saying "when packets are sent and received" is a little bit too vague. Does this mean when IP* packets are sent/received via the mac layer, dls layer, the driver layer or IP layer? And how far into each layer? To compare this with snoop, what we get is more or less understood to be a copy of the data sent to/from a device driver with anything above it, so we know it is possible to see corrupt frames, incorrect checksums, etc. Having a good understanding of where the probes fit in the overall architecture will help a lot towards us networking folks understanding what we can (or cannot) see when using them.>>Will tracemem() be told how to walk the packet, so I can the below? >>tracemem(args[1]->ip_hdr, args[1]->ip_plength) >> >>...well, that''d also require ip_hdr being added to ipinfo_t. >> >> > >I don''t think ipinfo_t should have ip_hdr; if you are interested in >protocol details, then instead of using the "ip" provider or the generic >ipinfo_t structure, you use the "ipv4" (or "ipv6") protocol specific >provider, which currently provides args[2]->ipv4_hdr as a pointer >to the raw ipha_t. > >I think there are two things you may be interested in: > >A) dumping the raw header + extended options. > >Here is an example from the old prototype "ip" provider, > ># dtrace -n ''ip:::send,ip:::receive { tracemem(args[2]->ipv4_hdr, 20); }'' >dtrace: description ''ip:::send,ip:::receive '' matched 5 probes >CPU ID FUNCTION:NAME > 0 54627 ip_input:receive > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > 0: 45 20 00 98 3b 01 40 00 28 06 e0 c2 45 b5 2e b2 E ..;. at .(...E... > 10: c0 a8 01 6d ...m > > 0 54627 ip_input:receive > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > 0: 45 00 00 58 b6 fe 40 00 3c 06 03 e1 c0 a8 01 03 E..X.. at .<....... > 10: c0 a8 01 6d ...m >[...] > >So far so good, now need to replace the "20" value with args[2]->ipv4_ihl, > ># dtrace -n ''ip:::send,ip:::receive { tracemem(args[2]->ipv4_hdr, > args[2]->ipv4_ihl); }'' >dtrace: invalid probe specifier ip:::send,ip:::receive { tracemem(args[2]->ipv4_hdr, args[2]->ipv4_ihl); }: tracemem( ) argument #2 must be a non-zero positive integral constant expression > >Will ipv4_ihl be the number of bytes or the value as found in the header? (a count of 4 byte words)>hmm. DTrace is trying to help out by sanity checking args, > > if (dt_node_is_posconst(size) == 0) { > dnerror(size, D_TRACEMEM_SIZE, "tracemem( ) argument #2 must " > "be a non-zero positive integral constant expression\n"); > } > >I need to check if there is a sensible way around this... I can think of >a work-around using existing DTrace functionality, but it isn''t pretty. > >I''m not sure I understand what the problem is here but I''m glad you do :)>B) dumping the payload > >It wasn''t the initial intent to access the payload easily from DTrace, but >enough people have asked that I will check how doable this is (it could >certainly be added as a later feature). > >Of course, if you can walk to the data offsets in C, you can tracemem() >them using DTrace (I assume this would need the first mblk pointer, >which would be provided in the protocol providers for such raw debugging). >So there will be at least one way to do this. > >It would be nice to do this easily from DTrace, eg, > > tracemem(args[2]->ipv4_payload, args[2]->ipv4_plength) > >which might turn into something like, > > tracemblk(args[2]->ipv4_baddr, args[2]->ipv4_plength) > >baddr meaning buffer address, and tracemblk() a new function to walk >and dump mblks easily. I''m not yet sure which approach would be best >without trying to code it first. > >Ok, dumping the payload is what I was asking about and it seems like you understand the problem, etc.>>For IPv6, how do you see this design evolving to include looking >>at extension header data? >> >> > >Extension header data would only be accessed through the ipv6 provider, >and can be added to the ipv6info_t struct. > >For most IPv6 extension headers, there is a limit on them appearing only once, with the exception being destination options. How do you see that being handled? Darren
Brendan Gregg - Sun Microsystems
2007-Jun-13 02:36 UTC
[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2
G''Day Dan, On Tue, Jun 12, 2007 at 03:14:23PM -0400, Dan McDonald wrote:> On Mon, Jun 11, 2007 at 08:11:25PM -0700, Brendan Gregg - Sun Microsystems wrote:[...]> > ipsec:::send cstateinfo_t, ipinfo_t, ipsecinfo_t > > ipsec:::receive cstateinfo_t, ipinfo_t, ipsecinfo_t[...]> > You have "ipsec" as one set of probes. I take it your ipsecinfo_t will > distinguish between AH, ESP, or both on a packet.Yes. Although I haven''t created a suggested ipsecinfo_t struct yet. The members of ipsecinfo_t will depend on balancing, - data that is stable - data that is useful - data that is available in the right places (or can be made available through code changes) - data that is maintainable in those places> > 2) cstateinfo_t > > > > This provides connection state information if available, > > > > cstateinfo_t, > > cs_cid uint64_t /* connection ID (conn_t *) */ > > cs_loopback int /* loopback state */ > > > > and may be extended to include details such as, > > > > cs_zoneid int /* zone ID */ > > cs_ip_stack uint64_t /* stack ID (ip_stack_t *) */ > > > > cs_cid isn''t provided as a conn_t *, as that would expose an unstable > > interface; instead it is provided as a uint64_t, as it can be useful as a > > connection ID (and people can cast it as a conn_t * for raw debugging). > > That seems sensible, but for lots of receive-side processing, there will be > NOTHING resembling a conn_t immediately available. > > For example, IPsec packets (depending on where you put the probe) might have > the SA (ipsa_t) available, but no way will it have the conn_t... conn lookup > occurs AFTER inbound IPsec processing (at least right now).The probes don''t need to appear in the IPsec code - so if appropriate IPsec data (such as ipsa_t) can be fetched after conn lookup, then probes could be placed there. I need to continue studying the code to see what may be best to do. I''m glad you brought it up anyway - these are the sort of issues I want to address before committing to a network provider plan.> > and the actual source and dest, could be traced as follows: > > > > Probes Arguments contain > > ipv4::: tunnel source and dest > > ipv6::: tunnel source and dest > > ipsec::: actual source and dest, and tunnel source and dest > > IPsec is not a tunnelling protocol, and not all IPsec packets have tunnel > source/dest to contend with. Also, in Solaris, IP tunnels are distinct > entities, we implement IPsec tunnel-mode inside the context of our IP > tunnelling - that''s the implementation tack of Tunnel Reform. > > For IPsec transmit-side, depending on where the probes are placed, you''ll > want to know things in the IPSEC_OUT M_CTL mblk. Highlights include: > > - Policy entry and/or actions (actions --> what precise algorithm(s) > need(s) to be applied to the packet). > > - Outbound SA (if cached in the transmitting conn_t). Maybe some > info from the SA in a readable form (e.g. algorithms).Right, and if I''m probing when I have access to conn_t, then it looks like I can usually get these details from conn_ipsec_opt_mp (so I have both conn_t and ipsa_t at the same time).> - conn_t of transmitter (may be NULL for ICMP replies and TCP RST)Yes - I don''t expect conn_t information for ICMP and TCP RSTs, and so their connection ID (cs_cid) would be NULL, and I''d need to figure out if other details such as cs_loopback can be provided.> For the receive side, look at the IPSEC_IN mblk, but mostly, you''ll want: > > - Inbound SA and all relevant fields.Yes, and it looks like I have IPSEC_IN mblks in ip_proto_input() (at least), so I''m hopeful I''ll find a place on input that has both conn_t and ipsa_t pointers.> And don''t forget --> IPsec uses ip_drop_packet() whenever a packet is dropped > for IPsec reasons. FBT + ip_drop_packet() is a gold mine of usable > information already, if you could extend the rest of TCP/IP to use > ip_drop_packet(), it may help a LOT.Ahh, thank goodness, that does make things easier. There seems to be the occasional place in the tcp/ip code where a freemsg(mp) happens without bumping debug metrics (such as MIB). Thanks for your feedback, Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems
2007-Jun-13 18:26 UTC
[dtrace-discuss] Re: [networking-discuss] Re: DTrace Network Providers, take 2
G''Day Simon, On Tue, Jun 12, 2007 at 04:31:36AM -0700, Simon Leinen wrote:> > 1) ipinfo_t > > > > This provides common details from the IP protocols that we expect to > > have available, > > > > ipinfo_t, > > ip_protocol int /* protocol */ > [...] > > ip_protocol could be just the IP protocol number (4 or 6), or borrow > > existing definitions such as AF_INET/AF_INET6, > > ETHERTYPE_IP/ETHERTYPE_IPV6 or use /etc/protocols - as these may > > better accomodate future protocol additions. Using the IP protocol > > number (4/6) would be the least suprising choice for the end user. > > Your use of "IP protocol number" confuses me - I think you should call > this "IP version number" (and the struct member "ip_version"), as in > http://www.iana.org/assignments/version-numbers. Protocol numbers (at > least to me) refer to upper-layer protocols such as TCP, UDP, see > http://www.iana.org/assignments/protocol-numbersIt is actually ip_ver in the original prototype, as it only matched the IP version number; I changed it to ip_protocol in case we wanted it to match non-IP protocols, although I can''t quite put my finger on an actual example. I think you are right, and thanks for the URLs: if it is version, it should be ip_version; if it is protocol, it would be ip_protocol. I wonder if it would be wise to include both ip_protocol and ip_version? cheers, Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems
2007-Jun-14 17:20 UTC
[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2
G''Day Darren, On Tue, Jun 12, 2007 at 04:22:00PM -0700, Darren.Reed at sun.com wrote: [...]> >>Also, can you explain where the probes for each protocol fit in > >>relation to the greater processing for the protocol? > > > >Sure, probes fire when packets are sent and received. The actual > >location of the probe macros in the kernel TCP/IP code is based on > >available information and maintainability. This means that ip > >probes may be placed in tcp_send_data() and udp_send_data(), just > >as MIB macros for IP statistics appear in tcp.c and udp.c > > Saying "when packets are sent and received" is a little bit too > vague. Does this mean when IP* packets are sent/received > via the mac layer, dls layer, the driver layer or IP layer? And > how far into each layer?This is easier to explain if I post a webrev (which I want to do ASAP). Anyway, for sent traffic - it is probed as late as possible while the conn_t is still available; and for received traffic, it is probed early in common places. Or described another way - IP is probed when ip meets dld or dls.> To compare this with snoop, what we get is more or less understood > to be a copy of the data sent to/from a device driver with anything > above it, so we know it is possible to see corrupt frames, incorrect > checksums, etc.Great - so I don''t need to reinvent that functionality since snoop has me covered. ;-) Seriously, yes, I would prefer these DTrace providers to provide identical protocol headers to what snoop can see, but if there is some IP header mangling down in dld or dls, then snoop would have a better view of packets than the current placement of IP probes. But the question is - is there IP header mangling in dld or dls (and why would there be)? I do see that dls_soft_ring_fanout() can drop ip packets that don''t have a complete ipv4 header, if ((MBLKL(mp) < sizeof (ipha_t)) || !OK_32PTR(mp->b_rptr)) { if ((mp = msgpullup(mp, sizeof (ipha_t))) == NULL) { /* Let''s toss this away */ dls_bad_ip_pkt++; freemsg(mp); continue; } which seems like a precaution anyway, rather than something that will frequently happen, # echo dls_bad_ip_pkt/D | mdb -k dls_bad_ip_pkt: dls_bad_ip_pkt: 0> Having a good understanding of where the probes fit in the > overall architecture will help a lot towards us networking folks > understanding what we can (or cannot) see when using them.Sure. Maybe this helps also - this is the DTrace probe listing from the old prototype, which shows the function locations for the ip probes, # dtrace -ln ''ip:::'' ID PROVIDER MODULE FUNCTION NAME 47950 ip ip udp_send_data send 47951 ip ip tcp_send_data send 47952 ip ip ip_wput_ire_v6 send 47953 ip ip ip_wput_ire send 47963 ip ip ip_rput_v6 receive 47964 ip ip ip_input receive Of course, this list is likely to get a lot longer as IP Filter and IPsec are traced properly, and ipv4 and ipv6 providers are added. [...]> >So far so good, now need to replace the "20" value with args[2]->ipv4_ihl, > > > ># dtrace -n ''ip:::send,ip:::receive { tracemem(args[2]->ipv4_hdr, > > args[2]->ipv4_ihl); }'' > >dtrace: invalid probe specifier ip:::send,ip:::receive { > >tracemem(args[2]->ipv4_hdr, args[2]->ipv4_ihl); }: tracemem( ) argument #2 > >must be a non-zero positive integral constant expression > > Will ipv4_ihl be the number of bytes or the value as found in > the header? (a count of 4 byte words)In the first prototype it was 4 byte words, which I found a little annoying to use. I''m thinking the translated version could be bytes, and clearly documented as such, while bearing in mind that a pointer to the raw header is available also. The translated version is there for useability (eg, IP addresses are represented as strings). [...]> >>For IPv6, how do you see this design evolving to include looking > >>at extension header data? > > > >Extension header data would only be accessed through the ipv6 provider, > >and can be added to the ipv6info_t struct. > > For most IPv6 extension headers, there is a limit on them > appearing only once, with the exception being destination > options. How do you see that being handled?Anything requiring looping through like-members presents a little headache in DTrace - which doesn''t provide (or so far, really need) looping in the D language. I see them as important but not crucial for a first release of the ipv6 provider - destination options can be added as members to ipv6info_t when the right way to present them has been found. But thanks for bringing it up - it is another problem that needs careful consideration. cheers, Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems wrote:> rarp:::send cstateinfo_t, arpinfo_t > rarp:::receive cstateinfo_t, arpinfo_tThere isn''t any reverse ARP support in the kernel, so where would these probes be placed? In in.rarpd(1m)?> 2) cstateinfo_t > > This provides connection state information if available, > > cstateinfo_t, > cs_cid uint64_t /* connection ID (conn_t *) */ > cs_loopback int /* loopback state */I assume that cs_cid will be zero when there is no conn_t (which is for the most of the IP transmit paths.) What is the intended semantics of cs_loopback?> and may be extended to include details such as, > > cs_zoneid int /* zone ID */ > cs_ip_stack uint64_t /* stack ID (ip_stack_t *) */For the stack ID it is probably better to use the small integer which is in ip_stack_t->ips_netstack->netstack_stackid. It is much easier to see how these relate to exclusive-IP zones than having a pointer cast as a uint64_t. (The current implementation happens to pick the stackid to be the same as the zoneid for the exclusive-IP zones.) Just as Dan pointed out for the conn_t, the zoneid isn''t known/defined for much of the L3 receive side processing. For instance the IP and ICMP input doesn''t know it, and ARP is essentially all run in the global zone. For UDP and TCP input we know the zoneid. And I hope that by now Nevada has a well-defined zoneid for all the transmit paths.> (assuming this is doable - I need to check if ip::: probes can be placed > so to not see duplicates of IPSec traffic).The placement might be tricky now; should get a lot easier once we''ve refactored the IP datapaths. Erik
Erik Nordmark
2007-Jun-15 13:40 UTC
[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2
Brendan Gregg - Sun Microsystems wrote:> This is easier to explain if I post a webrev (which I want to do ASAP). > Anyway, for sent traffic - it is probed as late as possible while the > conn_t is still available; and for received traffic, it is probed > early in common places. > > Or described another way - IP is probed when ip meets dld or dls.The above two statements are in conflict. When IP meets dld or dls there is no notion of a conn_t. While some code paths in the current implementation might have a conn_t, that is a bug which will be fixed with the IP datapath refactoring project. So please don''t depend on this. If you want a conn_t you need to look at transmitted packets at the top of ip_output, and be aware that in some cases none is available (ICMP errors, TCP RST packets, etc). Alternatively you can get a conn_t by looking at transmitted packets at the bottom of tcp/udp/rawip. The latter might be more consistent with the receive side where we have a conn_t when the packet reaches the tcp/udp/rawip input codd. Erik
Brendan Gregg - Sun Microsystems
2007-Jun-15 17:16 UTC
[dtrace-discuss] DTrace Network Providers, take 2
G''Day Erik, Folks, On Fri, Jun 15, 2007 at 06:35:20AM -0700, Erik Nordmark wrote:> Brendan Gregg - Sun Microsystems wrote: > > rarp:::send cstateinfo_t, arpinfo_t > > rarp:::receive cstateinfo_t, arpinfo_t > > There isn''t any reverse ARP support in the kernel, so where would these > probes be placed? In in.rarpd(1m)?Ahh - I was fleshing out protocol names before embarking on the code - you are right, these would belong in in.rarpd as USDT probes.> >2) cstateinfo_t > > > >This provides connection state information if available, > > > > cstateinfo_t, > > cs_cid uint64_t /* connection ID (conn_t > > *) */ > > cs_loopback int /* loopback state */ > > I assume that cs_cid will be zero when there is no conn_t (which is for > the most of the IP transmit paths.)Yes, which is part of the problem. Probe placement turned out to be a bigger pain. Yesterday I spent many hours locating all the different code path points that would need probes for measuring conn_t in IP (I reached more than 20 places). For most of the probes, conn_t was and should be NULL (eg, RSTs, forwarding, bogus packets, ...). conn_t wasn''t NULL for TCP and UDP, something we learn later in the IP code path, however by this point many inbound packets have been dropped or processed elsewhere - which creates a headache for probe placement; I sent all dropped packets through a trace''n''drop path, and for those being processed elsewhere I added more probes. Having coded most of it, it was becomming obvious that something had gone wrong. Yes, I can get conn_t in IP, but are the code changes worth it? If it is NULL everywhere but TCP and UDP, shouldn''t I only export conn_t in the tcp and udp providers? And with ire_t and ill_t readily available, can the cstateinfo_t struct get its answers from them - and not conn_t? What DTrace scripts would need cs_cid in IP, that couldn''t be written using the udp and tcp providers? Last night I dropped conn_t from the ip provider, and started recoding it using ire_t and ill_t. The probes have also moved location, and many are now near FW_HOOKS for physical events (before for inbound, after for outbound). I''ll post some suggested info structs when I see if ire_t and ill_t are sensible to use (at the moment I''m not sure if I should be trying to export ire_t for inbound packets - I''m doubting that it makes sense).> What is the intended semantics of cs_loopback?I''d like it to be boolean, but right now it may be more practical as: -1 unknown 0 not loopback 1 loopback [...]> The placement might be tricky now; should get a lot easier once we''ve > refactored the IP datapaths.Placement is very tricky. Refactoring would be great. :) thanks, Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems
2007-Jun-15 17:30 UTC
[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2
G''Day Erik, On Fri, Jun 15, 2007 at 06:40:27AM -0700, Erik Nordmark wrote:> Brendan Gregg - Sun Microsystems wrote: > > >This is easier to explain if I post a webrev (which I want to do ASAP). > >Anyway, for sent traffic - it is probed as late as possible while the > >conn_t is still available; and for received traffic, it is probed > >early in common places. > > > >Or described another way - IP is probed when ip meets dld or dls. > > The above two statements are in conflict. When IP meets dld or dls there > is no notion of a conn_t. While some code paths in the current > implementation might have a conn_t, that is a bug which will be fixed > with the IP datapath refactoring project. So please don''t depend on this.I think I just learnt this the hard way ;). If there was a strong need for conn_t visibility in IP, then I''d keep pushing for it. There isn''t. There is such a need for conn_t visibility (especially as a connection ID) somewhere in the stack, such as the tcp and udp providers. cheers, Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems wrote:> Having coded most of it, it was becomming obvious that something had gone > wrong. Yes, I can get conn_t in IP, but are the code changes worth it? If > it is NULL everywhere but TCP and UDP, shouldn''t I only export conn_t in > the tcp and udp providers? And with ire_t and ill_t readily available, can > the cstateinfo_t struct get its answers from them - and not conn_t? What > DTrace scripts would need cs_cid in IP, that couldn''t be written using the > udp and tcp providers?I think TCP and UDP providers make more sense. And hopefully soon the rawip code (icmp.c - the transport provider for RAW sockets) will use conn_t as well. But there is a place in IP where there is (at least conceptually) none of conn_t, ire_t, ill_t. That place is when an ICMP error (or TCP reset etc) is generated by IP. Basically that just sends an IP packet and ip_output finds the ire_t which points at the ill_t. Thus conceptually the bottom of the transmit path of {tcp,udp}_output has a conn_t, and the top of ip_output has no context except the ip_stack_t. Ip_output then looks up an ire and after that there is an ire_t and related context (ill_t, nce_t) for the transmit path.> Last night I dropped conn_t from the ip provider, and started recoding > it using ire_t and ill_t. The probes have also moved location, and many > are now near FW_HOOKS for physical events (before for inbound, after for > outbound). I''ll post some suggested info structs when I see if ire_t and > ill_t are sensible to use (at the moment I''m not sure if I should be > trying to export ire_t for inbound packets - I''m doubting that it makes > sense).For inbound packets (delivered to the local machine) the ire_t isn''t very useful - it is either an IRE_LOCAL or IRE_BROADCAST. For forwarded packets the ire_t is more interesting. Placing probes at the FW_HOOKS provides observability at the bottom of IP i.e. corresponding to the ipInReceives on the receive side (odd, there isn''t a corresponding xmit stat). As such it is layered below IPsec and fragmentation/reassembly. But I think it would also be useful to have observability at the top of IP i.e. corresponding to ipInDelivers and ipOutRequests since that is above IPsec and frag/reass.>> What is the intended semantics of cs_loopback? > > I''d like it to be boolean, but right now it may be more practical as: > > -1 unknown > 0 not loopback > 1 loopbackWhat is it intended to mean? Will is be set for e.g. inter-zone packets on the same system? Multicast packets that are looped back because there are members on the transmitting interface? Multicast packets that are looped back because the system is running as a multicast router? Transmitted broadcast packets where a copy is handed back to the input side? Erik
Darren.Reed at Sun.COM
2007-Jun-18 17:51 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
Brendan Gregg - Sun Microsystems wrote:> ... > >Last night I dropped conn_t from the ip provider, and started recoding >it using ire_t and ill_t. The probes have also moved location, and many >are now near FW_HOOKS for physical events (before for inbound, after for >outbound). I''ll post some suggested info structs when I see if ire_t and >ill_t are sensible to use (at the moment I''m not sure if I should be >trying to export ire_t for inbound packets - I''m doubting that it makes >sense). >The FW_HOOKS macros generally have sdt dtrace probes on either side of them, e.g.: DTRACE_PROBE4(ip4__forwarding__start, ill_t *, in_ill, ill_t *, out_ill, ipha_t *, ipha, mblk_t *, mp); FW_HOOKS(ipst->ips_ip4_forwarding_event, ipst->ips_ipv4firewall_forwarding, in_ill, out_ill, ipha, mp, mp, ipst); DTRACE_PROBE1(ip4__forwarding__end, mblk_t *, mp); If your probes are getting close to where FW_HOOKS appears, is there some merit in replacing some of these sdt probes with probes from the provider you''re working on? Although this might only make sense of there is complete coversion of all the sdt''s into the new thing. ...I''m sure that there is such a thing as too many dtrace probe points :) Darren
Brendan Gregg - Sun Microsystems
2007-Jun-18 20:43 UTC
[dtrace-discuss] DTrace Network Providers, take 2
G''Day Erik, On Mon, Jun 18, 2007 at 06:56:37AM -0700, Erik Nordmark wrote:> Brendan Gregg - Sun Microsystems wrote: > > >Having coded most of it, it was becomming obvious that something had gone > >wrong. Yes, I can get conn_t in IP, but are the code changes worth it? If > >it is NULL everywhere but TCP and UDP, shouldn''t I only export conn_t in > >the tcp and udp providers? And with ire_t and ill_t readily available, can > >the cstateinfo_t struct get its answers from them - and not conn_t? What > >DTrace scripts would need cs_cid in IP, that couldn''t be written using the > >udp and tcp providers? > > I think TCP and UDP providers make more sense. And hopefully soon the > rawip code (icmp.c - the transport provider for RAW sockets) will use > conn_t as well.Yes; and if for some reason in the distant future, conn_t does become prolific in IP code, then we can always add an extra conn_t based argument to the IP providers. For now it looks best not to have conn_t in IP.> But there is a place in IP where there is (at least conceptually) none > of conn_t, ire_t, ill_t. > That place is when an ICMP error (or TCP reset etc) is generated by IP. > Basically that just sends an IP packet and ip_output finds the ire_t > which points at the ill_t.It looked like routing/forwarding fell into this category also.> Thus conceptually the bottom of the transmit path of {tcp,udp}_output > has a conn_t, and the top of ip_output has no context except the > ip_stack_t. Ip_output then looks up an ire and after that there is an > ire_t and related context (ill_t, nce_t) for the transmit path. > > >Last night I dropped conn_t from the ip provider, and started recoding > >it using ire_t and ill_t. The probes have also moved location, and many > >are now near FW_HOOKS for physical events (before for inbound, after for > >outbound). I''ll post some suggested info structs when I see if ire_t and > >ill_t are sensible to use (at the moment I''m not sure if I should be > >trying to export ire_t for inbound packets - I''m doubting that it makes > >sense). > > For inbound packets (delivered to the local machine) the ire_t isn''t > very useful - it is either an IRE_LOCAL or IRE_BROADCAST. > For forwarded packets the ire_t is more interesting.I did prototype it for send probes only, but now I''ve dropped it; I was really just using it to identify loopback traffic easily, which I can do with a seperate code-location-based argument.> Placing probes at the FW_HOOKS provides observability at the bottom of > IP i.e. corresponding to the ipInReceives on the receive side (odd, > there isn''t a corresponding xmit stat). As such it is layered below > IPsec and fragmentation/reassembly.Yes, which should allow a snoop-like raw inspection of received packets. So far the only received packets I''m not tracing as IP are those where pkt_len < IP_SIMPLE_HDR_LENGTH, since they may not truly be IP at all. The xmit stat (ipIfStatsHCOutTransmits) looks like an RFC 4293 MIB extension.> But I think it would also be useful to have observability at the top of > IP i.e. corresponding to ipInDelivers and ipOutRequests since that is > above IPsec and frag/reass.Yes, this would probably be where an IPsec provider would live; as for in the IP providers as well -- that would be interesting, but I can''t think of a stable abstraction yet. We have ip:::send/receive for bottom of IP tracing, what would we call top of IP tracing? ip:::read/write? Should this be sdt instead? Could we rely on TCP/UDP probes, which will be at the bottom of their stacks and close to the top of IP anyway? ...> >>What is the intended semantics of cs_loopback? > > > >I''d like it to be boolean, but right now it may be more practical as: > > > > -1 unknown > > 0 not loopback > > 1 loopback > > What is it intended to mean?Cool, I was testing these cases last night,> Will is be set for e.g. inter-zone packets on the same system?Yes. Both send and receive should be tracable (unless you are in TCP fusion - which I think I''d leave for the TCP provider to trace, rather than faking up some IP events).> Multicast packets that are looped back because there are members on the > transmitting interface?Yes. Currently looks like this, # ./ipio1.d FUNC:PROBE SOURCE DEST LOOP BYTES ip_multicast_l:send 192.168.1.108 -> 224.0.1.1 1 8 ip_wput_local:receive 192.168.1.108 -> 224.0.1.1 1 8 ip_wput_ire:send 192.168.1.108 -> 224.0.1.1 0 8 A loopback send and receive are visible, but only the non-loopback send (as we''d expect).> Multicast packets that are looped back because the system is running as > a multicast router?Hmmm, I''d say yes - both the loopback send and receive events should be visible. Although I haven''t checked this one yet. ;)> Transmitted broadcast packets where a copy is handed back to the input side?Yes. Currently look like, # ./ipio1.d FUNC:PROBE SOURCE DEST LOOP BYTES ip_wput_ire:send 192.168.1.108 -> 192.168.1.255 1 64 ip_wput_local:receive 192.168.1.108 -> 192.168.1.255 1 64 ip_wput_ire:send 192.168.1.108 -> 192.168.1.108 1 64 ip_wput_local:receive 192.168.1.108 -> 192.168.1.108 1 64 ip_wput_ire:send 192.168.1.108 -> 192.168.1.255 0 64 ip_input:receive 192.168.1.148 -> 192.168.1.108 0 64 ip_input:receive 192.168.1.217 -> 192.168.1.108 0 64 ip_input:receive 192.168.1.160 -> 192.168.1.108 0 64 ip_input:receive 192.168.1.146 -> 192.168.1.108 0 64 ip_input:receive 192.168.1.213 -> 192.168.1.108 0 64 ip_input:receive 192.168.1.188 -> 192.168.1.108 0 64 [...] Both the loopback broadcast to 192.168.1.255 and the physical broadcast can be seen. thanks for the help, Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems
2007-Jun-19 17:43 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
G''Day Darren, On Mon, Jun 18, 2007 at 10:51:33AM -0700, Darren.Reed at sun.com wrote:> Brendan Gregg - Sun Microsystems wrote: > > >... > > > >Last night I dropped conn_t from the ip provider, and started recoding > >it using ire_t and ill_t. The probes have also moved location, and many > >are now near FW_HOOKS for physical events (before for inbound, after for > >outbound). I''ll post some suggested info structs when I see if ire_t and > >ill_t are sensible to use (at the moment I''m not sure if I should be > >trying to export ire_t for inbound packets - I''m doubting that it makes > >sense). > > > > The FW_HOOKS macros generally have sdt dtrace probes on > either side of them, e.g.: > > DTRACE_PROBE4(ip4__forwarding__start, > ill_t *, in_ill, ill_t *, out_ill, ipha_t *, ipha, mblk_t *, > mp); > > FW_HOOKS(ipst->ips_ip4_forwarding_event, > ipst->ips_ipv4firewall_forwarding, > in_ill, out_ill, ipha, mp, mp, ipst); > > DTRACE_PROBE1(ip4__forwarding__end, mblk_t *, mp); > > If your probes are getting close to where FW_HOOKS appears, > is there some merit in replacing some of these sdt probes > with probes from the provider you''re working on?They were indeed getting close - at one point I had this in ip_wput_ire(), DTRACE_PROBE4(ip4__physical__out__start, ill_t *, NULL, ill_t *, ire->ire_ipif->ipif_ill, ipha_t *, ipha, mblk_t *, mp); FW_HOOKS(ipst->ips_ip4_physical_out_event, ipst->ips_ipv4firewall_physical_out, NULL, ire->ire_ipif->ipif_ill, ipha, mp, mp, ipst); DTRACE_PROBE1(ip4__physical__out__end, mblk_t *, mp); if (mp == NULL) goto release_ire_and_ill; /* * DTrace this as ip:::send and ipv4:::send. */ DTRACE_IP2(send, mblk_t *, mp, void_ip_t *, ipha); DTRACE_IPV4_5(send, mblk_t *, mp, void_ip_t *, ipha, ipha_t *, ipha, int, 0, ill_t *, ire->ire_ipif->ipif_ill); mp->b_prev = SET_BPREV_FLAG(IPP_LOCAL_OUT); DTRACE_PROBE2(ip__xmit__1, mblk_t *, mp, ire_t *, ire); pktxmit_state = ip_xmit_v4(mp, ire, NULL, B_TRUE); which has three generations of DTrace probes in one place! ... but not anymore - those ip:::send probes are now in ip_xmit_v4()...> Although this might only make sense of there is complete > coversion of all the sdt''s into the new thing.I did spend some time thinking about what could be done, as it the FW_HOOKS have already instrumeted packet code-paths at a fairly low level (I wish they existed in Solaris 10 3/05 - my fbt based scripts would have been much easier to write). The IP providers must be a completely stable interface - and can''t export mblk_t, ill_t, etc directly; however to do so with SDT probes is great for debugging, and the FW_HOOKS probes could even have more arguments added (ire_t, ...), and more probes of a similar style created. I think the FW_HOOKS SDT probes would be part of (and currently lead the way for) a debugging or code-path-latency provider. There are also a few places where FW_HOOKS SDT and the IP provider probes diverge - eg, ip_wput_frag_mdt().> ...I''m sure that there is such a thing as too many dtrace > probe points :)These are near zero overhead when not enabled (some nops plus some movs to set registers); although their existance may affect how the compiler optimizes code (especially if there were loads of probes). In any case, the CPU overhead is going to need to be measured. It might be safer to say that there shouldn''t be too many stable provider probes in the code. Unstable probes such as those SDT ones can be dropped if their overhead proves to be a problem, however probes can''t easily be dropped from a stable and committed provider. cheers, Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems wrote:>> But there is a place in IP where there is (at least conceptually) none >> of conn_t, ire_t, ill_t. >> That place is when an ICMP error (or TCP reset etc) is generated by IP. >> Basically that just sends an IP packet and ip_output finds the ire_t >> which points at the ill_t. > > It looked like routing/forwarding fell into this category also.In that case you at least have the ill_t on which the packet was received.> Yes, which should allow a snoop-like raw inspection of received packets. > So far the only received packets I''m not tracing as IP are those where > pkt_len < IP_SIMPLE_HDR_LENGTH, since they may not truly be IP at all.OK> The xmit stat (ipIfStatsHCOutTransmits) looks like an RFC 4293 MIB > extension.Ah - I forgot about the HC counters.> Yes, this would probably be where an IPsec provider would live; as for > in the IP providers as well -- that would be interesting, but I can''t think > of a stable abstraction yet. We have ip:::send/receive for bottom of IP > tracing, what would we call top of IP tracing? ip:::read/write? Should > this be sdt instead? Could we rely on TCP/UDP probes, which will be at > the bottom of their stacks and close to the top of IP anyway? ...One could name them ip:::request/deliver akin to the name of the stats. For TCP/UDP/RAW you can depend on the probes at the bottom of those transports, but that doesn''t capture - packets delivered to functions inside IP (icmp errors, IGMP, MLD packets) - packets sent by functions inside IP (icmp errors generated, IGMP, MLD) If such packets are 1) encrypted or 2) the transmit ones are dropped by the TX path, then you can''t see them at the ip:::send/recv probes.> Yes. Both send and receive should be tracable (unless you are in TCP > fusion - which I think I''d leave for the TCP provider to trace, rather > than faking up some IP events).I think that makes sense.>> Multicast packets that are looped back because there are members on the >> transmitting interface? > > Yes. Currently looks like this, > > # ./ipio1.d > FUNC:PROBE SOURCE DEST LOOP BYTES > ip_multicast_l:send 192.168.1.108 -> 224.0.1.1 1 8 > ip_wput_local:receive 192.168.1.108 -> 224.0.1.1 1 8 > ip_wput_ire:send 192.168.1.108 -> 224.0.1.1 0 8 > > A loopback send and receive are visible, but only the non-loopback send > (as we''d expect).OK. But will people care about the FUNC names? They are likely to change in the future. Erik
Erik Nordmark writes:> > # ./ipio1.d > > FUNC:PROBE SOURCE DEST LOOP BYTES > > ip_multicast_l:send 192.168.1.108 -> 224.0.1.1 1 8 > > ip_wput_local:receive 192.168.1.108 -> 224.0.1.1 1 8 > > ip_wput_ire:send 192.168.1.108 -> 224.0.1.1 0 8 > > > > A loopback send and receive are visible, but only the non-loopback send > > (as we''d expect). > > OK. > But will people care about the FUNC names? They are likely to change in > the future.I think that''s a dtrace feature. Module and function name are under the control of the framework itself. The provider has a name and has freedom in the probe and argument definitions. -- James Carlson, Solaris Networking <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Brendan Gregg - Sun Microsystems
2007-Jun-19 18:27 UTC
[dtrace-discuss] DTrace Network Providers, take 2
G''Day Folks, On Mon, Jun 18, 2007 at 06:56:37AM -0700, Erik Nordmark wrote: [...]> >>What is the intended semantics of cs_loopback? > > > >I''d like it to be boolean, but right now it may be more practical as: > > > > -1 unknown > > 0 not loopback > > 1 loopback > > What is it intended to mean?[...] Something has became apparent as I''ve dug through this code: the ip*:::send/receive probes are placed so that they always know if the packet is loopback or not. This means that I can make these part of the probe name, ip*:::send (physical) ip*:::receive (physical) ip*:::loopback-send ip*:::loopback-receive Arguments would be, args[0] packet ID args[1] ipinfo_t args[2] ipv4info_t/ipv6info_t args[3] illinfo_t Many of the IP provider scripts I''d write would be intended to trace physical traffic, and would use send/receive. Those scripts that want to observe all traffic would trace both send/receive and loopback-send/loopback-receive (or *send/*receive). Elsewhere in the TCP/IP stack, loopback status is not easily known - and would be provided as an argument rather than part of the probename (and would include a value for "unknown"). These ip:::loopback* probes are a bonus due to their placement - at the bottom of IP when we know where that packet is being delivered to (although I hope this isn''t too confusing when people look for tcp:::loopback* and don''t find them). Does this sound like a stable choice? Can we always expect the bottom of IP to know wheter it is delivering to loopback or not (I''d think so)? ... I''m also considering adding the following probes for the ip* providers, ip*:::drop-in (inbound packet dropped) ip*:::drop-out (outbound packet dropped) The drop-in/out probes could have the following arguments, args[0] packet ID args[1] ipinfo_t args[2] ipv4info_t/ipv6info_t args[3] debug string The debug string would shed light on why the packet was dropped (in addition to the function name and the stack trace, as provided by probename and stack()). The contents of the string could be anything, it is intended for observability but not for matching. MIB names could be used where available, perhaps with extensions as is done for ip2dbg(), eg, "ipIfStatsInDiscards: discard broadcast" but some places that drop packets don''t have MIB names, eg dropping multicast packets in dls_accept() when we don''t have that address enabled, "multicast address not enabled" I know internationalization could be an issue - I don''t see how to address this easily. ... Lastly, I''m also considering adding these, ip*:::read (read to IP (top of IP)) ip*:::write (write to IP (top of IP)) The read/write probes could have the following arguments, args[0] packet ID args[1] ipinfo_t I wouldn''t assume that ill_t was available, or that all the IPv* headers were correctly set. Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems
2007-Jun-19 18:31 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
On Tue, Jun 19, 2007 at 02:02:09PM -0400, James Carlson wrote:> Erik Nordmark writes: > > > # ./ipio1.d > > > FUNC:PROBE SOURCE DEST LOOP BYTES > > > ip_multicast_l:send 192.168.1.108 -> 224.0.1.1 1 8 > > > ip_wput_local:receive 192.168.1.108 -> 224.0.1.1 1 8 > > > ip_wput_ire:send 192.168.1.108 -> 224.0.1.1 0 8 > > > > > > A loopback send and receive are visible, but only the non-loopback send > > > (as we''d expect). > > > > OK. > > But will people care about the FUNC names? They are likely to change in > > the future. > > I think that''s a dtrace feature. Module and function name are under > the control of the framework itself. The provider has a name and has > freedom in the probe and argument definitions.Sorry - the ipio1.d script is part useful, part Brendan''s debug script - the FUNC name is provided as James said, and is definately unstable. I had added it so that I knew which of my send/receive probes were firing, but that field would be dropped for any published scripts... Brendan -- Brendan [CA, USA]
On Tue, Jun 19, 2007 at 11:27:27AM -0700, Brendan Gregg - Sun Microsystems wrote: <mucho snippage deleted!>> I''m also considering adding the following probes for the ip* providers, > > ip*:::drop-in (inbound packet dropped) > ip*:::drop-out (outbound packet dropped) > > The drop-in/out probes could have the following arguments, > > args[0] packet ID > args[1] ipinfo_t > args[2] ipv4info_t/ipv6info_t > args[3] debug string > > The debug string would shed light on why the packet was dropped (in addition > to the function name and the stack trace, as provided by probename and > stack()). The contents of the string could be anything, it is intended for > observability but not for matching. MIB names could be used where available, > perhaps with extensions as is done for ip2dbg(), eg, > > "ipIfStatsInDiscards: discard broadcast" > > but some places that drop packets don''t have MIB names, eg dropping > multicast packets in dls_accept() when we don''t have that address enabled, > > "multicast address not enabled" > > I know internationalization could be an issue - I don''t see how to > address this easily.First off, PLEASE don''t re-invent a while that''s already been invented. See ipdrop.[ch] in the kernel. We have an internal mechanism in place (the ipdropper_t) that you can expand/exploit to make anything you want up there happen. As for the string issue, you could use a level of numeric indirection to help out. E.g. you may bump ipIfStatsInDiscards for any number of reasons, but an enhanced ip_drop_packet() would bump that MIB, and perhaps take a "reason code" or "diagnostic code". We use the concept of a diagnostic code in our PF_KEY enhancements to disambiguate the UNIX EINVAL to great effect. You could do what we do, and map the diagnostic code to easily-translatable strings in a user-space library! Dan
Brendan Gregg - Sun Microsystems
2007-Jun-19 18:47 UTC
[dtrace-discuss] DTrace Network Providers, take 2
G''Day Dan, On Tue, Jun 19, 2007 at 02:33:37PM -0400, Dan McDonald wrote:> On Tue, Jun 19, 2007 at 11:27:27AM -0700, Brendan Gregg - Sun Microsystems wrote: > > <mucho snippage deleted!> > > > I''m also considering adding the following probes for the ip* providers, > > > > ip*:::drop-in (inbound packet dropped) > > ip*:::drop-out (outbound packet dropped) > > > > The drop-in/out probes could have the following arguments, > > > > args[0] packet ID > > args[1] ipinfo_t > > args[2] ipv4info_t/ipv6info_t > > args[3] debug string > > > > The debug string would shed light on why the packet was dropped (in addition > > to the function name and the stack trace, as provided by probename and > > stack()). The contents of the string could be anything, it is intended for > > observability but not for matching. MIB names could be used where available, > > perhaps with extensions as is done for ip2dbg(), eg, > > > > "ipIfStatsInDiscards: discard broadcast" > > > > but some places that drop packets don''t have MIB names, eg dropping > > multicast packets in dls_accept() when we don''t have that address enabled, > > > > "multicast address not enabled" > > > > I know internationalization could be an issue - I don''t see how to > > address this easily. > > First off, PLEASE don''t re-invent a while that''s already been invented. > > See ipdrop.[ch] in the kernel. We have an internal mechanism in place (the > ipdropper_t) that you can expand/exploit to make anything you want up there > happen. As for the string issue, you could use a level of numeric > indirection to help out. E.g. you may bump ipIfStatsInDiscards for any > number of reasons, but an enhanced ip_drop_packet() would bump that MIB, and > perhaps take a "reason code" or "diagnostic code". > > We use the concept of a diagnostic code in our PF_KEY enhancements to > disambiguate the UNIX EINVAL to great effect. You could do what we do, and > map the diagnostic code to easily-translatable strings in a user-space > library!Awsome - I can just DTrace ip_drop_packet(). Why didn''t I see that before? Oh - there is only ONE call to ip_drop_packet() in the whole of ip.c! I see 267 freemsg()''s in there! Uhh - have I missed something? I''m very happy to use ip_drop_packet() (from what I''ve just seen it looks like the right way). Will the rest of ip.c start using it? cheers Brendan -- Brendan [CA, USA]
On Tue, Jun 19, 2007 at 11:47:35AM -0700, Brendan Gregg - Sun Microsystems wrote:> G''Day Dan,Hello! <mucho snippage deleted!>> > We use the concept of a diagnostic code in our PF_KEY enhancements to > > disambiguate the UNIX EINVAL to great effect. You could do what we do, and > > map the diagnostic code to easily-translatable strings in a user-space > > library! > > Awsome - I can just DTrace ip_drop_packet(). Why didn''t I see that before? > Oh - there is only ONE call to ip_drop_packet() in the whole of ip.c! > I see 267 freemsg()''s in there! Uhh - have I missed something?Yes -- this RFE: 6321434 Created P3 kernel/tcp-ip Every dropped packet in IP should use ip_drop_packet()> I''m very happy to use ip_drop_packet() (from what I''ve just seen it looks > like the right way). Will the rest of ip.c start using it?Quite frankly, I''m of the opinion that the above RFE fits in with your Networking Provider work. The interface itself (ip_drop_packet() and friends) may need a bit of enhancement. I''m more than happy to help with such architectural changes. I *do* understand, however, that there''s a LOT of scutwork that needs to be done to make the above RFE happen. Also, if all of TCP/IP starts using ip_drop_packet(), initialization, etc. will have to happen earlier in IP. Right now, ip_drop_init() is done as part of ipsec_stack_init(), and if we generalize it, we''ll have to yank it up a level to IP itself. Dan
Brendan Gregg - Sun Microsystems
2007-Jun-20 13:50 UTC
[dtrace-discuss] DTrace Network Providers, take 2
G''Day Erik, On Tue, Jun 19, 2007 at 10:56:46AM -0700, Erik Nordmark wrote:> Brendan Gregg - Sun Microsystems wrote:[...]> >Yes, this would probably be where an IPsec provider would live; as for > >in the IP providers as well -- that would be interesting, but I can''t think > >of a stable abstraction yet. We have ip:::send/receive for bottom of IP > >tracing, what would we call top of IP tracing? ip:::read/write? Should > >this be sdt instead? Could we rely on TCP/UDP probes, which will be at > >the bottom of their stacks and close to the top of IP anyway? ... > > One could name them ip:::request/deliver akin to the name of the stats. > > For TCP/UDP/RAW you can depend on the probes at the bottom of those > transports, but that doesn''t capture > - packets delivered to functions inside IP (icmp errors, IGMP, MLD > packets)I''m trying out probes within the IP functions to cover those. So far I have 27 probes just for ip*:::deliver.> - packets sent by functions inside IP (icmp errors generated, IGMP, MLD)I''ve put probes in ip_wput_ire*() to capture those, and I''m testing to see how that works.> If such packets are 1) encrypted or 2) the transmit ones are dropped by > the TX path, then you can''t see them at the ip:::send/recv probes.Yes - ip*:::request/deliver won''t match ip*:::send/receive events; I just need to check that for each instance that they don''t match, it makes sense to do so (eg, IP request was dropped before the send). Anyhow, the probe list for IP is getting long. In terms of what probe points the IP providers should trace - this list should be close to completion. # dtrace -ln ''ip*:::'' ID PROVIDER MODULE FUNCTION NAME 11406 ipv6 ip ip_xmit_v6 send 11410 ipv6 ip ip_wput_ire_v6 request 11411 ipv6 ip ip_wput_local_v6 local-receive 11416 ipv6 ip ip_wput_local_v6 local-send 11421 ipv6 ip ip_rput_v6 receive 11424 ipv6 ip ip_rput_data_v6 deliver 11425 ipv6 ip ip_fanout_udp_v6 deliver 11426 ipv6 ip ip_fanout_tcp_v6 deliver 11427 ipv6 ip ip_fanout_proto_v6 deliver 11428 ipv6 ip ip_fanout_sctp_raw deliver 11438 ipv4 ip ip_wput_local local-receive 11439 ip ip ip_wput_local_v6 local-receive 11440 ip ip ip_wput_local local-receive 11445 ipv4 ip ip_wput_local local-send 11446 ip ip ip_wput_local_v6 local-send 11447 ip ip ip_wput_local local-send 11466 ipv4 ip udp_send_data request 11467 ipv4 ip tcp_lsosend_data request 11468 ipv4 ip tcp_multisend_data request 11469 ipv4 ip tcp_send_data request 11470 ipv4 ip ip_wput_ire request 11618 ipv4 ip ip_input receive 11619 ip ip ip_rput_v6 receive 11620 ip ip ip_input receive 11667 ipv4 ip udp_send_data send 11668 ipv4 ip tcp_lsosend_data send 11669 ipv4 ip tcp_multisend_data send 11670 ipv4 ip tcp_send_data send 11671 ipv4 ip ip_xmit_v4 send 11672 ipv4 ip ip_wput_frag send 11673 ipv4 ip ip_wput_frag_mdt send 11674 ipv4 ip ip_fast_forward send 11675 ip ip udp_send_data send 11676 ip ip tcp_lsosend_data send 11677 ip ip tcp_multisend_data send 11678 ip ip tcp_send_data send 11679 ip ip ip_xmit_v6 send 11680 ip ip ip_xmit_v4 send 11681 ip ip ip_wput_frag send 11682 ip ip ip_wput_frag_mdt send 11683 ip ip ip_fast_forward send 11787 ipv4 ip ip_fanout_sctp deliver 11788 ipv4 ip ip_fanout_sctp_raw deliver 11789 ipv4 ip ip_sctp_input deliver 11790 ipv4 ip ip_tcp_input deliver 11791 ipv4 ip ip_udp_input deliver 11792 ipv4 ip ip_fanout_udp_conn deliver 11793 ipv4 ip ip_fanout_tcp deliver 11794 ipv4 ip ip_fanout_proto deliver 12339 ipv4 ip ip_drop_packet drop-out 12340 ipv4 ip ip_drop_packet drop-in 12341 ipv6 ip ip_drop_packet drop-out 12342 ipv6 ip ip_drop_packet drop-in I renamed the loopback* probes to local*, which seems to make more sense as they are tracing local traffic ("loopback" may suggest lo0 only, which isn''t correct). I''ve also been struggling with the probe names ... request and deliver. I keep remembering one but not the other, or forgetting which way around they go. I know being MIB name based would be a good thing, but I''d guess that I have a better familiarity with MIB than most end users and if they aren''t helping me then they may be less likely to help others. I''ll change them back to read/write and see if how they feel... ... I dropped probes that I had down in dls and mac, which were to allow extended IP tracing if the interface was in promiscuous mode. It seemed too strange - either getting DTrace to put interfaces in promiscuous mode, creating a new promiscuous-adm command, or getting users to run snoop. Of course, without these probes there is no way that the IP providers will see every IP packet on the wire - as that requires changing NIC state. The IP providers will see every IP packet delivered to IP. cheers, Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems
2007-Jun-20 21:16 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
G''Day Jeremy, On Wed, Jun 20, 2007 at 09:53:44PM +0100, Jeremy Harris wrote:> Brendan Gregg - Sun Microsystems wrote: > > ID PROVIDER MODULE FUNCTION NAME > [...] > >11668 ipv4 ip tcp_lsosend_data send > >11669 ipv4 ip tcp_multisend_data send > > Aren''t these more implementation artifacts than things suitable > for exposure in a stable provider?Do you mean the MODULE and FUNCTION columns? those appear because I typed "dtrace -l"; the actual stable provider consists of the PROVIDER and NAME columns, and the probe arguments. If you meant that LSO and MDT are implementation details, then sure, their activity is tied to implementation, and the IP packets that they generate will be observable by the stable IP provider in a non-implementation specific way (send/receive probes, not lso-send/mdt-send/etc). If you want to know actual application sends, then you use other network providers up the stack (TCP, socket, ...). This is similar to observing I/O from either the io provider or the syscall layer. What the io provider traces is often implementation activity rather than application request (eg, read-ahead). cheers, Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems
2007-Jun-23 02:54 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
G''Day Folks, The IP providers have evolved into three parts based on the probe name. Here is an update for each, and a link to a webrev. 1) send/receive and local-send/local-receive probes, These probes are working well. # ./ipio01.d NAME SOURCE DEST BYTES INT PROTO send 192.168.1.108 -> 192.168.1.109 68 nge0 TCP receive 192.168.1.109 -> 192.168.1.108 68 nge0 TCP send 192.168.1.108 -> 192.168.1.109 20 nge0 TCP ^C *send/*receive instrument the bottom of IP - when IP speaks with the network interface. Also, since IPsec tunnels are considered an interface, we can also trace as IP speaks to tunnel devices, and as those tunnels speak to the destination interface, # ./ipio01.d NAME SOURCE DEST BYTES INT PROTO send 10.7.250.24 -> 192.146.17.75 68 ip.tun0 TCP send 192.168.1.108 -> 172.16.10.1 140 nge0 UDP receive 172.16.10.1 -> 192.168.1.108 140 nge0 UDP receive 192.146.17.75 -> 10.7.250.24 68 ip.tun0 TCP send 10.7.250.24 -> 192.146.17.75 20 ip.tun0 TCP send 192.168.1.108 -> 172.16.10.1 92 nge0 UDP ^C Great - this is visibility of the actual endpoint IP addresses, as well as those of the tunnel. Much can be done from these *send/*receive probes alone, that I feel it makes sense to take these to completion (testing, code tweaking) and putback into Solaris as the first stage of integration. Having already prototyped the other probes (and other providers) I''m confident that these *send/*receive probes can be integrated without impeding future probe integration. I still need to perform more testing to confirm that these probes are instrumenting different kinds of traffic correctly. 2) drop-in/drop-out probes Tracing ip_drop_packet() properly is very little work (and mostly done); what will be time consuming is completing RFE 6321434 - so that every dropped packet in IP will use ip_drop_packet(). I don''t know if I am or should be taking on RFE 6321434 yet; I''m certainly trying to help from a DTrace perspective, but I have commitments to other projects in the coming months (in fact, I should really be working on something else right now)... 3) read/write probes This is for tracing the top of the IP layer - when upper level protocols such as TCP and UDP speak to IP. # ./ipio03.d NAME SOURCE DEST BYTES INT write 192.168.1.108 -> 192.168.1.109 68 nge0 send 192.168.1.108 -> 192.168.1.109 68 nge0 receive 192.168.1.109 -> 192.168.1.108 68 nge0 read 192.168.1.109 -> 192.168.1.108 68 nge0 write 192.168.1.108 -> 192.168.1.109 20 nge0 send 192.168.1.108 -> 192.168.1.109 20 nge0 ^C The above output shows the read/write probes traced along with the send/receive probes -- the flow of data can be seen as it is written to IP and then sent, recieved by IP and then read, etc. Note that I''m exporting the interface at the read/write level -- I''m very cautious about what information is stable at this point, however I found that when using the prototype provider it was a real pain not to know the interface. Looking through the code to see what was possible, I found almost all of the MIB stats at this level (Requests/Delivers) are writing to MIB stats from a ill_t - so interface information was usually there. Note that this is the expected interface, as these read/write probes are tracing what was asked of IP -- only at send/receive do you know what really happened... And for an IPsec tunnel, # ./ipio03.d NAME SOURCE DEST BYTES INT write 10.7.250.24 -> 192.146.17.75 68 ip.tun0 send 10.7.250.24 -> 192.146.17.75 68 ip.tun0 write 192.168.1.108 -> 172.16.10.1 88 nge0 send 192.168.1.108 -> 172.16.10.1 140 nge0 receive 172.16.10.1 -> 192.168.1.108 140 nge0 read 172.16.10.1 -> 192.168.1.108 140 nge0 receive 192.146.17.75 -> 10.7.250.24 68 ip.tun0 read 192.146.17.75 -> 10.7.250.24 68 ip.tun0 write 10.7.250.24 -> 192.146.17.75 20 ip.tun0 send 10.7.250.24 -> 192.146.17.75 20 ip.tun0 write 192.168.1.108 -> 172.16.10.1 40 nge0 send 192.168.1.108 -> 172.16.10.1 92 nge0 ^C Note the payload size (BYTES) increase as encryption headers are added, and decrease as they are removed. There are many places I''ve been adding these DTrace read/write macros to, and while I''m catching all traffic I''ve thrown at it correctly, I wouldn''t be suprised if missed a codepath or several (due to optimizations, the IP code is a massive game of snakes and ladders). This will need more testing (a testsuite?), and probably more probe tweaking. In summary, prototyping the IP providers is coming along really well, and I''ve reached a point where it may be wise to think more seriously about putting back the IP *send/*receive probes as a first step. I''ve created a new website for these prototypes, http://www.opensolaris.org/os/community/dtrace/NetworkProvider/Prototype2 at the bottom of which is a link to the webrev if anyone is interested in getter a better understanding of this work so far (it''s not a code review. at least not yet anyway :). I''ll listen to feedback for a while, and then see if it is appropriate to generate another webrev - but this time as a suggested putback for *send/*receive probes. cheers, Brendan -- Brendan [CA, USA]
Darren.Reed at Sun.COM
2007-Jun-23 04:01 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
Brendan Gregg - Sun Microsystems wrote:>G''Day Darren, > >On Mon, Jun 18, 2007 at 10:51:33AM -0700, Darren.Reed at sun.com wrote: > > ... > >>...I''m sure that there is such a thing as too many dtrace >>probe points :) >> > >These are near zero overhead when not enabled (some nops plus some movs to >set registers); although their existance may affect how the compiler >optimizes code (especially if there were loads of probes). In any case, >the CPU overhead is going to need to be measured. >Has there been any measurement or thoughts about the impact of too many dtrace probes resulting in excessive "pollution" with the i-cache by instructions that "do nothing"? In most cases I imagine that dtrace probes are quite sparsely distributed through the kernel... In addition, with hot paths where we''re sensitive to every branch or instruction executed, how should we quantify a dtrace probe''s impact on execution? I mention this because other networking projects are looking very closely at various parts of tcp/ip to try and work out how we can slim it down. Without doubt there are bigger problems than removing dtrace probes but at the same time, if there is a quantifiable cost then we need to think about how we go about applying dtrace to networking in order to get the best return for the smallest CPU cost. I''m most concerned that there are dtrace probes almost next to each other, e.g from ip.c: 14250 DTRACE_PROBE1(ip4__physical__out__end, mblk_t *, 14251 mp); 14252 if (mp == NULL) 14253 goto drop; 14254 14255 DTRACE_IP2(send, mblk_t *, mp, void_ip_t *, ipha); 14256 DTRACE_IPV4_3(send, mblk_t *, mp, ipha_t *, ipha, 14257 ill_t *, stq_ill); To me it seems that ip4-physical-out-end is almost an alias for send here, if only the DTRACE_IP2() was before the if(). How can we tell dtrace that sdt:::ip4-physical-out-end is an alias for ip:::send or similar? or: 15205 /* 15206 * DTrace this as ip:::receive and ipv4:::receive. This 15207 * special case test avoids a particular code path where 15208 * IPsec packets are passed through here twice, both before 15209 * and after nattymod, and we only want to trace them once. 15210 * If a better way is found, this test will be dropped. 15211 */ 15212 if (!(!mhip && (ipha->ipha_protocol == IPPROTO_ESP || 15213 ipha->ipha_protocol == IPPROTO_AH))) { 15214 DTRACE_IP2(receive, mblk_t *, mp, void_ip_t *, ipha); 15215 DTRACE_IPV4_3(receive, mblk_t *, mp, ipha_t *, ipha, 15216 ill_t *, ill); 15217 } 15218 15219 /* 15220 * The event for packets being received from a ''physical'' 15221 * interface is placed after validation of the source and/or 15222 * destination address as being local so that packets can be 15223 * redirected to loopback addresses using ipnat. 15224 */ 15225 DTRACE_PROBE4(ip4__physical__in__start, 15226 ill_t *, ill, ill_t *, NULL, 15227 ipha_t *, ipha, mblk_t *, first_mp); Sure, there are lots of NOPs and other cheap instructions being executed here, but we are lengthening the default code path for packets - or at very least the spread of instructions so that where a cache line fill would get us lots of useful things to do, now it gets fewer...but I acknowledge that this is probably nit picking and that energy needs to be first spent elsewhere looking for things to do to make networking faster. ...and without wanting to do code review, I''ll add that putting another if() into ip_input() just for dtrace probes is not an ideal solution...every if() in ip_input() has a measurable cost to network performance. It would be nice if that if() could be part of the dtrace probe and only active when the probe was active, e.g. DTRACE_IF_IP2((!(!mhip && (ipha->ipha_protocol == IPPROTO_ESP || \ ipha->ipha_protocol == IPPROTO_AH))), receive, mblk_t *, mp, void_ip_t *, ipha); Darren p.s. your version of "wx webrev" generates "frames" that don''t appear to work well with mozilla from Solaris 10.
Garrett D''Amore
2007-Jun-24 22:01 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
Brendan Gregg - Sun Microsystems wrote:> > These are near zero overhead when not enabled (some nops plus some movs to > set registers); although their existance may affect how the compiler > optimizes code (especially if there were loads of probes). In any case, > the CPU overhead is going to need to be measured. >I''ve been spending a lot of time working on shortening the code paths in IP, so I guess its time for me to interject in the discussion. At 10GbE speeds, or with small packets at 1GbE speeds (64 byte frames, for example), we are completely CPU bound. I have done a lot of work (not yet committed) to improve the cost in the IP stack, by simplifying certain code and removing some redundant checks, etc. I''ve found in general that the cost of each additional branch to be ~0.1 to 0.2% performance difference. I''ve not tried to measure the incremental impact of a nop. If this project is going to lengthen the code by adding more probes, or by adding any additional instructions (even nops!), then I think it is critical to measure the performance impact. The best way to do this is try testing two code paths: * netperf with 64-byte UDP packets * IP forwarding performance with 64-byte packets (not TCP or UDP) Make sure to use an e1000g or bge card (not nxge or ce). Frankly, I''d prefer the testing was done with e1000g at this point, because I know what the performance considerations for it are. :-) If someone makes bfu archives available, I can actually do the second kind of testing with hardware I have at hand. Once we know what the actual impact is quantitatively, we can decide what the appropriate next steps are. -- Garrett
Brendan Gregg - Sun Microsystems
2007-Jun-29 04:16 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
G''Day Darren, On Fri, Jun 22, 2007 at 09:01:26PM -0700, Darren.Reed at Sun.COM wrote:> Brendan Gregg - Sun Microsystems wrote:[...]> >These are near zero overhead when not enabled (some nops plus some movs to > >set registers); although their existance may affect how the compiler > >optimizes code (especially if there were loads of probes). In any case, > >the CPU overhead is going to need to be measured. > > Has there been any measurement or thoughts about the impact of > too many dtrace probes resulting in excessive "pollution" with > the i-cache by instructions that "do nothing"? In most cases > I imagine that dtrace probes are quite sparsely distributed > through the kernel...Has anyone noticed a problem already? There are *already* more DTrace probes in TCP/IP than what I''m suggesting to add. For example, the following three commands were run at the same time, # snoop -r Using device nge0 (promiscuous mode) 192.168.1.109 -> 192.168.1.108 TELNET C port=40965 d 192.168.1.108 -> 192.168.1.109 TELNET R port=40965 d 192.168.1.109 -> 192.168.1.108 TELNET C port=40965 ^C # dtrace -qn ''ip*::: { @[probeprov, probename] = count(); @["", "TOTAL:"] = count(); } END { printa("%12s %24s %@8d\n", @); }''^G^C ip send 1 ipv4 send 1 ipv4 write 1 ip receive 2 ipv4 read 2 ipv4 receive 2 TOTAL: 9 # dtrace -qn ''mib:::,sdt:::ip*,sdt:::squeue*,sdt:::conn*,sdt:::hook* { @[probeprov, probename] = count(); @["", "TOTAL:"] = count(); } END { printa("%12s %24s %@8d\n", @); }'' ^C mib ipIfStatsHCOutOctets 1 mib ipIfStatsHCOutRequests 1 mib ipIfStatsHCOutTransmits 1 mib tcpInAckBytes 1 mib tcpInAckSegs 1 mib tcpInDataInorderBytes 1 mib tcpInDataInorderSegs 1 mib tcpOutDataBytes 1 mib tcpOutDataSegs 1 mib tcpRttUpdate 1 sdt ip4-physical-out-end 1 sdt ip4-physical-out-start 1 sdt squeue-enqueue 1 sdt squeue-enqueuechain 1 mib ipIfStatsHCInDelivers 2 mib ipIfStatsHCInOctets 2 mib ipIfStatsHCInReceives 2 sdt ip4-physical-in-end 2 sdt ip4-physical-in-start 2 sdt squeue-proc-end 12 sdt squeue-proc-start 12 sdt conn-dec-ref 15 sdt conn-inc-ref 15 TOTAL: 78 So while the ip providers fired 9 probes for those 3 packets, the existing sdt and mib providers fired 78 - 26 probes per packet! Ok, I did pick the worst example I could find :-) it can get to around 10 probes per packet, much better than 26, but still much more than the 2 per packet I''m suggesting we add. Anyway, while that may be interesting to note, there has been measurements and thoughts about this, Measurements: For packet based tests using netperf and ttcp, the overhead was not visible through the noise. PIC based measurements for i-cache suggested a drop in hit-rate of about 0.01%, however this was also close to the noise. DTrace instruction sampling suggests that three non-enabled probes take around 7 ns to execute on a 2.4 GHz AMD64 CPU, and the two ip_input probes with their if statement takes around 11 ns. The execution times of the probes and if statement appears negligible, so long as they access recently used (cached) variables. As for the overall effect on ip due to i-cache pollution; DTrace sampling of the entire ip module versus packet counts showed a per packet increase in ip time of around 50 ns. Adding and removing probes seemed to change this measured overhead in random ways, suggesting that it is either noise, or that the effect of adding instructions doesn''t necessarily mean things get slower (subsequent instructions are shifted to different addresses, which map to the cache differently, possibly relieving or creating hot spots). That DTrace is sampling to take these measurements is also affecting the behaviour of the caches. Thoughts: Each probe addition adds 1 to 5 nops, and a few movs of recently accessed data which we would expect to still be cached. Given a maximum packet rate per CPU per second of 200,000, the execution overhead of some nops and cache movs per packet should be negligible. The effect on i-cache is harder to estimate, and is probably only best checked through measurements. In summary, the execution time of the probes is negligible when considering the per CPU packet rates; the i-cache effect looks negligble, and from the tests was more affected by code layout than the existance or non-existance of probes.> In addition, with hot paths where we''re sensitive to every branch > or instruction executed, how should we quantify a dtrace probe''s > impact on execution? > > I mention this because other networking projects are looking > very closely at various parts of tcp/ip to try and work out > how we can slim it down. > > Without doubt there are bigger problems > than removing dtrace probes but at the same time, if there is > a quantifiable cost then we need to think about how we go about > applying dtrace to networking in order to get the best return > for the smallest CPU cost.I''d suggest this strategy. 1) For development/troubleshooting probes, use fbt. fbt is free in terms on non-enabled cost. In fact, many of those sdt probes could be served by fbt. eg, ipxmit_state_t ip_xmit_v4(mblk_t *mp, ire_t *ire, ipsec_out_t *io, boolean_t flow_ctl_enabled) { nce_t *arpce; ipha_t *ipha; queue_t *q; int ill_index; mblk_t *nxt_mp, *first_mp; boolean_t xmit_drop = B_FALSE; ip_proc_t proc; ill_t *out_ill; int pkt_len; arpce = ire->ire_nce; ASSERT(arpce != NULL); DTRACE_PROBE2(ip__xmit__v4, ire_t *, ire, nce_t *, arpce); [...] fbt::ip_xmit_v4:entry can be used instead of sdt:::ip-xmit-v4. There are other examples in ip that aren''t as obvious and would involve much digging and probe association with DTrace, but are still possible from fbt instead. 2) use sdt if fbt fails in some way. One advantage of private sdt probes is that they can be removed later if proved to be a performance problem. 3) add public and stable provider probes if you really must, such as for an end-user interface.> I''m most concerned that there are > dtrace probes almost next to each other, e.g from ip.c: > > 14250 DTRACE_PROBE1(ip4__physical__out__end, mblk_t > *, > 14251 mp); > 14252 if (mp == NULL) > 14253 goto drop; > 14254 > 14255 DTRACE_IP2(send, mblk_t *, mp, void_ip_t *, > ipha); > 14256 DTRACE_IPV4_3(send, mblk_t *, mp, ipha_t *, > ipha, > 14257 ill_t *, stq_ill); > > To me it seems that ip4-physical-out-end is almost an alias for > send here, if only the DTRACE_IP2() was before the if(). > > How can we tell dtrace that sdt:::ip4-physical-out-end is an alias > for ip:::send or similar?We can''t - DTrace doesn''t support aliasing like that.> or: > > 15205 /* > 15206 * DTrace this as ip:::receive and ipv4:::receive. > This > 15207 * special case test avoids a particular code path > where > 15208 * IPsec packets are passed through here twice, both > before > 15209 * and after nattymod, and we only want to trace them > once. > 15210 * If a better way is found, this test will be > dropped. > 15211 */ > 15212 if (!(!mhip && (ipha->ipha_protocol == IPPROTO_ESP || > 15213 ipha->ipha_protocol == IPPROTO_AH))) { > 15214 DTRACE_IP2(receive, mblk_t *, mp, void_ip_t > *, ipha); > 15215 DTRACE_IPV4_3(receive, mblk_t *, mp, ipha_t > *, ipha, > 15216 ill_t *, ill); > 15217 } > 15218 > 15219 /* > 15220 * The event for packets being received from a > ''physical'' > 15221 * interface is placed after validation of the source > and/or > 15222 * destination address as being local so that packets > can be > 15223 * redirected to loopback addresses using ipnat. > 15224 */ > 15225 DTRACE_PROBE4(ip4__physical__in__start, > 15226 ill_t *, ill, ill_t *, NULL, > 15227 ipha_t *, ipha, mblk_t *, first_mp); > > Sure, there are lots of NOPs and other cheap instructions being > executed here, but we are lengthening the default code path for > packets - or at very least the spread of instructions so that > where a cache line fill would get us lots of useful things to > do, now it gets fewer...but I acknowledge that this is probably > nit picking and that energy needs to be first spent elsewhere > looking for things to do to make networking faster. > > ...and without wanting to do code review, I''ll add that putting > another if() into ip_input() just for dtrace probes is not an > ideal solution...every if() in ip_input() has a measurable cost > to network performance.Understood - I''m trying to avoid them.> It would be nice if that if() could be part of the dtrace probe > and only active when the probe was active, e.g. > > DTRACE_IF_IP2((!(!mhip && (ipha->ipha_protocol == IPPROTO_ESP || \ > ipha->ipha_protocol == IPPROTO_AH))), receive, mblk_t *, mp, > void_ip_t *, ipha);Cool idea - I don''t think it''s necessary yet, but if the probe additions start to use many if statements, then it''s good to have such options to try. :) cheers, Brendan -- Brendan [CA, USA]
Garrett D''Amore
2007-Jun-29 05:30 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
Brendan Gregg - Sun Microsystems wrote:> G''Day Darren, > > On Fri, Jun 22, 2007 at 09:01:26PM -0700, Darren.Reed at Sun.COM wrote: > >> Brendan Gregg - Sun Microsystems wrote: >> > [...] > >>> These are near zero overhead when not enabled (some nops plus some movs to >>> set registers); although their existance may affect how the compiler >>> optimizes code (especially if there were loads of probes). In any case, >>> the CPU overhead is going to need to be measured. >>> >> Has there been any measurement or thoughts about the impact of >> too many dtrace probes resulting in excessive "pollution" with >> the i-cache by instructions that "do nothing"? In most cases >> I imagine that dtrace probes are quite sparsely distributed >> through the kernel... >> > > Has anyone noticed a problem already? There are *already* more DTrace > probes in TCP/IP than what I''m suggesting to add. For example, the > following three commands were run at the same time, >In short *yes*. We see problems with the length of the code paths, and negative impacts on the number of packets that the system can process. Every cycle is precious. I need to do the analysis on your proposed, but I''ll do it soon. Note that with TCP, you won''t see the problem. But try running a protocol that uses 64-byte (or at high speed, even 200 or 1500 byte) UDP packets. Most people assume that the fallacy of "I didn''t see a performance penalty" means something when they''re testing TCP stream performance. When doing large packets with TCP, of *course* you don''t notice. But run anything other than TCP, and you''ll find out where the *real* problems in Solaris'' networking performance lie. -- Garrett> # snoop -r > Using device nge0 (promiscuous mode) > 192.168.1.109 -> 192.168.1.108 TELNET C port=40965 d > 192.168.1.108 -> 192.168.1.109 TELNET R port=40965 d > 192.168.1.109 -> 192.168.1.108 TELNET C port=40965 > ^C > > # dtrace -qn ''ip*::: { @[probeprov, probename] = count(); @["", "TOTAL:"] = count(); } END { printa("%12s %24s %@8d\n", @); }''^G^C > ip send 1 > ipv4 send 1 > ipv4 write 1 > ip receive 2 > ipv4 read 2 > ipv4 receive 2 > TOTAL: 9 > > # dtrace -qn ''mib:::,sdt:::ip*,sdt:::squeue*,sdt:::conn*,sdt:::hook* { @[probeprov, probename] = count(); @["", "TOTAL:"] = count(); } END { printa("%12s %24s %@8d\n", @); }'' > ^C > mib ipIfStatsHCOutOctets 1 > mib ipIfStatsHCOutRequests 1 > mib ipIfStatsHCOutTransmits 1 > mib tcpInAckBytes 1 > mib tcpInAckSegs 1 > mib tcpInDataInorderBytes 1 > mib tcpInDataInorderSegs 1 > mib tcpOutDataBytes 1 > mib tcpOutDataSegs 1 > mib tcpRttUpdate 1 > sdt ip4-physical-out-end 1 > sdt ip4-physical-out-start 1 > sdt squeue-enqueue 1 > sdt squeue-enqueuechain 1 > mib ipIfStatsHCInDelivers 2 > mib ipIfStatsHCInOctets 2 > mib ipIfStatsHCInReceives 2 > sdt ip4-physical-in-end 2 > sdt ip4-physical-in-start 2 > sdt squeue-proc-end 12 > sdt squeue-proc-start 12 > sdt conn-dec-ref 15 > sdt conn-inc-ref 15 > TOTAL: 78 > > So while the ip providers fired 9 probes for those 3 packets, the existing > sdt and mib providers fired 78 - 26 probes per packet! > > Ok, I did pick the worst example I could find :-) it can get to around > 10 probes per packet, much better than 26, but still much more than the > 2 per packet I''m suggesting we add. > > Anyway, while that may be interesting to note, there has been measurements > and thoughts about this, > > Measurements: > > For packet based tests using netperf and ttcp, the overhead was not visible > through the noise. PIC based measurements for i-cache suggested a drop in > hit-rate of about 0.01%, however this was also close to the noise. > > DTrace instruction sampling suggests that three non-enabled probes take > around 7 ns to execute on a 2.4 GHz AMD64 CPU, and the two ip_input probes > with their if statement takes around 11 ns. The execution times of the > probes and if statement appears negligible, so long as they access recently > used (cached) variables. > > As for the overall effect on ip due to i-cache pollution; DTrace sampling > of the entire ip module versus packet counts showed a per packet increase > in ip time of around 50 ns. Adding and removing probes seemed to change > this measured overhead in random ways, suggesting that it is either noise, > or that the effect of adding instructions doesn''t necessarily mean things > get slower (subsequent instructions are shifted to different addresses, > which map to the cache differently, possibly relieving or creating hot > spots). That DTrace is sampling to take these measurements is also affecting > the behaviour of the caches. > > Thoughts: > > Each probe addition adds 1 to 5 nops, and a few movs of recently > accessed data which we would expect to still be cached. Given a maximum > packet rate per CPU per second of 200,000, the execution overhead of some > nops and cache movs per packet should be negligible. The effect on i-cache > is harder to estimate, and is probably only best checked through > measurements. > > In summary, the execution time of the probes is negligible when considering > the per CPU packet rates; the i-cache effect looks negligble, and from the > tests was more affected by code layout than the existance or non-existance > of probes. > > >> In addition, with hot paths where we''re sensitive to every branch >> or instruction executed, how should we quantify a dtrace probe''s >> impact on execution? >> >> I mention this because other networking projects are looking >> very closely at various parts of tcp/ip to try and work out >> how we can slim it down. >> >> Without doubt there are bigger problems >> than removing dtrace probes but at the same time, if there is >> a quantifiable cost then we need to think about how we go about >> applying dtrace to networking in order to get the best return >> for the smallest CPU cost. >> > > I''d suggest this strategy. > > 1) For development/troubleshooting probes, use fbt. > > fbt is free in terms on non-enabled cost. In fact, many of those sdt probes > could be served by fbt. eg, > > ipxmit_state_t > ip_xmit_v4(mblk_t *mp, ire_t *ire, ipsec_out_t *io, boolean_t flow_ctl_enabled) > { > nce_t *arpce; > ipha_t *ipha; > queue_t *q; > int ill_index; > mblk_t *nxt_mp, *first_mp; > boolean_t xmit_drop = B_FALSE; > ip_proc_t proc; > ill_t *out_ill; > int pkt_len; > > arpce = ire->ire_nce; > ASSERT(arpce != NULL); > > DTRACE_PROBE2(ip__xmit__v4, ire_t *, ire, nce_t *, arpce); > [...] > > fbt::ip_xmit_v4:entry can be used instead of sdt:::ip-xmit-v4. There are > other examples in ip that aren''t as obvious and would involve much digging > and probe association with DTrace, but are still possible from fbt instead. > > 2) use sdt if fbt fails in some way. > > One advantage of private sdt probes is that they can be removed later > if proved to be a performance problem. > > 3) add public and stable provider probes if you really must, such as for > an end-user interface. > > >> I''m most concerned that there are >> dtrace probes almost next to each other, e.g from ip.c: >> >> 14250 DTRACE_PROBE1(ip4__physical__out__end, mblk_t >> *, >> 14251 mp); >> 14252 if (mp == NULL) >> 14253 goto drop; >> 14254 >> 14255 DTRACE_IP2(send, mblk_t *, mp, void_ip_t *, >> ipha); >> 14256 DTRACE_IPV4_3(send, mblk_t *, mp, ipha_t *, >> ipha, >> 14257 ill_t *, stq_ill); >> >> To me it seems that ip4-physical-out-end is almost an alias for >> send here, if only the DTRACE_IP2() was before the if(). >> >> How can we tell dtrace that sdt:::ip4-physical-out-end is an alias >> for ip:::send or similar? >> > > We can''t - DTrace doesn''t support aliasing like that. > > >> or: >> >> 15205 /* >> 15206 * DTrace this as ip:::receive and ipv4:::receive. >> This >> 15207 * special case test avoids a particular code path >> where >> 15208 * IPsec packets are passed through here twice, both >> before >> 15209 * and after nattymod, and we only want to trace them >> once. >> 15210 * If a better way is found, this test will be >> dropped. >> 15211 */ >> 15212 if (!(!mhip && (ipha->ipha_protocol == IPPROTO_ESP || >> 15213 ipha->ipha_protocol == IPPROTO_AH))) { >> 15214 DTRACE_IP2(receive, mblk_t *, mp, void_ip_t >> *, ipha); >> 15215 DTRACE_IPV4_3(receive, mblk_t *, mp, ipha_t >> *, ipha, >> 15216 ill_t *, ill); >> 15217 } >> 15218 >> 15219 /* >> 15220 * The event for packets being received from a >> ''physical'' >> 15221 * interface is placed after validation of the source >> and/or >> 15222 * destination address as being local so that packets >> can be >> 15223 * redirected to loopback addresses using ipnat. >> 15224 */ >> 15225 DTRACE_PROBE4(ip4__physical__in__start, >> 15226 ill_t *, ill, ill_t *, NULL, >> 15227 ipha_t *, ipha, mblk_t *, first_mp); >> >> Sure, there are lots of NOPs and other cheap instructions being >> executed here, but we are lengthening the default code path for >> packets - or at very least the spread of instructions so that >> where a cache line fill would get us lots of useful things to >> do, now it gets fewer...but I acknowledge that this is probably >> nit picking and that energy needs to be first spent elsewhere >> looking for things to do to make networking faster. >> >> ...and without wanting to do code review, I''ll add that putting >> another if() into ip_input() just for dtrace probes is not an >> ideal solution...every if() in ip_input() has a measurable cost >> to network performance. >> > > Understood - I''m trying to avoid them. > > >> It would be nice if that if() could be part of the dtrace probe >> and only active when the probe was active, e.g. >> >> DTRACE_IF_IP2((!(!mhip && (ipha->ipha_protocol == IPPROTO_ESP || \ >> ipha->ipha_protocol == IPPROTO_AH))), receive, mblk_t *, mp, >> void_ip_t *, ipha); >> > > Cool idea - I don''t think it''s necessary yet, but if the probe additions > start to use many if statements, then it''s good to have such options to > try. :) > > cheers, > > Brendan > >
Jeremy Harris
2007-Jun-29 15:20 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
Brendan Gregg - Sun Microsystems wrote:> In fact, many of those sdt probes > could be served by fbt. eg, > > ipxmit_state_t > ip_xmit_v4(mblk_t *mp, ire_t *ire, ipsec_out_t *io, boolean_t flow_ctl_enabled) > { > nce_t *arpce; > ipha_t *ipha; > queue_t *q; > int ill_index; > mblk_t *nxt_mp, *first_mp; > boolean_t xmit_drop = B_FALSE; > ip_proc_t proc; > ill_t *out_ill; > int pkt_len; > > arpce = ire->ire_nce; > ASSERT(arpce != NULL); > > DTRACE_PROBE2(ip__xmit__v4, ire_t *, ire, nce_t *, arpce); > [...] > > fbt::ip_xmit_v4:entry can be used instead of sdt:::ip-xmit-v4.Does this say that there needs to be a way to declare a stable probe which happens to be implemented by an fbt, but which does not get lost at some future code refactoring? Cheers, Jeremy
Brendan Gregg - Sun Microsystems
2007-Jun-29 18:01 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
G''Day Garrett, On Thu, Jun 28, 2007 at 10:30:19PM -0700, Garrett D''Amore wrote:> Brendan Gregg - Sun Microsystems wrote:[...]> >Has anyone noticed a problem already? There are *already* more DTrace > >probes in TCP/IP than what I''m suggesting to add. For example, the > >following three commands were run at the same time, > > > > In short *yes*. We see problems with the length of the code paths, and > negative impacts on the number of packets that the system can process. > Every cycle is precious. I need to do the analysis on your proposed, > but I''ll do it soon. > > Note that with TCP, you won''t see the problem. But try running a > protocol that uses 64-byte (or at high speed, even 200 or 1500 byte) UDP > packets. > > Most people assume that the fallacy of "I didn''t see a performance > penalty" means something when they''re testing TCP stream performance.It does mean something - if you are a webserver (TCP), web proxy server (TCP), mail server (TCP), database server (TCP), application server (TCP), file system server (TCP), and so on; and also if you care about SPECweb, SPECsfs, ... Verifying TCP performance is the most critical task when considering the effect on customer servers. It isn''t a fallacy at all, it would be ignorant *not* to test TCP as well as UDP and others.> When doing large packets with TCP, of *course* you don''t notice. But run > anything other than TCP, and you''ll find out where the *real* problems > in Solaris'' networking performance lie.So, apart from Solaris routers, DNS servers and IPsec gateways, what other servers would be under heavy load and be using something other than TCP? I''m certainly interested in testing UDP and forwarding performance, along with TCP performance in particular. I''m measuring the effect of these probe additions - whether they coincide with existing "real" problems in the Solaris networking stack or not. cheers, Brendan -- Brendan [CA, USA]
James Carlson
2007-Jun-29 18:31 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
Brendan Gregg - Sun Microsystems writes:> > When doing large packets with TCP, of *course* you don''t notice. But run > > anything other than TCP, and you''ll find out where the *real* problems > > in Solaris'' networking performance lie. > > So, apart from Solaris routers, DNS servers and IPsec gateways, what other > servers would be under heavy load and be using something other than TCP?Old NFS servers and SAMBA servers have a lot of non-TCP traffic, as do systems doing VoIP. Rao Shoaib probably has contacts for that latter category. -- James Carlson, Solaris Networking <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Michael Hunter
2007-Jun-29 19:07 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
On Fri, 29 Jun 2007 12:53:45 -0600 Neil Putnam <Neil.Putnam at Sun.COM> wrote:> James Carlson wrote:[...]> > Old NFS servers and SAMBA servers have a lot of non-TCP traffic, as do > > systems doing VoIP. Rao Shoaib probably has contacts for that latter > > category. > > > > How about Oracle RAC? Or various financial "UDP streaming" applications?[...] Its not just financial. Multicast and unicast audio and video distribution systems (essentially the made stale by time protcols usu. affinitized with VOIP). Some years ago some (most?) of the online multi-user games were gross hacks on top of UDP mostly attempting to circumvent congestion control. It sometimes amazes me that a protocol modeled on serial communications could be dominate for so long. Its really a poor fit for a lot of the things its (ab)used for. mph
Garrett D''Amore
2007-Jun-29 19:50 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
Michael Hunter wrote:> On Fri, 29 Jun 2007 12:53:45 -0600 > Neil Putnam <Neil.Putnam at Sun.COM> wrote: > > >> James Carlson wrote: >> > [...] > >>> Old NFS servers and SAMBA servers have a lot of non-TCP traffic, as do >>> systems doing VoIP. Rao Shoaib probably has contacts for that latter >>> category. >>> >>> >> How about Oracle RAC? Or various financial "UDP streaming" applications? >> > [...] > > Its not just financial. Multicast and unicast audio and video > distribution systems (essentially the made stale by time protcols usu. > affinitized with VOIP). Some years ago some (most?) of the online > multi-user games were gross hacks on top of UDP mostly attempting to > circumvent congestion control. > > It sometimes amazes me that a protocol modeled on serial communications > could be dominate for so long. Its really a poor fit for a lot of the > things its (ab)used for. >When networks are reliable (and you don''t need multicast) TCP works pretty well. The problem is when you have networks in the middle that start dropping frames... then TCP goes to hell in a handbasket. The reality is that unless you start playing with wireless networks, the mostly-lossless-ordered-packets requirements generally hold true. And even the wireless networks generally try to detect and correct for loss/corruption at the link layer, so TCP corrections never come into play. Anyway, I never meant to suggest that TCP performance tuning wasn''t appropriate, only that it was *insufficient*. We (Sun/Solaris/OpenSolaris) need to start paying a lot more attention to the other protocols (and performance of them) than we have done in the past. -- Garrett
Jeremy Harris
2007-Jun-29 20:15 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
Brendan Gregg - Sun Microsystems wrote:> So, apart from Solaris routers, DNS servers and IPsec gateways, what other > servers would be under heavy load and be using something other than TCP?Finance houses, running Tibco Rendevous. - Jeremy Harris
Michael Hunter
2007-Jun-29 20:28 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
On Fri, 29 Jun 2007 12:50:46 -0700 Garrett D''Amore <garrett at damore.org> wrote:> Michael Hunter wrote:[...]> When networks are reliable (and you don''t need multicast) TCP works > pretty well. The problem is when you have networks in the middle that > start dropping frames... then TCP goes to hell in a handbasket.Maybe its just the socket API. But I remember constantly repeating to customers of mine some years back that they couldn''t depend on write() == read(). Or that the protocol level acks didn''t imply that the data had gotten to the application. Cranking down TCP keepalive doesn''t make for a good application heartbeat. On and On. Often they just wanted a reliable datagram protocol. But thats a big jump from UDP. The ones that tried to go that route often botched basic protcol considerations.> > The reality is that unless you start playing with wireless networks, the > mostly-lossless-ordered-packets requirements generally hold true. And > even the wireless networks generally try to detect and correct for > loss/corruption at the link layer, so TCP corrections never come into play.Don''t know what the lower latency wireless stuff uses for this[1] but interaction between link level error correction and error correction (retransmission) at higher levels isn''t always all that clean or obvious.> > Anyway, I never meant to suggest that TCP performance tuning wasn''t > appropriate, only that it was *insufficient*. We > (Sun/Solaris/OpenSolaris) need to start paying a lot more attention to > the other protocols (and performance of them) than we have done in the past.I didn''t either. I was just bemoaning a deficiency in the understand of and availability of other transport level protocol offerings and trying to extend the feel that Brendan was getting for protocol usage in the real world. mph> > -- Garrett[1] I love the mathematics and ideas behind FEC. There was a group doing research on using FEC (redandancy) in TCP to tradeoff bandwidth for some probabilistic bound on tranmission time in the late 90s. I thought that could be used to solve some of the pseudo RT problems I saw people trying to solve by beating their head against TCP. Oh well, that was a different life.> > _______________________________________________ > networking-discuss mailing list > networking-discuss at opensolaris.org
Michael Hunter
2007-Jun-29 20:51 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
On Fri, 29 Jun 2007 13:28:26 -0700 Michael Hunter <Michael.Hunter at Sun.COM> wrote:> Don''t know what the lower latency wireless stuff uses for this[1] but > interaction between link level error correction and error correction > (retransmission) at higher levels isn''t always all that clean or > obvious.s/levels/latencies/
Garrett D''Amore
2007-Jun-29 21:16 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
Michael Hunter wrote:> On Fri, 29 Jun 2007 13:28:26 -0700 > Michael Hunter <Michael.Hunter at Sun.COM> wrote: > > > >> Don''t know what the lower latency wireless stuff uses for this[1] but >> interaction between link level error correction and error correction >> (retransmission) at higher levels isn''t always all that clean or >> obvious. >> > > s/levels/latencies/ >Certainly with higher latency networks it could wreak havoc. But for 802.11, it happens so quickly that normally TCP would never notice. -- Garrett
Mike Gerdts
2007-Jun-29 21:42 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
On 6/29/07, Brendan Gregg - Sun Microsystems <brendan at sun.com> wrote:> So, apart from Solaris routers, DNS servers and IPsec gateways, what other > servers would be under heavy load and be using something other than TCP?Oracle RAC''s cache fusion (cluster interconnect) sends database blocks between nodes using UDP (or LLT if Veritas is in the mix). These transfers are mostly around 8 KB and can be quite heavy if the various nodes (oracle instances) are using the same blocks. This has been a particular area of contention when each node is a significant portion of a 15k/25k. Mike -- Mike Gerdts http://mgerdts.blogspot.com/
Darren.Reed at Sun.COM
2007-Jun-30 00:27 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
Brendan Gregg - Sun Microsystems wrote:>G''Day Darren, > >On Fri, Jun 22, 2007 at 09:01:26PM -0700, Darren.Reed at Sun.COM wrote: > >>Brendan Gregg - Sun Microsystems wrote: >> >[...] > >>>These are near zero overhead when not enabled (some nops plus some movs to >>>set registers); although their existance may affect how the compiler >>>optimizes code (especially if there were loads of probes). In any case, >>>the CPU overhead is going to need to be measured. >>> >>Has there been any measurement or thoughts about the impact of >>too many dtrace probes resulting in excessive "pollution" with >>the i-cache by instructions that "do nothing"? In most cases >>I imagine that dtrace probes are quite sparsely distributed >>through the kernel... >> > >Has anyone noticed a problem already? There are *already* more DTrace >probes in TCP/IP than what I''m suggesting to add. For example, the >following three commands were run at the same time, > > # snoop -r > Using device nge0 (promiscuous mode) > 192.168.1.109 -> 192.168.1.108 TELNET C port=40965 d > 192.168.1.108 -> 192.168.1.109 TELNET R port=40965 d > 192.168.1.109 -> 192.168.1.108 TELNET C port=40965 > ^C >As Garrett alluded to, the situation that we''re most concerned about with performance is not just local delivery but packet forwarding where there are various efforts underway to try and reduce the size of the code path. I''m sure that the length of the code path to get data to/from telnet leans heavily towards delivery into TCP. ...> mib ipIfStatsHCOutOctets 1 > mib ipIfStatsHCOutRequests 1 > mib ipIfStatsHCOutTransmits 1 > mib tcpInAckBytes 1 > mib tcpInAckSegs 1 > mib tcpInDataInorderBytes 1 > mib tcpInDataInorderSegs 1 > mib tcpOutDataBytes 1 > mib tcpOutDataSegs 1 > mib tcpRttUpdate 1 > sdt ip4-physical-out-end 1 > sdt ip4-physical-out-start 1 > sdt squeue-enqueue 1 > sdt squeue-enqueuechain 1 > mib ipIfStatsHCInDelivers 2 > mib ipIfStatsHCInOctets 2 > mib ipIfStatsHCInReceives 2 > sdt ip4-physical-in-end 2 > sdt ip4-physical-in-start 2 > sdt squeue-proc-end 12 > sdt squeue-proc-start 12 > sdt conn-dec-ref 15 > sdt conn-inc-ref 15 > TOTAL: 78 > >So while the ip providers fired 9 probes for those 3 packets, the existing >sdt and mib providers fired 78 - 26 probes per packet! > >Ok, I did pick the worst example I could find :-) it can get to around >10 probes per packet, much better than 26, but still much more than the >2 per packet I''m suggesting we add. >Understood and your point taken. ....>As for the overall effect on ip due to i-cache pollution; DTrace sampling >of the entire ip module versus packet counts showed a per packet increase >in ip time of around 50 ns. Adding and removing probes seemed to change >this measured overhead in random ways, suggesting that it is either noise, >or that the effect of adding instructions doesn''t necessarily mean things >get slower (subsequent instructions are shifted to different addresses, >which map to the cache differently, possibly relieving or creating hot >spots). That DTrace is sampling to take these measurements is also affecting >the behaviour of the caches. >Yes, indeed.>Thoughts: > >Each probe addition adds 1 to 5 nops, and a few movs of recently >accessed data which we would expect to still be cached. Given a maximum >packet rate per CPU per second of 200,000, the execution overhead of some >nops and cache movs per packet should be negligible. The effect on i-cache >is harder to estimate, and is probably only best checked through >measurements. > >In summary, the execution time of the probes is negligible when considering >the per CPU packet rates; the i-cache effect looks negligble, and from the >tests was more affected by code layout than the existance or non-existance >of probes. >I''m not really worried about the data cache, just the i-cache and the CPU pipeline. Throwing in some NOPs and otherwise redundant instructions every now and then doesn''t seem like too big of a sin, but filling an entire i-cache row or two with NOPs seems...wasteful. Depending on when the CPU decides to throw away a NOP, during the pipeline decoding of the instruction would by and large determine what it means to have 2-3 vs 6-9 of them from 1 vs 3 dtrace probes next to each other. This may sound silly, but I''d have less of an issue about this if the same probes were further away from each other :) However this really is micro-optimisation.>>In addition, with hot paths where we''re sensitive to every branch >>or instruction executed, how should we quantify a dtrace probe''s >>impact on execution? >> >>I mention this because other networking projects are looking >>very closely at various parts of tcp/ip to try and work out >>how we can slim it down. >> >>Without doubt there are bigger problems >>than removing dtrace probes but at the same time, if there is >>a quantifiable cost then we need to think about how we go about >>applying dtrace to networking in order to get the best return >>for the smallest CPU cost. >> > >I''d suggest this strategy. > >1) For development/troubleshooting probes, use fbt. >... >I agree with this. But having seen the sdt:::ip-xmit-v4 probe, wouldn''t it be nice if you could present that as fbt::ip-xmit-v4:entry but pass in ire->ire_nce as arg2 instead of ire?>>I''m most concerned that there are >>dtrace probes almost next to each other, e.g from ip.c: >> >>14250 DTRACE_PROBE1(ip4__physical__out__end, mblk_t >>*, >>14251 mp); >>14252 if (mp == NULL) >>14253 goto drop; >>14254 >>14255 DTRACE_IP2(send, mblk_t *, mp, void_ip_t *, >>ipha); >>14256 DTRACE_IPV4_3(send, mblk_t *, mp, ipha_t *, >>ipha, >>14257 ill_t *, stq_ill); >> >>To me it seems that ip4-physical-out-end is almost an alias for >>send here, if only the DTRACE_IP2() was before the if(). >> >>How can we tell dtrace that sdt:::ip4-physical-out-end is an alias >>for ip:::send or similar? >> > >We can''t - DTrace doesn''t support aliasing like that. >Can I have this added to the list of new things I''d like to see dtrace be able to support, along with the DTRACE_IF? :) I don''t know how this would look, maybe: DTRACE_DEF(ip__hook__abc, m, ip, stq_ill); DTRACE_VIEW1(ip__hook__abc, ip__physical__out__end, mblk_t *); DTRACE_VIEW2(ip__hook__abc, ip_send, mblk_t *, void_ip_t *); DTRACE_VIEW3(ip__hook__abc, ip4__send, mblk_t *, ipha_t *, ill_t *); or maybe something more complex so that you could create a hook alias that included just the mblk_t and the ill_t but still only have one hook in the code path. The general story is we have a collection of values that we want to export out through dtrace and we''d like to have a way to see them using different filters (or views.) If dtrace can handle multiple people using the same probe then technically it should be able to cope with multiple people using the same probe but with different aliases to present the values caught by the dtrace hook in different ways. Another thought is can these dtrace probe views be defined with dtrace(1m) rather than in the kernel? This would mean that "dtrace -l" might not provide an exhaustive list of all the probes available (if it only consults the kernel.) So, if you wanted to get the ip-send dtrace probe, maybe in your dtrace script you would include some special library that knows to use the ip4-physical-out-end (with extra args) and then to apply some sort of condition and change the types of the args. So my dtrace script might be: #!/usr/sbin/dtrace -Fs include <provider/ip> ip:::ip-send{ ... } If dtrace could be evolved to do that then we could define base probes, such as the ip4-physical-out-start and layer on top of that new probes without having to deliver any new kernel modules. Darren
Brendan Gregg - Sun Microsystems
2007-Jun-30 02:36 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
On Fri, Jun 29, 2007 at 12:50:46PM -0700, Garrett D''Amore wrote:> Michael Hunter wrote: > >On Fri, 29 Jun 2007 12:53:45 -0600 > >Neil Putnam <Neil.Putnam at Sun.COM> wrote: > > > > > >>James Carlson wrote: > >> > >[...] > > > >>>Old NFS servers and SAMBA servers have a lot of non-TCP traffic, as do > >>>systems doing VoIP. Rao Shoaib probably has contacts for that latter > >>>category. > >>> > >>> > >>How about Oracle RAC? Or various financial "UDP streaming" applications? > >> > >[...] > > > >Its not just financial. Multicast and unicast audio and video > >distribution systems (essentially the made stale by time protcols usu. > >affinitized with VOIP). Some years ago some (most?) of the online > >multi-user games were gross hacks on top of UDP mostly attempting to > >circumvent congestion control.Cool, so we can add UDP streaming application servers, VoIP servers, and audio and video servers. This is useful to bear in mind when considering who will be affected by any overheads. [...]> When networks are reliable (and you don''t need multicast) TCP works > pretty well. The problem is when you have networks in the middle that > start dropping frames... then TCP goes to hell in a handbasket. > > The reality is that unless you start playing with wireless networks, the > mostly-lossless-ordered-packets requirements generally hold true. And > even the wireless networks generally try to detect and correct for > loss/corruption at the link layer, so TCP corrections never come into play. > > Anyway, I never meant to suggest that TCP performance tuning wasn''t > appropriate, only that it was *insufficient*. We > (Sun/Solaris/OpenSolaris) need to start paying a lot more attention to > the other protocols (and performance of them) than we have done in the past.Sure, testing TCP only would be insufficient, I agree. I think we really need a network performance test suite that tests all protocol types and code paths, and generates a report with standard deviations for each measurement. This will make it easier for people to pay attention to those other protocols, and can be made available on an OpenSolaris page for all to use (similar to the filebench effort for testing file systems). Does such a tool exist? The self service performance tests look like they will not only provide the tests, but do them for you (even better), http://www.opensolaris.org/os/community/testing/selftest/ although, neither iperf, netperf or nttcp print standard deviation - and from the noise I''ve seen in my own packet tests, this is something we''d really like to know. Although netperf can pay attention to noise by doing its confidence tests... Brendan -- Brendan [CA, USA]
Darren.Reed at Sun.COM
2007-Jun-30 03:08 UTC
[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2
Brendan Gregg - Sun Microsystems wrote:> ... > >I think we really need a network performance test suite that tests all >protocol types and code paths, and generates a report with standard >deviations for each measurement. This will make it easier for people >to pay attention to those other protocols, and can be made available >on an OpenSolaris page for all to use (similar to the filebench effort >for testing file systems). > >Does such a tool exist? The self service performance tests look like they >will not only provide the tests, but do them for you (even better), > > http://www.opensolaris.org/os/community/testing/selftest/ > >although, neither iperf, netperf or nttcp print standard deviation - >and from the noise I''ve seen in my own packet tests, this is something >we''d really like to know. Although netperf can pay attention to noise >by doing its confidence tests... >My observations are that we need to be using specialised network test equipment, such as what we have in some of the development labs, for our testing. Which is to say that none of the currently established tests - iperf, netperf or nttcp - is really capable of telling us what we really need to know. Afterall, if iperf/netperf/nttcp are being run from Solaris hosts that are already subject to the limitations we''re trying to fix, how can we see if our changes make a difference? Darren
Brendan Gregg - Sun Microsystems
2007-Jul-02 18:58 UTC
[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2
On Thu, Jun 21, 2007 at 11:49:24AM +0100, Jeremy Harris wrote:> Hi Brendan, > > Brendan Gregg - Sun Microsystems wrote: > >On Wed, Jun 20, 2007 at 09:53:44PM +0100, Jeremy Harris wrote: > >>Brendan Gregg - Sun Microsystems wrote: > >>> ID PROVIDER MODULE FUNCTION NAME > >>[...] > >>>11668 ipv4 ip tcp_lsosend_data send > >>>11669 ipv4 ip tcp_multisend_data send > >>Aren''t these more implementation artifacts than things suitable > >>for exposure in a stable provider? > [...] > >If you meant that LSO and MDT are implementation details, then sure, > >their activity is tied to implementation, and the IP packets that they > >generate will be observable by the stable IP provider in a > >non-implementation > >specific way (send/receive probes, not lso-send/mdt-send/etc). > > This is what I meant, yes. > So, what stability level will these probes have?As a new interface, I''d write the DTrace stability table as, | Name Data Class -------------+------------------------------------------- Provider | Evolving Evolving Common Module | Private Private Unknown Function | Private Private Unknown Name | Evolving Evolving Common Arguments | Unstable Unstable Common with Arguments becomming "Evolving" in the near future. The aim of the provider is to be a stable end user interface, with the send/receive probes tracing activity between IP and the device driver framework, or itself (loopback). So for send probes - whatever IP sends down, is traced. For receive probes, whatever IP receives, is traced. For MDT, it should be able to trace the individual IP packets as ip:::send, similar to what sdt::tcp_multisend:ip4-physical-out-start does when hooks are on. For LSO, what IP sends down can be 64 Kbyte packets - which is what the IP provider will trace as a 64 Kbyte ip:::send. This is in line with the role of the IP provider - to trace what was sent and received to the device driver layers. Is 64 Kbyte sends confusing enough that it should really be called ip:::lso-send? ... I don''t think so. What about future hypothetical NIC performance feature, that accepts packets in some other manner that is much stranger than LSO? That may indeed be a case for new probe, ip:::mumblefoo-send, or whatever. Brendan -- Brendan [CA, USA]
Brendan Gregg - Sun Microsystems
2007-Jul-03 01:02 UTC
[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2
G''Day Darren, On Fri, Jun 29, 2007 at 05:27:01PM -0700, Darren.Reed at sun.com wrote:> Brendan Gregg - Sun Microsystems wrote: > > >G''Day Darren, > > > >On Fri, Jun 22, 2007 at 09:01:26PM -0700, Darren.Reed at Sun.COM wrote: > > > >>Brendan Gregg - Sun Microsystems wrote: > >>[...]> >Has anyone noticed a problem already? There are *already* more DTrace > >probes in TCP/IP than what I''m suggesting to add. For example, the > >following three commands were run at the same time, > > > > # snoop -r > > Using device nge0 (promiscuous mode) > > 192.168.1.109 -> 192.168.1.108 TELNET C port=40965 d > > 192.168.1.108 -> 192.168.1.109 TELNET R port=40965 d > > 192.168.1.109 -> 192.168.1.108 TELNET C port=40965 > > ^C > > > > As Garrett alluded to, the situation that we''re most concerned about > with performance is not just local delivery but packet forwarding > where there are various efforts underway to try and reduce the size > of the code path. I''m sure that the length of the code path to get > data to/from telnet leans heavily towards delivery into TCP.Rightio - sounds like testing the forwarding code path performance should also be a standard test (such as in an automated perf PIT). [...]> >Thoughts: > > > >Each probe addition adds 1 to 5 nops, and a few movs of recently > >accessed data which we would expect to still be cached. Given a maximum > >packet rate per CPU per second of 200,000, the execution overhead of some > >nops and cache movs per packet should be negligible. The effect on i-cache > >is harder to estimate, and is probably only best checked through > >measurements. > > > >In summary, the execution time of the probes is negligible when considering > >the per CPU packet rates; the i-cache effect looks negligble, and from the > >tests was more affected by code layout than the existance or non-existance > >of probes. > > > > I''m not really worried about the data cache, just the i-cache and > the CPU pipeline. Throwing in some NOPs and otherwise redundant > instructions every now and then doesn''t seem like too big of a sin, > but filling an entire i-cache row or two with NOPs seems...wasteful. > Depending on when the CPU decides to throw away a NOP, during the > pipeline decoding of the instruction would by and large determine > what it means to have 2-3 vs 6-9 of them from 1 vs 3 dtrace probes > next to each other. This may sound silly, but I''d have less of an > issue about this if the same probes were further away from each > other :) However this really is micro-optimisation.3 DTrace probes next to each other is certainly questionable. I''ve recently dropped one set of the IP probes so that it is now 2, and put a webrev of this with just the send/receive probes at the end of, http://www.opensolaris.org/os/community/dtrace/NetworkProvider/Prototype2/ I wasn''t certain that having both ip::: and ipv4/ipv6::: providers was the right way to go, and put the webrev and website out to see how it looked. I''ve also been writing scripts using either to see how they felt. The ip::: provider is more in the DTrace mould (compare with the io::: provider), and so I''ve dropped the ipv4/ipv6 providers from the recent webrev (ip::: now has identical functionality). [...]> >I''d suggest this strategy. > > > >1) For development/troubleshooting probes, use fbt. > >... > > > > I agree with this. > But having seen the sdt:::ip-xmit-v4 probe, wouldn''t it be > nice if you could present that as fbt::ip-xmit-v4:entry but > pass in ire->ire_nce as arg2 instead of ire?I would have thought that the existing (ire_t *)ire as args[2] was already ideal, as users could refer to args[2]->ire_nce as needed. It might be a different story if the kernel wasn''t CTF''d and these wern''t already casted. [...]> >We can''t - DTrace doesn''t support aliasing like that. > > > > Can I have this added to the list of new things I''d like to > see dtrace be able to support, along with the DTRACE_IF? :) > > I don''t know how this would look, maybe: > > DTRACE_DEF(ip__hook__abc, m, ip, stq_ill); > DTRACE_VIEW1(ip__hook__abc, ip__physical__out__end, mblk_t *); > DTRACE_VIEW2(ip__hook__abc, ip_send, mblk_t *, void_ip_t *); > DTRACE_VIEW3(ip__hook__abc, ip4__send, mblk_t *, ipha_t *, ill_t *); > > or maybe something more complex so that you could create a hook > alias that included just the mblk_t and the ill_t but still only > have one hook in the code path. > > The general story is we have a collection of values that we want > to export out through dtrace and we''d like to have a way to see > them using different filters (or views.) > > If dtrace can handle multiple people using the same probe then > technically it should be able to cope with multiple people > using the same probe but with different aliases to present the > values caught by the dtrace hook in different ways.Cool idea. I don''t think there is a need to do this right now based on performance testing so far, but it is good to have options to try like this than not to. And both DTRACE_IF and DTRACE_VIEW are options that could be added later without changing the exported stable provider interface. It would take some moderate work to get this done; and when complete, there would still be more non-enabled probe overhead in other places - like the mib probes, that wouldn''t benifit from this approach.> Another thought is can these dtrace probe views be defined with > dtrace(1m) rather than in the kernel? > > This would mean that "dtrace -l" might not provide an exhaustive > list of all the probes available (if it only consults the kernel.) > > So, if you wanted to get the ip-send dtrace probe, maybe in your > dtrace script you would include some special library that knows > to use the ip4-physical-out-end (with extra args) and then to > apply some sort of condition and change the types of the args. > > So my dtrace script might be: > > #!/usr/sbin/dtrace -Fs > > include <provider/ip> > ip:::ip-send{ > ... > } > > If dtrace could be evolved to do that then we could define base > probes, such as the ip4-physical-out-start and layer on top of > that new probes without having to deliver any new kernel modules.Yes, that would be another line of attack. ... There might be an easier way to drop the duplicate probe overhead, which currently looks like, 14223 DTRACE_PROBE4(ip4__physical__out__start, 14224 ill_t *, NULL, ill_t *, stq_ill, 14225 ipha_t *, ipha, mblk_t *, mp); 14226 FW_HOOKS(ipst->ips_ip4_physical_out_event, 14227 ipst->ips_ipv4firewall_physical_out, 14228 NULL, stq_ill, ipha, mp, mpip, ipst); 14229 DTRACE_PROBE1(ip4__physical__out__end, mblk_t *, 14230 mp); 14231 if (mp == NULL) 14232 goto drop; 14233 14234 DTRACE_IP5(send, mblk_t *, mp, void_ip_t *, ipha, 14235 ill_t *, stq_ill, ipha_t *, ipha, ip6_t *, NULL); Why have sdt:::ip4-physical-out-start and sdt:::ip4-physical-out-end anyway? FW_HOOKS wraps hook_run() if hooks are enabled, and hook_run() is tracable at zero cost from fbt. Those sdt::: probes provide interface info, IP header and raw mblk pointer. So you could write scripts like, # ./fwhooks01.d EVENT *MP IF-IN IF-OUT FROM TO> PHYSICAL_OUT ffffff6137dde4e0 0 2 192.168.1.109 192.168.1.198< PHYSICAL_OUT ffffff6137dde4e0 0 2 192.168.1.109 192.168.1.198> PHYSICAL_OUT fffffffedad37220 0 2 192.168.1.109 192.168.1.108< PHYSICAL_OUT 0 0 2 192.168.1.109 192.168.1.108> PHYSICAL_OUT ffffff0e2ec32620 0 2 192.168.1.109 192.168.1.198< PHYSICAL_OUT ffffff0e2ec32620 0 2 192.168.1.109 192.168.1.198> PHYSICAL_IN ffffff0e2ec32620 2 0 192.168.1.198 192.168.1.109< PHYSICAL_IN ffffff0e2ec32620 2 0 192.168.1.198 192.168.1.109> PHYSICAL_OUT ffffffff9ba996c0 0 2 192.168.1.109 192.168.1.198< PHYSICAL_OUT ffffffff9ba996c0 0 2 192.168.1.109 192.168.1.198 [...] (That zero *mp was for a blocked packet.) The above script was written using fbt at zero cost, not sdt! root at deimos:/root> cat fwhooks01.d #!/usr/sbin/dtrace -s #pragma D option quiet #pragma D option switchrate=10 dtrace:::BEGIN { printf(" %-12s %16s %6s %6s %16s %16s\n", "EVENT", "*MP", "IF-IN", "IF-OUT", "FROM", "TO"); } fbt::hook_run:entry { self->info = (hook_pkt_event_t *)arg1; self->name = stringof(args[0]->hei_event->he_name); this->ipha = (ipha_t *)self->info->hpe_hdr; self->saddr = inet_ntoa(&this->ipha->ipha_src); self->daddr = inet_ntoa(&this->ipha->ipha_dst); printf("> %-12s %16x %6d %6d %16s %16s\n", self->name, (uint64_t)*self->info->hpe_mp, self->info->hpe_ifp, self->info->hpe_ofp, self->saddr, self->daddr); } fbt::hook_run:return /self->info/ { printf("< %-12s %16x %6d %6d %16s %16s\n", self->name, (uint64_t)*self->info->hpe_mp, self->info->hpe_ifp, self->info->hpe_ofp, self->saddr, self->daddr); self->info = 0; self->name = 0; self->saddr = 0; self->daddr = 0; } Woah, cool huh! :-) (BTW, that script needs a recent build so that inet_ntoa() exists). Any specific reason why those sdt probes were added? ... such as a script that needed to be written? I can check if fbt can be used instead. I''m not saying that sdt:::ip4-physical-out-start/end *need* to be dropped, just that it looks like they could be served by zero cost fbt probes, and so dropped if a performance problem is shown to exist. And that might be an an easier option than building a new DTrace framework. ;). cheers, Brendan -- Brendan [CA, USA]
Darren.Reed at Sun.COM
2007-Jul-03 02:33 UTC
[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2
Brendan Gregg - Sun Microsystems wrote:>G''Day Darren, > >On Fri, Jun 29, 2007 at 05:27:01PM -0700, Darren.Reed at sun.com wrote: > > >>Brendan Gregg - Sun Microsystems wrote: >> >> >> >>>G''Day Darren, >>> >>>On Fri, Jun 22, 2007 at 09:01:26PM -0700, Darren.Reed at Sun.COM wrote: >>> >>> >>> >>> >>>Thoughts: >>> >>>Each probe addition adds 1 to 5 nops, and a few movs of recently >>>accessed data which we would expect to still be cached. Given a maximum >>>packet rate per CPU per second of 200,000, the execution overhead of some >>>nops and cache movs per packet should be negligible. The effect on i-cache >>>is harder to estimate, and is probably only best checked through >>>measurements. >>> >>>In summary, the execution time of the probes is negligible when considering >>>the per CPU packet rates; the i-cache effect looks negligble, and from the >>>tests was more affected by code layout than the existance or non-existance >>>of probes. >>> >>> >>> >>I''m not really worried about the data cache, just the i-cache and >>the CPU pipeline. Throwing in some NOPs and otherwise redundant >>instructions every now and then doesn''t seem like too big of a sin, >>but filling an entire i-cache row or two with NOPs seems...wasteful. >>Depending on when the CPU decides to throw away a NOP, during the >>pipeline decoding of the instruction would by and large determine >>what it means to have 2-3 vs 6-9 of them from 1 vs 3 dtrace probes >>next to each other. This may sound silly, but I''d have less of an >>issue about this if the same probes were further away from each >>other :) However this really is micro-optimisation. >> >> > >3 DTrace probes next to each other is certainly questionable. I''ve recently >dropped one set of the IP probes so that it is now 2, and put a webrev >of this with just the send/receive probes at the end of, >http://www.opensolaris.org/os/community/dtrace/NetworkProvider/Prototype2/ > >I wasn''t certain that having both ip::: and ipv4/ipv6::: providers was >the right way to go, and put the webrev and website out to see how it looked. >I''ve also been writing scripts using either to see how they felt. The ip::: >provider is more in the DTrace mould (compare with the io::: provider), and >so I''ve dropped the ipv4/ipv6 providers from the recent webrev (ip::: now >has identical functionality). > >[...] > > >>>I''d suggest this strategy. >>> >>>1) For development/troubleshooting probes, use fbt. >>>... >>> >>> >>> >>I agree with this. >>But having seen the sdt:::ip-xmit-v4 probe, wouldn''t it be >>nice if you could present that as fbt::ip-xmit-v4:entry but >>pass in ire->ire_nce as arg2 instead of ire? >> >> > >I would have thought that the existing (ire_t *)ire as args[2] was already >ideal, as users could refer to args[2]->ire_nce as needed. It might be >a different story if the kernel wasn''t CTF''d and these wern''t already >casted. > >Indeed and I suspect that this hook is simply "ease of use" in debugging and could be a candidate for removing..>>>We can''t - DTrace doesn''t support aliasing like that. >>> >>> >>> >>Can I have this added to the list of new things I''d like to >>see dtrace be able to support, along with the DTRACE_IF? :) >> >>I don''t know how this would look, maybe: >> >>DTRACE_DEF(ip__hook__abc, m, ip, stq_ill); >>DTRACE_VIEW1(ip__hook__abc, ip__physical__out__end, mblk_t *); >>DTRACE_VIEW2(ip__hook__abc, ip_send, mblk_t *, void_ip_t *); >>DTRACE_VIEW3(ip__hook__abc, ip4__send, mblk_t *, ipha_t *, ill_t *); >> >>or maybe something more complex so that you could create a hook >>alias that included just the mblk_t and the ill_t but still only >>have one hook in the code path. >> >>The general story is we have a collection of values that we want >>to export out through dtrace and we''d like to have a way to see >>them using different filters (or views.) >> >>If dtrace can handle multiple people using the same probe then >>technically it should be able to cope with multiple people >>using the same probe but with different aliases to present the >>values caught by the dtrace hook in different ways. >> >> > >Cool idea. I don''t think there is a need to do this right now based on >performance testing so far, but it is good to have options to try like >this than not to. And both DTRACE_IF and DTRACE_VIEW are options that >could be added later without changing the exported stable provider >interface. > >Yes :)>It would take some moderate work to get this done; and when complete, there >would still be more non-enabled probe overhead in other places - like the >mib probes, that wouldn''t benifit from this approach. > >Understood. Should I RFE this or is there some other virtual whiteboard on opensolaris.org for new things to add to dtrace?>>Another thought is can these dtrace probe views be defined with >>dtrace(1m) rather than in the kernel? >> >>This would mean that "dtrace -l" might not provide an exhaustive >>list of all the probes available (if it only consults the kernel.) >> >>So, if you wanted to get the ip-send dtrace probe, maybe in your >>dtrace script you would include some special library that knows >>to use the ip4-physical-out-end (with extra args) and then to >>apply some sort of condition and change the types of the args. >> >>So my dtrace script might be: >> >>#!/usr/sbin/dtrace -Fs >> >>include <provider/ip> >>ip:::ip-send{ >>... >>} >> >>If dtrace could be evolved to do that then we could define base >>probes, such as the ip4-physical-out-start and layer on top of >>that new probes without having to deliver any new kernel modules. >> >> > >Yes, that would be another line of attack. > >... > >There might be an easier way to drop the duplicate probe overhead, which >currently looks like, > > 14223 DTRACE_PROBE4(ip4__physical__out__start, > 14224 ill_t *, NULL, ill_t *, stq_ill, > 14225 ipha_t *, ipha, mblk_t *, mp); > 14226 FW_HOOKS(ipst->ips_ip4_physical_out_event, > 14227 ipst->ips_ipv4firewall_physical_out, > 14228 NULL, stq_ill, ipha, mp, mpip, ipst); > 14229 DTRACE_PROBE1(ip4__physical__out__end, mblk_t *, > 14230 mp); > 14231 if (mp == NULL) > 14232 goto drop; > 14233 > 14234 DTRACE_IP5(send, mblk_t *, mp, void_ip_t *, ipha, > 14235 ill_t *, stq_ill, ipha_t *, ipha, ip6_t *, NULL); > >Why have sdt:::ip4-physical-out-start and sdt:::ip4-physical-out-end anyway? > >For -start, the point is to have a dtrace probe that is particular to the location of the hook in the stack relative to packet processing. For -end, it is useful to see the mp after the FW_HOOKS() is complete. Is it NULL? Has it been changed? Both of these are possible here. Or in other words, the complexity of using dtrace to achieve something meaningful is why there are two hooks. Plus, none of the standard fbt things will show us what happens to mp on the successful return of the call to hook_run. In the scripts that you added, yes, you''ve achieved the same thing, but it''s not a trivial exercise. You need to (a) know that you can work this way with dtrace and (b) know what''s inside the args. Recommending that people take that approach isn''t something we should be doing a lot of if they end up using interfaces that aren''t stable. Or to put it another way, it it was possible to use a probes that presented just as simple an interface but implemented it using other means then sure, the probes could be eliminated. Or to use the syntax from above, I would be just as happy if I could write a probe like this: dtrace -I provider/pfh -n ''pfh:::ip4-physical-out-end{...}'' where "-I" is an "include" function, and dtrace somehow built the ip4-physical-out-end using fbt probes like you did, why do I care how the probe is implemented? A consideration that needs to be taken into account is the performance cost of this approach. A danger in relying on fbt is that it only works when the function call works. So, for example, if ip4-physical-out-start/end were made to depend on fbt::hook_run, if that function doesn''t get called then there''s no physical-out-start/end. While the -end probe is only really useful in the presence of hook run being called, the other is not. Darren
Brendan Gregg - Sun Microsystems
2007-Jul-03 03:42 UTC
[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2
On Mon, Jul 02, 2007 at 07:33:45PM -0700, Darren.Reed at Sun.COM wrote: [...]> >>I agree with this. > >>But having seen the sdt:::ip-xmit-v4 probe, wouldn''t it be > >>nice if you could present that as fbt::ip-xmit-v4:entry but > >>pass in ire->ire_nce as arg2 instead of ire? > >> > >> > > > >I would have thought that the existing (ire_t *)ire as args[2] was already > >ideal, as users could refer to args[2]->ire_nce as needed. It might be > >a different story if the kernel wasn''t CTF''d and these wern''t already > >casted. > > > > Indeed and I suspect that this hook is simply "ease of use" > in debugging and could be a candidate for removing..Yes; and an advantage of private sdt probes is that they could be added to support a particular performance project and then removed later when no longer needed...> >>DTRACE_VIEW2(ip__hook__abc, ip_send, mblk_t *, void_ip_t *); > >>DTRACE_VIEW3(ip__hook__abc, ip4__send, mblk_t *, ipha_t *, ill_t *);[...]> >It would take some moderate work to get this done; and when complete, there > >would still be more non-enabled probe overhead in other places - like the > >mib probes, that wouldn''t benifit from this approach. > > > > Understood. Should I RFE this or is there some other virtual > whiteboard on opensolaris.org for new things to add to dtrace?AFAIK, the DTrace todo list is the RFE list from bugster. [...]> >There might be an easier way to drop the duplicate probe overhead, which > >currently looks like, > > > > 14223 DTRACE_PROBE4(ip4__physical__out__start, > > 14224 ill_t *, NULL, ill_t *, stq_ill, > > 14225 ipha_t *, ipha, mblk_t *, mp); > > 14226 FW_HOOKS(ipst->ips_ip4_physical_out_event, > > 14227 ipst->ips_ipv4firewall_physical_out, > > 14228 NULL, stq_ill, ipha, mp, mpip, ipst); > > 14229 DTRACE_PROBE1(ip4__physical__out__end, mblk_t *, > > 14230 mp); > > 14231 if (mp == NULL) > > 14232 goto drop; > > 14233 > > 14234 DTRACE_IP5(send, mblk_t *, mp, void_ip_t *, ipha, > > 14235 ill_t *, stq_ill, ipha_t *, ipha, ip6_t *, > > NULL); > > > >Why have sdt:::ip4-physical-out-start and sdt:::ip4-physical-out-end > >anyway? > > > > > > For -start, the point is to have a dtrace probe that is particular to the > location of the hook in the stack relative to packet processing.That sounds doable from fbt, # dtrace -n ''fbt::hook_run:entry { trace(stringof(args[0]->hei_event->he_name)); stack(); }'' dtrace: description ''fbt::hook_run:entry '' matched 1 probe CPU ID FUNCTION:NAME 0 73492 hook_run:entry PHYSICAL_OUT ip`tcp_send_data+0x78e ip`tcp_output+0x78f ip`squeue_enter+0x41a ip`tcp_wput+0xf8 unix`putnext+0x2f1 rpcmod`mir_wput+0x1aa rpcmod`rmm_wput+0x1e unix`put+0x270 rpcmod`clnt_dispatch_send+0x11a rpcmod`clnt_cots_kcallit+0x596 nfs`nfs4_rfscall+0x4e2 nfs`rfs4call+0x102 nfs`nfs4lookupvalidate_otw+0x2d5 nfs`nfs4lookup+0x212 nfs`nfs4_lookup+0xe5 genunix`fop_lookup+0x53 genunix`lookuppnvp+0x2e5 genunix`lookuppnat+0x125 genunix`lookupnameat+0x82 genunix`cstatat_getvp+0x160 [...]> For -end, it is useful to see the mp after the FW_HOOKS() is complete. > Is it NULL? Has it been changed? Both of these are possible here. > Or in other words, the complexity of using dtrace to achieve something > meaningful is why there are two hooks.A minute reading FW_HOOKS() shows that it will set mp to NULL and complete if hook_run() returns zero, which fbt can trace. Yes, there are some really complex associations and mental leaps that must be made with fbt to solve some tracing issues, where it would be useful to use sdt instead. It doesn''t look like this is one of them.> Plus, none of the standard fbt things will show us what happens to mp on > the successful return of the call to hook_run.This is also doable from fbt, although a few lines of code, fbt::hook_run:entry { self->info = (hook_pkt_event_t *)arg1; } fbt::hook_run:return /self->info/ { this->mp = arg1 != 0 ? NULL : (uint64_t)*self->info->hpe_mp; printf("mp: %x", this->mp); self->info = 0; } The above shows the value of mp after FW_HOOKS(), and accounts for both hook_run() and FW_HOOKS() itself changing mp.> In the scripts that you added, yes, you''ve achieved the same thing, > but it''s not a trivial exercise. You need to (a) know that you can work > this way with dtrace and (b) know what''s inside the args. Recommending > that people take that approach isn''t something we should be doing a lot > of if they end up using interfaces that aren''t stable.Those probes export mblk_t''s and ill_t''s around hook calls, so whoever is interested in using them must have some familiarity with kernel code.> Or to put it another way, it it was possible to use a probes that > presented just as simple an interface but implemented it using > other means then sure, the probes could be eliminated.Who is this for? Who would be working on this code and not have read the contents of FW_HOOKS()?> Or to use the syntax from above, I would be just as happy if I could > write a probe like this: > > dtrace -I provider/pfh -n ''pfh:::ip4-physical-out-end{...}'' > > where "-I" is an "include" function, and dtrace somehow built the > ip4-physical-out-end using fbt probes like you did, why do I care > how the probe is implemented? A consideration that needs to be > taken into account is the performance cost of this approach.DTrace already hase some functionality that could be used; People can build custom /usr/lib/dtrace include files containing translators, inlines and #defines to make life easier for tracing certain targets. It might make sense to add a little functionality, such as provider aliasing as you suggested, to tie this together as a user customised view of fbt. Especially for fbt probes that are complex to use. Of course, anyone writing these include files will need to maintain them as the kernel changes.> A danger in relying on fbt is that it only works when the function > call works. So, for example, if ip4-physical-out-start/end were made > to depend on fbt::hook_run, if that function doesn''t get called then > there''s no physical-out-start/end. While the -end probe is only > really useful in the presence of hook run being called, the other is > not.Yes, that is a difference between fbt and those sdt probes. You can at least detect this situation using dtrace (the lack of hook_run() calls from certain funcitons). cheers, Brendan -- Brendan [CA, USA]
Darren.Reed at Sun.COM
2007-Jul-04 03:41 UTC
[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2
Brendan Gregg - Sun Microsystems wrote:>[...] > > >>>There might be an easier way to drop the duplicate probe overhead, which >>>currently looks like, >>> >>> 14223 DTRACE_PROBE4(ip4__physical__out__start, >>> 14224 ill_t *, NULL, ill_t *, stq_ill, >>> 14225 ipha_t *, ipha, mblk_t *, mp); >>> 14226 FW_HOOKS(ipst->ips_ip4_physical_out_event, >>> 14227 ipst->ips_ipv4firewall_physical_out, >>> 14228 NULL, stq_ill, ipha, mp, mpip, ipst); >>> 14229 DTRACE_PROBE1(ip4__physical__out__end, mblk_t *, >>> 14230 mp); >>> 14231 if (mp == NULL) >>> 14232 goto drop; >>> 14233 >>> 14234 DTRACE_IP5(send, mblk_t *, mp, void_ip_t *, ipha, >>> 14235 ill_t *, stq_ill, ipha_t *, ipha, ip6_t *, >>> NULL); >>> >>>Why have sdt:::ip4-physical-out-start and sdt:::ip4-physical-out-end >>>anyway? >>> >>> >>> >>> >>For -start, the point is to have a dtrace probe that is particular to the >>location of the hook in the stack relative to packet processing. >> >> > >That sounds doable from fbt, > > # dtrace -n ''fbt::hook_run:entry { trace(stringof(args[0]->hei_event->he_name)); stack(); }'' > dtrace: description ''fbt::hook_run:entry '' matched 1 probe > CPU ID FUNCTION:NAME > 0 73492 hook_run:entry PHYSICAL_OUT > ip`tcp_send_data+0x78e > ip`tcp_output+0x78f > ip`squeue_enter+0x41a > ip`tcp_wput+0xf8 > unix`putnext+0x2f1 > rpcmod`mir_wput+0x1aa > rpcmod`rmm_wput+0x1e > unix`put+0x270 > rpcmod`clnt_dispatch_send+0x11a > rpcmod`clnt_cots_kcallit+0x596 > nfs`nfs4_rfscall+0x4e2 > nfs`rfs4call+0x102 > nfs`nfs4lookupvalidate_otw+0x2d5 > nfs`nfs4lookup+0x212 > nfs`nfs4_lookup+0xe5 > genunix`fop_lookup+0x53 > genunix`lookuppnvp+0x2e5 > genunix`lookuppnat+0x125 > genunix`lookupnameat+0x82 > genunix`cstatat_getvp+0x160 > [...] > >When it is possible to write a dtrace script that use a probe named "ip4-physical-out-end" and there is no requirement for me to know about fbt, all will be good.>>For -end, it is useful to see the mp after the FW_HOOKS() is complete. >>Is it NULL? Has it been changed? Both of these are possible here. >>Or in other words, the complexity of using dtrace to achieve something >>meaningful is why there are two hooks. >> >> > >A minute reading FW_HOOKS() shows that it will set mp to NULL and complete >if hook_run() returns zero, which fbt can trace. > >So long as hook run is being used in this context. If hook_run() is ever used outside of networking (and there''s no reason why it couldn''t) then this becomes harder - more matching is required on entry of fbt''s and saving parameters too. The use of "?" in D to achieve "if-else" assignments is not obvious at first and the lack of "if-else" is the biggest obstacle I run into with D.>>Plus, none of the standard fbt things will show us what happens to mp on >>the successful return of the call to hook_run. >> >> > >This is also doable from fbt, although a few lines of code, > > fbt::hook_run:entry > { > self->info = (hook_pkt_event_t *)arg1; > } > > fbt::hook_run:return > /self->info/ > { > this->mp = arg1 != 0 ? NULL : (uint64_t)*self->info->hpe_mp; > printf("mp: %x", this->mp); > self->info = 0; > } > >The above shows the value of mp after FW_HOOKS(), and accounts for both >hook_run() and FW_HOOKS() itself changing mp. > >But it requires me to use 2 probes, not 1, and in a more complex method than simply doing: dtrace -n ''sdt:::ip4-physical-out-end{printf("mp: %x", this->mp);}'' People able to put it all on one command line has its advantages ;)>... > > >>Or to put it another way, it it was possible to use a probes that >>presented just as simple an interface but implemented it using >>other means then sure, the probes could be eliminated. >> >> > >Who is this for? Who would be working on this code and not have read the >contents of FW_HOOKS()? > >Using the existing probes doesn''t require someone to read the FW_HOOKS macro, today. If I write up a blog or something else about this SDT probes and mention what the parameters are, that should be enough for someone to go and use them for something. I''d also reject the assertion that using SDT probes implies that someone is "working on code.">>Or to use the syntax from above, I would be just as happy if I could >>write a probe like this: >> >>dtrace -I provider/pfh -n ''pfh:::ip4-physical-out-end{...}'' >> >>where "-I" is an "include" function, and dtrace somehow built the >>ip4-physical-out-end using fbt probes like you did, why do I care >>how the probe is implemented? A consideration that needs to be >>taken into account is the performance cost of this approach. >> >> > >DTrace already hase some functionality that could be used; People can >build custom /usr/lib/dtrace include files containing translators, >inlines and #defines to make life easier for tracing certain targets. > >Hmm, and I see you''re using translators to build up the data structures for your provider... I''ve never looked at a translator before - the syntax seems to borrow from C++ - I''m sure that will make instant enemies of some ;) They look useful, especially for networking...>It might make sense to add a little functionality, such as provider >aliasing as you suggested, to tie this together as a user customised >view of fbt. Especially for fbt probes that are complex to use. > >Of course, anyone writing these include files will need to maintain them >as the kernel changes. > >Of course. Darren
Adam Leventhal
2007-Jul-04 06:59 UTC
[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2
On Tue, Jul 03, 2007 at 08:41:26PM -0700, Darren.Reed at Sun.COM wrote:> When it is possible to write a dtrace script that use a probe named > "ip4-physical-out-end" and there is no requirement for me to know > about fbt, all will be good.That''s only a reasonable goal if we expect customers to gather actionable data from probe points of that nature. If such a probe''s only consumer were a developer of the IP stack, well, fbt is really the right tool.> The use of "?" in D to achieve "if-else" assignments is not obvious at first > and the lack of "if-else" is the biggest obstacle I run into with D.Was the ?: operator unfamiliar to you? We chose to implement that -- obviously -- because it was used in C (and all C-derivates I know of). Perhaps you should add an entry to the wiki for people unfamilar with the construct.> > fbt::hook_run:entry > > { > > self->info = (hook_pkt_event_t *)arg1; > > } > > > > fbt::hook_run:return > > /self->info/ > > { > > this->mp = arg1 != 0 ? NULL : (uint64_t)*self->info->hpe_mp; > > printf("mp: %x", this->mp); > > self->info = 0; > > } > > > >The above shows the value of mp after FW_HOOKS(), and accounts for both > >hook_run() and FW_HOOKS() itself changing mp. > > But it requires me to use 2 probes, not 1, and in a more complex > method than simply doing: > > dtrace -n ''sdt:::ip4-physical-out-end{printf("mp: %x", this->mp);}'' > > People able to put it all on one command line has its advantages ;)The point of SDT is not to make it easier to fit your D program in 80 columns, and the script that Brendan quoted is the kind of thing that users of DTrace do on the command line all the time. The fbt provider has an important place in the pantheon of DTrace providers and is not meant to be obviated by SDT. Rather SDT is meant to expose points of semantic relevance and stability.> I''d also reject the assertion that using SDT probes implies that > someone is "working on code."Well, as implemented the stability is private so I''m not sure how a constomer could reasonably be expected to use or rely on the probes. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
Darren.Reed at Sun.COM
2007-Jul-05 17:52 UTC
[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2
Adam Leventhal wrote:>On Tue, Jul 03, 2007 at 08:41:26PM -0700, Darren.Reed at Sun.COM wrote: > >>When it is possible to write a dtrace script that use a probe named >>"ip4-physical-out-end" and there is no requirement for me to know >>about fbt, all will be good. >> > >That''s only a reasonable goal if we expect customers to gather actionable >data from probe points of that nature. If such a probe''s only consumer were >a developer of the IP stack, well, fbt is really the right tool. >Are you saying that developers shouldn''t add sdt probes and should only use fbt probes? I suppose where this thread of conversation may go is that with the ip4-physical-out-start/end, there''s no real need to have a "start/end", just having "ip4-physical-out" is enough and the start/end can be derived from using fbt::hook_run. I''m in two minds about that, primarily because I''m not 100% comfortable with the idea of needing to store something on :::entry to look at it with :::return and see what it is after the return. While we may well hide behind the veil of "sdt probes are private", in the great big world of open source, this is just silly. Everyone can see they''re there and how they can or cannot use them. If someone who isn''t a developer wanted to use dtrace to collect inbound packet statistics using dtrace and found sdt:::ip4-physical-in-start as being useful, who are we to say they can''t if it serves their needs? Anyway, that''s a whole other digression.>>The use of "?" in D to achieve "if-else" assignments is not obvious at first >>and the lack of "if-else" is the biggest obstacle I run into with D. >> > >Was the ?: operator unfamiliar to you? We chose to implement that -- obviously >-- because it was used in C (and all C-derivates I know of). Perhaps you >should add an entry to the wiki for people unfamilar with the construct. >Most of the time I think in terms of "if-else" as I regard using the ?: operators as something people use to be obscure, less lines of code, etc, and try not to use them myself. Everytime I''ve had errors from dtrace saying it doesn''t understand if statements, I usually have to remember to do matching with //. D generally requires a different mode of thinking when trying to solve a programming problem and I''m not always in tune with that. Darren
Peter Memishian
2007-Jul-05 18:07 UTC
[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2
> While we may well hide behind the veil of "sdt probes are private",> in the great big world of open source, this is just silly. Please, not this again. Just because it''s open-source doesn''t mean that everyone is allowed to go mucking around with everyone else''s private parts. -- meem
Bryan Cantrill
2007-Jul-05 18:09 UTC
[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2
> >>When it is possible to write a dtrace script that use a probe named > >>"ip4-physical-out-end" and there is no requirement for me to know > >>about fbt, all will be good. > >> > > > >That''s only a reasonable goal if we expect customers to gather actionable > >data from probe points of that nature. If such a probe''s only consumer were > >a developer of the IP stack, well, fbt is really the right tool. > > > > Are you saying that developers shouldn''t add sdt probes and should > only use fbt probes?No, he''s not -- Adam''s point is that FBT can do quite a bit for you without having to add SDT probes. [ ... ]> While we may well hide behind the veil of "sdt probes are private", > in the great big world of open source, this is just silly. Everyone > can see they''re there and how they can or cannot use them. If someone > who isn''t a developer wanted to use dtrace to collect inbound packet > statistics using dtrace and found sdt:::ip4-physical-in-start as being > useful, who are we to say they can''t if it serves their needs?Yes, that''s _exactly_ why we added a programmatic notion of stability. So yes, you should knock yourself out adding Private SDT providers -- they don''t even require an ARC case. And if they serve a customer''s needs, great -- because you will have declared them to be Private, a customer that cares about the stability of their D scripts will know that their script can be broken by any twitch of the operating system. That said, there is great value in adding stable (or in this case, Evolving) SDT providers -- they allow customers to build those stable scripts that have been so useful in things like the DTraceToolkit. So one can envision a spectrum of usage of DTrace, with Solaris developers on one end and end-users on the other: FBT provides value primarily at the developer end, Stable/Evolving SDT providers provide value primarily at the end-user end, and Private SDT providers fall in between. So these are not mutually exclusive -- it''s a question of the audience, and the audience for Brendan''s work is very much the end-user. - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
Darren.Reed at Sun.COM
2007-Jul-05 18:24 UTC
[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2
Bryan Cantrill wrote:> ... > >>While we may well hide behind the veil of "sdt probes are private", >>in the great big world of open source, this is just silly. Everyone >>can see they''re there and how they can or cannot use them. If someone >>who isn''t a developer wanted to use dtrace to collect inbound packet >>statistics using dtrace and found sdt:::ip4-physical-in-start as being >>useful, who are we to say they can''t if it serves their needs? >> >> > >Yes, that''s _exactly_ why we added a programmatic notion of stability. >So yes, you should knock yourself out adding Private SDT providers -- they >don''t even require an ARC case. And if they serve a customer''s needs, >great -- because you will have declared them to be Private, a customer >that cares about the stability of their D scripts will know that their >script can be broken by any twitch of the operating system. > >That said, there is great value in adding stable (or in this case, Evolving) >SDT providers -- they allow customers to build those stable scripts that >have been so useful in things like the DTraceToolkit. > >So one can envision a spectrum of usage of DTrace, with Solaris developers >on one end and end-users on the other: FBT provides value primarily at the >developer end, Stable/Evolving SDT providers provide value primarily at >the end-user end, and Private SDT providers fall in between. So these >are not mutually exclusive -- it''s a question of the audience, and the >audience for Brendan''s work is very much the end-user. > >Adam''s email suggest that SDT probes are only ever going to be private but you''re suggesting otherwise and those that Brendan is working on aren''t SDT, if I recall correctly. Who''s right here? While I don''t actually envisage wanting to promote an SDT probe, at this point, to anything above private, I''d just like to know if there is some subtlety to the difference between what you''ve said above and what Adam was saying that I''m missing. Darren
Adam Leventhal
2007-Jul-05 22:24 UTC
[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2
On Thu, Jul 05, 2007 at 11:24:47AM -0700, Darren.Reed at Sun.COM wrote:> Adam''s email suggest that SDT probes are only ever going to be > private but you''re suggesting otherwise and those that Brendan > is working on aren''t SDT, if I recall correctly. Who''s right here? > > While I don''t actually envisage wanting to promote an SDT probe, > at this point, to anything above private, I''d just like to know if there > is some subtlety to the difference between what you''ve said above > and what Adam was saying that I''m missing.I think is an instance where it would be beneficial to RTFM. A better understanding of DTrace and it''s providers would help you answer your questions. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
Darren.Reed at Sun.COM
2007-Jul-06 23:58 UTC
[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2
Adam Leventhal wrote:>On Thu, Jul 05, 2007 at 11:24:47AM -0700, Darren.Reed at Sun.COM wrote: > > >>Adam''s email suggest that SDT probes are only ever going to be >>private but you''re suggesting otherwise and those that Brendan >>is working on aren''t SDT, if I recall correctly. Who''s right here? >> >>While I don''t actually envisage wanting to promote an SDT probe, >>at this point, to anything above private, I''d just like to know if there >>is some subtlety to the difference between what you''ve said above >>and what Adam was saying that I''m missing. >> >> > >I think is an instance where it would be beneficial to RTFM. A better >understanding of DTrace and it''s providers would help you answer your >questions. > >Reading sdt(7) does clear up any misunderstandings I had. Is sdt likely to be upgraded to a "committed" interface? Darren
Adam Leventhal
2007-Jul-07 00:02 UTC
[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2
On Fri, Jul 06, 2007 at 04:58:45PM -0700, Darren.Reed at Sun.COM wrote:> Reading sdt(7) does clear up any misunderstandings I had.http://docs.sun.com/app/docs/doc/817-6223 - ahl -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
James Carlson
2007-Jul-08 20:24 UTC
[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2
Peter Memishian writes:> > > While we may well hide behind the veil of "sdt probes are private", > > in the great big world of open source, this is just silly. > > Please, not this again. Just because it''s open-source doesn''t mean that > everyone is allowed to go mucking around with everyone else''s private parts.Indeed. The core confusion seems to be over the word "private." It doesn''t mean "secret." It doesn''t even mean that we''ll somehow chase down those wayward users and somehow stop them from using what they ought not. Instead, it means that the original author is not *expecting* anyone to use the interface. As he doesn''t *expect* this to happen, he''s not going to try very hard (if at all) to make sure that any future changes are compatible with what anyone else is doing. In other words, if you use something that''s private, you can get hurt by changes down the line. If you don''t care whether your application falls apart in the future, then go ahead and use whatever you want. It''s not "secrecy," but rather rate-of-change, and having open source doesn''t change the situation for private interfaces in the slightest. -- James Carlson, Solaris Networking <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677