thr3ads.net - dtrace discuss - [dtrace-discuss] DTrace Network Providers, take 2 [Jun 2007]

If this information is useful, please help other people find it:
Share via:

Brendan Gregg - Sun Microsystems

2007-Jun-12 03:11 UTC

[dtrace-discuss] DTrace Network Providers, take 2

G''Day Folks,

I''m restarting work on the proposed DTrace network providers. These are
the stable providers intended for use by Solaris end users for general
observability, similar to what the "io" provider achieves for disk
I/O.

I have been busy on other projects during the last 6 months, so little has
progressed since the last (long) dtrace-discuss thread with the subject
"DTrace Network Provider".

I''ve just reread that thread - and I''m grateful to have had so
much useful
feedback so far. I''ve reintegrated some feedback with what
I''ve learnt
through using the network provider during the past 6 months, and I''m
considering some new changes - which would be take 2 of the prototype
network providers.

For now I''m sticking to just discussing the internet layer protocols.

Summary

Take 1 (current):

        Probes          Args
        ip:::send       ipstateinfo_t, ipinfo_t, ipv4info_t, ipv6info_t
        ip:::receive    ipstateinfo_t, ipinfo_t, ipv4info_t, ipv6info_t

more details: http://www.opensolaris.org/os/community/dtrace/NetworkProvider

This provider has been great to use in testing. A key issue is the
uglyness of having ipv4info_t and ipv6info_t as different args; although
this hasn''t been much of a problem as ipinfo_t is usually used instead.

Take 2 (proposed):

        Probes                  Args
        ip:::send               cstateinfo_t, ipinfo_t
        ip:::receive            cstateinfo_t, ipinfo_t
        ipv4:::send             cstateinfo_t, ipinfo_t, ipv4info_t
        ipv4:::receive          cstateinfo_t, ipinfo_t, ipv4info_t
        ipv6:::send             cstateinfo_t, ipinfo_t, ipv6info_t
        ipv6:::receive          cstateinfo_t, ipinfo_t, ipv6info_t
        ipsec:::send            cstateinfo_t, ipinfo_t, ipsecinfo_t
        ipsec:::receive         cstateinfo_t, ipinfo_t, ipsecinfo_t
        icmp:::send             cstateinfo_t, ipinfo_t, icmpinfo_t
        icmp:::receive          cstateinfo_t, ipinfo_t, icmpinfo_t
        arp:::send              cstateinfo_t, arpinfo_t
        arp:::receive           cstateinfo_t, arpinfo_t
        rarp:::send             cstateinfo_t, arpinfo_t
        rarp:::receive          cstateinfo_t, arpinfo_t
        ...

Providers such as "ipv4" and "ipv6" are for specific
protocol analysis,
and their 3rd argument is an endian correct structure of the protocol
members (DTrace translator). A pointer to the raw protocol struct will
be provided. (this assumes that "ipv4" is an allowed provider name, as
it could potentially clash with a USDT provider called "ipv" that is
asked to trace PID 4).

Some discussion points about the new suggestions follow.

1) ipinfo_t

This provides common details from the IP protocols that we expect to
have available,

        ipinfo_t,
                ip_protocol     int             /* protocol */
                ip_plength      uint            /* payload length */
                ip_saddr        string          /* source address */
                ip_daddr        string          /* destination address */

ip_protocol could be just the IP protocol number (4 or 6), or borrow existing
definitions such as AF_INET/AF_INET6, ETHERTYPE_IP/ETHERTYPE_IPV6 or use
/etc/protocols - as these may better accomodate future protocol additions.
Using the IP protocol number (4/6) would be the least suprising choice for
the end user.

2) cstateinfo_t

This provides connection state information if available,

        cstateinfo_t,
                cs_cid          uint64_t        /* connection ID (conn_t *) */
                cs_loopback     int             /* loopback state */

and may be extended to include details such as,

                cs_zoneid       int             /* zone ID */
                cs_ip_stack     uint64_t        /* stack ID (ip_stack_t *) */

cs_cid isn''t provided as a conn_t *, as that would expose an unstable
interface; instead it is provided as a uint64_t, as it can be useful as a
connection ID (and people can cast it as a conn_t * for raw debugging).

3) ip provider

"ip" is the internet protocol provider, which is a convience provider
for
tracing all internet protocol traffic (anything with an IP header).
It may be expanded in the future to cover future internet layer data
protocols (protocols that serve the same purpose as IPv4/IPv6).

4) ip provider & ipsec

Since "ip" is a convience provider, we have the possibility of casting
data that is useful (for convience) rather than what is strictly in the
protocol. For example, IPSec, where packets have the tunnel source and dest
and the actual source and dest, could be traced as follows:

        Probes          Arguments contain
        ipv4:::         tunnel source and dest
        ipv6:::         tunnel source and dest
        ipsec:::        actual source and dest, and tunnel source and dest
        ip:::           actual source and dest

(assuming this is doable - I need to check if ip::: probes can be placed
so to not see duplicates of IPSec traffic).

This would allow the ipv4 provider to probe the IPv4 protocol, and the ip
provider to probe the actual end points of our communication.

5) ICMP

To stress the intent of these providers, an IPv4 ICMP send would fire,

        ip:::send       Generic info
        ipv4:::send     IPv4 packet info
        icmp:::send     ICMP details (type, code, ...)

perhaps snoop -V helps explain this,

192.168.1.109 -> 192.168.1.3  ETHER Type=0800 (IP), size=98 bytes
192.168.1.109 -> 192.168.1.3  IP  D=192.168.1.3 S=192.168.1.109 LEN=84,
ID=3232, TOS=0x0, TTL=255
192.168.1.109 -> 192.168.1.3  ICMP Echo request (ID: 20076 Sequence number:
0)

The ip and ipv4 providers will trace the "IP" line, and the icmp
provider
will trace the "ICMP" line.

6) Nifty-IP (hypothetical)

In the distant future, Aliens from the Pleiades cluster make contact with
Earth. They have been searching the galaxy for a quality operating system
that they can install on their interstellar battleships, and are delighted
to have found this thing that Earth people call "Solaris". However,
they
need to communicate with their own internet layer protocol, which roughly
translated into English is called "nifty-IP". Work begins to support
nifty-IP in Solaris 16, partly as a good jesture for our alien neighbours,
and partly because they aggreed to pay in tons of solid gold.

The following providers would trace nifty-IP,

        ip:::           Generic info
        nifty-ip:::     nifty-ip packet info (niftyipinfo_t)

An integration difficulty with ip::: is that of matching fields in ipinfo_t.
That should work as follows,

        ip_protocol     no problem - nifty-ip gets its own protocol number
        ip_plength      no problem - nifty-ip must move some length of data
        ip_saddr        no problem - although may be unintelligible
        ip_daddr        no problem - although may be unintelligible

The assumptions are that any future internet layer protocol will move
data from address A to address B, and that data will have a measurable
length.

The cstateinfo_t members shouldn''t be a problem, as they are specific
to
Solaris and not the protocol.

cheers,

Brendan

-- 
Brendan
[CA, USA]

Darren.Reed at Sun.COM

2007-Jun-12 03:52 UTC

head link

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

Brendan,

I had a quick look through the URL you provided and one thing
struck me as very odd - none of the protocol headers (IP, TCP,
UDP) expose the checksum value.  Can you explain your rationale
for this?  Whlie this is perhaps uninteresting on the inbound side,
it is quite interesting on the outbound side.

Also, can you explain where the probes for each protocol fit in
relation to the greater processing for the protocol?

For example, with TCP you mention that it is possible to see if
invalid flag combinations have been supplied - does this mean
the TCP header has also not been checksum''d?

Will tracemem() be told how to walk the packet, so I can the below?
tracemem(args[1]->ip_hdr, args[1]->ip_plength)

...well, that''d also require ip_hdr being added to ipinfo_t.

Adding ipv4_hl (or whatever you want to call it) would also be
beneficial for people who want to use dtrace to look for packets
with IP options in them.

For IPv6, how do you see this design evolving to include looking
at extension header data?

Darren

Brendan Gregg - Sun Microsystems

2007-Jun-12 18:28 UTC

head link

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

G''Day Darren,

On Mon, Jun 11, 2007 at 08:52:30PM -0700, Darren.Reed at sun.com
wrote:> Brendan,
> 
> I had a quick look through the URL you provided and one thing
That URL currently describes "take 1" of the network provider
proposal,
and so is out of date. The newer details (IP only) were in the email I
sent, and I''ll be updating the URL based on that and the feedback I
get.
> struck me as very odd - none of the protocol headers (IP, TCP,
> UDP) expose the checksum value.  Can you explain your rationale
> for this?  Whlie this is perhaps uninteresting on the inbound side,
> it is quite interesting on the outbound side.
Sorry - that URL is out of date - the providers I''ve been using in
testing do provide checksum values.
> Also, can you explain where the probes for each protocol fit in
> relation to the greater processing for the protocol?
Sure, probes fire when packets are sent and received. The actual
location of the probe macros in the kernel TCP/IP code is based on
available information and maintainability. This means that ip
probes may be placed in tcp_send_data() and udp_send_data(), just
as MIB macros for IP statistics appear in tcp.c and udp.c
> For example, with TCP you mention that it is possible to see if
> invalid flag combinations have been supplied - does this mean
> the TCP header has also not been checksum''d?
I just want to solve IP providers for now - I know there are complications
with the location of the TCP checksum calculation, which I''ll
eventually
try to solve. :)
> Will tracemem() be told how to walk the packet, so I can the below?
> tracemem(args[1]->ip_hdr, args[1]->ip_plength)
> 
> ...well, that''d also require ip_hdr being added to ipinfo_t.
I don''t think ipinfo_t should have ip_hdr; if you are interested in
protocol details, then instead of using the "ip" provider or the
generic
ipinfo_t structure, you use the "ipv4" (or "ipv6") protocol
specific
provider, which currently provides args[2]->ipv4_hdr as a pointer
to the raw ipha_t.

I think there are two things you may be interested in:

A) dumping the raw header + extended options.

Here is an example from the old prototype "ip" provider,

# dtrace -n ''ip:::send,ip:::receive { tracemem(args[2]->ipv4_hdr,
20); }''
dtrace: description ''ip:::send,ip:::receive '' matched 5 probes
CPU     ID                    FUNCTION:NAME
  0  54627                 ip_input:receive 
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 45 20 00 98 3b 01 40 00 28 06 e0 c2 45 b5 2e b2  E ..;. at .(...E...
        10: c0 a8 01 6d                                      ...m

  0  54627                 ip_input:receive 
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 45 00 00 58 b6 fe 40 00 3c 06 03 e1 c0 a8 01 03  E..X.. at
.<.......
        10: c0 a8 01 6d                                      ...m
[...]

So far so good, now need to replace the "20" value with
args[2]->ipv4_ihl,

# dtrace -n ''ip:::send,ip:::receive { tracemem(args[2]->ipv4_hdr,
    args[2]->ipv4_ihl); }''
dtrace: invalid probe specifier ip:::send,ip:::receive {
tracemem(args[2]->ipv4_hdr, args[2]->ipv4_ihl); }: tracemem( ) argument #2
must be a non-zero positive integral constant expression

hmm. DTrace is trying to help out by sanity checking args,

        if (dt_node_is_posconst(size) == 0) {
                dnerror(size, D_TRACEMEM_SIZE, "tracemem( ) argument #2
must "
                    "be a non-zero positive integral constant
expression\n");
        }

I need to check if there is a sensible way around this... I can think of
a work-around using existing DTrace functionality, but it isn''t pretty.

B) dumping the payload

It wasn''t the initial intent to access the payload easily from DTrace,
but
enough people have asked that I will check how doable this is (it could
certainly be added as a later feature).

Of course, if you can walk to the data offsets in C, you can tracemem()
them using DTrace (I assume this would need the first mblk pointer,
which would be provided in the protocol providers for such raw debugging).
So there will be at least one way to do this.

It would be nice to do this easily from DTrace, eg,

	tracemem(args[2]->ipv4_payload, args[2]->ipv4_plength)

which might turn into something like,

	tracemblk(args[2]->ipv4_baddr, args[2]->ipv4_plength)

baddr meaning buffer address, and tracemblk() a new function to walk
and dump mblks easily. I''m not yet sure which approach would be best
without trying to code it first.
> Adding ipv4_hl (or whatever you want to call it) would also be
> beneficial for people who want to use dtrace to look for packets
> with IP options in them.
Yes, the "ipv4" provider will have args[2]->ipv4_ihl.
> For IPv6, how do you see this design evolving to include looking
> at extension header data?
Extension header data would only be accessed through the ipv6 provider,
and can be added to the ipv6info_t struct.

cheers,

Brendan

-- 
Brendan
[CA, USA]

Peter Lawrence

2007-Jun-12 18:42 UTC

head link

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

G''Day Brendan,
              I''ve got a couple very open ended questions, I''m
not so
much concerned with the actual answers as to know that you are thinking
about such issues...

how does the info available through this provider compare with /usr/sbin/snoop ?

will it be possible to emulate the filtering capabilities of snoop ?

how does the placement of the probes imply timing info ?
	is there any way to track time between application send
	and ip send,  and then between ip send and NIC send ?
	is there any way to track time between ip receive and
	application read,  and also between NIC receive and ip
	receive ?


(and, for my own edification, how does IPsec relate to SSL, and if they
 are unrelated and I need to dtrace SLL how will I do that ?)


thanks,
-Pete Lawrence.





Brendan Gregg - Sun Microsystems wrote On 06/11/07 08:11
PM,:> G''Day Folks,
> 
> I''m restarting work on the proposed DTrace network providers.
These are
> the stable providers intended for use by Solaris end users for general
> observability, similar to what the "io" provider achieves for
disk I/O.
> 
> I have been busy on other projects during the last 6 months, so little has
> progressed since the last (long) dtrace-discuss thread with the subject
> "DTrace Network Provider".
> 
> I''ve just reread that thread - and I''m grateful to have
had so much useful
> feedback so far. I''ve reintegrated some feedback with what
I''ve learnt
> through using the network provider during the past 6 months, and
I''m
> considering some new changes - which would be take 2 of the prototype
> network providers.
> 
> For now I''m sticking to just discussing the internet layer
protocols.
> 
> Summary
> 
> Take 1 (current):
> 
>         Probes          Args
>         ip:::send       ipstateinfo_t, ipinfo_t, ipv4info_t, ipv6info_t
>         ip:::receive    ipstateinfo_t, ipinfo_t, ipv4info_t, ipv6info_t
> 
> more details:
http://www.opensolaris.org/os/community/dtrace/NetworkProvider
> 
> This provider has been great to use in testing. A key issue is the
> uglyness of having ipv4info_t and ipv6info_t as different args; although
> this hasn''t been much of a problem as ipinfo_t is usually used
instead.
> 
> Take 2 (proposed):
> 
>         Probes                  Args
>         ip:::send               cstateinfo_t, ipinfo_t
>         ip:::receive            cstateinfo_t, ipinfo_t
>         ipv4:::send             cstateinfo_t, ipinfo_t, ipv4info_t
>         ipv4:::receive          cstateinfo_t, ipinfo_t, ipv4info_t
>         ipv6:::send             cstateinfo_t, ipinfo_t, ipv6info_t
>         ipv6:::receive          cstateinfo_t, ipinfo_t, ipv6info_t
>         ipsec:::send            cstateinfo_t, ipinfo_t, ipsecinfo_t
>         ipsec:::receive         cstateinfo_t, ipinfo_t, ipsecinfo_t
>         icmp:::send             cstateinfo_t, ipinfo_t, icmpinfo_t
>         icmp:::receive          cstateinfo_t, ipinfo_t, icmpinfo_t
>         arp:::send              cstateinfo_t, arpinfo_t
>         arp:::receive           cstateinfo_t, arpinfo_t
>         rarp:::send             cstateinfo_t, arpinfo_t
>         rarp:::receive          cstateinfo_t, arpinfo_t
>         ...
> 
> Providers such as "ipv4" and "ipv6" are for specific
protocol analysis,
> and their 3rd argument is an endian correct structure of the protocol
> members (DTrace translator). A pointer to the raw protocol struct will
> be provided. (this assumes that "ipv4" is an allowed provider
name, as
> it could potentially clash with a USDT provider called "ipv" that
is
> asked to trace PID 4).
> 
> Some discussion points about the new suggestions follow.
> 
> 1) ipinfo_t
> 
> This provides common details from the IP protocols that we expect to
> have available,
> 
>         ipinfo_t,
>                 ip_protocol     int             /* protocol */
>                 ip_plength      uint            /* payload length */
>                 ip_saddr        string          /* source address */
>                 ip_daddr        string          /* destination address */
> 
> ip_protocol could be just the IP protocol number (4 or 6), or borrow
existing
> definitions such as AF_INET/AF_INET6, ETHERTYPE_IP/ETHERTYPE_IPV6 or use
> /etc/protocols - as these may better accomodate future protocol additions.
> Using the IP protocol number (4/6) would be the least suprising choice for
> the end user.
> 
> 2) cstateinfo_t
> 
> This provides connection state information if available,
> 
>         cstateinfo_t,
>                 cs_cid          uint64_t        /* connection ID (conn_t *)
*/
>                 cs_loopback     int             /* loopback state */
> 
> and may be extended to include details such as,
> 
>                 cs_zoneid       int             /* zone ID */
>                 cs_ip_stack     uint64_t        /* stack ID (ip_stack_t *)
*/
> 
> cs_cid isn''t provided as a conn_t *, as that would expose an
unstable
> interface; instead it is provided as a uint64_t, as it can be useful as a
> connection ID (and people can cast it as a conn_t * for raw debugging).
> 
> 3) ip provider
> 
> "ip" is the internet protocol provider, which is a convience
provider for
> tracing all internet protocol traffic (anything with an IP header).
> It may be expanded in the future to cover future internet layer data
> protocols (protocols that serve the same purpose as IPv4/IPv6).
> 
> 4) ip provider & ipsec
> 
> Since "ip" is a convience provider, we have the possibility of
casting
> data that is useful (for convience) rather than what is strictly in the
> protocol. For example, IPSec, where packets have the tunnel source and dest
> and the actual source and dest, could be traced as follows:
> 
>         Probes          Arguments contain
>         ipv4:::         tunnel source and dest
>         ipv6:::         tunnel source and dest
>         ipsec:::        actual source and dest, and tunnel source and dest
>         ip:::           actual source and dest
> 
> (assuming this is doable - I need to check if ip::: probes can be placed
> so to not see duplicates of IPSec traffic).
> 
> This would allow the ipv4 provider to probe the IPv4 protocol, and the ip
> provider to probe the actual end points of our communication.
> 
> 5) ICMP
> 
> To stress the intent of these providers, an IPv4 ICMP send would fire,
> 
>         ip:::send       Generic info
>         ipv4:::send     IPv4 packet info
>         icmp:::send     ICMP details (type, code, ...)
> 
> perhaps snoop -V helps explain this,
> 
> 192.168.1.109 -> 192.168.1.3  ETHER Type=0800 (IP), size=98 bytes
> 192.168.1.109 -> 192.168.1.3  IP  D=192.168.1.3 S=192.168.1.109 LEN=84,
ID=3232, TOS=0x0, TTL=255
> 192.168.1.109 -> 192.168.1.3  ICMP Echo request (ID: 20076 Sequence
number: 0)
> 
> The ip and ipv4 providers will trace the "IP" line, and the icmp
provider
> will trace the "ICMP" line.
> 
> 6) Nifty-IP (hypothetical)
> 
> In the distant future, Aliens from the Pleiades cluster make contact with
> Earth. They have been searching the galaxy for a quality operating system
> that they can install on their interstellar battleships, and are delighted
> to have found this thing that Earth people call "Solaris".
However, they
> need to communicate with their own internet layer protocol, which roughly
> translated into English is called "nifty-IP". Work begins to
support
> nifty-IP in Solaris 16, partly as a good jesture for our alien neighbours,
> and partly because they aggreed to pay in tons of solid gold.
> 
> The following providers would trace nifty-IP,
> 
>         ip:::           Generic info
>         nifty-ip:::     nifty-ip packet info (niftyipinfo_t)
> 
> An integration difficulty with ip::: is that of matching fields in
ipinfo_t.
> That should work as follows,
> 
>         ip_protocol     no problem - nifty-ip gets its own protocol number
>         ip_plength      no problem - nifty-ip must move some length of data
>         ip_saddr        no problem - although may be unintelligible
>         ip_daddr        no problem - although may be unintelligible
> 
> The assumptions are that any future internet layer protocol will move
> data from address A to address B, and that data will have a measurable
> length.
> 
> The cstateinfo_t members shouldn''t be a problem, as they are
specific to
> Solaris and not the protocol.
> 
> cheers,
> 
> Brendan
>

Dan McDonald

2007-Jun-12 19:02 UTC

head link

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

On Tue, Jun 12, 2007 at 11:42:42AM -0700, Peter Lawrence wrote:

<SNIP!>
> (and, for my own edification, how does IPsec relate to SSL, and if they
>  are unrelated and I need to dtrace SLL how will I do that ?)
There *is* kernel SSL, but most SSL is in user-space.  I *think* what''s
being
proposed is an all-kernel set of probes.

Dan

Dan McDonald

2007-Jun-12 19:14 UTC

head link

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

On Mon, Jun 11, 2007 at 08:11:25PM -0700, Brendan Gregg - Sun Microsystems
wrote:> This provider has been great to use in testing. A key issue is the
> uglyness of having ipv4info_t and ipv6info_t as different args; although
> this hasn''t been much of a problem as ipinfo_t is usually used
instead.
> 
> Take 2 (proposed):
> 
>         Probes                  Args
>         ip:::send               cstateinfo_t, ipinfo_t
>         ip:::receive            cstateinfo_t, ipinfo_t
>         ipv4:::send             cstateinfo_t, ipinfo_t, ipv4info_t
>         ipv4:::receive          cstateinfo_t, ipinfo_t, ipv4info_t
>         ipv6:::send             cstateinfo_t, ipinfo_t, ipv6info_t
>         ipv6:::receive          cstateinfo_t, ipinfo_t, ipv6info_t
>         ipsec:::send            cstateinfo_t, ipinfo_t, ipsecinfo_t
>         ipsec:::receive         cstateinfo_t, ipinfo_t, ipsecinfo_t
>         icmp:::send             cstateinfo_t, ipinfo_t, icmpinfo_t
>         icmp:::receive          cstateinfo_t, ipinfo_t, icmpinfo_t
>         arp:::send              cstateinfo_t, arpinfo_t
>         arp:::receive           cstateinfo_t, arpinfo_t
>         rarp:::send             cstateinfo_t, arpinfo_t
>         rarp:::receive          cstateinfo_t, arpinfo_t
>         ...
> 
> Providers such as "ipv4" and "ipv6" are for specific
protocol analysis,
> and their 3rd argument is an endian correct structure of the protocol
> members (DTrace translator). A pointer to the raw protocol struct will
> be provided. (this assumes that "ipv4" is an allowed provider
name, as
> it could potentially clash with a USDT provider called "ipv" that
is
> asked to trace PID 4).
You have "ipsec" as one set of probes.  I take it your ipsecinfo_t
will
distinguish between AH, ESP, or both on a packet.
> 2) cstateinfo_t
> 
> This provides connection state information if available,
> 
>         cstateinfo_t,
>                 cs_cid          uint64_t        /* connection ID (conn_t *)
*/
>                 cs_loopback     int             /* loopback state */
> 
> and may be extended to include details such as,
> 
>                 cs_zoneid       int             /* zone ID */
>                 cs_ip_stack     uint64_t        /* stack ID (ip_stack_t *)
*/
> 
> cs_cid isn''t provided as a conn_t *, as that would expose an
unstable
> interface; instead it is provided as a uint64_t, as it can be useful as a
> connection ID (and people can cast it as a conn_t * for raw debugging).
That seems sensible, but for lots of receive-side processing, there will be
NOTHING resembling a conn_t immediately available.

For example, IPsec packets (depending on where you put the probe) might have
the SA (ipsa_t) available, but no way will it have the conn_t... conn lookup
occurs AFTER inbound IPsec processing (at least right now).
> 4) ip provider & ipsec
> 
> Since "ip" is a convience provider, we have the possibility of
casting
> data that is useful (for convience) rather than what is strictly in the
> protocol. For example, IPSec, where packets have the tunnel source and dest
Spelling nit:  IPsec.
> and the actual source and dest, could be traced as follows:
> 
>         Probes          Arguments contain
>         ipv4:::         tunnel source and dest
>         ipv6:::         tunnel source and dest
>         ipsec:::        actual source and dest, and tunnel source and dest
IPsec is not a tunnelling protocol, and not all IPsec packets have tunnel
source/dest to contend with.  Also, in Solaris, IP tunnels are distinct
entities, we implement IPsec tunnel-mode inside the context of our IP
tunnelling - that''s the implementation tack of Tunnel Reform.

For IPsec transmit-side, depending on where the probes are placed,
you''ll
want to know things in the IPSEC_OUT M_CTL mblk.  Highlights include:

	- Policy entry and/or actions (actions --> what precise algorithm(s)
          need(s) to be applied to the packet).

	- Outbound SA (if cached in the transmitting conn_t).  Maybe some
          info from the SA in a readable form (e.g. algorithms).

	- conn_t of transmitter (may be NULL for ICMP replies and TCP RST)

For the receive side, look at the IPSEC_IN mblk, but mostly, you''ll
want:

	- Inbound SA and all relevant fields.

And don''t forget --> IPsec uses ip_drop_packet() whenever a packet
is dropped
for IPsec reasons.  FBT + ip_drop_packet() is a gold mine of usable
information already, if you could extend the rest of TCP/IP to use
ip_drop_packet(), it may help a LOT.
> (assuming this is doable - I need to check if ip::: probes can be placed
> so to not see duplicates of IPSec traffic).
You''re right -  placement is everything.

Dan

Peter Lawrence

2007-Jun-12 19:21 UTC

head link

[dtrace-discuss] CTRL-Z

all,
    what happens when I CTRL-Z suspend a dtrace script, does this only
suspend the user-mode portion (formatting and output) of dtrace, and
not the kernel-mode portion (probe actions).  if so, where does the
generated data go, does it keep on accumulating in the kernel.

I am guessing that if my actions are not calling trace or print, and only
accumulating aggregates, other than in my END action, then there isn''t
going
to be any data loss, or data overflow ?

I am guessing that if there is some instrumentation overhead of the kernel
portion of my dtrace script, then CTRL-Z suspending it won''t affect the
overhead, that the kernel portion keeps on running ?

thanks,
Pete Lawrence.

Peter Lawrence

2007-Jun-12 19:41 UTC

head link

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

Dan,
    I''ve been studying the Solaris kernel implementaion of SSL,
and the encompassing kernel-crypto-framework, and have no interest
in user-land implementations.

I''m still curious about how IPsec and SSL compare

SSL is for TCP, so its for reliable-ordered-byte-stream connections

IPsec is, I''m guessing from the IP in its name, just encrypted IP,
	which could be containing UDP or TCP or ...

so, I suppose that SSL could be emulated with IPsec as the underlying
protocol for TCP, but then I think nothing would prohibit the IPsec
connection from being used for additional connections, or even other
next level up (UDP,..., NFS?,...) protocols and connections.

I''m guessing that there isn''t much in common in their
implementations
other than crypto functions, since they''re at such different locations
in the protocol stack.

thanks,
Pete.

 McDonald wrote On 06/12/07 12:02 PM,:> On Tue, Jun 12, 2007 at 11:42:42AM -0700, Peter Lawrence wrote:
> 
> <SNIP!>
> 
>>(and, for my own edification, how does IPsec relate to SSL, and if they
>> are unrelated and I need to dtrace SLL how will I do that ?)
> 
> 
> There *is* kernel SSL, but most SSL is in user-space.  I *think*
what''s being
> proposed is an all-kernel set of probes.
> 
> Dan

Dan McDonald

2007-Jun-12 19:49 UTC

head link

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

On Tue, Jun 12, 2007 at 12:41:15PM -0700, Peter Lawrence
wrote:> Dan,
>     I''ve been studying the Solaris kernel implementaion of SSL,
> and the encompassing kernel-crypto-framework, and have no interest
> in user-land implementations.
Ack.
> I''m still curious about how IPsec and SSL compare
> 
> SSL is for TCP, so its for reliable-ordered-byte-stream connections
> 
> IPsec is, I''m guessing from the IP in its name, just encrypted IP,
> 	which could be containing UDP or TCP or ...
You are correct in that IPsec and SSL protect at different layers in the
stack.
> so, I suppose that SSL could be emulated with IPsec as the underlying
> protocol for TCP, but then I think nothing would prohibit the IPsec
> connection from being used for additional connections, or even other
> next level up (UDP,..., NFS?,...) protocols and connections.
SSL and IPsec have different packet formats.  I don''t even *know* (he
says
sheepishly) what the SSL/TLS data formates look like.  You can''t really
build
one in terms of the other and maintain interoperability.

Also, IPsec isn''t tied to connections per se -- it''s a
per-datagram
protocol.  You can narrow IPsec to one TCP connection by either using
per-socket policy or by specifying the full 5-tuple (TCP, laddr, raddr,
lport, rport) in IPsec policy.
> I''m guessing that there isn''t much in common in their
implementations
> other than crypto functions, since they''re at such different
locations
> in the protocol stack.
That''s a fair statement!

Dan

Chip Bennett

2007-Jun-12 19:56 UTC

head link

[dtrace-discuss] CTRL-Z

Peter,

There is a set of switch buffers allocated for each DTrace consumer, and 
then for each processor core or hardware thread.  For example, if you 
start a DTrace script on a T2000 with 32 hardware threads available, the 
system allocates 32 sets of switch buffers just for your script.

If you''ve put your DTrace script process into a STOP state, the probes 
continue to be activated and data recorded into the buffers; however, 
when the buffers fill, you''ll start loosing data.  When this happens 
while the script is running (i.e. you''re generating so much data that 
the consumer can''t keep up), then you''d see "dtrace:
nnnnn drops on CPU
n", but since the process is stopped, I doubt you''d ever see the
message.

Aggregation keys are kept in a different buffer, but the principle is 
the same.  If stopping the process causes the buffer to overflow,
you''ll
get aggregation drops.

Chip

Peter Lawrence wrote:> all,
>     what happens when I CTRL-Z suspend a dtrace script, does this only
> suspend the user-mode portion (formatting and output) of dtrace, and
> not the kernel-mode portion (probe actions).  if so, where does the
> generated data go, does it keep on accumulating in the kernel.
>
> I am guessing that if my actions are not calling trace or print, and only
> accumulating aggregates, other than in my END action, then there
isn''t going
> to be any data loss, or data overflow ?
>
> I am guessing that if there is some instrumentation overhead of the kernel
> portion of my dtrace script, then CTRL-Z suspending it won''t
affect the
> overhead, that the kernel portion keeps on running ?
>
> thanks,
> Pete Lawrence.
> _______________________________________________
> dtrace-discuss mailing list
> dtrace-discuss at opensolaris.org
>

Nicolas Williams

2007-Jun-12 20:00 UTC

head link

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

On Tue, Jun 12, 2007 at 12:41:15PM -0700, Peter Lawrence
wrote:>     I''ve been studying the Solaris kernel implementaion of SSL,
> and the encompassing kernel-crypto-framework, and have no interest
> in user-land implementations.
> 
> I''m still curious about how IPsec and SSL compare
They are very, very different.

IPsec consists of ESP and AH for protecting IP payloads (and, in the
case of AH, headers) below the transport protocols using keys that are
either manually configured or exchanged with a key exchange protocol.
TLS runs atop a transport protocol, has its'' own authentication and key
exchange protocols and its own "security layer" (to borrow a term from
SASL).

There is nothing in common, on the wire, between IPsec and TLS (except,
of course, for things incidental to authentication, like PKIX
certificates).  IPsec protects packets.  TLS protects octet streams or
datagram streams (DTLS; see below).  The security models are different
and so is the applicability of each.  In my view the most important
distinction between IPsec and TLS is that the former is usually
implemented with no useful APIs for application developers while for the
latter there is a plethora of APIs.
> SSL is for TCP, so its for reliable-ordered-byte-stream connections
And then there''s DTLS (Datagram TLS) which, as you can probably
imagine,
can run over UDP.
> IPsec is, I''m guessing from the IP in its name, just encrypted IP,
> 	which could be containing UDP or TCP or ...
ESP and AH.  Plus key exchange protocols (IKEv1, IKEv2, KINK).
> so, I suppose that SSL could be emulated with IPsec as the underlying
> protocol for TCP, but then I think nothing would prohibit the IPsec
> connection from being used for additional connections, or even other
> next level up (UDP,..., NFS?,...) protocols and connections.
FYI, the IETF BTNS WG is working on a notion of "IPsec channel" and
"IPsec APIs."
> I''m guessing that there isn''t much in common in their
implementations
> other than crypto functions, since they''re at such different
locations
> in the protocol stack.
Correct, but then, kssl runs in-kernel, so there''s a bit more in common
than you expected, but not much.

Nico
--

Jon Haslam

2007-Jun-12 20:04 UTC

head link

[dtrace-discuss] CTRL-Z

Just one extra thing to add to Chip''s comments.

The DTrace subsystem expects a consumer to check-in periodically
(i.e. when it snapshots buffer state). If it doesn''t check in for 30 
seconds,
by default, then the deadman timer will fire and the consumer is as
good as dead. If you suspend a consumer for longer than 30 seconds
it will bail when you next put it in the foreground.

You can just enable destructive actions to get around this.

Jon.
> Peter,
>
> There is a set of switch buffers allocated for each DTrace consumer, 
> and then for each processor core or hardware thread.  For example, if 
> you start a DTrace script on a T2000 with 32 hardware threads 
> available, the system allocates 32 sets of switch buffers just for 
> your script.
>
> If you''ve put your DTrace script process into a STOP state, the
probes
> continue to be activated and data recorded into the buffers; however, 
> when the buffers fill, you''ll start loosing data.  When this
happens
> while the script is running (i.e. you''re generating so much data
that
> the consumer can''t keep up), then you''d see "dtrace:
nnnnn drops on
> CPU n", but since the process is stopped, I doubt you''d ever
see the
> message.
>
> Aggregation keys are kept in a different buffer, but the principle is 
> the same.  If stopping the process causes the buffer to overflow, 
> you''ll get aggregation drops.
>
> Chip
>
> Peter Lawrence wrote:
>
>> all,
>>     what happens when I CTRL-Z suspend a dtrace script, does this only
>> suspend the user-mode portion (formatting and output) of dtrace, and
>> not the kernel-mode portion (probe actions).  if so, where does the
>> generated data go, does it keep on accumulating in the kernel.
>>
>> I am guessing that if my actions are not calling trace or print, and 
>> only
>> accumulating aggregates, other than in my END action, then there 
>> isn''t going
>> to be any data loss, or data overflow ?
>>
>> I am guessing that if there is some instrumentation overhead of the 
>> kernel
>> portion of my dtrace script, then CTRL-Z suspending it won''t
affect the
>> overhead, that the kernel portion keeps on running ?
>>
>> thanks,
>> Pete Lawrence.
>> _______________________________________________
>> dtrace-discuss mailing list
>> dtrace-discuss at opensolaris.org
>>   
>
>
> _______________________________________________
> dtrace-discuss mailing list
> dtrace-discuss at opensolaris.org

Brendan Gregg - Sun Microsystems

2007-Jun-12 20:45 UTC

head link

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

G''Day Peter,

On Tue, Jun 12, 2007 at 11:42:42AM -0700, Peter Lawrence
wrote:> G''Day Brendan,
>               I''ve got a couple very open ended questions,
I''m not so
> much concerned with the actual answers as to know that you are thinking
> about such issues...
> 
> how does the info available through this provider compare with
/usr/sbin/snoop ?
> 
> will it be possible to emulate the filtering capabilities of snoop ?
There are many instances where these providers will be used instead of
snoop, such as general traffic observability (eg, bytes by IP address).
The out of date URL in my first post demonstrates many of these, including
similar filtering capabilities.

However I imagine that snoop has features that these providers won''t do
easily, such as,

	- output to standard capture file format (RFC 1761)
	- output formatting and translations (-V, -v)
> how does the placement of the probes imply timing info ?
In terms of kernel code-path, it doesn''t.
> 	is there any way to track time between application send
> 	and ip send,  and then between ip send and NIC send ?
> 	is there any way to track time between ip receive and
> 	application read,  and also between NIC receive and ip
> 	receive ?
Kernel code-path latency was solved in 2004 with the fbt provider,
which provides 4344 probes (on my build) for examining the ip module,
which includes function entry and return values, and return offsets.

There are some cases where it is appropriate to add unstable sdt probes
to make this analysis easier. For example, when packets are dropped due
to error, it would be easier to instrument this with an sdt probe than
to process the return value and offset.

I would be happy to prototype a high level code-path latency provider (which
would be sdt based, and would probably be classified as unstable) as a
future project. In the meantime, use fbt. If you don''t know how, learn
how.
And when you run into fbt instrumentation issues that are genuinely
unsolvable, then add (or RFE) sdt probes.

These providers are for users of Solaris, and they are similar to what the
"io" provider achieves. They are not for developers of Solaris who are
interested in code-path latency and can already use fbt and sdt.

Brendan

-- 
Brendan
[CA, USA]

Darren.Reed at Sun.COM

2007-Jun-12 23:22 UTC

head link

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems wrote:
>G''Day Darren,
>
>On Mon, Jun 11, 2007 at 08:52:30PM -0700, Darren.Reed at sun.com wrote:
>  
>
>>Brendan,
>>
>>I had a quick look through the URL you provided and one thing
>>    
>>
>
>That URL currently describes "take 1" of the network provider
proposal,
>and so is out of date. The newer details (IP only) were in the email I
>sent, and I''ll be updating the URL based on that and the feedback I
get.
>  
>
Ah, I used the URL because it seemed to have better content
than the email ;)

>>Also, can you explain where the probes for each protocol fit in
>>relation to the greater processing for the protocol?
>>    
>>
>
>Sure, probes fire when packets are sent and received. The actual
>location of the probe macros in the kernel TCP/IP code is based on
>available information and maintainability. This means that ip
>probes may be placed in tcp_send_data() and udp_send_data(), just
>as MIB macros for IP statistics appear in tcp.c and udp.c
>  
>
Saying "when packets are sent and received" is a little bit too
vague.  Does this mean when IP* packets are sent/received
via the mac layer, dls layer, the driver layer or IP layer?  And
how far into each layer?

To compare this with snoop, what we get is more or less understood
to be a copy of the data sent to/from a device driver with anything
above it, so we know it is possible to see corrupt frames, incorrect
checksums, etc.

Having a good understanding of where the probes fit in the
overall architecture will help a lot towards us networking folks
understanding what we can (or cannot) see when using them.

>>Will tracemem() be told how to walk the packet, so I can the below?
>>tracemem(args[1]->ip_hdr, args[1]->ip_plength)
>>
>>...well, that''d also require ip_hdr being added to ipinfo_t.
>>    
>>
>
>I don''t think ipinfo_t should have ip_hdr; if you are interested in
>protocol details, then instead of using the "ip" provider or the
generic
>ipinfo_t structure, you use the "ipv4" (or "ipv6")
protocol specific
>provider, which currently provides args[2]->ipv4_hdr as a pointer
>to the raw ipha_t.
>
>I think there are two things you may be interested in:
>
>A) dumping the raw header + extended options.
>
>Here is an example from the old prototype "ip" provider,
>
># dtrace -n ''ip:::send,ip:::receive {
tracemem(args[2]->ipv4_hdr, 20); }''
>dtrace: description ''ip:::send,ip:::receive '' matched 5
probes
>CPU     ID                    FUNCTION:NAME
>  0  54627                 ip_input:receive 
>             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f 
0123456789abcdef
>         0: 45 20 00 98 3b 01 40 00 28 06 e0 c2 45 b5 2e b2  E ..;. at
.(...E...
>        10: c0 a8 01 6d                                      ...m
>
>  0  54627                 ip_input:receive 
>             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f 
0123456789abcdef
>         0: 45 00 00 58 b6 fe 40 00 3c 06 03 e1 c0 a8 01 03  E..X.. at
.<.......
>        10: c0 a8 01 6d                                      ...m
>[...]
>
>So far so good, now need to replace the "20" value with
args[2]->ipv4_ihl,
>
># dtrace -n ''ip:::send,ip:::receive {
tracemem(args[2]->ipv4_hdr,
>    args[2]->ipv4_ihl); }''
>dtrace: invalid probe specifier ip:::send,ip:::receive {
tracemem(args[2]->ipv4_hdr, args[2]->ipv4_ihl); }: tracemem( ) argument #2
must be a non-zero positive integral constant expression
>  
>
Will ipv4_ihl be the number of bytes or the value as found in
the header? (a count of 4 byte words)
>hmm. DTrace is trying to help out by sanity checking args,
>
>        if (dt_node_is_posconst(size) == 0) {
>                dnerror(size, D_TRACEMEM_SIZE, "tracemem( ) argument #2
must "
>                    "be a non-zero positive integral constant
expression\n");
>        }
>
>I need to check if there is a sensible way around this... I can think of
>a work-around using existing DTrace functionality, but it isn''t
pretty.
>  
>
I''m not sure I understand what the problem is here but I''m
glad you do :)

>B) dumping the payload
>
>It wasn''t the initial intent to access the payload easily from
DTrace, but
>enough people have asked that I will check how doable this is (it could
>certainly be added as a later feature).
>
>Of course, if you can walk to the data offsets in C, you can tracemem()
>them using DTrace (I assume this would need the first mblk pointer,
>which would be provided in the protocol providers for such raw debugging).
>So there will be at least one way to do this.
>
>It would be nice to do this easily from DTrace, eg,
>
>	tracemem(args[2]->ipv4_payload, args[2]->ipv4_plength)
>
>which might turn into something like,
>
>	tracemblk(args[2]->ipv4_baddr, args[2]->ipv4_plength)
>
>baddr meaning buffer address, and tracemblk() a new function to walk
>and dump mblks easily. I''m not yet sure which approach would be
best
>without trying to code it first.
>  
>
Ok, dumping the payload is what I was asking about and it seems
like you understand the problem, etc.

>>For IPv6, how do you see this design evolving to include looking
>>at extension header data?
>>    
>>
>
>Extension header data would only be accessed through the ipv6 provider,
>and can be added to the ipv6info_t struct.
>  
>
For most IPv6 extension headers, there is a limit on them
appearing only once, with the exception being destination
options.  How do you see that being handled?

Darren

Brendan Gregg - Sun Microsystems

2007-Jun-13 02:36 UTC

head link

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

G''Day Dan,

On Tue, Jun 12, 2007 at 03:14:23PM -0400, Dan McDonald
wrote:> On Mon, Jun 11, 2007 at 08:11:25PM -0700, Brendan Gregg - Sun Microsystems
wrote:
[...]> >         ipsec:::send            cstateinfo_t, ipinfo_t, ipsecinfo_t
> >         ipsec:::receive         cstateinfo_t, ipinfo_t, ipsecinfo_t
[...]> 
> You have "ipsec" as one set of probes.  I take it your
ipsecinfo_t will
> distinguish between AH, ESP, or both on a packet.
Yes. Although I haven''t created a suggested ipsecinfo_t struct yet.
The members of ipsecinfo_t will depend on balancing,

	- data that is stable
	- data that is useful
	- data that is available in the right places
	  (or can be made available through code changes)
	- data that is maintainable in those places
> > 2) cstateinfo_t
> > 
> > This provides connection state information if available,
> > 
> >         cstateinfo_t,
> >                 cs_cid          uint64_t        /* connection ID
(conn_t *) */
> >                 cs_loopback     int             /* loopback state */
> > 
> > and may be extended to include details such as,
> > 
> >                 cs_zoneid       int             /* zone ID */
> >                 cs_ip_stack     uint64_t        /* stack ID
(ip_stack_t *) */
> > 
> > cs_cid isn''t provided as a conn_t *, as that would expose an
unstable
> > interface; instead it is provided as a uint64_t, as it can be useful
as a
> > connection ID (and people can cast it as a conn_t * for raw
debugging).
> 
> That seems sensible, but for lots of receive-side processing, there will be
> NOTHING resembling a conn_t immediately available.
> 
> For example, IPsec packets (depending on where you put the probe) might
have
> the SA (ipsa_t) available, but no way will it have the conn_t... conn
lookup
> occurs AFTER inbound IPsec processing (at least right now).
The probes don''t need to appear in the IPsec code - so if appropriate
IPsec data (such as ipsa_t) can be fetched after conn lookup, then
probes could be placed there. I need to continue studying the code to
see what may be best to do.

I''m glad you brought it up anyway - these are the sort of issues I want
to address before committing to a network provider plan.
> > and the actual source and dest, could be traced as follows:
> > 
> >         Probes          Arguments contain
> >         ipv4:::         tunnel source and dest
> >         ipv6:::         tunnel source and dest
> >         ipsec:::        actual source and dest, and tunnel source and
dest
> 
> IPsec is not a tunnelling protocol, and not all IPsec packets have tunnel
> source/dest to contend with.  Also, in Solaris, IP tunnels are distinct
> entities, we implement IPsec tunnel-mode inside the context of our IP
> tunnelling - that''s the implementation tack of Tunnel Reform.
> 
> For IPsec transmit-side, depending on where the probes are placed,
you''ll
> want to know things in the IPSEC_OUT M_CTL mblk.  Highlights include:
> 
> 	- Policy entry and/or actions (actions --> what precise algorithm(s)
>           need(s) to be applied to the packet).
> 
> 	- Outbound SA (if cached in the transmitting conn_t).  Maybe some
>           info from the SA in a readable form (e.g. algorithms).
Right, and if I''m probing when I have access to conn_t, then it looks
like
I can usually get these details from conn_ipsec_opt_mp (so I have both
conn_t and ipsa_t at the same time).
> 	- conn_t of transmitter (may be NULL for ICMP replies and TCP RST)
Yes - I don''t expect conn_t information for ICMP and TCP RSTs, and so
their connection ID (cs_cid) would be NULL, and I''d need to figure out
if other details such as cs_loopback can be provided.
> For the receive side, look at the IPSEC_IN mblk, but mostly,
you''ll want:
> 
> 	- Inbound SA and all relevant fields.
Yes, and it looks like I have IPSEC_IN mblks in ip_proto_input() (at 
least), so I''m hopeful I''ll find a place on input that has
both conn_t
and ipsa_t pointers.
> And don''t forget --> IPsec uses ip_drop_packet() whenever a
packet is dropped
> for IPsec reasons.  FBT + ip_drop_packet() is a gold mine of usable
> information already, if you could extend the rest of TCP/IP to use
> ip_drop_packet(), it may help a LOT.
Ahh, thank goodness, that does make things easier. There seems to be the
occasional place in the tcp/ip code where a freemsg(mp) happens without
bumping debug metrics (such as MIB).

Thanks for your feedback,

Brendan

-- 
Brendan
[CA, USA]

Brendan Gregg - Sun Microsystems

2007-Jun-13 18:26 UTC

head link

[dtrace-discuss] Re: [networking-discuss] Re: DTrace Network Providers, take 2

G''Day Simon,

On Tue, Jun 12, 2007 at 04:31:36AM -0700, Simon Leinen
wrote:> > 1) ipinfo_t
> > 
> > This provides common details from the IP protocols that we expect to
> > have available,
> > 
> > ipinfo_t,
> > ip_protocol int /* protocol */
> [...]
> > ip_protocol could be just the IP protocol number (4 or 6), or borrow
> > existing definitions such as AF_INET/AF_INET6,
> > ETHERTYPE_IP/ETHERTYPE_IPV6 or use /etc/protocols - as these may
> > better accomodate future protocol additions.  Using the IP protocol
> > number (4/6) would be the least suprising choice for the end user.
> 
> Your use of "IP protocol number" confuses me - I think you should
call
> this "IP version number" (and the struct member
"ip_version"), as in
> http://www.iana.org/assignments/version-numbers.  Protocol numbers (at
> least to me) refer to upper-layer protocols such as TCP, UDP, see
> http://www.iana.org/assignments/protocol-numbers
It is actually ip_ver in the original prototype, as it only matched
the IP version number; I changed it to ip_protocol in case we wanted
it to match non-IP protocols, although I can''t quite put my finger
on an actual example.

I think you are right, and thanks for the URLs: if it is version, it
should be ip_version; if it is protocol, it would be ip_protocol.
I wonder if it would be wise to include both ip_protocol and ip_version?

cheers,

Brendan

-- 
Brendan
[CA, USA]

Brendan Gregg - Sun Microsystems

2007-Jun-14 17:20 UTC

head link

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

G''Day Darren,

On Tue, Jun 12, 2007 at 04:22:00PM -0700, Darren.Reed at sun.com wrote:
[...] > >>Also, can you explain where the probes for each protocol fit in
> >>relation to the greater processing for the protocol?
> >
> >Sure, probes fire when packets are sent and received. The actual
> >location of the probe macros in the kernel TCP/IP code is based on
> >available information and maintainability. This means that ip
> >probes may be placed in tcp_send_data() and udp_send_data(), just
> >as MIB macros for IP statistics appear in tcp.c and udp.c
> 
> Saying "when packets are sent and received" is a little bit too
> vague.  Does this mean when IP* packets are sent/received
> via the mac layer, dls layer, the driver layer or IP layer?  And
> how far into each layer?
This is easier to explain if I post a webrev (which I want to do ASAP).
Anyway, for sent traffic - it is probed as late as possible while the
conn_t is still available; and for received traffic, it is probed
early in common places.

Or described another way - IP is probed when ip meets dld or dls.
> To compare this with snoop, what we get is more or less understood
> to be a copy of the data sent to/from a device driver with anything
> above it, so we know it is possible to see corrupt frames, incorrect
> checksums, etc.
Great - so I don''t need to reinvent that functionality since snoop has
me covered. ;-)

Seriously, yes, I would prefer these DTrace providers to provide identical
protocol headers to what snoop can see, but if there is some IP header
mangling down in dld or dls, then snoop would have a better view of
packets than the current placement of IP probes. But the question is - 
is there IP header mangling in dld or dls (and why would there be)?

I do see that dls_soft_ring_fanout() can drop ip packets that don''t
have
a complete ipv4 header,

                if ((MBLKL(mp) < sizeof (ipha_t)) ||
!OK_32PTR(mp->b_rptr)) {
                        if ((mp = msgpullup(mp, sizeof (ipha_t))) == NULL) {
                                /* Let''s toss this away */
                                dls_bad_ip_pkt++;
                                freemsg(mp);
                                continue;
                        }

which seems like a precaution anyway, rather than something that will
frequently happen,

	# echo dls_bad_ip_pkt/D | mdb -k
	dls_bad_ip_pkt:
	dls_bad_ip_pkt: 0         
> Having a good understanding of where the probes fit in the
> overall architecture will help a lot towards us networking folks
> understanding what we can (or cannot) see when using them.
Sure. Maybe this helps also - this is the DTrace probe listing from
the old prototype, which shows the function locations for the ip probes,

# dtrace -ln ''ip:::''
   ID   PROVIDER            MODULE                          FUNCTION NAME
47950         ip                ip                     udp_send_data send
47951         ip                ip                     tcp_send_data send
47952         ip                ip                    ip_wput_ire_v6 send
47953         ip                ip                       ip_wput_ire send
47963         ip                ip                        ip_rput_v6 receive
47964         ip                ip                          ip_input receive

Of course, this list is likely to get a lot longer as IP Filter and IPsec
are traced properly, and ipv4 and ipv6 providers are added.

[...]> >So far so good, now need to replace the "20" value with
args[2]->ipv4_ihl,
> >
> ># dtrace -n ''ip:::send,ip:::receive {
tracemem(args[2]->ipv4_hdr,
> >   args[2]->ipv4_ihl); }''
> >dtrace: invalid probe specifier ip:::send,ip:::receive { 
> >tracemem(args[2]->ipv4_hdr, args[2]->ipv4_ihl); }: tracemem( )
argument #2
> >must be a non-zero positive integral constant expression
> 
> Will ipv4_ihl be the number of bytes or the value as found in
> the header? (a count of 4 byte words)
In the first prototype it was 4 byte words, which I found a little annoying
to use. I''m thinking the translated version could be bytes, and clearly
documented as such, while bearing in mind that a pointer to the raw
header is available also. The translated version is there for useability
(eg, IP addresses are represented as strings).

[...]> >>For IPv6, how do you see this design evolving to include looking
> >>at extension header data?
> >
> >Extension header data would only be accessed through the ipv6 provider,
> >and can be added to the ipv6info_t struct.
> 
> For most IPv6 extension headers, there is a limit on them
> appearing only once, with the exception being destination
> options.  How do you see that being handled?
Anything requiring looping through like-members presents a little headache
in DTrace - which doesn''t provide (or so far, really need) looping in
the
D language.

I see them as important but not crucial for a first release of the ipv6
provider - destination options can be added as members to ipv6info_t when
the right way to present them has been found. But thanks for bringing it up
- it is another problem that needs careful consideration.

cheers,

Brendan

-- 
Brendan
[CA, USA]

Erik Nordmark

2007-Jun-15 13:35 UTC

head link

[dtrace-discuss] DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems wrote:>         rarp:::send             cstateinfo_t, arpinfo_t
>         rarp:::receive          cstateinfo_t, arpinfo_t
There isn''t any reverse ARP support in the kernel, so where would these
probes be placed? In in.rarpd(1m)?
> 2) cstateinfo_t
> 
> This provides connection state information if available,
> 
>         cstateinfo_t,
>                 cs_cid          uint64_t        /* connection ID (conn_t *)
*/
>                 cs_loopback     int             /* loopback state */
I assume that cs_cid will be zero when there is no conn_t (which is for 
the most of the IP transmit paths.)

What is the intended semantics of cs_loopback?
> and may be extended to include details such as,
> 
>                 cs_zoneid       int             /* zone ID */
>                 cs_ip_stack     uint64_t        /* stack ID (ip_stack_t *)
*/
For the stack ID it is probably better to use the small integer which is 
in ip_stack_t->ips_netstack->netstack_stackid. It is much easier to see 
how these relate to exclusive-IP zones than having a pointer cast as a 
uint64_t. (The current implementation happens to pick the stackid to be 
the same as the zoneid for the exclusive-IP zones.)

Just as Dan pointed out for the conn_t, the zoneid isn''t known/defined 
for much of the L3 receive side processing. For instance the IP and ICMP 
input doesn''t know it, and ARP is essentially all run in the global 
zone. For UDP and TCP input we know the zoneid. And I hope that by now 
Nevada has a well-defined zoneid for all the transmit paths.

> (assuming this is doable - I need to check if ip::: probes can be placed
> so to not see duplicates of IPSec traffic).
The placement might be tricky now; should get a lot easier once we''ve 
refactored the IP datapaths.

    Erik

Erik Nordmark

2007-Jun-15 13:40 UTC

head link

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems wrote:
> This is easier to explain if I post a webrev (which I want to do ASAP).
> Anyway, for sent traffic - it is probed as late as possible while the
> conn_t is still available; and for received traffic, it is probed
> early in common places.
> 
> Or described another way - IP is probed when ip meets dld or dls.
The above two statements are in conflict. When IP meets dld or dls there 
is no notion of a conn_t. While some code paths in the current 
implementation might have a conn_t, that is a bug which will be fixed 
with the IP datapath refactoring project. So please don''t depend on
this.

If you want a conn_t you need to look at transmitted packets at the top 
of ip_output, and be aware that in some cases none is available (ICMP 
errors, TCP RST packets, etc).
Alternatively you can get a conn_t by looking at transmitted packets at 
the bottom of tcp/udp/rawip.
The latter might be more consistent with the receive side where we have 
a conn_t when the packet reaches the tcp/udp/rawip input codd.

   Erik

Brendan Gregg - Sun Microsystems

2007-Jun-15 17:16 UTC

head link

[dtrace-discuss] DTrace Network Providers, take 2

G''Day Erik, Folks,

On Fri, Jun 15, 2007 at 06:35:20AM -0700, Erik Nordmark
wrote:> Brendan Gregg - Sun Microsystems wrote:
> >        rarp:::send             cstateinfo_t, arpinfo_t
> >        rarp:::receive          cstateinfo_t, arpinfo_t
> 
> There isn''t any reverse ARP support in the kernel, so where would
these
> probes be placed? In in.rarpd(1m)?
Ahh - I was fleshing out protocol names before embarking on the code - 
you are right, these would belong in in.rarpd as USDT probes.
> >2) cstateinfo_t
> >
> >This provides connection state information if available,
> >
> >        cstateinfo_t,
> >                cs_cid          uint64_t        /* connection ID
(conn_t
> >                *) */
> >                cs_loopback     int             /* loopback state */
> 
> I assume that cs_cid will be zero when there is no conn_t (which is for 
> the most of the IP transmit paths.)
Yes, which is part of the problem. Probe placement turned out to be a
bigger pain.

Yesterday I spent many hours locating all the different code path points 
that would need probes for measuring conn_t in IP (I reached more than 20
places). For most of the probes, conn_t was and should be NULL (eg, RSTs,
forwarding, bogus packets, ...). conn_t wasn''t NULL for TCP and UDP,
something we learn later in the IP code path, however by this point many
inbound packets have been dropped or processed elsewhere - which creates a
headache for probe placement; I sent all dropped packets through a
trace''n''drop path, and for those being processed elsewhere I
added more
probes.

Having coded most of it, it was becomming obvious that something had gone
wrong. Yes, I can get conn_t in IP, but are the code changes worth it? If
it is NULL everywhere but TCP and UDP, shouldn''t I only export conn_t
in
the tcp and udp providers? And with ire_t and ill_t readily available, can
the cstateinfo_t struct get its answers from them - and not conn_t? What
DTrace scripts would need cs_cid in IP, that couldn''t be written using
the
udp and tcp providers?

Last night I dropped conn_t from the ip provider, and started recoding
it using ire_t and ill_t. The probes have also moved location, and many
are now near FW_HOOKS for physical events (before for inbound, after for
outbound). I''ll post some suggested info structs when I see if ire_t
and
ill_t are sensible to use (at the moment I''m not sure if I should be
trying to export ire_t for inbound packets - I''m doubting that it makes
sense).
> What is the intended semantics of cs_loopback?
I''d like it to be boolean, but right now it may be more practical as:

	-1	unknown
	0	not loopback
	1 	loopback

[...]> The placement might be tricky now; should get a lot easier once
we''ve
> refactored the IP datapaths.
Placement is very tricky. Refactoring would be great. :)

thanks,

Brendan

-- 
Brendan
[CA, USA]

Brendan Gregg - Sun Microsystems

2007-Jun-15 17:30 UTC

head link

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

G''Day Erik,

On Fri, Jun 15, 2007 at 06:40:27AM -0700, Erik Nordmark
wrote:> Brendan Gregg - Sun Microsystems wrote:
> 
> >This is easier to explain if I post a webrev (which I want to do ASAP).
> >Anyway, for sent traffic - it is probed as late as possible while the
> >conn_t is still available; and for received traffic, it is probed
> >early in common places.
> >
> >Or described another way - IP is probed when ip meets dld or dls.
> 
> The above two statements are in conflict. When IP meets dld or dls there 
> is no notion of a conn_t. While some code paths in the current 
> implementation might have a conn_t, that is a bug which will be fixed 
> with the IP datapath refactoring project. So please don''t depend
on this.
I think I just learnt this the hard way ;). If there was a strong need
for conn_t visibility in IP, then I''d keep pushing for it. There
isn''t.

There is such a need for conn_t visibility (especially as a connection ID)
somewhere in the stack, such as the tcp and udp providers.

cheers,

Brendan

-- 
Brendan
[CA, USA]

Erik Nordmark

2007-Jun-18 13:56 UTC

head link

[dtrace-discuss] DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems wrote:
> Having coded most of it, it was becomming obvious that something had gone
> wrong. Yes, I can get conn_t in IP, but are the code changes worth it? If
> it is NULL everywhere but TCP and UDP, shouldn''t I only export
conn_t in
> the tcp and udp providers? And with ire_t and ill_t readily available, can
> the cstateinfo_t struct get its answers from them - and not conn_t? What
> DTrace scripts would need cs_cid in IP, that couldn''t be written
using the
> udp and tcp providers?
I think TCP and UDP providers make more sense. And hopefully soon the 
rawip code (icmp.c - the transport provider for RAW sockets) will use 
conn_t as well.

But there is a place in IP where there is (at least conceptually) none 
of conn_t, ire_t, ill_t.
That place is when an ICMP error (or TCP reset etc) is generated by IP.
Basically that just sends an IP packet and ip_output finds the ire_t 
which points at the ill_t.

Thus conceptually the bottom of the transmit path of {tcp,udp}_output 
has a conn_t, and the top of ip_output has no context except the 
ip_stack_t. Ip_output then looks up an ire and after that there is an 
ire_t and related context (ill_t, nce_t) for the transmit path.
> Last night I dropped conn_t from the ip provider, and started recoding
> it using ire_t and ill_t. The probes have also moved location, and many
> are now near FW_HOOKS for physical events (before for inbound, after for
> outbound). I''ll post some suggested info structs when I see if
ire_t and
> ill_t are sensible to use (at the moment I''m not sure if I should
be
> trying to export ire_t for inbound packets - I''m doubting that it
makes
> sense).
For inbound packets (delivered to the local machine) the ire_t isn''t 
very useful - it is either an IRE_LOCAL or IRE_BROADCAST.
For forwarded packets the ire_t is more interesting.

Placing probes at the FW_HOOKS provides observability at the bottom of 
IP i.e. corresponding to the ipInReceives on the receive side (odd, 
there isn''t a corresponding xmit stat). As such it is layered below 
IPsec and fragmentation/reassembly.

But I think it would also be useful to have observability at the top of 
IP i.e. corresponding to ipInDelivers and ipOutRequests since that is 
above IPsec and frag/reass.
>> What is the intended semantics of cs_loopback?
> 
> I''d like it to be boolean, but right now it may be more practical
as:
> 
> 	-1	unknown
> 	0	not loopback
> 	1 	loopback
What is it intended to mean?
Will is be set for e.g. inter-zone packets on the same system?
Multicast packets that are looped back because there are members on the 
transmitting interface?
Multicast packets that are looped back because the system is running as 
a multicast router?
Transmitted broadcast packets where a copy is handed back to the input side?

    Erik

Darren.Reed at Sun.COM

2007-Jun-18 17:51 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems wrote:
> ...
>
>Last night I dropped conn_t from the ip provider, and started recoding
>it using ire_t and ill_t. The probes have also moved location, and many
>are now near FW_HOOKS for physical events (before for inbound, after for
>outbound). I''ll post some suggested info structs when I see if
ire_t and
>ill_t are sensible to use (at the moment I''m not sure if I should
be
>trying to export ire_t for inbound packets - I''m doubting that it
makes
>sense).
>
The FW_HOOKS macros generally have sdt dtrace probes on
either side of them, e.g.:

        DTRACE_PROBE4(ip4__forwarding__start,
            ill_t *, in_ill, ill_t *, out_ill, ipha_t *, ipha, mblk_t *, 
mp);

        FW_HOOKS(ipst->ips_ip4_forwarding_event,
            ipst->ips_ipv4firewall_forwarding,
            in_ill, out_ill, ipha, mp, mp, ipst);

        DTRACE_PROBE1(ip4__forwarding__end, mblk_t *, mp);

If your probes are getting close to where FW_HOOKS appears,
is there some merit in replacing some of these sdt probes
with probes from the provider you''re working on?

Although this might only make sense of there is complete
coversion of all the sdt''s into the new thing.

...I''m sure that there is such a thing as too many dtrace
probe points :)

Darren

Brendan Gregg - Sun Microsystems

2007-Jun-18 20:43 UTC

head link

[dtrace-discuss] DTrace Network Providers, take 2

G''Day Erik,

On Mon, Jun 18, 2007 at 06:56:37AM -0700, Erik Nordmark
wrote:> Brendan Gregg - Sun Microsystems wrote:
> 
> >Having coded most of it, it was becomming obvious that something had
gone
> >wrong. Yes, I can get conn_t in IP, but are the code changes worth it?
If
> >it is NULL everywhere but TCP and UDP, shouldn''t I only export
conn_t in
> >the tcp and udp providers? And with ire_t and ill_t readily available,
can
> >the cstateinfo_t struct get its answers from them - and not conn_t?
What
> >DTrace scripts would need cs_cid in IP, that couldn''t be
written using the
> >udp and tcp providers?
> 
> I think TCP and UDP providers make more sense. And hopefully soon the 
> rawip code (icmp.c - the transport provider for RAW sockets) will use 
> conn_t as well.
Yes; and if for some reason in the distant future, conn_t does become
prolific in IP code, then we can always add an extra conn_t based argument
to the IP providers. For now it looks best not to have conn_t in IP.
> But there is a place in IP where there is (at least conceptually) none 
> of conn_t, ire_t, ill_t.
> That place is when an ICMP error (or TCP reset etc) is generated by IP.
> Basically that just sends an IP packet and ip_output finds the ire_t 
> which points at the ill_t.
It looked like routing/forwarding fell into this category also.
> Thus conceptually the bottom of the transmit path of {tcp,udp}_output 
> has a conn_t, and the top of ip_output has no context except the 
> ip_stack_t. Ip_output then looks up an ire and after that there is an 
> ire_t and related context (ill_t, nce_t) for the transmit path.
> 
> >Last night I dropped conn_t from the ip provider, and started recoding
> >it using ire_t and ill_t. The probes have also moved location, and many
> >are now near FW_HOOKS for physical events (before for inbound, after
for
> >outbound). I''ll post some suggested info structs when I see if
ire_t and
> >ill_t are sensible to use (at the moment I''m not sure if I
should be
> >trying to export ire_t for inbound packets - I''m doubting that
it makes
> >sense).
> 
> For inbound packets (delivered to the local machine) the ire_t
isn''t
> very useful - it is either an IRE_LOCAL or IRE_BROADCAST.
> For forwarded packets the ire_t is more interesting.
I did prototype it for send probes only, but now I''ve dropped it; I was
really just using it to identify loopback traffic easily, which I can do
with a seperate code-location-based argument.
> Placing probes at the FW_HOOKS provides observability at the bottom of 
> IP i.e. corresponding to the ipInReceives on the receive side (odd, 
> there isn''t a corresponding xmit stat). As such it is layered
below
> IPsec and fragmentation/reassembly.
Yes, which should allow a snoop-like raw inspection of received packets.
So far the only received packets I''m not tracing as IP are those where
pkt_len < IP_SIMPLE_HDR_LENGTH, since they may not truly be IP at all.

The xmit stat (ipIfStatsHCOutTransmits) looks like an RFC 4293 MIB
extension.
> But I think it would also be useful to have observability at the top of 
> IP i.e. corresponding to ipInDelivers and ipOutRequests since that is 
> above IPsec and frag/reass.
Yes, this would probably be where an IPsec provider would live; as for
in the IP providers as well -- that would be interesting, but I can''t
think
of a stable abstraction yet. We have ip:::send/receive for bottom of IP
tracing, what would we call top of IP tracing? ip:::read/write? Should
this be sdt instead?  Could we rely on TCP/UDP probes, which will be at
the bottom of their stacks and close to the top of IP anyway? ...
> >>What is the intended semantics of cs_loopback?
> >
> >I''d like it to be boolean, but right now it may be more
practical as:
> >
> >	-1	unknown
> >	0	not loopback
> >	1 	loopback
> 
> What is it intended to mean?
Cool, I was testing these cases last night,
> Will is be set for e.g. inter-zone packets on the same system?
Yes. Both send and receive should be tracable (unless you are in TCP
fusion - which I think I''d leave for the TCP provider to trace, rather
than faking up some IP events).
> Multicast packets that are looped back because there are members on the 
> transmitting interface?
Yes. Currently looks like this,

   # ./ipio1.d 
             FUNC:PROBE             SOURCE               DEST LOOP BYTES
   ip_multicast_l:send       192.168.1.108 ->       224.0.1.1    1     8
    ip_wput_local:receive    192.168.1.108 ->       224.0.1.1    1     8
      ip_wput_ire:send       192.168.1.108 ->       224.0.1.1    0     8

A loopback send and receive are visible, but only the non-loopback send
(as we''d expect).
> Multicast packets that are looped back because the system is running as 
> a multicast router?
Hmmm, I''d say yes - both the loopback send and receive events should be
visible. Although I haven''t checked this one yet. ;)
> Transmitted broadcast packets where a copy is handed back to the input
side?
Yes. Currently look like,

   # ./ipio1.d 
             FUNC:PROBE             SOURCE               DEST LOOP BYTES
      ip_wput_ire:send       192.168.1.108 ->   192.168.1.255    1    64
    ip_wput_local:receive    192.168.1.108 ->   192.168.1.255    1    64
      ip_wput_ire:send       192.168.1.108 ->   192.168.1.108    1    64
    ip_wput_local:receive    192.168.1.108 ->   192.168.1.108    1    64
      ip_wput_ire:send       192.168.1.108 ->   192.168.1.255    0    64
         ip_input:receive    192.168.1.148 ->   192.168.1.108    0    64
         ip_input:receive    192.168.1.217 ->   192.168.1.108    0    64
         ip_input:receive    192.168.1.160 ->   192.168.1.108    0    64
         ip_input:receive    192.168.1.146 ->   192.168.1.108    0    64
         ip_input:receive    192.168.1.213 ->   192.168.1.108    0    64
         ip_input:receive    192.168.1.188 ->   192.168.1.108    0    64
   [...]

Both the loopback broadcast to 192.168.1.255 and the physical broadcast
can be seen.

thanks for the help,

Brendan

-- 
Brendan
[CA, USA]

Brendan Gregg - Sun Microsystems

2007-Jun-19 17:43 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

G''Day Darren,

On Mon, Jun 18, 2007 at 10:51:33AM -0700, Darren.Reed at sun.com
wrote:> Brendan Gregg - Sun Microsystems wrote:
> 
> >...
> >
> >Last night I dropped conn_t from the ip provider, and started recoding
> >it using ire_t and ill_t. The probes have also moved location, and many
> >are now near FW_HOOKS for physical events (before for inbound, after
for
> >outbound). I''ll post some suggested info structs when I see if
ire_t and
> >ill_t are sensible to use (at the moment I''m not sure if I
should be
> >trying to export ire_t for inbound packets - I''m doubting that
it makes
> >sense).
> >
> 
> The FW_HOOKS macros generally have sdt dtrace probes on
> either side of them, e.g.:
> 
>        DTRACE_PROBE4(ip4__forwarding__start,
>            ill_t *, in_ill, ill_t *, out_ill, ipha_t *, ipha, mblk_t *, 
> mp);
> 
>        FW_HOOKS(ipst->ips_ip4_forwarding_event,
>            ipst->ips_ipv4firewall_forwarding,
>            in_ill, out_ill, ipha, mp, mp, ipst);
> 
>        DTRACE_PROBE1(ip4__forwarding__end, mblk_t *, mp);
> 
> If your probes are getting close to where FW_HOOKS appears,
> is there some merit in replacing some of these sdt probes
> with probes from the provider you''re working on?
They were indeed getting close - at one point I had this in ip_wput_ire(),

            DTRACE_PROBE4(ip4__physical__out__start, ill_t *, NULL,
                ill_t *, ire->ire_ipif->ipif_ill, ipha_t *, ipha,
                mblk_t *, mp);
            FW_HOOKS(ipst->ips_ip4_physical_out_event,
                ipst->ips_ipv4firewall_physical_out,
                NULL, ire->ire_ipif->ipif_ill, ipha, mp, mp, ipst);
            DTRACE_PROBE1(ip4__physical__out__end, mblk_t *, mp);
            if (mp == NULL)
                    goto release_ire_and_ill;

            /*
             * DTrace this as ip:::send and ipv4:::send.
             */
            DTRACE_IP2(send, mblk_t *, mp, void_ip_t *, ipha);
            DTRACE_IPV4_5(send, mblk_t *, mp, void_ip_t *, ipha, ipha_t *,
                ipha, int, 0, ill_t *, ire->ire_ipif->ipif_ill);

            mp->b_prev = SET_BPREV_FLAG(IPP_LOCAL_OUT);
            DTRACE_PROBE2(ip__xmit__1, mblk_t *, mp, ire_t *, ire);
            pktxmit_state = ip_xmit_v4(mp, ire, NULL, B_TRUE);

which has three generations of DTrace probes in one place!

... but not anymore - those ip:::send probes are now in ip_xmit_v4()...
> Although this might only make sense of there is complete
> coversion of all the sdt''s into the new thing.
I did spend some time thinking about what could be done, as it the
FW_HOOKS have already instrumeted packet code-paths at a fairly
low level (I wish they existed in Solaris 10 3/05 - my fbt based
scripts would have been much easier to write).

The IP providers must be a completely stable interface - and can''t
export
mblk_t, ill_t, etc directly; however to do so with SDT probes is great for
debugging, and the FW_HOOKS probes could even have more arguments added
(ire_t, ...), and more probes of a similar style created.

I think the FW_HOOKS SDT probes would be part of (and currently lead the
way for) a debugging or code-path-latency provider.

There are also a few places where FW_HOOKS SDT and the IP provider probes
diverge - eg, ip_wput_frag_mdt().
> ...I''m sure that there is such a thing as too many dtrace
> probe points :)
These are near zero overhead when not enabled (some nops plus some movs to
set registers); although their existance may affect how the compiler
optimizes code (especially if there were loads of probes). In any case,
the CPU overhead is going to need to be measured. 

It might be safer to say that there shouldn''t be too many stable
provider
probes in the code. Unstable probes such as those SDT ones can be dropped
if their overhead proves to be a problem, however probes can''t easily
be
dropped from a stable and committed provider.

cheers,

Brendan

-- 
Brendan
[CA, USA]

Erik Nordmark

2007-Jun-19 17:56 UTC

head link

[dtrace-discuss] DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems wrote:
>> But there is a place in IP where there is (at least conceptually) none 
>> of conn_t, ire_t, ill_t.
>> That place is when an ICMP error (or TCP reset etc) is generated by IP.
>> Basically that just sends an IP packet and ip_output finds the ire_t 
>> which points at the ill_t.
> 
> It looked like routing/forwarding fell into this category also.
In that case you at least have the ill_t on which the packet was received.

> Yes, which should allow a snoop-like raw inspection of received packets.
> So far the only received packets I''m not tracing as IP are those
where
> pkt_len < IP_SIMPLE_HDR_LENGTH, since they may not truly be IP at all.
OK
> The xmit stat (ipIfStatsHCOutTransmits) looks like an RFC 4293 MIB
> extension.
Ah - I forgot about the HC counters.
> Yes, this would probably be where an IPsec provider would live; as for
> in the IP providers as well -- that would be interesting, but I
can''t think
> of a stable abstraction yet. We have ip:::send/receive for bottom of IP
> tracing, what would we call top of IP tracing? ip:::read/write? Should
> this be sdt instead?  Could we rely on TCP/UDP probes, which will be at
> the bottom of their stacks and close to the top of IP anyway? ...
One could name them ip:::request/deliver akin to the name of the stats.

For TCP/UDP/RAW you can depend on the probes at the bottom of those 
transports, but that doesn''t capture
  - packets delivered to functions inside IP (icmp errors, IGMP, MLD 
packets)
  - packets sent by functions inside IP (icmp errors generated, IGMP, MLD)

If such packets are 1) encrypted or 2) the transmit ones are dropped by 
the TX path, then you can''t see them at the ip:::send/recv probes.
> Yes. Both send and receive should be tracable (unless you are in TCP
> fusion - which I think I''d leave for the TCP provider to trace,
rather
> than faking up some IP events).
I think that makes sense.
>> Multicast packets that are looped back because there are members on the
>> transmitting interface?
> 
> Yes. Currently looks like this,
> 
>    # ./ipio1.d 
>              FUNC:PROBE             SOURCE               DEST LOOP BYTES
>    ip_multicast_l:send       192.168.1.108 ->       224.0.1.1    1     8
>     ip_wput_local:receive    192.168.1.108 ->       224.0.1.1    1     8
>       ip_wput_ire:send       192.168.1.108 ->       224.0.1.1    0     8
> 
> A loopback send and receive are visible, but only the non-loopback send
> (as we''d expect).
OK.
But will people care about the FUNC names? They are likely to change in 
the future.

    Erik

James Carlson

2007-Jun-19 18:02 UTC

head link

[dtrace-discuss] DTrace Network Providers, take 2

Erik Nordmark writes:> >    # ./ipio1.d 
> >              FUNC:PROBE             SOURCE               DEST LOOP
BYTES
> >    ip_multicast_l:send       192.168.1.108 ->       224.0.1.1    1 
8
> >     ip_wput_local:receive    192.168.1.108 ->       224.0.1.1    1 
8
> >       ip_wput_ire:send       192.168.1.108 ->       224.0.1.1    0 
8
> > 
> > A loopback send and receive are visible, but only the non-loopback
send
> > (as we''d expect).
> 
> OK.
> But will people care about the FUNC names? They are likely to change in 
> the future.
I think that''s a dtrace feature.  Module and function name are under
the control of the framework itself.  The provider has a name and has
freedom in the probe and argument definitions.

-- 
James Carlson, Solaris Networking              <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Brendan Gregg - Sun Microsystems

2007-Jun-19 18:27 UTC

head link

[dtrace-discuss] DTrace Network Providers, take 2

G''Day Folks,

On Mon, Jun 18, 2007 at 06:56:37AM -0700, Erik Nordmark wrote:
[...]> >>What is the intended semantics of cs_loopback?
> >
> >I''d like it to be boolean, but right now it may be more
practical as:
> >
> >	-1	unknown
> >	0	not loopback
> >	1 	loopback
> 
> What is it intended to mean?[...]

Something has became apparent as I''ve dug through this code: the
ip*:::send/receive probes are placed so that they always know if the
packet is loopback or not. This means that I can make these part
of the probe name,

	ip*:::send		(physical)
	ip*:::receive		(physical)
	ip*:::loopback-send
	ip*:::loopback-receive

Arguments would be,

	args[0]		packet ID
	args[1]		ipinfo_t
	args[2]		ipv4info_t/ipv6info_t
	args[3]		illinfo_t

Many of the IP provider scripts I''d write would be intended to trace
physical traffic, and would use send/receive. Those scripts that want to
observe all traffic would trace both send/receive and
loopback-send/loopback-receive (or *send/*receive). 

Elsewhere in the TCP/IP stack, loopback status is not easily known - and
would be provided as an argument rather than part of the probename (and
would include a value for "unknown"). These ip:::loopback* probes are
a
bonus due to their placement - at the bottom of IP when we know where
that packet is being delivered to (although I hope this isn''t too
confusing when people look for tcp:::loopback* and don''t find them).

Does this sound like a stable choice? Can we always expect the bottom of
IP to know wheter it is delivering to loopback or not (I''d think so)?

...

I''m also considering adding the following probes for the ip* providers,

	ip*:::drop-in	(inbound packet dropped)
	ip*:::drop-out	(outbound packet dropped)

The drop-in/out probes could have the following arguments,

	args[0]		packet ID
	args[1]		ipinfo_t
	args[2]		ipv4info_t/ipv6info_t
	args[3]		debug string

The debug string would shed light on why the packet was dropped (in addition
to the function name and the stack trace, as provided by probename and
stack()). The contents of the string could be anything, it is intended for
observability but not for matching. MIB names could be used where available,
perhaps with extensions as is done for ip2dbg(), eg,

	"ipIfStatsInDiscards: discard broadcast"

but some places that drop packets don''t have MIB names, eg dropping 
multicast packets in dls_accept() when we don''t have that address
enabled,

	"multicast address not enabled"

I know internationalization could be an issue - I don''t see how to
address this easily.

...

Lastly, I''m also considering adding these,

	ip*:::read	(read to IP (top of IP))
	ip*:::write	(write to IP (top of IP))

The read/write probes could have the following arguments,

	args[0]		packet ID
	args[1]		ipinfo_t

I wouldn''t assume that ill_t was available, or that all the IPv*
headers
were correctly set. 

Brendan

-- 
Brendan
[CA, USA]

Brendan Gregg - Sun Microsystems

2007-Jun-19 18:31 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

On Tue, Jun 19, 2007 at 02:02:09PM -0400, James Carlson
wrote:> Erik Nordmark writes:
> > >    # ./ipio1.d 
> > >              FUNC:PROBE             SOURCE               DEST
LOOP BYTES
> > >    ip_multicast_l:send       192.168.1.108 ->       224.0.1.1 
1     8
> > >     ip_wput_local:receive    192.168.1.108 ->       224.0.1.1 
1     8
> > >       ip_wput_ire:send       192.168.1.108 ->       224.0.1.1 
0     8
> > > 
> > > A loopback send and receive are visible, but only the
non-loopback send
> > > (as we''d expect).
> > 
> > OK.
> > But will people care about the FUNC names? They are likely to change
in
> > the future.
> 
> I think that''s a dtrace feature.  Module and function name are
under
> the control of the framework itself.  The provider has a name and has
> freedom in the probe and argument definitions.
Sorry - the ipio1.d script is part useful, part Brendan''s debug script
-
the FUNC name is provided as James said, and is definately unstable. I had
added it so that I knew which of my send/receive probes were firing, but
that field would be dropped for any published scripts...

Brendan

-- 
Brendan
[CA, USA]

Dan McDonald

2007-Jun-19 18:33 UTC

head link

[dtrace-discuss] DTrace Network Providers, take 2

On Tue, Jun 19, 2007 at 11:27:27AM -0700, Brendan Gregg - Sun Microsystems
wrote:

<mucho snippage deleted!>
> I''m also considering adding the following probes for the ip*
providers,
> 
> 	ip*:::drop-in	(inbound packet dropped)
> 	ip*:::drop-out	(outbound packet dropped)
> 
> The drop-in/out probes could have the following arguments,
> 
> 	args[0]		packet ID
> 	args[1]		ipinfo_t
> 	args[2]		ipv4info_t/ipv6info_t
> 	args[3]		debug string
> 
> The debug string would shed light on why the packet was dropped (in
addition
> to the function name and the stack trace, as provided by probename and
> stack()). The contents of the string could be anything, it is intended for
> observability but not for matching. MIB names could be used where
available,
> perhaps with extensions as is done for ip2dbg(), eg,
> 
> 	"ipIfStatsInDiscards: discard broadcast"
> 
> but some places that drop packets don''t have MIB names, eg
dropping
> multicast packets in dls_accept() when we don''t have that address
enabled,
> 
> 	"multicast address not enabled"
> 
> I know internationalization could be an issue - I don''t see how to
> address this easily.
First off, PLEASE don''t re-invent a while that''s already been
invented.

See ipdrop.[ch] in the kernel.  We have an internal mechanism in place (the
ipdropper_t) that you can expand/exploit to make anything you want up there
happen.  As for the string issue, you could use a level of numeric
indirection to help out.  E.g. you may bump ipIfStatsInDiscards for any
number of reasons, but an enhanced ip_drop_packet() would bump that MIB, and
perhaps take a "reason code" or "diagnostic code".

We use the concept of a diagnostic code in our PF_KEY enhancements to
disambiguate the UNIX EINVAL to great effect.  You could do what we do, and
map the diagnostic code to easily-translatable strings in a user-space
library!

Dan

Brendan Gregg - Sun Microsystems

2007-Jun-19 18:47 UTC

head link

[dtrace-discuss] DTrace Network Providers, take 2

G''Day Dan,

On Tue, Jun 19, 2007 at 02:33:37PM -0400, Dan McDonald
wrote:> On Tue, Jun 19, 2007 at 11:27:27AM -0700, Brendan Gregg - Sun Microsystems
wrote:
> 
> <mucho snippage deleted!>
> 
> > I''m also considering adding the following probes for the ip*
providers,
> > 
> > 	ip*:::drop-in	(inbound packet dropped)
> > 	ip*:::drop-out	(outbound packet dropped)
> > 
> > The drop-in/out probes could have the following arguments,
> > 
> > 	args[0]		packet ID
> > 	args[1]		ipinfo_t
> > 	args[2]		ipv4info_t/ipv6info_t
> > 	args[3]		debug string
> > 
> > The debug string would shed light on why the packet was dropped (in
addition
> > to the function name and the stack trace, as provided by probename and
> > stack()). The contents of the string could be anything, it is intended
for
> > observability but not for matching. MIB names could be used where
available,
> > perhaps with extensions as is done for ip2dbg(), eg,
> > 
> > 	"ipIfStatsInDiscards: discard broadcast"
> > 
> > but some places that drop packets don''t have MIB names, eg
dropping
> > multicast packets in dls_accept() when we don''t have that
address enabled,
> > 
> > 	"multicast address not enabled"
> > 
> > I know internationalization could be an issue - I don''t see
how to
> > address this easily.
> 
> First off, PLEASE don''t re-invent a while that''s already
been invented.
> 
> See ipdrop.[ch] in the kernel.  We have an internal mechanism in place (the
> ipdropper_t) that you can expand/exploit to make anything you want up there
> happen.  As for the string issue, you could use a level of numeric
> indirection to help out.  E.g. you may bump ipIfStatsInDiscards for any
> number of reasons, but an enhanced ip_drop_packet() would bump that MIB,
and
> perhaps take a "reason code" or "diagnostic code".
> 
> We use the concept of a diagnostic code in our PF_KEY enhancements to
> disambiguate the UNIX EINVAL to great effect.  You could do what we do, and
> map the diagnostic code to easily-translatable strings in a user-space
> library!
Awsome - I can just DTrace ip_drop_packet(). Why didn''t I see that
before?
Oh - there is only ONE call to ip_drop_packet() in the whole of ip.c!
I see 267 freemsg()''s in there! Uhh - have I missed something?

I''m very happy to use ip_drop_packet() (from what I''ve just
seen it looks
like the right way). Will the rest of ip.c start using it?

cheers

Brendan

-- 
Brendan
[CA, USA]

Dan McDonald

2007-Jun-19 18:58 UTC

head link

[dtrace-discuss] DTrace Network Providers, take 2

On Tue, Jun 19, 2007 at 11:47:35AM -0700, Brendan Gregg - Sun Microsystems
wrote:> G''Day Dan,
Hello!

<mucho snippage deleted!>
> > We use the concept of a diagnostic code in our PF_KEY enhancements to
> > disambiguate the UNIX EINVAL to great effect.  You could do what we
do, and
> > map the diagnostic code to easily-translatable strings in a user-space
> > library!
> 
> Awsome - I can just DTrace ip_drop_packet(). Why didn''t I see that
before?
> Oh - there is only ONE call to ip_drop_packet() in the whole of ip.c!
> I see 267 freemsg()''s in there! Uhh - have I missed something?
Yes -- this RFE:

6321434 Created P3 kernel/tcp-ip Every dropped packet in IP should
	use ip_drop_packet()
> I''m very happy to use ip_drop_packet() (from what I''ve
just seen it looks
> like the right way). Will the rest of ip.c start using it?
Quite frankly, I''m of the opinion that the above RFE fits in with your
Networking Provider work.  The interface itself (ip_drop_packet() and
friends) may need a bit of enhancement.  I''m more than happy to help
with
such architectural changes.

I *do* understand, however, that there''s a LOT of scutwork that needs
to be
done to make the above RFE happen.  Also, if all of TCP/IP starts using
ip_drop_packet(), initialization, etc. will have to happen earlier in IP.
Right now, ip_drop_init() is done as part of ipsec_stack_init(), and if we
generalize it, we''ll have to yank it up a level to IP itself.

Dan

Brendan Gregg - Sun Microsystems

2007-Jun-20 13:50 UTC

head link

[dtrace-discuss] DTrace Network Providers, take 2

G''Day Erik,

On Tue, Jun 19, 2007 at 10:56:46AM -0700, Erik Nordmark
wrote:> Brendan Gregg - Sun Microsystems wrote:
[...]> >Yes, this would probably be where an IPsec provider would live; as for
> >in the IP providers as well -- that would be interesting, but I
can''t think
> >of a stable abstraction yet. We have ip:::send/receive for bottom of IP
> >tracing, what would we call top of IP tracing? ip:::read/write? Should
> >this be sdt instead?  Could we rely on TCP/UDP probes, which will be at
> >the bottom of their stacks and close to the top of IP anyway? ...
> 
> One could name them ip:::request/deliver akin to the name of the stats.
> 
> For TCP/UDP/RAW you can depend on the probes at the bottom of those 
> transports, but that doesn''t capture
>  - packets delivered to functions inside IP (icmp errors, IGMP, MLD 
> packets)
I''m trying out probes within the IP functions to cover those. So far I
have 27 probes just for ip*:::deliver.
>  - packets sent by functions inside IP (icmp errors generated, IGMP, MLD)
I''ve put probes in ip_wput_ire*() to capture those, and I''m
testing to
see how that works.
> If such packets are 1) encrypted or 2) the transmit ones are dropped by 
> the TX path, then you can''t see them at the ip:::send/recv probes.
Yes - ip*:::request/deliver won''t match ip*:::send/receive events; I
just need to check that for each instance that they don''t match, it
makes sense to do so (eg, IP request was dropped before the send).

Anyhow, the probe list for IP is getting long. In terms of what probe
points the IP providers should trace - this list should be close to
completion.

# dtrace -ln ''ip*:::''
   ID   PROVIDER            MODULE                          FUNCTION NAME
11406       ipv6                ip                        ip_xmit_v6 send
11410       ipv6                ip                    ip_wput_ire_v6 request
11411       ipv6                ip                  ip_wput_local_v6
local-receive
11416       ipv6                ip                  ip_wput_local_v6 local-send
11421       ipv6                ip                        ip_rput_v6 receive
11424       ipv6                ip                   ip_rput_data_v6 deliver
11425       ipv6                ip                  ip_fanout_udp_v6 deliver
11426       ipv6                ip                  ip_fanout_tcp_v6 deliver
11427       ipv6                ip                ip_fanout_proto_v6 deliver
11428       ipv6                ip                ip_fanout_sctp_raw deliver
11438       ipv4                ip                     ip_wput_local
local-receive
11439         ip                ip                  ip_wput_local_v6
local-receive
11440         ip                ip                     ip_wput_local
local-receive
11445       ipv4                ip                     ip_wput_local local-send
11446         ip                ip                  ip_wput_local_v6 local-send
11447         ip                ip                     ip_wput_local local-send
11466       ipv4                ip                     udp_send_data request
11467       ipv4                ip                  tcp_lsosend_data request
11468       ipv4                ip                tcp_multisend_data request
11469       ipv4                ip                     tcp_send_data request
11470       ipv4                ip                       ip_wput_ire request
11618       ipv4                ip                          ip_input receive
11619         ip                ip                        ip_rput_v6 receive
11620         ip                ip                          ip_input receive
11667       ipv4                ip                     udp_send_data send
11668       ipv4                ip                  tcp_lsosend_data send
11669       ipv4                ip                tcp_multisend_data send
11670       ipv4                ip                     tcp_send_data send
11671       ipv4                ip                        ip_xmit_v4 send
11672       ipv4                ip                      ip_wput_frag send
11673       ipv4                ip                  ip_wput_frag_mdt send
11674       ipv4                ip                   ip_fast_forward send
11675         ip                ip                     udp_send_data send
11676         ip                ip                  tcp_lsosend_data send
11677         ip                ip                tcp_multisend_data send
11678         ip                ip                     tcp_send_data send
11679         ip                ip                        ip_xmit_v6 send
11680         ip                ip                        ip_xmit_v4 send
11681         ip                ip                      ip_wput_frag send
11682         ip                ip                  ip_wput_frag_mdt send
11683         ip                ip                   ip_fast_forward send
11787       ipv4                ip                    ip_fanout_sctp deliver
11788       ipv4                ip                ip_fanout_sctp_raw deliver
11789       ipv4                ip                     ip_sctp_input deliver
11790       ipv4                ip                      ip_tcp_input deliver
11791       ipv4                ip                      ip_udp_input deliver
11792       ipv4                ip                ip_fanout_udp_conn deliver
11793       ipv4                ip                     ip_fanout_tcp deliver
11794       ipv4                ip                   ip_fanout_proto deliver
12339       ipv4                ip                    ip_drop_packet drop-out
12340       ipv4                ip                    ip_drop_packet drop-in
12341       ipv6                ip                    ip_drop_packet drop-out
12342       ipv6                ip                    ip_drop_packet drop-in

I renamed the loopback* probes to local*, which seems to make more sense
as they are tracing local traffic ("loopback" may suggest lo0 only,
which
isn''t correct).

I''ve also been struggling with the probe names ... request and deliver.
I keep remembering one but not the other, or forgetting which way around
they go. I know being MIB name based would be a good thing, but I''d
guess
that I have a better familiarity with MIB than most end users and if
they aren''t helping me then they may be less likely to help others.
I''ll change them back to read/write and see if how they feel...

...

I dropped probes that I had down in dls and mac, which were to allow
extended IP tracing if the interface was in promiscuous mode. It seemed
too strange - either getting DTrace to put interfaces in promiscuous mode,
creating a new promiscuous-adm command, or getting users to run snoop. Of
course, without these probes there is no way that the IP providers will
see every IP packet on the wire - as that requires changing NIC state.
The IP providers will see every IP packet delivered to IP.

cheers,

Brendan

-- 
Brendan
[CA, USA]

Brendan Gregg - Sun Microsystems

2007-Jun-20 21:16 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

G''Day Jeremy,

On Wed, Jun 20, 2007 at 09:53:44PM +0100, Jeremy Harris
wrote:> Brendan Gregg - Sun Microsystems wrote:
> >   ID   PROVIDER            MODULE                          FUNCTION
NAME
> [...]
> >11668       ipv4                ip                  tcp_lsosend_data
send
> >11669       ipv4                ip                tcp_multisend_data
send
> 
> Aren''t these more implementation artifacts than things suitable
> for exposure in a stable provider?
Do you mean the MODULE and FUNCTION columns? those appear because I typed
"dtrace -l"; the actual stable provider consists of the PROVIDER and
NAME columns, and the probe arguments.

If you meant that LSO and MDT are implementation details, then sure,
their activity is tied to implementation, and the IP packets that they
generate will be observable by the stable IP provider in a non-implementation
specific way (send/receive probes, not lso-send/mdt-send/etc).

If you want to know actual application sends, then you use other network
providers up the stack (TCP, socket, ...). This is similar to observing I/O
from either the io provider or the syscall layer. What the io provider
traces is often implementation activity rather than application request (eg,
read-ahead).

cheers,

Brendan

-- 
Brendan
[CA, USA]

Brendan Gregg - Sun Microsystems

2007-Jun-23 02:54 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

G''Day Folks,

The IP providers have evolved into three parts based on the probe name.
Here is an update for each, and a link to a webrev.


1) send/receive and local-send/local-receive probes,

These probes are working well.

   # ./ipio01.d 
    NAME       SOURCE             DEST            BYTES      INT  PROTO
    send       192.168.1.108   -> 192.168.1.109      68     nge0    TCP
    receive    192.168.1.109   -> 192.168.1.108      68     nge0    TCP
    send       192.168.1.108   -> 192.168.1.109      20     nge0    TCP
   ^C

*send/*receive instrument the bottom of IP - when IP speaks with the
network interface. Also, since IPsec tunnels are considered an interface,
we can also trace as IP speaks to tunnel devices, and as those tunnels
speak to the destination interface,

   # ./ipio01.d 
    NAME       SOURCE             DEST            BYTES      INT  PROTO
    send       10.7.250.24     -> 192.146.17.75      68  ip.tun0    TCP
    send       192.168.1.108   -> 172.16.10.1       140     nge0    UDP
    receive    172.16.10.1     -> 192.168.1.108     140     nge0    UDP
    receive    192.146.17.75   -> 10.7.250.24        68  ip.tun0    TCP
    send       10.7.250.24     -> 192.146.17.75      20  ip.tun0    TCP
    send       192.168.1.108   -> 172.16.10.1        92     nge0    UDP
   ^C

Great - this is visibility of the actual endpoint IP addresses, as well
as those of the tunnel. 

Much can be done from these *send/*receive probes alone, that I feel it
makes sense to take these to completion (testing, code tweaking) and
putback into Solaris as the first stage of integration. Having already
prototyped the other probes (and other providers) I''m confident that
these *send/*receive probes can be integrated without impeding future
probe integration.

I still need to perform more testing to confirm that these probes are
instrumenting different kinds of traffic correctly.


2) drop-in/drop-out probes

Tracing ip_drop_packet() properly is very little work (and mostly done);
what will be time consuming is completing RFE 6321434 - so that every
dropped packet in IP will use ip_drop_packet().

I don''t know if I am or should be taking on RFE 6321434 yet;
I''m certainly
trying to help from a DTrace perspective, but I have commitments to 
other projects in the coming months (in fact, I should really be working
on something else right now)...


3) read/write probes

This is for tracing the top of the IP layer - when upper level protocols
such as TCP and UDP speak to IP.

   # ./ipio03.d
    NAME       SOURCE             DEST            BYTES      INT
    write      192.168.1.108   -> 192.168.1.109      68     nge0
     send      192.168.1.108   -> 192.168.1.109      68     nge0
     receive   192.168.1.109   -> 192.168.1.108      68     nge0
    read       192.168.1.109   -> 192.168.1.108      68     nge0
    write      192.168.1.108   -> 192.168.1.109      20     nge0
     send      192.168.1.108   -> 192.168.1.109      20     nge0
   ^C

The above output shows the read/write probes traced along with the
send/receive probes -- the flow of data can be seen as it is written
to IP and then sent, recieved by IP and then read, etc.

Note that I''m exporting the interface at the read/write level --
I''m
very cautious about what information is stable at this point, however
I found that when using the prototype provider it was a real pain not
to know the interface. Looking through the code to see what was possible,
I found almost all of the MIB stats at this level (Requests/Delivers)
are writing to MIB stats from a ill_t - so interface information was
usually there. Note that this is the expected interface, as these
read/write probes are tracing what was asked of IP -- only at send/receive
do you know what really happened...

And for an IPsec tunnel,

   # ./ipio03.d
    NAME       SOURCE             DEST            BYTES      INT
    write      10.7.250.24     -> 192.146.17.75      68  ip.tun0
     send      10.7.250.24     -> 192.146.17.75      68  ip.tun0
    write      192.168.1.108   -> 172.16.10.1        88     nge0
     send      192.168.1.108   -> 172.16.10.1       140     nge0
     receive   172.16.10.1     -> 192.168.1.108     140     nge0
    read       172.16.10.1     -> 192.168.1.108     140     nge0
     receive   192.146.17.75   -> 10.7.250.24        68  ip.tun0
    read       192.146.17.75   -> 10.7.250.24        68  ip.tun0
    write      10.7.250.24     -> 192.146.17.75      20  ip.tun0
     send      10.7.250.24     -> 192.146.17.75      20  ip.tun0
    write      192.168.1.108   -> 172.16.10.1        40     nge0
     send      192.168.1.108   -> 172.16.10.1        92     nge0
   ^C

Note the payload size (BYTES) increase as encryption headers are added,
and decrease as they are removed.

There are many places I''ve been adding these DTrace read/write macros
to,
and while I''m catching all traffic I''ve thrown at it
correctly, I wouldn''t
be suprised if missed a codepath or several (due to optimizations, the IP
code is a massive game of snakes and ladders). This will need more testing
(a testsuite?), and probably more probe tweaking.


In summary, prototyping the IP providers is coming along really well,
and I''ve reached a point where it may be wise to think more seriously
about putting back the IP *send/*receive probes as a first step.

I''ve created a new website for these prototypes,

   http://www.opensolaris.org/os/community/dtrace/NetworkProvider/Prototype2

at the bottom of which is a link to the webrev if anyone is interested in
getter a better understanding of this work so far (it''s not a code
review.
at least not yet anyway :). I''ll listen to feedback for a while, and
then see if it is appropriate to generate another webrev - but this time
as a suggested putback for *send/*receive probes.

cheers,

Brendan

--
Brendan
[CA, USA]

Darren.Reed at Sun.COM

2007-Jun-23 04:01 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems wrote:
>G''Day Darren,
>
>On Mon, Jun 18, 2007 at 10:51:33AM -0700, Darren.Reed at sun.com wrote:
>
> ...
>
>>...I''m sure that there is such a thing as too many dtrace
>>probe points :)
>>
>
>These are near zero overhead when not enabled (some nops plus some movs to
>set registers); although their existance may affect how the compiler
>optimizes code (especially if there were loads of probes). In any case,
>the CPU overhead is going to need to be measured.
>
Has there been any measurement or thoughts about the impact of
too many dtrace probes resulting in excessive "pollution" with
the i-cache by instructions that "do nothing"?  In most cases
I imagine that dtrace probes are quite sparsely distributed
through the kernel...

In addition, with hot paths where we''re sensitive to every branch
or instruction executed, how should we quantify a dtrace probe''s
impact on execution?

I mention this because other networking projects are looking
very closely at various parts of tcp/ip to try and work out
how we can slim it down.  Without doubt there are bigger problems
than removing dtrace probes but at the same time, if there is
a quantifiable cost then we need to think about how we go about
applying dtrace to networking in order to get the best return
for the smallest CPU cost.  I''m most concerned that there are
dtrace probes almost next to each other, e.g from ip.c:

14250                         DTRACE_PROBE1(ip4__physical__out__end, mblk_t *,
14251                             mp);
14252                         if (mp == NULL)
14253                                 goto drop;
14254 
14255                         DTRACE_IP2(send, mblk_t *, mp, void_ip_t *, ipha);
14256                         DTRACE_IPV4_3(send, mblk_t *, mp, ipha_t *, ipha,
14257                             ill_t *, stq_ill);

To me it seems that ip4-physical-out-end is almost an alias for
send here, if only the DTRACE_IP2() was before the if().

How can we tell dtrace that sdt:::ip4-physical-out-end is an alias
for ip:::send or similar?

or:

15205                 /*
15206                  * DTrace this as ip:::receive and ipv4:::receive. This
15207                  * special case test avoids a particular code path where
15208                  * IPsec packets are passed through here twice, both
before
15209                  * and after nattymod, and we only want to trace them
once.
15210                  * If a better way is found, this test will be dropped.
15211                  */
15212                 if (!(!mhip && (ipha->ipha_protocol ==
IPPROTO_ESP ||
15213                     ipha->ipha_protocol == IPPROTO_AH))) {
15214                         DTRACE_IP2(receive, mblk_t *, mp, void_ip_t *,
ipha);
15215                         DTRACE_IPV4_3(receive, mblk_t *, mp, ipha_t *,
ipha,
15216                             ill_t *, ill);
15217                 }
15218 
15219                 /*
15220                  * The event for packets being received from a
''physical''
15221                  * interface is placed after validation of the source
and/or
15222                  * destination address as being local so that packets can
be
15223                  * redirected to loopback addresses using ipnat.
15224                  */
15225                 DTRACE_PROBE4(ip4__physical__in__start,
15226                     ill_t *, ill, ill_t *, NULL,
15227                     ipha_t *, ipha, mblk_t *, first_mp);

Sure, there are lots of NOPs and other cheap instructions being
executed here, but we are lengthening the default code path for
packets - or at very least the spread of instructions so that
where a cache line fill would get us lots of useful things to
do, now it gets fewer...but I acknowledge that this is probably
nit picking and that energy needs to be first spent elsewhere
looking for things to do to make networking faster.

...and without wanting to do code review, I''ll add that putting
another if() into ip_input() just for dtrace probes is not an
ideal solution...every if() in ip_input() has a measurable cost
to network performance.

It would be nice if that if() could be part of the dtrace probe
and only active when the probe was active, e.g.

DTRACE_IF_IP2((!(!mhip && (ipha->ipha_protocol == IPPROTO_ESP || \
    ipha->ipha_protocol == IPPROTO_AH))), receive, mblk_t *, mp,
    void_ip_t *, ipha); 

Darren

p.s. your version of "wx webrev" generates "frames" that
don''t appear to work well with mozilla from Solaris 10.

Garrett D''Amore

2007-Jun-24 22:01 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems wrote:>
> These are near zero overhead when not enabled (some nops plus some movs to
> set registers); although their existance may affect how the compiler
> optimizes code (especially if there were loads of probes). In any case,
> the CPU overhead is going to need to be measured. 
>   
I''ve been spending a lot of time working on shortening the code paths
in
IP, so I guess its time for me to interject in the discussion.

At 10GbE speeds, or with small packets at 1GbE speeds (64 byte frames, 
for example), we are completely CPU bound.  I have done a lot of work 
(not yet committed) to improve the cost in the IP stack, by simplifying 
certain code and removing some redundant checks, etc.  I''ve found in 
general that the cost of each additional branch to be ~0.1 to 0.2% 
performance difference.  I''ve not tried to measure the incremental 
impact of a nop.

If this project is going to lengthen the code by adding more probes, or 
by adding any additional instructions (even nops!), then I think it is 
critical to measure the performance impact.

The best way to do this is try testing two code paths:

    * netperf with 64-byte UDP packets
    * IP forwarding performance with 64-byte packets (not TCP or UDP)

Make sure to use an e1000g or bge card (not nxge or ce).  Frankly, I''d 
prefer the testing was done with e1000g at this point, because I know 
what the performance considerations for it are. :-)

If someone makes bfu archives available, I can actually do the second 
kind of testing with hardware I have at hand.

Once we know what the actual impact is quantitatively, we can decide 
what the appropriate next steps are.

    -- Garrett

Brendan Gregg - Sun Microsystems

2007-Jun-29 04:16 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

G''Day Darren,

On Fri, Jun 22, 2007 at 09:01:26PM -0700, Darren.Reed at Sun.COM
wrote:> Brendan Gregg - Sun Microsystems wrote:
[...]> >These are near zero overhead when not enabled (some nops plus some movs
to
> >set registers); although their existance may affect how the compiler
> >optimizes code (especially if there were loads of probes). In any case,
> >the CPU overhead is going to need to be measured.
> 
> Has there been any measurement or thoughts about the impact of
> too many dtrace probes resulting in excessive "pollution" with
> the i-cache by instructions that "do nothing"?  In most cases
> I imagine that dtrace probes are quite sparsely distributed
> through the kernel...
Has anyone noticed a problem already? There are *already* more DTrace
probes in TCP/IP than what I''m suggesting to add. For example, the
following three commands were run at the same time,

   # snoop -r
   Using device nge0 (promiscuous mode)
   192.168.1.109 -> 192.168.1.108 TELNET C port=40965 d
   192.168.1.108 -> 192.168.1.109 TELNET R port=40965 d
   192.168.1.109 -> 192.168.1.108 TELNET C port=40965 
   ^C

   # dtrace -qn ''ip*::: { @[probeprov, probename] = count();
@["", "TOTAL:"] = count(); } END { printa("%12s %24s
%@8d\n", @); }''^G^C
             ip                     send        1
           ipv4                     send        1
           ipv4                    write        1
             ip                  receive        2
           ipv4                     read        2
           ipv4                  receive        2
                                  TOTAL:        9

   # dtrace -qn ''mib:::,sdt:::ip*,sdt:::squeue*,sdt:::conn*,sdt:::hook*
{ @[probeprov, probename] = count(); @["", "TOTAL:"] =
count(); } END { printa("%12s %24s %@8d\n", @); }''
   ^C
            mib     ipIfStatsHCOutOctets        1
            mib   ipIfStatsHCOutRequests        1
            mib  ipIfStatsHCOutTransmits        1
            mib            tcpInAckBytes        1
            mib             tcpInAckSegs        1
            mib    tcpInDataInorderBytes        1
            mib     tcpInDataInorderSegs        1
            mib          tcpOutDataBytes        1
            mib           tcpOutDataSegs        1
            mib             tcpRttUpdate        1
            sdt     ip4-physical-out-end        1
            sdt   ip4-physical-out-start        1
            sdt           squeue-enqueue        1
            sdt      squeue-enqueuechain        1
            mib    ipIfStatsHCInDelivers        2
            mib      ipIfStatsHCInOctets        2
            mib    ipIfStatsHCInReceives        2
            sdt      ip4-physical-in-end        2
            sdt    ip4-physical-in-start        2
            sdt          squeue-proc-end       12
            sdt        squeue-proc-start       12
            sdt             conn-dec-ref       15
            sdt             conn-inc-ref       15
                                  TOTAL:       78

So while the ip providers fired 9 probes for those 3 packets, the existing
sdt and mib providers fired 78 - 26 probes per packet!

Ok, I did pick the worst example I could find :-) it can get to around
10 probes per packet, much better than 26, but still much more than the
2 per packet I''m suggesting we add.

Anyway, while that may be interesting to note, there has been measurements
and thoughts about this,

Measurements:

For packet based tests using netperf and ttcp, the overhead was not visible
through the noise. PIC based measurements for i-cache suggested a drop in
hit-rate of about 0.01%, however this was also close to the noise.

DTrace instruction sampling suggests that three non-enabled probes take
around 7 ns to execute on a 2.4 GHz AMD64 CPU, and the two ip_input probes
with their if statement takes around 11 ns. The execution times of the
probes and if statement appears negligible, so long as they access recently
used (cached) variables.

As for the overall effect on ip due to i-cache pollution; DTrace sampling
of the entire ip module versus packet counts showed a per packet increase
in ip time of around 50 ns. Adding and removing probes seemed to change
this measured overhead in random ways, suggesting that it is either noise,
or that the effect of adding instructions doesn''t necessarily mean
things
get slower (subsequent instructions are shifted to different addresses, 
which map to the cache differently, possibly relieving or creating hot 
spots). That DTrace is sampling to take these measurements is also affecting
the behaviour of the caches.

Thoughts:

Each probe addition adds 1 to 5 nops, and a few movs of recently
accessed data which we would expect to still be cached. Given a maximum
packet rate per CPU per second of 200,000, the execution overhead of some
nops and cache movs per packet should be negligible. The effect on i-cache
is harder to estimate, and is probably only best checked through
measurements.

In summary, the execution time of the probes is negligible when considering
the per CPU packet rates; the i-cache effect looks negligble, and from the
tests was more affected by code layout than the existance or non-existance
of probes.
> In addition, with hot paths where we''re sensitive to every branch
> or instruction executed, how should we quantify a dtrace probe''s
> impact on execution?
> 
> I mention this because other networking projects are looking
> very closely at various parts of tcp/ip to try and work out
> how we can slim it down.
>
> Without doubt there are bigger problems
> than removing dtrace probes but at the same time, if there is
> a quantifiable cost then we need to think about how we go about
> applying dtrace to networking in order to get the best return
> for the smallest CPU cost.
I''d suggest this strategy.

1) For development/troubleshooting probes, use fbt. 

fbt is free in terms on non-enabled cost. In fact, many of those sdt probes
could be served by fbt. eg,

ipxmit_state_t
ip_xmit_v4(mblk_t *mp, ire_t *ire, ipsec_out_t *io, boolean_t flow_ctl_enabled)
{
        nce_t           *arpce;
        ipha_t          *ipha;
        queue_t         *q;
        int             ill_index;
        mblk_t          *nxt_mp, *first_mp;
        boolean_t       xmit_drop = B_FALSE;
        ip_proc_t       proc;
        ill_t           *out_ill;
        int             pkt_len;

        arpce = ire->ire_nce;
        ASSERT(arpce != NULL);

        DTRACE_PROBE2(ip__xmit__v4, ire_t *, ire,  nce_t *, arpce);
[...]

fbt::ip_xmit_v4:entry can be used instead of sdt:::ip-xmit-v4. There are
other examples in ip that aren''t as obvious and would involve much
digging
and probe association with DTrace, but are still possible from fbt instead.

2) use sdt if fbt fails in some way.

One advantage of private sdt probes is that they can be removed later
if proved to be a performance problem. 

3) add public and stable provider probes if you really must, such as for
an end-user interface.
> I''m most concerned that there are
> dtrace probes almost next to each other, e.g from ip.c:
> 
> 14250                         DTRACE_PROBE1(ip4__physical__out__end, mblk_t
> *,
> 14251                             mp);
> 14252                         if (mp == NULL)
> 14253                                 goto drop;
> 14254 
> 14255                         DTRACE_IP2(send, mblk_t *, mp, void_ip_t *, 
> ipha);
> 14256                         DTRACE_IPV4_3(send, mblk_t *, mp, ipha_t *, 
> ipha,
> 14257                             ill_t *, stq_ill);
> 
> To me it seems that ip4-physical-out-end is almost an alias for
> send here, if only the DTRACE_IP2() was before the if().
> 
> How can we tell dtrace that sdt:::ip4-physical-out-end is an alias
> for ip:::send or similar?
We can''t - DTrace doesn''t support aliasing like that.
> or:
> 
> 15205                 /*
> 15206                  * DTrace this as ip:::receive and ipv4:::receive. 
> This
> 15207                  * special case test avoids a particular code path 
> where
> 15208                  * IPsec packets are passed through here twice, both 
> before
> 15209                  * and after nattymod, and we only want to trace them
> once.
> 15210                  * If a better way is found, this test will be 
> dropped.
> 15211                  */
> 15212                 if (!(!mhip && (ipha->ipha_protocol ==
IPPROTO_ESP ||
> 15213                     ipha->ipha_protocol == IPPROTO_AH))) {
> 15214                         DTRACE_IP2(receive, mblk_t *, mp, void_ip_t 
> *, ipha);
> 15215                         DTRACE_IPV4_3(receive, mblk_t *, mp, ipha_t 
> *, ipha,
> 15216                             ill_t *, ill);
> 15217                 }
> 15218 
> 15219                 /*
> 15220                  * The event for packets being received from a 
> ''physical''
> 15221                  * interface is placed after validation of the source
> and/or
> 15222                  * destination address as being local so that packets
> can be
> 15223                  * redirected to loopback addresses using ipnat.
> 15224                  */
> 15225                 DTRACE_PROBE4(ip4__physical__in__start,
> 15226                     ill_t *, ill, ill_t *, NULL,
> 15227                     ipha_t *, ipha, mblk_t *, first_mp);
> 
> Sure, there are lots of NOPs and other cheap instructions being
> executed here, but we are lengthening the default code path for
> packets - or at very least the spread of instructions so that
> where a cache line fill would get us lots of useful things to
> do, now it gets fewer...but I acknowledge that this is probably
> nit picking and that energy needs to be first spent elsewhere
> looking for things to do to make networking faster.
> 
> ...and without wanting to do code review, I''ll add that putting
> another if() into ip_input() just for dtrace probes is not an
> ideal solution...every if() in ip_input() has a measurable cost
> to network performance.
Understood - I''m trying to avoid them. 
> It would be nice if that if() could be part of the dtrace probe
> and only active when the probe was active, e.g.
> 
> DTRACE_IF_IP2((!(!mhip && (ipha->ipha_protocol == IPPROTO_ESP ||
\
>    ipha->ipha_protocol == IPPROTO_AH))), receive, mblk_t *, mp,
>    void_ip_t *, ipha); 
Cool idea - I don''t think it''s necessary yet, but if the probe
additions
start to use many if statements, then it''s good to have such options to
try. :)

cheers,

Brendan

-- 
Brendan
[CA, USA]

Garrett D''Amore

2007-Jun-29 05:30 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems wrote:> G''Day Darren,
>
> On Fri, Jun 22, 2007 at 09:01:26PM -0700, Darren.Reed at Sun.COM wrote:
>   
>> Brendan Gregg - Sun Microsystems wrote:
>>     
> [...]
>   
>>> These are near zero overhead when not enabled (some nops plus some
movs to
>>> set registers); although their existance may affect how the
compiler
>>> optimizes code (especially if there were loads of probes). In any
case,
>>> the CPU overhead is going to need to be measured.
>>>       
>> Has there been any measurement or thoughts about the impact of
>> too many dtrace probes resulting in excessive "pollution"
with
>> the i-cache by instructions that "do nothing"?  In most cases
>> I imagine that dtrace probes are quite sparsely distributed
>> through the kernel...
>>     
>
> Has anyone noticed a problem already? There are *already* more DTrace
> probes in TCP/IP than what I''m suggesting to add. For example, the
> following three commands were run at the same time,
>   
In short *yes*.  We see problems with the length of the code paths, and 
negative impacts on the number of packets that the system can process.  
Every cycle is precious.  I need to do the analysis on your proposed, 
but I''ll do it soon.

Note that with TCP, you won''t see the problem.  But try running a 
protocol that uses 64-byte (or at high speed, even 200 or 1500 byte) UDP 
packets.

Most people assume that the fallacy of "I didn''t see a performance
penalty" means something when they''re testing TCP stream
performance.
When doing large packets with TCP, of *course* you don''t notice. But
run
anything other than TCP, and you''ll find out where the *real* problems 
in Solaris'' networking performance lie.

    -- Garrett
>    # snoop -r
>    Using device nge0 (promiscuous mode)
>    192.168.1.109 -> 192.168.1.108 TELNET C port=40965 d
>    192.168.1.108 -> 192.168.1.109 TELNET R port=40965 d
>    192.168.1.109 -> 192.168.1.108 TELNET C port=40965 
>    ^C
>
>    # dtrace -qn ''ip*::: { @[probeprov, probename] = count();
@["", "TOTAL:"] = count(); } END { printa("%12s %24s
%@8d\n", @); }''^G^C
>              ip                     send        1
>            ipv4                     send        1
>            ipv4                    write        1
>              ip                  receive        2
>            ipv4                     read        2
>            ipv4                  receive        2
>                                   TOTAL:        9
>
>    # dtrace -qn
''mib:::,sdt:::ip*,sdt:::squeue*,sdt:::conn*,sdt:::hook* { @[probeprov,
probename] = count(); @["", "TOTAL:"] = count(); } END {
printa("%12s %24s %@8d\n", @); }''
>    ^C
>             mib     ipIfStatsHCOutOctets        1
>             mib   ipIfStatsHCOutRequests        1
>             mib  ipIfStatsHCOutTransmits        1
>             mib            tcpInAckBytes        1
>             mib             tcpInAckSegs        1
>             mib    tcpInDataInorderBytes        1
>             mib     tcpInDataInorderSegs        1
>             mib          tcpOutDataBytes        1
>             mib           tcpOutDataSegs        1
>             mib             tcpRttUpdate        1
>             sdt     ip4-physical-out-end        1
>             sdt   ip4-physical-out-start        1
>             sdt           squeue-enqueue        1
>             sdt      squeue-enqueuechain        1
>             mib    ipIfStatsHCInDelivers        2
>             mib      ipIfStatsHCInOctets        2
>             mib    ipIfStatsHCInReceives        2
>             sdt      ip4-physical-in-end        2
>             sdt    ip4-physical-in-start        2
>             sdt          squeue-proc-end       12
>             sdt        squeue-proc-start       12
>             sdt             conn-dec-ref       15
>             sdt             conn-inc-ref       15
>                                   TOTAL:       78
>
> So while the ip providers fired 9 probes for those 3 packets, the existing
> sdt and mib providers fired 78 - 26 probes per packet!
>
> Ok, I did pick the worst example I could find :-) it can get to around
> 10 probes per packet, much better than 26, but still much more than the
> 2 per packet I''m suggesting we add.
>
> Anyway, while that may be interesting to note, there has been measurements
> and thoughts about this,
>
> Measurements:
>
> For packet based tests using netperf and ttcp, the overhead was not visible
> through the noise. PIC based measurements for i-cache suggested a drop in
> hit-rate of about 0.01%, however this was also close to the noise.
>
> DTrace instruction sampling suggests that three non-enabled probes take
> around 7 ns to execute on a 2.4 GHz AMD64 CPU, and the two ip_input probes
> with their if statement takes around 11 ns. The execution times of the
> probes and if statement appears negligible, so long as they access recently
> used (cached) variables.
>
> As for the overall effect on ip due to i-cache pollution; DTrace sampling
> of the entire ip module versus packet counts showed a per packet increase
> in ip time of around 50 ns. Adding and removing probes seemed to change
> this measured overhead in random ways, suggesting that it is either noise,
> or that the effect of adding instructions doesn''t necessarily mean
things
> get slower (subsequent instructions are shifted to different addresses, 
> which map to the cache differently, possibly relieving or creating hot 
> spots). That DTrace is sampling to take these measurements is also
affecting
> the behaviour of the caches.
>
> Thoughts:
>
> Each probe addition adds 1 to 5 nops, and a few movs of recently
> accessed data which we would expect to still be cached. Given a maximum
> packet rate per CPU per second of 200,000, the execution overhead of some
> nops and cache movs per packet should be negligible. The effect on i-cache
> is harder to estimate, and is probably only best checked through
> measurements.
>
> In summary, the execution time of the probes is negligible when considering
> the per CPU packet rates; the i-cache effect looks negligble, and from the
> tests was more affected by code layout than the existance or non-existance
> of probes.
>
>   
>> In addition, with hot paths where we''re sensitive to every
branch
>> or instruction executed, how should we quantify a dtrace
probe''s
>> impact on execution?
>>
>> I mention this because other networking projects are looking
>> very closely at various parts of tcp/ip to try and work out
>> how we can slim it down.
>>
>> Without doubt there are bigger problems
>> than removing dtrace probes but at the same time, if there is
>> a quantifiable cost then we need to think about how we go about
>> applying dtrace to networking in order to get the best return
>> for the smallest CPU cost.
>>     
>
> I''d suggest this strategy.
>
> 1) For development/troubleshooting probes, use fbt. 
>
> fbt is free in terms on non-enabled cost. In fact, many of those sdt probes
> could be served by fbt. eg,
>
> ipxmit_state_t
> ip_xmit_v4(mblk_t *mp, ire_t *ire, ipsec_out_t *io, boolean_t
flow_ctl_enabled)
> {
>         nce_t           *arpce;
>         ipha_t          *ipha;
>         queue_t         *q;
>         int             ill_index;
>         mblk_t          *nxt_mp, *first_mp;
>         boolean_t       xmit_drop = B_FALSE;
>         ip_proc_t       proc;
>         ill_t           *out_ill;
>         int             pkt_len;
>
>         arpce = ire->ire_nce;
>         ASSERT(arpce != NULL);
>
>         DTRACE_PROBE2(ip__xmit__v4, ire_t *, ire,  nce_t *, arpce);
> [...]
>
> fbt::ip_xmit_v4:entry can be used instead of sdt:::ip-xmit-v4. There are
> other examples in ip that aren''t as obvious and would involve much
digging
> and probe association with DTrace, but are still possible from fbt instead.
>
> 2) use sdt if fbt fails in some way.
>
> One advantage of private sdt probes is that they can be removed later
> if proved to be a performance problem. 
>
> 3) add public and stable provider probes if you really must, such as for
> an end-user interface.
>
>   
>> I''m most concerned that there are
>> dtrace probes almost next to each other, e.g from ip.c:
>>
>> 14250                         DTRACE_PROBE1(ip4__physical__out__end,
mblk_t
>> *,
>> 14251                             mp);
>> 14252                         if (mp == NULL)
>> 14253                                 goto drop;
>> 14254 
>> 14255                         DTRACE_IP2(send, mblk_t *, mp, void_ip_t
*,
>> ipha);
>> 14256                         DTRACE_IPV4_3(send, mblk_t *, mp, ipha_t
*,
>> ipha,
>> 14257                             ill_t *, stq_ill);
>>
>> To me it seems that ip4-physical-out-end is almost an alias for
>> send here, if only the DTRACE_IP2() was before the if().
>>
>> How can we tell dtrace that sdt:::ip4-physical-out-end is an alias
>> for ip:::send or similar?
>>     
>
> We can''t - DTrace doesn''t support aliasing like that.
>
>   
>> or:
>>
>> 15205                 /*
>> 15206                  * DTrace this as ip:::receive and
ipv4:::receive.
>> This
>> 15207                  * special case test avoids a particular code
path
>> where
>> 15208                  * IPsec packets are passed through here twice,
both
>> before
>> 15209                  * and after nattymod, and we only want to trace
them
>> once.
>> 15210                  * If a better way is found, this test will be 
>> dropped.
>> 15211                  */
>> 15212                 if (!(!mhip && (ipha->ipha_protocol ==
IPPROTO_ESP ||
>> 15213                     ipha->ipha_protocol == IPPROTO_AH))) {
>> 15214                         DTRACE_IP2(receive, mblk_t *, mp,
void_ip_t
>> *, ipha);
>> 15215                         DTRACE_IPV4_3(receive, mblk_t *, mp,
ipha_t
>> *, ipha,
>> 15216                             ill_t *, ill);
>> 15217                 }
>> 15218 
>> 15219                 /*
>> 15220                  * The event for packets being received from a 
>> ''physical''
>> 15221                  * interface is placed after validation of the
source
>> and/or
>> 15222                  * destination address as being local so that
packets
>> can be
>> 15223                  * redirected to loopback addresses using ipnat.
>> 15224                  */
>> 15225                 DTRACE_PROBE4(ip4__physical__in__start,
>> 15226                     ill_t *, ill, ill_t *, NULL,
>> 15227                     ipha_t *, ipha, mblk_t *, first_mp);
>>
>> Sure, there are lots of NOPs and other cheap instructions being
>> executed here, but we are lengthening the default code path for
>> packets - or at very least the spread of instructions so that
>> where a cache line fill would get us lots of useful things to
>> do, now it gets fewer...but I acknowledge that this is probably
>> nit picking and that energy needs to be first spent elsewhere
>> looking for things to do to make networking faster.
>>
>> ...and without wanting to do code review, I''ll add that
putting
>> another if() into ip_input() just for dtrace probes is not an
>> ideal solution...every if() in ip_input() has a measurable cost
>> to network performance.
>>     
>
> Understood - I''m trying to avoid them. 
>
>   
>> It would be nice if that if() could be part of the dtrace probe
>> and only active when the probe was active, e.g.
>>
>> DTRACE_IF_IP2((!(!mhip && (ipha->ipha_protocol ==
IPPROTO_ESP || \
>>    ipha->ipha_protocol == IPPROTO_AH))), receive, mblk_t *, mp,
>>    void_ip_t *, ipha); 
>>     
>
> Cool idea - I don''t think it''s necessary yet, but if the
probe additions
> start to use many if statements, then it''s good to have such
options to
> try. :)
>
> cheers,
>
> Brendan
>
>

Jeremy Harris

2007-Jun-29 15:20 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems wrote:>   In fact, many of those sdt probes
> could be served by fbt. eg,
> 
> ipxmit_state_t
> ip_xmit_v4(mblk_t *mp, ire_t *ire, ipsec_out_t *io, boolean_t
flow_ctl_enabled)
> {
>         nce_t           *arpce;
>         ipha_t          *ipha;
>         queue_t         *q;
>         int             ill_index;
>         mblk_t          *nxt_mp, *first_mp;
>         boolean_t       xmit_drop = B_FALSE;
>         ip_proc_t       proc;
>         ill_t           *out_ill;
>         int             pkt_len;
> 
>         arpce = ire->ire_nce;
>         ASSERT(arpce != NULL);
> 
>         DTRACE_PROBE2(ip__xmit__v4, ire_t *, ire,  nce_t *, arpce);
> [...]
> 
> fbt::ip_xmit_v4:entry can be used instead of sdt:::ip-xmit-v4.
Does this say that there needs to be a way to declare a stable probe
which happens to be implemented by an fbt, but which does not get
lost at some future code refactoring?

Cheers,
   Jeremy

Brendan Gregg - Sun Microsystems

2007-Jun-29 18:01 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

G''Day Garrett,

On Thu, Jun 28, 2007 at 10:30:19PM -0700, Garrett D''Amore
wrote:> Brendan Gregg - Sun Microsystems wrote:
[...]> >Has anyone noticed a problem already? There are *already* more DTrace
> >probes in TCP/IP than what I''m suggesting to add. For example,
the
> >following three commands were run at the same time,
> >  
> 
> In short *yes*.  We see problems with the length of the code paths, and 
> negative impacts on the number of packets that the system can process.  
> Every cycle is precious.  I need to do the analysis on your proposed, 
> but I''ll do it soon.
> 
> Note that with TCP, you won''t see the problem.  But try running a 
> protocol that uses 64-byte (or at high speed, even 200 or 1500 byte) UDP 
> packets.
> 
> Most people assume that the fallacy of "I didn''t see a
performance
> penalty" means something when they''re testing TCP stream
performance.
It does mean something - if you are a webserver (TCP), web proxy server
(TCP), mail server (TCP), database server (TCP), application server (TCP),
file system server (TCP), and so on; and also if you care about SPECweb,
SPECsfs, ...

Verifying TCP performance is the most critical task when considering
the effect on customer servers. It isn''t a fallacy at all, it would be
ignorant *not* to test TCP as well as UDP and others.
> When doing large packets with TCP, of *course* you don''t notice.
But run
> anything other than TCP, and you''ll find out where the *real*
problems
> in Solaris'' networking performance lie.
So, apart from Solaris routers, DNS servers and IPsec gateways, what other
servers would be under heavy load and be using something other than TCP?

I''m certainly interested in testing UDP and forwarding performance,
along
with TCP performance in particular. I''m measuring the effect of these
probe additions - whether they coincide with existing "real" problems
in the Solaris networking stack or not.

cheers,

Brendan

-- 
Brendan
[CA, USA]

James Carlson

2007-Jun-29 18:31 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems writes:> > When doing large packets with TCP, of *course* you don''t
notice. But run
> > anything other than TCP, and you''ll find out where the *real*
problems
> > in Solaris'' networking performance lie.
> 
> So, apart from Solaris routers, DNS servers and IPsec gateways, what other
> servers would be under heavy load and be using something other than TCP?
Old NFS servers and SAMBA servers have a lot of non-TCP traffic, as do
systems doing VoIP.  Rao Shoaib probably has contacts for that latter
category.

-- 
James Carlson, Solaris Networking              <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Michael Hunter

2007-Jun-29 19:07 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

On Fri, 29 Jun 2007 12:53:45 -0600
Neil Putnam <Neil.Putnam at Sun.COM> wrote:
> James Carlson wrote:
[...]> > Old NFS servers and SAMBA servers have a lot of non-TCP traffic, as do
> > systems doing VoIP.  Rao Shoaib probably has contacts for that latter
> > category.
> > 
> 
> How about Oracle RAC?   Or various financial "UDP streaming"
applications?[...]

Its not just financial.  Multicast and unicast audio and video
distribution systems (essentially the made stale by time protcols usu.
affinitized with VOIP).  Some years ago some (most?) of the online
multi-user games were gross hacks on top of UDP mostly attempting to
circumvent congestion control.

It sometimes amazes me that a protocol modeled on serial communications
could be dominate for so long.  Its really a poor fit for a lot of the
things its (ab)used for.

			mph

Garrett D''Amore

2007-Jun-29 19:50 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

Michael Hunter wrote:> On Fri, 29 Jun 2007 12:53:45 -0600
> Neil Putnam <Neil.Putnam at Sun.COM> wrote:
>
>   
>> James Carlson wrote:
>>     
> [...]
>   
>>> Old NFS servers and SAMBA servers have a lot of non-TCP traffic, as
do
>>> systems doing VoIP.  Rao Shoaib probably has contacts for that
latter
>>> category.
>>>
>>>       
>> How about Oracle RAC?   Or various financial "UDP streaming"
applications?
>>     
> [...]
>
> Its not just financial.  Multicast and unicast audio and video
> distribution systems (essentially the made stale by time protcols usu.
> affinitized with VOIP).  Some years ago some (most?) of the online
> multi-user games were gross hacks on top of UDP mostly attempting to
> circumvent congestion control.
>
> It sometimes amazes me that a protocol modeled on serial communications
> could be dominate for so long.  Its really a poor fit for a lot of the
> things its (ab)used for.
>   
When networks are reliable (and you don''t need multicast) TCP works 
pretty well.  The problem is when you have networks in the middle that 
start dropping frames... then TCP goes to hell in a handbasket.

The reality is that unless you start playing with wireless networks, the 
mostly-lossless-ordered-packets requirements generally hold true.  And 
even the wireless networks generally try to detect and correct for 
loss/corruption at the link layer, so TCP corrections never come into play.

Anyway, I never meant to suggest that TCP performance tuning wasn''t 
appropriate, only that it was *insufficient*.  We 
(Sun/Solaris/OpenSolaris) need to start paying a lot more attention to 
the other protocols (and performance of them) than we have done in the past.

    -- Garrett

Jeremy Harris

2007-Jun-29 20:15 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems wrote:> So, apart from Solaris routers, DNS servers and IPsec gateways, what other
> servers would be under heavy load and be using something other than TCP?
Finance houses, running Tibco Rendevous.

- Jeremy Harris

Michael Hunter

2007-Jun-29 20:28 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

On Fri, 29 Jun 2007 12:50:46 -0700
Garrett D''Amore <garrett at damore.org> wrote:
> Michael Hunter wrote:
[...]> When networks are reliable (and you don''t need multicast) TCP
works
> pretty well.  The problem is when you have networks in the middle that 
> start dropping frames... then TCP goes to hell in a handbasket.
Maybe its just the socket API.  But I remember constantly repeating to
customers of mine some years back that they couldn''t depend on write()
== read().  Or that the protocol level acks didn''t imply that the data
had gotten to the application.  Cranking down TCP keepalive doesn''t
make for a good application heartbeat.  On and On.  Often they just
wanted a reliable datagram protocol.  But thats a big jump from UDP.
The ones that tried to go that route often botched basic protcol
considerations.
> 
> The reality is that unless you start playing with wireless networks, the 
> mostly-lossless-ordered-packets requirements generally hold true.  And 
> even the wireless networks generally try to detect and correct for 
> loss/corruption at the link layer, so TCP corrections never come into play.
Don''t know what the lower latency wireless stuff uses for this[1] but
interaction between link level error correction and error correction
(retransmission) at higher levels isn''t always all that clean or
obvious.
> 
> Anyway, I never meant to suggest that TCP performance tuning
wasn''t
> appropriate, only that it was *insufficient*.  We 
> (Sun/Solaris/OpenSolaris) need to start paying a lot more attention to 
> the other protocols (and performance of them) than we have done in the
past.
I didn''t either.  I was just bemoaning a deficiency in the understand
of and availability of other transport level protocol offerings and
trying to extend the feel that Brendan was getting for protocol usage
in the real world.

			mph
> 
>     -- Garrett
[1] I love the mathematics and ideas behind FEC.  There was a group
doing research on using FEC (redandancy) in TCP to tradeoff bandwidth
for some probabilistic bound on tranmission time in the late 90s.  I
thought that could be used to solve some of the pseudo RT problems I
saw people trying to solve by beating their head against TCP.  Oh well,
that was a different life.
> 
> _______________________________________________
> networking-discuss mailing list
> networking-discuss at opensolaris.org

Michael Hunter

2007-Jun-29 20:51 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

On Fri, 29 Jun 2007 13:28:26 -0700
Michael Hunter <Michael.Hunter at Sun.COM> wrote:

> Don''t know what the lower latency wireless stuff uses for this[1]
but
> interaction between link level error correction and error correction
> (retransmission) at higher levels isn''t always all that clean or
> obvious.
s/levels/latencies/

Garrett D''Amore

2007-Jun-29 21:16 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

Michael Hunter wrote:> On Fri, 29 Jun 2007 13:28:26 -0700
> Michael Hunter <Michael.Hunter at Sun.COM> wrote:
>
>
>   
>> Don''t know what the lower latency wireless stuff uses for
this[1] but
>> interaction between link level error correction and error correction
>> (retransmission) at higher levels isn''t always all that clean
or
>> obvious.
>>     
>
> s/levels/latencies/
>   

Certainly with higher latency networks it could wreak havoc.

But for 802.11, it happens so quickly that normally TCP would never notice.

    -- Garrett

Mike Gerdts

2007-Jun-29 21:42 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

On 6/29/07, Brendan Gregg - Sun Microsystems <brendan at sun.com>
wrote:> So, apart from Solaris routers, DNS servers and IPsec gateways, what other
> servers would be under heavy load and be using something other than TCP?
Oracle RAC''s cache fusion (cluster interconnect) sends database blocks
between nodes using UDP (or LLT if Veritas is in the mix).  These
transfers are mostly around 8 KB and can be quite heavy if the various
nodes (oracle instances) are using the same blocks.  This has been a
particular area of contention when each node is a significant portion
of a 15k/25k.

Mike

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Darren.Reed at Sun.COM

2007-Jun-30 00:27 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems wrote:
>G''Day Darren,
>
>On Fri, Jun 22, 2007 at 09:01:26PM -0700, Darren.Reed at Sun.COM wrote:
>
>>Brendan Gregg - Sun Microsystems wrote:
>>
>[...]
>
>>>These are near zero overhead when not enabled (some nops plus some
movs to
>>>set registers); although their existance may affect how the compiler
>>>optimizes code (especially if there were loads of probes). In any
case,
>>>the CPU overhead is going to need to be measured.
>>>
>>Has there been any measurement or thoughts about the impact of
>>too many dtrace probes resulting in excessive "pollution" with
>>the i-cache by instructions that "do nothing"?  In most cases
>>I imagine that dtrace probes are quite sparsely distributed
>>through the kernel...
>>
>
>Has anyone noticed a problem already? There are *already* more DTrace
>probes in TCP/IP than what I''m suggesting to add. For example, the
>following three commands were run at the same time,
>
>   # snoop -r
>   Using device nge0 (promiscuous mode)
>   192.168.1.109 -> 192.168.1.108 TELNET C port=40965 d
>   192.168.1.108 -> 192.168.1.109 TELNET R port=40965 d
>   192.168.1.109 -> 192.168.1.108 TELNET C port=40965 
>   ^C
>
As Garrett alluded to, the situation that we''re most concerned about
with performance is not just local delivery but packet forwarding
where there are various efforts underway to try and reduce the size
of the code path.  I''m sure that the length of the code path to get
data to/from telnet leans heavily towards delivery into TCP.

...
>            mib     ipIfStatsHCOutOctets        1
>            mib   ipIfStatsHCOutRequests        1
>            mib  ipIfStatsHCOutTransmits        1
>            mib            tcpInAckBytes        1
>            mib             tcpInAckSegs        1
>            mib    tcpInDataInorderBytes        1
>            mib     tcpInDataInorderSegs        1
>            mib          tcpOutDataBytes        1
>            mib           tcpOutDataSegs        1
>            mib             tcpRttUpdate        1
>            sdt     ip4-physical-out-end        1
>            sdt   ip4-physical-out-start        1
>            sdt           squeue-enqueue        1
>            sdt      squeue-enqueuechain        1
>            mib    ipIfStatsHCInDelivers        2
>            mib      ipIfStatsHCInOctets        2
>            mib    ipIfStatsHCInReceives        2
>            sdt      ip4-physical-in-end        2
>            sdt    ip4-physical-in-start        2
>            sdt          squeue-proc-end       12
>            sdt        squeue-proc-start       12
>            sdt             conn-dec-ref       15
>            sdt             conn-inc-ref       15
>                                  TOTAL:       78
>
>So while the ip providers fired 9 probes for those 3 packets, the existing
>sdt and mib providers fired 78 - 26 probes per packet!
>
>Ok, I did pick the worst example I could find :-) it can get to around
>10 probes per packet, much better than 26, but still much more than the
>2 per packet I''m suggesting we add.
>
Understood and your point taken.


....
>As for the overall effect on ip due to i-cache pollution; DTrace sampling
>of the entire ip module versus packet counts showed a per packet increase
>in ip time of around 50 ns. Adding and removing probes seemed to change
>this measured overhead in random ways, suggesting that it is either noise,
>or that the effect of adding instructions doesn''t necessarily mean
things
>get slower (subsequent instructions are shifted to different addresses, 
>which map to the cache differently, possibly relieving or creating hot 
>spots). That DTrace is sampling to take these measurements is also affecting
>the behaviour of the caches.
>
Yes, indeed.

>Thoughts:
>
>Each probe addition adds 1 to 5 nops, and a few movs of recently
>accessed data which we would expect to still be cached. Given a maximum
>packet rate per CPU per second of 200,000, the execution overhead of some
>nops and cache movs per packet should be negligible. The effect on i-cache
>is harder to estimate, and is probably only best checked through
>measurements.
>
>In summary, the execution time of the probes is negligible when considering
>the per CPU packet rates; the i-cache effect looks negligble, and from the
>tests was more affected by code layout than the existance or non-existance
>of probes.
>
I''m not really worried about the data cache, just the i-cache and
the CPU pipeline.  Throwing in some NOPs and otherwise redundant
instructions every now and then doesn''t seem like too big of a sin,
but filling an entire i-cache row or two with NOPs seems...wasteful.
Depending on when the CPU decides to throw away a NOP, during the
pipeline decoding of the instruction would by and large determine
what it means to have 2-3 vs 6-9 of them from 1 vs 3 dtrace probes
next to each other.  This may sound silly, but I''d have less of an
issue about this if the same probes were further away from each
other :)  However this really is micro-optimisation.

>>In addition, with hot paths where we''re sensitive to every
branch
>>or instruction executed, how should we quantify a dtrace
probe''s
>>impact on execution?
>>
>>I mention this because other networking projects are looking
>>very closely at various parts of tcp/ip to try and work out
>>how we can slim it down.
>>
>>Without doubt there are bigger problems
>>than removing dtrace probes but at the same time, if there is
>>a quantifiable cost then we need to think about how we go about
>>applying dtrace to networking in order to get the best return
>>for the smallest CPU cost.
>>
>
>I''d suggest this strategy.
>
>1) For development/troubleshooting probes, use fbt. 
>...
>
I agree with this.
But having seen the sdt:::ip-xmit-v4 probe, wouldn''t it be
nice if you could present that as fbt::ip-xmit-v4:entry but
pass in ire->ire_nce as arg2 instead of ire?

>>I''m most concerned that there are
>>dtrace probes almost next to each other, e.g from ip.c:
>>
>>14250                         DTRACE_PROBE1(ip4__physical__out__end,
mblk_t
>>*,
>>14251                             mp);
>>14252                         if (mp == NULL)
>>14253                                 goto drop;
>>14254 
>>14255                         DTRACE_IP2(send, mblk_t *, mp, void_ip_t
*,
>>ipha);
>>14256                         DTRACE_IPV4_3(send, mblk_t *, mp, ipha_t
*,
>>ipha,
>>14257                             ill_t *, stq_ill);
>>
>>To me it seems that ip4-physical-out-end is almost an alias for
>>send here, if only the DTRACE_IP2() was before the if().
>>
>>How can we tell dtrace that sdt:::ip4-physical-out-end is an alias
>>for ip:::send or similar?
>>
>
>We can''t - DTrace doesn''t support aliasing like that.
>
Can I have this added to the list of new things I''d like to
see dtrace be able to support, along with the DTRACE_IF? :)

I don''t know how this would look, maybe:

DTRACE_DEF(ip__hook__abc, m, ip, stq_ill);
DTRACE_VIEW1(ip__hook__abc, ip__physical__out__end, mblk_t *);
DTRACE_VIEW2(ip__hook__abc, ip_send, mblk_t *, void_ip_t *);
DTRACE_VIEW3(ip__hook__abc, ip4__send, mblk_t *, ipha_t *, ill_t *);

or maybe something more complex so that you could create a hook
alias that included just the mblk_t and the ill_t but still only
have one hook in the code path.

The general story is we have a collection of values that we want
to export out through dtrace and we''d like to have a way to see
them using different filters (or views.)

If dtrace can handle multiple people using the same probe then
technically it should be able to cope with multiple people
using the same probe but with different aliases to present the
values caught by the dtrace hook in different ways.


Another thought is can these dtrace probe views be defined with
dtrace(1m) rather than in the kernel?

This would mean that "dtrace -l" might not provide an exhaustive
list of all the probes available (if it only consults the kernel.)

So, if you wanted to get the ip-send dtrace probe, maybe in your
dtrace script you would include some special library that knows
to use the ip4-physical-out-end (with extra args) and then to
apply some sort of condition and change the types of the args.

So my dtrace script might be:

#!/usr/sbin/dtrace -Fs

include <provider/ip>
ip:::ip-send{
...
}

If dtrace could be evolved to do that then we could define base
probes, such as the ip4-physical-out-start and layer on top of
that new probes without having to deliver any new kernel modules.


Darren

Brendan Gregg - Sun Microsystems

2007-Jun-30 02:36 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

On Fri, Jun 29, 2007 at 12:50:46PM -0700, Garrett D''Amore
wrote:> Michael Hunter wrote:
> >On Fri, 29 Jun 2007 12:53:45 -0600
> >Neil Putnam <Neil.Putnam at Sun.COM> wrote:
> >
> >  
> >>James Carlson wrote:
> >>    
> >[...]
> >  
> >>>Old NFS servers and SAMBA servers have a lot of non-TCP
traffic, as do
> >>>systems doing VoIP.  Rao Shoaib probably has contacts for that
latter
> >>>category.
> >>>
> >>>      
> >>How about Oracle RAC?   Or various financial "UDP
streaming" applications?
> >>    
> >[...]
> >
> >Its not just financial.  Multicast and unicast audio and video
> >distribution systems (essentially the made stale by time protcols usu.
> >affinitized with VOIP).  Some years ago some (most?) of the online
> >multi-user games were gross hacks on top of UDP mostly attempting to
> >circumvent congestion control.
Cool, so we can add UDP streaming application servers, VoIP servers,
and audio and video servers. This is useful to bear in mind when
considering who will be affected by any overheads.

[...]> When networks are reliable (and you don''t need multicast) TCP
works
> pretty well.  The problem is when you have networks in the middle that 
> start dropping frames... then TCP goes to hell in a handbasket.
> 
> The reality is that unless you start playing with wireless networks, the 
> mostly-lossless-ordered-packets requirements generally hold true.  And 
> even the wireless networks generally try to detect and correct for 
> loss/corruption at the link layer, so TCP corrections never come into play.
> 
> Anyway, I never meant to suggest that TCP performance tuning
wasn''t
> appropriate, only that it was *insufficient*.  We 
> (Sun/Solaris/OpenSolaris) need to start paying a lot more attention to 
> the other protocols (and performance of them) than we have done in the
past.
Sure, testing TCP only would be insufficient, I agree.

I think we really need a network performance test suite that tests all
protocol types and code paths, and generates a report with standard
deviations for each measurement. This will make it easier for people
to pay attention to those other protocols, and can be made available
on an OpenSolaris page for all to use (similar to the filebench effort
for testing file systems).

Does such a tool exist? The self service performance tests look like they
will not only provide the tests, but do them for you (even better),

	http://www.opensolaris.org/os/community/testing/selftest/

although, neither iperf, netperf or nttcp print standard deviation - 
and from the noise I''ve seen in my own packet tests, this is something
we''d really like to know. Although netperf can pay attention to noise
by doing its confidence tests...

Brendan

-- 
Brendan
[CA, USA]

Darren.Reed at Sun.COM

2007-Jun-30 03:08 UTC

head link

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems wrote:
> ...
>
>I think we really need a network performance test suite that tests all
>protocol types and code paths, and generates a report with standard
>deviations for each measurement. This will make it easier for people
>to pay attention to those other protocols, and can be made available
>on an OpenSolaris page for all to use (similar to the filebench effort
>for testing file systems).
>
>Does such a tool exist? The self service performance tests look like they
>will not only provide the tests, but do them for you (even better),
>
>	http://www.opensolaris.org/os/community/testing/selftest/
>
>although, neither iperf, netperf or nttcp print standard deviation - 
>and from the noise I''ve seen in my own packet tests, this is
something
>we''d really like to know. Although netperf can pay attention to
noise
>by doing its confidence tests...
>
My observations are that we need to be using specialised network
test equipment, such as what we have in some of the development
labs, for our testing.  Which is to say that none of the currently
established tests - iperf, netperf or nttcp - is really capable
of telling us what we really need to know.

Afterall, if iperf/netperf/nttcp are being run from Solaris hosts
that are already subject to the limitations we''re trying to fix,
how can we see if our changes make a difference?

Darren

Brendan Gregg - Sun Microsystems

2007-Jul-02 18:58 UTC

head link

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

On Thu, Jun 21, 2007 at 11:49:24AM +0100, Jeremy Harris
wrote:> Hi Brendan,
> 
> Brendan Gregg - Sun Microsystems wrote:
> >On Wed, Jun 20, 2007 at 09:53:44PM +0100, Jeremy Harris wrote:
> >>Brendan Gregg - Sun Microsystems wrote:
> >>>  ID   PROVIDER            MODULE                         
FUNCTION NAME
> >>[...]
> >>>11668       ipv4                ip                 
tcp_lsosend_data send
> >>>11669       ipv4                ip               
tcp_multisend_data send
> >>Aren''t these more implementation artifacts than things
suitable
> >>for exposure in a stable provider?
> [...]
> >If you meant that LSO and MDT are implementation details, then sure,
> >their activity is tied to implementation, and the IP packets that they
> >generate will be observable by the stable IP provider in a 
> >non-implementation
> >specific way (send/receive probes, not lso-send/mdt-send/etc).
> 
> This is what I meant, yes.
> So, what stability level will these probes have?
As a new interface, I''d write the DTrace stability table as,

                |  Name            Data            Class
   -------------+-------------------------------------------
   Provider     |  Evolving        Evolving        Common
   Module       |  Private         Private         Unknown
   Function     |  Private         Private         Unknown
   Name         |  Evolving        Evolving        Common
   Arguments    |  Unstable        Unstable        Common

with Arguments becomming "Evolving" in the near future.

The aim of the provider is to be a stable end user interface, with the
send/receive probes tracing activity between IP and the device driver
framework, or itself (loopback).

So for send probes - whatever IP sends down, is traced. For receive probes,
whatever IP receives, is traced.

For MDT, it should be able to trace the individual IP packets as ip:::send,
similar to what sdt::tcp_multisend:ip4-physical-out-start does when hooks
are on.

For LSO, what IP sends down can be 64 Kbyte packets - which is what the IP
provider will trace as a 64 Kbyte ip:::send. This is in line with the role
of the IP provider - to trace what was sent and received to the device
driver layers. Is 64 Kbyte sends confusing enough that it should really be
called ip:::lso-send? ... I don''t think so.

What about future hypothetical NIC performance feature, that accepts 
packets in some other manner that is much stranger than LSO? That may
indeed be a case for new probe, ip:::mumblefoo-send, or whatever.

Brendan

-- 
Brendan
[CA, USA]

Brendan Gregg - Sun Microsystems

2007-Jul-03 01:02 UTC

head link

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

G''Day Darren,

On Fri, Jun 29, 2007 at 05:27:01PM -0700, Darren.Reed at sun.com
wrote:> Brendan Gregg - Sun Microsystems wrote:
> 
> >G''Day Darren,
> >
> >On Fri, Jun 22, 2007 at 09:01:26PM -0700, Darren.Reed at Sun.COM wrote:
> >
> >>Brendan Gregg - Sun Microsystems wrote:
> >>
[...]> >Has anyone noticed a problem already? There are *already* more DTrace
> >probes in TCP/IP than what I''m suggesting to add. For example,
the
> >following three commands were run at the same time,
> >
> >  # snoop -r
> >  Using device nge0 (promiscuous mode)
> >  192.168.1.109 -> 192.168.1.108 TELNET C port=40965 d
> >  192.168.1.108 -> 192.168.1.109 TELNET R port=40965 d
> >  192.168.1.109 -> 192.168.1.108 TELNET C port=40965 
> >  ^C
> >
> 
> As Garrett alluded to, the situation that we''re most concerned
about
> with performance is not just local delivery but packet forwarding
> where there are various efforts underway to try and reduce the size
> of the code path.  I''m sure that the length of the code path to
get
> data to/from telnet leans heavily towards delivery into TCP.
Rightio - sounds like testing the forwarding code path performance should
also be a standard test (such as in an automated perf PIT).

[...]> >Thoughts:
> >
> >Each probe addition adds 1 to 5 nops, and a few movs of recently
> >accessed data which we would expect to still be cached. Given a maximum
> >packet rate per CPU per second of 200,000, the execution overhead of
some
> >nops and cache movs per packet should be negligible. The effect on
i-cache
> >is harder to estimate, and is probably only best checked through
> >measurements.
> >
> >In summary, the execution time of the probes is negligible when
considering
> >the per CPU packet rates; the i-cache effect looks negligble, and from
the
> >tests was more affected by code layout than the existance or
non-existance
> >of probes.
> >
> 
> I''m not really worried about the data cache, just the i-cache and
> the CPU pipeline.  Throwing in some NOPs and otherwise redundant
> instructions every now and then doesn''t seem like too big of a
sin,
> but filling an entire i-cache row or two with NOPs seems...wasteful.
> Depending on when the CPU decides to throw away a NOP, during the
> pipeline decoding of the instruction would by and large determine
> what it means to have 2-3 vs 6-9 of them from 1 vs 3 dtrace probes
> next to each other.  This may sound silly, but I''d have less of an
> issue about this if the same probes were further away from each
> other :)  However this really is micro-optimisation.
3 DTrace probes next to each other is certainly questionable. I''ve
recently
dropped one set of the IP probes so that it is now 2, and put a webrev
of this with just the send/receive probes at the end of,
http://www.opensolaris.org/os/community/dtrace/NetworkProvider/Prototype2/

I wasn''t certain that having both ip::: and ipv4/ipv6::: providers was
the right way to go, and put the webrev and website out to see how it looked.
I''ve also been writing scripts using either to see how they felt. The
ip:::
provider is more in the DTrace mould (compare with the io::: provider), and
so I''ve dropped the ipv4/ipv6 providers from the recent webrev (ip:::
now
has identical functionality).

[...]> >I''d suggest this strategy.
> >
> >1) For development/troubleshooting probes, use fbt. 
> >...
> >
> 
> I agree with this.
> But having seen the sdt:::ip-xmit-v4 probe, wouldn''t it be
> nice if you could present that as fbt::ip-xmit-v4:entry but
> pass in ire->ire_nce as arg2 instead of ire?
I would have thought that the existing (ire_t *)ire as args[2] was already
ideal, as users could refer to args[2]->ire_nce as needed. It might be
a different story if the kernel wasn''t CTF''d and these
wern''t already
casted.

[...]> >We can''t - DTrace doesn''t support aliasing like that.
> >
> 
> Can I have this added to the list of new things I''d like to
> see dtrace be able to support, along with the DTRACE_IF? :)
> 
> I don''t know how this would look, maybe:
> 
> DTRACE_DEF(ip__hook__abc, m, ip, stq_ill);
> DTRACE_VIEW1(ip__hook__abc, ip__physical__out__end, mblk_t *);
> DTRACE_VIEW2(ip__hook__abc, ip_send, mblk_t *, void_ip_t *);
> DTRACE_VIEW3(ip__hook__abc, ip4__send, mblk_t *, ipha_t *, ill_t *);
> 
> or maybe something more complex so that you could create a hook
> alias that included just the mblk_t and the ill_t but still only
> have one hook in the code path.
> 
> The general story is we have a collection of values that we want
> to export out through dtrace and we''d like to have a way to see
> them using different filters (or views.)
> 
> If dtrace can handle multiple people using the same probe then
> technically it should be able to cope with multiple people
> using the same probe but with different aliases to present the
> values caught by the dtrace hook in different ways.
Cool idea. I don''t think there is a need to do this right now based on
performance testing so far, but it is good to have options to try like
this than not to. And both DTRACE_IF and DTRACE_VIEW are options that
could be added later without changing the exported stable provider
interface.

It would take some moderate work to get this done; and when complete, there
would still be more non-enabled probe overhead in other places - like the
mib probes, that wouldn''t benifit from this approach.
> Another thought is can these dtrace probe views be defined with
> dtrace(1m) rather than in the kernel?
> 
> This would mean that "dtrace -l" might not provide an exhaustive
> list of all the probes available (if it only consults the kernel.)
> 
> So, if you wanted to get the ip-send dtrace probe, maybe in your
> dtrace script you would include some special library that knows
> to use the ip4-physical-out-end (with extra args) and then to
> apply some sort of condition and change the types of the args.
> 
> So my dtrace script might be:
> 
> #!/usr/sbin/dtrace -Fs
> 
> include <provider/ip>
> ip:::ip-send{
> ...
> }
> 
> If dtrace could be evolved to do that then we could define base
> probes, such as the ip4-physical-out-start and layer on top of
> that new probes without having to deliver any new kernel modules.
Yes, that would be another line of attack.

...

There might be an easier way to drop the duplicate probe overhead, which
currently looks like,

   14223                 DTRACE_PROBE4(ip4__physical__out__start,
   14224                     ill_t *, NULL, ill_t *, stq_ill,
   14225                     ipha_t *, ipha, mblk_t *, mp);
   14226                 FW_HOOKS(ipst->ips_ip4_physical_out_event,
   14227                     ipst->ips_ipv4firewall_physical_out,
   14228                     NULL, stq_ill, ipha, mp, mpip, ipst);
   14229                 DTRACE_PROBE1(ip4__physical__out__end, mblk_t *,
   14230                     mp);
   14231                 if (mp == NULL)
   14232                         goto drop;
   14233 
   14234                 DTRACE_IP5(send, mblk_t *, mp, void_ip_t *, ipha,
   14235                     ill_t *, stq_ill, ipha_t *, ipha, ip6_t *, NULL);

Why have sdt:::ip4-physical-out-start and sdt:::ip4-physical-out-end anyway?

FW_HOOKS wraps hook_run() if hooks are enabled, and hook_run() is tracable
at zero cost from fbt.

Those sdt::: probes provide interface info, IP header and raw mblk pointer.
So you could write scripts like,

# ./fwhooks01.d 
  EVENT                     *MP  IF-IN IF-OUT             FROM              
TO> PHYSICAL_OUT ffffff6137dde4e0      0      2    192.168.1.109   
192.168.1.198< PHYSICAL_OUT ffffff6137dde4e0      0      2    192.168.1.109   
192.168.1.198> PHYSICAL_OUT fffffffedad37220      0      2    192.168.1.109   
192.168.1.108< PHYSICAL_OUT                0      0      2    192.168.1.109   
192.168.1.108> PHYSICAL_OUT ffffff0e2ec32620      0      2    192.168.1.109   
192.168.1.198< PHYSICAL_OUT ffffff0e2ec32620      0      2    192.168.1.109   
192.168.1.198> PHYSICAL_IN  ffffff0e2ec32620      2      0    192.168.1.198   
192.168.1.109< PHYSICAL_IN  ffffff0e2ec32620      2      0    192.168.1.198   
192.168.1.109> PHYSICAL_OUT ffffffff9ba996c0      0      2    192.168.1.109   
192.168.1.198< PHYSICAL_OUT ffffffff9ba996c0      0      2    192.168.1.109   
192.168.1.198
[...]

(That zero *mp was for a blocked packet.)

The above script was written using fbt at zero cost, not sdt!

root at deimos:/root> cat fwhooks01.d 
#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option switchrate=10

dtrace:::BEGIN
{
        printf("  %-12s %16s %6s %6s %16s %16s\n",
            "EVENT", "*MP", "IF-IN",
"IF-OUT", "FROM", "TO");
}

fbt::hook_run:entry
{
        self->info = (hook_pkt_event_t *)arg1;
        self->name = stringof(args[0]->hei_event->he_name);
        this->ipha = (ipha_t *)self->info->hpe_hdr;
        self->saddr = inet_ntoa(&this->ipha->ipha_src);
        self->daddr = inet_ntoa(&this->ipha->ipha_dst);

        printf("> %-12s %16x %6d %6d %16s %16s\n", self->name,
            (uint64_t)*self->info->hpe_mp, self->info->hpe_ifp,
            self->info->hpe_ofp, self->saddr, self->daddr);
}

fbt::hook_run:return
/self->info/
{
        printf("< %-12s %16x %6d %6d %16s %16s\n", self->name,
            (uint64_t)*self->info->hpe_mp, self->info->hpe_ifp,
            self->info->hpe_ofp, self->saddr, self->daddr);

        self->info = 0;
        self->name = 0;
        self->saddr = 0;
        self->daddr = 0;
}

Woah, cool huh! :-)

(BTW, that script needs a recent build so that inet_ntoa() exists).

Any specific reason why those sdt probes were added? ... such as a script 
that needed to be written? I can check if fbt can be used instead.

I''m not saying that sdt:::ip4-physical-out-start/end *need* to be
dropped,
just that it looks like they could be served by zero cost fbt probes, and
so dropped if a performance problem is shown to exist. And that might be
an an easier option than building a new DTrace framework. ;).

cheers,

Brendan

-- 
Brendan
[CA, USA]

Darren.Reed at Sun.COM

2007-Jul-03 02:33 UTC

head link

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems wrote:
>G''Day Darren,
>
>On Fri, Jun 29, 2007 at 05:27:01PM -0700, Darren.Reed at sun.com wrote:
>  
>
>>Brendan Gregg - Sun Microsystems wrote:
>>
>>    
>>
>>>G''Day Darren,
>>>
>>>On Fri, Jun 22, 2007 at 09:01:26PM -0700, Darren.Reed at Sun.COM
wrote:
>>>
>>>      
>>>
>>>
>>>Thoughts:
>>>
>>>Each probe addition adds 1 to 5 nops, and a few movs of recently
>>>accessed data which we would expect to still be cached. Given a
maximum
>>>packet rate per CPU per second of 200,000, the execution overhead of
some
>>>nops and cache movs per packet should be negligible. The effect on
i-cache
>>>is harder to estimate, and is probably only best checked through
>>>measurements.
>>>
>>>In summary, the execution time of the probes is negligible when
considering
>>>the per CPU packet rates; the i-cache effect looks negligble, and
from the
>>>tests was more affected by code layout than the existance or
non-existance
>>>of probes.
>>>
>>>      
>>>
>>I''m not really worried about the data cache, just the i-cache
and
>>the CPU pipeline.  Throwing in some NOPs and otherwise redundant
>>instructions every now and then doesn''t seem like too big of a
sin,
>>but filling an entire i-cache row or two with NOPs seems...wasteful.
>>Depending on when the CPU decides to throw away a NOP, during the
>>pipeline decoding of the instruction would by and large determine
>>what it means to have 2-3 vs 6-9 of them from 1 vs 3 dtrace probes
>>next to each other.  This may sound silly, but I''d have less of
an
>>issue about this if the same probes were further away from each
>>other :)  However this really is micro-optimisation.
>>    
>>
>
>3 DTrace probes next to each other is certainly questionable. I''ve
recently
>dropped one set of the IP probes so that it is now 2, and put a webrev
>of this with just the send/receive probes at the end of,
>http://www.opensolaris.org/os/community/dtrace/NetworkProvider/Prototype2/
>
>I wasn''t certain that having both ip::: and ipv4/ipv6::: providers
was
>the right way to go, and put the webrev and website out to see how it
looked.
>I''ve also been writing scripts using either to see how they felt.
The ip:::
>provider is more in the DTrace mould (compare with the io::: provider), and
>so I''ve dropped the ipv4/ipv6 providers from the recent webrev
(ip::: now
>has identical functionality).
>
>[...]
>  
>
>>>I''d suggest this strategy.
>>>
>>>1) For development/troubleshooting probes, use fbt. 
>>>...
>>>
>>>      
>>>
>>I agree with this.
>>But having seen the sdt:::ip-xmit-v4 probe, wouldn''t it be
>>nice if you could present that as fbt::ip-xmit-v4:entry but
>>pass in ire->ire_nce as arg2 instead of ire?
>>    
>>
>
>I would have thought that the existing (ire_t *)ire as args[2] was already
>ideal, as users could refer to args[2]->ire_nce as needed. It might be
>a different story if the kernel wasn''t CTF''d and these
wern''t already
>casted.
>  
>
Indeed and I suspect that this hook is simply "ease of use"
in debugging and could be a candidate for removing..

>>>We can''t - DTrace doesn''t support aliasing like
that.
>>>
>>>      
>>>
>>Can I have this added to the list of new things I''d like to
>>see dtrace be able to support, along with the DTRACE_IF? :)
>>
>>I don''t know how this would look, maybe:
>>
>>DTRACE_DEF(ip__hook__abc, m, ip, stq_ill);
>>DTRACE_VIEW1(ip__hook__abc, ip__physical__out__end, mblk_t *);
>>DTRACE_VIEW2(ip__hook__abc, ip_send, mblk_t *, void_ip_t *);
>>DTRACE_VIEW3(ip__hook__abc, ip4__send, mblk_t *, ipha_t *, ill_t *);
>>
>>or maybe something more complex so that you could create a hook
>>alias that included just the mblk_t and the ill_t but still only
>>have one hook in the code path.
>>
>>The general story is we have a collection of values that we want
>>to export out through dtrace and we''d like to have a way to see
>>them using different filters (or views.)
>>
>>If dtrace can handle multiple people using the same probe then
>>technically it should be able to cope with multiple people
>>using the same probe but with different aliases to present the
>>values caught by the dtrace hook in different ways.
>>    
>>
>
>Cool idea. I don''t think there is a need to do this right now based
on
>performance testing so far, but it is good to have options to try like
>this than not to. And both DTRACE_IF and DTRACE_VIEW are options that
>could be added later without changing the exported stable provider
>interface.
>  
>
Yes :)

>It would take some moderate work to get this done; and when complete, there
>would still be more non-enabled probe overhead in other places - like the
>mib probes, that wouldn''t benifit from this approach.
>  
>
Understood.  Should I RFE this or is there some other virtual
whiteboard on opensolaris.org for new things to add to dtrace?

>>Another thought is can these dtrace probe views be defined with
>>dtrace(1m) rather than in the kernel?
>>
>>This would mean that "dtrace -l" might not provide an
exhaustive
>>list of all the probes available (if it only consults the kernel.)
>>
>>So, if you wanted to get the ip-send dtrace probe, maybe in your
>>dtrace script you would include some special library that knows
>>to use the ip4-physical-out-end (with extra args) and then to
>>apply some sort of condition and change the types of the args.
>>
>>So my dtrace script might be:
>>
>>#!/usr/sbin/dtrace -Fs
>>
>>include <provider/ip>
>>ip:::ip-send{
>>...
>>}
>>
>>If dtrace could be evolved to do that then we could define base
>>probes, such as the ip4-physical-out-start and layer on top of
>>that new probes without having to deliver any new kernel modules.
>>    
>>
>
>Yes, that would be another line of attack.
>
>...
>
>There might be an easier way to drop the duplicate probe overhead, which
>currently looks like,
>
>   14223                 DTRACE_PROBE4(ip4__physical__out__start,
>   14224                     ill_t *, NULL, ill_t *, stq_ill,
>   14225                     ipha_t *, ipha, mblk_t *, mp);
>   14226                 FW_HOOKS(ipst->ips_ip4_physical_out_event,
>   14227                     ipst->ips_ipv4firewall_physical_out,
>   14228                     NULL, stq_ill, ipha, mp, mpip, ipst);
>   14229                 DTRACE_PROBE1(ip4__physical__out__end, mblk_t *,
>   14230                     mp);
>   14231                 if (mp == NULL)
>   14232                         goto drop;
>   14233 
>   14234                 DTRACE_IP5(send, mblk_t *, mp, void_ip_t *, ipha,
>   14235                     ill_t *, stq_ill, ipha_t *, ipha, ip6_t *,
NULL);
>
>Why have sdt:::ip4-physical-out-start and sdt:::ip4-physical-out-end anyway?
>  
>
For -start, the point is to have a dtrace probe that is particular to the
location of the hook in the stack relative to packet processing.

For -end, it is useful to see the mp after the FW_HOOKS() is complete.
Is it NULL?  Has it been changed?  Both of these are possible here.
Or in other words, the complexity of using dtrace to achieve something
meaningful is why there are two hooks.

Plus, none of the standard fbt things will show us what happens to mp on
the successful return of the call to hook_run.

In the scripts that you added, yes, you''ve achieved the same thing,
but it''s not a trivial exercise.  You need to (a) know that you can
work
this way with dtrace and (b) know what''s inside the args.  Recommending
that people take that approach isn''t something we should be doing a lot
of if they end up using interfaces that aren''t stable.

Or to put it another way, it it was possible to use a probes that
presented just as simple an interface but implemented it using
other means then sure, the probes could be eliminated.

Or to use the syntax from above, I would be just as happy if I could
write a probe like this:

dtrace -I provider/pfh -n ''pfh:::ip4-physical-out-end{...}''

where "-I" is an "include" function, and dtrace somehow
built the
ip4-physical-out-end using fbt probes like you did, why do I care
how the probe is implemented?  A consideration that needs to be
taken into account is the performance cost of this approach.

A danger in relying on fbt is that it only works when the function
call works.  So, for example, if ip4-physical-out-start/end were made
to depend on fbt::hook_run, if that function doesn''t get called then
there''s no physical-out-start/end.  While the -end probe is only
really useful in the presence of hook run being called, the other is
not.

Darren

Brendan Gregg - Sun Microsystems

2007-Jul-03 03:42 UTC

head link

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

On Mon, Jul 02, 2007 at 07:33:45PM -0700, Darren.Reed at Sun.COM wrote:
[...]> >>I agree with this.
> >>But having seen the sdt:::ip-xmit-v4 probe, wouldn''t it be
> >>nice if you could present that as fbt::ip-xmit-v4:entry but
> >>pass in ire->ire_nce as arg2 instead of ire?
> >>   
> >>
> >
> >I would have thought that the existing (ire_t *)ire as args[2] was
already
> >ideal, as users could refer to args[2]->ire_nce as needed. It might
be
> >a different story if the kernel wasn''t CTF''d and
these wern''t already
> >casted.
> >
> 
> Indeed and I suspect that this hook is simply "ease of use"
> in debugging and could be a candidate for removing..
Yes; and an advantage of private sdt probes is that they could be added
to support a particular performance project and then removed later
when no longer needed...
> >>DTRACE_VIEW2(ip__hook__abc, ip_send, mblk_t *, void_ip_t *);
> >>DTRACE_VIEW3(ip__hook__abc, ip4__send, mblk_t *, ipha_t *, ill_t
*);
[...]> >It would take some moderate work to get this done; and when complete,
there
> >would still be more non-enabled probe overhead in other places - like
the
> >mib probes, that wouldn''t benifit from this approach.
> >
> 
> Understood.  Should I RFE this or is there some other virtual
> whiteboard on opensolaris.org for new things to add to dtrace?
AFAIK, the DTrace todo list is the RFE list from bugster.

[...]> >There might be an easier way to drop the duplicate probe overhead,
which
> >currently looks like,
> >
> >  14223                 DTRACE_PROBE4(ip4__physical__out__start,
> >  14224                     ill_t *, NULL, ill_t *, stq_ill,
> >  14225                     ipha_t *, ipha, mblk_t *, mp);
> >  14226                 FW_HOOKS(ipst->ips_ip4_physical_out_event,
> >  14227                     ipst->ips_ipv4firewall_physical_out,
> >  14228                     NULL, stq_ill, ipha, mp, mpip, ipst);
> >  14229                 DTRACE_PROBE1(ip4__physical__out__end, mblk_t
*,
> >  14230                     mp);
> >  14231                 if (mp == NULL)
> >  14232                         goto drop;
> >  14233 
> >  14234                 DTRACE_IP5(send, mblk_t *, mp, void_ip_t *,
ipha,
> >  14235                     ill_t *, stq_ill, ipha_t *, ipha, ip6_t *, 
> >  NULL);
> >
> >Why have sdt:::ip4-physical-out-start and sdt:::ip4-physical-out-end 
> >anyway?
> > 
> >
> 
> For -start, the point is to have a dtrace probe that is particular to the
> location of the hook in the stack relative to packet processing.
That sounds doable from fbt,

   # dtrace -n ''fbt::hook_run:entry {
trace(stringof(args[0]->hei_event->he_name)); stack(); }''
   dtrace: description ''fbt::hook_run:entry '' matched 1 probe
   CPU     ID                    FUNCTION:NAME
     0  73492                   hook_run:entry   PHYSICAL_OUT
                 ip`tcp_send_data+0x78e
                 ip`tcp_output+0x78f
                 ip`squeue_enter+0x41a
                 ip`tcp_wput+0xf8
                 unix`putnext+0x2f1
                 rpcmod`mir_wput+0x1aa
                 rpcmod`rmm_wput+0x1e
                 unix`put+0x270
                 rpcmod`clnt_dispatch_send+0x11a
                 rpcmod`clnt_cots_kcallit+0x596
                 nfs`nfs4_rfscall+0x4e2
                 nfs`rfs4call+0x102
                 nfs`nfs4lookupvalidate_otw+0x2d5
                 nfs`nfs4lookup+0x212
                 nfs`nfs4_lookup+0xe5
                 genunix`fop_lookup+0x53
                 genunix`lookuppnvp+0x2e5
                 genunix`lookuppnat+0x125
                 genunix`lookupnameat+0x82
                 genunix`cstatat_getvp+0x160
   [...]
> For -end, it is useful to see the mp after the FW_HOOKS() is complete.
> Is it NULL?  Has it been changed?  Both of these are possible here.
> Or in other words, the complexity of using dtrace to achieve something
> meaningful is why there are two hooks.
A minute reading FW_HOOKS() shows that it will set mp to NULL and complete
if hook_run() returns zero, which fbt can trace.

Yes, there are some really complex associations and mental leaps that must
be made with fbt to solve some tracing issues, where it would be useful
to use sdt instead. It doesn''t look like this is one of them.
> Plus, none of the standard fbt things will show us what happens to mp on
> the successful return of the call to hook_run.
This is also doable from fbt, although a few lines of code,

	fbt::hook_run:entry
	{
		self->info = (hook_pkt_event_t *)arg1;
	}

	fbt::hook_run:return
	/self->info/
	{
		this->mp = arg1 != 0 ? NULL : (uint64_t)*self->info->hpe_mp;
		printf("mp: %x", this->mp);
		self->info = 0;
	}

The above shows the value of mp after FW_HOOKS(), and accounts for both
hook_run() and FW_HOOKS() itself changing mp.
> In the scripts that you added, yes, you''ve achieved the same
thing,
> but it''s not a trivial exercise.  You need to (a) know that you
can work
> this way with dtrace and (b) know what''s inside the args. 
Recommending
> that people take that approach isn''t something we should be doing
a lot
> of if they end up using interfaces that aren''t stable.
Those probes export mblk_t''s and ill_t''s around hook calls, so
whoever is
interested in using them must have some familiarity with kernel code. 
> Or to put it another way, it it was possible to use a probes that
> presented just as simple an interface but implemented it using
> other means then sure, the probes could be eliminated.
Who is this for? Who would be working on this code and not have read the
contents of FW_HOOKS()?
> Or to use the syntax from above, I would be just as happy if I could
> write a probe like this:
> 
> dtrace -I provider/pfh -n
''pfh:::ip4-physical-out-end{...}''
> 
> where "-I" is an "include" function, and dtrace somehow
built the
> ip4-physical-out-end using fbt probes like you did, why do I care
> how the probe is implemented?  A consideration that needs to be
> taken into account is the performance cost of this approach.
DTrace already hase some functionality that could be used; People can
build custom /usr/lib/dtrace include files containing translators,
inlines and #defines to make life easier for tracing certain targets.

It might make sense to add a little functionality, such as provider
aliasing as you suggested, to tie this together as a user customised
view of fbt. Especially for fbt probes that are complex to use.

Of course, anyone writing these include files will need to maintain them
as the kernel changes.
> A danger in relying on fbt is that it only works when the function
> call works.  So, for example, if ip4-physical-out-start/end were made
> to depend on fbt::hook_run, if that function doesn''t get called
then
> there''s no physical-out-start/end.  While the -end probe is only
> really useful in the presence of hook run being called, the other is
> not.
Yes, that is a difference between fbt and those sdt probes. You can at
least detect this situation using dtrace (the lack of hook_run() calls
from certain funcitons).

cheers,

Brendan

-- 
Brendan
[CA, USA]

Darren.Reed at Sun.COM

2007-Jul-04 03:41 UTC

head link

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

Brendan Gregg - Sun Microsystems wrote:
>[...]
>  
>
>>>There might be an easier way to drop the duplicate probe overhead,
which
>>>currently looks like,
>>>
>>> 14223                 DTRACE_PROBE4(ip4__physical__out__start,
>>> 14224                     ill_t *, NULL, ill_t *, stq_ill,
>>> 14225                     ipha_t *, ipha, mblk_t *, mp);
>>> 14226                 FW_HOOKS(ipst->ips_ip4_physical_out_event,
>>> 14227                     ipst->ips_ipv4firewall_physical_out,
>>> 14228                     NULL, stq_ill, ipha, mp, mpip, ipst);
>>> 14229                 DTRACE_PROBE1(ip4__physical__out__end, mblk_t
*,
>>> 14230                     mp);
>>> 14231                 if (mp == NULL)
>>> 14232                         goto drop;
>>> 14233 
>>> 14234                 DTRACE_IP5(send, mblk_t *, mp, void_ip_t *,
ipha,
>>> 14235                     ill_t *, stq_ill, ipha_t *, ipha, ip6_t
*,
>>> NULL);
>>>
>>>Why have sdt:::ip4-physical-out-start and sdt:::ip4-physical-out-end
>>>anyway?
>>>
>>>
>>>      
>>>
>>For -start, the point is to have a dtrace probe that is particular to
the
>>location of the hook in the stack relative to packet processing.
>>    
>>
>
>That sounds doable from fbt,
>
>   # dtrace -n ''fbt::hook_run:entry {
trace(stringof(args[0]->hei_event->he_name)); stack(); }''
>   dtrace: description ''fbt::hook_run:entry '' matched 1
probe
>   CPU     ID                    FUNCTION:NAME
>     0  73492                   hook_run:entry   PHYSICAL_OUT
>                 ip`tcp_send_data+0x78e
>                 ip`tcp_output+0x78f
>                 ip`squeue_enter+0x41a
>                 ip`tcp_wput+0xf8
>                 unix`putnext+0x2f1
>                 rpcmod`mir_wput+0x1aa
>                 rpcmod`rmm_wput+0x1e
>                 unix`put+0x270
>                 rpcmod`clnt_dispatch_send+0x11a
>                 rpcmod`clnt_cots_kcallit+0x596
>                 nfs`nfs4_rfscall+0x4e2
>                 nfs`rfs4call+0x102
>                 nfs`nfs4lookupvalidate_otw+0x2d5
>                 nfs`nfs4lookup+0x212
>                 nfs`nfs4_lookup+0xe5
>                 genunix`fop_lookup+0x53
>                 genunix`lookuppnvp+0x2e5
>                 genunix`lookuppnat+0x125
>                 genunix`lookupnameat+0x82
>                 genunix`cstatat_getvp+0x160
>   [...]
>  
>
When it is possible to write a dtrace script that use a probe named
"ip4-physical-out-end" and there is no requirement for me to know
about fbt, all will be good.

>>For -end, it is useful to see the mp after the FW_HOOKS() is complete.
>>Is it NULL?  Has it been changed?  Both of these are possible here.
>>Or in other words, the complexity of using dtrace to achieve something
>>meaningful is why there are two hooks.
>>    
>>
>
>A minute reading FW_HOOKS() shows that it will set mp to NULL and complete
>if hook_run() returns zero, which fbt can trace.
>  
>
So long as hook run is being used in this context.

If hook_run() is ever used outside of networking (and there''s no reason
why it couldn''t) then this becomes harder - more matching is required
on
entry of fbt''s and saving parameters too.

The use of "?" in D to achieve "if-else" assignments is not
obvious at first
and the lack of "if-else" is the biggest obstacle I run into with D.

>>Plus, none of the standard fbt things will show us what happens to mp on
>>the successful return of the call to hook_run.
>>    
>>
>
>This is also doable from fbt, although a few lines of code,
>
>	fbt::hook_run:entry
>	{
>		self->info = (hook_pkt_event_t *)arg1;
>	}
>
>	fbt::hook_run:return
>	/self->info/
>	{
>		this->mp = arg1 != 0 ? NULL : (uint64_t)*self->info->hpe_mp;
>		printf("mp: %x", this->mp);
>		self->info = 0;
>	}
>
>The above shows the value of mp after FW_HOOKS(), and accounts for both
>hook_run() and FW_HOOKS() itself changing mp.
>  
>
But it requires me to use 2 probes, not 1, and in a more complex
method than simply doing:

dtrace -n ''sdt:::ip4-physical-out-end{printf("mp: %x",
this->mp);}''

People able to put it all on one command line has its advantages ;)

>...
>  
>
>>Or to put it another way, it it was possible to use a probes that
>>presented just as simple an interface but implemented it using
>>other means then sure, the probes could be eliminated.
>>    
>>
>
>Who is this for? Who would be working on this code and not have read the
>contents of FW_HOOKS()?
>  
>
Using the existing probes doesn''t require someone to read the
FW_HOOKS macro, today.  If I write up a blog or something else
about this SDT probes and mention what the parameters are,
that should be enough for someone to go and use them for
something.

I''d also reject the assertion that using SDT probes implies that
someone is "working on code."

>>Or to use the syntax from above, I would be just as happy if I could
>>write a probe like this:
>>
>>dtrace -I provider/pfh -n
''pfh:::ip4-physical-out-end{...}''
>>
>>where "-I" is an "include" function, and dtrace
somehow built the
>>ip4-physical-out-end using fbt probes like you did, why do I care
>>how the probe is implemented?  A consideration that needs to be
>>taken into account is the performance cost of this approach.
>>    
>>
>
>DTrace already hase some functionality that could be used; People can
>build custom /usr/lib/dtrace include files containing translators,
>inlines and #defines to make life easier for tracing certain targets.
>  
>
Hmm, and I see you''re using translators to build up the data
structures for your provider...  I''ve never looked at a translator
before - the syntax seems to borrow from C++ - I''m sure that
will make instant enemies of some ;)

They look useful, especially for networking...

>It might make sense to add a little functionality, such as provider
>aliasing as you suggested, to tie this together as a user customised
>view of fbt. Especially for fbt probes that are complex to use.
>
>Of course, anyone writing these include files will need to maintain them
>as the kernel changes.
>  
>
Of course.

Darren

Adam Leventhal

2007-Jul-04 06:59 UTC

head link

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

On Tue, Jul 03, 2007 at 08:41:26PM -0700, Darren.Reed at Sun.COM
wrote:> When it is possible to write a dtrace script that use a probe named
> "ip4-physical-out-end" and there is no requirement for me to know
> about fbt, all will be good.
That''s only a reasonable goal if we expect customers to gather
actionable
data from probe points of that nature. If such a probe''s only consumer
were
a developer of the IP stack, well, fbt is really the right tool.
> The use of "?" in D to achieve "if-else" assignments is
not obvious at first
> and the lack of "if-else" is the biggest obstacle I run into with
D.
Was the ?: operator unfamiliar to you? We chose to implement that -- obviously
-- because it was used in C (and all C-derivates I know of). Perhaps you
should add an entry to the wiki for people unfamilar with the construct.
> >	fbt::hook_run:entry
> >	{
> >		self->info = (hook_pkt_event_t *)arg1;
> >	}
> >
> >	fbt::hook_run:return
> >	/self->info/
> >	{
> >		this->mp = arg1 != 0 ? NULL : (uint64_t)*self->info->hpe_mp;
> >		printf("mp: %x", this->mp);
> >		self->info = 0;
> >	}
> >
> >The above shows the value of mp after FW_HOOKS(), and accounts for both
> >hook_run() and FW_HOOKS() itself changing mp.
> 
> But it requires me to use 2 probes, not 1, and in a more complex
> method than simply doing:
> 
> dtrace -n ''sdt:::ip4-physical-out-end{printf("mp: %x",
this->mp);}''
> 
> People able to put it all on one command line has its advantages ;)
The point of SDT is not to make it easier to fit your D program in 80
columns, and the script that Brendan quoted is the kind of thing that
users of DTrace do on the command line all the time. The fbt provider
has an important place in the pantheon of DTrace providers and is not
meant to be obviated by SDT. Rather SDT is meant to expose points of
semantic relevance and stability.
> I''d also reject the assertion that using SDT probes implies that
> someone is "working on code."
Well, as implemented the stability is private so I''m not sure how a
constomer could reasonably be expected to use or rely on the probes.

Adam

-- 
Adam Leventhal, Solaris Kernel Development       http://blogs.sun.com/ahl

Darren.Reed at Sun.COM

2007-Jul-05 17:52 UTC

head link

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

Adam Leventhal wrote:
>On Tue, Jul 03, 2007 at 08:41:26PM -0700, Darren.Reed at Sun.COM wrote:
>
>>When it is possible to write a dtrace script that use a probe named
>>"ip4-physical-out-end" and there is no requirement for me to
know
>>about fbt, all will be good.
>>
>
>That''s only a reasonable goal if we expect customers to gather
actionable
>data from probe points of that nature. If such a probe''s only
consumer were
>a developer of the IP stack, well, fbt is really the right tool.
>
Are you saying that developers shouldn''t add sdt probes and should
only use fbt probes?

I suppose where this thread of conversation may go is that with the
ip4-physical-out-start/end, there''s no real need to have a
"start/end",
just having "ip4-physical-out" is enough and the start/end can be
derived from using fbt::hook_run.  I''m in two minds about that,
primarily because I''m not 100% comfortable with the idea of needing
to store something on :::entry to look at it with :::return and
see what it is after the return.

While we may well hide behind the veil of "sdt probes are private",
in the great big world of open source, this is just silly.  Everyone
can see they''re there and how they can or cannot use them.  If someone
who isn''t a developer wanted to use dtrace to collect inbound packet
statistics using dtrace and found sdt:::ip4-physical-in-start as being
useful, who are we to say they can''t if it serves their needs?
Anyway, that''s a whole other digression.

>>The use of "?" in D to achieve "if-else" assignments
is not obvious at first
>>and the lack of "if-else" is the biggest obstacle I run into
with D.
>>
>
>Was the ?: operator unfamiliar to you? We chose to implement that --
obviously
>-- because it was used in C (and all C-derivates I know of). Perhaps you
>should add an entry to the wiki for people unfamilar with the construct.
>
Most of the time I think in terms of "if-else" as I regard using the
?: operators as something people use to be obscure, less lines of
code, etc, and try not to use them myself.

Everytime I''ve had errors from dtrace saying it doesn''t
understand
if statements, I usually have to remember to do matching with //.
D generally requires a different mode of thinking when trying to
solve a programming problem and I''m not always in tune with that.

Darren

Peter Memishian

2007-Jul-05 18:07 UTC

head link

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

> While we may well hide behind the veil of "sdt probes are
private", > in the great big world of open source, this is just silly.

Please, not this again.  Just because it''s open-source doesn''t
mean that
everyone is allowed to go mucking around with everyone else''s private
parts.

--
meem

Bryan Cantrill

2007-Jul-05 18:09 UTC

head link

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

> >>When it is possible to write a dtrace script that use a probe named
> >>"ip4-physical-out-end" and there is no requirement for me
to know
> >>about fbt, all will be good.
> >>
> >
> >That''s only a reasonable goal if we expect customers to gather
actionable
> >data from probe points of that nature. If such a probe''s only
consumer were
> >a developer of the IP stack, well, fbt is really the right tool.
> >
> 
> Are you saying that developers shouldn''t add sdt probes and should
> only use fbt probes?
No, he''s not -- Adam''s point is that FBT can do quite a bit
for you
without having to add SDT probes.

[ ... ]
> While we may well hide behind the veil of "sdt probes are
private",
> in the great big world of open source, this is just silly.  Everyone
> can see they''re there and how they can or cannot use them.  If
someone
> who isn''t a developer wanted to use dtrace to collect inbound
packet
> statistics using dtrace and found sdt:::ip4-physical-in-start as being
> useful, who are we to say they can''t if it serves their needs?
Yes, that''s _exactly_ why we added a programmatic notion of stability.
So yes, you should knock yourself out adding Private SDT providers -- they
don''t even require an ARC case.  And if they serve a
customer''s needs,
great -- because you will have declared them to be Private, a customer
that cares about the stability of their D scripts will know that their
script can be broken by any twitch of the operating system.

That said, there is great value in adding stable (or in this case, Evolving)
SDT providers -- they allow customers to build those stable scripts that
have been so useful in things like the DTraceToolkit.  

So one can envision a spectrum of usage of DTrace, with Solaris developers
on one end and end-users on the other:  FBT provides value primarily at the
developer end, Stable/Evolving SDT providers provide value primarily at
the end-user end, and Private SDT providers fall in between.  So these
are not mutually exclusive -- it''s a question of the audience, and the
audience for Brendan''s work is very much the end-user.

	- Bryan

--------------------------------------------------------------------------
Bryan Cantrill, Solaris Kernel Development.       http://blogs.sun.com/bmc

Darren.Reed at Sun.COM

2007-Jul-05 18:24 UTC

head link

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

Bryan Cantrill wrote:
> ...
>
>>While we may well hide behind the veil of "sdt probes are
private",
>>in the great big world of open source, this is just silly.  Everyone
>>can see they''re there and how they can or cannot use them.  If
someone
>>who isn''t a developer wanted to use dtrace to collect inbound
packet
>>statistics using dtrace and found sdt:::ip4-physical-in-start as being
>>useful, who are we to say they can''t if it serves their needs?
>>    
>>
>
>Yes, that''s _exactly_ why we added a programmatic notion of
stability.
>So yes, you should knock yourself out adding Private SDT providers -- they
>don''t even require an ARC case.  And if they serve a
customer''s needs,
>great -- because you will have declared them to be Private, a customer
>that cares about the stability of their D scripts will know that their
>script can be broken by any twitch of the operating system.
>
>That said, there is great value in adding stable (or in this case, Evolving)
>SDT providers -- they allow customers to build those stable scripts that
>have been so useful in things like the DTraceToolkit.
>
>So one can envision a spectrum of usage of DTrace, with Solaris developers
>on one end and end-users on the other:  FBT provides value primarily at the
>developer end, Stable/Evolving SDT providers provide value primarily at
>the end-user end, and Private SDT providers fall in between.  So these
>are not mutually exclusive -- it''s a question of the audience, and
the
>audience for Brendan''s work is very much the end-user.
>  
>
Adam''s email suggest that SDT probes are only ever going to be
private but you''re suggesting otherwise and those that Brendan
is working on aren''t SDT, if I recall correctly.  Who''s right
here?

While I don''t actually envisage wanting to promote an SDT probe,
at this point, to anything above private, I''d just like to know if
there
is some subtlety to the difference between what you''ve said above
and what Adam was saying that I''m missing.

Darren

Adam Leventhal

2007-Jul-05 22:24 UTC

head link

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

On Thu, Jul 05, 2007 at 11:24:47AM -0700, Darren.Reed at Sun.COM
wrote:> Adam''s email suggest that SDT probes are only ever going to be
> private but you''re suggesting otherwise and those that Brendan
> is working on aren''t SDT, if I recall correctly.  Who''s
right here?
> 
> While I don''t actually envisage wanting to promote an SDT probe,
> at this point, to anything above private, I''d just like to know if
there
> is some subtlety to the difference between what you''ve said above
> and what Adam was saying that I''m missing.
I think is an instance where it would be beneficial to RTFM. A better
understanding of DTrace and it''s providers would help you answer your
questions.

Adam

-- 
Adam Leventhal, Solaris Kernel Development       http://blogs.sun.com/ahl

Darren.Reed at Sun.COM

2007-Jul-06 23:58 UTC

head link

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

Adam Leventhal wrote:
>On Thu, Jul 05, 2007 at 11:24:47AM -0700, Darren.Reed at Sun.COM wrote:
>  
>
>>Adam''s email suggest that SDT probes are only ever going to be
>>private but you''re suggesting otherwise and those that Brendan
>>is working on aren''t SDT, if I recall correctly. 
Who''s right here?
>>
>>While I don''t actually envisage wanting to promote an SDT
probe,
>>at this point, to anything above private, I''d just like to know
if there
>>is some subtlety to the difference between what you''ve said
above
>>and what Adam was saying that I''m missing.
>>    
>>
>
>I think is an instance where it would be beneficial to RTFM. A better
>understanding of DTrace and it''s providers would help you answer
your
>questions.
>  
>
Reading sdt(7) does clear up any misunderstandings I had.

Is sdt likely to be upgraded to a "committed" interface?

Darren

Adam Leventhal

2007-Jul-07 00:02 UTC

head link

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

On Fri, Jul 06, 2007 at 04:58:45PM -0700, Darren.Reed at Sun.COM
wrote:> Reading sdt(7) does clear up any misunderstandings I had.
http://docs.sun.com/app/docs/doc/817-6223

- ahl

-- 
Adam Leventhal, Solaris Kernel Development       http://blogs.sun.com/ahl

James Carlson

2007-Jul-08 20:24 UTC

head link

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

Peter Memishian writes:> 
>  > While we may well hide behind the veil of "sdt probes are
private",
>  > in the great big world of open source, this is just silly.
> 
> Please, not this again.  Just because it''s open-source
doesn''t mean that
> everyone is allowed to go mucking around with everyone else''s
private parts.
Indeed.  The core confusion seems to be over the word "private."  It
doesn''t mean "secret."  It doesn''t even mean that
we''ll somehow chase
down those wayward users and somehow stop them from using what they
ought not.

Instead, it means that the original author is not *expecting* anyone
to use the interface.  As he doesn''t *expect* this to happen,
he''s not
going to try very hard (if at all) to make sure that any future
changes are compatible with what anyone else is doing.

In other words, if you use something that''s private, you can get hurt
by changes down the line.  If you don''t care whether your application
falls apart in the future, then go ahead and use whatever you want.

It''s not "secrecy," but rather rate-of-change, and having
open source
doesn''t change the situation for private interfaces in the slightest.

-- 
James Carlson, Solaris Networking              <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

dtrace discuss - Jun 2007 - DTrace Network Providers, take 2

[dtrace-discuss] DTrace Network Providers, take 2

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

[dtrace-discuss] CTRL-Z

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

[dtrace-discuss] CTRL-Z

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

[dtrace-discuss] CTRL-Z

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

[dtrace-discuss] Re: [networking-discuss] Re: DTrace Network Providers, take 2

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

[dtrace-discuss] DTrace Network Providers, take 2

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

[dtrace-discuss] DTrace Network Providers, take 2

[dtrace-discuss] Re: [networking-discuss] DTrace Network Providers, take 2

[dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[dtrace-discuss] DTrace Network Providers, take 2

[dtrace-discuss] DTrace Network Providers, take 2

[dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[dtrace-discuss] DTrace Network Providers, take 2

[dtrace-discuss] DTrace Network Providers, take 2

[dtrace-discuss] DTrace Network Providers, take 2

[dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[networking-discuss] Re: [dtrace-discuss] DTrace Network Providers, take 2

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2

[dtrace-discuss] [networking-discuss] Re: DTrace Network Providers, take 2