Currently in Xen, interdomain communication needlessly wastes CPU cycles calculating and verifying TCP/UDP checksums. This is unnecessary, as the possibility of packet corruption between domains is miniscule (and can be detected in memory via ECC). Also, domU''s are unable to take advantage of any adapter hardware checksum offload capabilities when transmitting packets outside of the system. This patch removes the inter-xen network checksums by using the existing Linux hardware checksum offload infrastructure. This decreased the changes needed by this patch, and enabled me to easily use hardware checksum on the physical devices. Here is how the traffic flow now works (generically): Traffic generated by dom0 will not do the TCP/UDP checksums and will notify domU this via the csum bit in netif_rx_response_t. domU will check for the csum bit on each incoming packet, and if not enabled it will verify the checksum. Traffic generated externally, if rx hardware checksum is available and enabled, then dom0 will notify domU that it is unnecessary to validate this checksum (providing the checksum is valid) by enabling the csum bit. If domU is not notified that it is unnecessary to vaildate the checksum, then domU will do it. Traffic generated by domU will not do the TCP/UDP checksums and will notify dom0 this via the csim bit in netif_tx_request_t. dom0 will check for the csum bit on each incoming packet, and if enabled it will calculate the necessary bits for hardware checksum offload (skb->csum, which is the offset to insert the checksum). It also sets skb->ip_summed = CHECKSUM_UNNECESSARY; skb->flags |= SKB_FDW_NO_CSUM; ip_summed is set in the case that the packet is destined for dom0, which will prevent dom0 from checking the TCP/UDP checksum. Unfortunately, this flag is stomped on by both routing and bridging. So I added a new skb field and a new flag, SKB_FDW_NO_CSUM. This field is checked on transmission and corrects the fields that have been modified by the bridging/routing code. Once these fields have been corrected, the adapter (if tx csum able) or stack (via skb_checksum_help()) will calculate the TCP/UDP checksum. Performance: I ran the following test cases with netperf3 TCP_STREAM, and get the following boosts (using bridging): domU->dom0 500Mbps dom0->domU 10Mbps domU->remote host none domU->domU 70Mbps Note: I have a small bridging patch which increases dom0 throughput. I am in the process of having it accepted into the Linux kernel. I currently do not have CPU utilization numbers (where the real boost of this patch would be), and I do not have throughput numbers for routing/nat. Also, I added the ability to enable/disable checksum offload via the ethtool command. Signed-off-by: Jon Mason <jdmason@us.ibm.com> --- ../xen-unstable-pristine/xen/include/public/io/netif.h 2005-05-04 22:20:10.000000000 -0500 +++ xen/include/public/io/netif.h 2005-05-18 12:05:41.000000000 -0500 @@ -12,7 +12,8 @@ typedef struct { memory_t addr; /* 0: Machine address of packet. */ MEMORY_PADDING; - u16 id; /* 8: Echoed in response message. */ + u16 csum:1; + u16 id:15; /* 8: Echoed in response message. */ u16 size; /* 10: Packet size in bytes. */ } PACKED netif_tx_request_t; /* 12 bytes */ @@ -29,7 +30,8 @@ typedef struct { typedef struct { memory_t addr; /* 0: Machine address of packet. */ MEMORY_PADDING; - u16 id; /* 8: */ + u16 csum:1; + u16 id:15; /* 8: */ s16 status; /* 10: -ve: BLKIF_RSP_* ; +ve: Rx''ed pkt size. */ } PACKED netif_rx_response_t; /* 12 bytes */ --- ../xen-unstable-pristine/linux-2.6.11-xen-sparse/drivers/xen/netback/netback.c 2005-05-04 22:20:01.000000000 -0500 +++ linux-2.6.11-xen-sparse/drivers/xen/netback/netback.c 2005-05-19 13:25:50.000000000 -0500 @@ -13,6 +13,9 @@ #include "common.h" #include <asm-xen/balloon.h> #include <asm-xen/evtchn.h> +#include <net/ip.h> +#include <linux/tcp.h> +#include <linux/udp.h> #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,0) #include <linux/delay.h> @@ -154,10 +157,14 @@ int netif_be_start_xmit(struct sk_buff * __skb_put(nskb, skb->len); (void)skb_copy_bits(skb, -hlen, nskb->data - hlen, skb->len + hlen); nskb->dev = skb->dev; + nskb->ip_summed = skb->ip_summed; dev_kfree_skb(skb); skb = nskb; } + if (skb->ip_summed > 0) + netif->rx->ring[MASK_NETIF_RX_IDX(netif->rx_resp_prod)].resp.csum = 1; + netif->rx_req_cons++; netif_get(netif); @@ -646,6 +653,18 @@ static void net_tx_action(unsigned long skb->dev = netif->dev; skb->protocol = eth_type_trans(skb, skb->dev); + skb->csum = 0; + if (txreq.csum) { + skb->ip_summed = CHECKSUM_UNNECESSARY; + skb->flags |= SKB_FDW_NO_CSUM; + skb->nh.iph = (struct iphdr *) skb->data; + if (skb->nh.iph->protocol == IPPROTO_TCP) + skb->csum = offsetof(struct tcphdr, check); + if (skb->nh.iph->protocol == IPPROTO_UDP) + skb->csum = offsetof(struct udphdr, check); + } else + skb->ip_summed = CHECKSUM_NONE; + netif->stats.rx_bytes += txreq.size; netif->stats.rx_packets++; --- ../xen-unstable-pristine/linux-2.6.11-xen-sparse/drivers/xen/netback/interface.c 2005-05-04 22:20:09.000000000 -0500 +++ linux-2.6.11-xen-sparse/drivers/xen/netback/interface.c 2005-05-20 10:36:14.000000000 -0500 @@ -159,6 +159,7 @@ void netif_create(netif_be_create_t *cre dev->get_stats = netif_be_get_stats; dev->open = net_open; dev->stop = net_close; + dev->features = NETIF_F_NO_CSUM; /* Disable queuing. */ dev->tx_queue_len = 0; --- ../xen-unstable-pristine/linux-2.6.11-xen-sparse/drivers/xen/netfront/netfront.c 2005-05-04 22:20:11.000000000 -0500 +++ linux-2.6.11-xen-sparse/drivers/xen/netfront/netfront.c 2005-05-20 13:15:39.000000000 -0500 @@ -40,6 +40,7 @@ #include <linux/init.h> #include <linux/bitops.h> #include <linux/proc_fs.h> +#include <linux/ethtool.h> #include <net/sock.h> #include <net/pkt_sched.h> #include <net/arp.h> @@ -287,6 +288,11 @@ static int send_fake_arp(struct net_devi return dev_queue_xmit(skb); } +static struct ethtool_ops network_ethtool_ops = { + .get_tx_csum = ethtool_op_get_tx_csum, + .set_tx_csum = ethtool_op_set_tx_csum, +}; + static int network_open(struct net_device *dev) { struct net_private *np = netdev_priv(dev); @@ -472,6 +478,7 @@ static int network_start_xmit(struct sk_ tx->id = id; tx->addr = virt_to_machine(skb->data); tx->size = skb->len; + tx->csum = (skb->ip_summed) ? 1 : 0; wmb(); /* Ensure that backend will see the request. */ np->tx->req_prod = i + 1; @@ -572,6 +579,9 @@ static int netif_poll(struct net_device skb->len = rx->status; skb->tail = skb->data + skb->len; + if (rx->csum) + skb->ip_summed = CHECKSUM_UNNECESSARY; + np->stats.rx_packets++; np->stats.rx_bytes += rx->status; @@ -966,7 +976,9 @@ static int create_netdev(int handle, str dev->get_stats = network_get_stats; dev->poll = netif_poll; dev->weight = 64; - + dev->features = NETIF_F_IP_CSUM; + SET_ETHTOOL_OPS(dev, &network_ethtool_ops); + if ((err = register_netdev(dev)) != 0) { printk(KERN_WARNING "%s> register_netdev err=%d\n", __FUNCTION__, err); goto exit; --- ../xen-unstable-pristine/linux-2.6.11-xen0/include/linux/skbuff.h 2005-03-02 01:38:38.000000000 -0600 +++ linux-2.6.11-xen0/include/linux/skbuff.h 2005-05-18 12:05:41.000000000 -0500 @@ -37,6 +37,10 @@ #define CHECKSUM_HW 1 #define CHECKSUM_UNNECESSARY 2 +#define SKB_CLONED 1 +#define SKB_NOHDR 2 +#define SKB_FDW_NO_CSUM 4 + #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ ~(SMP_CACHE_BYTES - 1)) #define SKB_MAX_ORDER(X, ORDER) (((PAGE_SIZE << (ORDER)) - (X) - \ @@ -238,7 +242,7 @@ struct sk_buff { mac_len, csum; unsigned char local_df, - cloned, + flags, pkt_type, ip_summed; __u32 priority; @@ -370,7 +374,7 @@ static inline void kfree_skb(struct sk_b */ static inline int skb_cloned(const struct sk_buff *skb) { - return skb->cloned && atomic_read(&skb_shinfo(skb)->dataref) != 1; + return (skb->flags & SKB_CLONED) && atomic_read(&skb_shinfo(skb)->dataref) != 1; } /** --- ../xen-unstable-pristine/linux-2.6.11-xen0/net/core/skbuff.c 2005-03-02 01:38:17.000000000 -0600 +++ linux-2.6.11-xen0/net/core/skbuff.c 2005-05-18 12:05:41.000000000 -0500 @@ -240,7 +240,7 @@ static void skb_clone_fraglist(struct sk void skb_release_data(struct sk_buff *skb) { - if (!skb->cloned || + if (!(skb->flags & SKB_CLONED) || atomic_dec_and_test(&(skb_shinfo(skb)->dataref))) { if (skb_shinfo(skb)->nr_frags) { int i; @@ -352,7 +352,7 @@ struct sk_buff *skb_clone(struct sk_buff C(data_len); C(csum); C(local_df); - n->cloned = 1; + n->flags = skb->flags | SKB_CLONED; C(pkt_type); C(ip_summed); C(priority); @@ -395,7 +395,7 @@ struct sk_buff *skb_clone(struct sk_buff C(end); atomic_inc(&(skb_shinfo(skb)->dataref)); - skb->cloned = 1; + skb->flags |= SKB_CLONED; return n; } @@ -603,7 +603,7 @@ int pskb_expand_head(struct sk_buff *skb skb->mac.raw += off; skb->h.raw += off; skb->nh.raw += off; - skb->cloned = 0; + skb->flags &= SKB_CLONED; atomic_set(&skb_shinfo(skb)->dataref, 1); return 0; --- ../xen-unstable-pristine/linux-2.6.11-xen0/net/core/dev.c 2005-03-02 01:38:09.000000000 -0600 +++ linux-2.6.11-xen0/net/core/dev.c 2005-05-20 10:20:36.000000000 -0500 @@ -98,6 +98,7 @@ #include <linux/stat.h> #include <linux/if_bridge.h> #include <linux/divert.h> +#include <net/ip.h> #include <net/dst.h> #include <net/pkt_sched.h> #include <net/checksum.h> @@ -1182,7 +1183,7 @@ int __skb_linearize(struct sk_buff *skb, skb->data += offset; /* We are no longer a clone, even if we were. */ - skb->cloned = 0; + skb->flags &= ~SKB_CLONED; skb->tail += skb->data_len; skb->data_len = 0; @@ -1236,6 +1237,15 @@ int dev_queue_xmit(struct sk_buff *skb) __skb_linearize(skb, GFP_ATOMIC)) goto out_kfree_skb; + /* If packet is forwarded to a device that needs a checksum and not + * checksummed, correct the pointers and enable checksumming in the + * next function. + */ + if (skb->flags & SKB_FDW_NO_CSUM) { + skb->ip_summed = CHECKSUM_HW; + skb->h.raw = (void *)skb->nh.iph + (skb->nh.iph->ihl * 4); + } + /* If packet is not checksummed and device does not support * checksumming for this protocol, complete checksumming here. */ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 21 May 2005, at 00:30, Jon Mason wrote:> > Traffic generated externally, if rx hardware checksum is available and > enabled, then dom0 will notify domU that it is unnecessary to validate > this checksum (providing the checksum is valid) by enabling the csum > bit. If domU is not notified that it is unnecessary to vaildate the > checksum, then domU will do it.Unfortunately you can''t trust the ip_summed flag because, as you point out yourself, the bridge and IP forwarding paths both clobber it to CHECKSUM_NONE. This puts us in a pickle: without hacking in some more info we have no way to know whether the physical interface (eth0, say) summed the packet or not. And, if it did, whether it was a CHECKSUM_UNNECESSARY or a CHECKSUM_HW kind of summing (they differ in how you interpret the result). Your patch as its stands is only correct if eth0 sets ip_summed==CHECKSUM_UNNECESSARY on received packets. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 21 May 2005, at 15:53, Keir Fraser wrote:>> Traffic generated externally, if rx hardware checksum is available and >> enabled, then dom0 will notify domU that it is unnecessary to validate >> this checksum (providing the checksum is valid) by enabling the csum >> bit. If domU is not notified that it is unnecessary to vaildate the >> checksum, then domU will do it. > > Unfortunately you can''t trust the ip_summed flag because, as you point > out yourself, the bridge and IP forwarding paths both clobber it to > CHECKSUM_NONE. This puts us in a pickle: without hacking in some more > info we have no way to know whether the physical interface (eth0, say) > summed the packet or not. And, if it did, whether it was a > CHECKSUM_UNNECESSARY or a CHECKSUM_HW kind of summing (they differ in > how you interpret the result). > > Your patch as its stands is only correct if eth0 sets > ip_summed==CHECKSUM_UNNECESSARY on received packets.I''ve checked in a modified version of your patch that hopefully deals with propagating checksum information in both directions across a virtual bridge or router. I replaced your skb flags with two new ones -- proto_csum_blank and proto_csum_valid. The former indicates that the protocol-level checksum needs filling in. This is not a problem for local processing, but the flag is picked up before sending to a physical interface and fixed up. The latter indicates that the proto-level checksum has been validated since arrival at localhost (*or* that the packet originated from a domU on localhost). This flag survives crossing a bridge/router so we can trust it when deciding if checksum validation is required. I''ll push the patch to the bkbits repository just as soon as bkbits rematerialises. :-) If you have any performance or stress tests that you were using to test checksum offloading, it would be great to find out how they perform on the checked-in version! Thanks, Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Saturday 21 May 2005 02:16 pm, Keir Fraser wrote:> On 21 May 2005, at 15:53, Keir Fraser wrote: > >> Traffic generated externally, if rx hardware checksum is available and > >> enabled, then dom0 will notify domU that it is unnecessary to validate > >> this checksum (providing the checksum is valid) by enabling the csum > >> bit. If domU is not notified that it is unnecessary to vaildate the > >> checksum, then domU will do it. > > > > Unfortunately you can''t trust the ip_summed flag because, as you point > > out yourself, the bridge and IP forwarding paths both clobber it to > > CHECKSUM_NONE. This puts us in a pickle: without hacking in some more > > info we have no way to know whether the physical interface (eth0, say) > > summed the packet or not. And, if it did, whether it was a > > CHECKSUM_UNNECESSARY or a CHECKSUM_HW kind of summing (they differ in > > how you interpret the result). > > > > Your patch as its stands is only correct if eth0 sets > > ip_summed==CHECKSUM_UNNECESSARY on received packets.Silly mistake on my part. Good catch.> I''ve checked in a modified version of your patch that hopefully deals > with propagating checksum information in both directions across a > virtual bridge or router. I replaced your skb flags with two new ones > -- proto_csum_blank and proto_csum_valid. > > The former indicates that the protocol-level checksum needs filling in. > This is not a problem for local processing, but the flag is picked up > before sending to a physical interface and fixed up. > > The latter indicates that the proto-level checksum has been validated > since arrival at localhost (*or* that the packet originated from a domU > on localhost). This flag survives crossing a bridge/router so we can > trust it when deciding if checksum validation is required. > > I''ll push the patch to the bkbits repository just as soon as bkbits > rematerialises. :-)I''d be interested in seeing the bits you added.> If you have any performance or stress tests that you were using to test > checksum offloading, it would be great to find out how they perform on > the checked-in version!I am happy to give the latest patch some testing (thought I probably won''t be able Monday). Thanks, Jon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I''ve checked in a modified version of your patch that hopefully deals > with propagating checksum information in both directions across a > virtual bridge or router. I replaced your skb flags with two new ones > -- proto_csum_blank and proto_csum_valid. > > The former indicates that the protocol-level checksum needs filling > in. This is not a problem for local processing, but the flag is > picked up before sending to a physical interface and fixed up. > > The latter indicates that the proto-level checksum has been validated > since arrival at localhost (*or* that the packet originated from a > domU on localhost). This flag survives crossing a bridge/router so we > can trust it when deciding if checksum validation is required. > > I''ll push the patch to the bkbits repository just as soon as bkbits > rematerialises. :-) > > If you have any performance or stress tests that you were using to > test checksum offloading, it would be great to find out how they > perform on the checked-in version!Now that BK is up, I''ll run some netperf tests before/after that changeset and see what we get. -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
It seems to break the interdomain ssh and nfs on my machine. Digging for reasons. - Bin On 5/23/05, Andrew Theurer <habanero@us.ibm.com> wrote:> > I''ve checked in a modified version of your patch that hopefully deals > > with propagating checksum information in both directions across a > > virtual bridge or router. I replaced your skb flags with two new ones > > -- proto_csum_blank and proto_csum_valid. > > > > The former indicates that the protocol-level checksum needs filling > > in. This is not a problem for local processing, but the flag is > > picked up before sending to a physical interface and fixed up. > > > > The latter indicates that the proto-level checksum has been validated > > since arrival at localhost (*or* that the packet originated from a > > domU on localhost). This flag survives crossing a bridge/router so we > > can trust it when deciding if checksum validation is required. > > > > I''ll push the patch to the bkbits repository just as soon as bkbits > > rematerialises. :-) > > > > If you have any performance or stress tests that you were using to > > test checksum offloading, it would be great to find out how they > > perform on the checked-in version! > > Now that BK is up, I''ll run some netperf tests before/after that > changeset and see what we get. > > -Andrew > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Monday 23 May 2005 10:31, Bin Ren wrote:> It seems to break the interdomain ssh and nfs on my machine. Digging > for reasons.Are you using bridge or network model?> > - Bin > > On 5/23/05, Andrew Theurer <habanero@us.ibm.com> wrote: > > > I''ve checked in a modified version of your patch that hopefully > > > deals with propagating checksum information in both directions > > > across a virtual bridge or router. I replaced your skb flags with > > > two new ones -- proto_csum_blank and proto_csum_valid. > > > > > > The former indicates that the protocol-level checksum needs > > > filling in. This is not a problem for local processing, but the > > > flag is picked up before sending to a physical interface and > > > fixed up. > > > > > > The latter indicates that the proto-level checksum has been > > > validated since arrival at localhost (*or* that the packet > > > originated from a domU on localhost). This flag survives crossing > > > a bridge/router so we can trust it when deciding if checksum > > > validation is required. > > > > > > I''ll push the patch to the bkbits repository just as soon as > > > bkbits rematerialises. :-) > > > > > > If you have any performance or stress tests that you were using > > > to test checksum offloading, it would be great to find out how > > > they perform on the checked-in version! > > > > Now that BK is up, I''ll run some netperf tests before/after that > > changeset and see what we get. > > > > -Andrew > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I''m using bridge and stock scripts. I start to doubt it''s caused csum offloading, as I''m seeing some weird things. (1) it''s possible to do interdomain iperf, which binds to ports > 1024 (2) ssh and nfs don''t work. In both cases, dom0 is the server, dom1 is the client. tcpdump on dom0 doesn''t show any incoming packets from dom1. I''m recompiling everything again. Cheers, Bin On 5/23/05, Andrew Theurer <habanero@us.ibm.com> wrote:> On Monday 23 May 2005 10:31, Bin Ren wrote: > > It seems to break the interdomain ssh and nfs on my machine. Digging > > for reasons. > > Are you using bridge or network model? > > > > - Bin > > > > On 5/23/05, Andrew Theurer <habanero@us.ibm.com> wrote: > > > > I''ve checked in a modified version of your patch that hopefully > > > > deals with propagating checksum information in both directions > > > > across a virtual bridge or router. I replaced your skb flags with > > > > two new ones -- proto_csum_blank and proto_csum_valid. > > > > > > > > The former indicates that the protocol-level checksum needs > > > > filling in. This is not a problem for local processing, but the > > > > flag is picked up before sending to a physical interface and > > > > fixed up. > > > > > > > > The latter indicates that the proto-level checksum has been > > > > validated since arrival at localhost (*or* that the packet > > > > originated from a domU on localhost). This flag survives crossing > > > > a bridge/router so we can trust it when deciding if checksum > > > > validation is required. > > > > > > > > I''ll push the patch to the bkbits repository just as soon as > > > > bkbits rematerialises. :-) > > > > > > > > If you have any performance or stress tests that you were using > > > > to test checksum offloading, it would be great to find out how > > > > they perform on the checked-in version! > > > > > > Now that BK is up, I''ll run some netperf tests before/after that > > > changeset and see what we get. > > > > > > -Andrew > > > > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.xensource.com > > > http://lists.xensource.com/xen-devel > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Start from fresh again. The same weird symptoms. - Bin On 5/23/05, Bin Ren <bin.ren@gmail.com> wrote:> I''m using bridge and stock scripts. I start to doubt it''s caused csum > offloading, as I''m seeing some weird things. (1) it''s possible to do > interdomain iperf, which binds to ports > 1024 (2) ssh and nfs don''t > work. In both cases, dom0 is the server, dom1 is the client. tcpdump > on dom0 doesn''t show any incoming packets from dom1. > > I''m recompiling everything again. > > Cheers, > Bin > > On 5/23/05, Andrew Theurer <habanero@us.ibm.com> wrote: > > On Monday 23 May 2005 10:31, Bin Ren wrote: > > > It seems to break the interdomain ssh and nfs on my machine. Digging > > > for reasons. > > > > Are you using bridge or network model? > > > > > > - Bin > > > > > > On 5/23/05, Andrew Theurer <habanero@us.ibm.com> wrote: > > > > > I''ve checked in a modified version of your patch that hopefully > > > > > deals with propagating checksum information in both directions > > > > > across a virtual bridge or router. I replaced your skb flags with > > > > > two new ones -- proto_csum_blank and proto_csum_valid. > > > > > > > > > > The former indicates that the protocol-level checksum needs > > > > > filling in. This is not a problem for local processing, but the > > > > > flag is picked up before sending to a physical interface and > > > > > fixed up. > > > > > > > > > > The latter indicates that the proto-level checksum has been > > > > > validated since arrival at localhost (*or* that the packet > > > > > originated from a domU on localhost). This flag survives crossing > > > > > a bridge/router so we can trust it when deciding if checksum > > > > > validation is required. > > > > > > > > > > I''ll push the patch to the bkbits repository just as soon as > > > > > bkbits rematerialises. :-) > > > > > > > > > > If you have any performance or stress tests that you were using > > > > > to test checksum offloading, it would be great to find out how > > > > > they perform on the checked-in version! > > > > > > > > Now that BK is up, I''ll run some netperf tests before/after that > > > > changeset and see what we get. > > > > > > > > -Andrew > > > > > > > > _______________________________________________ > > > > Xen-devel mailing list > > > > Xen-devel@lists.xensource.com > > > > http://lists.xensource.com/xen-devel > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
You can disable the checksum "offload" with ethtool (in domU). "ethtool -k eth0" will show whether it is enabled or not. "ethtool -K eth0 tx off" will disable it. "ethtool -K eth0 tx on" will enable it. I tested it throughly with bridging before I submitted the patch, so it should be working. I''ll download the latest source and verify that it works on my test system. Thanks for your help, Jon On Monday 23 May 2005 11:06 am, Bin Ren wrote:> Start from fresh again. The same weird symptoms. > > - Bin > > On 5/23/05, Bin Ren <bin.ren@gmail.com> wrote: > > I''m using bridge and stock scripts. I start to doubt it''s caused csum > > offloading, as I''m seeing some weird things. (1) it''s possible to do > > interdomain iperf, which binds to ports > 1024 (2) ssh and nfs don''t > > work. In both cases, dom0 is the server, dom1 is the client. tcpdump > > on dom0 doesn''t show any incoming packets from dom1. > > > > I''m recompiling everything again. > > > > Cheers, > > Bin > > > > On 5/23/05, Andrew Theurer <habanero@us.ibm.com> wrote: > > > On Monday 23 May 2005 10:31, Bin Ren wrote: > > > > It seems to break the interdomain ssh and nfs on my machine. Digging > > > > for reasons. > > > > > > Are you using bridge or network model? > > > > > > > - Bin > > > > > > > > On 5/23/05, Andrew Theurer <habanero@us.ibm.com> wrote: > > > > > > I''ve checked in a modified version of your patch that hopefully > > > > > > deals with propagating checksum information in both directions > > > > > > across a virtual bridge or router. I replaced your skb flags with > > > > > > two new ones -- proto_csum_blank and proto_csum_valid. > > > > > > > > > > > > The former indicates that the protocol-level checksum needs > > > > > > filling in. This is not a problem for local processing, but the > > > > > > flag is picked up before sending to a physical interface and > > > > > > fixed up. > > > > > > > > > > > > The latter indicates that the proto-level checksum has been > > > > > > validated since arrival at localhost (*or* that the packet > > > > > > originated from a domU on localhost). This flag survives crossing > > > > > > a bridge/router so we can trust it when deciding if checksum > > > > > > validation is required. > > > > > > > > > > > > I''ll push the patch to the bkbits repository just as soon as > > > > > > bkbits rematerialises. :-) > > > > > > > > > > > > If you have any performance or stress tests that you were using > > > > > > to test checksum offloading, it would be great to find out how > > > > > > they perform on the checked-in version! > > > > > > > > > > Now that BK is up, I''ll run some netperf tests before/after that > > > > > changeset and see what we get. > > > > > > > > > > -Andrew > > > > > > > > > > _______________________________________________ > > > > > Xen-devel mailing list > > > > > Xen-devel@lists.xensource.com > > > > > http://lists.xensource.com/xen-devel > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir has removed ''SET_ETHTOOL_OPS(dev, &network_ethtool_ops);'' from your patch. The operations are not supported. - Bin On 5/23/05, Jon Mason <jdmason@us.ibm.com> wrote:> You can disable the checksum "offload" with ethtool (in domU). > "ethtool -k eth0" will show whether it is enabled or not. > "ethtool -K eth0 tx off" will disable it. > "ethtool -K eth0 tx on" will enable it. > > I tested it throughly with bridging before I submitted the patch, so it should > be working. I''ll download the latest source and verify that it works on my > test system. > > Thanks for your help, > Jon > > On Monday 23 May 2005 11:06 am, Bin Ren wrote: > > Start from fresh again. The same weird symptoms. > > > > - Bin > > > > On 5/23/05, Bin Ren <bin.ren@gmail.com> wrote: > > > I''m using bridge and stock scripts. I start to doubt it''s caused csum > > > offloading, as I''m seeing some weird things. (1) it''s possible to do > > > interdomain iperf, which binds to ports > 1024 (2) ssh and nfs don''t > > > work. In both cases, dom0 is the server, dom1 is the client. tcpdump > > > on dom0 doesn''t show any incoming packets from dom1. > > > > > > I''m recompiling everything again. > > > > > > Cheers, > > > Bin > > > > > > On 5/23/05, Andrew Theurer <habanero@us.ibm.com> wrote: > > > > On Monday 23 May 2005 10:31, Bin Ren wrote: > > > > > It seems to break the interdomain ssh and nfs on my machine. Digging > > > > > for reasons. > > > > > > > > Are you using bridge or network model? > > > > > > > > > - Bin > > > > > > > > > > On 5/23/05, Andrew Theurer <habanero@us.ibm.com> wrote: > > > > > > > I''ve checked in a modified version of your patch that hopefully > > > > > > > deals with propagating checksum information in both directions > > > > > > > across a virtual bridge or router. I replaced your skb flags with > > > > > > > two new ones -- proto_csum_blank and proto_csum_valid. > > > > > > > > > > > > > > The former indicates that the protocol-level checksum needs > > > > > > > filling in. This is not a problem for local processing, but the > > > > > > > flag is picked up before sending to a physical interface and > > > > > > > fixed up. > > > > > > > > > > > > > > The latter indicates that the proto-level checksum has been > > > > > > > validated since arrival at localhost (*or* that the packet > > > > > > > originated from a domU on localhost). This flag survives crossing > > > > > > > a bridge/router so we can trust it when deciding if checksum > > > > > > > validation is required. > > > > > > > > > > > > > > I''ll push the patch to the bkbits repository just as soon as > > > > > > > bkbits rematerialises. :-) > > > > > > > > > > > > > > If you have any performance or stress tests that you were using > > > > > > > to test checksum offloading, it would be great to find out how > > > > > > > they perform on the checked-in version! > > > > > > > > > > > > Now that BK is up, I''ll run some netperf tests before/after that > > > > > > changeset and see what we get. > > > > > > > > > > > > -Andrew > > > > > > > > > > > > _______________________________________________ > > > > > > Xen-devel mailing list > > > > > > Xen-devel@lists.xensource.com > > > > > > http://lists.xensource.com/xen-devel > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 23 May 2005, at 17:36, Bin Ren wrote:> Keir has removed ''SET_ETHTOOL_OPS(dev, &network_ethtool_ops);'' from > your patch. The operations are not supported.Ah, I thought that was just testing infrastructure. I''ll take a patch to add the ethtool ops back in. Bin -- does your domain0 traffic get delivered via the bridge device or via the new vif0.0/veth0 that I added? If the former you might want to try updating your /etc/xen/scripts/network script. Although delivery via the bridge ought to work... -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
It''s via the new vif0.0/veth0. I did tcpdump on vif1.0 in dom0 and saw packets sent by dom0, but got dropped by the netfront on dom1. Cheers, Bin On 5/23/05, Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:> > On 23 May 2005, at 17:36, Bin Ren wrote: > > > Keir has removed ''SET_ETHTOOL_OPS(dev, &network_ethtool_ops);'' from > > your patch. The operations are not supported. > > Ah, I thought that was just testing infrastructure. I''ll take a patch > to add the ethtool ops back in. > > Bin -- does your domain0 traffic get delivered via the bridge device or > via the new vif0.0/veth0 that I added? If the former you might want to > try updating your /etc/xen/scripts/network script. Although delivery > via the bridge ought to work... > > -- Keir > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
thanks Bin, I''ll take a look at that. Jon On Monday 23 May 2005 01:08 pm, Bin Ren wrote:> It''s via the new vif0.0/veth0. I did tcpdump on vif1.0 in dom0 and saw > packets sent by dom0, but got dropped by the netfront on dom1. > > Cheers, > Bin > > On 5/23/05, Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote: > > On 23 May 2005, at 17:36, Bin Ren wrote: > > > Keir has removed ''SET_ETHTOOL_OPS(dev, &network_ethtool_ops);'' from > > > your patch. The operations are not supported. > > > > Ah, I thought that was just testing infrastructure. I''ll take a patch > > to add the ethtool ops back in. > > > > Bin -- does your domain0 traffic get delivered via the bridge device or > > via the new vif0.0/veth0 that I added? If the former you might want to > > try updating your /etc/xen/scripts/network script. Although delivery > > via the bridge ought to work... > > > > -- Keir > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I think I found the problem, and I''ve checked in a fix. Bin: can you try dom0->domU networking with latest unstable tree? Hopefully your problem is fixed. As further work on this, I think I chose a bad name for the ''proto_csum_valid'' field because sometimes it is set for local packets that have had no csum poked into the packet at all. Something like ''proto_data_valid'' might be better. And communicating this information between domains (i.e., that the csum field is blank, but the packet data is known good anyway) would be nice. Then domU can decide to add the checksum if it passes the packet off to a context that expects a valid checksum. -- Keir On 23 May 2005, at 19:18, Jon Mason wrote:> thanks Bin, > I''ll take a look at that. > > Jon > > On Monday 23 May 2005 01:08 pm, Bin Ren wrote: >> It''s via the new vif0.0/veth0. I did tcpdump on vif1.0 in dom0 and saw >> packets sent by dom0, but got dropped by the netfront on dom1. >> >> Cheers, >> Bin_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Fantastic! It''s working :-D Thanks a great deal, Bin On 5/23/05, Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:> > I think I found the problem, and I''ve checked in a fix. > > Bin: can you try dom0->domU networking with latest unstable tree? > Hopefully your problem is fixed. > > As further work on this, I think I chose a bad name for the > ''proto_csum_valid'' field because sometimes it is set for local packets > that have had no csum poked into the packet at all. Something like > ''proto_data_valid'' might be better. And communicating this information > between domains (i.e., that the csum field is blank, but the packet > data is known good anyway) would be nice. Then domU can decide to add > the checksum if it passes the packet off to a context that expects a > valid checksum. > > -- Keir > > On 23 May 2005, at 19:18, Jon Mason wrote: > > > thanks Bin, > > I''ll take a look at that. > > > > Jon > > > > On Monday 23 May 2005 01:08 pm, Bin Ren wrote: > >> It''s via the new vif0.0/veth0. I did tcpdump on vif1.0 in dom0 and saw > >> packets sent by dom0, but got dropped by the netfront on dom1. > >> > >> Cheers, > >> Bin > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I''ve added the support for ethtools. By turning on and off netfront checksum offloading, I''m getting the following throughput numbers, using iperf. Each test was run three times. CPU usages are quite similar in two cases (''top'' output). Looks like checksum computation is not a major overhead in domU networking. dom0/1/2 all have 128M memory. dom0 has e1000 tx checksum offloading turned on. With Tx checksum on: dom1->dom2: 300Mb/s (dom0 cpu maxed out by software interrupts) dom1->dom0: 459Mb/s (dom0 cpu 80% in SI, dom1 cpu 20% in SI) dom1->external: 439Mb/s (over 1Gb/s ethernet) (dom0 cpu 50% in SI, dom1 60% in SI) With Tx checksum off: dom1->dom2: 301Mb/s dom1->dom0: 454Mb/s dom1->externel: 437Mb/s (over 1Gb/s ethernet) On 5/23/05, Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:> > On 23 May 2005, at 17:36, Bin Ren wrote: > > > Keir has removed ''SET_ETHTOOL_OPS(dev, &network_ethtool_ops);'' from > > your patch. The operations are not supported. > > Ah, I thought that was just testing infrastructure. I''ll take a patch > to add the ethtool ops back in. > > Bin -- does your domain0 traffic get delivered via the bridge device or > via the new vif0.0/veth0 that I added? If the former you might want to > try updating your /etc/xen/scripts/network script. Although delivery > via the bridge ought to work... > > -- Keir > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 23 May 2005, at 20:55, Bin Ren wrote:> I''ve added the support for ethtools. By turning on and off netfront > checksum offloading, I''m getting the following throughput numbers, > using iperf. Each test was run three times. CPU usages are quite > similar in two cases (''top'' output). Looks like checksum computation > is not a major overhead in domU networking. > > dom0/1/2 all have 128M memory. dom0 has e1000 tx checksum offloading > turned on.What happens to CPU usage in dom1 when tx checksumming is disabled? Overall though these are the kind of results I would expect. Linux usually does csumming at the same time as it has to do a copy anyway, and it ends up being limited by memory/L2-cache bandwidth, not the extra computation. But the offload extensions haven''t cost much to implement and there are probably cases where it helps a little. Maybe I''m being pessimistic though: Can you reproduce the rather more impressive speedups that you previously saw, Jon? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Monday 23 May 2005 03:13 pm, Keir Fraser wrote:> On 23 May 2005, at 20:55, Bin Ren wrote: > > I''ve added the support for ethtools. By turning on and off netfront > > checksum offloading, I''m getting the following throughput numbers, > > using iperf. Each test was run three times. CPU usages are quite > > similar in two cases (''top'' output). Looks like checksum computation > > is not a major overhead in domU networking. > > > > dom0/1/2 all have 128M memory. dom0 has e1000 tx checksum offloading > > turned on. > > What happens to CPU usage in dom1 when tx checksumming is disabled? > > Overall though these are the kind of results I would expect. Linux > usually does csumming at the same time as it has to do a copy anyway, > and it ends up being limited by memory/L2-cache bandwidth, not the > extra computation. But the offload extensions haven''t cost much to > implement and there are probably cases where it helps a little. > > Maybe I''m being pessimistic though: Can you reproduce the rather more > impressive speedups that you previously saw, Jon?I would if I could. As I don''t use BK, I''ll have to wait for the nightly build to pull in your latest fix. Thanks, Jon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Overall though these are the kind of results I would expect. > Linux usually does csumming at the same time as it has to do > a copy anyway, and it ends up being limited by > memory/L2-cache bandwidth, not the extra computation. But the > offload extensions haven''t cost much to implement and there > are probably cases where it helps a little. > > Maybe I''m being pessimistic though: Can you reproduce the > rather more impressive speedups that you previously saw, Jon?We should be getting some benefit on the receive path, where the checksum is normally forced to happen independent of a copy. Having this offloaded to hardware should produce some measureable gain. Bin: The numbers you''re seeing are terrible anyway. You should be seeing 890Mb/s for external traffic. What kind of machine is this on? Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I would if I could. As I don''t use BK, I''ll have to wait for > the nightly build to pull in your latest fix.Jon, Do you know about either of the following? http://www.bitkeeper.com/press/2005-03-17.html http://sourceforge.net/projects/sourcepuller/ I haven''t used either myself, but I''d be interested to know whether they work. Thanks, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 23 May 2005, at 21:22, Ian Pratt wrote:>> Maybe I''m being pessimistic though: Can you reproduce the >> rather more impressive speedups that you previously saw, Jon? > > We should be getting some benefit on the receive path, where the > checksum is normally forced to happen independent of a copy. Having > this > offloaded to hardware should produce some measureable gain.Ah, I forgot about that. But rx csum was not being toggled in the experiment. The external bandwidth was definitely very low, so I guess there must be some other bottleneck in Bin''s setup. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Monday 23 May 2005 03:38 pm, Keir Fraser wrote:> On 23 May 2005, at 21:22, Ian Pratt wrote: > >> Maybe I''m being pessimistic though: Can you reproduce the > >> rather more impressive speedups that you previously saw, Jon? > > > > We should be getting some benefit on the receive path, where the > > checksum is normally forced to happen independent of a copy. Having > > this > > offloaded to hardware should produce some measureable gain. > > Ah, I forgot about that. But rx csum was not being toggled in the > experiment. > > The external bandwidth was definitely very low, so I guess there must > be some other bottleneck in Bin''s setup.I was using 256MB/domain in my test system (and Bin is using 128MB). That might be the bottleneck. Thanks, Jon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Machines spec: External server: CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) CPU: AMD Athlon(tm) 64 Processor 3200+ stepping 08 Memory: 1024M DDR400 CAS 3 NIC: 1Gb/s Intel Pro/1000 MT Desktop Xen machine: CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 256K (64 bytes/line) CPU: AMD Sempron(tm) 2200+ stepping 01 Memory: 1024M DDR400 CAS 3 NIC: 1Gb/s Intel Pro/1000 MT Desktop The highest number I''m seeing here is 760Mbps running native linux on the Xen machine. dom0->external server gets 650Mbps. dom1->external server is definitely low using the default BVT. I''ve recently implemented a Xen scheduler based on Earliest Eligible Virtual Deadline First, which gives 610Mbps for dom1->external, ~50% improvement over BVT. I''m still figuring out why. On 5/23/05, Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> wrote:> > Overall though these are the kind of results I would expect. > > Linux usually does csumming at the same time as it has to do > > a copy anyway, and it ends up being limited by > > memory/L2-cache bandwidth, not the extra computation. But the > > offload extensions haven''t cost much to implement and there > > are probably cases where it helps a little. > > > > Maybe I''m being pessimistic though: Can you reproduce the > > rather more impressive speedups that you previously saw, Jon? > > We should be getting some benefit on the receive path, where the > checksum is normally forced to happen independent of a copy. Having this > offloaded to hardware should produce some measureable gain. > > Bin: The numbers you''re seeing are terrible anyway. You should be seeing > 890Mb/s for external traffic. What kind of machine is this on? > > Ian >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Monday 23 May 2005 16:01, Bin Ren wrote:> Machines spec: > > External server: > CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) > CPU: L2 Cache: 1024K (64 bytes/line) > CPU: AMD Athlon(tm) 64 Processor 3200+ stepping 08 > Memory: 1024M DDR400 CAS 3 > NIC: 1Gb/s Intel Pro/1000 MT Desktop > > Xen machine: > CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) > CPU: L2 Cache: 256K (64 bytes/line) > CPU: AMD Sempron(tm) 2200+ stepping 01 > Memory: 1024M DDR400 CAS 3 > NIC: 1Gb/s Intel Pro/1000 MT Desktop > > The highest number I''m seeing here is 760Mbps running native linux on > the Xen machine.This still seems kind of low. With netperf tcp_stream test I see 940 Mbps, basically wire speed with somewhere around 30% cpu on a P4 Xeon. Do you know the cpu util for native linux test? -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Bin Ren wrote:> I''ve added the support for ethtools. By turning on and off netfront > checksum offloading, I''m getting the following throughput numbers, > using iperf. Each test was run three times. CPU usages are quite > similar in two cases (''top'' output). Looks like checksum computation > is not a major overhead in domU networking. > > dom0/1/2 all have 128M memory. dom0 has e1000 tx checksum offloading turned on.Yeah, if you want to do anything network intensive, 128MB is just not enough - you really need more memory in your system.> With Tx checksum on: > > dom1->dom2: 300Mb/s (dom0 cpu maxed out by software interrupts) > dom1->dom0: 459Mb/s (dom0 cpu 80% in SI, dom1 cpu 20% in SI) > dom1->external: 439Mb/s (over 1Gb/s ethernet) (dom0 cpu 50% in SI, > dom1 60% in SI) > > With Tx checksum off: > > dom1->dom2: 301Mb/s > dom1->dom0: 454Mb/s > dom1->externel: 437Mb/s (over 1Gb/s ethernet)iperf is a directional send test, correct? i.e. is dom1 -> dom0 perf the same as dom0 -> dom1 for you? thanks, Nivedita _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 5/23/05, Nivedita Singhvi <niv@us.ibm.com> wrote:> Bin Ren wrote: > > I''ve added the support for ethtools. By turning on and off netfront > > checksum offloading, I''m getting the following throughput numbers, > > using iperf. Each test was run three times. CPU usages are quite > > similar in two cases (''top'' output). Looks like checksum computation > > is not a major overhead in domU networking. > > > > dom0/1/2 all have 128M memory. dom0 has e1000 tx checksum offloading turned on. > > Yeah, if you want to do anything network intensive, 128MB is just > not enough - you really need more memory in your system.I''ve given all the domains 256M memory and switched to netperf TCP_STREAM (netperf -H server). almost no change. Details: dom1->external: 420Mbps dom1->dom0: 437Mbps dom0->dom1: 200Mbps (!!!) dom1->dom2: 327Mbps> > > With Tx checksum on: > > > > dom1->dom2: 300Mb/s (dom0 cpu maxed out by software interrupts) > > dom1->dom0: 459Mb/s (dom0 cpu 80% in SI, dom1 cpu 20% in SI) > > dom1->external: 439Mb/s (over 1Gb/s ethernet) (dom0 cpu 50% in SI, > > dom1 60% in SI) > > > > With Tx checksum off: > > > > dom1->dom2: 301Mb/s > > dom1->dom0: 454Mb/s > > dom1->externel: 437Mb/s (over 1Gb/s ethernet) > > > iperf is a directional send test, correct? > i.e. is dom1 -> dom0 perf the same as dom0 -> dom1 for you?Please see above. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 5/23/05, Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:> What happens to CPU usage in dom1 when tx checksumming is disabled?dom1->dom0: 70.7% id, 0.0% wa, 1.7% hi, 15.0% si dom1->external: 20.0% id, 0.0% wa, 0.7% hi, 60.0% si dom1->dom2: 77.7% id, 0.0% wa, 1.0% hi, 9.3% si> > Overall though these are the kind of results I would expect. Linux > usually does csumming at the same time as it has to do a copy anyway, > and it ends up being limited by memory/L2-cache bandwidth, not the > extra computation. But the offload extensions haven''t cost much to > implement and there are probably cases where it helps a little. > > Maybe I''m being pessimistic though: Can you reproduce the rather more > impressive speedups that you previously saw, Jon? > > -- Keir > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Monday 23 May 2005 03:13 pm, Keir Fraser wrote:> On 23 May 2005, at 20:55, Bin Ren wrote: > > I''ve added the support for ethtools. By turning on and off netfront > > checksum offloading, I''m getting the following throughput numbers, > > using iperf. Each test was run three times. CPU usages are quite > > similar in two cases (''top'' output). Looks like checksum computation > > is not a major overhead in domU networking. > > > > dom0/1/2 all have 128M memory. dom0 has e1000 tx checksum offloading > > turned on. > > What happens to CPU usage in dom1 when tx checksumming is disabled? > > Overall though these are the kind of results I would expect. Linux > usually does csumming at the same time as it has to do a copy anyway, > and it ends up being limited by memory/L2-cache bandwidth, not the > extra computation. But the offload extensions haven''t cost much to > implement and there are probably cases where it helps a little. > > Maybe I''m being pessimistic though: Can you reproduce the rather more > impressive speedups that you previously saw, Jon?Alright, I broke down and got a BK puller. I get the following domU->dom0 throughput on my system (using netperf3 TCP_STREAM testcase): tx on ~1580Mbps tx off ~1230Mbps with my previous patch (on Friday''s build), I was seeing the following: with patch ~1610Mbps no patch ~1100Mbps The slight difference between the two might be caused by the changes that were incorporated in xen between those dates. If you think it is worth the time, I can back port the latest patch to Friday''s build to see if that makes a difference. Thanks, Jon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 5/23/05, Jon Mason <jdmason@us.ibm.com> wrote:> I get the following domU->dom0 throughput on my system (using netperf3 > TCP_STREAM testcase): > tx on ~1580Mbps > tx off ~1230Mbps > > with my previous patch (on Friday''s build), I was seeing the following: > with patch ~1610Mbps > no patch ~1100MbpsI suppose you are running dom0 and dom1 on different CPUs. Is it possible for you to pin them to the same CPU and get the numbers again? It''ll show how much overhead context switches and CPU share halved could incur. Thanks a lot, Bin> The slight difference between the two might be caused by the changes that were > incorporated in xen between those dates. If you think it is worth the time, > I can back port the latest patch to Friday''s build to see if that makes a > difference. > > Thanks, > Jon > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Monday 23 May 2005 05:05 pm, Bin Ren wrote:> On 5/23/05, Jon Mason <jdmason@us.ibm.com> wrote: > > I get the following domU->dom0 throughput on my system (using netperf3 > > TCP_STREAM testcase): > > tx on ~1580Mbps > > tx off ~1230Mbps > > > > with my previous patch (on Friday''s build), I was seeing the following: > > with patch ~1610Mbps > > no patch ~1100Mbps > > I suppose you are running dom0 and dom1 on different CPUs. Is itYes, I am.> possible for you to pin them to the same CPU and get the numbers > again? It''ll show how much overhead context switches and CPU share > halved could incur.I pinned them to the same CPU and got the following: tx on ~1480Mbps tx off ~1330Mbps> > The slight difference between the two might be caused by the changes that > > were incorporated in xen between those dates. If you think it is worth > > the time, I can back port the latest patch to Friday''s build to see if > > that makes a difference. > > > > Thanks, > > Jon > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
These results are pretty bad. What do you get for dom0->external? That definitely should be close or equal to native. Have you tweaked /proc/sys/net/core/rmem_max? Is the socket buffer set to some large value? Are you transmitting/receiving enough data? I don''t know netperf but for ttcp I would normally do: echo 1048576 > /proc/sys/net/core/rmem_max ttcp -b 65536 (or similar) ... And then transmit a few gigabytes What''s the interrupt rate etc. Rolf On 23/5/05 10:48 pm, "Bin Ren" <bin.ren@gmail.com> wrote:> On 5/23/05, Nivedita Singhvi <niv@us.ibm.com> wrote: >> Bin Ren wrote: >>> I''ve added the support for ethtools. By turning on and off netfront >>> checksum offloading, I''m getting the following throughput numbers, >>> using iperf. Each test was run three times. CPU usages are quite >>> similar in two cases (''top'' output). Looks like checksum computation >>> is not a major overhead in domU networking. >>> >>> dom0/1/2 all have 128M memory. dom0 has e1000 tx checksum offloading turned >>> on. >> >> Yeah, if you want to do anything network intensive, 128MB is just >> not enough - you really need more memory in your system. > > I''ve given all the domains 256M memory and switched to netperf > TCP_STREAM (netperf -H server). almost no change. Details: > > dom1->external: 420Mbps > dom1->dom0: 437Mbps > dom0->dom1: 200Mbps (!!!) > dom1->dom2: 327Mbps > >> >>> With Tx checksum on: >>> >>> dom1->dom2: 300Mb/s (dom0 cpu maxed out by software interrupts) >>> dom1->dom0: 459Mb/s (dom0 cpu 80% in SI, dom1 cpu 20% in SI) >>> dom1->external: 439Mb/s (over 1Gb/s ethernet) (dom0 cpu 50% in SI, >>> dom1 60% in SI) >>> >>> With Tx checksum off: >>> >>> dom1->dom2: 301Mb/s >>> dom1->dom0: 454Mb/s >>> dom1->externel: 437Mb/s (over 1Gb/s ethernet) >> >> >> iperf is a directional send test, correct? >> i.e. is dom1 -> dom0 perf the same as dom0 -> dom1 for you? > > Please see above. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I get the following domU->dom0 throughput on my system (using > netperf3 TCP_STREAM testcase): > tx on ~1580Mbps > tx off ~1230Mbps > > with my previous patch (on Friday''s build), I was seeing the > following: > with patch ~1610Mbps > no patch ~1100Mbps > > The slight difference between the two might be caused by the > changes that were incorporated in xen between those dates. > If you think it is worth the time, I can back port the latest > patch to Friday''s build to see if that makes a difference.Are you sure these aren''t within ''experimental error''? I can''t think of anything that''s changed since Friday that could be effecting this, but it would be good to dig a bit further as the difference in ''no patch'' results is quite significant. It might be revealing to try running some results on the unpatched Fri/Sat/Sun tree. BTW, dom0<->domU is not that interesting as I''d generally discourage people from running services in dom0. I''d be really interested to see the following tests: domU <-> external [dom0 on cpu0; dom1 on cpu1] domU <-> external [dom0 on cpu0; dom1 on cpu0] domU <-> domU [dom0 on cpu0; dom1 on cpu1; dom2 on cpu2 ** on a 4 way] domU <-> domU [dom0 on cpu0; dom1 on cpu0; dom2 on cpu0 ] domU <-> domU [dom0 on cpu0; dom1 on cpu1; dom2 on cpu1 ] domU <-> domU [dom0 on cpu0; dom1 on cpu0; dom2 on cpu1 ] domU <-> domU [dom0 on cpu0; dom1 on cpu1; dom2 on cpu2 ** cpu2 hyperthread w/ cpu 0] domU <-> domU [dom0 on cpu0; dom1 on cpu1; dom2 on cpu3 ** cpu3 hyperthread w/ cpu 1] This might help us understand the performance of interdomin networking rather better than we do at present. If you could fill a few of these in that would be great. Best, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 5/24/05, Rolf Neugebauer <rolf.neugebauer@intel.com> wrote:> These results are pretty bad. > > What do you get for dom0->external? That definitely should be close or equal > to native.with default BVT, dom->external gets 643Mbps. native gets 744Mbps.> Have you tweaked /proc/sys/net/core/rmem_max?No. I once did Linux tcp tuning on native linux and increased the throughput to around 810Mbps. But it''s not very stable and occasionally produced weird behaviors so I turned off tuning on both server and client.> Is the socket buffer set to some large value?Both sender and receiver buffers are 32K.> Are you transmitting/receiving enough data?Each tests last 50 seconds, transmitting around 3g data.> > I don''t know netperf but for ttcp I would normally do: > > echo 1048576 > /proc/sys/net/core/rmem_max > ttcp -b 65536 (or similar) ... > And then transmit a few gigabytes > > What''s the interrupt rate etc.Haven''t noticed yet. I''ll get you the number tomorrow. What currently I''m really really obssessed is (1) dom1->external with default BVT gives only ~400Mbps (2) dom1->external with my EEVDF scheduler (everything else is exactly the same) gives 610Mbps, very close to dom0->external. With scheduler latency histograms, it seems to be caused by *far too frequent* context switches in BVT. I''m still digging. Thanks a lot, Bin> > Rolf > > > On 23/5/05 10:48 pm, "Bin Ren" <bin.ren@gmail.com> wrote: > > > On 5/23/05, Nivedita Singhvi <niv@us.ibm.com> wrote: > >> Bin Ren wrote: > >>> I''ve added the support for ethtools. By turning on and off netfront > >>> checksum offloading, I''m getting the following throughput numbers, > >>> using iperf. Each test was run three times. CPU usages are quite > >>> similar in two cases (''top'' output). Looks like checksum computation > >>> is not a major overhead in domU networking. > >>> > >>> dom0/1/2 all have 128M memory. dom0 has e1000 tx checksum offloading turned > >>> on. > >> > >> Yeah, if you want to do anything network intensive, 128MB is just > >> not enough - you really need more memory in your system. > > > > I''ve given all the domains 256M memory and switched to netperf > > TCP_STREAM (netperf -H server). almost no change. Details: > > > > dom1->external: 420Mbps > > dom1->dom0: 437Mbps > > dom0->dom1: 200Mbps (!!!) > > dom1->dom2: 327Mbps > > > >> > >>> With Tx checksum on: > >>> > >>> dom1->dom2: 300Mb/s (dom0 cpu maxed out by software interrupts) > >>> dom1->dom0: 459Mb/s (dom0 cpu 80% in SI, dom1 cpu 20% in SI) > >>> dom1->external: 439Mb/s (over 1Gb/s ethernet) (dom0 cpu 50% in SI, > >>> dom1 60% in SI) > >>> > >>> With Tx checksum off: > >>> > >>> dom1->dom2: 301Mb/s > >>> dom1->dom0: 454Mb/s > >>> dom1->externel: 437Mb/s (over 1Gb/s ethernet) > >> > >> > >> iperf is a directional send test, correct? > >> i.e. is dom1 -> dom0 perf the same as dom0 -> dom1 for you? > > > > Please see above. > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> What currently I''m really really obssessed is (1) > dom1->external with default BVT gives only ~400Mbps (2) > dom1->external with my EEVDF scheduler (everything else is > exactly the same) gives 610Mbps, very close to > dom0->external. With scheduler latency histograms, it seems > to be caused by *far too frequent* context switches in BVT. > I''m still digging.Have you tried SEDF? I''m itching to make it the default scheduler... Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Not yet. I''ll give it a shot tomorrow and post the numbers here. Cheers, Bin On 5/24/05, Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> wrote:> > What currently I''m really really obssessed is (1) > > dom1->external with default BVT gives only ~400Mbps (2) > > dom1->external with my EEVDF scheduler (everything else is > > exactly the same) gives 610Mbps, very close to > > dom0->external. With scheduler latency histograms, it seems > > to be caused by *far too frequent* context switches in BVT. > > I''m still digging. > > Have you tried SEDF? I''m itching to make it the default scheduler... > > Ian >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Monday 23 May 2005 06:59 pm, Ian Pratt wrote:> > I get the following domU->dom0 throughput on my system (using > > netperf3 TCP_STREAM testcase): > > tx on ~1580Mbps > > tx off ~1230Mbps > > > > with my previous patch (on Friday''s build), I was seeing the > > following: > > with patch ~1610Mbps > > no patch ~1100Mbps > > > > The slight difference between the two might be caused by the > > changes that were incorporated in xen between those dates. > > If you think it is worth the time, I can back port the latest > > patch to Friday''s build to see if that makes a difference. > > Are you sure these aren''t within ''experimental error''? I can''t think of > anything that''s changed since Friday that could be effecting this, but > it would be good to dig a bit further as the difference in ''no patch'' > results is quite significant.The "tx off" is probably higher because of the offloading for the rx (in both the netback not checksumming and the physical ethernet checksum verification being passed to domU). I''m not sure why "tx on" is lower than my previous tests. It could be something outside the patch which has been incorporated, or it could be something in the patch that was committed. The changelog patch diff was truncated, so I will have to create a diff to apply to my Friday tree to see if the problem lies in the latter.> It might be revealing to try running some results on the unpatched > Fri/Sat/Sun tree. > > BTW, dom0<->domU is not that interesting as I''d generally discourage > people from running services in dom0.That is why I designed the checksum offload patch the way I did, as there were otherways which would be significantly better domU->dom0 communication (but would cause significantly more calculation in dom0).> I''d be really interested to see > the following tests: > > domU <-> external [dom0 on cpu0; dom1 on cpu1] > domU <-> external [dom0 on cpu0; dom1 on cpu0] > domU <-> domU [dom0 on cpu0; dom1 on cpu1; dom2 on cpu2 ** on a 4 way] > domU <-> domU [dom0 on cpu0; dom1 on cpu0; dom2 on cpu0 ] > domU <-> domU [dom0 on cpu0; dom1 on cpu1; dom2 on cpu1 ] > domU <-> domU [dom0 on cpu0; dom1 on cpu0; dom2 on cpu1 ] > domU <-> domU [dom0 on cpu0; dom1 on cpu1; dom2 on cpu2 ** cpu2 > hyperthread w/ cpu 0] > domU <-> domU [dom0 on cpu0; dom1 on cpu1; dom2 on cpu3 ** cpu3 > hyperthread w/ cpu 1] > > This might help us understand the performance of interdomin networking > rather better than we do at present. If you could fill a few of these in > that would be great.I wish I had all the hardware you describe ;-) My tests are running on a pentium4 (which has hyperthreading, which shows up as 2 cpus). dom0 was on cpu0 and domU was on cpu1. I''ll be happy to run netperf on the hardware I have. Thanks, Jon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
First round of test results for netperf2: I will also run domU->domU, and all of these tests again with domians on different cpus (these are all on the same HW thread). The cpu util is from xc_domain_get_cpu_usage(), not sar, vmstat, etc (I am not confident those are accurate for Xen right now). DomU cpu util is about 2% lower on domU->host, which is about the % time spent in csum_partial_copy based on a timer int based oprofile. Not sure why dom0 uses that extra 2% cpu, and we see maybe 1% throughput increase in our best cases. I do think cpu util in dom0 is the biggest problem right now. On this same box, we might use 30% of one cpu total to max out this Gbps adapter (tg3). Adding ~60% cpu to just "proxy" this network seems like a lot. Dom0 to domU is quite good, 13% better in the best case. Also note the horrible throughput rates for 64 byte messages, most likely due to excessive context switching. Also, BTW, this is the "old" bridge networking, no veth0 used yet. -Andrew 3.2 GHz Xeon with Hyperhtreading, 1GB memory Benchmark: netperf2 -T TCP_STREAM dom0 and dom1 on cpu0 (first SMT thread on first core) domU to host "hw" tx csum msg-size: 00064 Mbps: 0186 d0-cpu: 49.38 d1-cpu: 44.35 msg-size: 01500 Mbps: 0917 d0-cpu: 62.13 d1-cpu: 37.87 msg-size: 16384 Mbps: 0933 d0-cpu: 66.63 d1-cpu: 33.37 msg-size: 32768 Mbps: 0928 d0-cpu: 66.96 d1-cpu: 32.66 sw tx csum msg-size: 00064 Mbps: 0187 d0-cpu: 49.50 d1-cpu: 44.52 msg-size: 01500 Mbps: 0904 d0-cpu: 60.63 d1-cpu: 39.36 msg-size: 16384 Mbps: 0924 d0-cpu: 63.98 d1-cpu: 35.98 msg-size: 32768 Mbps: 0926 d0-cpu: 64.18 d1-cpu: 35.68 domU to dom0 "hw" csum msg-size: 00064 Mbps: 0014 d0-cpu: 64.02 d1-cpu: 31.71 msg-size: 01500 Mbps: 1087 d0-cpu: 63.34 d1-cpu: 36.67 msg-size: 16384 Mbps: 1204 d0-cpu: 67.30 d1-cpu: 32.71 msg-size: 32768 Mbps: 1148 d0-cpu: 68.08 d1-cpu: 31.93 sw tx csum msg-size: 00064 Mbps: 0014 d0-cpu: 64.88 d1-cpu: 32.39 msg-size: 01500 Mbps: 0948 d0-cpu: 62.20 d1-cpu: 37.80 msg-size: 16384 Mbps: 1063 d0-cpu: 64.73 d1-cpu: 35.27 msg-size: 32768 Mbps: 1012 d0-cpu: 65.71 d1-cpu: 34.30 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 5/24/05, Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> wrote:> > What currently I''m really really obssessed is (1) > > dom1->external with default BVT gives only ~400Mbps (2) > > dom1->external with my EEVDF scheduler (everything else is > > exactly the same) gives 610Mbps, very close to > > dom0->external. With scheduler latency histograms, it seems > > to be caused by *far too frequent* context switches in BVT. > > I''m still digging. > > Have you tried SEDF? I''m itching to make it the default scheduler... > > IanThe following numbers are all for dom1->external. Each test runs 50 seconds. dom0/1 shares one CPU. With default SEDF, throughput is even worse than default BVT: 318Mbps (down from 410Mbps). I guess, without looking into the source codes, default SEDF, dom0 and dom1 both get 50% of CPU. I tweaked their CPU shares and get the followings: dom1 60%: 493Mbps dom1 70%: 371Mbps dom1 80%: 243Mbps After these tests, dom0 /proc/interrupts is: CPU0 14: 11148 Phys-irq ide0 15: 2 Phys-irq ide1 16: 1722970 Phys-irq eth0 21: 0 Phys-irq uhci_hcd, uhci_hcd, uhci_hcd, uhci_hcd 256: 5 Dynamic-irq ctrl-if 257: 92682 Dynamic-irq timer0 258: 35 Dynamic-irq console 259: 0 Dynamic-irq net-be-dbg 260: 4842 Dynamic-irq blkif-backend 261: 2943112 Dynamic-irq vif1.0 NMI: 0 LOC: 0 ERR: 0 MIS: 0 dom1 /proc/interrupts is: CPU0 256: 474 Dynamic-irq ctrl-if 257: 45584 Dynamic-irq timer0 258: 5158 Dynamic-irq blkif 259: 1097273 Dynamic-irq eth0 NMI: 0 ERR: 0 SEDF doesn''t work out of box and parameter tuning is tricky as for driver domains. - Bin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tests for domU->dom0, domU->host, and domU->domU are completed: 3.2 GHz Xeon with Hyperhtreading, 2GB (correction) memory Benchmark: netperf2 -T TCP_STREAM dom0, dom1, and dom2 on cpu0 (first SMT thread on first core) domU to host hw tx csum msg-size: 00064 Mbps: 0186 d0-cpu: 49.38 d1-cpu: 44.35 msg-size: 01500 Mbps: 0917 d0-cpu: 62.13 d1-cpu: 37.87 msg-size: 16384 Mbps: 0933 d0-cpu: 66.63 d1-cpu: 33.37 msg-size: 32768 Mbps: 0928 d0-cpu: 66.96 d1-cpu: 32.66 sw tx csum msg-size: 00064 Mbps: 0187 d0-cpu: 49.50 d1-cpu: 44.52 msg-size: 01500 Mbps: 0904 d0-cpu: 60.63 d1-cpu: 39.36 msg-size: 16384 Mbps: 0924 d0-cpu: 63.98 d1-cpu: 35.98 msg-size: 32768 Mbps: 0926 d0-cpu: 64.18 d1-cpu: 35.68 ^^about 2% reduction in cpu util on dom1^^ domU to dom0 hw tx csum msg-size: 00064 Mbps: 0014 d0-cpu: 64.02 d1-cpu: 31.71 msg-size: 01500 Mbps: 1087 d0-cpu: 63.34 d1-cpu: 36.67 msg-size: 16384 Mbps: 1204 d0-cpu: 67.30 d1-cpu: 32.71 msg-size: 32768 Mbps: 1148 d0-cpu: 68.08 d1-cpu: 31.93 sw tx csum msg-size: 00064 Mbps: 0014 d0-cpu: 64.88 d1-cpu: 32.39 msg-size: 01500 Mbps: 0948 d0-cpu: 62.20 d1-cpu: 37.80 msg-size: 16384 Mbps: 1063 d0-cpu: 64.73 d1-cpu: 35.27 msg-size: 32768 Mbps: 1012 d0-cpu: 65.71 d1-cpu: 34.30 ^^upto 13% throughput increase with cpu util down ~2% on dom1^^ Note the dismal performance for very small msg sizes donU to domU hw tx csum msg-size:00064 Mbps: 0359 d0-cpu: 27.85 d1-cpu: 53.68 d2-cpu: 18.48 msg-size:01500 Mbps: 0594 d0-cpu: 47.42 d1-cpu: 21.77 d2-cpu: 30.78 msg-size:16384 Mbps: 0619 d0-cpu: 49.66 d1-cpu: 18.81 d2-cpu: 31.53 msg-size:32768 Mbps: 0616 d0-cpu: 49.58 d1-cpu: 18.68 d2-cpu: 31.74 sw tx csum msg-size:00064 Mbps: 0361 d0-cpu: 27.81 d1-cpu: 53.58 d2-cpu: 18.62 msg-size:01500 Mbps: 0584 d0-cpu: 46.22 d1-cpu: 23.18 d2-cpu: 30.60 msg-size:16384 Mbps: 0602 d0-cpu: 47.99 d1-cpu: 20.33 d2-cpu: 31.69 msg-size:32768 Mbps: 0603 d0-cpu: 47.67 d1-cpu: 20.59 d2-cpu: 31.74 ^^About a 2% throughput increase, and cpu down on d1 The cpu wasted on dom1 should be enough justification for domU<->domU communication with point to point front end driver communication. dom0 on cpu0, dom1 on cpu2, and dom2 on cpu3 (dom1 and dom2 on same core) domU to host hw tx csum msg-size: 00064 Mbps: 0540 d0-cpu: 92.98 d1-cpu: 100.00 msg-size: 01500 Mbps: 0941 d0-cpu: 99.74 d1-cpu: 48.62 msg-size: 16384 Mbps: 0941 d0-cpu: 99.71 d1-cpu: 43.32 msg-size: 32768 Mbps: 0941 d0-cpu: 99.72 d1-cpu: 43.21 sw tx csum msg-size: 00064 Mbps: 0545 d0-cpu: 93.47 d1-cpu: 100.00 msg-size: 01500 Mbps: 0941 d0-cpu: 99.76 d1-cpu: 51.43 msg-size: 16384 Mbps: 0941 d0-cpu: 99.69 d1-cpu: 46.58 msg-size: 32768 Mbps: 0941 d0-cpu: 99.72 d1-cpu: 45.39 ^^Finally at wire speed, but at a cost of 100% cpu on dom0 This cpu util seems excessive, maybe oprofile will show some problems. Notice dom1 has ~2% lower cpu. domU to dom0 tx csum msg-size: 00064 Mbps: 0390 d0-cpu: 97.92 d1-cpu: 100.00 msg-size: 01500 Mbps: 1571 d0-cpu: 97.36 d1-cpu: 54.83 msg-size: 16384 Mbps: 1582 d0-cpu: 96.20 d1-cpu: 49.93 msg-size: 32768 Mbps: 1596 d0-cpu: 96.32 d1-cpu: 49.63 sw tx csum msg-size: 00064 Mbps: 0375 d0-cpu: 97.65 d1-cpu: 100.00 msg-size: 01500 Mbps: 1546 d0-cpu: 96.36 d1-cpu: 52.99 msg-size: 16384 Mbps: 1598 d0-cpu: 95.88 d1-cpu: 47.48 msg-size: 32768 Mbps: 1641 d0-cpu: 95.89 d1-cpu: 46.37 ^^very slightly better avg throughput, and lower cpu on dom1 donU to domU tx csum msg-size:00064 Mbps: 0287 d0-cpu: 84.97 d1-cpu: 100.0 d2-cpu: 75.46 msg-size:01500 Mbps: 1004 d0-cpu: 90.98 d1-cpu: 68.29 d2-cpu: 76.94 msg-size:16384 Mbps: 1018 d0-cpu: 89.78 d1-cpu: 60.82 d2-cpu: 78.12 msg-size:32768 Mbps: 1010 d0-cpu: 89.30 d1-cpu: 59.83 d2-cpu: 77.99 sw tx csum msg-size:00064 Mbps: 0286 d0-cpu: 84.81 d1-cpu: 99.93 d2-cpu: 76.28 msg-size:01500 Mbps: 1018 d0-cpu: 91.30 d1-cpu: 67.27 d2-cpu: 75.08 msg-size:16384 Mbps: 1012 d0-cpu: 88.46 d1-cpu: 55.56 d2-cpu: 71.37 msg-size:32768 Mbps: 1017 d0-cpu: 88.33 d1-cpu: 54.96 d2-cpu: 70.96 ^^about same throughput, but ~4% lower cpu on d1 Again, point to point front end comms woudl be great here. IMO, I think the patch is a good thing. There are other very major issues with networking, like the massive cpu overhead for dom0. I wonder if we could have a layer 2 networking model like: -Xen has have front end ethernet drivers only -dom0 has a Xen bridge front end driver, just to put eth0 (or whatever phys dev) on it. -no domain hosted bridge device or backend ethernet drivers With this, Xen acts as a ethernet "switch", switching ethernet traffic in xen itself, without the help of a domain hosted bridge. Packets are forwarded to either a domain''s front end driver, or the front end bridge interface in dom0 (or any other driver domain). With this we may have better control of emulating offload functions, and we should avoid some hops (and in may cases involving dom0) for the netwrok traffic. Comments? -Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
What does the tx hw csum control actually turn on and off? I''m surprised there''s much benefit to csum offload on the tx side at all as its almost always done as part of a copy. I''d have thought the main benefit of csum offload was on the rx side, so that packets received by the NIC are hardware csum''ed, passed through the bridge, and then into the domU where the csum re-calculation is avoided [it would normally need to be done before the TCP ack is sent, and can''t be done as part of a copy as the data won''t be moved out of the skb until the user app does a read]. The same rx csum check will be avoided and hence provide benefit to domU <-> domU transfers. In the figures below, which direction is the data stream heading? (I presume it''s a one way test, like ttcp?) It''s somewhat surprising that the dom0 bridge code is burning so much CPU. xenoprofile results will be quite interesting to see what functions are eating the CPU. Ultimately, the best way of doing domU <-> domU networking will be to allow point-to-point connections where netfronts are connected direct to other netfronts if the hosts are on the same machine. However, the priority for 3.0 is to optimise the normal front-back-bridge-back-front path. Thanks, Ian> -----Original Message----- > From: Andrew Theurer [mailto:habanero@us.ibm.com] > Sent: 25 May 2005 15:39 > To: Jon Mason; xen-devel@lists.xensource.com > Cc: Ian Pratt; bin.ren@cl.cam.ac.uk > Subject: Re: [Xen-devel] [PATCH] Network Checksum Removal > > Tests for domU->dom0, domU->host, and domU->domU are completed: > > 3.2 GHz Xeon with Hyperhtreading, 2GB (correction) memory > > > Benchmark: netperf2 -T TCP_STREAM > > > dom0, dom1, and dom2 on cpu0 (first SMT thread on first core) > domU to host > hw tx csum > msg-size: 00064 Mbps: 0186 d0-cpu: 49.38 d1-cpu: 44.35 > msg-size: 01500 Mbps: 0917 d0-cpu: 62.13 d1-cpu: 37.87 > msg-size: 16384 Mbps: 0933 d0-cpu: 66.63 d1-cpu: 33.37 > msg-size: 32768 Mbps: 0928 d0-cpu: 66.96 d1-cpu: 32.66 > sw tx csum > msg-size: 00064 Mbps: 0187 d0-cpu: 49.50 d1-cpu: 44.52 > msg-size: 01500 Mbps: 0904 d0-cpu: 60.63 d1-cpu: 39.36 > msg-size: 16384 Mbps: 0924 d0-cpu: 63.98 d1-cpu: 35.98 > msg-size: 32768 Mbps: 0926 d0-cpu: 64.18 d1-cpu: 35.68 > ^^about 2% reduction in cpu util on dom1^^ > domU to dom0 > hw tx csum > msg-size: 00064 Mbps: 0014 d0-cpu: 64.02 d1-cpu: 31.71 > msg-size: 01500 Mbps: 1087 d0-cpu: 63.34 d1-cpu: 36.67 > msg-size: 16384 Mbps: 1204 d0-cpu: 67.30 d1-cpu: 32.71 > msg-size: 32768 Mbps: 1148 d0-cpu: 68.08 d1-cpu: 31.93 > sw tx csum > msg-size: 00064 Mbps: 0014 d0-cpu: 64.88 d1-cpu: 32.39 > msg-size: 01500 Mbps: 0948 d0-cpu: 62.20 d1-cpu: 37.80 > msg-size: 16384 Mbps: 1063 d0-cpu: 64.73 d1-cpu: 35.27 > msg-size: 32768 Mbps: 1012 d0-cpu: 65.71 d1-cpu: 34.30 > ^^upto 13% throughput increase with cpu util down ~2% on dom1^^ > Note the dismal performance for very small msg sizes > donU to domU > hw tx csum > msg-size:00064 Mbps: 0359 d0-cpu: 27.85 d1-cpu: 53.68 > d2-cpu: 18.48 > msg-size:01500 Mbps: 0594 d0-cpu: 47.42 d1-cpu: 21.77 > d2-cpu: 30.78 > msg-size:16384 Mbps: 0619 d0-cpu: 49.66 d1-cpu: 18.81 > d2-cpu: 31.53 > msg-size:32768 Mbps: 0616 d0-cpu: 49.58 d1-cpu: 18.68 > d2-cpu: 31.74 > sw tx csum > msg-size:00064 Mbps: 0361 d0-cpu: 27.81 d1-cpu: 53.58 > d2-cpu: 18.62 > msg-size:01500 Mbps: 0584 d0-cpu: 46.22 d1-cpu: 23.18 > d2-cpu: 30.60 > msg-size:16384 Mbps: 0602 d0-cpu: 47.99 d1-cpu: 20.33 > d2-cpu: 31.69 > msg-size:32768 Mbps: 0603 d0-cpu: 47.67 d1-cpu: 20.59 > d2-cpu: 31.74 > ^^About a 2% throughput increase, and cpu down on d1 > The cpu wasted on dom1 should be enough justification for > domU<->domU communication with point to point front end driver > communication. > dom0 on cpu0, dom1 on cpu2, and dom2 on cpu3 (dom1 and dom2 on same > core) > domU to host > hw tx csum > msg-size: 00064 Mbps: 0540 d0-cpu: 92.98 d1-cpu: 100.00 > msg-size: 01500 Mbps: 0941 d0-cpu: 99.74 d1-cpu: 48.62 > msg-size: 16384 Mbps: 0941 d0-cpu: 99.71 d1-cpu: 43.32 > msg-size: 32768 Mbps: 0941 d0-cpu: 99.72 d1-cpu: 43.21 > sw tx csum > msg-size: 00064 Mbps: 0545 d0-cpu: 93.47 d1-cpu: 100.00 > msg-size: 01500 Mbps: 0941 d0-cpu: 99.76 d1-cpu: 51.43 > msg-size: 16384 Mbps: 0941 d0-cpu: 99.69 d1-cpu: 46.58 > msg-size: 32768 Mbps: 0941 d0-cpu: 99.72 d1-cpu: 45.39 > ^^Finally at wire speed, but at a cost of 100% cpu on dom0 > This cpu util seems excessive, maybe oprofile will show > some problems. Notice dom1 has ~2% lower cpu. > domU to dom0 > tx csum > msg-size: 00064 Mbps: 0390 d0-cpu: 97.92 d1-cpu: 100.00 > msg-size: 01500 Mbps: 1571 d0-cpu: 97.36 d1-cpu: 54.83 > msg-size: 16384 Mbps: 1582 d0-cpu: 96.20 d1-cpu: 49.93 > msg-size: 32768 Mbps: 1596 d0-cpu: 96.32 d1-cpu: 49.63 > sw tx csum > msg-size: 00064 Mbps: 0375 d0-cpu: 97.65 d1-cpu: 100.00 > msg-size: 01500 Mbps: 1546 d0-cpu: 96.36 d1-cpu: 52.99 > msg-size: 16384 Mbps: 1598 d0-cpu: 95.88 d1-cpu: 47.48 > msg-size: 32768 Mbps: 1641 d0-cpu: 95.89 d1-cpu: 46.37 > ^^very slightly better avg throughput, and lower cpu on dom1 > donU to domU > tx csum > msg-size:00064 Mbps: 0287 d0-cpu: 84.97 d1-cpu: 100.0 > d2-cpu: 75.46 > msg-size:01500 Mbps: 1004 d0-cpu: 90.98 d1-cpu: 68.29 > d2-cpu: 76.94 > msg-size:16384 Mbps: 1018 d0-cpu: 89.78 d1-cpu: 60.82 > d2-cpu: 78.12 > msg-size:32768 Mbps: 1010 d0-cpu: 89.30 d1-cpu: 59.83 > d2-cpu: 77.99 > sw tx csum > msg-size:00064 Mbps: 0286 d0-cpu: 84.81 d1-cpu: 99.93 > d2-cpu: 76.28 > msg-size:01500 Mbps: 1018 d0-cpu: 91.30 d1-cpu: 67.27 > d2-cpu: 75.08 > msg-size:16384 Mbps: 1012 d0-cpu: 88.46 d1-cpu: 55.56 > d2-cpu: 71.37 > msg-size:32768 Mbps: 1017 d0-cpu: 88.33 d1-cpu: 54.96 > d2-cpu: 70.96 > ^^about same throughput, but ~4% lower cpu on d1 > Again, point to point front end comms woudl be great here. > > > IMO, I think the patch is a good thing. There are other very major > issues with networking, like the massive cpu overhead for dom0. I > wonder if we could have a layer 2 networking model like: > > -Xen has have front end ethernet drivers only > -dom0 has a Xen bridge front end driver, just to put eth0 (or > whatever > phys dev) on it. > -no domain hosted bridge device or backend ethernet drivers > > With this, Xen acts as a ethernet "switch", switching > ethernet traffic > in xen itself, without the help of a domain hosted bridge. > Packets are > forwarded to either a domain''s front end driver, or the front end > bridge interface in dom0 (or any other driver domain). With this we > may have better control of emulating offload functions, and we should > avoid some hops (and in may cases involving dom0) for the netwrok > traffic. Comments? > > -Andrew > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wednesday 25 May 2005 11:48 am, Ian Pratt wrote:> What does the tx hw csum control actually turn on and off?The tx hw csum control lets the TCP/IP stack know whether or not to software checksum the outgoing packet or not. So if tx checksum offload is enabled, then the stack will not software checksum it.> I''m surprised there''s much benefit to csum offload on the tx side at all > as its almost always done as part of a copy.Why? The tx checksumming is just as expensive as the rx checksumming.> I''d have thought the main benefit of csum offload was on the rx side, so > that packets received by the NIC are hardware csum''ed, passed through > the bridge, and then into the domU where the csum re-calculation is > avoided [it would normally need to be done before the TCP ack is sent, > and can''t be done as part of a copy as the data won''t be moved out of > the skb until the user app does a read]. The same rx csum check will be > avoided and hence provide benefit to domU <-> domU transfers.I can add an ethtool feature to disable rx checksum offload (so that domU will verify the checksum in hardware).> In the figures below, which direction is the data stream heading? (I > presume it''s a one way test, like ttcp?) > > It''s somewhat surprising that the dom0 bridge code is burning so much > CPU. xenoprofile results will be quite interesting to see what functions > are eating the CPU.There is a patch on netdev which can decrease the CPU load of bridging. specifically, it allows the bridge device to take advantage of the network device features (like hardware checksum offload). Stephen Hemminger says it should go in the 2.6.13 kernel.> Ultimately, the best way of doing domU <-> domU networking will be to > allow point-to-point connections where netfronts are connected direct to > other netfronts if the hosts are on the same machine. However, the > priority for 3.0 is to optimise the normal front-back-bridge-back-front > path. > > Thanks, > Ian > > > -----Original Message----- > > From: Andrew Theurer [mailto:habanero@us.ibm.com] > > Sent: 25 May 2005 15:39 > > To: Jon Mason; xen-devel@lists.xensource.com > > Cc: Ian Pratt; bin.ren@cl.cam.ac.uk > > Subject: Re: [Xen-devel] [PATCH] Network Checksum Removal > > > > Tests for domU->dom0, domU->host, and domU->domU are completed: > > > > 3.2 GHz Xeon with Hyperhtreading, 2GB (correction) memory > > > > > > Benchmark: netperf2 -T TCP_STREAM > > > > > > dom0, dom1, and dom2 on cpu0 (first SMT thread on first core) > > domU to host > > hw tx csum > > msg-size: 00064 Mbps: 0186 d0-cpu: 49.38 d1-cpu: 44.35 > > msg-size: 01500 Mbps: 0917 d0-cpu: 62.13 d1-cpu: 37.87 > > msg-size: 16384 Mbps: 0933 d0-cpu: 66.63 d1-cpu: 33.37 > > msg-size: 32768 Mbps: 0928 d0-cpu: 66.96 d1-cpu: 32.66 > > sw tx csum > > msg-size: 00064 Mbps: 0187 d0-cpu: 49.50 d1-cpu: 44.52 > > msg-size: 01500 Mbps: 0904 d0-cpu: 60.63 d1-cpu: 39.36 > > msg-size: 16384 Mbps: 0924 d0-cpu: 63.98 d1-cpu: 35.98 > > msg-size: 32768 Mbps: 0926 d0-cpu: 64.18 d1-cpu: 35.68 > > ^^about 2% reduction in cpu util on dom1^^ > > domU to dom0 > > hw tx csum > > msg-size: 00064 Mbps: 0014 d0-cpu: 64.02 d1-cpu: 31.71 > > msg-size: 01500 Mbps: 1087 d0-cpu: 63.34 d1-cpu: 36.67 > > msg-size: 16384 Mbps: 1204 d0-cpu: 67.30 d1-cpu: 32.71 > > msg-size: 32768 Mbps: 1148 d0-cpu: 68.08 d1-cpu: 31.93 > > sw tx csum > > msg-size: 00064 Mbps: 0014 d0-cpu: 64.88 d1-cpu: 32.39 > > msg-size: 01500 Mbps: 0948 d0-cpu: 62.20 d1-cpu: 37.80 > > msg-size: 16384 Mbps: 1063 d0-cpu: 64.73 d1-cpu: 35.27 > > msg-size: 32768 Mbps: 1012 d0-cpu: 65.71 d1-cpu: 34.30 > > ^^upto 13% throughput increase with cpu util down ~2% on dom1^^ > > Note the dismal performance for very small msg sizes > > donU to domU > > hw tx csum > > msg-size:00064 Mbps: 0359 d0-cpu: 27.85 d1-cpu: 53.68 > > d2-cpu: 18.48 > > msg-size:01500 Mbps: 0594 d0-cpu: 47.42 d1-cpu: 21.77 > > d2-cpu: 30.78 > > msg-size:16384 Mbps: 0619 d0-cpu: 49.66 d1-cpu: 18.81 > > d2-cpu: 31.53 > > msg-size:32768 Mbps: 0616 d0-cpu: 49.58 d1-cpu: 18.68 > > d2-cpu: 31.74 > > sw tx csum > > msg-size:00064 Mbps: 0361 d0-cpu: 27.81 d1-cpu: 53.58 > > d2-cpu: 18.62 > > msg-size:01500 Mbps: 0584 d0-cpu: 46.22 d1-cpu: 23.18 > > d2-cpu: 30.60 > > msg-size:16384 Mbps: 0602 d0-cpu: 47.99 d1-cpu: 20.33 > > d2-cpu: 31.69 > > msg-size:32768 Mbps: 0603 d0-cpu: 47.67 d1-cpu: 20.59 > > d2-cpu: 31.74 > > ^^About a 2% throughput increase, and cpu down on d1 > > The cpu wasted on dom1 should be enough justification for > > domU<->domU communication with point to point front end driver > > communication. > > dom0 on cpu0, dom1 on cpu2, and dom2 on cpu3 (dom1 and dom2 on same > > core) > > domU to host > > hw tx csum > > msg-size: 00064 Mbps: 0540 d0-cpu: 92.98 d1-cpu: 100.00 > > msg-size: 01500 Mbps: 0941 d0-cpu: 99.74 d1-cpu: 48.62 > > msg-size: 16384 Mbps: 0941 d0-cpu: 99.71 d1-cpu: 43.32 > > msg-size: 32768 Mbps: 0941 d0-cpu: 99.72 d1-cpu: 43.21 > > sw tx csum > > msg-size: 00064 Mbps: 0545 d0-cpu: 93.47 d1-cpu: 100.00 > > msg-size: 01500 Mbps: 0941 d0-cpu: 99.76 d1-cpu: 51.43 > > msg-size: 16384 Mbps: 0941 d0-cpu: 99.69 d1-cpu: 46.58 > > msg-size: 32768 Mbps: 0941 d0-cpu: 99.72 d1-cpu: 45.39 > > ^^Finally at wire speed, but at a cost of 100% cpu on dom0 > > This cpu util seems excessive, maybe oprofile will show > > some problems. Notice dom1 has ~2% lower cpu. > > domU to dom0 > > tx csum > > msg-size: 00064 Mbps: 0390 d0-cpu: 97.92 d1-cpu: 100.00 > > msg-size: 01500 Mbps: 1571 d0-cpu: 97.36 d1-cpu: 54.83 > > msg-size: 16384 Mbps: 1582 d0-cpu: 96.20 d1-cpu: 49.93 > > msg-size: 32768 Mbps: 1596 d0-cpu: 96.32 d1-cpu: 49.63 > > sw tx csum > > msg-size: 00064 Mbps: 0375 d0-cpu: 97.65 d1-cpu: 100.00 > > msg-size: 01500 Mbps: 1546 d0-cpu: 96.36 d1-cpu: 52.99 > > msg-size: 16384 Mbps: 1598 d0-cpu: 95.88 d1-cpu: 47.48 > > msg-size: 32768 Mbps: 1641 d0-cpu: 95.89 d1-cpu: 46.37 > > ^^very slightly better avg throughput, and lower cpu on dom1 > > donU to domU > > tx csum > > msg-size:00064 Mbps: 0287 d0-cpu: 84.97 d1-cpu: 100.0 > > d2-cpu: 75.46 > > msg-size:01500 Mbps: 1004 d0-cpu: 90.98 d1-cpu: 68.29 > > d2-cpu: 76.94 > > msg-size:16384 Mbps: 1018 d0-cpu: 89.78 d1-cpu: 60.82 > > d2-cpu: 78.12 > > msg-size:32768 Mbps: 1010 d0-cpu: 89.30 d1-cpu: 59.83 > > d2-cpu: 77.99 > > sw tx csum > > msg-size:00064 Mbps: 0286 d0-cpu: 84.81 d1-cpu: 99.93 > > d2-cpu: 76.28 > > msg-size:01500 Mbps: 1018 d0-cpu: 91.30 d1-cpu: 67.27 > > d2-cpu: 75.08 > > msg-size:16384 Mbps: 1012 d0-cpu: 88.46 d1-cpu: 55.56 > > d2-cpu: 71.37 > > msg-size:32768 Mbps: 1017 d0-cpu: 88.33 d1-cpu: 54.96 > > d2-cpu: 70.96 > > ^^about same throughput, but ~4% lower cpu on d1 > > Again, point to point front end comms woudl be great here. > > > > > > IMO, I think the patch is a good thing. There are other very major > > issues with networking, like the massive cpu overhead for dom0. I > > wonder if we could have a layer 2 networking model like: > > > > -Xen has have front end ethernet drivers only > > -dom0 has a Xen bridge front end driver, just to put eth0 (or > > whatever > > phys dev) on it. > > -no domain hosted bridge device or backend ethernet drivers > > > > With this, Xen acts as a ethernet "switch", switching > > ethernet traffic > > in xen itself, without the help of a domain hosted bridge. > > Packets are > > forwarded to either a domain''s front end driver, or the front end > > bridge interface in dom0 (or any other driver domain). With this we > > may have better control of emulating offload functions, and we should > > avoid some hops (and in may cases involving dom0) for the netwrok > > traffic. Comments? > > > > -Andrew > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jon Mason wrote:>>I''m surprised there''s much benefit to csum offload on the tx side at all >>as its almost always done as part of a copy. > > > Why? The tx checksumming is just as expensive as the rx checksumming.Normally (i.e. non sendfile() case), on the transmit side, you have to copy the data from user space to kernel space, and usually, during this step, you perform the checksum operation for a few extra instructions - you have to take the hit of pulling in each byte of data in any case. So checksum offload on the transmit path _normally_ buys you no throughput gain, and very slight reduction in CPU utilization. Of course, for every segment sent out (or bunches thereof), we get an ack back. But checksumming a TCP header (pure ack case) is again, fairly trivial (20 bytes). thanks, Nivedita _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > I''m surprised there''s much benefit to csum offload on the > tx side at > > all as its almost always done as part of a copy. > > Why? The tx checksumming is just as expensive as the rx checksumming.[Nivedita has already posted a nice explanation.]> There is a patch on netdev which can decrease the CPU load of > bridging. > specifically, it allows the bridge device to take advantage > of the network device features (like hardware checksum > offload). Stephen Hemminger says it should go in the 2.6.13 kernel.Please can you post it as a patch so that we can include it in our 2.6.11 patches directory. With the patch, csum offload will be much more interesting in the rx case Thanks, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 25 May 2005, at 21:06, Ian Pratt wrote:>> There is a patch on netdev which can decrease the CPU load of >> bridging. >> specifically, it allows the bridge device to take advantage >> of the network device features (like hardware checksum >> offload). Stephen Hemminger says it should go in the 2.6.13 kernel. > > Please can you post it as a patch so that we can include it in our > 2.6.11 patches directory. > > With the patch, csum offload will be much more interesting in the rx > caseThe code we already have offloads rx csums both for dom0 and domU''s (the dom0 traffic has to be received through veth0 though, not the bridge device itself). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wednesday 25 May 2005 04:14 pm, Keir Fraser wrote:> On 25 May 2005, at 21:06, Ian Pratt wrote: > >> There is a patch on netdev which can decrease the CPU load of > >> bridging. > >> specifically, it allows the bridge device to take advantage > >> of the network device features (like hardware checksum > >> offload). Stephen Hemminger says it should go in the 2.6.13 kernel. > > > > Please can you post it as a patch so that we can include it in our > > 2.6.11 patches directory. > > > > With the patch, csum offload will be much more interesting in the rx > > case > > The code we already have offloads rx csums both for dom0 and domU''s > (the dom0 traffic has to be received through veth0 though, not the > bridge device itself).The problem with the bridge device is that all traffic generated in dom0 will be software checksummed, regardless of whether it needs to be or not. In Xen''s case, it will software checksum all traffic to domU, even though the vif device is advertising NETIF_F_NO_CSUM. This is because the stack doesn''t see the features of the children devices of the bridge, only the features of the bridge device itself. I created a quick hack to work around this, and started the discussion on the Linux netdev mailing list about how to fix the problem. From this discussion, a patch was created which does most of what we want, but needs to be slightly modified to be optimal for Xen. I will post the Xen optimized patch as soon as I have it done. Thanks, Jon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hello It seems this patch breaks something in netfilter. My setup is classical bridge (no veth0/vif0.0) plus some stateful firewalling on Dom0 With tx offload off and firewall on, pings from Dom0 to DomU works, ssh from Dom0 to DomU works. With tx offload on and firewall off, idem. With tx offload on and firewall on, ping goes well but ssh not. Here are the iptables rules : iptables -P INPUT DROP iptables -A INPUT -p icmp -j ACCEPT iptables -A INPUT -i xen-br0 -m state --state RELATED,ESTABLISHED -j ACCEPT iptables -P OUTPUT ACCEPT Here is a capture of vif1.0 : IP DOM0.2486 > DOM1.22: S IP DOM1.22 > DOM0.2486: S IP DOM0.2486 > DOM1.22: . ack 1 IP DOM1.22 > DOM0.2486: P 1:23(22) ack 1 IP DOM1.22 > DOM0.2486: P 1:23(22) ack 1 IP DOM1.22 > DOM0.2486: P 1:23(22) ack 1 IP DOM1.22 > DOM0.2486: P 1:23(22) ack 1 IP DOM1.22 > DOM0.2486: P 1:23(22) ack 1 IP DOM1.22 > DOM0.2486: P 1:23(22) ack 1 IP DOM1.22 > DOM0.2486: P 1:23(22) ack 1 ... The response from the original SYN goes through the third rule, but the ACKs don''t. I added a rule to log packets with invalid state and the ACKs got logged. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 25 May 2005, at 22:35, Jon Mason wrote:> The problem with the bridge device is that all traffic generated in > dom0 will > be software checksummed, regardless of whether it needs to be or not. > In > Xen''s case, it will software checksum all traffic to domU, even though > the > vif device is advertising NETIF_F_NO_CSUM. > > This is because the stack doesn''t see the features of the children > devices of > the bridge, only the features of the bridge device itself. I created > a quick > hack to work around this, and started the discussion on the Linux > netdev > mailing list about how to fix the problem. From this discussion, a > patch was > created which does most of what we want, but needs to be slightly > modified to > be optimal for Xen. I will post the Xen optimized patch as soon as I > have it > done.But we no longer bring up an IP interface on the bridge device -- we use veth0 instead, which advertises NETIF_F_IP_CSUM. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 25 May 2005, at 22:38, Cédric Schieli wrote:> The response from the original SYN goes through the third rule, but the > ACKs don''t. > > I added a rule to log packets with invalid state and the ACKs got > logged.This may be a hard one to fix. The problem is probably that the packets coming from domU haven''t been checksummed, so a checksum check will fail. We set ip_summed==CHECKSUM_UNNECESSARY, but perhaps the firewall code checksums anyway, or the bridge is clobbering ip_summed when it locally delivers. :-( veth0 is careful to preserve CHECKSUM_UNNECESSARY -- it may be worth trying it out rather than bringing up your IP interface on the bridge device. See tools/examples/network for an example script that brings up veth0. If that doesn''t work then I''m not sure there''s a clean solution (ie. one that doesn;t require hacking the network stack), other than disabling checksum offload. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 25 May 2005, at 22:47, Keir Fraser wrote:> This may be a hard one to fix. The problem is probably that the > packets coming from domU haven''t been checksummed, so a checksum check > will fail. We set ip_summed==CHECKSUM_UNNECESSARY, but perhaps the > firewall code checksums anyway, or the bridge is clobbering ip_summed > when it locally delivers. :-(Perhaps not so hard.... Try modifying tcp_error() in net/ipv4/netfilter/ip_conntrack_proto_tcp.c. Wrap the entire if statement that checks for invalid checksum in: if ( skb->ip_summed != CHECKSUM_UNNECESSARY ) { <checksum checking code goes here> } I expect this should solve the problem. :-) -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wednesday 25 May 2005 04:40 pm, Keir Fraser wrote:> On 25 May 2005, at 22:35, Jon Mason wrote: > > The problem with the bridge device is that all traffic generated in > > dom0 will > > be software checksummed, regardless of whether it needs to be or not. > > In > > Xen''s case, it will software checksum all traffic to domU, even though > > the > > vif device is advertising NETIF_F_NO_CSUM. > > > > This is because the stack doesn''t see the features of the children > > devices of > > the bridge, only the features of the bridge device itself. I created > > a quick > > hack to work around this, and started the discussion on the Linux > > netdev > > mailing list about how to fix the problem. From this discussion, a > > patch was > > created which does most of what we want, but needs to be slightly > > modified to > > be optimal for Xen. I will post the Xen optimized patch as soon as I > > have it > > done. > > But we no longer bring up an IP interface on the bridge device -- we > use veth0 instead, which advertises NETIF_F_IP_CSUM.The bridge device still is the device that the stack sees, and uses its features to determine what to do during transmission. If you monitor the skb->ip_summed flag going into netif_be_start_xmit(), you will see that it is 0 (meaning that the stack did the checksum in software). Now if you add the following patch to the bridging device, you will notice that ip_summed is now being used. --- net/bridge/br_device.c.orig 2005-05-13 11:23:02.552751024 -0500 +++ net/bridge/br_device.c 2005-05-13 11:25:39.155943720 -0500 @@ -101,4 +101,5 @@ void br_dev_setup(struct net_device *dev dev->tx_queue_len = 0; dev->set_mac_address = NULL; dev->priv_flags = IFF_EBRIDGE; + dev->features = NETIF_F_HW_CSUM | NETIF_F_SG; } This patch oversimplifies what needs to be done, but it provides the general idea and speedup that we are looking for. Thanks, Jon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 26 May 2005, at 00:41, Jon Mason wrote:> The bridge device still is the device that the stack sees, and uses its > features to determine what to do during transmission. If you monitor > the > skb->ip_summed flag going into netif_be_start_xmit(), you will see > that it is > 0 (meaning that the stack did the checksum in software). Now if you > add the > following patch to the bridging device, you will notice that ip_summed > is now > being used.For local traffic transmitted via veth0, ip_summed is zero (CHECKSUM_NONE) at netif_be_start_xmit because the bridge forwarding code nobbles the ip_summed field. It does *not* checksum the packet: etherbridge never checksums packets it forwards because that is the destination''s job (it''s an end-to-end checksum at the protocol level). If you transmit local traffic directly on the bridge device then yes, you need a patch because it does not advertise NETIF_F_*_CSUM. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thursday 26 May 2005 03:07 am, Keir Fraser wrote:> On 26 May 2005, at 00:41, Jon Mason wrote: > > The bridge device still is the device that the stack sees, and uses its > > features to determine what to do during transmission. If you monitor > > the > > skb->ip_summed flag going into netif_be_start_xmit(), you will see > > that it is > > 0 (meaning that the stack did the checksum in software). Now if you > > add the > > following patch to the bridging device, you will notice that ip_summed > > is now > > being used. > > For local traffic transmitted via veth0, ip_summed is zero > (CHECKSUM_NONE) at netif_be_start_xmit because the bridge forwarding > code nobbles the ip_summed field. It does *not* checksum the packet: > etherbridge never checksums packets it forwards because that is the > destination''s job (it''s an end-to-end checksum at the protocol level). > > If you transmit local traffic directly on the bridge device then yes, > you need a patch because it does not advertise NETIF_F_*_CSUM.That is exactly the case I am refering to (sorry for the confusion). The issue is not Xen sepcific (which is why I addressed the issue on the Linux networking mailing list), but Xen will see a boost when using the patch. Thanks, Jon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel