multiple queue virtio-net: flow steering through host/guest cooperation Hello all: This is a rough series adds the guest/host cooperation of flow steering support based on Krish Kumar's multiple queue virtio-net driver patch 3/3 (http://lwn.net/Articles/467283/). This idea is simple, the backend pass the rxhash to the guest and guest would tell the backend the hash to queue mapping when necessary then backend can choose the queue based on the hash value of the packet. The table is just a page shared bettwen userspace and the backend. Patch 1 enable the ability to pass the rxhash through vnet_hdr to guest. Patch 2,3 implement a very simple flow director for tap and mavtap. tap part is based on the multiqueue tap patches posted by me (http://lwn.net/Articles/459270/). Patch 4 implement a method for virtio device to find the irq of a specific virtqueue, in order to do device specific interrupt optimization Patch 5 is the part of the guest driver that using accelerate rfs to program the flow director and with some optimizations on irq affinity and tx queue selection. This is just a prototype that demonstrates the idea, there are still things need to be discussed: - An alternative idea instead of shared page is ctrl vq, the reason that a shared table is preferable is the delay of ctrl vq itself. - Optimization on irq affinity and tx queue selection Comments are welcomed, thanks! --- Jason Wang (5): virtio_net: passing rxhash through vnet_hdr tuntap: simple flow director support macvtap: flow director support virtio: introduce a method to get the irq of a specific virtqueue virtio-net: flow director support drivers/lguest/lguest_device.c | 8 ++ drivers/net/macvlan.c | 4 + drivers/net/macvtap.c | 42 ++++++++- drivers/net/tun.c | 105 ++++++++++++++++------ drivers/net/virtio_net.c | 189 +++++++++++++++++++++++++++++++++++++++- drivers/s390/kvm/kvm_virtio.c | 6 + drivers/vhost/net.c | 10 +- drivers/vhost/vhost.h | 5 + drivers/virtio/virtio_mmio.c | 8 ++ drivers/virtio/virtio_pci.c | 12 +++ include/linux/if_macvlan.h | 1 include/linux/if_tun.h | 11 ++ include/linux/virtio_config.h | 4 + include/linux/virtio_net.h | 16 +++ 14 files changed, 377 insertions(+), 44 deletions(-) -- Signature
Jason Wang
2011-Dec-05 08:58 UTC
[net-next RFC PATCH 1/5] virtio_net: passing rxhash through vnet_hdr
This patch enables the ability to pass the rxhash value to guest through vnet_hdr. This is useful for guest when it wants to cooperate with virtual device to steer a flow to dedicated guest cpu. This feature is negotiated through VIRTIO_NET_F_GUEST_RXHASH. Signed-off-by: Jason Wang <jasowang at redhat.com> --- drivers/net/macvtap.c | 10 ++++++---- drivers/net/tun.c | 44 +++++++++++++++++++++++++------------------- drivers/net/virtio_net.c | 26 ++++++++++++++++++++++---- drivers/vhost/net.c | 10 +++++++--- drivers/vhost/vhost.h | 5 +++-- include/linux/if_tun.h | 1 + include/linux/virtio_net.h | 10 +++++++++- 7 files changed, 73 insertions(+), 33 deletions(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 7c88d13..504c745 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -760,16 +760,17 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q, int vnet_hdr_len = 0; if (q->flags & IFF_VNET_HDR) { - struct virtio_net_hdr vnet_hdr; + struct virtio_net_hdr_rxhash vnet_hdr; vnet_hdr_len = q->vnet_hdr_sz; if ((len -= vnet_hdr_len) < 0) return -EINVAL; - ret = macvtap_skb_to_vnet_hdr(skb, &vnet_hdr); + ret = macvtap_skb_to_vnet_hdr(skb, &vnet_hdr.hdr.hdr); if (ret) return ret; - if (memcpy_toiovecend(iv, (void *)&vnet_hdr, 0, sizeof(vnet_hdr))) + vnet_hdr.rxhash = skb->rxhash; + if (memcpy_toiovecend(iv, (void *)&vnet_hdr, 0, q->vnet_hdr_sz)) return -EFAULT; } @@ -890,7 +891,8 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd, return ret; case TUNGETFEATURES: - if (put_user(IFF_TAP | IFF_NO_PI | IFF_VNET_HDR, up)) + if (put_user(IFF_TAP | IFF_NO_PI | IFF_VNET_HDR | IFF_RXHASH, + up)) return -EFAULT; return 0; diff --git a/drivers/net/tun.c b/drivers/net/tun.c index afb11d1..7d22b4b 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -869,49 +869,55 @@ static ssize_t tun_put_user(struct tun_file *tfile, } if (tfile->flags & TUN_VNET_HDR) { - struct virtio_net_hdr gso = { 0 }; /* no info leak */ - if ((len -= tfile->vnet_hdr_sz) < 0) + struct virtio_net_hdr_rxhash hdr; + struct virtio_net_hdr *gso = (struct virtio_net_hdr *)&hdr; + + if ((len -= tfile->vnet_hdr_sz) < 0 || + tfile->vnet_hdr_sz > sizeof(struct virtio_net_hdr_rxhash)) return -EINVAL; + memset(&hdr, 0, sizeof(hdr)); if (skb_is_gso(skb)) { struct skb_shared_info *sinfo = skb_shinfo(skb); /* This is a hint as to how much should be linear. */ - gso.hdr_len = skb_headlen(skb); - gso.gso_size = sinfo->gso_size; + gso->hdr_len = skb_headlen(skb); + gso->gso_size = sinfo->gso_size; if (sinfo->gso_type & SKB_GSO_TCPV4) - gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4; + gso->gso_type = VIRTIO_NET_HDR_GSO_TCPV4; else if (sinfo->gso_type & SKB_GSO_TCPV6) - gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6; + gso->gso_type = VIRTIO_NET_HDR_GSO_TCPV6; else if (sinfo->gso_type & SKB_GSO_UDP) - gso.gso_type = VIRTIO_NET_HDR_GSO_UDP; + gso->gso_type = VIRTIO_NET_HDR_GSO_UDP; else { pr_err("unexpected GSO type: " "0x%x, gso_size %d, hdr_len %d\n", - sinfo->gso_type, gso.gso_size, - gso.hdr_len); + sinfo->gso_type, gso->gso_size, + gso->hdr_len); print_hex_dump(KERN_ERR, "tun: ", DUMP_PREFIX_NONE, 16, 1, skb->head, - min((int)gso.hdr_len, 64), true); + min((int)gso->hdr_len, 64), + true); WARN_ON_ONCE(1); return -EINVAL; } if (sinfo->gso_type & SKB_GSO_TCP_ECN) - gso.gso_type |= VIRTIO_NET_HDR_GSO_ECN; + gso->gso_type |= VIRTIO_NET_HDR_GSO_ECN; } else - gso.gso_type = VIRTIO_NET_HDR_GSO_NONE; + gso->gso_type = VIRTIO_NET_HDR_GSO_NONE; if (skb->ip_summed == CHECKSUM_PARTIAL) { - gso.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - gso.csum_start = skb_checksum_start_offset(skb); - gso.csum_offset = skb->csum_offset; + gso->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + gso->csum_start = skb_checksum_start_offset(skb); + gso->csum_offset = skb->csum_offset; } else if (skb->ip_summed == CHECKSUM_UNNECESSARY) { - gso.flags = VIRTIO_NET_HDR_F_DATA_VALID; + gso->flags = VIRTIO_NET_HDR_F_DATA_VALID; } /* else everything is zero */ - if (unlikely(memcpy_toiovecend(iv, (void *)&gso, total, - sizeof(gso)))) + hdr.rxhash = skb_get_rxhash(skb); + if (unlikely(memcpy_toiovecend(iv, (void *)&hdr, total, + tfile->vnet_hdr_sz))) return -EFAULT; total += tfile->vnet_hdr_sz; } @@ -1358,7 +1364,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd, * This is needed because we never checked for invalid flags on * TUNSETIFF. */ return put_user(IFF_TUN | IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE | - IFF_VNET_HDR | IFF_MULTI_QUEUE, + IFF_VNET_HDR | IFF_MULTI_QUEUE | IFF_RXHASH, (unsigned int __user*)argp); } diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 157ee63..0d871f8 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -107,12 +107,16 @@ struct virtnet_info { /* Host will merge rx buffers for big packets (shake it! shake it!) */ bool mergeable_rx_bufs; + + /* Host will pass rxhash to us. */ + bool has_rxhash; }; struct skb_vnet_hdr { union { struct virtio_net_hdr hdr; struct virtio_net_hdr_mrg_rxbuf mhdr; + struct virtio_net_hdr_rxhash rhdr; }; unsigned int num_sg; }; @@ -205,7 +209,10 @@ static struct sk_buff *page_to_skb(struct receive_queue *rq, hdr = skb_vnet_hdr(skb); if (vi->mergeable_rx_bufs) { - hdr_len = sizeof hdr->mhdr; + if (vi->has_rxhash) + hdr_len = sizeof hdr->rhdr; + else + hdr_len = sizeof hdr->mhdr; offset = hdr_len; } else { hdr_len = sizeof hdr->hdr; @@ -376,6 +383,9 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len) skb_shinfo(skb)->gso_segs = 0; } + if (vi->has_rxhash) + skb->rxhash = hdr->rhdr.rxhash; + netif_receive_skb(skb); return; @@ -645,9 +655,12 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb, hdr->mhdr.num_buffers = 0; /* Encode metadata header at front. */ - if (vi->mergeable_rx_bufs) - sg_set_buf(sg, &hdr->mhdr, sizeof hdr->mhdr); - else + if (vi->mergeable_rx_bufs) { + if (vi->has_rxhash) + sg_set_buf(sg, &hdr->rhdr, sizeof hdr->rhdr); + else + sg_set_buf(sg, &hdr->mhdr, sizeof hdr->mhdr); + } else sg_set_buf(sg, &hdr->hdr, sizeof hdr->hdr); hdr->num_sg = skb_to_sgvec(skb, sg + 1, 0, skb->len) + 1; @@ -1338,8 +1351,12 @@ static int virtnet_probe(struct virtio_device *vdev) if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) vi->mergeable_rx_bufs = true; + if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_RXHASH)) + vi->has_rxhash = true; + /* Allocate/initialize the rx/tx queues, and invoke find_vqs */ err = virtnet_setup_vqs(vi); + if (err) goto free_netdev; @@ -1436,6 +1453,7 @@ static unsigned int features[] = { VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO, VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ, VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, VIRTIO_NET_F_MULTIQUEUE, + VIRTIO_NET_F_GUEST_RXHASH, }; static struct virtio_driver virtio_net_driver = { diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 882a51f..b2d6548 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -768,9 +768,13 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features) size_t vhost_hlen, sock_hlen, hdr_len; int i; - hdr_len = (features & (1 << VIRTIO_NET_F_MRG_RXBUF)) ? - sizeof(struct virtio_net_hdr_mrg_rxbuf) : - sizeof(struct virtio_net_hdr); + if (features & (1 << VIRTIO_NET_F_MRG_RXBUF)) + hdr_len = (features & (1 << VIRTIO_NET_F_GUEST_RXHASH)) ? + sizeof(struct virtio_net_hdr_rxhash) : + sizeof(struct virtio_net_hdr_mrg_rxbuf); + else + hdr_len = sizeof(struct virtio_net_hdr); + if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) { /* vhost provides vnet_hdr */ vhost_hlen = hdr_len; diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index a801e28..4ad2d5f 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -115,7 +115,7 @@ struct vhost_virtqueue { /* hdr is used to store the virtio header. * Since each iovec has >= 1 byte length, we never need more than * header length entries to store the header. */ - struct iovec hdr[sizeof(struct virtio_net_hdr_mrg_rxbuf)]; + struct iovec hdr[sizeof(struct virtio_net_hdr_rxhash)]; struct iovec *indirect; size_t vhost_hlen; size_t sock_hlen; @@ -203,7 +203,8 @@ enum { (1ULL << VIRTIO_RING_F_EVENT_IDX) | (1ULL << VHOST_F_LOG_ALL) | (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) | - (1ULL << VIRTIO_NET_F_MRG_RXBUF), + (1ULL << VIRTIO_NET_F_MRG_RXBUF) | + (1ULL << VIRTIO_NET_F_GUEST_RXHASH) , }; static inline int vhost_has_feature(struct vhost_dev *dev, int bit) diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h index d3f24d8..a1f6f3f 100644 --- a/include/linux/if_tun.h +++ b/include/linux/if_tun.h @@ -66,6 +66,7 @@ #define IFF_VNET_HDR 0x4000 #define IFF_TUN_EXCL 0x8000 #define IFF_MULTI_QUEUE 0x0100 +#define IFF_RXHASH 0x0200 /* Features for GSO (TUNSETOFFLOAD). */ #define TUN_F_CSUM 0x01 /* You can hand me unchecksummed packets. */ diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index c92b83f..2291317 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -50,6 +50,7 @@ #define VIRTIO_NET_F_CTRL_VLAN 19 /* Control channel VLAN filtering */ #define VIRTIO_NET_F_CTRL_RX_EXTRA 20 /* Extra RX mode control support */ #define VIRTIO_NET_F_MULTIQUEUE 21 /* Device supports multiple TXQ/RXQ */ +#define VIRTIO_NET_F_GUEST_RXHASH 22 /* Guest can receive rxhash */ #define VIRTIO_NET_S_LINK_UP 1 /* Link is up */ @@ -63,7 +64,7 @@ struct virtio_net_config { } __attribute__((packed)); /* This is the first element of the scatter-gather list. If you don't - * specify GSO or CSUM features, you can simply ignore the header. */ + * specify GSO, CSUM or HASH features, you can simply ignore the header. */ struct virtio_net_hdr { #define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 // Use csum_start, csum_offset #define VIRTIO_NET_HDR_F_DATA_VALID 2 // Csum is valid @@ -87,6 +88,13 @@ struct virtio_net_hdr_mrg_rxbuf { __u16 num_buffers; /* Number of merged rx buffers */ }; +/* This is the version of the header to use when GUEST_RXHASH + * feature has been negotiated. */ +struct virtio_net_hdr_rxhash { + struct virtio_net_hdr_mrg_rxbuf hdr; + __u32 rxhash; +}; + /* * Control virtqueue data structures *
Jason Wang
2011-Dec-05 08:58 UTC
[net-next RFC PATCH 2/5] tuntap: simple flow director support
This patch adds a simple flow director to tun/tap device. It is just a page that contains the hash to queue mapping which could be changed by user-space. The backend (tap/macvtap) would query this table to get the desired queue of a packets when it send packets to userspace. The page address were set through a new kind of ioctl - TUNSETFD and were pinned until device exit or another new page were specified. Signed-off-by: Jason Wang <jasowang at redhat.com> --- drivers/net/tun.c | 63 ++++++++++++++++++++++++++++++++++++++++-------- include/linux/if_tun.h | 10 ++++++++ 2 files changed, 62 insertions(+), 11 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 7d22b4b..2efaf81 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -64,6 +64,7 @@ #include <linux/nsproxy.h> #include <linux/virtio_net.h> #include <linux/rcupdate.h> +#include <linux/highmem.h> #include <net/net_namespace.h> #include <net/netns/generic.h> #include <net/rtnetlink.h> @@ -109,6 +110,7 @@ struct tap_filter { }; #define MAX_TAP_QUEUES (NR_CPUS < 16 ? NR_CPUS : 16) +#define TAP_HASH_MASK 0xFF struct tun_file { struct sock sk; @@ -128,6 +130,7 @@ struct tun_sock; struct tun_struct { struct tun_file *tfiles[MAX_TAP_QUEUES]; + struct page *fd_page[1]; unsigned int numqueues; unsigned int flags; uid_t owner; @@ -156,7 +159,7 @@ static struct tun_file *tun_get_queue(struct net_device *dev, struct tun_struct *tun = netdev_priv(dev); struct tun_file *tfile = NULL; int numqueues = tun->numqueues; - __u32 rxq; + __u32 rxq, rxhash; BUG_ON(!rcu_read_lock_held()); @@ -168,6 +171,22 @@ static struct tun_file *tun_get_queue(struct net_device *dev, goto out; } + rxhash = skb_get_rxhash(skb); + if (rxhash) { + if (tun->fd_page[0]) { + u16 *table = kmap_atomic(tun->fd_page[0]); + rxq = table[rxhash & TAP_HASH_MASK]; + kunmap_atomic(table); + if (rxq < numqueues) { + tfile = rcu_dereference(tun->tfiles[rxq]); + goto out; + } + } + rxq = ((u64)rxhash * numqueues) >> 32; + tfile = rcu_dereference(tun->tfiles[rxq]); + goto out; + } + if (likely(skb_rx_queue_recorded(skb))) { rxq = skb_get_rx_queue(skb); @@ -178,14 +197,6 @@ static struct tun_file *tun_get_queue(struct net_device *dev, goto out; } - /* Check if we can use flow to select a queue */ - rxq = skb_get_rxhash(skb); - if (rxq) { - u32 idx = ((u64)rxq * numqueues) >> 32; - tfile = rcu_dereference(tun->tfiles[idx]); - goto out; - } - tfile = rcu_dereference(tun->tfiles[0]); out: return tfile; @@ -1020,6 +1031,14 @@ out: return ret; } +static void tun_destructor(struct net_device *dev) +{ + struct tun_struct *tun = netdev_priv(dev); + if (tun->fd_page[0]) + put_page(tun->fd_page[0]); + free_netdev(dev); +} + static void tun_setup(struct net_device *dev) { struct tun_struct *tun = netdev_priv(dev); @@ -1028,7 +1047,7 @@ static void tun_setup(struct net_device *dev) tun->group = -1; dev->ethtool_ops = &tun_ethtool_ops; - dev->destructor = free_netdev; + dev->destructor = tun_destructor; } /* Trivial set of netlink ops to allow deleting tun or tap @@ -1230,6 +1249,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr) tun = netdev_priv(dev); tun->dev = dev; tun->flags = flags; + tun->fd_page[0] = NULL; security_tun_dev_post_create(&tfile->sk); @@ -1353,6 +1373,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd, struct net_device *dev = NULL; void __user* argp = (void __user*)arg; struct ifreq ifr; + struct tun_fd tfd; int ret; if (cmd == TUNSETIFF || cmd == TUNATTACHQUEUE || _IOC_TYPE(cmd) == 0x89) @@ -1364,7 +1385,8 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd, * This is needed because we never checked for invalid flags on * TUNSETIFF. */ return put_user(IFF_TUN | IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE | - IFF_VNET_HDR | IFF_MULTI_QUEUE | IFF_RXHASH, + IFF_VNET_HDR | IFF_MULTI_QUEUE | IFF_RXHASH | + IFF_FD, (unsigned int __user*)argp); } @@ -1476,6 +1498,25 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd, ret = set_offload(tun, arg); break; + case TUNSETFD: + if (copy_from_user(&tfd, argp, sizeof(tfd))) + ret = -EFAULT; + else { + if (tun->fd_page[0]) { + put_page(tun->fd_page[0]); + tun->fd_page[0] = NULL; + } + + /* put_page() in tun_destructor() */ + if (get_user_pages_fast(tfd.addr, 1, 0, + &tun->fd_page[0]) != 1) + ret = -EFAULT; + else + ret = 0; + } + + break; + case SIOCGIFHWADDR: /* Get hw address */ memcpy(ifr.ifr_hwaddr.sa_data, tun->dev->dev_addr, ETH_ALEN); diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h index a1f6f3f..726731d 100644 --- a/include/linux/if_tun.h +++ b/include/linux/if_tun.h @@ -36,6 +36,8 @@ #define TUN_VNET_HDR 0x0200 #define TUN_TAP_MQ 0x0400 +struct tun_fd; + /* Ioctl defines */ #define TUNSETNOCSUM _IOW('T', 200, int) #define TUNSETDEBUG _IOW('T', 201, int) @@ -56,6 +58,7 @@ #define TUNSETVNETHDRSZ _IOW('T', 216, int) #define TUNATTACHQUEUE _IOW('T', 217, int) #define TUNDETACHQUEUE _IOW('T', 218, int) +#define TUNSETFD _IOW('T', 219, struct tun_fd) /* TUNSETIFF ifr flags */ @@ -67,6 +70,7 @@ #define IFF_TUN_EXCL 0x8000 #define IFF_MULTI_QUEUE 0x0100 #define IFF_RXHASH 0x0200 +#define IFF_FD 0x0400 /* Features for GSO (TUNSETOFFLOAD). */ #define TUN_F_CSUM 0x01 /* You can hand me unchecksummed packets. */ @@ -97,6 +101,12 @@ struct tun_filter { __u8 addr[0][ETH_ALEN]; }; +/* Programmable flow director */ +struct tun_fd { + unsigned long addr; + size_t size; +}; + #ifdef __KERNEL__ #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE) struct socket *tun_get_socket(struct file *);
Signed-off-by: Jason Wang <jasowang at redhat.com> --- drivers/net/macvlan.c | 4 ++++ drivers/net/macvtap.c | 36 ++++++++++++++++++++++++++++++++++-- include/linux/if_macvlan.h | 1 + 3 files changed, 39 insertions(+), 2 deletions(-) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 7413497..b0cb7ce 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -706,6 +706,7 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev, vlan->port = port; vlan->receive = receive; vlan->forward = forward; + vlan->fd_page[0] = NULL; vlan->mode = MACVLAN_MODE_VEPA; if (data && data[IFLA_MACVLAN_MODE]) @@ -749,6 +750,9 @@ void macvlan_dellink(struct net_device *dev, struct list_head *head) { struct macvlan_dev *vlan = netdev_priv(dev); + if (vlan->fd_page[0]) + put_page(vlan->fd_page[0]); + list_del(&vlan->list); unregister_netdevice_queue(dev, head); } diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 504c745..a34eb84 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -14,6 +14,7 @@ #include <linux/wait.h> #include <linux/cdev.h> #include <linux/fs.h> +#include <linux/highmem.h> #include <net/net_namespace.h> #include <net/rtnetlink.h> @@ -62,6 +63,8 @@ static DEFINE_IDR(minor_idr); static struct class *macvtap_class; static struct cdev macvtap_cdev; +#define TAP_HASH_MASK 0xFF + static const struct proto_ops macvtap_socket_ops; /* @@ -189,6 +192,11 @@ static struct macvtap_queue *macvtap_get_queue(struct net_device *dev, /* Check if we can use flow to select a queue */ rxq = skb_get_rxhash(skb); if (rxq) { + if (vlan->fd_page[0]) { + u16 *table = kmap_atomic(vlan->fd_page[0]); + rxq = table[rxq & TAP_HASH_MASK]; + kunmap_atomic(table); + } tap = rcu_dereference(vlan->taps[rxq % numvtaps]); if (tap) goto out; @@ -851,6 +859,7 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd, { struct macvtap_queue *q = file->private_data; struct macvlan_dev *vlan; + struct tun_fd tfd; void __user *argp = (void __user *)arg; struct ifreq __user *ifr = argp; unsigned int __user *up = argp; @@ -891,8 +900,8 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd, return ret; case TUNGETFEATURES: - if (put_user(IFF_TAP | IFF_NO_PI | IFF_VNET_HDR | IFF_RXHASH, - up)) + if (put_user(IFF_TAP | IFF_NO_PI | IFF_VNET_HDR | IFF_RXHASH | + IFF_FD, up)) return -EFAULT; return 0; @@ -918,6 +927,29 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd, q->vnet_hdr_sz = s; return 0; + case TUNSETFD: + rcu_read_lock_bh(); + vlan = rcu_dereference(q->vlan); + if (!vlan) + ret = -ENOLINK; + else { + if (copy_from_user(&tfd, argp, sizeof(tfd))) + ret = -EFAULT; + if (vlan->fd_page[0]) { + put_page(vlan->fd_page[0]); + vlan->fd_page[0] = NULL; + } + + /* put_page() in macvlan_dellink() */ + if (get_user_pages_fast(tfd.addr, 1, 0, + &vlan->fd_page[0]) != 1) + ret = -EFAULT; + else + ret = 0; + } + rcu_read_unlock_bh(); + return ret; + case TUNSETOFFLOAD: /* let the user check for future flags */ if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 | diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h index d103dca..69a87a1 100644 --- a/include/linux/if_macvlan.h +++ b/include/linux/if_macvlan.h @@ -65,6 +65,7 @@ struct macvlan_dev { struct macvtap_queue *taps[MAX_MACVTAP_QUEUES]; int numvtaps; int minor; + struct page *fd_page[1]; }; static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
Jason Wang
2011-Dec-05 08:59 UTC
[net-next RFC PATCH 4/5] virtio: introduce a method to get the irq of a specific virtqueue
Device specific irq configuration may be need in order to do some optimization. So a new configuration is needed to get the irq of a virtqueue. Signed-off-by: Jason Wang <jasowang at redhat.com> --- drivers/lguest/lguest_device.c | 8 ++++++++ drivers/s390/kvm/kvm_virtio.c | 6 ++++++ drivers/virtio/virtio_mmio.c | 8 ++++++++ drivers/virtio/virtio_pci.c | 12 ++++++++++++ include/linux/virtio_config.h | 4 ++++ 5 files changed, 38 insertions(+), 0 deletions(-) diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c index 595d731..6483bff 100644 --- a/drivers/lguest/lguest_device.c +++ b/drivers/lguest/lguest_device.c @@ -386,6 +386,13 @@ static const char *lg_bus_name(struct virtio_device *vdev) return ""; } +static int lg_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq) +{ + struct lguest_vq_info *lvq = vq->priv; + + return lvq->config.irq; +} + /* The ops structure which hooks everything together. */ static struct virtio_config_ops lguest_config_ops = { .get_features = lg_get_features, @@ -398,6 +405,7 @@ static struct virtio_config_ops lguest_config_ops = { .find_vqs = lg_find_vqs, .del_vqs = lg_del_vqs, .bus_name = lg_bus_name, + .get_vq_irq = lg_get_vq_irq, }; /* diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c index 8af868b..a8d5ca1 100644 --- a/drivers/s390/kvm/kvm_virtio.c +++ b/drivers/s390/kvm/kvm_virtio.c @@ -268,6 +268,11 @@ static const char *kvm_bus_name(struct virtio_device *vdev) return ""; } +static int kvm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq) +{ + return 0x2603; +} + /* * The config ops structure as defined by virtio config */ @@ -282,6 +287,7 @@ static struct virtio_config_ops kvm_vq_configspace_ops = { .find_vqs = kvm_find_vqs, .del_vqs = kvm_del_vqs, .bus_name = kvm_bus_name, + .get_vq_irq = kvm_get_vq_irq, }; /* diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c index 2f57380..309d471 100644 --- a/drivers/virtio/virtio_mmio.c +++ b/drivers/virtio/virtio_mmio.c @@ -368,6 +368,13 @@ static const char *vm_bus_name(struct virtio_device *vdev) return vm_dev->pdev->name; } +static int vm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq) +{ + struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev); + + return platform_get_irq(vm_dev->pdev, 0); +} + static struct virtio_config_ops virtio_mmio_config_ops = { .get = vm_get, .set = vm_set, @@ -379,6 +386,7 @@ static struct virtio_config_ops virtio_mmio_config_ops = { .get_features = vm_get_features, .finalize_features = vm_finalize_features, .bus_name = vm_bus_name, + .get_vq_irq = vm_get_vq_irq, }; diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index 229ea56..4f99164 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -583,6 +583,17 @@ static const char *vp_bus_name(struct virtio_device *vdev) return pci_name(vp_dev->pci_dev); } +static int vp_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq) +{ + struct virtio_pci_device *vp_dev = to_vp_device(vdev); + struct virtio_pci_vq_info *info = vq->priv; + + if (vp_dev->intx_enabled) + return vp_dev->pci_dev->irq; + else + return vp_dev->msix_entries[info->msix_vector].vector; +} + static struct virtio_config_ops virtio_pci_config_ops = { .get = vp_get, .set = vp_set, @@ -594,6 +605,7 @@ static struct virtio_config_ops virtio_pci_config_ops = { .get_features = vp_get_features, .finalize_features = vp_finalize_features, .bus_name = vp_bus_name, + .get_vq_irq = vp_get_vq_irq, }; static void virtio_pci_release_dev(struct device *_d) diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h index 63f98d0..7b783a6 100644 --- a/include/linux/virtio_config.h +++ b/include/linux/virtio_config.h @@ -104,6 +104,9 @@ * vdev: the virtio_device * This returns a pointer to the bus name a la pci_name from which * the caller can then copy. + * @get_vq_irq: get the irq numer of the specific virt queue. + * vdev: the virtio_device + * vq: the virtqueue */ typedef void vq_callback_t(struct virtqueue *); struct virtio_config_ops { @@ -122,6 +125,7 @@ struct virtio_config_ops { u32 (*get_features)(struct virtio_device *vdev); void (*finalize_features)(struct virtio_device *vdev); const char *(*bus_name)(struct virtio_device *vdev); + int (*get_vq_irq)(struct virtio_device *vdev, struct virtqueue *vq); }; /* If driver didn't advertise the feature, it will never appear. */
Jason Wang
2011-Dec-05 08:59 UTC
[net-next RFC PATCH 5/5] virtio-net: flow director support
In order to let the packets of a flow to be passed to the desired guest cpu, we can co-operate with devices through programming the flow director which was just a hash to queue table. This kinds of co-operation is done through the accelerate RFS support, a device specific flow sterring method virtnet_fd() is used to modify the flow director based on rfs mapping. The desired queue were calculated through reverse mapping of the irq affinity table. In order to parallelize the ingress path, irq affinity of rx queue were also provides by the driver. In addition to accelerate RFS, we can also use the guest scheduler to balance the load of TX and reduce the lock contention on egress path, so the processor_id() were used to tx queue selection. Signed-off-by: Jason Wang <jasowang at redhat.com> --- drivers/net/virtio_net.c | 165 +++++++++++++++++++++++++++++++++++++++++++- include/linux/virtio_net.h | 6 ++ 2 files changed, 169 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 0d871f8..89bb5e7 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -26,6 +26,10 @@ #include <linux/scatterlist.h> #include <linux/if_vlan.h> #include <linux/slab.h> +#include <linux/highmem.h> +#include <linux/cpu_rmap.h> +#include <linux/interrupt.h> +#include <linux/cpumask.h> static int napi_weight = 128; module_param(napi_weight, int, 0444); @@ -40,6 +44,7 @@ module_param(gso, bool, 0444); #define VIRTNET_SEND_COMMAND_SG_MAX 2 #define VIRTNET_DRIVER_VERSION "1.0.0" +#define TAP_HASH_MASK 0xFF struct virtnet_send_stats { struct u64_stats_sync syncp; @@ -89,6 +94,9 @@ struct receive_queue { /* Active rx statistics */ struct virtnet_recv_stats __percpu *stats; + + /* FIXME: per vector instead of per queue ?? */ + cpumask_var_t affinity_mask; }; struct virtnet_info { @@ -110,6 +118,11 @@ struct virtnet_info { /* Host will pass rxhash to us. */ bool has_rxhash; + + /* A page of flow director */ + struct page *fd_page; + + cpumask_var_t affinity_mask; }; struct skb_vnet_hdr { @@ -386,6 +399,7 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len) if (vi->has_rxhash) skb->rxhash = hdr->rhdr.rxhash; + skb_record_rx_queue(skb, rq->vq->queue_index / 2); netif_receive_skb(skb); return; @@ -722,6 +736,19 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) return NETDEV_TX_OK; } +static int virtnet_set_fd(struct net_device *dev, u32 pfn) +{ + struct virtnet_info *vi = netdev_priv(dev); + struct virtio_device *vdev = vi->vdev; + + if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) { + vdev->config->set(vdev, + offsetof(struct virtio_net_config_fd, addr), + &pfn, sizeof(u32)); + } + return 0; +} + static int virtnet_set_mac_address(struct net_device *dev, void *p) { struct virtnet_info *vi = netdev_priv(dev); @@ -1017,6 +1044,39 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu) return 0; } +#ifdef CONFIG_RFS_ACCEL + +int virtnet_fd(struct net_device *net_dev, const struct sk_buff *skb, + u16 rxq_index, u32 flow_id) +{ + struct virtnet_info *vi = netdev_priv(net_dev); + u16 *table = NULL; + + if (skb->protocol != htons(ETH_P_IP) || !skb->rxhash) + return -EPROTONOSUPPORT; + + table = kmap_atomic(vi->fd_page); + table[skb->rxhash & TAP_HASH_MASK] = rxq_index; + kunmap_atomic(table); + + return 0; +} +#endif + +static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb) +{ + int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : + smp_processor_id(); + + /* As we make use of the accelerate rfs which let the scheduler to + * balance the load, it make sense to choose the tx queue also based on + * theprocessor id? + */ + while (unlikely(txq >= dev->real_num_tx_queues)) + txq -= dev->real_num_tx_queues; + return txq; +} + static const struct net_device_ops virtnet_netdev = { .ndo_open = virtnet_open, .ndo_stop = virtnet_close, @@ -1028,9 +1088,13 @@ static const struct net_device_ops virtnet_netdev = { .ndo_get_stats64 = virtnet_stats, .ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid, .ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid, + .ndo_select_queue = virtnet_select_queue, #ifdef CONFIG_NET_POLL_CONTROLLER .ndo_poll_controller = virtnet_netpoll, #endif +#ifdef CONFIG_RFS_ACCEL + .ndo_rx_flow_steer = virtnet_fd, +#endif }; static void virtnet_update_status(struct virtnet_info *vi) @@ -1272,12 +1336,76 @@ static int virtnet_setup_vqs(struct virtnet_info *vi) return ret; } +static int virtnet_init_rx_cpu_rmap(struct virtnet_info *vi) +{ +#ifdef CONFIG_RFS_ACCEL + struct virtio_device *vdev = vi->vdev; + int i, rc; + + vi->dev->rx_cpu_rmap = alloc_irq_cpu_rmap(vi->num_queue_pairs); + if (!vi->dev->rx_cpu_rmap) + return -ENOMEM; + for (i = 0; i < vi->num_queue_pairs; i++) { + rc = irq_cpu_rmap_add(vi->dev->rx_cpu_rmap, + vdev->config->get_vq_irq(vdev, vi->rq[i]->vq)); + if (rc) { + free_irq_cpu_rmap(vi->dev->rx_cpu_rmap); + vi->dev->rx_cpu_rmap = NULL; + return rc; + } + } +#endif + return 0; +} + +static int virtnet_init_rq_affinity(struct virtnet_info *vi) +{ + struct virtio_device *vdev = vi->vdev; + int i; + + /* FIXME: TX/RX share a vector */ + for (i = 0; i < vi->num_queue_pairs; i++) { + if (!alloc_cpumask_var(&vi->rq[i]->affinity_mask, GFP_KERNEL)) + goto err_out; + cpumask_set_cpu(i, vi->rq[i]->affinity_mask); + irq_set_affinity_hint(vdev->config->get_vq_irq(vdev, + vi->rq[i]->vq), + vi->rq[i]->affinity_mask); + } + + return 0; +err_out: + while (i) { + i--; + irq_set_affinity_hint(vdev->config->get_vq_irq(vdev, + vi->rq[i]->vq), + NULL); + free_cpumask_var(vi->rq[i]->affinity_mask); + } + return -ENOMEM; +} + +static void virtnet_free_rq_affinity(struct virtnet_info *vi) +{ + struct virtio_device *vdev = vi->vdev; + int i; + + for (i = 0; i < vi->num_queue_pairs; i++) { + irq_set_affinity_hint(vdev->config->get_vq_irq(vdev, + vi->rq[i]->vq), + NULL); + free_cpumask_var(vi->rq[i]->affinity_mask); + } +} + static int virtnet_probe(struct virtio_device *vdev) { int i, err; struct net_device *dev; struct virtnet_info *vi; u16 num_queues, num_queue_pairs; + struct page *page = NULL; + u16 *table = NULL; /* Find if host supports multiqueue virtio_net device */ err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE, @@ -1298,7 +1426,7 @@ static int virtnet_probe(struct virtio_device *vdev) /* Set up network device as normal. */ dev->priv_flags |= IFF_UNICAST_FLT; dev->netdev_ops = &virtnet_netdev; - dev->features = NETIF_F_HIGHDMA; + dev->features = NETIF_F_HIGHDMA | NETIF_F_NTUPLE; SET_ETHTOOL_OPS(dev, &virtnet_ethtool_ops); SET_NETDEV_DEV(dev, &vdev->dev); @@ -1342,6 +1470,7 @@ static int virtnet_probe(struct virtio_device *vdev) vdev->priv = vi; vi->num_queue_pairs = num_queue_pairs; + /* If we can receive ANY GSO packets, we must allocate large ones. */ if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) || virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6) || @@ -1382,6 +1511,31 @@ static int virtnet_probe(struct virtio_device *vdev) } } + /* Config flow director */ + if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) { + page = alloc_page(GFP_KERNEL); + if (!page) + return -ENOMEM; + table = (u16 *)kmap_atomic(page); + for (i = 0; i < (PAGE_SIZE / 16); i++) { + /* invalid all entries */ + table[i] = num_queue_pairs; + } + + vi->fd_page = page; + kunmap_atomic(table); + virtnet_set_fd(dev, page_to_pfn(page)); + + err = virtnet_init_rx_cpu_rmap(vi); + if (err) + goto free_recv_bufs; + + err = virtnet_init_rq_affinity(vi); + if (err) + goto free_recv_bufs; + + } + /* Assume link up if device can't report link status, otherwise get link status from config. */ if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) { @@ -1437,6 +1591,13 @@ static void __devexit virtnet_remove(struct virtio_device *vdev) /* Free memory for send and receive queues */ free_rq_sq(vi); + /* Free the page of flow director */ + if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) { + if (vi->fd_page) + put_page(vi->fd_page); + + virtnet_free_rq_affinity(vi); + } free_netdev(vi->dev); } @@ -1453,7 +1614,7 @@ static unsigned int features[] = { VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO, VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ, VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, VIRTIO_NET_F_MULTIQUEUE, - VIRTIO_NET_F_GUEST_RXHASH, + VIRTIO_NET_F_GUEST_RXHASH, VIRTIO_NET_F_HOST_FD, }; static struct virtio_driver virtio_net_driver = { diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 2291317..abcea52 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -51,6 +51,7 @@ #define VIRTIO_NET_F_CTRL_RX_EXTRA 20 /* Extra RX mode control support */ #define VIRTIO_NET_F_MULTIQUEUE 21 /* Device supports multiple TXQ/RXQ */ #define VIRTIO_NET_F_GUEST_RXHASH 22 /* Guest can receive rxhash */ +#define VIRTIO_NET_F_HOST_FD 23 /* Host has a flow director */ #define VIRTIO_NET_S_LINK_UP 1 /* Link is up */ @@ -63,6 +64,11 @@ struct virtio_net_config { __u16 num_queues; } __attribute__((packed)); +struct virtio_net_config_fd { + struct virtio_net_config cfg; + u32 addr; +} __packed; + /* This is the first element of the scatter-gather list. If you don't * specify GSO, CSUM or HASH features, you can simply ignore the header. */ struct virtio_net_hdr {
Stefan Hajnoczi
2011-Dec-05 10:38 UTC
[net-next RFC PATCH 2/5] tuntap: simple flow director support
On Mon, Dec 5, 2011 at 8:58 AM, Jason Wang <jasowang at redhat.com> wrote:> This patch adds a simple flow director to tun/tap device. It is just a > page that contains the hash to queue mapping which could be changed by > user-space. The backend (tap/macvtap) would query this table to get > the desired queue of a packets when it send packets to userspace. > > The page address were set through a new kind of ioctl - TUNSETFD and > were pinned until device exit or another new page were specified.Please use "flow" or "fdir" instead of "fd" in the ioctl and code. "fd" reminds of file descriptor. The ixgbe driver uses "fdir". Stefan
Stefan Hajnoczi
2011-Dec-05 10:55 UTC
[net-next RFC PATCH 5/5] virtio-net: flow director support
On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang <jasowang at redhat.com> wrote:> +static int virtnet_set_fd(struct net_device *dev, u32 pfn) > +{ > + ? ? ? struct virtnet_info *vi = netdev_priv(dev); > + ? ? ? struct virtio_device *vdev = vi->vdev; > + > + ? ? ? if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) { > + ? ? ? ? ? ? ? vdev->config->set(vdev, > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? offsetof(struct virtio_net_config_fd, addr), > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &pfn, sizeof(u32));Please use the virtio model (i.e. virtqueues) instead of shared memory. Mapping a page breaks the virtio abstraction. Stefan
On Mon, 05 Dec 2011 16:58:37 +0800, Jason Wang <jasowang at redhat.com> wrote:> multiple queue virtio-net: flow steering through host/guest cooperation > > Hello all: > > This is a rough series adds the guest/host cooperation of flow > steering support based on Krish Kumar's multiple queue virtio-net > driver patch 3/3 (http://lwn.net/Articles/467283/).Is there a real (physical) device which does this kind of thing? How do they do it? Can we copy them? Cheers, Rusty.