This is a first prototype of a new interface into the network stack, to eventually replace tun/tap and the bridge driver in certain virtual machine setups. Background ---------- The 'Edge Virtual Bridging' working group is discussing ways to overcome the limitation of virtual bridges in hypervisors. One important part of this is the Virtual Ethernet Port Aggregator (VEPA), as described in http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-vepa-modular-0709-v01.pdf In short, the idea of VEPA is that virtual machines do not communicate with each other through direct bridging in the hypervisor but only via an external managed switch that is already well integrated into the data center, including network filtering, accounting and monitoring. While we can do most of that efficiently in the Linux bridge code, doing it externally simplifies the overall setup. Related work ------------ Patches to implement VEPA in the Linux bridge driver have been posted by Anna Fischer in June, see http://patchwork.ozlabs.org/patch/28702/. Those patches are good and hopefully get merged in 2.6.32, but I think we can take some shortcuts with an alternative approach: The macvlan driver already has the property of forwarding all traffic between guests and an external interface but not between the guests, just as VEPA needs it. Also, VEPA does explicitly not want or need advanced filtering in the way that netfilter-bridge provides, so we can use macvlan to replace the bridge code in this setup, reducing the code path through the kernel. This works fine with containers and network namespaces, but not easily with kvm/qemu because we only have a network device. Or Gerlitz posted a "raw" packet socket backend for qemu to deal with this, at http://marc.info/?l=qemu-devel&m=124653801212767 and at least three other people have done a similar functionality independently. This driver ----------- While the other approaches should work as well, doing it using a tap interface should give additional benefits: * We can keep using the optimizations for jumbo frames that we have put into the tun/tap driver. * No need for root permissions that packet sockets need, just use 'ip link add link type macvtap' to create a new device and give it the right permissions using udev (using one tap per macvlan netdev). * support for multiqueue network adapters by opening the tap device multiple times, using one file descriptor per guest CPU/network queue/interrupt (if the adapter supports multiple queues on a single MAC address). * support for zero-copy receive/transmit using async I/O on the tap device (if the adapter supports per MAC rx queues). * The same framework in macvlan can be used to add a third backend into a future kernel based virtio-net implementation. This version of the driver does not support any of those features, but they all appear possible to add ;). The driver is currently called 'macvtap', but I'd be more than happy to change that if anyone could suggest a better name. The code is still in an early stage and I wish I had found more time to polish it, but at this time, I'd first like to know if people agree with the basic concept at all. Cc: Patrick McHardy <kaber at trash.net> Cc: Stephen Hemminger <shemminger at linux-foundation.org> Cc: David S. Miller" <davem at davemloft.net> Cc: "Michael S. Tsirkin" <mst at redhat.com> Cc: Herbert Xu <herbert at gondor.apana.org.au> Cc: Or Gerlitz <ogerlitz at voltaire.com> Cc: "Fischer, Anna" <anna.fischer at hp.com> Cc: netdev at vger.kernel.org Cc: bridge at lists.linux-foundation.org Cc: linux-kernel at vger.kernel.org Cc: Edge Virtual Bridging <evb at yahoogroups.com> Signed-off-by: Arnd Bergmann <arnd at arndb.de> --- The evb mailing list eats Cc headers, please make sure to keep everybody in your Cc list when replying there. --- drivers/net/Kconfig | 12 ++ drivers/net/Makefile | 1 + drivers/net/macvlan.c | 39 +++----- drivers/net/macvlan.h | 37 +++++++ drivers/net/macvtap.c | 276 +++++++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 341 insertions(+), 24 deletions(-) create mode 100644 drivers/net/macvlan.h create mode 100644 drivers/net/macvtap.c diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 5f6509a..0b9ac6a 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -90,6 +90,18 @@ config MACVLAN To compile this driver as a module, choose M here: the module will be called macvlan. +config MACVTAP + tristate "MAC-VLAN based tap driver (EXPERIMENTAL)" + depends on MACVLAN + help + This adds a specialized tap character device driver that is based + on the MAC-VLAN network interface, called macvtap. A macvtap device + can be added in the same way as a macvlan device, using 'type + macvlan', and then be accessed through the tap user space interface. + + To compile this driver as a module, choose M here: the module + will be called macvtap. + config EQUALIZER tristate "EQL (serial line load balancing) support" ---help--- diff --git a/drivers/net/Makefile b/drivers/net/Makefile index ead8cab..8a2d2d7 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -162,6 +162,7 @@ obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o obj-$(CONFIG_DUMMY) += dummy.o obj-$(CONFIG_IFB) += ifb.o obj-$(CONFIG_MACVLAN) += macvlan.o +obj-$(CONFIG_MACVTAP) += macvtap.o obj-$(CONFIG_DE600) += de600.o obj-$(CONFIG_DE620) += de620.o obj-$(CONFIG_LANCE) += lance.o diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 99eed9f..9f7dc6a 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -30,22 +30,7 @@ #include <linux/if_macvlan.h> #include <net/rtnetlink.h> -#define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE) - -struct macvlan_port { - struct net_device *dev; - struct hlist_head vlan_hash[MACVLAN_HASH_SIZE]; - struct list_head vlans; -}; - -struct macvlan_dev { - struct net_device *dev; - struct list_head list; - struct hlist_node hlist; - struct macvlan_port *port; - struct net_device *lowerdev; -}; - +#include "macvlan.h" static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port, const unsigned char *addr) @@ -135,7 +120,7 @@ static void macvlan_broadcast(struct sk_buff *skb, else nskb->pkt_type = PACKET_MULTICAST; - netif_rx(nskb); + vlan->receive(nskb); } } } @@ -180,11 +165,11 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb) skb->dev = dev; skb->pkt_type = PACKET_HOST; - netif_rx(skb); + vlan->receive(skb); return NULL; } -static int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev) +int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev) { const struct macvlan_dev *vlan = netdev_priv(dev); unsigned int len = skb->len; @@ -202,6 +187,7 @@ static int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev) } return NETDEV_TX_OK; } +EXPORT_SYMBOL_GPL(macvlan_start_xmit); static int macvlan_hard_header(struct sk_buff *skb, struct net_device *dev, unsigned short type, const void *daddr, @@ -412,7 +398,7 @@ static const struct net_device_ops macvlan_netdev_ops = { .ndo_validate_addr = eth_validate_addr, }; -static void macvlan_setup(struct net_device *dev) +void macvlan_setup(struct net_device *dev) { ether_setup(dev); @@ -423,6 +409,7 @@ static void macvlan_setup(struct net_device *dev) dev->ethtool_ops = &macvlan_ethtool_ops; dev->tx_queue_len = 0; } +EXPORT_SYMBOL_GPL(macvlan_setup); static int macvlan_port_create(struct net_device *dev) { @@ -472,7 +459,7 @@ static void macvlan_transfer_operstate(struct net_device *dev) } } -static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]) +int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]) { if (tb[IFLA_ADDRESS]) { if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN) @@ -482,9 +469,10 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]) } return 0; } +EXPORT_SYMBOL_GPL(macvlan_validate); -static int macvlan_newlink(struct net_device *dev, - struct nlattr *tb[], struct nlattr *data[]) +int macvlan_newlink(struct net_device *dev, + struct nlattr *tb[], struct nlattr *data[]) { struct macvlan_dev *vlan = netdev_priv(dev); struct macvlan_port *port; @@ -524,6 +512,7 @@ static int macvlan_newlink(struct net_device *dev, vlan->lowerdev = lowerdev; vlan->dev = dev; vlan->port = port; + vlan->receive = netif_rx; err = register_netdevice(dev); if (err < 0) @@ -533,8 +522,9 @@ static int macvlan_newlink(struct net_device *dev, macvlan_transfer_operstate(dev); return 0; } +EXPORT_SYMBOL_GPL(macvlan_newlink); -static void macvlan_dellink(struct net_device *dev) +void macvlan_dellink(struct net_device *dev) { struct macvlan_dev *vlan = netdev_priv(dev); struct macvlan_port *port = vlan->port; @@ -545,6 +535,7 @@ static void macvlan_dellink(struct net_device *dev) if (list_empty(&port->vlans)) macvlan_port_destroy(port->dev); } +EXPORT_SYMBOL_GPL(macvlan_dellink); static struct rtnl_link_ops macvlan_link_ops __read_mostly = { .kind = "macvlan", diff --git a/drivers/net/macvlan.h b/drivers/net/macvlan.h new file mode 100644 index 0000000..3f3c6c3 --- /dev/null +++ b/drivers/net/macvlan.h @@ -0,0 +1,37 @@ +#ifndef _MACVLAN_H +#define _MACVLAN_H + +#include <linux/netdevice.h> +#include <linux/netlink.h> +#include <linux/list.h> + +#define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE) + +struct macvlan_port { + struct net_device *dev; + struct hlist_head vlan_hash[MACVLAN_HASH_SIZE]; + struct list_head vlans; +}; + +struct macvlan_dev { + struct net_device *dev; + struct list_head list; + struct hlist_node hlist; + struct macvlan_port *port; + struct net_device *lowerdev; + + int (*receive)(struct sk_buff *skb); +}; + +extern int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev); + +extern void macvlan_setup(struct net_device *dev); + +extern int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]); + +extern int macvlan_newlink(struct net_device *dev, + struct nlattr *tb[], struct nlattr *data[]); + +extern void macvlan_dellink(struct net_device *dev); + +#endif /* _MACVLAN_H */ diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c new file mode 100644 index 0000000..d99bfc0 --- /dev/null +++ b/drivers/net/macvtap.c @@ -0,0 +1,276 @@ +#include <linux/etherdevice.h> +#include <linux/nsproxy.h> +#include <linux/module.h> +#include <linux/skbuff.h> +#include <linux/cache.h> +#include <linux/sched.h> +#include <linux/types.h> +#include <linux/init.h> +#include <linux/wait.h> +#include <linux/cdev.h> +#include <linux/fs.h> + +#include <net/net_namespace.h> +#include <net/rtnetlink.h> + +#include "macvlan.h" + +struct macvtap_dev { + struct macvlan_dev m; + struct cdev cdev; + struct sk_buff_head readq; + wait_queue_head_t wait; +}; + +/* + * Minor number matches netdev->ifindex, so need a large value + */ +static int macvtap_major; +#define MACVTAP_NUM_DEVS 65536 + +static int macvtap_receive(struct sk_buff *skb) +{ + struct macvtap_dev *vtap = netdev_priv(skb->dev); + + skb_queue_tail(&vtap->readq, skb); + wake_up(&vtap->wait); + return 0; +} + +static int macvtap_open(struct inode *inode, struct file *file) +{ + struct net *net = current->nsproxy->net_ns; + int ifindex = iminor(inode); + struct net_device *dev = dev_get_by_index(net, ifindex); + int err; + + err = -ENODEV; + if (!dev) + goto out1; + + file->private_data = netdev_priv(dev); + err = 0; +out1: + return err; +} + +static int macvtap_release(struct inode *inode, struct file *file) +{ + struct macvtap_dev *vtap = file->private_data; + + if (!vtap) + return 0; + + dev_put(vtap->m.dev); + return 0; +} + +/* Get packet from user space buffer */ +static ssize_t macvtap_get_user(struct macvtap_dev *vtap, + const struct iovec *iv, size_t count, + int noblock) +{ + struct sk_buff *skb; + size_t len = count; + + if (unlikely(len < ETH_HLEN)) + return -EINVAL; + + skb = alloc_skb(NET_IP_ALIGN + len, GFP_KERNEL); + + if (!skb) { + vtap->m.dev->stats.rx_dropped++; + return -ENOMEM; + } + + skb_reserve(skb, NET_IP_ALIGN); + skb_put(skb, count); + + if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) { + vtap->m.dev->stats.rx_dropped++; + kfree_skb(skb); + return -EFAULT; + } + + skb_set_network_header(skb, ETH_HLEN); + skb->dev = vtap->m.lowerdev; + + macvlan_start_xmit(skb, vtap->m.dev); + + return count; +} + +static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t pos) +{ + struct file *file = iocb->ki_filp; + ssize_t result; + struct macvtap_dev *vtap = file->private_data; + + result = macvtap_get_user(vtap, iv, iov_length(iv, count), + file->f_flags & O_NONBLOCK); + + return result; +} + +/* Put packet to the user space buffer */ +static ssize_t macvtap_put_user(struct macvtap_dev *vtap, + struct sk_buff *skb, + struct iovec *iv, int len) +{ + int ret; + + skb_push(skb, ETH_HLEN); + len = min_t(int, skb->len, len); + + ret = skb_copy_datagram_iovec(skb, 0, iv, len); + + vtap->m.dev->stats.rx_packets++; + vtap->m.dev->stats.rx_bytes += len; + + return ret ? ret : len; +} + +static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t pos) +{ + struct file *file = iocb->ki_filp; + struct macvtap_dev *vtap = file->private_data; + DECLARE_WAITQUEUE(wait, current); + struct sk_buff *skb; + ssize_t len, ret = 0; + + if (!vtap) + return -EBADFD; + + len = iov_length(iv, count); + if (len < 0) { + ret = -EINVAL; + goto out; + } + + add_wait_queue(&vtap->wait, &wait); + while (len) { + current->state = TASK_INTERRUPTIBLE; + + /* Read frames from the queue */ + if (!(skb=skb_dequeue(&vtap->readq))) { + if (file->f_flags & O_NONBLOCK) { + ret = -EAGAIN; + break; + } + if (signal_pending(current)) { + ret = -ERESTARTSYS; + break; + } + /* Nothing to read, let's sleep */ + schedule(); + continue; + } + ret = macvtap_put_user(vtap, skb, (struct iovec *) iv, len); + kfree_skb(skb); + break; + } + + current->state = TASK_RUNNING; + remove_wait_queue(&vtap->wait, &wait); + +out: + return ret; +} + +struct file_operations macvtap_fops = { + .owner = THIS_MODULE, + .open = macvtap_open, + .release = macvtap_release, + .aio_read = macvtap_aio_read, + .aio_write = macvtap_aio_write, + .llseek = no_llseek, +}; + +static int macvtap_newlink(struct net_device *dev, + struct nlattr *tb[], struct nlattr *data[]) +{ + struct macvtap_dev *vtap = netdev_priv(dev); + int err; + + err = macvlan_newlink(dev, tb, data); + if (err) + goto out1; + + cdev_init(&vtap->cdev, &macvtap_fops); + vtap->cdev.owner = THIS_MODULE; + err = cdev_add(&vtap->cdev, MKDEV(MAJOR(macvtap_major), dev->ifindex), 1); + + if (err) + goto out2; + + /* + * TODO: add class dev so device node gets created automatically + * by udev. + */ + pr_debug("%s:%d: added cdev %d:%d for dev %s\n", + __func__, __LINE__, MAJOR(macvtap_major), + dev->ifindex, dev->name); + + skb_queue_head_init(&vtap->readq); + init_waitqueue_head(&vtap->wait); + vtap->m.receive = macvtap_receive; + + return 0; + +out2: + macvlan_dellink(dev); +out1: + return err; +} + +static void macvtap_dellink(struct net_device *dev) +{ + struct macvtap_dev *vtap = netdev_priv(dev); + cdev_del(&vtap->cdev); + /* TODO: kill open file descriptors */ + macvlan_dellink(dev); +} + +static struct rtnl_link_ops macvtap_link_ops __read_mostly = { + .kind = "macvtap", + .priv_size = sizeof(struct macvtap_dev), + .setup = macvlan_setup, + .validate = macvlan_validate, + .newlink = macvtap_newlink, + .dellink = macvtap_dellink, +}; + +static int macvtap_init(void) +{ + int err; + + err = alloc_chrdev_region(&macvtap_major, 0, + MACVTAP_NUM_DEVS, "macvtap"); + if (err) + goto out1; + + err = rtnl_link_register(&macvtap_link_ops); + if (err) + goto out2; + + return 0; + +out2: + unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS); +out1: + return err; +} +module_init(macvtap_init); + +static void macvtap_exit(void) +{ + rtnl_link_unregister(&macvtap_link_ops); + unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS); +} +module_exit(macvtap_exit); + +MODULE_ALIAS_RTNL_LINK("macvtap"); +MODULE_AUTHOR("Arnd Bergmann <arnd at arndb.de>"); +MODULE_LICENSE("GPL"); -- 1.6.0.4
From: Arnd Bergmann <arnd at arndb.de> Date: Thu, 6 Aug 2009 21:50:28 +0000> This is a first prototype of a new interface into the network > stack, to eventually replace tun/tap and the bridge driver > in certain virtual machine setups.I don't know enough to say how good a solution this is for the problem, but I certainly like this driver for it's utter simplicity and minimalness.
On Thu, Aug 6, 2009 at 3:50 PM, Arnd Bergmann<arnd at arndb.de> wrote:> This is a first prototype of a new interface into the network > stack, to eventually replace tun/tap and the bridge driver > in certain virtual machine setups.I have some general questions about the intended use and benefits of VEPA, from an IT perspective: In which virtual machine setups and technologies do you forsee this interface being used? Is this new interface to be used within a virtual machine or container, on the master node, or both? What interface(s) would need to be configured for a single virtual machine to use VEPA to access the network? What are the current flexibility, security or performance limitations of tun/tap and bridge that make this new interface necessary or beneficial? Is this new interface useful at all for VPN solutions or is it *specifically* targeted for connecting virtual machines to the network? Is this essentially a bridge with layer-2 isolation for the virtual machine interfaces built-in? If isolation is provided, what mechanism is used to accomplish this, and how secure is it? Does VEPA look like a regular ethernet interface (eth0) on the virtual machine side? Are there any associated user-space tools required for configuring a VEPA? Do you have any HOWTO-style documentation that would demonstrate how this interface would be used in production? Or a FAQ? This seems like a very interesting effort but I don't quite have a good grasp of VEPA's benefits and limitations -- I imagine that others are in the same boat too. Best Regards, Daniel
Paul Congdon (UC Davis)
2009-Aug-07 19:10 UTC
[Bridge] [PATCH] macvlan: add tap device backend
Responding to Daniel's questions...> I have some general questions about the intended use and benefits of > VEPA, from an IT perspective: > > In which virtual machine setups and technologies do you forsee this > interface being used?The benefit of VEPA is the coordination and unification with the external network switch. So, in environments where you are needing/wanting your feature rich, wire speed, external network device (firewall/switch/IPS/content-filter) to provide consistent policy enforcement, and you want your VMs traffic to be subject to that enforcement, you will want their traffic directed externally. Perhaps you have some VMs that are on a DMZ or clustering an application or implementing a multi-tier application where you would normally place a firewall in-between the tiers.> Is this new interface to be used within a virtual machine or > container, on the master node, or both?It is really an interface to a new type of virtual switch. When you create virtual network, I would imagine it being a new mode of operation (bridge, NAT, VEPA, etc).> What interface(s) would need to be configured for a single virtual > machine to use VEPA to access the network?It would be the same as if that machine were configure to use a bridge to access the network, but the bridge mode would be different.> What are the current flexibility, security or performance limitations > of tun/tap and bridge that make this new interface necessary or > beneficial?If you have VMs that will be communicating with one another on the same physical machine, and you want their traffic to be exposed to an in-line network device such as a application firewall/IPS/content-filter (without this feature) you will have to have this device co-located within the same physical server. This will use up CPU cycles that you presumable purchased to run applications, it will require a lot of consistent configuration on all physical machines, it could invoke potentially a lot of software licensing, additional cost, etc.. Everything would need to be replicated on each physical machine. With the VEPA capability, you can leverage all this functionality in an external network device and have it managed and configured in one place. The external implementation is likely a higher performance, silicon based implementation. It should make it easier to migrate machines from one physical server to another and maintain the same network policy enforcement.> Is this new interface useful at all for VPN solutions or is it > *specifically* targeted for connecting virtual machines to the > network?I'm not sure I see the benefit for VPN solutions, but I'd have to understand the deployment scenario better. Certainly this is targeting connecting VMs to the adjacent physical LAN.> Is this essentially a bridge with layer-2 isolation for the virtual > machine interfaces built-in? If isolation is provided, what mechanism > is used to accomplish this, and how secure is it?That might be an over simplification, but you can achieve layer-2 isolation if you connect to a standard external switch. If that switch has 'hairpin' forwarding, then the VMs can talk at L2, but their traffic is forced through the bridge. If that bridge is a security device (e.g. firewall), then their traffic is exposed to that. The isolation in the outbound direction is created by the way frames are forwarded. They are simply dropped on the wire, so no VMs can talk directly to one another without their traffic first going external. In the inbound direction, the isolation is created using the forwarding table.> Does VEPA look like a regular ethernet interface (eth0) on the virtual > machine side?Yes> Are there any associated user-space tools required for configuring a > VEPA? >The standard brctl utility has been augmented to enable/disable the capability.> Do you have any HOWTO-style documentation that would demonstrate how > this interface would be used in production? Or a FAQ? >None yet.> This seems like a very interesting effort but I don't quite have a > good grasp of VEPA's benefits and limitations -- I imagine that others > are in the same boat too. >There are some seminar slides available on the IEEE 802.1 web-site and elsewhere. The patch had a reference to a seminar, but here is another one you might find helpful: http://www.internet2.edu/presentations/jt2009jul/20090719-congdon.pdf I'm happy to try to explain further... Paul
On Thu, Aug 06, 2009 at 09:50:28PM +0000, Arnd Bergmann wrote:> This driver > ----------- > While the other approaches should work as well, doing it using a tap > interface should give additional benefits: > > * We can keep using the optimizations for jumbo frames that we have put > into the tun/tap driver. > > * No need for root permissions that packet sockets need, just use 'ip > link add link type macvtap' to create a new device and give it the right > permissions using udev (using one tap per macvlan netdev). > > * support for multiqueue network adapters by opening the tap device > multiple times, using one file descriptor per guest CPU/network > queue/interrupt (if the adapter supports multiple queues on a single > MAC address). > > * support for zero-copy receive/transmit using async I/O on the tap device > (if the adapter supports per MAC rx queues). > > * The same framework in macvlan can be used to add a third backend > into a future kernel based virtio-net implementation.Could you split the patches up, to make this last easier? patch 1 - export framework patch 2 - code using it> This version of the driver does not support any of those features, > but they all appear possible to add ;). > The driver is currently called 'macvtap', but I'd be more than happy > to change that if anyone could suggest a better name. The code is > still in an early stage and I wish I had found more time to polish > it, but at this time, I'd first like to know if people agree with the > basic concept at all. > > Cc: Patrick McHardy <kaber at trash.net> > Cc: Stephen Hemminger <shemminger at linux-foundation.org> > Cc: David S. Miller" <davem at davemloft.net> > Cc: "Michael S. Tsirkin" <mst at redhat.com> > Cc: Herbert Xu <herbert at gondor.apana.org.au> > Cc: Or Gerlitz <ogerlitz at voltaire.com> > Cc: "Fischer, Anna" <anna.fischer at hp.com> > Cc: netdev at vger.kernel.org > Cc: bridge at lists.linux-foundation.org > Cc: linux-kernel at vger.kernel.org > Cc: Edge Virtual Bridging <evb at yahoogroups.com> > Signed-off-by: Arnd Bergmann <arnd at arndb.de> > > --- > > The evb mailing list eats Cc headers, please make sure to keep everybody > in your Cc list when replying there. > --- > drivers/net/Kconfig | 12 ++ > drivers/net/Makefile | 1 + > drivers/net/macvlan.c | 39 +++----- > drivers/net/macvlan.h | 37 +++++++ > drivers/net/macvtap.c | 276 +++++++++++++++++++++++++++++++++++++++++++++++++ > 5 files changed, 341 insertions(+), 24 deletions(-) > create mode 100644 drivers/net/macvlan.h > create mode 100644 drivers/net/macvtap.c > > diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig > index 5f6509a..0b9ac6a 100644 > --- a/drivers/net/Kconfig > +++ b/drivers/net/Kconfig > @@ -90,6 +90,18 @@ config MACVLAN > To compile this driver as a module, choose M here: the module > will be called macvlan. > > +config MACVTAP > + tristate "MAC-VLAN based tap driver (EXPERIMENTAL)" > + depends on MACVLAN > + help > + This adds a specialized tap character device driver that is based > + on the MAC-VLAN network interface, called macvtap. A macvtap device > + can be added in the same way as a macvlan device, using 'type > + macvlan', and then be accessed through the tap user space interface. > + > + To compile this driver as a module, choose M here: the module > + will be called macvtap. > + > config EQUALIZER > tristate "EQL (serial line load balancing) support" > ---help--- > diff --git a/drivers/net/Makefile b/drivers/net/Makefile > index ead8cab..8a2d2d7 100644 > --- a/drivers/net/Makefile > +++ b/drivers/net/Makefile > @@ -162,6 +162,7 @@ obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o > obj-$(CONFIG_DUMMY) += dummy.o > obj-$(CONFIG_IFB) += ifb.o > obj-$(CONFIG_MACVLAN) += macvlan.o > +obj-$(CONFIG_MACVTAP) += macvtap.o > obj-$(CONFIG_DE600) += de600.o > obj-$(CONFIG_DE620) += de620.o > obj-$(CONFIG_LANCE) += lance.o > diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c > index 99eed9f..9f7dc6a 100644 > --- a/drivers/net/macvlan.c > +++ b/drivers/net/macvlan.c > @@ -30,22 +30,7 @@ > #include <linux/if_macvlan.h> > #include <net/rtnetlink.h> > > -#define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE) > - > -struct macvlan_port { > - struct net_device *dev; > - struct hlist_head vlan_hash[MACVLAN_HASH_SIZE]; > - struct list_head vlans; > -}; > - > -struct macvlan_dev { > - struct net_device *dev; > - struct list_head list; > - struct hlist_node hlist; > - struct macvlan_port *port; > - struct net_device *lowerdev; > -}; > - > +#include "macvlan.h" > > static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port, > const unsigned char *addr) > @@ -135,7 +120,7 @@ static void macvlan_broadcast(struct sk_buff *skb, > else > nskb->pkt_type = PACKET_MULTICAST; > > - netif_rx(nskb); > + vlan->receive(nskb); > } > } > } > @@ -180,11 +165,11 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb) > skb->dev = dev; > skb->pkt_type = PACKET_HOST; > > - netif_rx(skb); > + vlan->receive(skb); > return NULL; > } > > -static int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev) > +int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev) > { > const struct macvlan_dev *vlan = netdev_priv(dev); > unsigned int len = skb->len; > @@ -202,6 +187,7 @@ static int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev) > } > return NETDEV_TX_OK; > } > +EXPORT_SYMBOL_GPL(macvlan_start_xmit); > > static int macvlan_hard_header(struct sk_buff *skb, struct net_device *dev, > unsigned short type, const void *daddr, > @@ -412,7 +398,7 @@ static const struct net_device_ops macvlan_netdev_ops = { > .ndo_validate_addr = eth_validate_addr, > }; > > -static void macvlan_setup(struct net_device *dev) > +void macvlan_setup(struct net_device *dev) > { > ether_setup(dev); > > @@ -423,6 +409,7 @@ static void macvlan_setup(struct net_device *dev) > dev->ethtool_ops = &macvlan_ethtool_ops; > dev->tx_queue_len = 0; > } > +EXPORT_SYMBOL_GPL(macvlan_setup); > > static int macvlan_port_create(struct net_device *dev) > { > @@ -472,7 +459,7 @@ static void macvlan_transfer_operstate(struct net_device *dev) > } > } > > -static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]) > +int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]) > { > if (tb[IFLA_ADDRESS]) { > if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN) > @@ -482,9 +469,10 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]) > } > return 0; > } > +EXPORT_SYMBOL_GPL(macvlan_validate); > > -static int macvlan_newlink(struct net_device *dev, > - struct nlattr *tb[], struct nlattr *data[]) > +int macvlan_newlink(struct net_device *dev, > + struct nlattr *tb[], struct nlattr *data[]) > { > struct macvlan_dev *vlan = netdev_priv(dev); > struct macvlan_port *port; > @@ -524,6 +512,7 @@ static int macvlan_newlink(struct net_device *dev, > vlan->lowerdev = lowerdev; > vlan->dev = dev; > vlan->port = port; > + vlan->receive = netif_rx; > > err = register_netdevice(dev); > if (err < 0) > @@ -533,8 +522,9 @@ static int macvlan_newlink(struct net_device *dev, > macvlan_transfer_operstate(dev); > return 0; > } > +EXPORT_SYMBOL_GPL(macvlan_newlink); > > -static void macvlan_dellink(struct net_device *dev) > +void macvlan_dellink(struct net_device *dev) > { > struct macvlan_dev *vlan = netdev_priv(dev); > struct macvlan_port *port = vlan->port; > @@ -545,6 +535,7 @@ static void macvlan_dellink(struct net_device *dev) > if (list_empty(&port->vlans)) > macvlan_port_destroy(port->dev); > } > +EXPORT_SYMBOL_GPL(macvlan_dellink); > > static struct rtnl_link_ops macvlan_link_ops __read_mostly = { > .kind = "macvlan", > diff --git a/drivers/net/macvlan.h b/drivers/net/macvlan.h > new file mode 100644 > index 0000000..3f3c6c3 > --- /dev/null > +++ b/drivers/net/macvlan.h > @@ -0,0 +1,37 @@ > +#ifndef _MACVLAN_H > +#define _MACVLAN_H > + > +#include <linux/netdevice.h> > +#include <linux/netlink.h> > +#include <linux/list.h> > + > +#define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE) > + > +struct macvlan_port { > + struct net_device *dev; > + struct hlist_head vlan_hash[MACVLAN_HASH_SIZE]; > + struct list_head vlans; > +}; > + > +struct macvlan_dev { > + struct net_device *dev; > + struct list_head list; > + struct hlist_node hlist; > + struct macvlan_port *port; > + struct net_device *lowerdev; > + > + int (*receive)(struct sk_buff *skb); > +}; > + > +extern int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev); > + > +extern void macvlan_setup(struct net_device *dev); > + > +extern int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]); > + > +extern int macvlan_newlink(struct net_device *dev, > + struct nlattr *tb[], struct nlattr *data[]); > + > +extern void macvlan_dellink(struct net_device *dev); > + > +#endif /* _MACVLAN_H */ > diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c > new file mode 100644 > index 0000000..d99bfc0 > --- /dev/null > +++ b/drivers/net/macvtap.c > @@ -0,0 +1,276 @@ > +#include <linux/etherdevice.h> > +#include <linux/nsproxy.h> > +#include <linux/module.h> > +#include <linux/skbuff.h> > +#include <linux/cache.h> > +#include <linux/sched.h> > +#include <linux/types.h> > +#include <linux/init.h> > +#include <linux/wait.h> > +#include <linux/cdev.h> > +#include <linux/fs.h> > + > +#include <net/net_namespace.h> > +#include <net/rtnetlink.h> > + > +#include "macvlan.h" > + > +struct macvtap_dev { > + struct macvlan_dev m; > + struct cdev cdev; > + struct sk_buff_head readq; > + wait_queue_head_t wait; > +}; > + > +/* > + * Minor number matches netdev->ifindex, so need a large value > + */ > +static int macvtap_major; > +#define MACVTAP_NUM_DEVS 65536 > + > +static int macvtap_receive(struct sk_buff *skb) > +{ > + struct macvtap_dev *vtap = netdev_priv(skb->dev); > + > + skb_queue_tail(&vtap->readq, skb); > + wake_up(&vtap->wait); > + return 0; > +} > + > +static int macvtap_open(struct inode *inode, struct file *file) > +{ > + struct net *net = current->nsproxy->net_ns; > + int ifindex = iminor(inode); > + struct net_device *dev = dev_get_by_index(net, ifindex); > + int err; > + > + err = -ENODEV; > + if (!dev) > + goto out1; > + > + file->private_data = netdev_priv(dev); > + err = 0; > +out1: > + return err; > +} > + > +static int macvtap_release(struct inode *inode, struct file *file) > +{ > + struct macvtap_dev *vtap = file->private_data; > + > + if (!vtap) > + return 0; > + > + dev_put(vtap->m.dev); > + return 0; > +} > + > +/* Get packet from user space buffer */ > +static ssize_t macvtap_get_user(struct macvtap_dev *vtap, > + const struct iovec *iv, size_t count, > + int noblock) > +{ > + struct sk_buff *skb; > + size_t len = count; > + > + if (unlikely(len < ETH_HLEN)) > + return -EINVAL; > + > + skb = alloc_skb(NET_IP_ALIGN + len, GFP_KERNEL); > + > + if (!skb) { > + vtap->m.dev->stats.rx_dropped++; > + return -ENOMEM; > + } > + > + skb_reserve(skb, NET_IP_ALIGN); > + skb_put(skb, count); > + > + if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) { > + vtap->m.dev->stats.rx_dropped++; > + kfree_skb(skb); > + return -EFAULT; > + } > + > + skb_set_network_header(skb, ETH_HLEN); > + skb->dev = vtap->m.lowerdev; > + > + macvlan_start_xmit(skb, vtap->m.dev); > + > + return count; > +}With tap, we discovered that not limiting the number of outstanding skbs hurts UDP performance. And the solution was to limit the number of outstanding packets - with hacks to work around the fact that userspace .> + > +static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv, > + unsigned long count, loff_t pos) > +{ > + struct file *file = iocb->ki_filp; > + ssize_t result; > + struct macvtap_dev *vtap = file->private_data; > + > + result = macvtap_get_user(vtap, iv, iov_length(iv, count), > + file->f_flags & O_NONBLOCK); > + > + return result; > +} > + > +/* Put packet to the user space buffer */ > +static ssize_t macvtap_put_user(struct macvtap_dev *vtap, > + struct sk_buff *skb, > + struct iovec *iv, int len) > +{ > + int ret; > + > + skb_push(skb, ETH_HLEN); > + len = min_t(int, skb->len, len); > + > + ret = skb_copy_datagram_iovec(skb, 0, iv, len); > + > + vtap->m.dev->stats.rx_packets++; > + vtap->m.dev->stats.rx_bytes += len;where does atomicity guarantee for these counters come from?> + > + return ret ? ret : len; > +} > + > +static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv, > + unsigned long count, loff_t pos) > +{ > + struct file *file = iocb->ki_filp; > + struct macvtap_dev *vtap = file->private_data; > + DECLARE_WAITQUEUE(wait, current); > + struct sk_buff *skb; > + ssize_t len, ret = 0; > + > + if (!vtap) > + return -EBADFD; > + > + len = iov_length(iv, count); > + if (len < 0) { > + ret = -EINVAL; > + goto out; > + } > + > + add_wait_queue(&vtap->wait, &wait); > + while (len) { > + current->state = TASK_INTERRUPTIBLE; > + > + /* Read frames from the queue */ > + if (!(skb=skb_dequeue(&vtap->readq))) { > + if (file->f_flags & O_NONBLOCK) { > + ret = -EAGAIN; > + break; > + } > + if (signal_pending(current)) { > + ret = -ERESTARTSYS; > + break; > + } > + /* Nothing to read, let's sleep */ > + schedule(); > + continue; > + } > + ret = macvtap_put_user(vtap, skb, (struct iovec *) iv, len);Don't cast away the constness. Instead, fix macvtap_put_user to used skb_copy_datagram_const_iovec which does not modify the iovec.> + kfree_skb(skb); > + break; > + } > + > + current->state = TASK_RUNNING; > + remove_wait_queue(&vtap->wait, &wait); > + > +out: > + return ret; > +} > + > +struct file_operations macvtap_fops = { > + .owner = THIS_MODULE, > + .open = macvtap_open, > + .release = macvtap_release, > + .aio_read = macvtap_aio_read, > + .aio_write = macvtap_aio_write, > + .llseek = no_llseek, > +}; > + > +static int macvtap_newlink(struct net_device *dev, > + struct nlattr *tb[], struct nlattr *data[]) > +{ > + struct macvtap_dev *vtap = netdev_priv(dev); > + int err; > + > + err = macvlan_newlink(dev, tb, data); > + if (err) > + goto out1; > + > + cdev_init(&vtap->cdev, &macvtap_fops); > + vtap->cdev.owner = THIS_MODULE; > + err = cdev_add(&vtap->cdev, MKDEV(MAJOR(macvtap_major), dev->ifindex), 1); > + > + if (err) > + goto out2; > + > + /* > + * TODO: add class dev so device node gets created automatically > + * by udev. > + */ > + pr_debug("%s:%d: added cdev %d:%d for dev %s\n", > + __func__, __LINE__, MAJOR(macvtap_major), > + dev->ifindex, dev->name); > + > + skb_queue_head_init(&vtap->readq); > + init_waitqueue_head(&vtap->wait); > + vtap->m.receive = macvtap_receive; > + > + return 0; > + > +out2: > + macvlan_dellink(dev); > +out1: > + return err; > +} > + > +static void macvtap_dellink(struct net_device *dev) > +{ > + struct macvtap_dev *vtap = netdev_priv(dev); > + cdev_del(&vtap->cdev); > + /* TODO: kill open file descriptors */ > + macvlan_dellink(dev); > +} > + > +static struct rtnl_link_ops macvtap_link_ops __read_mostly = { > + .kind = "macvtap", > + .priv_size = sizeof(struct macvtap_dev), > + .setup = macvlan_setup, > + .validate = macvlan_validate, > + .newlink = macvtap_newlink, > + .dellink = macvtap_dellink, > +}; > + > +static int macvtap_init(void) > +{ > + int err; > + > + err = alloc_chrdev_region(&macvtap_major, 0, > + MACVTAP_NUM_DEVS, "macvtap"); > + if (err) > + goto out1; > + > + err = rtnl_link_register(&macvtap_link_ops); > + if (err) > + goto out2; > + > + return 0; > + > +out2: > + unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS); > +out1: > + return err; > +} > +module_init(macvtap_init); > + > +static void macvtap_exit(void) > +{ > + rtnl_link_unregister(&macvtap_link_ops); > + unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS); > +} > +module_exit(macvtap_exit); > + > +MODULE_ALIAS_RTNL_LINK("macvtap"); > +MODULE_AUTHOR("Arnd Bergmann <arnd at arndb.de>"); > +MODULE_LICENSE("GPL"); > -- > 1.6.0.4 > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sunday 09 August 2009 08:02:16 Michael S. Tsirkin wrote:> On Thu, Aug 06, 2009 at 09:50:28PM +0000, Arnd Bergmann wrote: > > * The same framework in macvlan can be used to add a third backend > > into a future kernel based virtio-net implementation. > > Could you split the patches up, to make this last easier? > patch 1 - export framework > patch 2 - code using itSure, will do.> > +/* Get packet from user space buffer */ > > +static ssize_t macvtap_get_user(struct macvtap_dev *vtap, > > + const struct iovec *iv, size_t count, > > + int noblock) > > +{ > > + struct sk_buff *skb; > > + size_t len = count; > > + > > + if (unlikely(len < ETH_HLEN)) > > + return -EINVAL; > > + > > + skb = alloc_skb(NET_IP_ALIGN + len, GFP_KERNEL); > > + > > + if (!skb) { > > + vtap->m.dev->stats.rx_dropped++; > > + return -ENOMEM; > > + } > > + > > + skb_reserve(skb, NET_IP_ALIGN); > > + skb_put(skb, count); > > + > > + if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) { > > + vtap->m.dev->stats.rx_dropped++; > > + kfree_skb(skb); > > + return -EFAULT; > > + } > > + > > + skb_set_network_header(skb, ETH_HLEN); > > + skb->dev = vtap->m.lowerdev; > > + > > + macvlan_start_xmit(skb, vtap->m.dev); > > + > > + return count; > > +} > > With tap, we discovered that not limiting the number of outstanding > skbs hurts UDP performance. And the solution was to limit > the number of outstanding packets - with hacks to work around > the fact that userspace .Something seems to be missing in your last sentence here. My driver OTOH is also missing any sort of flow control in both RX and TX direction ;) For RX, there should probably just be a limit of frames that get buffered in the ring. For TX, I guess there should be a way to let the packet scheduler handle this and give us a chance to block and unblock at the right time. I haven't found out yet how to do that. Would it be enough to check the dev_queue_xmit() return code for NETDEV_TX_BUSY? How would I get notified when it gets free again?> > + ret = skb_copy_datagram_iovec(skb, 0, iv, len); > > + > > + vtap->m.dev->stats.rx_packets++; > > + vtap->m.dev->stats.rx_bytes += len; > > where does atomicity guarantee for these counters come from?AFAIK, we never do for any driver. They are statistics only and need not be 100% correct, so the networking stack goes for lower overhead and 99.9% correct.> > +static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv, > > + unsigned long count, loff_t pos) > > +{ > > + struct file *file = iocb->ki_filp; > > + struct macvtap_dev *vtap = file->private_data; > > + DECLARE_WAITQUEUE(wait, current); > > + struct sk_buff *skb; > > + ssize_t len, ret = 0; > > + > > + if (!vtap) > > + return -EBADFD; > > + > > + len = iov_length(iv, count); > > + if (len < 0) { > > + ret = -EINVAL; > > + goto out; > > + } > > + > > + add_wait_queue(&vtap->wait, &wait); > > + while (len) { > > + current->state = TASK_INTERRUPTIBLE; > > + > > + /* Read frames from the queue */ > > + if (!(skb=skb_dequeue(&vtap->readq))) { > > + if (file->f_flags & O_NONBLOCK) { > > + ret = -EAGAIN; > > + break; > > + } > > + if (signal_pending(current)) { > > + ret = -ERESTARTSYS; > > + break; > > + } > > + /* Nothing to read, let's sleep */ > > + schedule(); > > + continue; > > + } > > + ret = macvtap_put_user(vtap, skb, (struct iovec *) iv, len); > > Don't cast away the constness. Instead, fix macvtap_put_user > to used skb_copy_datagram_const_iovec which does not modify the iovec.Ah, good catch. I had copied that from the tun driver before you fixed it there and failed to fix it the right way when I adapted it for the new interface. Thanks for the review, Arnd <><
Arnd Bergmann wrote:> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c > new file mode 100644 > index 0000000..d99bfc0 > --- /dev/null > +++ b/drivers/net/macvtap.c > +static int macvtap_open(struct inode *inode, struct file *file) > +{ > + struct net *net = current->nsproxy->net_ns; > + int ifindex = iminor(inode); > + struct net_device *dev = dev_get_by_index(net, ifindex); > + int err; > + > + err = -ENODEV; > + if (!dev) > + goto out1; > + > + file->private_data = netdev_priv(dev); > + err = 0; > +out1: > + return err; > +}macvlan will remove all macvlan/vtap devices when the underlying device in unregistered, at which time you need to release the device references you're holding. I'd suggest to change the macvlan_device_event() handler to use vlan->dev->rtnl_link_ops->dellink(vlan->dev) instead of macvlan_dellink() so the macvtap_dellink callback is invoked.