thr3ads.net - Linux Ethernet Bridging - [Bridge] [PATCH] macvlan: add tap device backend [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Arnd Bergmann

2009-Aug-06 21:50 UTC

[Bridge] [PATCH] macvlan: add tap device backend

This is a first prototype of a new interface into the network
stack, to eventually replace tun/tap and the bridge driver
in certain virtual machine setups.

Background
----------
The 'Edge Virtual Bridging' working group is discussing ways to overcome
the limitation of virtual bridges in hypervisors.  One important part
of this is the Virtual Ethernet Port Aggregator (VEPA), as described in
http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-vepa-modular-0709-v01.pdf

In short, the idea of VEPA is that virtual machines do not communicate
with each other through direct bridging in the hypervisor but only via
an external managed switch that is already well integrated into the data
center, including network filtering, accounting and monitoring. While
we can do most of that efficiently in the Linux bridge code, doing it
externally simplifies the overall setup.

Related work
------------
Patches to implement VEPA in the Linux bridge driver have been posted by
Anna Fischer in June, see http://patchwork.ozlabs.org/patch/28702/. Those
patches are good and hopefully get merged in 2.6.32, but I think we can
take some shortcuts with an alternative approach:

The macvlan driver already has the property of forwarding all traffic
between guests and an external interface but not between the guests, just
as VEPA needs it. Also, VEPA does explicitly not want or need advanced
filtering in the way that netfilter-bridge provides, so we can use macvlan
to replace the bridge code in this setup, reducing the code path through
the kernel.  This works fine with containers and network namespaces,
but not easily with kvm/qemu because we only have a network device.

Or Gerlitz posted a "raw" packet socket backend for qemu to deal with
this,
at http://marc.info/?l=qemu-devel&m=124653801212767 and at least three
other people have done a similar functionality independently.

This driver
-----------
While the other approaches should work as well, doing it using a tap
interface should give additional benefits:

* We can keep using the optimizations for jumbo frames that we have put
into the tun/tap driver.

* No need for root permissions that packet sockets need, just use 'ip
link add link type macvtap' to create a new device and give it the right
permissions using udev (using one tap per macvlan netdev).

* support for multiqueue network adapters by opening the tap device
multiple times, using one file descriptor per guest CPU/network
queue/interrupt (if the adapter supports multiple queues on a single
MAC address).

* support for zero-copy receive/transmit using async I/O on the tap device
(if the adapter supports per MAC rx queues).

* The same framework in macvlan can be used to add a third backend
into a future kernel based virtio-net implementation.

This version of the driver does not support any of those features,
but they all appear possible to add ;).
The driver is currently called 'macvtap', but I'd be more than happy
to change that if anyone could suggest a better name. The code is
still in an early stage and I wish I had found more time to polish
it, but at this time, I'd first like to know if people agree with the
basic concept at all.

Cc: Patrick McHardy <kaber at trash.net>
Cc: Stephen Hemminger <shemminger at linux-foundation.org>
Cc: David S. Miller" <davem at davemloft.net>
Cc: "Michael S. Tsirkin" <mst at redhat.com>
Cc: Herbert Xu <herbert at gondor.apana.org.au>
Cc: Or Gerlitz <ogerlitz at voltaire.com>
Cc: "Fischer, Anna" <anna.fischer at hp.com>
Cc: netdev at vger.kernel.org
Cc: bridge at lists.linux-foundation.org
Cc: linux-kernel at vger.kernel.org
Cc: Edge Virtual Bridging <evb at yahoogroups.com>
Signed-off-by: Arnd Bergmann <arnd at arndb.de>

---

The evb mailing list eats Cc headers, please make sure to keep everybody
in your Cc list when replying there.
---
 drivers/net/Kconfig   |   12 ++
 drivers/net/Makefile  |    1 +
 drivers/net/macvlan.c |   39 +++-----
 drivers/net/macvlan.h |   37 +++++++
 drivers/net/macvtap.c |  276 +++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 341 insertions(+), 24 deletions(-)
 create mode 100644 drivers/net/macvlan.h
 create mode 100644 drivers/net/macvtap.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 5f6509a..0b9ac6a 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -90,6 +90,18 @@ config MACVLAN
 	  To compile this driver as a module, choose M here: the module
 	  will be called macvlan.
 
+config MACVTAP
+	tristate "MAC-VLAN based tap driver (EXPERIMENTAL)"
+	depends on MACVLAN
+	help
+	  This adds a specialized tap character device driver that is based
+	  on the MAC-VLAN network interface, called macvtap. A macvtap device
+	  can be added in the same way as a macvlan device, using 'type
+	  macvlan', and then be accessed through the tap user space interface.
+	
+	  To compile this driver as a module, choose M here: the module
+	  will be called macvtap.
+
 config EQUALIZER
 	tristate "EQL (serial line load balancing) support"
 	---help---
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index ead8cab..8a2d2d7 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -162,6 +162,7 @@ obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
 obj-$(CONFIG_DUMMY) += dummy.o
 obj-$(CONFIG_IFB) += ifb.o
 obj-$(CONFIG_MACVLAN) += macvlan.o
+obj-$(CONFIG_MACVTAP) += macvtap.o
 obj-$(CONFIG_DE600) += de600.o
 obj-$(CONFIG_DE620) += de620.o
 obj-$(CONFIG_LANCE) += lance.o
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 99eed9f..9f7dc6a 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -30,22 +30,7 @@
 #include <linux/if_macvlan.h>
 #include <net/rtnetlink.h>
 
-#define MACVLAN_HASH_SIZE	(1 << BITS_PER_BYTE)
-
-struct macvlan_port {
-	struct net_device	*dev;
-	struct hlist_head	vlan_hash[MACVLAN_HASH_SIZE];
-	struct list_head	vlans;
-};
-
-struct macvlan_dev {
-	struct net_device	*dev;
-	struct list_head	list;
-	struct hlist_node	hlist;
-	struct macvlan_port	*port;
-	struct net_device	*lowerdev;
-};
-
+#include "macvlan.h"
 
 static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
 					       const unsigned char *addr)
@@ -135,7 +120,7 @@ static void macvlan_broadcast(struct sk_buff *skb,
 			else
 				nskb->pkt_type = PACKET_MULTICAST;
 
-			netif_rx(nskb);
+			vlan->receive(nskb);
 		}
 	}
 }
@@ -180,11 +165,11 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff
*skb)
 	skb->dev = dev;
 	skb->pkt_type = PACKET_HOST;
 
-	netif_rx(skb);
+	vlan->receive(skb);
 	return NULL;
 }
 
-static int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
+int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	const struct macvlan_dev *vlan = netdev_priv(dev);
 	unsigned int len = skb->len;
@@ -202,6 +187,7 @@ static int macvlan_start_xmit(struct sk_buff *skb, struct
net_device *dev)
 	}
 	return NETDEV_TX_OK;
 }
+EXPORT_SYMBOL_GPL(macvlan_start_xmit);
 
 static int macvlan_hard_header(struct sk_buff *skb, struct net_device *dev,
 			       unsigned short type, const void *daddr,
@@ -412,7 +398,7 @@ static const struct net_device_ops macvlan_netdev_ops = {
 	.ndo_validate_addr	= eth_validate_addr,
 };
 
-static void macvlan_setup(struct net_device *dev)
+void macvlan_setup(struct net_device *dev)
 {
 	ether_setup(dev);
 
@@ -423,6 +409,7 @@ static void macvlan_setup(struct net_device *dev)
 	dev->ethtool_ops	= &macvlan_ethtool_ops;
 	dev->tx_queue_len	= 0;
 }
+EXPORT_SYMBOL_GPL(macvlan_setup);
 
 static int macvlan_port_create(struct net_device *dev)
 {
@@ -472,7 +459,7 @@ static void macvlan_transfer_operstate(struct net_device
*dev)
 	}
 }
 
-static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
+int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
 {
 	if (tb[IFLA_ADDRESS]) {
 		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
@@ -482,9 +469,10 @@ static int macvlan_validate(struct nlattr *tb[], struct
nlattr *data[])
 	}
 	return 0;
 }
+EXPORT_SYMBOL_GPL(macvlan_validate);
 
-static int macvlan_newlink(struct net_device *dev,
-			   struct nlattr *tb[], struct nlattr *data[])
+int macvlan_newlink(struct net_device *dev,
+		    struct nlattr *tb[], struct nlattr *data[])
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvlan_port *port;
@@ -524,6 +512,7 @@ static int macvlan_newlink(struct net_device *dev,
 	vlan->lowerdev = lowerdev;
 	vlan->dev      = dev;
 	vlan->port     = port;
+	vlan->receive  = netif_rx;
 
 	err = register_netdevice(dev);
 	if (err < 0)
@@ -533,8 +522,9 @@ static int macvlan_newlink(struct net_device *dev,
 	macvlan_transfer_operstate(dev);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(macvlan_newlink);
 
-static void macvlan_dellink(struct net_device *dev)
+void macvlan_dellink(struct net_device *dev)
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvlan_port *port = vlan->port;
@@ -545,6 +535,7 @@ static void macvlan_dellink(struct net_device *dev)
 	if (list_empty(&port->vlans))
 		macvlan_port_destroy(port->dev);
 }
+EXPORT_SYMBOL_GPL(macvlan_dellink);
 
 static struct rtnl_link_ops macvlan_link_ops __read_mostly = {
 	.kind		= "macvlan",
diff --git a/drivers/net/macvlan.h b/drivers/net/macvlan.h
new file mode 100644
index 0000000..3f3c6c3
--- /dev/null
+++ b/drivers/net/macvlan.h
@@ -0,0 +1,37 @@
+#ifndef _MACVLAN_H
+#define _MACVLAN_H
+
+#include <linux/netdevice.h>
+#include <linux/netlink.h>
+#include <linux/list.h>
+
+#define MACVLAN_HASH_SIZE	(1 << BITS_PER_BYTE)
+
+struct macvlan_port {
+	struct net_device	*dev;
+	struct hlist_head	vlan_hash[MACVLAN_HASH_SIZE];
+	struct list_head	vlans;
+};
+
+struct macvlan_dev {
+	struct net_device	*dev;
+	struct list_head	list;
+	struct hlist_node	hlist;
+	struct macvlan_port	*port;
+	struct net_device	*lowerdev;
+
+	int (*receive)(struct sk_buff *skb);
+};
+
+extern int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev);
+
+extern void macvlan_setup(struct net_device *dev);
+
+extern int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]);
+
+extern int macvlan_newlink(struct net_device *dev,
+		struct nlattr *tb[], struct nlattr *data[]);
+
+extern void macvlan_dellink(struct net_device *dev);
+
+#endif /* _MACVLAN_H */
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
new file mode 100644
index 0000000..d99bfc0
--- /dev/null
+++ b/drivers/net/macvtap.c
@@ -0,0 +1,276 @@
+#include <linux/etherdevice.h>
+#include <linux/nsproxy.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/cache.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/wait.h>
+#include <linux/cdev.h>
+#include <linux/fs.h>
+
+#include <net/net_namespace.h>
+#include <net/rtnetlink.h>
+
+#include "macvlan.h"
+
+struct macvtap_dev {
+	struct macvlan_dev m;
+	struct cdev cdev;
+	struct sk_buff_head readq;
+	wait_queue_head_t wait;
+};
+
+/*
+ * Minor number matches netdev->ifindex, so need a large value
+ */
+static int macvtap_major;
+#define MACVTAP_NUM_DEVS 65536
+
+static int macvtap_receive(struct sk_buff *skb)
+{
+	struct macvtap_dev *vtap = netdev_priv(skb->dev);
+
+	skb_queue_tail(&vtap->readq, skb);
+	wake_up(&vtap->wait);
+	return 0;
+}
+
+static int macvtap_open(struct inode *inode, struct file *file)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int ifindex = iminor(inode);
+	struct net_device *dev = dev_get_by_index(net, ifindex);
+	int err;
+
+	err = -ENODEV;
+	if (!dev)
+		goto out1;
+
+	file->private_data = netdev_priv(dev);
+	err = 0;
+out1:
+	return err;
+}
+
+static int macvtap_release(struct inode *inode, struct file *file)
+{
+	struct macvtap_dev *vtap = file->private_data;
+
+	if (!vtap)
+		return 0;
+
+	dev_put(vtap->m.dev);
+	return 0;
+}
+
+/* Get packet from user space buffer */
+static ssize_t macvtap_get_user(struct macvtap_dev *vtap,
+			       const struct iovec *iv, size_t count,
+			       int noblock)
+{
+	struct sk_buff *skb;
+	size_t len = count;
+
+	if (unlikely(len < ETH_HLEN))
+		return -EINVAL;
+
+	skb = alloc_skb(NET_IP_ALIGN + len, GFP_KERNEL);
+
+	if (!skb) {
+		vtap->m.dev->stats.rx_dropped++;
+		return -ENOMEM;
+	}
+
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb_put(skb, count);
+
+	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
+		vtap->m.dev->stats.rx_dropped++;
+		kfree_skb(skb);
+		return -EFAULT;
+	}
+
+	skb_set_network_header(skb, ETH_HLEN);
+	skb->dev = vtap->m.lowerdev;
+
+	macvlan_start_xmit(skb, vtap->m.dev);
+
+	return count;
+}
+
+static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
+			      unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t result;
+	struct macvtap_dev *vtap = file->private_data;
+
+	result = macvtap_get_user(vtap, iv, iov_length(iv, count),
+			      file->f_flags & O_NONBLOCK);
+
+	return result;
+}
+
+/* Put packet to the user space buffer */
+static ssize_t macvtap_put_user(struct macvtap_dev *vtap,
+				       struct sk_buff *skb,
+				       struct iovec *iv, int len)
+{
+	int ret;
+
+	skb_push(skb, ETH_HLEN);
+	len = min_t(int, skb->len, len);
+
+	ret = skb_copy_datagram_iovec(skb, 0, iv, len);
+
+	vtap->m.dev->stats.rx_packets++;
+	vtap->m.dev->stats.rx_bytes += len;
+
+	return ret ? ret : len;
+}
+
+static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
+			    unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct macvtap_dev *vtap = file->private_data;
+	DECLARE_WAITQUEUE(wait, current);
+	struct sk_buff *skb;
+	ssize_t len, ret = 0;
+
+	if (!vtap)
+		return -EBADFD;
+
+	len = iov_length(iv, count);
+	if (len < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	add_wait_queue(&vtap->wait, &wait);
+	while (len) {
+		current->state = TASK_INTERRUPTIBLE;
+
+		/* Read frames from the queue */
+		if (!(skb=skb_dequeue(&vtap->readq))) {
+			if (file->f_flags & O_NONBLOCK) {
+				ret = -EAGAIN;
+				break;
+			}
+			if (signal_pending(current)) {
+				ret = -ERESTARTSYS;
+				break;
+			}
+			/* Nothing to read, let's sleep */
+			schedule();
+			continue;
+		}
+		ret = macvtap_put_user(vtap, skb, (struct iovec *) iv, len);
+		kfree_skb(skb);
+		break;
+	}
+
+	current->state = TASK_RUNNING;
+	remove_wait_queue(&vtap->wait, &wait);
+
+out:
+	return ret;
+}
+
+struct file_operations macvtap_fops = {
+	.owner = THIS_MODULE,
+	.open = macvtap_open,
+	.release = macvtap_release,
+	.aio_read = macvtap_aio_read,
+	.aio_write = macvtap_aio_write,
+	.llseek = no_llseek,
+};
+
+static int macvtap_newlink(struct net_device *dev,
+	struct nlattr *tb[], struct nlattr *data[])
+{
+	struct macvtap_dev *vtap = netdev_priv(dev);
+	int err;
+
+	err = macvlan_newlink(dev, tb, data);
+	if (err)
+		goto out1;
+
+	cdev_init(&vtap->cdev, &macvtap_fops);
+	vtap->cdev.owner = THIS_MODULE;
+	err = cdev_add(&vtap->cdev, MKDEV(MAJOR(macvtap_major),
dev->ifindex), 1);
+
+	if (err)
+		goto out2;
+
+	/*
+	 * TODO: add class dev so device node gets created automatically
+	 * by udev.
+	 */
+	pr_debug("%s:%d: added cdev %d:%d for dev %s\n",
+		__func__, __LINE__, MAJOR(macvtap_major),
+		dev->ifindex, dev->name);
+
+	skb_queue_head_init(&vtap->readq);
+	init_waitqueue_head(&vtap->wait);
+	vtap->m.receive = macvtap_receive;
+
+	return 0;
+
+out2:
+	macvlan_dellink(dev);
+out1:
+	return err;
+}
+
+static void macvtap_dellink(struct net_device *dev)
+{
+	struct macvtap_dev *vtap = netdev_priv(dev);
+	cdev_del(&vtap->cdev);
+	/* TODO: kill open file descriptors */
+	macvlan_dellink(dev);
+}
+
+static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
+	.kind = "macvtap",
+	.priv_size = sizeof(struct macvtap_dev),
+	.setup = macvlan_setup,
+	.validate = macvlan_validate,
+	.newlink = macvtap_newlink,
+	.dellink = macvtap_dellink,
+};
+
+static int macvtap_init(void)
+{
+	int err;
+
+	err = alloc_chrdev_region(&macvtap_major, 0,
+				MACVTAP_NUM_DEVS, "macvtap");
+	if (err)
+		goto out1;
+
+	err = rtnl_link_register(&macvtap_link_ops);
+	if (err)
+		goto out2;
+
+	return 0;
+
+out2:
+	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+out1:
+	return err;
+}
+module_init(macvtap_init);
+
+static void macvtap_exit(void)
+{
+	rtnl_link_unregister(&macvtap_link_ops);
+	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+}
+module_exit(macvtap_exit);
+
+MODULE_ALIAS_RTNL_LINK("macvtap");
+MODULE_AUTHOR("Arnd Bergmann <arnd at arndb.de>");
+MODULE_LICENSE("GPL");
-- 
1.6.0.4

David Miller

2009-Aug-07 03:20 UTC

head link

[Bridge] [PATCH] macvlan: add tap device backend

From: Arnd Bergmann <arnd at arndb.de>
Date: Thu,  6 Aug 2009 21:50:28 +0000
> This is a first prototype of a new interface into the network
> stack, to eventually replace tun/tap and the bridge driver
> in certain virtual machine setups.
I don't know enough to say how good a solution this is for
the problem, but I certainly like this driver for it's
utter simplicity and minimalness.

Daniel Robbins

2009-Aug-07 17:35 UTC

head link

[Bridge] [PATCH] macvlan: add tap device backend

On Thu, Aug 6, 2009 at 3:50 PM, Arnd Bergmann<arnd at arndb.de>
wrote:> This is a first prototype of a new interface into the network
> stack, to eventually replace tun/tap and the bridge driver
> in certain virtual machine setups.
I have some general questions about the intended use and benefits of
VEPA, from an IT perspective:

In which virtual machine setups and technologies do you forsee this
interface being used?
Is this new interface to be used within a virtual machine or
container, on the master node, or both?
What interface(s) would need to be configured for a single virtual
machine to use VEPA to access the network?
What are the current flexibility, security or performance limitations
of tun/tap and bridge that make this new interface necessary or
beneficial?
Is this new interface useful at all for VPN solutions or is it
*specifically* targeted for connecting virtual machines to the
network?
Is this essentially a bridge with layer-2 isolation for the virtual
machine interfaces built-in? If isolation is provided, what mechanism
is used to accomplish this, and how secure is it?
Does VEPA look like a regular ethernet interface (eth0) on the virtual
machine side?
Are there any associated user-space tools required for configuring a VEPA?

Do you have any HOWTO-style documentation that would demonstrate how
this interface would be used in production? Or a FAQ?

This seems like a very interesting effort but I don't quite have a
good grasp of VEPA's benefits and limitations -- I imagine that others
are in the same boat too.

Best Regards,

Daniel

Paul Congdon (UC Davis)

2009-Aug-07 19:10 UTC

head link

[Bridge] [PATCH] macvlan: add tap device backend

Responding to Daniel's questions...
> I have some general questions about the intended use and benefits of 
> VEPA, from an IT perspective:
> 
> In which virtual machine setups and technologies do you forsee this 
> interface being used?
The benefit of VEPA is the coordination and unification with the external
network switch.  So, in environments where you are needing/wanting your feature
rich, wire speed, external network device (firewall/switch/IPS/content-filter)
to provide consistent policy enforcement, and you want your VMs traffic to be
subject to that enforcement, you will want their traffic directed externally. 
Perhaps you have some VMs that are on a DMZ or clustering an application or
implementing a multi-tier application where you would normally place a firewall
in-between the tiers.
> Is this new interface to be used within a virtual machine or 
> container, on the master node, or both?
It is really an interface to a new type of virtual switch.  When you create
virtual network, I would imagine it being a new mode of operation (bridge, NAT,
VEPA, etc).
> What interface(s) would need to be configured for a single virtual 
> machine to use VEPA to access the network?
It would be the same as if that machine were configure to use a bridge to access
the network, but the bridge mode would be different.
> What are the current flexibility, security or performance limitations 
> of tun/tap and bridge that make this new interface necessary or 
> beneficial?
If you have VMs that will be communicating with one another on the same physical
machine, and you want their traffic to be exposed to an in-line network device
such as a application firewall/IPS/content-filter (without this feature) you
will have to have this device co-located within the same physical server.  This
will use up CPU cycles that you presumable purchased to run applications, it
will require a lot of consistent configuration on all physical machines, it
could invoke potentially a lot of software licensing, additional cost, etc.. 
Everything would need to be replicated on each physical machine.  With the VEPA
capability, you can leverage all this functionality in an external network
device and have it managed and configured in one place.  The external
implementation is likely a higher performance, silicon based implementation.  It
should make it easier to migrate machines from one physical server to another
and maintain the same network policy enforcement.
> Is this new interface useful at all for VPN solutions or is it
> *specifically* targeted for connecting virtual machines to the 
> network?
I'm not sure I see the benefit for VPN solutions, but I'd have to
understand the deployment scenario better.  Certainly this is targeting
connecting VMs to the adjacent physical LAN.
> Is this essentially a bridge with layer-2 isolation for the virtual 
> machine interfaces built-in? If isolation is provided, what mechanism 
> is used to accomplish this, and how secure is it?
That might be an over simplification, but you can achieve layer-2 isolation if
you connect to a standard external switch.  If that switch has 'hairpin'
forwarding, then the VMs can talk at L2, but their traffic is forced through the
bridge.  If that bridge is a security device (e.g. firewall), then their traffic
is exposed to that.

The isolation in the outbound direction is created by the way frames are
forwarded.  They are simply dropped on the wire, so no VMs can talk directly to
one another without their traffic first going external.  In the inbound
direction, the isolation is created using the forwarding table.
> Does VEPA look like a regular ethernet interface (eth0) on the virtual 
> machine side?
Yes
> Are there any associated user-space tools required for configuring a 
> VEPA?
>
The standard brctl utility has been augmented to enable/disable the capability.
 > Do you have any HOWTO-style documentation that would demonstrate how 
> this interface would be used in production? Or a FAQ?
>
None yet.
 > This seems like a very interesting effort but I don't quite have a 
> good grasp of VEPA's benefits and limitations -- I imagine that others 
> are in the same boat too.
> 
There are some seminar slides available on the IEEE 802.1 web-site and
elsewhere.  The patch had a reference to a seminar, but here is another one you
might find helpful:

http://www.internet2.edu/presentations/jt2009jul/20090719-congdon.pdf

I'm happy to try to explain further...

Paul

Michael S. Tsirkin

2009-Aug-09 08:02 UTC

head link

[Bridge] [PATCH] macvlan: add tap device backend

On Thu, Aug 06, 2009 at 09:50:28PM +0000, Arnd Bergmann
wrote:> This driver
> -----------
> While the other approaches should work as well, doing it using a tap
> interface should give additional benefits:
> 
> * We can keep using the optimizations for jumbo frames that we have put
> into the tun/tap driver.
> 
> * No need for root permissions that packet sockets need, just use 'ip
> link add link type macvtap' to create a new device and give it the
right
> permissions using udev (using one tap per macvlan netdev).
> 
> * support for multiqueue network adapters by opening the tap device
> multiple times, using one file descriptor per guest CPU/network
> queue/interrupt (if the adapter supports multiple queues on a single
> MAC address).
> 
> * support for zero-copy receive/transmit using async I/O on the tap device
> (if the adapter supports per MAC rx queues).
> 
> * The same framework in macvlan can be used to add a third backend
> into a future kernel based virtio-net implementation.
Could you split the patches up, to make this last easier?
patch 1 - export framework
patch 2 - code using it

> This version of the driver does not support any of those features,
> but they all appear possible to add ;).
> The driver is currently called 'macvtap', but I'd be more than
happy
> to change that if anyone could suggest a better name. The code is
> still in an early stage and I wish I had found more time to polish
> it, but at this time, I'd first like to know if people agree with the
> basic concept at all.
> 
> Cc: Patrick McHardy <kaber at trash.net>
> Cc: Stephen Hemminger <shemminger at linux-foundation.org>
> Cc: David S. Miller" <davem at davemloft.net>
> Cc: "Michael S. Tsirkin" <mst at redhat.com>
> Cc: Herbert Xu <herbert at gondor.apana.org.au>
> Cc: Or Gerlitz <ogerlitz at voltaire.com>
> Cc: "Fischer, Anna" <anna.fischer at hp.com>
> Cc: netdev at vger.kernel.org
> Cc: bridge at lists.linux-foundation.org
> Cc: linux-kernel at vger.kernel.org
> Cc: Edge Virtual Bridging <evb at yahoogroups.com>
> Signed-off-by: Arnd Bergmann <arnd at arndb.de>
> 
> ---
> 
> The evb mailing list eats Cc headers, please make sure to keep everybody
> in your Cc list when replying there.
> ---
>  drivers/net/Kconfig   |   12 ++
>  drivers/net/Makefile  |    1 +
>  drivers/net/macvlan.c |   39 +++-----
>  drivers/net/macvlan.h |   37 +++++++
>  drivers/net/macvtap.c |  276
+++++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 341 insertions(+), 24 deletions(-)
>  create mode 100644 drivers/net/macvlan.h
>  create mode 100644 drivers/net/macvtap.c
> 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 5f6509a..0b9ac6a 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -90,6 +90,18 @@ config MACVLAN
>  	  To compile this driver as a module, choose M here: the module
>  	  will be called macvlan.
>  
> +config MACVTAP
> +	tristate "MAC-VLAN based tap driver (EXPERIMENTAL)"
> +	depends on MACVLAN
> +	help
> +	  This adds a specialized tap character device driver that is based
> +	  on the MAC-VLAN network interface, called macvtap. A macvtap device
> +	  can be added in the same way as a macvlan device, using 'type
> +	  macvlan', and then be accessed through the tap user space
interface.
> +	
> +	  To compile this driver as a module, choose M here: the module
> +	  will be called macvtap.
> +
>  config EQUALIZER
>  	tristate "EQL (serial line load balancing) support"
>  	---help---
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index ead8cab..8a2d2d7 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -162,6 +162,7 @@ obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
>  obj-$(CONFIG_DUMMY) += dummy.o
>  obj-$(CONFIG_IFB) += ifb.o
>  obj-$(CONFIG_MACVLAN) += macvlan.o
> +obj-$(CONFIG_MACVTAP) += macvtap.o
>  obj-$(CONFIG_DE600) += de600.o
>  obj-$(CONFIG_DE620) += de620.o
>  obj-$(CONFIG_LANCE) += lance.o
> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
> index 99eed9f..9f7dc6a 100644
> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -30,22 +30,7 @@
>  #include <linux/if_macvlan.h>
>  #include <net/rtnetlink.h>
>  
> -#define MACVLAN_HASH_SIZE	(1 << BITS_PER_BYTE)
> -
> -struct macvlan_port {
> -	struct net_device	*dev;
> -	struct hlist_head	vlan_hash[MACVLAN_HASH_SIZE];
> -	struct list_head	vlans;
> -};
> -
> -struct macvlan_dev {
> -	struct net_device	*dev;
> -	struct list_head	list;
> -	struct hlist_node	hlist;
> -	struct macvlan_port	*port;
> -	struct net_device	*lowerdev;
> -};
> -
> +#include "macvlan.h"
>  
>  static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port
*port,
>  					       const unsigned char *addr)
> @@ -135,7 +120,7 @@ static void macvlan_broadcast(struct sk_buff *skb,
>  			else
>  				nskb->pkt_type = PACKET_MULTICAST;
>  
> -			netif_rx(nskb);
> +			vlan->receive(nskb);
>  		}
>  	}
>  }
> @@ -180,11 +165,11 @@ static struct sk_buff *macvlan_handle_frame(struct
sk_buff *skb)
>  	skb->dev = dev;
>  	skb->pkt_type = PACKET_HOST;
>  
> -	netif_rx(skb);
> +	vlan->receive(skb);
>  	return NULL;
>  }
>  
> -static int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
> +int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
>  	const struct macvlan_dev *vlan = netdev_priv(dev);
>  	unsigned int len = skb->len;
> @@ -202,6 +187,7 @@ static int macvlan_start_xmit(struct sk_buff *skb,
struct net_device *dev)
>  	}
>  	return NETDEV_TX_OK;
>  }
> +EXPORT_SYMBOL_GPL(macvlan_start_xmit);
>  
>  static int macvlan_hard_header(struct sk_buff *skb, struct net_device
*dev,
>  			       unsigned short type, const void *daddr,
> @@ -412,7 +398,7 @@ static const struct net_device_ops macvlan_netdev_ops =
{
>  	.ndo_validate_addr	= eth_validate_addr,
>  };
>  
> -static void macvlan_setup(struct net_device *dev)
> +void macvlan_setup(struct net_device *dev)
>  {
>  	ether_setup(dev);
>  
> @@ -423,6 +409,7 @@ static void macvlan_setup(struct net_device *dev)
>  	dev->ethtool_ops	= &macvlan_ethtool_ops;
>  	dev->tx_queue_len	= 0;
>  }
> +EXPORT_SYMBOL_GPL(macvlan_setup);
>  
>  static int macvlan_port_create(struct net_device *dev)
>  {
> @@ -472,7 +459,7 @@ static void macvlan_transfer_operstate(struct
net_device *dev)
>  	}
>  }
>  
> -static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
> +int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
>  {
>  	if (tb[IFLA_ADDRESS]) {
>  		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
> @@ -482,9 +469,10 @@ static int macvlan_validate(struct nlattr *tb[],
struct nlattr *data[])
>  	}
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(macvlan_validate);
>  
> -static int macvlan_newlink(struct net_device *dev,
> -			   struct nlattr *tb[], struct nlattr *data[])
> +int macvlan_newlink(struct net_device *dev,
> +		    struct nlattr *tb[], struct nlattr *data[])
>  {
>  	struct macvlan_dev *vlan = netdev_priv(dev);
>  	struct macvlan_port *port;
> @@ -524,6 +512,7 @@ static int macvlan_newlink(struct net_device *dev,
>  	vlan->lowerdev = lowerdev;
>  	vlan->dev      = dev;
>  	vlan->port     = port;
> +	vlan->receive  = netif_rx;
>  
>  	err = register_netdevice(dev);
>  	if (err < 0)
> @@ -533,8 +522,9 @@ static int macvlan_newlink(struct net_device *dev,
>  	macvlan_transfer_operstate(dev);
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(macvlan_newlink);
>  
> -static void macvlan_dellink(struct net_device *dev)
> +void macvlan_dellink(struct net_device *dev)
>  {
>  	struct macvlan_dev *vlan = netdev_priv(dev);
>  	struct macvlan_port *port = vlan->port;
> @@ -545,6 +535,7 @@ static void macvlan_dellink(struct net_device *dev)
>  	if (list_empty(&port->vlans))
>  		macvlan_port_destroy(port->dev);
>  }
> +EXPORT_SYMBOL_GPL(macvlan_dellink);
>  
>  static struct rtnl_link_ops macvlan_link_ops __read_mostly = {
>  	.kind		= "macvlan",
> diff --git a/drivers/net/macvlan.h b/drivers/net/macvlan.h
> new file mode 100644
> index 0000000..3f3c6c3
> --- /dev/null
> +++ b/drivers/net/macvlan.h
> @@ -0,0 +1,37 @@
> +#ifndef _MACVLAN_H
> +#define _MACVLAN_H
> +
> +#include <linux/netdevice.h>
> +#include <linux/netlink.h>
> +#include <linux/list.h>
> +
> +#define MACVLAN_HASH_SIZE	(1 << BITS_PER_BYTE)
> +
> +struct macvlan_port {
> +	struct net_device	*dev;
> +	struct hlist_head	vlan_hash[MACVLAN_HASH_SIZE];
> +	struct list_head	vlans;
> +};
> +
> +struct macvlan_dev {
> +	struct net_device	*dev;
> +	struct list_head	list;
> +	struct hlist_node	hlist;
> +	struct macvlan_port	*port;
> +	struct net_device	*lowerdev;
> +
> +	int (*receive)(struct sk_buff *skb);
> +};
> +
> +extern int macvlan_start_xmit(struct sk_buff *skb, struct net_device
*dev);
> +
> +extern void macvlan_setup(struct net_device *dev);
> +
> +extern int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]);
> +
> +extern int macvlan_newlink(struct net_device *dev,
> +		struct nlattr *tb[], struct nlattr *data[]);
> +
> +extern void macvlan_dellink(struct net_device *dev);
> +
> +#endif /* _MACVLAN_H */
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> new file mode 100644
> index 0000000..d99bfc0
> --- /dev/null
> +++ b/drivers/net/macvtap.c
> @@ -0,0 +1,276 @@
> +#include <linux/etherdevice.h>
> +#include <linux/nsproxy.h>
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <linux/cache.h>
> +#include <linux/sched.h>
> +#include <linux/types.h>
> +#include <linux/init.h>
> +#include <linux/wait.h>
> +#include <linux/cdev.h>
> +#include <linux/fs.h>
> +
> +#include <net/net_namespace.h>
> +#include <net/rtnetlink.h>
> +
> +#include "macvlan.h"
> +
> +struct macvtap_dev {
> +	struct macvlan_dev m;
> +	struct cdev cdev;
> +	struct sk_buff_head readq;
> +	wait_queue_head_t wait;
> +};
> +
> +/*
> + * Minor number matches netdev->ifindex, so need a large value
> + */
> +static int macvtap_major;
> +#define MACVTAP_NUM_DEVS 65536
> +
> +static int macvtap_receive(struct sk_buff *skb)
> +{
> +	struct macvtap_dev *vtap = netdev_priv(skb->dev);
> +
> +	skb_queue_tail(&vtap->readq, skb);
> +	wake_up(&vtap->wait);
> +	return 0;
> +}
> +
> +static int macvtap_open(struct inode *inode, struct file *file)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	int ifindex = iminor(inode);
> +	struct net_device *dev = dev_get_by_index(net, ifindex);
> +	int err;
> +
> +	err = -ENODEV;
> +	if (!dev)
> +		goto out1;
> +
> +	file->private_data = netdev_priv(dev);
> +	err = 0;
> +out1:
> +	return err;
> +}
> +
> +static int macvtap_release(struct inode *inode, struct file *file)
> +{
> +	struct macvtap_dev *vtap = file->private_data;
> +
> +	if (!vtap)
> +		return 0;
> +
> +	dev_put(vtap->m.dev);
> +	return 0;
> +}
> +
> +/* Get packet from user space buffer */
> +static ssize_t macvtap_get_user(struct macvtap_dev *vtap,
> +			       const struct iovec *iv, size_t count,
> +			       int noblock)
> +{
> +	struct sk_buff *skb;
> +	size_t len = count;
> +
> +	if (unlikely(len < ETH_HLEN))
> +		return -EINVAL;
> +
> +	skb = alloc_skb(NET_IP_ALIGN + len, GFP_KERNEL);
> +
> +	if (!skb) {
> +		vtap->m.dev->stats.rx_dropped++;
> +		return -ENOMEM;
> +	}
> +
> +	skb_reserve(skb, NET_IP_ALIGN);
> +	skb_put(skb, count);
> +
> +	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
> +		vtap->m.dev->stats.rx_dropped++;
> +		kfree_skb(skb);
> +		return -EFAULT;
> +	}
> +
> +	skb_set_network_header(skb, ETH_HLEN);
> +	skb->dev = vtap->m.lowerdev;
> +
> +	macvlan_start_xmit(skb, vtap->m.dev);
> +
> +	return count;
> +}
With tap, we discovered that not limiting the number of outstanding
skbs hurts UDP performance. And the solution was to limit
the number of outstanding packets - with hacks to work around
the fact that userspace .


> +
> +static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec
*iv,
> +			      unsigned long count, loff_t pos)
> +{
> +	struct file *file = iocb->ki_filp;
> +	ssize_t result;
> +	struct macvtap_dev *vtap = file->private_data;
> +
> +	result = macvtap_get_user(vtap, iv, iov_length(iv, count),
> +			      file->f_flags & O_NONBLOCK);
> +
> +	return result;
> +}
> +
> +/* Put packet to the user space buffer */
> +static ssize_t macvtap_put_user(struct macvtap_dev *vtap,
> +				       struct sk_buff *skb,
> +				       struct iovec *iv, int len)
> +{
> +	int ret;
> +
> +	skb_push(skb, ETH_HLEN);
> +	len = min_t(int, skb->len, len);
> +
> +	ret = skb_copy_datagram_iovec(skb, 0, iv, len);
> +
> +	vtap->m.dev->stats.rx_packets++;
> +	vtap->m.dev->stats.rx_bytes += len;
where does atomicity guarantee for these counters come from?
> +
> +	return ret ? ret : len;
> +}
> +
> +static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec
*iv,
> +			    unsigned long count, loff_t pos)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct macvtap_dev *vtap = file->private_data;
> +	DECLARE_WAITQUEUE(wait, current);
> +	struct sk_buff *skb;
> +	ssize_t len, ret = 0;
> +
> +	if (!vtap)
> +		return -EBADFD;
> +
> +	len = iov_length(iv, count);
> +	if (len < 0) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	add_wait_queue(&vtap->wait, &wait);
> +	while (len) {
> +		current->state = TASK_INTERRUPTIBLE;
> +
> +		/* Read frames from the queue */
> +		if (!(skb=skb_dequeue(&vtap->readq))) {
> +			if (file->f_flags & O_NONBLOCK) {
> +				ret = -EAGAIN;
> +				break;
> +			}
> +			if (signal_pending(current)) {
> +				ret = -ERESTARTSYS;
> +				break;
> +			}
> +			/* Nothing to read, let's sleep */
> +			schedule();
> +			continue;
> +		}
> +		ret = macvtap_put_user(vtap, skb, (struct iovec *) iv, len);
Don't cast away the constness. Instead, fix macvtap_put_user
to used skb_copy_datagram_const_iovec which does not modify the iovec.
> +		kfree_skb(skb);
> +		break;
> +	}
> +
> +	current->state = TASK_RUNNING;
> +	remove_wait_queue(&vtap->wait, &wait);
> +
> +out:
> +	return ret;
> +}
> +
> +struct file_operations macvtap_fops = {
> +	.owner = THIS_MODULE,
> +	.open = macvtap_open,
> +	.release = macvtap_release,
> +	.aio_read = macvtap_aio_read,
> +	.aio_write = macvtap_aio_write,
> +	.llseek = no_llseek,
> +};
> +
> +static int macvtap_newlink(struct net_device *dev,
> +	struct nlattr *tb[], struct nlattr *data[])
> +{
> +	struct macvtap_dev *vtap = netdev_priv(dev);
> +	int err;
> +
> +	err = macvlan_newlink(dev, tb, data);
> +	if (err)
> +		goto out1;
> +
> +	cdev_init(&vtap->cdev, &macvtap_fops);
> +	vtap->cdev.owner = THIS_MODULE;
> +	err = cdev_add(&vtap->cdev, MKDEV(MAJOR(macvtap_major),
dev->ifindex), 1);
> +
> +	if (err)
> +		goto out2;
> +
> +	/*
> +	 * TODO: add class dev so device node gets created automatically
> +	 * by udev.
> +	 */
> +	pr_debug("%s:%d: added cdev %d:%d for dev %s\n",
> +		__func__, __LINE__, MAJOR(macvtap_major),
> +		dev->ifindex, dev->name);
> +
> +	skb_queue_head_init(&vtap->readq);
> +	init_waitqueue_head(&vtap->wait);
> +	vtap->m.receive = macvtap_receive;
> +
> +	return 0;
> +
> +out2:
> +	macvlan_dellink(dev);
> +out1:
> +	return err;
> +}
> +
> +static void macvtap_dellink(struct net_device *dev)
> +{
> +	struct macvtap_dev *vtap = netdev_priv(dev);
> +	cdev_del(&vtap->cdev);
> +	/* TODO: kill open file descriptors */
> +	macvlan_dellink(dev);
> +}
> +
> +static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
> +	.kind = "macvtap",
> +	.priv_size = sizeof(struct macvtap_dev),
> +	.setup = macvlan_setup,
> +	.validate = macvlan_validate,
> +	.newlink = macvtap_newlink,
> +	.dellink = macvtap_dellink,
> +};
> +
> +static int macvtap_init(void)
> +{
> +	int err;
> +
> +	err = alloc_chrdev_region(&macvtap_major, 0,
> +				MACVTAP_NUM_DEVS, "macvtap");
> +	if (err)
> +		goto out1;
> +
> +	err = rtnl_link_register(&macvtap_link_ops);
> +	if (err)
> +		goto out2;
> +
> +	return 0;
> +
> +out2:
> +	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
> +out1:
> +	return err;
> +}
> +module_init(macvtap_init);
> +
> +static void macvtap_exit(void)
> +{
> +	rtnl_link_unregister(&macvtap_link_ops);
> +	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
> +}
> +module_exit(macvtap_exit);
> +
> +MODULE_ALIAS_RTNL_LINK("macvtap");
> +MODULE_AUTHOR("Arnd Bergmann <arnd at arndb.de>");
> +MODULE_LICENSE("GPL");
> -- 
> 1.6.0.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev"
in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arnd Bergmann

2009-Aug-09 20:42 UTC

head link

[Bridge] [PATCH] macvlan: add tap device backend

On Sunday 09 August 2009 08:02:16 Michael S. Tsirkin
wrote:> On Thu, Aug 06, 2009 at 09:50:28PM +0000, Arnd Bergmann wrote:
> > * The same framework in macvlan can be used to add a third backend
> > into a future kernel based virtio-net implementation.
> 
> Could you split the patches up, to make this last easier?
> patch 1 - export framework
> patch 2 - code using it
Sure, will do.
> > +/* Get packet from user space buffer */
> > +static ssize_t macvtap_get_user(struct macvtap_dev *vtap,
> > +			       const struct iovec *iv, size_t count,
> > +			       int noblock)
> > +{
> > +	struct sk_buff *skb;
> > +	size_t len = count;
> > +
> > +	if (unlikely(len < ETH_HLEN))
> > +		return -EINVAL;
> > +
> > +	skb = alloc_skb(NET_IP_ALIGN + len, GFP_KERNEL);
> > +
> > +	if (!skb) {
> > +		vtap->m.dev->stats.rx_dropped++;
> > +		return -ENOMEM;
> > +	}
> > +
> > +	skb_reserve(skb, NET_IP_ALIGN);
> > +	skb_put(skb, count);
> > +
> > +	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
> > +		vtap->m.dev->stats.rx_dropped++;
> > +		kfree_skb(skb);
> > +		return -EFAULT;
> > +	}
> > +
> > +	skb_set_network_header(skb, ETH_HLEN);
> > +	skb->dev = vtap->m.lowerdev;
> > +
> > +	macvlan_start_xmit(skb, vtap->m.dev);
> > +
> > +	return count;
> > +}
> 
> With tap, we discovered that not limiting the number of outstanding
> skbs hurts UDP performance. And the solution was to limit
> the number of outstanding packets - with hacks to work around
> the fact that userspace .
Something seems to be missing in your last sentence here.

My driver OTOH is also missing any sort of flow control in both
RX and TX direction ;) For RX, there should probably just be
a limit of frames that get buffered in the ring.

For TX, I guess there should be a way to let the packet
scheduler handle this and give us a chance to block and
unblock at the right time. I haven't found out yet how to
do that.

Would it be enough to check the dev_queue_xmit() return
code for NETDEV_TX_BUSY?

How would I get notified when it gets free again?
> > +	ret = skb_copy_datagram_iovec(skb, 0, iv, len);
> > +
> > +	vtap->m.dev->stats.rx_packets++;
> > +	vtap->m.dev->stats.rx_bytes += len;
> 
> where does atomicity guarantee for these counters come from?
AFAIK, we never do for any driver. They are statistics only and
need not be 100% correct, so the networking stack goes for
lower overhead and 99.9% correct.
> > +static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct
iovec *iv,
> > +			    unsigned long count, loff_t pos)
> > +{
> > +	struct file *file = iocb->ki_filp;
> > +	struct macvtap_dev *vtap = file->private_data;
> > +	DECLARE_WAITQUEUE(wait, current);
> > +	struct sk_buff *skb;
> > +	ssize_t len, ret = 0;
> > +
> > +	if (!vtap)
> > +		return -EBADFD;
> > +
> > +	len = iov_length(iv, count);
> > +	if (len < 0) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	add_wait_queue(&vtap->wait, &wait);
> > +	while (len) {
> > +		current->state = TASK_INTERRUPTIBLE;
> > +
> > +		/* Read frames from the queue */
> > +		if (!(skb=skb_dequeue(&vtap->readq))) {
> > +			if (file->f_flags & O_NONBLOCK) {
> > +				ret = -EAGAIN;
> > +				break;
> > +			}
> > +			if (signal_pending(current)) {
> > +				ret = -ERESTARTSYS;
> > +				break;
> > +			}
> > +			/* Nothing to read, let's sleep */
> > +			schedule();
> > +			continue;
> > +		}
> > +		ret = macvtap_put_user(vtap, skb, (struct iovec *) iv, len);
> 
> Don't cast away the constness. Instead, fix macvtap_put_user
> to used skb_copy_datagram_const_iovec which does not modify the iovec.
Ah, good catch. I had copied that from the tun driver before you
fixed it there and failed to fix it the right way when I adapted
it for the new interface.

Thanks for the review,

	Arnd <><

Patrick McHardy

2009-Aug-10 06:47 UTC

head link

[Bridge] [PATCH] macvlan: add tap device backend

Arnd Bergmann wrote:> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> new file mode 100644
> index 0000000..d99bfc0
> --- /dev/null
> +++ b/drivers/net/macvtap.c
> +static int macvtap_open(struct inode *inode, struct file *file)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	int ifindex = iminor(inode);
> +	struct net_device *dev = dev_get_by_index(net, ifindex);
> +	int err;
> +
> +	err = -ENODEV;
> +	if (!dev)
> +		goto out1;
> +
> +	file->private_data = netdev_priv(dev);
> +	err = 0;
> +out1:
> +	return err;
> +}
macvlan will remove all macvlan/vtap devices when the underlying
device in unregistered, at which time you need to release the
device references you're holding. I'd suggest to change the
macvlan_device_event() handler to use

vlan->dev->rtnl_link_ops->dellink(vlan->dev)

instead of macvlan_dellink() so the macvtap_dellink callback
is invoked.

Linux Ethernet Bridging - Aug 2009 - [Bridge] [PATCH] macvlan: add tap device backend

[Bridge] [PATCH] macvlan: add tap device backend

[Bridge] [PATCH] macvlan: add tap device backend

[Bridge] [PATCH] macvlan: add tap device backend

[Bridge] [PATCH] macvlan: add tap device backend

[Bridge] [PATCH] macvlan: add tap device backend

[Bridge] [PATCH] macvlan: add tap device backend

[Bridge] [PATCH] macvlan: add tap device backend