Michael S. Tsirkin
2012-Oct-29  15:49 UTC
[PATCH net-next 0/8] enable/disable zero copy tx dynamically
tun supports zero copy transmit since 0690899b4d4501b3505be069b9a687e68ccbe15b, however you can only enable this mode if you know your workload does not trigger heavy guest to host/host to guest traffic - otherwise you get a (minor) performance regression. This patchset addresses this problem by notifying the owner device when callback is invoked because of a data copy. This makes it possible to detect whether zero copy is appropriate dynamically: we start in zero copy mode, when we detect data copied we disable zero copy for a while. With this patch applied, I get the same performance for guest to host and guest to guest both with and without zero copy tx. Michael S. Tsirkin (8): skb: report completion status for zero copy skbs skb: api to report errors for zero copy skbs tun: report orphan frags errors to zero copy callback vhost-net: cleanup macros for DMA status tracking vhost: track zero copy failures using DMA length vhost: move -net specific code out vhost-net: select tx zero copy dynamically vhost-net: reduce vq polling on tx zerocopy drivers/net/tun.c | 1 + drivers/vhost/net.c | 109 +++++++++++++++++++++++++++++++++++++++++++--- drivers/vhost/tcm_vhost.c | 1 + drivers/vhost/vhost.c | 52 +++------------------- drivers/vhost/vhost.h | 11 ++--- include/linux/skbuff.h | 5 ++- net/core/skbuff.c | 23 +++++++++- 7 files changed, 141 insertions(+), 61 deletions(-) -- MST
Michael S. Tsirkin
2012-Oct-29  15:49 UTC
[PATCH net-next 1/8] skb: report completion status for zero copy skbs
Even if skb is marked for zero copy, net core might still decide
to copy it later which is somewhat slower than a copy in user context:
besides copying the data we need to pin/unpin the pages.
Add a parameter reporting such cases through zero copy callback:
if this happens a lot, device can take this into account
and switch to copying in user context.
This patch updates all users but ignores the passed value for now:
it will be used by follow-up patches.
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 drivers/vhost/vhost.c  | 2 +-
 drivers/vhost/vhost.h  | 2 +-
 include/linux/skbuff.h | 4 +++-
 net/core/skbuff.c      | 4 ++--
 4 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 99ac2cb..92308b6 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1600,7 +1600,7 @@ void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref *ubufs)
 	kfree(ubufs);
 }
 
-void vhost_zerocopy_callback(struct ubuf_info *ubuf)
+void vhost_zerocopy_callback(struct ubuf_info *ubuf, int zerocopy_status)
 {
 	struct vhost_ubuf_ref *ubufs = ubuf->ctx;
 	struct vhost_virtqueue *vq = ubufs->vq;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 1125af3..eb7263c3 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -191,7 +191,7 @@ bool vhost_enable_notify(struct vhost_dev *, struct
vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
-void vhost_zerocopy_callback(struct ubuf_info *);
+void vhost_zerocopy_callback(struct ubuf_info *, int);
 int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq);
 
 #define vq_err(vq, fmt, ...) do {                                  \
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6a2c34e..8bac11b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -235,11 +235,13 @@ enum {
 /*
  * The callback notifies userspace to release buffers when skb DMA is done in
  * lower device, the skb last reference should be 0 when calling this.
+ * The zerocopy_status argument is 0 if zero copy transmit occurred,
+ * 1 on successful data copy; < 0 on out of memory error.
  * The ctx field is used to track device context.
  * The desc field is used to track userspace buffer index.
  */
 struct ubuf_info {
-	void (*callback)(struct ubuf_info *);
+	void (*callback)(struct ubuf_info *, int zerocopy_status);
 	void *ctx;
 	unsigned long desc;
 };
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6e04b1f..eb31f6e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -519,7 +519,7 @@ static void skb_release_data(struct sk_buff *skb)
 
 			uarg = skb_shinfo(skb)->destructor_arg;
 			if (uarg->callback)
-				uarg->callback(uarg);
+				uarg->callback(uarg, 0);
 		}
 
 		if (skb_has_frag_list(skb))
@@ -797,7 +797,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	for (i = 0; i < num_frags; i++)
 		skb_frag_unref(skb, i);
 
-	uarg->callback(uarg);
+	uarg->callback(uarg, 1);
 
 	/* skb frags point to kernel buffers */
 	for (i = num_frags - 1; i >= 0; i--) {
-- 
MST
Michael S. Tsirkin
2012-Oct-29  15:49 UTC
[PATCH net-next 2/8] skb: api to report errors for zero copy skbs
Orphaning frags for zero copy skbs needs to allocate data in atomic
context so is has a chance to fail. If it does we currently discard
the skb which is safe, but we don't report anything to the caller,
so it can not recover by e.g. disabling zero copy.
Add an API to free skb reporting such errors: this is used
by tun in case orphaning frags fails.
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c      | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8bac11b..0644432 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -568,6 +568,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff
*skb)
 }
 
 extern void kfree_skb(struct sk_buff *skb);
+extern void skb_tx_error(struct sk_buff *skb, int err);
 extern void consume_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct kmem_cache *skbuff_head_cache;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index eb31f6e..ad99c64 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -635,6 +635,25 @@ void kfree_skb(struct sk_buff *skb)
 EXPORT_SYMBOL(kfree_skb);
 
 /**
+ *	kfree_skb_on_error - report an sk_buff xmit error
+ *	@skb: buffer that triggered an error
+ *
+ *	Report xmit error if a device callback is tracking this skb.
+ */
+void skb_tx_error(struct sk_buff *skb, int err)
+{
+	if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
+		struct ubuf_info *uarg;
+
+		uarg = skb_shinfo(skb)->destructor_arg;
+		if (uarg->callback)
+			uarg->callback(uarg, err);
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+	}
+}
+EXPORT_SYMBOL(skb_tx_error);
+
+/**
  *	consume_skb - free an skbuff
  *	@skb: buffer to free
  *
-- 
MST
Michael S. Tsirkin
2012-Oct-29  15:49 UTC
[PATCH net-next 3/8] tun: report orphan frags errors to zero copy callback
When tun transmits a zero copy skb, it orphans the frags which might need to allocate extra memory, in atomic context. If that fails, notify ubufs callback before freeing the skb as a hint that device should disable zerocopy mode. Signed-off-by: Michael S. Tsirkin <mst at redhat.com> --- drivers/net/tun.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 3157519..613f826 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -433,6 +433,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) drop: dev->stats.tx_dropped++; + skb_tx_error(skb, -ENOMEM); kfree_skb(skb); return NETDEV_TX_OK; } -- MST
Michael S. Tsirkin
2012-Oct-29  15:49 UTC
[PATCH net-next 8/8] vhost-net: reduce vq polling on tx zerocopy
It seems that to avoid deadlocks it is enough to poll vq before
 we are going to use the last buffer.  This should be faster than
c70aa540c7a9f67add11ad3161096fb95233aa2e.
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 drivers/vhost/net.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8e9de79..3967f82 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -197,8 +197,16 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf,
int status)
 {
 	struct vhost_ubuf_ref *ubufs = ubuf->ctx;
 	struct vhost_virtqueue *vq = ubufs->vq;
-
-	vhost_poll_queue(&vq->poll);
+	int cnt = atomic_read(&ubufs->kref.refcount);
+
+	/*
+	 * Trigger polling thread if guest stopped submitting new buffers:
+	 * in this case, the refcount after decrement will eventually reach 1
+	 * so here it is 2.
+	 * We also trigger polling periodically after each 16 packets.
+	 */
+	if (cnt <= 2 || !(cnt % 16))
+		vhost_poll_queue(&vq->poll);
 	/* set len to mark this desc buffers done DMA */
 	vq->heads[ubuf->desc].len = status ?
 		VHOST_DMA_FAILED_LEN : VHOST_DMA_DONE_LEN;
-- 
MST
Michael S. Tsirkin
2012-Oct-29  15:49 UTC
[PATCH net-next 4/8] vhost-net: cleanup macros for DMA status tracking
Better document macros for DMA tracking. Add an
explicit one for DMA in progress instead of
relying on user supplying len != 1.
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 drivers/vhost/net.c   |  3 ++-
 drivers/vhost/vhost.c |  2 +-
 drivers/vhost/vhost.h | 12 +++++++++---
 3 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 072cbba..f80ae5f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -237,7 +237,8 @@ static void handle_tx(struct vhost_net *net)
 			} else {
 				struct ubuf_info *ubuf = &vq->ubuf_info[head];
 
-				vq->heads[vq->upend_idx].len = len;
+				vq->heads[vq->upend_idx].len +					VHOST_DMA_IN_PROGRESS;
 				ubuf->callback = vhost_zerocopy_callback;
 				ubuf->ctx = vq->ubufs;
 				ubuf->desc = vq->upend_idx;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 92308b6..906fd9f 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1606,7 +1606,7 @@ void vhost_zerocopy_callback(struct ubuf_info *ubuf, int
zerocopy_status)
 	struct vhost_virtqueue *vq = ubufs->vq;
 
 	vhost_poll_queue(&vq->poll);
-	/* set len = 1 to mark this desc buffers done DMA */
+	/* set len to mark this desc buffers done DMA */
 	vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
 	kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index eb7263c3..ad72a1f 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -13,9 +13,15 @@
 #include <linux/virtio_ring.h>
 #include <linux/atomic.h>
 
-/* This is for zerocopy, used buffer len is set to 1 when lower device DMA
- * done */
-#define VHOST_DMA_DONE_LEN	1
+/*
+ * For transmit, used buffer len is unused; we override it to track buffer
+ * status internally; used for zerocopy tx only.
+ */
+/* Lower device DMA done */
+#define VHOST_DMA_DONE_LEN	2
+/* Lower device DMA in progress */
+#define VHOST_DMA_IN_PROGRESS	1
+/* Buffer unused */
 #define VHOST_DMA_CLEAR_LEN	0
 
 struct vhost_device;
-- 
MST
Michael S. Tsirkin
2012-Oct-29  15:50 UTC
[PATCH net-next 5/8] vhost: track zero copy failures using DMA length
This will be used to disable zerocopy when error rate
is high.
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 drivers/vhost/vhost.c | 7 ++++---
 drivers/vhost/vhost.h | 4 ++++
 2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 906fd9f..5affce3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -425,7 +425,7 @@ int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
 	int j = 0;
 
 	for (i = vq->done_idx; i != vq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
-		if ((vq->heads[i].len == VHOST_DMA_DONE_LEN)) {
+		if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
 			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
 			vhost_add_used_and_signal(vq->dev, vq,
 						  vq->heads[i].id, 0);
@@ -1600,13 +1600,14 @@ void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref
*ubufs)
 	kfree(ubufs);
 }
 
-void vhost_zerocopy_callback(struct ubuf_info *ubuf, int zerocopy_status)
+void vhost_zerocopy_callback(struct ubuf_info *ubuf, int status)
 {
 	struct vhost_ubuf_ref *ubufs = ubuf->ctx;
 	struct vhost_virtqueue *vq = ubufs->vq;
 
 	vhost_poll_queue(&vq->poll);
 	/* set len to mark this desc buffers done DMA */
-	vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
+	vq->heads[ubuf->desc].len = status ?
+		VHOST_DMA_FAILED_LEN : VHOST_DMA_DONE_LEN;
 	kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index ad72a1f..6fdf31d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -17,6 +17,8 @@
  * For transmit, used buffer len is unused; we override it to track buffer
  * status internally; used for zerocopy tx only.
  */
+/* Lower device DMA failed */
+#define VHOST_DMA_FAILED_LEN	3
 /* Lower device DMA done */
 #define VHOST_DMA_DONE_LEN	2
 /* Lower device DMA in progress */
@@ -24,6 +26,8 @@
 /* Buffer unused */
 #define VHOST_DMA_CLEAR_LEN	0
 
+#define VHOST_DMA_IS_DONE(len) ((len) >= VHOST_DMA_DONE_LEN)
+
 struct vhost_device;
 
 struct vhost_work;
-- 
MST
Michael S. Tsirkin
2012-Oct-29  15:50 UTC
[PATCH net-next 6/8] vhost: move -net specific code out
Zerocopy handling code is vhost-net specific.
Move it from vhost.c/vhost.h out to net.c
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 drivers/vhost/net.c       | 45 ++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/tcm_vhost.c |  1 +
 drivers/vhost/vhost.c     | 53 +++++++----------------------------------------
 drivers/vhost/vhost.h     | 21 +++----------------
 4 files changed, 56 insertions(+), 64 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f80ae5f..532fc88 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -126,6 +126,42 @@ static void tx_poll_start(struct vhost_net *net, struct
socket *sock)
 	net->tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+/* In case of DMA done not in order in lower device driver for some reason.
+ * upend_idx is used to track end of used idx, done_idx is used to track head
+ * of used idx. Once lower device DMA done contiguously, we will signal KVM
+ * guest used idx.
+ */
+int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
+{
+	int i;
+	int j = 0;
+
+	for (i = vq->done_idx; i != vq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
+		if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
+			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
+			vhost_add_used_and_signal(vq->dev, vq,
+						  vq->heads[i].id, 0);
+			++j;
+		} else
+			break;
+	}
+	if (j)
+		vq->done_idx = i;
+	return j;
+}
+
+static void vhost_zerocopy_callback(struct ubuf_info *ubuf, int status)
+{
+	struct vhost_ubuf_ref *ubufs = ubuf->ctx;
+	struct vhost_virtqueue *vq = ubufs->vq;
+
+	vhost_poll_queue(&vq->poll);
+	/* set len to mark this desc buffers done DMA */
+	vq->heads[ubuf->desc].len = status ?
+		VHOST_DMA_FAILED_LEN : VHOST_DMA_DONE_LEN;
+	vhost_ubuf_put(ubufs);
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -594,9 +630,18 @@ static int vhost_net_release(struct inode *inode, struct
file *f)
 	struct vhost_net *n = f->private_data;
 	struct socket *tx_sock;
 	struct socket *rx_sock;
+	int i;
 
 	vhost_net_stop(n, &tx_sock, &rx_sock);
 	vhost_net_flush(n);
+	vhost_dev_stop(&n->dev);
+	for (i = 0; i < n->dev.nvqs; ++i) {
+		/* Wait for all lower device DMAs done. */
+		if (n->dev.vqs[i].ubufs)
+			vhost_ubuf_put_and_wait(n->dev.vqs[i].ubufs);
+
+		vhost_zerocopy_signal_used(n, &n->dev.vqs[i]);
+	}
 	vhost_dev_cleanup(&n->dev, false);
 	if (tx_sock)
 		fput(tx_sock->file);
diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
index aa31692..23c138f 100644
--- a/drivers/vhost/tcm_vhost.c
+++ b/drivers/vhost/tcm_vhost.c
@@ -895,6 +895,7 @@ static int vhost_scsi_release(struct inode *inode, struct
file *f)
 		vhost_scsi_clear_endpoint(s, &backend);
 	}
 
+	vhost_dev_stop(&s->dev);
 	vhost_dev_cleanup(&s->dev, false);
 	kfree(s);
 	return 0;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5affce3..ef8f598 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -26,10 +26,6 @@
 #include <linux/kthread.h>
 #include <linux/cgroup.h>
 
-#include <linux/net.h>
-#include <linux/if_packet.h>
-#include <linux/if_arp.h>
-
 #include "vhost.h"
 
 enum {
@@ -414,28 +410,16 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
 	return 0;
 }
 
-/* In case of DMA done not in order in lower device driver for some reason.
- * upend_idx is used to track end of used idx, done_idx is used to track head
- * of used idx. Once lower device DMA done contiguously, we will signal KVM
- * guest used idx.
- */
-int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
+void vhost_dev_stop(struct vhost_dev *dev)
 {
 	int i;
-	int j = 0;
-
-	for (i = vq->done_idx; i != vq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
-		if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
-			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
-			vhost_add_used_and_signal(vq->dev, vq,
-						  vq->heads[i].id, 0);
-			++j;
-		} else
-			break;
+
+	for (i = 0; i < dev->nvqs; ++i) {
+		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
+			vhost_poll_stop(&dev->vqs[i].poll);
+			vhost_poll_flush(&dev->vqs[i].poll);
+		}
 	}
-	if (j)
-		vq->done_idx = i;
-	return j;
 }
 
 /* Caller should have device mutex if and only if locked is set */
@@ -444,17 +428,6 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 	int i;
 
 	for (i = 0; i < dev->nvqs; ++i) {
-		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
-			vhost_poll_stop(&dev->vqs[i].poll);
-			vhost_poll_flush(&dev->vqs[i].poll);
-		}
-		/* Wait for all lower device DMAs done. */
-		if (dev->vqs[i].ubufs)
-			vhost_ubuf_put_and_wait(dev->vqs[i].ubufs);
-
-		/* Signal guest as appropriate. */
-		vhost_zerocopy_signal_used(&dev->vqs[i]);
-
 		if (dev->vqs[i].error_ctx)
 			eventfd_ctx_put(dev->vqs[i].error_ctx);
 		if (dev->vqs[i].error)
@@ -1599,15 +1572,3 @@ void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref
*ubufs)
 	wait_event(ubufs->wait, !atomic_read(&ubufs->kref.refcount));
 	kfree(ubufs);
 }
-
-void vhost_zerocopy_callback(struct ubuf_info *ubuf, int status)
-{
-	struct vhost_ubuf_ref *ubufs = ubuf->ctx;
-	struct vhost_virtqueue *vq = ubufs->vq;
-
-	vhost_poll_queue(&vq->poll);
-	/* set len to mark this desc buffers done DMA */
-	vq->heads[ubuf->desc].len = status ?
-		VHOST_DMA_FAILED_LEN : VHOST_DMA_DONE_LEN;
-	kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
-}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 6fdf31d..5e19e3d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -7,27 +7,11 @@
 #include <linux/mutex.h>
 #include <linux/poll.h>
 #include <linux/file.h>
-#include <linux/skbuff.h>
 #include <linux/uio.h>
 #include <linux/virtio_config.h>
 #include <linux/virtio_ring.h>
 #include <linux/atomic.h>
 
-/*
- * For transmit, used buffer len is unused; we override it to track buffer
- * status internally; used for zerocopy tx only.
- */
-/* Lower device DMA failed */
-#define VHOST_DMA_FAILED_LEN	3
-/* Lower device DMA done */
-#define VHOST_DMA_DONE_LEN	2
-/* Lower device DMA in progress */
-#define VHOST_DMA_IN_PROGRESS	1
-/* Buffer unused */
-#define VHOST_DMA_CLEAR_LEN	0
-
-#define VHOST_DMA_IS_DONE(len) ((len) >= VHOST_DMA_DONE_LEN)
-
 struct vhost_device;
 
 struct vhost_work;
@@ -80,6 +64,8 @@ struct vhost_ubuf_ref *vhost_ubuf_alloc(struct vhost_virtqueue
*, bool zcopy);
 void vhost_ubuf_put(struct vhost_ubuf_ref *);
 void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref *);
 
+struct ubuf_info;
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -177,6 +163,7 @@ long vhost_dev_init(struct vhost_dev *, struct
vhost_virtqueue *vqs, int nvqs);
 long vhost_dev_check_owner(struct vhost_dev *);
 long vhost_dev_reset_owner(struct vhost_dev *);
 void vhost_dev_cleanup(struct vhost_dev *, bool locked);
+void vhost_dev_stop(struct vhost_dev *);
 long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, unsigned long
arg);
 int vhost_vq_access_ok(struct vhost_virtqueue *vq);
 int vhost_log_access_ok(struct vhost_dev *);
@@ -201,8 +188,6 @@ bool vhost_enable_notify(struct vhost_dev *, struct
vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
-void vhost_zerocopy_callback(struct ubuf_info *, int);
-int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq);
 
 #define vq_err(vq, fmt, ...) do {                                  \
 		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
-- 
MST
Michael S. Tsirkin
2012-Oct-29  15:50 UTC
[PATCH net-next 7/8] vhost-net: select tx zero copy dynamically
Even when vhost-net is in zero-copy transmit mode,
net core might still decide to copy the skb later
which is somewhat slower than a copy in user
context: data copy overhead is added to the cost of
page pin/unpin. The result is that enabling tx zero copy
option leads to higher CPU utilization for guest to guest
and guest to host traffic.
To fix this, suppress zero copy tx after a given number of
packets triggered late data copy. Re-enable periodically
to detect workload changes.
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 drivers/vhost/net.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 50 insertions(+), 5 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 532fc88..8e9de79 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -42,6 +42,21 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable
Experimental Zero Copy TX");
 #define VHOST_MAX_PEND 128
 #define VHOST_GOODCOPY_LEN 256
 
+/*
+ * For transmit, used buffer len is unused; we override it to track buffer
+ * status internally; used for zerocopy tx only.
+ */
+/* Lower device DMA failed */
+#define VHOST_DMA_FAILED_LEN	3
+/* Lower device DMA done */
+#define VHOST_DMA_DONE_LEN	2
+/* Lower device DMA in progress */
+#define VHOST_DMA_IN_PROGRESS	1
+/* Buffer unused */
+#define VHOST_DMA_CLEAR_LEN	0
+
+#define VHOST_DMA_IS_DONE(len) ((len) >= VHOST_DMA_DONE_LEN)
+
 enum {
 	VHOST_NET_VQ_RX = 0,
 	VHOST_NET_VQ_TX = 1,
@@ -62,8 +77,33 @@ struct vhost_net {
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
 	enum vhost_net_poll_state tx_poll_state;
+	/* Number of TX recently submitted.
+	 * Protected by tx vq lock. */
+	unsigned tx_packets;
+	/* Number of times zerocopy TX recently failed.
+	 * Protected by tx vq lock. */
+	unsigned tx_zcopy_err;
 };
 
+static void vhost_net_tx_packet(struct vhost_net *net)
+{
+	++net->tx_packets;
+	if (net->tx_packets < 1024)
+		return;
+	net->tx_packets = 0;
+	net->tx_zcopy_err = 0;
+}
+
+static void vhost_net_tx_err(struct vhost_net *net)
+{
+	++net->tx_zcopy_err;
+}
+
+static bool vhost_net_tx_select_zcopy(struct vhost_net *net)
+{
+	return net->tx_packets / 64 >= net->tx_zcopy_err;
+}
+
 static bool vhost_sock_zcopy(struct socket *sock)
 {
 	return unlikely(experimental_zcopytx) &&
@@ -131,12 +171,15 @@ static void tx_poll_start(struct vhost_net *net, struct
socket *sock)
  * of used idx. Once lower device DMA done contiguously, we will signal KVM
  * guest used idx.
  */
-int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
+static int vhost_zerocopy_signal_used(struct vhost_net *net,
+				      struct vhost_virtqueue *vq)
 {
 	int i;
 	int j = 0;
 
 	for (i = vq->done_idx; i != vq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
+		if (vq->heads[i].len == VHOST_DMA_FAILED_LEN)
+			vhost_net_tx_err(net);
 		if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
 			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
 			vhost_add_used_and_signal(vq->dev, vq,
@@ -208,7 +251,7 @@ static void handle_tx(struct vhost_net *net)
 	for (;;) {
 		/* Release DMAs done buffers first */
 		if (zcopy)
-			vhost_zerocopy_signal_used(vq);
+			vhost_zerocopy_signal_used(net, vq);
 
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
@@ -263,7 +306,8 @@ static void handle_tx(struct vhost_net *net)
 		/* use msg_control to pass vhost zerocopy ubuf info to skb */
 		if (zcopy) {
 			vq->heads[vq->upend_idx].id = head;
-			if (len < VHOST_GOODCOPY_LEN) {
+			if (!vhost_net_tx_select_zcopy(net) ||
+			    len < VHOST_GOODCOPY_LEN) {
 				/* copy don't need to wait for DMA done */
 				vq->heads[vq->upend_idx].len  							VHOST_DMA_DONE_LEN;
@@ -305,8 +349,9 @@ static void handle_tx(struct vhost_net *net)
 		if (!zcopy)
 			vhost_add_used_and_signal(&net->dev, vq, head, 0);
 		else
-			vhost_zerocopy_signal_used(vq);
+			vhost_zerocopy_signal_used(net, vq);
 		total_len += len;
+		vhost_net_tx_packet(net);
 		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
 			break;
@@ -774,7 +819,7 @@ static long vhost_net_set_backend(struct vhost_net *n,
unsigned index, int fd)
 	if (oldubufs) {
 		vhost_ubuf_put_and_wait(oldubufs);
 		mutex_lock(&vq->mutex);
-		vhost_zerocopy_signal_used(vq);
+		vhost_zerocopy_signal_used(n, vq);
 		mutex_unlock(&vq->mutex);
 	}
 
-- 
MST
Vlad Yasevich
2012-Oct-30  15:47 UTC
[PATCH net-next 8/8] vhost-net: reduce vq polling on tx zerocopy
On 10/29/2012 11:49 AM, Michael S. Tsirkin wrote:> It seems that to avoid deadlocks it is enough to poll vq before > we are going to use the last buffer. This should be faster than > c70aa540c7a9f67add11ad3161096fb95233aa2e. > > Signed-off-by: Michael S. Tsirkin <mst at redhat.com> > --- > drivers/vhost/net.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > index 8e9de79..3967f82 100644 > --- a/drivers/vhost/net.c > +++ b/drivers/vhost/net.c > @@ -197,8 +197,16 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, int status) > { > struct vhost_ubuf_ref *ubufs = ubuf->ctx; > struct vhost_virtqueue *vq = ubufs->vq; > - > - vhost_poll_queue(&vq->poll); > + int cnt = atomic_read(&ubufs->kref.refcount); > + > + /* > + * Trigger polling thread if guest stopped submitting new buffers: > + * in this case, the refcount after decrement will eventually reach 1 > + * so here it is 2. > + * We also trigger polling periodically after each 16 packets. > + */ > + if (cnt <= 2 || !(cnt % 16))Why 16? Does it make sense to make it configurable? -vlad> + vhost_poll_queue(&vq->poll); > /* set len to mark this desc buffers done DMA */ > vq->heads[ubuf->desc].len = status ? > VHOST_DMA_FAILED_LEN : VHOST_DMA_DONE_LEN; >
Maybe Matching Threads
- [PATCH net-next 0/8] enable/disable zero copy tx dynamically
- [PATCHv2 net-next 0/8] enable/disable zero copy tx dynamically
- [PATCHv2 net-next 0/8] enable/disable zero copy tx dynamically
- [PATCHv3 net-next 0/8] enable/disable zero copy tx dynamically
- [PATCHv3 net-next 0/8] enable/disable zero copy tx dynamically