Michael S. Tsirkin
2012-Nov-01  19:16 UTC
[PATCHv3 net-next 0/8] enable/disable zero copy tx dynamically
tun supports zero copy transmit since 0690899b4d4501b3505be069b9a687e68ccbe15b,
however you can only enable this mode if you know your workload does not
trigger heavy guest to host/host to guest traffic - otherwise you
get a (minor) performance regression.
This patchset addresses this problem by notifying the owner
device when callback is invoked because of a data copy.
This makes it possible to detect whether zero copy is appropriate
dynamically: we start in zero copy mode, when we detect
data copied we disable zero copy for a while.
With this patch applied, I get the same performance for
guest to host and guest to guest both with and without zero copy tx.
Changes from v2:
  change callback parameter from int to bool
  accordingly, drop err parameter from skb_tx_error
Changes from v1:
  Comment fixups in patches 2 and 8 suggested by Vlad Yasevich,
     no changes to other patches
Michael S. Tsirkin (8):
  skb: report completion status for zero copy skbs
  skb: api to report errors for zero copy skbs
  tun: report orphan frags errors to zero copy callback
  vhost-net: cleanup macros for DMA status tracking
  vhost: track zero copy failures using DMA length
  vhost: move -net specific code out
  vhost-net: select tx zero copy dynamically
  vhost-net: reduce vq polling on tx zerocopy
 drivers/net/tun.c         |   1 +
 drivers/vhost/net.c       | 111 +++++++++++++++++++++++++++++++++++++++++++---
 drivers/vhost/tcm_vhost.c |   1 +
 drivers/vhost/vhost.c     |  52 +++-------------------
 drivers/vhost/vhost.h     |  11 ++---
 include/linux/skbuff.h    |   5 ++-
 net/core/skbuff.c         |  24 +++++++++-
 7 files changed, 144 insertions(+), 61 deletions(-)
-- 
MST
Michael S. Tsirkin
2012-Nov-01  19:16 UTC
[PATCHv3 net-next 1/8] skb: report completion status for zero copy skbs
Even if skb is marked for zero copy, net core might still decide
to copy it later which is somewhat slower than a copy in user context:
besides copying the data we need to pin/unpin the pages.
Add a parameter reporting such cases through zero copy callback:
if this happens a lot, device can take this into account
and switch to copying in user context.
This patch updates all users but ignores the passed value for now:
it will be used by follow-up patches.
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 drivers/vhost/vhost.c  | 2 +-
 drivers/vhost/vhost.h  | 2 +-
 include/linux/skbuff.h | 4 +++-
 net/core/skbuff.c      | 4 ++--
 4 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 99ac2cb..73d08db 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1600,7 +1600,7 @@ void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref *ubufs)
 	kfree(ubufs);
 }
 
-void vhost_zerocopy_callback(struct ubuf_info *ubuf)
+void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool status)
 {
 	struct vhost_ubuf_ref *ubufs = ubuf->ctx;
 	struct vhost_virtqueue *vq = ubufs->vq;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 1125af3..2de4ce2 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -191,7 +191,7 @@ bool vhost_enable_notify(struct vhost_dev *, struct
vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
-void vhost_zerocopy_callback(struct ubuf_info *);
+void vhost_zerocopy_callback(struct ubuf_info *, bool);
 int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq);
 
 #define vq_err(vq, fmt, ...) do {                                  \
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a2a0bdb..e5eae5b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -235,11 +235,13 @@ enum {
 /*
  * The callback notifies userspace to release buffers when skb DMA is done in
  * lower device, the skb last reference should be 0 when calling this.
+ * The zerocopy_success argument is true if zero copy transmit occurred,
+ * false on data copy or out of memory error caused by data copy attempt.
  * The ctx field is used to track device context.
  * The desc field is used to track userspace buffer index.
  */
 struct ubuf_info {
-	void (*callback)(struct ubuf_info *);
+	void (*callback)(struct ubuf_info *, bool zerocopy_success);
 	void *ctx;
 	unsigned long desc;
 };
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6e04b1f..4abdf71 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -519,7 +519,7 @@ static void skb_release_data(struct sk_buff *skb)
 
 			uarg = skb_shinfo(skb)->destructor_arg;
 			if (uarg->callback)
-				uarg->callback(uarg);
+				uarg->callback(uarg, true);
 		}
 
 		if (skb_has_frag_list(skb))
@@ -797,7 +797,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	for (i = 0; i < num_frags; i++)
 		skb_frag_unref(skb, i);
 
-	uarg->callback(uarg);
+	uarg->callback(uarg, false);
 
 	/* skb frags point to kernel buffers */
 	for (i = num_frags - 1; i >= 0; i--) {
-- 
MST
Michael S. Tsirkin
2012-Nov-01  19:16 UTC
[PATCHv3 net-next 2/8] skb: api to report errors for zero copy skbs
Orphaning frags for zero copy skbs needs to allocate data in atomic
context so is has a chance to fail. If it does we currently discard
the skb which is safe, but we don't report anything to the caller,
so it can not recover by e.g. disabling zero copy.
Add an API to free skb reporting such errors: this is used
by tun in case orphaning frags fails.
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c      | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e5eae5b..f2af494 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -568,6 +568,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff
*skb)
 }
 
 extern void kfree_skb(struct sk_buff *skb);
+extern void skb_tx_error(struct sk_buff *skb);
 extern void consume_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct kmem_cache *skbuff_head_cache;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4abdf71..d9addea 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -635,6 +635,26 @@ void kfree_skb(struct sk_buff *skb)
 EXPORT_SYMBOL(kfree_skb);
 
 /**
+ *	skb_tx_error - report an sk_buff xmit error
+ *	@skb: buffer that triggered an error
+ *
+ *	Report xmit error if a device callback is tracking this skb.
+ *	skb must be freed afterwards.
+ */
+void skb_tx_error(struct sk_buff *skb)
+{
+	if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
+		struct ubuf_info *uarg;
+
+		uarg = skb_shinfo(skb)->destructor_arg;
+		if (uarg->callback)
+			uarg->callback(uarg, false);
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+	}
+}
+EXPORT_SYMBOL(skb_tx_error);
+
+/**
  *	consume_skb - free an skbuff
  *	@skb: buffer to free
  *
-- 
MST
Michael S. Tsirkin
2012-Nov-01  19:16 UTC
[PATCHv3 net-next 3/8] tun: report orphan frags errors to zero copy callback
When tun transmits a zero copy skb, it orphans the frags which might need to allocate extra memory, in atomic context. If that fails, notify ubufs callback before freeing the skb as a hint that device should disable zerocopy mode. Signed-off-by: Michael S. Tsirkin <mst at redhat.com> --- drivers/net/tun.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 9e28768..b44d7b7 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -728,6 +728,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) drop: dev->stats.tx_dropped++; + skb_tx_error(skb); kfree_skb(skb); rcu_read_unlock(); return NETDEV_TX_OK; -- MST
Michael S. Tsirkin
2012-Nov-01  19:16 UTC
[PATCHv3 net-next 4/8] vhost-net: cleanup macros for DMA status tracking
Better document macros for DMA tracking. Add an
explicit one for DMA in progress instead of
relying on user supplying len != 1.
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 drivers/vhost/net.c   |  3 ++-
 drivers/vhost/vhost.c |  2 +-
 drivers/vhost/vhost.h | 12 +++++++++---
 3 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 072cbba..f80ae5f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -237,7 +237,8 @@ static void handle_tx(struct vhost_net *net)
 			} else {
 				struct ubuf_info *ubuf = &vq->ubuf_info[head];
 
-				vq->heads[vq->upend_idx].len = len;
+				vq->heads[vq->upend_idx].len +					VHOST_DMA_IN_PROGRESS;
 				ubuf->callback = vhost_zerocopy_callback;
 				ubuf->ctx = vq->ubufs;
 				ubuf->desc = vq->upend_idx;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 73d08db..f23cf89 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1606,7 +1606,7 @@ void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool
status)
 	struct vhost_virtqueue *vq = ubufs->vq;
 
 	vhost_poll_queue(&vq->poll);
-	/* set len = 1 to mark this desc buffers done DMA */
+	/* set len to mark this desc buffers done DMA */
 	vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
 	kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 2de4ce2..b6538ee 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -13,9 +13,15 @@
 #include <linux/virtio_ring.h>
 #include <linux/atomic.h>
 
-/* This is for zerocopy, used buffer len is set to 1 when lower device DMA
- * done */
-#define VHOST_DMA_DONE_LEN	1
+/*
+ * For transmit, used buffer len is unused; we override it to track buffer
+ * status internally; used for zerocopy tx only.
+ */
+/* Lower device DMA done */
+#define VHOST_DMA_DONE_LEN	2
+/* Lower device DMA in progress */
+#define VHOST_DMA_IN_PROGRESS	1
+/* Buffer unused */
 #define VHOST_DMA_CLEAR_LEN	0
 
 struct vhost_device;
-- 
MST
Michael S. Tsirkin
2012-Nov-01  19:16 UTC
[PATCHv3 net-next 5/8] vhost: track zero copy failures using DMA length
This will be used to disable zerocopy when error rate
is high.
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 drivers/vhost/vhost.c | 7 ++++---
 drivers/vhost/vhost.h | 4 ++++
 2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f23cf89..5affce3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -425,7 +425,7 @@ int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
 	int j = 0;
 
 	for (i = vq->done_idx; i != vq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
-		if ((vq->heads[i].len == VHOST_DMA_DONE_LEN)) {
+		if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
 			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
 			vhost_add_used_and_signal(vq->dev, vq,
 						  vq->heads[i].id, 0);
@@ -1600,13 +1600,14 @@ void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref
*ubufs)
 	kfree(ubufs);
 }
 
-void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool status)
+void vhost_zerocopy_callback(struct ubuf_info *ubuf, int status)
 {
 	struct vhost_ubuf_ref *ubufs = ubuf->ctx;
 	struct vhost_virtqueue *vq = ubufs->vq;
 
 	vhost_poll_queue(&vq->poll);
 	/* set len to mark this desc buffers done DMA */
-	vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
+	vq->heads[ubuf->desc].len = status ?
+		VHOST_DMA_FAILED_LEN : VHOST_DMA_DONE_LEN;
 	kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b6538ee..464469d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -17,6 +17,8 @@
  * For transmit, used buffer len is unused; we override it to track buffer
  * status internally; used for zerocopy tx only.
  */
+/* Lower device DMA failed */
+#define VHOST_DMA_FAILED_LEN	3
 /* Lower device DMA done */
 #define VHOST_DMA_DONE_LEN	2
 /* Lower device DMA in progress */
@@ -24,6 +26,8 @@
 /* Buffer unused */
 #define VHOST_DMA_CLEAR_LEN	0
 
+#define VHOST_DMA_IS_DONE(len) ((len) >= VHOST_DMA_DONE_LEN)
+
 struct vhost_device;
 
 struct vhost_work;
-- 
MST
Michael S. Tsirkin
2012-Nov-01  19:16 UTC
[PATCHv3 net-next 6/8] vhost: move -net specific code out
Zerocopy handling code is vhost-net specific.
Move it from vhost.c/vhost.h out to net.c
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 drivers/vhost/net.c       | 45 ++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/tcm_vhost.c |  1 +
 drivers/vhost/vhost.c     | 53 +++++++----------------------------------------
 drivers/vhost/vhost.h     | 21 +++----------------
 4 files changed, 56 insertions(+), 64 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f80ae5f..532fc88 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -126,6 +126,42 @@ static void tx_poll_start(struct vhost_net *net, struct
socket *sock)
 	net->tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+/* In case of DMA done not in order in lower device driver for some reason.
+ * upend_idx is used to track end of used idx, done_idx is used to track head
+ * of used idx. Once lower device DMA done contiguously, we will signal KVM
+ * guest used idx.
+ */
+int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
+{
+	int i;
+	int j = 0;
+
+	for (i = vq->done_idx; i != vq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
+		if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
+			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
+			vhost_add_used_and_signal(vq->dev, vq,
+						  vq->heads[i].id, 0);
+			++j;
+		} else
+			break;
+	}
+	if (j)
+		vq->done_idx = i;
+	return j;
+}
+
+static void vhost_zerocopy_callback(struct ubuf_info *ubuf, int status)
+{
+	struct vhost_ubuf_ref *ubufs = ubuf->ctx;
+	struct vhost_virtqueue *vq = ubufs->vq;
+
+	vhost_poll_queue(&vq->poll);
+	/* set len to mark this desc buffers done DMA */
+	vq->heads[ubuf->desc].len = status ?
+		VHOST_DMA_FAILED_LEN : VHOST_DMA_DONE_LEN;
+	vhost_ubuf_put(ubufs);
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -594,9 +630,18 @@ static int vhost_net_release(struct inode *inode, struct
file *f)
 	struct vhost_net *n = f->private_data;
 	struct socket *tx_sock;
 	struct socket *rx_sock;
+	int i;
 
 	vhost_net_stop(n, &tx_sock, &rx_sock);
 	vhost_net_flush(n);
+	vhost_dev_stop(&n->dev);
+	for (i = 0; i < n->dev.nvqs; ++i) {
+		/* Wait for all lower device DMAs done. */
+		if (n->dev.vqs[i].ubufs)
+			vhost_ubuf_put_and_wait(n->dev.vqs[i].ubufs);
+
+		vhost_zerocopy_signal_used(n, &n->dev.vqs[i]);
+	}
 	vhost_dev_cleanup(&n->dev, false);
 	if (tx_sock)
 		fput(tx_sock->file);
diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
index aa31692..23c138f 100644
--- a/drivers/vhost/tcm_vhost.c
+++ b/drivers/vhost/tcm_vhost.c
@@ -895,6 +895,7 @@ static int vhost_scsi_release(struct inode *inode, struct
file *f)
 		vhost_scsi_clear_endpoint(s, &backend);
 	}
 
+	vhost_dev_stop(&s->dev);
 	vhost_dev_cleanup(&s->dev, false);
 	kfree(s);
 	return 0;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5affce3..ef8f598 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -26,10 +26,6 @@
 #include <linux/kthread.h>
 #include <linux/cgroup.h>
 
-#include <linux/net.h>
-#include <linux/if_packet.h>
-#include <linux/if_arp.h>
-
 #include "vhost.h"
 
 enum {
@@ -414,28 +410,16 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
 	return 0;
 }
 
-/* In case of DMA done not in order in lower device driver for some reason.
- * upend_idx is used to track end of used idx, done_idx is used to track head
- * of used idx. Once lower device DMA done contiguously, we will signal KVM
- * guest used idx.
- */
-int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
+void vhost_dev_stop(struct vhost_dev *dev)
 {
 	int i;
-	int j = 0;
-
-	for (i = vq->done_idx; i != vq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
-		if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
-			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
-			vhost_add_used_and_signal(vq->dev, vq,
-						  vq->heads[i].id, 0);
-			++j;
-		} else
-			break;
+
+	for (i = 0; i < dev->nvqs; ++i) {
+		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
+			vhost_poll_stop(&dev->vqs[i].poll);
+			vhost_poll_flush(&dev->vqs[i].poll);
+		}
 	}
-	if (j)
-		vq->done_idx = i;
-	return j;
 }
 
 /* Caller should have device mutex if and only if locked is set */
@@ -444,17 +428,6 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 	int i;
 
 	for (i = 0; i < dev->nvqs; ++i) {
-		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
-			vhost_poll_stop(&dev->vqs[i].poll);
-			vhost_poll_flush(&dev->vqs[i].poll);
-		}
-		/* Wait for all lower device DMAs done. */
-		if (dev->vqs[i].ubufs)
-			vhost_ubuf_put_and_wait(dev->vqs[i].ubufs);
-
-		/* Signal guest as appropriate. */
-		vhost_zerocopy_signal_used(&dev->vqs[i]);
-
 		if (dev->vqs[i].error_ctx)
 			eventfd_ctx_put(dev->vqs[i].error_ctx);
 		if (dev->vqs[i].error)
@@ -1599,15 +1572,3 @@ void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref
*ubufs)
 	wait_event(ubufs->wait, !atomic_read(&ubufs->kref.refcount));
 	kfree(ubufs);
 }
-
-void vhost_zerocopy_callback(struct ubuf_info *ubuf, int status)
-{
-	struct vhost_ubuf_ref *ubufs = ubuf->ctx;
-	struct vhost_virtqueue *vq = ubufs->vq;
-
-	vhost_poll_queue(&vq->poll);
-	/* set len to mark this desc buffers done DMA */
-	vq->heads[ubuf->desc].len = status ?
-		VHOST_DMA_FAILED_LEN : VHOST_DMA_DONE_LEN;
-	kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
-}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 464469d..5e19e3d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -7,27 +7,11 @@
 #include <linux/mutex.h>
 #include <linux/poll.h>
 #include <linux/file.h>
-#include <linux/skbuff.h>
 #include <linux/uio.h>
 #include <linux/virtio_config.h>
 #include <linux/virtio_ring.h>
 #include <linux/atomic.h>
 
-/*
- * For transmit, used buffer len is unused; we override it to track buffer
- * status internally; used for zerocopy tx only.
- */
-/* Lower device DMA failed */
-#define VHOST_DMA_FAILED_LEN	3
-/* Lower device DMA done */
-#define VHOST_DMA_DONE_LEN	2
-/* Lower device DMA in progress */
-#define VHOST_DMA_IN_PROGRESS	1
-/* Buffer unused */
-#define VHOST_DMA_CLEAR_LEN	0
-
-#define VHOST_DMA_IS_DONE(len) ((len) >= VHOST_DMA_DONE_LEN)
-
 struct vhost_device;
 
 struct vhost_work;
@@ -80,6 +64,8 @@ struct vhost_ubuf_ref *vhost_ubuf_alloc(struct vhost_virtqueue
*, bool zcopy);
 void vhost_ubuf_put(struct vhost_ubuf_ref *);
 void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref *);
 
+struct ubuf_info;
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -177,6 +163,7 @@ long vhost_dev_init(struct vhost_dev *, struct
vhost_virtqueue *vqs, int nvqs);
 long vhost_dev_check_owner(struct vhost_dev *);
 long vhost_dev_reset_owner(struct vhost_dev *);
 void vhost_dev_cleanup(struct vhost_dev *, bool locked);
+void vhost_dev_stop(struct vhost_dev *);
 long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, unsigned long
arg);
 int vhost_vq_access_ok(struct vhost_virtqueue *vq);
 int vhost_log_access_ok(struct vhost_dev *);
@@ -201,8 +188,6 @@ bool vhost_enable_notify(struct vhost_dev *, struct
vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
-void vhost_zerocopy_callback(struct ubuf_info *, bool);
-int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq);
 
 #define vq_err(vq, fmt, ...) do {                                  \
 		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
-- 
MST
Michael S. Tsirkin
2012-Nov-01  19:16 UTC
[PATCHv3 net-next 7/8] vhost-net: select tx zero copy dynamically
Even when vhost-net is in zero-copy transmit mode,
net core might still decide to copy the skb later
which is somewhat slower than a copy in user
context: data copy overhead is added to the cost of
page pin/unpin. The result is that enabling tx zero copy
option leads to higher CPU utilization for guest to guest
and guest to host traffic.
To fix this, suppress zero copy tx after a given number of
packets triggered late data copy. Re-enable periodically
to detect workload changes.
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 drivers/vhost/net.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 53 insertions(+), 8 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 532fc88..93f2d67 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -42,6 +42,21 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable
Experimental Zero Copy TX");
 #define VHOST_MAX_PEND 128
 #define VHOST_GOODCOPY_LEN 256
 
+/*
+ * For transmit, used buffer len is unused; we override it to track buffer
+ * status internally; used for zerocopy tx only.
+ */
+/* Lower device DMA failed */
+#define VHOST_DMA_FAILED_LEN	3
+/* Lower device DMA done */
+#define VHOST_DMA_DONE_LEN	2
+/* Lower device DMA in progress */
+#define VHOST_DMA_IN_PROGRESS	1
+/* Buffer unused */
+#define VHOST_DMA_CLEAR_LEN	0
+
+#define VHOST_DMA_IS_DONE(len) ((len) >= VHOST_DMA_DONE_LEN)
+
 enum {
 	VHOST_NET_VQ_RX = 0,
 	VHOST_NET_VQ_TX = 1,
@@ -62,8 +77,33 @@ struct vhost_net {
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
 	enum vhost_net_poll_state tx_poll_state;
+	/* Number of TX recently submitted.
+	 * Protected by tx vq lock. */
+	unsigned tx_packets;
+	/* Number of times zerocopy TX recently failed.
+	 * Protected by tx vq lock. */
+	unsigned tx_zcopy_err;
 };
 
+static void vhost_net_tx_packet(struct vhost_net *net)
+{
+	++net->tx_packets;
+	if (net->tx_packets < 1024)
+		return;
+	net->tx_packets = 0;
+	net->tx_zcopy_err = 0;
+}
+
+static void vhost_net_tx_err(struct vhost_net *net)
+{
+	++net->tx_zcopy_err;
+}
+
+static bool vhost_net_tx_select_zcopy(struct vhost_net *net)
+{
+	return net->tx_packets / 64 >= net->tx_zcopy_err;
+}
+
 static bool vhost_sock_zcopy(struct socket *sock)
 {
 	return unlikely(experimental_zcopytx) &&
@@ -131,12 +171,15 @@ static void tx_poll_start(struct vhost_net *net, struct
socket *sock)
  * of used idx. Once lower device DMA done contiguously, we will signal KVM
  * guest used idx.
  */
-int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
+static int vhost_zerocopy_signal_used(struct vhost_net *net,
+				      struct vhost_virtqueue *vq)
 {
 	int i;
 	int j = 0;
 
 	for (i = vq->done_idx; i != vq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
+		if (vq->heads[i].len == VHOST_DMA_FAILED_LEN)
+			vhost_net_tx_err(net);
 		if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
 			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
 			vhost_add_used_and_signal(vq->dev, vq,
@@ -150,15 +193,15 @@ int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
 	return j;
 }
 
-static void vhost_zerocopy_callback(struct ubuf_info *ubuf, int status)
+static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
 {
 	struct vhost_ubuf_ref *ubufs = ubuf->ctx;
 	struct vhost_virtqueue *vq = ubufs->vq;
 
 	vhost_poll_queue(&vq->poll);
 	/* set len to mark this desc buffers done DMA */
-	vq->heads[ubuf->desc].len = status ?
-		VHOST_DMA_FAILED_LEN : VHOST_DMA_DONE_LEN;
+	vq->heads[ubuf->desc].len = success ?
+		VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
 	vhost_ubuf_put(ubufs);
 }
 
@@ -208,7 +251,7 @@ static void handle_tx(struct vhost_net *net)
 	for (;;) {
 		/* Release DMAs done buffers first */
 		if (zcopy)
-			vhost_zerocopy_signal_used(vq);
+			vhost_zerocopy_signal_used(net, vq);
 
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
@@ -263,7 +306,8 @@ static void handle_tx(struct vhost_net *net)
 		/* use msg_control to pass vhost zerocopy ubuf info to skb */
 		if (zcopy) {
 			vq->heads[vq->upend_idx].id = head;
-			if (len < VHOST_GOODCOPY_LEN) {
+			if (!vhost_net_tx_select_zcopy(net) ||
+			    len < VHOST_GOODCOPY_LEN) {
 				/* copy don't need to wait for DMA done */
 				vq->heads[vq->upend_idx].len  							VHOST_DMA_DONE_LEN;
@@ -305,8 +349,9 @@ static void handle_tx(struct vhost_net *net)
 		if (!zcopy)
 			vhost_add_used_and_signal(&net->dev, vq, head, 0);
 		else
-			vhost_zerocopy_signal_used(vq);
+			vhost_zerocopy_signal_used(net, vq);
 		total_len += len;
+		vhost_net_tx_packet(net);
 		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
 			break;
@@ -774,7 +819,7 @@ static long vhost_net_set_backend(struct vhost_net *n,
unsigned index, int fd)
 	if (oldubufs) {
 		vhost_ubuf_put_and_wait(oldubufs);
 		mutex_lock(&vq->mutex);
-		vhost_zerocopy_signal_used(vq);
+		vhost_zerocopy_signal_used(n, vq);
 		mutex_unlock(&vq->mutex);
 	}
 
-- 
MST
Michael S. Tsirkin
2012-Nov-01  19:16 UTC
[PATCHv3 net-next 8/8] vhost-net: reduce vq polling on tx zerocopy
It seems that to avoid deadlocks it is enough to poll vq before
 we are going to use the last buffer.  This is faster than
c70aa540c7a9f67add11ad3161096fb95233aa2e.
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
---
 drivers/vhost/net.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 93f2d67..28ad775 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -197,8 +197,18 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf,
bool success)
 {
 	struct vhost_ubuf_ref *ubufs = ubuf->ctx;
 	struct vhost_virtqueue *vq = ubufs->vq;
-
-	vhost_poll_queue(&vq->poll);
+	int cnt = atomic_read(&ubufs->kref.refcount);
+
+	/*
+	 * Trigger polling thread if guest stopped submitting new buffers:
+	 * in this case, the refcount after decrement will eventually reach 1
+	 * so here it is 2.
+	 * We also trigger polling periodically after each 16 packets
+	 * (the value 16 here is more or less arbitrary, it's tuned to trigger
+	 * less than 10% of times).
+	 */
+	if (cnt <= 2 || !(cnt % 16))
+		vhost_poll_queue(&vq->poll);
 	/* set len to mark this desc buffers done DMA */
 	vq->heads[ubuf->desc].len = success ?
 		VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
-- 
MST
David Miller
2012-Nov-03  01:31 UTC
[PATCHv3 net-next 0/8] enable/disable zero copy tx dynamically
From: "Michael S. Tsirkin" <mst at redhat.com> Date: Thu, 1 Nov 2012 21:16:17 +0200> > tun supports zero copy transmit since 0690899b4d4501b3505be069b9a687e68ccbe15b, > however you can only enable this mode if you know your workload does not > trigger heavy guest to host/host to guest traffic - otherwise you > get a (minor) performance regression. > This patchset addresses this problem by notifying the owner > device when callback is invoked because of a data copy. > This makes it possible to detect whether zero copy is appropriate > dynamically: we start in zero copy mode, when we detect > data copied we disable zero copy for a while. > > With this patch applied, I get the same performance for > guest to host and guest to guest both with and without zero copy tx.Series applied, thanks Michael.
Seemingly Similar Threads
- [PATCHv3 net-next 0/8] enable/disable zero copy tx dynamically
- [PATCHv2 net-next 0/8] enable/disable zero copy tx dynamically
- [PATCHv2 net-next 0/8] enable/disable zero copy tx dynamically
- [PATCH net-next 0/8] enable/disable zero copy tx dynamically
- [PATCH net-next 0/8] enable/disable zero copy tx dynamically