Jason Wang
2014-Oct-11  07:16 UTC
[PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
Hello all: We free old transmitted packets in ndo_start_xmit() currently, so any packet must be orphaned also there. This was used to reduce the overhead of tx interrupt to achieve better performance. But this may not work for some protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to implement various optimization for small packets stream such as TCP small queue and auto corking. But orphaning packets early in ndo_start_xmit() disable such things more or less since sk_wmem_alloc was not accurate. This lead extra low throughput for TCP stream of small writes. This series tries to solve this issue by enable tx interrupts for all TCP packets other than the ones with push bit or pure ACK. This is done through the support of urgent descriptor which can force an interrupt for a specified packet. If tx interrupt was enabled for a packet, there's no need to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP can batch more for small write. More larger skb was produced by TCP in this case to improve both throughput and cpu utilization. Test shows great improvements on small write tcp streams. For most of the other cases, the throughput and cpu utilization are the same in the past. Only few cases, more cpu utilization was noticed which needs more investigation. Review and comments are welcomed. Thanks Test result: - Two Intel Corporation Xeon 5600s (8 cores) with back to back connected 82599ES: - netperf test between guest and remote host - 1 queue 2 vcpus with zercopy enabled vhost_net - both host and guest are net-next.git with the patches. - Value with '[]' means obvious difference (the significance is greater than 95%). - he significance of the differences between the two averages is calculated using unpaired T-test that takes into account the SD of the averages. Guest RX size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/ 64/1/+3.7872%/+3.2307%/+0.5390%/ 64/2/-0.2325%/+2.9552%/-3.0962%/ 64/4/[-2.0296%]/+2.2955%/[-4.2280%]/ 64/8/+0.0944%/[+2.2654%]/-2.4662%/ 256/1/+1.1947%/-2.5462%/+3.8386%/ 256/2/-1.6477%/+3.4421%/-4.9301%/ 256/4/[-5.9526%]/[+6.8861%]/[-11.9951%]/ 256/8/-3.6470%/-1.5887%/-2.0916%/ 1024/1/-4.2225%/-1.3238%/-2.9376%/ 1024/2/+0.3568%/+1.8439%/-1.4601%/ 1024/4/-0.7065%/-0.0099%/-2.3483%/ 1024/8/-1.8620%/-2.4774%/+0.6310%/ 4096/1/+0.0115%/-0.3693%/+0.3823%/ 4096/2/-0.0209%/+0.8730%/-0.8862%/ 4096/4/+0.0729%/-7.0303%/+7.6403%/ 4096/8/-2.3720%/+0.0507%/-2.4214%/ 16384/1/+0.0222%/-1.8672%/+1.9254%/ 16384/2/+0.0986%/+3.2968%/-3.0961%/ 16384/4/-1.2059%/+7.4291%/-8.0379%/ 16384/8/-1.4893%/+0.3403%/-1.8234%/ 65535/1/-0.0445%/-1.4060%/+1.3808%/ 65535/2/-0.0311%/+0.9610%/-0.9827%/ 65535/4/-0.7015%/+0.3660%/-1.0637%/ 65535/8/-3.1585%/+11.1302%/[-12.8576%]/ Guest TX size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/ 64/1/[+75.2622%]/[-14.3928%]/[+104.7283%]/ 64/2/[+68.9596%]/[-12.6655%]/[+93.4625%]/ 64/4/[+68.0126%]/[-12.7982%]/[+92.6710%]/ 64/8/[+67.9870%]/[-12.6297%]/[+92.2703%]/ 256/1/[+160.4177%]/[-26.9643%]/[+256.5624%]/ 256/2/[+48.4357%]/[-24.3380%]/[+96.1825%]/ 256/4/[+48.3663%]/[-24.1127%]/[+95.5087%]/ 256/8/[+47.9722%]/[-24.2516%]/[+95.3469%]/ 1024/1/[+54.4474%]/[-52.9223%]/[+228.0694%]/ 1024/2/+0.0742%/[-12.7444%]/[+14.6908%]/ 1024/4/[+0.5524%]/-0.0327%/+0.5853%/ 1024/8/[-1.2783%]/[+6.2902%]/[-7.1206%]/ 4096/1/+0.0778%/-13.1121%/+15.1804%/ 4096/2/+0.0189%/[-11.3176%]/[+12.7832%]/ 4096/4/+0.0218%/-1.0389%/+1.0718%/ 4096/8/-1.3774%/[+12.7396%]/[-12.5218%]/ 16384/1/+0.0136%/-2.5043%/+2.5826%/ 16384/2/+0.0509%/[-15.3846%]/[+18.2420%]/ 16384/4/-0.0163%/[-4.8808%]/[+5.1141%]/ 16384/8/[-1.7249%]/[+13.9174%]/[-13.7313%]/ 65535/1/+0.0686%/-5.4942%/+5.8862%/ 65535/2/+0.0043%/[-7.5816%]/[+8.2082%]/ 65535/4/+0.0080%/[-7.2993%]/[+7.8827%]/ 65535/8/[-1.3669%]/[+16.6536%]/[-15.4479%]/ Guest TCP_RR size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/ 256/1/-0.2914%/+12.6457%/-11.4848%/ 256/25/-0.5968%/-5.0531%/+4.6935%/ 256/50/+0.0262%/+0.2079%/-0.1813%/ 4096/1/+2.6965%/[+16.1248%]/[-11.5636%]/ 4096/25/-0.5002%/+0.5449%/-1.0395%/ 4096/50/[-2.0987%]/-0.0330%/[-2.0664%]/ Tests on mlx4 was ongoing, will post the result in next week. Jason Wang (3): virtio: support for urgent descriptors vhost: support urgent descriptors virtio-net: conditionally enable tx interrupt drivers/net/virtio_net.c | 164 ++++++++++++++++++++++++++++++--------- drivers/vhost/net.c | 43 +++++++--- drivers/vhost/scsi.c | 23 ++++-- drivers/vhost/test.c | 5 +- drivers/vhost/vhost.c | 44 +++++++---- drivers/vhost/vhost.h | 19 +++-- drivers/virtio/virtio_ring.c | 75 +++++++++++++++++- include/linux/virtio.h | 14 ++++ include/uapi/linux/virtio_ring.h | 5 +- 9 files changed, 308 insertions(+), 84 deletions(-) -- 1.8.3.1
Jason Wang
2014-Oct-11  07:16 UTC
[PATCH net-next RFC 1/3] virtio: support for urgent descriptors
Below should be useful for some experiments Jason is doing.
I thought I'd send it out for early review/feedback.
event idx feature allows us to defer interrupts until
a specific # of descriptors were used.
Sometimes it might be useful to get an interrupt after
a specific descriptor, regardless.
This adds a descriptor flag for this, and an API
to create an urgent output descriptor.
This is still an RFC:
we'll need a feature bit for drivers to detect this,
but we've run out of feature bits for virtio 0.X.
For experimentation purposes, drivers can assume
this is set, or add a driver-specific feature bit.
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
Signed-off-by: Jason Wang <jasowang at redhat.com>
---
 drivers/virtio/virtio_ring.c     | 75 +++++++++++++++++++++++++++++++++++++---
 include/linux/virtio.h           | 14 ++++++++
 include/uapi/linux/virtio_ring.h |  5 ++-
 3 files changed, 89 insertions(+), 5 deletions(-)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 4d08f45a..a5188c6 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -115,6 +115,7 @@ static inline struct scatterlist *sg_next_arr(struct
scatterlist *sg,
 
 /* Set up an indirect table of descriptors and add it to the queue. */
 static inline int vring_add_indirect(struct vring_virtqueue *vq,
+				     bool urgent,
 				     struct scatterlist *sgs[],
 				     struct scatterlist *(*next)
 				       (struct scatterlist *, unsigned int *),
@@ -173,6 +174,8 @@ static inline int vring_add_indirect(struct vring_virtqueue
*vq,
 	/* Use a single buffer which doesn't continue */
 	head = vq->free_head;
 	vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
+	if (urgent)
+		vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
 	vq->vring.desc[head].addr = virt_to_phys(desc);
 	/* kmemleak gives a false positive, as it's hidden by virt_to_phys */
 	kmemleak_ignore(desc);
@@ -185,6 +188,7 @@ static inline int vring_add_indirect(struct vring_virtqueue
*vq,
 }
 
 static inline int virtqueue_add(struct virtqueue *_vq,
+			        bool urgent,
 				struct scatterlist *sgs[],
 				struct scatterlist *(*next)
 				  (struct scatterlist *, unsigned int *),
@@ -227,7 +231,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
 	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
{
-		head = vring_add_indirect(vq, sgs, next, total_sg, total_out,
+		head = vring_add_indirect(vq, urgent, sgs, next, total_sg, total_out,
 					  total_in,
 					  out_sgs, in_sgs, gfp);
 		if (likely(head >= 0))
@@ -256,6 +260,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	for (n = 0; n < out_sgs; n++) {
 		for (sg = sgs[n]; sg; sg = next(sg, &total_out)) {
 			vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
+			if (urgent) {
+				vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
+				urgent = false;
+			}
 			vq->vring.desc[i].addr = sg_phys(sg);
 			vq->vring.desc[i].len = sg->length;
 			prev = i;
@@ -265,6 +273,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	for (; n < (out_sgs + in_sgs); n++) {
 		for (sg = sgs[n]; sg; sg = next(sg, &total_in)) {
 			vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
+			if (urgent) {
+				vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
+				urgent = false;
+			}
 			vq->vring.desc[i].addr = sg_phys(sg);
 			vq->vring.desc[i].len = sg->length;
 			prev = i;
@@ -305,6 +317,8 @@ add_head:
 
 /**
  * virtqueue_add_sgs - expose buffers to other end
+ * @urgent: in case virtqueue_enable_cb_delayed was called, cause an interrupt
+ *          after this descriptor was completed
  * @vq: the struct virtqueue we're talking about.
  * @sgs: array of terminated scatterlists.
  * @out_num: the number of scatterlists readable by other side
@@ -337,7 +351,7 @@ int virtqueue_add_sgs(struct virtqueue *_vq,
 		for (sg = sgs[i]; sg; sg = sg_next(sg))
 			total_in++;
 	}
-	return virtqueue_add(_vq, sgs, sg_next_chained,
+	return virtqueue_add(_vq, false, sgs, sg_next_chained,
 			     total_out, total_in, out_sgs, in_sgs, data, gfp);
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_sgs);
@@ -360,11 +374,35 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
 			 void *data,
 			 gfp_t gfp)
 {
-	return virtqueue_add(vq, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
+	return virtqueue_add(vq, false, &sg, sg_next_arr, num, 0, 1, 0, data,
gfp);
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
 
 /**
+ * virtqueue_add_outbuf - expose output buffers to other end
+ *          in case virtqueue_enable_cb_delayed was called, cause an interrupt
+ *          after this descriptor was completed
+ * @vq: the struct virtqueue we're talking about.
+ * @sgs: array of scatterlists (need not be terminated!)
+ * @num: the number of scatterlists readable by other side
+ * @data: the token identifying the buffer.
+ * @gfp: how to do memory allocations (if necessary).
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted).
+ *
+ * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
+ */
+int virtqueue_add_outbuf_urgent(struct virtqueue *vq,
+			 struct scatterlist sg[], unsigned int num,
+			 void *data,
+			 gfp_t gfp)
+{
+	return virtqueue_add(vq, true, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
+}
+EXPORT_SYMBOL_GPL(virtqueue_add_outbuf_urgent);
+
+/**
  * virtqueue_add_inbuf - expose input buffers to other end
  * @vq: the struct virtqueue we're talking about.
  * @sgs: array of scatterlists (need not be terminated!)
@@ -382,7 +420,7 @@ int virtqueue_add_inbuf(struct virtqueue *vq,
 			void *data,
 			gfp_t gfp)
 {
-	return virtqueue_add(vq, &sg, sg_next_arr, 0, num, 0, 1, data, gfp);
+	return virtqueue_add(vq, false, &sg, sg_next_arr, 0, num, 0, 1, data,
gfp);
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_inbuf);
 
@@ -595,6 +633,14 @@ void virtqueue_disable_cb(struct virtqueue *_vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
 
+void virtqueue_disable_cb_urgent(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	vq->vring.avail->flags |= VRING_AVAIL_F_NO_URGENT_INTERRUPT;
+}
+EXPORT_SYMBOL_GPL(virtqueue_disable_cb_urgent);
+
 /**
  * virtqueue_enable_cb_prepare - restart callbacks after disable_cb
  * @vq: the struct virtqueue we're talking about.
@@ -626,6 +672,19 @@ unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
 
+unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 last_used_idx;
+
+	START_USE(vq);
+	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_URGENT_INTERRUPT;
+	last_used_idx = vq->last_used_idx;
+	END_USE(vq);
+	return last_used_idx;
+}
+EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare_urgent);
+
 /**
  * virtqueue_poll - query pending used buffers
  * @vq: the struct virtqueue we're talking about.
@@ -662,6 +721,14 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
 
+bool virtqueue_enable_cb_urgent(struct virtqueue *_vq)
+{
+	unsigned last_used_idx = virtqueue_enable_cb_prepare_urgent(_vq);
+
+	return !virtqueue_poll(_vq, last_used_idx);
+}
+EXPORT_SYMBOL_GPL(virtqueue_enable_cb_urgent);
+
 /**
  * virtqueue_enable_cb_delayed - restart callbacks after disable_cb.
  * @vq: the struct virtqueue we're talking about.
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index b46671e..68be5f2 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -39,6 +39,12 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
 			 void *data,
 			 gfp_t gfp);
 
+int virtqueue_add_outbuf_urgent(struct virtqueue *vq,
+				struct scatterlist sg[], unsigned int num,
+				void *data,
+				gfp_t gfp);
+
+
 int virtqueue_add_inbuf(struct virtqueue *vq,
 			struct scatterlist sg[], unsigned int num,
 			void *data,
@@ -61,10 +67,18 @@ void *virtqueue_get_buf(struct virtqueue *vq, unsigned int
*len);
 
 void virtqueue_disable_cb(struct virtqueue *vq);
 
+void virtqueue_disable_cb_urgent(struct virtqueue *vq);
+
 bool virtqueue_enable_cb(struct virtqueue *vq);
 
+bool virtqueue_enable_cb_urgent(struct virtqueue *vq);
+
+bool virtqueue_enable_cb_urgent(struct virtqueue *vq);
+
 unsigned virtqueue_enable_cb_prepare(struct virtqueue *vq);
 
+unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *vq);
+
 bool virtqueue_poll(struct virtqueue *vq, unsigned);
 
 bool virtqueue_enable_cb_delayed(struct virtqueue *vq);
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index a99f9b7..daf5bb0 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -39,6 +39,9 @@
 #define VRING_DESC_F_WRITE	2
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT	4
+/* This means the descriptor should cause an interrupt
+ * ignoring avail event idx. */
+#define VRING_DESC_F_URGENT	8
 
 /* The Host uses this in used->flags to advise the Guest: don't kick me
when
  * you add a buffer.  It's unreliable, so it's simply an optimization. 
Guest
@@ -48,7 +51,7 @@
  * when you consume a buffer.  It's unreliable, so it's simply an
  * optimization.  */
 #define VRING_AVAIL_F_NO_INTERRUPT	1
-
+#define VRING_AVAIL_F_NO_URGENT_INTERRUPT	2
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC	28
 
-- 
1.8.3.1
Jason Wang
2014-Oct-11  07:16 UTC
[PATCH net-next RFC 2/3] vhost: support urgent descriptors
This patches let vhost-net support urgent descriptors. For zerocopy case,
two new types of length was introduced to make it work.
Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
Signed-off-by: Jason Wang <jasowang at redhat.com>
---
 drivers/vhost/net.c   | 43 +++++++++++++++++++++++++++++++------------
 drivers/vhost/scsi.c  | 23 +++++++++++++++--------
 drivers/vhost/test.c  |  5 +++--
 drivers/vhost/vhost.c | 44 +++++++++++++++++++++++++++++---------------
 drivers/vhost/vhost.h | 19 +++++++++++++------
 5 files changed, 91 insertions(+), 43 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8dae2f7..37b0bb5 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -48,9 +48,13 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy
TX;"
  * status internally; used for zerocopy tx only.
  */
 /* Lower device DMA failed */
-#define VHOST_DMA_FAILED_LEN	3
+#define VHOST_DMA_FAILED_LEN	5
+/* Lower device DMA doen, urgent bit set */
+#define VHOST_DMA_DONE_LEN_URGENT	4
 /* Lower device DMA done */
-#define VHOST_DMA_DONE_LEN	2
+#define VHOST_DMA_DONE_LEN	3
+/* Lower device DMA in progress, urgent bit set */
+#define VHOST_DMA_URGENT	2
 /* Lower device DMA in progress */
 #define VHOST_DMA_IN_PROGRESS	1
 /* Buffer unused */
@@ -284,11 +288,13 @@ static void vhost_zerocopy_signal_used(struct vhost_net
*net,
 		container_of(vq, struct vhost_net_virtqueue, vq);
 	int i, add;
 	int j = 0;
+	bool urgent = false;
 
 	for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
 		if (vq->heads[i].len == VHOST_DMA_FAILED_LEN)
 			vhost_net_tx_err(net);
 		if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
+			urgent = urgent || vq->heads[i].len == VHOST_DMA_DONE_LEN_URGENT;
 			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
 			++j;
 		} else
@@ -296,7 +302,7 @@ static void vhost_zerocopy_signal_used(struct vhost_net
*net,
 	}
 	while (j) {
 		add = min(UIO_MAXIOV - nvq->done_idx, j);
-		vhost_add_used_and_signal_n(vq->dev, vq,
+		vhost_add_used_and_signal_n(vq->dev, vq, urgent,
 					    &vq->heads[nvq->done_idx], add);
 		nvq->done_idx = (nvq->done_idx + add) % UIO_MAXIOV;
 		j -= add;
@@ -311,9 +317,14 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf,
bool success)
 
 	rcu_read_lock_bh();
 
-	/* set len to mark this desc buffers done DMA */
-	vq->heads[ubuf->desc].len = success ?
-		VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
+	if (success) {
+		if (vq->heads[ubuf->desc].len == VHOST_DMA_IN_PROGRESS)
+			vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
+		else
+			vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN_URGENT;
+	} else {
+		vq->heads[ubuf->desc].len = VHOST_DMA_FAILED_LEN;
+	}
 	cnt = vhost_net_ubuf_put(ubufs);
 
 	/*
@@ -363,6 +374,7 @@ static void handle_tx(struct vhost_net *net)
 	zcopy = nvq->ubufs;
 
 	for (;;) {
+		bool urgent;
 		/* Release DMAs done buffers first */
 		if (zcopy)
 			vhost_zerocopy_signal_used(net, vq);
@@ -374,7 +386,7 @@ static void handle_tx(struct vhost_net *net)
 			      % UIO_MAXIOV == nvq->done_idx))
 			break;
 
-		head = vhost_get_vq_desc(vq, vq->iov,
+		head = vhost_get_vq_desc(vq, &urgent, vq->iov,
 					 ARRAY_SIZE(vq->iov),
 					 &out, &in,
 					 NULL, NULL);
@@ -417,7 +429,8 @@ static void handle_tx(struct vhost_net *net)
 			ubuf = nvq->ubuf_info + nvq->upend_idx;
 
 			vq->heads[nvq->upend_idx].id = head;
-			vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
+			vq->heads[nvq->upend_idx].len = urgent ?
+				VHOST_DMA_URGENT : VHOST_DMA_IN_PROGRESS;
 			ubuf->callback = vhost_zerocopy_callback;
 			ubuf->ctx = nvq->ubufs;
 			ubuf->desc = nvq->upend_idx;
@@ -445,7 +458,7 @@ static void handle_tx(struct vhost_net *net)
 			pr_debug("Truncated TX packet: "
 				 " len %d != %zd\n", err, len);
 		if (!zcopy_used)
-			vhost_add_used_and_signal(&net->dev, vq, head, 0);
+			vhost_add_used_and_signal(&net->dev, vq, urgent, head, 0);
 		else
 			vhost_zerocopy_signal_used(net, vq);
 		total_len += len;
@@ -488,6 +501,7 @@ static int peek_head_len(struct sock *sk)
  *	returns number of buffer heads allocated, negative on error
  */
 static int get_rx_bufs(struct vhost_virtqueue *vq,
+		       bool *urgentp,
 		       struct vring_used_elem *heads,
 		       int datalen,
 		       unsigned *iovcount,
@@ -502,11 +516,13 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 	int r, nlogs = 0;
 
 	while (datalen > 0 && headcount < quota) {
+		bool urgent = false;
+
 		if (unlikely(seg >= UIO_MAXIOV)) {
 			r = -ENOBUFS;
 			goto err;
 		}
-		r = vhost_get_vq_desc(vq, vq->iov + seg,
+		r = vhost_get_vq_desc(vq, &urgent, vq->iov + seg,
 				      ARRAY_SIZE(vq->iov) - seg, &out,
 				      &in, log, log_num);
 		if (unlikely(r < 0))
@@ -527,6 +543,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 			nlogs += *log_num;
 			log += *log_num;
 		}
+		*urgentp = *urgentp || urgent;
 		heads[headcount].id = d;
 		heads[headcount].len = iov_length(vq->iov + seg, in);
 		datalen -= heads[headcount].len;
@@ -590,9 +607,11 @@ static void handle_rx(struct vhost_net *net)
 	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
 
 	while ((sock_len = peek_head_len(sock->sk))) {
+		bool urgent = false;
+
 		sock_len += sock_hlen;
 		vhost_len = sock_len + vhost_hlen;
-		headcount = get_rx_bufs(vq, vq->heads, vhost_len,
+		headcount = get_rx_bufs(vq, &urgent, vq->heads, vhost_len,
 					&in, vq_log, &log,
 					likely(mergeable) ? UIO_MAXIOV : 1);
 		/* On error, stop handling until the next kick. */
@@ -654,7 +673,7 @@ static void handle_rx(struct vhost_net *net)
 			vhost_discard_vq_desc(vq, headcount);
 			break;
 		}
-		vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
+		vhost_add_used_and_signal_n(&net->dev, vq, urgent, vq->heads,
 					    headcount);
 		if (unlikely(vq_log))
 			vhost_log_write(vq, vq_log, log, vhost_len);
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 69906ca..0a7e5bc 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -72,6 +72,8 @@ struct tcm_vhost_cmd {
 	int tvc_vq_desc;
 	/* virtio-scsi initiator task attribute */
 	int tvc_task_attr;
+	/* Descriptor urgent? */
+	bool tvc_vq_desc_urgent;
 	/* virtio-scsi initiator data direction */
 	enum dma_data_direction tvc_data_direction;
 	/* Expected data transfer length from virtio-scsi header */
@@ -606,6 +608,7 @@ tcm_vhost_do_evt_work(struct vhost_scsi *vs, struct
tcm_vhost_evt *evt)
 	struct virtio_scsi_event __user *eventp;
 	unsigned out, in;
 	int head, ret;
+	bool urgent;
 
 	if (!vq->private_data) {
 		vs->vs_events_missed = true;
@@ -614,7 +617,7 @@ tcm_vhost_do_evt_work(struct vhost_scsi *vs, struct
tcm_vhost_evt *evt)
 
 again:
 	vhost_disable_notify(&vs->dev, vq);
-	head = vhost_get_vq_desc(vq, vq->iov,
+	head = vhost_get_vq_desc(vq, &urgent, vq->iov,
 			ARRAY_SIZE(vq->iov), &out, &in,
 			NULL, NULL);
 	if (head < 0) {
@@ -643,7 +646,7 @@ again:
 	eventp = vq->iov[out].iov_base;
 	ret = __copy_to_user(eventp, event, sizeof(*event));
 	if (!ret)
-		vhost_add_used_and_signal(&vs->dev, vq, head, 0);
+		vhost_add_used_and_signal(&vs->dev, vq, urgent, head, 0);
 	else
 		vq_err(vq, "Faulted on tcm_vhost_send_event\n");
 }
@@ -704,7 +707,8 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work
*work)
 		ret = copy_to_user(cmd->tvc_resp, &v_rsp, sizeof(v_rsp));
 		if (likely(ret == 0)) {
 			struct vhost_scsi_virtqueue *q;
-			vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0);
+			vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc_urgent,
+				       cmd->tvc_vq_desc, 0);
 			q = container_of(cmd->tvc_vq, struct vhost_scsi_virtqueue, vq);
 			vq = q - vs->vqs;
 			__set_bit(vq, signal);
@@ -947,6 +951,7 @@ static void tcm_vhost_submission_work(struct work_struct
*work)
 static void
 vhost_scsi_send_bad_target(struct vhost_scsi *vs,
 			   struct vhost_virtqueue *vq,
+			   bool urgent,
 			   int head, unsigned out)
 {
 	struct virtio_scsi_cmd_resp __user *resp;
@@ -958,7 +963,7 @@ vhost_scsi_send_bad_target(struct vhost_scsi *vs,
 	resp = vq->iov[out].iov_base;
 	ret = __copy_to_user(resp, &rsp, sizeof(rsp));
 	if (!ret)
-		vhost_add_used_and_signal(&vs->dev, vq, head, 0);
+		vhost_add_used_and_signal(&vs->dev, vq, urgent, head, 0);
 	else
 		pr_err("Faulted on virtio_scsi_cmd_resp\n");
 }
@@ -980,6 +985,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct
vhost_virtqueue *vq)
 	u8 *target, *lunp, task_attr;
 	bool hdr_pi;
 	void *req, *cdb;
+	bool urgent;
 
 	mutex_lock(&vq->mutex);
 	/*
@@ -993,7 +999,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct
vhost_virtqueue *vq)
 	vhost_disable_notify(&vs->dev, vq);
 
 	for (;;) {
-		head = vhost_get_vq_desc(vq, vq->iov,
+		head = vhost_get_vq_desc(vq, &urgent, vq->iov,
 					ARRAY_SIZE(vq->iov), &out, &in,
 					NULL, NULL);
 		pr_debug("vhost_get_vq_desc: head: %d, out: %u in: %u\n",
@@ -1067,7 +1073,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct
vhost_virtqueue *vq)
 
 		/* virtio-scsi spec requires byte 0 of the lun to be 1 */
 		if (unlikely(*lunp != 1)) {
-			vhost_scsi_send_bad_target(vs, vq, head, out);
+			vhost_scsi_send_bad_target(vs, vq, urgent, head, out);
 			continue;
 		}
 
@@ -1075,7 +1081,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct
vhost_virtqueue *vq)
 
 		/* Target does not exist, fail the request */
 		if (unlikely(!tpg)) {
-			vhost_scsi_send_bad_target(vs, vq, head, out);
+			vhost_scsi_send_bad_target(vs, vq, urgent, head, out);
 			continue;
 		}
 
@@ -1187,6 +1193,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct
vhost_virtqueue *vq)
 		 * tcm_vhost_queue_data_in() and tcm_vhost_queue_status()
 		 */
 		cmd->tvc_vq_desc = head;
+		cmd->tvc_vq_desc_urgent = urgent;
 		/*
 		 * Dispatch tv_cmd descriptor for cmwq execution in process
 		 * context provided by tcm_vhost_workqueue.  This also ensures
@@ -1203,7 +1210,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct
vhost_virtqueue *vq)
 err_free:
 	vhost_scsi_free_cmd(cmd);
 err_cmd:
-	vhost_scsi_send_bad_target(vs, vq, head, out);
+	vhost_scsi_send_bad_target(vs, vq, urgent, head, out);
 out:
 	mutex_unlock(&vq->mutex);
 }
diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index d9c501e..757f3a2 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -42,6 +42,7 @@ static void handle_vq(struct vhost_test *n)
 	int head;
 	size_t len, total_len = 0;
 	void *private;
+	bool urgent;
 
 	mutex_lock(&vq->mutex);
 	private = vq->private_data;
@@ -53,7 +54,7 @@ static void handle_vq(struct vhost_test *n)
 	vhost_disable_notify(&n->dev, vq);
 
 	for (;;) {
-		head = vhost_get_vq_desc(vq, vq->iov,
+		head = vhost_get_vq_desc(vq, &urgent, vq->iov,
 					 ARRAY_SIZE(vq->iov),
 					 &out, &in,
 					 NULL, NULL);
@@ -79,7 +80,7 @@ static void handle_vq(struct vhost_test *n)
 			vq_err(vq, "Unexpected 0 len for TX\n");
 			break;
 		}
-		vhost_add_used_and_signal(&n->dev, vq, head, 0);
+		vhost_add_used_and_signal(&n->dev, vq, urgent,  head, 0);
 		total_len += len;
 		if (unlikely(total_len >= VHOST_TEST_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c90f437..8a35e14 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -186,6 +186,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->last_used_idx = 0;
 	vq->signalled_used = 0;
 	vq->signalled_used_valid = false;
+	vq->urgent = false;
 	vq->used_flags = 0;
 	vq->log_used = false;
 	vq->log_addr = -1ull;
@@ -1201,7 +1202,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
  * This function returns the descriptor number found, or vq->num (which is
  * never a valid descriptor number) if none was found.  A negative code is
  * returned on error. */
-int vhost_get_vq_desc(struct vhost_virtqueue *vq,
+int vhost_get_vq_desc(struct vhost_virtqueue *vq, bool *urgent,
 		      struct iovec iov[], unsigned int iov_size,
 		      unsigned int *out_num, unsigned int *in_num,
 		      struct vhost_log *log, unsigned int *log_num)
@@ -1211,6 +1212,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 	u16 last_avail_idx;
 	int ret;
 
+	*urgent = false;
+
 	/* Check it isn't doing very strange things with descriptor numbers. */
 	last_avail_idx = vq->last_avail_idx;
 	if (unlikely(__get_user(vq->avail_idx, &vq->avail->idx))) {
@@ -1274,6 +1277,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 			       i, vq->desc + i);
 			return -EFAULT;
 		}
+		if (desc.flags & VRING_DESC_F_URGENT)
+			*urgent = true;
 		if (desc.flags & VRING_DESC_F_INDIRECT) {
 			ret = get_indirect(vq, iov, iov_size,
 					   out_num, in_num,
@@ -1333,11 +1338,11 @@ EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
 
 /* After we've used one of their buffers, we tell them about it.  We'll
then
  * want to notify the guest, using eventfd. */
-int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
+int vhost_add_used(struct vhost_virtqueue *vq, bool urgent, unsigned int head,
int len)
 {
 	struct vring_used_elem heads = { head, len };
 
-	return vhost_add_used_n(vq, &heads, 1);
+	return vhost_add_used_n(vq, urgent, &heads, 1);
 }
 EXPORT_SYMBOL_GPL(vhost_add_used);
 
@@ -1386,7 +1391,8 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 
 /* After we've used one of their buffers, we tell them about it.  We'll
then
  * want to notify the guest, using eventfd. */
-int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
+int vhost_add_used_n(struct vhost_virtqueue *vq, bool urgent,
+		     struct vring_used_elem *heads,
 		     unsigned count)
 {
 	int start, n, r;
@@ -1416,13 +1422,14 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct
vring_used_elem *heads,
 		if (vq->log_ctx)
 			eventfd_signal(vq->log_ctx, 1);
 	}
+	vq->urgent = vq->urgent || urgent;
 	return r;
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_n);
 
 static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-	__u16 old, new, event;
+	__u16 old, new, event, flags;
 	bool v;
 	/* Flush out used index updates. This is paired
 	 * with the barrier that the Guest executes when enabling
@@ -1433,14 +1440,17 @@ static bool vhost_notify(struct vhost_dev *dev, struct
vhost_virtqueue *vq)
 	    unlikely(vq->avail_idx == vq->last_avail_idx))
 		return true;
 
-	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
-		__u16 flags;
-		if (__get_user(flags, &vq->avail->flags)) {
-			vq_err(vq, "Failed to get flags");
-			return true;
-		}
-		return !(flags & VRING_AVAIL_F_NO_INTERRUPT);
+	if (__get_user(flags, &vq->avail->flags)) {
+		vq_err(vq, "Failed to get flags");
+		return true;
 	}
+
+	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX))
+		return !(flags & VRING_AVAIL_F_NO_INTERRUPT);
+
+	if (vq->urgent && !(flags & VRING_AVAIL_F_NO_URGENT_INTERRUPT))
+		return true;
+
 	old = vq->signalled_used;
 	v = vq->signalled_used_valid;
 	new = vq->signalled_used = vq->last_used_idx;
@@ -1460,17 +1470,20 @@ static bool vhost_notify(struct vhost_dev *dev, struct
vhost_virtqueue *vq)
 void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	/* Signal the Guest tell them we used something up. */
-	if (vq->call_ctx && vhost_notify(dev, vq))
+	if (vq->call_ctx && vhost_notify(dev, vq)) {
 		eventfd_signal(vq->call_ctx, 1);
+		vq->urgent = false;
+	}
 }
 EXPORT_SYMBOL_GPL(vhost_signal);
 
 /* And here's the combo meal deal.  Supersize me! */
 void vhost_add_used_and_signal(struct vhost_dev *dev,
 			       struct vhost_virtqueue *vq,
+			       bool urgent,
 			       unsigned int head, int len)
 {
-	vhost_add_used(vq, head, len);
+	vhost_add_used(vq, urgent, head, len);
 	vhost_signal(dev, vq);
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_and_signal);
@@ -1478,9 +1491,10 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal);
 /* multi-buffer version of vhost_add_used_and_signal */
 void vhost_add_used_and_signal_n(struct vhost_dev *dev,
 				 struct vhost_virtqueue *vq,
+				 bool urgent,
 				 struct vring_used_elem *heads, unsigned count)
 {
-	vhost_add_used_n(vq, heads, count);
+	vhost_add_used_n(vq, urgent, heads, count);
 	vhost_signal(dev, vq);
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3eda654..61ca542 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -96,6 +96,9 @@ struct vhost_virtqueue {
 	/* Last used index value we have signalled on */
 	bool signalled_used_valid;
 
+	/* Urgent descriptor was used */
+	bool urgent;
+
 	/* Log writes to used structure. */
 	bool log_used;
 	u64 log_addr;
@@ -138,20 +141,24 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl,
void __user *argp);
 int vhost_vq_access_ok(struct vhost_virtqueue *vq);
 int vhost_log_access_ok(struct vhost_dev *);
 
-int vhost_get_vq_desc(struct vhost_virtqueue *,
+int vhost_get_vq_desc(struct vhost_virtqueue *, bool *urgent,
 		      struct iovec iov[], unsigned int iov_count,
 		      unsigned int *out_num, unsigned int *in_num,
 		      struct vhost_log *log, unsigned int *log_num);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
 int vhost_init_used(struct vhost_virtqueue *);
-int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
-int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
-		     unsigned count);
-void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
+int vhost_add_used(struct vhost_virtqueue *, bool urgent,
+		   unsigned int head, int len);
+int vhost_add_used_n(struct vhost_virtqueue *, bool urgent,
+		     struct vring_used_elem *heads, unsigned count);
+void vhost_add_used_and_signal(struct vhost_dev *,
+			       struct vhost_virtqueue *,
+			       bool urgent,
 			       unsigned int id, int len);
 void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
-			       struct vring_used_elem *heads, unsigned count);
+				 bool urgent,
+				 struct vring_used_elem *heads, unsigned count);
 void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
 void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
-- 
1.8.3.1
Jason Wang
2014-Oct-11  07:16 UTC
[PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt
We free transmitted packets in ndo_start_xmit() in the past to get better
performance in the past. One side effect is that skb_orphan() needs to be
called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in
fact. For TCP protocol, this means several optimization could not work well
such as TCP small queue and auto corking. This can lead extra low
throughput of small packets stream.
Thanks to the urgent descriptor support. This patch tries to solve this
issue by enable the tx interrupt selectively for stream packets. This means
we don't need to orphan TCP stream packets in ndo_start_xmit() but enable
tx interrupt for those packets. After we get tx interrupt, a tx napi was
scheduled to free those packets.
With this method, sk_wmem_alloc of TCP socket were more accurate than in
the past which let TCP can batch more through TSQ and auto corking.
Signed-off-by: Jason Wang <jasowang at redhat.com>
---
 drivers/net/virtio_net.c | 164 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 128 insertions(+), 36 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5810841..b450fc4 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -72,6 +72,8 @@ struct send_queue {
 
 	/* Name of the send queue: output.$index */
 	char name[40];
+
+	struct napi_struct napi;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue *rq,
gfp_t gfp_mask)
 	return p;
 }
 
+static int free_old_xmit_skbs(struct send_queue *sq, int budget)
+{
+	struct sk_buff *skb;
+	unsigned int len;
+	struct virtnet_info *vi = sq->vq->vdev->priv;
+	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
+	int sent = 0;
+
+	while (sent < budget &&
+	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+		pr_debug("Sent skb %p\n", skb);
+
+		u64_stats_update_begin(&stats->tx_syncp);
+		stats->tx_bytes += skb->len;
+		stats->tx_packets++;
+		u64_stats_update_end(&stats->tx_syncp);
+
+		dev_kfree_skb_any(skb);
+		sent++;
+	}
+
+	return sent;
+}
+
 static void skb_xmit_done(struct virtqueue *vq)
 {
 	struct virtnet_info *vi = vq->vdev->priv;
+	struct send_queue *sq = &vi->sq[vq2txq(vq)];
 
-	/* Suppress further interrupts. */
-	virtqueue_disable_cb(vq);
-
-	/* We were probably waiting for more output buffers. */
-	netif_wake_subqueue(vi->dev, vq2txq(vq));
+	if (napi_schedule_prep(&sq->napi)) {
+		virtqueue_disable_cb(vq);
+		virtqueue_disable_cb_urgent(vq);
+		__napi_schedule(&sq->napi);
+	}
 }
 
 static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
@@ -772,7 +799,38 @@ again:
 	return received;
 }
 
+static int virtnet_poll_tx(struct napi_struct *napi, int budget)
+{
+	struct send_queue *sq +		container_of(napi, struct send_queue, napi);
+	struct virtnet_info *vi = sq->vq->vdev->priv;
+	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
+	unsigned int r, sent = 0;
+
+again:
+	__netif_tx_lock(txq, smp_processor_id());
+	sent += free_old_xmit_skbs(sq, budget - sent);
+
+	if (sent < budget) {
+		r = virtqueue_enable_cb_prepare_urgent(sq->vq);
+		napi_complete(napi);
+		__netif_tx_unlock(txq);
+		if (unlikely(virtqueue_poll(sq->vq, r)) &&
+		    napi_schedule_prep(napi)) {
+			virtqueue_disable_cb_urgent(sq->vq);
+			__napi_schedule(napi);
+			goto again;
+		}
+	} else {
+		__netif_tx_unlock(txq);
+	}
+
+	netif_wake_subqueue(vi->dev, vq2txq(sq->vq));
+	return sent;
+}
+
 #ifdef CONFIG_NET_RX_BUSY_POLL
+
 /* must be called with local_bh_disable()d */
 static int virtnet_busy_poll(struct napi_struct *napi)
 {
@@ -820,31 +878,13 @@ static int virtnet_open(struct net_device *dev)
 			if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
 				schedule_delayed_work(&vi->refill, 0);
 		virtnet_napi_enable(&vi->rq[i]);
+		napi_enable(&vi->sq[i].napi);
 	}
 
 	return 0;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq)
-{
-	struct sk_buff *skb;
-	unsigned int len;
-	struct virtnet_info *vi = sq->vq->vdev->priv;
-	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
-
-	while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		pr_debug("Sent skb %p\n", skb);
-
-		u64_stats_update_begin(&stats->tx_syncp);
-		stats->tx_bytes += skb->len;
-		stats->tx_packets++;
-		u64_stats_update_end(&stats->tx_syncp);
-
-		dev_kfree_skb_any(skb);
-	}
-}
-
-static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
+static int xmit_skb(struct send_queue *sq, struct sk_buff *skb, bool urgent)
 {
 	struct skb_vnet_hdr *hdr;
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
@@ -908,7 +948,43 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff
*skb)
 		sg_set_buf(sq->sg, hdr, hdr_len);
 		num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
 	}
-	return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
+	if (urgent)
+		return virtqueue_add_outbuf_urgent(sq->vq, sq->sg, num_sg,
+						   skb, GFP_ATOMIC);
+	else
+		return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb,
+					    GFP_ATOMIC);
+}
+
+static bool virtnet_skb_needs_intr(struct sk_buff *skb)
+{
+	union {
+		unsigned char *network;
+		struct iphdr *ipv4;
+		struct ipv6hdr *ipv6;
+	} hdr;
+	struct tcphdr *th = tcp_hdr(skb);
+	u16 payload_len;
+
+	hdr.network = skb_network_header(skb);
+
+	/* Only IPv4/IPv6 with TCP is supported */
+	if ((skb->protocol == htons(ETH_P_IP)) &&
+	    hdr.ipv4->protocol == IPPROTO_TCP) {
+		payload_len = ntohs(hdr.ipv4->tot_len) - hdr.ipv4->ihl * 4 -
+			      th->doff * 4;
+	} else if ((skb->protocol == htons(ETH_P_IPV6) ||
+		   hdr.ipv6->nexthdr == IPPROTO_TCP)) {
+		payload_len = ntohs(hdr.ipv6->payload_len) - th->doff * 4;
+	} else {
+		return false;
+	}
+
+	/* We don't want to dealy packet with PUSH bit and pure ACK packet */
+	if (!th->psh && payload_len)
+		return true;
+
+	return false;
 }
 
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -916,13 +992,15 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct
net_device *dev)
 	struct virtnet_info *vi = netdev_priv(dev);
 	int qnum = skb_get_queue_mapping(skb);
 	struct send_queue *sq = &vi->sq[qnum];
-	int err;
+	bool urgent = virtnet_skb_needs_intr(skb);
+	int err, qsize = virtqueue_get_vring_size(sq->vq);
 
+	virtqueue_disable_cb_urgent(sq->vq);
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(sq);
+	free_old_xmit_skbs(sq, qsize);
 
 	/* Try to transmit */
-	err = xmit_skb(sq, skb);
+	err = xmit_skb(sq, skb, urgent);
 
 	/* This should not happen! */
 	if (unlikely(err)) {
@@ -935,22 +1013,26 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct
net_device *dev)
 		return NETDEV_TX_OK;
 	}
 
-	/* Don't wait up for transmitted skbs to be freed. */
-	skb_orphan(skb);
-	nf_reset(skb);
+	if (!urgent) {
+		skb_orphan(skb);
+		nf_reset(skb);
+	}
 
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
 		netif_stop_subqueue(dev, qnum);
+		virtqueue_disable_cb_urgent(sq->vq);
 		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
 			/* More just got used, free them then recheck. */
-			free_old_xmit_skbs(sq);
+			free_old_xmit_skbs(sq, qsize);
 			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
 				netif_start_subqueue(dev, qnum);
 				virtqueue_disable_cb(sq->vq);
 			}
 		}
+	} else if (virtqueue_enable_cb_urgent(sq->vq)) {
+		free_old_xmit_skbs(sq, qsize);
 	}
 
 	if (__netif_subqueue_stopped(dev, qnum) || !skb->xmit_more)
@@ -1132,8 +1214,10 @@ static int virtnet_close(struct net_device *dev)
 	/* Make sure refill_work doesn't re-enable napi! */
 	cancel_delayed_work_sync(&vi->refill);
 
-	for (i = 0; i < vi->max_queue_pairs; i++)
+	for (i = 0; i < vi->max_queue_pairs; i++) {
 		napi_disable(&vi->rq[i].napi);
+		napi_disable(&vi->sq[i].napi);
+	}
 
 	return 0;
 }
@@ -1452,8 +1536,10 @@ static void virtnet_free_queues(struct virtnet_info *vi)
 {
 	int i;
 
-	for (i = 0; i < vi->max_queue_pairs; i++)
+	for (i = 0; i < vi->max_queue_pairs; i++) {
 		netif_napi_del(&vi->rq[i].napi);
+		netif_napi_del(&vi->sq[i].napi);
+	}
 
 	kfree(vi->rq);
 	kfree(vi->sq);
@@ -1607,6 +1693,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
 		netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
 			       napi_weight);
 		napi_hash_add(&vi->rq[i].napi);
+		netif_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
+			       napi_weight);
 
 		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
 		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
@@ -1912,8 +2000,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
 	if (netif_running(vi->dev)) {
 		for (i = 0; i < vi->max_queue_pairs; i++) {
 			napi_disable(&vi->rq[i].napi);
+			napi_disable(&vi->sq[i].napi);
 			napi_hash_del(&vi->rq[i].napi);
 			netif_napi_del(&vi->rq[i].napi);
+			netif_napi_del(&vi->sq[i].napi);
 		}
 	}
 
@@ -1938,8 +2028,10 @@ static int virtnet_restore(struct virtio_device *vdev)
 			if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
 				schedule_delayed_work(&vi->refill, 0);
 
-		for (i = 0; i < vi->max_queue_pairs; i++)
+		for (i = 0; i < vi->max_queue_pairs; i++) {
 			virtnet_napi_enable(&vi->rq[i]);
+			napi_enable(&vi->sq[i].napi);
+		}
 	}
 
 	netif_device_attach(vi->dev);
-- 
1.8.3.1
Eric Dumazet
2014-Oct-11  14:48 UTC
[PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt
On Sat, 2014-10-11 at 15:16 +0800, Jason Wang wrote:> We free transmitted packets in ndo_start_xmit() in the past to get better > performance in the past. One side effect is that skb_orphan() needs to be > called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in > fact. For TCP protocol, this means several optimization could not work well > such as TCP small queue and auto corking. This can lead extra low > throughput of small packets stream. > > Thanks to the urgent descriptor support. This patch tries to solve this > issue by enable the tx interrupt selectively for stream packets. This means > we don't need to orphan TCP stream packets in ndo_start_xmit() but enable > tx interrupt for those packets. After we get tx interrupt, a tx napi was > scheduled to free those packets. > > With this method, sk_wmem_alloc of TCP socket were more accurate than in > the past which let TCP can batch more through TSQ and auto corking. > > Signed-off-by: Jason Wang <jasowang at redhat.com> > --- > drivers/net/virtio_net.c | 164 ++++++++++++++++++++++++++++++++++++----------- > 1 file changed, 128 insertions(+), 36 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 5810841..b450fc4 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -72,6 +72,8 @@ struct send_queue { > > /* Name of the send queue: output.$index */ > char name[40]; > + > + struct napi_struct napi; > }; > > /* Internal representation of a receive virtqueue */ > @@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask) > return p; > } > > +static int free_old_xmit_skbs(struct send_queue *sq, int budget) > +{ > + struct sk_buff *skb; > + unsigned int len; > + struct virtnet_info *vi = sq->vq->vdev->priv; > + struct virtnet_stats *stats = this_cpu_ptr(vi->stats); > + int sent = 0; > + > + while (sent < budget && > + (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) { > + pr_debug("Sent skb %p\n", skb); > + > + u64_stats_update_begin(&stats->tx_syncp); > + stats->tx_bytes += skb->len; > + stats->tx_packets++; > + u64_stats_update_end(&stats->tx_syncp); > + > + dev_kfree_skb_any(skb); > + sent++; > + } > +You could accumulate skb->len in a totlen var, and perform a single u64_stats_update_begin(&stats->tx_syncp); stats->tx_bytes += totlen; stats->tx_packets += sent; u64_stats_update_end(&stats->tx_syncp); after the loop.> + return sent; > +} > +...> + > +static bool virtnet_skb_needs_intr(struct sk_buff *skb) > +{ > + union { > + unsigned char *network; > + struct iphdr *ipv4; > + struct ipv6hdr *ipv6; > + } hdr; > + struct tcphdr *th = tcp_hdr(skb); > + u16 payload_len; > + > + hdr.network = skb_network_header(skb); > + > + /* Only IPv4/IPv6 with TCP is supported */Oh well, yet another packet flow dissector :) If most packets were caught by your implementation, you could use it for fast patj and fallback to skb_flow_dissect() for encapsulated traffic. struct flow_keys keys; if (!skb_flow_dissect(skb, &keys)) return false; if (keys.ip_proto != IPPROTO_TCP) return false; then check __skb_get_poff() how to get th, and check if there is some payload...> + if ((skb->protocol == htons(ETH_P_IP)) && > + hdr.ipv4->protocol == IPPROTO_TCP) { > + payload_len = ntohs(hdr.ipv4->tot_len) - hdr.ipv4->ihl * 4 - > + th->doff * 4; > + } else if ((skb->protocol == htons(ETH_P_IPV6) || > + hdr.ipv6->nexthdr == IPPROTO_TCP)) { > + payload_len = ntohs(hdr.ipv6->payload_len) - th->doff * 4; > + } else { > + return false; > + } > + > + /* We don't want to dealy packet with PUSH bit and pure ACK packet */ > + if (!th->psh && payload_len) > + return true; > + > + return false; > }
Michael S. Tsirkin
2014-Oct-12  09:27 UTC
[PATCH net-next RFC 1/3] virtio: support for urgent descriptors
On Sat, Oct 11, 2014 at 03:16:44PM +0800, Jason Wang wrote:> Below should be useful for some experiments Jason is doing. > I thought I'd send it out for early review/feedback. > > event idx feature allows us to defer interrupts until > a specific # of descriptors were used. > Sometimes it might be useful to get an interrupt after > a specific descriptor, regardless. > This adds a descriptor flag for this, and an API > to create an urgent output descriptor. > This is still an RFC: > we'll need a feature bit for drivers to detect this, > but we've run out of feature bits for virtio 0.X. > For experimentation purposes, drivers can assume > this is set, or add a driver-specific feature bit. > > Signed-off-by: Michael S. Tsirkin <mst at redhat.com> > Signed-off-by: Jason Wang <jasowang at redhat.com>I see that as compared to my original patch, you have added a new flag: VRING_AVAIL_F_NO_URGENT_INTERRUPT I don't think it's necessary, see below. As such, I think this patch should be split: - original patch adding support for urgent descriptors - a patch adding virtqueue_enable/disable_cb_urgent(_prepare)?> --- > drivers/virtio/virtio_ring.c | 75 +++++++++++++++++++++++++++++++++++++--- > include/linux/virtio.h | 14 ++++++++ > include/uapi/linux/virtio_ring.h | 5 ++- > 3 files changed, 89 insertions(+), 5 deletions(-) > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index 4d08f45a..a5188c6 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -115,6 +115,7 @@ static inline struct scatterlist *sg_next_arr(struct scatterlist *sg, > > /* Set up an indirect table of descriptors and add it to the queue. */ > static inline int vring_add_indirect(struct vring_virtqueue *vq, > + bool urgent, > struct scatterlist *sgs[], > struct scatterlist *(*next) > (struct scatterlist *, unsigned int *), > @@ -173,6 +174,8 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq, > /* Use a single buffer which doesn't continue */ > head = vq->free_head; > vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT; > + if (urgent) > + vq->vring.desc[head].flags |= VRING_DESC_F_URGENT; > vq->vring.desc[head].addr = virt_to_phys(desc); > /* kmemleak gives a false positive, as it's hidden by virt_to_phys */ > kmemleak_ignore(desc); > @@ -185,6 +188,7 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq, > } > > static inline int virtqueue_add(struct virtqueue *_vq, > + bool urgent, > struct scatterlist *sgs[], > struct scatterlist *(*next) > (struct scatterlist *, unsigned int *), > @@ -227,7 +231,7 @@ static inline int virtqueue_add(struct virtqueue *_vq, > /* If the host supports indirect descriptor tables, and we have multiple > * buffers, then go indirect. FIXME: tune this threshold */ > if (vq->indirect && total_sg > 1 && vq->vq.num_free) { > - head = vring_add_indirect(vq, sgs, next, total_sg, total_out, > + head = vring_add_indirect(vq, urgent, sgs, next, total_sg, total_out, > total_in, > out_sgs, in_sgs, gfp); > if (likely(head >= 0)) > @@ -256,6 +260,10 @@ static inline int virtqueue_add(struct virtqueue *_vq, > for (n = 0; n < out_sgs; n++) { > for (sg = sgs[n]; sg; sg = next(sg, &total_out)) { > vq->vring.desc[i].flags = VRING_DESC_F_NEXT; > + if (urgent) { > + vq->vring.desc[head].flags |= VRING_DESC_F_URGENT; > + urgent = false; > + } > vq->vring.desc[i].addr = sg_phys(sg); > vq->vring.desc[i].len = sg->length; > prev = i; > @@ -265,6 +273,10 @@ static inline int virtqueue_add(struct virtqueue *_vq, > for (; n < (out_sgs + in_sgs); n++) { > for (sg = sgs[n]; sg; sg = next(sg, &total_in)) { > vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE; > + if (urgent) { > + vq->vring.desc[head].flags |= VRING_DESC_F_URGENT; > + urgent = false; > + } > vq->vring.desc[i].addr = sg_phys(sg); > vq->vring.desc[i].len = sg->length; > prev = i; > @@ -305,6 +317,8 @@ add_head: > > /** > * virtqueue_add_sgs - expose buffers to other end > + * @urgent: in case virtqueue_enable_cb_delayed was called, cause an interrupt > + * after this descriptor was completed > * @vq: the struct virtqueue we're talking about. > * @sgs: array of terminated scatterlists. > * @out_num: the number of scatterlists readable by other side > @@ -337,7 +351,7 @@ int virtqueue_add_sgs(struct virtqueue *_vq, > for (sg = sgs[i]; sg; sg = sg_next(sg)) > total_in++; > } > - return virtqueue_add(_vq, sgs, sg_next_chained, > + return virtqueue_add(_vq, false, sgs, sg_next_chained, > total_out, total_in, out_sgs, in_sgs, data, gfp); > } > EXPORT_SYMBOL_GPL(virtqueue_add_sgs); > @@ -360,11 +374,35 @@ int virtqueue_add_outbuf(struct virtqueue *vq, > void *data, > gfp_t gfp) > { > - return virtqueue_add(vq, &sg, sg_next_arr, num, 0, 1, 0, data, gfp); > + return virtqueue_add(vq, false, &sg, sg_next_arr, num, 0, 1, 0, data, gfp); > } > EXPORT_SYMBOL_GPL(virtqueue_add_outbuf); > > /** > + * virtqueue_add_outbuf - expose output buffers to other end > + * in case virtqueue_enable_cb_delayed was called, cause an interrupt > + * after this descriptor was completed > + * @vq: the struct virtqueue we're talking about. > + * @sgs: array of scatterlists (need not be terminated!) > + * @num: the number of scatterlists readable by other side > + * @data: the token identifying the buffer. > + * @gfp: how to do memory allocations (if necessary). > + * > + * Caller must ensure we don't call this with other virtqueue operations > + * at the same time (except where noted). > + * > + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). > + */ > +int virtqueue_add_outbuf_urgent(struct virtqueue *vq, > + struct scatterlist sg[], unsigned int num, > + void *data, > + gfp_t gfp) > +{ > + return virtqueue_add(vq, true, &sg, sg_next_arr, num, 0, 1, 0, data, gfp); > +} > +EXPORT_SYMBOL_GPL(virtqueue_add_outbuf_urgent); > + > +/** > * virtqueue_add_inbuf - expose input buffers to other end > * @vq: the struct virtqueue we're talking about. > * @sgs: array of scatterlists (need not be terminated!) > @@ -382,7 +420,7 @@ int virtqueue_add_inbuf(struct virtqueue *vq, > void *data, > gfp_t gfp) > { > - return virtqueue_add(vq, &sg, sg_next_arr, 0, num, 0, 1, data, gfp); > + return virtqueue_add(vq, false, &sg, sg_next_arr, 0, num, 0, 1, data, gfp); > } > EXPORT_SYMBOL_GPL(virtqueue_add_inbuf); > > @@ -595,6 +633,14 @@ void virtqueue_disable_cb(struct virtqueue *_vq) > } > EXPORT_SYMBOL_GPL(virtqueue_disable_cb); > > +void virtqueue_disable_cb_urgent(struct virtqueue *_vq) > +{ > + struct vring_virtqueue *vq = to_vvq(_vq); > + > + vq->vring.avail->flags |= VRING_AVAIL_F_NO_URGENT_INTERRUPT; > +} > +EXPORT_SYMBOL_GPL(virtqueue_disable_cb_urgent); > + > /** > * virtqueue_enable_cb_prepare - restart callbacks after disable_cb > * @vq: the struct virtqueue we're talking about. > @@ -626,6 +672,19 @@ unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq) > } > EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare); > > +unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *_vq) > +{ > + struct vring_virtqueue *vq = to_vvq(_vq); > + u16 last_used_idx; > + > + START_USE(vq); > + vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_URGENT_INTERRUPT; > + last_used_idx = vq->last_used_idx; > + END_USE(vq); > + return last_used_idx; > +} > +EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare_urgent); > +You can implement virtqueue_enable_cb_prepare_urgent simply by clearing ~VRING_AVAIL_F_NO_INTERRUPT. The effect is same: host sends interrupts only if there is an urgent descriptor.> /** > * virtqueue_poll - query pending used buffers > * @vq: the struct virtqueue we're talking about. > @@ -662,6 +721,14 @@ bool virtqueue_enable_cb(struct virtqueue *_vq) > } > EXPORT_SYMBOL_GPL(virtqueue_enable_cb); > > +bool virtqueue_enable_cb_urgent(struct virtqueue *_vq) > +{ > + unsigned last_used_idx = virtqueue_enable_cb_prepare_urgent(_vq); > + > + return !virtqueue_poll(_vq, last_used_idx); > +} > +EXPORT_SYMBOL_GPL(virtqueue_enable_cb_urgent); > + > /** > * virtqueue_enable_cb_delayed - restart callbacks after disable_cb. > * @vq: the struct virtqueue we're talking about. > diff --git a/include/linux/virtio.h b/include/linux/virtio.h > index b46671e..68be5f2 100644 > --- a/include/linux/virtio.h > +++ b/include/linux/virtio.h > @@ -39,6 +39,12 @@ int virtqueue_add_outbuf(struct virtqueue *vq, > void *data, > gfp_t gfp); > > +int virtqueue_add_outbuf_urgent(struct virtqueue *vq, > + struct scatterlist sg[], unsigned int num, > + void *data, > + gfp_t gfp); > + > + > int virtqueue_add_inbuf(struct virtqueue *vq, > struct scatterlist sg[], unsigned int num, > void *data, > @@ -61,10 +67,18 @@ void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len); > > void virtqueue_disable_cb(struct virtqueue *vq); > > +void virtqueue_disable_cb_urgent(struct virtqueue *vq); > + > bool virtqueue_enable_cb(struct virtqueue *vq); > > +bool virtqueue_enable_cb_urgent(struct virtqueue *vq); > + > +bool virtqueue_enable_cb_urgent(struct virtqueue *vq); > + > unsigned virtqueue_enable_cb_prepare(struct virtqueue *vq); > > +unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *vq); > + > bool virtqueue_poll(struct virtqueue *vq, unsigned); > > bool virtqueue_enable_cb_delayed(struct virtqueue *vq); > diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h > index a99f9b7..daf5bb0 100644 > --- a/include/uapi/linux/virtio_ring.h > +++ b/include/uapi/linux/virtio_ring.h > @@ -39,6 +39,9 @@ > #define VRING_DESC_F_WRITE 2 > /* This means the buffer contains a list of buffer descriptors. */ > #define VRING_DESC_F_INDIRECT 4 > +/* This means the descriptor should cause an interrupt > + * ignoring avail event idx. */ > +#define VRING_DESC_F_URGENT 8 > > /* The Host uses this in used->flags to advise the Guest: don't kick me when > * you add a buffer. It's unreliable, so it's simply an optimization. Guest > @@ -48,7 +51,7 @@ > * when you consume a buffer. It's unreliable, so it's simply an > * optimization. */ > #define VRING_AVAIL_F_NO_INTERRUPT 1 > - > +#define VRING_AVAIL_F_NO_URGENT_INTERRUPT 2 > /* We support indirect buffer descriptors */ > #define VIRTIO_RING_F_INDIRECT_DESC 28 > > -- > 1.8.3.1
David Miller
2014-Oct-14  18:53 UTC
[PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
From: Jason Wang <jasowang at redhat.com> Date: Sat, 11 Oct 2014 15:16:43 +0800> We free old transmitted packets in ndo_start_xmit() currently, so any > packet must be orphaned also there. This was used to reduce the overhead of > tx interrupt to achieve better performance. But this may not work for some > protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to > implement various optimization for small packets stream such as TCP small > queue and auto corking. But orphaning packets early in ndo_start_xmit() > disable such things more or less since sk_wmem_alloc was not accurate. This > lead extra low throughput for TCP stream of small writes. > > This series tries to solve this issue by enable tx interrupts for all TCP > packets other than the ones with push bit or pure ACK. This is done through > the support of urgent descriptor which can force an interrupt for a > specified packet. If tx interrupt was enabled for a packet, there's no need > to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled > by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP > can batch more for small write. More larger skb was produced by TCP in this > case to improve both throughput and cpu utilization. > > Test shows great improvements on small write tcp streams. For most of the > other cases, the throughput and cpu utilization are the same in the > past. Only few cases, more cpu utilization was noticed which needs more > investigation. > > Review and comments are welcomed.I think proper accounting and queueing (at all levels, not just TCP sockets) is more important than trying to skim a bunch of cycles by avoiding TX interrupts. Having an event to free the SKB is absolutely essential for the stack to operate correctly. And with virtio-net you don't even have the excuse of "the HW unfortunately doesn't have an appropriate TX event." So please don't play games, and instead use TX interrupts all the time. You can mitigate them in various ways, but don't turn them on selectively based upon traffic type, that's terrible. You can even use ->xmit_more to defer the TX interrupt indication to the final TX packet in the chain.
Michael S. Tsirkin
2014-Oct-14  21:51 UTC
[PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:> From: Jason Wang <jasowang at redhat.com> > Date: Sat, 11 Oct 2014 15:16:43 +0800 > > > We free old transmitted packets in ndo_start_xmit() currently, so any > > packet must be orphaned also there. This was used to reduce the overhead of > > tx interrupt to achieve better performance. But this may not work for some > > protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to > > implement various optimization for small packets stream such as TCP small > > queue and auto corking. But orphaning packets early in ndo_start_xmit() > > disable such things more or less since sk_wmem_alloc was not accurate. This > > lead extra low throughput for TCP stream of small writes. > > > > This series tries to solve this issue by enable tx interrupts for all TCP > > packets other than the ones with push bit or pure ACK. This is done through > > the support of urgent descriptor which can force an interrupt for a > > specified packet. If tx interrupt was enabled for a packet, there's no need > > to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled > > by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP > > can batch more for small write. More larger skb was produced by TCP in this > > case to improve both throughput and cpu utilization. > > > > Test shows great improvements on small write tcp streams. For most of the > > other cases, the throughput and cpu utilization are the same in the > > past. Only few cases, more cpu utilization was noticed which needs more > > investigation. > > > > Review and comments are welcomed. > > I think proper accounting and queueing (at all levels, not just TCP > sockets) is more important than trying to skim a bunch of cycles by > avoiding TX interrupts. > > Having an event to free the SKB is absolutely essential for the stack > to operate correctly. > > And with virtio-net you don't even have the excuse of "the HW > unfortunately doesn't have an appropriate TX event." > > So please don't play games, and instead use TX interrupts all the > time. You can mitigate them in various ways, but don't turn them on > selectively based upon traffic type, that's terrible.This got me thinking: how about using virtqueue_enable_cb_delayed for this mitigation? It's pretty easy to implement - I'll send a proof of concept patch separately.
Michael S. Tsirkin
2014-Oct-14  21:51 UTC
[PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt
On Sat, Oct 11, 2014 at 03:16:46PM +0800, Jason Wang wrote:> We free transmitted packets in ndo_start_xmit() in the past to get better > performance in the past. One side effect is that skb_orphan() needs to be > called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in > fact. For TCP protocol, this means several optimization could not work well > such as TCP small queue and auto corking. This can lead extra low > throughput of small packets stream. > > Thanks to the urgent descriptor support. This patch tries to solve this > issue by enable the tx interrupt selectively for stream packets. This means > we don't need to orphan TCP stream packets in ndo_start_xmit() but enable > tx interrupt for those packets. After we get tx interrupt, a tx napi was > scheduled to free those packets. > > With this method, sk_wmem_alloc of TCP socket were more accurate than in > the past which let TCP can batch more through TSQ and auto corking. > > Signed-off-by: Jason Wang <jasowang at redhat.com> > --- > drivers/net/virtio_net.c | 164 ++++++++++++++++++++++++++++++++++++----------- > 1 file changed, 128 insertions(+), 36 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 5810841..b450fc4 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -72,6 +72,8 @@ struct send_queue { > > /* Name of the send queue: output.$index */ > char name[40]; > + > + struct napi_struct napi; > }; > > /* Internal representation of a receive virtqueue */ > @@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask) > return p; > } > > +static int free_old_xmit_skbs(struct send_queue *sq, int budget) > +{ > + struct sk_buff *skb; > + unsigned int len; > + struct virtnet_info *vi = sq->vq->vdev->priv; > + struct virtnet_stats *stats = this_cpu_ptr(vi->stats); > + int sent = 0; > + > + while (sent < budget && > + (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) { > + pr_debug("Sent skb %p\n", skb); > + > + u64_stats_update_begin(&stats->tx_syncp); > + stats->tx_bytes += skb->len; > + stats->tx_packets++; > + u64_stats_update_end(&stats->tx_syncp); > + > + dev_kfree_skb_any(skb); > + sent++; > + } > + > + return sent; > +} > + > static void skb_xmit_done(struct virtqueue *vq) > { > struct virtnet_info *vi = vq->vdev->priv; > + struct send_queue *sq = &vi->sq[vq2txq(vq)]; > > - /* Suppress further interrupts. */ > - virtqueue_disable_cb(vq); > - > - /* We were probably waiting for more output buffers. */ > - netif_wake_subqueue(vi->dev, vq2txq(vq)); > + if (napi_schedule_prep(&sq->napi)) { > + virtqueue_disable_cb(vq); > + virtqueue_disable_cb_urgent(vq);This disable_cb is no longer safe in xmit_done callback, since queue can be running at the same time. You must do it under tx lock. And yes, this likely will not work work well without event_idx. We'll probably need extra synchronization for such old hosts.> + __napi_schedule(&sq->napi); > + } > } > > static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx) > @@ -772,7 +799,38 @@ again: > return received; > } > > +static int virtnet_poll_tx(struct napi_struct *napi, int budget) > +{ > + struct send_queue *sq > + container_of(napi, struct send_queue, napi); > + struct virtnet_info *vi = sq->vq->vdev->priv; > + struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq)); > + unsigned int r, sent = 0; > + > +again: > + __netif_tx_lock(txq, smp_processor_id()); > + sent += free_old_xmit_skbs(sq, budget - sent); > + > + if (sent < budget) { > + r = virtqueue_enable_cb_prepare_urgent(sq->vq); > + napi_complete(napi); > + __netif_tx_unlock(txq); > + if (unlikely(virtqueue_poll(sq->vq, r)) && > + napi_schedule_prep(napi)) { > + virtqueue_disable_cb_urgent(sq->vq); > + __napi_schedule(napi); > + goto again; > + } > + } else { > + __netif_tx_unlock(txq); > + } > + > + netif_wake_subqueue(vi->dev, vq2txq(sq->vq)); > + return sent; > +} > + > #ifdef CONFIG_NET_RX_BUSY_POLL > + > /* must be called with local_bh_disable()d */ > static int virtnet_busy_poll(struct napi_struct *napi) > { > @@ -820,31 +878,13 @@ static int virtnet_open(struct net_device *dev) > if (!try_fill_recv(&vi->rq[i], GFP_KERNEL)) > schedule_delayed_work(&vi->refill, 0); > virtnet_napi_enable(&vi->rq[i]); > + napi_enable(&vi->sq[i].napi); > } > > return 0; > } > > -static void free_old_xmit_skbs(struct send_queue *sq) > -{ > - struct sk_buff *skb; > - unsigned int len; > - struct virtnet_info *vi = sq->vq->vdev->priv; > - struct virtnet_stats *stats = this_cpu_ptr(vi->stats); > - > - while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) { > - pr_debug("Sent skb %p\n", skb); > - > - u64_stats_update_begin(&stats->tx_syncp); > - stats->tx_bytes += skb->len; > - stats->tx_packets++; > - u64_stats_update_end(&stats->tx_syncp); > - > - dev_kfree_skb_any(skb); > - } > -} > - > -static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) > +static int xmit_skb(struct send_queue *sq, struct sk_buff *skb, bool urgent) > { > struct skb_vnet_hdr *hdr; > const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest; > @@ -908,7 +948,43 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) > sg_set_buf(sq->sg, hdr, hdr_len); > num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1; > } > - return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC); > + if (urgent) > + return virtqueue_add_outbuf_urgent(sq->vq, sq->sg, num_sg, > + skb, GFP_ATOMIC); > + else > + return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, > + GFP_ATOMIC); > +} > + > +static bool virtnet_skb_needs_intr(struct sk_buff *skb) > +{ > + union { > + unsigned char *network; > + struct iphdr *ipv4; > + struct ipv6hdr *ipv6; > + } hdr; > + struct tcphdr *th = tcp_hdr(skb); > + u16 payload_len; > + > + hdr.network = skb_network_header(skb); > + > + /* Only IPv4/IPv6 with TCP is supported */ > + if ((skb->protocol == htons(ETH_P_IP)) && > + hdr.ipv4->protocol == IPPROTO_TCP) { > + payload_len = ntohs(hdr.ipv4->tot_len) - hdr.ipv4->ihl * 4 - > + th->doff * 4; > + } else if ((skb->protocol == htons(ETH_P_IPV6) || > + hdr.ipv6->nexthdr == IPPROTO_TCP)) { > + payload_len = ntohs(hdr.ipv6->payload_len) - th->doff * 4; > + } else { > + return false; > + } > + > + /* We don't want to dealy packet with PUSH bit and pure ACK packet */ > + if (!th->psh && payload_len) > + return true; > + > + return false; > } > > static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) > @@ -916,13 +992,15 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) > struct virtnet_info *vi = netdev_priv(dev); > int qnum = skb_get_queue_mapping(skb); > struct send_queue *sq = &vi->sq[qnum]; > - int err; > + bool urgent = virtnet_skb_needs_intr(skb); > + int err, qsize = virtqueue_get_vring_size(sq->vq); > > + virtqueue_disable_cb_urgent(sq->vq); > /* Free up any pending old buffers before queueing new ones. */ > - free_old_xmit_skbs(sq); > + free_old_xmit_skbs(sq, qsize); > > /* Try to transmit */ > - err = xmit_skb(sq, skb); > + err = xmit_skb(sq, skb, urgent); > > /* This should not happen! */ > if (unlikely(err)) { > @@ -935,22 +1013,26 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) > return NETDEV_TX_OK; > } > > - /* Don't wait up for transmitted skbs to be freed. */ > - skb_orphan(skb); > - nf_reset(skb); > + if (!urgent) { > + skb_orphan(skb); > + nf_reset(skb); > + } > > /* Apparently nice girls don't return TX_BUSY; stop the queue > * before it gets out of hand. Naturally, this wastes entries. */ > if (sq->vq->num_free < 2+MAX_SKB_FRAGS) { > netif_stop_subqueue(dev, qnum); > + virtqueue_disable_cb_urgent(sq->vq); > if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) { > /* More just got used, free them then recheck. */ > - free_old_xmit_skbs(sq); > + free_old_xmit_skbs(sq, qsize); > if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) { > netif_start_subqueue(dev, qnum); > virtqueue_disable_cb(sq->vq); > } > } > + } else if (virtqueue_enable_cb_urgent(sq->vq)) { > + free_old_xmit_skbs(sq, qsize); > } > > if (__netif_subqueue_stopped(dev, qnum) || !skb->xmit_more) > @@ -1132,8 +1214,10 @@ static int virtnet_close(struct net_device *dev) > /* Make sure refill_work doesn't re-enable napi! */ > cancel_delayed_work_sync(&vi->refill); > > - for (i = 0; i < vi->max_queue_pairs; i++) > + for (i = 0; i < vi->max_queue_pairs; i++) { > napi_disable(&vi->rq[i].napi); > + napi_disable(&vi->sq[i].napi); > + } > > return 0; > } > @@ -1452,8 +1536,10 @@ static void virtnet_free_queues(struct virtnet_info *vi) > { > int i; > > - for (i = 0; i < vi->max_queue_pairs; i++) > + for (i = 0; i < vi->max_queue_pairs; i++) { > netif_napi_del(&vi->rq[i].napi); > + netif_napi_del(&vi->sq[i].napi); > + } > > kfree(vi->rq); > kfree(vi->sq); > @@ -1607,6 +1693,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi) > netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll, > napi_weight); > napi_hash_add(&vi->rq[i].napi); > + netif_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx, > + napi_weight); > > sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg)); > ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT); > @@ -1912,8 +2000,10 @@ static int virtnet_freeze(struct virtio_device *vdev) > if (netif_running(vi->dev)) { > for (i = 0; i < vi->max_queue_pairs; i++) { > napi_disable(&vi->rq[i].napi); > + napi_disable(&vi->sq[i].napi); > napi_hash_del(&vi->rq[i].napi); > netif_napi_del(&vi->rq[i].napi); > + netif_napi_del(&vi->sq[i].napi); > } > } > > @@ -1938,8 +2028,10 @@ static int virtnet_restore(struct virtio_device *vdev) > if (!try_fill_recv(&vi->rq[i], GFP_KERNEL)) > schedule_delayed_work(&vi->refill, 0); > > - for (i = 0; i < vi->max_queue_pairs; i++) > + for (i = 0; i < vi->max_queue_pairs; i++) { > virtnet_napi_enable(&vi->rq[i]); > + napi_enable(&vi->sq[i].napi); > + } > } > > netif_device_attach(vi->dev); > -- > 1.8.3.1
Michael S. Tsirkin
2014-Oct-14  23:06 UTC
[PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:> From: Jason Wang <jasowang at redhat.com> > Date: Sat, 11 Oct 2014 15:16:43 +0800 > > > We free old transmitted packets in ndo_start_xmit() currently, so any > > packet must be orphaned also there. This was used to reduce the overhead of > > tx interrupt to achieve better performance. But this may not work for some > > protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to > > implement various optimization for small packets stream such as TCP small > > queue and auto corking. But orphaning packets early in ndo_start_xmit() > > disable such things more or less since sk_wmem_alloc was not accurate. This > > lead extra low throughput for TCP stream of small writes. > > > > This series tries to solve this issue by enable tx interrupts for all TCP > > packets other than the ones with push bit or pure ACK. This is done through > > the support of urgent descriptor which can force an interrupt for a > > specified packet. If tx interrupt was enabled for a packet, there's no need > > to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled > > by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP > > can batch more for small write. More larger skb was produced by TCP in this > > case to improve both throughput and cpu utilization. > > > > Test shows great improvements on small write tcp streams. For most of the > > other cases, the throughput and cpu utilization are the same in the > > past. Only few cases, more cpu utilization was noticed which needs more > > investigation. > > > > Review and comments are welcomed. > > I think proper accounting and queueing (at all levels, not just TCP > sockets) is more important than trying to skim a bunch of cycles by > avoiding TX interrupts. > > Having an event to free the SKB is absolutely essential for the stack > to operate correctly. > > And with virtio-net you don't even have the excuse of "the HW > unfortunately doesn't have an appropriate TX event." > > So please don't play games, and instead use TX interrupts all the > time. You can mitigate them in various ways, but don't turn them on > selectively based upon traffic type, that's terrible. > > You can even use ->xmit_more to defer the TX interrupt indication to > the final TX packet in the chain.I guess we can just defer the kick, interrupt will naturally be deferred as well. This should solve the problem for old hosts as well. We'll also need to implement bql for this. Something like the below? Completely untested - posting here to see if I figured the API out correctly. Has to be applied on top of the previous patch. --- virtio_net: bql + xmit_more Signed-off-by: Michael S. Tsirkin <mst at redhat.com> --- diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 62c059d..c245047 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -213,13 +213,15 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask) return p; } -static int free_old_xmit_skbs(struct send_queue *sq, int budget) +static int free_old_xmit_skbs(struct netdev_queue *txq, + struct send_queue *sq, int budget) { struct sk_buff *skb; unsigned int len; struct virtnet_info *vi = sq->vq->vdev->priv; struct virtnet_stats *stats = this_cpu_ptr(vi->stats); int sent = 0; + unsigned int bytes = 0; while (sent < budget && (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) { @@ -227,6 +229,7 @@ static int free_old_xmit_skbs(struct send_queue *sq, int budget) u64_stats_update_begin(&stats->tx_syncp); stats->tx_bytes += skb->len; + bytes += skb->len; stats->tx_packets++; u64_stats_update_end(&stats->tx_syncp); @@ -234,6 +237,8 @@ static int free_old_xmit_skbs(struct send_queue *sq, int budget) sent++; } + netdev_tx_completed_queue(txq, sent, bytes); + return sent; } @@ -802,7 +807,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget) again: __netif_tx_lock(txq, smp_processor_id()); virtqueue_disable_cb(sq->vq); - sent += free_old_xmit_skbs(sq, budget - sent); + sent += free_old_xmit_skbs(txq, sq, budget - sent); if (sent < budget) { r = virtqueue_enable_cb_prepare(sq->vq); @@ -951,6 +956,9 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) int qnum = skb_get_queue_mapping(skb); struct send_queue *sq = &vi->sq[qnum]; int err, qsize = virtqueue_get_vring_size(sq->vq); + struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum); + bool kick = !skb->xmit_more || netif_xmit_stopped(txq); + unsigned int bytes = skb->len; virtqueue_disable_cb(sq->vq); @@ -967,7 +975,11 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) dev_kfree_skb_any(skb); return NETDEV_TX_OK; } - virtqueue_kick(sq->vq); + + netdev_tx_sent_queue(txq, bytes); + + if (kick) + virtqueue_kick(sq->vq); /* Apparently nice girls don't return TX_BUSY; stop the queue * before it gets out of hand. Naturally, this wastes entries. */ @@ -975,14 +987,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) netif_stop_subqueue(dev, qnum); if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) { /* More just got used, free them then recheck. */ - free_old_xmit_skbs(sq, qsize); + free_old_xmit_skbs(txq, sq, qsize); if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) { netif_start_subqueue(dev, qnum); virtqueue_disable_cb(sq->vq); } } } else if (virtqueue_enable_cb_delayed(sq->vq)) { - free_old_xmit_skbs(sq, qsize); + free_old_xmit_skbs(txq, sq, qsize); } return NETDEV_TX_OK;
Rusty Russell
2014-Oct-15  05:40 UTC
[PATCH net-next RFC 1/3] virtio: support for urgent descriptors
Jason Wang <jasowang at redhat.com> writes:> Below should be useful for some experiments Jason is doing. > I thought I'd send it out for early review/feedback. > > event idx feature allows us to defer interrupts until > a specific # of descriptors were used. > Sometimes it might be useful to get an interrupt after > a specific descriptor, regardless. > This adds a descriptor flag for this, and an API > to create an urgent output descriptor. > This is still an RFC: > we'll need a feature bit for drivers to detect this, > but we've run out of feature bits for virtio 0.X. > For experimentation purposes, drivers can assume > this is set, or add a driver-specific feature bit. > > Signed-off-by: Michael S. Tsirkin <mst at redhat.com> > Signed-off-by: Jason Wang <jasowang at redhat.com>The new VRING_DESC_F_URGENT bit is theoretically nicer, but for networking (which tends to take packets in order) couldn't we just set the event counter to give us a tx interrupt at the packet we want? Cheers, Rusty.
Maybe Matching Threads
- [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt
- [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt
- [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt
- [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
- [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt