thr3ads.net - Linux Virtualization - [PATCH 3/3] virtio_blk: implement blk_mq

If this information is useful, please help other people find it:
Share via:

Stefan Hajnoczi

2021-May-20 14:13 UTC

[PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

Request completion latency can be reduced by using polling instead of
irqs. Even Posted Interrupts or similar hardware support doesn't beat
polling. The reason is that disabling virtqueue notifications saves
critical-path CPU cycles on the host by skipping irq injection and in
the guest by skipping the irq handler. So let's add blk_mq_ops->poll()
support to virtio_blk.

The approach taken by this patch differs from the NVMe driver's
approach. NVMe dedicates hardware queues to polling and submits
REQ_HIPRI requests only on those queues. This patch does not require
exclusive polling queues for virtio_blk. Instead, it switches between
irqs and polling when one or more REQ_HIPRI requests are in flight on a
virtqueue.

This is possible because toggling virtqueue notifications is cheap even
while the virtqueue is running. NVMe cqs can't do this because irqs are
only enabled/disabled at queue creation time.

This toggling approach requires no configuration. There is no need to
dedicate queues ahead of time or to teach users and orchestration tools
how to set up polling queues.

Possible drawbacks of this approach:

- Hardware virtio_blk implementations may find virtqueue_disable_cb()
  expensive since it requires DMA. If such devices become popular then
  the virtio_blk driver could use a similar approach to NVMe when
  VIRTIO_F_ACCESS_PLATFORM is detected in the future.

- If a blk_poll() thread is descheduled it not only hurts polling
  performance but also delays completion of non-REQ_HIPRI requests on
  that virtqueue since vq notifications are disabled.

Performance:

- Benchmark: fio ioengine=pvsync2 numjobs=4 direct=1
- Guest: 4 vCPUs with one virtio-blk device (4 virtqueues)
- Disk: Intel Corporation NVMe Datacenter SSD [Optane] [8086:2701]
- CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz

rw          bs hipri=0 hipri=1
------------------------------
randread    4k 149,426 170,763 +14%
randread   16k 118,939 134,269 +12%
randread   64k  34,886  34,906   0%
randread  128k  17,655  17,667   0%
randwrite   4k 138,578 163,600 +18%
randwrite  16k 102,089 120,950 +18%
randwrite  64k  32,364  32,561   0%
randwrite 128k  16,154  16,237   0%
read        4k 146,032 170,620 +16%
read       16k 117,097 130,437 +11%
read       64k  34,834  35,037   0%
read      128k  17,680  17,658   0%
write       4k 134,562 151,422 +12%
write      16k 101,796 107,606  +5%
write      64k  32,364  32,594   0%
write     128k  16,259  16,265   0%

Larger block sizes do not benefit from polling as much but the
improvement is worthwhile for smaller block sizes.

Signed-off-by: Stefan Hajnoczi <stefanha at redhat.com>
---
 drivers/block/virtio_blk.c | 92 +++++++++++++++++++++++++++++++++++---
 1 file changed, 87 insertions(+), 5 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index fc0fb1dcd399..f0243dcd745a 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -29,6 +29,16 @@ static struct workqueue_struct *virtblk_wq;
 struct virtio_blk_vq {
 	struct virtqueue *vq;
 	spinlock_t lock;
+
+	/* Number of non-REQ_HIPRI requests in flight. Protected by lock. */
+	unsigned int num_lopri;
+
+	/* Number of REQ_HIPRI requests in flight. Protected by lock. */
+	unsigned int num_hipri;
+
+	/* Are vq notifications enabled? Protected by lock. */
+	bool cb_enabled;
+
 	char name[VQ_NAME_LEN];
 } ____cacheline_aligned_in_smp;
 
@@ -171,33 +181,67 @@ static inline void virtblk_request_done(struct request
*req)
 	blk_mq_end_request(req, virtblk_result(vbr));
 }
 
-static void virtblk_done(struct virtqueue *vq)
+/* Returns true if one or more requests completed */
+static bool virtblk_complete_requests(struct virtqueue *vq)
 {
 	struct virtio_blk *vblk = vq->vdev->priv;
 	struct virtio_blk_vq *vbq = &vblk->vqs[vq->index];
 	bool req_done = false;
+	bool last_hipri_done = false;
 	struct virtblk_req *vbr;
 	unsigned long flags;
 	unsigned int len;
 
 	spin_lock_irqsave(&vbq->lock, flags);
+
 	do {
-		virtqueue_disable_cb(vq);
+		if (vbq->cb_enabled)
+			virtqueue_disable_cb(vq);
 		while ((vbr = virtqueue_get_buf(vq, &len)) != NULL) {
 			struct request *req = blk_mq_rq_from_pdu(vbr);
 
+			if (req->cmd_flags & REQ_HIPRI) {
+				if (--vbq->num_hipri == 0)
+					last_hipri_done = true;
+			} else
+				vbq->num_lopri--;
+
 			if (likely(!blk_should_fake_timeout(req->q)))
 				blk_mq_complete_request(req);
 			req_done = true;
 		}
 		if (unlikely(virtqueue_is_broken(vq)))
 			break;
-	} while (!virtqueue_enable_cb(vq));
+
+		/* Enable vq notifications if non-polled requests remain */
+		if (last_hipri_done && vbq->num_lopri > 0) {
+			last_hipri_done = false;
+			vbq->cb_enabled = true;
+		}
+	} while (vbq->cb_enabled && !virtqueue_enable_cb(vq));
 
 	/* In case queue is stopped waiting for more buffers. */
 	if (req_done)
 		blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
 	spin_unlock_irqrestore(&vbq->lock, flags);
+
+	return req_done;
+}
+
+static int virtblk_poll(struct blk_mq_hw_ctx *hctx)
+{
+	struct virtio_blk *vblk = hctx->queue->queuedata;
+	struct virtqueue *vq = vblk->vqs[hctx->queue_num].vq;
+
+	if (!virtqueue_more_used(vq))
+		return 0;
+
+	return virtblk_complete_requests(vq);
+}
+
+static void virtblk_done(struct virtqueue *vq)
+{
+	virtblk_complete_requests(vq);
 }
 
 static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx)
@@ -275,6 +319,16 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx
*hctx,
 	}
 
 	spin_lock_irqsave(&vbq->lock, flags);
+
+	/* Re-enable vq notifications if first req is non-polling */
+	if (!(req->cmd_flags & REQ_HIPRI) &&
+	    vbq->num_lopri == 0 && vbq->num_hipri == 0 &&
+	    !vbq->cb_enabled) {
+		/* Can't return false since there are no in-flight reqs */
+		virtqueue_enable_cb(vbq->vq);
+		vbq->cb_enabled = true;
+	}
+
 	err = virtblk_add_req(vbq->vq, vbr, vbr->sg, num);
 	if (err) {
 		virtqueue_kick(vbq->vq);
@@ -294,6 +348,21 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx
*hctx,
 		}
 	}
 
+	/*
+	 * Disable vq notifications when polled reqs are submitted.
+	 *
+	 * The virtqueue lock is held so req is still valid here even if the
+	 * device polls the virtqueue and completes the request before we call
+	 * virtqueue_notify().
+	 */
+	if (req->cmd_flags & REQ_HIPRI) {
+		if (vbq->num_hipri++ == 0 && vbq->cb_enabled) {
+			virtqueue_disable_cb(vbq->vq);
+			vbq->cb_enabled = false;
+		}
+	} else
+		vbq->num_lopri++;
+
 	if (bd->last && virtqueue_kick_prepare(vbq->vq))
 		notify = true;
 	spin_unlock_irqrestore(&vbq->lock, flags);
@@ -533,6 +602,9 @@ static int init_vq(struct virtio_blk *vblk)
 	for (i = 0; i < num_vqs; i++) {
 		spin_lock_init(&vblk->vqs[i].lock);
 		vblk->vqs[i].vq = vqs[i];
+		vblk->vqs[i].num_lopri = 0;
+		vblk->vqs[i].num_hipri = 0;
+		vblk->vqs[i].cb_enabled = true;
 	}
 	vblk->num_vqs = num_vqs;
 
@@ -681,8 +753,16 @@ static int virtblk_map_queues(struct blk_mq_tag_set *set)
 {
 	struct virtio_blk *vblk = set->driver_data;
 
-	return blk_mq_virtio_map_queues(&set->map[HCTX_TYPE_DEFAULT],
-					vblk->vdev, 0);
+	set->map[HCTX_TYPE_DEFAULT].nr_queues = vblk->num_vqs;
+	blk_mq_virtio_map_queues(&set->map[HCTX_TYPE_DEFAULT], vblk->vdev,
0);
+
+	set->map[HCTX_TYPE_READ].nr_queues = 0;
+
+	/* HCTX_TYPE_DEFAULT queues are shared with HCTX_TYPE_POLL */
+	set->map[HCTX_TYPE_POLL].nr_queues = vblk->num_vqs;
+	blk_mq_virtio_map_queues(&set->map[HCTX_TYPE_POLL], vblk->vdev, 0);
+
+	return 0;
 }
 
 static const struct blk_mq_ops virtio_mq_ops = {
@@ -691,6 +771,7 @@ static const struct blk_mq_ops virtio_mq_ops = {
 	.complete	= virtblk_request_done,
 	.init_request	= virtblk_init_request,
 	.map_queues	= virtblk_map_queues,
+	.poll		= virtblk_poll,
 };
 
 static unsigned int virtblk_queue_depth;
@@ -768,6 +849,7 @@ static int virtblk_probe(struct virtio_device *vdev)
 
 	memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
 	vblk->tag_set.ops = &virtio_mq_ops;
+	vblk->tag_set.nr_maps = 3; /* default, read, and poll */
 	vblk->tag_set.queue_depth = queue_depth;
 	vblk->tag_set.numa_node = NUMA_NO_NODE;
 	vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
-- 
2.31.1

Christoph Hellwig

2021-May-24 14:59 UTC

head link

[PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi
wrote:> Possible drawbacks of this approach:
> 
> - Hardware virtio_blk implementations may find virtqueue_disable_cb()
>   expensive since it requires DMA. If such devices become popular then
>   the virtio_blk driver could use a similar approach to NVMe when
>   VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> 
> - If a blk_poll() thread is descheduled it not only hurts polling
>   performance but also delays completion of non-REQ_HIPRI requests on
>   that virtqueue since vq notifications are disabled.
Yes, I think this is a dangerous configuration.  What argument exists
again just using dedicated poll queues?

Jason Wang

2021-May-25 03:21 UTC

head link

[PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

? 2021/5/20 ??10:13, Stefan Hajnoczi ??:> Request completion latency can be reduced by using polling instead of
> irqs. Even Posted Interrupts or similar hardware support doesn't beat
> polling. The reason is that disabling virtqueue notifications saves
> critical-path CPU cycles on the host by skipping irq injection and in
> the guest by skipping the irq handler. So let's add
blk_mq_ops->poll()
> support to virtio_blk.
>
> The approach taken by this patch differs from the NVMe driver's
> approach. NVMe dedicates hardware queues to polling and submits
> REQ_HIPRI requests only on those queues. This patch does not require
> exclusive polling queues for virtio_blk. Instead, it switches between
> irqs and polling when one or more REQ_HIPRI requests are in flight on a
> virtqueue.
>
> This is possible because toggling virtqueue notifications is cheap even
> while the virtqueue is running. NVMe cqs can't do this because irqs are
> only enabled/disabled at queue creation time.
>
> This toggling approach requires no configuration. There is no need to
> dedicate queues ahead of time or to teach users and orchestration tools
> how to set up polling queues.
>
> Possible drawbacks of this approach:
>
> - Hardware virtio_blk implementations may find virtqueue_disable_cb()
>    expensive since it requires DMA.

Note that it's probably not related to the behavior of the driver but 
the design of the event suppression mechanism.

Device can choose to ignore the suppression flag and keep sending 
interrupts.

>   If such devices become popular then
>    the virtio_blk driver could use a similar approach to NVMe when
>    VIRTIO_F_ACCESS_PLATFORM is detected in the future.
>
> - If a blk_poll() thread is descheduled it not only hurts polling
>    performance but also delays completion of non-REQ_HIPRI requests on
>    that virtqueue since vq notifications are disabled.

Can we poll only when only high pri requests are pending?

If the backend is a remote one, I think the polling may cause more cpu 
cycles.

>
> Performance:
>
> - Benchmark: fio ioengine=pvsync2 numjobs=4 direct=1
> - Guest: 4 vCPUs with one virtio-blk device (4 virtqueues)
> - Disk: Intel Corporation NVMe Datacenter SSD [Optane] [8086:2701]
> - CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
>
> rw          bs hipri=0 hipri=1
> ------------------------------
> randread    4k 149,426 170,763 +14%
> randread   16k 118,939 134,269 +12%
> randread   64k  34,886  34,906   0%
> randread  128k  17,655  17,667   0%
> randwrite   4k 138,578 163,600 +18%
> randwrite  16k 102,089 120,950 +18%
> randwrite  64k  32,364  32,561   0%
> randwrite 128k  16,154  16,237   0%
> read        4k 146,032 170,620 +16%
> read       16k 117,097 130,437 +11%
> read       64k  34,834  35,037   0%
> read      128k  17,680  17,658   0%
> write       4k 134,562 151,422 +12%
> write      16k 101,796 107,606  +5%
> write      64k  32,364  32,594   0%
> write     128k  16,259  16,265   0%
>
> Larger block sizes do not benefit from polling as much but the
> improvement is worthwhile for smaller block sizes.
>
> Signed-off-by: Stefan Hajnoczi <stefanha at redhat.com>
> ---
>   drivers/block/virtio_blk.c | 92 +++++++++++++++++++++++++++++++++++---
>   1 file changed, 87 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index fc0fb1dcd399..f0243dcd745a 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -29,6 +29,16 @@ static struct workqueue_struct *virtblk_wq;
>   struct virtio_blk_vq {
>   	struct virtqueue *vq;
>   	spinlock_t lock;
> +
> +	/* Number of non-REQ_HIPRI requests in flight. Protected by lock. */
> +	unsigned int num_lopri;
> +
> +	/* Number of REQ_HIPRI requests in flight. Protected by lock. */
> +	unsigned int num_hipri;
> +
> +	/* Are vq notifications enabled? Protected by lock. */
> +	bool cb_enabled;

We had event_flag_shadow, is it sufficient to introduce a new helper 
virtqueue_cb_is_enabled()?

> +
>   	char name[VQ_NAME_LEN];
>   } ____cacheline_aligned_in_smp;
>   
> @@ -171,33 +181,67 @@ static inline void virtblk_request_done(struct
request *req)
>   	blk_mq_end_request(req, virtblk_result(vbr));
>   }
>   
> -static void virtblk_done(struct virtqueue *vq)
> +/* Returns true if one or more requests completed */
> +static bool virtblk_complete_requests(struct virtqueue *vq)
>   {
>   	struct virtio_blk *vblk = vq->vdev->priv;
>   	struct virtio_blk_vq *vbq = &vblk->vqs[vq->index];
>   	bool req_done = false;
> +	bool last_hipri_done = false;
>   	struct virtblk_req *vbr;
>   	unsigned long flags;
>   	unsigned int len;
>   
>   	spin_lock_irqsave(&vbq->lock, flags);
> +
>   	do {
> -		virtqueue_disable_cb(vq);
> +		if (vbq->cb_enabled)
> +			virtqueue_disable_cb(vq);
>   		while ((vbr = virtqueue_get_buf(vq, &len)) != NULL) {
>   			struct request *req = blk_mq_rq_from_pdu(vbr);
>   
> +			if (req->cmd_flags & REQ_HIPRI) {
> +				if (--vbq->num_hipri == 0)
> +					last_hipri_done = true;
> +			} else
> +				vbq->num_lopri--;
> +
>   			if (likely(!blk_should_fake_timeout(req->q)))
>   				blk_mq_complete_request(req);
>   			req_done = true;
>   		}
>   		if (unlikely(virtqueue_is_broken(vq)))
>   			break;
> -	} while (!virtqueue_enable_cb(vq));
> +
> +		/* Enable vq notifications if non-polled requests remain */
> +		if (last_hipri_done && vbq->num_lopri > 0) {
> +			last_hipri_done = false;
> +			vbq->cb_enabled = true;
> +		}
> +	} while (vbq->cb_enabled && !virtqueue_enable_cb(vq));
>   
>   	/* In case queue is stopped waiting for more buffers. */
>   	if (req_done)
>   		blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
>   	spin_unlock_irqrestore(&vbq->lock, flags);
> +
> +	return req_done;
> +}
> +
> +static int virtblk_poll(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct virtio_blk *vblk = hctx->queue->queuedata;
> +	struct virtqueue *vq = vblk->vqs[hctx->queue_num].vq;
> +
> +	if (!virtqueue_more_used(vq))

I'm not familiar with block polling but what happens if a buffer is made 
available after virtqueue_more_used() returns false here?

Thanks

> +		return 0;
> +
> +	return virtblk_complete_requests(vq);
> +}
> +
> +static void virtblk_done(struct virtqueue *vq)
> +{
> +	virtblk_complete_requests(vq);
>   }
>   
>   static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx)
> @@ -275,6 +319,16 @@ static blk_status_t virtio_queue_rq(struct
blk_mq_hw_ctx *hctx,
>   	}
>   
>   	spin_lock_irqsave(&vbq->lock, flags);
> +
> +	/* Re-enable vq notifications if first req is non-polling */
> +	if (!(req->cmd_flags & REQ_HIPRI) &&
> +	    vbq->num_lopri == 0 && vbq->num_hipri == 0 &&
> +	    !vbq->cb_enabled) {
> +		/* Can't return false since there are no in-flight reqs */
> +		virtqueue_enable_cb(vbq->vq);
> +		vbq->cb_enabled = true;
> +	}
> +
>   	err = virtblk_add_req(vbq->vq, vbr, vbr->sg, num);
>   	if (err) {
>   		virtqueue_kick(vbq->vq);
> @@ -294,6 +348,21 @@ static blk_status_t virtio_queue_rq(struct
blk_mq_hw_ctx *hctx,
>   		}
>   	}
>   
> +	/*
> +	 * Disable vq notifications when polled reqs are submitted.
> +	 *
> +	 * The virtqueue lock is held so req is still valid here even if the
> +	 * device polls the virtqueue and completes the request before we call
> +	 * virtqueue_notify().
> +	 */
> +	if (req->cmd_flags & REQ_HIPRI) {
> +		if (vbq->num_hipri++ == 0 && vbq->cb_enabled) {
> +			virtqueue_disable_cb(vbq->vq);
> +			vbq->cb_enabled = false;
> +		}
> +	} else
> +		vbq->num_lopri++;
> +
>   	if (bd->last && virtqueue_kick_prepare(vbq->vq))
>   		notify = true;
>   	spin_unlock_irqrestore(&vbq->lock, flags);
> @@ -533,6 +602,9 @@ static int init_vq(struct virtio_blk *vblk)
>   	for (i = 0; i < num_vqs; i++) {
>   		spin_lock_init(&vblk->vqs[i].lock);
>   		vblk->vqs[i].vq = vqs[i];
> +		vblk->vqs[i].num_lopri = 0;
> +		vblk->vqs[i].num_hipri = 0;
> +		vblk->vqs[i].cb_enabled = true;
>   	}
>   	vblk->num_vqs = num_vqs;
>   
> @@ -681,8 +753,16 @@ static int virtblk_map_queues(struct blk_mq_tag_set
*set)
>   {
>   	struct virtio_blk *vblk = set->driver_data;
>   
> -	return blk_mq_virtio_map_queues(&set->map[HCTX_TYPE_DEFAULT],
> -					vblk->vdev, 0);
> +	set->map[HCTX_TYPE_DEFAULT].nr_queues = vblk->num_vqs;
> +	blk_mq_virtio_map_queues(&set->map[HCTX_TYPE_DEFAULT],
vblk->vdev, 0);
> +
> +	set->map[HCTX_TYPE_READ].nr_queues = 0;
> +
> +	/* HCTX_TYPE_DEFAULT queues are shared with HCTX_TYPE_POLL */
> +	set->map[HCTX_TYPE_POLL].nr_queues = vblk->num_vqs;
> +	blk_mq_virtio_map_queues(&set->map[HCTX_TYPE_POLL], vblk->vdev,
0);
> +
> +	return 0;
>   }
>   
>   static const struct blk_mq_ops virtio_mq_ops = {
> @@ -691,6 +771,7 @@ static const struct blk_mq_ops virtio_mq_ops = {
>   	.complete	= virtblk_request_done,
>   	.init_request	= virtblk_init_request,
>   	.map_queues	= virtblk_map_queues,
> +	.poll		= virtblk_poll,
>   };
>   
>   static unsigned int virtblk_queue_depth;
> @@ -768,6 +849,7 @@ static int virtblk_probe(struct virtio_device *vdev)
>   
>   	memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
>   	vblk->tag_set.ops = &virtio_mq_ops;
> +	vblk->tag_set.nr_maps = 3; /* default, read, and poll */
>   	vblk->tag_set.queue_depth = queue_depth;
>   	vblk->tag_set.numa_node = NUMA_NO_NODE;
>   	vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;

Ming Lei

2021-May-27 02:44 UTC

head link

[PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi
wrote:> Request completion latency can be reduced by using polling instead of
> irqs. Even Posted Interrupts or similar hardware support doesn't beat
> polling. The reason is that disabling virtqueue notifications saves
> critical-path CPU cycles on the host by skipping irq injection and in
> the guest by skipping the irq handler. So let's add
blk_mq_ops->poll()
> support to virtio_blk.
> 
> The approach taken by this patch differs from the NVMe driver's
> approach. NVMe dedicates hardware queues to polling and submits
> REQ_HIPRI requests only on those queues. This patch does not require
> exclusive polling queues for virtio_blk. Instead, it switches between
> irqs and polling when one or more REQ_HIPRI requests are in flight on a
> virtqueue.
> 
> This is possible because toggling virtqueue notifications is cheap even
> while the virtqueue is running. NVMe cqs can't do this because irqs are
> only enabled/disabled at queue creation time.
> 
> This toggling approach requires no configuration. There is no need to
> dedicate queues ahead of time or to teach users and orchestration tools
> how to set up polling queues.
This approach looks good, and very neat thanks per-vq lock.

BTW, is there any virt-exit saved by disabling vq interrupt? I understand
there isn't since virt-exit may only be involved in remote completion
via sending IPI.
> 
> Possible drawbacks of this approach:
> 
> - Hardware virtio_blk implementations may find virtqueue_disable_cb()
>   expensive since it requires DMA. If such devices become popular then
You mean the hardware need to consider order between DMA completion and
interrupt notify? But it is disabling notify, guest just calls
virtqueue_get_buf() to see if there is buffer available, if not, it will be
polled again.
>   the virtio_blk driver could use a similar approach to NVMe when
>   VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> 
> - If a blk_poll() thread is descheduled it not only hurts polling
>   performance but also delays completion of non-REQ_HIPRI requests on
>   that virtqueue since vq notifications are disabled.
> 
> Performance:
> 
> - Benchmark: fio ioengine=pvsync2 numjobs=4 direct=1
> - Guest: 4 vCPUs with one virtio-blk device (4 virtqueues)
4 jobs can consume up all 4 vCPUs. Just run a quick fio test with
'ioengine=io_uring --numjobs=1' on single vq, and IOPS can be improved
by ~20%(hipri=1 vs hipri=0) with the 3 patches, and the virtio-blk is
still backed on NVMe SSD.


Thanks, 
Ming

Stefan Hajnoczi

2021-Jun-03 15:12 UTC

head link

[PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

On Thu, May 27, 2021 at 10:44:51AM +0800, Ming Lei
wrote:> On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:
> > Request completion latency can be reduced by using polling instead of
> > irqs. Even Posted Interrupts or similar hardware support doesn't
beat
> > polling. The reason is that disabling virtqueue notifications saves
> > critical-path CPU cycles on the host by skipping irq injection and in
> > the guest by skipping the irq handler. So let's add
blk_mq_ops->poll()
> > support to virtio_blk.
> > 
> > The approach taken by this patch differs from the NVMe driver's
> > approach. NVMe dedicates hardware queues to polling and submits
> > REQ_HIPRI requests only on those queues. This patch does not require
> > exclusive polling queues for virtio_blk. Instead, it switches between
> > irqs and polling when one or more REQ_HIPRI requests are in flight on
a
> > virtqueue.
> > 
> > This is possible because toggling virtqueue notifications is cheap
even
> > while the virtqueue is running. NVMe cqs can't do this because
irqs are
> > only enabled/disabled at queue creation time.
> > 
> > This toggling approach requires no configuration. There is no need to
> > dedicate queues ahead of time or to teach users and orchestration
tools
> > how to set up polling queues.
> 
> This approach looks good, and very neat thanks per-vq lock.
> 
> BTW, is there any virt-exit saved by disabling vq interrupt? I understand
> there isn't since virt-exit may only be involved in remote completion
> via sending IPI.
This patch doesn't eliminate vmexits. QEMU already has virtqueue polling
code that disables the vq notification (the virtio-pci hardware register
write that causes a vmexit).

However, when both the guest
driver and the emulated device are polling then there are no vmexits or
interrupt injections with this patch.
> > 
> > Possible drawbacks of this approach:
> > 
> > - Hardware virtio_blk implementations may find virtqueue_disable_cb()
> >   expensive since it requires DMA. If such devices become popular then
> 
> You mean the hardware need to consider order between DMA completion and
> interrupt notify? But it is disabling notify, guest just calls
> virtqueue_get_buf() to see if there is buffer available, if not, it will be
> polled again.
Software devices have cheap access to guest RAM for looking at the
virtqueue_disable_cb() state before injecting an irq. Hardware devices
need to perform a DMA transaction to read that state. They have to do
this every time they want to raise an irq because the guest driver may
have changed the value.

I'm not sure if the DMA overhead is acceptable. This problem is not
introduced by this patch, it's a VIRTIO spec design issue.

I was trying to express that dedicated polling queues would avoid the
DMA since the device knows that irqs are never needed for this virtqueue.
> 
> >   the virtio_blk driver could use a similar approach to NVMe when
> >   VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> > 
> > - If a blk_poll() thread is descheduled it not only hurts polling
> >   performance but also delays completion of non-REQ_HIPRI requests on
> >   that virtqueue since vq notifications are disabled.
> > 
> > Performance:
> > 
> > - Benchmark: fio ioengine=pvsync2 numjobs=4 direct=1
> > - Guest: 4 vCPUs with one virtio-blk device (4 virtqueues)
> 
> 4 jobs can consume up all 4 vCPUs. Just run a quick fio test with
> 'ioengine=io_uring --numjobs=1' on single vq, and IOPS can be
improved
> by ~20%(hipri=1 vs hipri=0) with the 3 patches, and the virtio-blk is
> still backed on NVMe SSD.
Nice, thank you for sharing the data!

Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL:
<http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20210603/9f4d905d/attachment-0001.sig>

Linux Virtualization - May 2021 - [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

[PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

[PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

[PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

[PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

[PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()