thr3ads.net - Linux Virtualization - [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap() [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Jason Wang

2019-Mar-06 07:18 UTC

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

This series tries to access virtqueue metadata through kernel virtual
address instead of copy_user() friends since they had too much
overheads like checks, spec barriers or even hardware feature
toggling. This is done through setup kernel address through vmap() and
resigter MMU notifier for invalidation.

Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
obvious improvement.

Thanks

Changes from V4:
- use invalidate_range() instead of invalidate_range_start()
- track dirty pages
Changes from V3:
- don't try to use vmap for file backed pages
- rebase to master
Changes from V2:
- fix buggy range overlapping check
- tear down MMU notifier during vhost ioctl to make sure invalidation
  request can read metadata userspace address and vq size without
  holding vq mutex.
Changes from V1:
- instead of pinning pages, use MMU notifier to invalidate vmaps and
  remap duing metadata prefetch
- fix build warning on MIPS

Jason Wang (5):
  vhost: generalize adding used elem
  vhost: fine grain userspace memory accessors
  vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
  vhost: introduce helpers to get the size of metadata area
  vhost: access vq metadata through kernel virtual address

 drivers/vhost/net.c   |   6 +-
 drivers/vhost/vhost.c | 434 ++++++++++++++++++++++++++++++++++++++++++++------
 drivers/vhost/vhost.h |  18 ++-
 3 files changed, 407 insertions(+), 51 deletions(-)

-- 
1.8.3.1

Jason Wang

2019-Mar-06 07:18 UTC

head link

[RFC PATCH V2 1/5] vhost: generalize adding used elem

Use one generic vhost_copy_to_user() instead of two dedicated
accessor. This will simplify the conversion to fine grain
accessors. About 2% improvement of PPS were seen during vitio-user
txonly test.

Signed-off-by: Jason Wang <jasowang at redhat.com>
---
 drivers/vhost/vhost.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a2e5dc7..400aa78 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2251,16 +2251,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 
 	start = vq->last_used_idx & (vq->num - 1);
 	used = vq->used->ring + start;
-	if (count == 1) {
-		if (vhost_put_user(vq, heads[0].id, &used->id)) {
-			vq_err(vq, "Failed to write used id");
-			return -EFAULT;
-		}
-		if (vhost_put_user(vq, heads[0].len, &used->len)) {
-			vq_err(vq, "Failed to write used len");
-			return -EFAULT;
-		}
-	} else if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
+	if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
 		vq_err(vq, "Failed to write used");
 		return -EFAULT;
 	}
-- 
1.8.3.1

Jason Wang

2019-Mar-06 07:18 UTC

head link

[RFC PATCH V2 2/5] vhost: fine grain userspace memory accessors

This is used to hide the metadata address from virtqueue helpers. This
will allow to implement a vmap based fast accessing to metadata.

Signed-off-by: Jason Wang <jasowang at redhat.com>
---
 drivers/vhost/vhost.c | 94 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 77 insertions(+), 17 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 400aa78..29709e7 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -869,6 +869,34 @@ static inline void __user *__vhost_get_user(struct
vhost_virtqueue *vq,
 	ret; \
 })
 
+static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
+{
+	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
+			      vhost_avail_event(vq));
+}
+
+static inline int vhost_put_used(struct vhost_virtqueue *vq,
+				 struct vring_used_elem *head, int idx,
+				 int count)
+{
+	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
+				  count * sizeof(*head));
+}
+
+static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
+
+{
+	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
+			      &vq->used->flags);
+}
+
+static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
+
+{
+	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
+			      &vq->used->idx);
+}
+
 #define vhost_get_user(vq, x, ptr, type)		\
 ({ \
 	int ret; \
@@ -907,6 +935,43 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
 		mutex_unlock(&d->vqs[i]->mutex);
 }
 
+static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
+				      __virtio16 *idx)
+{
+	return vhost_get_avail(vq, *idx, &vq->avail->idx);
+}
+
+static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
+				       __virtio16 *head, int idx)
+{
+	return vhost_get_avail(vq, *head,
+			       &vq->avail->ring[idx & (vq->num - 1)]);
+}
+
+static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
+					__virtio16 *flags)
+{
+	return vhost_get_avail(vq, *flags, &vq->avail->flags);
+}
+
+static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
+				       __virtio16 *event)
+{
+	return vhost_get_avail(vq, *event, vhost_used_event(vq));
+}
+
+static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
+				     __virtio16 *idx)
+{
+	return vhost_get_used(vq, *idx, &vq->used->idx);
+}
+
+static inline int vhost_get_desc(struct vhost_virtqueue *vq,
+				 struct vring_desc *desc, int idx)
+{
+	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
+}
+
 static int vhost_new_umem_range(struct vhost_umem *umem,
 				u64 start, u64 size, u64 end,
 				u64 userspace_addr, int perm)
@@ -1840,8 +1905,7 @@ int vhost_log_write(struct vhost_virtqueue *vq, struct
vhost_log *log,
 static int vhost_update_used_flags(struct vhost_virtqueue *vq)
 {
 	void __user *used;
-	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
-			   &vq->used->flags) < 0)
+	if (vhost_put_used_flags(vq))
 		return -EFAULT;
 	if (unlikely(vq->log_used)) {
 		/* Make sure the flag is seen before log. */
@@ -1858,8 +1922,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue
*vq)
 
 static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16
avail_event)
 {
-	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
-			   vhost_avail_event(vq)))
+	if (vhost_put_avail_event(vq))
 		return -EFAULT;
 	if (unlikely(vq->log_used)) {
 		void __user *used;
@@ -1895,7 +1958,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
 		r = -EFAULT;
 		goto err;
 	}
-	r = vhost_get_used(vq, last_used_idx, &vq->used->idx);
+	r = vhost_get_used_idx(vq, &last_used_idx);
 	if (r) {
 		vq_err(vq, "Can't access used idx at %p\n",
 		       &vq->used->idx);
@@ -2094,7 +2157,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 	last_avail_idx = vq->last_avail_idx;
 
 	if (vq->avail_idx == vq->last_avail_idx) {
-		if (unlikely(vhost_get_avail(vq, avail_idx, &vq->avail->idx))) {
+		if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
 			vq_err(vq, "Failed to access avail idx at %p\n",
 				&vq->avail->idx);
 			return -EFAULT;
@@ -2121,8 +2184,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 	/* Grab the next descriptor number they're advertising, and increment
 	 * the index we've seen. */
-	if (unlikely(vhost_get_avail(vq, ring_head,
-		     &vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
+	if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
 		vq_err(vq, "Failed to read head: idx %d address %p\n",
 		       last_avail_idx,
 		       &vq->avail->ring[last_avail_idx % vq->num]);
@@ -2157,8 +2219,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 			       i, vq->num, head);
 			return -EINVAL;
 		}
-		ret = vhost_copy_from_user(vq, &desc, vq->desc + i,
-					   sizeof desc);
+		ret = vhost_get_desc(vq, &desc, i);
 		if (unlikely(ret)) {
 			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
 			       i, vq->desc + i);
@@ -2251,7 +2312,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 
 	start = vq->last_used_idx & (vq->num - 1);
 	used = vq->used->ring + start;
-	if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
+	if (vhost_put_used(vq, heads, start, count)) {
 		vq_err(vq, "Failed to write used");
 		return -EFAULT;
 	}
@@ -2293,8 +2354,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct
vring_used_elem *heads,
 
 	/* Make sure buffer is written before we update index. */
 	smp_wmb();
-	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
-			   &vq->used->idx)) {
+	if (vhost_put_used_idx(vq)) {
 		vq_err(vq, "Failed to increment used idx");
 		return -EFAULT;
 	}
@@ -2327,7 +2387,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct
vhost_virtqueue *vq)
 
 	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
 		__virtio16 flags;
-		if (vhost_get_avail(vq, flags, &vq->avail->flags)) {
+		if (vhost_get_avail_flags(vq, &flags)) {
 			vq_err(vq, "Failed to get flags");
 			return true;
 		}
@@ -2341,7 +2401,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct
vhost_virtqueue *vq)
 	if (unlikely(!v))
 		return true;
 
-	if (vhost_get_avail(vq, event, vhost_used_event(vq))) {
+	if (vhost_get_used_event(vq, &event)) {
 		vq_err(vq, "Failed to get used event idx");
 		return true;
 	}
@@ -2386,7 +2446,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct
vhost_virtqueue *vq)
 	if (vq->avail_idx != vq->last_avail_idx)
 		return false;
 
-	r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
+	r = vhost_get_avail_idx(vq, &avail_idx);
 	if (unlikely(r))
 		return false;
 	vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
@@ -2422,7 +2482,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct
vhost_virtqueue *vq)
 	/* They could have slipped one in as we were doing that: make
 	 * sure it's written, then check again. */
 	smp_mb();
-	r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
+	r = vhost_get_avail_idx(vq, &avail_idx);
 	if (r) {
 		vq_err(vq, "Failed to check avail idx at %p: %d\n",
 		       &vq->avail->idx, r);
-- 
1.8.3.1

Jason Wang

2019-Mar-06 07:18 UTC

head link

[RFC PATCH V2 3/5] vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()

Rename the function to be more accurate since it actually tries to
prefetch vq metadata address in IOTLB. And this will be used by
following patch to prefetch metadata virtual addresses.

Signed-off-by: Jason Wang <jasowang at redhat.com>
---
 drivers/vhost/net.c   | 4 ++--
 drivers/vhost/vhost.c | 4 ++--
 drivers/vhost/vhost.h | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index df51a35..bf55f99 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -971,7 +971,7 @@ static void handle_tx(struct vhost_net *net)
 	if (!sock)
 		goto out;
 
-	if (!vq_iotlb_prefetch(vq))
+	if (!vq_meta_prefetch(vq))
 		goto out;
 
 	vhost_disable_notify(&net->dev, vq);
@@ -1140,7 +1140,7 @@ static void handle_rx(struct vhost_net *net)
 	if (!sock)
 		goto out;
 
-	if (!vq_iotlb_prefetch(vq))
+	if (!vq_meta_prefetch(vq))
 		goto out;
 
 	vhost_disable_notify(&net->dev, vq);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 29709e7..2025543 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1309,7 +1309,7 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
 	return true;
 }
 
-int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
+int vq_meta_prefetch(struct vhost_virtqueue *vq)
 {
 	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 	unsigned int num = vq->num;
@@ -1328,7 +1328,7 @@ int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
 			       num * sizeof(*vq->used->ring) + s,
 			       VHOST_ADDR_USED);
 }
-EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
+EXPORT_SYMBOL_GPL(vq_meta_prefetch);
 
 /* Can we log writes? */
 /* Caller should have device mutex but not vq mutex */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 9490e7d..7a7fc00 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -209,7 +209,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *, struct
vhost_virtqueue *,
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len,
 		    struct iovec *iov, int count);
-int vq_iotlb_prefetch(struct vhost_virtqueue *vq);
+int vq_meta_prefetch(struct vhost_virtqueue *vq);
 
 struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type);
 void vhost_enqueue_msg(struct vhost_dev *dev,
-- 
1.8.3.1

Jason Wang

2019-Mar-06 07:18 UTC

head link

[RFC PATCH V2 4/5] vhost: introduce helpers to get the size of metadata area

Signed-off-by: Jason Wang <jasowang at redhat.com>
---
 drivers/vhost/vhost.c | 46 ++++++++++++++++++++++++++++------------------
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2025543..1015464 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -413,6 +413,27 @@ static void vhost_dev_free_iovecs(struct vhost_dev *dev)
 		vhost_vq_free_iovecs(dev->vqs[i]);
 }
 
+static size_t vhost_get_avail_size(struct vhost_virtqueue *vq, int num)
+{
+	size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
+
+	return sizeof(*vq->avail) +
+	       sizeof(*vq->avail->ring) * num + event;
+}
+
+static size_t vhost_get_used_size(struct vhost_virtqueue *vq, int num)
+{
+	size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
+
+	return sizeof(*vq->used) +
+	       sizeof(*vq->used->ring) * num + event;
+}
+
+static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num)
+{
+	return sizeof(*vq->desc) * num;
+}
+
 void vhost_dev_init(struct vhost_dev *dev,
 		    struct vhost_virtqueue **vqs, int nvqs, int iov_limit)
 {
@@ -1253,13 +1274,9 @@ static bool vq_access_ok(struct vhost_virtqueue *vq,
unsigned int num,
 			 struct vring_used __user *used)
 
 {
-	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
-
-	return access_ok(desc, num * sizeof *desc) &&
-	       access_ok(avail,
-			 sizeof *avail + num * sizeof *avail->ring + s) &&
-	       access_ok(used,
-			sizeof *used + num * sizeof *used->ring + s);
+	return access_ok(desc, vhost_get_desc_size(vq, num)) &&
+	       access_ok(avail, vhost_get_avail_size(vq, num)) &&
+	       access_ok(used, vhost_get_used_size(vq, num));
 }
 
 static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
@@ -1311,22 +1328,18 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
 
 int vq_meta_prefetch(struct vhost_virtqueue *vq)
 {
-	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 	unsigned int num = vq->num;
 
 	if (!vq->iotlb)
 		return 1;
 
 	return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
-			       num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
+			       vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) &&
 	       iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->avail,
-			       sizeof *vq->avail +
-			       num * sizeof(*vq->avail->ring) + s,
+			       vhost_get_avail_size(vq, num),
 			       VHOST_ADDR_AVAIL) &&
 	       iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)(uintptr_t)vq->used,
-			       sizeof *vq->used +
-			       num * sizeof(*vq->used->ring) + s,
-			       VHOST_ADDR_USED);
+			       vhost_get_used_size(vq, num), VHOST_ADDR_USED);
 }
 EXPORT_SYMBOL_GPL(vq_meta_prefetch);
 
@@ -1343,13 +1356,10 @@ bool vhost_log_access_ok(struct vhost_dev *dev)
 static bool vq_log_access_ok(struct vhost_virtqueue *vq,
 			     void __user *log_base)
 {
-	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
-
 	return vq_memory_access_ok(log_base, vq->umem,
 				   vhost_has_feature(vq, VHOST_F_LOG_ALL)) &&
 		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
-					sizeof *vq->used +
-					vq->num * sizeof *vq->used->ring + s));
+				  vhost_get_used_size(vq, vq->num)));
 }
 
 /* Can we start vq? */
-- 
1.8.3.1

Jason Wang

2019-Mar-06 07:18 UTC

head link

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

It was noticed that the copy_user() friends that was used to access
virtqueue metdata tends to be very expensive for dataplane
implementation like vhost since it involves lots of software checks,
speculation barrier, hardware feature toggling (e.g SMAP). The
extra cost will be more obvious when transferring small packets since
the time spent on metadata accessing become more significant.

This patch tries to eliminate those overheads by accessing them
through kernel virtual address by vmap(). To make the pages can be
migrated, instead of pinning them through GUP, we use MMU notifiers to
invalidate vmaps and re-establish vmaps during each round of metadata
prefetching if necessary. It looks to me .invalidate_range() is
sufficient for catching this since we don't need extra TLB flush. For
devices that doesn't use metadata prefetching, the memory accessors
fallback to normal copy_user() implementation gracefully. The
invalidation was synchronized with datapath through vq mutex, and in
order to avoid hold vq mutex during range checking, MMU notifier was
teared down when trying to modify vq metadata.

Dirty page checking is done by calling set_page_dirty_locked()
explicitly for the page that used ring stay after each round of
processing.

Note that this was only done when device IOTLB is not enabled. We
could use similar method to optimize it in the future.

Tests shows at most about 22% improvement on TX PPS when using
virtio-user + vhost_net + xdp1 + TAP on 2.6GHz Broadwell:

        SMAP on | SMAP off
Before: 5.0Mpps | 6.6Mpps
After:  6.1Mpps | 7.4Mpps

Cc: <linux-mm at kvack.org>
Signed-off-by: Jason Wang <jasowang at redhat.com>
---
 drivers/vhost/net.c   |   2 +
 drivers/vhost/vhost.c | 281 +++++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/vhost/vhost.h |  16 +++
 3 files changed, 297 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index bf55f99..c276371 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -982,6 +982,7 @@ static void handle_tx(struct vhost_net *net)
 	else
 		handle_tx_copy(net, sock);
 
+	vq_meta_prefetch_done(vq);
 out:
 	mutex_unlock(&vq->mutex);
 }
@@ -1250,6 +1251,7 @@ static void handle_rx(struct vhost_net *net)
 		vhost_net_enable_vq(net, vq);
 out:
 	vhost_net_signal_used(nvq);
+	vq_meta_prefetch_done(vq);
 	mutex_unlock(&vq->mutex);
 }
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 1015464..36ccf7c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -434,6 +434,74 @@ static size_t vhost_get_desc_size(struct vhost_virtqueue
*vq, int num)
 	return sizeof(*vq->desc) * num;
 }
 
+static void vhost_uninit_vmap(struct vhost_vmap *map)
+{
+	if (map->addr) {
+		vunmap(map->unmap_addr);
+		kfree(map->pages);
+		map->pages = NULL;
+		map->npages = 0;
+	}
+
+	map->addr = NULL;
+	map->unmap_addr = NULL;
+}
+
+static void vhost_invalidate_vmap(struct vhost_virtqueue *vq,
+				  struct vhost_vmap *map,
+				  unsigned long ustart,
+				  size_t size,
+				  unsigned long start,
+				  unsigned long end)
+{
+	if (end < ustart || start > ustart - 1 + size)
+		return;
+
+	dump_stack();
+	mutex_lock(&vq->mutex);
+	vhost_uninit_vmap(map);
+	mutex_unlock(&vq->mutex);
+}
+
+
+static void vhost_invalidate(struct vhost_dev *dev,
+			     unsigned long start, unsigned long end)
+{
+	int i;
+
+	for (i = 0; i < dev->nvqs; i++) {
+		struct vhost_virtqueue *vq = dev->vqs[i];
+
+		vhost_invalidate_vmap(vq, &vq->avail_ring,
+				      (unsigned long)vq->avail,
+				      vhost_get_avail_size(vq, vq->num),
+				      start, end);
+		vhost_invalidate_vmap(vq, &vq->desc_ring,
+				      (unsigned long)vq->desc,
+				      vhost_get_desc_size(vq, vq->num),
+				      start, end);
+		vhost_invalidate_vmap(vq, &vq->used_ring,
+				      (unsigned long)vq->used,
+				      vhost_get_used_size(vq, vq->num),
+				      start, end);
+	}
+}
+
+
+static void vhost_invalidate_range(struct mmu_notifier *mn,
+				   struct mm_struct *mm,
+				   unsigned long start, unsigned long end)
+{
+	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
+					     mmu_notifier);
+
+	vhost_invalidate(dev, start, end);
+}
+
+static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
+	.invalidate_range = vhost_invalidate_range,
+};
+
 void vhost_dev_init(struct vhost_dev *dev,
 		    struct vhost_virtqueue **vqs, int nvqs, int iov_limit)
 {
@@ -449,6 +517,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->mm = NULL;
 	dev->worker = NULL;
 	dev->iov_limit = iov_limit;
+	dev->mmu_notifier.ops = &vhost_mmu_notifier_ops;
 	init_llist_head(&dev->work_list);
 	init_waitqueue_head(&dev->wait);
 	INIT_LIST_HEAD(&dev->read_list);
@@ -462,6 +531,9 @@ void vhost_dev_init(struct vhost_dev *dev,
 		vq->indirect = NULL;
 		vq->heads = NULL;
 		vq->dev = dev;
+		vq->avail_ring.addr = NULL;
+		vq->used_ring.addr = NULL;
+		vq->desc_ring.addr = NULL;
 		mutex_init(&vq->mutex);
 		vhost_vq_reset(dev, vq);
 		if (vq->handle_kick)
@@ -542,7 +614,13 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 	if (err)
 		goto err_cgroup;
 
+	err = mmu_notifier_register(&dev->mmu_notifier, dev->mm);
+	if (err)
+		goto err_mmu_notifier;
+
 	return 0;
+err_mmu_notifier:
+	vhost_dev_free_iovecs(dev);
 err_cgroup:
 	kthread_stop(worker);
 	dev->worker = NULL;
@@ -633,6 +711,81 @@ static void vhost_clear_msg(struct vhost_dev *dev)
 	spin_unlock(&dev->iotlb_lock);
 }
 
+static int vhost_init_vmap(struct vhost_dev *dev,
+			   struct vhost_vmap *map, unsigned long uaddr,
+			   size_t size, int write)
+{
+	struct page **pages;
+	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
+	int npinned;
+	void *vaddr;
+	int err = -EFAULT;
+
+	err = -ENOMEM;
+	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
+	if (!pages)
+		goto err_uaddr;
+
+	err = EFAULT;
+	npinned = get_user_pages_fast(uaddr, npages, write, pages);
+	if (npinned != npages)
+		goto err_gup;
+
+	vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
+	if (!vaddr)
+		goto err_gup;
+
+	map->addr = vaddr + (uaddr & (PAGE_SIZE - 1));
+	map->unmap_addr = vaddr;
+	map->npages = npages;
+	map->pages = pages;
+
+err_gup:
+	/* Don't pin pages, mmu notifier will notify us about page
+	 * migration.
+	 */
+	if (npinned > 0)
+		release_pages(pages, npinned);
+err_uaddr:
+	return err;
+}
+
+static void vhost_uninit_vq_vmaps(struct vhost_virtqueue *vq)
+{
+	vhost_uninit_vmap(&vq->avail_ring);
+	vhost_uninit_vmap(&vq->desc_ring);
+	vhost_uninit_vmap(&vq->used_ring);
+}
+
+static int vhost_setup_avail_vmap(struct vhost_virtqueue *vq,
+				  unsigned long avail)
+{
+	return vhost_init_vmap(vq->dev, &vq->avail_ring, avail,
+			       vhost_get_avail_size(vq, vq->num), false);
+}
+
+static int vhost_setup_desc_vmap(struct vhost_virtqueue *vq,
+				 unsigned long desc)
+{
+	return vhost_init_vmap(vq->dev, &vq->desc_ring, desc,
+			       vhost_get_desc_size(vq, vq->num), false);
+}
+
+static int vhost_setup_used_vmap(struct vhost_virtqueue *vq,
+				 unsigned long used)
+{
+	return vhost_init_vmap(vq->dev, &vq->used_ring, used,
+			       vhost_get_used_size(vq, vq->num), true);
+}
+
+static void vhost_set_vmap_dirty(struct vhost_vmap *used)
+{
+	int i;
+
+	for (i = 0; i < used->npages; i++)
+		set_page_dirty_lock(used->pages[i]);
+}
+
 void vhost_dev_cleanup(struct vhost_dev *dev)
 {
 	int i;
@@ -662,8 +815,12 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 		kthread_stop(dev->worker);
 		dev->worker = NULL;
 	}
-	if (dev->mm)
+	for (i = 0; i < dev->nvqs; i++)
+		vhost_uninit_vq_vmaps(dev->vqs[i]);
+	if (dev->mm) {
+		mmu_notifier_unregister(&dev->mmu_notifier, dev->mm);
 		mmput(dev->mm);
+	}
 	dev->mm = NULL;
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
@@ -892,6 +1049,16 @@ static inline void __user *__vhost_get_user(struct
vhost_virtqueue *vq,
 
 static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
 {
+	if (!vq->iotlb) {
+		struct vring_used *used = vq->used_ring.addr;
+
+		if (likely(used)) {
+			*((__virtio16 *)&used->ring[vq->num]) +				cpu_to_vhost16(vq,
vq->avail_idx);
+			return 0;
+		}
+	}
+
 	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
 			      vhost_avail_event(vq));
 }
@@ -900,6 +1067,16 @@ static inline int vhost_put_used(struct vhost_virtqueue
*vq,
 				 struct vring_used_elem *head, int idx,
 				 int count)
 {
+	if (!vq->iotlb) {
+		struct vring_used *used = vq->used_ring.addr;
+
+		if (likely(used)) {
+			memcpy(used->ring + idx, head,
+			       count * sizeof(*head));
+			return 0;
+		}
+	}
+
 	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
 				  count * sizeof(*head));
 }
@@ -907,6 +1084,15 @@ static inline int vhost_put_used(struct vhost_virtqueue
*vq,
 static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
 
 {
+	if (!vq->iotlb) {
+		struct vring_used *used = vq->used_ring.addr;
+
+		if (likely(used)) {
+			used->flags = cpu_to_vhost16(vq, vq->used_flags);
+			return 0;
+		}
+	}
+
 	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
 			      &vq->used->flags);
 }
@@ -914,6 +1100,15 @@ static inline int vhost_put_used_flags(struct
vhost_virtqueue *vq)
 static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
 
 {
+	if (!vq->iotlb) {
+		struct vring_used *used = vq->used_ring.addr;
+
+		if (likely(used)) {
+			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
+			return 0;
+		}
+	}
+
 	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
 			      &vq->used->idx);
 }
@@ -959,12 +1154,30 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
 static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
 				      __virtio16 *idx)
 {
+	if (!vq->iotlb) {
+		struct vring_avail *avail = vq->avail_ring.addr;
+
+		if (likely(avail)) {
+			*idx = avail->idx;
+			return 0;
+		}
+	}
+
 	return vhost_get_avail(vq, *idx, &vq->avail->idx);
 }
 
 static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
 				       __virtio16 *head, int idx)
 {
+	if (!vq->iotlb) {
+		struct vring_avail *avail = vq->avail_ring.addr;
+
+		if (likely(avail)) {
+			*head = avail->ring[idx & (vq->num - 1)];
+			return 0;
+		}
+	}
+
 	return vhost_get_avail(vq, *head,
 			       &vq->avail->ring[idx & (vq->num - 1)]);
 }
@@ -972,24 +1185,60 @@ static inline int vhost_get_avail_head(struct
vhost_virtqueue *vq,
 static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
 					__virtio16 *flags)
 {
+	if (!vq->iotlb) {
+		struct vring_avail *avail = vq->avail_ring.addr;
+
+		if (likely(avail)) {
+			*flags = avail->flags;
+			return 0;
+		}
+	}
+
 	return vhost_get_avail(vq, *flags, &vq->avail->flags);
 }
 
 static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
 				       __virtio16 *event)
 {
+	if (!vq->iotlb) {
+		struct vring_avail *avail = vq->avail_ring.addr;
+
+		if (likely(avail)) {
+			*event = (__virtio16)avail->ring[vq->num];
+			return 0;
+		}
+	}
+
 	return vhost_get_avail(vq, *event, vhost_used_event(vq));
 }
 
 static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
 				     __virtio16 *idx)
 {
+	if (!vq->iotlb) {
+		struct vring_used *used = vq->used_ring.addr;
+
+		if (likely(used)) {
+			*idx = used->idx;
+			return 0;
+		}
+	}
+
 	return vhost_get_used(vq, *idx, &vq->used->idx);
 }
 
 static inline int vhost_get_desc(struct vhost_virtqueue *vq,
 				 struct vring_desc *desc, int idx)
 {
+	if (!vq->iotlb) {
+		struct vring_desc *d = vq->desc_ring.addr;
+
+		if (likely(d)) {
+			*desc = *(d + idx);
+			return 0;
+		}
+	}
+
 	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
 }
 
@@ -1330,8 +1579,16 @@ int vq_meta_prefetch(struct vhost_virtqueue *vq)
 {
 	unsigned int num = vq->num;
 
-	if (!vq->iotlb)
+	if (!vq->iotlb) {
+		if (unlikely(!vq->avail_ring.addr))
+			vhost_setup_avail_vmap(vq, (unsigned long)vq->avail);
+		if (unlikely(!vq->desc_ring.addr))
+			vhost_setup_desc_vmap(vq, (unsigned long)vq->desc);
+		if (unlikely(!vq->used_ring.addr))
+			vhost_setup_used_vmap(vq, (unsigned long)vq->used);
+
 		return 1;
+	}
 
 	return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
 			       vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) &&
@@ -1343,6 +1600,15 @@ int vq_meta_prefetch(struct vhost_virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(vq_meta_prefetch);
 
+void vq_meta_prefetch_done(struct vhost_virtqueue *vq)
+{
+	if (vq->iotlb)
+		return;
+	if (likely(vq->used_ring.addr))
+		vhost_set_vmap_dirty(&vq->used_ring);
+}
+EXPORT_SYMBOL_GPL(vq_meta_prefetch_done);
+
 /* Can we log writes? */
 /* Caller should have device mutex but not vq mutex */
 bool vhost_log_access_ok(struct vhost_dev *dev)
@@ -1483,6 +1749,13 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int
ioctl, void __user *arg
 
 	mutex_lock(&vq->mutex);
 
+	/* Unregister MMU notifer to allow invalidation callback
+	 * can access vq->avail, vq->desc , vq->used and vq->num
+	 * without holding vq->mutex.
+	 */
+	if (d->mm)
+		mmu_notifier_unregister(&d->mmu_notifier, d->mm);
+
 	switch (ioctl) {
 	case VHOST_SET_VRING_NUM:
 		/* Resizing ring with an active backend?
@@ -1499,6 +1772,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int
ioctl, void __user *arg
 			r = -EINVAL;
 			break;
 		}
+		vhost_uninit_vq_vmaps(vq);
 		vq->num = s.num;
 		break;
 	case VHOST_SET_VRING_BASE:
@@ -1581,6 +1855,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int
ioctl, void __user *arg
 		vq->avail = (void __user *)(unsigned long)a.avail_user_addr;
 		vq->log_addr = a.log_guest_addr;
 		vq->used = (void __user *)(unsigned long)a.used_user_addr;
+		vhost_uninit_vq_vmaps(vq);
 		break;
 	case VHOST_SET_VRING_KICK:
 		if (copy_from_user(&f, argp, sizeof f)) {
@@ -1656,6 +1931,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int
ioctl, void __user *arg
 	if (pollstart && vq->handle_kick)
 		r = vhost_poll_start(&vq->poll, vq->kick);
 
+	if (d->mm)
+		mmu_notifier_register(&d->mmu_notifier, d->mm);
 	mutex_unlock(&vq->mutex);
 
 	if (pollstop && vq->handle_kick)
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 7a7fc00..146076e 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -12,6 +12,8 @@
 #include <linux/virtio_config.h>
 #include <linux/virtio_ring.h>
 #include <linux/atomic.h>
+#include <linux/pagemap.h>
+#include <linux/mmu_notifier.h>
 
 struct vhost_work;
 typedef void (*vhost_work_fn_t)(struct vhost_work *work);
@@ -80,6 +82,13 @@ enum vhost_uaddr_type {
 	VHOST_NUM_ADDRS = 3,
 };
 
+struct vhost_vmap {
+	void *addr;
+	void *unmap_addr;
+	int npages;
+	struct page **pages;
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -90,6 +99,11 @@ struct vhost_virtqueue {
 	struct vring_desc __user *desc;
 	struct vring_avail __user *avail;
 	struct vring_used __user *used;
+
+	struct vhost_vmap avail_ring;
+	struct vhost_vmap desc_ring;
+	struct vhost_vmap used_ring;
+
 	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
 	struct file *kick;
 	struct eventfd_ctx *call_ctx;
@@ -158,6 +172,7 @@ struct vhost_msg_node {
 
 struct vhost_dev {
 	struct mm_struct *mm;
+	struct mmu_notifier mmu_notifier;
 	struct mutex mutex;
 	struct vhost_virtqueue **vqs;
 	int nvqs;
@@ -210,6 +225,7 @@ int vhost_log_write(struct vhost_virtqueue *vq, struct
vhost_log *log,
 		    unsigned int log_num, u64 len,
 		    struct iovec *iov, int count);
 int vq_meta_prefetch(struct vhost_virtqueue *vq);
+void vq_meta_prefetch_done(struct vhost_virtqueue *vq);
 
 struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type);
 void vhost_enqueue_msg(struct vhost_dev *dev,
-- 
1.8.3.1

Christophe de Dinechin

2019-Mar-06 10:45 UTC

head link

[RFC PATCH V2 2/5] vhost: fine grain userspace memory accessors

> On 6 Mar 2019, at 08:18, Jason Wang <jasowang at redhat.com> wrote:
> 
> This is used to hide the metadata address from virtqueue helpers. This
> will allow to implement a vmap based fast accessing to metadata.
> 
> Signed-off-by: Jason Wang <jasowang at redhat.com>
> ---
> drivers/vhost/vhost.c | 94
+++++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 77 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 400aa78..29709e7 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -869,6 +869,34 @@ static inline void __user *__vhost_get_user(struct
vhost_virtqueue *vq,
> 	ret; \
> })
> 
> +static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
> +{
> +	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
> +			      vhost_avail_event(vq));
> +}
> +
> +static inline int vhost_put_used(struct vhost_virtqueue *vq,
> +				 struct vring_used_elem *head, int idx,
> +				 int count)
> +{
> +	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
> +				  count * sizeof(*head));
> +}
> +
> +static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
> +
> +{
> +	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
> +			      &vq->used->flags);
> +}
> +
> +static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
> +
> +{
> +	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
> +			      &vq->used->idx);
> +}
> +
> #define vhost_get_user(vq, x, ptr, type)		\
> ({ \
> 	int ret; \
> @@ -907,6 +935,43 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
> 		mutex_unlock(&d->vqs[i]->mutex);
> }
> 
> +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> +				      __virtio16 *idx)
> +{
> +	return vhost_get_avail(vq, *idx, &vq->avail->idx);
> +}
> +
> +static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> +				       __virtio16 *head, int idx)
> +{
> +	return vhost_get_avail(vq, *head,
> +			       &vq->avail->ring[idx & (vq->num - 1)]);
> +}
> +
> +static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> +					__virtio16 *flags)
> +{
> +	return vhost_get_avail(vq, *flags, &vq->avail->flags);
> +}
> +
> +static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
> +				       __virtio16 *event)
> +{
> +	return vhost_get_avail(vq, *event, vhost_used_event(vq));
> +}
> +
> +static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
> +				     __virtio16 *idx)
> +{
> +	return vhost_get_used(vq, *idx, &vq->used->idx);
> +}
> +
> +static inline int vhost_get_desc(struct vhost_virtqueue *vq,
> +				 struct vring_desc *desc, int idx)
> +{
> +	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
> +}
> +
> static int vhost_new_umem_range(struct vhost_umem *umem,
> 				u64 start, u64 size, u64 end,
> 				u64 userspace_addr, int perm)
> @@ -1840,8 +1905,7 @@ int vhost_log_write(struct vhost_virtqueue *vq,
struct vhost_log *log,
> static int vhost_update_used_flags(struct vhost_virtqueue *vq)
> {
> 	void __user *used;
> -	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
> -			   &vq->used->flags) < 0)
> +	if (vhost_put_used_flags(vq))
> 		return -EFAULT;
> 	if (unlikely(vq->log_used)) {
> 		/* Make sure the flag is seen before log. */
> @@ -1858,8 +1922,7 @@ static int vhost_update_used_flags(struct
vhost_virtqueue *vq)
> 
> static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16
avail_event)
> {
> -	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
> -			   vhost_avail_event(vq)))
> +	if (vhost_put_avail_event(vq))
> 		return -EFAULT;
> 	if (unlikely(vq->log_used)) {
> 		void __user *used;
> @@ -1895,7 +1958,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
> 		r = -EFAULT;
> 		goto err;
> 	}
> -	r = vhost_get_used(vq, last_used_idx, &vq->used->idx);
> +	r = vhost_get_used_idx(vq, &last_used_idx);
> 	if (r) {
> 		vq_err(vq, "Can't access used idx at %p\n",
> 		       &vq->used->idx);
From the error case, it looks like you are not entirely encapsulating
knowledge of what the accessor uses, i.e. it?s not:

		vq_err(vq, "Can't access used idx at %p\n",
		       &last_user_idx);

Maybe move error message within accessor?
> @@ -2094,7 +2157,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> 	last_avail_idx = vq->last_avail_idx;
> 
> 	if (vq->avail_idx == vq->last_avail_idx) {
> -		if (unlikely(vhost_get_avail(vq, avail_idx, &vq->avail->idx)))
{
> +		if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
> 			vq_err(vq, "Failed to access avail idx at %p\n",
> 				&vq->avail->idx);
> 			return -EFAULT;
Same here.
> @@ -2121,8 +2184,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> 
> 	/* Grab the next descriptor number they're advertising, and increment
> 	 * the index we've seen. */
> -	if (unlikely(vhost_get_avail(vq, ring_head,
> -		     &vq->avail->ring[last_avail_idx & (vq->num -
1)]))) {
> +	if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
> 		vq_err(vq, "Failed to read head: idx %d address %p\n",
> 		       last_avail_idx,
> 		       &vq->avail->ring[last_avail_idx % vq->num]);
> @@ -2157,8 +2219,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> 			       i, vq->num, head);
> 			return -EINVAL;
> 		}
> -		ret = vhost_copy_from_user(vq, &desc, vq->desc + i,
> -					   sizeof desc);
> +		ret = vhost_get_desc(vq, &desc, i);
> 		if (unlikely(ret)) {
> 			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
> 			       i, vq->desc + i);
> @@ -2251,7 +2312,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue
*vq,
> 
> 	start = vq->last_used_idx & (vq->num - 1);
> 	used = vq->used->ring + start;
> -	if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
> +	if (vhost_put_used(vq, heads, start, count)) {
> 		vq_err(vq, "Failed to write used");
> 		return -EFAULT;
> 	}
> @@ -2293,8 +2354,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq,
struct vring_used_elem *heads,
> 
> 	/* Make sure buffer is written before we update index. */
> 	smp_wmb();
> -	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
> -			   &vq->used->idx)) {
> +	if (vhost_put_used_idx(vq)) {
> 		vq_err(vq, "Failed to increment used idx");
> 		return -EFAULT;
> 	}
> @@ -2327,7 +2387,7 @@ static bool vhost_notify(struct vhost_dev *dev,
struct vhost_virtqueue *vq)
> 
> 	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
> 		__virtio16 flags;
> -		if (vhost_get_avail(vq, flags, &vq->avail->flags)) {
> +		if (vhost_get_avail_flags(vq, &flags)) {
> 			vq_err(vq, "Failed to get flags");
> 			return true;
> 		}
> @@ -2341,7 +2401,7 @@ static bool vhost_notify(struct vhost_dev *dev,
struct vhost_virtqueue *vq)
> 	if (unlikely(!v))
> 		return true;
> 
> -	if (vhost_get_avail(vq, event, vhost_used_event(vq))) {
> +	if (vhost_get_used_event(vq, &event)) {
> 		vq_err(vq, "Failed to get used event idx");
> 		return true;
> 	}
> @@ -2386,7 +2446,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev,
struct vhost_virtqueue *vq)
> 	if (vq->avail_idx != vq->last_avail_idx)
> 		return false;
> 
> -	r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
> +	r = vhost_get_avail_idx(vq, &avail_idx);
> 	if (unlikely(r))
> 		return false;
> 	vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> @@ -2422,7 +2482,7 @@ bool vhost_enable_notify(struct vhost_dev *dev,
struct vhost_virtqueue *vq)
> 	/* They could have slipped one in as we were doing that: make
> 	 * sure it's written, then check again. */
> 	smp_mb();
> -	r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
> +	r = vhost_get_avail_idx(vq, &avail_idx);
> 	if (r) {
> 		vq_err(vq, "Failed to check avail idx at %p: %d\n",
> 		       &vq->avail->idx, r);
> -- 
> 1.8.3.1
>

Christophe de Dinechin

2019-Mar-06 10:56 UTC

head link

[RFC PATCH V2 4/5] vhost: introduce helpers to get the size of metadata area

> On 6 Mar 2019, at 08:18, Jason Wang <jasowang at redhat.com> wrote:
> 
> Signed-off-by: Jason Wang <jasowang at redhat.com>
> ---
> drivers/vhost/vhost.c | 46 ++++++++++++++++++++++++++++------------------
> 1 file changed, 28 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2025543..1015464 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -413,6 +413,27 @@ static void vhost_dev_free_iovecs(struct vhost_dev
*dev)
> 		vhost_vq_free_iovecs(dev->vqs[i]);
> }
> 
> +static size_t vhost_get_avail_size(struct vhost_virtqueue *vq, int num)
Nit: Any reason not to make `num` unsigned or size_t?
> +{
> +	size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> +
> +	return sizeof(*vq->avail) +
> +	       sizeof(*vq->avail->ring) * num + event;
> +}
> +
> +static size_t vhost_get_used_size(struct vhost_virtqueue *vq, int num)
> +{
> +	size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> +
> +	return sizeof(*vq->used) +
> +	       sizeof(*vq->used->ring) * num + event;
> +}
> +
> +static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num)
> +{
> +	return sizeof(*vq->desc) * num;
> +}
> +
> void vhost_dev_init(struct vhost_dev *dev,
> 		    struct vhost_virtqueue **vqs, int nvqs, int iov_limit)
> {
> @@ -1253,13 +1274,9 @@ static bool vq_access_ok(struct vhost_virtqueue *vq,
unsigned int num,
> 			 struct vring_used __user *used)
> 
> {
> -	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> -
> -	return access_ok(desc, num * sizeof *desc) &&
> -	       access_ok(avail,
> -			 sizeof *avail + num * sizeof *avail->ring + s) &&
> -	       access_ok(used,
> -			sizeof *used + num * sizeof *used->ring + s);
> +	return access_ok(desc, vhost_get_desc_size(vq, num)) &&
> +	       access_ok(avail, vhost_get_avail_size(vq, num)) &&
> +	       access_ok(used, vhost_get_used_size(vq, num));
> }
> 
> static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
> @@ -1311,22 +1328,18 @@ static bool iotlb_access_ok(struct vhost_virtqueue
*vq,
> 
> int vq_meta_prefetch(struct vhost_virtqueue *vq)
> {
> -	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> 	unsigned int num = vq->num;
> 
> 	if (!vq->iotlb)
> 		return 1;
> 
> 	return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
> -			       num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
> +			       vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) &&
> 	       iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->avail,
> -			       sizeof *vq->avail +
> -			       num * sizeof(*vq->avail->ring) + s,
> +			       vhost_get_avail_size(vq, num),
> 			       VHOST_ADDR_AVAIL) &&
> 	       iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)(uintptr_t)vq->used,
> -			       sizeof *vq->used +
> -			       num * sizeof(*vq->used->ring) + s,
> -			       VHOST_ADDR_USED);
> +			       vhost_get_used_size(vq, num), VHOST_ADDR_USED);
> }
> EXPORT_SYMBOL_GPL(vq_meta_prefetch);
> 
> @@ -1343,13 +1356,10 @@ bool vhost_log_access_ok(struct vhost_dev *dev)
> static bool vq_log_access_ok(struct vhost_virtqueue *vq,
> 			     void __user *log_base)
> {
> -	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> -
> 	return vq_memory_access_ok(log_base, vq->umem,
> 				   vhost_has_feature(vq, VHOST_F_LOG_ALL)) &&
> 		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
> -					sizeof *vq->used +
> -					vq->num * sizeof *vq->used->ring + s));
> +				  vhost_get_used_size(vq, vq->num)));
> }
> 
> /* Can we start vq? */
> -- 
> 1.8.3.1
>

Michael S. Tsirkin

2019-Mar-06 16:31 UTC

head link

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

On Wed, Mar 06, 2019 at 02:18:12AM -0500, Jason Wang
wrote:> It was noticed that the copy_user() friends that was used to access
> virtqueue metdata tends to be very expensive for dataplane
> implementation like vhost since it involves lots of software checks,
> speculation barrier, hardware feature toggling (e.g SMAP). The
> extra cost will be more obvious when transferring small packets since
> the time spent on metadata accessing become more significant.
> 
> This patch tries to eliminate those overheads by accessing them
> through kernel virtual address by vmap(). To make the pages can be
> migrated, instead of pinning them through GUP, we use MMU notifiers to
> invalidate vmaps and re-establish vmaps during each round of metadata
> prefetching if necessary. It looks to me .invalidate_range() is
> sufficient for catching this since we don't need extra TLB flush. For
> devices that doesn't use metadata prefetching, the memory accessors
> fallback to normal copy_user() implementation gracefully. The
> invalidation was synchronized with datapath through vq mutex, and in
> order to avoid hold vq mutex during range checking, MMU notifier was
> teared down when trying to modify vq metadata.
> 
> Dirty page checking is done by calling set_page_dirty_locked()
> explicitly for the page that used ring stay after each round of
> processing.
> 
> Note that this was only done when device IOTLB is not enabled. We
> could use similar method to optimize it in the future.
> 
> Tests shows at most about 22% improvement on TX PPS when using
> virtio-user + vhost_net + xdp1 + TAP on 2.6GHz Broadwell:
> 
>         SMAP on | SMAP off
> Before: 5.0Mpps | 6.6Mpps
> After:  6.1Mpps | 7.4Mpps
> 
> Cc: <linux-mm at kvack.org>
> Signed-off-by: Jason Wang <jasowang at redhat.com>
> ---
>  drivers/vhost/net.c   |   2 +
>  drivers/vhost/vhost.c | 281
+++++++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/vhost/vhost.h |  16 +++
>  3 files changed, 297 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index bf55f99..c276371 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -982,6 +982,7 @@ static void handle_tx(struct vhost_net *net)
>  	else
>  		handle_tx_copy(net, sock);
>  
> +	vq_meta_prefetch_done(vq);
>  out:
>  	mutex_unlock(&vq->mutex);
>  }
> @@ -1250,6 +1251,7 @@ static void handle_rx(struct vhost_net *net)
>  		vhost_net_enable_vq(net, vq);
>  out:
>  	vhost_net_signal_used(nvq);
> +	vq_meta_prefetch_done(vq);
>  	mutex_unlock(&vq->mutex);
>  }
>  
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 1015464..36ccf7c 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -434,6 +434,74 @@ static size_t vhost_get_desc_size(struct
vhost_virtqueue *vq, int num)
>  	return sizeof(*vq->desc) * num;
>  }
>  
> +static void vhost_uninit_vmap(struct vhost_vmap *map)
> +{
> +	if (map->addr) {
> +		vunmap(map->unmap_addr);
> +		kfree(map->pages);
> +		map->pages = NULL;
> +		map->npages = 0;
> +	}
> +
> +	map->addr = NULL;
> +	map->unmap_addr = NULL;
> +}
> +
> +static void vhost_invalidate_vmap(struct vhost_virtqueue *vq,
> +				  struct vhost_vmap *map,
> +				  unsigned long ustart,
> +				  size_t size,
> +				  unsigned long start,
> +				  unsigned long end)
> +{
> +	if (end < ustart || start > ustart - 1 + size)
> +		return;
> +
> +	dump_stack();
> +	mutex_lock(&vq->mutex);
> +	vhost_uninit_vmap(map);
> +	mutex_unlock(&vq->mutex);
> +}
> +
> +
> +static void vhost_invalidate(struct vhost_dev *dev,
> +			     unsigned long start, unsigned long end)
> +{
> +	int i;
> +
> +	for (i = 0; i < dev->nvqs; i++) {
> +		struct vhost_virtqueue *vq = dev->vqs[i];
> +
> +		vhost_invalidate_vmap(vq, &vq->avail_ring,
> +				      (unsigned long)vq->avail,
> +				      vhost_get_avail_size(vq, vq->num),
> +				      start, end);
> +		vhost_invalidate_vmap(vq, &vq->desc_ring,
> +				      (unsigned long)vq->desc,
> +				      vhost_get_desc_size(vq, vq->num),
> +				      start, end);
> +		vhost_invalidate_vmap(vq, &vq->used_ring,
> +				      (unsigned long)vq->used,
> +				      vhost_get_used_size(vq, vq->num),
> +				      start, end);
> +	}
> +}
> +
> +
> +static void vhost_invalidate_range(struct mmu_notifier *mn,
> +				   struct mm_struct *mm,
> +				   unsigned long start, unsigned long end)
> +{
> +	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
> +					     mmu_notifier);
> +
> +	vhost_invalidate(dev, start, end);
> +}
> +
> +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
> +	.invalidate_range = vhost_invalidate_range,
> +};
> +
>  void vhost_dev_init(struct vhost_dev *dev,
>  		    struct vhost_virtqueue **vqs, int nvqs, int iov_limit)
>  {

Note that
.invalidate_range seems to be called after page lock has
been dropped.

Looking at page dirty below:



> @@ -449,6 +517,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  	dev->mm = NULL;
>  	dev->worker = NULL;
>  	dev->iov_limit = iov_limit;
> +	dev->mmu_notifier.ops = &vhost_mmu_notifier_ops;
>  	init_llist_head(&dev->work_list);
>  	init_waitqueue_head(&dev->wait);
>  	INIT_LIST_HEAD(&dev->read_list);
> @@ -462,6 +531,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>  		vq->indirect = NULL;
>  		vq->heads = NULL;
>  		vq->dev = dev;
> +		vq->avail_ring.addr = NULL;
> +		vq->used_ring.addr = NULL;
> +		vq->desc_ring.addr = NULL;
>  		mutex_init(&vq->mutex);
>  		vhost_vq_reset(dev, vq);
>  		if (vq->handle_kick)
> @@ -542,7 +614,13 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>  	if (err)
>  		goto err_cgroup;
>  
> +	err = mmu_notifier_register(&dev->mmu_notifier, dev->mm);
> +	if (err)
> +		goto err_mmu_notifier;
> +
>  	return 0;
> +err_mmu_notifier:
> +	vhost_dev_free_iovecs(dev);
>  err_cgroup:
>  	kthread_stop(worker);
>  	dev->worker = NULL;
> @@ -633,6 +711,81 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>  	spin_unlock(&dev->iotlb_lock);
>  }
>  
> +static int vhost_init_vmap(struct vhost_dev *dev,
> +			   struct vhost_vmap *map, unsigned long uaddr,
> +			   size_t size, int write)
> +{
> +	struct page **pages;
> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
> +	int npinned;
> +	void *vaddr;
> +	int err = -EFAULT;
> +
> +	err = -ENOMEM;
> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> +	if (!pages)
> +		goto err_uaddr;
> +
> +	err = EFAULT;
> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
> +	if (npinned != npages)
> +		goto err_gup;
> +
> +	vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
> +	if (!vaddr)
> +		goto err_gup;
> +
> +	map->addr = vaddr + (uaddr & (PAGE_SIZE - 1));
> +	map->unmap_addr = vaddr;
> +	map->npages = npages;
> +	map->pages = pages;
> +
> +err_gup:
> +	/* Don't pin pages, mmu notifier will notify us about page
> +	 * migration.
> +	 */
> +	if (npinned > 0)
> +		release_pages(pages, npinned);
> +err_uaddr:
> +	return err;
> +}
> +
> +static void vhost_uninit_vq_vmaps(struct vhost_virtqueue *vq)
> +{
> +	vhost_uninit_vmap(&vq->avail_ring);
> +	vhost_uninit_vmap(&vq->desc_ring);
> +	vhost_uninit_vmap(&vq->used_ring);
> +}
> +
> +static int vhost_setup_avail_vmap(struct vhost_virtqueue *vq,
> +				  unsigned long avail)
> +{
> +	return vhost_init_vmap(vq->dev, &vq->avail_ring, avail,
> +			       vhost_get_avail_size(vq, vq->num), false);
> +}
> +
> +static int vhost_setup_desc_vmap(struct vhost_virtqueue *vq,
> +				 unsigned long desc)
> +{
> +	return vhost_init_vmap(vq->dev, &vq->desc_ring, desc,
> +			       vhost_get_desc_size(vq, vq->num), false);
> +}
> +
> +static int vhost_setup_used_vmap(struct vhost_virtqueue *vq,
> +				 unsigned long used)
> +{
> +	return vhost_init_vmap(vq->dev, &vq->used_ring, used,
> +			       vhost_get_used_size(vq, vq->num), true);
> +}
> +
> +static void vhost_set_vmap_dirty(struct vhost_vmap *used)
> +{
> +	int i;
> +
> +	for (i = 0; i < used->npages; i++)
> +		set_page_dirty_lock(used->pages[i]);

This seems to rely on page lock to mark page dirty.

Could it happen that page writeback will check the
page, find it clean, and then you mark it dirty and then
invalidate callback is called?

> +}
> +
>  void vhost_dev_cleanup(struct vhost_dev *dev)
>  {
>  	int i;
> @@ -662,8 +815,12 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>  		kthread_stop(dev->worker);
>  		dev->worker = NULL;
>  	}
> -	if (dev->mm)
> +	for (i = 0; i < dev->nvqs; i++)
> +		vhost_uninit_vq_vmaps(dev->vqs[i]);
> +	if (dev->mm) {
> +		mmu_notifier_unregister(&dev->mmu_notifier, dev->mm);
>  		mmput(dev->mm);
> +	}
>  	dev->mm = NULL;
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
> @@ -892,6 +1049,16 @@ static inline void __user *__vhost_get_user(struct
vhost_virtqueue *vq,
>  
>  static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_used *used = vq->used_ring.addr;
> +
> +		if (likely(used)) {
> +			*((__virtio16 *)&used->ring[vq->num]) > +			
cpu_to_vhost16(vq, vq->avail_idx);
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
>  			      vhost_avail_event(vq));
>  }
> @@ -900,6 +1067,16 @@ static inline int vhost_put_used(struct
vhost_virtqueue *vq,
>  				 struct vring_used_elem *head, int idx,
>  				 int count)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_used *used = vq->used_ring.addr;
> +
> +		if (likely(used)) {
> +			memcpy(used->ring + idx, head,
> +			       count * sizeof(*head));
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
>  				  count * sizeof(*head));
>  }
> @@ -907,6 +1084,15 @@ static inline int vhost_put_used(struct
vhost_virtqueue *vq,
>  static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>  
>  {
> +	if (!vq->iotlb) {
> +		struct vring_used *used = vq->used_ring.addr;
> +
> +		if (likely(used)) {
> +			used->flags = cpu_to_vhost16(vq, vq->used_flags);
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
>  			      &vq->used->flags);
>  }
> @@ -914,6 +1100,15 @@ static inline int vhost_put_used_flags(struct
vhost_virtqueue *vq)
>  static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>  
>  {
> +	if (!vq->iotlb) {
> +		struct vring_used *used = vq->used_ring.addr;
> +
> +		if (likely(used)) {
> +			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
>  			      &vq->used->idx);
>  }
> @@ -959,12 +1154,30 @@ static void vhost_dev_unlock_vqs(struct vhost_dev
*d)
>  static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
>  				      __virtio16 *idx)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_avail *avail = vq->avail_ring.addr;
> +
> +		if (likely(avail)) {
> +			*idx = avail->idx;
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_get_avail(vq, *idx, &vq->avail->idx);
>  }
>  
>  static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>  				       __virtio16 *head, int idx)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_avail *avail = vq->avail_ring.addr;
> +
> +		if (likely(avail)) {
> +			*head = avail->ring[idx & (vq->num - 1)];
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_get_avail(vq, *head,
>  			       &vq->avail->ring[idx & (vq->num - 1)]);
>  }
> @@ -972,24 +1185,60 @@ static inline int vhost_get_avail_head(struct
vhost_virtqueue *vq,
>  static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
>  					__virtio16 *flags)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_avail *avail = vq->avail_ring.addr;
> +
> +		if (likely(avail)) {
> +			*flags = avail->flags;
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_get_avail(vq, *flags, &vq->avail->flags);
>  }
>  
>  static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
>  				       __virtio16 *event)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_avail *avail = vq->avail_ring.addr;
> +
> +		if (likely(avail)) {
> +			*event = (__virtio16)avail->ring[vq->num];
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_get_avail(vq, *event, vhost_used_event(vq));
>  }
>  
>  static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
>  				     __virtio16 *idx)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_used *used = vq->used_ring.addr;
> +
> +		if (likely(used)) {
> +			*idx = used->idx;
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_get_used(vq, *idx, &vq->used->idx);
>  }
>  
>  static inline int vhost_get_desc(struct vhost_virtqueue *vq,
>  				 struct vring_desc *desc, int idx)
>  {
> +	if (!vq->iotlb) {
> +		struct vring_desc *d = vq->desc_ring.addr;
> +
> +		if (likely(d)) {
> +			*desc = *(d + idx);
> +			return 0;
> +		}
> +	}
> +
>  	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
>  }
>  
> @@ -1330,8 +1579,16 @@ int vq_meta_prefetch(struct vhost_virtqueue *vq)
>  {
>  	unsigned int num = vq->num;
>  
> -	if (!vq->iotlb)
> +	if (!vq->iotlb) {
> +		if (unlikely(!vq->avail_ring.addr))
> +			vhost_setup_avail_vmap(vq, (unsigned long)vq->avail);
> +		if (unlikely(!vq->desc_ring.addr))
> +			vhost_setup_desc_vmap(vq, (unsigned long)vq->desc);
> +		if (unlikely(!vq->used_ring.addr))
> +			vhost_setup_used_vmap(vq, (unsigned long)vq->used);
> +
>  		return 1;
> +	}
>  
>  	return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
>  			       vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) &&
> @@ -1343,6 +1600,15 @@ int vq_meta_prefetch(struct vhost_virtqueue *vq)
>  }
>  EXPORT_SYMBOL_GPL(vq_meta_prefetch);
>  
> +void vq_meta_prefetch_done(struct vhost_virtqueue *vq)
> +{
> +	if (vq->iotlb)
> +		return;
> +	if (likely(vq->used_ring.addr))
> +		vhost_set_vmap_dirty(&vq->used_ring);
> +}
> +EXPORT_SYMBOL_GPL(vq_meta_prefetch_done);
> +
>  /* Can we log writes? */
>  /* Caller should have device mutex but not vq mutex */
>  bool vhost_log_access_ok(struct vhost_dev *dev)
> @@ -1483,6 +1749,13 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned
int ioctl, void __user *arg
>  
>  	mutex_lock(&vq->mutex);
>  
> +	/* Unregister MMU notifer to allow invalidation callback
> +	 * can access vq->avail, vq->desc , vq->used and vq->num
> +	 * without holding vq->mutex.
> +	 */
> +	if (d->mm)
> +		mmu_notifier_unregister(&d->mmu_notifier, d->mm);
> +
>  	switch (ioctl) {
>  	case VHOST_SET_VRING_NUM:
>  		/* Resizing ring with an active backend?
> @@ -1499,6 +1772,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned
int ioctl, void __user *arg
>  			r = -EINVAL;
>  			break;
>  		}
> +		vhost_uninit_vq_vmaps(vq);
>  		vq->num = s.num;
>  		break;
>  	case VHOST_SET_VRING_BASE:
> @@ -1581,6 +1855,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned
int ioctl, void __user *arg
>  		vq->avail = (void __user *)(unsigned long)a.avail_user_addr;
>  		vq->log_addr = a.log_guest_addr;
>  		vq->used = (void __user *)(unsigned long)a.used_user_addr;
> +		vhost_uninit_vq_vmaps(vq);
>  		break;
>  	case VHOST_SET_VRING_KICK:
>  		if (copy_from_user(&f, argp, sizeof f)) {
> @@ -1656,6 +1931,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned
int ioctl, void __user *arg
>  	if (pollstart && vq->handle_kick)
>  		r = vhost_poll_start(&vq->poll, vq->kick);
>  
> +	if (d->mm)
> +		mmu_notifier_register(&d->mmu_notifier, d->mm);
>  	mutex_unlock(&vq->mutex);
>  
>  	if (pollstop && vq->handle_kick)
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 7a7fc00..146076e 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -12,6 +12,8 @@
>  #include <linux/virtio_config.h>
>  #include <linux/virtio_ring.h>
>  #include <linux/atomic.h>
> +#include <linux/pagemap.h>
> +#include <linux/mmu_notifier.h>
>  
>  struct vhost_work;
>  typedef void (*vhost_work_fn_t)(struct vhost_work *work);
> @@ -80,6 +82,13 @@ enum vhost_uaddr_type {
>  	VHOST_NUM_ADDRS = 3,
>  };
>  
> +struct vhost_vmap {
> +	void *addr;
> +	void *unmap_addr;
> +	int npages;
> +	struct page **pages;
> +};
> +
>  /* The virtqueue structure describes a queue attached to a device. */
>  struct vhost_virtqueue {
>  	struct vhost_dev *dev;
> @@ -90,6 +99,11 @@ struct vhost_virtqueue {
>  	struct vring_desc __user *desc;
>  	struct vring_avail __user *avail;
>  	struct vring_used __user *used;
> +
> +	struct vhost_vmap avail_ring;
> +	struct vhost_vmap desc_ring;
> +	struct vhost_vmap used_ring;
> +
>  	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>  	struct file *kick;
>  	struct eventfd_ctx *call_ctx;
> @@ -158,6 +172,7 @@ struct vhost_msg_node {
>  
>  struct vhost_dev {
>  	struct mm_struct *mm;
> +	struct mmu_notifier mmu_notifier;
>  	struct mutex mutex;
>  	struct vhost_virtqueue **vqs;
>  	int nvqs;
> @@ -210,6 +225,7 @@ int vhost_log_write(struct vhost_virtqueue *vq, struct
vhost_log *log,
>  		    unsigned int log_num, u64 len,
>  		    struct iovec *iov, int count);
>  int vq_meta_prefetch(struct vhost_virtqueue *vq);
> +void vq_meta_prefetch_done(struct vhost_virtqueue *vq);
>  
>  struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int
type);
>  void vhost_enqueue_msg(struct vhost_dev *dev,
> -- 
> 1.8.3.1

Jason Wang

2019-Mar-07 02:42 UTC

head link

[RFC PATCH V2 4/5] vhost: introduce helpers to get the size of metadata area

On 2019/3/7 ??2:43, Souptick Joarder wrote:> On Wed, Mar 6, 2019 at 12:48 PM Jason Wang <jasowang at redhat.com>
wrote:
>> Signed-off-by: Jason Wang <jasowang at redhat.com>
> Is the change log left with any particular reason ?

Nope, will add the log.

Thanks

>> ---
>>   drivers/vhost/vhost.c | 46
++++++++++++++++++++++++++++------------------
>>   1 file changed, 28 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 2025543..1015464 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -413,6 +413,27 @@ static void vhost_dev_free_iovecs(struct vhost_dev
*dev)
>>                  vhost_vq_free_iovecs(dev->vqs[i]);
>>   }
>>
>> +static size_t vhost_get_avail_size(struct vhost_virtqueue *vq, int
num)
>> +{
>> +       size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ?
2 : 0;
>> +
>> +       return sizeof(*vq->avail) +
>> +              sizeof(*vq->avail->ring) * num + event;
>> +}
>> +
>> +static size_t vhost_get_used_size(struct vhost_virtqueue *vq, int num)
>> +{
>> +       size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ?
2 : 0;
>> +
>> +       return sizeof(*vq->used) +
>> +              sizeof(*vq->used->ring) * num + event;
>> +}
>> +
>> +static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num)
>> +{
>> +       return sizeof(*vq->desc) * num;
>> +}
>> +
>>   void vhost_dev_init(struct vhost_dev *dev,
>>                      struct vhost_virtqueue **vqs, int nvqs, int
iov_limit)
>>   {
>> @@ -1253,13 +1274,9 @@ static bool vq_access_ok(struct vhost_virtqueue
*vq, unsigned int num,
>>                           struct vring_used __user *used)
>>
>>   {
>> -       size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 :
0;
>> -
>> -       return access_ok(desc, num * sizeof *desc) &&
>> -              access_ok(avail,
>> -                        sizeof *avail + num * sizeof *avail->ring +
s) &&
>> -              access_ok(used,
>> -                       sizeof *used + num * sizeof *used->ring +
s);
>> +       return access_ok(desc, vhost_get_desc_size(vq, num)) &&
>> +              access_ok(avail, vhost_get_avail_size(vq, num))
&&
>> +              access_ok(used, vhost_get_used_size(vq, num));
>>   }
>>
>>   static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
>> @@ -1311,22 +1328,18 @@ static bool iotlb_access_ok(struct
vhost_virtqueue *vq,
>>
>>   int vq_meta_prefetch(struct vhost_virtqueue *vq)
>>   {
>> -       size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 :
0;
>>          unsigned int num = vq->num;
>>
>>          if (!vq->iotlb)
>>                  return 1;
>>
>>          return iotlb_access_ok(vq, VHOST_ACCESS_RO,
(u64)(uintptr_t)vq->desc,
>> -                              num * sizeof(*vq->desc),
VHOST_ADDR_DESC) &&
>> +                              vhost_get_desc_size(vq, num),
VHOST_ADDR_DESC) &&
>>                 iotlb_access_ok(vq, VHOST_ACCESS_RO,
(u64)(uintptr_t)vq->avail,
>> -                              sizeof *vq->avail +
>> -                              num * sizeof(*vq->avail->ring) +
s,
>> +                              vhost_get_avail_size(vq, num),
>>                                 VHOST_ADDR_AVAIL) &&
>>                 iotlb_access_ok(vq, VHOST_ACCESS_WO,
(u64)(uintptr_t)vq->used,
>> -                              sizeof *vq->used +
>> -                              num * sizeof(*vq->used->ring) + s,
>> -                              VHOST_ADDR_USED);
>> +                              vhost_get_used_size(vq, num),
VHOST_ADDR_USED);
>>   }
>>   EXPORT_SYMBOL_GPL(vq_meta_prefetch);
>>
>> @@ -1343,13 +1356,10 @@ bool vhost_log_access_ok(struct vhost_dev *dev)
>>   static bool vq_log_access_ok(struct vhost_virtqueue *vq,
>>                               void __user *log_base)
>>   {
>> -       size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 :
0;
>> -
>>          return vq_memory_access_ok(log_base, vq->umem,
>>                                     vhost_has_feature(vq,
VHOST_F_LOG_ALL)) &&
>>                  (!vq->log_used || log_access_ok(log_base,
vq->log_addr,
>> -                                       sizeof *vq->used +
>> -                                       vq->num * sizeof
*vq->used->ring + s));
>> +                                 vhost_get_used_size(vq,
vq->num)));
>>   }
>>
>>   /* Can we start vq? */
>> --
>> 1.8.3.1
>>

Michael S. Tsirkin

2019-Mar-07 15:47 UTC

head link

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

On Wed, Mar 06, 2019 at 02:18:12AM -0500, Jason Wang
wrote:> +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
> +	.invalidate_range = vhost_invalidate_range,
> +};
> +
>  void vhost_dev_init(struct vhost_dev *dev,
>  		    struct vhost_virtqueue **vqs, int nvqs, int iov_limit)
>  {
I also wonder here: when page is write protected then
it does not look like .invalidate_range is invoked.

E.g. mm/ksm.c calls

mmu_notifier_invalidate_range_start and
mmu_notifier_invalidate_range_end but not mmu_notifier_invalidate_range.

Similarly, rmap in page_mkclean_one will not call
mmu_notifier_invalidate_range.

If I'm right vhost won't get notified when page is write-protected since
you
didn't install start/end notifiers. Note that end notifier can be called
with page locked, so it's not as straight-forward as just adding a call.
Writing into a write-protected page isn't a good idea.

Note that documentation says:
	it is fine to delay the mmu_notifier_invalidate_range
	call to mmu_notifier_invalidate_range_end() outside the page table lock.
implying it's called just later.

-- 
MST

Christoph Hellwig

2019-Mar-08 14:12 UTC

head link

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang
wrote:> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling. This is done through setup kernel address through vmap() and
> resigter MMU notifier for invalidation.
> 
> Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
> obvious improvement.
How is this going to work for CPUs with virtually tagged caches?

Jason Wang

2019-Mar-11 07:13 UTC

head link

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

On 2019/3/8 ??10:12, Christoph Hellwig wrote:> On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
>> This series tries to access virtqueue metadata through kernel virtual
>> address instead of copy_user() friends since they had too much
>> overheads like checks, spec barriers or even hardware feature
>> toggling. This is done through setup kernel address through vmap() and
>> resigter MMU notifier for invalidation.
>>
>> Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
>> obvious improvement.
> How is this going to work for CPUs with virtually tagged caches?

Anything different that you worry? I can have a test but do you know any 
archs that use virtual tag cache?

Thanks

Maybe Matching Threads

Search for more seemingly similar threads

Linux Virtualization - Mar 2019 - [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

[RFC PATCH V2 1/5] vhost: generalize adding used elem

[RFC PATCH V2 2/5] vhost: fine grain userspace memory accessors

[RFC PATCH V2 3/5] vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()

[RFC PATCH V2 4/5] vhost: introduce helpers to get the size of metadata area

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 2/5] vhost: fine grain userspace memory accessors

[RFC PATCH V2 4/5] vhost: introduce helpers to get the size of metadata area

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 4/5] vhost: introduce helpers to get the size of metadata area

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

[RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

Maybe Matching Threads