thr3ads.net - Linux Virtualization - [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API [Sep 2014]

If this information is useful, please help other people find it:
Share via:

Andy Lutomirski

2014-Sep-01 17:39 UTC

[PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API

This fixes virtio on Xen guests as well as on any other platform
that uses virtio_pci on which physical addresses don't match bus
addresses.

This can be tested with:

    virtme-run --xen xen --kimg arch/x86/boot/bzImage --console

using virtme from here:

    https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git

Without these patches, the guest hangs forever.  With these patches,
everything works.

This should be safe on all platforms that I'm aware of.  That
doesn't mean that there isn't anything that I missed.

Thanks to everyone for putting up with the development of this
series.  Hopefully it'll be the end of DMA issues in virtio. :)

Changes from v3:
 - virtio_pci only asks virtio_ring to use the DMA_API if
   !PCI_DMA_BUS_IS_PHYS.
 - Reduce tools/virtio breakage.  It's now merely as broken as before
   instead of being even more broken.
 - Drop the sg_next changes -- Rusty's version is better.

Changes from v2:
 - Reordered patches.
 - Fixed a virtio_net OOPS.

Changes from v1:
 - Using the DMA API is optional now.  It would be nice to improve the
   DMA API to the point that it could be used unconditionally, but s390
   proves that we're not there yet.
 - Includes patch 4, which fixes DMA debugging warnings from virtio_net.

Andy Lutomirski (4):
  virtio_ring: Support DMA APIs if requested
  virtio_pci: Use the DMA API for virtqueues
  virtio_net: Don't set the end flag on reusable sg entries
  virtio_net: Stop doing DMA from the stack

 drivers/lguest/lguest_device.c         |   3 +-
 drivers/misc/mic/card/mic_virtio.c     |   2 +-
 drivers/net/virtio_net.c               |  59 +++++++----
 drivers/remoteproc/remoteproc_virtio.c |   4 +-
 drivers/s390/kvm/kvm_virtio.c          |   2 +-
 drivers/s390/kvm/virtio_ccw.c          |   4 +-
 drivers/virtio/virtio_mmio.c           |   5 +-
 drivers/virtio/virtio_pci.c            |  41 ++++++--
 drivers/virtio/virtio_ring.c           | 187 +++++++++++++++++++++++++++++----
 include/linux/virtio_ring.h            |   1 +
 tools/virtio/linux/dma-mapping.h       |  17 +++
 tools/virtio/linux/virtio.h            |   1 +
 tools/virtio/virtio_test.c             |   2 +-
 tools/virtio/vringh_test.c             |   3 +-
 14 files changed, 268 insertions(+), 63 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

-- 
1.9.3

Andy Lutomirski

2014-Sep-01 17:39 UTC

head link

[PATCH v4 1/4] virtio_ring: Support DMA APIs if requested

virtio_ring currently sends the device (usually a hypervisor)
physical addresses of its I/O buffers.  This is okay when DMA
addresses and physical addresses are the same thing, but this isn't
always the case.  For example, this never works on Xen guests, and
it is likely to fail if a physical "virtio" device ever ends up
behind an IOMMU or swiotlb.

The immediate use case for me is to enable virtio on Xen guests.
For that to work, we need vring to support DMA address translation
as well as a corresponding change to virtio_pci or to another
driver.

With this patch, if enabled, virtfs survives kmemleak and
CONFIG_DMA_API_DEBUG.  virtio-net warns (correctly) about DMA from
the stack in virtnet_set_rx_mode.

This explicitly supports !CONFIG_HAS_DMA.  If vring is asked to use
the DMA API and CONFIG_HAS_DMA is not set, then vring will refuse to
create the virtqueue.

Signed-off-by: Andy Lutomirski <luto at amacapital.net>
---
 drivers/lguest/lguest_device.c         |   3 +-
 drivers/misc/mic/card/mic_virtio.c     |   2 +-
 drivers/remoteproc/remoteproc_virtio.c |   4 +-
 drivers/s390/kvm/kvm_virtio.c          |   2 +-
 drivers/s390/kvm/virtio_ccw.c          |   4 +-
 drivers/virtio/virtio_mmio.c           |   5 +-
 drivers/virtio/virtio_pci.c            |   3 +-
 drivers/virtio/virtio_ring.c           | 187 +++++++++++++++++++++++++++++----
 include/linux/virtio_ring.h            |   1 +
 tools/virtio/linux/dma-mapping.h       |  17 +++
 tools/virtio/linux/virtio.h            |   1 +
 tools/virtio/virtio_test.c             |   2 +-
 tools/virtio/vringh_test.c             |   3 +-
 13 files changed, 199 insertions(+), 35 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
index d0a1d8a45c81..f0eafbe82ed4 100644
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -301,7 +301,8 @@ static struct virtqueue *lg_find_vq(struct virtio_device
*vdev,
 	 * barriers.
 	 */
 	vq = vring_new_virtqueue(index, lvq->config.num, LGUEST_VRING_ALIGN, vdev,
-				 true, lvq->pages, lg_notify, callback, name);
+				 true, false, lvq->pages,
+				 lg_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
 		goto unmap;
diff --git a/drivers/misc/mic/card/mic_virtio.c
b/drivers/misc/mic/card/mic_virtio.c
index f14b60080c21..d633964417b1 100644
--- a/drivers/misc/mic/card/mic_virtio.c
+++ b/drivers/misc/mic/card/mic_virtio.c
@@ -256,7 +256,7 @@ static struct virtqueue *mic_find_vq(struct virtio_device
*vdev,
 	mvdev->vr[index] = va;
 	memset_io(va, 0x0, _vr_size);
 	vq = vring_new_virtqueue(index, le16_to_cpu(config.num),
-				 MIC_VIRTIO_RING_ALIGN, vdev, false,
+				 MIC_VIRTIO_RING_ALIGN, vdev, false, false,
 				 (void __force *)va, mic_notify, callback,
 				 name);
 	if (!vq) {
diff --git a/drivers/remoteproc/remoteproc_virtio.c
b/drivers/remoteproc/remoteproc_virtio.c
index a34b50690b4e..e31f2fefa76e 100644
--- a/drivers/remoteproc/remoteproc_virtio.c
+++ b/drivers/remoteproc/remoteproc_virtio.c
@@ -107,8 +107,8 @@ static struct virtqueue *rp_find_vq(struct virtio_device
*vdev,
 	 * Create the new vq, and tell virtio we're not interested in
 	 * the 'weak' smp barriers, since we're talking with a real
device.
 	 */
-	vq = vring_new_virtqueue(id, len, rvring->align, vdev, false, addr,
-					rproc_virtio_notify, callback, name);
+	vq = vring_new_virtqueue(id, len, rvring->align, vdev, false, false,
+				 addr, rproc_virtio_notify, callback, name);
 	if (!vq) {
 		dev_err(dev, "vring_new_virtqueue %s failed\n", name);
 		rproc_free_vring(rvring);
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
index a1349653c6d9..91abcdc196d0 100644
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -206,7 +206,7 @@ static struct virtqueue *kvm_find_vq(struct virtio_device
*vdev,
 		goto out;
 
 	vq = vring_new_virtqueue(index, config->num, KVM_S390_VIRTIO_RING_ALIGN,
-				 vdev, true, (void *) config->address,
+				 vdev, true, false, (void *) config->address,
 				 kvm_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
diff --git a/drivers/s390/kvm/virtio_ccw.c b/drivers/s390/kvm/virtio_ccw.c
index d2c0b442bce5..2462a443358a 100644
--- a/drivers/s390/kvm/virtio_ccw.c
+++ b/drivers/s390/kvm/virtio_ccw.c
@@ -478,8 +478,8 @@ static struct virtqueue *virtio_ccw_setup_vq(struct
virtio_device *vdev,
 	}
 
 	vq = vring_new_virtqueue(i, info->num, KVM_VIRTIO_CCW_RING_ALIGN, vdev,
-				 true, info->queue, virtio_ccw_kvm_notify,
-				 callback, name);
+				 true, false, info->queue,
+				 virtio_ccw_kvm_notify, callback, name);
 	if (!vq) {
 		/* For now, we fail if we can't get the requested size. */
 		dev_warn(&vcdev->cdev->dev, "no vq\n");
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index c600ccfd6922..693254e52a5d 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -366,8 +366,9 @@ static struct virtqueue *vm_setup_vq(struct virtio_device
*vdev, unsigned index,
 			vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
 
 	/* Create the vring */
-	vq = vring_new_virtqueue(index, info->num, VIRTIO_MMIO_VRING_ALIGN, vdev,
-				 true, info->queue, vm_notify, callback, name);
+	vq = vring_new_virtqueue(index, info->num, VIRTIO_MMIO_VRING_ALIGN,
+				 vdev, true, false, info->queue,
+				 vm_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
 		goto error_new_virtqueue;
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 3d1463c6b120..a1f299fa4626 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -430,7 +430,8 @@ static struct virtqueue *setup_vq(struct virtio_device
*vdev, unsigned index,
 
 	/* create the vring */
 	vq = vring_new_virtqueue(index, info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
-				 true, info->queue, vp_notify, callback, name);
+				 true, false, info->queue,
+				 vp_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
 		goto out_activate_queue;
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 4d08f45a9c29..7e10770edd0f 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -24,6 +24,7 @@
 #include <linux/module.h>
 #include <linux/hrtimer.h>
 #include <linux/kmemleak.h>
+#include <linux/dma-mapping.h>
 
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
@@ -54,6 +55,12 @@
 #define END_USE(vq)
 #endif
 
+struct vring_desc_state
+{
+	void *data;			/* Data for callback. */
+	struct vring_desc *indir_desc;	/* Indirect descriptor, if any. */
+};
+
 struct vring_virtqueue
 {
 	struct virtqueue vq;
@@ -64,6 +71,9 @@ struct vring_virtqueue
 	/* Can we use weak barriers? */
 	bool weak_barriers;
 
+	/* Should we use the DMA API? */
+	bool use_dma_api;
+
 	/* Other side has made a mess, don't try any more. */
 	bool broken;
 
@@ -93,8 +103,8 @@ struct vring_virtqueue
 	ktime_t last_add_time;
 #endif
 
-	/* Tokens for callbacks. */
-	void *data[];
+	/* Per-descriptor state. */
+	struct vring_desc_state desc_state[];
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
@@ -113,6 +123,83 @@ static inline struct scatterlist *sg_next_arr(struct
scatterlist *sg,
 	return sg + 1;
 }
 
+/* Map one sg entry. */
+static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
+				   struct scatterlist *sg,
+				   enum dma_data_direction direction)
+{
+#ifdef CONFIG_HAS_DMA
+	/*
+	 * We can't use dma_map_sg, because we don't use scatterlists in
+	 * the way it expects (we sometimes use unterminated
+	 * scatterlists, and we don't guarantee that the scatterlist
+	 * will exist for the lifetime of the mapping.
+	 */
+	if (vq->use_dma_api)
+		return dma_map_page(vq->vq.vdev->dev.parent,
+				    sg_page(sg), sg->offset, sg->length,
+				    direction);
+#endif
+
+	return sg_phys(sg);
+}
+
+static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
+				   void *cpu_addr, size_t size,
+				   enum dma_data_direction direction)
+{
+#ifdef CONFIG_HAS_DMA
+	if (vq->use_dma_api)
+		return dma_map_single(vq->vq.vdev->dev.parent,
+				      cpu_addr, size,
+				      direction);
+#endif
+
+	return virt_to_phys(cpu_addr);
+}
+
+static void vring_unmap_one(const struct vring_virtqueue *vq,
+			    struct vring_desc *desc)
+{
+#ifdef CONFIG_HAS_DMA
+	if (!vq->use_dma_api)
+		return;		/* Nothing to do. */
+
+	if (desc->flags & VRING_DESC_F_INDIRECT) {
+		dma_unmap_single(vq->vq.vdev->dev.parent,
+				 desc->addr, desc->len,
+				 (desc->flags & VRING_DESC_F_WRITE) ?
+				 DMA_FROM_DEVICE : DMA_TO_DEVICE);
+	} else {
+		dma_unmap_page(vq->vq.vdev->dev.parent,
+			       desc->addr, desc->len,
+			       (desc->flags & VRING_DESC_F_WRITE) ?
+			       DMA_FROM_DEVICE : DMA_TO_DEVICE);
+	}
+#endif
+}
+
+static void vring_unmap_indirect(const struct vring_virtqueue *vq,
+				 struct vring_desc *desc, int total)
+{
+	int i;
+
+	if (vq->use_dma_api)
+		for (i = 0; i < total; i++)
+			vring_unmap_one(vq, &desc[i]);
+}
+
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+			       dma_addr_t addr)
+{
+#ifdef CONFIG_HAS_DMA
+	return vq->use_dma_api &&
+		dma_mapping_error(vq->vq.vdev->dev.parent, addr);
+#else
+	return 0;
+#endif
+}
+
 /* Set up an indirect table of descriptors and add it to the queue. */
 static inline int vring_add_indirect(struct vring_virtqueue *vq,
 				     struct scatterlist *sgs[],
@@ -146,7 +233,10 @@ static inline int vring_add_indirect(struct vring_virtqueue
*vq,
 	for (n = 0; n < out_sgs; n++) {
 		for (sg = sgs[n]; sg; sg = next(sg, &total_out)) {
 			desc[i].flags = VRING_DESC_F_NEXT;
-			desc[i].addr = sg_phys(sg);
+			desc[i].addr +				vring_map_one_sg(vq, sg, DMA_TO_DEVICE);
+			if (vring_mapping_error(vq, desc[i].addr))
+				goto unmap_free;
 			desc[i].len = sg->length;
 			desc[i].next = i+1;
 			i++;
@@ -155,7 +245,10 @@ static inline int vring_add_indirect(struct vring_virtqueue
*vq,
 	for (; n < (out_sgs + in_sgs); n++) {
 		for (sg = sgs[n]; sg; sg = next(sg, &total_in)) {
 			desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
-			desc[i].addr = sg_phys(sg);
+			desc[i].addr +				vring_map_one_sg(vq, sg, DMA_FROM_DEVICE);
+			if (vring_mapping_error(vq, desc[i].addr))
+				goto unmap_free;
 			desc[i].len = sg->length;
 			desc[i].next = i+1;
 			i++;
@@ -173,15 +266,26 @@ static inline int vring_add_indirect(struct
vring_virtqueue *vq,
 	/* Use a single buffer which doesn't continue */
 	head = vq->free_head;
 	vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
-	vq->vring.desc[head].addr = virt_to_phys(desc);
-	/* kmemleak gives a false positive, as it's hidden by virt_to_phys */
-	kmemleak_ignore(desc);
+	vq->vring.desc[head].addr +		vring_map_single(vq,
+				 desc, i * sizeof(struct vring_desc),
+				 DMA_TO_DEVICE);
+	if (vring_mapping_error(vq, vq->vring.desc[head].addr))
+		goto unmap_free;
 	vq->vring.desc[head].len = i * sizeof(struct vring_desc);
 
 	/* Update free pointer */
 	vq->free_head = vq->vring.desc[head].next;
 
+	/* Save the indirect block */
+	vq->desc_state[head].indir_desc = desc;
+
 	return head;
+
+unmap_free:
+	vring_unmap_indirect(vq, desc, i);
+	kfree(desc);
+	return -ENOMEM;
 }
 
 static inline int virtqueue_add(struct virtqueue *_vq,
@@ -197,7 +301,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	struct scatterlist *sg;
-	unsigned int i, n, avail, uninitialized_var(prev), total_sg;
+	unsigned int i, n, avail, uninitialized_var(prev), total_sg, err_idx;
 	int head;
 
 	START_USE(vq);
@@ -256,7 +360,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	for (n = 0; n < out_sgs; n++) {
 		for (sg = sgs[n]; sg; sg = next(sg, &total_out)) {
 			vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
-			vq->vring.desc[i].addr = sg_phys(sg);
+			vq->vring.desc[i].addr +				vring_map_one_sg(vq, sg, DMA_TO_DEVICE);
+			if (vring_mapping_error(vq, vq->vring.desc[i].addr))
+				goto unmap_release;
 			vq->vring.desc[i].len = sg->length;
 			prev = i;
 			i = vq->vring.desc[i].next;
@@ -265,7 +372,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	for (; n < (out_sgs + in_sgs); n++) {
 		for (sg = sgs[n]; sg; sg = next(sg, &total_in)) {
 			vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
-			vq->vring.desc[i].addr = sg_phys(sg);
+			vq->vring.desc[i].addr +				vring_map_one_sg(vq, sg, DMA_FROM_DEVICE);
+			if (vring_mapping_error(vq, vq->vring.desc[i].addr))
+				goto unmap_release;
 			vq->vring.desc[i].len = sg->length;
 			prev = i;
 			i = vq->vring.desc[i].next;
@@ -279,7 +389,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 
 add_head:
 	/* Set token. */
-	vq->data[head] = data;
+	vq->desc_state[head].data = data;
 
 	/* Put entry in available array (but don't update avail->idx until they
 	 * do sync). */
@@ -301,6 +411,20 @@ add_head:
 	END_USE(vq);
 
 	return 0;
+
+unmap_release:
+	err_idx = i;
+	i = head;
+
+	for (n = 0; n < total_sg; n++) {
+		if (i == err_idx)
+			break;
+		vring_unmap_one(vq, &vq->vring.desc[i]);
+		i = vq->vring.desc[i].next;
+	}
+
+	vq->vq.num_free += total_sg;
+	return -EIO;
 }
 
 /**
@@ -480,22 +604,33 @@ static void detach_buf(struct vring_virtqueue *vq,
unsigned int head)
 	unsigned int i;
 
 	/* Clear data ptr. */
-	vq->data[head] = NULL;
+	vq->desc_state[head].data = NULL;
 
 	/* Put back on free list: find end */
 	i = head;
 
 	/* Free the indirect table */
-	if (vq->vring.desc[i].flags & VRING_DESC_F_INDIRECT)
-		kfree(phys_to_virt(vq->vring.desc[i].addr));
+	if (vq->desc_state[head].indir_desc) {
+		u32 len = vq->vring.desc[i].len;
+
+		BUG_ON(!(vq->vring.desc[i].flags & VRING_DESC_F_INDIRECT));
+		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
+		vring_unmap_indirect(vq, vq->desc_state[head].indir_desc,
+				     len / sizeof(struct vring_desc));
+		kfree(vq->desc_state[head].indir_desc);
+		vq->desc_state[head].indir_desc = NULL;
+	}
 
 	while (vq->vring.desc[i].flags & VRING_DESC_F_NEXT) {
+		vring_unmap_one(vq, &vq->vring.desc[i]);
 		i = vq->vring.desc[i].next;
 		vq->vq.num_free++;
 	}
 
+	vring_unmap_one(vq, &vq->vring.desc[i]);
 	vq->vring.desc[i].next = vq->free_head;
 	vq->free_head = head;
+
 	/* Plus final descriptor */
 	vq->vq.num_free++;
 }
@@ -552,13 +687,13 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned
int *len)
 		BAD_RING(vq, "id %u out of range\n", i);
 		return NULL;
 	}
-	if (unlikely(!vq->data[i])) {
+	if (unlikely(!vq->desc_state[i].data)) {
 		BAD_RING(vq, "id %u is not a head!\n", i);
 		return NULL;
 	}
 
 	/* detach_buf clears data, so grab it now. */
-	ret = vq->data[i];
+	ret = vq->desc_state[i].data;
 	detach_buf(vq, i);
 	vq->last_used_idx++;
 	/* If we expect an interrupt for the next entry, tell host
@@ -719,10 +854,10 @@ void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
 	START_USE(vq);
 
 	for (i = 0; i < vq->vring.num; i++) {
-		if (!vq->data[i])
+		if (!vq->desc_state[i].data)
 			continue;
 		/* detach_buf clears data, so grab it now. */
-		buf = vq->data[i];
+		buf = vq->desc_state[i].data;
 		detach_buf(vq, i);
 		vq->vring.avail->idx--;
 		END_USE(vq);
@@ -761,6 +896,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 				      unsigned int vring_align,
 				      struct virtio_device *vdev,
 				      bool weak_barriers,
+				      bool use_dma_api,
 				      void *pages,
 				      bool (*notify)(struct virtqueue *),
 				      void (*callback)(struct virtqueue *),
@@ -775,7 +911,13 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 		return NULL;
 	}
 
-	vq = kmalloc(sizeof(*vq) + sizeof(void *)*num, GFP_KERNEL);
+#ifndef CONFIG_HAS_DMA
+	if (use_dma_api)
+		return NULL;
+#endif
+
+	vq = kmalloc(sizeof(*vq) + num * sizeof(struct vring_desc_state),
+		     GFP_KERNEL);
 	if (!vq)
 		return NULL;
 
@@ -787,6 +929,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 	vq->vq.index = index;
 	vq->notify = notify;
 	vq->weak_barriers = weak_barriers;
+	vq->use_dma_api = use_dma_api;
 	vq->broken = false;
 	vq->last_used_idx = 0;
 	vq->num_added = 0;
@@ -805,11 +948,9 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 
 	/* Put everything in free lists. */
 	vq->free_head = 0;
-	for (i = 0; i < num-1; i++) {
+	for (i = 0; i < num-1; i++)
 		vq->vring.desc[i].next = i+1;
-		vq->data[i] = NULL;
-	}
-	vq->data[i] = NULL;
+	memset(vq->desc_state, 0, num * sizeof(struct vring_desc_state));
 
 	return &vq->vq;
 }
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index 67e06fe18c03..60f761a38a09 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -70,6 +70,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 				      unsigned int vring_align,
 				      struct virtio_device *vdev,
 				      bool weak_barriers,
+				      bool use_dma_api,
 				      void *pages,
 				      bool (*notify)(struct virtqueue *vq),
 				      void (*callback)(struct virtqueue *vq),
diff --git a/tools/virtio/linux/dma-mapping.h b/tools/virtio/linux/dma-mapping.h
new file mode 100644
index 000000000000..4f93af89ae16
--- /dev/null
+++ b/tools/virtio/linux/dma-mapping.h
@@ -0,0 +1,17 @@
+#ifndef _LINUX_DMA_MAPPING_H
+#define _LINUX_DMA_MAPPING_H
+
+#ifdef CONFIG_HAS_DMA
+# error Virtio userspace code does not support CONFIG_HAS_DMA
+#endif
+
+#define PCI_DMA_BUS_IS_PHYS 1
+
+enum dma_data_direction {
+	DMA_BIDIRECTIONAL = 0,
+	DMA_TO_DEVICE = 1,
+	DMA_FROM_DEVICE = 2,
+	DMA_NONE = 3,
+};
+
+#endif
diff --git a/tools/virtio/linux/virtio.h b/tools/virtio/linux/virtio.h
index 5a2d1f0f6bc7..5d42dc6a6201 100644
--- a/tools/virtio/linux/virtio.h
+++ b/tools/virtio/linux/virtio.h
@@ -78,6 +78,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 				      unsigned int vring_align,
 				      struct virtio_device *vdev,
 				      bool weak_barriers,
+				      bool use_dma_api,
 				      void *pages,
 				      bool (*notify)(struct virtqueue *vq),
 				      void (*callback)(struct virtqueue *vq),
diff --git a/tools/virtio/virtio_test.c b/tools/virtio/virtio_test.c
index 00ea679b3826..860cc89900a7 100644
--- a/tools/virtio/virtio_test.c
+++ b/tools/virtio/virtio_test.c
@@ -99,7 +99,7 @@ static void vq_info_add(struct vdev_info *dev, int num)
 	vring_init(&info->vring, num, info->ring, 4096);
 	info->vq = vring_new_virtqueue(info->idx,
 				       info->vring.num, 4096, &dev->vdev,
-				       true, info->ring,
+				       true, false, info->ring,
 				       vq_notify, vq_callback, "test");
 	assert(info->vq);
 	info->vq->priv = info;
diff --git a/tools/virtio/vringh_test.c b/tools/virtio/vringh_test.c
index 14a4f4cab5b9..67d3c3a1ba88 100644
--- a/tools/virtio/vringh_test.c
+++ b/tools/virtio/vringh_test.c
@@ -312,7 +312,8 @@ static int parallel_test(unsigned long features,
 		if (sched_setaffinity(getpid(), sizeof(cpu_set), &cpu_set))
 			err(1, "Could not set affinity to cpu %u", first_cpu);
 
-		vq = vring_new_virtqueue(0, RINGSIZE, ALIGN, &gvdev.vdev, true,
+		vq = vring_new_virtqueue(0, RINGSIZE, ALIGN, &gvdev.vdev,
+					 true, false,
 					 guest_map, fast_vringh ? no_notify_host
 					 : parallel_notify_host,
 					 never_callback_guest, "guest vq");
-- 
1.9.3

Andy Lutomirski

2014-Sep-01 17:39 UTC

head link

[PATCH v4 2/4] virtio_pci: Use the DMA API for virtqueues

A virtqueue is a coherent DMA mapping.  Use the DMA API for it.
This fixes virtio_pci on Xen.

As an optimization, this only enables asks virtio_ring to use the
DMA API if !PCI_DMA_BUS_IS_PHYS.  Eventually, once the DMA API is
known to be efficient on all relevant architectures, this
optimization can be removed.

Signed-off-by: Andy Lutomirski <luto at amacapital.net>
---
 drivers/virtio/virtio_pci.c | 40 +++++++++++++++++++++++++++++++---------
 1 file changed, 31 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index a1f299fa4626..226b46b08727 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -80,8 +80,9 @@ struct virtio_pci_vq_info
 	/* the number of entries in the queue */
 	int num;
 
-	/* the virtual address of the ring queue */
-	void *queue;
+	/* the ring queue */
+	void *queue;			/* virtual address */
+	dma_addr_t queue_dma_addr;	/* bus address */
 
 	/* the list node for the virtqueues list */
 	struct list_head node;
@@ -417,20 +418,32 @@ static struct virtqueue *setup_vq(struct virtio_device
*vdev, unsigned index,
 	info->num = num;
 	info->msix_vector = msix_vec;
 
-	size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
-	info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
+	size = vring_size(num, VIRTIO_PCI_VRING_ALIGN);
+	info->queue = dma_zalloc_coherent(vdev->dev.parent, size,
+					  &info->queue_dma_addr, GFP_KERNEL);
 	if (info->queue == NULL) {
 		err = -ENOMEM;
 		goto out_info;
 	}
 
 	/* activate the queue */
-	iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+	iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
 		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
-	/* create the vring */
+	/*
+	 * Create the vring.  If there is an IOMMU of any sort, including
+	 * Xen paravirt's ersatz IOMMU, use it.  If the host wants physical
+	 * addresses instead of bus addresses, the host shouldn't expose
+	 * an IOMMU.
+	 *
+	 * As an optimization, if the platform promises to have physical
+	 * PCI DMA, we turn off DMA mapping in virtio_ring.  If the
+	 * platform's DMA API implementation is well optimized, this
+	 * should have almost no effect, but that's a dangerous thing to
+	 * rely on.
+	 */
 	vq = vring_new_virtqueue(index, info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
-				 true, false, info->queue,
+				 true, !PCI_DMA_BUS_IS_PHYS, info->queue,
 				 vp_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
@@ -463,7 +476,8 @@ out_assign:
 	vring_del_virtqueue(vq);
 out_activate_queue:
 	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
-	free_pages_exact(info->queue, size);
+	dma_free_coherent(vdev->dev.parent, size,
+			  info->queue, info->queue_dma_addr);
 out_info:
 	kfree(info);
 	return ERR_PTR(err);
@@ -494,7 +508,8 @@ static void vp_del_vq(struct virtqueue *vq)
 	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
 	size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
-	free_pages_exact(info->queue, size);
+	dma_free_coherent(vq->vdev->dev.parent, size,
+			  info->queue, info->queue_dma_addr);
 	kfree(info);
 }
 
@@ -713,6 +728,13 @@ static int virtio_pci_probe(struct pci_dev *pci_dev,
 	if (err)
 		goto out;
 
+	err = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64));
+	if (err)
+		err = dma_set_mask_and_coherent(&pci_dev->dev,
+						DMA_BIT_MASK(32));
+	if (err)
+		dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA. 
Trying to continue, but this might not work.\n");
+
 	err = pci_request_regions(pci_dev, "virtio-pci");
 	if (err)
 		goto out_enable_device;
-- 
1.9.3

Andy Lutomirski

2014-Sep-01 17:39 UTC

head link

[PATCH v4 3/4] virtio_net: Don't set the end flag on reusable sg entries

Every time virtio_net calls skb_to_sgvec, an end flag gets set
somewhere on the queue's scatterlist and never gets cleared.  As
soon as a larger request happens, virtio_net sends the virtqueue a
scatterlist with an end mark set in the middle.  Once the vring code
starts using for_each_sg, this will blow up.

Signed-off-by: Andy Lutomirski <luto at amacapital.net>
---
 drivers/net/virtio_net.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 59caa06f34a6..c90466a4fab0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -548,7 +548,7 @@ static int add_recvbuf_small(struct receive_queue *rq, gfp_t
gfp)
 	hdr = skb_vnet_hdr(skb);
 	sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);
 
-	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
+	skb_to_sgvec_nomark(skb, rq->sg + 1, 0, skb->len);
 
 	err = virtqueue_add_inbuf(rq->vq, rq->sg, 2, skb, gfp);
 	if (err < 0)
@@ -901,12 +901,12 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff
*skb)
 
 	if (can_push) {
 		__skb_push(skb, hdr_len);
-		num_sg = skb_to_sgvec(skb, sq->sg, 0, skb->len);
+		num_sg = skb_to_sgvec_nomark(skb, sq->sg, 0, skb->len);
 		/* Pull header back to avoid skew in tx bytes calculations. */
 		__skb_pull(skb, hdr_len);
 	} else {
 		sg_set_buf(sq->sg, hdr, hdr_len);
-		num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
+		num_sg = skb_to_sgvec_nomark(skb, sq->sg + 1, 0, skb->len) + 1;
 	}
 	return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
 }
-- 
1.9.3

Andy Lutomirski

2014-Sep-01 17:39 UTC

head link

[PATCH v4 4/4] virtio_net: Stop doing DMA from the stack

Now that virtio supports real DMA, drivers should play by the rules.
For virtio_net, that means that DMA should be done to and from
dynamically-allocated memory, not the kernel stack.

This should have no effect on any performance-critical code paths.

Signed-off-by: Andy Lutomirski <luto at amacapital.net>
---
 drivers/net/virtio_net.c | 53 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c90466a4fab0..25703fd2df28 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -966,31 +966,43 @@ static bool virtnet_send_command(struct virtnet_info *vi,
u8 class, u8 cmd,
 				 struct scatterlist *out)
 {
 	struct scatterlist *sgs[4], hdr, stat;
-	struct virtio_net_ctrl_hdr ctrl;
-	virtio_net_ctrl_ack status = ~0;
+
+	struct {
+		struct virtio_net_ctrl_hdr ctrl;
+		virtio_net_ctrl_ack status;
+	} *buf;
+
 	unsigned out_num = 0, tmp;
+	bool ret;
 
 	/* Caller should know better */
 	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
-	ctrl.class = class;
-	ctrl.cmd = cmd;
+	buf = kmalloc(sizeof(*buf), GFP_ATOMIC);
+	if (!buf)
+		return false;
+	buf->status = ~0;
+
+	buf->ctrl.class = class;
+	buf->ctrl.cmd = cmd;
 	/* Add header */
-	sg_init_one(&hdr, &ctrl, sizeof(ctrl));
+	sg_init_one(&hdr, &buf->ctrl, sizeof(buf->ctrl));
 	sgs[out_num++] = &hdr;
 
 	if (out)
 		sgs[out_num++] = out;
 
 	/* Add return status. */
-	sg_init_one(&stat, &status, sizeof(status));
+	sg_init_one(&stat, &buf->status, sizeof(buf->status));
 	sgs[out_num] = &stat;
 
 	BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
 	virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
 
-	if (unlikely(!virtqueue_kick(vi->cvq)))
-		return status == VIRTIO_NET_OK;
+	if (unlikely(!virtqueue_kick(vi->cvq))) {
+		ret = (buf->status == VIRTIO_NET_OK);
+		goto out;
+	}
 
 	/* Spin for a response, the kick causes an ioport write, trapping
 	 * into the hypervisor, so the request should be handled immediately.
@@ -999,7 +1011,11 @@ static bool virtnet_send_command(struct virtnet_info *vi,
u8 class, u8 cmd,
 	       !virtqueue_is_broken(vi->cvq))
 		cpu_relax();
 
-	return status == VIRTIO_NET_OK;
+	ret = (buf->status == VIRTIO_NET_OK);
+
+out:
+	kfree(buf);
+	return ret;
 }
 
 static int virtnet_set_mac_address(struct net_device *dev, void *p)
@@ -1140,7 +1156,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct scatterlist sg[2];
-	u8 promisc, allmulti;
+	u8 *cmdbyte;
 	struct virtio_net_ctrl_mac *mac_data;
 	struct netdev_hw_addr *ha;
 	int uc_count;
@@ -1152,22 +1168,25 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
 		return;
 
-	promisc = ((dev->flags & IFF_PROMISC) != 0);
-	allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
+	cmdbyte = kmalloc(sizeof(*cmdbyte), GFP_ATOMIC);
+	if (!cmdbyte)
+		return;
 
-	sg_init_one(sg, &promisc, sizeof(promisc));
+	sg_init_one(sg, cmdbyte, sizeof(*cmdbyte));
 
+	*cmdbyte = ((dev->flags & IFF_PROMISC) != 0);
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
 				  VIRTIO_NET_CTRL_RX_PROMISC, sg))
 		dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
-			 promisc ? "en" : "dis");
-
-	sg_init_one(sg, &allmulti, sizeof(allmulti));
+			 *cmdbyte ? "en" : "dis");
 
+	*cmdbyte = ((dev->flags & IFF_ALLMULTI) != 0);
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
 				  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
 		dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
-			 allmulti ? "en" : "dis");
+			 *cmdbyte ? "en" : "dis");
+
+	kfree(cmdbyte);
 
 	uc_count = netdev_uc_count(dev);
 	mc_count = netdev_mc_count(dev);
-- 
1.9.3

Benjamin Herrenschmidt

2014-Sep-01 22:16 UTC

head link

[PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API

On Mon, 2014-09-01 at 10:39 -0700, Andy Lutomirski
wrote:> Changes from v1:
>  - Using the DMA API is optional now.  It would be nice to improve the
>    DMA API to the point that it could be used unconditionally, but s390
>    proves that we're not there yet.
>  - Includes patch 4, which fixes DMA debugging warnings from virtio_net.
I'm not sure if you saw my reply on the other thread but I have a few
comments based on the above "it would be nice if ..."

So here we have both a yes and a no :-)

It would be nice to avoid those if () games all over and indeed just
use the DMA API, *however* we most certainly don't want to actually
create IOMMU mappings for the KVM virio case. This would be a massive
loss in performances on several platforms and generally doesn't make
much sense.

However, we can still use the API without that on any architecture
where the dma mapping API ends up calling the generic dma_map_ops,
it becomes just a matter of virtio setting up some special "nop" ops
when needed.

The difficulty here resides in the fact that we have never completely
made the dma_map_ops generic. The ops themselves are defined generically
as are the dma_map_* interfaces based on them, but the location of the
ops pointer is still more/less arch specific and some architectures
still chose not to use that indirection at all I believe.

Cheers,
Ben.

Andy Lutomirski

2014-Sep-02 05:55 UTC

head link

[PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API

On Mon, Sep 1, 2014 at 3:16 PM, Benjamin Herrenschmidt
<benh at kernel.crashing.org> wrote:> On Mon, 2014-09-01 at 10:39 -0700, Andy Lutomirski wrote:
>> Changes from v1:
>>  - Using the DMA API is optional now.  It would be nice to improve the
>>    DMA API to the point that it could be used unconditionally, but s390
>>    proves that we're not there yet.
>>  - Includes patch 4, which fixes DMA debugging warnings from
virtio_net.
>
> I'm not sure if you saw my reply on the other thread but I have a few
> comments based on the above "it would be nice if ..."
>
Yeah, sorry, I sort of thought I responded, but I didn't do a very good job.
> So here we have both a yes and a no :-)
>
> It would be nice to avoid those if () games all over and indeed just
> use the DMA API, *however* we most certainly don't want to actually
> create IOMMU mappings for the KVM virio case. This would be a massive
> loss in performances on several platforms and generally doesn't make
> much sense.
>
> However, we can still use the API without that on any architecture
> where the dma mapping API ends up calling the generic dma_map_ops,
> it becomes just a matter of virtio setting up some special "nop"
ops
> when needed.
I'm not quite convinced that this is a good idea.  I think that there
are three relevant categories of virtio devices:

a) Any virtio device where the normal DMA ops are nops.  This includes
x86 without an IOMMU (e.g. in a QEMU/KVM guest), 32-bit ARM, and
probably many other architectures.  In this case, what we do only
matters for performance, not for correctness.  Ideally the arch DMA
ops are fast.

b) Virtio devices that use physical addressing on systems where DMA
ops either don't exist at all (most s390) or do something nontrivial.
In this case, we must either override the DMA ops or just not use
them.

c) Virtio devices that use bus addressing.  This includes everything
on Xen (because the "physical" addresses are nonsense) and any actual
physical PCI device that speaks virtio on a system with an IOMMU.  In
this case, we must use the DMA ops.

The issue is that, on systems with DMA ops that do something, we need
to make sure that we know whether we're in case (b) or (c).  In these
patches, I've made the assumption that, if the virtio devices lives on
the PCI bus, then it uses the same type of addressing that any other
device on that PCI bus would use.

On x86, at least, I doubt that we'll ever see a physically addressed
PCI virtio device for which ACPI advertises an IOMMU, since any sane
hypervisor will just not advertise an IOMMU for the virtio device.
But are there arm64 or PPC guests that use virtio_pci, that have
IOMMUs, and that will malfunction if the virtio_pci driver ends up
using the IOMMU?  I certainly hope not, since these systems might be
very hard-pressed to work right if someone plugged in a physical
virtio-speaking PCI device.
>
> The difficulty here resides in the fact that we have never completely
> made the dma_map_ops generic. The ops themselves are defined generically
> as are the dma_map_* interfaces based on them, but the location of the
> ops pointer is still more/less arch specific and some architectures
> still chose not to use that indirection at all I believe.
>
I'd be happy to update the patches if someone does this, but I don't
really want to attack the DMA API on all architectures right now.  In
the mean time, at least s390 requires that we be able to compile out
the DMA API calls.  I'd rather see s390 provide working no-op dma ops
for all of the struct devices that provide virtio interfaces.

On a related note, shouldn't virtio be doing something to provide dma
ops to the virtio device and any of its children?  I don't know how it
would even try to do this, given how architecture-dependent this code
currently is.  Calling dma_map_single on the virtio device (as opposed
to its parent) is currently likely to crash on x86.  Fortunately,
nothing does this.

--Andy

Seemingly Similar Threads

Search for more possibly parallel threads

Linux Virtualization - Sep 2014 - [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API

[PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API

[PATCH v4 1/4] virtio_ring: Support DMA APIs if requested

[PATCH v4 2/4] virtio_pci: Use the DMA API for virtqueues

[PATCH v4 3/4] virtio_net: Don't set the end flag on reusable sg entries

[PATCH v4 4/4] virtio_net: Stop doing DMA from the stack

[PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API

[PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API

Seemingly Similar Threads