This switches virtio to use the DMA API unconditionally. I'm sure it breaks things, but it seems to work on x86 using virtio-pci, with and without Xen, and using both the modern 1.0 variant and the legacy variant. This appears to work on native and Xen x86_64 using both modern and legacy virtio-pci. It also appears to work on arm and arm64. It definitely won't work as-is on s390x, and I haven't been able to test Christian's patches because I can't get virtio-ccw to work in QEMU at all. I don't know what I'm doing wrong. It doesn't work on ppc64. Ben, consider yourself pinged to send me a patch :) It doesn't work on sparc64. I didn't realize at Kernel Summit that sparc64 has the same problem as ppc64. DaveM, for background, we're trying to fix virtio to use the DMA API. That will require that every platform that uses virtio supplies valid DMA operations on devices that use virtio_ring. Unfortunately, QEMU historically ignores the IOMMU on virtio devices. On x86, this isn't really a problem. x86 has a nice way for the platform to describe which devices are behind an IOMMU, and QEMU will be adjusted accordingly. The only thing that will break is a recently-added experimental mode. Ben's plan for powerpc is to add a quirk for existing virtio-pci devices and to eventually update the devicetree stuff to allow QEMU to tell the guest which devices use the IOMMU. AFAICT sparc has a similar problem to powerpc. DaveM, can you come up with a straightforward way to get sparc's DMA API to work correctly for virtio-pci devices? NB: Sadly, the platforms I've successfully tested on don't include any big-endian platforms, so there could still be lurking endian problems. Changes from v3: - More big-endian fixes. - Added better virtio-ring APIs that handle allocation and use them in virtio-mmio and virtio-pci. - Switch to Michael's virtio-net patch. Changes from v2: - Fix vring_mapping_error incorrect argument Changes from v1: - Fix an endian conversion error causing a BUG to hit. - Fix a DMA ordering issue (swiotlb=force works now). - Minor cleanups. Andy Lutomirski (5): virtio_ring: Support DMA APIs virtio_pci: Use the DMA API virtio: Add improved queue allocation API virtio_mmio: Use the DMA API virtio_pci: Use the DMA API Michael S. Tsirkin (1): virtio-net: Stop doing DMA from the stack drivers/net/virtio_net.c | 34 ++-- drivers/virtio/Kconfig | 2 +- drivers/virtio/virtio_mmio.c | 67 ++----- drivers/virtio/virtio_pci_common.h | 6 - drivers/virtio/virtio_pci_legacy.c | 42 ++--- drivers/virtio/virtio_pci_modern.c | 61 ++----- drivers/virtio/virtio_ring.c | 348 ++++++++++++++++++++++++++++++------- include/linux/virtio.h | 23 ++- include/linux/virtio_ring.h | 35 ++++ tools/virtio/linux/dma-mapping.h | 17 ++ 10 files changed, 426 insertions(+), 209 deletions(-) create mode 100644 tools/virtio/linux/dma-mapping.h -- 2.4.3
Andy Lutomirski
2015-Oct-30 01:09 UTC
[PATCH v4 1/6] virtio-net: Stop doing DMA from the stack
From: "Michael S. Tsirkin" <mst at redhat.com> Once virtio starts using the DMA API, we won't be able to safely DMA from the stack. virtio-net does a couple of config DMA requests from small stack buffers -- switch to using dynamically-allocated memory. This should have no effect on any performance-critical code paths. [I wrote the subject and commit message. mst wrote the code. --luto] Signed-off-by: Andy Lutomirski <luto at kernel.org> signed-off-by: Michael S. Tsirkin <mst at redhat.com> --- drivers/net/virtio_net.c | 34 +++++++++++++++++++--------------- 1 file changed, 19 insertions(+), 15 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index d8838dedb7a4..f94ab786088f 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -140,6 +140,12 @@ struct virtnet_info { /* CPU hot plug notifier */ struct notifier_block nb; + + /* Control VQ buffers: protected by the rtnl lock */ + struct virtio_net_ctrl_hdr ctrl_hdr; + virtio_net_ctrl_ack ctrl_status; + u8 ctrl_promisc; + u8 ctrl_allmulti; }; struct padded_vnet_hdr { @@ -976,31 +982,30 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd, struct scatterlist *out) { struct scatterlist *sgs[4], hdr, stat; - struct virtio_net_ctrl_hdr ctrl; - virtio_net_ctrl_ack status = ~0; unsigned out_num = 0, tmp; /* Caller should know better */ BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)); - ctrl.class = class; - ctrl.cmd = cmd; + vi->ctrl_status = ~0; + vi->ctrl_hdr.class = class; + vi->ctrl_hdr.cmd = cmd; /* Add header */ - sg_init_one(&hdr, &ctrl, sizeof(ctrl)); + sg_init_one(&hdr, &vi->ctrl_hdr, sizeof(vi->ctrl_hdr)); sgs[out_num++] = &hdr; if (out) sgs[out_num++] = out; /* Add return status. */ - sg_init_one(&stat, &status, sizeof(status)); + sg_init_one(&stat, &vi->ctrl_status, sizeof(vi->ctrl_status)); sgs[out_num] = &stat; BUG_ON(out_num + 1 > ARRAY_SIZE(sgs)); virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC); if (unlikely(!virtqueue_kick(vi->cvq))) - return status == VIRTIO_NET_OK; + return vi->ctrl_status == VIRTIO_NET_OK; /* Spin for a response, the kick causes an ioport write, trapping * into the hypervisor, so the request should be handled immediately. @@ -1009,7 +1014,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd, !virtqueue_is_broken(vi->cvq)) cpu_relax(); - return status == VIRTIO_NET_OK; + return vi->ctrl_status == VIRTIO_NET_OK; } static int virtnet_set_mac_address(struct net_device *dev, void *p) @@ -1151,7 +1156,6 @@ static void virtnet_set_rx_mode(struct net_device *dev) { struct virtnet_info *vi = netdev_priv(dev); struct scatterlist sg[2]; - u8 promisc, allmulti; struct virtio_net_ctrl_mac *mac_data; struct netdev_hw_addr *ha; int uc_count; @@ -1163,22 +1167,22 @@ static void virtnet_set_rx_mode(struct net_device *dev) if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX)) return; - promisc = ((dev->flags & IFF_PROMISC) != 0); - allmulti = ((dev->flags & IFF_ALLMULTI) != 0); + vi->ctrl_promisc = ((dev->flags & IFF_PROMISC) != 0); + vi->ctrl_allmulti = ((dev->flags & IFF_ALLMULTI) != 0); - sg_init_one(sg, &promisc, sizeof(promisc)); + sg_init_one(sg, &vi->ctrl_promisc, sizeof(vi->ctrl_promisc)); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX, VIRTIO_NET_CTRL_RX_PROMISC, sg)) dev_warn(&dev->dev, "Failed to %sable promisc mode.\n", - promisc ? "en" : "dis"); + vi->ctrl_promisc ? "en" : "dis"); - sg_init_one(sg, &allmulti, sizeof(allmulti)); + sg_init_one(sg, &vi->ctrl_allmulti, sizeof(vi->ctrl_allmulti)); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX, VIRTIO_NET_CTRL_RX_ALLMULTI, sg)) dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n", - allmulti ? "en" : "dis"); + vi->ctrl_allmulti ? "en" : "dis"); uc_count = netdev_uc_count(dev); mc_count = netdev_mc_count(dev); -- 2.4.3
virtio_ring currently sends the device (usually a hypervisor) physical addresses of its I/O buffers. This is okay when DMA addresses and physical addresses are the same thing, but this isn't always the case. For example, this never works on Xen guests, and it is likely to fail if a physical "virtio" device ever ends up behind an IOMMU or swiotlb. The immediate use case for me is to enable virtio on Xen guests. For that to work, we need vring to support DMA address translation as well as a corresponding change to virtio_pci or to another driver. With this patch, if enabled, virtfs survives kmemleak and CONFIG_DMA_API_DEBUG. Signed-off-by: Andy Lutomirski <luto at kernel.org> --- drivers/virtio/Kconfig | 2 +- drivers/virtio/virtio_ring.c | 190 +++++++++++++++++++++++++++++++-------- tools/virtio/linux/dma-mapping.h | 17 ++++ 3 files changed, 172 insertions(+), 37 deletions(-) create mode 100644 tools/virtio/linux/dma-mapping.h diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index cab9f3f63a38..77590320d44c 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -60,7 +60,7 @@ config VIRTIO_INPUT config VIRTIO_MMIO tristate "Platform bus driver for memory mapped virtio devices" - depends on HAS_IOMEM + depends on HAS_IOMEM && HAS_DMA select VIRTIO ---help--- This drivers provides support for memory mapped virtio diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 096b857e7b75..a872eb89587f 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -24,6 +24,7 @@ #include <linux/module.h> #include <linux/hrtimer.h> #include <linux/kmemleak.h> +#include <linux/dma-mapping.h> #ifdef DEBUG /* For development, we want to crash whenever the ring is screwed. */ @@ -54,7 +55,14 @@ #define END_USE(vq) #endif -struct vring_virtqueue { +struct vring_desc_state +{ + void *data; /* Data for callback. */ + struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ +}; + +struct vring_virtqueue +{ struct virtqueue vq; /* Actual memory layout for this queue */ @@ -92,12 +100,71 @@ struct vring_virtqueue { ktime_t last_add_time; #endif - /* Tokens for callbacks. */ - void *data[]; + /* Per-descriptor state. */ + struct vring_desc_state desc_state[]; }; #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq) +/* + * The DMA ops on various arches are rather gnarly right now, and + * making all of the arch DMA ops work on the vring device itself + * is a mess. For now, we use the parent device for DMA ops. + */ +struct device *vring_dma_dev(const struct vring_virtqueue *vq) +{ + return vq->vq.vdev->dev.parent; +} + +/* Map one sg entry. */ +static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq, + struct scatterlist *sg, + enum dma_data_direction direction) +{ + /* + * We can't use dma_map_sg, because we don't use scatterlists in + * the way it expects (we don't guarantee that the scatterlist + * will exist for the lifetime of the mapping). + */ + return dma_map_page(vring_dma_dev(vq), + sg_page(sg), sg->offset, sg->length, + direction); +} + +static dma_addr_t vring_map_single(const struct vring_virtqueue *vq, + void *cpu_addr, size_t size, + enum dma_data_direction direction) +{ + return dma_map_single(vring_dma_dev(vq), + cpu_addr, size, direction); +} + +static void vring_unmap_one(const struct vring_virtqueue *vq, + struct vring_desc *desc) +{ + u16 flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); + + if (flags & VRING_DESC_F_INDIRECT) { + dma_unmap_single(vring_dma_dev(vq), + virtio64_to_cpu(vq->vq.vdev, desc->addr), + virtio32_to_cpu(vq->vq.vdev, desc->len), + (flags & VRING_DESC_F_WRITE) ? + DMA_FROM_DEVICE : DMA_TO_DEVICE); + } else { + dma_unmap_page(vring_dma_dev(vq), + virtio64_to_cpu(vq->vq.vdev, desc->addr), + virtio32_to_cpu(vq->vq.vdev, desc->len), + (flags & VRING_DESC_F_WRITE) ? + DMA_FROM_DEVICE : DMA_TO_DEVICE); + } +} + +static int vring_mapping_error(const struct vring_virtqueue *vq, + dma_addr_t addr) +{ + return dma_mapping_error(vring_dma_dev(vq), addr); +} + static struct vring_desc *alloc_indirect(struct virtqueue *_vq, unsigned int total_sg, gfp_t gfp) { @@ -131,7 +198,7 @@ static inline int virtqueue_add(struct virtqueue *_vq, struct vring_virtqueue *vq = to_vvq(_vq); struct scatterlist *sg; struct vring_desc *desc; - unsigned int i, n, avail, descs_used, uninitialized_var(prev); + unsigned int i, n, avail, descs_used, uninitialized_var(prev), err_idx; int head; bool indirect; @@ -171,21 +238,15 @@ static inline int virtqueue_add(struct virtqueue *_vq, if (desc) { /* Use a single buffer which doesn't continue */ - vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_INDIRECT); - vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, virt_to_phys(desc)); - /* avoid kmemleak false positive (hidden by virt_to_phys) */ - kmemleak_ignore(desc); - vq->vring.desc[head].len = cpu_to_virtio32(_vq->vdev, total_sg * sizeof(struct vring_desc)); - + indirect = true; /* Set up rest to use this indirect table. */ i = 0; descs_used = 1; - indirect = true; } else { + indirect = false; desc = vq->vring.desc; i = head; descs_used = total_sg; - indirect = false; } if (vq->vq.num_free < descs_used) { @@ -200,13 +261,14 @@ static inline int virtqueue_add(struct virtqueue *_vq, return -ENOSPC; } - /* We're about to use some buffers from the free list. */ - vq->vq.num_free -= descs_used; - for (n = 0; n < out_sgs; n++) { for (sg = sgs[n]; sg; sg = sg_next(sg)) { + dma_addr_t addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE); + if (vring_mapping_error(vq, addr)) + goto unmap_release; + desc[i].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT); - desc[i].addr = cpu_to_virtio64(_vq->vdev, sg_phys(sg)); + desc[i].addr = cpu_to_virtio64(_vq->vdev, addr); desc[i].len = cpu_to_virtio32(_vq->vdev, sg->length); prev = i; i = virtio16_to_cpu(_vq->vdev, desc[i].next); @@ -214,8 +276,12 @@ static inline int virtqueue_add(struct virtqueue *_vq, } for (; n < (out_sgs + in_sgs); n++) { for (sg = sgs[n]; sg; sg = sg_next(sg)) { + dma_addr_t addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE); + if (vring_mapping_error(vq, addr)) + goto unmap_release; + desc[i].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT | VRING_DESC_F_WRITE); - desc[i].addr = cpu_to_virtio64(_vq->vdev, sg_phys(sg)); + desc[i].addr = cpu_to_virtio64(_vq->vdev, addr); desc[i].len = cpu_to_virtio32(_vq->vdev, sg->length); prev = i; i = virtio16_to_cpu(_vq->vdev, desc[i].next); @@ -224,14 +290,33 @@ static inline int virtqueue_add(struct virtqueue *_vq, /* Last one doesn't continue. */ desc[prev].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT); + if (indirect) { + /* Now that the indirect table is filled in, map it. */ + dma_addr_t addr = vring_map_single( + vq, desc, total_sg * sizeof(struct vring_desc), + DMA_TO_DEVICE); + if (vring_mapping_error(vq, addr)) + goto unmap_release; + + vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_INDIRECT); + vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, addr); + + vq->vring.desc[head].len = cpu_to_virtio32(_vq->vdev, total_sg * sizeof(struct vring_desc)); + } + + /* We're using some buffers from the free list. */ + vq->vq.num_free -= descs_used; + /* Update free pointer */ if (indirect) vq->free_head = virtio16_to_cpu(_vq->vdev, vq->vring.desc[head].next); else vq->free_head = i; - /* Set token. */ - vq->data[head] = data; + /* Store token and indirect buffer state. */ + vq->desc_state[head].data = data; + if (indirect) + vq->desc_state[head].indir_desc = desc; /* Put entry in available array (but don't update avail->idx until they * do sync). */ @@ -253,6 +338,24 @@ static inline int virtqueue_add(struct virtqueue *_vq, virtqueue_kick(_vq); return 0; + +unmap_release: + err_idx = i; + i = head; + + for (n = 0; n < total_sg; n++) { + if (i == err_idx) + break; + vring_unmap_one(vq, &desc[i]); + i = vq->vring.desc[i].next; + } + + vq->vq.num_free += total_sg; + + if (indirect) + kfree(desc); + + return -EIO; } /** @@ -423,27 +526,43 @@ EXPORT_SYMBOL_GPL(virtqueue_kick); static void detach_buf(struct vring_virtqueue *vq, unsigned int head) { - unsigned int i; + unsigned int i, j; + u16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT); /* Clear data ptr. */ - vq->data[head] = NULL; + vq->desc_state[head].data = NULL; - /* Put back on free list: find end */ + /* Put back on free list: unmap first-level descriptors and find end */ i = head; - /* Free the indirect table */ - if (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)) - kfree(phys_to_virt(virtio64_to_cpu(vq->vq.vdev, vq->vring.desc[i].addr))); - - while (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT)) { + while (vq->vring.desc[i].flags & nextflag) { + vring_unmap_one(vq, &vq->vring.desc[i]); i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next); vq->vq.num_free++; } + vring_unmap_one(vq, &vq->vring.desc[i]); vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head); vq->free_head = head; + /* Plus final descriptor */ vq->vq.num_free++; + + /* Free the indirect table, if any, now that it's unmapped. */ + if (vq->desc_state[head].indir_desc) { + struct vring_desc *indir_desc = vq->desc_state[head].indir_desc; + u32 len = vq->vring.desc[head].len; + + BUG_ON(!(vq->vring.desc[head].flags & + cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT))); + BUG_ON(len == 0 || len % sizeof(struct vring_desc)); + + for (j = 0; j < len / sizeof(struct vring_desc); j++) + vring_unmap_one(vq, &indir_desc[j]); + + kfree(vq->desc_state[head].indir_desc); + vq->desc_state[head].indir_desc = NULL; + } } static inline bool more_used(const struct vring_virtqueue *vq) @@ -498,13 +617,13 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len) BAD_RING(vq, "id %u out of range\n", i); return NULL; } - if (unlikely(!vq->data[i])) { + if (unlikely(!vq->desc_state[i].data)) { BAD_RING(vq, "id %u is not a head!\n", i); return NULL; } /* detach_buf clears data, so grab it now. */ - ret = vq->data[i]; + ret = vq->desc_state[i].data; detach_buf(vq, i); vq->last_used_idx++; /* If we expect an interrupt for the next entry, tell host @@ -665,10 +784,10 @@ void *virtqueue_detach_unused_buf(struct virtqueue *_vq) START_USE(vq); for (i = 0; i < vq->vring.num; i++) { - if (!vq->data[i]) + if (!vq->desc_state[i].data) continue; /* detach_buf clears data, so grab it now. */ - buf = vq->data[i]; + buf = vq->desc_state[i].data; detach_buf(vq, i); vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, virtio16_to_cpu(_vq->vdev, vq->vring.avail->idx) - 1); END_USE(vq); @@ -721,7 +840,8 @@ struct virtqueue *vring_new_virtqueue(unsigned int index, return NULL; } - vq = kmalloc(sizeof(*vq) + sizeof(void *)*num, GFP_KERNEL); + vq = kmalloc(sizeof(*vq) + num * sizeof(struct vring_desc_state), + GFP_KERNEL); if (!vq) return NULL; @@ -751,11 +871,9 @@ struct virtqueue *vring_new_virtqueue(unsigned int index, /* Put everything in free lists. */ vq->free_head = 0; - for (i = 0; i < num-1; i++) { + for (i = 0; i < num-1; i++) vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1); - vq->data[i] = NULL; - } - vq->data[i] = NULL; + memset(vq->desc_state, 0, num * sizeof(struct vring_desc_state)); return &vq->vq; } diff --git a/tools/virtio/linux/dma-mapping.h b/tools/virtio/linux/dma-mapping.h new file mode 100644 index 000000000000..4f93af89ae16 --- /dev/null +++ b/tools/virtio/linux/dma-mapping.h @@ -0,0 +1,17 @@ +#ifndef _LINUX_DMA_MAPPING_H +#define _LINUX_DMA_MAPPING_H + +#ifdef CONFIG_HAS_DMA +# error Virtio userspace code does not support CONFIG_HAS_DMA +#endif + +#define PCI_DMA_BUS_IS_PHYS 1 + +enum dma_data_direction { + DMA_BIDIRECTIONAL = 0, + DMA_TO_DEVICE = 1, + DMA_FROM_DEVICE = 2, + DMA_NONE = 3, +}; + +#endif -- 2.4.3
This fixes virtio-pci on platforms and busses that have IOMMUs. This will break the experimental QEMU Q35 IOMMU support until QEMU is fixed. In exchange, it fixes physical virtio hardware as well as virtio-pci running under Xen. We should clean up the virtqueue API to do its own allocation and teach virtqueue_get_avail and virtqueue_get_used to return DMA addresses directly. Signed-off-by: Andy Lutomirski <luto at kernel.org> --- drivers/virtio/virtio_pci_common.h | 3 ++- drivers/virtio/virtio_pci_legacy.c | 19 +++++++++++++++---- drivers/virtio/virtio_pci_modern.c | 34 ++++++++++++++++++++++++---------- 3 files changed, 41 insertions(+), 15 deletions(-) diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h index b976d968e793..cd6196b513ad 100644 --- a/drivers/virtio/virtio_pci_common.h +++ b/drivers/virtio/virtio_pci_common.h @@ -38,8 +38,9 @@ struct virtio_pci_vq_info { /* the number of entries in the queue */ int num; - /* the virtual address of the ring queue */ + /* the ring queue */ void *queue; + dma_addr_t queue_dma_addr; /* bus address */ /* the list node for the virtqueues list */ struct list_head node; diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c index 48bc9797e530..b5293e5f2af4 100644 --- a/drivers/virtio/virtio_pci_legacy.c +++ b/drivers/virtio/virtio_pci_legacy.c @@ -135,12 +135,14 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, info->msix_vector = msix_vec; size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN)); - info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO); + info->queue = dma_zalloc_coherent(&vp_dev->pci_dev->dev, size, + &info->queue_dma_addr, + GFP_KERNEL); if (info->queue == NULL) return ERR_PTR(-ENOMEM); /* activate the queue */ - iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, + iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); /* create the vring */ @@ -169,7 +171,8 @@ out_assign: vring_del_virtqueue(vq); out_activate_queue: iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); - free_pages_exact(info->queue, size); + dma_free_coherent(&vp_dev->pci_dev->dev, size, + info->queue, info->queue_dma_addr); return ERR_PTR(err); } @@ -194,7 +197,8 @@ static void del_vq(struct virtio_pci_vq_info *info) iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN)); - free_pages_exact(info->queue, size); + dma_free_coherent(&vp_dev->pci_dev->dev, size, + info->queue, info->queue_dma_addr); } static const struct virtio_config_ops virtio_pci_config_ops = { @@ -227,6 +231,13 @@ int virtio_pci_legacy_probe(struct virtio_pci_device *vp_dev) return -ENODEV; } + rc = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64)); + if (rc) + rc = dma_set_mask_and_coherent(&pci_dev->dev, + DMA_BIT_MASK(32)); + if (rc) + dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA. Trying to continue, but this might not work.\n"); + rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy"); if (rc) return rc; diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c index 8e5cf194cc0b..fbe0bd1c4881 100644 --- a/drivers/virtio/virtio_pci_modern.c +++ b/drivers/virtio/virtio_pci_modern.c @@ -293,14 +293,16 @@ static size_t vring_pci_size(u16 num) return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES)); } -static void *alloc_virtqueue_pages(int *num) +static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev, + int *num, dma_addr_t *dma_addr) { void *pages; /* TODO: allocate each queue chunk individually */ for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) { - pages = alloc_pages_exact(vring_pci_size(*num), - GFP_KERNEL|__GFP_ZERO|__GFP_NOWARN); + pages = dma_zalloc_coherent( + &vp_dev->pci_dev->dev, vring_pci_size(*num), + dma_addr, GFP_KERNEL|__GFP_NOWARN); if (pages) return pages; } @@ -309,7 +311,9 @@ static void *alloc_virtqueue_pages(int *num) return NULL; /* Try to get a single page. You are my only hope! */ - return alloc_pages_exact(vring_pci_size(*num), GFP_KERNEL|__GFP_ZERO); + return dma_zalloc_coherent( + &vp_dev->pci_dev->dev, vring_pci_size(*num), + dma_addr, GFP_KERNEL); } static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, @@ -346,7 +350,8 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, info->num = num; info->msix_vector = msix_vec; - info->queue = alloc_virtqueue_pages(&info->num); + info->queue = alloc_virtqueue_pages(vp_dev, &info->num, + &info->queue_dma_addr); if (info->queue == NULL) return ERR_PTR(-ENOMEM); @@ -361,11 +366,11 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, /* activate the queue */ vp_iowrite16(num, &cfg->queue_size); - vp_iowrite64_twopart(virt_to_phys(info->queue), + vp_iowrite64_twopart(info->queue_dma_addr, &cfg->queue_desc_lo, &cfg->queue_desc_hi); - vp_iowrite64_twopart(virt_to_phys(virtqueue_get_avail(vq)), + vp_iowrite64_twopart(info->queue_dma_addr + ((char *)virtqueue_get_avail(vq) - (char *)info->queue), &cfg->queue_avail_lo, &cfg->queue_avail_hi); - vp_iowrite64_twopart(virt_to_phys(virtqueue_get_used(vq)), + vp_iowrite64_twopart(info->queue_dma_addr + ((char *)virtqueue_get_used(vq) - (char *)info->queue), &cfg->queue_used_lo, &cfg->queue_used_hi); if (vp_dev->notify_base) { @@ -411,7 +416,8 @@ err_assign_vector: err_map_notify: vring_del_virtqueue(vq); err_new_queue: - free_pages_exact(info->queue, vring_pci_size(info->num)); + dma_free_coherent(&vp_dev->pci_dev->dev, vring_pci_size(info->num), + info->queue, info->queue_dma_addr); return ERR_PTR(err); } @@ -457,7 +463,8 @@ static void del_vq(struct virtio_pci_vq_info *info) vring_del_virtqueue(vq); - free_pages_exact(info->queue, vring_pci_size(info->num)); + dma_free_coherent(&vp_dev->pci_dev->dev, vring_pci_size(info->num), + info->queue, info->queue_dma_addr); } static const struct virtio_config_ops virtio_pci_config_nodev_ops = { @@ -641,6 +648,13 @@ int virtio_pci_modern_probe(struct virtio_pci_device *vp_dev) return -EINVAL; } + err = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64)); + if (err) + err = dma_set_mask_and_coherent(&pci_dev->dev, + DMA_BIT_MASK(32)); + if (err) + dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA. Trying to continue, but this might not work.\n"); + /* Device capability is only mandatory for devices that have * device-specific configuration. */ -- 2.4.3
Andy Lutomirski
2015-Oct-30 01:09 UTC
[PATCH v4 4/6] virtio: Add improved queue allocation API
This leaves vring_new_virtqueue alone for compatbility, but it adds two new improved APIs: vring_create_virtqueue: Creates a virtqueue backed by automatically allocated coherent memory. (Some day it this could be extended to support non-coherent memory, too, if there ends up being a platform on which it's worthwhile.) __vring_new_virtqueue: Creates a virtqueue with a manually-specified layout. This should allow mic_virtio to work much more cleanly. Signed-off-by: Andy Lutomirski <luto at kernel.org> --- drivers/virtio/virtio_ring.c | 164 +++++++++++++++++++++++++++++++++++-------- include/linux/virtio.h | 23 +++++- include/linux/virtio_ring.h | 35 +++++++++ 3 files changed, 190 insertions(+), 32 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index a872eb89587f..f269e1cba00c 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -91,6 +91,11 @@ struct vring_virtqueue /* How to notify other side. FIXME: commonalize hcalls! */ bool (*notify)(struct virtqueue *vq); + /* DMA, allocation, and size information */ + bool we_own_ring; + size_t queue_size_in_bytes; + dma_addr_t queue_dma_addr; + #ifdef DEBUG /* They're supposed to lock for us. */ unsigned int in_use; @@ -821,36 +826,31 @@ irqreturn_t vring_interrupt(int irq, void *_vq) } EXPORT_SYMBOL_GPL(vring_interrupt); -struct virtqueue *vring_new_virtqueue(unsigned int index, - unsigned int num, - unsigned int vring_align, - struct virtio_device *vdev, - bool weak_barriers, - void *pages, - bool (*notify)(struct virtqueue *), - void (*callback)(struct virtqueue *), - const char *name) +struct virtqueue *__vring_new_virtqueue(unsigned int index, + struct vring vring, + struct virtio_device *vdev, + bool weak_barriers, + bool (*notify)(struct virtqueue *), + void (*callback)(struct virtqueue *), + const char *name) { - struct vring_virtqueue *vq; unsigned int i; + struct vring_virtqueue *vq; - /* We assume num is a power of 2. */ - if (num & (num - 1)) { - dev_warn(&vdev->dev, "Bad virtqueue length %u\n", num); - return NULL; - } - - vq = kmalloc(sizeof(*vq) + num * sizeof(struct vring_desc_state), + vq = kmalloc(sizeof(*vq) + vring.num * sizeof(struct vring_desc_state), GFP_KERNEL); if (!vq) return NULL; - vring_init(&vq->vring, num, pages, vring_align); + vq->vring = vring; vq->vq.callback = callback; vq->vq.vdev = vdev; vq->vq.name = name; - vq->vq.num_free = num; + vq->vq.num_free = vring.num; vq->vq.index = index; + vq->we_own_ring = false; + vq->queue_dma_addr = 0; + vq->queue_size_in_bytes = 0; vq->notify = notify; vq->weak_barriers = weak_barriers; vq->broken = false; @@ -871,18 +871,105 @@ struct virtqueue *vring_new_virtqueue(unsigned int index, /* Put everything in free lists. */ vq->free_head = 0; - for (i = 0; i < num-1; i++) + for (i = 0; i < vring.num-1; i++) vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1); - memset(vq->desc_state, 0, num * sizeof(struct vring_desc_state)); + memset(vq->desc_state, 0, vring.num * sizeof(struct vring_desc_state)); return &vq->vq; } +EXPORT_SYMBOL_GPL(__vring_new_virtqueue); + +struct virtqueue *vring_create_virtqueue( + unsigned int index, + unsigned int num, + unsigned int vring_align, + struct virtio_device *vdev, + bool weak_barriers, + bool may_reduce_num, + bool (*notify)(struct virtqueue *), + void (*callback)(struct virtqueue *), + const char *name) +{ + struct virtqueue *vq; + void *queue; + dma_addr_t dma_addr; + size_t queue_size_in_bytes; + struct vring vring; + + /* We assume num is a power of 2. */ + if (num & (num - 1)) { + dev_warn(&vdev->dev, "Bad virtqueue length %u\n", num); + return NULL; + } + + /* TODO: allocate each queue chunk individually */ + for (; num && vring_size(num, vring_align) > PAGE_SIZE; num /= 2) { + queue = dma_zalloc_coherent( + vdev->dev.parent, vring_size(num, vring_align), + &dma_addr, GFP_KERNEL|__GFP_NOWARN); + if (queue) + break; + } + + if (!num) + return NULL; + + if (!queue) { + /* Try to get a single page. You are my only hope! */ + queue = dma_zalloc_coherent( + vdev->dev.parent, vring_size(num, vring_align), + &dma_addr, GFP_KERNEL); + } + if (!queue) + return NULL; + + queue_size_in_bytes = vring_size(num, vring_align); + vring_init(&vring, num, queue, vring_align); + + vq = __vring_new_virtqueue(index, vring, vdev, weak_barriers, + notify, callback, name); + if (!vq) { + dma_free_coherent(vdev->dev.parent, + queue_size_in_bytes, queue, + dma_addr); + return NULL; + } + + to_vvq(vq)->queue_dma_addr = dma_addr; + to_vvq(vq)->queue_size_in_bytes = queue_size_in_bytes; + to_vvq(vq)->we_own_ring = true; + + return vq; +} +EXPORT_SYMBOL_GPL(vring_create_virtqueue); + +struct virtqueue *vring_new_virtqueue(unsigned int index, + unsigned int num, + unsigned int vring_align, + struct virtio_device *vdev, + bool weak_barriers, + void *pages, + bool (*notify)(struct virtqueue *vq), + void (*callback)(struct virtqueue *vq), + const char *name) +{ + struct vring vring; + vring_init(&vring, num, pages, vring_align); + return __vring_new_virtqueue(index, vring, vdev, weak_barriers, + notify, callback, name); +} EXPORT_SYMBOL_GPL(vring_new_virtqueue); -void vring_del_virtqueue(struct virtqueue *vq) +void vring_del_virtqueue(struct virtqueue *_vq) { - list_del(&vq->list); - kfree(to_vvq(vq)); + struct vring_virtqueue *vq = to_vvq(_vq); + + if (vq->we_own_ring) { + dma_free_coherent(vring_dma_dev(vq), vq->queue_size_in_bytes, + vq->vring.desc, vq->queue_dma_addr); + } + list_del(&_vq->list); + kfree(vq); } EXPORT_SYMBOL_GPL(vring_del_virtqueue); @@ -946,20 +1033,37 @@ void virtio_break_device(struct virtio_device *dev) } EXPORT_SYMBOL_GPL(virtio_break_device); -void *virtqueue_get_avail(struct virtqueue *_vq) +dma_addr_t virtqueue_get_desc_addr(struct virtqueue *_vq) { struct vring_virtqueue *vq = to_vvq(_vq); - return vq->vring.avail; + BUG_ON(!vq->we_own_ring); + return vq->queue_dma_addr; } -EXPORT_SYMBOL_GPL(virtqueue_get_avail); +EXPORT_SYMBOL_GPL(virtqueue_get_desc_addr); -void *virtqueue_get_used(struct virtqueue *_vq) +dma_addr_t virtqueue_get_avail_addr(struct virtqueue *_vq) { struct vring_virtqueue *vq = to_vvq(_vq); - return vq->vring.used; + BUG_ON(!vq->we_own_ring); + return vq->queue_dma_addr + ((char *)vq->vring.avail - (char *)vq->vring.desc); +} +EXPORT_SYMBOL_GPL(virtqueue_get_avail_addr); + +dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq) +{ + struct vring_virtqueue *vq = to_vvq(_vq); + + BUG_ON(!vq->we_own_ring); + return vq->queue_dma_addr + ((char *)vq->vring.used - (char *)vq->vring.desc); +} +EXPORT_SYMBOL_GPL(virtqueue_get_used_addr); + +const struct vring *virtqueue_get_vring(struct virtqueue *vq) +{ + return &to_vvq(vq)->vring; } -EXPORT_SYMBOL_GPL(virtqueue_get_used); +EXPORT_SYMBOL_GPL(virtqueue_get_vring); MODULE_LICENSE("GPL"); diff --git a/include/linux/virtio.h b/include/linux/virtio.h index 8f4d4bfa6d46..d5eb5479a425 100644 --- a/include/linux/virtio.h +++ b/include/linux/virtio.h @@ -75,8 +75,27 @@ unsigned int virtqueue_get_vring_size(struct virtqueue *vq); bool virtqueue_is_broken(struct virtqueue *vq); -void *virtqueue_get_avail(struct virtqueue *vq); -void *virtqueue_get_used(struct virtqueue *vq); +const struct vring *virtqueue_get_vring(struct virtqueue *vq); +dma_addr_t virtqueue_get_desc_addr(struct virtqueue *vq); +dma_addr_t virtqueue_get_avail_addr(struct virtqueue *vq); +dma_addr_t virtqueue_get_used_addr(struct virtqueue *vq); + +/* + * Legacy accessors -- in almost all cases, these are the wrong functions + * to use. + */ +static inline void *virtqueue_get_desc(struct virtqueue *vq) +{ + return virtqueue_get_vring(vq)->desc; +} +static inline void *virtqueue_get_avail(struct virtqueue *vq) +{ + return virtqueue_get_vring(vq)->avail; +} +static inline void *virtqueue_get_used(struct virtqueue *vq) +{ + return virtqueue_get_vring(vq)->used; +} /** * virtio_device - representation of a device using virtio diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h index 8e50888a6d59..b6b6821917fa 100644 --- a/include/linux/virtio_ring.h +++ b/include/linux/virtio_ring.h @@ -50,6 +50,35 @@ static inline void virtio_wmb(bool weak_barriers) struct virtio_device; struct virtqueue; +/* + * Creates a virtqueue and allocates the descriptor ring. If + * may_reduce_num is set, then this may allocate a smaller ring than + * expected. The caller should query virtqueue_get_ring_size to learn + * the actual size of the ring. + */ +struct virtqueue *vring_create_virtqueue(unsigned int index, + unsigned int num, + unsigned int vring_align, + struct virtio_device *vdev, + bool weak_barriers, + bool may_reduce_num, + bool (*notify)(struct virtqueue *vq), + void (*callback)(struct virtqueue *vq), + const char *name); + +/* Creates a virtqueue with a custom layout. */ +struct virtqueue *__vring_new_virtqueue(unsigned int index, + struct vring vring, + struct virtio_device *vdev, + bool weak_barriers, + bool (*notify)(struct virtqueue *), + void (*callback)(struct virtqueue *), + const char *name); + +/* + * Creates a virtqueue with a standard layout but a caller-allocated + * ring. + */ struct virtqueue *vring_new_virtqueue(unsigned int index, unsigned int num, unsigned int vring_align, @@ -59,7 +88,13 @@ struct virtqueue *vring_new_virtqueue(unsigned int index, bool (*notify)(struct virtqueue *vq), void (*callback)(struct virtqueue *vq), const char *name); + +/* + * Destroys a virtqueue. If created with vring_create_virtqueue, this + * also frees the ring. + */ void vring_del_virtqueue(struct virtqueue *vq); + /* Filter out transport-specific feature bits. */ void vring_transport_features(struct virtio_device *vdev); -- 2.4.3
This switches to vring_create_virtqueue, simplifying the driver and adding DMA API support. Signed-off-by: Andy Lutomirski <luto at kernel.org> --- drivers/virtio/virtio_mmio.c | 67 ++++++++++---------------------------------- 1 file changed, 15 insertions(+), 52 deletions(-) diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c index f499d9da7237..2b9fab52a3cb 100644 --- a/drivers/virtio/virtio_mmio.c +++ b/drivers/virtio/virtio_mmio.c @@ -99,12 +99,6 @@ struct virtio_mmio_vq_info { /* the actual virtqueue */ struct virtqueue *vq; - /* the number of entries in the queue */ - unsigned int num; - - /* the virtual address of the ring queue */ - void *queue; - /* the list node for the virtqueues list */ struct list_head node; }; @@ -322,15 +316,13 @@ static void vm_del_vq(struct virtqueue *vq) { struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vq->vdev); struct virtio_mmio_vq_info *info = vq->priv; - unsigned long flags, size; + unsigned long flags; unsigned int index = vq->index; spin_lock_irqsave(&vm_dev->lock, flags); list_del(&info->node); spin_unlock_irqrestore(&vm_dev->lock, flags); - vring_del_virtqueue(vq); - /* Select and deactivate the queue */ writel(index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL); if (vm_dev->version == 1) { @@ -340,8 +332,8 @@ static void vm_del_vq(struct virtqueue *vq) WARN_ON(readl(vm_dev->base + VIRTIO_MMIO_QUEUE_READY)); } - size = PAGE_ALIGN(vring_size(info->num, VIRTIO_MMIO_VRING_ALIGN)); - free_pages_exact(info->queue, size); + vring_del_virtqueue(vq); + kfree(info); } @@ -356,8 +348,6 @@ static void vm_del_vqs(struct virtio_device *vdev) free_irq(platform_get_irq(vm_dev->pdev, 0), vm_dev); } - - static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index, void (*callback)(struct virtqueue *vq), const char *name) @@ -365,7 +355,8 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index, struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev); struct virtio_mmio_vq_info *info; struct virtqueue *vq; - unsigned long flags, size; + unsigned long flags; + unsigned int num; int err; if (!name) @@ -388,66 +379,40 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index, goto error_kmalloc; } - /* Allocate pages for the queue - start with a queue as big as - * possible (limited by maximum size allowed by device), drop down - * to a minimal size, just big enough to fit descriptor table - * and two rings (which makes it "alignment_size * 2") - */ - info->num = readl(vm_dev->base + VIRTIO_MMIO_QUEUE_NUM_MAX); - - /* If the device reports a 0 entry queue, we won't be able to - * use it to perform I/O, and vring_new_virtqueue() can't create - * empty queues anyway, so don't bother to set up the device. - */ - if (info->num == 0) { + num = readl(vm_dev->base + VIRTIO_MMIO_QUEUE_NUM_MAX); + if (num == 0) { err = -ENOENT; - goto error_alloc_pages; - } - - while (1) { - size = PAGE_ALIGN(vring_size(info->num, - VIRTIO_MMIO_VRING_ALIGN)); - /* Did the last iter shrink the queue below minimum size? */ - if (size < VIRTIO_MMIO_VRING_ALIGN * 2) { - err = -ENOMEM; - goto error_alloc_pages; - } - - info->queue = alloc_pages_exact(size, GFP_KERNEL | __GFP_ZERO); - if (info->queue) - break; - - info->num /= 2; + goto error_new_virtqueue; } /* Create the vring */ - vq = vring_new_virtqueue(index, info->num, VIRTIO_MMIO_VRING_ALIGN, vdev, - true, info->queue, vm_notify, callback, name); + vq = vring_create_virtqueue(index, num, VIRTIO_MMIO_VRING_ALIGN, vdev, + true, true, vm_notify, callback, name); if (!vq) { err = -ENOMEM; goto error_new_virtqueue; } /* Activate the queue */ - writel(info->num, vm_dev->base + VIRTIO_MMIO_QUEUE_NUM); + writel(virtqueue_get_vring_size(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NUM); if (vm_dev->version == 1) { writel(PAGE_SIZE, vm_dev->base + VIRTIO_MMIO_QUEUE_ALIGN); - writel(virt_to_phys(info->queue) >> PAGE_SHIFT, + writel(virtqueue_get_desc_addr(vq) >> PAGE_SHIFT, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN); } else { u64 addr; - addr = virt_to_phys(info->queue); + addr = virtqueue_get_desc_addr(vq); writel((u32)addr, vm_dev->base + VIRTIO_MMIO_QUEUE_DESC_LOW); writel((u32)(addr >> 32), vm_dev->base + VIRTIO_MMIO_QUEUE_DESC_HIGH); - addr = virt_to_phys(virtqueue_get_avail(vq)); + addr = virtqueue_get_avail_addr(vq); writel((u32)addr, vm_dev->base + VIRTIO_MMIO_QUEUE_AVAIL_LOW); writel((u32)(addr >> 32), vm_dev->base + VIRTIO_MMIO_QUEUE_AVAIL_HIGH); - addr = virt_to_phys(virtqueue_get_used(vq)); + addr = virtqueue_get_used_addr(vq); writel((u32)addr, vm_dev->base + VIRTIO_MMIO_QUEUE_USED_LOW); writel((u32)(addr >> 32), vm_dev->base + VIRTIO_MMIO_QUEUE_USED_HIGH); @@ -471,8 +436,6 @@ error_new_virtqueue: writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_READY); WARN_ON(readl(vm_dev->base + VIRTIO_MMIO_QUEUE_READY)); } - free_pages_exact(info->queue, size); -error_alloc_pages: kfree(info); error_kmalloc: error_available: -- 2.4.3
This switches to vring_create_virtqueue, simplifying the driver and adding DMA API support. Signed-off-by: Andy Lutomirski <luto at kernel.org> --- drivers/virtio/virtio_pci_common.h | 7 ----- drivers/virtio/virtio_pci_legacy.c | 39 +++++++----------------- drivers/virtio/virtio_pci_modern.c | 61 ++++++-------------------------------- 3 files changed, 19 insertions(+), 88 deletions(-) diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h index cd6196b513ad..1a3c689d1b9e 100644 --- a/drivers/virtio/virtio_pci_common.h +++ b/drivers/virtio/virtio_pci_common.h @@ -35,13 +35,6 @@ struct virtio_pci_vq_info { /* the actual virtqueue */ struct virtqueue *vq; - /* the number of entries in the queue */ - int num; - - /* the ring queue */ - void *queue; - dma_addr_t queue_dma_addr; /* bus address */ - /* the list node for the virtqueues list */ struct list_head node; diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c index b5293e5f2af4..8c4e61783441 100644 --- a/drivers/virtio/virtio_pci_legacy.c +++ b/drivers/virtio/virtio_pci_legacy.c @@ -119,7 +119,6 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, u16 msix_vec) { struct virtqueue *vq; - unsigned long size; u16 num; int err; @@ -131,29 +130,19 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, if (!num || ioread32(vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN)) return ERR_PTR(-ENOENT); - info->num = num; info->msix_vector = msix_vec; - size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN)); - info->queue = dma_zalloc_coherent(&vp_dev->pci_dev->dev, size, - &info->queue_dma_addr, - GFP_KERNEL); - if (info->queue == NULL) + /* create the vring */ + vq = vring_create_virtqueue(index, num, + VIRTIO_PCI_VRING_ALIGN, &vp_dev->vdev, + true, false, vp_notify, callback, name); + if (!vq) return ERR_PTR(-ENOMEM); /* activate the queue */ - iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, + iowrite32(virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); - /* create the vring */ - vq = vring_new_virtqueue(index, info->num, - VIRTIO_PCI_VRING_ALIGN, &vp_dev->vdev, - true, info->queue, vp_notify, callback, name); - if (!vq) { - err = -ENOMEM; - goto out_activate_queue; - } - vq->priv = (void __force *)vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY; if (msix_vec != VIRTIO_MSI_NO_VECTOR) { @@ -161,18 +150,15 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, msix_vec = ioread16(vp_dev->ioaddr + VIRTIO_MSI_QUEUE_VECTOR); if (msix_vec == VIRTIO_MSI_NO_VECTOR) { err = -EBUSY; - goto out_assign; + goto out_deactivate; } } return vq; -out_assign: - vring_del_virtqueue(vq); -out_activate_queue: +out_deactivate: iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); - dma_free_coherent(&vp_dev->pci_dev->dev, size, - info->queue, info->queue_dma_addr); + vring_del_virtqueue(vq); return ERR_PTR(err); } @@ -180,7 +166,6 @@ static void del_vq(struct virtio_pci_vq_info *info) { struct virtqueue *vq = info->vq; struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev); - unsigned long size; iowrite16(vq->index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL); @@ -191,14 +176,10 @@ static void del_vq(struct virtio_pci_vq_info *info) ioread8(vp_dev->ioaddr + VIRTIO_PCI_ISR); } - vring_del_virtqueue(vq); - /* Select and deactivate the queue */ iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); - size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN)); - dma_free_coherent(&vp_dev->pci_dev->dev, size, - info->queue, info->queue_dma_addr); + vring_del_virtqueue(vq); } static const struct virtio_config_ops virtio_pci_config_ops = { diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c index fbe0bd1c4881..50b0cd5a501e 100644 --- a/drivers/virtio/virtio_pci_modern.c +++ b/drivers/virtio/virtio_pci_modern.c @@ -287,35 +287,6 @@ static u16 vp_config_vector(struct virtio_pci_device *vp_dev, u16 vector) return vp_ioread16(&vp_dev->common->msix_config); } -static size_t vring_pci_size(u16 num) -{ - /* We only need a cacheline separation. */ - return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES)); -} - -static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev, - int *num, dma_addr_t *dma_addr) -{ - void *pages; - - /* TODO: allocate each queue chunk individually */ - for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) { - pages = dma_zalloc_coherent( - &vp_dev->pci_dev->dev, vring_pci_size(*num), - dma_addr, GFP_KERNEL|__GFP_NOWARN); - if (pages) - return pages; - } - - if (!*num) - return NULL; - - /* Try to get a single page. You are my only hope! */ - return dma_zalloc_coherent( - &vp_dev->pci_dev->dev, vring_pci_size(*num), - dma_addr, GFP_KERNEL); -} - static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, struct virtio_pci_vq_info *info, unsigned index, @@ -347,30 +318,22 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, /* get offset of notification word for this vq */ off = vp_ioread16(&cfg->queue_notify_off); - info->num = num; info->msix_vector = msix_vec; - info->queue = alloc_virtqueue_pages(vp_dev, &info->num, - &info->queue_dma_addr); - if (info->queue == NULL) - return ERR_PTR(-ENOMEM); - /* create the vring */ - vq = vring_new_virtqueue(index, info->num, - SMP_CACHE_BYTES, &vp_dev->vdev, - true, info->queue, vp_notify, callback, name); - if (!vq) { - err = -ENOMEM; - goto err_new_queue; - } + vq = vring_create_virtqueue(index, num, + SMP_CACHE_BYTES, &vp_dev->vdev, + true, true, vp_notify, callback, name); + if (!vq) + return ERR_PTR(-ENOMEM); /* activate the queue */ - vp_iowrite16(num, &cfg->queue_size); - vp_iowrite64_twopart(info->queue_dma_addr, + vp_iowrite16(virtqueue_get_vring_size(vq), &cfg->queue_size); + vp_iowrite64_twopart(virtqueue_get_desc_addr(vq), &cfg->queue_desc_lo, &cfg->queue_desc_hi); - vp_iowrite64_twopart(info->queue_dma_addr + ((char *)virtqueue_get_avail(vq) - (char *)info->queue), + vp_iowrite64_twopart(virtqueue_get_avail_addr(vq), &cfg->queue_avail_lo, &cfg->queue_avail_hi); - vp_iowrite64_twopart(info->queue_dma_addr + ((char *)virtqueue_get_used(vq) - (char *)info->queue), + vp_iowrite64_twopart(virtqueue_get_used_addr(vq), &cfg->queue_used_lo, &cfg->queue_used_hi); if (vp_dev->notify_base) { @@ -415,9 +378,6 @@ err_assign_vector: pci_iounmap(vp_dev->pci_dev, (void __iomem __force *)vq->priv); err_map_notify: vring_del_virtqueue(vq); -err_new_queue: - dma_free_coherent(&vp_dev->pci_dev->dev, vring_pci_size(info->num), - info->queue, info->queue_dma_addr); return ERR_PTR(err); } @@ -462,9 +422,6 @@ static void del_vq(struct virtio_pci_vq_info *info) pci_iounmap(vp_dev->pci_dev, (void __force __iomem *)vq->priv); vring_del_virtqueue(vq); - - dma_free_coherent(&vp_dev->pci_dev->dev, vring_pci_size(info->num), - info->queue, info->queue_dma_addr); } static const struct virtio_config_ops virtio_pci_config_nodev_ops = { -- 2.4.3
On Thu, Oct 29, 2015 at 6:09 PM, Andy Lutomirski <luto at kernel.org> wrote:> This switches virtio to use the DMA API unconditionally. I'm sure > it breaks things, but it seems to work on x86 using virtio-pci, with > and without Xen, and using both the modern 1.0 variant and the > legacy variant....> Andy Lutomirski (5): > virtio_ring: Support DMA APIs > virtio_pci: Use the DMA API > virtio: Add improved queue allocation API > virtio_mmio: Use the DMA API > virtio_pci: Use the DMA APIUgh. The two virtio_pci patches should be squashed together. I'll do that for v5, but I'm not going to send it until there's more feedback. FWIW, I'm collecting this stuff here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_dma That branch includes this series (with the squash) and the s390 patches. I'll keep it up to date, since it seems silly to declare it stable enough to stop rebasing yet. --Andy
Am 30.10.2015 um 02:09 schrieb Andy Lutomirski:> This switches virtio to use the DMA API unconditionally. I'm sure > it breaks things, but it seems to work on x86 using virtio-pci, with > and without Xen, and using both the modern 1.0 variant and the > legacy variant. > > This appears to work on native and Xen x86_64 using both modern and > legacy virtio-pci. It also appears to work on arm and arm64. > > It definitely won't work as-is on s390x, and I haven't been able to > test Christian's patches because I can't get virtio-ccw to work in > QEMU at all. I don't know what I'm doing wrong.[...]> virtio-net: Stop doing DMA from the stack > > drivers/net/virtio_net.c | 34 ++-- > drivers/virtio/Kconfig | 2 +- > drivers/virtio/virtio_mmio.c | 67 ++----- > drivers/virtio/virtio_pci_common.h | 6 - > drivers/virtio/virtio_pci_legacy.c | 42 ++--- > drivers/virtio/virtio_pci_modern.c | 61 ++----- > drivers/virtio/virtio_ring.c | 348 ++++++++++++++++++++++++++++++-------do you also have an untested patch for drivers/s390/virtio/* ?
On Thu, 29 Oct 2015 18:09:47 -0700 Andy Lutomirski <luto at kernel.org> wrote:> virtio_ring currently sends the device (usually a hypervisor) > physical addresses of its I/O buffers. This is okay when DMA > addresses and physical addresses are the same thing, but this isn't > always the case. For example, this never works on Xen guests, and > it is likely to fail if a physical "virtio" device ever ends up > behind an IOMMU or swiotlb. > > The immediate use case for me is to enable virtio on Xen guests. > For that to work, we need vring to support DMA address translation > as well as a corresponding change to virtio_pci or to another > driver. > > With this patch, if enabled, virtfs survives kmemleak and > CONFIG_DMA_API_DEBUG. > > Signed-off-by: Andy Lutomirski <luto at kernel.org> > --- > drivers/virtio/Kconfig | 2 +- > drivers/virtio/virtio_ring.c | 190 +++++++++++++++++++++++++++++++-------- > tools/virtio/linux/dma-mapping.h | 17 ++++ > 3 files changed, 172 insertions(+), 37 deletions(-) > create mode 100644 tools/virtio/linux/dma-mapping.h> static void detach_buf(struct vring_virtqueue *vq, unsigned int head) > { > - unsigned int i; > + unsigned int i, j; > + u16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT); > > /* Clear data ptr. */ > - vq->data[head] = NULL; > + vq->desc_state[head].data = NULL; > > - /* Put back on free list: find end */ > + /* Put back on free list: unmap first-level descriptors and find end */ > i = head; > > - /* Free the indirect table */ > - if (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)) > - kfree(phys_to_virt(virtio64_to_cpu(vq->vq.vdev, vq->vring.desc[i].addr))); > - > - while (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT)) { > + while (vq->vring.desc[i].flags & nextflag) { > + vring_unmap_one(vq, &vq->vring.desc[i]); > i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next); > vq->vq.num_free++; > } > > + vring_unmap_one(vq, &vq->vring.desc[i]); > vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head); > vq->free_head = head; > + > /* Plus final descriptor */ > vq->vq.num_free++; > + > + /* Free the indirect table, if any, now that it's unmapped. */ > + if (vq->desc_state[head].indir_desc) { > + struct vring_desc *indir_desc = vq->desc_state[head].indir_desc; > + u32 len = vq->vring.desc[head].len;This one needs to be virtio32_to_cpu(...) as well.> + > + BUG_ON(!(vq->vring.desc[head].flags & > + cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT))); > + BUG_ON(len == 0 || len % sizeof(struct vring_desc)); > + > + for (j = 0; j < len / sizeof(struct vring_desc); j++) > + vring_unmap_one(vq, &indir_desc[j]); > + > + kfree(vq->desc_state[head].indir_desc); > + vq->desc_state[head].indir_desc = NULL; > + } > }With that change on top of your current branch, I can boot (root on virtio-blk, either virtio-1 or legacy virtio) on current qemu master with kvm enabled on s390. Haven't tried anything further.
Christian Borntraeger
2015-Oct-30 13:55 UTC
[PATCH v4 1/6] virtio-net: Stop doing DMA from the stack
Am 30.10.2015 um 02:09 schrieb Andy Lutomirski:> From: "Michael S. Tsirkin" <mst at redhat.com> > > Once virtio starts using the DMA API, we won't be able to safely DMA > from the stack. virtio-net does a couple of config DMA requests > from small stack buffers -- switch to using dynamically-allocated > memory. > > This should have no effect on any performance-critical code paths. > > [I wrote the subject and commit message. mst wrote the code. --luto] > > Signed-off-by: Andy Lutomirski <luto at kernel.org> > signed-off-by: Michael S. Tsirkin <mst at redhat.com>I still get an error when using multiqueue: # ethtool -L eth0 combined 4 [ 33.534686] virtio_ccw 0.0.000d: DMA-API: device driver maps memory from stack [addr=00000000629e7c06] [ 33.534704] ------------[ cut here ]------------ [ 33.534705] WARNING: at lib/dma-debug.c:1169 [ 33.534706] Modules linked in: dm_multipath [ 33.534709] CPU: 1 PID: 1087 Comm: ethtool Not tainted 4.3.0-rc3+ #269 [ 33.534710] task: 00000000616f9978 ti: 00000000629e4000 task.ti: 00000000629e4000 [ 33.534712] Krnl PSW : 0704d00180000000 00000000005869d2 (check_for_stack+0xb2/0x118) [ 33.534716] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 EA:3 Krnl GPRS: 000000000000006a 0000000000d60f44 000000000000005a 0000000064ee0870 [ 33.534718] 00000000005869ce 0000000000000000 0000000000000001 00000000629e7c06 [ 33.534719] 0000000000000000 0000000000000c06 0000000000000002 000000006467f800 [ 33.534720] 0000000064673428 00000000629e7c06 00000000005869ce 00000000629e7928 [ 33.534726] Krnl Code: 00000000005869c2: c0200024ad4e larl %r2,a1c45e 00000000005869c8: c0e5ffe6d6fc brasl %r14,2617c0 #00000000005869ce: a7f40001 brc 15,5869d0 >00000000005869d2: c010003465eb larl %r1,c135a8 00000000005869d8: e31010000012 lt %r1,0(%r1) 00000000005869de: a784000a brc 8,5869f2 00000000005869e2: e340f0b00004 lg %r4,176(%r15) 00000000005869e8: ebcff0a00004 lmg %r12,%r15,160(%r15) [ 33.534736] Call Trace: [ 33.534737] ([<00000000005869ce>] check_for_stack+0xae/0x118) [ 33.534738] [<0000000000586e3c>] debug_dma_map_page+0x114/0x160 [ 33.534740] [<00000000005a31f8>] vring_map_one_sg.isra.7+0x98/0xc0 [ 33.534742] [<00000000005a3b72>] virtqueue_add_sgs+0x1e2/0x788 [ 33.534744] [<0000000000618afc>] virtnet_send_command+0xcc/0x140 [ 33.534745] [<0000000000618c0c>] virtnet_set_queues+0x9c/0x110 [ 33.534747] [<0000000000619928>] virtnet_set_channels+0x78/0xe0 [ 33.534748] [<00000000006f63ea>] ethtool_set_channels+0x62/0x88 [ 33.534750] [<00000000006f8900>] dev_ethtool+0x10d8/0x1a48 [ 33.534752] [<000000000070c540>] dev_ioctl+0x190/0x510 [ 33.534754] [<00000000006cf2da>] sock_do_ioctl+0x7a/0x90 [ 33.534755] [<00000000006cf840>] sock_ioctl+0x1e8/0x2d0 [ 33.534758] [<00000000002e6c78>] do_vfs_ioctl+0x3a8/0x508 [ 33.534759] [<00000000002e6e7c>] SyS_ioctl+0xa4/0xb8 [ 33.534762] [<00000000008231ec>] system_call+0x244/0x264 [ 33.534763] [<000003ff922026d2>] 0x3ff922026d2 [ 33.534764] Last Breaking-Event-Address: [ 33.534765] [<00000000005869ce>] check_for_stack+0xae/0x118 [ 33.534766] ---[ end trace 2379df65f4decfc4 ]---> --- > drivers/net/virtio_net.c | 34 +++++++++++++++++++--------------- > 1 file changed, 19 insertions(+), 15 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index d8838dedb7a4..f94ab786088f 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -140,6 +140,12 @@ struct virtnet_info { > > /* CPU hot plug notifier */ > struct notifier_block nb; > + > + /* Control VQ buffers: protected by the rtnl lock */ > + struct virtio_net_ctrl_hdr ctrl_hdr; > + virtio_net_ctrl_ack ctrl_status; > + u8 ctrl_promisc; > + u8 ctrl_allmulti; > }; > > struct padded_vnet_hdr { > @@ -976,31 +982,30 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd, > struct scatterlist *out) > { > struct scatterlist *sgs[4], hdr, stat; > - struct virtio_net_ctrl_hdr ctrl; > - virtio_net_ctrl_ack status = ~0; > unsigned out_num = 0, tmp; > > /* Caller should know better */ > BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)); > > - ctrl.class = class; > - ctrl.cmd = cmd; > + vi->ctrl_status = ~0; > + vi->ctrl_hdr.class = class; > + vi->ctrl_hdr.cmd = cmd; > /* Add header */ > - sg_init_one(&hdr, &ctrl, sizeof(ctrl)); > + sg_init_one(&hdr, &vi->ctrl_hdr, sizeof(vi->ctrl_hdr)); > sgs[out_num++] = &hdr; > > if (out) > sgs[out_num++] = out; > > /* Add return status. */ > - sg_init_one(&stat, &status, sizeof(status)); > + sg_init_one(&stat, &vi->ctrl_status, sizeof(vi->ctrl_status)); > sgs[out_num] = &stat; > > BUG_ON(out_num + 1 > ARRAY_SIZE(sgs)); > virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC); > > if (unlikely(!virtqueue_kick(vi->cvq))) > - return status == VIRTIO_NET_OK; > + return vi->ctrl_status == VIRTIO_NET_OK; > > /* Spin for a response, the kick causes an ioport write, trapping > * into the hypervisor, so the request should be handled immediately. > @@ -1009,7 +1014,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd, > !virtqueue_is_broken(vi->cvq)) > cpu_relax(); > > - return status == VIRTIO_NET_OK; > + return vi->ctrl_status == VIRTIO_NET_OK; > } > > static int virtnet_set_mac_address(struct net_device *dev, void *p) > @@ -1151,7 +1156,6 @@ static void virtnet_set_rx_mode(struct net_device *dev) > { > struct virtnet_info *vi = netdev_priv(dev); > struct scatterlist sg[2]; > - u8 promisc, allmulti; > struct virtio_net_ctrl_mac *mac_data; > struct netdev_hw_addr *ha; > int uc_count; > @@ -1163,22 +1167,22 @@ static void virtnet_set_rx_mode(struct net_device *dev) > if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX)) > return; > > - promisc = ((dev->flags & IFF_PROMISC) != 0); > - allmulti = ((dev->flags & IFF_ALLMULTI) != 0); > + vi->ctrl_promisc = ((dev->flags & IFF_PROMISC) != 0); > + vi->ctrl_allmulti = ((dev->flags & IFF_ALLMULTI) != 0); > > - sg_init_one(sg, &promisc, sizeof(promisc)); > + sg_init_one(sg, &vi->ctrl_promisc, sizeof(vi->ctrl_promisc)); > > if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX, > VIRTIO_NET_CTRL_RX_PROMISC, sg)) > dev_warn(&dev->dev, "Failed to %sable promisc mode.\n", > - promisc ? "en" : "dis"); > + vi->ctrl_promisc ? "en" : "dis"); > > - sg_init_one(sg, &allmulti, sizeof(allmulti)); > + sg_init_one(sg, &vi->ctrl_allmulti, sizeof(vi->ctrl_allmulti)); > > if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX, > VIRTIO_NET_CTRL_RX_ALLMULTI, sg)) > dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n", > - allmulti ? "en" : "dis"); > + vi->ctrl_allmulti ? "en" : "dis"); > > uc_count = netdev_uc_count(dev); > mc_count = netdev_mc_count(dev); >
On Thu, Oct 29, 2015 at 06:09:45PM -0700, Andy Lutomirski wrote:> This switches virtio to use the DMA API unconditionally. I'm sure > it breaks things, but it seems to work on x86 using virtio-pci, with > and without Xen, and using both the modern 1.0 variant and the > legacy variant. > > This appears to work on native and Xen x86_64 using both modern and > legacy virtio-pci. It also appears to work on arm and arm64. > > It definitely won't work as-is on s390x, and I haven't been able to > test Christian's patches because I can't get virtio-ccw to work in > QEMU at all. I don't know what I'm doing wrong. > > It doesn't work on ppc64. Ben, consider yourself pinged to send me > a patch :) > > It doesn't work on sparc64. I didn't realize at Kernel Summit that > sparc64 has the same problem as ppc64. > > DaveM, for background, we're trying to fix virtio to use the DMA > API. That will require that every platform that uses virtio > supplies valid DMA operations on devices that use virtio_ring. > Unfortunately, QEMU historically ignores the IOMMU on virtio > devices. > > On x86, this isn't really a problem. x86 has a nice way for the > platform to describe which devices are behind an IOMMU, and QEMU > will be adjusted accordingly. The only thing that will break is a > recently-added experimental mode.Well that's not exactly true. I think we would like to make it possible to put virtio devices behind an IOMMU on x86, but if this means existing guests break, then many people won't be able to use this option: having to find out which kernel version your guest is running is a significant burden. So on the host side, we need to detect guests that don't program the IOMMU and make QEMU ignore it. I think we need to figure out a way to do this before we commit to the guest change. Additionally, IOMMU overhead is very high when running within the VM. So for uses such as VFIO, we'd like a way to make something like iommu-pt the default.> Ben's plan for powerpc is to add a quirk for existing virtio-pci > devices and to eventually update the devicetree stuff to allow QEMU > to tell the guest which devices use the IOMMU. > > AFAICT sparc has a similar problem to powerpc. DaveM, can you come > up with a straightforward way to get sparc's DMA API to work > correctly for virtio-pci devices? > > NB: Sadly, the platforms I've successfully tested on don't include any > big-endian platforms, so there could still be lurking endian problems. > > Changes from v3: > - More big-endian fixes. > - Added better virtio-ring APIs that handle allocation and use them in > virtio-mmio and virtio-pci. > - Switch to Michael's virtio-net patch. > > Changes from v2: > - Fix vring_mapping_error incorrect argument > > Changes from v1: > - Fix an endian conversion error causing a BUG to hit. > - Fix a DMA ordering issue (swiotlb=force works now). > - Minor cleanups. > > Andy Lutomirski (5): > virtio_ring: Support DMA APIs > virtio_pci: Use the DMA API > virtio: Add improved queue allocation API > virtio_mmio: Use the DMA API > virtio_pci: Use the DMA API > > Michael S. Tsirkin (1): > virtio-net: Stop doing DMA from the stack > > drivers/net/virtio_net.c | 34 ++-- > drivers/virtio/Kconfig | 2 +- > drivers/virtio/virtio_mmio.c | 67 ++----- > drivers/virtio/virtio_pci_common.h | 6 - > drivers/virtio/virtio_pci_legacy.c | 42 ++--- > drivers/virtio/virtio_pci_modern.c | 61 ++----- > drivers/virtio/virtio_ring.c | 348 ++++++++++++++++++++++++++++++------- > include/linux/virtio.h | 23 ++- > include/linux/virtio_ring.h | 35 ++++ > tools/virtio/linux/dma-mapping.h | 17 ++ > 10 files changed, 426 insertions(+), 209 deletions(-) > create mode 100644 tools/virtio/linux/dma-mapping.h > > -- > 2.4.3
On 09/11/2015 13:15, Michael S. Tsirkin wrote:> Well that's not exactly true. I think we would like to make > it possible to put virtio devices behind an IOMMU on x86, > but if this means existing guests break, then many people won't be able > to use this option: having to find out which kernel version your guest > is running is a significant burden. > > So on the host side, we need to detect guests that > don't program the IOMMU and make QEMU ignore it. > I think we need to figure out a way to do this > before we commit to the guest change.What is the usecase for putting virtio devices behind an IOMMU, apart from: 1) "because you can" 2) using VFIO within the guest ? Case 1 can be ignored, and in case 2 the guest will do the right thing.> Additionally, IOMMU overhead is very high when running within the VM. > So for uses such as VFIO, we'd like a way to make something like > iommu-pt the default.That's not something that the kernel cares about. It's just a configuration issue. Paolo
Benjamin Herrenschmidt
2015-Nov-09 22:58 UTC
[PATCH v4 0/6] virtio core DMA API conversion
So ... I've finally tried to sort that out for powerpc and I can't find a way to make that work that isn't a complete pile of stinking shit. I'm very tempted to go back to my original idea: virtio itself should indicate it's "bypassing ability" via the virtio config space or some other bit (like the ProgIf of the PCI header etc...). I don't see how I can make it work otherwise. The problem with the statement "it's a platform matter" is that: ? - It's not entirely correct. It's actually a feature of a specific virtio implementation (qemu's) that it bypasses all the platform IOMMU facilities. ? - The platforms (on powerpc there's at least 3 or 4 that have qemu emulation and support some form of PCI) don't have an existing way to convey the information that a device bypasses the IOMMU (if any). ? - Even if we add such a mechanism (new property in the device-tree), we end up with a big turd: Because we need to be compatible with older qemus, we essentially need a quirk that will make all virtio devices assume that property is present. That will of course break whenever we try to use another implementation of virtio on powerpc which doesn't bypass the iommu. We don't have a way to differenciate a virtio device that does the bypass from one that doesn't and the backward compatibility requirement forces us to treat all existing virtio devices as doing bypass. The only way out of this while keeping the "platform" stuff would be to also bump some kind of version in the virtio config (or PCI header). I have no other way to differenciate between "this is an old qemu that doesn't do the 'bypass property' yet" from "this is a virtio device that doesn't bypass". Any better idea ? Cheers, Ben.