Jason Wang
2020-Dec-24 03:01 UTC
[RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
On 2020/12/23 ??10:17, Yongji Xie wrote:> On Wed, Dec 23, 2020 at 4:08 PM Jason Wang <jasowang at redhat.com> wrote: >> >> On 2020/12/22 ??10:52, Xie Yongji wrote: >>> This VDUSE driver enables implementing vDPA devices in userspace. >>> Both control path and data path of vDPA devices will be able to >>> be handled in userspace. >>> >>> In the control path, the VDUSE driver will make use of message >>> mechnism to forward the config operation from vdpa bus driver >>> to userspace. Userspace can use read()/write() to receive/reply >>> those control messages. >>> >>> In the data path, the VDUSE driver implements a MMU-based on-chip >>> IOMMU driver which supports mapping the kernel dma buffer to a >>> userspace iova region dynamically. Userspace can access those >>> iova region via mmap(). Besides, the eventfd mechanism is used to >>> trigger interrupt callbacks and receive virtqueue kicks in userspace >>> >>> Now we only support virtio-vdpa bus driver with this patch applied. >>> >>> Signed-off-by: Xie Yongji <xieyongji at bytedance.com> >>> --- >>> Documentation/driver-api/vduse.rst | 74 ++ >>> Documentation/userspace-api/ioctl/ioctl-number.rst | 1 + >>> drivers/vdpa/Kconfig | 8 + >>> drivers/vdpa/Makefile | 1 + >>> drivers/vdpa/vdpa_user/Makefile | 5 + >>> drivers/vdpa/vdpa_user/eventfd.c | 221 ++++ >>> drivers/vdpa/vdpa_user/eventfd.h | 48 + >>> drivers/vdpa/vdpa_user/iova_domain.c | 442 ++++++++ >>> drivers/vdpa/vdpa_user/iova_domain.h | 93 ++ >>> drivers/vdpa/vdpa_user/vduse.h | 59 ++ >>> drivers/vdpa/vdpa_user/vduse_dev.c | 1121 ++++++++++++++++++++ >>> include/uapi/linux/vdpa.h | 1 + >>> include/uapi/linux/vduse.h | 99 ++ >>> 13 files changed, 2173 insertions(+) >>> create mode 100644 Documentation/driver-api/vduse.rst >>> create mode 100644 drivers/vdpa/vdpa_user/Makefile >>> create mode 100644 drivers/vdpa/vdpa_user/eventfd.c >>> create mode 100644 drivers/vdpa/vdpa_user/eventfd.h >>> create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c >>> create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h >>> create mode 100644 drivers/vdpa/vdpa_user/vduse.h >>> create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c >>> create mode 100644 include/uapi/linux/vduse.h >>> >>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst >>> new file mode 100644 >>> index 000000000000..da9b3040f20a >>> --- /dev/null >>> +++ b/Documentation/driver-api/vduse.rst >>> @@ -0,0 +1,74 @@ >>> +=================================>>> +VDUSE - "vDPA Device in Userspace" >>> +=================================>>> + >>> +vDPA (virtio data path acceleration) device is a device that uses a >>> +datapath which complies with the virtio specifications with vendor >>> +specific control path. vDPA devices can be both physically located on >>> +the hardware or emulated by software. VDUSE is a framework that makes it >>> +possible to implement software-emulated vDPA devices in userspace. >>> + >>> +How VDUSE works >>> +------------ >>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on >>> +the VDUSE character device (/dev/vduse). Then a file descriptor pointing >>> +to the new resources will be returned, which can be used to implement the >>> +userspace vDPA device's control path and data path. >>> + >>> +To implement control path, the read/write operations to the file descriptor >>> +will be used to receive/reply the control messages from/to VDUSE driver. >>> +Those control messages are based on the vdpa_config_ops which defines a >>> +unified interface to control different types of vDPA device. >>> + >>> +The following types of messages are provided by the VDUSE framework now: >>> + >>> +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue. >>> + >>> +- VDUSE_SET_VQ_NUM: Set the size of virtqueue >>> + >>> +- VDUSE_SET_VQ_READY: Set ready status of virtqueue >>> + >>> +- VDUSE_GET_VQ_READY: Get ready status of virtqueue >>> + >>> +- VDUSE_SET_FEATURES: Set virtio features supported by the driver >>> + >>> +- VDUSE_GET_FEATURES: Get virtio features supported by the device >>> + >>> +- VDUSE_SET_STATUS: Set the device status >>> + >>> +- VDUSE_GET_STATUS: Get the device status >>> + >>> +- VDUSE_SET_CONFIG: Write to device specific configuration space >>> + >>> +- VDUSE_GET_CONFIG: Read from device specific configuration space >>> + >>> +Please see include/linux/vdpa.h for details. >>> + >>> +In the data path, VDUSE framework implements a MMU-based on-chip IOMMU >>> +driver which supports mapping the kernel dma buffer to a userspace iova >>> +region dynamically. The userspace iova region can be created by passing >>> +the userspace vDPA device fd to mmap(2). >>> + >>> +Besides, the eventfd mechanism is used to trigger interrupt callbacks and >>> +receive virtqueue kicks in userspace. The following ioctls on the userspace >>> +vDPA device fd are provided to support that: >>> + >>> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used >>> + by VDUSE driver to notify userspace to consume the vring. >>> + >>> +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used >>> + by userspace to notify VDUSE driver to trigger interrupt callbacks. >>> + >>> +MMU-based IOMMU Driver >>> +---------------------- >>> +The basic idea behind the IOMMU driver is treating MMU (VA->PA) as >>> +IOMMU (IOVA->PA). This driver will set up MMU mapping instead of IOMMU mapping >>> +for the DMA transfer so that the userspace process is able to use its virtual >>> +address to access the dma buffer in kernel. >>> + >>> +And to avoid security issue, a bounce-buffering mechanism is introduced to >>> +prevent userspace accessing the original buffer directly which may contain other >>> +kernel data. During the mapping, unmapping, the driver will copy the data from >>> +the original buffer to the bounce buffer and back, depending on the direction of >>> +the transfer. And the bounce-buffer addresses will be mapped into the user address >>> +space instead of the original one. >>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst >>> index a4c75a28c839..71722e6f8f23 100644 >>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst >>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst >>> @@ -300,6 +300,7 @@ Code Seq# Include File Comments >>> 'z' 10-4F drivers/s390/crypto/zcrypt_api.h conflict! >>> '|' 00-7F linux/media.h >>> 0x80 00-1F linux/fb.h >>> +0x81 00-1F linux/vduse.h >>> 0x89 00-06 arch/x86/include/asm/sockios.h >>> 0x89 0B-DF linux/sockios.h >>> 0x89 E0-EF linux/sockios.h SIOCPROTOPRIVATE range >>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig >>> index 4be7be39be26..211cc449cbd3 100644 >>> --- a/drivers/vdpa/Kconfig >>> +++ b/drivers/vdpa/Kconfig >>> @@ -21,6 +21,14 @@ config VDPA_SIM >>> to RX. This device is used for testing, prototyping and >>> development of vDPA. >>> >>> +config VDPA_USER >>> + tristate "VDUSE (vDPA Device in Userspace) support" >>> + depends on EVENTFD && MMU && HAS_DMA >>> + default n >> >> The "default n" is not necessary. >> > OK. >>> + help >>> + With VDUSE it is possible to emulate a vDPA Device >>> + in a userspace program. >>> + >>> config IFCVF >>> tristate "Intel IFC VF vDPA driver" >>> depends on PCI_MSI >>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile >>> index d160e9b63a66..66e97778ad03 100644 >>> --- a/drivers/vdpa/Makefile >>> +++ b/drivers/vdpa/Makefile >>> @@ -1,5 +1,6 @@ >>> # SPDX-License-Identifier: GPL-2.0 >>> obj-$(CONFIG_VDPA) += vdpa.o >>> obj-$(CONFIG_VDPA_SIM) += vdpa_sim/ >>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/ >>> obj-$(CONFIG_IFCVF) += ifcvf/ >>> obj-$(CONFIG_MLX5_VDPA) += mlx5/ >>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile >>> new file mode 100644 >>> index 000000000000..b7645e36992b >>> --- /dev/null >>> +++ b/drivers/vdpa/vdpa_user/Makefile >>> @@ -0,0 +1,5 @@ >>> +# SPDX-License-Identifier: GPL-2.0 >>> + >>> +vduse-y := vduse_dev.o iova_domain.o eventfd.o >> >> Do we really need eventfd.o here consider we've selected it. >> > Do you mean the file "drivers/vdpa/vdpa_user/eventfd.c"?My bad, I confuse this with the common eventfd. So the code is fine here.> >>> + >>> +obj-$(CONFIG_VDPA_USER) += vduse.o >>> diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c >>> new file mode 100644 >>> index 000000000000..dbffddb08908 >>> --- /dev/null >>> +++ b/drivers/vdpa/vdpa_user/eventfd.c >>> @@ -0,0 +1,221 @@ >>> +// SPDX-License-Identifier: GPL-2.0-only >>> +/* >>> + * Eventfd support for VDUSE >>> + * >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved. >>> + * >>> + * Author: Xie Yongji <xieyongji at bytedance.com> >>> + * >>> + */ >>> + >>> +#include <linux/eventfd.h> >>> +#include <linux/poll.h> >>> +#include <linux/wait.h> >>> +#include <linux/slab.h> >>> +#include <linux/file.h> >>> +#include <uapi/linux/vduse.h> >>> + >>> +#include "eventfd.h" >>> + >>> +static struct workqueue_struct *vduse_irqfd_cleanup_wq; >>> + >>> +static void vduse_virqfd_shutdown(struct work_struct *work) >>> +{ >>> + u64 cnt; >>> + struct vduse_virqfd *virqfd = container_of(work, >>> + struct vduse_virqfd, shutdown); >>> + >>> + eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt); >>> + flush_work(&virqfd->inject); >>> + eventfd_ctx_put(virqfd->ctx); >>> + kfree(virqfd); >>> +} >>> + >>> +static void vduse_virqfd_inject(struct work_struct *work) >>> +{ >>> + struct vduse_virqfd *virqfd = container_of(work, >>> + struct vduse_virqfd, inject); >>> + struct vduse_virtqueue *vq = virqfd->vq; >>> + >>> + spin_lock_irq(&vq->irq_lock); >>> + if (vq->ready && vq->cb) >>> + vq->cb(vq->private); >>> + spin_unlock_irq(&vq->irq_lock); >>> +} >>> + >>> +static void virqfd_deactivate(struct vduse_virqfd *virqfd) >>> +{ >>> + queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown); >>> +} >>> + >>> +static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode, >>> + int sync, void *key) >>> +{ >>> + struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait); >>> + struct vduse_virtqueue *vq = virqfd->vq; >>> + >>> + __poll_t flags = key_to_poll(key); >>> + >>> + if (flags & EPOLLIN) >>> + schedule_work(&virqfd->inject); >>> + >>> + if (flags & EPOLLHUP) { >>> + spin_lock(&vq->irq_lock); >>> + if (vq->virqfd == virqfd) { >>> + vq->virqfd = NULL; >>> + virqfd_deactivate(virqfd); >>> + } >>> + spin_unlock(&vq->irq_lock); >>> + } >>> + >>> + return 0; >>> +} >>> + >>> +static void vduse_virqfd_ptable_queue_proc(struct file *file, >>> + wait_queue_head_t *wqh, poll_table *pt) >>> +{ >>> + struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt); >>> + >>> + add_wait_queue(wqh, &virqfd->wait); >>> +} >>> + >>> +int vduse_virqfd_setup(struct vduse_dev *dev, >>> + struct vduse_vq_eventfd *eventfd) >>> +{ >>> + struct vduse_virqfd *virqfd; >>> + struct fd irqfd; >>> + struct eventfd_ctx *ctx; >>> + struct vduse_virtqueue *vq; >>> + __poll_t events; >>> + int ret; >>> + >>> + if (eventfd->index >= dev->vq_num) >>> + return -EINVAL; >>> + >>> + vq = &dev->vqs[eventfd->index]; >>> + virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL); >>> + if (!virqfd) >>> + return -ENOMEM; >>> + >>> + INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown); >>> + INIT_WORK(&virqfd->inject, vduse_virqfd_inject); >> >> Any reason that a workqueue is must here? >> > Mainly for performance considerations. Make sure the push() and pop() > for used vring can be asynchronous.I see.> >>> + >>> + ret = -EBADF; >>> + irqfd = fdget(eventfd->fd); >>> + if (!irqfd.file) >>> + goto err_fd; >>> + >>> + ctx = eventfd_ctx_fileget(irqfd.file); >>> + if (IS_ERR(ctx)) { >>> + ret = PTR_ERR(ctx); >>> + goto err_ctx; >>> + } >>> + >>> + virqfd->vq = vq; >>> + virqfd->ctx = ctx; >>> + spin_lock(&vq->irq_lock); >>> + if (vq->virqfd) >>> + virqfd_deactivate(virqfd); >>> + vq->virqfd = virqfd; >>> + spin_unlock(&vq->irq_lock); >>> + >>> + init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup); >>> + init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc); >>> + >>> + events = vfs_poll(irqfd.file, &virqfd->pt); >>> + >>> + /* >>> + * Check if there was an event already pending on the eventfd >>> + * before we registered and trigger it as if we didn't miss it. >>> + */ >>> + if (events & EPOLLIN) >>> + schedule_work(&virqfd->inject); >>> + >>> + fdput(irqfd); >>> + >>> + return 0; >>> +err_ctx: >>> + fdput(irqfd); >>> +err_fd: >>> + kfree(virqfd); >>> + return ret; >>> +} >>> + >>> +void vduse_virqfd_release(struct vduse_dev *dev) >>> +{ >>> + int i; >>> + >>> + for (i = 0; i < dev->vq_num; i++) { >>> + struct vduse_virtqueue *vq = &dev->vqs[i]; >>> + >>> + spin_lock(&vq->irq_lock); >>> + if (vq->virqfd) { >>> + virqfd_deactivate(vq->virqfd); >>> + vq->virqfd = NULL; >>> + } >>> + spin_unlock(&vq->irq_lock); >>> + } >>> + flush_workqueue(vduse_irqfd_cleanup_wq); >>> +} >>> + >>> +int vduse_virqfd_init(void) >>> +{ >>> + vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup", >>> + WQ_UNBOUND, 0); >>> + if (!vduse_irqfd_cleanup_wq) >>> + return -ENOMEM; >>> + >>> + return 0; >>> +} >>> + >>> +void vduse_virqfd_exit(void) >>> +{ >>> + destroy_workqueue(vduse_irqfd_cleanup_wq); >>> +} >>> + >>> +void vduse_vq_kick(struct vduse_virtqueue *vq) >>> +{ >>> + spin_lock(&vq->kick_lock); >>> + if (vq->ready && vq->kickfd) >>> + eventfd_signal(vq->kickfd, 1); >>> + spin_unlock(&vq->kick_lock); >>> +} >>> + >>> +int vduse_kickfd_setup(struct vduse_dev *dev, >>> + struct vduse_vq_eventfd *eventfd) >>> +{ >>> + struct eventfd_ctx *ctx; >>> + struct vduse_virtqueue *vq; >>> + >>> + if (eventfd->index >= dev->vq_num) >>> + return -EINVAL; >>> + >>> + vq = &dev->vqs[eventfd->index]; >>> + ctx = eventfd_ctx_fdget(eventfd->fd); >>> + if (IS_ERR(ctx)) >>> + return PTR_ERR(ctx); >>> + >>> + spin_lock(&vq->kick_lock); >>> + if (vq->kickfd) >>> + eventfd_ctx_put(vq->kickfd); >>> + vq->kickfd = ctx; >>> + spin_unlock(&vq->kick_lock); >>> + >>> + return 0; >>> +} >>> + >>> +void vduse_kickfd_release(struct vduse_dev *dev) >>> +{ >>> + int i; >>> + >>> + for (i = 0; i < dev->vq_num; i++) { >>> + struct vduse_virtqueue *vq = &dev->vqs[i]; >>> + >>> + spin_lock(&vq->kick_lock); >>> + if (vq->kickfd) { >>> + eventfd_ctx_put(vq->kickfd); >>> + vq->kickfd = NULL; >>> + } >>> + spin_unlock(&vq->kick_lock); >>> + } >>> +} >>> diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h >>> new file mode 100644 >>> index 000000000000..14269ff27f47 >>> --- /dev/null >>> +++ b/drivers/vdpa/vdpa_user/eventfd.h >>> @@ -0,0 +1,48 @@ >>> +/* SPDX-License-Identifier: GPL-2.0-only */ >>> +/* >>> + * Eventfd support for VDUSE >>> + * >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved. >>> + * >>> + * Author: Xie Yongji <xieyongji at bytedance.com> >>> + * >>> + */ >>> + >>> +#ifndef _VDUSE_EVENTFD_H >>> +#define _VDUSE_EVENTFD_H >>> + >>> +#include <linux/eventfd.h> >>> +#include <linux/poll.h> >>> +#include <linux/wait.h> >>> +#include <uapi/linux/vduse.h> >>> + >>> +#include "vduse.h" >>> + >>> +struct vduse_dev; >>> + >>> +struct vduse_virqfd { >>> + struct eventfd_ctx *ctx; >>> + struct vduse_virtqueue *vq; >>> + struct work_struct inject; >>> + struct work_struct shutdown; >>> + wait_queue_entry_t wait; >>> + poll_table pt; >>> +}; >>> + >>> +int vduse_virqfd_setup(struct vduse_dev *dev, >>> + struct vduse_vq_eventfd *eventfd); >>> + >>> +void vduse_virqfd_release(struct vduse_dev *dev); >>> + >>> +int vduse_virqfd_init(void); >>> + >>> +void vduse_virqfd_exit(void); >>> + >>> +void vduse_vq_kick(struct vduse_virtqueue *vq); >>> + >>> +int vduse_kickfd_setup(struct vduse_dev *dev, >>> + struct vduse_vq_eventfd *eventfd); >>> + >>> +void vduse_kickfd_release(struct vduse_dev *dev); >>> + >>> +#endif /* _VDUSE_EVENTFD_H */ >>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c >>> new file mode 100644 >>> index 000000000000..27022157abc6 >>> --- /dev/null >>> +++ b/drivers/vdpa/vdpa_user/iova_domain.c >>> @@ -0,0 +1,442 @@ >>> +// SPDX-License-Identifier: GPL-2.0-only >>> +/* >>> + * MMU-based IOMMU implementation >>> + * >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved. >>> + * >>> + * Author: Xie Yongji <xieyongji at bytedance.com> >>> + * >>> + */ >>> + >>> +#include <linux/wait.h> >>> +#include <linux/slab.h> >>> +#include <linux/genalloc.h> >>> +#include <linux/dma-mapping.h> >>> + >>> +#include "iova_domain.h" >>> + >>> +#define IOVA_CHUNK_SHIFT 26 >>> +#define IOVA_CHUNK_SIZE (_AC(1, UL) << IOVA_CHUNK_SHIFT) >>> +#define IOVA_CHUNK_MASK (~(IOVA_CHUNK_SIZE - 1)) >>> + >>> +#define IOVA_MIN_SIZE (IOVA_CHUNK_SIZE << 1) >>> + >>> +#define IOVA_ALLOC_ORDER 12 >>> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER) >>> + >>> +struct vduse_mmap_vma { >>> + struct vm_area_struct *vma; >>> + struct list_head list; >>> +}; >>> + >>> +static inline struct page * >>> +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, >>> + unsigned long iova) >>> +{ >>> + unsigned long index = iova >> IOVA_CHUNK_SHIFT; >>> + unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK; >>> + unsigned long pgindex = chunkoff >> PAGE_SHIFT; >>> + >>> + return domain->chunks[index].bounce_pages[pgindex]; >>> +} >>> + >>> +static inline void >>> +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain, >>> + unsigned long iova, struct page *page) >>> +{ >>> + unsigned long index = iova >> IOVA_CHUNK_SHIFT; >>> + unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK; >>> + unsigned long pgindex = chunkoff >> PAGE_SHIFT; >>> + >>> + domain->chunks[index].bounce_pages[pgindex] = page; >>> +} >>> + >>> +static inline struct vduse_iova_map * >>> +vduse_domain_get_iova_map(struct vduse_iova_domain *domain, >>> + unsigned long iova) >>> +{ >>> + unsigned long index = iova >> IOVA_CHUNK_SHIFT; >>> + unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK; >>> + unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER; >>> + >>> + return domain->chunks[index].iova_map[mapindex]; >>> +} >>> + >>> +static inline void >>> +vduse_domain_set_iova_map(struct vduse_iova_domain *domain, >>> + unsigned long iova, struct vduse_iova_map *map) >>> +{ >>> + unsigned long index = iova >> IOVA_CHUNK_SHIFT; >>> + unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK; >>> + unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER; >>> + >>> + domain->chunks[index].iova_map[mapindex] = map; >>> +} >>> + >>> +static int >>> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain, >>> + unsigned long iova, size_t size) >>> +{ >>> + struct page *page; >>> + size_t walk_sz = 0; >>> + int frees = 0; >>> + >>> + while (walk_sz < size) { >>> + page = vduse_domain_get_bounce_page(domain, iova); >>> + if (page) { >>> + vduse_domain_set_bounce_page(domain, iova, NULL); >>> + put_page(page); >>> + frees++; >>> + } >>> + iova += PAGE_SIZE; >>> + walk_sz += PAGE_SIZE; >>> + } >>> + >>> + return frees; >>> +} >>> + >>> +int vduse_domain_add_vma(struct vduse_iova_domain *domain, >>> + struct vm_area_struct *vma) >>> +{ >>> + unsigned long size = vma->vm_end - vma->vm_start; >>> + struct vduse_mmap_vma *mmap_vma; >>> + >>> + if (WARN_ON(size != domain->size)) >>> + return -EINVAL; >>> + >>> + mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL); >>> + if (!mmap_vma) >>> + return -ENOMEM; >>> + >>> + mmap_vma->vma = vma; >>> + mutex_lock(&domain->vma_lock); >>> + list_add(&mmap_vma->list, &domain->vma_list); >>> + mutex_unlock(&domain->vma_lock); >>> + >>> + return 0; >>> +} >>> + >>> +void vduse_domain_remove_vma(struct vduse_iova_domain *domain, >>> + struct vm_area_struct *vma) >>> +{ >>> + struct vduse_mmap_vma *mmap_vma; >>> + >>> + mutex_lock(&domain->vma_lock); >>> + list_for_each_entry(mmap_vma, &domain->vma_list, list) { >>> + if (mmap_vma->vma == vma) { >>> + list_del(&mmap_vma->list); >>> + kfree(mmap_vma); >>> + break; >>> + } >>> + } >>> + mutex_unlock(&domain->vma_lock); >>> +} >>> + >>> +int vduse_domain_add_mapping(struct vduse_iova_domain *domain, >>> + unsigned long iova, unsigned long orig, >>> + size_t size, enum dma_data_direction dir) >>> +{ >>> + struct vduse_iova_map *map; >>> + unsigned long last = iova + size; >>> + >>> + map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC); >>> + if (!map) >>> + return -ENOMEM; >>> + >>> + map->iova = iova; >>> + map->orig = orig; >>> + map->size = size; >>> + map->dir = dir; >>> + >>> + while (iova < last) { >>> + vduse_domain_set_iova_map(domain, iova, map); >>> + iova += IOVA_ALLOC_SIZE; >>> + } >>> + >>> + return 0; >>> +} >>> + >>> +struct vduse_iova_map * >>> +vduse_domain_get_mapping(struct vduse_iova_domain *domain, >>> + unsigned long iova) >>> +{ >>> + return vduse_domain_get_iova_map(domain, iova); >>> +} >>> + >>> +void vduse_domain_remove_mapping(struct vduse_iova_domain *domain, >>> + struct vduse_iova_map *map) >>> +{ >>> + unsigned long iova = map->iova; >>> + unsigned long last = iova + map->size; >>> + >>> + while (iova < last) { >>> + vduse_domain_set_iova_map(domain, iova, NULL); >>> + iova += IOVA_ALLOC_SIZE; >>> + } >>> +} >>> + >>> +void vduse_domain_unmap(struct vduse_iova_domain *domain, >>> + unsigned long iova, size_t size) >>> +{ >>> + struct vduse_mmap_vma *mmap_vma; >>> + unsigned long uaddr; >>> + >>> + mutex_lock(&domain->vma_lock); >>> + list_for_each_entry(mmap_vma, &domain->vma_list, list) { >>> + mmap_read_lock(mmap_vma->vma->vm_mm); >>> + uaddr = iova + mmap_vma->vma->vm_start; >>> + zap_page_range(mmap_vma->vma, uaddr, size); >>> + mmap_read_unlock(mmap_vma->vma->vm_mm); >>> + } >>> + mutex_unlock(&domain->vma_lock); >>> +} >>> + >>> +int vduse_domain_direct_map(struct vduse_iova_domain *domain, >>> + struct vm_area_struct *vma, unsigned long iova) >>> +{ >>> + unsigned long uaddr = iova + vma->vm_start; >>> + unsigned long start = iova & PAGE_MASK; >>> + unsigned long last = start + PAGE_SIZE - 1; >>> + unsigned long offset; >>> + struct vduse_iova_map *map; >>> + struct page *page = NULL; >>> + >>> + map = vduse_domain_get_iova_map(domain, iova); >>> + if (map) { >>> + offset = last - map->iova; >>> + page = virt_to_page(map->orig + offset); >>> + } >>> + >>> + return page ? vm_insert_page(vma, uaddr, page) : -EFAULT; >>> +} >> >> So as we discussed before, we need to find way to make vhost work. And >> it's better to make vhost transparent to VDUSE. One idea is to implement >> shadow virtqueue here, that is, instead of trying to insert the pages to >> VDUSE userspace, we use the shadow virtqueue to relay the descriptors to >> userspace. With this, we don't need stuffs like shmfd etc. >> > Good idea! The disadvantage is performance will go down (one more > thread switch overhead and vhost-liked kworker will become bottleneck > without multi-thread support).Yes, the disadvantage is the performance. But it should be simpler (I guess) and we know it can succeed.> I think I can try this in v3. And the > MMU-based IOMMU implementation can be a future optimization in the > virtio-vdpa case. What's your opinion?Maybe I was wrong, but I think we can try as what has been proposed here first and use shadow virtqueue as backup plan if we fail.> >>> + >>> +void vduse_domain_bounce(struct vduse_iova_domain *domain, >>> + unsigned long iova, unsigned long orig, >>> + size_t size, enum dma_data_direction dir) >>> +{ >>> + unsigned int offset = offset_in_page(iova); >>> + >>> + while (size) { >>> + struct page *p = vduse_domain_get_bounce_page(domain, iova); >>> + size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size); >>> + void *addr; >>> + >>> + if (p) { >>> + addr = page_address(p) + offset; >>> + if (dir == DMA_TO_DEVICE) >>> + memcpy(addr, (void *)orig, copy_len); >>> + else if (dir == DMA_FROM_DEVICE) >>> + memcpy((void *)orig, addr, copy_len); >>> + } >> >> I think I miss something, for DMA_FROM_DEVICE, if p doesn't exist how is >> it expected to work? Or do we need to warn here in this case? >> > Yes, I think we need a WARN_ON here.Ok.> > >>> + size -= copy_len; >>> + orig += copy_len; >>> + iova += copy_len; >>> + offset = 0; >>> + } >>> +} >>> + >>> +int vduse_domain_bounce_map(struct vduse_iova_domain *domain, >>> + struct vm_area_struct *vma, unsigned long iova) >>> +{ >>> + unsigned long uaddr = iova + vma->vm_start; >>> + unsigned long start = iova & PAGE_MASK; >>> + unsigned long offset = 0; >>> + bool found = false; >>> + struct vduse_iova_map *map; >>> + struct page *page; >>> + >>> + mutex_lock(&domain->map_lock); >>> + >>> + page = vduse_domain_get_bounce_page(domain, iova); >>> + if (page) >>> + goto unlock; >>> + >>> + page = alloc_page(GFP_KERNEL); >>> + if (!page) >>> + goto unlock; >>> + >>> + while (offset < PAGE_SIZE) { >>> + unsigned int src_offset = 0, dst_offset = 0; >>> + void *src, *dst; >>> + size_t copy_len; >>> + >>> + map = vduse_domain_get_iova_map(domain, start + offset); >>> + if (!map) { >>> + offset += IOVA_ALLOC_SIZE; >>> + continue; >>> + } >>> + >>> + found = true; >>> + offset += map->size; >>> + if (map->dir == DMA_FROM_DEVICE) >>> + continue; >>> + >>> + if (start > map->iova) >>> + src_offset = start - map->iova; >>> + else >>> + dst_offset = map->iova - start; >>> + >>> + src = (void *)(map->orig + src_offset); >>> + dst = page_address(page) + dst_offset; >>> + copy_len = min_t(size_t, map->size - src_offset, >>> + PAGE_SIZE - dst_offset); >>> + memcpy(dst, src, copy_len); >>> + } >>> + if (!found) { >>> + put_page(page); >>> + page = NULL; >>> + } >>> + vduse_domain_set_bounce_page(domain, iova, page); >>> +unlock: >>> + mutex_unlock(&domain->map_lock); >>> + >>> + return page ? vm_insert_page(vma, uaddr, page) : -EFAULT; >>> +} >>> + >>> +bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain, >>> + unsigned long iova) >>> +{ >>> + unsigned long index = iova >> IOVA_CHUNK_SHIFT; >>> + struct vduse_iova_chunk *chunk = &domain->chunks[index]; >>> + >>> + return atomic_read(&chunk->map_type) == TYPE_DIRECT_MAP; >>> +} >>> + >>> +unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain, >>> + size_t size, enum iova_map_type type) >>> +{ >>> + struct vduse_iova_chunk *chunk; >>> + unsigned long iova = 0; >>> + int align = (type == TYPE_DIRECT_MAP) ? PAGE_SIZE : IOVA_ALLOC_SIZE; >>> + struct genpool_data_align data = { .align = align }; >>> + int i; >>> + >>> + for (i = 0; i < domain->chunk_num; i++) { >>> + chunk = &domain->chunks[i]; >>> + if (unlikely(atomic_read(&chunk->map_type) == TYPE_NONE)) >>> + atomic_cmpxchg(&chunk->map_type, TYPE_NONE, type); >>> + >>> + if (atomic_read(&chunk->map_type) != type) >>> + continue; >>> + >>> + iova = gen_pool_alloc_algo(chunk->pool, size, >>> + gen_pool_first_fit_align, &data); >>> + if (iova) >>> + break; >>> + } >>> + >>> + return iova; >> >> I wonder why not just reuse the iova domain implements in >> driver/iommu/iova.c >> > The iova domain in driver/iommu/iova.c is only an iova allocator which > is implemented by the genpool memory allocator in our case. The other > part in our iova domain is chunk management and iova_map management. > We need different chunks to distinguish different dma mapping types: > consistent mapping or streaming mapping. We can only use > bouncing-mechanism in the streaming mapping case.To differ dma mappings, you can use two iova domains with different ranges. It looks simpler than the gen_pool. (AFAIK most IOMMU driver is using iova domain). Thanks