The following patches apply over mst's vhost branch and were tested againt that branch and also mkp's 5.13 branch which has some vhost-scsi changes. These patches allow us to support multiple vhost workers per device. I ended up just doing Stefan's original idea where userspace has the kernel create a worker and we pass back the pid. This has the benefit over the workqueue and userspace thread approach where we only have one'ish code path in the kernel. The kernel patches here allow us to then do N workers device and also share workers across devices. I included a patch for qemu so you can get an idea of how it works. TODO: ----- - polling - Allow sharing workers across devices. Kernel support is added and I hacked up userspace to test, but I'm still working on a sane way to manage it in userspace. - Bind to specific CPUs. Commands like "virsh emulatorpin" work with these patches and allow us to set the group of vhost threads to different CPUs. But we can also set a specific vq's worker to run on a CPU. - I'm handling old kernel by just checking for EPERM. Does this require a feature? Results: -------- When running with the null_blk driver and vhost-scsi I can get 1.2 million IOPs by just running a simple fio --filename=/dev/sda --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=128 --numjobs=8 --time_based --group_reporting --name=iops --runtime=60 --eta-newline=1 The VM has 8 vCPUs and sda has 8 virtqueues and we can do a total of 1024 cmds per devices. To get 1.2 million IOPs I did have to tune and ran the virsh emulatorpin command so the vhost threads were running on different CPUs than the VM. If the vhost threads share CPUs then I get around 800K. For a more real device that are also CPU hogs like iscsi, I can still get 1 million IOPs using 1 dm-multipath device over 8 iscsi paths (natively it gets 1.1 million IOPs).
Mike Christie
2021-Apr-28 22:37 UTC
[PATCH RFC 01/14] vhost: remove work arg from vhost_work_flush
vhost_work_flush doesn't do anything with the work arg. This patch drops it and then renames vhost_work_flush to vhost_work_dev_flush to reflect that the function flushes all the works in the dev and not just a specific queue or work item. Signed-off-by: Mike Christie <michael.christie at oracle.com> Acked-by: Jason Wang <jasowang at redhat.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni at wdc.com> --- drivers/vhost/scsi.c | 4 ++-- drivers/vhost/vhost.c | 8 ++++---- drivers/vhost/vhost.h | 2 +- drivers/vhost/vsock.c | 2 +- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index d16c04dcc144..0fd596da1834 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1455,8 +1455,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs) /* Flush both the vhost poll and vhost work */ for (i = 0; i < VHOST_SCSI_MAX_VQ; i++) vhost_scsi_flush_vq(vs, i); - vhost_work_flush(&vs->dev, &vs->vs_completion_work); - vhost_work_flush(&vs->dev, &vs->vs_event_work); + vhost_work_dev_flush(&vs->dev); + vhost_work_dev_flush(&vs->dev); /* Wait for all reqs issued before the flush to be finished */ for (i = 0; i < VHOST_SCSI_MAX_VQ; i++) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 5ccb0705beae..b9e853e6094d 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -231,7 +231,7 @@ void vhost_poll_stop(struct vhost_poll *poll) } EXPORT_SYMBOL_GPL(vhost_poll_stop); -void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work) +void vhost_work_dev_flush(struct vhost_dev *dev) { struct vhost_flush_struct flush; @@ -243,13 +243,13 @@ void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work) wait_for_completion(&flush.wait_event); } } -EXPORT_SYMBOL_GPL(vhost_work_flush); +EXPORT_SYMBOL_GPL(vhost_work_dev_flush); /* Flush any work that has been scheduled. When calling this, don't hold any * locks that are also used by the callback. */ void vhost_poll_flush(struct vhost_poll *poll) { - vhost_work_flush(poll->dev, &poll->work); + vhost_work_dev_flush(poll->dev); } EXPORT_SYMBOL_GPL(vhost_poll_flush); @@ -538,7 +538,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev) attach.owner = current; vhost_work_init(&attach.work, vhost_attach_cgroups_work); vhost_work_queue(dev, &attach.work); - vhost_work_flush(dev, &attach.work); + vhost_work_dev_flush(dev); return attach.ret; } diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index b063324c7669..1ba8e814989d 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -46,7 +46,7 @@ int vhost_poll_start(struct vhost_poll *poll, struct file *file); void vhost_poll_stop(struct vhost_poll *poll); void vhost_poll_flush(struct vhost_poll *poll); void vhost_poll_queue(struct vhost_poll *poll); -void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work); +void vhost_work_dev_flush(struct vhost_dev *dev); long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp); struct vhost_log { diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index 5e78fb719602..f954f4d29c95 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -663,7 +663,7 @@ static void vhost_vsock_flush(struct vhost_vsock *vsock) for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) if (vsock->vqs[i].handle_kick) vhost_poll_flush(&vsock->vqs[i].poll); - vhost_work_flush(&vsock->dev, &vsock->send_pkt_work); + vhost_work_dev_flush(&vsock->dev); } static void vhost_vsock_reset_orphans(struct sock *sk) -- 2.25.1
Mike Christie
2021-Apr-28 22:37 UTC
[PATCH RFC 1/1] QEMU vhost-scsi: add support for VHOST_SET_VRING_WORKER
This patch adds support for the proposed ioctl that allows userspace to create virtqueue workers. For vhost-scsi you can set virtqueue_workers to: 0: default behavior where we have 1 worker for all vqs. -1: create a worker per vq>0: create N workers and allow the vqs to share them by assigning avq a worker by just doing round robin. TODO: - Allow sharing workers across devices. - Bind to specific CPUs. Commands like "virsh emulatorpin" allow us to set the group of vhost threads to different CPUs. But we can also set a specific vq's worker to run on a CPU. - I'm handling old kernel by just checking for EPERM. Does this require a feature? Signed-off-by: Mike Christie <michael.christie at oracle.com> --- hw/scsi/vhost-scsi.c | 85 +++++++++++++++++++- hw/virtio/vhost-backend.c | 8 ++ include/hw/virtio/vhost-backend.h | 4 + include/hw/virtio/virtio-scsi.h | 1 + include/standard-headers/linux/vhost_types.h | 9 +++ linux-headers/linux/vhost.h | 7 ++ 6 files changed, 111 insertions(+), 3 deletions(-) diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c index 4d70fa036bbe..9e3653d158c3 100644 --- a/hw/scsi/vhost-scsi.c +++ b/hw/scsi/vhost-scsi.c @@ -163,6 +163,76 @@ static const VMStateDescription vmstate_virtio_vhost_scsi = { .pre_save = vhost_scsi_pre_save, }; +static int vhost_scsi_set_workers(VHostSCSICommon *vsc, int vq_workers) +{ + struct vhost_dev *dev = &vsc->dev; + int worker_index = 0, num_workers = 0; + struct vhost_vring_worker w; + pid_t *workers = NULL; + int i, ret; + + if (vq_workers < -1) + return -EINVAL; + + if (vq_workers > 0) { + if (vq_workers > dev->nvqs) + vq_workers = dev->nvqs; + + workers = g_malloc0(vq_workers * sizeof(pid_t)); + } + + w.pid = -1; + for (i = 0; i < dev->nvqs; i++) { + w.index = i; + + switch (vq_workers) { + case -1: + /* + * ctl/evt share the first worker since it will be rare for them + * to send cmds while IO is running. The rest of the vqs get their + * own worker. + */ + if (i > VHOST_SCSI_VQ_NUM_FIXED) + w.pid = -1; + break; + case 0: + /* All vqs share 1 worker. Pass back the pid we got the first run */ + break; + default: + /* Each worker handles N vqs. */ + if (num_workers == vq_workers) { + w.pid = workers[worker_index]; + + worker_index++; + if (worker_index == vq_workers) + worker_index = 0; + } else { + w.pid = -1; + } + + break; + } + + ret = dev->vhost_ops->vhost_set_vring_worker(dev, &w); + /* Ignore for now. Add feature in final patch */ + if (ret == -EPERM) { + ret = 0; + goto free_workers; + } else if (ret) + goto free_workers; + + if (vq_workers > 0 && num_workers < vq_workers) { + workers[num_workers] = w.pid; + num_workers++; + } + } + +free_workers: + if (workers) + g_free(workers); + return ret; +} + static void vhost_scsi_realize(DeviceState *dev, Error **errp) { VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(dev); @@ -226,6 +296,13 @@ static void vhost_scsi_realize(DeviceState *dev, Error **errp) goto free_vqs; } + ret = vhost_scsi_set_workers(vsc, vs->conf.virtqueue_workers); + if (ret < 0) { + error_setg(errp, "vhost-scsi: vhost worker setup failed: %s", + strerror(-ret)); + goto free_vqs; + } + /* At present, channel and lun both are 0 for bootable vhost-scsi disk */ vsc->channel = 0; vsc->lun = 0; @@ -271,18 +348,20 @@ static Property vhost_scsi_properties[] = { DEFINE_PROP_STRING("wwpn", VirtIOSCSICommon, conf.wwpn), DEFINE_PROP_UINT32("boot_tpgt", VirtIOSCSICommon, conf.boot_tpgt, 0), DEFINE_PROP_UINT32("num_queues", VirtIOSCSICommon, conf.num_queues, - VIRTIO_SCSI_AUTO_NUM_QUEUES), + 8), DEFINE_PROP_UINT32("virtqueue_size", VirtIOSCSICommon, conf.virtqueue_size, - 128), + 1024), DEFINE_PROP_BOOL("seg_max_adjust", VirtIOSCSICommon, conf.seg_max_adjust, true), DEFINE_PROP_UINT32("max_sectors", VirtIOSCSICommon, conf.max_sectors, 0xFFFF), - DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSICommon, conf.cmd_per_lun, 128), + DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSICommon, conf.cmd_per_lun, 1024), DEFINE_PROP_BIT64("t10_pi", VHostSCSICommon, host_features, VIRTIO_SCSI_F_T10_PI, false), DEFINE_PROP_BOOL("migratable", VHostSCSICommon, migratable, false), + DEFINE_PROP_INT32("virtqueue_workers", VirtIOSCSICommon, + conf.virtqueue_workers, 0), DEFINE_PROP_END_OF_LIST(), }; diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c index 31b33bde37b2..0dc9acfca7ec 100644 --- a/hw/virtio/vhost-backend.c +++ b/hw/virtio/vhost-backend.c @@ -150,6 +150,13 @@ static int vhost_kernel_set_vring_busyloop_timeout(struct vhost_dev *dev, return vhost_kernel_call(dev, VHOST_SET_VRING_BUSYLOOP_TIMEOUT, s); } +static int vhost_kernel_set_vring_worker(struct vhost_dev *dev, + struct vhost_vring_worker *worker) +{ + return vhost_kernel_call(dev, VHOST_SET_VRING_WORKER, worker); +} + + static int vhost_kernel_set_features(struct vhost_dev *dev, uint64_t features) { @@ -311,6 +318,7 @@ static const VhostOps kernel_ops = { .vhost_set_vring_call = vhost_kernel_set_vring_call, .vhost_set_vring_busyloop_timeout vhost_kernel_set_vring_busyloop_timeout, + .vhost_set_vring_worker = vhost_kernel_set_vring_worker, .vhost_set_features = vhost_kernel_set_features, .vhost_get_features = vhost_kernel_get_features, .vhost_set_backend_cap = vhost_kernel_set_backend_cap, diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h index 8a6f8e2a7a30..375fd6e79d8f 100644 --- a/include/hw/virtio/vhost-backend.h +++ b/include/hw/virtio/vhost-backend.h @@ -33,6 +33,7 @@ struct vhost_memory; struct vhost_vring_file; struct vhost_vring_state; struct vhost_vring_addr; +struct vhost_vring_worker; struct vhost_scsi_target; struct vhost_iotlb_msg; struct vhost_virtqueue; @@ -70,6 +71,8 @@ typedef int (*vhost_set_vring_call_op)(struct vhost_dev *dev, struct vhost_vring_file *file); typedef int (*vhost_set_vring_busyloop_timeout_op)(struct vhost_dev *dev, struct vhost_vring_state *r); +typedef int (*vhost_set_vring_worker_op)(struct vhost_dev *dev, + struct vhost_vring_worker *worker); typedef int (*vhost_set_features_op)(struct vhost_dev *dev, uint64_t features); typedef int (*vhost_get_features_op)(struct vhost_dev *dev, @@ -145,6 +148,7 @@ typedef struct VhostOps { vhost_set_vring_kick_op vhost_set_vring_kick; vhost_set_vring_call_op vhost_set_vring_call; vhost_set_vring_busyloop_timeout_op vhost_set_vring_busyloop_timeout; + vhost_set_vring_worker_op vhost_set_vring_worker; vhost_set_features_op vhost_set_features; vhost_get_features_op vhost_get_features; vhost_set_backend_cap_op vhost_set_backend_cap; diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h index 543681bc1838..694221601dad 100644 --- a/include/hw/virtio/virtio-scsi.h +++ b/include/hw/virtio/virtio-scsi.h @@ -58,6 +58,7 @@ struct VirtIOSCSIConf { #ifdef CONFIG_VHOST_SCSI char *vhostfd; char *wwpn; + int virtqueue_workers; #endif CharBackend chardev; uint32_t boot_tpgt; diff --git a/include/standard-headers/linux/vhost_types.h b/include/standard-headers/linux/vhost_types.h index 0bd2684a2ae4..0d81ff6b2f1f 100644 --- a/include/standard-headers/linux/vhost_types.h +++ b/include/standard-headers/linux/vhost_types.h @@ -47,6 +47,15 @@ struct vhost_vring_addr { uint64_t log_guest_addr; }; +struct vhost_vring_worker { + unsigned int index; + /* + * The pid of the vhost worker that the vq will be bound to. If -1, + * a new worker will be created and it's pid will be returned in pid. + */ + pid_t pid; +}; + /* no alignment requirement */ struct vhost_iotlb_msg { uint64_t iova; diff --git a/linux-headers/linux/vhost.h b/linux-headers/linux/vhost.h index c998860d7bbc..24569f89611b 100644 --- a/linux-headers/linux/vhost.h +++ b/linux-headers/linux/vhost.h @@ -70,6 +70,13 @@ #define VHOST_VRING_BIG_ENDIAN 1 #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state) #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state) +/* Create/bind a vhost worker to a virtqueue. If pid > 0 and matches an existing + * vhost_worker thread it will be bound to the vq. If pid is -1, then a new + * worker will be created and bound to the vq. + */ +#define VHOST_SET_VRING_WORKER _IOWR(VHOST_VIRTIO, 0x15, struct vhost_vring_worker) +/* Return the vqs worker's pid. If no worker is set pid is -1 */ +#define VHOST_GET_VRING_WORKER _IOR(VHOST_VIRTIO, 0x16, struct vhost_vring_worker) /* The following ioctls use eventfd file descriptors to signal and poll * for events. */ -- 2.25.1
The vhost work flush function was flushing the entire work queue, so there is no need for the double vhost_work_dev_flush calls in vhost_scsi_flush. And we do not need to call vhost_poll_flush for each poller because that call also ends up flushing the same work queue thread the vhost_work_dev_flush call flushed. Signed-off-by: Mike Christie <michael.christie at oracle.com> Reviewed-by: Stefan Hajnoczi <stefanha at redhat.com> Acked-by: Jason Wang <jasowang at redhat.com> --- drivers/vhost/scsi.c | 8 -------- 1 file changed, 8 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 0fd596da1834..b3e6fe9b1767 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1430,11 +1430,6 @@ static void vhost_scsi_handle_kick(struct vhost_work *work) vhost_scsi_handle_vq(vs, vq); } -static void vhost_scsi_flush_vq(struct vhost_scsi *vs, int index) -{ - vhost_poll_flush(&vs->vqs[index].vq.poll); -} - /* Callers must hold dev mutex */ static void vhost_scsi_flush(struct vhost_scsi *vs) { @@ -1453,9 +1448,6 @@ static void vhost_scsi_flush(struct vhost_scsi *vs) kref_put(&old_inflight[i]->kref, vhost_scsi_done_inflight); /* Flush both the vhost poll and vhost work */ - for (i = 0; i < VHOST_SCSI_MAX_VQ; i++) - vhost_scsi_flush_vq(vs, i); - vhost_work_dev_flush(&vs->dev); vhost_work_dev_flush(&vs->dev); /* Wait for all reqs issued before the flush to be finished */ -- 2.25.1
Mike Christie
2021-Apr-28 22:37 UTC
[PATCH RFC 03/14] vhost-scsi: reduce flushes during endpoint clearing
vhost_scsi_flush will flush everything, so we can clear the backends then flush, then destroy. We don't need to flush before each vq destruction because after the flush we will have made sure there can be no new cmds started and there are no running cmds. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- drivers/vhost/scsi.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index b3e6fe9b1767..46f897e41217 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1732,11 +1732,12 @@ vhost_scsi_clear_endpoint(struct vhost_scsi *vs, mutex_lock(&vq->mutex); vhost_vq_set_backend(vq, NULL); mutex_unlock(&vq->mutex); - /* - * Make sure cmds are not running before tearing them - * down. - */ - vhost_scsi_flush(vs); + } + /* Make sure cmds are not running before tearing them down. */ + vhost_scsi_flush(vs); + + for (i = 0; i < VHOST_SCSI_MAX_VQ; i++) { + vq = &vs->vqs[i].vq; vhost_scsi_destroy_vq_cmds(vq); } } -- 2.25.1
We use like 3 coding styles in this struct. Switch to just tabs. Signed-off-by: Mike Christie <michael.christie at oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni at wdc.com> Reviewed-by: Stefan Hajnoczi <stefanha at redhat.com> Acked-by: Jason Wang <jasowang at redhat.com> --- drivers/vhost/vhost.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 1ba8e814989d..575c8180caad 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -28,12 +28,12 @@ struct vhost_work { /* Poll a file (eventfd or socket) */ /* Note: there's nothing vhost specific about this structure. */ struct vhost_poll { - poll_table table; - wait_queue_head_t *wqh; - wait_queue_entry_t wait; - struct vhost_work work; - __poll_t mask; - struct vhost_dev *dev; + poll_table table; + wait_queue_head_t *wqh; + wait_queue_entry_t wait; + struct vhost_work work; + __poll_t mask; + struct vhost_dev *dev; }; void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); -- 2.25.1
Mike Christie
2021-Apr-28 22:37 UTC
[PATCH RFC 05/14] vhost: move worker thread fields to new struct
This is just a prep patch. It moves the worker related fields to a new vhost_worker struct and moves the code around to create some helpers that will be used in the next patches. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- drivers/vhost/vhost.c | 94 +++++++++++++++++++++++++++++-------------- drivers/vhost/vhost.h | 9 ++++- 2 files changed, 70 insertions(+), 33 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index b9e853e6094d..0cd19b1a832e 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -263,8 +263,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) * sure it was not in the list. * test_and_set_bit() implies a memory barrier. */ - llist_add(&work->node, &dev->work_list); - wake_up_process(dev->worker); + llist_add(&work->node, &dev->worker->work_list); + wake_up_process(dev->worker->task); } } EXPORT_SYMBOL_GPL(vhost_work_queue); @@ -272,7 +272,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue); /* A lockless hint for busy polling code to exit the loop */ bool vhost_has_work(struct vhost_dev *dev) { - return !llist_empty(&dev->work_list); + return dev->worker && !llist_empty(&dev->worker->work_list); } EXPORT_SYMBOL_GPL(vhost_has_work); @@ -343,7 +343,8 @@ static void vhost_vq_reset(struct vhost_dev *dev, static int vhost_worker(void *data) { - struct vhost_dev *dev = data; + struct vhost_worker *worker = data; + struct vhost_dev *dev = worker->dev; struct vhost_work *work, *work_next; struct llist_node *node; @@ -358,7 +359,7 @@ static int vhost_worker(void *data) break; } - node = llist_del_all(&dev->work_list); + node = llist_del_all(&worker->work_list); if (!node) schedule(); @@ -487,7 +488,6 @@ void vhost_dev_init(struct vhost_dev *dev, dev->byte_weight = byte_weight; dev->use_worker = use_worker; dev->msg_handler = msg_handler; - init_llist_head(&dev->work_list); init_waitqueue_head(&dev->wait); INIT_LIST_HEAD(&dev->read_list); INIT_LIST_HEAD(&dev->pending_list); @@ -579,10 +579,59 @@ static void vhost_detach_mm(struct vhost_dev *dev) dev->mm = NULL; } +static void vhost_worker_free(struct vhost_dev *dev) +{ + struct vhost_worker *worker = dev->worker; + + if (!worker) + return; + + dev->worker = NULL; + WARN_ON(!llist_empty(&worker->work_list)); + kthread_stop(worker->task); + kfree(worker); +} + +static int vhost_worker_create(struct vhost_dev *dev) +{ + struct vhost_worker *worker; + struct task_struct *task; + int ret; + + worker = kzalloc(sizeof(*worker), GFP_KERNEL); + if (!worker) + return -ENOMEM; + + dev->worker = worker; + worker->dev = dev; + init_llist_head(&worker->work_list); + + task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid); + if (IS_ERR(task)) { + ret = PTR_ERR(task); + goto free_worker; + } + + worker->task = task; + wake_up_process(task); /* avoid contributing to loadavg */ + + ret = vhost_attach_cgroups(dev); + if (ret) + goto stop_worker; + + return 0; + +stop_worker: + kthread_stop(worker->task); +free_worker: + kfree(worker); + dev->worker = NULL; + return ret; +} + /* Caller should have device mutex */ long vhost_dev_set_owner(struct vhost_dev *dev) { - struct task_struct *worker; int err; /* Is there an owner already? */ @@ -595,31 +644,18 @@ long vhost_dev_set_owner(struct vhost_dev *dev) dev->kcov_handle = kcov_common_handle(); if (dev->use_worker) { - worker = kthread_create(vhost_worker, dev, - "vhost-%d", current->pid); - if (IS_ERR(worker)) { - err = PTR_ERR(worker); - goto err_worker; - } - - dev->worker = worker; - wake_up_process(worker); /* avoid contributing to loadavg */ - - err = vhost_attach_cgroups(dev); + err = vhost_worker_create(dev); if (err) - goto err_cgroup; + goto err_worker; } err = vhost_dev_alloc_iovecs(dev); if (err) - goto err_cgroup; + goto err_iovecs; return 0; -err_cgroup: - if (dev->worker) { - kthread_stop(dev->worker); - dev->worker = NULL; - } +err_iovecs: + vhost_worker_free(dev); err_worker: vhost_detach_mm(dev); dev->kcov_handle = 0; @@ -712,13 +748,9 @@ void vhost_dev_cleanup(struct vhost_dev *dev) dev->iotlb = NULL; vhost_clear_msg(dev); wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM); - WARN_ON(!llist_empty(&dev->work_list)); - if (dev->worker) { - kthread_stop(dev->worker); - dev->worker = NULL; - dev->kcov_handle = 0; - } + vhost_worker_free(dev); vhost_detach_mm(dev); + dev->kcov_handle = 0; } EXPORT_SYMBOL_GPL(vhost_dev_cleanup); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 575c8180caad..5b1e4cdc7756 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -25,6 +25,12 @@ struct vhost_work { unsigned long flags; }; +struct vhost_worker { + struct task_struct *task; + struct llist_head work_list; + struct vhost_dev *dev; +}; + /* Poll a file (eventfd or socket) */ /* Note: there's nothing vhost specific about this structure. */ struct vhost_poll { @@ -149,8 +155,7 @@ struct vhost_dev { struct vhost_virtqueue **vqs; int nvqs; struct eventfd_ctx *log_ctx; - struct llist_head work_list; - struct task_struct *worker; + struct vhost_worker *worker; struct vhost_iotlb *umem; struct vhost_iotlb *iotlb; spinlock_t iotlb_lock; -- 2.25.1
Mike Christie
2021-Apr-28 22:37 UTC
[PATCH RFC 06/14] vhost: move vhost worker creation to kick setup
The next patch will add new ioctls that allows userspace to create workers and bind them to devs and vqs after VHOST_SET_OWNER. To support older tools, newer tools that want to go wild with worker threads, and newer tools that want the old/default behavior this patch moves the default worker setup to the kick setup. After the first vq's kick/poll setup is done we could start to get works queued, so this is the point when we must have a worker setup. If we are using older tools or the newer tools just want the default single vhost thread per dev behavior then it will automatically be done here. If the tools are using the newer ioctls that have already created the workers then we also detect that here and do nothing. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- drivers/vhost/vhost.c | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 0cd19b1a832e..a291cde95c43 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -629,6 +629,15 @@ static int vhost_worker_create(struct vhost_dev *dev) return ret; } +/* Caller must have device mutex */ +static int vhost_worker_try_create_def(struct vhost_dev *dev) +{ + if (!dev->use_worker || dev->worker) + return 0; + + return vhost_worker_create(dev); +} + /* Caller should have device mutex */ long vhost_dev_set_owner(struct vhost_dev *dev) { @@ -643,11 +652,6 @@ long vhost_dev_set_owner(struct vhost_dev *dev) vhost_attach_mm(dev); dev->kcov_handle = kcov_common_handle(); - if (dev->use_worker) { - err = vhost_worker_create(dev); - if (err) - goto err_worker; - } err = vhost_dev_alloc_iovecs(dev); if (err) @@ -655,8 +659,6 @@ long vhost_dev_set_owner(struct vhost_dev *dev) return 0; err_iovecs: - vhost_worker_free(dev); -err_worker: vhost_detach_mm(dev); dev->kcov_handle = 0; err_mm: @@ -1665,6 +1667,13 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg r = -EFAULT; break; } + + if (f.fd != VHOST_FILE_UNBIND) { + r = vhost_worker_try_create_def(d); + if (r) + break; + } + eventfp = f.fd == VHOST_FILE_UNBIND ? NULL : eventfd_fget(f.fd); if (IS_ERR(eventfp)) { r = PTR_ERR(eventfp); -- 2.25.1
Mike Christie
2021-Apr-28 22:37 UTC
[PATCH RFC 07/14] vhost: modify internal functions to take a vhost_worker
The final patches will allow us to have multiple vhost_workers per device and be able to share them across devices. To prepare for that, this patch allow us our internal work queueing, flush and cgroup attach functions to take a vhost_worker as an arg. The poll code required a change to the driver, so the next patch will convert that code. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- drivers/vhost/vhost.c | 136 ++++++++++++++++++++++++++++-------------- drivers/vhost/vhost.h | 4 +- 2 files changed, 94 insertions(+), 46 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a291cde95c43..4bfa9a7a51bb 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -231,20 +231,6 @@ void vhost_poll_stop(struct vhost_poll *poll) } EXPORT_SYMBOL_GPL(vhost_poll_stop); -void vhost_work_dev_flush(struct vhost_dev *dev) -{ - struct vhost_flush_struct flush; - - if (dev->worker) { - init_completion(&flush.wait_event); - vhost_work_init(&flush.work, vhost_flush_work); - - vhost_work_queue(dev, &flush.work); - wait_for_completion(&flush.wait_event); - } -} -EXPORT_SYMBOL_GPL(vhost_work_dev_flush); - /* Flush any work that has been scheduled. When calling this, don't hold any * locks that are also used by the callback. */ void vhost_poll_flush(struct vhost_poll *poll) @@ -253,26 +239,62 @@ void vhost_poll_flush(struct vhost_poll *poll) } EXPORT_SYMBOL_GPL(vhost_poll_flush); -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) +static void vhost_work_queue_on(struct vhost_work *work, + struct vhost_worker *worker) { - if (!dev->worker) - return; - if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) { /* We can only add the work to the list after we're * sure it was not in the list. * test_and_set_bit() implies a memory barrier. */ - llist_add(&work->node, &dev->worker->work_list); - wake_up_process(dev->worker->task); + llist_add(&work->node, &worker->work_list); + wake_up_process(worker->task); } } + +void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) +{ + if (!dev->workers) + return; + /* + * devs with use_worker=true created by tools that do not support the + * worker creation ioctl will always have at least one worker. + */ + vhost_work_queue_on(work, dev->workers[0]); +} EXPORT_SYMBOL_GPL(vhost_work_queue); +static void vhost_work_dev_flush_on(struct vhost_worker *worker) +{ + struct vhost_flush_struct flush; + + init_completion(&flush.wait_event); + vhost_work_init(&flush.work, vhost_flush_work); + + vhost_work_queue_on(&flush.work, worker); + wait_for_completion(&flush.wait_event); +} + +void vhost_work_dev_flush(struct vhost_dev *dev) +{ + int i; + + for (i = 0; i < dev->num_workers; i++) + vhost_work_dev_flush_on(dev->workers[i]); +} +EXPORT_SYMBOL_GPL(vhost_work_dev_flush); + /* A lockless hint for busy polling code to exit the loop */ bool vhost_has_work(struct vhost_dev *dev) { - return dev->worker && !llist_empty(&dev->worker->work_list); + int i; + + for (i = 0; i < dev->num_workers; i++) { + if (!llist_empty(&dev->workers[i]->work_list)) + return true; + } + + return false; } EXPORT_SYMBOL_GPL(vhost_has_work); @@ -482,7 +504,8 @@ void vhost_dev_init(struct vhost_dev *dev, dev->umem = NULL; dev->iotlb = NULL; dev->mm = NULL; - dev->worker = NULL; + dev->workers = NULL; + dev->num_workers = 0; dev->iov_limit = iov_limit; dev->weight = weight; dev->byte_weight = byte_weight; @@ -531,14 +554,14 @@ static void vhost_attach_cgroups_work(struct vhost_work *work) s->ret = cgroup_attach_task_all(s->owner, current); } -static int vhost_attach_cgroups(struct vhost_dev *dev) +static int vhost_attach_cgroups_on(struct vhost_worker *worker) { struct vhost_attach_cgroups_struct attach; attach.owner = current; vhost_work_init(&attach.work, vhost_attach_cgroups_work); - vhost_work_queue(dev, &attach.work); - vhost_work_dev_flush(dev); + vhost_work_queue_on(&attach.work, worker); + vhost_work_dev_flush_on(worker); return attach.ret; } @@ -579,20 +602,29 @@ static void vhost_detach_mm(struct vhost_dev *dev) dev->mm = NULL; } -static void vhost_worker_free(struct vhost_dev *dev) +static void vhost_worker_free(struct vhost_worker *worker) { - struct vhost_worker *worker = dev->worker; - - if (!worker) - return; - - dev->worker = NULL; WARN_ON(!llist_empty(&worker->work_list)); kthread_stop(worker->task); kfree(worker); } -static int vhost_worker_create(struct vhost_dev *dev) +static void vhost_workers_free(struct vhost_dev *dev) +{ + int i; + + if (!dev->workers) + return; + + for (i = 0; i < dev->num_workers; i++) + vhost_worker_free(dev->workers[i]); + + kfree(dev->workers); + dev->num_workers = 0; + dev->workers = NULL; +} + +static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) { struct vhost_worker *worker; struct task_struct *task; @@ -600,42 +632,56 @@ static int vhost_worker_create(struct vhost_dev *dev) worker = kzalloc(sizeof(*worker), GFP_KERNEL); if (!worker) - return -ENOMEM; + return NULL; - dev->worker = worker; + worker->id = dev->num_workers; worker->dev = dev; init_llist_head(&worker->work_list); task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid); - if (IS_ERR(task)) { - ret = PTR_ERR(task); + if (IS_ERR(task)) goto free_worker; - } worker->task = task; wake_up_process(task); /* avoid contributing to loadavg */ - ret = vhost_attach_cgroups(dev); + ret = vhost_attach_cgroups_on(worker); if (ret) goto stop_worker; - return 0; + return worker; stop_worker: kthread_stop(worker->task); free_worker: kfree(worker); - dev->worker = NULL; - return ret; + return NULL; } /* Caller must have device mutex */ static int vhost_worker_try_create_def(struct vhost_dev *dev) { - if (!dev->use_worker || dev->worker) + struct vhost_worker *worker; + + if (!dev->use_worker || dev->workers) return 0; - return vhost_worker_create(dev); + dev->workers = kcalloc(1, sizeof(struct vhost_worker *), GFP_KERNEL); + if (!dev->workers) + return -ENOMEM; + + worker = vhost_worker_create(dev); + if (!worker) + goto free_workers; + + dev->workers[worker->id] = worker; + dev->num_workers++; + return 0; + +free_workers: + kfree(dev->workers); + dev->workers = NULL; + return -ENOMEM; } /* Caller should have device mutex */ @@ -750,7 +796,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev) dev->iotlb = NULL; vhost_clear_msg(dev); wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM); - vhost_worker_free(dev); + vhost_workers_free(dev); vhost_detach_mm(dev); dev->kcov_handle = 0; } diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 5b1e4cdc7756..9eb7c3bf0bd6 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -29,6 +29,7 @@ struct vhost_worker { struct task_struct *task; struct llist_head work_list; struct vhost_dev *dev; + int id; }; /* Poll a file (eventfd or socket) */ @@ -155,7 +156,8 @@ struct vhost_dev { struct vhost_virtqueue **vqs; int nvqs; struct eventfd_ctx *log_ctx; - struct vhost_worker *worker; + struct vhost_worker **workers; + int num_workers; struct vhost_iotlb *umem; struct vhost_iotlb *iotlb; spinlock_t iotlb_lock; -- 2.25.1
Mike Christie
2021-Apr-28 22:37 UTC
[PATCH RFC 08/14] vhost: allow vhost_polls to use different vhost_workers
This allows vhost_polls to use the worker the vq the poll is associated with. This also exports a helper vhost_vq_work_queue which is used here internally, and will be used in the vhost-scsi patches. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- drivers/vhost/net.c | 6 ++++-- drivers/vhost/vhost.c | 19 ++++++++++++++++--- drivers/vhost/vhost.h | 6 +++++- 3 files changed, 25 insertions(+), 6 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index df82b124170e..948bc3d361ab 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -1334,8 +1334,10 @@ static int vhost_net_open(struct inode *inode, struct file *f) VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true, NULL); - vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev); - vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev); + vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev, + vqs[VHOST_NET_VQ_TX]); + vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev, + vqs[VHOST_NET_VQ_RX]); f->private_data = n; n->page_frag.page = NULL; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 4bfa9a7a51bb..3cc1196d465b 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -187,13 +187,15 @@ EXPORT_SYMBOL_GPL(vhost_work_init); /* Init poll structure */ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, - __poll_t mask, struct vhost_dev *dev) + __poll_t mask, struct vhost_dev *dev, + struct vhost_virtqueue *vq) { init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup); init_poll_funcptr(&poll->table, vhost_poll_func); poll->mask = mask; poll->dev = dev; poll->wqh = NULL; + poll->vq = vq; vhost_work_init(&poll->work, fn); } @@ -298,9 +300,15 @@ bool vhost_has_work(struct vhost_dev *dev) } EXPORT_SYMBOL_GPL(vhost_has_work); +void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work) +{ + vhost_work_queue_on(work, vq->worker); +} +EXPORT_SYMBOL_GPL(vhost_vq_work_queue); + void vhost_poll_queue(struct vhost_poll *poll) { - vhost_work_queue(poll->dev, &poll->work); + vhost_vq_work_queue(poll->vq, &poll->work); } EXPORT_SYMBOL_GPL(vhost_poll_queue); @@ -359,6 +367,7 @@ static void vhost_vq_reset(struct vhost_dev *dev, vq->busyloop_timeout = 0; vq->umem = NULL; vq->iotlb = NULL; + vq->worker = NULL; vhost_vring_call_reset(&vq->call_ctx); __vhost_vq_meta_reset(vq); } @@ -527,7 +536,7 @@ void vhost_dev_init(struct vhost_dev *dev, vhost_vq_reset(dev, vq); if (vq->handle_kick) vhost_poll_init(&vq->poll, vq->handle_kick, - EPOLLIN, dev); + EPOLLIN, dev, vq); } } EXPORT_SYMBOL_GPL(vhost_dev_init); @@ -662,6 +671,7 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) static int vhost_worker_try_create_def(struct vhost_dev *dev) { struct vhost_worker *worker; + int i; if (!dev->use_worker || dev->workers) return 0; @@ -674,6 +684,9 @@ static int vhost_worker_try_create_def(struct vhost_dev *dev) if (!worker) goto free_workers; + for (i = 0; i < dev->nvqs; i++) + dev->vqs[i]->worker = worker; + dev->workers[worker->id] = worker; dev->num_workers++; return 0; diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 9eb7c3bf0bd6..56a6806be8f6 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -41,14 +41,17 @@ struct vhost_poll { struct vhost_work work; __poll_t mask; struct vhost_dev *dev; + struct vhost_virtqueue *vq; }; void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); bool vhost_has_work(struct vhost_dev *dev); +void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work); void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, - __poll_t mask, struct vhost_dev *dev); + __poll_t mask, struct vhost_dev *dev, + struct vhost_virtqueue *vq); int vhost_poll_start(struct vhost_poll *poll, struct file *file); void vhost_poll_stop(struct vhost_poll *poll); void vhost_poll_flush(struct vhost_poll *poll); @@ -76,6 +79,7 @@ struct vhost_vring_call { /* The virtqueue structure describes a queue attached to a device. */ struct vhost_virtqueue { struct vhost_dev *dev; + struct vhost_worker *worker; /* The actual ring of buffers. */ struct mutex mutex; -- 2.25.1
Mike Christie
2021-Apr-28 22:37 UTC
[PATCH RFC 09/14] vhost-scsi: flush IO vqs then send TMF rsp
With one worker we will always send the scsi cmd responses then send the TMF rsp, because LIO will always complete the scsi cmds first which calls vhost_scsi_release_cmd to add them to the work queue. When the next patches adds multiple worker support, the worker threads could still be sending their responses when the tmf's work is run. So this patch has vhost-scsi flush the IO vqs on other worker threads before we send the tmf response. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- drivers/vhost/scsi.c | 17 ++++++++++++++--- drivers/vhost/vhost.c | 6 ++++++ drivers/vhost/vhost.h | 1 + 3 files changed, 21 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 46f897e41217..96462032a972 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1168,12 +1168,23 @@ static void vhost_scsi_tmf_resp_work(struct vhost_work *work) { struct vhost_scsi_tmf *tmf = container_of(work, struct vhost_scsi_tmf, vwork); - int resp_code; + int resp_code, i; + + if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE) { + /* + * When processing a TMF, lio completes the cmds then the + * TMF, so with one worker the TMF always completes after + * cmds. For multiple worker support, we must flush the + * IO vqs that do not share a worker with the ctl vq (vqs + * 3 and up) to make sure they have completed their cmds. + */ + for (i = 1; i < tmf->vhost->dev.num_workers; i++) + vhost_vq_work_flush(&tmf->vhost->vqs[i + VHOST_SCSI_VQ_IO].vq); - if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE) resp_code = VIRTIO_SCSI_S_FUNCTION_SUCCEEDED; - else + } else { resp_code = VIRTIO_SCSI_S_FUNCTION_REJECTED; + } vhost_scsi_send_tmf_resp(tmf->vhost, &tmf->svq->vq, tmf->in_iovs, tmf->vq_desc, &tmf->resp_iov, resp_code); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 3cc1196d465b..345ade0af133 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -300,6 +300,12 @@ bool vhost_has_work(struct vhost_dev *dev) } EXPORT_SYMBOL_GPL(vhost_has_work); +void vhost_vq_work_flush(struct vhost_virtqueue *vq) +{ + vhost_work_dev_flush_on(vq->worker); +} +EXPORT_SYMBOL_GPL(vhost_vq_work_flush); + void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work) { vhost_work_queue_on(work, vq->worker); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 56a6806be8f6..973889ec7d62 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -47,6 +47,7 @@ struct vhost_poll { void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); bool vhost_has_work(struct vhost_dev *dev); +void vhost_vq_work_flush(struct vhost_virtqueue *vq); void vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work); void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, -- 2.25.1
Mike Christie
2021-Apr-28 22:37 UTC
[PATCH RFC 10/14] vhost-scsi: make SCSI cmd completion per vq
This patch separates the scsi cmd completion code paths so we can complete cmds based on their vq instead of having all cmds complete on the same worker/CPU. This will be useful with the next patch that allows us to create mulitple worker threads and bind them to different vqs, so we can have completions running on different threads/CPUs. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- drivers/vhost/scsi.c | 48 +++++++++++++++++++++++--------------------- 1 file changed, 25 insertions(+), 23 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 96462032a972..2f84cf602ab3 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -176,6 +176,7 @@ enum { struct vhost_scsi_virtqueue { struct vhost_virtqueue vq; + struct vhost_scsi *vs; /* * Reference counting for inflight reqs, used for flush operation. At * each time, one reference tracks new commands submitted, while we @@ -190,6 +191,9 @@ struct vhost_scsi_virtqueue { struct vhost_scsi_cmd *scsi_cmds; struct sbitmap scsi_tags; int max_cmds; + + struct vhost_work completion_work; + struct llist_head completion_list; }; struct vhost_scsi { @@ -200,9 +204,6 @@ struct vhost_scsi { struct vhost_dev dev; struct vhost_scsi_virtqueue vqs[VHOST_SCSI_MAX_VQ]; - struct vhost_work vs_completion_work; /* cmd completion work item */ - struct llist_head vs_completion_list; /* cmd completion queue */ - struct vhost_work vs_event_work; /* evt injection work item */ struct llist_head vs_event_list; /* evt injection queue */ @@ -377,10 +378,11 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd) } else { struct vhost_scsi_cmd *cmd = container_of(se_cmd, struct vhost_scsi_cmd, tvc_se_cmd); - struct vhost_scsi *vs = cmd->tvc_vhost; + struct vhost_scsi_virtqueue *svq = container_of(cmd->tvc_vq, + struct vhost_scsi_virtqueue, vq); - llist_add(&cmd->tvc_completion_list, &vs->vs_completion_list); - vhost_work_queue(&vs->dev, &vs->vs_completion_work); + llist_add(&cmd->tvc_completion_list, &svq->completion_list); + vhost_vq_work_queue(&svq->vq, &svq->completion_work); } } @@ -543,18 +545,17 @@ static void vhost_scsi_evt_work(struct vhost_work *work) */ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) { - struct vhost_scsi *vs = container_of(work, struct vhost_scsi, - vs_completion_work); - DECLARE_BITMAP(signal, VHOST_SCSI_MAX_VQ); + struct vhost_scsi_virtqueue *svq = container_of(work, + struct vhost_scsi_virtqueue, completion_work); struct virtio_scsi_cmd_resp v_rsp; struct vhost_scsi_cmd *cmd, *t; struct llist_node *llnode; struct se_cmd *se_cmd; struct iov_iter iov_iter; - int ret, vq; + bool signal = false; + int ret; - bitmap_zero(signal, VHOST_SCSI_MAX_VQ); - llnode = llist_del_all(&vs->vs_completion_list); + llnode = llist_del_all(&svq->completion_list); llist_for_each_entry_safe(cmd, t, llnode, tvc_completion_list) { se_cmd = &cmd->tvc_se_cmd; @@ -574,21 +575,16 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) cmd->tvc_in_iovs, sizeof(v_rsp)); ret = copy_to_iter(&v_rsp, sizeof(v_rsp), &iov_iter); if (likely(ret == sizeof(v_rsp))) { - struct vhost_scsi_virtqueue *q; + signal = true; vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0); - q = container_of(cmd->tvc_vq, struct vhost_scsi_virtqueue, vq); - vq = q - vs->vqs; - __set_bit(vq, signal); } else pr_err("Faulted on virtio_scsi_cmd_resp\n"); vhost_scsi_release_cmd_res(se_cmd); } - vq = -1; - while ((vq = find_next_bit(signal, VHOST_SCSI_MAX_VQ, vq + 1)) - < VHOST_SCSI_MAX_VQ) - vhost_signal(&vs->dev, &vs->vqs[vq].vq); + if (signal) + vhost_signal(&svq->vs->dev, &svq->vq); } static struct vhost_scsi_cmd * @@ -1799,6 +1795,7 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features) static int vhost_scsi_open(struct inode *inode, struct file *f) { + struct vhost_scsi_virtqueue *svq; struct vhost_scsi *vs; struct vhost_virtqueue **vqs; int r = -ENOMEM, i; @@ -1811,7 +1808,6 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) if (!vqs) goto err_vqs; - vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work); vhost_work_init(&vs->vs_event_work, vhost_scsi_evt_work); vs->vs_events_nr = 0; @@ -1822,8 +1818,14 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) vs->vqs[VHOST_SCSI_VQ_CTL].vq.handle_kick = vhost_scsi_ctl_handle_kick; vs->vqs[VHOST_SCSI_VQ_EVT].vq.handle_kick = vhost_scsi_evt_handle_kick; for (i = VHOST_SCSI_VQ_IO; i < VHOST_SCSI_MAX_VQ; i++) { - vqs[i] = &vs->vqs[i].vq; - vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick; + svq = &vs->vqs[i]; + + vqs[i] = &svq->vq; + svq->vs = vs; + init_llist_head(&svq->completion_list); + vhost_work_init(&svq->completion_work, + vhost_scsi_complete_cmd_work); + svq->vq.handle_kick = vhost_scsi_handle_kick; } vhost_dev_init(&vs->dev, vqs, VHOST_SCSI_MAX_VQ, UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0, true, NULL); -- 2.25.1
Mike Christie
2021-Apr-28 22:37 UTC
[PATCH RFC 11/14] vhost: allow userspace to create workers
This patch allows userspace to create workers and bind them to vqs, so you can have N workers per dev and also share N workers with M vqs. The next patch will allow sharing across devices. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- drivers/vhost/vhost.c | 95 +++++++++++++++++++++++++++++++- drivers/vhost/vhost.h | 3 + include/uapi/linux/vhost.h | 6 ++ include/uapi/linux/vhost_types.h | 9 +++ 4 files changed, 111 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 345ade0af133..fecdae0d18c7 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -42,6 +42,9 @@ module_param(max_iotlb_entries, int, 0444); MODULE_PARM_DESC(max_iotlb_entries, "Maximum number of iotlb entries. (default: 2048)"); +static LIST_HEAD(vhost_workers_list); +static DEFINE_SPINLOCK(vhost_workers_lock); + enum { VHOST_MEMORY_F_LOG = 0x1, }; @@ -617,8 +620,16 @@ static void vhost_detach_mm(struct vhost_dev *dev) dev->mm = NULL; } -static void vhost_worker_free(struct vhost_worker *worker) +static void vhost_worker_put(struct vhost_worker *worker) { + spin_lock(&vhost_workers_lock); + if (!refcount_dec_and_test(&worker->refcount)) { + spin_unlock(&vhost_workers_lock); + return; + } + list_del(&worker->list); + spin_unlock(&vhost_workers_lock); + WARN_ON(!llist_empty(&worker->work_list)); kthread_stop(worker->task); kfree(worker); @@ -632,7 +643,7 @@ static void vhost_workers_free(struct vhost_dev *dev) return; for (i = 0; i < dev->num_workers; i++) - vhost_worker_free(dev->workers[i]); + vhost_worker_put(dev->workers[i]); kfree(dev->workers); dev->num_workers = 0; @@ -652,6 +663,8 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) worker->id = dev->num_workers; worker->dev = dev; init_llist_head(&worker->work_list); + INIT_LIST_HEAD(&worker->list); + refcount_set(&worker->refcount, 1); task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid); if (IS_ERR(task)) @@ -664,6 +677,9 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) if (ret) goto stop_worker; + spin_lock(&vhost_workers_lock); + list_add_tail(&worker->list, &vhost_workers_list); + spin_unlock(&vhost_workers_lock); return worker; stop_worker: @@ -673,6 +689,71 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) return NULL; } +static struct vhost_worker *vhost_worker_find(struct vhost_dev *dev, pid_t pid) +{ + struct vhost_worker *worker; + + /* TODO hash on pid? */ + spin_lock(&vhost_workers_lock); + list_for_each_entry(worker, &vhost_workers_list, list) { + if (worker->task->pid != pid) + continue; + + /* tmp - next patch allows sharing across devs */ + if (worker->dev != dev) { + spin_unlock(&vhost_workers_lock); + return NULL; + } + + refcount_inc(&worker->refcount); + spin_unlock(&vhost_workers_lock); + return worker; + } + spin_unlock(&vhost_workers_lock); + return NULL; +} + +/* Caller must have device mutex */ +static int vhost_vq_set_worker(struct vhost_virtqueue *vq, + struct vhost_vring_worker *info) +{ + struct vhost_dev *dev = vq->dev; + struct vhost_worker *worker; + + if (vq->worker) { + /* TODO - support changing while works are running */ + return -EBUSY; + } + + if (info->pid == -1) { + worker = vhost_worker_create(dev); + if (!worker) + return -ENOMEM; + + info->pid = worker->task->pid; + } else { + worker = vhost_worker_find(dev, info->pid); + if (!worker) + return -ENODEV; + } + + if (!dev->workers) { + dev->workers = kcalloc(vq->dev->nvqs, + sizeof(struct vhost_worker *), + GFP_KERNEL); + if (!dev->workers) { + vhost_worker_put(worker); + return -ENOMEM; + } + } + + vq->worker = worker; + + dev->workers[dev->num_workers] = worker; + dev->num_workers++; + return 0; +} + /* Caller must have device mutex */ static int vhost_worker_try_create_def(struct vhost_dev *dev) { @@ -1680,6 +1761,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg struct eventfd_ctx *ctx = NULL; u32 __user *idxp = argp; struct vhost_virtqueue *vq; + struct vhost_vring_worker w; struct vhost_vring_state s; struct vhost_vring_file f; u32 idx; @@ -1794,6 +1876,15 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg if (copy_to_user(argp, &s, sizeof(s))) r = -EFAULT; break; + case VHOST_SET_VRING_WORKER: + if (copy_from_user(&w, argp, sizeof(w))) { + r = -EFAULT; + break; + } + r = vhost_vq_set_worker(vq, &w); + if (!r && copy_to_user(argp, &w, sizeof(w))) + r = -EFAULT; + break; default: r = -ENOIOCTLCMD; } diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 973889ec7d62..64dc00337389 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -14,6 +14,7 @@ #include <linux/atomic.h> #include <linux/vhost_iotlb.h> #include <linux/irqbypass.h> +#include <linux/refcount.h> struct vhost_work; typedef void (*vhost_work_fn_t)(struct vhost_work *work); @@ -28,6 +29,8 @@ struct vhost_work { struct vhost_worker { struct task_struct *task; struct llist_head work_list; + struct list_head list; + refcount_t refcount; struct vhost_dev *dev; int id; }; diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index c998860d7bbc..61a57f5366ee 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -70,6 +70,12 @@ #define VHOST_VRING_BIG_ENDIAN 1 #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state) #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state) +/* Create/bind a vhost worker to a virtqueue. If pid > 0 and matches an existing + * vhost_worker thread it will be bound to the vq. If pid is -1, then a new + * worker will be created and bound to the vq. + */ +#define VHOST_SET_VRING_WORKER _IOWR(VHOST_VIRTIO, 0x15, struct vhost_vring_worker) + /* The following ioctls use eventfd file descriptors to signal and poll * for events. */ diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h index f7f6a3a28977..216f1658d0b6 100644 --- a/include/uapi/linux/vhost_types.h +++ b/include/uapi/linux/vhost_types.h @@ -47,6 +47,15 @@ struct vhost_vring_addr { __u64 log_guest_addr; }; +struct vhost_vring_worker { + unsigned int index; + /* + * The pid of the vhost worker that the vq will be bound to. If -1, + * a new worker will be created and it's pid will be returned in pid. + */ + __kernel_pid_t pid; +}; + /* no alignment requirement */ struct vhost_iotlb_msg { __u64 iova; -- 2.25.1
Mike Christie
2021-Apr-28 22:37 UTC
[PATCH RFC 12/14] vhost: add vhost_dev pointer to vhost_work
The next patch allows a vhost_worker to handle different devices. To prepare for that, this patch adds a pointer to the device on the work so we can get to the different mms in the vhost_worker thread. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- drivers/vhost/scsi.c | 7 ++++--- drivers/vhost/vhost.c | 24 ++++++++++++++---------- drivers/vhost/vhost.h | 10 ++++++---- drivers/vhost/vsock.c | 3 ++- 4 files changed, 26 insertions(+), 18 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 2f84cf602ab3..0e862503b6a8 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1808,7 +1808,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) if (!vqs) goto err_vqs; - vhost_work_init(&vs->vs_event_work, vhost_scsi_evt_work); + vhost_work_init(&vs->dev, &vs->vs_event_work, vhost_scsi_evt_work); vs->vs_events_nr = 0; vs->vs_events_missed = false; @@ -1823,7 +1823,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) vqs[i] = &svq->vq; svq->vs = vs; init_llist_head(&svq->completion_list); - vhost_work_init(&svq->completion_work, + vhost_work_init(&vs->dev, &svq->completion_work, vhost_scsi_complete_cmd_work); svq->vq.handle_kick = vhost_scsi_handle_kick; } @@ -2017,7 +2017,8 @@ static int vhost_scsi_port_link(struct se_portal_group *se_tpg, if (!tmf) return -ENOMEM; INIT_LIST_HEAD(&tmf->queue_entry); - vhost_work_init(&tmf->vwork, vhost_scsi_tmf_resp_work); + vhost_work_init(&tpg->vhost_scsi->dev, &tmf->vwork, + vhost_scsi_tmf_resp_work); mutex_lock(&vhost_scsi_mutex); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index fecdae0d18c7..7ba0c303bb98 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -181,10 +181,12 @@ static int vhost_poll_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, return 0; } -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn) +void vhost_work_init(struct vhost_dev *dev, struct vhost_work *work, + vhost_work_fn_t fn) { clear_bit(VHOST_WORK_QUEUED, &work->flags); work->fn = fn; + work->dev = dev; } EXPORT_SYMBOL_GPL(vhost_work_init); @@ -200,7 +202,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, poll->wqh = NULL; poll->vq = vq; - vhost_work_init(&poll->work, fn); + vhost_work_init(dev, &poll->work, fn); } EXPORT_SYMBOL_GPL(vhost_poll_init); @@ -269,12 +271,13 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) } EXPORT_SYMBOL_GPL(vhost_work_queue); -static void vhost_work_dev_flush_on(struct vhost_worker *worker) +static void vhost_work_dev_flush_on(struct vhost_dev *dev, + struct vhost_worker *worker) { struct vhost_flush_struct flush; init_completion(&flush.wait_event); - vhost_work_init(&flush.work, vhost_flush_work); + vhost_work_init(dev, &flush.work, vhost_flush_work); vhost_work_queue_on(&flush.work, worker); wait_for_completion(&flush.wait_event); @@ -285,7 +288,7 @@ void vhost_work_dev_flush(struct vhost_dev *dev) int i; for (i = 0; i < dev->num_workers; i++) - vhost_work_dev_flush_on(dev->workers[i]); + vhost_work_dev_flush_on(dev, dev->workers[i]); } EXPORT_SYMBOL_GPL(vhost_work_dev_flush); @@ -305,7 +308,7 @@ EXPORT_SYMBOL_GPL(vhost_has_work); void vhost_vq_work_flush(struct vhost_virtqueue *vq) { - vhost_work_dev_flush_on(vq->worker); + vhost_work_dev_flush_on(vq->dev, vq->worker); } EXPORT_SYMBOL_GPL(vhost_vq_work_flush); @@ -572,14 +575,15 @@ static void vhost_attach_cgroups_work(struct vhost_work *work) s->ret = cgroup_attach_task_all(s->owner, current); } -static int vhost_attach_cgroups_on(struct vhost_worker *worker) +static int vhost_attach_cgroups_on(struct vhost_dev *dev, + struct vhost_worker *worker) { struct vhost_attach_cgroups_struct attach; attach.owner = current; - vhost_work_init(&attach.work, vhost_attach_cgroups_work); + vhost_work_init(dev, &attach.work, vhost_attach_cgroups_work); vhost_work_queue_on(&attach.work, worker); - vhost_work_dev_flush_on(worker); + vhost_work_dev_flush_on(dev, worker); return attach.ret; } @@ -673,7 +677,7 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) worker->task = task; wake_up_process(task); /* avoid contributing to loadavg */ - ret = vhost_attach_cgroups_on(worker); + ret = vhost_attach_cgroups_on(dev, worker); if (ret) goto stop_worker; diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 64dc00337389..051dea4e3ab6 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -21,9 +21,10 @@ typedef void (*vhost_work_fn_t)(struct vhost_work *work); #define VHOST_WORK_QUEUED 1 struct vhost_work { - struct llist_node node; - vhost_work_fn_t fn; - unsigned long flags; + struct llist_node node; + vhost_work_fn_t fn; + unsigned long flags; + struct vhost_dev *dev; }; struct vhost_worker { @@ -47,7 +48,8 @@ struct vhost_poll { struct vhost_virtqueue *vq; }; -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); +void vhost_work_init(struct vhost_dev *dev, struct vhost_work *work, + vhost_work_fn_t fn); void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); bool vhost_has_work(struct vhost_dev *dev); void vhost_vq_work_flush(struct vhost_virtqueue *vq); diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index f954f4d29c95..302415b6460b 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -648,7 +648,8 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file) file->private_data = vsock; spin_lock_init(&vsock->send_pkt_list_lock); INIT_LIST_HEAD(&vsock->send_pkt_list); - vhost_work_init(&vsock->send_pkt_work, vhost_transport_send_pkt_work); + vhost_work_init(&vsock->dev, &vsock->send_pkt_work, + vhost_transport_send_pkt_work); return 0; out: -- 2.25.1
Mike Christie
2021-Apr-28 22:37 UTC
[PATCH RFC 13/14] vhost: support sharing workers across devs
This allows a worker to handle multiple device's vqs. TODO: - How to handle if the devices are in different cgroups/VMs. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- drivers/vhost/vhost.c | 18 +++++++----------- drivers/vhost/vhost.h | 1 - 2 files changed, 7 insertions(+), 12 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 7ba0c303bb98..b2d567a4cd53 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -387,12 +387,10 @@ static void vhost_vq_reset(struct vhost_dev *dev, static int vhost_worker(void *data) { struct vhost_worker *worker = data; - struct vhost_dev *dev = worker->dev; struct vhost_work *work, *work_next; + struct vhost_dev *dev; struct llist_node *node; - kthread_use_mm(dev->mm); - for (;;) { /* mb paired w/ kthread_stop */ set_current_state(TASK_INTERRUPTIBLE); @@ -411,15 +409,20 @@ static int vhost_worker(void *data) smp_wmb(); llist_for_each_entry_safe(work, work_next, node, node) { clear_bit(VHOST_WORK_QUEUED, &work->flags); + dev = work->dev; + + kthread_use_mm(dev->mm); + __set_current_state(TASK_RUNNING); kcov_remote_start_common(dev->kcov_handle); work->fn(work); kcov_remote_stop(); if (need_resched()) schedule(); + + kthread_unuse_mm(dev->mm); } } - kthread_unuse_mm(dev->mm); return 0; } @@ -665,7 +668,6 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) return NULL; worker->id = dev->num_workers; - worker->dev = dev; init_llist_head(&worker->work_list); INIT_LIST_HEAD(&worker->list); refcount_set(&worker->refcount, 1); @@ -703,12 +705,6 @@ static struct vhost_worker *vhost_worker_find(struct vhost_dev *dev, pid_t pid) if (worker->task->pid != pid) continue; - /* tmp - next patch allows sharing across devs */ - if (worker->dev != dev) { - spin_unlock(&vhost_workers_lock); - return NULL; - } - refcount_inc(&worker->refcount); spin_unlock(&vhost_workers_lock); return worker; diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 051dea4e3ab6..6d97fdf231c2 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -32,7 +32,6 @@ struct vhost_worker { struct llist_head work_list; struct list_head list; refcount_t refcount; - struct vhost_dev *dev; int id; }; -- 2.25.1
Mike Christie
2021-Apr-28 22:37 UTC
[PATCH RFC 14/14] vhost: allow userspace to query vq worker mappings
Add an ioctl cmd to allow userspace to figure out the vq's worker. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- drivers/vhost/vhost.c | 10 ++++++++++ include/uapi/linux/vhost.h | 3 ++- 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index b2d567a4cd53..e6148acbe928 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1885,6 +1885,16 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg if (!r && copy_to_user(argp, &w, sizeof(w))) r = -EFAULT; break; + case VHOST_GET_VRING_WORKER: + w.index = idx; + w.pid = -1; + + if (vq->worker) + w.pid = vq->worker->task->pid; + + if (copy_to_user(argp, &w, sizeof(w))) + r = -EFAULT; + break; default: r = -ENOIOCTLCMD; } diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h index 61a57f5366ee..24569f89611b 100644 --- a/include/uapi/linux/vhost.h +++ b/include/uapi/linux/vhost.h @@ -75,7 +75,8 @@ * worker will be created and bound to the vq. */ #define VHOST_SET_VRING_WORKER _IOWR(VHOST_VIRTIO, 0x15, struct vhost_vring_worker) - +/* Return the vqs worker's pid. If no worker is set pid is -1 */ +#define VHOST_GET_VRING_WORKER _IOR(VHOST_VIRTIO, 0x16, struct vhost_vring_worker) /* The following ioctls use eventfd file descriptors to signal and poll * for events. */ -- 2.25.1
Stefano Garzarella
2021-May-04 15:30 UTC
[PATCH RFC 11/14] vhost: allow userspace to create workers
On Wed, Apr 28, 2021 at 05:37:11PM -0500, Mike Christie wrote:>This patch allows userspace to create workers and bind them to vqs, so you >can have N workers per dev and also share N workers with M vqs. The next >patch will allow sharing across devices. > >Signed-off-by: Mike Christie <michael.christie at oracle.com> >--- > drivers/vhost/vhost.c | 95 +++++++++++++++++++++++++++++++- > drivers/vhost/vhost.h | 3 + > include/uapi/linux/vhost.h | 6 ++ > include/uapi/linux/vhost_types.h | 9 +++ > 4 files changed, 111 insertions(+), 2 deletions(-) > >diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >index 345ade0af133..fecdae0d18c7 100644 >--- a/drivers/vhost/vhost.c >+++ b/drivers/vhost/vhost.c >@@ -42,6 +42,9 @@ module_param(max_iotlb_entries, int, 0444); > MODULE_PARM_DESC(max_iotlb_entries, > "Maximum number of iotlb entries. (default: 2048)"); > >+static LIST_HEAD(vhost_workers_list); >+static DEFINE_SPINLOCK(vhost_workers_lock); >+ > enum { > VHOST_MEMORY_F_LOG = 0x1, > }; >@@ -617,8 +620,16 @@ static void vhost_detach_mm(struct vhost_dev *dev) > dev->mm = NULL; > } > >-static void vhost_worker_free(struct vhost_worker *worker) >+static void vhost_worker_put(struct vhost_worker *worker) > { >+ spin_lock(&vhost_workers_lock); >+ if (!refcount_dec_and_test(&worker->refcount)) { >+ spin_unlock(&vhost_workers_lock); >+ return; >+ } >+ list_del(&worker->list); >+ spin_unlock(&vhost_workers_lock); >+ > WARN_ON(!llist_empty(&worker->work_list)); > kthread_stop(worker->task); > kfree(worker); >@@ -632,7 +643,7 @@ static void vhost_workers_free(struct vhost_dev *dev) > return; > > for (i = 0; i < dev->num_workers; i++) >- vhost_worker_free(dev->workers[i]); >+ vhost_worker_put(dev->workers[i]); > > kfree(dev->workers); > dev->num_workers = 0; >@@ -652,6 +663,8 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) > worker->id = dev->num_workers; > worker->dev = dev; > init_llist_head(&worker->work_list); >+ INIT_LIST_HEAD(&worker->list); >+ refcount_set(&worker->refcount, 1); > > task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid); > if (IS_ERR(task)) >@@ -664,6 +677,9 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) > if (ret) > goto stop_worker; > >+ spin_lock(&vhost_workers_lock); >+ list_add_tail(&worker->list, &vhost_workers_list); >+ spin_unlock(&vhost_workers_lock); > return worker; > > stop_worker: >@@ -673,6 +689,71 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) > return NULL; > } > >+static struct vhost_worker *vhost_worker_find(struct vhost_dev *dev, pid_t pid) >+{ >+ struct vhost_worker *worker; >+ >+ /* TODO hash on pid? */ >+ spin_lock(&vhost_workers_lock); >+ list_for_each_entry(worker, &vhost_workers_list, list) { >+ if (worker->task->pid != pid) >+ continue; >+ >+ /* tmp - next patch allows sharing across devs */ >+ if (worker->dev != dev) { >+ spin_unlock(&vhost_workers_lock); >+ return NULL; >+ } >+ >+ refcount_inc(&worker->refcount); >+ spin_unlock(&vhost_workers_lock); >+ return worker; >+ } >+ spin_unlock(&vhost_workers_lock); >+ return NULL;I would like to have a single point where we release the lock to avoid future issues, how about changing vhost_worker_find() to: static struct vhost_worker *vhost_worker_find(struct vhost_dev *dev, pid_t pid) { struct vhost_worker *worker, *found_worker = NULL; spin_lock(&vhost_workers_lock); list_for_each_entry(worker, &vhost_workers_list, list) { if (worker->task->pid == pid) { /* tmp - next patch allows sharing across devs */ if (worker->dev != dev) break; found_worker = worker; refcount_inc(&found_worker->refcount); break; } } spin_unlock(&vhost_workers_lock); return found_worker; }>+} >+ >+/* Caller must have device mutex */ >+static int vhost_vq_set_worker(struct vhost_virtqueue *vq, >+ struct vhost_vring_worker *info) >+{ >+ struct vhost_dev *dev = vq->dev; >+ struct vhost_worker *worker; >+ >+ if (vq->worker) { >+ /* TODO - support changing while works are running */ >+ return -EBUSY; >+ } >+ >+ if (info->pid == -1) { >+ worker = vhost_worker_create(dev); >+ if (!worker) >+ return -ENOMEM; >+ >+ info->pid = worker->task->pid; >+ } else { >+ worker = vhost_worker_find(dev, info->pid); >+ if (!worker) >+ return -ENODEV; >+ } >+ >+ if (!dev->workers) { >+ dev->workers = kcalloc(vq->dev->nvqs, >+ sizeof(struct vhost_worker *), >+ GFP_KERNEL); >+ if (!dev->workers) { >+ vhost_worker_put(worker); >+ return -ENOMEM; >+ } >+ } >+ >+ vq->worker = worker; >+ >+ dev->workers[dev->num_workers] = worker; >+ dev->num_workers++; >+ return 0; >+} >+ > /* Caller must have device mutex */ > static int vhost_worker_try_create_def(struct vhost_dev *dev) > { >@@ -1680,6 +1761,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg > struct eventfd_ctx *ctx = NULL; > u32 __user *idxp = argp; > struct vhost_virtqueue *vq; >+ struct vhost_vring_worker w; > struct vhost_vring_state s; > struct vhost_vring_file f; > u32 idx; >@@ -1794,6 +1876,15 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg > if (copy_to_user(argp, &s, sizeof(s))) > r = -EFAULT; > break; >+ case VHOST_SET_VRING_WORKER: >+ if (copy_from_user(&w, argp, sizeof(w))) { >+ r = -EFAULT; >+ break; >+ } >+ r = vhost_vq_set_worker(vq, &w); >+ if (!r && copy_to_user(argp, &w, sizeof(w))) >+ r = -EFAULT; >+ break; > default: > r = -ENOIOCTLCMD; > } >diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h >index 973889ec7d62..64dc00337389 100644 >--- a/drivers/vhost/vhost.h >+++ b/drivers/vhost/vhost.h >@@ -14,6 +14,7 @@ > #include <linux/atomic.h> > #include <linux/vhost_iotlb.h> > #include <linux/irqbypass.h> >+#include <linux/refcount.h> > > struct vhost_work; > typedef void (*vhost_work_fn_t)(struct vhost_work *work); >@@ -28,6 +29,8 @@ struct vhost_work { > struct vhost_worker { > struct task_struct *task; > struct llist_head work_list; >+ struct list_head list; >+ refcount_t refcount; > struct vhost_dev *dev; > int id; > }; >diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h >index c998860d7bbc..61a57f5366ee 100644 >--- a/include/uapi/linux/vhost.h >+++ b/include/uapi/linux/vhost.h >@@ -70,6 +70,12 @@ > #define VHOST_VRING_BIG_ENDIAN 1 > #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state) > #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state) >+/* Create/bind a vhost worker to a virtqueue. If pid > 0 and matches an existing >+ * vhost_worker thread it will be bound to the vq. If pid is -1, then a newWhat about adding a macro for -1? (e.g. VHOST_VRING_NEW_WORKER) Thanks, Stefano
Stefano Garzarella
2021-May-04 15:33 UTC
[PATCH RFC 12/14] vhost: add vhost_dev pointer to vhost_work
On Wed, Apr 28, 2021 at 05:37:12PM -0500, Mike Christie wrote:>The next patch allows a vhost_worker to handle different devices. To >prepare for that, this patch adds a pointer to the device on the work so >we can get to the different mms in the vhost_worker thread. > >Signed-off-by: Mike Christie <michael.christie at oracle.com> >--- > drivers/vhost/scsi.c | 7 ++++--- > drivers/vhost/vhost.c | 24 ++++++++++++++---------- > drivers/vhost/vhost.h | 10 ++++++---- > drivers/vhost/vsock.c | 3 ++- > 4 files changed, 26 insertions(+), 18 deletions(-) > >diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c >index 2f84cf602ab3..0e862503b6a8 100644 >--- a/drivers/vhost/scsi.c >+++ b/drivers/vhost/scsi.c >@@ -1808,7 +1808,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) > if (!vqs) > goto err_vqs; > >- vhost_work_init(&vs->vs_event_work, vhost_scsi_evt_work); >+ vhost_work_init(&vs->dev, &vs->vs_event_work, vhost_scsi_evt_work); > > vs->vs_events_nr = 0; > vs->vs_events_missed = false; >@@ -1823,7 +1823,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) > vqs[i] = &svq->vq; > svq->vs = vs; > init_llist_head(&svq->completion_list); >- vhost_work_init(&svq->completion_work, >+ vhost_work_init(&vs->dev, &svq->completion_work, > vhost_scsi_complete_cmd_work); > svq->vq.handle_kick = vhost_scsi_handle_kick; > } >@@ -2017,7 +2017,8 @@ static int vhost_scsi_port_link(struct se_portal_group *se_tpg, > if (!tmf) > return -ENOMEM; > INIT_LIST_HEAD(&tmf->queue_entry); >- vhost_work_init(&tmf->vwork, vhost_scsi_tmf_resp_work); >+ vhost_work_init(&tpg->vhost_scsi->dev, &tmf->vwork, >+ vhost_scsi_tmf_resp_work); >`checkpatch.pl --strict` complains here: CHECK: Alignment should match open parenthesis #74: FILE: drivers/vhost/scsi.c:2036: + vhost_work_init(&tpg->vhost_scsi->dev, &tmf->vwork, + vhost_scsi_tmf_resp_work);> mutex_lock(&vhost_scsi_mutex); > >diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >index fecdae0d18c7..7ba0c303bb98 100644 >--- a/drivers/vhost/vhost.c >+++ b/drivers/vhost/vhost.c >@@ -181,10 +181,12 @@ static int vhost_poll_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, > return 0; > } > >-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn) >+void vhost_work_init(struct vhost_dev *dev, struct vhost_work *work, >+ vhost_work_fn_t fn) > { > clear_bit(VHOST_WORK_QUEUED, &work->flags); > work->fn = fn; >+ work->dev = dev; > } > EXPORT_SYMBOL_GPL(vhost_work_init); > >@@ -200,7 +202,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, > poll->wqh = NULL; > poll->vq = vq; > >- vhost_work_init(&poll->work, fn); >+ vhost_work_init(dev, &poll->work, fn); > } > EXPORT_SYMBOL_GPL(vhost_poll_init); > >@@ -269,12 +271,13 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) > } > EXPORT_SYMBOL_GPL(vhost_work_queue); > >-static void vhost_work_dev_flush_on(struct vhost_worker *worker) >+static void vhost_work_dev_flush_on(struct vhost_dev *dev, >+ struct vhost_worker *worker) > { > struct vhost_flush_struct flush; > > init_completion(&flush.wait_event); >- vhost_work_init(&flush.work, vhost_flush_work); >+ vhost_work_init(dev, &flush.work, vhost_flush_work); > > vhost_work_queue_on(&flush.work, worker); > wait_for_completion(&flush.wait_event); >@@ -285,7 +288,7 @@ void vhost_work_dev_flush(struct vhost_dev *dev) > int i; > > for (i = 0; i < dev->num_workers; i++) >- vhost_work_dev_flush_on(dev->workers[i]); >+ vhost_work_dev_flush_on(dev, dev->workers[i]); > } > EXPORT_SYMBOL_GPL(vhost_work_dev_flush); > >@@ -305,7 +308,7 @@ EXPORT_SYMBOL_GPL(vhost_has_work); > > void vhost_vq_work_flush(struct vhost_virtqueue *vq) > { >- vhost_work_dev_flush_on(vq->worker); >+ vhost_work_dev_flush_on(vq->dev, vq->worker); > } > EXPORT_SYMBOL_GPL(vhost_vq_work_flush); > >@@ -572,14 +575,15 @@ static void vhost_attach_cgroups_work(struct vhost_work *work) > s->ret = cgroup_attach_task_all(s->owner, current); > } > >-static int vhost_attach_cgroups_on(struct vhost_worker *worker) >+static int vhost_attach_cgroups_on(struct vhost_dev *dev, >+ struct vhost_worker *worker) > { > struct vhost_attach_cgroups_struct attach; > > attach.owner = current; >- vhost_work_init(&attach.work, vhost_attach_cgroups_work); >+ vhost_work_init(dev, &attach.work, vhost_attach_cgroups_work); > vhost_work_queue_on(&attach.work, worker); >- vhost_work_dev_flush_on(worker); >+ vhost_work_dev_flush_on(dev, worker); > return attach.ret; > } > >@@ -673,7 +677,7 @@ static struct vhost_worker >*vhost_worker_create(struct vhost_dev *dev) > worker->task = task; > wake_up_process(task); /* avoid contributing to loadavg */ > >- ret = vhost_attach_cgroups_on(worker); >+ ret = vhost_attach_cgroups_on(dev, worker); > if (ret) > goto stop_worker; > >diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h >index 64dc00337389..051dea4e3ab6 100644 >--- a/drivers/vhost/vhost.h >+++ b/drivers/vhost/vhost.h >@@ -21,9 +21,10 @@ typedef void (*vhost_work_fn_t)(struct vhost_work *work); > > #define VHOST_WORK_QUEUED 1 > struct vhost_work { >- struct llist_node node; >- vhost_work_fn_t fn; >- unsigned long flags; >+ struct llist_node node; >+ vhost_work_fn_t fn; >+ unsigned long flags;Maybe we should move these changes in another patch since they are not related. Thanks, Stefano>+ struct vhost_dev *dev; > }; > > struct vhost_worker { >@@ -47,7 +48,8 @@ struct vhost_poll { > struct vhost_virtqueue *vq; > }; > >-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); >+void vhost_work_init(struct vhost_dev *dev, struct vhost_work *work, >+ vhost_work_fn_t fn); > void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); > bool vhost_has_work(struct vhost_dev *dev); > void vhost_vq_work_flush(struct vhost_virtqueue *vq); >diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c >index f954f4d29c95..302415b6460b 100644 >--- a/drivers/vhost/vsock.c >+++ b/drivers/vhost/vsock.c >@@ -648,7 +648,8 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file) > file->private_data = vsock; > spin_lock_init(&vsock->send_pkt_list_lock); > INIT_LIST_HEAD(&vsock->send_pkt_list); >- vhost_work_init(&vsock->send_pkt_work, vhost_transport_send_pkt_work); >+ vhost_work_init(&vsock->dev, &vsock->send_pkt_work, >+ vhost_transport_send_pkt_work); > return 0; > > out: >-- >2.25.1 >
Hi Mike, On Wed, Apr 28, 2021 at 05:36:59PM -0500, Mike Christie wrote:>The following patches apply over mst's vhost branch and were tested >againt that branch and also mkp's 5.13 branch which has some vhost-scsi >changes. > >These patches allow us to support multiple vhost workers per device. I >ended up just doing Stefan's original idea where userspace has the >kernel create a worker and we pass back the pid. This has the benefit >over the workqueue and userspace thread approach where we only have >one'ish code path in the kernel. > >The kernel patches here allow us to then do N workers device and also >share workers across devices.I took a first look and left a few comments. In general it looks good to me, I'm just worried if it's okay to use the kthread pid directly or it's better to use an internal id. For example I think if this can be affected by the pid namespaces or some security issues. I admit I don't know much about this topic, so this might be silly.> >I included a patch for qemu so you can get an idea of how it works. > >TODO: >----- >- polling >- Allow sharing workers across devices. Kernel support is added and I >hacked up userspace to test, but I'm still working on a sane way to >manage it in userspace. >- Bind to specific CPUs. Commands like "virsh emulatorpin" work with >these patches and allow us to set the group of vhost threads to different >CPUs. But we can also set a specific vq's worker to run on a CPU. >- I'm handling old kernel by just checking for EPERM. Does this require >a feature?Do you mean when the new ioctls are not available? We do not return ENOIOCTLCMD? In this case I think is fine to check the return value. Thanks, Stefano