Mike Christie
2021-Sep-16 21:20 UTC
[PATCH 0/8] Use copy_process/create_io_thread in vhost layer
The following patches were made over linus's tree and also apply over Jens's 5.15 io_ring branch and Michael's vhost branch. The patchset allows the vhost layer to do a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how io_uring does a copy_process against its userspace app (Jens, the patches make create_io_thread more generic so that's why you are cc'd). This allows the vhost layer's worker threads to inherit cgroups, namespaces, address space, etc and this worker thread will also be accounted for against that owner/parent process's RLIMIT_NPROC limit. Here is a more detailed problem description: Qemu will create vhost devices in the kernel which perform network, SCSI, etc IO and management operations from worker threads created by the kthread API. Because the kthread API does a copy_process on the kthreadd thread, the vhost layer has to use kthread_use_mm to access the Qemu thread's memory and cgroup_attach_task_all to add itself to the Qemu thread's cgroups. The problem with this approach is that we then have to add new functions/ args/functionality for every thing we want to inherit. I started doing that here: https://lkml.org/lkml/2021/6/23/1233 for the RLIMIT_NPROC check, but it seems it might be easier to just inherit everything from the beginning, becuase I'd need to do something like that patch several times. For example, the current approach does not support cgroups v2 so commands like virsh emulatorpin do not work. The qemu process can go over its RLIMIT_NPROC. And for future vhost interfaces where we export the vhost thread pid we will want the namespace info.
The vhost layer has similar requirements as io_uring where its worker threads need to access the userspace thread's memory, want to inherit the parents's cgroups and namespaces, and be checked against the parent's RLIMITs. Right now, the vhost layer uses the kthread API which has kthread_use_mm for mem access, and those threads can use cgroup_attach_task_all for v1 cgroups, but there are no helpers for the other items. This adds a helper to clone a process so we can inherit everything we want in one call. It's a more generic version of create_io_thread which will be used by the vhost layer and io_uring in later patches in this set. This patch also exports __set_task_comm and wake_up_new_task which is needed by modules to use the new helper. io_uring calls these functions already but its always built into the kernel so was not needed before. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- fs/exec.c | 7 +++++++ include/linux/sched/task.h | 3 +++ kernel/fork.c | 29 +++++++++++++++++++++++++++++ kernel/sched/core.c | 4 +++- 4 files changed, 42 insertions(+), 1 deletion(-) diff --git a/fs/exec.c b/fs/exec.c index a098c133d8d7..9fc4bb0c5c7e 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1220,6 +1220,12 @@ EXPORT_SYMBOL_GPL(__get_task_comm); * so that a new one can be started */ +/** + * __set_task_comm - set the task's executable name + * @tsk: task_struct to modify + * @buf: executable name + * @exec: true if called during a process exec. false for name changes. + */ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) { task_lock(tsk); @@ -1228,6 +1234,7 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) task_unlock(tsk); perf_event_comm(tsk, exec); } +EXPORT_SYMBOL_GPL(__set_task_comm); /* * Calling this is the point of no return. None of the failures will be diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index ef02be869cf2..c55f1eb69d41 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -84,6 +84,9 @@ extern void exit_itimers(struct signal_struct *); extern pid_t kernel_clone(struct kernel_clone_args *kargs); struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node); +struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node, + unsigned long clone_flags, + int io_thread); struct task_struct *fork_idle(int); struct mm_struct *copy_init_mm(void); extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags); diff --git a/kernel/fork.c b/kernel/fork.c index 38681ad44c76..cec7b6011beb 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2532,6 +2532,35 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) return copy_process(NULL, 0, node, &args); } +/** + * kernel_copy_process - create a copy of a process to be used by the kernel + * @fn: thread stack + * @arg: data to be passed to fn + * @node: numa node to allocate task from + * @clone_flags: CLONE flags + * @io_thread: 1 if this will be a PF_IO_WORKER else 0. + * + * This returns a created task, or an error pointer. The returned task is + * inactive, and the caller must fire it up through wake_up_new_task(p). If + * this is an PF_IO_WORKER all singals but KILL and STOP are blocked. + */ +struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node, + unsigned long clone_flags, + int io_thread) +{ + struct kernel_clone_args args = { + .flags = ((lower_32_bits(clone_flags) | CLONE_VM | + CLONE_UNTRACED) & ~CSIGNAL), + .exit_signal = (lower_32_bits(clone_flags) & CSIGNAL), + .stack = (unsigned long)fn, + .stack_size = (unsigned long)arg, + .io_thread = io_thread, + }; + + return copy_process(NULL, 0, node, &args); +} +EXPORT_SYMBOL_GPL(kernel_copy_process); + /* * Ok, this is the main fork-routine. * diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1bba4128a3e6..a0b9508ea202 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4429,8 +4429,9 @@ unsigned long to_ratio(u64 period, u64 runtime) return div64_u64(runtime << BW_SHIFT, period); } -/* +/** * wake_up_new_task - wake up a newly created task for the first time. + * @p: task to wake up * * This function will do some initial scheduler statistics housekeeping * that must be done for every newly created context, then puts the task @@ -4476,6 +4477,7 @@ void wake_up_new_task(struct task_struct *p) #endif task_rq_unlock(rq, p, &rf); } +EXPORT_SYMBOL_GPL(wake_up_new_task); #ifdef CONFIG_PREEMPT_NOTIFIERS -- 2.25.1
The kthread API creates threads that ignore all signals by default so modules like vhost that will move from that API to kernel_copy_process will not be expecting them. This patch exports ignore_signals so those modules can keep their existing behavior. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- kernel/signal.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/kernel/signal.c b/kernel/signal.c index 952741f6d0f9..8fb79200c18b 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -534,6 +534,10 @@ void flush_itimer_signals(void) } #endif +/** + * ignore_signals - setup task to ignore all signals + * @t: task to setup + */ void ignore_signals(struct task_struct *t) { int i; @@ -543,6 +547,7 @@ void ignore_signals(struct task_struct *t) flush_signals(t); } +EXPORT_SYMBOL_GPL(ignore_signals); /* * Flush all handlers for a task. -- 2.25.1
Mike Christie
2021-Sep-16 21:20 UTC
[PATCH 3/8] fork: add option to not clone or dup files
Each vhost device gets a thread that is used to perform IO and management operations. Instead of a thread that is accessing a device, the thread is part of the device, so when it calls kernel_copy_process we can't dup or clone the parent's (Qemu thread that does the VHOST_SET_OWNER ioctl) files/FDS because it would do an extra increment on ourself. Later, when we do: Qemu process exits: do_exit -> exit_files -> put_files_struct -> close_files we would leak the device's resources because of that extra refcount on the fd or file_struct. This patch adds a no_files option so these worker threads can prevent taking an extra refcount on themselves. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- include/linux/sched/task.h | 3 ++- kernel/fork.c | 14 +++++++++++--- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index c55f1eb69d41..d0b0872f56cc 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -32,6 +32,7 @@ struct kernel_clone_args { size_t set_tid_size; int cgroup; int io_thread; + int no_files; struct cgroup *cgrp; struct css_set *cset; }; @@ -86,7 +87,7 @@ extern pid_t kernel_clone(struct kernel_clone_args *kargs); struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node); struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node, unsigned long clone_flags, - int io_thread); + int io_thread, int no_files); struct task_struct *fork_idle(int); struct mm_struct *copy_init_mm(void); extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags); diff --git a/kernel/fork.c b/kernel/fork.c index cec7b6011beb..a0468e30b27e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1532,7 +1532,8 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk) return 0; } -static int copy_files(unsigned long clone_flags, struct task_struct *tsk) +static int copy_files(unsigned long clone_flags, struct task_struct *tsk, + int no_files) { struct files_struct *oldf, *newf; int error = 0; @@ -1544,6 +1545,11 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk) if (!oldf) goto out; + if (no_files) { + tsk->files = NULL; + goto out; + } + if (clone_flags & CLONE_FILES) { atomic_inc(&oldf->count); goto out; @@ -2179,7 +2185,7 @@ static __latent_entropy struct task_struct *copy_process( retval = copy_semundo(clone_flags, p); if (retval) goto bad_fork_cleanup_security; - retval = copy_files(clone_flags, p); + retval = copy_files(clone_flags, p, args->no_files); if (retval) goto bad_fork_cleanup_semundo; retval = copy_fs(clone_flags, p); @@ -2539,6 +2545,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) * @node: numa node to allocate task from * @clone_flags: CLONE flags * @io_thread: 1 if this will be a PF_IO_WORKER else 0. + * @no_files: Do not duplicate or copy the parent's open files. * * This returns a created task, or an error pointer. The returned task is * inactive, and the caller must fire it up through wake_up_new_task(p). If @@ -2546,7 +2553,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) */ struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node, unsigned long clone_flags, - int io_thread) + int io_thread, int no_files) { struct kernel_clone_args args = { .flags = ((lower_32_bits(clone_flags) | CLONE_VM | @@ -2555,6 +2562,7 @@ struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node, .stack = (unsigned long)fn, .stack_size = (unsigned long)arg, .io_thread = io_thread, + .no_files = no_files, }; return copy_process(NULL, 0, node, &args); -- 2.25.1
Mike Christie
2021-Sep-16 21:20 UTC
[PATCH 4/8] fork: move PF_IO_WORKER's kernel frame setup to new flag
The vhost worker threads need the same frame setup as io_uring's worker threads, but do not handle signals and may not even be executing IO. This moves the setup part to a new flag. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- arch/x86/kernel/process.c | 4 ++-- include/linux/sched.h | 1 + include/linux/sched/task.h | 1 + kernel/fork.c | 3 +++ 4 files changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 1d9463e3096b..ccdec11cd120 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -178,9 +178,9 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg, task_user_gs(p) = get_user_gs(current_pt_regs()); #endif - if (unlikely(p->flags & PF_IO_WORKER)) { + if (unlikely(p->flags & PF_USER_WORKER)) { /* - * An IO thread is a user space thread, but it doesn't + * An user worker thread is a user space thread, but it doesn't * return to ret_after_fork(). * * In order to indicate that to tools like gdb, diff --git a/include/linux/sched.h b/include/linux/sched.h index e12b524426b0..817d3a7bec77 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1664,6 +1664,7 @@ extern struct pid *cad_pid; #define PF_VCPU 0x00000001 /* I'm a virtual CPU */ #define PF_IDLE 0x00000002 /* I am an IDLE thread */ #define PF_EXITING 0x00000004 /* Getting shut down */ +#define PF_USER_WORKER 0x00000008 /* Userspace kernel worker */ #define PF_IO_WORKER 0x00000010 /* Task is an IO worker */ #define PF_WQ_WORKER 0x00000020 /* I'm a workqueue worker */ #define PF_FORKNOEXEC 0x00000040 /* Forked but didn't exec */ diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index d0b0872f56cc..4a6100a24894 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -33,6 +33,7 @@ struct kernel_clone_args { int cgroup; int io_thread; int no_files; + int user_worker; struct cgroup *cgrp; struct css_set *cset; }; diff --git a/kernel/fork.c b/kernel/fork.c index a0468e30b27e..1dda1d4ea77b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2041,6 +2041,8 @@ static __latent_entropy struct task_struct *copy_process( siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); } + if (args->user_worker) + p->flags |= PF_USER_WORKER; /* * This _must_ happen before we call free_task(), i.e. before we jump * to any of the bad_fork_* labels. This is to avoid freeing @@ -2563,6 +2565,7 @@ struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node, .stack_size = (unsigned long)arg, .io_thread = io_thread, .no_files = no_files, + .user_worker = 1, }; return copy_process(NULL, 0, node, &args); -- 2.25.1
Convert io_uring and io-wq to use kernel_copy_process. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- fs/io-wq.c | 9 +++++++-- fs/io_uring.c | 5 ++++- include/linux/sched/task.h | 1 - kernel/fork.c | 22 ---------------------- 4 files changed, 11 insertions(+), 26 deletions(-) diff --git a/fs/io-wq.c b/fs/io-wq.c index 6c55362c1f99..6fccba5bdc65 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -68,6 +68,9 @@ struct io_worker { #define IO_WQ_NR_HASH_BUCKETS (1u << IO_WQ_HASH_ORDER) +#define IO_WQ_CLONE_FLAGS CLONE_FS|CLONE_FILES|CLONE_SIGHAND| \ + CLONE_THREAD|CLONE_IO + struct io_wqe_acct { unsigned nr_workers; unsigned max_workers; @@ -687,7 +690,8 @@ static void create_worker_cont(struct callback_head *cb) worker = container_of(cb, struct io_worker, create_work); clear_bit_unlock(0, &worker->create_state); wqe = worker->wqe; - tsk = create_io_thread(io_wqe_worker, worker, wqe->node); + tsk = kernel_copy_process(io_wqe_worker, worker, wqe->node, + IO_WQ_CLONE_FLAGS, 1, 0); if (!IS_ERR(tsk)) { io_init_new_worker(wqe, worker, tsk); io_worker_release(worker); @@ -757,7 +761,8 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index) if (index == IO_WQ_ACCT_BOUND) worker->flags |= IO_WORKER_F_BOUND; - tsk = create_io_thread(io_wqe_worker, worker, wqe->node); + tsk = kernel_copy_process(io_wqe_worker, worker, wqe->node, + IO_WQ_CLONE_FLAGS, 1, 0); if (!IS_ERR(tsk)) { io_init_new_worker(wqe, worker, tsk); } else if (!io_should_retry_thread(PTR_ERR(tsk))) { diff --git a/fs/io_uring.c b/fs/io_uring.c index 16fb7436043c..2493a78ddd7d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -8519,6 +8519,8 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx, fdput(f); } if (ctx->flags & IORING_SETUP_SQPOLL) { + unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND| + CLONE_THREAD|CLONE_IO; struct task_struct *tsk; struct io_sq_data *sqd; bool attached; @@ -8560,7 +8562,8 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx, sqd->task_pid = current->pid; sqd->task_tgid = current->tgid; - tsk = create_io_thread(io_sq_thread, sqd, NUMA_NO_NODE); + tsk = kernel_copy_process(io_sq_thread, sqd, NUMA_NO_NODE, + flags, 1, 0); if (IS_ERR(tsk)) { ret = PTR_ERR(tsk); goto err_sqpoll; diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 4a6100a24894..f43e81c907e1 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -85,7 +85,6 @@ extern void exit_files(struct task_struct *); extern void exit_itimers(struct signal_struct *); extern pid_t kernel_clone(struct kernel_clone_args *kargs); -struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node); struct task_struct *kernel_copy_process(int (*fn)(void *), void *arg, int node, unsigned long clone_flags, int io_thread, int no_files); diff --git a/kernel/fork.c b/kernel/fork.c index 1dda1d4ea77b..9011cbe83fe8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2518,28 +2518,6 @@ struct mm_struct *copy_init_mm(void) return dup_mm(NULL, &init_mm); } -/* - * This is like kernel_clone(), but shaved down and tailored to just - * creating io_uring workers. It returns a created task, or an error pointer. - * The returned task is inactive, and the caller must fire it up through - * wake_up_new_task(p). All signals are blocked in the created task. - */ -struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) -{ - unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD| - CLONE_IO; - struct kernel_clone_args args = { - .flags = ((lower_32_bits(flags) | CLONE_VM | - CLONE_UNTRACED) & ~CSIGNAL), - .exit_signal = (lower_32_bits(flags) & CSIGNAL), - .stack = (unsigned long)fn, - .stack_size = (unsigned long)arg, - .io_thread = 1, - }; - - return copy_process(NULL, 0, node, &args); -} - /** * kernel_copy_process - create a copy of a process to be used by the kernel * @fn: thread stack -- 2.25.1
Mike Christie
2021-Sep-16 21:20 UTC
[PATCH 6/8] vhost: move worker thread fields to new struct
This is just a prep patch. It moves the worker related fields to a new vhost_worker struct and moves the code around to create some helpers that will be used in the next patches. Signed-off-by: Mike Christie <michael.christie at oracle.com> Reviewed-by: Stefan Hajnoczi <stefanha at redhat.com> --- drivers/vhost/vhost.c | 98 ++++++++++++++++++++++++++++--------------- drivers/vhost/vhost.h | 11 +++-- 2 files changed, 72 insertions(+), 37 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 59edb5a1ffe2..c9a1f706989c 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -263,8 +263,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) * sure it was not in the list. * test_and_set_bit() implies a memory barrier. */ - llist_add(&work->node, &dev->work_list); - wake_up_process(dev->worker); + llist_add(&work->node, &dev->worker->work_list); + wake_up_process(dev->worker->task); } } EXPORT_SYMBOL_GPL(vhost_work_queue); @@ -272,7 +272,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue); /* A lockless hint for busy polling code to exit the loop */ bool vhost_has_work(struct vhost_dev *dev) { - return !llist_empty(&dev->work_list); + return dev->worker && !llist_empty(&dev->worker->work_list); } EXPORT_SYMBOL_GPL(vhost_has_work); @@ -343,7 +343,8 @@ static void vhost_vq_reset(struct vhost_dev *dev, static int vhost_worker(void *data) { - struct vhost_dev *dev = data; + struct vhost_worker *worker = data; + struct vhost_dev *dev = worker->dev; struct vhost_work *work, *work_next; struct llist_node *node; @@ -358,7 +359,7 @@ static int vhost_worker(void *data) break; } - node = llist_del_all(&dev->work_list); + node = llist_del_all(&worker->work_list); if (!node) schedule(); @@ -368,7 +369,7 @@ static int vhost_worker(void *data) llist_for_each_entry_safe(work, work_next, node, node) { clear_bit(VHOST_WORK_QUEUED, &work->flags); __set_current_state(TASK_RUNNING); - kcov_remote_start_common(dev->kcov_handle); + kcov_remote_start_common(worker->kcov_handle); work->fn(work); kcov_remote_stop(); if (need_resched()) @@ -487,7 +488,6 @@ void vhost_dev_init(struct vhost_dev *dev, dev->byte_weight = byte_weight; dev->use_worker = use_worker; dev->msg_handler = msg_handler; - init_llist_head(&dev->work_list); init_waitqueue_head(&dev->wait); INIT_LIST_HEAD(&dev->read_list); INIT_LIST_HEAD(&dev->pending_list); @@ -579,10 +579,60 @@ static void vhost_detach_mm(struct vhost_dev *dev) dev->mm = NULL; } +static void vhost_worker_free(struct vhost_dev *dev) +{ + struct vhost_worker *worker = dev->worker; + + if (!worker) + return; + + dev->worker = NULL; + WARN_ON(!llist_empty(&worker->work_list)); + kthread_stop(worker->task); + kfree(worker); +} + +static int vhost_worker_create(struct vhost_dev *dev) +{ + struct vhost_worker *worker; + struct task_struct *task; + int ret; + + worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); + if (!worker) + return -ENOMEM; + + dev->worker = worker; + worker->dev = dev; + worker->kcov_handle = kcov_common_handle(); + init_llist_head(&worker->work_list); + + task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid); + if (IS_ERR(task)) { + ret = PTR_ERR(task); + goto free_worker; + } + + worker->task = task; + wake_up_process(task); /* avoid contributing to loadavg */ + + ret = vhost_attach_cgroups(dev); + if (ret) + goto stop_worker; + + return 0; + +stop_worker: + kthread_stop(worker->task); +free_worker: + kfree(worker); + dev->worker = NULL; + return ret; +} + /* Caller should have device mutex */ long vhost_dev_set_owner(struct vhost_dev *dev) { - struct task_struct *worker; int err; /* Is there an owner already? */ @@ -593,36 +643,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev) vhost_attach_mm(dev); - dev->kcov_handle = kcov_common_handle(); if (dev->use_worker) { - worker = kthread_create(vhost_worker, dev, - "vhost-%d", current->pid); - if (IS_ERR(worker)) { - err = PTR_ERR(worker); - goto err_worker; - } - - dev->worker = worker; - wake_up_process(worker); /* avoid contributing to loadavg */ - - err = vhost_attach_cgroups(dev); + err = vhost_worker_create(dev); if (err) - goto err_cgroup; + goto err_worker; } err = vhost_dev_alloc_iovecs(dev); if (err) - goto err_cgroup; + goto err_iovecs; return 0; -err_cgroup: - if (dev->worker) { - kthread_stop(dev->worker); - dev->worker = NULL; - } +err_iovecs: + vhost_worker_free(dev); err_worker: vhost_detach_mm(dev); - dev->kcov_handle = 0; err_mm: return err; } @@ -712,12 +747,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev) dev->iotlb = NULL; vhost_clear_msg(dev); wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM); - WARN_ON(!llist_empty(&dev->work_list)); - if (dev->worker) { - kthread_stop(dev->worker); - dev->worker = NULL; - dev->kcov_handle = 0; - } + vhost_worker_free(dev); vhost_detach_mm(dev); } EXPORT_SYMBOL_GPL(vhost_dev_cleanup); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 638bb640d6b4..102ce25e4e13 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -25,6 +25,13 @@ struct vhost_work { unsigned long flags; }; +struct vhost_worker { + struct task_struct *task; + struct llist_head work_list; + struct vhost_dev *dev; + u64 kcov_handle; +}; + /* Poll a file (eventfd or socket) */ /* Note: there's nothing vhost specific about this structure. */ struct vhost_poll { @@ -148,8 +155,7 @@ struct vhost_dev { struct vhost_virtqueue **vqs; int nvqs; struct eventfd_ctx *log_ctx; - struct llist_head work_list; - struct task_struct *worker; + struct vhost_worker *worker; struct vhost_iotlb *umem; struct vhost_iotlb *iotlb; spinlock_t iotlb_lock; @@ -159,7 +165,6 @@ struct vhost_dev { int iov_limit; int weight; int byte_weight; - u64 kcov_handle; bool use_worker; int (*msg_handler)(struct vhost_dev *dev, struct vhost_iotlb_msg *msg); -- 2.25.1
Mike Christie
2021-Sep-16 21:20 UTC
[PATCH 7/8] vhost: use kernel_copy_process to check RLIMITs and inherit cgroups
For vhost workers we use the kthread API which inherit's its values from and checks against the kthreadd thread. This results in cgroups v2 not working and the wrong RLIMITs being checked. This patch has us use the kernel_copy_process function which will inherit its values/checks from the thread that owns the device. Note this patch converts us. The next patch will remove the code that is no longer needed. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- drivers/vhost/vhost.c | 49 +++++++++++++++++++++++++++---------------- drivers/vhost/vhost.h | 7 ++++++- 2 files changed, 37 insertions(+), 19 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index c9a1f706989c..6e58417b13fc 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -344,17 +344,14 @@ static void vhost_vq_reset(struct vhost_dev *dev, static int vhost_worker(void *data) { struct vhost_worker *worker = data; - struct vhost_dev *dev = worker->dev; struct vhost_work *work, *work_next; struct llist_node *node; - kthread_use_mm(dev->mm); - for (;;) { /* mb paired w/ kthread_stop */ set_current_state(TASK_INTERRUPTIBLE); - if (kthread_should_stop()) { + if (test_bit(VHOST_WORKER_FLAG_STOP, &worker->flags)) { __set_current_state(TASK_RUNNING); break; } @@ -376,8 +373,9 @@ static int vhost_worker(void *data) schedule(); } } - kthread_unuse_mm(dev->mm); - return 0; + + complete(worker->exit_done); + do_exit(0); } static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq) @@ -579,6 +577,16 @@ static void vhost_detach_mm(struct vhost_dev *dev) dev->mm = NULL; } +static void vhost_worker_stop(struct vhost_worker *worker) +{ + DECLARE_COMPLETION_ONSTACK(exit_done); + + worker->exit_done = &exit_done; + set_bit(VHOST_WORKER_FLAG_STOP, &worker->flags); + wake_up_process(worker->task); + wait_for_completion(worker->exit_done); +} + static void vhost_worker_free(struct vhost_dev *dev) { struct vhost_worker *worker = dev->worker; @@ -588,14 +596,16 @@ static void vhost_worker_free(struct vhost_dev *dev) dev->worker = NULL; WARN_ON(!llist_empty(&worker->work_list)); - kthread_stop(worker->task); + vhost_worker_stop(worker); kfree(worker); } static int vhost_worker_create(struct vhost_dev *dev) { + DECLARE_COMPLETION_ONSTACK(start_done); struct vhost_worker *worker; struct task_struct *task; + char buf[TASK_COMM_LEN]; int ret; worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); @@ -603,27 +613,30 @@ static int vhost_worker_create(struct vhost_dev *dev) return -ENOMEM; dev->worker = worker; - worker->dev = dev; worker->kcov_handle = kcov_common_handle(); init_llist_head(&worker->work_list); - task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid); - if (IS_ERR(task)) { - ret = PTR_ERR(task); + /* + * vhost used to use the kthread API which ignores all signals by + * default and the drivers expect this behavior. So we do not want to + * ineherit the parent's signal handlers and set our worker to ignore + * everything below. + */ + task = kernel_copy_process(vhost_worker, worker, NUMA_NO_NODE, + CLONE_FS|CLONE_CLEAR_SIGHAND, 0, 1); + if (IS_ERR(task)) goto free_worker; - } worker->task = task; - wake_up_process(task); /* avoid contributing to loadavg */ - ret = vhost_attach_cgroups(dev); - if (ret) - goto stop_worker; + snprintf(buf, sizeof(buf), "vhost-%d", current->pid); + set_task_comm(task, buf); + + ignore_signals(task); + wake_up_new_task(task); return 0; -stop_worker: - kthread_stop(worker->task); free_worker: kfree(worker); dev->worker = NULL; diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 102ce25e4e13..09748694cb66 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -25,11 +25,16 @@ struct vhost_work { unsigned long flags; }; +enum { + VHOST_WORKER_FLAG_STOP, +}; + struct vhost_worker { struct task_struct *task; + struct completion *exit_done; struct llist_head work_list; - struct vhost_dev *dev; u64 kcov_handle; + unsigned long flags; }; /* Poll a file (eventfd or socket) */ -- 2.25.1
We inherit v1 and v2 cgroups from copy_process now, so we can drop the v1 only code. Signed-off-by: Mike Christie <michael.christie at oracle.com> --- drivers/vhost/vhost.c | 26 -------------------------- 1 file changed, 26 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 6e58417b13fc..b561c5ea00fc 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -22,7 +22,6 @@ #include <linux/slab.h> #include <linux/vmalloc.h> #include <linux/kthread.h> -#include <linux/cgroup.h> #include <linux/module.h> #include <linux/sort.h> #include <linux/sched/mm.h> @@ -515,31 +514,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev) } EXPORT_SYMBOL_GPL(vhost_dev_check_owner); -struct vhost_attach_cgroups_struct { - struct vhost_work work; - struct task_struct *owner; - int ret; -}; - -static void vhost_attach_cgroups_work(struct vhost_work *work) -{ - struct vhost_attach_cgroups_struct *s; - - s = container_of(work, struct vhost_attach_cgroups_struct, work); - s->ret = cgroup_attach_task_all(s->owner, current); -} - -static int vhost_attach_cgroups(struct vhost_dev *dev) -{ - struct vhost_attach_cgroups_struct attach; - - attach.owner = current; - vhost_work_init(&attach.work, vhost_attach_cgroups_work); - vhost_work_queue(dev, &attach.work); - vhost_work_dev_flush(dev); - return attach.ret; -} - /* Caller should have device mutex */ bool vhost_dev_has_owner(struct vhost_dev *dev) { -- 2.25.1
Hillf Danton
2021-Sep-19 08:24 UTC
[PATCH 7/8] vhost: use kernel_copy_process to check RLIMITs and inherit cgroups
On Thu, 16 Sep 2021 16:20:50 -0500 Mike Christie wrote:> > static int vhost_worker_create(struct vhost_dev *dev) > { > + DECLARE_COMPLETION_ONSTACK(start_done);Nit, cut it.> struct vhost_worker *worker; > struct task_struct *task; > + char buf[TASK_COMM_LEN]; > int ret; > > worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); > @@ -603,27 +613,30 @@ static int vhost_worker_create(struct vhost_dev *dev) > return -ENOMEM; > > dev->worker = worker; > - worker->dev = dev; > worker->kcov_handle = kcov_common_handle(); > init_llist_head(&worker->work_list); > > - task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid); > - if (IS_ERR(task)) { > - ret = PTR_ERR(task); > + /* > + * vhost used to use the kthread API which ignores all signals by > + * default and the drivers expect this behavior. So we do not want to > + * ineherit the parent's signal handlers and set our worker to ignore > + * everything below. > + */ > + task = kernel_copy_process(vhost_worker, worker, NUMA_NO_NODE, > + CLONE_FS|CLONE_CLEAR_SIGHAND, 0, 1); > + if (IS_ERR(task)) > goto free_worker; > - } > > worker->task = task; > - wake_up_process(task); /* avoid contributing to loadavg */ > > - ret = vhost_attach_cgroups(dev); > - if (ret) > - goto stop_worker; > + snprintf(buf, sizeof(buf), "vhost-%d", current->pid); > + set_task_comm(task, buf); > + > + ignore_signals(task); > > + wake_up_new_task(task); > return 0; > > -stop_worker: > - kthread_stop(worker->task); > free_worker: > kfree(worker); > dev->worker = NULL; > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h > index 102ce25e4e13..09748694cb66 100644 > --- a/drivers/vhost/vhost.h > +++ b/drivers/vhost/vhost.h > @@ -25,11 +25,16 @@ struct vhost_work { > unsigned long flags; > }; > > +enum { > + VHOST_WORKER_FLAG_STOP, > +}; > + > struct vhost_worker { > struct task_struct *task; > + struct completion *exit_done; > struct llist_head work_list; > - struct vhost_dev *dev; > u64 kcov_handle; > + unsigned long flags; > }; > > /* Poll a file (eventfd or socket) */ > -- > 2.25.1