thr3ads.net - Virtualization - [PATCH V6 01/10] Use copy

If this information is useful, please help other people find it:
Share via:

Mike Christie

2021-Dec-22 17:32 UTC

[PATCH V6 01/10] Use copy_process in vhost layer

On 12/21/21 6:20 PM, Eric W. Biederman wrote:> michael.christie at oracle.com writes:
> 
>> On 12/17/21 1:26 PM, Eric W. Biederman wrote:
>>> Mike Christie <michael.christie at oracle.com> writes:
>>>
>>>> The following patches made over Linus's tree, allow the
vhost layer to do
>>>> a copy_process on the thread that does the VHOST_SET_OWNER
ioctl like how
>>>> io_uring does a copy_process against its userspace app. This
allows the
>>>> vhost layer's worker threads to inherit cgroups,
namespaces, address
>>>> space, etc and this worker thread will also be accounted for
against that
>>>> owner/parent process's RLIMIT_NPROC limit.
>>>>
>>>> If you are not familiar with qemu and vhost here is more
detailed
>>>> problem description:
>>>>
>>>> Qemu will create vhost devices in the kernel which perform
network, SCSI,
>>>> etc IO and management operations from worker threads created by
the
>>>> kthread API. Because the kthread API does a copy_process on the
kthreadd
>>>> thread, the vhost layer has to use kthread_use_mm to access the
Qemu
>>>> thread's memory and cgroup_attach_task_all to add itself to
the Qemu
>>>> thread's cgroups.
>>>>
>>>> The problem with this approach is that we then have to add new
functions/
>>>> args/functionality for every thing we want to inherit. I
started doing
>>>> that here:
>>>>
>>>>
https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!eIaEe9V8mCgGU6vyvaWTKGi3Zlnz0rgk5Y-0nsBXRbsuVZsM8lYfHr8I8GRuoLYPYrOB$
>>>>
>>>> for the RLIMIT_NPROC check, but it seems it might be easier to
just
>>>> inherit everything from the beginning, becuase I'd need to
do something
>>>> like that patch several times.
>>>
>>> I read through the code and I don't see why you want to make
these
>>> almost threads of a process not actually threads of that process
>>> (like the io_uring threads are).
>>>
>>> As a separate process there are many things that will continue to
be
>>> disjoint.  If the thread changes cgroups for example your new
process
>>> won't follow.
>>>
>>> If you want them to be actually processes with an lifetime
independent
>>> of the creating process I expect you want to reparent them to the
local
>>> init process.  Just so they don't confuse the process tree. 
Plus init
>>> processes know how to handle unexpected children.
>>>
>>> What are the semantics you are aiming for?
>>>
>>
>> Hi Eric,
>>
>> Right now, for vhost we need the thread being created:
>>
>> 1. added to the caller's cgroup.
>> 2. to share the mm struct with the caller.
>> 3. to be accounted for under the caller's nproc rlimit value.
>>
>> For 1 and 2, we have cgroup_attach_task_all and get_task_mm
>> already.
>>
>> This patchset started with me just trying to handle #3 by modifying
kthreads
>> like here:
>>
>>
https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1234__;!!ACWV5N9M2RV99hQ!bvqZOWy7TxQyq18L4I_a5MxP2OX0V2imOYEJrWsc-LkyVTI_zpFzxyV2pM_dgYywwH2y$
>>
>> So we can use kthreads and the existing helpers and add:
>>
>> A. a ucounts version of the above patches in the link
>>
>> or
>>
>> B. a helper that does something like copy_process's use of
>> is_ucounts_overlimit and vhost can call that.
>>
>> instead of this patchset.
> 
> I don't fundamentally hate the patchset.  I do have concerns about
> the completely broken patch.
> 
> With respect this patchset my gut says decide.  Are you a thread of the
> process (just use create_io_worker) are you a separate process forked
> from the caller (use a cousin of create_io_worker but don't touch
> create_io_worker).  I think being a process vs being a thread is such a
> fundamental difference we don't want to mix the helpers.
> 
>> Before we even get to the next section below, do you consider items 1 -
3
>> something we need an API based on copy_process for?
> 
> I think 3 staying in the callers nproc strongly suggests you want to
> reuse copy_process.  Which gets back to my question do you want
> a thread or do you want a process.
> 
> 
> For me a key detail is what is the lifetime of the vhost device?
> 
> Does the vhost go away when the caller goes away?
Yes. When the caller, normally qemu in our case, that created the worker
thread exits, then we free the vhost devices and stop and free the worker
threads we are creating in this patchset.

However, I'm not sure if it makes a difference to you, but we also have
second
way to free a vhost device and its worker thread. The user can run a command
that instructs the the qemu process to free the vhost device and its worker
thread.
> 
>   If so you can create a thread in the caller's process that only
performs
>   work in kernel space.  At which point you are essentially
>   create_io_thread.
> 
> If the vhost device can live after the caller goes away how is that
managed?
When the caller goes away we free the devices and their worker threads.

Either before the caller exists it does an explicit close to release the device
which frees the device and its worker thread, or when the process exits and the
kernel does a put on its open devices that will trigger the vhost device's
release
function and we free device and its thread at that time.

>   Especially when you are tied to the caller's mm.
> 
>   If your device lives on after the caller then you should be a separate
>   process forked from the caller.
> 
> 
>> Do you think I should just do A or B above, or do you have another
idea? If
>> so can we get agreement on that from everyone?
> 
> Like I said.  I don't hate this patchset.  I think you just tried to be
> a little too general.
> > 
> With kthreads you are fighting very hard to get something that is not
> part of the process tree to act like it is part of the process tree.
> Which strongly suggests you should be part of the process tree.
> 
> As long as you don't have important state you will need to fight to get
> separate from the process.
> 
> If you do have important state that needs to not come from the process
> figuring out how to make a kthread work may be safer.
> 
> I don't currently know vhost devices well enough to answer that
question.
> 
>> I thought my patches in that link were a little hacky in how they
passed
>> around the user/creds info. I thought maybe it shouldn't be passed
around like
>> that, so switched to the copy_process based approach which did
everything for
>> me. And I thought io_uring needed something similar as us so I made it
generic.
>>
>> I don't have a preference. You and Christian are the experts, so
I'll leave it
>> to you guys.
> 
> I hope this gives you some useful direction.
> 
> Eric
> 
> 
>>> I can see sense in generalizing some of the pieces of
create_io_thread
>>> but I think generalizing create_io_thread itself is premature.  The
code
>>> lives in kernel/fork.c because it is a very special thing that we
want
>>> to keep our eyes on.
>>>
>>> Some of your generalization makes it much more difficult to tell
what
>>> your code is going to use because you remove hard codes that are
there
>>> to simplify the analysis of the situation.
>>>
>>> My gut says adding a new create_vhost_worker and putting that in
>>> kernel/fork.c is a lot safer and will allow much better code
analysis.
>>>
>>> If there a really are commonalities between creating a userspace
process
>>> that runs completely in the kernel and creating an additional
userspace
>>> thread we refactor the code and simplify things.
>>>
>>> I am especially nervous about generalizing the io_uring code as
it's
>>> signal handling just barely works, and any generalization will
cause it
>>> to break.  So you are in the process of generalizing code that
simply
>>> can not handle the general case.  That scares me

Eric W. Biederman

2021-Dec-22 18:24 UTC

head link

[PATCH V6 01/10] Use copy_process in vhost layer

Mike Christie <michael.christie at oracle.com> writes:
> On 12/21/21 6:20 PM, Eric W. Biederman wrote:
>> michael.christie at oracle.com writes:
>> 
>>> On 12/17/21 1:26 PM, Eric W. Biederman wrote:
>>>> Mike Christie <michael.christie at oracle.com> writes:
>>>>
>>>>> The following patches made over Linus's tree, allow the
vhost layer to do
>>>>> a copy_process on the thread that does the VHOST_SET_OWNER
ioctl like how
>>>>> io_uring does a copy_process against its userspace app.
This allows the
>>>>> vhost layer's worker threads to inherit cgroups,
namespaces, address
>>>>> space, etc and this worker thread will also be accounted
for against that
>>>>> owner/parent process's RLIMIT_NPROC limit.
>>>>>
>>>>> If you are not familiar with qemu and vhost here is more
detailed
>>>>> problem description:
>>>>>
>>>>> Qemu will create vhost devices in the kernel which perform
network, SCSI,
>>>>> etc IO and management operations from worker threads
created by the
>>>>> kthread API. Because the kthread API does a copy_process on
the kthreadd
>>>>> thread, the vhost layer has to use kthread_use_mm to access
the Qemu
>>>>> thread's memory and cgroup_attach_task_all to add
itself to the Qemu
>>>>> thread's cgroups.
>>>>>
>>>>> The problem with this approach is that we then have to add
new functions/
>>>>> args/functionality for every thing we want to inherit. I
started doing
>>>>> that here:
>>>>>
>>>>>
https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!eIaEe9V8mCgGU6vyvaWTKGi3Zlnz0rgk5Y-0nsBXRbsuVZsM8lYfHr8I8GRuoLYPYrOB$
>>>>>
>>>>> for the RLIMIT_NPROC check, but it seems it might be easier
to just
>>>>> inherit everything from the beginning, becuase I'd need
to do something
>>>>> like that patch several times.
>>>>
>>>> I read through the code and I don't see why you want to
make these
>>>> almost threads of a process not actually threads of that
process
>>>> (like the io_uring threads are).
>>>>
>>>> As a separate process there are many things that will continue
to be
>>>> disjoint.  If the thread changes cgroups for example your new
process
>>>> won't follow.
>>>>
>>>> If you want them to be actually processes with an lifetime
independent
>>>> of the creating process I expect you want to reparent them to
the local
>>>> init process.  Just so they don't confuse the process tree.
Plus init
>>>> processes know how to handle unexpected children.
>>>>
>>>> What are the semantics you are aiming for?
>>>>
>>>
>>> Hi Eric,
>>>
>>> Right now, for vhost we need the thread being created:
>>>
>>> 1. added to the caller's cgroup.
>>> 2. to share the mm struct with the caller.
>>> 3. to be accounted for under the caller's nproc rlimit value.
>>>
>>> For 1 and 2, we have cgroup_attach_task_all and get_task_mm
>>> already.
>>>
>>> This patchset started with me just trying to handle #3 by modifying
kthreads
>>> like here:
>>>
>>>
https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1234__;!!ACWV5N9M2RV99hQ!bvqZOWy7TxQyq18L4I_a5MxP2OX0V2imOYEJrWsc-LkyVTI_zpFzxyV2pM_dgYywwH2y$
>>>
>>> So we can use kthreads and the existing helpers and add:
>>>
>>> A. a ucounts version of the above patches in the link
>>>
>>> or
>>>
>>> B. a helper that does something like copy_process's use of
>>> is_ucounts_overlimit and vhost can call that.
>>>
>>> instead of this patchset.
>> 
>> I don't fundamentally hate the patchset.  I do have concerns about
>> the completely broken patch.
>> 
>> With respect this patchset my gut says decide.  Are you a thread of the
>> process (just use create_io_worker) are you a separate process forked
>> from the caller (use a cousin of create_io_worker but don't touch
>> create_io_worker).  I think being a process vs being a thread is such a
>> fundamental difference we don't want to mix the helpers.
>> 
>>> Before we even get to the next section below, do you consider items
1 - 3
>>> something we need an API based on copy_process for?
>> 
>> I think 3 staying in the callers nproc strongly suggests you want to
>> reuse copy_process.  Which gets back to my question do you want
>> a thread or do you want a process.
>> 
>> 
>> For me a key detail is what is the lifetime of the vhost device?
>> 
>> Does the vhost go away when the caller goes away?
>
> Yes. When the caller, normally qemu in our case, that created the worker
> thread exits, then we free the vhost devices and stop and free the worker
> threads we are creating in this patchset.
>
> However, I'm not sure if it makes a difference to you, but we also have
second
> way to free a vhost device and its worker thread. The user can run a
command
> that instructs the the qemu process to free the vhost device and its worker
> thread.
I dug a little deeper to understand how this works, and it appears to be
a standard file descriptor based API.  The last close of the file
descriptor is what causes the vhost_dev_cleanup to be called which shuts
down the thread.

This means that in rare cases the file descriptor can be passed to
another process and be held open there, even after the main process
exits.

This says to me that much as it might be handy your thread does not
strictly share the same lifetime as your qemu process.

>>   If so you can create a thread in the caller's process that only
performs
>>   work in kernel space.  At which point you are essentially
>>   create_io_thread.
>> 
>> If the vhost device can live after the caller goes away how is that
managed?
>
> When the caller goes away we free the devices and their worker threads.
>
> Either before the caller exists it does an explicit close to release the
device
> which frees the device and its worker thread, or when the process exits and
the
> kernel does a put on its open devices that will trigger the vhost
device's release
> function and we free device and its thread at that time.
All of which says to me that the vhost devices semantically work well as
separate processes (that never run userspace code) not as threads of the
creating userspace process.

So I would recommend creating a minimal version of the kthread api,
using create_process targeted only at the vhost case.  Essentially what
you have done with this patchset, but without any configuration knobs
from the callers perspective.

Which means that you can hard code calling ignore_signals, and the
like, instead of needing to have a separate configuration knob for each
place io_workers are different from vhost_workers.

In the future I can see io_workers evolving into a general user space
thread that only runs code in the kernel abstraction, and I can see
vhost_workers evolving into a general user space process that only runs
code in the kernel abstraction.

For now we don't need that generality so please just create a
create_vhost_process helper akin to create_io_thread that does just what
you need.

I don't know if it is better to base it on kernel_clone or on
copy_process.  All I am certain of is that you need to set
"args->exit_signal = -1;".  This prevents having to play games with
do_notify_parent.

Eric

Mike Christie

2022-Jan-17 16:41 UTC

head link

[PATCH V6 01/10] Use copy_process in vhost layer

On 12/22/21 12:24 PM, Eric W. Biederman wrote:> All I am certain of is that you need to set
> "args->exit_signal = -1;".  This prevents having to play games
with
> do_notify_parent.
Hi Eric,

I have all your review comments handled except this one. It's looking like
it's
more difficult than just setting the exit_signal=-1, so I wanted to check that
I understood you.

Here is what I'm currently looking at:

1. I can't just set args->exit_signal to -1, because we end up with a
task_struct
that's partially setup like a CLONE_THREAD task. What happens is
copy_process
will set the task's exit_signal to -1 and then thread_group_leader() will
return
false. When code like the thread_group_leader check in copy_process runs, we
will
then go down the CLONE_THREAD paths which are not setup and hit crashes.

We would need changes like the following which does not crash anymore but is not
correct for many reasons. I am just posting this code as an example of the issue
I am hitting.

@@ -1637,11 +1637,13 @@ static void posix_cpu_timers_init_group(struct
signal_struct *sig)
 	posix_cputimers_group_init(pct, cpu_limit);
 }

-static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
+static int copy_signal(unsigned long clone_flags, struct task_struct *tsk,
+		       struct kernel_clone_args *args)
 {
 	struct signal_struct *sig;

-	if (clone_flags & CLONE_THREAD)
+	if (clone_flags & CLONE_THREAD || args->exit_signal == -1)
 		return 0;

 	sig = kmem_cache_zalloc(signal_cachep, GFP_KERNEL);
@@ -2194,7 +2244,7 @@ static __latent_entropy struct task_struct *copy_process(
 	retval = copy_sighand(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_fs;
-	retval = copy_signal(clone_flags, p);
+	retval = copy_signal(clone_flags, p, args);
 	if (retval)
 		goto bad_fork_cleanup_sighand;
 	retval = copy_mm(clone_flags, p);
@@ -2277,6 +2327,9 @@ static __latent_entropy struct task_struct *copy_process(
 	if (clone_flags & CLONE_THREAD) {
 		p->group_leader = current->group_leader;
 		p->tgid = current->tgid;
+	} else if (args->exit_signal == -1) {
+		p->group_leader = current->group_leader;
+		p->tgid = p->pid;
 	} else {
 		p->group_leader = p;
 		p->tgid = p->pid;

2. Instead of #1, I could add some code where we just set
task_struct->exit_signal to -1. We could do this twords the end of
copy_process
or after it has returned, but before we do do_exit. However, hat will have
similar
issues as #1 during the exit handling.

For example, __exit_signal will call thread_group_leader which would return
false.
__unhash_process would then not detach the pid and we would later hit crashes
due
to the task_struct being freed already. I could add code like above to the exit
related
code paths, but it gets messy like above.

3. I thought I could separate the leader detection from the exit signal by
adding
a flag/field to kernel_clone_args and task_struct. But then I get to the point
where I just need a check for USER/VHOST_WORKER tasks in exit_notify which is
similar to the patch you didn't like where I added the check in
do_notify_parent.
So I thought you might not like this approach.

Note:
We can't set our task's exit_signal to SIGCHLD and get autoreaped like
suggested in
another mail. The original idea for the do_notify_parent was we wanted the
behavior
that kthreads have where they get autoreaped on exit. kthreads get autoreaped
there
because the threadd task that is the parent ignores all signals and so we hit
the
parent SIG_IGN check:

        psig = tsk->parent->sighand;
        spin_lock_irqsave(&psig->siglock, flags);
        if (!tsk->ptrace && sig == SIGCHLD &&
            (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
             (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {

Our parent, the qemu task, does not ignore SIGCHLD and so will not hit the code
above.

4. Maybe I am going in the wrong direction and we need kthreads. I could add a:

if (!is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC,
rlimit(RLIMIT_NPROC)))
	inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);

to vhost.c or to kthread.c when some new arg is passed in.

What do you think?

Virtualization - Jan 2022 - [PATCH V6 01/10] Use copy_process in vhost layer

[PATCH V6 01/10] Use copy_process in vhost layer

[PATCH V6 01/10] Use copy_process in vhost layer

[PATCH V6 01/10] Use copy_process in vhost layer