thr3ads.net - Virtualization - [syzbot] WARNING in vhost_dev

If this information is useful, please help other people find it:
Share via:

Mike Christie

2022-Feb-18 17:53 UTC

[syzbot] WARNING in vhost_dev_cleanup (2)

On 2/17/22 3:48 AM, Stefano Garzarella wrote:> 
> On Thu, Feb 17, 2022 at 8:50 AM Michael S. Tsirkin <mst at
redhat.com> wrote:
>>
>> On Thu, Feb 17, 2022 at 03:39:48PM +0800, Jason Wang wrote:
>>> On Thu, Feb 17, 2022 at 3:36 PM Michael S. Tsirkin <mst at
redhat.com> wrote:
>>>>
>>>> On Thu, Feb 17, 2022 at 03:34:13PM +0800, Jason Wang wrote:
>>>>> On Thu, Feb 17, 2022 at 10:01 AM syzbot
>>>>> <syzbot+1e3ea63db39f2b4440e0 at
syzkaller.appspotmail.com> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> syzbot found the following issue on:
>>>>>>
>>>>>> HEAD commit:    c5d9ae265b10 Merge tag
'for-linus' of git://git.kernel.org..
>>>>>> git tree:       upstream
>>>>>> console output:
https://urldefense.com/v3/__https://syzkaller.appspot.com/x/log.txt?x=132e687c700000__;!!ACWV5N9M2RV99hQ!fLqQTyosTBm7FK50IVmo0ozZhsvUEPFCivEHFDGU3GjlAHDWl07UdOa-t9uf9YisMihn$
>>>>>> kernel config: 
https://urldefense.com/v3/__https://syzkaller.appspot.com/x/.config?x=a78b064590b9f912__;!!ACWV5N9M2RV99hQ!fLqQTyosTBm7FK50IVmo0ozZhsvUEPFCivEHFDGU3GjlAHDWl07UdOa-t9uf9RjOhplp$
>>>>>> dashboard link:
https://urldefense.com/v3/__https://syzkaller.appspot.com/bug?extid=1e3ea63db39f2b4440e0__;!!ACWV5N9M2RV99hQ!fLqQTyosTBm7FK50IVmo0ozZhsvUEPFCivEHFDGU3GjlAHDWl07UdOa-t9uf9bBf5tv0$
>>>>>> compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110,
GNU ld (GNU Binutils for Debian) 2.35.2
>>>>>>
>>>>>> Unfortunately, I don't have any reproducer for this
issue yet.
>>>>>>
>>>>>> IMPORTANT: if you fix the issue, please add the
following tag to the commit:
>>>>>> Reported-by: syzbot+1e3ea63db39f2b4440e0 at
syzkaller.appspotmail.com
>>>>>>
>>>>>> WARNING: CPU: 1 PID: 10828 at drivers/vhost/vhost.c:715
vhost_dev_cleanup+0x8b8/0xbc0 drivers/vhost/vhost.c:715
>>>>>> Modules linked in:
>>>>>> CPU: 0 PID: 10828 Comm: syz-executor.0 Not tainted
5.17.0-rc4-syzkaller-00051-gc5d9ae265b10 #0
>>>>>> Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
>>>>>> RIP: 0010:vhost_dev_cleanup+0x8b8/0xbc0
drivers/vhost/vhost.c:715
>>>>>
>>>>> Probably a hint that we are missing a flush.
>>>>>
>>>>> Looking at vhost_vsock_stop() that is called by
vhost_vsock_dev_release():
>>>>>
>>>>> static int vhost_vsock_stop(struct vhost_vsock *vsock)
>>>>> {
>>>>> size_t i;
>>>>>         int ret;
>>>>>
>>>>>         mutex_lock(&vsock->dev.mutex);
>>>>>
>>>>>         ret = vhost_dev_check_owner(&vsock->dev);
>>>>>         if (ret)
>>>>>                 goto err;
>>>>>
>>>>> Where it could fail so the device is not actually stopped.
>>>>>
>>>>> I wonder if this is something related.
>>>>>
>>>>> Thanks
>>>>
>>>>
>>>> But then if that is not the owner then no work should be
running, right?
>>>
>>> Could it be a buggy user space that passes the fd to another
process
>>> and changes the owner just before the mutex_lock() above?
>>>
>>> Thanks
>>
>> Maybe, but can you be a bit more explicit? what is the set of
>> conditions you see that can lead to this?
> 
> I think the issue could be in the vhost_vsock_stop() as Jason mentioned, 
> but not related to fd passing, but related to the do_exit() function.
> 
> Looking the stack trace, we are in exit_task_work(), that is called 
> after exit_mm(), so the vhost_dev_check_owner() can fail because 
> current->mm should be NULL at that point.
> 
> It seems the fput work is queued by fput_many() in a worker queue, and 
> in some cases (maybe a lot of files opened?) the work is still queued 
> when we enter in do_exit().It normally happens if userspace doesn't do a close() when the VM
is shutdown and instead let's the kernel's reaper code cleanup. The qemu
vhost-scsi code doesn't do a close() during shutdown and so this is our
normal code path. It also happens when something like qemu is not
gracefully shutdown like during a crash.

So fire up qemu, start IO, then crash it or kill 9 it while IO is still
running and you can hit it.
> 
> That said, I don't know if we can simply remove that check in 
> vhost_vsock_stop(), or check if current->mm is NULL, to understand if 
> the process is exiting.
> 
Should the caller do the vhost_dev_check_owner or tell vhost_vsock_stop
when to check?

- vhost_vsock_dev_ioctl always wants to check for ownership right?

- For vhost_vsock_dev_release ownership doesn't matter because we
always want to clean up or it doesn't hurt too much.

For the case where we just do open then close and no ioctls then
running vhost_vq_set_backend in vhost_vsock_stop is just a minor
hit of extra work. If we've done ioctls, but are now in
vhost_vsock_dev_release then we know for the graceful and ungraceful
case that nothing is going to be accessing this device in the future
and it's getting completely freed so we must completely clean it up.

Mike Christie

2022-Feb-18 18:23 UTC

head link

[syzbot] WARNING in vhost_dev_cleanup (2)

On 2/18/22 11:53 AM, Mike Christie wrote:> On 2/17/22 3:48 AM, Stefano Garzarella wrote:
>>
>> On Thu, Feb 17, 2022 at 8:50 AM Michael S. Tsirkin <mst at
redhat.com> wrote:
>>>
>>> On Thu, Feb 17, 2022 at 03:39:48PM +0800, Jason Wang wrote:
>>>> On Thu, Feb 17, 2022 at 3:36 PM Michael S. Tsirkin <mst at
redhat.com> wrote:
>>>>>
>>>>> On Thu, Feb 17, 2022 at 03:34:13PM +0800, Jason Wang wrote:
>>>>>> On Thu, Feb 17, 2022 at 10:01 AM syzbot
>>>>>> <syzbot+1e3ea63db39f2b4440e0 at
syzkaller.appspotmail.com> wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> syzbot found the following issue on:
>>>>>>>
>>>>>>> HEAD commit:    c5d9ae265b10 Merge tag
'for-linus' of git://git.kernel.org..
>>>>>>> git tree:       upstream
>>>>>>> console output:
https://urldefense.com/v3/__https://syzkaller.appspot.com/x/log.txt?x=132e687c700000__;!!ACWV5N9M2RV99hQ!fLqQTyosTBm7FK50IVmo0ozZhsvUEPFCivEHFDGU3GjlAHDWl07UdOa-t9uf9YisMihn$
>>>>>>> kernel config: 
https://urldefense.com/v3/__https://syzkaller.appspot.com/x/.config?x=a78b064590b9f912__;!!ACWV5N9M2RV99hQ!fLqQTyosTBm7FK50IVmo0ozZhsvUEPFCivEHFDGU3GjlAHDWl07UdOa-t9uf9RjOhplp$
>>>>>>> dashboard link:
https://urldefense.com/v3/__https://syzkaller.appspot.com/bug?extid=1e3ea63db39f2b4440e0__;!!ACWV5N9M2RV99hQ!fLqQTyosTBm7FK50IVmo0ozZhsvUEPFCivEHFDGU3GjlAHDWl07UdOa-t9uf9bBf5tv0$
>>>>>>> compiler:       gcc (Debian 10.2.1-6) 10.2.1
20210110, GNU ld (GNU Binutils for Debian) 2.35.2
>>>>>>>
>>>>>>> Unfortunately, I don't have any reproducer for
this issue yet.
>>>>>>>
>>>>>>> IMPORTANT: if you fix the issue, please add the
following tag to the commit:
>>>>>>> Reported-by: syzbot+1e3ea63db39f2b4440e0 at
syzkaller.appspotmail.com
>>>>>>>
>>>>>>> WARNING: CPU: 1 PID: 10828 at
drivers/vhost/vhost.c:715 vhost_dev_cleanup+0x8b8/0xbc0
drivers/vhost/vhost.c:715
>>>>>>> Modules linked in:
>>>>>>> CPU: 0 PID: 10828 Comm: syz-executor.0 Not tainted
5.17.0-rc4-syzkaller-00051-gc5d9ae265b10 #0
>>>>>>> Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
>>>>>>> RIP: 0010:vhost_dev_cleanup+0x8b8/0xbc0
drivers/vhost/vhost.c:715
>>>>>>
>>>>>> Probably a hint that we are missing a flush.
>>>>>>
>>>>>> Looking at vhost_vsock_stop() that is called by
vhost_vsock_dev_release():
>>>>>>
>>>>>> static int vhost_vsock_stop(struct vhost_vsock *vsock)
>>>>>> {
>>>>>> size_t i;
>>>>>>         int ret;
>>>>>>
>>>>>>         mutex_lock(&vsock->dev.mutex);
>>>>>>
>>>>>>         ret =
vhost_dev_check_owner(&vsock->dev);
>>>>>>         if (ret)
>>>>>>                 goto err;
>>>>>>
>>>>>> Where it could fail so the device is not actually
stopped.
>>>>>>
>>>>>> I wonder if this is something related.
>>>>>>
>>>>>> Thanks
>>>>>
>>>>>
>>>>> But then if that is not the owner then no work should be
running, right?
>>>>
>>>> Could it be a buggy user space that passes the fd to another
process
>>>> and changes the owner just before the mutex_lock() above?
>>>>
>>>> Thanks
>>>
>>> Maybe, but can you be a bit more explicit? what is the set of
>>> conditions you see that can lead to this?
>>
>> I think the issue could be in the vhost_vsock_stop() as Jason
mentioned,
>> but not related to fd passing, but related to the do_exit() function.
>>
>> Looking the stack trace, we are in exit_task_work(), that is called 
>> after exit_mm(), so the vhost_dev_check_owner() can fail because 
>> current->mm should be NULL at that point.
>>
>> It seems the fput work is queued by fput_many() in a worker queue, and 
>> in some cases (maybe a lot of files opened?) the work is still queued 
>> when we enter in do_exit().
> It normally happens if userspace doesn't do a close() when the VM
Just one clarification. I meant to say it "always" happens when
userspace
doesn't do a close.

It doesn't have anything to do with lots of files or something like that.
We are actually running the vhost device's release function from
do_exit->task_work_run and so all those __fputs are done from something
like qemu's context (current == that process).

We are *not* hitting the case:

do_exit->exit_files->put_files_struct->filp_close->fput->fput_many

and then in there hitting the schedule_delayed_work path. For that
the last __fput would be done from a workqueue thread and so the current
pointer would point to a completely different thread.


> is shutdown and instead let's the kernel's reaper code cleanup. The
qemu
> vhost-scsi code doesn't do a close() during shutdown and so this is our
> normal code path. It also happens when something like qemu is not
> gracefully shutdown like during a crash.
> 
> So fire up qemu, start IO, then crash it or kill 9 it while IO is still
> running and you can hit it.
> 
>>
>> That said, I don't know if we can simply remove that check in 
>> vhost_vsock_stop(), or check if current->mm is NULL, to understand
if
>> the process is exiting.
>>
> 
> Should the caller do the vhost_dev_check_owner or tell vhost_vsock_stop
> when to check?
> 
> - vhost_vsock_dev_ioctl always wants to check for ownership right?
> 
> - For vhost_vsock_dev_release ownership doesn't matter because we
> always want to clean up or it doesn't hurt too much.
> 
> For the case where we just do open then close and no ioctls then
> running vhost_vq_set_backend in vhost_vsock_stop is just a minor
> hit of extra work. If we've done ioctls, but are now in
> vhost_vsock_dev_release then we know for the graceful and ungraceful
> case that nothing is going to be accessing this device in the future
> and it's getting completely freed so we must completely clean it up.
> 
> 
> 
> 
> 
> _______________________________________________
> Virtualization mailing list
> Virtualization at lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Virtualization - Feb 2022 - [syzbot] WARNING in vhost_dev_cleanup (2)

[syzbot] WARNING in vhost_dev_cleanup (2)

[syzbot] WARNING in vhost_dev_cleanup (2)