thr3ads.net - Virtualization - [PATCH v8 10/10] Documentation: Add documentation for VDUSE [Jul 2021]

If this information is useful, please help other people find it:
Share via:

Jason Wang

2021-Jul-07 09:24 UTC

[PATCH v8 10/10] Documentation: Add documentation for VDUSE

? 2021/7/7 ??4:55, Stefan Hajnoczi ??:> On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
>> ? 2021/7/7 ??1:11, Stefan Hajnoczi ??:
>>> On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
>>>> On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha at
redhat.com> wrote:
>>>>> On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
>>>>>> ? 2021/7/5 ??8:49, Stefan Hajnoczi ??:
>>>>>>> On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason
Wang wrote:
>>>>>>>> ? 2021/7/4 ??5:49, Yongji Xie ??:
>>>>>>>>>>> OK, I get you now. Since the VIRTIO
specification says "Device
>>>>>>>>>>> configuration space is generally
used for rarely-changing or
>>>>>>>>>>> initialization-time
parameters". I assume the VDUSE_DEV_SET_CONFIG
>>>>>>>>>>> ioctl should not be called
frequently.
>>>>>>>>>> The spec uses MUST and other terms to
define the precise requirements.
>>>>>>>>>> Here the language (especially the word
"generally") is weaker and means
>>>>>>>>>> there may be exceptions.
>>>>>>>>>>
>>>>>>>>>> Another type of access that doesn't
work with the VDUSE_DEV_SET_CONFIG
>>>>>>>>>> approach is reads that have
side-effects. For example, imagine a field
>>>>>>>>>> containing an error code if the device
encounters a problem unrelated to
>>>>>>>>>> a specific virtqueue request. Reading
from this field resets the error
>>>>>>>>>> code to 0, saving the driver an extra
configuration space write access
>>>>>>>>>> and possibly race conditions. It
isn't possible to implement those
>>>>>>>>>> semantics suing VDUSE_DEV_SET_CONFIG.
It's another corner case, but it
>>>>>>>>>> makes me think that the interface does
not allow full VIRTIO semantics.
>>>>>>>> Note that though you're correct, my
understanding is that config space is
>>>>>>>> not suitable for this kind of error
propagating. And it would be very hard
>>>>>>>> to implement such kind of semantic in some
transports.  Virtqueue should be
>>>>>>>> much better. As Yong Ji quoted, the config
space is used for
>>>>>>>> "rarely-changing or intialization-time
parameters".
>>>>>>>>
>>>>>>>>
>>>>>>>>> Agreed. I will use VDUSE_DEV_GET_CONFIG in
the next version. And to
>>>>>>>>> handle the message failure, I'm going
to add a return value to
>>>>>>>>> virtio_config_ops.get() and virtio_cread_*
API so that the error can
>>>>>>>>> be propagated to the virtio device driver.
Then the virtio-blk device
>>>>>>>>> driver can be modified to handle that.
>>>>>>>>>
>>>>>>>>> Jason and Stefan, what do you think of this
way?
>>>>>>> Why does VDUSE_DEV_GET_CONFIG need to support an
error return value?
>>>>>>>
>>>>>>> The VIRTIO spec provides no way for the device to
report errors from
>>>>>>> config space accesses.
>>>>>>>
>>>>>>> The QEMU virtio-pci implementation returns -1 from
invalid
>>>>>>> virtio_config_read*() and silently discards
virtio_config_write*()
>>>>>>> accesses.
>>>>>>>
>>>>>>> VDUSE can take the same approach with
>>>>>>> VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
>>>>>>>
>>>>>>>> I'd like to stick to the current assumption
thich get_config won't fail.
>>>>>>>> That is to say,
>>>>>>>>
>>>>>>>> 1) maintain a config in the kernel, make sure
the config space read can
>>>>>>>> always succeed
>>>>>>>> 2) introduce an ioctl for the vduse usersapce
to update the config space.
>>>>>>>> 3) we can synchronize with the vduse userspace
during set_config
>>>>>>>>
>>>>>>>> Does this work?
>>>>>>> I noticed that caching is also allowed by the
vhost-user protocol
>>>>>>> messages (QEMU's docs/interop/vhost-user.rst),
but the device doesn't
>>>>>>> know whether or not caching is in effect. The
interface you outlined
>>>>>>> above requires caching.
>>>>>>>
>>>>>>> Is there a reason why the host kernel vDPA code
needs to cache the
>>>>>>> configuration space?
>>>>>> Because:
>>>>>>
>>>>>> 1) Kernel can not wait forever in get_config(), this is
the major difference
>>>>>> with vhost-user.
>>>>> virtio_cread() can sleep:
>>>>>
>>>>>     #define virtio_cread(vdev, structname, member, ptr)    
\
>>>>>             do {                                           
\
>>>>>                     typeof(((structname*)0)->member)
virtio_cread_v;        \
>>>>>                                                            
\
>>>>>                     might_sleep();                         
\
>>>>>                     ^^^^^^^^^^^^^^
>>>>>
>>>>> Which code path cannot sleep?
>>>> Well, it can sleep but it can't sleep forever. For VDUSE, a
>>>> buggy/malicious userspace may refuse to respond to the
get_config.
>>>>
>>>> It looks to me the ideal case, with the current virtio spec,
for VDUSE is to
>>>>
>>>> 1) maintain the device and its state in the kernel, userspace
may sync
>>>> with the kernel device via ioctls
>>>> 2) offload the datapath (virtqueue) to the userspace
>>>>
>>>> This seems more robust and safe than simply relaying everything
to
>>>> userspace and waiting for its response.
>>>>
>>>> And we know for sure this model can work, an example is
TUN/TAP:
>>>> netdevice is abstracted in the kernel and datapath is done via
>>>> sendmsg()/recvmsg().
>>>>
>>>> Maintaining the config in the kernel follows this model and it
can
>>>> simplify the device generation implementation.
>>>>
>>>> For config space write, it requires more thought but
fortunately it's
>>>> not commonly used. So VDUSE can choose to filter out the
>>>> device/features that depends on the config write.
>>> This is the problem. There are other messages like SET_FEATURES
where I
>>> guess we'll face the same challenge.
>>
>> Probably not, userspace device can tell the kernel about the
device_features
>> and mandated_features during creation, and the feature negotiation
could be
>> done purely in the kernel without bothering the userspace.

(For some reason I drop the list accidentally, adding them back, sorry)

> Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous
> interface where the driver waits for the device.

It depends on how we define "synchronous" here. If I understand 
correctly, the spec doesn't expect there will be any kind of failure for 
the operation of set_status itself.

Instead, anytime it want any synchronization, it should be done via 
get_status():

1) re-read device status to make sure FEATURES_OK is set during feature 
negotiation
2) re-read device status to be 0 to make sure the device has finish the 
reset

>
> VDUSE currently doesn't wait for the device emulation process to handle
> this message (no reply is needed) but I think this is a mistake because
> VDUSE is not following the VIRTIO device model.

With the trick that is done for FEATURES_OK above, I think we don't need 
to wait for the reply.

If userspace takes too long to respond, it can be detected since 
get_status() doesn't return the expected value for long time.

And for the case that needs a timeout, we probably can use NEEDS_RESET.

>
> I strongly suggest designing the VDUSE interface to match the VIRTIO
> device model (or at least the vDPA interface).

I fully agree with you and that is what we want to achieve in this series.

> Defining a custom
> interface for VDUSE avoids some implementation complexity and makes it
> easier to deal with untrusted userspace, but it's impossible to
> implement certain VIRTIO features or devices. It also fragments VIRTIO
> more than necessary; we have a standard, let's stick to it.

Yes.

>
>>> I agree that caching the contents of configuration space in the
kernel
>>> helps, but if there are other VDUSE messages with the same problem
then
>>> an attacker will exploit them instead.
>>>
>>> I think a systematic solution is needed. It would be necessary to
>>> enumerate the virtio_vdpa and vhost_vdpa cases separately to figure
out
>>> where VDUSE messages are synchronous/time-sensitive.
>>
>> This is the case of reset and needs more thought. We should stick a
>> consistent uAPI for the userspace.
>>
>> For vhost-vDPA, it needs synchronzied with the userspace and we can
wait for
>> ever.
> The VMM should still be able to handle signals when a vhost_vdpa ioctl
> is waiting for a reply from the VDUSE userspace process. Or if that's
> not possible then there needs to be a way to force disconnection from
> VDUSE so the VMM can be killed.

Note that VDUSE works under vDPA bus, so vhost should be transport to VDUSE.

But we can detect this via whether or not the bounce buffer is used.

Thanks

>
> Stefan

Stefan Hajnoczi

2021-Jul-07 15:54 UTC

head link

[PATCH v8 10/10] Documentation: Add documentation for VDUSE

On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang
wrote:> 
> ? 2021/7/7 ??4:55, Stefan Hajnoczi ??:
> > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
> > > ? 2021/7/7 ??1:11, Stefan Hajnoczi ??:
> > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
> > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi
<stefanha at redhat.com> wrote:
> > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason
Wang wrote:
> > > > > > > ? 2021/7/5 ??8:49, Stefan Hajnoczi ??:
> > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM
+0800, Jason Wang wrote:
> > > > > > > > > ? 2021/7/4 ??5:49, Yongji Xie ??:
> > > > > > > > > > > > OK, I get you now.
Since the VIRTIO specification says "Device
> > > > > > > > > > > > configuration space
is generally used for rarely-changing or
> > > > > > > > > > > > initialization-time
parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > ioctl should not be
called frequently.
> > > > > > > > > > > The spec uses MUST and
other terms to define the precise requirements.
> > > > > > > > > > > Here the language
(especially the word "generally") is weaker and means
> > > > > > > > > > > there may be exceptions.
> > > > > > > > > > > 
> > > > > > > > > > > Another type of access
that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > approach is reads that
have side-effects. For example, imagine a field
> > > > > > > > > > > containing an error code
if the device encounters a problem unrelated to
> > > > > > > > > > > a specific virtqueue
request. Reading from this field resets the error
> > > > > > > > > > > code to 0, saving the
driver an extra configuration space write access
> > > > > > > > > > > and possibly race
conditions. It isn't possible to implement those
> > > > > > > > > > > semantics suing
VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > > > > > > > makes me think that the
interface does not allow full VIRTIO semantics.
> > > > > > > > > Note that though you're
correct, my understanding is that config space is
> > > > > > > > > not suitable for this kind of error
propagating. And it would be very hard
> > > > > > > > > to implement such kind of semantic
in some transports.  Virtqueue should be
> > > > > > > > > much better. As Yong Ji quoted, the
config space is used for
> > > > > > > > > "rarely-changing or
intialization-time parameters".
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Agreed. I will use
VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > > > > > > > handle the message failure,
I'm going to add a return value to
> > > > > > > > > > virtio_config_ops.get() and
virtio_cread_* API so that the error can
> > > > > > > > > > be propagated to the virtio
device driver. Then the virtio-blk device
> > > > > > > > > > driver can be modified to
handle that.
> > > > > > > > > > 
> > > > > > > > > > Jason and Stefan, what do you
think of this way?
> > > > > > > > Why does VDUSE_DEV_GET_CONFIG need to
support an error return value?
> > > > > > > > 
> > > > > > > > The VIRTIO spec provides no way for the
device to report errors from
> > > > > > > > config space accesses.
> > > > > > > > 
> > > > > > > > The QEMU virtio-pci implementation
returns -1 from invalid
> > > > > > > > virtio_config_read*() and silently
discards virtio_config_write*()
> > > > > > > > accesses.
> > > > > > > > 
> > > > > > > > VDUSE can take the same approach with
> > > > > > > >
VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > > > > > > 
> > > > > > > > > I'd like to stick to the
current assumption thich get_config won't fail.
> > > > > > > > > That is to say,
> > > > > > > > > 
> > > > > > > > > 1) maintain a config in the kernel,
make sure the config space read can
> > > > > > > > > always succeed
> > > > > > > > > 2) introduce an ioctl for the vduse
usersapce to update the config space.
> > > > > > > > > 3) we can synchronize with the
vduse userspace during set_config
> > > > > > > > > 
> > > > > > > > > Does this work?
> > > > > > > > I noticed that caching is also allowed
by the vhost-user protocol
> > > > > > > > messages (QEMU's
docs/interop/vhost-user.rst), but the device doesn't
> > > > > > > > know whether or not caching is in
effect. The interface you outlined
> > > > > > > > above requires caching.
> > > > > > > > 
> > > > > > > > Is there a reason why the host kernel
vDPA code needs to cache the
> > > > > > > > configuration space?
> > > > > > > Because:
> > > > > > > 
> > > > > > > 1) Kernel can not wait forever in
get_config(), this is the major difference
> > > > > > > with vhost-user.
> > > > > > virtio_cread() can sleep:
> > > > > > 
> > > > > >     #define virtio_cread(vdev, structname, member,
ptr)                     \
> > > > > >             do {                                  
\
> > > > > >                    
typeof(((structname*)0)->member) virtio_cread_v;        \
> > > > > >                                                   
\
> > > > > >                     might_sleep();                
\
> > > > > >                     ^^^^^^^^^^^^^^
> > > > > > 
> > > > > > Which code path cannot sleep?
> > > > > Well, it can sleep but it can't sleep forever. For
VDUSE, a
> > > > > buggy/malicious userspace may refuse to respond to the
get_config.
> > > > > 
> > > > > It looks to me the ideal case, with the current virtio
spec, for VDUSE is to
> > > > > 
> > > > > 1) maintain the device and its state in the kernel,
userspace may sync
> > > > > with the kernel device via ioctls
> > > > > 2) offload the datapath (virtqueue) to the userspace
> > > > > 
> > > > > This seems more robust and safe than simply relaying
everything to
> > > > > userspace and waiting for its response.
> > > > > 
> > > > > And we know for sure this model can work, an example is
TUN/TAP:
> > > > > netdevice is abstracted in the kernel and datapath is
done via
> > > > > sendmsg()/recvmsg().
> > > > > 
> > > > > Maintaining the config in the kernel follows this model
and it can
> > > > > simplify the device generation implementation.
> > > > > 
> > > > > For config space write, it requires more thought but
fortunately it's
> > > > > not commonly used. So VDUSE can choose to filter out
the
> > > > > device/features that depends on the config write.
> > > > This is the problem. There are other messages like
SET_FEATURES where I
> > > > guess we'll face the same challenge.
> > > 
> > > Probably not, userspace device can tell the kernel about the
device_features
> > > and mandated_features during creation, and the feature
negotiation could be
> > > done purely in the kernel without bothering the userspace.
> 
> 
> (For some reason I drop the list accidentally, adding them back, sorry)
> 
> 
> > Sorry, I confused the messages. I meant SET_STATUS. It's a
synchronous
> > interface where the driver waits for the device.
> 
> 
> It depends on how we define "synchronous" here. If I understand
correctly,
> the spec doesn't expect there will be any kind of failure for the
operation
> of set_status itself.
> 
> Instead, anytime it want any synchronization, it should be done via
> get_status():
> 
> 1) re-read device status to make sure FEATURES_OK is set during feature
> negotiation
> 2) re-read device status to be 0 to make sure the device has finish the
> reset
> 
> 
> > 
> > VDUSE currently doesn't wait for the device emulation process to
handle
> > this message (no reply is needed) but I think this is a mistake
because
> > VDUSE is not following the VIRTIO device model.
> 
> 
> With the trick that is done for FEATURES_OK above, I think we don't
need to
> wait for the reply.
> 
> If userspace takes too long to respond, it can be detected since
> get_status() doesn't return the expected value for long time.
> 
> And for the case that needs a timeout, we probably can use NEEDS_RESET.
I think you're right. get_status is the synchronization point, not
set_status.

Currently there is no VDUSE GET_STATUS message. The
VDUSE_START/STOP_DATAPLANE messages could be changed to SET_STATUS so
that the device emulation program can participate in emulating the
Device Status field. This change could affect VDUSE's VIRTIO feature
interface since the device emulation program can reject features by not
setting FEATURES_OK.

Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL:
<http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20210707/f8edc7ea/attachment.sig>

Virtualization - Jul 2021 - [PATCH v8 10/10] Documentation: Add documentation for VDUSE

[PATCH v8 10/10] Documentation: Add documentation for VDUSE

[PATCH v8 10/10] Documentation: Add documentation for VDUSE