thr3ads.net - Virtualization - [PATCH v8 10/10] Documentation: Add documentation for VDUSE [Jul 2021]

If this information is useful, please help other people find it:
Share via:
Stefan Hajnoczi
2021-Jul-01 13:15 UTC
[PATCH v8 10/10] Documentation: Add documentation for VDUSE

On Thu, Jul 01, 2021 at 06:00:48PM +0800, Yongji Xie
wrote:> On Wed, Jun 30, 2021 at 6:06 PM Stefan Hajnoczi <stefanha at
redhat.com> wrote:
> >
> > On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> > > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha at
redhat.com> wrote:
> > > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > > > +     static void *iova_to_va(int dev_fd, uint64_t
iova, uint64_t *len)
> > > > > +     {
> > > > > +             int fd;
> > > > > +             void *addr;
> > > > > +             size_t size;
> > > > > +             struct vduse_iotlb_entry entry;
> > > > > +
> > > > > +             entry.start = iova;
> > > > > +             entry.last = iova + 1;
> > > >
> > > > Why +1?
> > > >
> > > > I expected the request to include *len so that VDUSE can
create a bounce
> > > > buffer for the full iova range, if necessary.
> > > >
> > >
> > > The function is used to translate iova to va. And the *len is not
> > > specified by the caller. Instead, it's used to tell the
caller the
> > > length of the contiguous iova region from the specified iova. And
the
> > > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the
first
> > > overlapped iova region. So using iova + 1 should be enough here.
> >
> > Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
> > wonder why userspace needs to assign a value at all if it's always
+1.
> >
> 
> If we need to get some iova regions in the specified range, we need
> the entry.last field. For example, we can use [0, ULONG_MAX] to get
> the first overlapped iova region which might be [4096, 8192]. But in
> this function, we don't use VDUSE_IOTLB_GET_FD like this. We need to
> get the iova region including the specified iova.
I see, thanks for explaining!
> > > > > +             return addr + iova - entry.start;
> > > > > +     }
> > > > > +
> > > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> > > >
> > > > Are these VIRTIO feature bits? Please explain how feature
negotiation
> > > > works. There must be a way for userspace to report the
device's
> > > > supported feature bits to the kernel.
> > > >
> > >
> > > Yes, these are VIRTIO feature bits. Userspace will specify the
> > > device's supported feature bits when creating a new VDUSE
device with
> > > ioctl(VDUSE_CREATE_DEV).
> >
> > Can the VDUSE device influence feature bit negotiation? For example,
if
> > the VDUSE virtio-blk device does not implement discard/write-zeroes,
how
> > does QEMU or the guest find out about this?
> >
> 
> There is a "features" field in struct vduse_dev_config which is
used
> to do feature negotiation.
This approach is more restrictive than required by the VIRTIO
specification:

  "The device SHOULD accept any valid subset of features the driver
  accepts, otherwise it MUST fail to set the FEATURES_OK device status
  bit when the driver writes it."

 
https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-130002

The spec allows a device to reject certain subsets of features. For
example, if feature B depends on feature A and can only be enabled when
feature A is also enabled.

From your description I think VDUSE would accept feature B without
feature A since the device implementation has no opportunity to fail
negotiation with custom logic.

Ideally VDUSE would send a SET_FEATURES message to userspace, allowing
the device implementation full flexibility in which subsets of features
to accept.

This is a corner case. Many or maybe even all existing VIRTIO devices
don't need this flexibility, but I want to point out this limitation in
the VDUSE interface because it may cause issues in the future.
> > > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration
space and inject a config interrupt
> > > >
> > > > Does this mean the contents of the configuration space are
cached by
> > > > VDUSE?
> > >
> > > Yes, but the kernel will also store the same contents.
> > >
> > > > The downside is that the userspace code cannot generate the
> > > > contents on demand. Most devices doin't need to generate
the contents
> > > > on demand, so I think this is okay but I had expected a
different
> > > > interface:
> > > >
> > > > kernel->userspace VDUSE_DEV_GET_CONFIG
> > > > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > > >
> > >
> > > The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG.
We
> > > will need lots of modification of virtio codes to support that.
So to
> > > make it simple, we choose this way:
> > >
> > > userspace -> kernel VDUSE_DEV_SET_CONFIG
> > > userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > >
> > > > I think you can leave it the way it is, but I wanted to
mention this in
> > > > case someone thinks it's important to support generating
the contents of
> > > > the configuration space on demand.
> > > >
> > >
> > > Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG
and
> > > VDUSE_DEV_INJECT_CONFIG_IRQ achieve that?
> >
> > If the contents of the configuration space change continuously, then
the
> > VDUSE_DEV_SET_CONFIG approach is inefficient and might have race
> > conditions. For example, imagine a device where the driver can read a
> > timer from the configuration space. I think the VIRTIO device model
> > allows that although I'm not aware of any devices that do
something like
> > it today. The problem is that VDUSE_DEV_SET_CONFIG would have to be
> > called frequently to keep the timer value updated even though the
guest
> > driver probably isn't accessing it.
> >
> 
> OK, I get you now. Since the VIRTIO specification says "Device
> configuration space is generally used for rarely-changing or
> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> ioctl should not be called frequently.
The spec uses MUST and other terms to define the precise requirements.
Here the language (especially the word "generally") is weaker and
means
there may be exceptions.

Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
approach is reads that have side-effects. For example, imagine a field
containing an error code if the device encounters a problem unrelated to
a specific virtqueue request. Reading from this field resets the error
code to 0, saving the driver an extra configuration space write access
and possibly race conditions. It isn't possible to implement those
semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
makes me think that the interface does not allow full VIRTIO semantics.
> > What's worse is that there might be race conditions where other
> > driver->device operations are supposed to update the configuration
space
> > but VDUSE_DEV_SET_CONFIG means that the VDUSE kernel code is caching
an
> > outdated copy.
> >
> 
> I'm not sure. Should the device and driver be able to access the same
> fields concurrently?
Yes. The VIRTIO spec has a generation count to handle multi-field
accesses so that consistency can be ensured:
https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-180004
> 
> > Again, I don't think it's a problem for existing devices in
the VIRTIO
> > specification. But I'm not 100% sure and future devices might
require
> > what I've described, so the VDUSE_DEV_SET_CONFIG interface could
become
> > a problem.
> >
> 
> If so, maybe a new interface can be added at that time. The
> VDUSE_DEV_GET_CONFIG might be better, but I still did not find a good
> way for failure handling.
I'm not aware of the details of why the current approach was necessary,
so I don't have any concrete suggestions. Sorry!

Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL:
<http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20210701/6083fe47/attachment.sig>
Virtualization - Jul 2021 - [PATCH v8 10/10] Documentation: Add documentation for VDUSE

[PATCH v8 10/10] Documentation: Add documentation for VDUSE