Michael S. Tsirkin
2019-Sep-03 11:26 UTC
[RFC v3] vhost: introduce mdev based hardware vhost backend
On Wed, Aug 28, 2019 at 01:37:12PM +0800, Tiwei Bie wrote:> Details about this can be found here: > > https://lwn.net/Articles/750770/ > > What's new in this version > =========================> > There are three choices based on the discussion [1] in RFC v2: > > > #1. We expose a VFIO device, so we can reuse the VFIO container/group > > based DMA API and potentially reuse a lot of VFIO code in QEMU. > > > > But in this case, we have two choices for the VFIO device interface > > (i.e. the interface on top of VFIO device fd): > > > > A) we may invent a new vhost protocol (as demonstrated by the code > > in this RFC) on VFIO device fd to make it work in VFIO's way, > > i.e. regions and irqs. > > > > B) Or as you proposed, instead of inventing a new vhost protocol, > > we can reuse most existing vhost ioctls on the VFIO device fd > > directly. There should be no conflicts between the VFIO ioctls > > (type is 0x3B) and VHOST ioctls (type is 0xAF) currently. > > > > #2. Instead of exposing a VFIO device, we may expose a VHOST device. > > And we will introduce a new mdev driver vhost-mdev to do this. > > It would be natural to reuse the existing kernel vhost interface > > (ioctls) on it as much as possible. But we will need to invent > > some APIs for DMA programming (reusing VHOST_SET_MEM_TABLE is a > > choice, but it's too heavy and doesn't support vIOMMU by itself). > > This version is more like a quick PoC to try Jason's proposal on > reusing vhost ioctls. And the second way (#1/B) in above three > choices was chosen in this version to demonstrate the idea quickly. > > Now the userspace API looks like this: > > - VFIO's container/group based IOMMU API is used to do the > DMA programming. > > - Vhost's existing ioctls are used to setup the device. > > And the device will report device_api as "vfio-vhost". > > Note that, there are dirty hacks in this version. If we decide to > go this way, some refactoring in vhost.c/vhost.h may be needed. > > PS. The direct mapping of the notify registers isn't implemented > in this version. > > [1] https://lkml.org/lkml/2019/7/9/101 > > Signed-off-by: Tiwei Bie <tiwei.bie at intel.com>....> +long vhost_mdev_ioctl(struct mdev_device *mdev, unsigned int cmd, > + unsigned long arg) > +{ > + void __user *argp = (void __user *)arg; > + struct vhost_mdev *vdpa; > + unsigned long minsz; > + int ret = 0; > + > + if (!mdev) > + return -EINVAL; > + > + vdpa = mdev_get_drvdata(mdev); > + if (!vdpa) > + return -ENODEV; > + > + switch (cmd) { > + case VFIO_DEVICE_GET_INFO: > + { > + struct vfio_device_info info; > + > + minsz = offsetofend(struct vfio_device_info, num_irqs); > + > + if (copy_from_user(&info, (void __user *)arg, minsz)) { > + ret = -EFAULT; > + break; > + } > + > + if (info.argsz < minsz) { > + ret = -EINVAL; > + break; > + } > + > + info.flags = VFIO_DEVICE_FLAGS_VHOST; > + info.num_regions = 0; > + info.num_irqs = 0; > + > + if (copy_to_user((void __user *)arg, &info, minsz)) { > + ret = -EFAULT; > + break; > + } > + > + break; > + } > + case VFIO_DEVICE_GET_REGION_INFO: > + case VFIO_DEVICE_GET_IRQ_INFO: > + case VFIO_DEVICE_SET_IRQS: > + case VFIO_DEVICE_RESET: > + ret = -EINVAL; > + break; > + > + case VHOST_MDEV_SET_STATE: > + ret = vhost_set_state(vdpa, argp); > + break; > + case VHOST_GET_FEATURES: > + ret = vhost_get_features(vdpa, argp); > + break; > + case VHOST_SET_FEATURES: > + ret = vhost_set_features(vdpa, argp); > + break; > + case VHOST_GET_VRING_BASE: > + ret = vhost_get_vring_base(vdpa, argp); > + break; > + default: > + ret = vhost_dev_ioctl(&vdpa->dev, cmd, argp); > + if (ret == -ENOIOCTLCMD) > + ret = vhost_vring_ioctl(&vdpa->dev, cmd, argp); > + } > + > + return ret; > +} > +EXPORT_SYMBOL(vhost_mdev_ioctl);I don't have a problem with this approach. A small question: would it make sense to have two fds: send vhost ioctls on one and vfio ioctls on another? We can then pass vfio fd to the vhost fd with a SET_BACKEND ioctl. What do you think? -- MST
Tiwei Bie
2019-Sep-04 02:48 UTC
[RFC v3] vhost: introduce mdev based hardware vhost backend
On Tue, Sep 03, 2019 at 07:26:03AM -0400, Michael S. Tsirkin wrote:> On Wed, Aug 28, 2019 at 01:37:12PM +0800, Tiwei Bie wrote: > > Details about this can be found here: > > > > https://lwn.net/Articles/750770/ > > > > What's new in this version > > =========================> > > > There are three choices based on the discussion [1] in RFC v2: > > > > > #1. We expose a VFIO device, so we can reuse the VFIO container/group > > > based DMA API and potentially reuse a lot of VFIO code in QEMU. > > > > > > But in this case, we have two choices for the VFIO device interface > > > (i.e. the interface on top of VFIO device fd): > > > > > > A) we may invent a new vhost protocol (as demonstrated by the code > > > in this RFC) on VFIO device fd to make it work in VFIO's way, > > > i.e. regions and irqs. > > > > > > B) Or as you proposed, instead of inventing a new vhost protocol, > > > we can reuse most existing vhost ioctls on the VFIO device fd > > > directly. There should be no conflicts between the VFIO ioctls > > > (type is 0x3B) and VHOST ioctls (type is 0xAF) currently. > > > > > > #2. Instead of exposing a VFIO device, we may expose a VHOST device. > > > And we will introduce a new mdev driver vhost-mdev to do this. > > > It would be natural to reuse the existing kernel vhost interface > > > (ioctls) on it as much as possible. But we will need to invent > > > some APIs for DMA programming (reusing VHOST_SET_MEM_TABLE is a > > > choice, but it's too heavy and doesn't support vIOMMU by itself). > > > > This version is more like a quick PoC to try Jason's proposal on > > reusing vhost ioctls. And the second way (#1/B) in above three > > choices was chosen in this version to demonstrate the idea quickly. > > > > Now the userspace API looks like this: > > > > - VFIO's container/group based IOMMU API is used to do the > > DMA programming. > > > > - Vhost's existing ioctls are used to setup the device. > > > > And the device will report device_api as "vfio-vhost". > > > > Note that, there are dirty hacks in this version. If we decide to > > go this way, some refactoring in vhost.c/vhost.h may be needed. > > > > PS. The direct mapping of the notify registers isn't implemented > > in this version. > > > > [1] https://lkml.org/lkml/2019/7/9/101 > > > > Signed-off-by: Tiwei Bie <tiwei.bie at intel.com> > > .... > > > +long vhost_mdev_ioctl(struct mdev_device *mdev, unsigned int cmd, > > + unsigned long arg) > > +{ > > + void __user *argp = (void __user *)arg; > > + struct vhost_mdev *vdpa; > > + unsigned long minsz; > > + int ret = 0; > > + > > + if (!mdev) > > + return -EINVAL; > > + > > + vdpa = mdev_get_drvdata(mdev); > > + if (!vdpa) > > + return -ENODEV; > > + > > + switch (cmd) { > > + case VFIO_DEVICE_GET_INFO: > > + { > > + struct vfio_device_info info; > > + > > + minsz = offsetofend(struct vfio_device_info, num_irqs); > > + > > + if (copy_from_user(&info, (void __user *)arg, minsz)) { > > + ret = -EFAULT; > > + break; > > + } > > + > > + if (info.argsz < minsz) { > > + ret = -EINVAL; > > + break; > > + } > > + > > + info.flags = VFIO_DEVICE_FLAGS_VHOST; > > + info.num_regions = 0; > > + info.num_irqs = 0; > > + > > + if (copy_to_user((void __user *)arg, &info, minsz)) { > > + ret = -EFAULT; > > + break; > > + } > > + > > + break; > > + } > > + case VFIO_DEVICE_GET_REGION_INFO: > > + case VFIO_DEVICE_GET_IRQ_INFO: > > + case VFIO_DEVICE_SET_IRQS: > > + case VFIO_DEVICE_RESET: > > + ret = -EINVAL; > > + break; > > + > > + case VHOST_MDEV_SET_STATE: > > + ret = vhost_set_state(vdpa, argp); > > + break; > > + case VHOST_GET_FEATURES: > > + ret = vhost_get_features(vdpa, argp); > > + break; > > + case VHOST_SET_FEATURES: > > + ret = vhost_set_features(vdpa, argp); > > + break; > > + case VHOST_GET_VRING_BASE: > > + ret = vhost_get_vring_base(vdpa, argp); > > + break; > > + default: > > + ret = vhost_dev_ioctl(&vdpa->dev, cmd, argp); > > + if (ret == -ENOIOCTLCMD) > > + ret = vhost_vring_ioctl(&vdpa->dev, cmd, argp); > > + } > > + > > + return ret; > > +} > > +EXPORT_SYMBOL(vhost_mdev_ioctl); > > > I don't have a problem with this approach. A small question: > would it make sense to have two fds: send vhost ioctls > on one and vfio ioctls on another? > We can then pass vfio fd to the vhost fd with a > SET_BACKEND ioctl. > > What do you think?I like this idea! I will give it a try. So we can introduce /dev/vhost-mdev to have the vhost fd, and let userspace pass vfio fd to the vhost fd with a SET_BACKEND ioctl. Thanks a lot! Tiwei> > -- > MST
Jason Wang
2019-Sep-04 04:32 UTC
[RFC v3] vhost: introduce mdev based hardware vhost backend
On 2019/9/4 ??10:48, Tiwei Bie wrote:> On Tue, Sep 03, 2019 at 07:26:03AM -0400, Michael S. Tsirkin wrote: >> On Wed, Aug 28, 2019 at 01:37:12PM +0800, Tiwei Bie wrote: >>> Details about this can be found here: >>> >>> https://lwn.net/Articles/750770/ >>> >>> What's new in this version >>> =========================>>> >>> There are three choices based on the discussion [1] in RFC v2: >>> >>>> #1. We expose a VFIO device, so we can reuse the VFIO container/group >>>> based DMA API and potentially reuse a lot of VFIO code in QEMU. >>>> >>>> But in this case, we have two choices for the VFIO device interface >>>> (i.e. the interface on top of VFIO device fd): >>>> >>>> A) we may invent a new vhost protocol (as demonstrated by the code >>>> in this RFC) on VFIO device fd to make it work in VFIO's way, >>>> i.e. regions and irqs. >>>> >>>> B) Or as you proposed, instead of inventing a new vhost protocol, >>>> we can reuse most existing vhost ioctls on the VFIO device fd >>>> directly. There should be no conflicts between the VFIO ioctls >>>> (type is 0x3B) and VHOST ioctls (type is 0xAF) currently. >>>> >>>> #2. Instead of exposing a VFIO device, we may expose a VHOST device. >>>> And we will introduce a new mdev driver vhost-mdev to do this. >>>> It would be natural to reuse the existing kernel vhost interface >>>> (ioctls) on it as much as possible. But we will need to invent >>>> some APIs for DMA programming (reusing VHOST_SET_MEM_TABLE is a >>>> choice, but it's too heavy and doesn't support vIOMMU by itself). >>> This version is more like a quick PoC to try Jason's proposal on >>> reusing vhost ioctls. And the second way (#1/B) in above three >>> choices was chosen in this version to demonstrate the idea quickly. >>> >>> Now the userspace API looks like this: >>> >>> - VFIO's container/group based IOMMU API is used to do the >>> DMA programming. >>> >>> - Vhost's existing ioctls are used to setup the device. >>> >>> And the device will report device_api as "vfio-vhost". >>> >>> Note that, there are dirty hacks in this version. If we decide to >>> go this way, some refactoring in vhost.c/vhost.h may be needed. >>> >>> PS. The direct mapping of the notify registers isn't implemented >>> in this version. >>> >>> [1] https://lkml.org/lkml/2019/7/9/101 >>> >>> Signed-off-by: Tiwei Bie <tiwei.bie at intel.com> >> .... >> >>> +long vhost_mdev_ioctl(struct mdev_device *mdev, unsigned int cmd, >>> + unsigned long arg) >>> +{ >>> + void __user *argp = (void __user *)arg; >>> + struct vhost_mdev *vdpa; >>> + unsigned long minsz; >>> + int ret = 0; >>> + >>> + if (!mdev) >>> + return -EINVAL; >>> + >>> + vdpa = mdev_get_drvdata(mdev); >>> + if (!vdpa) >>> + return -ENODEV; >>> + >>> + switch (cmd) { >>> + case VFIO_DEVICE_GET_INFO: >>> + { >>> + struct vfio_device_info info; >>> + >>> + minsz = offsetofend(struct vfio_device_info, num_irqs); >>> + >>> + if (copy_from_user(&info, (void __user *)arg, minsz)) { >>> + ret = -EFAULT; >>> + break; >>> + } >>> + >>> + if (info.argsz < minsz) { >>> + ret = -EINVAL; >>> + break; >>> + } >>> + >>> + info.flags = VFIO_DEVICE_FLAGS_VHOST; >>> + info.num_regions = 0; >>> + info.num_irqs = 0; >>> + >>> + if (copy_to_user((void __user *)arg, &info, minsz)) { >>> + ret = -EFAULT; >>> + break; >>> + } >>> + >>> + break; >>> + } >>> + case VFIO_DEVICE_GET_REGION_INFO: >>> + case VFIO_DEVICE_GET_IRQ_INFO: >>> + case VFIO_DEVICE_SET_IRQS: >>> + case VFIO_DEVICE_RESET: >>> + ret = -EINVAL; >>> + break; >>> + >>> + case VHOST_MDEV_SET_STATE: >>> + ret = vhost_set_state(vdpa, argp); >>> + break; >>> + case VHOST_GET_FEATURES: >>> + ret = vhost_get_features(vdpa, argp); >>> + break; >>> + case VHOST_SET_FEATURES: >>> + ret = vhost_set_features(vdpa, argp); >>> + break; >>> + case VHOST_GET_VRING_BASE: >>> + ret = vhost_get_vring_base(vdpa, argp); >>> + break; >>> + default: >>> + ret = vhost_dev_ioctl(&vdpa->dev, cmd, argp); >>> + if (ret == -ENOIOCTLCMD) >>> + ret = vhost_vring_ioctl(&vdpa->dev, cmd, argp); >>> + } >>> + >>> + return ret; >>> +} >>> +EXPORT_SYMBOL(vhost_mdev_ioctl); >> >> I don't have a problem with this approach. A small question: >> would it make sense to have two fds: send vhost ioctls >> on one and vfio ioctls on another? >> We can then pass vfio fd to the vhost fd with a >> SET_BACKEND ioctl. >> >> What do you think? > I like this idea! I will give it a try. > So we can introduce /dev/vhost-mdev to have the vhost fd,You still need to think about how to connect it to current sysfs based mdev management interface, or you want to invent another API, or just use the /dev/vhost-net but pass vfio fd through ioctl to the file. Thanks> and let > userspace pass vfio fd to the vhost fd with a SET_BACKEND ioctl. > > Thanks a lot! > Tiwei > >> -- >> MST
Possibly Parallel Threads
- [RFC v3] vhost: introduce mdev based hardware vhost backend
- [RFC v3] vhost: introduce mdev based hardware vhost backend
- [RFC v3] vhost: introduce mdev based hardware vhost backend
- [RFC v3] vhost: introduce mdev based hardware vhost backend
- [RFC v3] vhost: introduce mdev based hardware vhost backend