On Thu, May 12, 2022 at 3:44 AM Eugenio Perez Martin
<eperezma at redhat.com> wrote:>
> This is a proposal to restore the state of the vhost-vdpa device at
> the destination after a live migration. It uses as many available
> features both from the device and from qemu as possible so we keep the
> communication simple and speed up the merging process.
When we finalize the design, we can formalize it in kernel Documentation/
>
> # Initializing a vhost-vdpa device.
>
> Without the context of live migration, the steps to initialize the
> device from vhost-vdpa at qemu starting are:
> 1) [vhost] Open the vdpa device, Using simply open()
> 2) [vhost+virtio] Get device features. These are expected not to
> change in the device's lifetime, so we can save them. Qemu issues a
> VHOST_GET_FEATURES ioctl and vdpa forwards to the backend driver using
> get_device_features() callback.
For "virtio" do you mean it's an action that is defined in the
spec?
> 3) [vhost+virtio] Get its max_queue_pairs if _F_MQ and _F_CTRL_VQ.
> These are obtained using VHOST_VDPA_GET_CONFIG, and that request is
> forwarded to the device using get_config. QEMU expects the device to
> not change it in its lifetime.
> 4) [vhost] Vdpa set status (_S_ACKNOLEDGE, _S_DRIVER). Still no
> FEATURES_OK or DRIVER_OK. The ioctl is VHOST_VDPA_SET_STATUS, and the
> vdpa backend driver callback is set_status.
>
> These are the steps used to initialize the device in qemu terminology,
> taking away some redundancies to make it simpler.
>
> Now the driver sends the FEATURES_OK and the DRIVER_OK, and qemu
> detects it, so it *starts* the device.
>
> # Starting a vhost-vdpa device
>
> At virtio_net_vhost_status we have two important variables here:
> int cvq = _F_CTRL_VQ ? 1 : 0;
> int queue_pairs = _F_CTRL_VQ && _F_MQ ? (max_queue_pairs of step 3)
: 0;
>
> Now identification of the cvq index. Qemu *know* that the device will
> expose it at the last queue (max_queue_pairs*2) if _F_MQ has been
> acknowledged by the guest's driver or 2 if not. It cannot depend on
> any data sent to the device via cvq, because we couldn't get its
> command status on a change.
>
> Now we start the vhost device. The workflow is currently:
>
> 5) [virtio+vhost] The first step is to send the acknowledgement of the
> Virtio features and vhost/vdpa backend features to the device, so it
> knows how to configure itself. This is done using the same calls as
> step 4 with these feature bits added.
> 6) [virtio] Set the size, base, addr, kick and call fd for each queue
> (SET_VRING_ADDR, SET_VRING_NUM, ...; and forwarded with
> set_vq_address, set_vq_state, ...)
> 7) [vdpa] Send host notifiers and *send SET_VRING_ENABLE = 1* for each
> queue. This is done using ioctl VHOST_VDPA_SET_VRING_ENABLE, and
> forwarded to the vdpa backend using set_vq_ready callback.
> 8) [virtio + vdpa] Send memory translations & set DRIVER_OK.
>
> If we follow the current workflow, the device is allowed now to start
> receiving only on vq pair 0, since we've still not set the multi queue
> pair. This could cause the guest to receive packets in unexpected
> queues, breaking RSS.
>
> # Proposal
>
> Our proposal diverge in step 7: Instead of enabling *all* the
> virtqueues, only enable the CVQ. After that, send the DRIVER_OK and
> queue all the control commands to restore the device status (MQ, RSS,
> ...). Once all of them have been acknowledged ("device", or
emulated
> cvq in host vdpa backend driver, has used all cvq buffers, enable
> (SET_VRING_ENABLE, set_vq_ready) all other queues.
>
> Everything needed for this is already implemented in the kernel as far
> as I see, there is only a small modification in qemu needed. Thus
> achieving the restoring of the device state without creating
> maintenance burden.
Yes, one of the major motivations is to try to reuse the existing APIs
as much as possible as a start. It doesn't mean we can't invent new
API, but having a dedicated save/restore uAPI looks fine. But it looks
more like a work that needs to be finalized in the virtio spec
otherwise we may end up with code that is hard to maintain.
Thanks
>
> A lot of optimizations can be applied on top without the need to add
> stuff to the migration protocol or vDPA uAPI, like the pre-warming of
> the vdpa queues or adding more capabilities to the emulated CVQ.
>
> Other optimizations like applying the state out of band can also be
> added so they can run in parallel with the migration, but that
> requires a bigger change in qemu migration protocol making us lose
> focus on achieving at least the basic device migration in my opinion.
>
> Thoughts?
>