On Wed, Aug 24, 2022 at 3:52 PM Alvaro Karsz <alvaro.karsz at
solid-run.com> wrote:>
> I think that we should add a timeout to the control virtqueue commands.
> If the hypervisor crashes while handling a control command, the guest
> will spin forever.
> This may not be necessary for a virtual environment, when both the
> hypervisor and the guest OS run in the same bare metal, but this
> is needed for a physical network device compatible with VirtIO.
>
> (In these cases, the network device acts as the hypervisor, and the
> server acts as
> the guest OS).
>
> The network device may fail to answer a control command, or may crash,
leading
> to a stall in the server.
>
> My idea is to add a big enough timeout, to allow the slow devices to
> complete the command.
>
> I wrote a simple patch that returns false from virtnet_send_command in
> case of timeouts.
>
> The timeout approach introduces some serious problems in cases when
> the network device does answer the control command, but after the
> timeout.
>
> * The device will think that the command succeeded, while the server
won't.
> This may be serious with the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command.
> The server may receive packets in an unexpected queue.
>
> * virtqueue_get_buf will return the previous response for the next
> control command.
>
> Addressing this case by adding a timeout to the spec won't be easy,
> since the network device and the server have different clocks, and the
> server won't know when exactly the network device noticed the kick.
>
> So maybe we should call virtnet_remove if we reach a timeout.
Or reset but can we simply use interrupt instead of the busy waiting?
Thanks
>
> Or maybe we can just assume that the network device crashed after a
> long timeout, and nothing should be done.
>
> What do you guys think?
>
> Alvaro
>