Hey,
I have a linux host (linux 4.12.5) running linux guest VMs using the
virtio_net driver.
After upgrading the guests to a newer kernel, the guests started
experiencing network hangs every few hours.
Once the guest got into this "hung" state, no data seems to reach its
network interface, even though the host could see packets supposedly
destined for it.
I bisected the problem to 8d622d21d248 ("virtio: fix up
virtio_disable_cb").
Reverting that commit in the guest's kernel resolves the issue described
above for my setup.
I've further verified that the only portion of the patch that I needed
to revert to fix my network hangs is:
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 992cb1cbec93..809ff4a58b8e 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -742,10 +742,7 @@ static void virtqueue_disable_cb_split(struct
virtqueue *_vq)
>
> if (!(vq->split.avail_flags_shadow &
VRING_AVAIL_F_NO_INTERRUPT)) {
> vq->split.avail_flags_shadow |=
VRING_AVAIL_F_NO_INTERRUPT;
> - if (vq->event)
> - /* TODO: this is a hack. Figure out a cleaner value
to write. */
> - vring_used_event(&vq->split.vring) = 0x0;
> - else
> + if (!vq->event)
I also noticed it didn't reproduce for guests running on more modern
host kernels, so I also bisected down what host kernels were impacted,
which lead me to
8d65843c4426 ('Revert "vhost: cache used event for better
performance"').
The commit that reverts (809ecb9bca6a) was present in 4.10 to 4.12
(inclusive), and I can only reproduce the above hang with hosts in that
version range.
>From that, my current understanding is that "virtio: fix up
virtio_disable_cb"
isn't buggy itself, but it triggers a bug in older vhost drivers, which
previously seemed to be latent (I at least never ran into it before).
So, what's the point of this email?
Primarily, I want to just make y'all aware, and especially make sure
this info is out there for anyone else who runs into this interaction.
I couldn't find reports of anyone else hitting this bug, but surely
there's other people out there with old host kernels.
I also want to ask: should anything be done on the virtio guest driver
side of things?
I know this is an old vhost bug, and I'm personally just updating my
host (better late than never), but I'm also curious if there's some
prior art or policy here.
Do the virtio guest drivers try to avoid regressions for the last X host
kernel versions? For all host kernels with relevant vhost drivers?
Somewhere in between?
- Euan
P.S. Just wanted to throw in a thanks for the excellent virtio drivers!
The rest of the email's dry, but I do appreciate all the work going into em!