thr3ads.net - Virtualization - [PATCH] virtio_net: Revert "virtio_net: set the default max ring size by find

If this information is useful, please help other people find it:
Share via:

Guenter Roeck

2022-Aug-15 20:50 UTC

[PATCH] virtio_net: Revert "virtio_net: set the default max ring size by find_vqs()"

On Mon, Aug 15, 2022 at 04:42:51PM -0400, Michael S. Tsirkin
wrote:> On Mon, Aug 15, 2022 at 01:34:26PM -0700, Guenter Roeck wrote:
> > On Mon, Aug 15, 2022 at 05:16:50AM -0400, Michael S. Tsirkin wrote:
> > > This reverts commit 762faee5a2678559d3dc09d95f8f2c54cd0466a7.
> > > 
> > > This has been reported to trip up guests on GCP (Google Cloud). 
Why is
> > > not yet clear - to be debugged, but the patch itself has several
other
> > > issues:
> > > 
> > > - It treats unknown speed as < 10G
> > > - It leaves userspace no way to find out the ring size set by
hypervisor
> > > - It tests speed when link is down
> > > - It ignores the virtio spec advice:
> > >         Both \field{speed} and \field{duplex} can change, thus
the driver
> > >         is expected to re-read these values after receiving a
> > >         configuration change notification.
> > > - It is not clear the performance impact has been tested properly
> > > 
> > > Revert the patch for now.
> > > 
> > > Link:
https://lore.kernel.org/r/20220814212610.GA3690074%40roeck-us.net
> > > Link:
https://lore.kernel.org/r/20220815070203.plwjx7b3cyugpdt7%40awork3.anarazel.de
> > > Link:
https://lore.kernel.org/r/3df6bb82-1951-455d-a768-e9e1513eb667%40www.fastmail.com
> > > Link:
https://lore.kernel.org/r/FCDC5DDE-3CDD-4B8A-916F-CA7D87B547CE%40anarazel.de
> > > Fixes: 762faee5a267 ("virtio_net: set the default max ring
size by find_vqs()")
> > > Cc: Xuan Zhuo <xuanzhuo at linux.alibaba.com>
> > > Cc: Jason Wang <jasowang at redhat.com>
> > > Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
> > > Tested-by: Andres Freund <andres at anarazel.de>
> > 
> > I ran this patch through a total of 14 syskaller tests, 2 test runs
each on
> > 7 different crashes reported by syzkaller (as reported to the
linux-kernel
> > mailing list). No problems were reported. I also ran a single
cross-check
> > with one of the syzkaller runs on top of v6.0-rc1, without this patch.
> > That test run failed.
> > 
> > Overall, I think we can call this fixed.
> > 
> > Guenter
> 
> It's more of a work around though since we don't yet have the root
> cause for this. I suspect a GCP hypervisor bug at the moment.
> This is excercising a path we previously only took on GFP_KERNEL
> allocation failures during probe, I don't think that happens a lot.
>
Even a hypervisor bug should not trigger crashes like this one,
though, or at least I think so. Any idea what to look for on the
hypervisor side, and/or what it might be doing wrong ?

Thanks,
Guenter

Michael S. Tsirkin

2022-Aug-15 21:04 UTC

head link

[PATCH] virtio_net: Revert "virtio_net: set the default max ring size by find_vqs()"

On Mon, Aug 15, 2022 at 01:50:53PM -0700, Guenter Roeck
wrote:> On Mon, Aug 15, 2022 at 04:42:51PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Aug 15, 2022 at 01:34:26PM -0700, Guenter Roeck wrote:
> > > On Mon, Aug 15, 2022 at 05:16:50AM -0400, Michael S. Tsirkin
wrote:
> > > > This reverts commit
762faee5a2678559d3dc09d95f8f2c54cd0466a7.
> > > > 
> > > > This has been reported to trip up guests on GCP (Google
Cloud).  Why is
> > > > not yet clear - to be debugged, but the patch itself has
several other
> > > > issues:
> > > > 
> > > > - It treats unknown speed as < 10G
> > > > - It leaves userspace no way to find out the ring size set
by hypervisor
> > > > - It tests speed when link is down
> > > > - It ignores the virtio spec advice:
> > > >         Both \field{speed} and \field{duplex} can change,
thus the driver
> > > >         is expected to re-read these values after receiving
a
> > > >         configuration change notification.
> > > > - It is not clear the performance impact has been tested
properly
> > > > 
> > > > Revert the patch for now.
> > > > 
> > > > Link:
https://lore.kernel.org/r/20220814212610.GA3690074%40roeck-us.net
> > > > Link:
https://lore.kernel.org/r/20220815070203.plwjx7b3cyugpdt7%40awork3.anarazel.de
> > > > Link:
https://lore.kernel.org/r/3df6bb82-1951-455d-a768-e9e1513eb667%40www.fastmail.com
> > > > Link:
https://lore.kernel.org/r/FCDC5DDE-3CDD-4B8A-916F-CA7D87B547CE%40anarazel.de
> > > > Fixes: 762faee5a267 ("virtio_net: set the default max
ring size by find_vqs()")
> > > > Cc: Xuan Zhuo <xuanzhuo at linux.alibaba.com>
> > > > Cc: Jason Wang <jasowang at redhat.com>
> > > > Signed-off-by: Michael S. Tsirkin <mst at redhat.com>
> > > > Tested-by: Andres Freund <andres at anarazel.de>
> > > 
> > > I ran this patch through a total of 14 syskaller tests, 2 test
runs each on
> > > 7 different crashes reported by syzkaller (as reported to the
linux-kernel
> > > mailing list). No problems were reported. I also ran a single
cross-check
> > > with one of the syzkaller runs on top of v6.0-rc1, without this
patch.
> > > That test run failed.
> > > 
> > > Overall, I think we can call this fixed.
> > > 
> > > Guenter
> > 
> > It's more of a work around though since we don't yet have the
root
> > cause for this. I suspect a GCP hypervisor bug at the moment.
> > This is excercising a path we previously only took on GFP_KERNEL
> > allocation failures during probe, I don't think that happens a
lot.
> >
> 
> Even a hypervisor bug should not trigger crashes like this one,
> though, or at least I think so. Any idea what to look for on the
> hypervisor side, and/or what it might be doing wrong ?
> 
> Thanks,
> Guenter
Sure!
So virtio has a queue_size register. When read, it will give you
originally the maximum queue size. Normally we just read it and
use it as queue size.

However, when queue memory allocation fails, and unconditionally with a
network device with the problematic patch, driver is asking the
hypervisor to make the ring smaller by writing a smaller value into this
register.

I suspect that what happens is hypervisor still uses the original value
somewhere. Then it things the ring is bigger than what driver allocated.
If we get lucky and nothing important landed in the several pages
covered by the larger ting, then the only effect is that driver does not
see the data hypervisor writes in the ring, and this is the network
failures observed - most likely DHCP responses get lost and
guest never gets an IP. OTOH if something important lands there then when
hypervisor overwrites that memory it gets corrupted and we get crashes.


-- 
MST

Virtualization - Aug 2022 - [PATCH] virtio_net: Revert "virtio_net: set the default max ring size by find_vqs()"

[PATCH] virtio_net: Revert "virtio_net: set the default max ring size by find_vqs()"

[PATCH] virtio_net: Revert "virtio_net: set the default max ring size by find_vqs()"