On Wed, Aug 11, 2021 at 11:38:59AM +0800, Jason Wang wrote:> On Tue, Aug 10, 2021 at 11:31 PM Michael S. Tsirkin <mst at redhat.com> wrote: > > > > On Mon, Aug 02, 2021 at 04:23:12PM -0500, Ivan wrote: > > > On Mon, Aug 2, 2021 at 2:52 PM Michael S. Tsirkin <mst at redhat.com> wrote: > > > > > > > > On Mon, Aug 02, 2021 at 01:32:05PM -0500, Ivan wrote: > > > > > On Tue, Jul 27, 2021 at 4:11 AM Michael S. Tsirkin <mst at redhat.com> wrote: > > > > > > > > > > > > On Mon, Jul 26, 2021 at 07:44:43PM -0500, Ivan wrote: > > > > > > > On Sat, Jul 24, 2021 at 11:18 PM Ivan <ivan at prestigetransportation.com> wrote: > > > > > > > > > > > > > > > > On Sat, Jul 24, 2021 at 7:17 PM Ivan <ivan at prestigetransportation.com> wrote: > > > > > > > > > > > > > > > > > > On Fri, Jul 23, 2021 at 7:33 AM Ivan <ivan at prestigetransportation.com> wrote: > > > > > > > > >> > > > > > > > > >> On Fri, Jul 23, 2021 at 7:10 AM Michael S. Tsirkin <mst at redhat.com> wrote: > > > > > > > > >>> > > > > > > > > >>> On Fri, Jul 23, 2021 at 03:06:04AM -0500, Ivan wrote: > > > > > > > > >>> > On Fri, Jul 23, 2021 at 2:59 AM Michael S. Tsirkin <mst at redhat.com> wrote: > > > > > > > > >>> > > > > > > > > > > >>> > > On Thu, Jul 22, 2021 at 11:50:11PM -0500, Ivan wrote: > > > > > > > > >>> > > > On Thu, Jul 22, 2021 at 11:25 PM Jason Wang <jasowang at redhat.com> wrote: > > > > > > > > >>> > > > > ? 2021/7/23 ??10:54, Ivan ??: > > > > > > > > >>> > > > > > On Thu, Jul 22, 2021 at 9:37 PM Jason Wang <jasowang at redhat.com> wrote: > > > > > > > > >>> > > > > >> Does it work if you turn off lro before enabling the forwarding? > > > > > > > > >>> > > > > > 0 root at NuRaid:~# ethtool -K eth0 lro off > > > > > > > > >>> > > > > > Actual changes: > > > > > > > > >>> > > > > > rx-lro: on [requested off] > > > > > > > > >>> > > > > > Could not change any device features > > > > > > > > >>> > > > > > > > > > > > > >>> > > > > Ok, it looks like the device misses the VIRTIO_NET_F_CTRL_GUEST_OFFLOADS > > > > > > > > >>> > > > > which makes it impossible to change the LRO setting. > > > > > > > > >>> > > > > > > > > > > > > >>> > > > > Did you use qemu? If yes, what's the qemu version you've used? > > > > > > > > >>> > > > > > > > > > > > >>> > > > These are VirtualBox machines, which I've been using for years with > > > > > > > > >>> > > > longterm kernels 4.19, and I never had such a problem. But now that I > > > > > > > > >>> > > > tried upgrading to kernels 5.10 or 5.13 -- the panics started. These > > > > > > > > >>> > > > are just generic kernel builds, and a minimalistic userspace. > > > > > > > > >>> > > > > > > > > > > >>> > > I would be useful to see the features your virtualbox instance provides > > > > > > > > >>> > > > > > > > > > > >>> > > cat /sys/class/net/eth0/device/features > > > > > > > > >>> > > > > > > > > > >>> > # cat /sys/class/net/eth0/device/features > > > > > > > > >>> > 1100010110111011111100000000000000000000000000000000000000000000 > > > > > > > > >>> > > > > > > > > >>> I was able to reproduce the warning but not the panic. > > > > > > > > >>> OTOH if LRO stays on when enabling forwarding that > > > > > > > > >>> is already a problem. Any chance you can bisect to > > > > > > > > >>> find out which change introduced the panic? > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> Any kernels up to 4.19.198 don't panic. > > > > > > > > >> Any kernels 5.10+ panic immediately upon starting forwarding. > > > > > > > > >> I have not tested any kernels between 4.19 and 5.10. > > > > > > > > >> I guess I can build a few kernels inbetween, and try pinpoint where it starts. > > > > > > > > >> That may take a day or so. I'll get on with it now, and report my findings. > > > > > > > > > > > > > > > > > > So, I narrowed it down: the panics start with kernel 5.0-rc. > > > > > > > > > > > > > > > > More narowly, the problem seems be coming from commit > > > > > > > > a02e8964eaf9271a8a5fcc0c55bd13f933bafc56. > > > > > > > > Just to test my suspicion, I deleted a few lines from that code, > > > > > > > > and the panic went away. Hope that helps you guys figure out > > > > > > > > what the problem might be. > > > > > > > > > > > > Well it disables LRO but we knew this :( I'd help if we knew > > > > > > where does it panic, all we see it the warning which is > > > > > > related for sure but not the immediate rootcause ... > > > > > > > > > > > > > > > > > > > > > > --- a/drivers/net/virtio_net.c > > > > > > > > +++ b/drivers/net/virtio_net.c > > > > > > > > @@ -2978,11 +2978,6 @@ > > > > > > > > } > > > > > > > > if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM)) > > > > > > > > dev->features |= NETIF_F_RXCSUM; > > > > > > > > - if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) || > > > > > > > > - virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) > > > > > > > > - dev->features |= NETIF_F_LRO; > > > > > > > > - if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) > > > > > > > > - dev->hw_features |= NETIF_F_LRO; > > > > > > > > > > > > > > > > dev->vlan_features = dev->features; > > > > > > > > > > > > > > Just FYI, Google turned up two similar bug reposts... > > > > > > > Apr 14, 2020 -- https://github.com/containers/podman/issues/5815 > > > > > > > Oct 09. 2020 -- https://bugzilla.kernel.org/show_bug.cgi?id=209593 > > > > > > > > > > > > > > Is there any sensible thing I could do, temporarily, until this > > > > > > > problem is sorted out? > > > > > > > Or am I simply stuck to kernels 4.19 on these machines for now? > > > > > > > > > > > > > > > > > > Something like this I guess: > > > > > > > > > > > > > > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > > > > > index 8a58a2f013af..cc5982193a40 100644 > > > > > > --- a/drivers/net/virtio_net.c > > > > > > +++ b/drivers/net/virtio_net.c > > > > > > @@ -3063,6 +3063,8 @@ static int virtnet_validate(struct virtio_device *vdev) > > > > > > __virtio_clear_bit(vdev, VIRTIO_NET_F_MTU); > > > > > > } > > > > > > > > > > > > + __virtio_clear_bit(vdev, VIRTIO_NET_F_GUEST_TSO4); > > > > > > + __virtio_clear_bit(vdev, VIRTIO_NET_F_GUEST_TSO6); > > > > > > return 0; > > > > > > } > > > > > > > > > > When I apply your patch, then I see drastic (more than half) > > > > > reductions in speed. (confirmed with iperf). > > > > > > > > > > But if instead I just remove a few lines from commit > > > > > a02e8964eaf9271a8a5fcc0c55bd13f933bafc56 > > > > > as in my earlier post, then I'm back to full speed > > > > > > > > > > I understand that this is just temporary workaround, until we figure this out. > > > > > > > > > > > > Oh weird. So it's not about getting some weird LRO packet. We will get it with > > > > VIRTIO_NET_F_GUEST_TSO4 anyway. It's about the LRO flag being set in > > > > features. > > > > > > > > How about this then? Just pretend to Linux that we disabled LRO. > > > > > > > > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > > > index 8a58a2f013af..8e7e4cea176b 100644 > > > > --- a/drivers/net/virtio_net.c > > > > +++ b/drivers/net/virtio_net.c > > > > @@ -2651,8 +2651,9 @@ static int virtnet_set_features(struct net_device *dev, > > > > ~GUEST_OFFLOAD_LRO_MASK; > > > > > > > > err = virtnet_set_guest_offloads(vi, offloads); > > > > - if (err) > > > > - return err; > > > > + WARN_ON(err); > > > > + //if (err) > > > > + // return err; > > > > vi->guest_offloads = offloads; > > > > } > > > > > > No. With this applied, the problem persists: > > > > > > # echo "1" > /proc/sys/net/ipv4/ip_forward > > > > > > kernel: ------------[ cut here ]------------ > > > kernel: netdevice: eth0: failed to disable LRO! > > > kernel: WARNING: CPU: 0 PID: 452 at net/core/dev.c:1768 > > > dev_disable_lro+0x108/0x150 > > > kernel: Modules linked in: sg nls_iso8859_1 nls_cp437 vfat fat > > > hid_generic usbhid hid virtio_net net_failover failover aesni_intel > > > libaes crypto_simd ohci_pci ahci libahci cryptd rapl ehci_pci ohci_hcd > > > ehci_hcd usbcore usb_common libata evdev lpc_ich mfd_core rng_core > > > i2c_piix4 i2c_core virtio_pci virtio_pci_modern_dev virtio_ring virtio > > > rtc_cmos atkbd libps2 i8042 serio battery ac button loop unix > > > kernel: CPU: 0 PID: 452 Comm: bash Not tainted 5.13.7-gnu.1-NuMini #1 > > > kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS > > > VirtualBox 12/01/2006 > > > kernel: RIP: 0010:dev_disable_lro+0x108/0x150 > > > > Again the warning isn't a big deal. I agree we should address - Jason > > any update? > > I still think using NETIF_F_LRO might not be correct. Since we're > basically receiving GSO packets. > > And it might cause a lot of issues if the device doesn't have > VIRTIO_NET_F_CTRL_GUEST_OFFLOADS. > > I see two possible fixes: > > 1) using NETIF_F_GRO_HW instead (the patch is attached)It's unfortunate you didn't inline. Anyway. Ivan could you test the patch and report?> > orHmm. I am not sure we always preserve the GRO_HW requirement that packets can be re-segmented to reconstruct the original packet stream. Do all backends guarantee this? Could you explain why?> 2) set NETIF_F_LRO only if the device has CTRL_GUEST_OFFLOADS > > ThanksThis one would slow guests on old hosts down significantly. I am not sure why this didn't trigger previously btw - we used not to have CTRL_GUEST_OFFLOADS after all.> > But the main issue is you lose connectivity. That still > > persists with this? Can't you get a serial connection > > out? I know qemu Did the kernel oops afterwards? > > > > -- > > MST > >
? 2021/8/11 ??3:39, Michael S. Tsirkin ??:> On Wed, Aug 11, 2021 at 11:38:59AM +0800, Jason Wang wrote: >> On Tue, Aug 10, 2021 at 11:31 PM Michael S. Tsirkin <mst at redhat.com> wrote: >>> On Mon, Aug 02, 2021 at 04:23:12PM -0500, Ivan wrote: >>>> On Mon, Aug 2, 2021 at 2:52 PM Michael S. Tsirkin <mst at redhat.com> wrote: >>>>> On Mon, Aug 02, 2021 at 01:32:05PM -0500, Ivan wrote: >>>>>> On Tue, Jul 27, 2021 at 4:11 AM Michael S. Tsirkin <mst at redhat.com> wrote: >>>>>>> On Mon, Jul 26, 2021 at 07:44:43PM -0500, Ivan wrote: >>>>>>>> On Sat, Jul 24, 2021 at 11:18 PM Ivan <ivan at prestigetransportation.com> wrote: >>>>>>>>> On Sat, Jul 24, 2021 at 7:17 PM Ivan <ivan at prestigetransportation.com> wrote: >>>>>>>>>> On Fri, Jul 23, 2021 at 7:33 AM Ivan <ivan at prestigetransportation.com> wrote: >>>>>>>>>>> On Fri, Jul 23, 2021 at 7:10 AM Michael S. Tsirkin <mst at redhat.com> wrote: >>>>>>>>>>>> On Fri, Jul 23, 2021 at 03:06:04AM -0500, Ivan wrote: >>>>>>>>>>>>> On Fri, Jul 23, 2021 at 2:59 AM Michael S. Tsirkin <mst at redhat.com> wrote: >>>>>>>>>>>>>> On Thu, Jul 22, 2021 at 11:50:11PM -0500, Ivan wrote: >>>>>>>>>>>>>>> On Thu, Jul 22, 2021 at 11:25 PM Jason Wang <jasowang at redhat.com> wrote: >>>>>>>>>>>>>>>> ? 2021/7/23 ??10:54, Ivan ??: >>>>>>>>>>>>>>>>> On Thu, Jul 22, 2021 at 9:37 PM Jason Wang <jasowang at redhat.com> wrote: >>>>>>>>>>>>>>>>>> Does it work if you turn off lro before enabling the forwarding? >>>>>>>>>>>>>>>>> 0 root at NuRaid:~# ethtool -K eth0 lro off >>>>>>>>>>>>>>>>> Actual changes: >>>>>>>>>>>>>>>>> rx-lro: on [requested off] >>>>>>>>>>>>>>>>> Could not change any device features >>>>>>>>>>>>>>>> Ok, it looks like the device misses the VIRTIO_NET_F_CTRL_GUEST_OFFLOADS >>>>>>>>>>>>>>>> which makes it impossible to change the LRO setting. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Did you use qemu? If yes, what's the qemu version you've used? >>>>>>>>>>>>>>> These are VirtualBox machines, which I've been using for years with >>>>>>>>>>>>>>> longterm kernels 4.19, and I never had such a problem. But now that I >>>>>>>>>>>>>>> tried upgrading to kernels 5.10 or 5.13 -- the panics started. These >>>>>>>>>>>>>>> are just generic kernel builds, and a minimalistic userspace. >>>>>>>>>>>>>> I would be useful to see the features your virtualbox instance provides >>>>>>>>>>>>>> >>>>>>>>>>>>>> cat /sys/class/net/eth0/device/features >>>>>>>>>>>>> # cat /sys/class/net/eth0/device/features >>>>>>>>>>>>> 1100010110111011111100000000000000000000000000000000000000000000 >>>>>>>>>>>> I was able to reproduce the warning but not the panic. >>>>>>>>>>>> OTOH if LRO stays on when enabling forwarding that >>>>>>>>>>>> is already a problem. Any chance you can bisect to >>>>>>>>>>>> find out which change introduced the panic? >>>>>>>>>>> >>>>>>>>>>> Any kernels up to 4.19.198 don't panic. >>>>>>>>>>> Any kernels 5.10+ panic immediately upon starting forwarding. >>>>>>>>>>> I have not tested any kernels between 4.19 and 5.10. >>>>>>>>>>> I guess I can build a few kernels inbetween, and try pinpoint where it starts. >>>>>>>>>>> That may take a day or so. I'll get on with it now, and report my findings. >>>>>>>>>> So, I narrowed it down: the panics start with kernel 5.0-rc. >>>>>>>>> More narowly, the problem seems be coming from commit >>>>>>>>> a02e8964eaf9271a8a5fcc0c55bd13f933bafc56. >>>>>>>>> Just to test my suspicion, I deleted a few lines from that code, >>>>>>>>> and the panic went away. Hope that helps you guys figure out >>>>>>>>> what the problem might be. >>>>>>> Well it disables LRO but we knew this :( I'd help if we knew >>>>>>> where does it panic, all we see it the warning which is >>>>>>> related for sure but not the immediate rootcause ... >>>>>>> >>>>>>>>> --- a/drivers/net/virtio_net.c >>>>>>>>> +++ b/drivers/net/virtio_net.c >>>>>>>>> @@ -2978,11 +2978,6 @@ >>>>>>>>> } >>>>>>>>> if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM)) >>>>>>>>> dev->features |= NETIF_F_RXCSUM; >>>>>>>>> - if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) || >>>>>>>>> - virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) >>>>>>>>> - dev->features |= NETIF_F_LRO; >>>>>>>>> - if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) >>>>>>>>> - dev->hw_features |= NETIF_F_LRO; >>>>>>>>> >>>>>>>>> dev->vlan_features = dev->features; >>>>>>>> Just FYI, Google turned up two similar bug reposts... >>>>>>>> Apr 14, 2020 -- https://github.com/containers/podman/issues/5815 >>>>>>>> Oct 09. 2020 -- https://bugzilla.kernel.org/show_bug.cgi?id=209593 >>>>>>>> >>>>>>>> Is there any sensible thing I could do, temporarily, until this >>>>>>>> problem is sorted out? >>>>>>>> Or am I simply stuck to kernels 4.19 on these machines for now? >>>>>>> >>>>>>> Something like this I guess: >>>>>>> >>>>>>> >>>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >>>>>>> index 8a58a2f013af..cc5982193a40 100644 >>>>>>> --- a/drivers/net/virtio_net.c >>>>>>> +++ b/drivers/net/virtio_net.c >>>>>>> @@ -3063,6 +3063,8 @@ static int virtnet_validate(struct virtio_device *vdev) >>>>>>> __virtio_clear_bit(vdev, VIRTIO_NET_F_MTU); >>>>>>> } >>>>>>> >>>>>>> + __virtio_clear_bit(vdev, VIRTIO_NET_F_GUEST_TSO4); >>>>>>> + __virtio_clear_bit(vdev, VIRTIO_NET_F_GUEST_TSO6); >>>>>>> return 0; >>>>>>> } >>>>>> When I apply your patch, then I see drastic (more than half) >>>>>> reductions in speed. (confirmed with iperf). >>>>>> >>>>>> But if instead I just remove a few lines from commit >>>>>> a02e8964eaf9271a8a5fcc0c55bd13f933bafc56 >>>>>> as in my earlier post, then I'm back to full speed >>>>>> >>>>>> I understand that this is just temporary workaround, until we figure this out. >>>>> >>>>> Oh weird. So it's not about getting some weird LRO packet. We will get it with >>>>> VIRTIO_NET_F_GUEST_TSO4 anyway. It's about the LRO flag being set in >>>>> features. >>>>> >>>>> How about this then? Just pretend to Linux that we disabled LRO. >>>>> >>>>> >>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >>>>> index 8a58a2f013af..8e7e4cea176b 100644 >>>>> --- a/drivers/net/virtio_net.c >>>>> +++ b/drivers/net/virtio_net.c >>>>> @@ -2651,8 +2651,9 @@ static int virtnet_set_features(struct net_device *dev, >>>>> ~GUEST_OFFLOAD_LRO_MASK; >>>>> >>>>> err = virtnet_set_guest_offloads(vi, offloads); >>>>> - if (err) >>>>> - return err; >>>>> + WARN_ON(err); >>>>> + //if (err) >>>>> + // return err; >>>>> vi->guest_offloads = offloads; >>>>> } >>>> No. With this applied, the problem persists: >>>> >>>> # echo "1" > /proc/sys/net/ipv4/ip_forward >>>> >>>> kernel: ------------[ cut here ]------------ >>>> kernel: netdevice: eth0: failed to disable LRO! >>>> kernel: WARNING: CPU: 0 PID: 452 at net/core/dev.c:1768 >>>> dev_disable_lro+0x108/0x150 >>>> kernel: Modules linked in: sg nls_iso8859_1 nls_cp437 vfat fat >>>> hid_generic usbhid hid virtio_net net_failover failover aesni_intel >>>> libaes crypto_simd ohci_pci ahci libahci cryptd rapl ehci_pci ohci_hcd >>>> ehci_hcd usbcore usb_common libata evdev lpc_ich mfd_core rng_core >>>> i2c_piix4 i2c_core virtio_pci virtio_pci_modern_dev virtio_ring virtio >>>> rtc_cmos atkbd libps2 i8042 serio battery ac button loop unix >>>> kernel: CPU: 0 PID: 452 Comm: bash Not tainted 5.13.7-gnu.1-NuMini #1 >>>> kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS >>>> VirtualBox 12/01/2006 >>>> kernel: RIP: 0010:dev_disable_lro+0x108/0x150 >>> Again the warning isn't a big deal. I agree we should address - Jason >>> any update? >> I still think using NETIF_F_LRO might not be correct. Since we're >> basically receiving GSO packets. >> >> And it might cause a lot of issues if the device doesn't have >> VIRTIO_NET_F_CTRL_GUEST_OFFLOADS. >> >> I see two possible fixes: >> >> 1) using NETIF_F_GRO_HW instead (the patch is attached) > It's unfortunate you didn't inline. Anyway. > Ivan could you test the patch and report? > >> or > Hmm. I am not sure we always preserve the GRO_HW requirement that > packets can be re-segmented to reconstruct the original packet stream. > Do all backends guarantee this?I think we can't.> Could you explain why?Or we probably need another new netdev feature like rx-gso?> > > >> 2) set NETIF_F_LRO only if the device has CTRL_GUEST_OFFLOADS >> >> Thanks > > This one would slow guests on old hosts down significantly.Actually, it's not this proposal but see below.> > I am not sure why this didn't trigger previouslyIt looks to me it was caused by a02e8964eaf9271a8a5fcc0c55bd13f933bafc56 ("virtio-net: ethtool configurable LRO"). Before this commit we won't even advertise NETIF_F_LRO, so dev_disable_lro() won't warn. After this commit, we advertise LRO and dev_disable_lro() will try to disable all guest offloads which will: 1) slow the traffic and 2) warn if "lro" can't be disabled on the device without ctrl guest offloads (e.g the virtualbox host) Thanks> btw - > we used not to have CTRL_GUEST_OFFLOADS after all. > > > >>> But the main issue is you lose connectivity. That still >>> persists with this? Can't you get a serial connection >>> out? I know qemu Did the kernel oops afterwards? >>> >>> -- >>> MST >>> > >