Michael S. Tsirkin
2022-Aug-31 12:27 UTC
[PATCH] virtio_net: Abort driver initialization if device fails
On Mon, Aug 29, 2022 at 04:49:02AM +0000, Eli Cohen wrote:> > From: Michael S. Tsirkin <mst at redhat.com> > > Sent: Sunday, August 28, 2022 11:08 PM > > To: Eli Cohen <elic at nvidia.com> > > Cc: jasowang at redhat.com; eperezma at redhat.com; si-wei.liu at oracle.com; > > virtualization at lists.linux-foundation.org > > Subject: Re: [PATCH] virtio_net: Abort driver initialization if device fails > > > > On Sun, Aug 28, 2022 at 06:48:20PM +0300, Eli Cohen wrote: > > > Read the status bit after virtio_device_ready() to check if device > > > initialization was successful. If it was not, abort driver > > > initialization to avoid further attempts to access device resources. > > > > > > Abort is required per virtio spec v1.1 > > > > > > 3.1.1 > > > ... > > > If any of these steps go irrecoverably wrong, the driver SHOULD set the > > > FAILED status bit to indicate that it has given up on the device (it can > > > reset the device later to restart if desired). The driver MUST NOT > > > continue initialization in that case. > > > > I don't see a requirement to read the status bit though > > which is what the patch does. > > > > No explicit requirement to read the status but we are required to abort in > case of a failure and reading the status seems legitimate way to detect this situation.Isn't this supposed to be VIRTIO_CONFIG_S_NEEDS_RESET then? Here's my analysis: device might set VIRTIO_CONFIG_S_NEEDS_RESET on failure. Normally it will assert a config change interrupt allowing us to check that. However, if failure triggers before DRIVER_OK it set, interrupt can not be sent. To address this, check after DRIVER_OK. Maybe add a comment along these lines. Also how about doing this check inside virtio_device_ready and marking device as broken, and returning error code?> > > This fixes an issue that was discovered when mlx5_vdpa initialization > > > failed due to firmware error > > > > A bit more detail would be nice. which function failed exactly? > > > > > and subsequent attempts to send control VQ > > > commands failed with a call trace: > > > > > > watchdog: BUG: soft lockup - CPU#62 stuck for 26s! [systemd-udevd:2610] > > > Modules linked in: virtio_net(+) net_failover failover virtio_vdpa > > mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib ib_uverbs ib_core mlx5_core > > mlxfw psample tls pci> > > > drm ghash_clmulni_intel serio_raw usb_storage scsi_transport_sas hpwdt > > wmi target_core_mod [last unloaded: ib_core] > > > CPU: 62 PID: 2610 Comm: systemd-udevd Tainted: G I 6.0.0-rc2+ #7 > > > Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS > > U30 04/08/2020 > > > RIP: 0010:virtnet_send_command+0xfa/0x140 [virtio_net] > > > Code: ec f0 80 e0 85 c0 0f 88 0c 68 00 00 48 8b 7b 08 e8 9b d6 80 e0 84 c0 > > 75 11 eb 43 48 8b 7b 08 e8 7c c3 80 e0 84 c0 75 15 f3 90 <48> 8b 7b 08 48 8d > > 74 24 > > > > RSP: 0018:ffffb6fba2037b78 EFLAGS: 00000246 > > > RAX: 0000000000000000 RBX: ffff99c6043a99c0 RCX: 0000000000000000 > > > RDX: 0000000000000000 RSI: ffffb6fba2037b7c RDI: ffff99c6042c6500 > > > RBP: ffffb6fba2037bc0 R08: 0000000000000001 R09: 0000000000000000 > > > R10: 0000000000000003 R11: 0000000000000002 R12: 0000000000000002 > > > R13: 0000000000000004 R14: 0000000000000000 R15: ffff99f634258400 > > > FS: 00007f5894eefb40(0000) GS:ffff99dd9fb80000(0000) > > knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 000055acd4a80000 CR3: 00000018605a4002 CR4: 00000000007706e0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > PKRU: 55555554 > > > Call Trace: > > > <TASK> > > > _virtnet_set_queues+0x7f/0xc0 [virtio_net] > > > virtnet_probe+0x989/0xae0 [virtio_net] > > > virtio_dev_probe+0x1ab/0x260 > > > really_probe+0xde/0x390 > > > ? pm_runtime_barrier+0x50/0x90 > > > __driver_probe_device+0x78/0x180 > > > driver_probe_device+0x1e/0x90 > > > __driver_attach+0xc4/0x1e0 > > > ? __device_attach_driver+0xe0/0xe0 > > > ? __device_attach_driver+0xe0/0xe0 > > > bus_for_each_dev+0x61/0x90 > > > bus_add_driver+0x1a9/0x200 > > > driver_register+0x8f/0xf0 > > > ? 0xffffffffc0fdb000 > > > virtio_net_driver_init+0x70/0x1000 [virtio_net] > > > do_one_initcall+0x41/0x210 > > > ? kmem_cache_alloc_trace+0x16d/0x2c0 > > > do_init_module+0x4c/0x1f0 > > > __do_sys_finit_module+0x9f/0x100 > > > do_syscall_64+0x38/0x90 > > > entry_SYSCALL_64_after_hwframe+0x63/0xcd > > > RIP: 0033:0x7f589510aecd > > > Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 > > 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 > > > > > > RSP: 002b:00007ffde61fb398 EFLAGS: 00000246 ORIG_RAX: > > 0000000000000139 > > > RAX: ffffffffffffffda RBX: 000055acd48ac950 RCX: 00007f589510aecd > > > RDX: 0000000000000000 RSI: 00007f589584332c RDI: 0000000000000012 > > > RBP: 0000000000020000 R08: 0000000000000000 R09: 0000000000000002 > > > R10: 0000000000000012 R11: 0000000000000246 R12: 00007f589584332c > > > R13: 000055acd4867c70 R14: 0000000000000007 R15: 000055acd48980a0 > > > </TASK> > > > > > > Fixes: commit 4baf1e33d084 ("virtio_net: enable VQs early") > > > Signed-off-by: Eli Cohen <elic at nvidia.com> > > > --- > > > drivers/net/virtio_net.c | 7 ++++++- > > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > > index 9cce7dec7366..4698d9a28a6f 100644 > > > --- a/drivers/net/virtio_net.c > > > +++ b/drivers/net/virtio_net.c > > > @@ -3900,6 +3900,11 @@ static int virtnet_probe(struct virtio_device > > *vdev) > > > } > > > > > > virtio_device_ready(vdev); > > > + if (vdev->config->get_status(vdev) & VIRTIO_CONFIG_S_FAILED) { > > > + err = -EINVAL; > > > + rtnl_unlock(); > > > + goto unregister_ndev; > > > + } > > > > > > rtnl_unlock(); > > > > > > > I don't get it. What set the failed status? > > > > > > > @@ -3934,7 +3939,7 @@ static int virtnet_probe(struct virtio_device > > *vdev) > > > > > > free_unregister_netdev: > > > virtio_reset_device(vdev); > > > - > > > +unregister_ndev: > > > unregister_netdev(dev); > > > free_failover: > > > net_failover_destroy(vi->failover); > > > -- > > > 2.35.1 >