Michael S. Tsirkin
2022-Aug-28 20:07 UTC
[PATCH] virtio_net: Abort driver initialization if device fails
On Sun, Aug 28, 2022 at 06:48:20PM +0300, Eli Cohen wrote:> Read the status bit after virtio_device_ready() to check if device > initialization was successful. If it was not, abort driver > initialization to avoid further attempts to access device resources. > > Abort is required per virtio spec v1.1 > > 3.1.1 > ... > If any of these steps go irrecoverably wrong, the driver SHOULD set the > FAILED status bit to indicate that it has given up on the device (it can > reset the device later to restart if desired). The driver MUST NOT > continue initialization in that case.I don't see a requirement to read the status bit though which is what the patch does.> This fixes an issue that was discovered when mlx5_vdpa initialization > failed due to firmware errorA bit more detail would be nice. which function failed exactly?> and subsequent attempts to send control VQ > commands failed with a call trace: > > watchdog: BUG: soft lockup - CPU#62 stuck for 26s! [systemd-udevd:2610] > Modules linked in: virtio_net(+) net_failover failover virtio_vdpa mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib ib_uverbs ib_core mlx5_core mlxfw psample tls pci> > drm ghash_clmulni_intel serio_raw usb_storage scsi_transport_sas hpwdt wmi target_core_mod [last unloaded: ib_core] > CPU: 62 PID: 2610 Comm: systemd-udevd Tainted: G I 6.0.0-rc2+ #7 > Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 04/08/2020 > RIP: 0010:virtnet_send_command+0xfa/0x140 [virtio_net] > Code: ec f0 80 e0 85 c0 0f 88 0c 68 00 00 48 8b 7b 08 e8 9b d6 80 e0 84 c0 75 11 eb 43 48 8b 7b 08 e8 7c c3 80 e0 84 c0 75 15 f3 90 <48> 8b 7b 08 48 8d 74 24 > > RSP: 0018:ffffb6fba2037b78 EFLAGS: 00000246 > RAX: 0000000000000000 RBX: ffff99c6043a99c0 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: ffffb6fba2037b7c RDI: ffff99c6042c6500 > RBP: ffffb6fba2037bc0 R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000003 R11: 0000000000000002 R12: 0000000000000002 > R13: 0000000000000004 R14: 0000000000000000 R15: ffff99f634258400 > FS: 00007f5894eefb40(0000) GS:ffff99dd9fb80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 000055acd4a80000 CR3: 00000018605a4002 CR4: 00000000007706e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > PKRU: 55555554 > Call Trace: > <TASK> > _virtnet_set_queues+0x7f/0xc0 [virtio_net] > virtnet_probe+0x989/0xae0 [virtio_net] > virtio_dev_probe+0x1ab/0x260 > really_probe+0xde/0x390 > ? pm_runtime_barrier+0x50/0x90 > __driver_probe_device+0x78/0x180 > driver_probe_device+0x1e/0x90 > __driver_attach+0xc4/0x1e0 > ? __device_attach_driver+0xe0/0xe0 > ? __device_attach_driver+0xe0/0xe0 > bus_for_each_dev+0x61/0x90 > bus_add_driver+0x1a9/0x200 > driver_register+0x8f/0xf0 > ? 0xffffffffc0fdb000 > virtio_net_driver_init+0x70/0x1000 [virtio_net] > do_one_initcall+0x41/0x210 > ? kmem_cache_alloc_trace+0x16d/0x2c0 > do_init_module+0x4c/0x1f0 > __do_sys_finit_module+0x9f/0x100 > do_syscall_64+0x38/0x90 > entry_SYSCALL_64_after_hwframe+0x63/0xcd > RIP: 0033:0x7f589510aecd > Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 > > RSP: 002b:00007ffde61fb398 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > RAX: ffffffffffffffda RBX: 000055acd48ac950 RCX: 00007f589510aecd > RDX: 0000000000000000 RSI: 00007f589584332c RDI: 0000000000000012 > RBP: 0000000000020000 R08: 0000000000000000 R09: 0000000000000002 > R10: 0000000000000012 R11: 0000000000000246 R12: 00007f589584332c > R13: 000055acd4867c70 R14: 0000000000000007 R15: 000055acd48980a0 > </TASK> > > Fixes: commit 4baf1e33d084 ("virtio_net: enable VQs early") > Signed-off-by: Eli Cohen <elic at nvidia.com> > --- > drivers/net/virtio_net.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 9cce7dec7366..4698d9a28a6f 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -3900,6 +3900,11 @@ static int virtnet_probe(struct virtio_device *vdev) > } > > virtio_device_ready(vdev); > + if (vdev->config->get_status(vdev) & VIRTIO_CONFIG_S_FAILED) { > + err = -EINVAL; > + rtnl_unlock(); > + goto unregister_ndev; > + } > > rtnl_unlock(); >I don't get it. What set the failed status?> @@ -3934,7 +3939,7 @@ static int virtnet_probe(struct virtio_device *vdev) > > free_unregister_netdev: > virtio_reset_device(vdev); > - > +unregister_ndev: > unregister_netdev(dev); > free_failover: > net_failover_destroy(vi->failover); > -- > 2.35.1
Jason Wang
2022-Aug-29 00:33 UTC
[PATCH] virtio_net: Abort driver initialization if device fails
On Mon, Aug 29, 2022 at 4:07 AM Michael S. Tsirkin <mst at redhat.com> wrote:> > On Sun, Aug 28, 2022 at 06:48:20PM +0300, Eli Cohen wrote: > > Read the status bit after virtio_device_ready() to check if device > > initialization was successful. If it was not, abort driver > > initialization to avoid further attempts to access device resources. > > > > Abort is required per virtio spec v1.1 > > > > 3.1.1 > > ... > > If any of these steps go irrecoverably wrong, the driver SHOULD set the > > FAILED status bit to indicate that it has given up on the device (it can > > reset the device later to restart if desired). The driver MUST NOT > > continue initialization in that case. > > I don't see a requirement to read the status bit though > which is what the patch does. > > > This fixes an issue that was discovered when mlx5_vdpa initialization > > failed due to firmware error > > A bit more detail would be nice. which function failed exactly? > > > and subsequent attempts to send control VQ > > commands failed with a call trace: > > > > watchdog: BUG: soft lockup - CPU#62 stuck for 26s! [systemd-udevd:2610] > > Modules linked in: virtio_net(+) net_failover failover virtio_vdpa mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib ib_uverbs ib_core mlx5_core mlxfw psample tls pci> > > drm ghash_clmulni_intel serio_raw usb_storage scsi_transport_sas hpwdt wmi target_core_mod [last unloaded: ib_core] > > CPU: 62 PID: 2610 Comm: systemd-udevd Tainted: G I 6.0.0-rc2+ #7 > > Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 04/08/2020 > > RIP: 0010:virtnet_send_command+0xfa/0x140 [virtio_net] > > Code: ec f0 80 e0 85 c0 0f 88 0c 68 00 00 48 8b 7b 08 e8 9b d6 80 e0 84 c0 75 11 eb 43 48 8b 7b 08 e8 7c c3 80 e0 84 c0 75 15 f3 90 <48> 8b 7b 08 48 8d 74 24 > > > RSP: 0018:ffffb6fba2037b78 EFLAGS: 00000246 > > RAX: 0000000000000000 RBX: ffff99c6043a99c0 RCX: 0000000000000000 > > RDX: 0000000000000000 RSI: ffffb6fba2037b7c RDI: ffff99c6042c6500 > > RBP: ffffb6fba2037bc0 R08: 0000000000000001 R09: 0000000000000000 > > R10: 0000000000000003 R11: 0000000000000002 R12: 0000000000000002 > > R13: 0000000000000004 R14: 0000000000000000 R15: ffff99f634258400 > > FS: 00007f5894eefb40(0000) GS:ffff99dd9fb80000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 000055acd4a80000 CR3: 00000018605a4002 CR4: 00000000007706e0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > PKRU: 55555554 > > Call Trace: > > <TASK> > > _virtnet_set_queues+0x7f/0xc0 [virtio_net]Btw the calltrace seems like the driver is waiting for the completion of mq setting command. There's some recent discussion of using timeout or interrupt for ctrl command which may help to solve this issue. Thanks> > virtnet_probe+0x989/0xae0 [virtio_net] > > virtio_dev_probe+0x1ab/0x260 > > really_probe+0xde/0x390 > > ? pm_runtime_barrier+0x50/0x90 > > __driver_probe_device+0x78/0x180 > > driver_probe_device+0x1e/0x90 > > __driver_attach+0xc4/0x1e0 > > ? __device_attach_driver+0xe0/0xe0 > > ? __device_attach_driver+0xe0/0xe0 > > bus_for_each_dev+0x61/0x90 > > bus_add_driver+0x1a9/0x200 > > driver_register+0x8f/0xf0 > > ? 0xffffffffc0fdb000 > > virtio_net_driver_init+0x70/0x1000 [virtio_net] > > do_one_initcall+0x41/0x210 > > ? kmem_cache_alloc_trace+0x16d/0x2c0 > > do_init_module+0x4c/0x1f0 > > __do_sys_finit_module+0x9f/0x100 > > do_syscall_64+0x38/0x90 > > entry_SYSCALL_64_after_hwframe+0x63/0xcd > > RIP: 0033:0x7f589510aecd > > Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 > > > RSP: 002b:00007ffde61fb398 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > > RAX: ffffffffffffffda RBX: 000055acd48ac950 RCX: 00007f589510aecd > > RDX: 0000000000000000 RSI: 00007f589584332c RDI: 0000000000000012 > > RBP: 0000000000020000 R08: 0000000000000000 R09: 0000000000000002 > > R10: 0000000000000012 R11: 0000000000000246 R12: 00007f589584332c > > R13: 000055acd4867c70 R14: 0000000000000007 R15: 000055acd48980a0 > > </TASK> > > > > Fixes: commit 4baf1e33d084 ("virtio_net: enable VQs early") > > Signed-off-by: Eli Cohen <elic at nvidia.com> > > --- > > drivers/net/virtio_net.c | 7 ++++++- > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > index 9cce7dec7366..4698d9a28a6f 100644 > > --- a/drivers/net/virtio_net.c > > +++ b/drivers/net/virtio_net.c > > @@ -3900,6 +3900,11 @@ static int virtnet_probe(struct virtio_device *vdev) > > } > > > > virtio_device_ready(vdev); > > + if (vdev->config->get_status(vdev) & VIRTIO_CONFIG_S_FAILED) { > > + err = -EINVAL; > > + rtnl_unlock(); > > + goto unregister_ndev; > > + } > > > > rtnl_unlock(); > > > > I don't get it. What set the failed status? > > > > @@ -3934,7 +3939,7 @@ static int virtnet_probe(struct virtio_device *vdev) > > > > free_unregister_netdev: > > virtio_reset_device(vdev); > > - > > +unregister_ndev: > > unregister_netdev(dev); > > free_failover: > > net_failover_destroy(vi->failover); > > -- > > 2.35.1 >