Dragos Tatulea
2023-May-16 09:58 UTC
[PATCH] vdpa/mlx5: Fix hang when cvq commands are triggered during device unregister
Currently the vdpa device is unregistered after the workqueue that processes vq commands is disabled. However, the device unregister process can still send commands to the cvq (a vlan delete for example) which leads to a hang because the handing workqueue has been disabled and the command never finishes: [ 2263.095764] rcu: INFO: rcu_sched self-detected stall on CPU [ 2263.096307] rcu: 9-....: (5250 ticks this GP) idle=dac4/1/0x4000000000000000 softirq=111009/111009 fqs=2544 [ 2263.097154] rcu: (t=5251 jiffies g=393549 q=347 ncpus=10) [ 2263.097648] CPU: 9 PID: 94300 Comm: kworker/u20:2 Not tainted 6.3.0-rc6_for_upstream_min_debug_2023_04_14_00_02 #1 [ 2263.098535] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 2263.099481] Workqueue: mlx5_events mlx5_vhca_state_work_handler [mlx5_core] [ 2263.100143] RIP: 0010:virtnet_send_command+0x109/0x170 [ 2263.100621] Code: 1d df f5 ff 85 c0 78 5c 48 8b 7b 08 e8 d0 c5 f5 ff 84 c0 75 11 eb 22 48 8b 7b 08 e8 01 b7 f5 ff 84 c0 75 15 f3 90 48 8b 7b 08 <48> 8d 74 24 04 e8 8d c5 f5 ff 48 85 c0 74 de 48 8b 83 f8 00 00 00 [ 2263.102148] RSP: 0018:ffff888139cf36e8 EFLAGS: 00000246 [ 2263.102624] RAX: 0000000000000000 RBX: ffff888166bea940 RCX: 0000000000000001 [ 2263.103244] RDX: 0000000000000000 RSI: ffff888139cf36ec RDI: ffff888146763800 [ 2263.103864] RBP: ffff888139cf3710 R08: ffff88810d201000 R09: 0000000000000000 [ 2263.104473] R10: 0000000000000002 R11: 0000000000000003 R12: 0000000000000002 [ 2263.105082] R13: 0000000000000002 R14: ffff888114528400 R15: ffff888166bea000 [ 2263.105689] FS: 0000000000000000(0000) GS:ffff88852cc80000(0000) knlGS:0000000000000000 [ 2263.106404] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2263.106925] CR2: 00007f31f394b000 CR3: 000000010615b006 CR4: 0000000000370ea0 [ 2263.107542] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2263.108163] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2263.108769] Call Trace: [ 2263.109059] <TASK> [ 2263.109320] ? check_preempt_wakeup+0x11f/0x230 [ 2263.109750] virtnet_vlan_rx_kill_vid+0x5a/0xa0 [ 2263.110180] vlan_vid_del+0x9c/0x170 [ 2263.110546] vlan_device_event+0x351/0x760 [8021q] [ 2263.111004] raw_notifier_call_chain+0x41/0x60 [ 2263.111426] dev_close_many+0xcb/0x120 [ 2263.111808] unregister_netdevice_many_notify+0x130/0x770 [ 2263.112297] ? wq_worker_running+0xa/0x30 [ 2263.112688] unregister_netdevice_queue+0x89/0xc0 [ 2263.113128] unregister_netdev+0x18/0x20 [ 2263.113512] virtnet_remove+0x4f/0x230 [ 2263.113885] virtio_dev_remove+0x31/0x70 [ 2263.114273] device_release_driver_internal+0x18f/0x1f0 [ 2263.114746] bus_remove_device+0xc6/0x130 [ 2263.115146] device_del+0x173/0x3c0 [ 2263.115502] ? kernfs_find_ns+0x35/0xd0 [ 2263.115895] device_unregister+0x1a/0x60 [ 2263.116279] unregister_virtio_device+0x11/0x20 [ 2263.116706] device_release_driver_internal+0x18f/0x1f0 [ 2263.117182] bus_remove_device+0xc6/0x130 [ 2263.117576] device_del+0x173/0x3c0 [ 2263.117929] ? vdpa_dev_remove+0x20/0x20 [vdpa] [ 2263.118364] device_unregister+0x1a/0x60 [ 2263.118752] mlx5_vdpa_dev_del+0x4c/0x80 [mlx5_vdpa] [ 2263.119232] vdpa_match_remove+0x21/0x30 [vdpa] [ 2263.119663] bus_for_each_dev+0x71/0xc0 [ 2263.120054] vdpa_mgmtdev_unregister+0x57/0x70 [vdpa] [ 2263.120520] mlx5v_remove+0x12/0x20 [mlx5_vdpa] [ 2263.120953] auxiliary_bus_remove+0x18/0x30 [ 2263.121356] device_release_driver_internal+0x18f/0x1f0 [ 2263.121830] bus_remove_device+0xc6/0x130 [ 2263.122223] device_del+0x173/0x3c0 [ 2263.122581] ? devl_param_driverinit_value_get+0x29/0x90 [ 2263.123070] mlx5_rescan_drivers_locked+0xc4/0x2d0 [mlx5_core] [ 2263.123633] mlx5_unregister_device+0x54/0x80 [mlx5_core] [ 2263.124169] mlx5_uninit_one+0x54/0x150 [mlx5_core] [ 2263.124656] mlx5_sf_dev_remove+0x45/0x90 [mlx5_core] [ 2263.125153] auxiliary_bus_remove+0x18/0x30 [ 2263.125560] device_release_driver_internal+0x18f/0x1f0 [ 2263.126052] bus_remove_device+0xc6/0x130 [ 2263.126451] device_del+0x173/0x3c0 [ 2263.126815] mlx5_sf_dev_remove+0x39/0xf0 [mlx5_core] [ 2263.127318] mlx5_sf_dev_state_change_handler+0x178/0x270 [mlx5_core] [ 2263.127920] blocking_notifier_call_chain+0x5a/0x80 [ 2263.128379] mlx5_vhca_state_work_handler+0x151/0x200 [mlx5_core] [ 2263.128951] process_one_work+0x1bb/0x3c0 [ 2263.129355] ? process_one_work+0x3c0/0x3c0 [ 2263.129766] worker_thread+0x4d/0x3c0 [ 2263.130140] ? process_one_work+0x3c0/0x3c0 [ 2263.130548] kthread+0xb9/0xe0 [ 2263.130895] ? kthread_complete_and_exit+0x20/0x20 [ 2263.131349] ret_from_fork+0x1f/0x30 [ 2263.131717] </TASK> The fix is to disable and destroy the workqueue after the device unregister. It is expected that vhost will not trigger kicks after the unregister. But even if it would, the wq is disabled already by setting the pointer to NULL (done so in the referenced commit). Fixes: ad6dc1daaf29 ("vdpa/mlx5: Avoid processing works if workqueue was destroyed") Signed-off-by: Dragos Tatulea <dtatulea at nvidia.com> --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index e29e32b306ad..279ac6a558d2 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -3349,10 +3349,10 @@ static void mlx5_vdpa_dev_del(struct vdpa_mgmt_dev *v_mdev, struct vdpa_device * mlx5_vdpa_remove_debugfs(ndev->debugfs); ndev->debugfs = NULL; unregister_link_notifier(ndev); + _vdpa_unregister_device(dev); wq = mvdev->wq; mvdev->wq = NULL; destroy_workqueue(wq); - _vdpa_unregister_device(dev); mgtdev->ndev = NULL; } -- 2.40.1
Jason Wang
2023-May-17 05:24 UTC
[PATCH] vdpa/mlx5: Fix hang when cvq commands are triggered during device unregister
On Tue, May 16, 2023 at 5:58?PM Dragos Tatulea <dtatulea at nvidia.com> wrote:> > Currently the vdpa device is unregistered after the workqueue that > processes vq commands is disabled. However, the device unregister > process can still send commands to the cvq (a vlan delete for example) > which leads to a hang because the handing workqueue has been disabled > and the command never finishes: > > [ 2263.095764] rcu: INFO: rcu_sched self-detected stall on CPU > [ 2263.096307] rcu: 9-....: (5250 ticks this GP) idle=dac4/1/0x4000000000000000 softirq=111009/111009 fqs=2544 > [ 2263.097154] rcu: (t=5251 jiffies g=393549 q=347 ncpus=10) > [ 2263.097648] CPU: 9 PID: 94300 Comm: kworker/u20:2 Not tainted 6.3.0-rc6_for_upstream_min_debug_2023_04_14_00_02 #1 > [ 2263.098535] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 > [ 2263.099481] Workqueue: mlx5_events mlx5_vhca_state_work_handler [mlx5_core] > [ 2263.100143] RIP: 0010:virtnet_send_command+0x109/0x170 > [ 2263.100621] Code: 1d df f5 ff 85 c0 78 5c 48 8b 7b 08 e8 d0 c5 f5 ff 84 c0 75 11 eb 22 48 8b 7b 08 e8 01 b7 f5 ff 84 c0 75 15 f3 90 48 8b 7b 08 <48> 8d 74 24 04 e8 8d c5 f5 ff 48 85 c0 74 de 48 8b 83 f8 00 00 00 > [ 2263.102148] RSP: 0018:ffff888139cf36e8 EFLAGS: 00000246 > [ 2263.102624] RAX: 0000000000000000 RBX: ffff888166bea940 RCX: 0000000000000001 > [ 2263.103244] RDX: 0000000000000000 RSI: ffff888139cf36ec RDI: ffff888146763800 > [ 2263.103864] RBP: ffff888139cf3710 R08: ffff88810d201000 R09: 0000000000000000 > [ 2263.104473] R10: 0000000000000002 R11: 0000000000000003 R12: 0000000000000002 > [ 2263.105082] R13: 0000000000000002 R14: ffff888114528400 R15: ffff888166bea000 > [ 2263.105689] FS: 0000000000000000(0000) GS:ffff88852cc80000(0000) knlGS:0000000000000000 > [ 2263.106404] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 2263.106925] CR2: 00007f31f394b000 CR3: 000000010615b006 CR4: 0000000000370ea0 > [ 2263.107542] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 2263.108163] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 2263.108769] Call Trace: > [ 2263.109059] <TASK> > [ 2263.109320] ? check_preempt_wakeup+0x11f/0x230 > [ 2263.109750] virtnet_vlan_rx_kill_vid+0x5a/0xa0 > [ 2263.110180] vlan_vid_del+0x9c/0x170 > [ 2263.110546] vlan_device_event+0x351/0x760 [8021q] > [ 2263.111004] raw_notifier_call_chain+0x41/0x60 > [ 2263.111426] dev_close_many+0xcb/0x120 > [ 2263.111808] unregister_netdevice_many_notify+0x130/0x770 > [ 2263.112297] ? wq_worker_running+0xa/0x30 > [ 2263.112688] unregister_netdevice_queue+0x89/0xc0 > [ 2263.113128] unregister_netdev+0x18/0x20 > [ 2263.113512] virtnet_remove+0x4f/0x230 > [ 2263.113885] virtio_dev_remove+0x31/0x70 > [ 2263.114273] device_release_driver_internal+0x18f/0x1f0 > [ 2263.114746] bus_remove_device+0xc6/0x130 > [ 2263.115146] device_del+0x173/0x3c0 > [ 2263.115502] ? kernfs_find_ns+0x35/0xd0 > [ 2263.115895] device_unregister+0x1a/0x60 > [ 2263.116279] unregister_virtio_device+0x11/0x20 > [ 2263.116706] device_release_driver_internal+0x18f/0x1f0 > [ 2263.117182] bus_remove_device+0xc6/0x130 > [ 2263.117576] device_del+0x173/0x3c0 > [ 2263.117929] ? vdpa_dev_remove+0x20/0x20 [vdpa] > [ 2263.118364] device_unregister+0x1a/0x60 > [ 2263.118752] mlx5_vdpa_dev_del+0x4c/0x80 [mlx5_vdpa] > [ 2263.119232] vdpa_match_remove+0x21/0x30 [vdpa] > [ 2263.119663] bus_for_each_dev+0x71/0xc0 > [ 2263.120054] vdpa_mgmtdev_unregister+0x57/0x70 [vdpa] > [ 2263.120520] mlx5v_remove+0x12/0x20 [mlx5_vdpa] > [ 2263.120953] auxiliary_bus_remove+0x18/0x30 > [ 2263.121356] device_release_driver_internal+0x18f/0x1f0 > [ 2263.121830] bus_remove_device+0xc6/0x130 > [ 2263.122223] device_del+0x173/0x3c0 > [ 2263.122581] ? devl_param_driverinit_value_get+0x29/0x90 > [ 2263.123070] mlx5_rescan_drivers_locked+0xc4/0x2d0 [mlx5_core] > [ 2263.123633] mlx5_unregister_device+0x54/0x80 [mlx5_core] > [ 2263.124169] mlx5_uninit_one+0x54/0x150 [mlx5_core] > [ 2263.124656] mlx5_sf_dev_remove+0x45/0x90 [mlx5_core] > [ 2263.125153] auxiliary_bus_remove+0x18/0x30 > [ 2263.125560] device_release_driver_internal+0x18f/0x1f0 > [ 2263.126052] bus_remove_device+0xc6/0x130 > [ 2263.126451] device_del+0x173/0x3c0 > [ 2263.126815] mlx5_sf_dev_remove+0x39/0xf0 [mlx5_core] > [ 2263.127318] mlx5_sf_dev_state_change_handler+0x178/0x270 [mlx5_core] > [ 2263.127920] blocking_notifier_call_chain+0x5a/0x80 > [ 2263.128379] mlx5_vhca_state_work_handler+0x151/0x200 [mlx5_core] > [ 2263.128951] process_one_work+0x1bb/0x3c0 > [ 2263.129355] ? process_one_work+0x3c0/0x3c0 > [ 2263.129766] worker_thread+0x4d/0x3c0 > [ 2263.130140] ? process_one_work+0x3c0/0x3c0 > [ 2263.130548] kthread+0xb9/0xe0 > [ 2263.130895] ? kthread_complete_and_exit+0x20/0x20 > [ 2263.131349] ret_from_fork+0x1f/0x30 > [ 2263.131717] </TASK> > > The fix is to disable and destroy the workqueue after the device > unregister. It is expected that vhost will not trigger kicks after > the unregister. But even if it would, the wq is disabled already by > setting the pointer to NULL (done so in the referenced commit). > > Fixes: ad6dc1daaf29 ("vdpa/mlx5: Avoid processing works if workqueue was destroyed") > Signed-off-by: Dragos Tatulea <dtatulea at nvidia.com>Acked-by: Jason Wang <jasowang at redhat.com> Thanks> --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index e29e32b306ad..279ac6a558d2 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -3349,10 +3349,10 @@ static void mlx5_vdpa_dev_del(struct vdpa_mgmt_dev *v_mdev, struct vdpa_device * > mlx5_vdpa_remove_debugfs(ndev->debugfs); > ndev->debugfs = NULL; > unregister_link_notifier(ndev); > + _vdpa_unregister_device(dev); > wq = mvdev->wq; > mvdev->wq = NULL; > destroy_workqueue(wq); > - _vdpa_unregister_device(dev); > mgtdev->ndev = NULL; > } > > -- > 2.40.1 >