Michael S. Tsirkin
2021-Jul-17 20:46 UTC
[PATCH 4/4] virtio_pci: Support surprise removal of virtio pci device
On Sat, Jul 17, 2021 at 10:42:58AM +0300, Parav Pandit wrote:> When a virtio pci device undergo surprise removal (aka async removaln intypo> PCIe spec), mark the device is broken so that any upper layer drivers can > abort any outstanding operation. > > When a virtio net pci device undergo surprise removal which is used by a > NetworkManager, a below call trace was observed. > > kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kworker/1:1:27059] > watchdog: BUG: soft lockup - CPU#1 stuck for 52s! [kworker/1:1:27059] > CPU: 1 PID: 27059 Comm: kworker/1:1 Tainted: G S W I L 5.13.0-hotplug+ #8 > Hardware name: Dell Inc. PowerEdge R640/0H28RR, BIOS 2.9.4 11/06/2020 > Workqueue: events linkwatch_event > RIP: 0010:virtnet_send_command+0xfc/0x150 [virtio_net] > Call Trace: > virtnet_set_rx_mode+0xcf/0x2a7 [virtio_net] > ? __hw_addr_create_ex+0x85/0xc0 > __dev_mc_add+0x72/0x80 > igmp6_group_added+0xa7/0xd0 > ipv6_mc_up+0x3c/0x60 > ipv6_find_idev+0x36/0x80 > addrconf_add_dev+0x1e/0xa0 > addrconf_dev_config+0x71/0x130 > addrconf_notify+0x1f5/0xb40 > ? rtnl_is_locked+0x11/0x20 > ? __switch_to_asm+0x42/0x70 > ? finish_task_switch+0xaf/0x2c0 > ? raw_notifier_call_chain+0x3e/0x50 > raw_notifier_call_chain+0x3e/0x50 > netdev_state_change+0x67/0x90 > linkwatch_do_dev+0x3c/0x50 > __linkwatch_run_queue+0xd2/0x220 > linkwatch_event+0x21/0x30 > process_one_work+0x1c8/0x370 > worker_thread+0x30/0x380 > ? process_one_work+0x370/0x370 > kthread+0x118/0x140 > ? set_kthread_struct+0x40/0x40 > ret_from_fork+0x1f/0x30 > > Hence, add the ability to abort the command on surprise removal > which prevents infinite loop and system lockup. > > Signed-off-by: Parav Pandit <parav at nvidia.com>OK that's nice, but I suspect this is not enough. For example we need to also fix up all config space reads to handle all-ones patterns specially. E.g. /* After writing 0 to device_status, the driver MUST wait for a read of * device_status to return 0 before reinitializing the device. * This will flush out the status write, and flush in device writes, * including MSI-X interrupts, if any. */ while (vp_modern_get_status(mdev)) msleep(1); lots of code in drivers needs to be fixed too. I guess we need to annotate drivers somehow to mark up whether they support surprise removal? And maybe find a way to let host know?> --- > drivers/virtio/virtio_pci_common.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c > index 222d630c41fc..b35bb2d57f62 100644 > --- a/drivers/virtio/virtio_pci_common.c > +++ b/drivers/virtio/virtio_pci_common.c > @@ -576,6 +576,13 @@ static void virtio_pci_remove(struct pci_dev *pci_dev) > struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev); > struct device *dev = get_device(&vp_dev->vdev.dev); > > + /* > + * Device is marked broken on surprise removal so that virtio upper > + * layers can abort any ongoing operation. > + */ > + if (!pci_device_is_present(pci_dev)) > + virtio_break_device(&vp_dev->vdev); > + > pci_disable_sriov(pci_dev); > > unregister_virtio_device(&vp_dev->vdev); > -- > 2.27.0
Parav Pandit
2021-Jul-19 05:44 UTC
[PATCH 4/4] virtio_pci: Support surprise removal of virtio pci device
> From: Michael S. Tsirkin <mst at redhat.com> > Sent: Sunday, July 18, 2021 2:17 AM > > On Sat, Jul 17, 2021 at 10:42:58AM +0300, Parav Pandit wrote: > > When a virtio pci device undergo surprise removal (aka async removaln > > in > > typoFixing it. Checkpatch.pl and codespell, both didn't catch it. ?> > OK that's nice, but I suspect this is not enough. > For example we need to also fix up all config space reads to handle all-ones > patterns specially. > > E.g. > > /* After writing 0 to device_status, the driver MUST wait for a read of > * device_status to return 0 before reinitializing the device. > * This will flush out the status write, and flush in device writes, > * including MSI-X interrupts, if any. > */ > while (vp_modern_get_status(mdev)) > msleep(1); > > lots of code in drivers needs to be fixed too.Above one particularly known to us in the hot plug area. Above fix is needed to close the race of hot plug and unplug happening in parallel, which is occurring today, but less common. It is in my todo list to fix it. Will take care of it in near future in other series.> > I guess we need to annotate drivers somehow to mark up whether they > support surprise removal? And maybe find a way to let host know?What is the benefit of it? Who can make use of that information? In virtio pci case, it is similar improvement to what nvme pci driver went through few years back to support hot plug/unplug. Lets complete this of fixes to make it little more robust like nvme.> > > > > --- > > drivers/virtio/virtio_pci_common.c | 7 +++++++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/drivers/virtio/virtio_pci_common.c > > b/drivers/virtio/virtio_pci_common.c > > index 222d630c41fc..b35bb2d57f62 100644 > > --- a/drivers/virtio/virtio_pci_common.c > > +++ b/drivers/virtio/virtio_pci_common.c > > @@ -576,6 +576,13 @@ static void virtio_pci_remove(struct pci_dev > *pci_dev) > > struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev); > > struct device *dev = get_device(&vp_dev->vdev.dev); > > > > + /* > > + * Device is marked broken on surprise removal so that virtio upper > > + * layers can abort any ongoing operation. > > + */ > > + if (!pci_device_is_present(pci_dev)) > > + virtio_break_device(&vp_dev->vdev); > > + > > pci_disable_sriov(pci_dev); > > > > unregister_virtio_device(&vp_dev->vdev); > > -- > > 2.27.0
Cornelia Huck
2021-Jul-19 09:40 UTC
[PATCH 4/4] virtio_pci: Support surprise removal of virtio pci device
On Sat, Jul 17 2021, "Michael S. Tsirkin" <mst at redhat.com> wrote:> On Sat, Jul 17, 2021 at 10:42:58AM +0300, Parav Pandit wrote: >> When a virtio pci device undergo surprise removal (aka async removaln in > > typo > >> PCIe spec), mark the device is broken so that any upper layer drivers can >> abort any outstanding operation. >> >> When a virtio net pci device undergo surprise removal which is used by a >> NetworkManager, a below call trace was observed. >> >> kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kworker/1:1:27059] >> watchdog: BUG: soft lockup - CPU#1 stuck for 52s! [kworker/1:1:27059] >> CPU: 1 PID: 27059 Comm: kworker/1:1 Tainted: G S W I L 5.13.0-hotplug+ #8 >> Hardware name: Dell Inc. PowerEdge R640/0H28RR, BIOS 2.9.4 11/06/2020 >> Workqueue: events linkwatch_event >> RIP: 0010:virtnet_send_command+0xfc/0x150 [virtio_net] >> Call Trace: >> virtnet_set_rx_mode+0xcf/0x2a7 [virtio_net] >> ? __hw_addr_create_ex+0x85/0xc0 >> __dev_mc_add+0x72/0x80 >> igmp6_group_added+0xa7/0xd0 >> ipv6_mc_up+0x3c/0x60 >> ipv6_find_idev+0x36/0x80 >> addrconf_add_dev+0x1e/0xa0 >> addrconf_dev_config+0x71/0x130 >> addrconf_notify+0x1f5/0xb40 >> ? rtnl_is_locked+0x11/0x20 >> ? __switch_to_asm+0x42/0x70 >> ? finish_task_switch+0xaf/0x2c0 >> ? raw_notifier_call_chain+0x3e/0x50 >> raw_notifier_call_chain+0x3e/0x50 >> netdev_state_change+0x67/0x90 >> linkwatch_do_dev+0x3c/0x50 >> __linkwatch_run_queue+0xd2/0x220 >> linkwatch_event+0x21/0x30 >> process_one_work+0x1c8/0x370 >> worker_thread+0x30/0x380 >> ? process_one_work+0x370/0x370 >> kthread+0x118/0x140 >> ? set_kthread_struct+0x40/0x40 >> ret_from_fork+0x1f/0x30 >> >> Hence, add the ability to abort the command on surprise removal >> which prevents infinite loop and system lockup. >> >> Signed-off-by: Parav Pandit <parav at nvidia.com> > > OK that's nice, but I suspect this is not enough. > For example we need to also fix up all config space reads > to handle all-ones patterns specially. > > E.g. > > /* After writing 0 to device_status, the driver MUST wait for a read of > * device_status to return 0 before reinitializing the device. > * This will flush out the status write, and flush in device writes, > * including MSI-X interrupts, if any. > */ > while (vp_modern_get_status(mdev)) > msleep(1); > > lots of code in drivers needs to be fixed too. > > I guess we need to annotate drivers somehow to mark up > whether they support surprise removal? And maybe find a > way to let host know?I'm wondering whether virtio-pci surprise removal would need more support in drivers than virtio-ccw surprise removal; given that virtio-ccw *only* supports surprise removal and I don't remember any problem reports, the situation is probably not that bad. Is surprise removal of block devices still a big problem? We have some support for (non-virtio) ccw devices (e.g. dasd) via a 'disconnected' state that was designed to mitigate problems with block devices that are suddenly gone.