thr3ads.net - libvirt users - PCI Passthrough and Surprise Hotplug [Oct 2020]

If this information is useful, please help other people find it:
Share via:

Marc Smith

2020-Oct-05 15:05 UTC

PCI Passthrough and Surprise Hotplug

Hi,

I'm using QEMU/KVM on RHEL (CentOS) 7.8.2003:
# cat /etc/redhat-release
CentOS Linux release 7.8.2003

I'm passing an NVMe drive into a Linux KVM virtual machine (<type
arch='x86_64' machine='pc-i440fx-rhel7.0.0'>hvm</type>)
which has the
following 'hostdev' entry:
    <hostdev mode='subsystem' type='pci'
managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x42'
slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev5'/>
      <rom bar='off'/>
      <address type='pci' domain='0x0000' bus='0x01'
slot='0x0f'
function='0x0'/>
    </hostdev>

This all works fine during normal operation, but I noticed when we
remove the NVMe drive (surprise hotplug event), the PCIe EP then seems
"stuck"... here we see the link-down event on the host (when the drive
is removed):
[67720.177959] pciehp 0000:40:01.2:pcie004: Slot(238-1): Link Down
[67720.178027] vfio-pci 0000:42:00.0: Relaying device request to user (#0)

And naturally, inside of the Linux VM, we see the NVMe controller drop:
[ 1203.491536] nvme nvme1: controller is down; will reset:
CSTS=0xffffffff, PCI_STATUS=0xffff
[ 1203.522759] blk_update_request: I/O error, dev nvme1n2, sector
33554304 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ 1203.560505] nvme 0000:01:0f.0: Refused to change power state, currently in D3
[ 1203.561104] nvme nvme1: Removing after probe failure status: -19
[ 1203.583506] Buffer I/O error on dev nvme1n2, logical block 4194288,
async page read
[ 1203.583514] blk_update_request: I/O error, dev nvme1n1, sector
33554304 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

We see this EP is found at IOMMU group '76':
# readlink /sys/bus/pci/devices/0000\:42\:00.0/iommu_group
../../../../kernel/iommu_groups/76

And it is no longer bound to the 'vfio-pci' driver (expected) on the
host. I was expecting to see all of the FD's to the /dev/vfio/NN
character devices closed, but it seems they are still open:
# lsof | grep "vfio/76"
qemu-kvm  242364              qemu   70u      CHR              235,4
      0t0    3925324 /dev/vfio/76
qemu-kvm  242364 242502       qemu   70u      CHR              235,4
      0t0    3925324 /dev/vfio/76
qemu-kvm  242364 242511       qemu   70u      CHR              235,4
      0t0    3925324 /dev/vfio/76
qemu-kvm  242364 242518       qemu   70u      CHR              235,4
      0t0    3925324 /dev/vfio/76
qemu-kvm  242364 242531       qemu   70u      CHR              235,4
      0t0    3925324 /dev/vfio/76
qemu-kvm  242364 242533       qemu   70u      CHR              235,4
      0t0    3925324 /dev/vfio/76
qemu-kvm  242364 242542       qemu   70u      CHR              235,4
      0t0    3925324 /dev/vfio/76
qemu-kvm  242364 242550       qemu   70u      CHR              235,4
      0t0    3925324 /dev/vfio/76
qemu-kvm  242364 242554       qemu   70u      CHR              235,4
      0t0    3925324 /dev/vfio/76
SPICE     242364 242559       qemu   70u      CHR              235,4
      0t0    3925324 /dev/vfio/76

After the NVMe drive was removed for 100 seconds, we see the following
kernel messages on the host:
[67820.179749] vfio-pci 0000:42:00.0: Relaying device request to user (#10)
[67900.272468] vfio_bar_restore: 0000:42:00.0 reset recovery - restoring bars
[67900.272652] vfio_bar_restore: 0000:42:00.0 reset recovery - restoring bars
[67900.319284] vfio_bar_restore: 0000:42:00.0 reset recovery - restoring bars

I also noticed these messages related to the EP that is down currently
that seem to continue indefinitely on the host (every 100 seconds):
[67920.181882] vfio-pci 0000:42:00.0: Relaying device request to user (#20)
[68020.184945] vfio-pci 0000:42:00.0: Relaying device request to user (#30)
[68120.188209] vfio-pci 0000:42:00.0: Relaying device request to user (#40)
[68220.190397] vfio-pci 0000:42:00.0: Relaying device request to user (#50)
[68320.192575] vfio-pci 0000:42:00.0: Relaying device request to user (#60)

But perhaps that is expected behavior. In any case, the problem comes
when I re-insert the NVMe drive into the system... on the host, we see
the link-up event:
[68418.595101] pciehp 0000:40:01.2:pcie004: Slot(238-1): Link Up

But the device is not bound to the 'vfio-pci' driver:
# ls -ltr /sys/bus/pci/devices/0000\:42\:00.0/driver
ls: cannot access /sys/bus/pci/devices/0000:42:00.0/driver: No such
file or directory

And appears to fail when attempting to bind to it manually:
# echo "0000:42:00.0" > /sys/bus/pci/drivers/vfio-pci/bind
-bash: echo: write error: No such device

Device is enabled:
# cat /sys/bus/pci/devices/0000\:42\:00.0/enable
1

So, wondering if this is expected behavior? Stopping the VM and
starting it (virsh destroy/start) allows the device to work in the VM
again, but for my particular use case, this is not an option. Need the
surprise hotplug functionality to work with the PCIe EP passed into
the VM. And perhaps this is an issue elsewhere (eg, vfio-pci). Any
tips/suggestions on where to dig more would be appreciated.

Thanks for your time.


--Marc

Alex Williamson

2020-Oct-07 02:21 UTC

head link

Re: PCI Passthrough and Surprise Hotplug

On Mon, 5 Oct 2020 11:05:05 -0400
Marc Smith <msmith626@gmail.com> wrote:
> Hi,
> 
> I'm using QEMU/KVM on RHEL (CentOS) 7.8.2003:
> # cat /etc/redhat-release
> CentOS Linux release 7.8.2003
> 
> I'm passing an NVMe drive into a Linux KVM virtual machine (<type
> arch='x86_64'
machine='pc-i440fx-rhel7.0.0'>hvm</type>) which has the
> following 'hostdev' entry:
>     <hostdev mode='subsystem' type='pci'
managed='yes'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0000' bus='0x42'
slot='0x00' function='0x0'/>
>       </source>
>       <alias name='hostdev5'/>
>       <rom bar='off'/>
>       <address type='pci' domain='0x0000'
bus='0x01' slot='0x0f'
> function='0x0'/>  
>     </hostdev>
> 
> This all works fine during normal operation, but I noticed when we
> remove the NVMe drive (surprise hotplug event), the PCIe EP then seems
> "stuck"... here we see the link-down event on the host (when the
drive
> is removed):
> [67720.177959] pciehp 0000:40:01.2:pcie004: Slot(238-1): Link Down
> [67720.178027] vfio-pci 0000:42:00.0: Relaying device request to user (#0)
> 
> And naturally, inside of the Linux VM, we see the NVMe controller drop:
> [ 1203.491536] nvme nvme1: controller is down; will reset:
> CSTS=0xffffffff, PCI_STATUS=0xffff
> [ 1203.522759] blk_update_request: I/O error, dev nvme1n2, sector
> 33554304 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
> [ 1203.560505] nvme 0000:01:0f.0: Refused to change power state, currently
in D3
> [ 1203.561104] nvme nvme1: Removing after probe failure status: -19
> [ 1203.583506] Buffer I/O error on dev nvme1n2, logical block 4194288,
> async page read
> [ 1203.583514] blk_update_request: I/O error, dev nvme1n1, sector
> 33554304 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
> 
> We see this EP is found at IOMMU group '76':
> # readlink /sys/bus/pci/devices/0000\:42\:00.0/iommu_group
> ../../../../kernel/iommu_groups/76
> 
> And it is no longer bound to the 'vfio-pci' driver (expected) on
the
> host. I was expecting to see all of the FD's to the /dev/vfio/NN
> character devices closed, but it seems they are still open:
> # lsof | grep "vfio/76"
> qemu-kvm  242364              qemu   70u      CHR              235,4
>       0t0    3925324 /dev/vfio/76
> qemu-kvm  242364 242502       qemu   70u      CHR              235,4
>       0t0    3925324 /dev/vfio/76
> qemu-kvm  242364 242511       qemu   70u      CHR              235,4
>       0t0    3925324 /dev/vfio/76
> qemu-kvm  242364 242518       qemu   70u      CHR              235,4
>       0t0    3925324 /dev/vfio/76
> qemu-kvm  242364 242531       qemu   70u      CHR              235,4
>       0t0    3925324 /dev/vfio/76
> qemu-kvm  242364 242533       qemu   70u      CHR              235,4
>       0t0    3925324 /dev/vfio/76
> qemu-kvm  242364 242542       qemu   70u      CHR              235,4
>       0t0    3925324 /dev/vfio/76
> qemu-kvm  242364 242550       qemu   70u      CHR              235,4
>       0t0    3925324 /dev/vfio/76
> qemu-kvm  242364 242554       qemu   70u      CHR              235,4
>       0t0    3925324 /dev/vfio/76
> SPICE     242364 242559       qemu   70u      CHR              235,4
>       0t0    3925324 /dev/vfio/76
> 
> After the NVMe drive was removed for 100 seconds, we see the following
> kernel messages on the host:
> [67820.179749] vfio-pci 0000:42:00.0: Relaying device request to user (#10)
> [67900.272468] vfio_bar_restore: 0000:42:00.0 reset recovery - restoring
bars
> [67900.272652] vfio_bar_restore: 0000:42:00.0 reset recovery - restoring
bars
> [67900.319284] vfio_bar_restore: 0000:42:00.0 reset recovery - restoring
bars
> 
> I also noticed these messages related to the EP that is down currently
> that seem to continue indefinitely on the host (every 100 seconds):
> [67920.181882] vfio-pci 0000:42:00.0: Relaying device request to user (#20)
> [68020.184945] vfio-pci 0000:42:00.0: Relaying device request to user (#30)
> [68120.188209] vfio-pci 0000:42:00.0: Relaying device request to user (#40)
> [68220.190397] vfio-pci 0000:42:00.0: Relaying device request to user (#50)
> [68320.192575] vfio-pci 0000:42:00.0: Relaying device request to user (#60)
> 
> But perhaps that is expected behavior. In any case, the problem comes
> when I re-insert the NVMe drive into the system... on the host, we see
> the link-up event:
> [68418.595101] pciehp 0000:40:01.2:pcie004: Slot(238-1): Link Up
> 
> But the device is not bound to the 'vfio-pci' driver:
> # ls -ltr /sys/bus/pci/devices/0000\:42\:00.0/driver
> ls: cannot access /sys/bus/pci/devices/0000:42:00.0/driver: No such
> file or directory
> 
> And appears to fail when attempting to bind to it manually:
> # echo "0000:42:00.0" > /sys/bus/pci/drivers/vfio-pci/bind
> -bash: echo: write error: No such device
> 
> Device is enabled:
> # cat /sys/bus/pci/devices/0000\:42\:00.0/enable
> 1
> 
> So, wondering if this is expected behavior? Stopping the VM and
> starting it (virsh destroy/start) allows the device to work in the VM
> again, but for my particular use case, this is not an option. Need the
> surprise hotplug functionality to work with the PCIe EP passed into
> the VM. And perhaps this is an issue elsewhere (eg, vfio-pci). Any
> tips/suggestions on where to dig more would be appreciated.
Sorry, but nothing about what you're trying to accomplish is supported.
vfio-pci only supports cooperative hotplug, and that's what it's trying
to implement here.  The internal kernel PCI object is being torn down
even after the device has been physically removed, the PCI core is
trying to unbind it from the driver, which is where you're seeing the
device requests being relayed to the user.  The user (QEMU or guest) is
probably hung up trying to access the device that no long exists to
respond to these unplug requests.

Finally, you've added the device back, but there's an entire chain of
policy decisions that needs to decide to bind that new device to
vfio-pci, decide that this guest should have access to that device, and
initiate a hot-add to the VM.  That simply doesn't exist.  Should this
guest still have access to the device at that bus address?  Why?  What
if it's an entirely new and different device?  Who decides?

Someone needs to decide that this is a worthwhile feature to implement
and invest time to work out all these details before it "just works".
Perhaps you could share your use case to add weight to whether this is
something that should be pursued.  The behavior you see is expected and
there is currently no ETA (or active development that I'm aware of) for
the behavior you desire.  Thanks,

Alex

Marc Smith

2020-Oct-08 00:21 UTC

head link

Re: PCI Passthrough and Surprise Hotplug

On Tue, Oct 6, 2020 at 10:21 PM Alex Williamson
<alex.williamson@redhat.com> wrote:>
> On Mon, 5 Oct 2020 11:05:05 -0400
> Marc Smith <msmith626@gmail.com> wrote:
>
> > Hi,
> >
> > I'm using QEMU/KVM on RHEL (CentOS) 7.8.2003:
> > # cat /etc/redhat-release
> > CentOS Linux release 7.8.2003
> >
> > I'm passing an NVMe drive into a Linux KVM virtual machine
(<type
> > arch='x86_64'
machine='pc-i440fx-rhel7.0.0'>hvm</type>) which has the
> > following 'hostdev' entry:
> >     <hostdev mode='subsystem' type='pci'
managed='yes'>
> >       <driver name='vfio'/>
> >       <source>
> >         <address domain='0x0000' bus='0x42'
slot='0x00' function='0x0'/>
> >       </source>
> >       <alias name='hostdev5'/>
> >       <rom bar='off'/>
> >       <address type='pci' domain='0x0000'
bus='0x01' slot='0x0f'
> > function='0x0'/>
> >     </hostdev>
> >
> > This all works fine during normal operation, but I noticed when we
> > remove the NVMe drive (surprise hotplug event), the PCIe EP then seems
> > "stuck"... here we see the link-down event on the host (when
the drive
> > is removed):
> > [67720.177959] pciehp 0000:40:01.2:pcie004: Slot(238-1): Link Down
> > [67720.178027] vfio-pci 0000:42:00.0: Relaying device request to user
(#0)
> >
> > And naturally, inside of the Linux VM, we see the NVMe controller
drop:
> > [ 1203.491536] nvme nvme1: controller is down; will reset:
> > CSTS=0xffffffff, PCI_STATUS=0xffff
> > [ 1203.522759] blk_update_request: I/O error, dev nvme1n2, sector
> > 33554304 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
> > [ 1203.560505] nvme 0000:01:0f.0: Refused to change power state,
currently in D3
> > [ 1203.561104] nvme nvme1: Removing after probe failure status: -19
> > [ 1203.583506] Buffer I/O error on dev nvme1n2, logical block 4194288,
> > async page read
> > [ 1203.583514] blk_update_request: I/O error, dev nvme1n1, sector
> > 33554304 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
> >
> > We see this EP is found at IOMMU group '76':
> > # readlink /sys/bus/pci/devices/0000\:42\:00.0/iommu_group
> > ../../../../kernel/iommu_groups/76
> >
> > And it is no longer bound to the 'vfio-pci' driver (expected)
on the
> > host. I was expecting to see all of the FD's to the /dev/vfio/NN
> > character devices closed, but it seems they are still open:
> > # lsof | grep "vfio/76"
> > qemu-kvm  242364              qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242502       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242511       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242518       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242531       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242533       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242542       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242550       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242554       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > SPICE     242364 242559       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> >
> > After the NVMe drive was removed for 100 seconds, we see the following
> > kernel messages on the host:
> > [67820.179749] vfio-pci 0000:42:00.0: Relaying device request to user
(#10)
> > [67900.272468] vfio_bar_restore: 0000:42:00.0 reset recovery -
restoring bars
> > [67900.272652] vfio_bar_restore: 0000:42:00.0 reset recovery -
restoring bars
> > [67900.319284] vfio_bar_restore: 0000:42:00.0 reset recovery -
restoring bars
> >
> > I also noticed these messages related to the EP that is down currently
> > that seem to continue indefinitely on the host (every 100 seconds):
> > [67920.181882] vfio-pci 0000:42:00.0: Relaying device request to user
(#20)
> > [68020.184945] vfio-pci 0000:42:00.0: Relaying device request to user
(#30)
> > [68120.188209] vfio-pci 0000:42:00.0: Relaying device request to user
(#40)
> > [68220.190397] vfio-pci 0000:42:00.0: Relaying device request to user
(#50)
> > [68320.192575] vfio-pci 0000:42:00.0: Relaying device request to user
(#60)
> >
> > But perhaps that is expected behavior. In any case, the problem comes
> > when I re-insert the NVMe drive into the system... on the host, we see
> > the link-up event:
> > [68418.595101] pciehp 0000:40:01.2:pcie004: Slot(238-1): Link Up
> >
> > But the device is not bound to the 'vfio-pci' driver:
> > # ls -ltr /sys/bus/pci/devices/0000\:42\:00.0/driver
> > ls: cannot access /sys/bus/pci/devices/0000:42:00.0/driver: No such
> > file or directory
> >
> > And appears to fail when attempting to bind to it manually:
> > # echo "0000:42:00.0" >
/sys/bus/pci/drivers/vfio-pci/bind
> > -bash: echo: write error: No such device
> >
> > Device is enabled:
> > # cat /sys/bus/pci/devices/0000\:42\:00.0/enable
> > 1
> >
> > So, wondering if this is expected behavior? Stopping the VM and
> > starting it (virsh destroy/start) allows the device to work in the VM
> > again, but for my particular use case, this is not an option. Need the
> > surprise hotplug functionality to work with the PCIe EP passed into
> > the VM. And perhaps this is an issue elsewhere (eg, vfio-pci). Any
> > tips/suggestions on where to dig more would be appreciated.
>
> Sorry, but nothing about what you're trying to accomplish is supported.
> vfio-pci only supports cooperative hotplug, and that's what it's
trying
> to implement here.  The internal kernel PCI object is being torn down
> even after the device has been physically removed, the PCI core is
> trying to unbind it from the driver, which is where you're seeing the
> device requests being relayed to the user.  The user (QEMU or guest) is
> probably hung up trying to access the device that no long exists to
> respond to these unplug requests.
>
> Finally, you've added the device back, but there's an entire chain
of
> policy decisions that needs to decide to bind that new device to
> vfio-pci, decide that this guest should have access to that device, and
> initiate a hot-add to the VM.  That simply doesn't exist.  Should this
> guest still have access to the device at that bus address?  Why?  What
> if it's an entirely new and different device?  Who decides?
Understood, not supported currently.

>
> Someone needs to decide that this is a worthwhile feature to implement
> and invest time to work out all these details before it "just
works".
> Perhaps you could share your use case to add weight to whether this is
> something that should be pursued.  The behavior you see is expected and
> there is currently no ETA (or active development that I'm aware of) for
> the behavior you desire.  Thanks,
In this case, I'm passing NVMe drives into a KVM virtual machine --
the VM is then the "application" that uses these NVMe storage devices.
Why? Good question. =)

Knowing how the current implementation works now, I may rethink this a
bit. Thanks for your time and information.

--Marc

>
> Alex
>

Maybe Matching Threads

Search for more apparently analagous threads

libvirt users - Oct 2020 - PCI Passthrough and Surprise Hotplug

PCI Passthrough and Surprise Hotplug

Re: PCI Passthrough and Surprise Hotplug

Re: PCI Passthrough and Surprise Hotplug

Maybe Matching Threads