On Tue, Jul 27, 2021 at 11:22:25AM +0200, Francesc Guasch
wrote:> Hello.
>
> I have a host with an NVIDIA RTX 3090. I configured PCI passthrough
> and it works fine. We are using it for CUA and Matlab on Ubuntu 20.04.
>
> The problem comes sometimes on rebooting the virtual machine. It
doesn't
> happen 100% of the times but eventually after 3 or 4 reboots the PCI
> device stops working. The only solution is to reboot the host.
>
> Weird thing is this only happens when rebooting the VM. After a host
> reboot if we shutdown the virtual machine and we start it again,
> it works fine. I wrote a small script that does that a hundred times
> just to make sure. Only a reboot triggers the problem.
>
> When it fails I run "nvidia-smi" in the virtual machine and I
get:
>
> No devices were found
>
> Also I spotted some errors in syslog
>
> NVRM: installed in this system is not supported by the
> NVIDIA 460.91.03 driver release.
> NVRM: GPU 0000:01:01.0: GPU has fallen off the bus
> NVRM: the NVIDIA kernel module is unloaded.
> NVRM: GPU 0000:01:01.0: RmInitAdapter failed! (0x23:0x65:1204)
> NVRM: GPU 0000:01:01.0: rm_init_adapter failed, device minor number 0
>
> The device is there because typing lspci I can see information:
>
> 0000:01:01.0 VGA compatible controller [0300]: NVIDIA Corporation
> Device [10de:2204] (rev a1)
> Subsystem: Gigabyte Technology Co., Ltd Device [1458:403b]
> Kernel driver in use: nvidia
> Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
>
> I tried different Nvidia drivers and Linux kernels in the host and
> the virtual machine with the same results.
Hi,
this question is better suited for vfio-users at redhat.com. Once the GPU is
bound
to the vfio-pci driver, it's out of libvirt's hands.
AFAIR Nvidia only enabled PCI device assignment on GeForce cards on Windows 10
VMs, but you claim to run a Linux VM. Back when I worked on the vGPU stuff that
is supported only on the Tesla cards, I remember being told that the host and
guest driver communicated with each other. Applying the same to GeForce, I
would not be surprised if NVIDIA detected in the host driver that the
corresponding guest driver is not a Windows 10 one and didn't do a proper
GPU
reset in between VM reboots - hence the need to reboot the host. There used to
be a similar bus reset bug in the AMD host driver not so long ago which
affected every single VM shutdown/reboot in a way that the host had to be
rebooted in order for the card to be usable again. Be it as it may, I can only
speculate and since your scenario is officially not supported by NVIDIA I wish
you the best of luck :)
Regards,
Erik