Diet version: faux-Quadro 6000 VGA passthrough, works lovely when first started. Starting it using this script: modprobe xen-pciback # USB Controllers xl pci-assignable-add 0000:00:1a.0 xl pci-assignable-add 0000:00:1d.2 # Audio xl pci-assignable-add 0000:0d:00.0 # NIC xl pci-assignable-add 0000:02:00.0 # GPU xl pci-assignable-add 0000:0b:00.0 xl pci-assignable-add 0000:0b:00.1 xl create /etc/xen/edi vinagre :0 But - shut down the domU (XP x64). Start it up again, and: Jun 30 18:59:08 normandy kernel: irq 18: nobody cared (try booting with the "irqpoll" option) Jun 30 18:59:08 normandy kernel: Pid: 0, comm: swapper/0 Tainted: PF O 3.9.5-1.el6xen.x86_64 #1 Jun 30 18:59:08 normandy kernel: Call Trace: Jun 30 18:59:08 normandy kernel: NVRM: VM: nv_alloc_contig_pages: failed to DMA-map memory Jun 30 18:59:08 normandy kernel: <IRQ> [<ffffffff810d416d>] __report_bad_irq+0x3d/0xe0 Jun 30 18:59:08 normandy kernel: [<ffffffff810d4366>] note_interrupt+0x156/0x210 Jun 30 18:59:08 normandy kernel: [<ffffffff810d1b99>] handle_irq_event_percpu+0xc9/0x210 Jun 30 18:59:08 normandy kernel: [<ffffffff810d1d21>] handle_irq_event+0x41/0x70 Jun 30 18:59:08 normandy kernel: [<ffffffff810d4a99>] handle_fasteoi_irq+0x59/0xf0 Jun 30 18:59:08 normandy kernel: [<ffffffff81302780>] __xen_evtchn_do_upcall+0x240/0x380 Jun 30 18:59:08 normandy kernel: [<ffffffff813028ff>] xen_evtchn_do_upcall+0x2f/0x50 Jun 30 18:59:08 normandy kernel: [<ffffffff8155c57e>] xen_do_hypervisor_callback+0x1e/0x30 Jun 30 18:59:08 normandy kernel: <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Jun 30 18:59:08 normandy kernel: [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Jun 30 18:59:08 normandy kernel: [<ffffffff8100a2a0>] ? xen_safe_halt+0x10/0x20 Jun 30 18:59:08 normandy kernel: [<ffffffff8101d166>] ? default_idle+0x46/0x100 Jun 30 18:59:08 normandy kernel: [<ffffffff8101ca99>] ? cpu_idle+0xd9/0x120 Jun 30 18:59:08 normandy kernel: [<ffffffff8153a0d5>] ? rest_init+0x75/0x80 Jun 30 18:59:08 normandy kernel: [<ffffffff81801200>] ? start_kernel+0x40e/0x41b Jun 30 18:59:08 normandy kernel: [<ffffffff81800c10>] ? repair_env_string+0x5b/0x5b Jun 30 18:59:08 normandy kernel: [<ffffffff818005f1>] ? x86_64_start_reservations+0x2a/0x2c Jun 30 18:59:08 normandy kernel: [<ffffffff818045ae>] ? xen_start_kernel+0x56e/0x570 Jun 30 18:59:08 normandy kernel: handlers: Jun 30 18:59:08 normandy kernel: [<ffffffff813ca4c0>] usb_hcd_irq Jun 30 18:59:08 normandy kernel: [<ffffffffa02c3820>] i801_isr [i2c_i801] Jun 30 18:59:08 normandy kernel: Disabling IRQ #18 This happens reasonably consistently. The domU comes up only in VGA 16-colour mode on VNC. But - I''ve found that shutting it down, doing: xl mem-set 0 48G (the machine has 48GB of RAM) makes it work fine again. IRQ18 is used by two USB controllers (of which one I''m PCI passthrough-ing) and a SMBus (i2c_i801) controller. The thing that bothers me is that NVRM seems to be what''s complaining, but the GPU being passed through is firmly under control of xen-pciback. I don''t suppose anyone might have an idea on how to gain some useful debug info for a bug report out of this? Gordan
Gordan Bobic
2013-Jul-01 18:34 UTC
Re: Odd domU Reboot Bug (possibly VGA passthrough related)
On 07/01/2013 07:22 PM, Gordan Bobic wrote:> Diet version: > > faux-Quadro 6000 VGA passthrough, works lovely when first started. > Starting it using this script: > > modprobe xen-pciback > > # USB Controllers > xl pci-assignable-add 0000:00:1a.0 > xl pci-assignable-add 0000:00:1d.2 > > > # Audio > xl pci-assignable-add 0000:0d:00.0 > > # NIC > xl pci-assignable-add 0000:02:00.0 > > # GPU > xl pci-assignable-add 0000:0b:00.0 > xl pci-assignable-add 0000:0b:00.1 > > xl create /etc/xen/edi > > vinagre :0 > > > But - shut down the domU (XP x64). Start it up again, and: > > Jun 30 18:59:08 normandy kernel: irq 18: nobody cared (try booting with > the "irqpoll" option) > Jun 30 18:59:08 normandy kernel: Pid: 0, comm: swapper/0 Tainted: PF > O 3.9.5-1.el6xen.x86_64 #1 > Jun 30 18:59:08 normandy kernel: Call Trace: > Jun 30 18:59:08 normandy kernel: NVRM: VM: nv_alloc_contig_pages: failed > to DMA-map memory > Jun 30 18:59:08 normandy kernel: <IRQ> [<ffffffff810d416d>] > __report_bad_irq+0x3d/0xe0 > Jun 30 18:59:08 normandy kernel: [<ffffffff810d4366>] > note_interrupt+0x156/0x210 > Jun 30 18:59:08 normandy kernel: [<ffffffff810d1b99>] > handle_irq_event_percpu+0xc9/0x210 > Jun 30 18:59:08 normandy kernel: [<ffffffff810d1d21>] > handle_irq_event+0x41/0x70 > Jun 30 18:59:08 normandy kernel: [<ffffffff810d4a99>] > handle_fasteoi_irq+0x59/0xf0 > Jun 30 18:59:08 normandy kernel: [<ffffffff81302780>] > __xen_evtchn_do_upcall+0x240/0x380 > Jun 30 18:59:08 normandy kernel: [<ffffffff813028ff>] > xen_evtchn_do_upcall+0x2f/0x50 > Jun 30 18:59:08 normandy kernel: [<ffffffff8155c57e>] > xen_do_hypervisor_callback+0x1e/0x30 > Jun 30 18:59:08 normandy kernel: <EOI> [<ffffffff810013aa>] ? > xen_hypercall_sched_op+0xa/0x20 > Jun 30 18:59:08 normandy kernel: [<ffffffff810013aa>] ? > xen_hypercall_sched_op+0xa/0x20 > Jun 30 18:59:08 normandy kernel: [<ffffffff8100a2a0>] ? > xen_safe_halt+0x10/0x20 > Jun 30 18:59:08 normandy kernel: [<ffffffff8101d166>] ? > default_idle+0x46/0x100 > Jun 30 18:59:08 normandy kernel: [<ffffffff8101ca99>] ? cpu_idle+0xd9/0x120 > Jun 30 18:59:08 normandy kernel: [<ffffffff8153a0d5>] ? rest_init+0x75/0x80 > Jun 30 18:59:08 normandy kernel: [<ffffffff81801200>] ? > start_kernel+0x40e/0x41b > Jun 30 18:59:08 normandy kernel: [<ffffffff81800c10>] ? > repair_env_string+0x5b/0x5b > Jun 30 18:59:08 normandy kernel: [<ffffffff818005f1>] ? > x86_64_start_reservations+0x2a/0x2c > Jun 30 18:59:08 normandy kernel: [<ffffffff818045ae>] ? > xen_start_kernel+0x56e/0x570 > Jun 30 18:59:08 normandy kernel: handlers: > Jun 30 18:59:08 normandy kernel: [<ffffffff813ca4c0>] usb_hcd_irq > Jun 30 18:59:08 normandy kernel: [<ffffffffa02c3820>] i801_isr [i2c_i801] > Jun 30 18:59:08 normandy kernel: Disabling IRQ #18 > > This happens reasonably consistently. The domU comes up only in VGA > 16-colour mode on VNC. > > But - I''ve found that shutting it down, doing: > xl mem-set 0 48G > (the machine has 48GB of RAM) makes it work fine again. > > IRQ18 is used by two USB controllers (of which one I''m PCI > passthrough-ing) and a SMBus (i2c_i801) controller. > > The thing that bothers me is that NVRM seems to be what''s complaining, > but the GPU being passed through is firmly under control of xen-pciback. > > I don''t suppose anyone might have an idea on how to gain some useful > debug info for a bug report out of this?Additional - after a while this stops helping - and every time the VM is restarted, the dom0 memory drops by a further 2GB, i.e. Initial: dom0: 48GB start domU: 46GB/2GB xl mem-set 0 48GB: dom0:48GB start domU, crash: 44GB/2GB xl mem-set 0 48GB: dom0:48GB start domU, crash: 42GB/2GB I could have sworn this wasn''t happening before. I''ll try to work out of a recent XSA patch broke something again. Gordan
Ian Campbell
2013-Jul-02 08:42 UTC
Re: Odd domU Reboot Bug (possibly VGA passthrough related)
On Mon, 2013-07-01 at 19:22 +0100, Gordan Bobic wrote:> The thing that bothers me is that NVRM seems to be what''s complaining, > but the GPU being passed through is firmly under control of xen-pciback.Do the xl -vvv logs or the logs under /var/log/xen/ say anything about rebinding the device at all? AIUI pci-assignable-add is supposed to unbind the original driver and bind to pciback and nothing is supposed to rebind until pci-assignable-remove, but perhaps something is (inadvertently) happening on domain shutdown too? If you examine /sys you should be able to see which driver is bound to the device, which might give a clue. If you just nuke the NV driver from dom0 altogether does that help? What about if you hide the device via the kernel command line rather than dynamically (assuming that works in your setup)? Ian.
Gordan Bobic
2013-Jul-02 20:44 UTC
Re: Odd domU Reboot Bug (possibly VGA passthrough related)
On 07/02/2013 09:42 AM, Ian Campbell wrote:> On Mon, 2013-07-01 at 19:22 +0100, Gordan Bobic wrote: >> The thing that bothers me is that NVRM seems to be what''s complaining, >> but the GPU being passed through is firmly under control of xen-pciback. > > Do the xl -vvv logs or the logs under /var/log/xen/ say anything about > rebinding the device at all?Nothing at all.> AIUI pci-assignable-add is supposed to unbind the original driver and > bind to pciback and nothing is supposed to rebind until > pci-assignable-remove, but perhaps something is (inadvertently) > happening on domain shutdown too? > > If you examine /sys you should be able to see which driver is bound to > the device, which might give a clue.I''m quite certain it never unbinds - lspci -vvv shows the device still being handled by the pciback driver.> If you just nuke the NV driver from dom0 altogether does that help? What > about if you hide the device via the kernel command line rather than > dynamically (assuming that works in your setup)?I added xen-pciback module to initramfs and made sure it loads. I still have to manually add the USB controllers manually, though, because the USB driver appears to be built in on my kernel. Either way, this doesn''t change the situation, still works fine after a fresh reboot, but not after a full VM shutdown. The pattern of events is quite consistent: 1) Fresh boot - all works fine. Shut down the domU. See attached qemu-dm-edi.log.3 2) Try booting the domU - locks up during boot as soon as it tries to initialize the GPU (there''s a flash of desktop background and the mouse pointer, but it goes black before the login screen shows up and never comes back. Have to terminate it using "xl destroy edi". See attached qemu-dm-edi.2 3) Try booting domU again - it will get to the desktop in VNC, but only in 16 colour VGA mode, but still thinking it''s running on the Quadro card. Shuts down cleanly. See attached qemu-dm-edi.1 4) Try booting domU again - hard-lock-up of the host. Have to hard-reset it (actually, not sure if it''s a complete hard-lock-up on the host, I haven''t yet tried ssh-ing to it after that happens. Just looking through /var/log/messages for clues, and I can see this on the 2nd domU start: Jul 2 21:13:46 normandy kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0018 Jul 2 21:13:46 normandy kernel: pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID) Jul 2 21:13:46 normandy kernel: pcieport 0000:00:03.0: device [8086:340a] error status/mask=00004000/00000000 Jul 2 21:13:46 normandy kernel: pcieport 0000:00:03.0: [14] Completion Timeout (First) Jul 2 21:13:46 normandy kernel: pciback 0000:0d:00.0: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0b:00.0: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0b:00.1: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0d:00.0: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0b:00.0: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0b:00.1: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0d:00.0: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0b:00.0: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pciback 0000:0b:00.1: xen-pciback device is not connected or owned by HVM, kill it Jul 2 21:13:46 normandy kernel: pcieport 0000:00:03.0: AER: Device recovery successful lspci shows that device 00:03.0 is the Intel PCIe bridge on which three of the the passed through devices are: 1) Quadro 6000 (well modified GTX480, but close enough to make no difference) 2) Nvidia audio on the Nvidia card 3) Sound Blaster PCIe So I''m wondering if this might be a problem with either: 1) another PCI memory stomp going on since symptoms are similar to what I was seeing before with > 2GB assigned to domU (but why would it only happen on a second and subsequent domU startups (domU restarts trigger it, too)?) or 2) PCIe bridging anomaly due to the VGA card being on the same bridge as another device - Thinking about it, I did add the sound card to the machine recently, and not only is it on the same Intel PCIe bridge -> Nvidia NF200 PCIe bridge, but the Sound card has it''s own PCIe->PCI bridge on it, so it''s doubly bridged for extra weirdness. Time to start experimenting with different slots again, it seems... Gordan _______________________________________________ Xen-users mailing list Xen-users@lists.xen.org http://lists.xen.org/xen-users
Ian Campbell
2013-Jul-03 09:18 UTC
Re: Odd domU Reboot Bug (possibly VGA passthrough related)
On Tue, 2013-07-02 at 21:44 +0100, Gordan Bobic wrote:> On 07/02/2013 09:42 AM, Ian Campbell wrote: > > On Mon, 2013-07-01 at 19:22 +0100, Gordan Bobic wrote: > >> The thing that bothers me is that NVRM seems to be what''s complaining, > >> but the GPU being passed through is firmly under control of xen-pciback. > > > > Do the xl -vvv logs or the logs under /var/log/xen/ say anything about > > rebinding the device at all? > > Nothing at all. > > > AIUI pci-assignable-add is supposed to unbind the original driver and > > bind to pciback and nothing is supposed to rebind until > > pci-assignable-remove, but perhaps something is (inadvertently) > > happening on domain shutdown too? > > > > If you examine /sys you should be able to see which driver is bound to > > the device, which might give a clue. > > I''m quite certain it never unbinds - lspci -vvv shows the device still > being handled by the pciback driver.Very strange that the NV driver is getting involved then.> > If you just nuke the NV driver from dom0 altogether does that help? What > > about if you hide the device via the kernel command line rather than > > dynamically (assuming that works in your setup)? > > I added xen-pciback module to initramfs and made sure it loads. I still > have to manually add the USB controllers manually, though, because the > USB driver appears to be built in on my kernel. Either way, this doesn''t > change the situation, still works fine after a fresh reboot, but not > after a full VM shutdown.But did you remove the nv.ko from dom0 altogher, ensuring it is never loaded?> > The pattern of events is quite consistent:[...]> Time to start experimenting with different slots again, it seems...I''m afraid most of the intricacies of this stuff are completely beyond me. You theory about bridges and slots sounds plausible so far as I am qualified to comment though. Ian.
Gordan Bobic
2013-Jul-03 10:41 UTC
Re: Odd domU Reboot Bug (possibly VGA passthrough related)
On Wed, 3 Jul 2013 10:18:57 +0100, Ian Campbell <Ian.Campbell@citrix.com> wrote:> On Tue, 2013-07-02 at 21:44 +0100, Gordan Bobic wrote: >> On 07/02/2013 09:42 AM, Ian Campbell wrote: >> > On Mon, 2013-07-01 at 19:22 +0100, Gordan Bobic wrote: >> >> The thing that bothers me is that NVRM seems to be what''s >> complaining, >> >> but the GPU being passed through is firmly under control of >> xen-pciback. >> > >> > Do the xl -vvv logs or the logs under /var/log/xen/ say anything >> about >> > rebinding the device at all? >> >> Nothing at all. >> >> > AIUI pci-assignable-add is supposed to unbind the original driver >> and >> > bind to pciback and nothing is supposed to rebind until >> > pci-assignable-remove, but perhaps something is (inadvertently) >> > happening on domain shutdown too? >> > >> > If you examine /sys you should be able to see which driver is >> bound to >> > the device, which might give a clue. >> >> I''m quite certain it never unbinds - lspci -vvv shows the device >> still >> being handled by the pciback driver. > > Very strange that the NV driver is getting involved then.That may have been just a fluke - it doesn''t happen every time. Once the PCI memory space starts getting stomped all over all bets are off WRT what might happen. Speaking of which - does qemu-xen in 4.2.x allocate the BARs consistently / deterministically? I''m wondering it this could be caused by the first initialization getting one set of BAR ranges, but the second time it gets mapped somewhere else, and something between qemu-xen, the driver and the card itself gets confused and goes wrong. Which also leads me to wondering if always ensuring that pBAR = vBAR might be a good and desirable thing for everything (which might also improve passthrough compatibility with VGA and other BAR-heavy devices).>> > If you just nuke the NV driver from dom0 altogether does that >> help? What >> > about if you hide the device via the kernel command line rather >> than >> > dynamically (assuming that works in your setup)? >> >> I added xen-pciback module to initramfs and made sure it loads. I >> still >> have to manually add the USB controllers manually, though, because >> the >> USB driver appears to be built in on my kernel. Either way, this >> doesn''t >> change the situation, still works fine after a fresh reboot, but not >> after a full VM shutdown. > > But did you remove the nv.ko from dom0 altogher, ensuring it is never > loaded?If you are referring to nvidia.ko, no, I didn''t - I need it for dom0 to work properly. nvidiafb.ko is explicitly blacklisted (as is nvidia.ko but the nvidia Xorg driver loads it anyway).>> The pattern of events is quite consistent: > [...] >> Time to start experimenting with different slots again, it seems... > > I''m afraid most of the intricacies of this stuff are completely > beyond > me. You theory about bridges and slots sounds plausible so far as I > am > qualified to comment though.Last time I was fighting this with PCI memory stomps, making sure that the VGA card was the only thing on the PCIe bridge chain seemed to help, and the symptoms were very similar WRT AER errors getting thrown all over the place. Potentially another problem that might implicitly go away if pBAR=vBAR were to become the default... Gordan