Whrrr, xen-users! Once more someone having trouble with the elusive, gnarly, extensive device passthrough is here, and this time, that someone is I. Ahem. --- The background --- Here''s what I''m trying to do: I have a machine with an Asus P8B WS mainboard and an Intel Xeon E3-1230 processor which I''d like to have running three Xen domains: the dom0 Heterodyne, and the two domUs Quail and Furn. (The software identifiers are all lowercase.) All domains are running Debian GNU/Linux with at least Linux 3.0.0, though the userspace configuration varies considerably. Furn is currently a PV domain and Quail an HVM domain, though I might change them to both be HVM later on. I''m planning to partition the CPU cores between the domains by pinning the vcpus, in case that makes any difference. This machine has a number of PCI devices, both on the mainboard and offboard, which I would like to divide them up between the domains. I''m aiming for the following allocation based on dom0 lspci: Heterodyne: 00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port SATA AHCI Controller (rev 05) 05:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection 06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection 07:00.0 USB Controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host Controller [and everything else] Quail: 00:1d.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05) 01:00.0 VGA compatible controller: ATI Technologies Inc RV710 [Radeon HD 4350] 01:00.1 Audio device: ATI Technologies Inc RV710/730 08:00.0 Multimedia audio controller: C-Media Electronics Inc CMI8788 [Oxygen HD Audio] 08:03.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller (rev c0) Furn (optional): 00:1a.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) 00:1b.0 Audio device: Intel Corporation 6 Series/C200 Series Chipset Family High Definition Audio Controller (rev 05) I do _not_ necessarily want to attach the VGA as one that Quail''s HVM BIOS can find; I''m happy with it being registered as a secondary graphics card with the primary one being the emulated Cirrus VGA accessible over VNC, then having the guest OS load drivers for the device and reinitialize it. This would seem to sidestep a lot of the work listed in the XenVGAPassthrough page on the wiki, since that refers to making the VGA work for the BIOS. I''m currently running Xen 4.1.1 (Debian 4.1.1-2). The mainboard has VT-d IOMMU support, and it is enabled in the BIOS (which is currently version 0605 from Asus, but I''m planning to upgrade it to 0704 since this supposedly fixes the bogus ECC issue from the current version). The Xen and dom0 Linux in the plainest configuration are loaded with: /boot/xen-4.1-amd64.gz placeholder iommu=1 console=com1,vga com1=38400,8n1 dom0_mem=512M dom0_max_vcpus=2 dom0_vcpus_pin /boot/vmlinuz-3.0.0-1-amd64 placeholder root=UUID=<...> ro quiet console=tty0 console=hvc0 mem=512M --- The saga --- I''m going to reproduce these steps as I''m writing this, so it should be a fairly accurate accounting as far as results, though it may not reflect exactly what I went through earlier. My first priorities are to get the Radeon, one USB controller, and the PCI audio device attached to Quail, in that order; once I can manage that, it seems likely that Furn''s configuration will fall into place. All of the below is done with Furn absent as a Xen domain. Auxiliary files that are too large to include inline are available from http://dasyatidae.net/2011/xen-users-265/. ATTEMPT #1: "Hot Cross Plugs" To avoid interference, I preëmptively blacklist the drivers radeon, radeonfb, snd_hda_intel, and snd_virtuoso in the dom0 Linux configuration, using /etc/modprobe.d/blacklist.conf. (snd_hda_intel would also normally bind to the secondary function of the Radeon HD 4350.) The dom0 is running Debian testing (wheezy), and the domU is running unstable (sid). Boot. lsmod reveals that none of the aforementioned modules have been loaded; there is still a VGA console visible through the Radeon. dmesg output is in 01-dom0-dmesg.txt and 01-xen-dmesg.txt. Note that at boot dom0 complains about PCI address space overlapping video ROM. The IOMMU is in fact initialized according to Xen. Boot domU, with VNC open and configuration from 01-quail-cfg.txt. VNC has a getty running on the emulated Cirrus VGA, and xm console attaches to the HVM serial device properly. Linux in the domU is loaded as: /boot/vmlinuz-3.0.0-1-amd64 root=UUID=<...> ro quiet console=ttyS0,38400n1 Full dmesg output is in 01-quail-dmesg-boot.txt. Run « stubify 0000:01:00.0 » and « stubify 0000:01:00.1 » on the dom0; stubify is a shell script that rebinds a PCI device to the pci-stub driver. xen-pcifront and xen-pciback are nowhere to be found on either the dom0 or domU kernel, that I can detect---I suppose the Debian configuration doesn''t come with either of them. root@heterodyne:~# xm pci-list-assignable-devices 0000:01:00.0 0000:01:00.1 root@heterodyne:~# xm pci-attach quail 0000:01:00.0 root@heterodyne:~# xm console quail [:: ... log in as root ... ::] root@quail:~# modprobe pci-hotplug root@quail:~# echo 1 >/sys/bus/pci/rescan root@quail:~# [ 450.499603] radeon 0000:00:04.0: Fatal error during GPU init [ 450.518377] [TTM] Trying to take down uninitialized memory manager type 1 root@quail:~# Quail''s next dmesg entries (01-quail-dmesg-hotadd.txt) suggest that it can''t assign PCI resources for the video RAM, and the Radeon driver falls over with a zero access window. qemu-dm spits out messages in 01-qemu-log-hotadd.txt. ATTEMPT #2: "A Bridge over Stormy SeaBIOS" My guess is that the HVM emulated PCI bridge isn''t leaving enough window to allocate MMIO space for the video RAM. So I tentatively append pci=nocrs to Quail''s Linux command line to see whether that''ll convince it to use better window areas. The full domU command line is now: /boot/vmlinuz-3.0.0-1-amd64 root=UUID=<...> ro quiet console=ttyS0,38400n1 pci=nocrs dmesg output is in 02-quail-dmesg-boot.txt. I do the same xm pci-attach and rescan as before. This time, Quail hangs hard for a while, then does this: [ 172.050122] BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:1686] [ 172.053355] Stack: [ 172.054119] Call Trace: [ 172.054119] Code: 88 03 00 00 b8 ea ff ff ff 78 22 3b b7 70 03 00 00 77 1a c1 e6 03 48 81 e2 00 f0 ff ff 48 63 f6 48 83 ca 67 48 8d 34 31 48 89 16 « xm pci-detach quail 0000:01:00.0 » yields "Error: Timed out waiting for device model action", so I destroy the domain with « xm destroy quail ». 02-qemu-log-hotadd.txt suggests that Quail tried to map the graphics card into a high (>= 4 GiB) MMIO region and that qemu-dm''s PCI emulation blew chunks all over the memory map as a result. /var/log/messages from Quail contains the fragment in 02-quail-messages-hotadd.txt, which seems to confirm this. ATTEMPT #3: "Boom! Address Space Klotski" I change Quail''s configuration to set memory = ''3500'', leading to 03-quail-cfg.txt. pci=nocrs is still in effect. After boot and hot-add, 03-quail-dmesg.txt results; it seems to have mapped the GPU correctly and initialized it. Indeed at this point the DVI output of the Radeon card is correctly showing Quail''s Linux framebuffer console. However, dom0 grimaces and spits out the following on the serial console: (XEN) vmsi.c:122:d32767 Unsupported delivery mode 3 Oops. qemu-dm has meanwhile generated 03-qemu-log.txt. I create quail:/etc/X11/xorg.conf with the following text: Section "Device" Identifier "primary screen" Driver "radeon" BusID "PCI:0:4:0" EndSection to force it to use the Radeon card. (I haven''t forced pci-attach to use a specific slot yet, so this isn''t necessarily reliable between boots, but that should be easy enough to fix ex post facto.) Starting X as root yields an X display on the Radeon! glxinfo yields 03-glxinfo.txt, suggesting that the Radeon DRI is working. Running glxgears claims that it is synchronized to the vertical retrace, but no output appears. Woops. 2D output to X mostly works, but is horribly slow and full of tearing artifacts. This suggests to me that the vsync is broken. The vmsi error above suggests that this is because an MSI from the Radeon card is not being delivered. Right then! ATTEMPT #4: "Friendly MSI have been destroy!" I alter Quail''s Linux boot line to include pci=nomsi (combining it with the existing option). The new full Linux command line is: /boot/vmlinuz-3.0.0-1-amd64 root=UUID=<...> ro quiet console=ttyS0,38400n1 pci=nocrs,nomsi dmesg after domU boot yields 04-quail-dmesg-boot.txt. xm pci-attach, then rescan PCI device in domU. The Radeon is initialized successfully. I start X as root. glxinfo returns the exact same results. Running glxgears displays some rotating gears: 328 frames in 5.0 seconds = 65.401 FPS This is indeed correct! But then I attempt to attach the USB controller. I run « stubify 0000:00:1d.0 » in the dom0, and get a correct-looking: [ 3068.112581] ehci_hcd 0000:00:1d.0: remove, state 4 [ 3068.112588] usb usb2: USB disconnect, device number 1 [ 3068.112590] usb 2-1: USB disconnect, device number 2 [ 3068.131201] ehci_hcd 0000:00:1d.0: USB bus 2 deregistered [ 3068.131242] ehci_hcd 0000:00:1d.0: PCI INT A disabled [ 3068.131341] pci-stub 0000:00:1d.0: claimed by stub And the moment of truth: root@heterodyne:~# xm pci-list-assignable-devices 0000:00:1d.0 root@heterodyne:~# xm pci-attach quail 0000:00:1d.0 But the serial console coughs up: (XEN) physdev.c:182: dom7: no free pirq Uh-oh. Quail''s dmesg reveals: [ 259.241751] pci 0000:00:05.0: [8086:1c26] type 0 class 0x000c03 [ 259.242090] pci 0000:00:05.0: reg 10: [mem 0x00000000-0x00000fff] [ 259.244192] pci 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment [ 259.244196] pci 0000:00:05.0: BAR 0: assigned [mem 0xdc030000-0xdc030fff] [ 259.244276] pci 0000:00:05.0: BAR 0: set to [mem 0xdc030000-0xdc030fff] (PCI address [0xdc030000-0xdc030fff]) [ 259.304377] usbcore: registered new interface driver usbfs [ 259.304392] usbcore: registered new interface driver hub [ 259.304412] usbcore: registered new device driver usb [ 259.305635] ehci_hcd: USB 2.0 ''Enhanced'' Host Controller (EHCI) Driver [ 259.305727] xen map irq failed -22 [ 259.305729] ehci_hcd 0000:00:05.0: PCI INT A: failed to register GSI And indeed, qemu-dm complains of: dm-command: hot insert pass-through pci dev register_real_device: Assigning real physical device 00:1d.0 ... register_real_device: Enable MSI translation via per device option register_real_device: Disable power management pt_iomul_init: Error: pt_iomul_init can''t open file /dev/xen/pci_iomul: No such file or directory: 0x0:0x1d.0x0 pt_register_regions: IO region registered (size=0x00000400 base_addr=0xfe706000) register_real_device: Error: Mapping irq failed, rc = -1 register_real_device: Real physical device 00:1d.0 registered successfuly! IRQ type = INTx pt_pci_write_config: Warning: Guest attempt to set address to unused Base Address Register. [00:05.0][Offset:30h][Length:4] pt_iomem_map: e_phys=dc030000 maddr=fe706000 type=0 len=4096 index=0 first_map=1 Ow! (The full log is in 04-qemu-log.txt.) The rescan also convinces Xen that: (XEN) irq.c:1817: dom7: invalid pirq -28 or emuirq 36 At this point I got sort of stumped and started paging through Xen source code trying to figure out what in the world was going on, though I didn''t get very far. --- The quandary --- My best guess at this stage is that: - Xen is deriving the number of GSIs and MSIs available from the host APIC, and this is somehow carrying over to the domU interrupts. - For some reason, the MSI format being used is not supported by the PCI passthrough code, or else it''s misdetecting what sort of MSI is coming down the line (I''m not intimately familiar enough with PCI to know how this works (yet)). - Since Xen and qemu-dm only support attaching devices to separate virtual PCI buses in the domU, they can''t share GSI interrupts. (I wouldn''t expect them to work on the same virtual bus anyway, given that some are PCI Express and expect a point-to-point link, but maybe it''d work if they''re not too picky.) - On this modern machine, the number of GSIs available is severely diminished under the expectation that almost all mainboard and add-on devices will use MSI. (E.g., there''s only one PCI (as opposed to PCI-E) slot.) This carries over to the domU. Combined with the above, we run out of GSIs trying to attach the second device. Note that: - None of the devices involved support function-level reset that I know of (via « lspci -vv » in dom0). This does not appear to actually be a problem in practice; I have successfully attached the USB controller by itself to Quail, or the PCI audio device by itself. I can give « lspci -vv » output if it''s useful, but not right now since I''d have to reboot the dom0 again to do it and I''m running out of time. - I''m not currently attaching the second function of the Radeon card. This again does not appear to cause a problem even though theoretically one is supposed to attach both functions. I think I tried attaching both of them simultaneously, and it didn''t work due to this not being a bleeding-edge Xen, but I don''t have the details at the moment; attaching them both as separate PCI passthrough devices didn''t obviously do anything useful. (I''d actually prefer to avoid attaching that second function if possible; it''d be very convenient to not have to bother with a worthless second audio device gunking up Quail''s ALSA configuration.) - Last I checked, trying to boot Quail with any PCI passthrough devices already attached caused a failure to find the root filesystem. I haven''t gone back through this to check again, but I believe what is happening is that the Xen domU platform PCI device that the PV-on-HVM drivers use(?) is found, and the driver disables the emulated disks as a result, but then its interrupt attachment goes haywire because the available interrupts have all been used up by the passthrough devices, so communication with the Xen backends breaks down, which grinds everything to a halt because various essential devices (such as the disks) are already gone. I can go back and try this again if more specific information on it would be useful. - Previously I was running the dom0 kernel with pci=nocrs as well, which avoided the PCI address space overlap with video ROM message present in 01-dom0-dmesg.txt. This doesn''t seem to have made a significant difference in behavior at any of the steps. My questions: - Does my tentative analysis seem sound? If not, what have I miscalculated? Or am I on the wrong track from the very beginning? - Will upgrading to a bleeding-edge Xen make any difference? I''d rather not do this unless necessary, since I''d prefer a stabler machine with a more well-tested hypervisor, but I''m well aware that what I''m trying to do is fairly obtuse and may require jiggering software around. + A diff of xen/arch/x86/hvm/vmsi.c suggests that it will _not_ make the MSIs work if the delivery mode is being detected properly. vmsi_deliver seems to have the same code regarding delivery mode 3 (i.e., none; it''s dest__reserved_1 in xen/include/asm-x86/io_apic.h). + A grep of tools/ioemu-qemu-xen/hw/pass-through.c suggests that it _might_ allow using 5000 MB of memory for the domU rather than 3500 MB without bogotifying the PCI windows, since the "Guest attempt to set high MMIO Base Address" message is gone, suggesting that 64-bit BAR will work, but I haven''t confirmed this. Granting the domU more memory would be very good, but is not essential. - What is the best way to proceed that might allow me to attach these devices to an HVM domU at the same time? Or am I merely hosed? In particular: + Could I convince Xen and/or the device model to emulate the HVM domU''s APIC with more available GSIs rather than copying the model from the host (if that is indeed what it''s doing)? + Could I patch vmsi.c to make delivery mode 3 work so that MSIs from the Radeon card will be delivered? Or is delivery mode 3 truly nonexistent, meaning that there is something else wrong with MSI passthrough? If the latter, where might I look to find out more? Searching the mailing list archives yields various patches that have appeared in the past regarding vmsi.c, but nothing that looks both relevant and unapplied. - I would like to not have to run the domU kernel with pci=nocrs. Is there a reasonable way to make the device model allocate the PCI host bridge resources differently to make this unnecessary? It doesn''t seem important by comparison to the above, but is continuing to run with pci=nocrs an indicator that I may run into problems later if I upgrade Xen or qemu-dm? - Similarly, would running the dom0 kernel with pci=nocrs help or hurt in any particular way? It doesn''t seem to have much effect, but maybe I''m missing something. To anyone who made it this far: I am at least theoretically open to arranging reasonable bribes for the first people to help me get this to work. Inquire privately if you wish to take advantage of this. Otherwise, you still get my thanks in advance. :-) Kyaieee! ---> Drake Wilson Post Scriptum: BIOS 0704 has been unhelpful in either getting ECC enabled or doing anything to the PCI passthrough. Alas. Also, « lspci -vv » output is now available as 05-lspci-vv.txt. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Drake Wilson
2011-Sep-23 05:25 UTC
Re: [Xen-users] The saga of Heterodyne''s PCI Passthrough
[...] Hmm. Loopiness follows. So I finally chased down somewhere on the Web saying that MSI delivery modes are similar to IPI delivery modes, and then I found the IPI types in the AMD64 architecture manual. Apparently mode 3 is a remote read from a separate local APIC. This sounds totally bogus for an interrupt coming from a graphics card, but I don''t know for sure. But after poring over the dmesg a bit more I realized a bunch of lower interrupts were being eaten by PnP devices. I suppose these are attached to things like the emulated PS/2 keyboard. Since I don''t need those (I use the serial device as emergency maintenance port), pnpacpi=off pnpbios=off in the domU kernel, and magically there''s enough interrupts to pass through both the Radeon and the USB controller, or so it seems. So that may be a good enough workaround for now. But I still want to know why those MSIs are apparently not being passed through. Also apparently the interrupt limit is an architectural one, which I should have known since it was hardcoded, but somewhere in there I made a logic error I suppose (keeping in mind that I really don''t know what I''m doing here). Some primitive attempts to trace back where in the world mode 3 is coming from yielded not much except for a twisty maze that points either into the guest OS or maybe into some sort of structure corruption in the device model, maybe; the only places that I could find where a Linux write_msi happens that would alter the MSI Message Data register that way (as opposed to setting it to hardcoded elements) are related to IOMMU stuff, such as io_apic.c:3051, and some further primitive attempts to probe this yielded nothing useful. I do notice that Linux xen_hvm_setup_msi_irqs doesn''t seem to get called anywhere, or at least its printk lines don''t show up in dmesg, but whether this is relevant I cannot say. Kyaieee! ---> Drake Wilson _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Pasi Kärkkäinen
2011-Sep-26 19:07 UTC
Re: [Xen-users] The saga of Heterodyne''s PCI Passthrough
On Fri, Sep 23, 2011 at 12:25:32AM -0500, Drake Wilson wrote:> [...] > > Hmm. Loopiness follows. > > So I finally chased down somewhere on the Web saying that MSI delivery modes > are similar to IPI delivery modes, and then I found the IPI types in the AMD64 > architecture manual. Apparently mode 3 is a remote read from a separate local > APIC. This sounds totally bogus for an interrupt coming from a graphics card, > but I don''t know for sure. > > But after poring over the dmesg a bit more I realized a bunch of lower > interrupts were being eaten by PnP devices. I suppose these are attached to > things like the emulated PS/2 keyboard. Since I don''t need those (I use the > serial device as emergency maintenance port), pnpacpi=off pnpbios=off in the > domU kernel, and magically there''s enough interrupts to pass through both the > Radeon and the USB controller, or so it seems. > > So that may be a good enough workaround for now. But I still want to know why > those MSIs are apparently not being passed through. > > Also apparently the interrupt limit is an architectural one, which I should > have known since it was hardcoded, but somewhere in there I made a logic error > I suppose (keeping in mind that I really don''t know what I''m doing here). > > Some primitive attempts to trace back where in the world mode 3 is coming from > yielded not much except for a twisty maze that points either into the guest OS > or maybe into some sort of structure corruption in the device model, maybe; > the only places that I could find where a Linux write_msi happens that would > alter the MSI Message Data register that way (as opposed to setting it to > hardcoded elements) are related to IOMMU stuff, such as io_apic.c:3051, and > some further primitive attempts to probe this yielded nothing useful. > > I do notice that Linux xen_hvm_setup_msi_irqs doesn''t seem to get called > anywhere, or at least its printk lines don''t show up in dmesg, but whether > this is relevant I cannot say. > > Kyaieee! >Wow, lots of detailed info.. you might want to discuss this stuff on xen-devel mailinglist :) -- Pasi _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users