thr3ads.net - Xen users - [Xen-users] The saga of Heterodyne''s PCI Passthrough [Sep 2011]

If this information is useful, please help other people find it:
Share via:

Drake Wilson

2011-Sep-22 18:37 UTC

[Xen-users] The saga of Heterodyne''s PCI Passthrough

Whrrr, xen-users!  Once more someone having trouble with the elusive, gnarly,
extensive device passthrough is here, and this time, that someone is I.

Ahem.

--- The background ---

Here''s what I''m trying to do: I have a machine with an Asus
P8B WS mainboard
and an Intel Xeon E3-1230 processor which I''d like to have running
three Xen
domains: the dom0 Heterodyne, and the two domUs Quail and Furn.  (The software
identifiers are all lowercase.)  All domains are running Debian GNU/Linux with
at least Linux 3.0.0, though the userspace configuration varies considerably.
Furn is currently a PV domain and Quail an HVM domain, though I might change
them to both be HVM later on.  I''m planning to partition the CPU cores
between
the domains by pinning the vcpus, in case that makes any difference.

This machine has a number of PCI devices, both on the mainboard and offboard,
which I would like to divide them up between the domains.  I''m aiming
for the
following allocation based on dom0 lspci:

  Heterodyne:
    00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset
Family 6 port SATA AHCI Controller (rev 05)
    05:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network
Connection
    06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network
Connection
    07:00.0 USB Controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host
Controller
    [and everything else]
  Quail:
    00:1d.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset
Family USB Enhanced Host Controller #1 (rev 05)
    01:00.0 VGA compatible controller: ATI Technologies Inc RV710 [Radeon HD
4350]
    01:00.1 Audio device: ATI Technologies Inc RV710/730
    08:00.0 Multimedia audio controller: C-Media Electronics Inc CMI8788 [Oxygen
HD Audio]
    08:03.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6306/7/8 [Fire II(M)]
IEEE 1394 OHCI Controller (rev c0)
  Furn (optional):
    00:1a.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset
Family USB Enhanced Host Controller #2 (rev 05)
    00:1b.0 Audio device: Intel Corporation 6 Series/C200 Series Chipset Family
High Definition Audio Controller (rev 05)

I do _not_ necessarily want to attach the VGA as one that Quail''s HVM
BIOS can
find; I''m happy with it being registered as a secondary graphics card
with the
primary one being the emulated Cirrus VGA accessible over VNC, then having the
guest OS load drivers for the device and reinitialize it.  This would seem to
sidestep a lot of the work listed in the XenVGAPassthrough page on the wiki,
since that refers to making the VGA work for the BIOS.

I''m currently running Xen 4.1.1 (Debian 4.1.1-2).  The mainboard has
VT-d
IOMMU support, and it is enabled in the BIOS (which is currently version 0605
from Asus, but I''m planning to upgrade it to 0704 since this supposedly
fixes
the bogus ECC issue from the current version).  The Xen and dom0 Linux in the
plainest configuration are loaded with:

  /boot/xen-4.1-amd64.gz placeholder iommu=1 console=com1,vga com1=38400,8n1
dom0_mem=512M dom0_max_vcpus=2 dom0_vcpus_pin
  /boot/vmlinuz-3.0.0-1-amd64 placeholder root=UUID=<...> ro quiet
console=tty0 console=hvc0 mem=512M

--- The saga ---

I''m going to reproduce these steps as I''m writing this, so it
should be a
fairly accurate accounting as far as results, though it may not reflect
exactly what I went through earlier.  My first priorities are to get the
Radeon, one USB controller, and the PCI audio device attached to Quail, in
that order; once I can manage that, it seems likely that Furn''s
configuration
will fall into place.  All of the below is done with Furn absent as a Xen
domain.

Auxiliary files that are too large to include inline are available from
http://dasyatidae.net/2011/xen-users-265/.

  ATTEMPT #1: "Hot Cross Plugs"

To avoid interference, I preëmptively blacklist the drivers radeon, radeonfb,
snd_hda_intel, and snd_virtuoso in the dom0 Linux configuration, using
/etc/modprobe.d/blacklist.conf.  (snd_hda_intel would also normally bind to
the secondary function of the Radeon HD 4350.)  The dom0 is running Debian
testing (wheezy), and the domU is running unstable (sid).

Boot.  lsmod reveals that none of the aforementioned modules have been loaded;
there is still a VGA console visible through the Radeon.  dmesg output is in
01-dom0-dmesg.txt and 01-xen-dmesg.txt.  Note that at boot dom0 complains
about PCI address space overlapping video ROM.  The IOMMU is in fact
initialized according to Xen.

Boot domU, with VNC open and configuration from 01-quail-cfg.txt.  VNC has a
getty running on the emulated Cirrus VGA, and xm console attaches to the HVM
serial device properly.  Linux in the domU is loaded as:

  /boot/vmlinuz-3.0.0-1-amd64 root=UUID=<...> ro quiet
console=ttyS0,38400n1

Full dmesg output is in 01-quail-dmesg-boot.txt.

Run « stubify 0000:01:00.0 » and « stubify 0000:01:00.1 » on the dom0; stubify
is a shell script that rebinds a PCI device to the pci-stub driver.
xen-pcifront and xen-pciback are nowhere to be found on either the dom0 or
domU kernel, that I can detect---I suppose the Debian configuration
doesn''t
come with either of them.

  root@heterodyne:~# xm pci-list-assignable-devices
  0000:01:00.0
  0000:01:00.1
  root@heterodyne:~# xm pci-attach quail 0000:01:00.0
  root@heterodyne:~# xm console quail

  [:: ... log in as root ... ::]  

  root@quail:~# modprobe pci-hotplug
  root@quail:~# echo 1 >/sys/bus/pci/rescan
  root@quail:~# [  450.499603] radeon 0000:00:04.0: Fatal error during GPU init
  [  450.518377] [TTM] Trying to take down uninitialized memory manager type 1
  
  root@quail:~# 

Quail''s next dmesg entries (01-quail-dmesg-hotadd.txt) suggest that it
can''t
assign PCI resources for the video RAM, and the Radeon driver falls over with
a zero access window.  qemu-dm spits out messages in 01-qemu-log-hotadd.txt.

  ATTEMPT #2: "A Bridge over Stormy SeaBIOS"

My guess is that the HVM emulated PCI bridge isn''t leaving enough
window to
allocate MMIO space for the video RAM.  So I tentatively append pci=nocrs to
Quail''s Linux command line to see whether that''ll convince it
to use better
window areas.  The full domU command line is now:

  /boot/vmlinuz-3.0.0-1-amd64 root=UUID=<...> ro quiet
console=ttyS0,38400n1 pci=nocrs

dmesg output is in 02-quail-dmesg-boot.txt.  I do the same xm pci-attach and
rescan as before.  This time, Quail hangs hard for a while, then does this:

  [  172.050122] BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:1686]
  [  172.053355] Stack:
  [  172.054119] Call Trace:
  [  172.054119] Code: 88 03 00 00 b8 ea ff ff ff 78 22 3b b7 70 03 00 00 77 1a
c1 e6 03 48 81 e2 00 f0 ff ff 48 63 f6 48 83 ca 67 48 8d 34 31 48 89 16

« xm pci-detach quail 0000:01:00.0 » yields "Error: Timed out waiting for
device model action", so I destroy the domain with « xm destroy quail ».
02-qemu-log-hotadd.txt suggests that Quail tried to map the graphics card into
a high (>= 4 GiB) MMIO region and that qemu-dm''s PCI emulation blew
chunks all
over the memory map as a result.  /var/log/messages from Quail contains the
fragment in 02-quail-messages-hotadd.txt, which seems to confirm this.

  ATTEMPT #3: "Boom!  Address Space Klotski"

I change Quail''s configuration to set memory =
''3500'', leading to
03-quail-cfg.txt.  pci=nocrs is still in effect.  After boot and hot-add,
03-quail-dmesg.txt results; it seems to have mapped the GPU correctly and
initialized it.  Indeed at this point the DVI output of the Radeon card is
correctly showing Quail''s Linux framebuffer console.

However, dom0 grimaces and spits out the following on the serial console:

  (XEN) vmsi.c:122:d32767 Unsupported delivery mode 3

Oops.  qemu-dm has meanwhile generated 03-qemu-log.txt.

I create quail:/etc/X11/xorg.conf with the following text:

  Section "Device"
    Identifier "primary screen"
    Driver "radeon"
    BusID "PCI:0:4:0"
  EndSection

to force it to use the Radeon card.  (I haven''t forced pci-attach to
use a
specific slot yet, so this isn''t necessarily reliable between boots,
but that
should be easy enough to fix ex post facto.)  Starting X as root yields an X
display on the Radeon!  glxinfo yields 03-glxinfo.txt, suggesting that the
Radeon DRI is working.

Running glxgears claims that it is synchronized to the vertical retrace, but
no output appears.  Woops.  2D output to X mostly works, but is horribly slow
and full of tearing artifacts.

This suggests to me that the vsync is broken.  The vmsi error above suggests
that this is because an MSI from the Radeon card is not being delivered.

Right then!

  ATTEMPT #4: "Friendly MSI have been destroy!"

I alter Quail''s Linux boot line to include pci=nomsi (combining it with
the
existing option).  The new full Linux command line is:

  /boot/vmlinuz-3.0.0-1-amd64 root=UUID=<...> ro  quiet
console=ttyS0,38400n1 pci=nocrs,nomsi

dmesg after domU boot yields 04-quail-dmesg-boot.txt.  xm pci-attach, then
rescan PCI device in domU.  The Radeon is initialized successfully.  I start X
as root.  glxinfo returns the exact same results.  Running glxgears displays
some rotating gears:

  328 frames in 5.0 seconds = 65.401 FPS

This is indeed correct!

But then I attempt to attach the USB controller.  I run « stubify 0000:00:1d.0
» in the dom0, and get a correct-looking:

  [ 3068.112581] ehci_hcd 0000:00:1d.0: remove, state 4
  [ 3068.112588] usb usb2: USB disconnect, device number 1
  [ 3068.112590] usb 2-1: USB disconnect, device number 2
  [ 3068.131201] ehci_hcd 0000:00:1d.0: USB bus 2 deregistered
  [ 3068.131242] ehci_hcd 0000:00:1d.0: PCI INT A disabled
  [ 3068.131341] pci-stub 0000:00:1d.0: claimed by stub

And the moment of truth:

  root@heterodyne:~# xm pci-list-assignable-devices
  0000:00:1d.0
  root@heterodyne:~# xm pci-attach quail 0000:00:1d.0

But the serial console coughs up:

  (XEN) physdev.c:182: dom7: no free pirq

Uh-oh.  Quail''s dmesg reveals:

  [  259.241751] pci 0000:00:05.0: [8086:1c26] type 0 class 0x000c03
  [  259.242090] pci 0000:00:05.0: reg 10: [mem 0x00000000-0x00000fff]
  [  259.244192] pci 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus
alignment
  [  259.244196] pci 0000:00:05.0: BAR 0: assigned [mem 0xdc030000-0xdc030fff]
  [  259.244276] pci 0000:00:05.0: BAR 0: set to [mem 0xdc030000-0xdc030fff]
(PCI address [0xdc030000-0xdc030fff])
  [  259.304377] usbcore: registered new interface driver usbfs
  [  259.304392] usbcore: registered new interface driver hub
  [  259.304412] usbcore: registered new device driver usb
  [  259.305635] ehci_hcd: USB 2.0 ''Enhanced'' Host Controller
(EHCI) Driver
  [  259.305727] xen map irq failed -22
  [  259.305729] ehci_hcd 0000:00:05.0: PCI INT A: failed to register GSI

And indeed, qemu-dm complains of:

  dm-command: hot insert pass-through pci dev 
  register_real_device: Assigning real physical device 00:1d.0 ...
  register_real_device: Enable MSI translation via per device option
  register_real_device: Disable power management
  pt_iomul_init: Error: pt_iomul_init can''t open file
/dev/xen/pci_iomul: No such file or directory: 0x0:0x1d.0x0
  pt_register_regions: IO region registered (size=0x00000400
base_addr=0xfe706000)
  register_real_device: Error: Mapping irq failed, rc = -1
  register_real_device: Real physical device 00:1d.0 registered successfuly!
  IRQ type = INTx
  pt_pci_write_config: Warning: Guest attempt to set address to unused Base
Address Register. [00:05.0][Offset:30h][Length:4]
  pt_iomem_map: e_phys=dc030000 maddr=fe706000 type=0 len=4096 index=0
first_map=1

Ow!  (The full log is in 04-qemu-log.txt.)  The rescan also convinces Xen that:

  (XEN) irq.c:1817: dom7: invalid pirq -28 or emuirq 36

At this point I got sort of stumped and started paging through Xen source code
trying to figure out what in the world was going on, though I didn''t
get very
far.

--- The quandary ---

My best guess at this stage is that:

  - Xen is deriving the number of GSIs and MSIs available from the host APIC,
    and this is somehow carrying over to the domU interrupts.

  - For some reason, the MSI format being used is not supported by the PCI
    passthrough code, or else it''s misdetecting what sort of MSI is
coming
    down the line (I''m not intimately familiar enough with PCI to know
how
    this works (yet)).

  - Since Xen and qemu-dm only support attaching devices to separate virtual
    PCI buses in the domU, they can''t share GSI interrupts.  (I
wouldn''t
    expect them to work on the same virtual bus anyway, given that some are
    PCI Express and expect a point-to-point link, but maybe it''d work
if
    they''re not too picky.)

  - On this modern machine, the number of GSIs available is severely
    diminished under the expectation that almost all mainboard and add-on
    devices will use MSI.  (E.g., there''s only one PCI (as opposed to
PCI-E)
    slot.)  This carries over to the domU.  Combined with the above, we run
    out of GSIs trying to attach the second device.

Note that:

  - None of the devices involved support function-level reset that I know of
    (via « lspci -vv » in dom0).  This does not appear to actually be a
    problem in practice; I have successfully attached the USB controller by
    itself to Quail, or the PCI audio device by itself.  I can give « lspci
    -vv » output if it''s useful, but not right now since I''d
have to reboot
    the dom0 again to do it and I''m running out of time.

  - I''m not currently attaching the second function of the Radeon card.
This
    again does not appear to cause a problem even though theoretically one is
    supposed to attach both functions.  I think I tried attaching both of them
    simultaneously, and it didn''t work due to this not being a
bleeding-edge
    Xen, but I don''t have the details at the moment; attaching them
both as
    separate PCI passthrough devices didn''t obviously do anything
useful.

    (I''d actually prefer to avoid attaching that second function if
possible;
    it''d be very convenient to not have to bother with a worthless
second
    audio device gunking up Quail''s ALSA configuration.)

  - Last I checked, trying to boot Quail with any PCI passthrough devices
    already attached caused a failure to find the root filesystem.  I
haven''t
    gone back through this to check again, but I believe what is happening is
    that the Xen domU platform PCI device that the PV-on-HVM drivers use(?) is
    found, and the driver disables the emulated disks as a result, but then
    its interrupt attachment goes haywire because the available interrupts
    have all been used up by the passthrough devices, so communication with
    the Xen backends breaks down, which grinds everything to a halt because
    various essential devices (such as the disks) are already gone.

    I can go back and try this again if more specific information on it would
    be useful.

  - Previously I was running the dom0 kernel with pci=nocrs as well, which
    avoided the PCI address space overlap with video ROM message present in
    01-dom0-dmesg.txt.  This doesn''t seem to have made a significant
    difference in behavior at any of the steps.

My questions:

  - Does my tentative analysis seem sound?  If not, what have I miscalculated?
    Or am I on the wrong track from the very beginning?

  - Will upgrading to a bleeding-edge Xen make any difference?  I''d
rather not
    do this unless necessary, since I''d prefer a stabler machine with a
more
    well-tested hypervisor, but I''m well aware that what I''m
trying to do is
    fairly obtuse and may require jiggering software around.

    + A diff of xen/arch/x86/hvm/vmsi.c suggests that it will _not_ make the
      MSIs work if the delivery mode is being detected properly.  vmsi_deliver
      seems to have the same code regarding delivery mode 3 (i.e., none;
it''s
      dest__reserved_1 in xen/include/asm-x86/io_apic.h).

    + A grep of tools/ioemu-qemu-xen/hw/pass-through.c suggests that it
      _might_ allow using 5000 MB of memory for the domU rather than 3500 MB
      without bogotifying the PCI windows, since the "Guest attempt to set
      high MMIO Base Address" message is gone, suggesting that 64-bit BAR
will
      work, but I haven''t confirmed this.  Granting the domU more
memory would
      be very good, but is not essential.

  - What is the best way to proceed that might allow me to attach these
    devices to an HVM domU at the same time?  Or am I merely hosed?  In
    particular:

    + Could I convince Xen and/or the device model to emulate the HVM
domU''s
      APIC with more available GSIs rather than copying the model from the
      host (if that is indeed what it''s doing)?

    + Could I patch vmsi.c to make delivery mode 3 work so that MSIs from the
      Radeon card will be delivered?  Or is delivery mode 3 truly nonexistent,
      meaning that there is something else wrong with MSI passthrough?  If the
      latter, where might I look to find out more?

      Searching the mailing list archives yields various patches that have
      appeared in the past regarding vmsi.c, but nothing that looks both
      relevant and unapplied.

  - I would like to not have to run the domU kernel with pci=nocrs.  Is there
    a reasonable way to make the device model allocate the PCI host bridge
    resources differently to make this unnecessary?  It doesn''t seem
important
    by comparison to the above, but is continuing to run with pci=nocrs an
    indicator that I may run into problems later if I upgrade Xen or qemu-dm?

  - Similarly, would running the dom0 kernel with pci=nocrs help or hurt in
    any particular way?  It doesn''t seem to have much effect, but maybe
I''m
    missing something.

To anyone who made it this far: I am at least theoretically open to arranging
reasonable bribes for the first people to help me get this to work.  Inquire
privately if you wish to take advantage of this.  Otherwise, you still get my
thanks in advance.  :-)

Kyaieee!

   ---> Drake Wilson

Post Scriptum: BIOS 0704 has been unhelpful in either getting ECC enabled or
doing anything to the PCI passthrough.  Alas.  Also, « lspci -vv » output is
now available as 05-lspci-vv.txt.

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Drake Wilson

2011-Sep-23 05:25 UTC

head link

Re: [Xen-users] The saga of Heterodyne''s PCI Passthrough

[...]

Hmm.  Loopiness follows.

So I finally chased down somewhere on the Web saying that MSI delivery modes
are similar to IPI delivery modes, and then I found the IPI types in the AMD64
architecture manual.  Apparently mode 3 is a remote read from a separate local
APIC.  This sounds totally bogus for an interrupt coming from a graphics card,
but I don''t know for sure.

But after poring over the dmesg a bit more I realized a bunch of lower
interrupts were being eaten by PnP devices.  I suppose these are attached to
things like the emulated PS/2 keyboard.  Since I don''t need those (I
use the
serial device as emergency maintenance port), pnpacpi=off pnpbios=off in the
domU kernel, and magically there''s enough interrupts to pass through
both the
Radeon and the USB controller, or so it seems.

So that may be a good enough workaround for now.  But I still want to know why
those MSIs are apparently not being passed through.

Also apparently the interrupt limit is an architectural one, which I should
have known since it was hardcoded, but somewhere in there I made a logic error
I suppose (keeping in mind that I really don''t know what I''m
doing here).

Some primitive attempts to trace back where in the world mode 3 is coming from
yielded not much except for a twisty maze that points either into the guest OS
or maybe into some sort of structure corruption in the device model, maybe;
the only places that I could find where a Linux write_msi happens that would
alter the MSI Message Data register that way (as opposed to setting it to
hardcoded elements) are related to IOMMU stuff, such as io_apic.c:3051, and
some further primitive attempts to probe this yielded nothing useful.

I do notice that Linux xen_hvm_setup_msi_irqs doesn''t seem to get
called
anywhere, or at least its printk lines don''t show up in dmesg, but
whether
this is relevant I cannot say.

Kyaieee!

   ---> Drake Wilson

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Pasi Kärkkäinen

2011-Sep-26 19:07 UTC

head link

Re: [Xen-users] The saga of Heterodyne''s PCI Passthrough

On Fri, Sep 23, 2011 at 12:25:32AM -0500, Drake Wilson
wrote:> [...]
> 
> Hmm.  Loopiness follows.
> 
> So I finally chased down somewhere on the Web saying that MSI delivery
modes
> are similar to IPI delivery modes, and then I found the IPI types in the
AMD64
> architecture manual.  Apparently mode 3 is a remote read from a separate
local
> APIC.  This sounds totally bogus for an interrupt coming from a graphics
card,
> but I don''t know for sure.
> 
> But after poring over the dmesg a bit more I realized a bunch of lower
> interrupts were being eaten by PnP devices.  I suppose these are attached
to
> things like the emulated PS/2 keyboard.  Since I don''t need those
(I use the
> serial device as emergency maintenance port), pnpacpi=off pnpbios=off in
the
> domU kernel, and magically there''s enough interrupts to pass
through both the
> Radeon and the USB controller, or so it seems.
> 
> So that may be a good enough workaround for now.  But I still want to know
why
> those MSIs are apparently not being passed through.
> 
> Also apparently the interrupt limit is an architectural one, which I should
> have known since it was hardcoded, but somewhere in there I made a logic
error
> I suppose (keeping in mind that I really don''t know what
I''m doing here).
> 
> Some primitive attempts to trace back where in the world mode 3 is coming
from
> yielded not much except for a twisty maze that points either into the guest
OS
> or maybe into some sort of structure corruption in the device model, maybe;
> the only places that I could find where a Linux write_msi happens that
would
> alter the MSI Message Data register that way (as opposed to setting it to
> hardcoded elements) are related to IOMMU stuff, such as io_apic.c:3051, and
> some further primitive attempts to probe this yielded nothing useful.
> 
> I do notice that Linux xen_hvm_setup_msi_irqs doesn''t seem to get
called
> anywhere, or at least its printk lines don''t show up in dmesg, but
whether
> this is relevant I cannot say.
> 
> Kyaieee!
> 
Wow, lots of detailed info.. you might want to discuss this stuff on xen-devel
mailinglist :)

-- Pasi


_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Xen users - Sep 2011 - The saga of Heterodyne's PCI Passthrough

[Xen-users] The saga of Heterodyne''s PCI Passthrough

Re: [Xen-users] The saga of Heterodyne''s PCI Passthrough

Re: [Xen-users] The saga of Heterodyne''s PCI Passthrough