I am passing through to a domU (among other things) two USB controllers. Here is the lspci -v output on the dom0: 00:1d.0 USB controller: Intel Corporation Panther Point USB Enhanced Host Controller #1 (rev 04) (prog-if 20 [EHCI]) Subsystem: ASRock Incorporation Device 1e26 Flags: bus master, medium devsel, latency 0, IRQ 23 Memory at f7d17000 (32-bit, non-prefetchable) [size=1K] Capabilities: [50] Power Management version 2 Capabilities: [58] Debug port: BAR=1 offset=00a0 Capabilities: [98] PCI Advanced Features Kernel driver in use: pciback And here is the same device''s output in the domU: 00:07.0 USB controller: Intel Corporation Panther Point USB Enhanced Host Controller #1 (rev 04) (prog-if 20 [EHCI]) Subsystem: ASRock Incorporation Device 1e26 Flags: bus master, medium devsel, latency 64, IRQ 44 Memory at f3056000 (32-bit, non-prefetchable) [size=4K] Capabilities: [50] Power Management version 2 Kernel driver in use: ehci_hcd The output for the other controller is essentially the same. The peculiar thing here is that the domU thinks it has a 4K memory area while the dom0 says it''s just 1K. The controllers work, and I don''t know enough about the PCI subsystems to say if this could cause issues, but it seems things could go wrong if the domU ever decides to use the other 3K of memory. I had a look at how this value was calculated. I found that the guest will write all ones to the BAR and then reads it, and the size of the memory area is determined by how many bits come back as zero (per the PCI specs). In qemu, in hw/pass-through.c, pt_bar_reg_write and pt_bar_reg_read are responsible for emulating the writing and reading. In pt_bar_reg_read, there is: /* align resource size (memory type only) */ PT_GET_EMUL_SIZE(base->bar_flag, r_size); For memory type BAR this macro changes r_size to: (((r_size) + XC_PAGE_SIZE - 1) & ~(XC_PAGE_SIZE - 1)); This looks like it rounds r_size up to the next multiple of XC_PAGE_SIZE, and logging confirms this is changing r_size from 0x400 to 0x1000. This ends up giving the guest the rounded up size, instead of the real size. So, * is this an actual potential problem, or will something else ensure that the guest isn''t going to try to use the extra memory? * if it needs fixing, how can it be done? I''ve looked through the code but I''m not sure how to fix it without breaking other things.
Jan Beulich
2012-Jun-29 07:51 UTC
Re: A question about PCI passthrough device BAR memory size
>>> On 29.06.12 at 01:12, Rolu <rolu@roce.org> wrote: > I am passing through to a domU (among other things) two USB > controllers. Here is the lspci -v output on the dom0: > > 00:1d.0 USB controller: Intel Corporation Panther Point USB Enhanced > Host Controller #1 (rev 04) (prog-if 20 [EHCI]) > Subsystem: ASRock Incorporation Device 1e26 > Flags: bus master, medium devsel, latency 0, IRQ 23 > Memory at f7d17000 (32-bit, non-prefetchable) [size=1K] > Capabilities: [50] Power Management version 2 > Capabilities: [58] Debug port: BAR=1 offset=00a0 > Capabilities: [98] PCI Advanced Features > Kernel driver in use: pciback > > And here is the same device''s output in the domU: > > 00:07.0 USB controller: Intel Corporation Panther Point USB Enhanced > Host Controller #1 (rev 04) (prog-if 20 [EHCI]) > Subsystem: ASRock Incorporation Device 1e26 > Flags: bus master, medium devsel, latency 64, IRQ 44 > Memory at f3056000 (32-bit, non-prefetchable) [size=4K] > Capabilities: [50] Power Management version 2 > Kernel driver in use: ehci_hcd > > The output for the other controller is essentially the same. > > The peculiar thing here is that the domU thinks it has a 4K memory > area while the dom0 says it''s just 1K. The controllers work, and I > don''t know enough about the PCI subsystems to say if this could cause > issues, but it seems things could go wrong if the domU ever decides to > use the other 3K of memory. > > I had a look at how this value was calculated. I found that the guest > will write all ones to the BAR and then reads it, and the size of the > memory area is determined by how many bits come back as zero (per the > PCI specs). In qemu, in hw/pass-through.c, pt_bar_reg_write and > pt_bar_reg_read are responsible for emulating the writing and reading. > In pt_bar_reg_read, there is: > > /* align resource size (memory type only) */ > PT_GET_EMUL_SIZE(base->bar_flag, r_size); > > For memory type BAR this macro changes r_size to: > > (((r_size) + XC_PAGE_SIZE - 1) & ~(XC_PAGE_SIZE - 1)); > > This looks like it rounds r_size up to the next multiple of > XC_PAGE_SIZE, and logging confirms this is changing r_size from 0x400 > to 0x1000. This ends up giving the guest the rounded up size, instead > of the real size. > > So, > * is this an actual potential problem, or will something else ensure > that the guest isn''t going to try to use the extra memory?I think it is wrong for qemu-dm to not honor the original size. A driver handling different device versions/implementations could look at this and adapt its behavior accordingly (and would likely fail then). The second aspect to this - making sure the guest doesn''t access some other guest''s (or the host''s) MMIO space is something to be taken care of in the host, actually. The host has to re-assign (or assign in the first place, should the firmware not have done so) resources such that no two devices to be passed through to a guest share the same PAGE_SIZE region for their MMIO blocks. In the non-pvops kernel we have special code and command line options for this, but I believe this became redundant with other code and options in the upstream kernels by now (just never got around to go in and check how much redundancy there is and could hence be eliminated). In any case, these are things that - afaict - need manual admin action to get right _before_ passing through any device to a guest.> * if it needs fixing, how can it be done? I''ve looked through the code > but I''m not sure how to fix it without breaking other things.Since qemu ought to be able to find out the real device''s BAR sizes, it shouldn''t be that difficult to make it use that value in the config space access emulation rather than the rounded up one - in the worst case it would have to track two values instead of one. Jan