Hello, I''m trying to get SR-IOV working under Xen (4.2). It almost works except memory bug. This is easily reproducible just in Dom0. I have Connect-X3 card with the latest firmware. OFED 2.0-3 drivers. I tried 3.2 kernel from Debian, 3.10 kernel from Debian and vanila 3.11.5 kernel. All are the same. As soon as I issue ibv_devinfo command, it produces the following messages into dmesg. Problem is that with ib_rdma_bw command, I get more of those messages and moreover, oom killer gets confused and kills almost all processes. [23502.645455] mlx4_core 0000:06:00.0: mlx4_ib: Port 1 logical link is up [23550.181907] <mlx4_ib> check_flow_steering_support: Device managed flow steering is unavailable for IB port in multifunction env. [23550.183822] swap_free: Unused swap offset entry 00000001 [23550.183868] BUG: Bad page map in process ibv_devinfo pte:00000200 pmd:1b7df4067 [23550.183939] addr:00007f7ef5e18000 vm_flags:400844fa anon_vma: (null) mapping:ffff8801b83c0480 index:380fe0882 [23550.184022] vma->vm_file->f_op->mmap: ib_uverbs_mmap+0x0/0x2d [ib_uverbs] [23550.195382] Pid: 13813, comm: ibv_devinfo Tainted: G O 3.2.0-0.bpo.4-amd64 #1 Debian 3.2.41-2+deb7u2~bpo60+1+zs4 [23550.195461] Call Trace: [23550.195508] [<ffffffff810d9009>] ? print_bad_pte+0x1f5/0x20d [23550.195553] [<ffffffff810db083>] ? unmap_vmas+0x5fe/0x814 [23550.195601] [<ffffffff810c68dd>] ? __add_page_to_lru_list+0x53/0x53 [23550.195647] [<ffffffff810df2de>] ? unmap_region+0x9f/0x102 [23550.195694] [<ffffffff8100d722>] ? __switch_to+0x23b/0x2b1 [23550.195741] [<ffffffff8103d870>] ? pick_next_task_fair+0xfc/0x10c [23550.195788] [<ffffffff810463a2>] ? finish_task_switch+0x53/0xc7 [23550.195832] [<ffffffff810e01f7>] ? do_munmap+0x281/0x2eb [23550.195875] [<ffffffff810e02a0>] ? sys_munmap+0x3f/0x55 [23550.195921] [<ffffffff8136e51c>] ? system_call_fastpath+0x16/0x1b [23550.195965] Disabling lock debugging due to kernel taint [23550.196412] <mlx4_ib> check_flow_steering_support: Device managed flow steering is unavailable for IB port in multifunction env. [23550.198303] swap_free: Unused swap offset entry 00000001 [23550.198348] BUG: Bad page map in process ibv_devinfo pte:00000200 pmd:1b7df4067 [23550.198424] addr:00007f7ef5e18000 vm_flags:400844fa anon_vma: (null) mapping:ffff8801b83c09a0 index:380fe0082 [23550.198508] vma->vm_file->f_op->mmap: ib_uverbs_mmap+0x0/0x2d [ib_uverbs] [23550.198558] Pid: 13813, comm: ibv_devinfo Tainted: G B O 3.2.0-0.bpo.4-amd64 #1 Debian 3.2.41-2+deb7u2~bpo60+1+zs4 [23550.198637] Call Trace: [23550.198680] [<ffffffff810d9009>] ? print_bad_pte+0x1f5/0x20d [23550.198730] [<ffffffff810db083>] ? unmap_vmas+0x5fe/0x814 [23550.198775] [<ffffffff810c68dd>] ? __add_page_to_lru_list+0x53/0x53 [23550.198820] [<ffffffff810df2de>] ? unmap_region+0x9f/0x102 [23550.198865] [<ffffffff8100d6b0>] ? __switch_to+0x1c9/0x2b1 [23550.198913] [<ffffffff8103d870>] ? pick_next_task_fair+0xfc/0x10c [23550.198959] [<ffffffff810463a2>] ? finish_task_switch+0x53/0xc7 [23550.199005] [<ffffffff810e01f7>] ? do_munmap+0x281/0x2eb [23550.199052] [<ffffffff810e02a0>] ? sys_munmap+0x3f/0x55 [23550.199096] [<ffffffff8136e51c>] ? system_call_fastpath+0x16/0x1b [23550.199766] <mlx4_ib> check_flow_steering_support: Device managed flow steering is unavailable for IB port in multifunction env. [23550.201661] swap_free: Unused swap offset entry 00000001 [23550.201706] BUG: Bad page map in process ibv_devinfo pte:00000200 pmd:1b7df4067 [23550.201776] addr:00007f7ef5e18000 vm_flags:400844fa anon_vma: (null) mapping:ffff8801b83c0ec0 index:380fdf882 [23550.201861] vma->vm_file->f_op->mmap: ib_uverbs_mmap+0x0/0x2d [ib_uverbs] [23550.201908] Pid: 13813, comm: ibv_devinfo Tainted: G B O 3.2.0-0.bpo.4-amd64 #1 Debian 3.2.41-2+deb7u2~bpo60+1+zs4 [23550.201990] Call Trace: [23550.202032] [<ffffffff810d9009>] ? print_bad_pte+0x1f5/0x20d [23550.202081] [<ffffffff810db083>] ? unmap_vmas+0x5fe/0x814 [23550.202125] [<ffffffff810df2de>] ? unmap_region+0x9f/0x102 [23550.202169] [<ffffffff8100d6b0>] ? __switch_to+0x1c9/0x2b1 [23550.202217] [<ffffffff8103d870>] ? pick_next_task_fair+0xfc/0x10c [23550.202267] [<ffffffff810463a2>] ? finish_task_switch+0x53/0xc7 [23550.202312] [<ffffffff810e01f7>] ? do_munmap+0x281/0x2eb [23550.202355] [<ffffffff810e02a0>] ? sys_munmap+0x3f/0x55 [23550.202398] [<ffffffff8136e51c>] ? system_call_fastpath+0x16/0x1b [23550.202925] <mlx4_ib> check_flow_steering_support: Device managed flow steering is unavailable for IB port in multifunction env. [23550.213336] swap_free: Unused swap offset entry 00000001 [23550.213377] BUG: Bad page map in process ibv_devinfo pte:00000200 pmd:1b7df4067 [23550.213448] addr:00007f7ef5e18000 vm_flags:400844fa anon_vma: (null) mapping:ffff8801b6bd8ec0 index:380fdf082 [23550.213527] vma->vm_file->f_op->mmap: ib_uverbs_mmap+0x0/0x2d [ib_uverbs] [23550.213573] Pid: 13813, comm: ibv_devinfo Tainted: G B O 3.2.0-0.bpo.4-amd64 #1 Debian 3.2.41-2+deb7u2~bpo60+1+zs4 [23550.213651] Call Trace: [23550.213775] [<ffffffff810d9009>] ? print_bad_pte+0x1f5/0x20d [23550.213820] [<ffffffff810db083>] ? unmap_vmas+0x5fe/0x814 [23550.213863] [<ffffffff810df2de>] ? unmap_region+0x9f/0x102 [23550.213907] [<ffffffff8100d6b0>] ? __switch_to+0x1c9/0x2b1 [23550.213951] [<ffffffff8103d870>] ? pick_next_task_fair+0xfc/0x10c [23550.213996] [<ffffffff810463a2>] ? finish_task_switch+0x53/0xc7 [23550.214041] [<ffffffff810e01f7>] ? do_munmap+0x281/0x2eb [23550.214084] [<ffffffff810e02a0>] ? sys_munmap+0x3f/0x55 [23550.214127] [<ffffffff8136e51c>] ? system_call_fastpath+0x16/0x1b [23550.214461] <mlx4_ib> check_flow_steering_support: Device managed flow steering is unavailable for IB port in multifunction env. [23550.215924] swap_free: Unused swap offset entry 00000001 [23550.215974] BUG: Bad page map in process ibv_devinfo pte:00000200 pmd:1b7df4067 [23550.216049] addr:00007f7ef5e18000 vm_flags:400844fa anon_vma: (null) mapping:ffff8801b8f381f0 index:380fff085 [23550.216133] vma->vm_file->f_op->mmap: ib_uverbs_mmap+0x0/0x2d [ib_uverbs] [23550.216184] Pid: 13813, comm: ibv_devinfo Tainted: G B O 3.2.0-0.bpo.4-amd64 #1 Debian 3.2.41-2+deb7u2~bpo60+1+zs4 [23550.216267] Call Trace: [23550.216306] [<ffffffff810d9009>] ? print_bad_pte+0x1f5/0x20d [23550.216351] [<ffffffff810db083>] ? unmap_vmas+0x5fe/0x814 [23550.216395] [<ffffffff810df2de>] ? unmap_region+0x9f/0x102 [23550.216443] [<ffffffff8100d6b0>] ? __switch_to+0x1c9/0x2b1 [23550.216487] [<ffffffff8103d870>] ? pick_next_task_fair+0xfc/0x10c [23550.216532] [<ffffffff810463a2>] ? finish_task_switch+0x53/0xc7 [23550.216581] [<ffffffff810e01f7>] ? do_munmap+0x281/0x2eb [23550.216628] [<ffffffff810e02a0>] ? sys_munmap+0x3f/0x55 [23550.216677] [<ffffffff8136e51c>] ? system_call_fastpath+0x16/0x1b [23550.216728] swap_free: Unused swap offset entry 00000001 [23550.216777] BUG: Bad page map in process ibv_devinfo pte:00000200 pmd:1b7df4067 [23550.216846] addr:00007f7ef5e16000 vm_flags:400844fa anon_vma: (null) mapping:ffff8801b8f381f0 index:380fff485 [23550.216925] vma->vm_file->f_op->mmap: ib_uverbs_mmap+0x0/0x2d [ib_uverbs] [23550.216980] Pid: 13813, comm: ibv_devinfo Tainted: G B O 3.2.0-0.bpo.4-amd64 #1 Debian 3.2.41-2+deb7u2~bpo60+1+zs4 [23550.217077] Call Trace: [23550.217124] [<ffffffff810d9009>] ? print_bad_pte+0x1f5/0x20d [23550.217169] [<ffffffff810db083>] ? unmap_vmas+0x5fe/0x814 [23550.217212] [<ffffffff810df2de>] ? unmap_region+0x9f/0x102 [23550.217256] [<ffffffff8100d6b0>] ? __switch_to+0x1c9/0x2b1 [23550.217300] [<ffffffff8103d870>] ? pick_next_task_fair+0xfc/0x10c [23550.217349] [<ffffffff810e01f7>] ? do_munmap+0x281/0x2eb [23550.217396] [<ffffffff810e02a0>] ? sys_munmap+0x3f/0x55 [23550.217443] [<ffffffff8136e51c>] ? system_call_fastpath+0x16/0x1b this happens only if running under Xen. Native kernel in the same version is OK. Is it a known bug or is something wrong with BIOS/firmware? -- Lukáš Hejtmánek
On 10/21/2013 7:57 AM, Lukas Hejtmanek wrote:> Hello, > > I''m trying to get SR-IOV working under Xen (4.2). It almost works except > memory bug. This is easily reproducible just in Dom0. > > I have Connect-X3 card with the latest firmware. OFED 2.0-3 drivers. I tried > 3.2 kernel from Debian, 3.10 kernel from Debian and vanila 3.11.5 kernel. All > are the same.Ha! Funny you mention that. I had been looking at this.> As soon as I issue ibv_devinfo command, it produces the following messages > into dmesg. Problem is that with ib_rdma_bw command, I get more of those > messages and moreover, oom killer gets confused and kills almost all > processes. > > [23502.645455] mlx4_core 0000:06:00.0: mlx4_ib: Port 1 logical link is up > [23550.181907] <mlx4_ib> check_flow_steering_support: Device managed flow steering is unavailable for IB port in multifunction env. > [23550.183822] swap_free: Unused swap offset entry 00000001 > [23550.183868] BUG: Bad page map in process ibv_devinfo pte:00000200 pmd:1b7df4067 > [23550.183939] addr:00007f7ef5e18000 vm_flags:400844fa anon_vma: (null) mapping:ffff8801b83c0480 index:380fe0882 > [23550.184022] vma->vm_file->f_op->mmap: ib_uverbs_mmap+0x0/0x2d [ib_uverbs] > [23550.195382] Pid: 13813, comm: ibv_devinfo Tainted: G O 3.2.0-0.bpo.4-amd64 #1 Debian 3.2.41-2+deb7u2~bpo60+1+zs4 > [23550.195461] Call Trace: > [23550.195508] [<ffffffff810d9009>] ? print_bad_pte+0x1f5/0x20d > [23550.195553] [<ffffffff810db083>] ? unmap_vmas+0x5fe/0x814 > [23550.195601] [<ffffffff810c68dd>] ? __add_page_to_lru_list+0x53/0x53 > [23550.195647] [<ffffffff810df2de>] ? unmap_region+0x9f/0x102 > [23550.195694] [<ffffffff8100d722>] ? __switch_to+0x23b/0x2b1 > [23550.195741] [<ffffffff8103d870>] ? pick_next_task_fair+0xfc/0x10c > [23550.195788] [<ffffffff810463a2>] ? finish_task_switch+0x53/0xc7 > [23550.195832] [<ffffffff810e01f7>] ? do_munmap+0x281/0x2eb > [23550.195875] [<ffffffff810e02a0>] ? sys_munmap+0x3f/0x55 > [23550.195921] [<ffffffff8136e51c>] ? system_call_fastpath+0x16/0x1b > [23550.195965] Disabling lock debugging due to kernel taint > [23550.196412] <mlx4_ib> check_flow_steering_support: Device managed flow steering is unavailable for IB port in multifunction env. > [23550.198303] swap_free: Unused swap offset entry 00000001 > [23550.198348] BUG: Bad page map in process ibv_devinfo pte:00000200 pmd:1b7df4067 > [23550.198424] addr:00007f7ef5e18000 vm_flags:400844fa anon_vma: (null) mapping:ffff8801b83c09a0 index:380fe0082..> this happens only if running under Xen. Native kernel in the same version is OK. > > Is it a known bug or is something wrong with BIOS/firmware? >It is a bug in the drivers I believe. The issue is that the mapping created for the second mmap call is done without VM_IO and on an PFN that is RAM (and not the BAR). But I am not entirely sure and hopefully this week will have a better idea and fix. Stay tuned.
>>> On 21.10.13 at 13:57, Lukas Hejtmanek <xhejtman@ics.muni.cz> wrote: > I''m trying to get SR-IOV working under Xen (4.2). It almost works except > memory bug. This is easily reproducible just in Dom0.So without any SR-IOV then, I suppose?> [23502.645455] mlx4_core 0000:06:00.0: mlx4_ib: Port 1 logical link is up > [23550.181907] <mlx4_ib> check_flow_steering_support: Device managed flow > steering is unavailable for IB port in multifunction env. > [23550.183822] swap_free: Unused swap offset entry 00000001 > [23550.183868] BUG: Bad page map in process ibv_devinfo pte:00000200 > pmd:1b7df4067 > [23550.183939] addr:00007f7ef5e18000 vm_flags:400844fa anon_vma: > (null) mapping:ffff8801b83c0480 index:380fe0882 > [23550.184022] vma->vm_file->f_op->mmap: ib_uverbs_mmap+0x0/0x2d [ib_uverbs]Looking at ib_uverbs_mmap() and its necessary (for mlx4) descendant mlx4_ib_mmap() I see that the latter calls io_remap_pfn_range(), but afaict there''s nowhere _PAGE_IOMAP would get set here (as opposed to arch/x86/pci/i386.c:pci_mmap_page_range() for example). Could you check whether adding that flag helps? (I''m copying the kernel maintainers so that they could correct me if I''m wrong here - it would seem to me that this could equally be the reason for why there are other reports of certain things not working as expected in domains with more than 4Gb.) You could also consider trying an openSUSE kernel - there, other than upstream, there''s no need for each and every caller of io_remap_pfn_range() to take care of setting _PAGE_IOMAP (and I vaguely recall having discussed this a couple of years back with Konrad et al). Jan
>>> On 21.10.13 at 14:59, konrad wilk <konrad.wilk@oracle.com> wrote: > It is a bug in the drivers I believe. The issue is that the mapping > created for the second mmap > call is done without VM_IO and on an PFN that is RAM (and not the BAR).So while putting together the reply that I had sent to Lukas a minute ago I was actually hunting for that VM_IO -> _PAGE_IOMAP translation, and wasn''t able to find it anywhere. As you say it nevertheless exists - what am I overlooking (and why would then pci_mmap_page_range() nevertheless have to set _PAGE_IOMAP by hand)? Jan
On 10/21/2013 9:18 AM, Jan Beulich wrote:>>>> On 21.10.13 at 14:59, konrad wilk <konrad.wilk@oracle.com> wrote: >> It is a bug in the drivers I believe. The issue is that the mapping >> created for the second mmap >> call is done without VM_IO and on an PFN that is RAM (and not the BAR). > So while putting together the reply that I had sent to Lukas a > minute ago I was actually hunting for that VM_IO -> _PAGE_IOMAP > translation, and wasn''t able to find it anywhere. As you say it > nevertheless exists - what am I overlooking (and why would then > pci_mmap_page_range() nevertheless have to set _PAGE_IOMAP > by hand)?The P2M (arch/x86/xen/p2m.c) is consulted which for the MMIO gaps and E820_RESV has the MFNs set to the PFN. This is the 1-1 pfn/mfn stuff that I implemented some time ago - as hpa was opposed to having the _PAGE_IOMAP being stuck on any macro call to pgprot_writecombine|noncached|etc. Or perhaps that was on the arch_something_prot. Anyhow, the odd thing is that looking at the code: 669 if (io_remap_pfn_range(vma, vma->vm_start, 670 to_mucontext(context)->uar.pfn + 671 dev->dev->caps.num_uars, 672 PAGE_SIZE, vma->vm_page_prot)) The PFN in question (uar.pfn) is in mlx4_uar_alloc is set to: 159 uar->pfn = (pci_resource_start(dev->pdev, 2) >> PAGE_SHIFT) + offset; So is the BAR not in the MMIO region? Or is it the 64-bit type MMIO that lays outside the 4GB and hence when the P2M is consulted it thinks its INVALID_P2M_ENTRY? Which comes back to the bug you (Jan) discovered when you pointed out that PVH needs to setup MMIO entries for 64-bit MMIO regions which can be outside the 4GB region <sigh>. And that is something the pvops kernel completly ignores as it assumes that any region past the E820 can be used for ballooning. Anyhow, one easy thing to figure out is to get the lspci -v output from the InfiniBand card to see where its BARs are, and also the start of the kernel. You should see an E820 map (please also boot with "debug" on the Linux command line).> Jan >
On 10/21/2013 9:39 AM, konrad wilk wrote:> > On 10/21/2013 9:18 AM, Jan Beulich wrote: >>>>> On 21.10.13 at 14:59, konrad wilk <konrad.wilk@oracle.com> wrote: >>> It is a bug in the drivers I believe. The issue is that the mapping >>> created for the second mmap >>> call is done without VM_IO and on an PFN that is RAM (and not the BAR). >> So while putting together the reply that I had sent to Lukas a >> minute ago I was actually hunting for that VM_IO -> _PAGE_IOMAP >> translation, and wasn''t able to find it anywhere. As you say it >> nevertheless exists - what am I overlooking (and why would then >> pci_mmap_page_range() nevertheless have to set _PAGE_IOMAP >> by hand)? > > The P2M (arch/x86/xen/p2m.c) is consulted which for the MMIO gaps and > E820_RESV has the MFNs set to the PFN. This is the 1-1 pfn/mfn stuff > that I implemented > some time ago - as hpa was opposed to having the _PAGE_IOMAP being > stuck on any macro > call to pgprot_writecombine|noncached|etc. Or perhaps that was on the > arch_something_prot.This is the one that Jeremy cooked up some time ago: http://lkml.indiana.edu/hypermail/linux/kernel/1010.2/03012.html And here was the thread: http://www.spinics.net/lists/linux-rdma/msg07085.html which I thought had been fixed by the P2M identity code.> > Anyhow, the odd thing is that looking at the code: > > 669 if (io_remap_pfn_range(vma, vma->vm_start, > 670 to_mucontext(context)->uar.pfn + > 671 dev->dev->caps.num_uars, > 672 PAGE_SIZE, > vma->vm_page_prot)) > > The PFN in question (uar.pfn) is in mlx4_uar_alloc is set to: > > 159 uar->pfn = (pci_resource_start(dev->pdev, 2) >> > PAGE_SHIFT) + offset; > > So is the BAR not in the MMIO region? Or is it the 64-bit type MMIO > that lays outside the 4GB and > hence when the P2M is consulted it thinks its INVALID_P2M_ENTRY? > > Which comes back to the bug you (Jan) discovered when you pointed out > that PVH needs to setup MMIO entries > for 64-bit MMIO regions which can be outside the 4GB region <sigh>. > And that is something the pvops kernel > completly ignores as it assumes that any region past the E820 can be > used for ballooning. > > Anyhow, one easy thing to figure out is to get the lspci -v output > from the InfiniBand card > to see where its BARs are, and also the start of the kernel. You > should see an E820 map (please also boot with > "debug" on the Linux command line). > >> Jan >> >
On Mon, Oct 21, 2013 at 09:39:33AM -0400, konrad wilk wrote:> Anyhow, one easy thing to figure out is to get the lspci -v output > from the InfiniBand card > to see where its BARs are, and also the start of the kernel. You > should see an E820 map (please also boot with > "debug" on the Linux command line).note, adding _PAGE_IO as Jan suggested fixed those mem errors. here is lspci from the card and its virtual functions. 06:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] Subsystem: Mellanox Technologies Device 0017 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 42 Region 0: Memory at dfa00000 (64-bit, non-prefetchable) [size=1M] Region 2: Memory at 380fff000000 (64-bit, prefetchable) [size=8M] Expansion ROM at df900000 [disabled] [size=1M] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] Vital Product Data Product Name: CX353A - ConnectX-3 QSFP Read-only fields: [PN] Part number: MCX353A-QCBT [EC] Engineering changes: A4 [SN] Serial number: MT1327X00814 [V0] Vendor specific: PCIe Gen3 x8 [RV] Reserved: checksum good, 0 byte(s) reserved Read/write fields: [V1] Vendor specific: N/A [YA] Asset tag: N/A [RW] Read-write area: 105 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 252 byte(s) free End Capabilities: [9c] MSI-X: Enable+ Count=128 Masked- Vector table: BAR=0 offset=0007c000 PBA: BAR=0 offset=0007d000 Capabilities: [60] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 0 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [148 v1] Device Serial Number 00-02-c9-03-00-b6-fc-70 Capabilities: [108 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration-, Interrupt Message Number: 000 IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy+ IOVSta: Migration- Initial VFs: 64, Total VFs: 64, Number of VFs: 4, Function Dependency Link: 00 VF offset: 1, stride: 1, Device ID: 1004 Supported Page Size: 000007ff, System Page Size: 00000001 Region 2: Memory at 0000380fdf000000 (64-bit, prefetchable) VF Migration: offset: 00000000, BIR: 0 Capabilities: [154 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [18c v1] #19 Kernel driver in use: mlx4_core 06:00.1 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function] Subsystem: Mellanox Technologies Device 61b0 Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Region 2: [virtual] Memory at 380fdf000000 (64-bit, prefetchable) [size=8M] Capabilities: [60] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM unknown, Latency L0 <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [9c] MSI-X: Enable+ Count=4 Masked- Vector table: BAR=2 offset=00002000 PBA: BAR=2 offset=00003000 Kernel driver in use: mlx4_core 06:00.2 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function] Subsystem: Mellanox Technologies Device 61b0 Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Region 2: [virtual] Memory at 380fdf800000 (64-bit, prefetchable) [size=8M] Capabilities: [60] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM unknown, Latency L0 <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [9c] MSI-X: Enable+ Count=4 Masked- Vector table: BAR=2 offset=00002000 PBA: BAR=2 offset=00003000 Kernel driver in use: mlx4_core 06:00.3 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function] Subsystem: Mellanox Technologies Device 61b0 Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Region 2: [virtual] Memory at 380fe0000000 (64-bit, prefetchable) [size=8M] Capabilities: [60] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM unknown, Latency L0 <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [9c] MSI-X: Enable+ Count=4 Masked- Vector table: BAR=2 offset=00002000 PBA: BAR=2 offset=00003000 Kernel driver in use: mlx4_core 06:00.4 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function] Subsystem: Mellanox Technologies Device 61b0 Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Region 2: [virtual] Memory at 380fe0800000 (64-bit, prefetchable) [size=8M] Capabilities: [60] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM unknown, Latency L0 <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [9c] MSI-X: Enable+ Count=4 Masked- Vector table: BAR=2 offset=00002000 PBA: BAR=2 offset=00003000 Kernel driver in use: mlx4_core and this is from dmesg: [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] Xen: [mem 0x0000000000000000-0x0000000000090fff] usable [ 0.000000] Xen: [mem 0x0000000000091800-0x00000000000fffff] reserved [ 0.000000] Xen: [mem 0x0000000000100000-0x000000007dd76fff] usable [ 0.000000] Xen: [mem 0x000000007dd77000-0x000000007ddb5fff] reserved [ 0.000000] Xen: [mem 0x000000007ddb6000-0x000000007debefff] ACPI data [ 0.000000] Xen: [mem 0x000000007debf000-0x000000007e0dafff] ACPI NVS [ 0.000000] Xen: [mem 0x000000007e0db000-0x000000007f357fff] reserved [ 0.000000] Xen: [mem 0x000000007f358000-0x000000007f7fffff] ACPI NVS [ 0.000000] Xen: [mem 0x0000000080000000-0x000000008fffffff] reserved [ 0.000000] Xen: [mem 0x00000000fec00000-0x00000000fec01fff] reserved [ 0.000000] Xen: [mem 0x00000000fec40000-0x00000000fec40fff] reserved [ 0.000000] Xen: [mem 0x00000000fed1c000-0x00000000fed3ffff] reserved [ 0.000000] Xen: [mem 0x00000000fee00000-0x00000000fee00fff] reserved [ 0.000000] Xen: [mem 0x00000000ff000000-0x00000000ffffffff] reserved [ 0.000000] Xen: [mem 0x0000000100000000-0x000000107fffffff] usable [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable [ 0.000000] e820: last_pfn = 0x1080000 max_arch_pfn = 0x400000000 [ 0.000000] e820: last_pfn = 0x7dd77 max_arch_pfn = 0x400000000 [ 0.000000] e820: [mem 0x90000000-0xfebfffff] available for PCI devices [ 23.917733] PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820 [ 24.587366] e820: reserve RAM buffer [mem 0x00091000-0x0009ffff] [ 24.587468] e820: reserve RAM buffer [mem 0x7dd77000-0x7fffffff] do you need anything else? -- Lukáš Hejtmánek
On Mon, Oct 21, 2013 at 04:06:07PM +0200, Lukas Hejtmanek wrote:> On Mon, Oct 21, 2013 at 09:39:33AM -0400, konrad wilk wrote: > > Anyhow, one easy thing to figure out is to get the lspci -v output > > from the InfiniBand card > > to see where its BARs are, and also the start of the kernel. You > > should see an E820 map (please also boot with > > "debug" on the Linux command line). > > note, adding _PAGE_IO as Jan suggested fixed those mem errors.<nods> Right.> > here is lspci from the card and its virtual functions. > > 06:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] > Subsystem: Mellanox Technologies Device 0017 > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > Interrupt: pin A routed to IRQ 42 > Region 0: Memory at dfa00000 (64-bit, non-prefetchable) [size=1M] > Region 2: Memory at 380fff000000 (64-bit, prefetchable) [size=8M]Wow.> 06:00.1 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function] > Subsystem: Mellanox Technologies Device 61b0 > Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0 > Region 2: [virtual] Memory at 380fdf000000 (64-bit, prefetchable) [size=8M]Wow again. .. snip..> and this is from dmesg: > > [ 0.000000] e820: BIOS-provided physical RAM map: > [ 0.000000] Xen: [mem 0x0000000000000000-0x0000000000090fff] usable > [ 0.000000] Xen: [mem 0x0000000000091800-0x00000000000fffff] reserved > [ 0.000000] Xen: [mem 0x0000000000100000-0x000000007dd76fff] usable > [ 0.000000] Xen: [mem 0x000000007dd77000-0x000000007ddb5fff] reserved > [ 0.000000] Xen: [mem 0x000000007ddb6000-0x000000007debefff] ACPI data > [ 0.000000] Xen: [mem 0x000000007debf000-0x000000007e0dafff] ACPI NVS > [ 0.000000] Xen: [mem 0x000000007e0db000-0x000000007f357fff] reserved > [ 0.000000] Xen: [mem 0x000000007f358000-0x000000007f7fffff] ACPI NVS > [ 0.000000] Xen: [mem 0x0000000080000000-0x000000008fffffff] reserved > [ 0.000000] Xen: [mem 0x00000000fec00000-0x00000000fec01fff] reserved > [ 0.000000] Xen: [mem 0x00000000fec40000-0x00000000fec40fff] reserved > [ 0.000000] Xen: [mem 0x00000000fed1c000-0x00000000fed3ffff] reserved > [ 0.000000] Xen: [mem 0x00000000fee00000-0x00000000fee00fff] reserved > [ 0.000000] Xen: [mem 0x00000000ff000000-0x00000000ffffffff] reserved > [ 0.000000] Xen: [mem 0x0000000100000000-0x000000107fffffff] usableOdd, there should be messages about 1-1 mapping when you use ''debug''. But either way - the problem (bug) is what I suspected - we treat any region past the E820 as INVALID_P2M_ENTRY and hence doing any set_pte(..) operations will fetch an 0 value, which in turn means that the PTE is zero (with the 0x200 _PAGE_SPECIAL b/c of VMA tracking). Now the fix is to determine _where_ the end of real memory is so that we can make sure that ballooning will work (in case of dom0_mem_max parameter). And then anything past that PFN can be treated as IDENTITY_FRAME. Naively, I think this patch would do it: diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c index 09f3059..3871554 100644 --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -92,6 +92,9 @@ static void __init xen_add_extra_mem(u64 start, u64 size) __set_phys_to_machine(pfn, INVALID_P2M_ENTRY); } + /* Anything past the balloon area is marked as identity. */ + for (pfn = xen_max_p2m_pfn; pfn < MAX_DOMAIN_PAGES; pfn++) + __set_phys_to_machine(pfn, IDENTITY_FRAME(pfn)); } static unsigned long __init xen_do_chunk(unsigned long start, But this is not even compile tested :-(
>>> On 21.10.13 at 16:06, Lukas Hejtmanek <xhejtman@ics.muni.cz> wrote: > here is lspci from the card and its virtual functions. > > 06:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] > Subsystem: Mellanox Technologies Device 0017 > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > Interrupt: pin A routed to IRQ 42 > Region 0: Memory at dfa00000 (64-bit, non-prefetchable) [size=1M] > Region 2: Memory at 380fff000000 (64-bit, prefetchable) [size=8M]Which confirms what Konrad said regarding MMIO above 4Gb. Jan
On Mon, Oct 21, 2013 at 10:18:55AM -0400, Konrad Rzeszutek Wilk wrote:> Odd, there should be messages about 1-1 mapping when you use ''debug''.cat /proc/cmdline placeholder root=UUID=b5711e0a-3fc8-44ec-940f-112e60d8f143 ro debug so I suppose, I did it right. Maybe I didn''t compile something important in? -- Lukáš Hejtmánek
>>> On 21.10.13 at 16:18, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > On Mon, Oct 21, 2013 at 04:06:07PM +0200, Lukas Hejtmanek wrote: >> Region 2: Memory at 380fff000000 (64-bit, prefetchable) [size=8M] >... > --- a/arch/x86/xen/setup.c > +++ b/arch/x86/xen/setup.c > @@ -92,6 +92,9 @@ static void __init xen_add_extra_mem(u64 start, u64 size) > > __set_phys_to_machine(pfn, INVALID_P2M_ENTRY); > } > + /* Anything past the balloon area is marked as identity. */ > + for (pfn = xen_max_p2m_pfn; pfn < MAX_DOMAIN_PAGES; pfn++) > + __set_phys_to_machine(pfn, IDENTITY_FRAME(pfn));Hardly - MAX_DOMAIN_PAGES derives from CONFIG_XEN_MAX_DOMAIN_MEMORY, which in turn is unrelated to where MMIO might be. Should you perhaps simply start from an all 1:1 mapping, inserting the RAM translations as you find them? Jan
On Mon, Oct 21, 2013 at 03:27:50PM +0100, Jan Beulich wrote:> >>> On 21.10.13 at 16:18, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > On Mon, Oct 21, 2013 at 04:06:07PM +0200, Lukas Hejtmanek wrote: > >> Region 2: Memory at 380fff000000 (64-bit, prefetchable) [size=8M] > >... > > --- a/arch/x86/xen/setup.c > > +++ b/arch/x86/xen/setup.c > > @@ -92,6 +92,9 @@ static void __init xen_add_extra_mem(u64 start, u64 size) > > > > __set_phys_to_machine(pfn, INVALID_P2M_ENTRY); > > } > > + /* Anything past the balloon area is marked as identity. */ > > + for (pfn = xen_max_p2m_pfn; pfn < MAX_DOMAIN_PAGES; pfn++) > > + __set_phys_to_machine(pfn, IDENTITY_FRAME(pfn)); > > Hardly - MAX_DOMAIN_PAGES derives from > CONFIG_XEN_MAX_DOMAIN_MEMORY, which in turn is unrelated > to where MMIO might be. Should you perhaps simply start fromLooks like your mailer ate some words.> an all 1:1 mapping, inserting the RAM translations as you find > them?Yeah, as this code can be called for the regions under 4GB. Definitly needs more analysis. Were you suggesting a lookup when we scan the PCI devices? (xen_add_device)?> > Jan >
>>> On 21.10.13 at 16:44, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > On Mon, Oct 21, 2013 at 03:27:50PM +0100, Jan Beulich wrote: >> >>> On 21.10.13 at 16:18, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: >> > On Mon, Oct 21, 2013 at 04:06:07PM +0200, Lukas Hejtmanek wrote: >> >> Region 2: Memory at 380fff000000 (64-bit, prefetchable) [size=8M] >> >... >> > --- a/arch/x86/xen/setup.c >> > +++ b/arch/x86/xen/setup.c >> > @@ -92,6 +92,9 @@ static void __init xen_add_extra_mem(u64 start, u64 size) >> > >> > __set_phys_to_machine(pfn, INVALID_P2M_ENTRY); >> > } >> > + /* Anything past the balloon area is marked as identity. */ >> > + for (pfn = xen_max_p2m_pfn; pfn < MAX_DOMAIN_PAGES; pfn++) >> > + __set_phys_to_machine(pfn, IDENTITY_FRAME(pfn)); >> >> Hardly - MAX_DOMAIN_PAGES derives from >> CONFIG_XEN_MAX_DOMAIN_MEMORY, which in turn is unrelated >> to where MMIO might be. Should you perhaps simply start from > > Looks like your mailer ate some words.I don''t think so - they''re all there in the text you quoted.>> an all 1:1 mapping, inserting the RAM translations as you find >> them? > > > Yeah, as this code can be called for the regions under 4GB. Definitly > needs more analysis. > > Were you suggesting a lookup when we scan the PCI devices? (xen_add_device)?That was for PVH, and is obviously fragile, as there can be MMIO regions not matched by any PCI device''s BAR. We could hope for all of them to be below 4Gb, but I think (based on logs I got to see recently from a certain vendor''s upcoming systems) this isn''t going to work out. Jan
On Mon, Oct 21, 2013 at 04:12:56PM +0100, Jan Beulich wrote:> >>> On 21.10.13 at 16:44, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > On Mon, Oct 21, 2013 at 03:27:50PM +0100, Jan Beulich wrote: > >> >>> On 21.10.13 at 16:18, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > >> > On Mon, Oct 21, 2013 at 04:06:07PM +0200, Lukas Hejtmanek wrote: > >> >> Region 2: Memory at 380fff000000 (64-bit, prefetchable) [size=8M] > >> >... > >> > --- a/arch/x86/xen/setup.c > >> > +++ b/arch/x86/xen/setup.c > >> > @@ -92,6 +92,9 @@ static void __init xen_add_extra_mem(u64 start, u64 size) > >> > > >> > __set_phys_to_machine(pfn, INVALID_P2M_ENTRY); > >> > } > >> > + /* Anything past the balloon area is marked as identity. */ > >> > + for (pfn = xen_max_p2m_pfn; pfn < MAX_DOMAIN_PAGES; pfn++) > >> > + __set_phys_to_machine(pfn, IDENTITY_FRAME(pfn)); > >> > >> Hardly - MAX_DOMAIN_PAGES derives from > >> CONFIG_XEN_MAX_DOMAIN_MEMORY, which in turn is unrelated > >> to where MMIO might be. Should you perhaps simply start from > > > > Looks like your mailer ate some words. > > I don''t think so - they''re all there in the text you quoted. > > >> an all 1:1 mapping, inserting the RAM translations as you find > >> them? > > > > > > Yeah, as this code can be called for the regions under 4GB. Definitly > > needs more analysis. > > > > Were you suggesting a lookup when we scan the PCI devices? (xen_add_device)? > > That was for PVH, and is obviously fragile, as there can be MMIO > regions not matched by any PCI device''s BAR. We could hope for > all of them to be below 4Gb, but I think (based on logs I got to see > recently from a certain vendor''s upcoming systems) this isn''t going > to work out.This is the patch I had in mind that I think will fix these issues. But I would appreciate testing it and naturally send me the dmesg if possible. diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index b232908..258e3f9 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -133,6 +133,25 @@ static void balloon_append(struct page *page) adjust_managed_page_count(page, -1); } +/* + * Check if any the balloon pages overlap with the supplied + * pfn and its range. + */ +bool balloon_pfn(unsigned long pfn, unsigned long nr) +{ + struct page *page; + + if (list_empty(&ballooned_pages)) + return false; + + list_for_each_entry(page, &ballooned_pages, lru) { + unsigned long b_pfn = page_to_pfn(page); + + if (b_pfn >= pfn && b_pfn < pfn + nr) + return true; + } + return false; +} /* balloon_retrieve: rescue a page from the balloon, if it is not empty. */ static struct page *balloon_retrieve(bool prefer_highmem) { diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c index 18fff88..7e5ff49 100644 --- a/drivers/xen/pci.c +++ b/drivers/xen/pci.c @@ -17,11 +17,16 @@ * Author: Weidong Han <weidong.han@intel.com> */ +#define DEBUG 1 + #include <linux/pci.h> #include <linux/acpi.h> #include <xen/xen.h> #include <xen/interface/physdev.h> #include <xen/interface/xen.h> +#include <xen/interface/memory.h> +#include <xen/page.h> +#include <xen/balloon.h> #include <asm/xen/hypervisor.h> #include <asm/xen/hypercall.h> @@ -123,10 +128,78 @@ static int xen_add_device(struct device *dev) r = HYPERVISOR_physdev_op(PHYSDEVOP_manage_pci_add, &manage_pci); } - return r; } +static void xen_p2m_add_device(struct device *dev) +{ + int i; + struct pci_dev *pci_dev = to_pci_dev(dev); + + /* Verify whether the MMIO BARs are 1-1 in the P2M. */ + for (i = 0; i < PCI_NUM_RESOURCES; i++) { + unsigned long pfn, start, end, ok_pfns; + char bus_addr[64]; + char *fmt; + + if (!pci_resource_len(pci_dev, i)) + continue; + + if (pci_resource_flags(pci_dev, i) == IORESOURCE_IO) + fmt = " (bus address [%#06llx-%#06llx])"; + else + fmt = " (bus address [%#010llx-%#010llx])"; + + snprintf(bus_addr, sizeof(bus_addr), fmt, + (unsigned long long) (pci_resource_start(pci_dev, i)), + (unsigned long long) (pci_resource_end(pci_dev, i))); + + start = pci_resource_start(pci_dev, i) >> PAGE_SHIFT; + end = pci_resource_end(pci_dev, i) >> PAGE_SHIFT; + + /* + * We don''t worry about the balloon scratch page as it has a + * valid PFN - which means we will catch in the loop below. + */ + if (balloon_pfn(start, end - start)) { + dev_warn(dev, "%s is within balloon pages!\n", bus_addr); + continue; + } + + for (ok_pfns = 0, pfn = start; pfn < end; pfn ++) { + unsigned long mfn = pfn_to_mfn(pfn); + + if (mfn == pfn) { + ok_pfns ++; + continue; + } + if (mfn != INVALID_P2M_ENTRY) { /* RAM */ + dev_warn(dev, "%s is within RAM [%lx] region!\n", bus_addr, pfn); + break; + } + } + dev_dbg(dev, "%s pfn:%lx, s:%lx, e:%lx ok:%ld\n", bus_addr, pfn, start, end, ok_pfns); + if (pfn != end - 1) /* We broke out of the loop above. */ + continue; + + if (ok_pfns == end - start) /* All good. */ + continue; + + dev_dbg(dev, "%s [%lx->%lx]\n", bus_addr, start, end); + + /* This BAR was not detected during E820 parsing. */ + for (pfn = start; pfn < end; pfn ++) { + if (!set_phys_to_machine(pfn, pfn)) + break; + } + WARN(pfn != end - 1, "Only set %ld instead of %ld PFNs!\n", + end - pfn, end - start); + + dev_info(dev, "%s set %ld page(s) to 1-1 mapping.\n", + bus_addr, end - pfn); + } +} + static int xen_remove_device(struct device *dev) { int r; @@ -164,10 +237,14 @@ static int xen_pci_notifier(struct notifier_block *nb, switch (action) { case BUS_NOTIFY_ADD_DEVICE: - r = xen_add_device(dev); + if (xen_initial_domain()) + r = xen_add_device(dev); + if (r == 0) + xen_p2m_add_device(dev); break; case BUS_NOTIFY_DEL_DEVICE: - r = xen_remove_device(dev); + if (xen_initial_domain()) + r = xen_remove_device(dev); break; default: return NOTIFY_DONE; @@ -185,9 +262,8 @@ static struct notifier_block device_nb = { static int __init register_xen_pci_notifier(void) { - if (!xen_initial_domain()) + if (!xen_domain()) return 0; - return bus_register_notifier(&pci_bus_type, &device_nb); } diff --git a/include/xen/balloon.h b/include/xen/balloon.h index a4c1c6a..60ecc50 100644 --- a/include/xen/balloon.h +++ b/include/xen/balloon.h @@ -41,3 +41,4 @@ static inline int register_xen_selfballooning(struct device *dev) return -ENOSYS; } #endif +bool balloon_pfn(unsigned long pfn, unsigned long nr);> > Jan >
>>> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 10/23/13 5:37 PM >>> >On Mon, Oct 21, 2013 at 04:12:56PM +0100, Jan Beulich wrote: >> > Were you suggesting a lookup when we scan the PCI devices? (xen_add_device)? >> >> That was for PVH, and is obviously fragile, as there can be MMIO >> regions not matched by any PCI device''s BAR. We could hope for >> all of them to be below 4Gb, but I think (based on logs I got to see >> recently from a certain vendor''s upcoming systems) this isn''t going >> to work out. > >This is the patch I had in mind that I think will fix these issues. But >I would appreciate testing it and naturally send me the dmesg if possible.So this indeed is only about PCI devices (i.e. not taking into account the comment I made earlier [above]). Further, a brute force loop over all balloon pages seems like a pretty bad thing to do when the balloon is rather big. And finally - did you check that the bus notification happens after resource assignment? Jan
On Wed, Oct 23, 2013 at 04:45:37PM +0100, Jan Beulich wrote:> >>> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 10/23/13 5:37 PM >>> > >On Mon, Oct 21, 2013 at 04:12:56PM +0100, Jan Beulich wrote: > >> > Were you suggesting a lookup when we scan the PCI devices? (xen_add_device)? > >> > >> That was for PVH, and is obviously fragile, as there can be MMIO > >> regions not matched by any PCI device''s BAR. We could hope for > >> all of them to be below 4Gb, but I think (based on logs I got to see > >> recently from a certain vendor''s upcoming systems) this isn''t going > >> to work out. > > > >This is the patch I had in mind that I think will fix these issues. But > >I would appreciate testing it and naturally send me the dmesg if possible. > > So this indeed is only about PCI devices (i.e. not taking into account the > comment I made earlier [above]).Correct. What are some of those devices? It would help to understand what those are.> > Further, a brute force loop over all balloon pages seems like a pretty > bad thing to do when the balloon is rather big.Sure.> > And finally - did you check that the bus notification happens after resource > assignment?They do occur during the normal resource assigment. But I presume you meant during resource re-assigment?> > Jan >
>>> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 10/23/13 6:05 PM >>> >On Wed, Oct 23, 2013 at 04:45:37PM +0100, Jan Beulich wrote: > So this indeed is only about PCI devices (i.e. not taking into account the > comment I made earlier [above]). > >Correct. >What are some of those devices? It would help to understand what those are.The simplest possible thing are MCFG ranges, which aren''t required to be present in the E820 map.>> And finally - did you check that the bus notification happens after resource >> assignment? > >They do occur during the normal resource assigment. But I presume you >meant during resource re-assigment?No, I meant only assignment - iirc, re-assignment is still unimplemented. Jan
On 23/10/13 16:36, Konrad Rzeszutek Wilk wrote:> On Mon, Oct 21, 2013 at 04:12:56PM +0100, Jan Beulich wrote: >>>>> On 21.10.13 at 16:44, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: >>> On Mon, Oct 21, 2013 at 03:27:50PM +0100, Jan Beulich wrote: >>>>>>> On 21.10.13 at 16:18, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: >>>>> On Mon, Oct 21, 2013 at 04:06:07PM +0200, Lukas Hejtmanek wrote: >>>>>> Region 2: Memory at 380fff000000 (64-bit, prefetchable) [size=8M] >>>>> ... >>>>> --- a/arch/x86/xen/setup.c >>>>> +++ b/arch/x86/xen/setup.c >>>>> @@ -92,6 +92,9 @@ static void __init xen_add_extra_mem(u64 start, u64 size) >>>>> >>>>> __set_phys_to_machine(pfn, INVALID_P2M_ENTRY); >>>>> } >>>>> + /* Anything past the balloon area is marked as identity. */ >>>>> + for (pfn = xen_max_p2m_pfn; pfn < MAX_DOMAIN_PAGES; pfn++) >>>>> + __set_phys_to_machine(pfn, IDENTITY_FRAME(pfn)); >>>> >>>> Hardly - MAX_DOMAIN_PAGES derives from >>>> CONFIG_XEN_MAX_DOMAIN_MEMORY, which in turn is unrelated >>>> to where MMIO might be. Should you perhaps simply start from >>> >>> Looks like your mailer ate some words. >> >> I don''t think so - they''re all there in the text you quoted. >> >>>> an all 1:1 mapping, inserting the RAM translations as you find >>>> them? >>> >>> >>> Yeah, as this code can be called for the regions under 4GB. Definitly >>> needs more analysis. >>> >>> Were you suggesting a lookup when we scan the PCI devices? (xen_add_device)? >> >> That was for PVH, and is obviously fragile, as there can be MMIO >> regions not matched by any PCI device''s BAR. We could hope for >> all of them to be below 4Gb, but I think (based on logs I got to see >> recently from a certain vendor''s upcoming systems) this isn''t going >> to work out. > > This is the patch I had in mind that I think will fix these issues. But > I would appreciate testing it and naturally send me the dmesg if possible.I think there is a simpler way to handle this. If INVALID_P2M_ENTRY implies 1:1 and we arrange: a) pfn_to_mfn() to return pfn if the mfn is missing in the p2m b) mfn_to_pfn() to return mfn if p2m(m2p(mfn)) != mfn and there is no m2p override. Then: a) The identity p2m entries can be removed. b) _PAGE_IOMAP becomes unnecessary. David
On Fri, Oct 25, 2013 at 12:08:21AM +0100, David Vrabel wrote:> On 23/10/13 16:36, Konrad Rzeszutek Wilk wrote: > >On Mon, Oct 21, 2013 at 04:12:56PM +0100, Jan Beulich wrote: > >>>>>On 21.10.13 at 16:44, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > >>>On Mon, Oct 21, 2013 at 03:27:50PM +0100, Jan Beulich wrote: > >>>>>>>On 21.10.13 at 16:18, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > >>>>>On Mon, Oct 21, 2013 at 04:06:07PM +0200, Lukas Hejtmanek wrote: > >>>>>> Region 2: Memory at 380fff000000 (64-bit, prefetchable) [size=8M] > >>>>>... > >>>>>--- a/arch/x86/xen/setup.c > >>>>>+++ b/arch/x86/xen/setup.c > >>>>>@@ -92,6 +92,9 @@ static void __init xen_add_extra_mem(u64 start, u64 size) > >>>>> > >>>>> __set_phys_to_machine(pfn, INVALID_P2M_ENTRY); > >>>>> } > >>>>>+ /* Anything past the balloon area is marked as identity. */ > >>>>>+ for (pfn = xen_max_p2m_pfn; pfn < MAX_DOMAIN_PAGES; pfn++) > >>>>>+ __set_phys_to_machine(pfn, IDENTITY_FRAME(pfn)); > >>>> > >>>>Hardly - MAX_DOMAIN_PAGES derives from > >>>>CONFIG_XEN_MAX_DOMAIN_MEMORY, which in turn is unrelated > >>>>to where MMIO might be. Should you perhaps simply start from > >>> > >>>Looks like your mailer ate some words. > >> > >>I don''t think so - they''re all there in the text you quoted. > >> > >>>>an all 1:1 mapping, inserting the RAM translations as you find > >>>>them? > >>> > >>> > >>>Yeah, as this code can be called for the regions under 4GB. Definitly > >>>needs more analysis. > >>> > >>>Were you suggesting a lookup when we scan the PCI devices? (xen_add_device)? > >> > >>That was for PVH, and is obviously fragile, as there can be MMIO > >>regions not matched by any PCI device''s BAR. We could hope for > >>all of them to be below 4Gb, but I think (based on logs I got to see > >>recently from a certain vendor''s upcoming systems) this isn''t going > >>to work out. > > > >This is the patch I had in mind that I think will fix these issues. But > >I would appreciate testing it and naturally send me the dmesg if possible. > > I think there is a simpler way to handle this. > > If INVALID_P2M_ENTRY implies 1:1 and we arrange:I am a bit afraid to make that assumption.> > a) pfn_to_mfn() to return pfn if the mfn is missing in the p2mThe balloon pages are of missing type (initially). And they should return INVALID_P2M_ENTRY at start - later on they will return the scratch_page.> b) mfn_to_pfn() to return mfn if p2m(m2p(mfn)) != mfn and there is > no m2p override.The toolstack can map pages that are are p2m(p2m(mfn)) != mfn and have no m2p override.> > Then: > > a) The identity p2m entries can be removed. > b) _PAGE_IOMAP becomes unnecessary.You still need it for the toolstack to map other guests pages. (xen_privcmd_map). I think for right now to fix this issue going ahead and setting 1-1 in the P2M for affected devices (PCI and MCFG) is simpler, b/c: - We only do it when said device is in the guest (so if you launch and PCI PV guest you can still migrate it - after unplugging the device). Assuming all 1-1 regions might not be a healthy (I had a heck of time fixing all of the migration issues when I wrote the 1:1 code). - It will make PVH hypercall to mark I/O regions easier. Instead of it assuming that all non-RAM space is I/O regions it will be able to selectively setup the entries for said regions. I think that is what Jan suggested? - This is a bug - so lets fix it as a bug first. Redoing the P2M is certainly an option but I am not signing up for that this year. Let me post my two patches that fix this for PCI devices and MCFG areas.