Gordan Bobic
2013-Jul-23 22:34 UTC
Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
I just built 4.3.0 in order to get > 2GB of RAM in domU with GPU passthrough without crashes. Unfortunately, the same crashes still happen. Massive frame buffer corruption on domU before it locks up solid. It seems the PCI memory stomp is still happening. I am using qemu-dm, as I did on Xen 4.2.x. So whatever fix for this went into 4.3.0 didn''t fix it for me. Passing less than 2GB of RAM to domU till works fine. I have attached: qemu-dm log for domU xl dmesg _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2013-Jul-24 14:08 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Tue, Jul 23, 2013 at 11:34:00PM +0100, Gordan Bobic wrote:> I just built 4.3.0 in order to get > 2GB of RAM in domU with GPU > passthrough without crashes. Unfortunately, the same crashes still > happen. Massive frame buffer corruption on domU before it locks up > solid. It seems the PCI memory stomp is still happening. >If you boot Xen with guest_loglvl=all and then run the guest the consoel (xl dmesg) should also have the output from QEMU - that will help in seeing how it constructs the E820 (which was the problem last time). Are you also able to get the serial log from the guest? (IF this is Linux?) I usually have this in my guest config: serial=''pty'' and when Linux boots up I add ''console=ttyS0,115200 loglevel=8 debug'' which will output everything to the ''xl console <guest> | tee /tmp/log''.> I am using qemu-dm, as I did on Xen 4.2.x. > > So whatever fix for this went into 4.3.0 didn''t fix it for me. > Passing less than 2GB of RAM to domU till works fine. > > I have attached: > > qemu-dm log for domU > xl dmesg> domid: 1 > Using file /dev/zvol/ssd/edi in read-write mode > Watching /local/domain/0/device-model/1/logdirty/cmd > Watching /local/domain/0/device-model/1/command > Watching /local/domain/1/cpu > char device redirected to /dev/pts/3 > qemu_map_cache_init nr_buckets = 10000 size 4194304 > shared page at pfn feffd > buffered io page at pfn feffb > Guest uuid = a57e6840-e9f5-4a14-a822-b2cc662c177f > populating video RAM at ff000000 > mapping video RAM from ff000000 > Register xen platform. > Done register platform. > platform_fixed_ioport: changed ro/rw state of ROM memory area. now is rw state. > xs_read(/local/domain/0/device-model/1/xen_extended_power_mgmt): read error > xs_read(): vncpasswd get error. /vm/a57e6840-e9f5-4a14-a822-b2cc662c177f/vncpasswd. > Log-dirty: no command yet. > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > vcpu-set: watch node error. > [xenstore_process_vcpu_set_event]: /local/domain/1/cpu has no CPU! > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > xs_read(/local/domain/1/log-throttling): read error > qemu: ignoring not-understood drive `/local/domain/1/log-throttling'' > medium change watch on `/local/domain/1/log-throttling'' - unknown device, ignored > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > dm-command: hot insert pass-through pci dev > register_real_device: Assigning real physical device 00:1a.1 ... > register_real_device: Enable MSI translation via per device option > register_real_device: Disable power management > pt_iomul_init: Error: pt_iomul_init can''t open file /dev/xen/pci_iomul: No such file or directory: 0x0:0x1a.0x1 > pt_register_regions: IO region registered (size=0x00000020 base_addr=0x00009a01) > pci_intx: intx=2 > register_real_device: Real physical device 00:1a.1 registered successfuly! > IRQ type = INTx > dm-command: hot insert pass-through pci dev > register_real_device: Assigning real physical device 0d:00.0 ... > register_real_device: Enable MSI translation via per device option > register_real_device: Disable power management > pt_iomul_init: Error: pt_iomul_init can''t open file /dev/xen/pci_iomul: No such file or directory: 0xd:0x0.0x0 > pt_register_regions: IO region registered (size=0x00004000 base_addr=0xd7efc000) > pci_intx: intx=1 > register_real_device: Real physical device 0d:00.0 registered successfuly! > IRQ type = INTx > dm-command: hot insert pass-through pci dev > register_real_device: Assigning real physical device 08:00.0 ... > register_real_device: Enable MSI translation via per device option > register_real_device: Disable power management > pt_iomul_init: Error: pt_iomul_init can''t open file /dev/xen/pci_iomul: No such file or directory: 0x8:0x0.0x0 > pt_register_regions: IO region registered (size=0x02000000 base_addr=0xf8000000) > pt_register_regions: IO region registered (size=0x08000000 base_addr=0xb800000c) > pt_register_regions: IO region registered (size=0x04000000 base_addr=0xb400000c) > pt_register_regions: IO region registered (size=0x00000080 base_addr=0x0000df81) > pt_register_regions: Expansion ROM registered (size=0x00080000 base_addr=0xfbd00000) > pt_msi_setup: msi mapped with pirq 4f > pci_intx: intx=1 > register_real_device: Real physical device 08:00.0 registered successfuly! > IRQ type = MSI-INTx > dm-command: hot insert pass-through pci dev > register_real_device: Assigning real physical device 08:00.1 ... > register_real_device: Enable MSI translation via per device option > register_real_device: Disable power management > pt_iomul_init: Error: pt_iomul_init can''t open file /dev/xen/pci_iomul: No such file or directory: 0x8:0x0.0x1 > pt_register_regions: IO region registered (size=0x00004000 base_addr=0xfbdfc000) > pt_msi_setup: msi mapped with pirq 4e > pci_intx: intx=2 > register_real_device: Real physical device 08:00.1 registered successfuly! > IRQ type = MSI-INTx > pt_iomem_map: e_phys=e0000000 maddr=b8000000 type=8 len=134217728 index=1 first_map=1 > pt_iomem_map: e_phys=e8000000 maddr=b4000000 type=8 len=67108864 index=3 first_map=1 > pt_iomem_map: e_phys=ec000000 maddr=f8000000 type=0 len=33554432 index=0 first_map=1 > vga s->lfb_addr = ef000000 s->lfb_end = ef800000 > pt_iomem_map: e_phys=ef8a0000 maddr=d7efc000 type=0 len=16384 index=0 first_map=1 > pt_iomem_map: e_phys=ef8a4000 maddr=fbdfc000 type=0 len=16384 index=0 first_map=1 > pt_ioport_map: e_phys=c100 pio_base=df80 len=128 index=5 first_map=1 > pt_ioport_map: e_phys=c1e0 pio_base=9a00 len=32 index=4 first_map=1 > platform_fixed_ioport: changed ro/rw state of ROM memory area. now is rw state. > platform_fixed_ioport: changed ro/rw state of ROM memory area. now is ro state. > Unknown PV product 2 loaded in guest > PV driver build 1 > region type 0 at [ef880000,ef8a0000). > squash iomem [ef880000, ef8a0000). > region type 1 at [c180,c1c0). > vga s->lfb_addr = ef000000 s->lfb_end = ef800000 > pt_ioport_map: e_phys=ffff pio_base=9a00 len=32 index=4 first_map=0 > pt_pci_write_config: [00:05:0] Warning: Guest attempt to set address to unused Base Address Register. [Offset:30h][Length:4] > pt_ioport_map: e_phys=c1e0 pio_base=9a00 len=32 index=4 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=d7efc000 type=0 len=16384 index=0 first_map=0 > pt_pci_write_config: [00:06:0] Warning: Guest attempt to set address to unused Base Address Register. [Offset:30h][Length:4] > pt_iomem_map: e_phys=ef8a0000 maddr=d7efc000 type=0 len=16384 index=0 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=f8000000 type=0 len=33554432 index=0 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=b8000000 type=8 len=134217728 index=1 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=b4000000 type=8 len=67108864 index=3 first_map=0 > pt_ioport_map: e_phys=ffff pio_base=df80 len=128 index=5 first_map=0 > pt_iomem_map: e_phys=ec000000 maddr=f8000000 type=0 len=33554432 index=0 first_map=0 > pt_iomem_map: e_phys=e0000000 maddr=b8000000 type=8 len=134217728 index=1 first_map=0 > pt_iomem_map: e_phys=e8000000 maddr=b4000000 type=8 len=67108864 index=3 first_map=0 > pt_ioport_map: e_phys=c100 pio_base=df80 len=128 index=5 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=fbdfc000 type=0 len=16384 index=0 first_map=0 > pt_pci_write_config: [00:08:0] Warning: Guest attempt to set address to unused Base Address Register. [Offset:30h][Length:4] > pt_iomem_map: e_phys=ef8a4000 maddr=fbdfc000 type=0 len=16384 index=0 first_map=0 > pt_ioport_map: e_phys=ffff pio_base=9a00 len=32 index=4 first_map=0 > pt_ioport_map: e_phys=c1e0 pio_base=9a00 len=32 index=4 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=fbdfc000 type=0 len=16384 index=0 first_map=0 > pt_iomem_map: e_phys=ef8a4000 maddr=fbdfc000 type=0 len=16384 index=0 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=d7efc000 type=0 len=16384 index=0 first_map=0 > pt_iomem_map: e_phys=ef8a0000 maddr=d7efc000 type=0 len=16384 index=0 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=f8000000 type=0 len=33554432 index=0 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=b8000000 type=8 len=134217728 index=1 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=b4000000 type=8 len=67108864 index=3 first_map=0 > pt_ioport_map: e_phys=ffff pio_base=df80 len=128 index=5 first_map=0 > pt_iomem_map: e_phys=ec000000 maddr=f8000000 type=0 len=33554432 index=0 first_map=0 > pt_iomem_map: e_phys=e0000000 maddr=b8000000 type=8 len=134217728 index=1 first_map=0 > pt_iomem_map: e_phys=e8000000 maddr=b4000000 type=8 len=67108864 index=3 first_map=0 > pt_ioport_map: e_phys=c100 pio_base=df80 len=128 index=5 first_map=0> __ __ _ _ _____ ___ _ _ __ > \ \/ /___ _ __ | || | |___ / / _ \ / | ___| |/ /_ > \ // _ \ ''_ \ | || |_ |_ \| | | |__| | / _ \ | ''_ \ > / \ __/ | | | |__ _| ___) | |_| |__| || __/ | (_) | > /_/\_\___|_| |_| |_|(_)____(_)___/ |_(_)___|_|\___/ > > (XEN) Xen version 4.3.0 (root@shatteredsilicon.net) (gcc (GCC) 4.4.5 20110214 (Red Hat 4.4.5-6)) debug=n Tue Jul 23 14:28:40 BST 2013 > (XEN) Latest ChangeSet: > (XEN) Bootloader: GNU GRUB 0.97 > (XEN) Command line: noreboot dom0_vcpus_pin > (XEN) Video information: > (XEN) VGA is text mode 80x25, font 8x16 > (XEN) VBE/DDC methods: V2; EDID transfer time: 1 seconds > (XEN) Disc information: > (XEN) Found 4 MBR signatures > (XEN) Found 4 EDD information structures > (XEN) Xen-e820 RAM map: > (XEN) 0000000000000000 - 000000000009d400 (usable) > (XEN) 000000000009d400 - 00000000000a0000 (reserved) > (XEN) 00000000000e0000 - 0000000000100000 (reserved) > (XEN) 0000000000100000 - 000000003f790000 (usable) > (XEN) 000000003f790000 - 000000003f79e000 (ACPI data) > (XEN) 000000003f79e000 - 000000003f7d0000 (ACPI NVS) > (XEN) 000000003f7d0000 - 000000003f7e0000 (reserved) > (XEN) 000000003f7e7000 - 0000000040000000 (reserved) > (XEN) 00000000fee00000 - 00000000fee01000 (reserved) > (XEN) 00000000ffc00000 - 0000000100000000 (reserved) > (XEN) 0000000100000000 - 0000000cc0000000 (usable) > (XEN) ACPI: RSDP 000F9F70, 0024 (r2 ACPIAM) > (XEN) ACPI: XSDT 3F790100, 0064 (r1 042413 XSDT1438 20130424 MSFT 97) > (XEN) ACPI: FACP 3F790290, 00F4 (r4 042413 FACP1438 20130424 MSFT 97) > (XEN) ACPI: DSDT 3F7904F0, 58A3 (r2 1W555 1W555A58 A58 INTL 20051117) > (XEN) ACPI: FACS 3F79E000, 0040 > (XEN) ACPI: APIC 3F790390, 0118 (r2 042413 APIC1438 20130424 MSFT 97) > (XEN) ACPI: MCFG 3F7904B0, 003C (r1 042413 OEMMCFG 20130424 MSFT 97) > (XEN) ACPI: OEMB 3F79E040, 0082 (r1 042413 OEMB1438 20130424 MSFT 97) > (XEN) ACPI: SRAT 3F79A4F0, 0250 (r2 042413 OEMSRAT 1 INTL 1) > (XEN) ACPI: HPET 3F79A740, 0038 (r1 042413 OEMHPET 20130424 MSFT 97) > (XEN) ACPI: DMAR 3F79E0D0, 0120 (r1 AMI OEMDMAR 1 MSFT 97) > (XEN) ACPI: SSDT 3F7A4C70, 0363 (r1 DpgPmm CpuPm 12 INTL 20051117) > (XEN) System RAM: 49143MB (50322612kB) > (XEN) Domain heap initialised DMA width 32 bits > (XEN) Processor #0 6:12 APIC version 21 > (XEN) Processor #2 6:12 APIC version 21 > (XEN) Processor #4 6:12 APIC version 21 > (XEN) Processor #16 6:12 APIC version 21 > (XEN) Processor #18 6:12 APIC version 21 > (XEN) Processor #20 6:12 APIC version 21 > (XEN) Processor #32 6:12 APIC version 21 > (XEN) Processor #34 6:12 APIC version 21 > (XEN) Processor #36 6:12 APIC version 21 > (XEN) Processor #48 6:12 APIC version 21 > (XEN) Processor #50 6:12 APIC version 21 > (XEN) Processor #52 6:12 APIC version 21 > (XEN) Processor #1 6:12 APIC version 21 > (XEN) Processor #3 6:12 APIC version 21 > (XEN) Processor #5 6:12 APIC version 21 > (XEN) Processor #17 6:12 APIC version 21 > (XEN) Processor #19 6:12 APIC version 21 > (XEN) Processor #21 6:12 APIC version 21 > (XEN) Processor #33 6:12 APIC version 21 > (XEN) Processor #35 6:12 APIC version 21 > (XEN) Processor #37 6:12 APIC version 21 > (XEN) Processor #49 6:12 APIC version 21 > (XEN) Processor #51 6:12 APIC version 21 > (XEN) Processor #53 6:12 APIC version 21 > (XEN) IOAPIC[0]: apic_id 6, version 32, address 0xfec00000, GSI 0-23 > (XEN) IOAPIC[1]: apic_id 7, version 32, address 0xfec8a000, GSI 24-47 > (XEN) Enabling APIC mode: Phys. Using 2 I/O APICs > (XEN) Using scheduler: SMP Credit Scheduler (credit) > (XEN) Detected 3321.755 MHz processor. > (XEN) Initing memory sharing. > (XEN) PCI: Not using MCFG for segment 0000 bus 00-ff > (XEN) Intel VT-d iommu 0 supported page sizes: 4kB. > (XEN) Intel VT-d Snoop Control enabled. > (XEN) Intel VT-d Dom0 DMA Passthrough not enabled. > (XEN) Intel VT-d Queued Invalidation enabled. > (XEN) Intel VT-d Interrupt Remapping not enabled. > (XEN) Intel VT-d Shared EPT tables not enabled. > (XEN) I/O virtualisation enabled > (XEN) - Dom0 mode: Relaxed > (XEN) Interrupt remapping disabled > (XEN) Enabled directed EOI with ioapic_ack_old on! > (XEN) ENABLING IO-APIC IRQs > (XEN) -> Using old ACK method > (XEN) Platform timer is 14.318MHz HPET > (XEN) Allocated console ring of 64 KiB. > (XEN) VMX: Supported advanced features: > (XEN) - APIC MMIO access virtualisation > (XEN) - APIC TPR shadow > (XEN) - Extended Page Tables (EPT) > (XEN) - Virtual-Processor Identifiers (VPID) > (XEN) - Virtual NMI > (XEN) - MSR direct-access bitmap > (XEN) - Unrestricted Guest > (XEN) HVM: ASIDs enabled. > (XEN) HVM: VMX enabled > (XEN) HVM: Hardware Assisted Paging (HAP) detected > (XEN) HVM: HAP page sizes: 4kB, 2MB, 1GB > (XEN) Brought up 24 CPUs > (XEN) verify_tsc_reliability: TSC warp detected, disabling TSC_RELIABLE > (XEN) *** LOADING DOMAIN 0 *** > (XEN) Xen kernel: 64-bit, lsb, compat32 > (XEN) Dom0 kernel: 64-bit, PAE, lsb, paddr 0x1000000 -> 0x1f70000 > (XEN) PHYSICAL MEMORY ARRANGEMENT: > (XEN) Dom0 alloc.: 0000000420000000->0000000430000000 (12302085 pages to be allocated) > (XEN) Init. ramdisk: 0000000cbbdc3000->0000000cbffff400 > (XEN) VIRTUAL MEMORY ARRANGEMENT: > (XEN) Loaded kernel: ffffffff81000000->ffffffff81f70000 > (XEN) Init. ramdisk: ffffffff81f70000->ffffffff861ac400 > (XEN) Phys-Mach map: ffffffff861ad000->ffffffff8c029a10 > (XEN) Start info: ffffffff8c02a000->ffffffff8c02a4b4 > (XEN) Page tables: ffffffff8c02b000->ffffffff8c090000 > (XEN) Boot stack: ffffffff8c090000->ffffffff8c091000 > (XEN) TOTAL: ffffffff80000000->ffffffff8c400000 > (XEN) ENTRY ADDRESS: ffffffff818091e0 > (XEN) Dom0 has maximum 24 VCPUs > (XEN) Scrubbing Free RAM: .done. > (XEN) Initial low memory virq threshold set at 0x4000 pages. > (XEN) Std. Loglevel: Errors and warnings > (XEN) Guest Loglevel: Nothing (Rate-limited: Errors and warnings) > (XEN) Xen is relinquishing VGA console. > (XEN) *** Serial input -> DOM0 (type ''CTRL-a'' three times to switch input to Xen) > (XEN) Freed 272kB init memory. > (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from 0x0000000000000002 to 0x0000000000000000. > (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from 0x0000000000000002 to 0x0000000000000000. > (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from 0x0000000000000002 to 0x0000000000000000. > (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from 0x0000000000000002 to 0x0000000000000000. > (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from 0x0000000000000002 to 0x0000000000000000. > (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from 0x0000000000000002 to 0x0000000000000000. > (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from 0x0000000000000002 to 0x0000000000000000. > (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from 0x0000000000000002 to 0x0000000000000000. > (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from 0x0000000000000002 to 0x0000000000000000. > (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from 0x0000000000000002 to 0x0000000000000000.> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Gordan Bobic
2013-Jul-24 14:17 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Wed, 24 Jul 2013 10:08:13 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Tue, Jul 23, 2013 at 11:34:00PM +0100, Gordan Bobic wrote: >> I just built 4.3.0 in order to get > 2GB of RAM in domU with GPU >> passthrough without crashes. Unfortunately, the same crashes still >> happen. Massive frame buffer corruption on domU before it locks up >> solid. It seems the PCI memory stomp is still happening. >> > > If you boot Xen with guest_loglvl=all > > and then run the guest the consoel (xl dmesg) should also have > the output from QEMU - that will help in seeing how it constructs > the E820 (which was the problem last time).I will gather this tonight - apologies, I forgot that I removed the loglvl=all options from my boot config.> Are you also able to get the serial log from the guest? (IF this is > Linux?) I usually have this in my guest config: > > serial=''pty'' > > and when Linux boots up I add ''console=ttyS0,115200 loglevel=8 debug'' > which will output everything to the ''xl console <guest> | tee > /tmp/log''.The intended guest is XP64. I will, however, get a Linux guest up and running with the exact same domU config (apart from the disk volume) for debugging this. Gordan
Konrad Rzeszutek Wilk
2013-Jul-24 16:06 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Wed, Jul 24, 2013 at 03:17:50PM +0100, Gordan Bobic wrote:> On Wed, 24 Jul 2013 10:08:13 -0400, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > >On Tue, Jul 23, 2013 at 11:34:00PM +0100, Gordan Bobic wrote: > >>I just built 4.3.0 in order to get > 2GB of RAM in domU with GPU > >>passthrough without crashes. Unfortunately, the same crashes still > >>happen. Massive frame buffer corruption on domU before it locks up > >>solid. It seems the PCI memory stomp is still happening. > >> > > > >If you boot Xen with guest_loglvl=all > > > >and then run the guest the consoel (xl dmesg) should also have > >the output from QEMU - that will help in seeing how it constructs > >the E820 (which was the problem last time). > > I will gather this tonight - apologies, I forgot that I removed > the loglvl=all options from my boot config.Take your time.> > >Are you also able to get the serial log from the guest? (IF this is > >Linux?) I usually have this in my guest config: > > > >serial=''pty'' > > > >and when Linux boots up I add ''console=ttyS0,115200 loglevel=8 debug'' > >which will output everything to the ''xl console <guest> | tee > >/tmp/log''. > > The intended guest is XP64. I will, however, get a Linux guest upAh, I am not actually sure how Linux will work. I hadn''t had a chance to test that recently :-(> and running with the exact same domU config (apart from the disk > volume) for debugging this. > > Gordan
Gordan Bobic
2013-Jul-24 16:14 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Wed, 24 Jul 2013 12:06:39 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:>> >Are you also able to get the serial log from the guest? (IF this is >> >Linux?) I usually have this in my guest config: >> > >> >serial=''pty'' >> > >> >and when Linux boots up I add ''console=ttyS0,115200 loglevel=8 >> debug'' >> >which will output everything to the ''xl console <guest> | tee >> >/tmp/log''. >> >> The intended guest is XP64. I will, however, get a Linux guest up > > Ah, I am not actually sure how Linux will work. I hadn''t had a chance > to test that recently :-(As long as it brings up the serial console, that should be sufficient, but working VNC to text console login would be convenient. The main thing I want to find on it is the BAR mapping addresses from lspci and compare that to the e820 map from dmesg. I wouldn''t expect the memory map provided by SeaBIOS and the BAR mappings configured by qemu-dm to differ depending on the domU OS. Or am I wrong here? If there is any overlap, the problem should be obvious. If there is no overlap, then something even more bizzare is going on, but we can worry about that later. :) Gordan
Konrad Rzeszutek Wilk
2013-Jul-24 16:31 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Wed, Jul 24, 2013 at 05:14:32PM +0100, Gordan Bobic wrote:> On Wed, 24 Jul 2013 12:06:39 -0400, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > > >>>Are you also able to get the serial log from the guest? (IF this is > >>>Linux?) I usually have this in my guest config: > >>> > >>>serial=''pty'' > >>> > >>>and when Linux boots up I add ''console=ttyS0,115200 loglevel=8 > >>debug'' > >>>which will output everything to the ''xl console <guest> | tee > >>>/tmp/log''. > >> > >>The intended guest is XP64. I will, however, get a Linux guest up > > > >Ah, I am not actually sure how Linux will work. I hadn''t had a chance > >to test that recently :-( > > As long as it brings up the serial console, that should be > sufficient, but working VNC to text console login would be > convenient. The main thing I want to find on it is the > BAR mapping addresses from lspci and compare that to the > e820 map from dmesg.I see. That should work for you.> > I wouldn''t expect the memory map provided by SeaBIOS and > the BAR mappings configured by qemu-dm to differ > depending on the domU OS. Or am I wrong here?They might. The patches to fix the 2GB limit went in qemu-xen-traditional meaning you have to use: device_model_version = ''qemu-xen-traditional'' in your guest config (which I think you are already doing). I don''t recall what the situation is with upstream SeaBIOS.> > If there is any overlap, the problem should be obvious. > If there is no overlap, then something even more > bizzare is going on, but we can worry about that > later. :)
Gordan Bobic
2013-Jul-24 17:26 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On 07/24/2013 05:31 PM, Konrad Rzeszutek Wilk wrote:>> I wouldn''t expect the memory map provided by SeaBIOS and >> the BAR mappings configured by qemu-dm to differ >> depending on the domU OS. Or am I wrong here? > > They might. The patches to fix the 2GB limit went in qemu-xen-traditional > meaning you have to use: > > device_model_version = ''qemu-xen-traditional'' > > in your guest config (which I think you are already doing).Yes and no. I am using a self-built 4.3.0 rpm based fairly closely on the CRC 4.2.x rpms for EL6. This includes a patch to only build qemu-dm and make it the default, which, presumably, means that I don''t have to explicitly specify device_model_version. But maybe I''m wrong. I''ll try specifying it explicitly and see if that helps. Gordan
Gordan Bobic
2013-Jul-24 22:15 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
Attached are the logs (loglvl=all) and configs for 2GB (working) and 8GB (screen corruption + domU crash + sometimes dom0 crashing with it). I can see in the xl-dmesg log in 8GB case that there is memory remapping going on to allow for the lowmem MMIO hole, but it doesn''t seem to help. I will get a Linux VM up and running tomorrow and get a comparison of domU BARs vs. e820 map. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2013-Jul-25 19:18 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Wed, Jul 24, 2013 at 11:15 PM, Gordan Bobic <gordan@bobich.net> wrote:> Attached are the logs (loglvl=all) and configs for 2GB (working) and 8GB > (screen corruption + domU crash + sometimes dom0 crashing with it). > > I can see in the xl-dmesg log in 8GB case that there is memory remapping > going on to allow for the lowmem MMIO hole, but it doesn''t seem to help.Gordan, There''s a possibility that it''s actually got nothing to do with relocation, but with bugs in your hardware. Can you try: * Set the guest memory to 3600 * Boot the guest, and check to make sure that xl dmesg shows does *not* relocate memory? * Report whether it crashes? If it''s a bug in the hardware, I would expect to see that memory was not relocated, but that the system will lock up anyway. Can you also do lspci -vvv in dom0 before assigning the device and attach the output? The hardware bug we''ve seen is this: In order for the IOMMU to work properly, *all* DMA transactions must be passed up to the root bridge so the IOMMU can translate the addresses from guest address to host address. Unfortunately, an awful lot of bridges will not do this properly, which means that the address is not translated properly, which means that if a *guest* memory address overlaps the a *host* MMIO range, badness ensues. There''s nothing we can do about this in Xen other than make the guest MMIO hole the same size as the host MMIO hole. Thanks, -George
Gordan Bobic
2013-Jul-25 21:26 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
Attached are: domU-2GB dmesg, lspci domU-8GB dmesg, lspci map-2GB - memory map, e820 + PCI map-8GB - memory map, e820 + PCI There are no overlaps. In fact, the map is identical with 2040MB and 8192MB, except for the top usable range being bigger. So according to this, there _shouldn''t_ be any memory clobbering going on within domU. Which leads on to what George said earlier, which I will reply to in a separate email. What puzzles me, however, is that I thought that in 4.3.0 all 64-bit BARs should automatically be re-mapped to memory > 4GB, and that doesn''t appear to be happening here. Or is the remapping only happening if there is not enough 32-bit space for all the BARs? Gordan On 07/24/2013 05:31 PM, Konrad Rzeszutek Wilk wrote:> On Wed, Jul 24, 2013 at 05:14:32PM +0100, Gordan Bobic wrote: >> On Wed, 24 Jul 2013 12:06:39 -0400, Konrad Rzeszutek Wilk >> <konrad.wilk@oracle.com> wrote: >> >>>>> Are you also able to get the serial log from the guest? (IF this is >>>>> Linux?) I usually have this in my guest config: >>>>> >>>>> serial=''pty'' >>>>> >>>>> and when Linux boots up I add ''console=ttyS0,115200 loglevel=8 >>>> debug'' >>>>> which will output everything to the ''xl console <guest> | tee >>>>> /tmp/log''. >>>> >>>> The intended guest is XP64. I will, however, get a Linux guest up >>> >>> Ah, I am not actually sure how Linux will work. I hadn''t had a chance >>> to test that recently :-( >> >> As long as it brings up the serial console, that should be >> sufficient, but working VNC to text console login would be >> convenient. The main thing I want to find on it is the >> BAR mapping addresses from lspci and compare that to the >> e820 map from dmesg. > > I see. That should work for you. >> >> I wouldn''t expect the memory map provided by SeaBIOS and >> the BAR mappings configured by qemu-dm to differ >> depending on the domU OS. Or am I wrong here? > > They might. The patches to fix the 2GB limit went in qemu-xen-traditional > meaning you have to use: > > device_model_version = ''qemu-xen-traditional'' > > in your guest config (which I think you are already doing). > > I don''t recall what the situation is with upstream SeaBIOS. >> >> If there is any overlap, the problem should be obvious. >> If there is no overlap, then something even more >> bizzare is going on, but we can worry about that >> later. :)_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Gordan Bobic
2013-Jul-25 21:48 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On 07/25/2013 08:18 PM, George Dunlap wrote:> On Wed, Jul 24, 2013 at 11:15 PM, Gordan Bobic <gordan@bobich.net> wrote: >> Attached are the logs (loglvl=all) and configs for 2GB (working) and 8GB >> (screen corruption + domU crash + sometimes dom0 crashing with it). >> >> I can see in the xl-dmesg log in 8GB case that there is memory remapping >> going on to allow for the lowmem MMIO hole, but it doesn''t seem to help. > > There''s a possibility that it''s actually got nothing to do with > relocation, but with bugs in your hardware.That wouldn''t surprise me at all, unfortunately. :(> Can you try: > * Set the guest memory to 3600 > * Boot the guest, and check to make sure that xl dmesg shows does > *not* relocate memory? > * Report whether it crashes?xl dmesg from booting a Linux domU with 3600MB is attached. The crash is never immediate, both Linux and Windows boot fine. But when a large 3D application like a game loads, there is frame buffer corruption immediately visible, and the domU will typically lock up some seconds later. Infrequently, it will take the host down with it.> If it''s a bug in the hardware, I would expect to see that memory was > not relocated, but that the system will lock up anyway.That is indeed what seems to happen - the memory map looks OK with no overlaps between PCI memory and ROM ranges and the usable or reserved e820 regions.> Can you also do lspci -vvv in dom0 before assigning the device and > attach the output?I have attached it, but not before assigning - I''ll need to reboot for that. Do you expect there to be a difference in mapping in dom0 before and after assigning the device to domU?> The hardware bug we''ve seen is this: In order for the IOMMU to work > properly, *all* DMA transactions must be passed up to the root bridge > so the IOMMU can translate the addresses from guest address to host > address. Unfortunately, an awful lot of bridges will not do this > properly, which means that the address is not translated properly, > which means that if a *guest* memory address overlaps the a *host* > MMIO range, badness ensues.Hmm, looking at xl dmesg vs dom0 lspci, that does appear to be the case: xl dmesg: (XEN) HVM24: E820 table: (XEN) HVM24: [00]: 00000000:00000000 - 00000000:0009e000: RAM (XEN) HVM24: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED (XEN) HVM24: HOLE: 00000000:000a0000 - 00000000:000e0000 (XEN) HVM24: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED (XEN) HVM24: [03]: 00000000:00100000 - 00000000:e0000000: RAM (XEN) HVM24: HOLE: 00000000:e0000000 - 00000000:fc000000 (XEN) HVM24: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED (XEN) HVM24: [05]: 00000001:00000000 - 00000001:00800000: RAM lspci: 08:00.0 VGA compatible controller: nVidia Corporation GF100 Region 0: Memory at f8000000 (32-bit, non-prefetchable) [disabled] [size=32M] Region 1: Memory at b8000000 (64-bit, prefetchable) [disabled] [size=128M] Region 3: Memory at b4000000 (64-bit, prefetchable) [disabled] [size=64M] Unless I''m reading this wrong, it means that physical GPU region 0 is in the domU reserved area, and GPU regions 1 and 2 and in the domU RAM area. b4000000 = 2880MB So in theory, that might mean that I should be able to get away with up to 2880MB of RAM for domU without encountering frame buffer corruption and the crash. I will test this shortly.> There''s nothing we can do about this in > Xen other than make the guest MMIO hole the same size as the host MMIO > hole.Not sure I follow. Do you mean make it so that pBAR = vBAR? Gordan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Gordan Bobic
2013-Jul-25 22:23 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On 07/25/2013 10:48 PM, Gordan Bobic wrote:> On 07/25/2013 08:18 PM, George Dunlap wrote: >> On Wed, Jul 24, 2013 at 11:15 PM, Gordan Bobic <gordan@bobich.net> wrote: >>> Attached are the logs (loglvl=all) and configs for 2GB (working) and 8GB >>> (screen corruption + domU crash + sometimes dom0 crashing with it). >>> >>> I can see in the xl-dmesg log in 8GB case that there is memory remapping >>> going on to allow for the lowmem MMIO hole, but it doesn''t seem to help. >> >> There''s a possibility that it''s actually got nothing to do with >> relocation, but with bugs in your hardware. > > That wouldn''t surprise me at all, unfortunately. :( > >> Can you try: >> * Set the guest memory to 3600 >> * Boot the guest, and check to make sure that xl dmesg shows does >> *not* relocate memory? >> * Report whether it crashes? > > xl dmesg from booting a Linux domU with 3600MB is attached. > The crash is never immediate, both Linux and Windows boot fine. But when > a large 3D application like a game loads, there is frame buffer > corruption immediately visible, and the domU will typically lock up some > seconds later. Infrequently, it will take the host down with it. > >> If it''s a bug in the hardware, I would expect to see that memory was >> not relocated, but that the system will lock up anyway. > > That is indeed what seems to happen - the memory map looks OK with no > overlaps between PCI memory and ROM ranges and the usable or reserved > e820 regions. > >> Can you also do lspci -vvv in dom0 before assigning the device and >> attach the output? > > I have attached it, but not before assigning - I''ll need to reboot for > that. Do you expect there to be a difference in mapping in dom0 before > and after assigning the device to domU? > >> The hardware bug we''ve seen is this: In order for the IOMMU to work >> properly, *all* DMA transactions must be passed up to the root bridge >> so the IOMMU can translate the addresses from guest address to host >> address. Unfortunately, an awful lot of bridges will not do this >> properly, which means that the address is not translated properly, >> which means that if a *guest* memory address overlaps the a *host* >> MMIO range, badness ensues. > > Hmm, looking at xl dmesg vs dom0 lspci, that does appear to be the case: > > xl dmesg: > (XEN) HVM24: E820 table: > (XEN) HVM24: [00]: 00000000:00000000 - 00000000:0009e000: RAM > (XEN) HVM24: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED > (XEN) HVM24: HOLE: 00000000:000a0000 - 00000000:000e0000 > (XEN) HVM24: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED > (XEN) HVM24: [03]: 00000000:00100000 - 00000000:e0000000: RAM > (XEN) HVM24: HOLE: 00000000:e0000000 - 00000000:fc000000 > (XEN) HVM24: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED > (XEN) HVM24: [05]: 00000001:00000000 - 00000001:00800000: RAM > > lspci: > 08:00.0 VGA compatible controller: nVidia Corporation GF100 > Region 0: Memory at f8000000 (32-bit, non-prefetchable) > [disabled] [size=32M] > Region 1: Memory at b8000000 (64-bit, prefetchable) [disabled] > [size=128M] > Region 3: Memory at b4000000 (64-bit, prefetchable) [disabled] > [size=64M] > > Unless I''m reading this wrong, it means that physical GPU region 0 is in > the domU reserved area, and GPU regions 1 and 2 and in the domU RAM area. > > b4000000 = 2880MBCorrection - my other GPU has a BAR mapped lower, at 0xa8000000 which is 2688MB. So I upped my memory mapping to 2688MB, and lo and behold, that doesn''t crash and games work just fine without frame buffer getting corrupted. Now, if I am understanding the basic nature of the problem correctly, this _could_ be worked around by ensuring that vBAR = pBAR since in that case there is no room for the mis-mapped memory overwrites to occur. Is that correct? I guess I could test this easily enough by applying the vBAR = pBAR hack. Gordan
Ian Campbell
2013-Jul-26 00:21 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote:> Now, if I am understanding the basic nature of the problem correctly, > this _could_ be worked around by ensuring that vBAR = pBAR since in that > case there is no room for the mis-mapped memory overwrites to occur. Is > that correct?AIUI (which is not very well...) it''s not so much vBAR=pBAR but making the guest e820 (memory map) have the same MMIO holes as the host so that there can''t be any clash between v- or p-BAR and RAM in the guest.> I guess I could test this easily enough by applying the vBAR = pBAR hack.Does the e820_host=1 option help? That might be PV only though, I can''t remember... Ian.> > Gordan > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Andrew Bobulsky
2013-Jul-26 01:15 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell <ian.campbell@citrix.com> wrote:> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote: >> Now, if I am understanding the basic nature of the problem correctly, >> this _could_ be worked around by ensuring that vBAR = pBAR since in that >> case there is no room for the mis-mapped memory overwrites to occur. Is >> that correct? > > AIUI (which is not very well...) it''s not so much vBAR=pBAR but making > the guest e820 (memory map) have the same MMIO holes as the host so that > there can''t be any clash between v- or p-BAR and RAM in the guest. > >> I guess I could test this easily enough by applying the vBAR = pBAR hack. > > Does the e820_host=1 option help? That might be PV only though, I can''t > remember...Alas, yes. The man pages list it under "PV Guest Specific Options": http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html You got my hopes up! ;) Carry on! I''ll be sitting here metaphorically munching popcorn with anticipation :P -Andrew
Gordan Bobic
2013-Jul-26 09:23 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Fri, 26 Jul 2013 01:21:24 +0100, Ian Campbell <ian.campbell@citrix.com> wrote:> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote: >> Now, if I am understanding the basic nature of the problem >> correctly, >> this _could_ be worked around by ensuring that vBAR = pBAR since in >> that >> case there is no room for the mis-mapped memory overwrites to occur. >> Is >> that correct? > > AIUI (which is not very well...) it''s not so much vBAR=pBAR but > making > the guest e820 (memory map) have the same MMIO holes as the host so > that > there can''t be any clash between v- or p-BAR and RAM in the guest.Sure, I understand that - but unless I am overlooking something, vBAR=pBAR implicitly ensures that. The question, then, is what happens in the null translation instance. Specifically, if the PCIe bridge/router is broken (and NF200 is, it seems), it would imply that when the driver talks to the device, the operation will get sent to the vBAR (=pBAR, i.e. straight to the hardware). This then gets translated to the pBAR. But - with a broken bridge, and vBAR=pBAR, the MMIO request hits the pBAR directly from the guest. Does it then still get intercepted by the hypervisor, translated (null operation), and re-transmitted? If so, this would lead to the card receiving everything twice, resulting either in things outright breaking or going half as fast at best. Now, all this could be a good thing or a bad thing, depending on how exactly you spin it. If the bridge is broken and doesn''t route all the way back to the root bridge, this could actually be a performance optimizing feature. If we set vBAR=pBAR and disable any translation thereafter, this avoids the overhead of passing everything to/from the root PCIe bridge, and we can just directly DMA everything. I''m sure there are security implications here, but since NF200 doesn''t do PCIe ACS either, any concept of security goes out the window pre-emptively. So, my question is: 1) If vBAR = pBAR, does the hypervisor still do any translation? I presume it does because it expects the traffic to pass up from the root bridge, to the hypervisor and then back, to ensure security. If indeed it does do this, where could I optionally disable it, and is there an easy to follow bit of example code for how to plumb in a boot parameter option for this? 2) Further, I''m finding myself motivated to write that auto-set (as opposed to hard coded) vBAR=pBAR patch discussed briefly a week or so ago (have an init script read the BAR info from dom0 and put it in xenstore, plus a patch to make pBAR=vBAR reservations built dynamically rather than statically, based on this data. Now, I''m quite fluent in C, but my familiarity with Xen soruce code is nearly non-existant (limited to studying an old unsupported patch every now and then in order to make it apply to a more recent code release). Can anyone help me out with a high level view WRT where this would be best plumbed in (which files and the flow of control between the affected files)? The added bonus of this (if it can be made to work) is that it might just make unmodified GeForce cards work, too, which probably makes it worthwhile on it''s own.>> I guess I could test this easily enough by applying the vBAR = pBAR >> hack. > > Does the e820_host=1 option help? That might be PV only though, I > can''t > remember...Thanks for pointing this one out, I just found this post in the archives: http://lists.xen.org/archives/html/xen-users/2012-08/msg00150.html With a broken PCIe router, would I also need iommu=soft? Gordan
Gordan Bobic
2013-Jul-26 09:28 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Thu, 25 Jul 2013 21:15:10 -0400, Andrew Bobulsky <rulerof@gmail.com> wrote:> On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell > <ian.campbell@citrix.com> wrote: >> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote: >>> Now, if I am understanding the basic nature of the problem >>> correctly, >>> this _could_ be worked around by ensuring that vBAR = pBAR since in >>> that >>> case there is no room for the mis-mapped memory overwrites to >>> occur. Is >>> that correct? >> >> AIUI (which is not very well...) it''s not so much vBAR=pBAR but >> making >> the guest e820 (memory map) have the same MMIO holes as the host so >> that >> there can''t be any clash between v- or p-BAR and RAM in the guest. >> >>> I guess I could test this easily enough by applying the vBAR = pBAR >>> hack. >> >> Does the e820_host=1 option help? That might be PV only though, I >> can''t >> remember... > > Alas, yes. The man pages list it under "PV Guest Specific Options": > http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.htmlNow that is intereting - if this makes the memory holes the same between the guest and the host, does it also implicitly vBAR=pBAR? Gordan
Gordan Bobic
2013-Jul-26 13:11 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Fri, 26 Jul 2013 10:28:12 +0100, Gordan Bobic <gordan@bobich.net> wrote:> On Thu, 25 Jul 2013 21:15:10 -0400, Andrew Bobulsky > <rulerof@gmail.com> wrote: >> On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell >> <ian.campbell@citrix.com> wrote: >>> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote: >>>> Now, if I am understanding the basic nature of the problem >>>> correctly, >>>> this _could_ be worked around by ensuring that vBAR = pBAR since >>>> in that >>>> case there is no room for the mis-mapped memory overwrites to >>>> occur. Is >>>> that correct? >>> >>> AIUI (which is not very well...) it''s not so much vBAR=pBAR but >>> making >>> the guest e820 (memory map) have the same MMIO holes as the host so >>> that >>> there can''t be any clash between v- or p-BAR and RAM in the guest. >>> >>>> I guess I could test this easily enough by applying the vBAR = >>>> pBAR hack. >>> >>> Does the e820_host=1 option help? That might be PV only though, I >>> can''t >>> remember... >> >> Alas, yes. The man pages list it under "PV Guest Specific Options": >> http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html > > Now that is intereting - if this makes the memory holes the same > between > the guest and the host, does it also implicitly vBAR=pBAR?Another thing that occurred to me might be useful to check - it is pretty easy to modify the BAR size on Nvidia cards. The defaults are 64MB and 128MB for the two BARs. They can be made much, much larger, and there is often advantage to enlarging them to at least be equal to VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB, being a 64-bit BAR, it might make the BIOS do the sane thing and map it above 4GB. With the other BAR also suitably enlarged and it being done on the second GPU as well, there is no obvious option but to map them above 4GB (unless the BIOS is broken, which it may well be, in which case all bets are off). Which may just alleviate the memory issue if not completely fix the problem. Will try this and see what happens. Gordan
Konrad Rzeszutek Wilk
2013-Jul-28 10:26 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
Andrew Bobulsky <rulerof@gmail.com> wrote:>On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell <ian.campbell@citrix.com> >wrote: >> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote: >>> Now, if I am understanding the basic nature of the problem >correctly, >>> this _could_ be worked around by ensuring that vBAR = pBAR since in >that >>> case there is no room for the mis-mapped memory overwrites to occur. >Is >>> that correct? >> >> AIUI (which is not very well...) it''s not so much vBAR=pBAR but >making >> the guest e820 (memory map) have the same MMIO holes as the host so >that >> there can''t be any clash between v- or p-BAR and RAM in the guest. >> >>> I guess I could test this easily enough by applying the vBAR = pBAR >hack. >> >> Does the e820_host=1 option help? That might be PV only though, I >can''t >> remember... > >Alas, yes. The man pages list it under "PV Guest Specific Options": >http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html > >You got my hopes up! ;) > >Carry on! I''ll be sitting here metaphorically munching popcorn with >anticipation :P > >-Andrew > >_______________________________________________ >Xen-devel mailing list >Xen-devel@lists.xen.org >http://lists.xen.org/xen-develWe could implement that for HVM guests too. But I am not sure about the consequences of this for migration (say you unplug the device beforehand and then migrate to another host which has a different E820). That part requires a bit of pondering.
Gordan Bobic
2013-Jul-28 21:24 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On 07/28/2013 11:26 AM, Konrad Rzeszutek Wilk wrote:> Andrew Bobulsky <rulerof@gmail.com> wrote: >> On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell <ian.campbell@citrix.com> >> wrote: >>> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote: >>>> Now, if I am understanding the basic nature of the problem >> correctly, >>>> this _could_ be worked around by ensuring that vBAR = pBAR since in >> that >>>> case there is no room for the mis-mapped memory overwrites to occur. >> Is >>>> that correct? >>> >>> AIUI (which is not very well...) it''s not so much vBAR=pBAR but >> making >>> the guest e820 (memory map) have the same MMIO holes as the host so >> that >>> there can''t be any clash between v- or p-BAR and RAM in the guest. >>> >>>> I guess I could test this easily enough by applying the vBAR = pBAR >> hack. >>> >>> Does the e820_host=1 option help? That might be PV only though, I >> can''t >>> remember... >> >> Alas, yes. The man pages list it under "PV Guest Specific Options": >> http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html >> >> You got my hopes up! ;) >> >> Carry on! I''ll be sitting here metaphorically munching popcorn with >> anticipation :P > > We could implement that for HVM guests too. But I am not sure about > the consequences of this for migration (say you unplug the device > beforehand and then migrate to another host which has a different > E820). That part requires a bit of pondering.Just out of interest, what happens in case where the PV guests get migrated with e820_host=1 set? Gordan
Konrad Rzeszutek Wilk
2013-Jul-28 23:17 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
Gordan Bobic <gordan@bobich.net> wrote:>On 07/28/2013 11:26 AM, Konrad Rzeszutek Wilk wrote: >> Andrew Bobulsky <rulerof@gmail.com> wrote: >>> On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell ><ian.campbell@citrix.com> >>> wrote: >>>> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote: >>>>> Now, if I am understanding the basic nature of the problem >>> correctly, >>>>> this _could_ be worked around by ensuring that vBAR = pBAR since >in >>> that >>>>> case there is no room for the mis-mapped memory overwrites to >occur. >>> Is >>>>> that correct? >>>> >>>> AIUI (which is not very well...) it''s not so much vBAR=pBAR but >>> making >>>> the guest e820 (memory map) have the same MMIO holes as the host so >>> that >>>> there can''t be any clash between v- or p-BAR and RAM in the guest. >>>> >>>>> I guess I could test this easily enough by applying the vBAR >pBAR >>> hack. >>>> >>>> Does the e820_host=1 option help? That might be PV only though, I >>> can''t >>>> remember... >>> >>> Alas, yes. The man pages list it under "PV Guest Specific Options": >>> http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html >>> >>> You got my hopes up! ;) >>> >>> Carry on! I''ll be sitting here metaphorically munching popcorn with >>> anticipation :P >> >> We could implement that for HVM guests too. But I am not sure about >> the consequences of this for migration (say you unplug the device >> beforehand and then migrate to another host which has a different >> E820). That part requires a bit of pondering. > >Just out of interest, what happens in case where the PV guests get >migrated with e820_host=1 set? > >GordanWe disallow (I think?) as there is no way we can guarantee the E820 map. I guess your point is that since we disallow this on PV with this parameter there is not much difference in allowing HVM guest with this.
Gordan Bobic
2013-Jul-28 23:30 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On 07/29/2013 12:17 AM, Konrad Rzeszutek Wilk wrote:> Gordan Bobic <gordan@bobich.net> wrote: >> On 07/28/2013 11:26 AM, Konrad Rzeszutek Wilk wrote: >>> Andrew Bobulsky <rulerof@gmail.com> wrote: >>>> On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell >> <ian.campbell@citrix.com> >>>> wrote: >>>>> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote: >>>>>> Now, if I am understanding the basic nature of the problem >>>> correctly, >>>>>> this _could_ be worked around by ensuring that vBAR = pBAR since >> in >>>> that >>>>>> case there is no room for the mis-mapped memory overwrites to >> occur. >>>> Is >>>>>> that correct? >>>>> >>>>> AIUI (which is not very well...) it''s not so much vBAR=pBAR but >>>> making >>>>> the guest e820 (memory map) have the same MMIO holes as the host so >>>> that >>>>> there can''t be any clash between v- or p-BAR and RAM in the guest. >>>>> >>>>>> I guess I could test this easily enough by applying the vBAR >> pBAR >>>> hack. >>>>> >>>>> Does the e820_host=1 option help? That might be PV only though, I >>>> can''t >>>>> remember... >>>> >>>> Alas, yes. The man pages list it under "PV Guest Specific Options": >>>> http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html >>>> >>>> You got my hopes up! ;) >>>> >>>> Carry on! I''ll be sitting here metaphorically munching popcorn with >>>> anticipation :P >>> >>> We could implement that for HVM guests too. But I am not sure about >>> the consequences of this for migration (say you unplug the device >>> beforehand and then migrate to another host which has a different >>> E820). That part requires a bit of pondering. >> >> Just out of interest, what happens in case where the PV guests get >> migrated with e820_host=1 set? >> >> Gordan > > We disallow (I think?) as there is no way we can guarantee the > E820 map. I guess your point is that since we disallow this on > PV with this parameter there is not much difference in allowing > HVM guest with this.That is indeed where I was pondering going with this, yes - apply the same restriction in the HVM case that exists in the PV case. Regarding the e820_host=1 case, which of the following is true: 1) The dom0 BAR areas are simply reserved/holes and the domU still maps it''s own BARs elsewhere in the memory space? 2) domU is free to map BARs into any of the host E820 map holes of appropriate size? 3) vBAR=pBAR 4) Other? Thanks. Gordan
Ian Campbell
2013-Jul-29 09:53 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Sun, 2013-07-28 at 19:17 -0400, Konrad Rzeszutek Wilk wrote:> Gordan Bobic <gordan@bobich.net> wrote: > >On 07/28/2013 11:26 AM, Konrad Rzeszutek Wilk wrote: > >> Andrew Bobulsky <rulerof@gmail.com> wrote: > >>> On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell > ><ian.campbell@citrix.com> > >>> wrote: > >>>> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote: > >>>>> Now, if I am understanding the basic nature of the problem > >>> correctly, > >>>>> this _could_ be worked around by ensuring that vBAR = pBAR since > >in > >>> that > >>>>> case there is no room for the mis-mapped memory overwrites to > >occur. > >>> Is > >>>>> that correct? > >>>> > >>>> AIUI (which is not very well...) it''s not so much vBAR=pBAR but > >>> making > >>>> the guest e820 (memory map) have the same MMIO holes as the host so > >>> that > >>>> there can''t be any clash between v- or p-BAR and RAM in the guest. > >>>> > >>>>> I guess I could test this easily enough by applying the vBAR > >pBAR > >>> hack. > >>>> > >>>> Does the e820_host=1 option help? That might be PV only though, I > >>> can''t > >>>> remember... > >>> > >>> Alas, yes. The man pages list it under "PV Guest Specific Options": > >>> http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html > >>> > >>> You got my hopes up! ;) > >>> > >>> Carry on! I''ll be sitting here metaphorically munching popcorn with > >>> anticipation :P > >> > >> We could implement that for HVM guests too. But I am not sure about > >> the consequences of this for migration (say you unplug the device > >> beforehand and then migrate to another host which has a different > >> E820). That part requires a bit of pondering. > > > >Just out of interest, what happens in case where the PV guests get > >migrated with e820_host=1 set? > > > >Gordan > > We disallow (I think?) as there is no way we can guarantee the E820 > map. I guess your point is that since we disallow this on PV with > this parameter there is not much difference in allowing HVM guest with > this.Yes, I don''t think it is unreasonable to disallow migration when hardware specific workarounds have been applied (which is really what e820_host is, for either PV or HVM). Ian.
Ian Campbell
2013-Jul-29 11:14 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Fri, 2013-07-26 at 10:23 +0100, Gordan Bobic wrote:> On Fri, 26 Jul 2013 01:21:24 +0100, Ian Campbell > <ian.campbell@citrix.com> wrote: > > On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote: > >> Now, if I am understanding the basic nature of the problem > >> correctly, > >> this _could_ be worked around by ensuring that vBAR = pBAR since in > >> that > >> case there is no room for the mis-mapped memory overwrites to occur. > >> Is > >> that correct? > > > > AIUI (which is not very well...) it''s not so much vBAR=pBAR but > > making > > the guest e820 (memory map) have the same MMIO holes as the host so > > that > > there can''t be any clash between v- or p-BAR and RAM in the guest. > > Sure, I understand that - but unless I am overlooking something, > vBAR=pBAR implicitly ensures that.Not quite because you need to ensure that guest RAM and guest MMIO space do not overlap. So setting vBAR=pBAR is not sufficient, you also need to ensure that there is no RAM at those addresses. Depending on your PCI bus topology/hardware functionality it may be sufficient to only ensure the memory map is the same as the host, so long as the vBAR all fall within the MMIO regions. On other systems you may require vBAR=pBAR in addition to that. Obviously doing both is most likely to work.> The question, then, is what happens in the null translation instance. > Specifically, if the PCIe bridge/router is broken (and NF200 is, it > seems), it would imply that when the driver talks to the device, the > operation will get sent to the vBAR (=pBAR, i.e. straight to the > hardware). This then gets translated to the pBAR. But - with a > broken bridge, and vBAR=pBAR, the MMIO request hits the pBAR > directly from the guest. Does it then still get intercepted by > the hypervisor, translated (null operation), and re-transmitted? > If so, this would lead to the card receiving everything twice, > resulting either in things outright breaking or going half as > fast at best.AIUI the issue is not so much with a device seeing an IO accesses twice but with two device seeing the same IO access (one sees translated, the other untranslated) and thinking it is for them and who "wins" when such shadowing occurs, which will differ depending on which device (or the host CPU) is doing the IO. It is not the hypervisor which is intercepting and translating, but the hardware. A single bit of hardware should never see things twice. Perhaps a diagram (intended to be more illustrative than "real"): CPU | MMU & IOMMU | | RAM BUS 1: `---+---------------'' | BRIDGE | BUS 2: `--- BUS 2 ------------- | | DEVICE A DEVICE B vBAR->pBAR translation happens at the IOMMU. So if the CPU accesses a RAM address it will be translated by the MMU and go to the correct address in RAM. Lets assume that the bridge knows that accesses it forwards on need to be translated. So if DEVICE A tries to access RAM then it the BRIDGE will translate things (by talking to the IOMMU) and the access will again go to the right place. Likewise if the CPU tries to talk to DEVICE A then the MMIO accesses will be translated and go to the right place. However lets imagine DEVICE B happens to have a pBAR which is the same as the memory which DEVICE A is trying to access. Lets also assume that the BRIDGE has a bug which would allow DEVICE B to see DEVICE A''s accesses directly instead of laundering them via the IOMMU (perhaps it is really a shared bus like I''ve drawn it rather than a PCI-e thing with lanes etc). So now DEVICE A''s memory access could be seen and acted on by both the RAM (translated, probably) and DEVICE B. Weirdness will ensue, perhaps the DMA read done via device A gets serviced by DEVICE B and not RAM, or maybe the DMA write causes a side effect in DEVICE B. Furthermore the "winner" might even be different for an access from DEVICE A vs an access from the CPU etc. This is something vaguely like the real bug, but only vaguely, because my understanding of the real bug is a bit vague. I hope it is illustrative of the sort of issue we are talking about.> > Now, all this could be a good thing or a bad thing, depending on > how exactly you spin it. If the bridge is broken and doesn''t > route all the way back to the root bridge, this could actually be > a performance optimizing feature. If we set vBAR=pBAR and disable > any translation thereafter, this avoids the overhead of passing > everything to/from the root PCIe bridge, and we can just directly > DMA everything.I''m not sure how much perf overhead there is in practice since ISTR that the translations can be cached in the bridge and need explicit flushing etc when they are modified. Obviously there will be some overhead but I don''t think it will be anything like doubling the traffic.> I''m sure there are security implications here, but since NF200 > doesn''t do PCIe ACS either, any concept of security goes out > the window pre-emptively. > > So, my question is: > 1) If vBAR = pBAR, does the hypervisor still do any translation?I would assume so.> I presume it does because it expects the traffic to pass up > from the root bridge, to the hypervisor and then back, to > ensure security.NB: Not to the hypervisor (software) but to some bit of hardware which interprets a table provided by the hypervisor.> If indeed it does do this, where could I > optionally disable it, and is there an easy to follow bit of > example code for how to plumb in a boot parameter option for > this?I''m afraid I''ve no clue... Perhaps if you started from the hypercall which the toolstacks use to plumb stuff through you would be able to trace it down? XEN_DOMCTL_memory_mapping perhaps? (I''m wary of saying too much because there is every chance I am sending you on some wild goose chase)> 2) Further, I''m finding myself motivated to write that > auto-set (as opposed to hard coded) vBAR=pBAR patch discussed > briefly a week or so ago (have an init script read the BAR > info from dom0 and put it in xenstore, plus a patch to > make pBAR=vBAR reservations built dynamically rather than > statically, based on this data. Now, I''m quite fluent in C, > but my familiarity with Xen soruce code is nearly non-existant > (limited to studying an old unsupported patch every now and then > in order to make it apply to a more recent code release). > Can anyone help me out with a high level view WRT where > this would be best plumbed in (which files and the flow of > control between the affected files)?I''m not sure but the places I would start are the bits of libxc which call things like XEN_DOMCTL_memory_mapping and the bits of libxl which call into them. It would also be worth looking at the PCI setup code in hvmloader (tools/firmware/hvmloader/) I have a feeling that is where the code responsible for PCI bar allocation/layout within the guest''s memory map lives. Perhaps you might want perhaps to implement a mode where libxl/libxc end up writing the desired vBAR(==pBAR, in your case) values into xenstore for hvmloader to pickup and implement. Not being a maintainer for that area I''m not sure if that would acceptable or not.> > The added bonus of this (if it can be made to work) is that > it might just make unmodified GeForce cards work, too, > which probably makes it worthwhile on it''s own. > > >> I guess I could test this easily enough by applying the vBAR = pBAR > >> hack. > > > > Does the e820_host=1 option help? That might be PV only though, I > > can''t > > remember... > > Thanks for pointing this one out, I just found this post in the > archives: > http://lists.xen.org/archives/html/xen-users/2012-08/msg00150.html > > With a broken PCIe router, would I also need iommu=soft?I''m not sure that isn''t also a PV only thing. Sorry :-/> > Gordan
Konrad Rzeszutek Wilk
2013-Jul-29 18:04 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
> So, my question is: > 1) If vBAR = pBAR, does the hypervisor still do any translation? > I presume it does because it expects the traffic to pass up > from the root bridge, to the hypervisor and then back, to > ensure security. If indeed it does do this, where could I > optionally disable it, and is there an easy to follow bit of > example code for how to plumb in a boot parameter option for > this?It should.> > 2) Further, I''m finding myself motivated to write that > auto-set (as opposed to hard coded) vBAR=pBAR patch discussed > briefly a week or so ago (have an init script read the BAR > info from dom0 and put it in xenstore, plus a patch to > make pBAR=vBAR reservations built dynamically rather than > statically, based on this data. Now, I''m quite fluent in C, > but my familiarity with Xen soruce code is nearly non-existant > (limited to studying an old unsupported patch every now and then > in order to make it apply to a more recent code release). > Can anyone help me out with a high level view WRT where > this would be best plumbed in (which files and the flow of > control between the affected files)?hvmloader probably and the libxl e820 code. What from a high view needs to happen is that: 1). Need to relax the check in libxl for e820_hole to also do it for HVM guests. Said code just iterates over the host E820 and sanitizes it a bit and makes a E820 hypercall to set it for the guest. 2). Figure out whether the E820 hypercall (which sets the E820 layout for a guest) can be run on HVM guests. I think it could not and Mukesh in his PVH patches posted a patch to enable that - "..Move e820 fields out of pv_domain struct" 2). Hvmloader should do an E820 get machine memory hypercall to see if there is anything there. If there is - that means the toolstack has request a "new" type of E820. Iterate over the E820 and make it look like that. You can look in the Linux arch/x86/xen/setup.c to see how it does that. The complication there is that hvmloader needs to to fit the ACPI code (the guest type one) and such. Presumarily you can just re-use the existing spaces that the host has marked as E820_RESERVED or E820_ACPI.. Then there is the SMBIOS would need to move and the BIOS might need to be relocated - but I think those are relocatable in some form.> > The added bonus of this (if it can be made to work) is that > it might just make unmodified GeForce cards work, too, > which probably makes it worthwhile on it''s own.Well, I am more than happy to help you with this.> > >>I guess I could test this easily enough by applying the vBAR > >>pBAR hack. > > > >Does the e820_host=1 option help? That might be PV only though, I > >can''t > >remember... > > Thanks for pointing this one out, I just found this post in the > archives: > http://lists.xen.org/archives/html/xen-users/2012-08/msg00150.html > > With a broken PCIe router, would I also need iommu=soft?No. The iommu=soft is not needed with the recent pvops linux kernels. But broken PCIe router''s don''t have much to do with the kernel - that is the hypervisor decision whether to allow a guest (either PV or HVM) to have said device.> > Gordan > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
George Dunlap
2013-Jul-31 17:53 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Fri, Jul 26, 2013 at 2:11 PM, Gordan Bobic <gordan@bobich.net> wrote:>> Now that is intereting - if this makes the memory holes the same between >> the guest and the host, does it also implicitly vBAR=pBAR? > > > Another thing that occurred to me might be useful to check - it is > pretty easy to modify the BAR size on Nvidia cards. The defaults are > 64MB and 128MB for the two BARs. They can be made much, much larger, > and there is often advantage to enlarging them to at least be equal to > VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB, being a > 64-bit BAR, it might make the BIOS do the sane thing and map it above > 4GB. With the other BAR also suitably enlarged and it being done on > the second GPU as well, there is no obvious option but to map them > above 4GB (unless the BIOS is broken, which it may well be, in > which case all bets are off). > > Which may just alleviate the memory issue if not completely fix > the problem. > > Will try this and see what happens.I believe XenServer has a patch that allows the toolstack (in this case xapi) to set the default size of the MMIO hole. Andrew, did that ever make it upstream? Unfortunately, it is unlikely to work with upstream qemu until we fix the memory relocation issue... -George
Andrew Cooper
2013-Jul-31 17:56 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On 31/07/13 18:53, George Dunlap wrote:> On Fri, Jul 26, 2013 at 2:11 PM, Gordan Bobic <gordan@bobich.net> wrote: >>> Now that is intereting - if this makes the memory holes the same between >>> the guest and the host, does it also implicitly vBAR=pBAR? >> >> Another thing that occurred to me might be useful to check - it is >> pretty easy to modify the BAR size on Nvidia cards. The defaults are >> 64MB and 128MB for the two BARs. They can be made much, much larger, >> and there is often advantage to enlarging them to at least be equal to >> VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB, being a >> 64-bit BAR, it might make the BIOS do the sane thing and map it above >> 4GB. With the other BAR also suitably enlarged and it being done on >> the second GPU as well, there is no obvious option but to map them >> above 4GB (unless the BIOS is broken, which it may well be, in >> which case all bets are off). >> >> Which may just alleviate the memory issue if not completely fix >> the problem. >> >> Will try this and see what happens. > I believe XenServer has a patch that allows the toolstack (in this > case xapi) to set the default size of the MMIO hole. Andrew, did that > ever make it upstream? > > Unfortunately, it is unlikely to work with upstream qemu until we fix > the memory relocation issue... > > -GeorgeI believe it did - the patch does not exist in our patch queue any more. ~Andrew
Gordan Bobic
2013-Jul-31 19:35 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On 07/31/2013 06:53 PM, George Dunlap wrote:> On Fri, Jul 26, 2013 at 2:11 PM, Gordan Bobic <gordan@bobich.net> wrote: >>> Now that is intereting - if this makes the memory holes the same between >>> the guest and the host, does it also implicitly vBAR=pBAR? >> >> >> Another thing that occurred to me might be useful to check - it is >> pretty easy to modify the BAR size on Nvidia cards. The defaults are >> 64MB and 128MB for the two BARs. They can be made much, much larger, >> and there is often advantage to enlarging them to at least be equal to >> VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB, being a >> 64-bit BAR, it might make the BIOS do the sane thing and map it above >> 4GB. With the other BAR also suitably enlarged and it being done on >> the second GPU as well, there is no obvious option but to map them >> above 4GB (unless the BIOS is broken, which it may well be, in >> which case all bets are off). >> >> Which may just alleviate the memory issue if not completely fix >> the problem. >> >> Will try this and see what happens. > > I believe XenServer has a patch that allows the toolstack (in this > case xapi) to set the default size of the MMIO hole. Andrew, did that > ever make it upstream? > > Unfortunately, it is unlikely to work with upstream qemu until we fix > the memory relocation issue...Interesting you should mention something like this. I''ve been pondering whether it might be easier (even if it is a bodge) to simply always set the domU E820 map to have 0x80000000 - 0xFFFFFFFF (2GB->4GB) reserved. I have not yet seen a motherboard that maps 32-bit BARs below 2GB. Note: Admittedly, I haven''t tested what happens when you have multiple Nvidia cards each with a 1GB 32-bit BAR, though, I fully expect weirdness. And Nvidia cards have have the 32-bit BAR0 up to 2GB in size! But I cannot see a good reason to use such a configuration since it''s the 64-bit BAR1 (up to 64GB in size) that provides the direct VRAM mapping. Anyway, if the whole 2GB->4GB area was reserved, then presumably Xen would map the 32-bit bars below 2GB, which, provided there''s enough memory for the OS kernel to load and the BARs, shouldn''t be a problem (I cannot think of a sane case where this wouldn''t hold). 64-bit BARs can get re-mapped somewhere sky-high in domU RAM (at the top of the addressable range sounds like a reasonable bet, BIOS (for non-broken BIOS implementations, of which there seem to be fewer than I''d like to believe) would probably set those just above the size of RAM in the machine, so to 2^48 minus BAR size would possibly be a safe place to map them. Yes, I know it''s a bodge. Yes, I know it wouldn''t solve the GeForce passthrough problem. Yes, host E820 with vBAR = pBAR (possibly without IOMMU involvement) would be an awesome feature to have. But the bodge of just punching a 2GB hole at 2GB might just be a lot easier to implement as a quick fix. Gordan
Gordan Bobic
2013-Jul-31 19:36 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On 07/31/2013 06:56 PM, Andrew Cooper wrote:> On 31/07/13 18:53, George Dunlap wrote: >> On Fri, Jul 26, 2013 at 2:11 PM, Gordan Bobic <gordan@bobich.net> wrote: >>>> Now that is intereting - if this makes the memory holes the same between >>>> the guest and the host, does it also implicitly vBAR=pBAR? >>> >>> Another thing that occurred to me might be useful to check - it is >>> pretty easy to modify the BAR size on Nvidia cards. The defaults are >>> 64MB and 128MB for the two BARs. They can be made much, much larger, >>> and there is often advantage to enlarging them to at least be equal to >>> VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB, being a >>> 64-bit BAR, it might make the BIOS do the sane thing and map it above >>> 4GB. With the other BAR also suitably enlarged and it being done on >>> the second GPU as well, there is no obvious option but to map them >>> above 4GB (unless the BIOS is broken, which it may well be, in >>> which case all bets are off). >>> >>> Which may just alleviate the memory issue if not completely fix >>> the problem. >>> >>> Will try this and see what happens. >> I believe XenServer has a patch that allows the toolstack (in this >> case xapi) to set the default size of the MMIO hole. Andrew, did that >> ever make it upstream? >> >> Unfortunately, it is unlikely to work with upstream qemu until we fix >> the memory relocation issue... >> > > I believe it did - the patch does not exist in our patch queue any more.Can anyone point me at the relevant commit / docs on this patch? Gordan
George Dunlap
2013-Aug-01 09:15 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On 31/07/13 20:35, Gordan Bobic wrote:> On 07/31/2013 06:53 PM, George Dunlap wrote: >> On Fri, Jul 26, 2013 at 2:11 PM, Gordan Bobic <gordan@bobich.net> wrote: >>>> Now that is intereting - if this makes the memory holes the same >>>> between >>>> the guest and the host, does it also implicitly vBAR=pBAR? >>> >>> >>> Another thing that occurred to me might be useful to check - it is >>> pretty easy to modify the BAR size on Nvidia cards. The defaults are >>> 64MB and 128MB for the two BARs. They can be made much, much larger, >>> and there is often advantage to enlarging them to at least be equal to >>> VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB, being a >>> 64-bit BAR, it might make the BIOS do the sane thing and map it above >>> 4GB. With the other BAR also suitably enlarged and it being done on >>> the second GPU as well, there is no obvious option but to map them >>> above 4GB (unless the BIOS is broken, which it may well be, in >>> which case all bets are off). >>> >>> Which may just alleviate the memory issue if not completely fix >>> the problem. >>> >>> Will try this and see what happens. >> >> I believe XenServer has a patch that allows the toolstack (in this >> case xapi) to set the default size of the MMIO hole. Andrew, did that >> ever make it upstream? >> >> Unfortunately, it is unlikely to work with upstream qemu until we fix >> the memory relocation issue... > > Interesting you should mention something like this. I''ve been > pondering whether it might be easier (even if it is a bodge) to simply > always set the domU E820 map to have 0x80000000 - 0xFFFFFFFF > (2GB->4GB) reserved. I have not yet seen a motherboard that maps > 32-bit BARs below 2GB.I''m pretty sure we''ve seen a memory hole larger than 2GiB, in a box loaded up with a boatload of GPUs. The main problem with doing this unconditionally is that the relocated memory isn''t available to non-PAE 32-bit guests. I think we should have a work-around in place for 4.4 that will avoid a collision between the host MMIO and guest memory addresses; but it will need to be off by default, at least for guests that don''t have a passed-through device. -George
Fabio Fantoni
2013-Aug-01 13:10 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
Il 01/08/2013 11:15, George Dunlap ha scritto:> On 31/07/13 20:35, Gordan Bobic wrote: >> On 07/31/2013 06:53 PM, George Dunlap wrote: >>> On Fri, Jul 26, 2013 at 2:11 PM, Gordan Bobic <gordan@bobich.net> >>> wrote: >>>>> Now that is intereting - if this makes the memory holes the same >>>>> between >>>>> the guest and the host, does it also implicitly vBAR=pBAR? >>>> >>>> >>>> Another thing that occurred to me might be useful to check - it is >>>> pretty easy to modify the BAR size on Nvidia cards. The defaults are >>>> 64MB and 128MB for the two BARs. They can be made much, much larger, >>>> and there is often advantage to enlarging them to at least be equal to >>>> VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB, being a >>>> 64-bit BAR, it might make the BIOS do the sane thing and map it above >>>> 4GB. With the other BAR also suitably enlarged and it being done on >>>> the second GPU as well, there is no obvious option but to map them >>>> above 4GB (unless the BIOS is broken, which it may well be, in >>>> which case all bets are off). >>>> >>>> Which may just alleviate the memory issue if not completely fix >>>> the problem. >>>> >>>> Will try this and see what happens. >>> >>> I believe XenServer has a patch that allows the toolstack (in this >>> case xapi) to set the default size of the MMIO hole. Andrew, did that >>> ever make it upstream? >>> >>> Unfortunately, it is unlikely to work with upstream qemu until we fix >>> the memory relocation issue... >> >> Interesting you should mention something like this. I''ve been >> pondering whether it might be easier (even if it is a bodge) to >> simply always set the domU E820 map to have 0x80000000 - 0xFFFFFFFF >> (2GB->4GB) reserved. I have not yet seen a motherboard that maps >> 32-bit BARs below 2GB. > > I''m pretty sure we''ve seen a memory hole larger than 2GiB, in a box > loaded up with a boatload of GPUs. > > The main problem with doing this unconditionally is that the relocated > memory isn''t available to non-PAE 32-bit guests. I think we should > have a work-around in place for 4.4 that will avoid a collision > between the host MMIO and guest memory addresses; but it will need to > be off by default, at least for guests that don''t have a > passed-through device. > > -George >I see this recent patch on qemu: http://git.qemu.org/?p=qemu.git;a=commit;h=398489018183d613306ab022653552247d93919f Is related and can solve the problem or I''m wrong?> > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
George Dunlap
2013-Aug-02 14:43 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Thu, Aug 1, 2013 at 2:10 PM, Fabio Fantoni <fabio.fantoni@m2r.biz> wrote:> Il 01/08/2013 11:15, George Dunlap ha scritto: > >> On 31/07/13 20:35, Gordan Bobic wrote: >>> >>> On 07/31/2013 06:53 PM, George Dunlap wrote: >>>> >>>> On Fri, Jul 26, 2013 at 2:11 PM, Gordan Bobic <gordan@bobich.net> wrote: >>>>>> >>>>>> Now that is intereting - if this makes the memory holes the same >>>>>> between >>>>>> the guest and the host, does it also implicitly vBAR=pBAR? >>>>> >>>>> >>>>> >>>>> Another thing that occurred to me might be useful to check - it is >>>>> pretty easy to modify the BAR size on Nvidia cards. The defaults are >>>>> 64MB and 128MB for the two BARs. They can be made much, much larger, >>>>> and there is often advantage to enlarging them to at least be equal to >>>>> VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB, being a >>>>> 64-bit BAR, it might make the BIOS do the sane thing and map it above >>>>> 4GB. With the other BAR also suitably enlarged and it being done on >>>>> the second GPU as well, there is no obvious option but to map them >>>>> above 4GB (unless the BIOS is broken, which it may well be, in >>>>> which case all bets are off). >>>>> >>>>> Which may just alleviate the memory issue if not completely fix >>>>> the problem. >>>>> >>>>> Will try this and see what happens. >>>> >>>> >>>> I believe XenServer has a patch that allows the toolstack (in this >>>> case xapi) to set the default size of the MMIO hole. Andrew, did that >>>> ever make it upstream? >>>> >>>> Unfortunately, it is unlikely to work with upstream qemu until we fix >>>> the memory relocation issue... >>> >>> >>> Interesting you should mention something like this. I''ve been pondering >>> whether it might be easier (even if it is a bodge) to simply always set the >>> domU E820 map to have 0x80000000 - 0xFFFFFFFF (2GB->4GB) reserved. I have >>> not yet seen a motherboard that maps 32-bit BARs below 2GB. >> >> >> I''m pretty sure we''ve seen a memory hole larger than 2GiB, in a box loaded >> up with a boatload of GPUs. >> >> The main problem with doing this unconditionally is that the relocated >> memory isn''t available to non-PAE 32-bit guests. I think we should have a >> work-around in place for 4.4 that will avoid a collision between the host >> MMIO and guest memory addresses; but it will need to be off by default, at >> least for guests that don''t have a passed-through device. >> >> -George >> > > I see this recent patch on qemu: > http://git.qemu.org/?p=qemu.git;a=commit;h=398489018183d613306ab022653552247d93919f > Is related and can solve the problem or I''m wrong?It doesn''t look like it to me, but thanks for looking. -George
Gordan Bobic
2013-Sep-03 13:53 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Mon, 29 Jul 2013 14:04:31 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: Hi Konrad, Apologies it took me a month to get back to this.>> 2) Further, I''m finding myself motivated to write that >> auto-set (as opposed to hard coded) vBAR=pBAR patch discussed >> briefly a week or so ago (have an init script read the BAR >> info from dom0 and put it in xenstore, plus a patch to >> make pBAR=vBAR reservations built dynamically rather than >> statically, based on this data. Now, I''m quite fluent in C, >> but my familiarity with Xen soruce code is nearly non-existant >> (limited to studying an old unsupported patch every now and then >> in order to make it apply to a more recent code release). >> Can anyone help me out with a high level view WRT where >> this would be best plumbed in (which files and the flow of >> control between the affected files)? > > hvmloader probably and the libxl e820 code. What from a > high view needs to happen is that: > 1). Need to relax the check in libxl for e820_hole > to also do it for HVM guests. Said code just iterates over the > host E820 and sanitizes it a bit and makes a E820 hypercall to > set it for the guest.I''m looking at the libxl code at the moment. In cases where e820_host is seen as PV specific, would the correct thing to do be to move it out of the PV/HVM specific blocks so it applies to both? In libxl/libxl_x86.c/libxl__e820_alloc I have thus far changed the code to remove the PV check, and having moved e820_host option to be common to both VM types, I changed the 820 related instances from b_info->u.pv.e820_host to b_info->e820_host Is this the correct/preferred way this should be handled? Or would it be better to make e820_host be in both PV and HVM options, and refer to it as such (u.pv.e820_host / u.hvm.e820_host) ? The e820 sanitizer is called with b_info->u.pv.slack_memkb parameter. What does parameter actually mean? I googled it and couldn''t find any documentation specific to it, and it doesn''t appear to be documented as settable in the config file. What would the equivalent be in case of HVM?> 2). Figure out whether the E820 hypercall (which sets the E820 > layout for a guest) can be run on HVM guests. I think it > could not and Mukesh in his PVH patches posted a patch > to enable that - "..Move e820 fields out of pv_domain struct" > 2). Hvmloader should do an E820 get machine memory hypercall > to see if there is anything there. If there is - that means > the toolstack has request a "new" type of E820. Iterate > over the E820 and make it look like that. > You can look in the Linux arch/x86/xen/setup.c to see how > it does that. > > The complication there is that hvmloader needs to to fit the > ACPI code (the guest type one) and such. > Presumarily you can just re-use the existing spaces that > the host has marked as E820_RESERVED or E820_ACPI..Yup, I get it. Not only that, but it should also ideally (not strictly necessary, but it''d be handy) map the IOMEM for devices it is passed so that pBAR=vBAR (as opposed to just leaving all the host e820 reserved areas well alone - which would work for most things).> Then there is the SMBIOS would need to move and the BIOS > might need to be relocated - but I think those are relocatable > in some form.OK, I''ll look at that once I have a workable patch for the libxl part.>> The added bonus of this (if it can be made to work) is that >> it might just make unmodified GeForce cards work, too, >> which probably makes it worthwhile on it''s own. > > Well, I am more than happy to help you with this.Thanks, much appreciated. :) Gordan
Konrad Rzeszutek Wilk
2013-Sep-03 14:59 UTC
Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0
On Tue, Sep 03, 2013 at 02:53:06PM +0100, Gordan Bobic wrote:> On Mon, 29 Jul 2013 14:04:31 -0400, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > > Hi Konrad, > > Apologies it took me a month to get back to this.Hey Gordan, That is OK. Time flies fast!> > >>2) Further, I''m finding myself motivated to write that > >>auto-set (as opposed to hard coded) vBAR=pBAR patch discussed > >>briefly a week or so ago (have an init script read the BAR > >>info from dom0 and put it in xenstore, plus a patch to > >>make pBAR=vBAR reservations built dynamically rather than > >>statically, based on this data. Now, I''m quite fluent in C, > >>but my familiarity with Xen soruce code is nearly non-existant > >>(limited to studying an old unsupported patch every now and then > >>in order to make it apply to a more recent code release). > >>Can anyone help me out with a high level view WRT where > >>this would be best plumbed in (which files and the flow of > >>control between the affected files)? > > > >hvmloader probably and the libxl e820 code. What from a > >high view needs to happen is that: > > 1). Need to relax the check in libxl for e820_hole > > to also do it for HVM guests. Said code just iterates over the > > host E820 and sanitizes it a bit and makes a E820 hypercall to > > set it for the guest. > > I''m looking at the libxl code at the moment. > > In cases where e820_host is seen as PV specific, would the > correct thing to do be to move it out of the PV/HVM specific > blocks so it applies to both?Yes.> > In libxl/libxl_x86.c/libxl__e820_alloc > > I have thus far changed the code to remove the PV check, > and having moved e820_host option to be common to both VM > types, I changed the 820 related instances from > b_info->u.pv.e820_host > to > b_info->e820_host > > Is this the correct/preferred way this should be handled?Yes.> Or would it be better to make e820_host be in both PV and > HVM options, and refer to it as such > (u.pv.e820_host / u.hvm.e820_host) ?No. Lets make it work across the board.> > The e820 sanitizer is called with b_info->u.pv.slack_memkb > parameter. What does parameter actually mean? I googled > it and couldn''t find any documentation specific to it, and > it doesn''t appear to be documented as settable in the config > file. What would the equivalent be in case of HVM?0. If my memory serves me right it is some amount of memory that a PV guests that it does not use normally. It is used by the frontend and backend driver to communicate. Kind of like a shadow memory. But only ancient kernels use it but those still have to be supported.> > > 2). Figure out whether the E820 hypercall (which sets the E820 > > layout for a guest) can be run on HVM guests. I think it > > could not and Mukesh in his PVH patches posted a patch > > to enable that - "..Move e820 fields out of pv_domain struct" > > 2). Hvmloader should do an E820 get machine memory hypercall > > to see if there is anything there. If there is - that means > > the toolstack has request a "new" type of E820. Iterate > > over the E820 and make it look like that. > > You can look in the Linux arch/x86/xen/setup.c to see how > > it does that. > > > > The complication there is that hvmloader needs to to fit the > > ACPI code (the guest type one) and such. > > Presumarily you can just re-use the existing spaces that > > the host has marked as E820_RESERVED or E820_ACPI.. > > Yup, I get it. Not only that, but it should also ideally (not > strictly necessary, but it''d be handy) map the IOMEM for devices > it is passed so that pBAR=vBAR (as opposed to just leaving all > the host e820 reserved areas well alone - which would work for > most things).Yes. That is an extra complication that could be done in subsequent patches. But in theory if you have the E820 mirrored from the host the pBAR=vBAR should be easy enough as the values from the host BARs can easily fit in the E820 gaps.> > > Then there is the SMBIOS would need to move and the BIOS > > might need to be relocated - but I think those are relocatable > > in some form. > > OK, I''ll look at that once I have a workable patch for the libxl > part.Aye.> > >>The added bonus of this (if it can be made to work) is that > >>it might just make unmodified GeForce cards work, too, > >>which probably makes it worthwhile on it''s own. > > > >Well, I am more than happy to help you with this. > > Thanks, much appreciated. :)Yeeey! Vict^H^H^H^volunteer :-)! <manically laughter in the background> I am also reachable on IRC (FreeNode mostly) as either darnok or konrad if that would be more convient to discuss this.> > Gordan
Gordan Bobic
2013-Sep-03 19:47 UTC
HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On 09/03/2013 03:59 PM, Konrad Rzeszutek Wilk wrote:>>>> 2) Further, I''m finding myself motivated to write that >>>> auto-set (as opposed to hard coded) vBAR=pBAR patch discussed >>>> briefly a week or so ago (have an init script read the BAR >>>> info from dom0 and put it in xenstore, plus a patch to >>>> make pBAR=vBAR reservations built dynamically rather than >>>> statically, based on this data. Now, I''m quite fluent in C, >>>> but my familiarity with Xen soruce code is nearly non-existant >>>> (limited to studying an old unsupported patch every now and then >>>> in order to make it apply to a more recent code release). >>>> Can anyone help me out with a high level view WRT where >>>> this would be best plumbed in (which files and the flow of >>>> control between the affected files)? >>> >>> hvmloader probably and the libxl e820 code. What from a >>> high view needs to happen is that: >>> 1). Need to relax the check in libxl for e820_hole >>> to also do it for HVM guests. Said code just iterates over the >>> host E820 and sanitizes it a bit and makes a E820 hypercall to >>> set it for the guest.[snip] OK, I have attached a preliminary patch against 4.3.0 for the libxl part. It compiles. I haven''t tried running it to see if it actually works or does something, but my packages build. Please let me know if I''ve missed anything. On it''s own, I don''t think this patch will do much (apart from maybe break HVM hosts with e820_host=1 set).>>> 2). Figure out whether the E820 hypercall (which sets the E820 >>> layout for a guest) can be run on HVM guests. I think it >>> could not and Mukesh in his PVH patches posted a patch >>> to enable that - "..Move e820 fields out of pv_domain struct"Is this already in 4.3.0 or is this an out-of-tree patch? Do you have a link to it handy?>>> 2). Hvmloader should do an E820 get machine memory hypercall >>> to see if there is anything there. If there is - that means >>> the toolstack has request a "new" type of E820. Iterate >>> over the E820 and make it look like that. >>> You can look in the Linux arch/x86/xen/setup.c to see how >>> it does that. >>> >>> The complication there is that hvmloader needs to to fit the >>> ACPI code (the guest type one) and such. >>> Presumarily you can just re-use the existing spaces that >>> the host has marked as E820_RESERVED or E820_ACPI.. >> >> Yup, I get it. Not only that, but it should also ideally (not >> strictly necessary, but it''d be handy) map the IOMEM for devices >> it is passed so that pBAR=vBAR (as opposed to just leaving all >> the host e820 reserved areas well alone - which would work for >> most things). > > Yes. That is an extra complication that could be done in subsequent > patches. But in theory if you have the E820 mirrored from the host the > pBAR=vBAR should be easy enough as the values from the host BARs can > easily fit in the E820 gaps.Agreed. Let''s leave the pBAR=vBAR part for a separate patch set. I''ll have to figure out a sensible way to query the IOMEM regions for each of the devices passed to the VM and make sure they are in the same hole.>>> Then there is the SMBIOS would need to move and the BIOS >>> might need to be relocated - but I think those are relocatable >>> in some form.[bit above left for later reference]>>> Well, I am more than happy to help you with this. >> >> Thanks, much appreciated. :) > > Yeeey! Vict^H^H^H^volunteer :-)! <manically laughter in the background> > > I am also reachable on IRC (FreeNode mostly) as either darnok or konrad > if that would be more convient to discuss this.Thanks. I''ll keep that in mind. :) Gordan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Gordan Bobic
2013-Sep-03 20:35 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
First attempt at a test run predictably failed. I added e820_host=1 to a VM config and tried starting it: [root@normandy ~]# xl create /etc/xen/edi Parsing config from /etc/xen/edi libxl: error: libxl_x86.c:307:libxl__arch_domain_create: Failed while collecting E820 with: -3 (errno:-1) libxl: error: libxl_create.c:901:domcreate_rebuild_done: cannot (re-)build domain: -3 libxl: error: libxl_dm.c:1300:libxl__destroy_device_model: could not find device-model''s pid for dom 1 libxl: error: libxl.c:1415:libxl__destroy_domid: libxl__destroy_device_model failed for 1 xl-edi.log, qemu-dm-edi.log attached. Both actually look identical to previous logs before the patch. Is this something that is clearly a consequence of the patch being incomplete? Or did I break something? Gordan On 09/03/2013 08:47 PM, Gordan Bobic wrote:> On 09/03/2013 03:59 PM, Konrad Rzeszutek Wilk wrote: > >>>>> 2) Further, I''m finding myself motivated to write that >>>>> auto-set (as opposed to hard coded) vBAR=pBAR patch discussed >>>>> briefly a week or so ago (have an init script read the BAR >>>>> info from dom0 and put it in xenstore, plus a patch to >>>>> make pBAR=vBAR reservations built dynamically rather than >>>>> statically, based on this data. Now, I''m quite fluent in C, >>>>> but my familiarity with Xen soruce code is nearly non-existant >>>>> (limited to studying an old unsupported patch every now and then >>>>> in order to make it apply to a more recent code release). >>>>> Can anyone help me out with a high level view WRT where >>>>> this would be best plumbed in (which files and the flow of >>>>> control between the affected files)? >>>> >>>> hvmloader probably and the libxl e820 code. What from a >>>> high view needs to happen is that: >>>> 1). Need to relax the check in libxl for e820_hole >>>> to also do it for HVM guests. Said code just iterates over the >>>> host E820 and sanitizes it a bit and makes a E820 hypercall to >>>> set it for the guest. > [snip] > > OK, I have attached a preliminary patch against 4.3.0 for the libxl > part. It compiles. I haven''t tried running it to see if it actually > works or does something, but my packages build. > > Please let me know if I''ve missed anything. On it''s own, I don''t think > this patch will do much (apart from maybe break HVM hosts with > e820_host=1 set). > >>>> 2). Figure out whether the E820 hypercall (which sets the E820 >>>> layout for a guest) can be run on HVM guests. I think it >>>> could not and Mukesh in his PVH patches posted a patch >>>> to enable that - "..Move e820 fields out of pv_domain struct" > > Is this already in 4.3.0 or is this an out-of-tree patch? Do you have a > link to it handy? > >>>> 2). Hvmloader should do an E820 get machine memory hypercall >>>> to see if there is anything there. If there is - that means >>>> the toolstack has request a "new" type of E820. Iterate >>>> over the E820 and make it look like that. >>>> You can look in the Linux arch/x86/xen/setup.c to see how >>>> it does that. >>>> >>>> The complication there is that hvmloader needs to to fit the >>>> ACPI code (the guest type one) and such. >>>> Presumarily you can just re-use the existing spaces that >>>> the host has marked as E820_RESERVED or E820_ACPI.. >>> >>> Yup, I get it. Not only that, but it should also ideally (not >>> strictly necessary, but it''d be handy) map the IOMEM for devices >>> it is passed so that pBAR=vBAR (as opposed to just leaving all >>> the host e820 reserved areas well alone - which would work for >>> most things). >> >> Yes. That is an extra complication that could be done in subsequent >> patches. But in theory if you have the E820 mirrored from the host the >> pBAR=vBAR should be easy enough as the values from the host BARs can >> easily fit in the E820 gaps. > > Agreed. Let''s leave the pBAR=vBAR part for a separate patch set. I''ll > have to figure out a sensible way to query the IOMEM regions for each of > the devices passed to the VM and make sure they are in the same hole. > >>>> Then there is the SMBIOS would need to move and the BIOS >>>> might need to be relocated - but I think those are relocatable >>>> in some form. > > [bit above left for later reference] > >>>> Well, I am more than happy to help you with this. >>> >>> Thanks, much appreciated. :) >> >> Yeeey! Vict^H^H^H^volunteer :-)! <manically laughter in the background> >> >> I am also reachable on IRC (FreeNode mostly) as either darnok or konrad >> if that would be more convient to discuss this. > > Thanks. I''ll keep that in mind. :) > > Gordan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Gordan Bobic
2013-Sep-03 20:49 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
I spoke too soon - even with e820_host=0, the same error occurs. What did I break? The code in question is this: if (libxl_defbool_val(d_config->b_info.e820_host)) { ret = libxl__e820_alloc(gc, domid, d_config); if (ret) { LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR, "Failed while collecting E820 with: %d (errno:%d)\n", ret, errno); } } With e820_host=0, that outer black should evaluate to false, should it not? In libxl_create.c, if I am understanding the code correctly, e820_host is defaulted to false, too. What am I missing? Gordan On 09/03/2013 09:35 PM, Gordan Bobic wrote:> First attempt at a test run predictably failed. I added e820_host=1 to a > VM config and tried starting it: > > [root@normandy ~]# xl create /etc/xen/edi > Parsing config from /etc/xen/edi > libxl: error: libxl_x86.c:307:libxl__arch_domain_create: Failed while > collecting E820 with: -3 (errno:-1) > > libxl: error: libxl_create.c:901:domcreate_rebuild_done: cannot > (re-)build domain: -3 > libxl: error: libxl_dm.c:1300:libxl__destroy_device_model: could not > find device-model''s pid for dom 1 > libxl: error: libxl.c:1415:libxl__destroy_domid: > libxl__destroy_device_model failed for 1 > > xl-edi.log, qemu-dm-edi.log attached. > Both actually look identical to previous logs before the patch. > > Is this something that is clearly a consequence of the patch being > incomplete? Or did I break something? > > Gordan > > On 09/03/2013 08:47 PM, Gordan Bobic wrote: >> On 09/03/2013 03:59 PM, Konrad Rzeszutek Wilk wrote: >> >>>>>> 2) Further, I''m finding myself motivated to write that >>>>>> auto-set (as opposed to hard coded) vBAR=pBAR patch discussed >>>>>> briefly a week or so ago (have an init script read the BAR >>>>>> info from dom0 and put it in xenstore, plus a patch to >>>>>> make pBAR=vBAR reservations built dynamically rather than >>>>>> statically, based on this data. Now, I''m quite fluent in C, >>>>>> but my familiarity with Xen soruce code is nearly non-existant >>>>>> (limited to studying an old unsupported patch every now and then >>>>>> in order to make it apply to a more recent code release). >>>>>> Can anyone help me out with a high level view WRT where >>>>>> this would be best plumbed in (which files and the flow of >>>>>> control between the affected files)? >>>>> >>>>> hvmloader probably and the libxl e820 code. What from a >>>>> high view needs to happen is that: >>>>> 1). Need to relax the check in libxl for e820_hole >>>>> to also do it for HVM guests. Said code just iterates over the >>>>> host E820 and sanitizes it a bit and makes a E820 hypercall to >>>>> set it for the guest. >> [snip] >> >> OK, I have attached a preliminary patch against 4.3.0 for the libxl >> part. It compiles. I haven''t tried running it to see if it actually >> works or does something, but my packages build. >> >> Please let me know if I''ve missed anything. On it''s own, I don''t think >> this patch will do much (apart from maybe break HVM hosts with >> e820_host=1 set). >> >>>>> 2). Figure out whether the E820 hypercall (which sets the E820 >>>>> layout for a guest) can be run on HVM guests. I think it >>>>> could not and Mukesh in his PVH patches posted a patch >>>>> to enable that - "..Move e820 fields out of pv_domain struct" >> >> Is this already in 4.3.0 or is this an out-of-tree patch? Do you have a >> link to it handy? >> >>>>> 2). Hvmloader should do an E820 get machine memory hypercall >>>>> to see if there is anything there. If there is - that means >>>>> the toolstack has request a "new" type of E820. Iterate >>>>> over the E820 and make it look like that. >>>>> You can look in the Linux arch/x86/xen/setup.c to see how >>>>> it does that. >>>>> >>>>> The complication there is that hvmloader needs to to fit the >>>>> ACPI code (the guest type one) and such. >>>>> Presumarily you can just re-use the existing spaces that >>>>> the host has marked as E820_RESERVED or E820_ACPI.. >>>> >>>> Yup, I get it. Not only that, but it should also ideally (not >>>> strictly necessary, but it''d be handy) map the IOMEM for devices >>>> it is passed so that pBAR=vBAR (as opposed to just leaving all >>>> the host e820 reserved areas well alone - which would work for >>>> most things). >>> >>> Yes. That is an extra complication that could be done in subsequent >>> patches. But in theory if you have the E820 mirrored from the host the >>> pBAR=vBAR should be easy enough as the values from the host BARs can >>> easily fit in the E820 gaps. >> >> Agreed. Let''s leave the pBAR=vBAR part for a separate patch set. I''ll >> have to figure out a sensible way to query the IOMEM regions for each of >> the devices passed to the VM and make sure they are in the same hole. >> >>>>> Then there is the SMBIOS would need to move and the BIOS >>>>> might need to be relocated - but I think those are relocatable >>>>> in some form. >> >> [bit above left for later reference] >> >>>>> Well, I am more than happy to help you with this. >>>> >>>> Thanks, much appreciated. :) >>> >>> Yeeey! Vict^H^H^H^volunteer :-)! <manically laughter in the background> >>> >>> I am also reachable on IRC (FreeNode mostly) as either darnok or konrad >>> if that would be more convient to discuss this. >> >> Thanks. I''ll keep that in mind. :) >> >> Gordan >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel >> >
Konrad Rzeszutek Wilk
2013-Sep-03 21:08 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Tue, Sep 03, 2013 at 09:35:50PM +0100, Gordan Bobic wrote:> First attempt at a test run predictably failed. I added e820_host=1 > to a VM config and tried starting it: > > [root@normandy ~]# xl create /etc/xen/edi > Parsing config from /etc/xen/edi > libxl: error: libxl_x86.c:307:libxl__arch_domain_create: Failed > while collecting E820 with: -3 (errno:-1) > > libxl: error: libxl_create.c:901:domcreate_rebuild_done: cannot > (re-)build domain: -3 > libxl: error: libxl_dm.c:1300:libxl__destroy_device_model: could not > find device-model''s pid for dom 1 > libxl: error: libxl.c:1415:libxl__destroy_domid: > libxl__destroy_device_model failed for 1 > > xl-edi.log, qemu-dm-edi.log attached. > Both actually look identical to previous logs before the patch. > > Is this something that is clearly a consequence of the patch being > incomplete? Or did I break something?You are missing the hypervisor patch to set the E820 for HVM guests. http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html And that should make it possible to "stash" the E820 in the hypervisor. Then after that you will need to implement in the hvmloader.c the XENMEM_memory_map hypercall to get the E820 and do something with it. Oh, and something like this probably should do it - not compile tested in any way: diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 1fcaed0..7b38890 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -3146,6 +3146,7 @@ static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) case XENMEM_machine_memory_map: case XENMEM_machphys_mapping: return -ENOSYS; + case XENMEM_memory_map: case XENMEM_decrease_reservation: rc = do_memory_op(cmd, arg); current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 1; @@ -3216,10 +3217,10 @@ static long hvm_memory_op_compat32(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) switch ( cmd & MEMOP_CMD_MASK ) { - case XENMEM_memory_map: case XENMEM_machine_memory_map: case XENMEM_machphys_mapping: return -ENOSYS; + case XENMEM_memory_map: case XENMEM_decrease_reservation: rc = compat_memory_op(cmd, arg); current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 1; diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c index 2e05e93..86fb20a 100644 --- a/tools/firmware/hvmloader/e820.c +++ b/tools/firmware/hvmloader/e820.c @@ -68,16 +68,42 @@ void dump_e820_table(struct e820entry *e820, unsigned int nr) } } +static const char *e820_names(int type) +{ + switch (type) { + case E820_RAM: return "RAM"; + case E820_RESERVED: return "Reserved"; + case E820_ACPI: return "ACPI"; + case E820_NVS: return "ACPI NVS"; + case E820_UNUSABLE: return "Unusable"; + default: break; + } + return "Unknown"; +} + + /* Create an E820 table based on memory parameters provided in hvm_info. */ int build_e820_table(struct e820entry *e820, unsigned int lowmem_reserved_base, unsigned int bios_image_base) { unsigned int nr = 0; + struct xen_memory_map op; + struct e820entry map[E820MAX]; + int rc; if ( !lowmem_reserved_base ) lowmem_reserved_base = 0xA0000; + set_xen_guest_handle(op.buffer, map); + + rc = hypercall_memory_op ( XENMEM_memory_op, &op); + if ( rc != -ENOSYS) { /* It works!? */ + int i; + for ( i = 0; i < op.nr_entries; i++ ) + printf(" %lx -> %lx %s\n", map[i].addr >> 12, + (map[i].addr + map[i].size) >> 12, e820_names(map[i].type)); + } /* Lowmem must be at least 512K to keep Windows happy) */ ASSERT ( lowmem_reserved_base > 512<<10 );> > Gordan > > On 09/03/2013 08:47 PM, Gordan Bobic wrote: > >On 09/03/2013 03:59 PM, Konrad Rzeszutek Wilk wrote: > > > >>>>>2) Further, I''m finding myself motivated to write that > >>>>>auto-set (as opposed to hard coded) vBAR=pBAR patch discussed > >>>>>briefly a week or so ago (have an init script read the BAR > >>>>>info from dom0 and put it in xenstore, plus a patch to > >>>>>make pBAR=vBAR reservations built dynamically rather than > >>>>>statically, based on this data. Now, I''m quite fluent in C, > >>>>>but my familiarity with Xen soruce code is nearly non-existant > >>>>>(limited to studying an old unsupported patch every now and then > >>>>>in order to make it apply to a more recent code release). > >>>>>Can anyone help me out with a high level view WRT where > >>>>>this would be best plumbed in (which files and the flow of > >>>>>control between the affected files)? > >>>> > >>>>hvmloader probably and the libxl e820 code. What from a > >>>>high view needs to happen is that: > >>>>1). Need to relax the check in libxl for e820_hole > >>>> to also do it for HVM guests. Said code just iterates over the > >>>> host E820 and sanitizes it a bit and makes a E820 hypercall to > >>>> set it for the guest. > >[snip] > > > >OK, I have attached a preliminary patch against 4.3.0 for the libxl > >part. It compiles. I haven''t tried running it to see if it actually > >works or does something, but my packages build. > > > >Please let me know if I''ve missed anything. On it''s own, I don''t think > >this patch will do much (apart from maybe break HVM hosts with > >e820_host=1 set). > > > >>>>2). Figure out whether the E820 hypercall (which sets the E820 > >>>> layout for a guest) can be run on HVM guests. I think it > >>>> could not and Mukesh in his PVH patches posted a patch > >>>> to enable that - "..Move e820 fields out of pv_domain struct" > > > >Is this already in 4.3.0 or is this an out-of-tree patch? Do you have a > >link to it handy? > > > >>>>2). Hvmloader should do an E820 get machine memory hypercall > >>>> to see if there is anything there. If there is - that means > >>>> the toolstack has request a "new" type of E820. Iterate > >>>> over the E820 and make it look like that. > >>>> You can look in the Linux arch/x86/xen/setup.c to see how > >>>> it does that. > >>>> > >>>> The complication there is that hvmloader needs to to fit the > >>>> ACPI code (the guest type one) and such. > >>>> Presumarily you can just re-use the existing spaces that > >>>> the host has marked as E820_RESERVED or E820_ACPI.. > >>> > >>>Yup, I get it. Not only that, but it should also ideally (not > >>>strictly necessary, but it''d be handy) map the IOMEM for devices > >>>it is passed so that pBAR=vBAR (as opposed to just leaving all > >>>the host e820 reserved areas well alone - which would work for > >>>most things). > >> > >>Yes. That is an extra complication that could be done in subsequent > >>patches. But in theory if you have the E820 mirrored from the host the > >>pBAR=vBAR should be easy enough as the values from the host BARs can > >>easily fit in the E820 gaps. > > > >Agreed. Let''s leave the pBAR=vBAR part for a separate patch set. I''ll > >have to figure out a sensible way to query the IOMEM regions for each of > >the devices passed to the VM and make sure they are in the same hole. > > > >>>> Then there is the SMBIOS would need to move and the BIOS > >>>> might need to be relocated - but I think those are relocatable > >>>> in some form. > > > >[bit above left for later reference] > > > >>>>Well, I am more than happy to help you with this. > >>> > >>>Thanks, much appreciated. :) > >> > >>Yeeey! Vict^H^H^H^volunteer :-)! <manically laughter in the background> > >> > >>I am also reachable on IRC (FreeNode mostly) as either darnok or konrad > >>if that would be more convient to discuss this. > > > >Thanks. I''ll keep that in mind. :) > > > >Gordan > > > > > >_______________________________________________ > >Xen-devel mailing list > >Xen-devel@lists.xen.org > >http://lists.xen.org/xen-devel > > >> domid: 1 > Using file /dev/zvol/ssd/edi in read-write mode > Watching /local/domain/0/device-model/1/logdirty/cmd > Watching /local/domain/0/device-model/1/command > Watching /local/domain/1/cpu > char device redirected to /dev/pts/3 > qemu_map_cache_init nr_buckets = 10000 size 4194304 > shared page at pfn feffd > buffered io page at pfn feffb > Guest uuid = a57e6840-e9f5-4a14-a822-b2cc662c177f > populating video RAM at ff000000 > mapping video RAM from ff000000 > Register xen platform. > Done register platform. > platform_fixed_ioport: changed ro/rw state of ROM memory area. now is rw state. > xs_read(/local/domain/0/device-model/1/xen_extended_power_mgmt): read error > xs_read(): vncpasswd get error. /vm/a57e6840-e9f5-4a14-a822-b2cc662c177f/vncpasswd. > Log-dirty: no command yet. > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > vcpu-set: watch node error. > [xenstore_process_vcpu_set_event]: /local/domain/1/cpu has no CPU! > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > xs_read(/local/domain/1/log-throttling): read error > qemu: ignoring not-understood drive `/local/domain/1/log-throttling'' > medium change watch on `/local/domain/1/log-throttling'' - unknown device, ignored > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 > dm-command: hot insert pass-through pci dev > register_real_device: Assigning real physical device 08:00.0 ... > register_real_device: Disable MSI translation via per device option > register_real_device: Enable power management > pt_iomul_init: Error: pt_iomul_init can''t open file /dev/xen/pci_iomul: No such file or directory: 0x8:0x0.0x0 > pt_register_regions: IO region registered (size=0x02000000 base_addr=0xf8000000) > pt_register_regions: IO region registered (size=0x08000000 base_addr=0xb800000c) > pt_register_regions: IO region registered (size=0x04000000 base_addr=0xb400000c) > pt_register_regions: IO region registered (size=0x00000080 base_addr=0x0000cf81) > pt_register_regions: Expansion ROM registered (size=0x00080000 base_addr=0xfbc00000) > pci_intx: intx=1 > register_real_device: Real physical device 08:00.0 registered successfuly! > IRQ type = INTx > dm-command: hot insert pass-through pci dev > register_real_device: Assigning real physical device 08:00.1 ... > register_real_device: Disable MSI translation via per device option > register_real_device: Enable power management > pt_iomul_init: Error: pt_iomul_init can''t open file /dev/xen/pci_iomul: No such file or directory: 0x8:0x0.0x1 > pt_register_regions: IO region registered (size=0x00004000 base_addr=0xfbcfc000) > pci_intx: intx=2 > register_real_device: Real physical device 08:00.1 registered successfuly! > IRQ type = INTx > dm-command: hot insert pass-through pci dev > register_real_device: Assigning real physical device 0c:00.0 ... > register_real_device: Disable MSI translation via per device option > register_real_device: Enable power management > pt_iomul_init: Error: pt_iomul_init can''t open file /dev/xen/pci_iomul: No such file or directory: 0xc:0x0.0x0 > pt_register_regions: IO region registered (size=0x00004000 base_addr=0xd7efc000) > pci_intx: intx=1 > register_real_device: Real physical device 0c:00.0 registered successfuly! > IRQ type = INTx > dm-command: hot insert pass-through pci dev > register_real_device: Assigning real physical device 00:1a.1 ... > register_real_device: Disable MSI translation via per device option > register_real_device: Enable power management > pt_iomul_init: Error: pt_iomul_init can''t open file /dev/xen/pci_iomul: No such file or directory: 0x0:0x1a.0x1 > pt_register_regions: IO region registered (size=0x00000020 base_addr=0x00008a01) > pci_intx: intx=2 > register_real_device: Real physical device 00:1a.1 registered successfuly! > IRQ type = INTx > pt_iomem_map: e_phys=e0000000 maddr=b8000000 type=8 len=134217728 index=1 first_map=1 > pt_iomem_map: e_phys=e8000000 maddr=b4000000 type=8 len=67108864 index=3 first_map=1 > pt_iomem_map: e_phys=ec000000 maddr=f8000000 type=0 len=33554432 index=0 first_map=1 > vga s->lfb_addr = ef000000 s->lfb_end = ef800000 > pt_iomem_map: e_phys=ef8a0000 maddr=fbcfc000 type=0 len=16384 index=0 first_map=1 > pt_iomem_map: e_phys=ef8a4000 maddr=d7efc000 type=0 len=16384 index=0 first_map=1 > pt_ioport_map: e_phys=c100 pio_base=cf80 len=128 index=5 first_map=1 > pt_ioport_map: e_phys=c1e0 pio_base=8a00 len=32 index=4 first_map=1 > platform_fixed_ioport: changed ro/rw state of ROM memory area. now is rw state. > platform_fixed_ioport: changed ro/rw state of ROM memory area. now is ro state. > Unknown PV product 2 loaded in guest > PV driver build 1 > region type 0 at [ef880000,ef8a0000). > squash iomem [ef880000, ef8a0000). > region type 1 at [c180,c1c0). > vga s->lfb_addr = ef000000 s->lfb_end = ef800000 > pt_iomem_map: e_phys=ffffffff maddr=f8000000 type=0 len=33554432 index=0 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=b8000000 type=8 len=134217728 index=1 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=b4000000 type=8 len=67108864 index=3 first_map=0 > pt_ioport_map: e_phys=ffff pio_base=cf80 len=128 index=5 first_map=0 > pt_iomem_map: e_phys=ec000000 maddr=f8000000 type=0 len=33554432 index=0 first_map=0 > pt_iomem_map: e_phys=e0000000 maddr=b8000000 type=8 len=134217728 index=1 first_map=0 > pt_iomem_map: e_phys=e8000000 maddr=b4000000 type=8 len=67108864 index=3 first_map=0 > pt_ioport_map: e_phys=c100 pio_base=cf80 len=128 index=5 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=fbcfc000 type=0 len=16384 index=0 first_map=0 > pt_pci_write_config: [00:06:0] Warning: Guest attempt to set address to unused Base Address Register. [Offset:30h][Length:4] > pt_iomem_map: e_phys=ef8a0000 maddr=fbcfc000 type=0 len=16384 index=0 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=d7efc000 type=0 len=16384 index=0 first_map=0 > pt_pci_write_config: [00:07:0] Warning: Guest attempt to set address to unused Base Address Register. [Offset:30h][Length:4] > pt_iomem_map: e_phys=ef8a4000 maddr=d7efc000 type=0 len=16384 index=0 first_map=0 > pt_ioport_map: e_phys=ffff pio_base=8a00 len=32 index=4 first_map=0 > pt_pci_write_config: [00:08:0] Warning: Guest attempt to set address to unused Base Address Register. [Offset:30h][Length:4] > pt_ioport_map: e_phys=c1e0 pio_base=8a00 len=32 index=4 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=f8000000 type=0 len=33554432 index=0 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=b8000000 type=8 len=134217728 index=1 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=b4000000 type=8 len=67108864 index=3 first_map=0 > pt_ioport_map: e_phys=ffff pio_base=cf80 len=128 index=5 first_map=0 > pt_iomem_map: e_phys=ec000000 maddr=f8000000 type=0 len=33554432 index=0 first_map=0 > pt_iomem_map: e_phys=e0000000 maddr=b8000000 type=8 len=134217728 index=1 first_map=0 > pt_iomem_map: e_phys=e8000000 maddr=b4000000 type=8 len=67108864 index=3 first_map=0 > pt_ioport_map: e_phys=c100 pio_base=cf80 len=128 index=5 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=fbcfc000 type=0 len=16384 index=0 first_map=0 > pt_iomem_map: e_phys=ef8a0000 maddr=fbcfc000 type=0 len=16384 index=0 first_map=0 > pt_ioport_map: e_phys=ffff pio_base=8a00 len=32 index=4 first_map=0 > pt_ioport_map: e_phys=c1e0 pio_base=8a00 len=32 index=4 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=d7efc000 type=0 len=16384 index=0 first_map=0 > pt_iomem_map: e_phys=ef8a4000 maddr=d7efc000 type=0 len=16384 index=0 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=fbcfc000 type=0 len=16384 index=0 first_map=0 > pt_iomem_map: e_phys=ffffffff maddr=d7efc000 type=0 len=16384 index=0 first_map=0 > pt_ioport_map: e_phys=ffff pio_base=8a00 len=32 index=4 first_map=0 > shutdown requested in cpu_handle_ioreq > Issued domain 1 poweroff> Waiting for domain edi (domid 1) to die [pid 8363] > Domain 1 has shut down, reason code 0 0x0 > Action for shutdown reason code 0 is destroy > Domain 1 needs to be cleaned up: destroying the domain > libxl: error: libxl_pci.c:990:libxl__device_pci_reset: The kernel doesn''t support reset from sysfs for PCI device 0000:08:00.0 > libxl: error: libxl_pci.c:990:libxl__device_pci_reset: The kernel doesn''t support reset from sysfs for PCI device 0000:08:00.1 > Done. Exiting now
Konrad Rzeszutek Wilk
2013-Sep-03 21:10 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Tue, Sep 03, 2013 at 09:49:40PM +0100, Gordan Bobic wrote:> I spoke too soon - even with e820_host=0, the same error occurs. > What did I break? The code in question is this: > > if (libxl_defbool_val(d_config->b_info.e820_host)) { > ret = libxl__e820_alloc(gc, domid, d_config); > if (ret) { > LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR, > "Failed while collecting E820 with: %d (errno:%d)\n", > ret, errno); > } > } > > With e820_host=0, that outer black should evaluate to false, should > it not? In libxl_create.c, if I am understanding the code correctly, > e820_host is defaulted to false, too. What am I missing?Just sent you an email but I believe what is failing is: 241 rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr); You can add some extra LIBXL__LOG_ERRNO to check each ''rc'' to see which one of them failed. Hm, perhaps it might make sense to actually have the libxl__e820_alloc also use the LIBXL__LOG_ERRNO to log more details..> > Gordan > > On 09/03/2013 09:35 PM, Gordan Bobic wrote: > >First attempt at a test run predictably failed. I added e820_host=1 to a > >VM config and tried starting it: > > > >[root@normandy ~]# xl create /etc/xen/edi > >Parsing config from /etc/xen/edi > >libxl: error: libxl_x86.c:307:libxl__arch_domain_create: Failed while > >collecting E820 with: -3 (errno:-1) > > > >libxl: error: libxl_create.c:901:domcreate_rebuild_done: cannot > >(re-)build domain: -3 > >libxl: error: libxl_dm.c:1300:libxl__destroy_device_model: could not > >find device-model''s pid for dom 1 > >libxl: error: libxl.c:1415:libxl__destroy_domid: > >libxl__destroy_device_model failed for 1 > > > >xl-edi.log, qemu-dm-edi.log attached. > >Both actually look identical to previous logs before the patch. > > > >Is this something that is clearly a consequence of the patch being > >incomplete? Or did I break something? > > > >Gordan > > > >On 09/03/2013 08:47 PM, Gordan Bobic wrote: > >>On 09/03/2013 03:59 PM, Konrad Rzeszutek Wilk wrote: > >> > >>>>>>2) Further, I''m finding myself motivated to write that > >>>>>>auto-set (as opposed to hard coded) vBAR=pBAR patch discussed > >>>>>>briefly a week or so ago (have an init script read the BAR > >>>>>>info from dom0 and put it in xenstore, plus a patch to > >>>>>>make pBAR=vBAR reservations built dynamically rather than > >>>>>>statically, based on this data. Now, I''m quite fluent in C, > >>>>>>but my familiarity with Xen soruce code is nearly non-existant > >>>>>>(limited to studying an old unsupported patch every now and then > >>>>>>in order to make it apply to a more recent code release). > >>>>>>Can anyone help me out with a high level view WRT where > >>>>>>this would be best plumbed in (which files and the flow of > >>>>>>control between the affected files)? > >>>>> > >>>>>hvmloader probably and the libxl e820 code. What from a > >>>>>high view needs to happen is that: > >>>>>1). Need to relax the check in libxl for e820_hole > >>>>> to also do it for HVM guests. Said code just iterates over the > >>>>> host E820 and sanitizes it a bit and makes a E820 hypercall to > >>>>> set it for the guest. > >>[snip] > >> > >>OK, I have attached a preliminary patch against 4.3.0 for the libxl > >>part. It compiles. I haven''t tried running it to see if it actually > >>works or does something, but my packages build. > >> > >>Please let me know if I''ve missed anything. On it''s own, I don''t think > >>this patch will do much (apart from maybe break HVM hosts with > >>e820_host=1 set). > >> > >>>>>2). Figure out whether the E820 hypercall (which sets the E820 > >>>>> layout for a guest) can be run on HVM guests. I think it > >>>>> could not and Mukesh in his PVH patches posted a patch > >>>>> to enable that - "..Move e820 fields out of pv_domain struct" > >> > >>Is this already in 4.3.0 or is this an out-of-tree patch? Do you have a > >>link to it handy? > >> > >>>>>2). Hvmloader should do an E820 get machine memory hypercall > >>>>> to see if there is anything there. If there is - that means > >>>>> the toolstack has request a "new" type of E820. Iterate > >>>>> over the E820 and make it look like that. > >>>>> You can look in the Linux arch/x86/xen/setup.c to see how > >>>>> it does that. > >>>>> > >>>>> The complication there is that hvmloader needs to to fit the > >>>>> ACPI code (the guest type one) and such. > >>>>> Presumarily you can just re-use the existing spaces that > >>>>> the host has marked as E820_RESERVED or E820_ACPI.. > >>>> > >>>>Yup, I get it. Not only that, but it should also ideally (not > >>>>strictly necessary, but it''d be handy) map the IOMEM for devices > >>>>it is passed so that pBAR=vBAR (as opposed to just leaving all > >>>>the host e820 reserved areas well alone - which would work for > >>>>most things). > >>> > >>>Yes. That is an extra complication that could be done in subsequent > >>>patches. But in theory if you have the E820 mirrored from the host the > >>>pBAR=vBAR should be easy enough as the values from the host BARs can > >>>easily fit in the E820 gaps. > >> > >>Agreed. Let''s leave the pBAR=vBAR part for a separate patch set. I''ll > >>have to figure out a sensible way to query the IOMEM regions for each of > >>the devices passed to the VM and make sure they are in the same hole. > >> > >>>>> Then there is the SMBIOS would need to move and the BIOS > >>>>> might need to be relocated - but I think those are relocatable > >>>>> in some form. > >> > >>[bit above left for later reference] > >> > >>>>>Well, I am more than happy to help you with this. > >>>> > >>>>Thanks, much appreciated. :) > >>> > >>>Yeeey! Vict^H^H^H^volunteer :-)! <manically laughter in the background> > >>> > >>>I am also reachable on IRC (FreeNode mostly) as either darnok or konrad > >>>if that would be more convient to discuss this. > >> > >>Thanks. I''ll keep that in mind. :) > >> > >>Gordan > >> > >> > >>_______________________________________________ > >>Xen-devel mailing list > >>Xen-devel@lists.xen.org > >>http://lists.xen.org/xen-devel > >> > > >
Gordan Bobic
2013-Sep-03 21:24 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On 09/03/2013 10:10 PM, Konrad Rzeszutek Wilk wrote:> On Tue, Sep 03, 2013 at 09:49:40PM +0100, Gordan Bobic wrote: >> I spoke too soon - even with e820_host=0, the same error occurs. >> What did I break? The code in question is this: >> >> if (libxl_defbool_val(d_config->b_info.e820_host)) { >> ret = libxl__e820_alloc(gc, domid, d_config); >> if (ret) { >> LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR, >> "Failed while collecting E820 with: %d (errno:%d)\n", >> ret, errno); >> } >> } >> >> With e820_host=0, that outer black should evaluate to false, should >> it not? In libxl_create.c, if I am understanding the code correctly, >> e820_host is defaulted to false, too. What am I missing? > > Just sent you an email but I believe what is failing is: > > 241 rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr);But with e820_host=0 set in the config, libxl__e820_alloc() should not be getting called in the first place. That function only gets called from line 303, inside that if block I pasted above. That is what is puzzling me.> You can add some extra LIBXL__LOG_ERRNO to check each ''rc'' to see > which one of them failed. > > Hm, perhaps it might make sense to actually have the libxl__e820_alloc > also use the LIBXL__LOG_ERRNO to log more details..OK, I''ll add some debug and see what I find. Gordan
Konrad Rzeszutek Wilk
2013-Sep-03 21:30 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Tue, Sep 03, 2013 at 10:24:44PM +0100, Gordan Bobic wrote:> On 09/03/2013 10:10 PM, Konrad Rzeszutek Wilk wrote: > >On Tue, Sep 03, 2013 at 09:49:40PM +0100, Gordan Bobic wrote: > >>I spoke too soon - even with e820_host=0, the same error occurs. > >>What did I break? The code in question is this: > >> > >>if (libxl_defbool_val(d_config->b_info.e820_host)) { > >> ret = libxl__e820_alloc(gc, domid, d_config); > >> if (ret) { > >> LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR, > >> "Failed while collecting E820 with: %d (errno:%d)\n", > >> ret, errno); > >> } > >>} > >> > >>With e820_host=0, that outer black should evaluate to false, should > >>it not? In libxl_create.c, if I am understanding the code correctly, > >>e820_host is defaulted to false, too. What am I missing?Does your config have ''pci'' in it? The patch you sent had this: + if (d_config->num_pcidevs) + libxl_defbool_set(&b_info->e820_host, true); Which means that even if you did not have e820_host it will be automatically set if you have PCI devices.> > > >Just sent you an email but I believe what is failing is: > > > >241 rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr); > > But with e820_host=0 set in the config, libxl__e820_alloc() should > not be getting called in the first place. That function only gets > called from line 303, inside that if block I pasted above. That is > what is puzzling me. > > >You can add some extra LIBXL__LOG_ERRNO to check each ''rc'' to see > >which one of them failed. > > > >Hm, perhaps it might make sense to actually have the libxl__e820_alloc > >also use the LIBXL__LOG_ERRNO to log more details.. > > OK, I''ll add some debug and see what I find. > > Gordan
Gordan Bobic
2013-Sep-04 00:18 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On 09/03/2013 10:30 PM, Konrad Rzeszutek Wilk wrote:> On Tue, Sep 03, 2013 at 10:24:44PM +0100, Gordan Bobic wrote: >> On 09/03/2013 10:10 PM, Konrad Rzeszutek Wilk wrote: >>> On Tue, Sep 03, 2013 at 09:49:40PM +0100, Gordan Bobic wrote: >>>> I spoke too soon - even with e820_host=0, the same error occurs. >>>> What did I break? The code in question is this: >>>> >>>> if (libxl_defbool_val(d_config->b_info.e820_host)) { >>>> ret = libxl__e820_alloc(gc, domid, d_config); >>>> if (ret) { >>>> LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR, >>>> "Failed while collecting E820 with: %d (errno:%d)\n", >>>> ret, errno); >>>> } >>>> } >>>> >>>> With e820_host=0, that outer black should evaluate to false, should >>>> it not? In libxl_create.c, if I am understanding the code correctly, >>>> e820_host is defaulted to false, too. What am I missing? > > Does your config have ''pci'' in it? The patch you sent had this: > > + if (d_config->num_pcidevs) > + libxl_defbool_set(&b_info->e820_host, true); > > Which means that even if you did not have e820_host it will be automatically > set if you have PCI devices.OK - that was embarrasing. Caffeine underflow error. :( I backed out that block. I don''t think e820_host should be implicit in hvm when PCI devices are passed. That makes the adjusted patch fragment: --- xl_cmdimpl.c.orig 2013-09-04 00:42:57.424337503 +0100 +++ xl_cmdimpl.c 2013-09-04 00:43:21.213886356 +0100 @@ -1293,7 +1293,7 @@ d_config->num_pcidevs++; } if (d_config->num_pcidevs && c_info->type == LIBXL_DOMAIN_TYPE_PV) - libxl_defbool_set(&b_info->u.pv.e820_host, true); + libxl_defbool_set(&b_info->e820_host, true); } switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0, 1)) { This should maintain the old behaviour for backward compatibility when e820_host is not set. I just tested it and it works (with e820_host=1 I get the previous error, with e820_host=0, everything works fine. I will have a play with the other two patches tomorrow. Gordan
Gordan Bobic
2013-Sep-04 09:21 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Tue, 3 Sep 2013 17:08:33 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> You are missing the hypervisor patch to set the E820 for HVM guests. > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html > > And that should make it possible to "stash" the E820 in the > hypervisor.Regarding Jan''s comment on the thread here: http://lists.xen.org/archives/html/xen-devel/2013-05/msg01649.html Should this not instead of: == @@ -595,7 +595,7 @@ void arch_domain_destroy(struct domain *d) if ( is_hvm_domain(d) ) hvm_domain_destroy(d); else - xfree(d->arch.pv_domain.e820); + xfree(d->arch.e820); free_domain_pirqs(d); if ( !is_idle_domain(d) ) == be something like: == @@ -595,7 +595,6 @@ void arch_domain_destroy(struct domain *d) if ( is_hvm_domain(d) ) hvm_domain_destroy(d); - else - xfree(d->arch.pv_domain.e820); + xfree(d->arch.e820); free_domain_pirqs(d); if ( !is_idle_domain(d) ) == The question I have is will d->arch.e820 always be there and set even with e820_host=0? Or does there need to be an extra check here?> Then after that you will need to implement in the hvmloader.c the > XENMEM_memory_map hypercall to get the E820 and do something with it. > > > Oh, and something like this probably should do it - not compile > tested > in any way: > > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > index 1fcaed0..7b38890 100644 > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -3146,6 +3146,7 @@ static long hvm_memory_op(int cmd, > XEN_GUEST_HANDLE_PARAM(void) arg) > case XENMEM_machine_memory_map: > case XENMEM_machphys_mapping: > return -ENOSYS; > + case XENMEM_memory_map: > case XENMEM_decrease_reservation: > rc = do_memory_op(cmd, arg); > current->domain->arch.hvm_domain.qemu_mapcache_invalidate = > 1; > @@ -3216,10 +3217,10 @@ static long hvm_memory_op_compat32(int cmd, > XEN_GUEST_HANDLE_PARAM(void) arg) > > switch ( cmd & MEMOP_CMD_MASK ) > { > - case XENMEM_memory_map: > case XENMEM_machine_memory_map: > case XENMEM_machphys_mapping: > return -ENOSYS; > + case XENMEM_memory_map: > case XENMEM_decrease_reservation: > rc = compat_memory_op(cmd, arg); > current->domain->arch.hvm_domain.qemu_mapcache_invalidate = > 1; > > diff --git a/tools/firmware/hvmloader/e820.c > b/tools/firmware/hvmloader/e820.c > index 2e05e93..86fb20a 100644 > --- a/tools/firmware/hvmloader/e820.c > +++ b/tools/firmware/hvmloader/e820.c > @@ -68,16 +68,42 @@ void dump_e820_table(struct e820entry *e820, > unsigned int nr) > } > } > > +static const char *e820_names(int type) > +{ > + switch (type) { > + case E820_RAM: return "RAM"; > + case E820_RESERVED: return "Reserved"; > + case E820_ACPI: return "ACPI"; > + case E820_NVS: return "ACPI NVS"; > + case E820_UNUSABLE: return "Unusable"; > + default: break; > + } > + return "Unknown"; > +} > + > + > /* Create an E820 table based on memory parameters provided in > hvm_info. */ > int build_e820_table(struct e820entry *e820, > unsigned int lowmem_reserved_base, > unsigned int bios_image_base) > { > unsigned int nr = 0; > + struct xen_memory_map op; > + struct e820entry map[E820MAX]; > + int rc; > > if ( !lowmem_reserved_base ) > lowmem_reserved_base = 0xA0000; > > + set_xen_guest_handle(op.buffer, map); > + > + rc = hypercall_memory_op ( XENMEM_memory_op, &op); > + if ( rc != -ENOSYS) { /* It works!? */ > + int i; > + for ( i = 0; i < op.nr_entries; i++ ) > + printf(" %lx -> %lx %s\n", map[i].addr >> 12, > + (map[i].addr + map[i].size) >> 12, > e820_names(map[i].type)); > + } > /* Lowmem must be at least 512K to keep Windows happy) */ > ASSERT ( lowmem_reserved_base > 512<<10 );Thanks. :) Will try that when I''ve verified the first two patches (mine and Mukesh''s) build cleanly in my 4.3.0 package build. Gordan
Gordan Bobic
2013-Sep-04 11:01 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Tue, 3 Sep 2013 17:08:33 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> Oh, and something like this probably should do it - not compile > tested > in any way: > > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > index 1fcaed0..7b38890 100644 > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -3146,6 +3146,7 @@ static long hvm_memory_op(int cmd, > XEN_GUEST_HANDLE_PARAM(void) arg) > case XENMEM_machine_memory_map: > case XENMEM_machphys_mapping: > return -ENOSYS; > + case XENMEM_memory_map: > case XENMEM_decrease_reservation: > rc = do_memory_op(cmd, arg); > current->domain->arch.hvm_domain.qemu_mapcache_invalidate = > 1;This seems to work better. :) --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -3142,10 +3142,10 @@ static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) switch ( cmd & MEMOP_CMD_MASK ) { - case XENMEM_memory_map: case XENMEM_machine_memory_map: case XENMEM_machphys_mapping: return -ENOSYS; + case XENMEM_memory_map: case XENMEM_decrease_reservation: rc = do_memory_op(cmd, arg); current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 1;> diff --git a/tools/firmware/hvmloader/e820.c > b/tools/firmware/hvmloader/e820.c > index 2e05e93..86fb20a 100644 > --- a/tools/firmware/hvmloader/e820.c > +++ b/tools/firmware/hvmloader/e820.c > @@ -68,16 +68,42 @@ void dump_e820_table(struct e820entry *e820, > unsigned int nr) > } > } > > +static const char *e820_names(int type) > +{ > + switch (type) { > + case E820_RAM: return "RAM"; > + case E820_RESERVED: return "Reserved"; > + case E820_ACPI: return "ACPI"; > + case E820_NVS: return "ACPI NVS"; > + case E820_UNUSABLE: return "Unusable"; > + default: break; > + } > + return "Unknown"; > +}To make this work I also added: --- tools/firmware/hvmloader/e820.h.orig 2013-09-04 10:55:38.317275183 +0100 +++ tools/firmware/hvmloader/e820.h 2013-09-04 10:56:14.374595809 +0100 @@ -8,6 +8,7 @@ #define E820_RESERVED 2 #define E820_ACPI 3 #define E820_NVS 4 +#define E820_UNUSBLE 5 struct e820entry { uint64_t addr; It that OK?> /* Create an E820 table based on memory parameters provided in > hvm_info. */ > int build_e820_table(struct e820entry *e820, > unsigned int lowmem_reserved_base, > unsigned int bios_image_base) > { > unsigned int nr = 0; > + struct xen_memory_map op; > + struct e820entry map[E820MAX]; > + int rc; > > if ( !lowmem_reserved_base ) > lowmem_reserved_base = 0xA0000; > > + set_xen_guest_handle(op.buffer, map); > + > + rc = hypercall_memory_op ( XENMEM_memory_op, &op);Where is XENMEM_memory_op defined? Should that be XENMEM_memory_map? Or maybe XENMEM_populate_physmap? Gordan
Gordan Bobic
2013-Sep-04 13:11 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
I have this at the point where it actually builds. Otherwise completely untested (will do that later today). Attached are: 1) libxl patch Modified from the original patch to _not_ implicitly enable e820_host when PCI devices are passed. 2) Mukesh''s hypervisor e820 patch from here: http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html Modified slightly to attempt to address Jan''s comment on the same thread, and to adjust the diff line pointers to match against 4.3.0 release code. 3) A patch based on Konrad''s earlier in this thread, with a few additions and changes to make it all compile. Some peer review would be most welcome - this is my first venture into Xen code, so please do assume that I have no idea what I''m doing at the moment. :) I added yet another E820MAX #define, this time to tools/firmware/hvmloader/e820.h If there is a better place to #include that via from e820.c, please point me in the right direction. Gordan On Wed, 04 Sep 2013 12:01:09 +0100, Gordan Bobic <gordan@bobich.net> wrote:> On Tue, 3 Sep 2013 17:08:33 -0400, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > >> Oh, and something like this probably should do it - not compile >> tested >> in any way: >> >> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c >> index 1fcaed0..7b38890 100644 >> --- a/xen/arch/x86/hvm/hvm.c >> +++ b/xen/arch/x86/hvm/hvm.c >> @@ -3146,6 +3146,7 @@ static long hvm_memory_op(int cmd, >> XEN_GUEST_HANDLE_PARAM(void) arg) >> case XENMEM_machine_memory_map: >> case XENMEM_machphys_mapping: >> return -ENOSYS; >> + case XENMEM_memory_map: >> case XENMEM_decrease_reservation: >> rc = do_memory_op(cmd, arg); >> current->domain->arch.hvm_domain.qemu_mapcache_invalidate = >> 1; > > This seems to work better. :) > > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -3142,10 +3142,10 @@ static long hvm_memory_op(int cmd, > XEN_GUEST_HANDLE_PARAM(void) arg) > > switch ( cmd & MEMOP_CMD_MASK ) > { > - case XENMEM_memory_map: > case XENMEM_machine_memory_map: > case XENMEM_machphys_mapping: > return -ENOSYS; > + case XENMEM_memory_map: > case XENMEM_decrease_reservation: > rc = do_memory_op(cmd, arg); > current->domain->arch.hvm_domain.qemu_mapcache_invalidate = > 1; > > >> diff --git a/tools/firmware/hvmloader/e820.c >> b/tools/firmware/hvmloader/e820.c >> index 2e05e93..86fb20a 100644 >> --- a/tools/firmware/hvmloader/e820.c >> +++ b/tools/firmware/hvmloader/e820.c >> @@ -68,16 +68,42 @@ void dump_e820_table(struct e820entry *e820, >> unsigned int nr) >> } >> } >> >> +static const char *e820_names(int type) >> +{ >> + switch (type) { >> + case E820_RAM: return "RAM"; >> + case E820_RESERVED: return "Reserved"; >> + case E820_ACPI: return "ACPI"; >> + case E820_NVS: return "ACPI NVS"; >> + case E820_UNUSABLE: return "Unusable"; >> + default: break; >> + } >> + return "Unknown"; >> +} > > To make this work I also added: > > --- tools/firmware/hvmloader/e820.h.orig 2013-09-04 > 10:55:38.317275183 +0100 > +++ tools/firmware/hvmloader/e820.h 2013-09-04 10:56:14.374595809 > +0100 > @@ -8,6 +8,7 @@ > #define E820_RESERVED 2 > #define E820_ACPI 3 > #define E820_NVS 4 > +#define E820_UNUSBLE 5 > > struct e820entry { > uint64_t addr; > > It that OK? > >> /* Create an E820 table based on memory parameters provided in >> hvm_info. */ >> int build_e820_table(struct e820entry *e820, >> unsigned int lowmem_reserved_base, >> unsigned int bios_image_base) >> { >> unsigned int nr = 0; >> + struct xen_memory_map op; >> + struct e820entry map[E820MAX]; >> + int rc; >> >> if ( !lowmem_reserved_base ) >> lowmem_reserved_base = 0xA0000; >> >> + set_xen_guest_handle(op.buffer, map); >> + >> + rc = hypercall_memory_op ( XENMEM_memory_op, &op); > > Where is XENMEM_memory_op defined? > Should that be XENMEM_memory_map? Or maybe XENMEM_populate_physmap? > > Gordan > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2013-Sep-04 14:08 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Wed, Sep 04, 2013 at 01:18:39AM +0100, Gordan Bobic wrote:> On 09/03/2013 10:30 PM, Konrad Rzeszutek Wilk wrote: > >On Tue, Sep 03, 2013 at 10:24:44PM +0100, Gordan Bobic wrote: > >>On 09/03/2013 10:10 PM, Konrad Rzeszutek Wilk wrote: > >>>On Tue, Sep 03, 2013 at 09:49:40PM +0100, Gordan Bobic wrote: > >>>>I spoke too soon - even with e820_host=0, the same error occurs. > >>>>What did I break? The code in question is this: > >>>> > >>>>if (libxl_defbool_val(d_config->b_info.e820_host)) { > >>>> ret = libxl__e820_alloc(gc, domid, d_config); > >>>> if (ret) { > >>>> LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR, > >>>> "Failed while collecting E820 with: %d (errno:%d)\n", > >>>> ret, errno); > >>>> } > >>>>} > >>>> > >>>>With e820_host=0, that outer black should evaluate to false, should > >>>>it not? In libxl_create.c, if I am understanding the code correctly, > >>>>e820_host is defaulted to false, too. What am I missing? > > > >Does your config have ''pci'' in it? The patch you sent had this: > > > >+ if (d_config->num_pcidevs) > >+ libxl_defbool_set(&b_info->e820_host, true); > > > >Which means that even if you did not have e820_host it will be automatically > >set if you have PCI devices. > > OK - that was embarrasing. Caffeine underflow error. :( > I backed out that block. I don''t think e820_host should be implicit > in hvm when PCI devices are passed. > > That makes the adjusted patch fragment: > --- xl_cmdimpl.c.orig 2013-09-04 00:42:57.424337503 +0100 > +++ xl_cmdimpl.c 2013-09-04 00:43:21.213886356 +0100 > @@ -1293,7 +1293,7 @@ > d_config->num_pcidevs++; > } > if (d_config->num_pcidevs && c_info->type == LIBXL_DOMAIN_TYPE_PV)I think you also want to get rid of the c_info->type check?> - libxl_defbool_set(&b_info->u.pv.e820_host, true); > + libxl_defbool_set(&b_info->e820_host, true); > } > > switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0, 1)) { > > > This should maintain the old behaviour for backward compatibility > when e820_host is not set. I just tested it and it works (with > e820_host=1 I get the previous error, with e820_host=0, everything > works fine.I think it might make sense to relax the PV check. That way the only way e820_host capability gets activated is if a the guest config has pci=X stanze. But perhaps that _and_ e820_host=1 is what should be done. Or maybe a negative check - if ''pci'' stanze is there we automatically turn on e820_host=1 (right now that is how it works). If the user has thought ''e820_host=0'' and ''pci=xxx'' then we would turn the E820 off? That way if something is odd we can turn this off?> > I will have a play with the other two patches tomorrow. > > Gordan
Gordan Bobic
2013-Sep-04 14:23 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Wed, 4 Sep 2013 10:08:37 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Wed, Sep 04, 2013 at 01:18:39AM +0100, Gordan Bobic wrote: >> On 09/03/2013 10:30 PM, Konrad Rzeszutek Wilk wrote: >> >On Tue, Sep 03, 2013 at 10:24:44PM +0100, Gordan Bobic wrote: >> >>On 09/03/2013 10:10 PM, Konrad Rzeszutek Wilk wrote: >> >>>On Tue, Sep 03, 2013 at 09:49:40PM +0100, Gordan Bobic wrote: >> >>>>I spoke too soon - even with e820_host=0, the same error occurs. >> >>>>What did I break? The code in question is this: >> >>>> >> >>>>if (libxl_defbool_val(d_config->b_info.e820_host)) { >> >>>> ret = libxl__e820_alloc(gc, domid, d_config); >> >>>> if (ret) { >> >>>> LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR, >> >>>> "Failed while collecting E820 with: %d >> (errno:%d)\n", >> >>>> ret, errno); >> >>>> } >> >>>>} >> >>>> >> >>>>With e820_host=0, that outer black should evaluate to false, >> should >> >>>>it not? In libxl_create.c, if I am understanding the code >> correctly, >> >>>>e820_host is defaulted to false, too. What am I missing? >> > >> >Does your config have ''pci'' in it? The patch you sent had this: >> > >> >+ if (d_config->num_pcidevs) >> >+ libxl_defbool_set(&b_info->e820_host, true); >> > >> >Which means that even if you did not have e820_host it will be >> automatically >> >set if you have PCI devices. >> >> OK - that was embarrasing. Caffeine underflow error. :( >> I backed out that block. I don''t think e820_host should be implicit >> in hvm when PCI devices are passed. >> >> That makes the adjusted patch fragment: >> --- xl_cmdimpl.c.orig 2013-09-04 00:42:57.424337503 +0100 >> +++ xl_cmdimpl.c 2013-09-04 00:43:21.213886356 +0100 >> @@ -1293,7 +1293,7 @@ >> d_config->num_pcidevs++; >> } >> if (d_config->num_pcidevs && c_info->type == >> LIBXL_DOMAIN_TYPE_PV) > > I think you also want to get rid of the c_info->type check?That would alter the current PV behaviour of implicitly enabling e820_host with PCI devices passed, would it not? I was hoping to maintain current behaviours intact, and only affect what happens when e820_host=1 is set for HVMs.>> - libxl_defbool_set(&b_info->u.pv.e820_host, true); >> + libxl_defbool_set(&b_info->e820_host, true); >> } >> >> switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0, 1)) { >> >> >> This should maintain the old behaviour for backward compatibility >> when e820_host is not set. I just tested it and it works (with >> e820_host=1 I get the previous error, with e820_host=0, everything >> works fine. > > I think it might make sense to relax the PV check. That way the only > way e820_host capability gets activated is if a the guest config > has pci=X stanze. But perhaps that _and_ e820_host=1 is what should > be done.While I think these two checks should be separate in both cases, I don''t know that this won''t break something for PV instances. And I would prefer to not have to also debug that code path at this point. :)> Or maybe a negative check - if ''pci'' stanze is there we automatically > turn on e820_host=1 (right now that is how it works). If the user > has thought ''e820_host=0'' and ''pci=xxx'' then we would turn the E820 > off? That way if something is odd we can turn this off?I am not disagreeing at all - I just really don''t want to change the current PV behaviour since that will potentially require extra debugging. Current PV behaviour seems to be that that if PCI devices are passed, e820_host=1 is always set regardless of whether it is explicitly enabled or disabled in the config. And I have no idea what will happen with a PV domain with PCI devices if e820_host=1 is disabled. Gordan
Konrad Rzeszutek Wilk
2013-Sep-04 18:00 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Wed, Sep 04, 2013 at 03:23:40PM +0100, Gordan Bobic wrote:> On Wed, 4 Sep 2013 10:08:37 -0400, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > >On Wed, Sep 04, 2013 at 01:18:39AM +0100, Gordan Bobic wrote: > >>On 09/03/2013 10:30 PM, Konrad Rzeszutek Wilk wrote: > >>>On Tue, Sep 03, 2013 at 10:24:44PM +0100, Gordan Bobic wrote: > >>>>On 09/03/2013 10:10 PM, Konrad Rzeszutek Wilk wrote: > >>>>>On Tue, Sep 03, 2013 at 09:49:40PM +0100, Gordan Bobic wrote: > >>>>>>I spoke too soon - even with e820_host=0, the same error occurs. > >>>>>>What did I break? The code in question is this: > >>>>>> > >>>>>>if (libxl_defbool_val(d_config->b_info.e820_host)) { > >>>>>> ret = libxl__e820_alloc(gc, domid, d_config); > >>>>>> if (ret) { > >>>>>> LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR, > >>>>>> "Failed while collecting E820 with: %d > >>(errno:%d)\n", > >>>>>> ret, errno); > >>>>>> } > >>>>>>} > >>>>>> > >>>>>>With e820_host=0, that outer black should evaluate to false, > >>should > >>>>>>it not? In libxl_create.c, if I am understanding the code > >>correctly, > >>>>>>e820_host is defaulted to false, too. What am I missing? > >>> > >>>Does your config have ''pci'' in it? The patch you sent had this: > >>> > >>>+ if (d_config->num_pcidevs) > >>>+ libxl_defbool_set(&b_info->e820_host, true); > >>> > >>>Which means that even if you did not have e820_host it will be > >>automatically > >>>set if you have PCI devices. > >> > >>OK - that was embarrasing. Caffeine underflow error. :( > >>I backed out that block. I don''t think e820_host should be implicit > >>in hvm when PCI devices are passed. > >> > >>That makes the adjusted patch fragment: > >>--- xl_cmdimpl.c.orig 2013-09-04 00:42:57.424337503 +0100 > >>+++ xl_cmdimpl.c 2013-09-04 00:43:21.213886356 +0100 > >>@@ -1293,7 +1293,7 @@ > >> d_config->num_pcidevs++; > >> } > >> if (d_config->num_pcidevs && c_info->type => >>LIBXL_DOMAIN_TYPE_PV) > > > >I think you also want to get rid of the c_info->type check? > > That would alter the current PV behaviour of implicitly > enabling e820_host with PCI devices passed, would it not? > I was hoping to maintain current behaviours intact, and > only affect what happens when e820_host=1 is set for HVMs. > > >>- libxl_defbool_set(&b_info->u.pv.e820_host, true); > >>+ libxl_defbool_set(&b_info->e820_host, true); > >> } > >> > >> switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0, 1)) { > >> > >> > >>This should maintain the old behaviour for backward compatibility > >>when e820_host is not set. I just tested it and it works (with > >>e820_host=1 I get the previous error, with e820_host=0, everything > >>works fine. > > > >I think it might make sense to relax the PV check. That way the only > >way e820_host capability gets activated is if a the guest config > >has pci=X stanze. But perhaps that _and_ e820_host=1 is what should > >be done. > > While I think these two checks should be separate in both cases, > I don''t know that this won''t break something for PV instances. And > I would prefer to not have to also debug that code path at this > point. :)OK.> > >Or maybe a negative check - if ''pci'' stanze is there we automatically > >turn on e820_host=1 (right now that is how it works). If the user > >has thought ''e820_host=0'' and ''pci=xxx'' then we would turn the E820 > >off? That way if something is odd we can turn this off? > > I am not disagreeing at all - I just really don''t want to change > the current PV behaviour since that will potentially require > extra debugging. Current PV behaviour seems to be that that if > PCI devices are passed, e820_host=1 is always set regardless > of whether it is explicitly enabled or disabled in the config.Right.> > And I have no idea what will happen with a PV domain with > PCI devices if e820_host=1 is disabled.It will boot - but if you are have more than 2GB the PCI devices will most likely not work.> > Gordan
Gordan Bobic
2013-Sep-04 20:18 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
OK, I have done some preliminary testing. Details below. On 09/04/2013 02:11 PM, Gordan Bobic wrote:> I have this at the point where it actually builds. > Otherwise completely untested (will do that later today). > > Attached are: > > 1) libxl patch > Modified from the original patch to _not_ implicitly enable > e820_host when PCI devices are passed.Builds, works with e820_host=0.> 2) Mukesh''s hypervisor e820 patch from here: > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html > Modified slightly to attempt to address Jan''s comment on the same > thread, and to adjust the diff line pointers to match against > 4.3.0 release code.Builds, works with e820_host=0.> 3) A patch based on Konrad''s earlier in this thread, with > a few additions and changes to make it all compile.Causes the domU to fail to start. No obvious errors in any logs, but the qemu-dm log simply stops before the usual point. There is blank white screen on VNC console. It looks like domU crashes before it even starts loading the OS. I have attached two qemu-dm logs: qemu-dm-edi.log - without patch 3 qemu-dm-edi.log.2 - with patch 3 I also attached the output of xl dmesg in each case. With the 3rd patch applied, everything seems to stop just as the hypervisor is about to log the E820 table for HVM1 (obvious if you diff them). This may be related to what I did to get your patch to build, Konrad. The map never gets output, so either rc=-ENOSYS, or it crashes during the hypercall. With e820_host=0, the e820 map should be exactly the same as it would have been anyway, but something seems to go wrong during: rc = hypercall_memory_op ( XENMEM_memory_map, &op); Thoughts? Gordan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2013-Sep-05 02:04 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Wed, Sep 04, 2013 at 02:11:06PM +0100, Gordan Bobic wrote:> I have this at the point where it actually builds. > Otherwise completely untested (will do that later today). > > Attached are: > > 1) libxl patch > Modified from the original patch to _not_ implicitly enable > e820_host when PCI devices are passed. > > 2) Mukesh''s hypervisor e820 patch from here: > http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html > Modified slightly to attempt to address Jan''s comment on the same > thread, and to adjust the diff line pointers to match against > 4.3.0 release code.I think that was the old version. I spotted a bug in it that was causing a hang. And also the one that explains why libxl would refuse to setup the E820. The problem was that in the XENMEM_set_memory_map there was a check to make sure that the guest launched was not HVM. Also there was bug in the initial domain creation where the spinlock was only set for PV and not for HVM.> > 3) A patch based on Konrad''s earlier in this thread, with > a few additions and changes to make it all compile. > > Some peer review would be most welcome - this is my first > venture into Xen code, so please do assume that I have > no idea what I''m doing at the moment. :) > > I added yet another E820MAX #define, this time to > tools/firmware/hvmloader/e820.h > > If there is a better place to #include that via from > e820.c, please point me in the right direction.I think I saw that #define in tools/libxc/xenctrl.h. But since the tools/firmware cannot link to the libxc (b/c it is a Minicontained OS) I believe just having the #define in hvmloader/e820.h is the right call. Good first pass. I altered it a bit and got in the HVM guest the E820 entries printed out. Here is a big giant diff: diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c index 2e05e93..3c80241 100644 --- a/tools/firmware/hvmloader/e820.c +++ b/tools/firmware/hvmloader/e820.c @@ -22,6 +22,9 @@ #include "config.h" #include "util.h" +#include "hypercall.h" +#include <xen/memory.h> +#include <errno.h> void dump_e820_table(struct e820entry *e820, unsigned int nr) { @@ -74,10 +77,20 @@ int build_e820_table(struct e820entry *e820, unsigned int bios_image_base) { unsigned int nr = 0; + struct xen_memory_map op; + struct e820entry map[E820MAX]; + int rc; if ( !lowmem_reserved_base ) lowmem_reserved_base = 0xA0000; + set_xen_guest_handle(op.buffer, map); + + rc = hypercall_memory_op ( XENMEM_memory_map, &op); + if ( rc != -ENOSYS) { /* It works!? */ + printf("%s:%d got %d op.nr_entries \n", __func__, __LINE__, op.nr_entries); + dump_e820_table(&map[0], op.nr_entries); + } /* Lowmem must be at least 512K to keep Windows happy) */ ASSERT ( lowmem_reserved_base > 512<<10 ); diff --git a/tools/firmware/hvmloader/e820.h b/tools/firmware/hvmloader/e820.h index b2ead7f..2fa700d 100644 --- a/tools/firmware/hvmloader/e820.h +++ b/tools/firmware/hvmloader/e820.h @@ -8,6 +8,9 @@ #define E820_RESERVED 2 #define E820_ACPI 3 #define E820_NVS 4 +#define E820_UNUSABLE 5 + +#define E820MAX 128 struct e820entry { uint64_t addr; diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c index 0c32d0b..d8e2346 100644 --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -208,6 +208,8 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc, libxl_defbool_setdefault(&b_info->disable_migrate, false); + libxl_defbool_setdefault(&b_info->e820_host, false); + switch (b_info->type) { case LIBXL_DOMAIN_TYPE_HVM: if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT) @@ -280,7 +282,6 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc, break; case LIBXL_DOMAIN_TYPE_PV: - libxl_defbool_setdefault(&b_info->u.pv.e820_host, false); if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT) b_info->shadow_memkb = 0; if (b_info->u.pv.slack_memkb == LIBXL_MEMKB_DEFAULT) diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl index 85341a0..fd6389a 100644 --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -299,6 +299,8 @@ libxl_domain_build_info = Struct("domain_build_info",[ ("irqs", Array(uint32, "num_irqs")), ("iomem", Array(libxl_iomem_range, "num_iomem")), ("claim_mode", libxl_defbool), + # Use host''s E820 for PCI passthrough. + ("e820_host", libxl_defbool), ("u", KeyedUnion(None, libxl_domain_type, "type", [("hvm", Struct(None, [("firmware", string), ("bios", libxl_bios_type), @@ -345,8 +347,6 @@ libxl_domain_build_info = Struct("domain_build_info",[ ("cmdline", string), ("ramdisk", string), ("features", string, {''const'': True}), - # Use host''s E820 for PCI passthrough. - ("e820_host", libxl_defbool), ])), ("invalid", Struct(None, [])), ], keyvar_init_val = "LIBXL_DOMAIN_TYPE_INVALID")), diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c index a78c91d..94515a5 100644 --- a/tools/libxl/libxl_x86.c +++ b/tools/libxl/libxl_x86.c @@ -216,28 +216,41 @@ static int libxl__e820_alloc(libxl__gc *gc, uint32_t domid, struct e820entry map[E820MAX]; libxl_domain_build_info *b_info; - if (d_config == NULL || d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM) - return ERROR_INVAL; - b_info = &d_config->b_info; - if (!libxl_defbool_val(b_info->u.pv.e820_host)) + if (!libxl_defbool_val(b_info->e820_host)) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.",__func__, __LINE__); return ERROR_INVAL; - + } rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX); if (rc < 0) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.",__func__, __LINE__); errno = rc; return ERROR_FAIL; } nr = rc; - rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, - (b_info->max_memkb - b_info->target_memkb) + - b_info->u.pv.slack_memkb); + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.nr:%d",__func__, __LINE__, nr); + if (d_config == NULL || d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM) { + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, + (b_info->max_memkb - b_info->target_memkb)); + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.rc%d",__func__, __LINE__, rc); + } else if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV) { + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, + (b_info->max_memkb - b_info->target_memkb) + + b_info->u.pv.slack_memkb); + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.rc%d",__func__, __LINE__, rc); + } + + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.rc%d",__func__, __LINE__, rc); if (rc) return ERROR_FAIL; + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.rc%d, nr:%d",__func__, __LINE__, rc, nr); + rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr); + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.rc%d",__func__, __LINE__, rc); if (rc < 0) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.rc%d",__func__, __LINE__, rc); errno = rc; return ERROR_FAIL; } @@ -296,8 +309,7 @@ int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config, xc_shadow_control(ctx->xch, domid, XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION, NULL, 0, &shadow, 0, NULL); } - if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV && - libxl_defbool_val(d_config->b_info.u.pv.e820_host)) { + if (libxl_defbool_val(d_config->b_info.e820_host)) { ret = libxl__e820_alloc(gc, domid, d_config); if (ret) { LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR, diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index ed99622..d98ca24 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -1291,11 +1291,7 @@ skip_vfb: if (!xlu_cfg_get_long (config, "pci_permissive", &l, 0)) pci_permissive = l; - /* To be reworked (automatically enabled) once the auto ballooning - * after guest starts is done (with PCI devices passed in). */ - if (c_info->type == LIBXL_DOMAIN_TYPE_PV) { - xlu_cfg_get_defbool(config, "e820_host", &b_info->u.pv.e820_host, 0); - } + xlu_cfg_get_defbool(config, "e820_host", &b_info->e820_host, 0); if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) { d_config->num_pcidevs = 0; @@ -1314,7 +1310,7 @@ skip_vfb: d_config->num_pcidevs++; } if (d_config->num_pcidevs && c_info->type == LIBXL_DOMAIN_TYPE_PV) - libxl_defbool_set(&b_info->u.pv.e820_host, true); + libxl_defbool_set(&b_info->e820_host, true); } switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0, 1)) { diff --git a/tools/libxl/xl_sxp.c b/tools/libxl/xl_sxp.c index a16a025..f34f0ba 100644 --- a/tools/libxl/xl_sxp.c +++ b/tools/libxl/xl_sxp.c @@ -87,6 +87,10 @@ void printf_info_sexp(int domid, libxl_domain_config *d_config) } } + printf("\t(e820_host %s)\n", + libxl_defbool_to_string(b_info->e820_host)); + + printf("\t(image\n"); switch (c_info->type) { case LIBXL_DOMAIN_TYPE_HVM: @@ -150,8 +154,6 @@ void printf_info_sexp(int domid, libxl_domain_config *d_config) printf("\t\t\t(kernel %s)\n", b_info->u.pv.kernel); printf("\t\t\t(cmdline %s)\n", b_info->u.pv.cmdline); printf("\t\t\t(ramdisk %s)\n", b_info->u.pv.ramdisk); - printf("\t\t\t(e820_host %s)\n", - libxl_defbool_to_string(b_info->u.pv.e820_host)); printf("\t\t)\n"); break; default: diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 874742c..4796221 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -566,10 +566,9 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) { /* 64-bit PV guest by default. */ d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0; - - spin_lock_init(&d->arch.pv_domain.e820_lock); } + spin_lock_init(&d->arch.e820_lock); /* initialize default tsc behavior in case tools don''t */ tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0); spin_lock_init(&d->arch.vtsc_lock); diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 54b1e6a..6c9b58c 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -3142,10 +3142,10 @@ static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) switch ( cmd & MEMOP_CMD_MASK ) { - case XENMEM_memory_map: case XENMEM_machine_memory_map: case XENMEM_machphys_mapping: return -ENOSYS; + case XENMEM_memory_map: case XENMEM_decrease_reservation: rc = do_memory_op(cmd, arg); current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 1; @@ -3217,10 +3217,10 @@ static long hvm_memory_op_compat32(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) switch ( cmd & MEMOP_CMD_MASK ) { - case XENMEM_memory_map: case XENMEM_machine_memory_map: case XENMEM_machphys_mapping: return -ENOSYS; + case XENMEM_memory_map: case XENMEM_decrease_reservation: rc = compat_memory_op(cmd, arg); current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 1; diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index e7f0e13..4c3ce9a 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -4740,19 +4740,13 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg) return rc; } - if ( is_hvm_domain(d) ) - { - rcu_unlock_domain(d); - return -EPERM; - } - e820 = xmalloc_array(e820entry_t, fmap.map.nr_entries); if ( e820 == NULL ) { rcu_unlock_domain(d); return -ENOMEM; } - + if ( copy_from_guest(e820, fmap.map.buffer, fmap.map.nr_entries) ) { xfree(e820); @@ -4760,11 +4754,11 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg) return -EFAULT; } - spin_lock(&d->arch.pv_domain.e820_lock); - xfree(d->arch.pv_domain.e820); - d->arch.pv_domain.e820 = e820; - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_lock(&d->arch.e820_lock); + xfree(d->arch.e820); + d->arch.e820 = e820; + d->arch.nr_e820 = fmap.map.nr_entries; + spin_unlock(&d->arch.e820_lock); rcu_unlock_domain(d); return rc; @@ -4778,26 +4772,26 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg) if ( copy_from_guest(&map, arg, 1) ) return -EFAULT; - spin_lock(&d->arch.pv_domain.e820_lock); + spin_lock(&d->arch.e820_lock); /* Backwards compatibility. */ - if ( (d->arch.pv_domain.nr_e820 == 0) || - (d->arch.pv_domain.e820 == NULL) ) + if ( (d->arch.nr_e820 == 0) || + (d->arch.e820 == NULL) ) { - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_unlock(&d->arch.e820_lock); return -ENOSYS; } - map.nr_entries = min(map.nr_entries, d->arch.pv_domain.nr_e820); - if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820, + map.nr_entries = min(map.nr_entries, d->arch.nr_e820); + if ( copy_to_guest(map.buffer, d->arch.e820, map.nr_entries) || __copy_to_guest(arg, &map, 1) ) { - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_unlock(&d->arch.e820_lock); return -EFAULT; } - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_unlock(&d->arch.e820_lock); return 0; } diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h index d79464d..c3f9f8e 100644 --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -234,11 +234,6 @@ struct pv_domain /* map_domain_page() mapping cache. */ struct mapcache_domain mapcache; - - /* Pseudophysical e820 map (XENMEM_memory_map). */ - spinlock_t e820_lock; - struct e820entry *e820; - unsigned int nr_e820; }; struct arch_domain @@ -313,6 +308,11 @@ struct arch_domain (possibly other cases in the future */ uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */ uint64_t vtsc_usercount; /* not used for hvm */ + + /* Pseudophysical e820 map (XENMEM_memory_map). */ + spinlock_t e820_lock; + struct e820entry *e820; + unsigned int nr_e820; } __cacheline_aligned; #define has_arch_pdevs(d) (!list_empty(&(d)->arch.pdev_list))
Gordan Bobic
2013-Sep-05 09:41 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
Hmm... gcc -O2 -fomit-frame-pointer -m64 -fno-strict-aliasing -std=gnu99 -Wall -Wstrict-prototypes -Wdeclaration-after-statement -Wno-unused-but-set-variable -DNDEBUG -fno-builtin -fno-common -Wredundant-decls -iwithprefix include -Werror -Wno-pointer-arith -pipe -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-generic -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-default -msoft-float -fno-stack-protector -fno-exceptions -Wnested-externs -mno-red-zone -mno-sse -fpic -fno-asynchronous-unwind-tables -DGCC_HAS_VISIBILITY_ATTRIBUTE -nostdinc -g -D__XEN__ -include /root/rpmbuild/BUILD/xen-4.3.0/xen/include/xen/config.h -DHAS_ACPI -DHAS_GDBSX -DHAS_PASSTHROUGH -MMD -MF .debug.o.d -c debug.c -o debug.o gcc -O2 -fomit-frame-pointer -m64 -fno-strict-aliasing -std=gnu99 -Wall -Wstrict-prototypes -Wdeclaration-after-statement -Wno-unused-but-set-variable -DNDEBUG -fno-builtin -fno-common -Wredundant-decls -iwithprefix include -Werror -Wno-pointer-arith -pipe -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-generic -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-default -msoft-float -fno-stack-protector -fno-exceptions -Wnested-externs -mno-red-zone -mno-sse -fpic -fno-asynchronous-unwind-tables -DGCC_HAS_VISIBILITY_ATTRIBUTE -nostdinc -g -D__XEN__ -include /root/rpmbuild/BUILD/xen-4.3.0/xen/include/xen/config.h -DHAS_ACPI -DHAS_GDBSX -DHAS_PASSTHROUGH -MMD -MF .domain.o.d -c domain.c -o domain.o domain.c: In function ‘arch_domain_destroy’: domain.c:595: error: ‘struct pv_domain’ has no member named ‘e820’ make[4]: *** [domain.o] Error 1 It would seem you omitted this block from the original patch: == @@ -592,8 +592,8 @@ void arch_domain_destroy(struct domain *d) { if ( is_hvm_domain(d) ) hvm_domain_destroy(d); - else - xfree(d->arch.pv_domain.e820); + + xfree(d->arch.e820); free_domain_pirqs(d); if ( !is_idle_domain(d) ) == Was that intentional? Does that block look OK to you? Should I re-add it? Gordan On Wed, 4 Sep 2013 22:04:42 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Wed, Sep 04, 2013 at 02:11:06PM +0100, Gordan Bobic wrote: >> I have this at the point where it actually builds. >> Otherwise completely untested (will do that later today). >> >> Attached are: >> >> 1) libxl patch >> Modified from the original patch to _not_ implicitly enable >> e820_host when PCI devices are passed. >> >> 2) Mukesh's hypervisor e820 patch from here: >> http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html >> Modified slightly to attempt to address Jan's comment on the same >> thread, and to adjust the diff line pointers to match against >> 4.3.0 release code. > > I think that was the old version. I spotted a bug in it that > was causing a hang. And also the one that explains why libxl > would refuse to setup the E820. > > The problem was that in the XENMEM_set_memory_map there was > a check to make sure that the guest launched was not HVM. > > Also there was bug in the initial domain creation where > the spinlock was only set for PV and not for HVM. > >> >> 3) A patch based on Konrad's earlier in this thread, with >> a few additions and changes to make it all compile. >> >> Some peer review would be most welcome - this is my first >> venture into Xen code, so please do assume that I have >> no idea what I'm doing at the moment. :) >> >> I added yet another E820MAX #define, this time to >> tools/firmware/hvmloader/e820.h >> >> If there is a better place to #include that via from >> e820.c, please point me in the right direction. > > I think I saw that #define in tools/libxc/xenctrl.h. But since > the tools/firmware cannot link to the libxc (b/c it is a > Minicontained > OS) I believe just having the #define in hvmloader/e820.h is > the right call. > > Good first pass. I altered it a bit and got in the HVM guest > the E820 entries printed out. Here is a big giant diff: > > diff --git a/tools/firmware/hvmloader/e820.c > b/tools/firmware/hvmloader/e820.c > index 2e05e93..3c80241 100644 > --- a/tools/firmware/hvmloader/e820.c > +++ b/tools/firmware/hvmloader/e820.c > @@ -22,6 +22,9 @@ > > #include "config.h" > #include "util.h" > +#include "hypercall.h" > +#include <xen/memory.h> > +#include <errno.h> > > void dump_e820_table(struct e820entry *e820, unsigned int nr) > { > @@ -74,10 +77,20 @@ int build_e820_table(struct e820entry *e820, > unsigned int bios_image_base) > { > unsigned int nr = 0; > + struct xen_memory_map op; > + struct e820entry map[E820MAX]; > + int rc; > > if ( !lowmem_reserved_base ) > lowmem_reserved_base = 0xA0000; > > + set_xen_guest_handle(op.buffer, map); > + > + rc = hypercall_memory_op ( XENMEM_memory_map, &op); > + if ( rc != -ENOSYS) { /* It works!? */ > + printf("%s:%d got %d op.nr_entries \n", __func__, __LINE__, > op.nr_entries); > + dump_e820_table(&map[0], op.nr_entries); > + } > /* Lowmem must be at least 512K to keep Windows happy) */ > ASSERT ( lowmem_reserved_base > 512<<10 ); > > diff --git a/tools/firmware/hvmloader/e820.h > b/tools/firmware/hvmloader/e820.h > index b2ead7f..2fa700d 100644 > --- a/tools/firmware/hvmloader/e820.h > +++ b/tools/firmware/hvmloader/e820.h > @@ -8,6 +8,9 @@ > #define E820_RESERVED 2 > #define E820_ACPI 3 > #define E820_NVS 4 > +#define E820_UNUSABLE 5 > + > +#define E820MAX 128 > > struct e820entry { > uint64_t addr; > diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c > index 0c32d0b..d8e2346 100644 > --- a/tools/libxl/libxl_create.c > +++ b/tools/libxl/libxl_create.c > @@ -208,6 +208,8 @@ int libxl__domain_build_info_setdefault(libxl__gc > *gc, > > libxl_defbool_setdefault(&b_info->disable_migrate, false); > > + libxl_defbool_setdefault(&b_info->e820_host, false); > + > switch (b_info->type) { > case LIBXL_DOMAIN_TYPE_HVM: > if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT) > @@ -280,7 +282,6 @@ int libxl__domain_build_info_setdefault(libxl__gc > *gc, > > break; > case LIBXL_DOMAIN_TYPE_PV: > - libxl_defbool_setdefault(&b_info->u.pv.e820_host, false); > if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT) > b_info->shadow_memkb = 0; > if (b_info->u.pv.slack_memkb == LIBXL_MEMKB_DEFAULT) > diff --git a/tools/libxl/libxl_types.idl > b/tools/libxl/libxl_types.idl > index 85341a0..fd6389a 100644 > --- a/tools/libxl/libxl_types.idl > +++ b/tools/libxl/libxl_types.idl > @@ -299,6 +299,8 @@ libxl_domain_build_info = > Struct("domain_build_info",[ > ("irqs", Array(uint32, "num_irqs")), > ("iomem", Array(libxl_iomem_range, "num_iomem")), > ("claim_mode", libxl_defbool), > + # Use host's E820 for PCI passthrough. > + ("e820_host", libxl_defbool), > ("u", KeyedUnion(None, libxl_domain_type, "type", > [("hvm", Struct(None, [("firmware", string), > ("bios", > libxl_bios_type), > @@ -345,8 +347,6 @@ libxl_domain_build_info = > Struct("domain_build_info",[ > ("cmdline", string), > ("ramdisk", string), > ("features", string, {'const': > True}), > - # Use host's E820 for PCI > passthrough. > - ("e820_host", libxl_defbool), > ])), > ("invalid", Struct(None, [])), > ], keyvar_init_val = "LIBXL_DOMAIN_TYPE_INVALID")), > diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c > index a78c91d..94515a5 100644 > --- a/tools/libxl/libxl_x86.c > +++ b/tools/libxl/libxl_x86.c > @@ -216,28 +216,41 @@ static int libxl__e820_alloc(libxl__gc *gc, > uint32_t domid, > struct e820entry map[E820MAX]; > libxl_domain_build_info *b_info; > > - if (d_config == NULL || d_config->c_info.type == > LIBXL_DOMAIN_TYPE_HVM) > - return ERROR_INVAL; > - > b_info = &d_config->b_info; > - if (!libxl_defbool_val(b_info->u.pv.e820_host)) > + if (!libxl_defbool_val(b_info->e820_host)) { > + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.",__func__, > __LINE__); > return ERROR_INVAL; > - > + } > rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX); > if (rc < 0) { > + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.",__func__, > __LINE__); > errno = rc; > return ERROR_FAIL; > } > nr = rc; > - rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, > - (b_info->max_memkb - b_info->target_memkb) + > - b_info->u.pv.slack_memkb); > + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.nr:%d",__func__, > __LINE__, nr); > + if (d_config == NULL || d_config->c_info.type => LIBXL_DOMAIN_TYPE_HVM) { > + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, > + (b_info->max_memkb - > b_info->target_memkb)); > + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, > "%s:%d.rc%d",__func__, __LINE__, rc); > + } else if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV) { > + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, > + (b_info->max_memkb - > b_info->target_memkb) + > + b_info->u.pv.slack_memkb); > + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, > "%s:%d.rc%d",__func__, __LINE__, rc); > + } > + > + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, > "%s:%d.rc%d",__func__, __LINE__, rc); > if (rc) > return ERROR_FAIL; > > + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.rc%d, > nr:%d",__func__, __LINE__, rc, nr); > + > rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr); > > + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, > "%s:%d.rc%d",__func__, __LINE__, rc); > if (rc < 0) { > + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, > "%s:%d.rc%d",__func__, __LINE__, rc); > errno = rc; > return ERROR_FAIL; > } > @@ -296,8 +309,7 @@ int libxl__arch_domain_create(libxl__gc *gc, > libxl_domain_config *d_config, > xc_shadow_control(ctx->xch, domid, > XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION, NULL, 0, &shadow, 0, NULL); > } > > - if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV && > - libxl_defbool_val(d_config->b_info.u.pv.e820_host)) { > + if (libxl_defbool_val(d_config->b_info.e820_host)) { > ret = libxl__e820_alloc(gc, domid, d_config); > if (ret) { > LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR, > diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > index ed99622..d98ca24 100644 > --- a/tools/libxl/xl_cmdimpl.c > +++ b/tools/libxl/xl_cmdimpl.c > @@ -1291,11 +1291,7 @@ skip_vfb: > if (!xlu_cfg_get_long (config, "pci_permissive", &l, 0)) > pci_permissive = l; > > - /* To be reworked (automatically enabled) once the auto > ballooning > - * after guest starts is done (with PCI devices passed in). */ > - if (c_info->type == LIBXL_DOMAIN_TYPE_PV) { > - xlu_cfg_get_defbool(config, "e820_host", > &b_info->u.pv.e820_host, 0); > - } > + xlu_cfg_get_defbool(config, "e820_host", &b_info->e820_host, 0); > > if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) { > d_config->num_pcidevs = 0; > @@ -1314,7 +1310,7 @@ skip_vfb: > d_config->num_pcidevs++; > } > if (d_config->num_pcidevs && c_info->type == > LIBXL_DOMAIN_TYPE_PV) > - libxl_defbool_set(&b_info->u.pv.e820_host, true); > + libxl_defbool_set(&b_info->e820_host, true); > } > > switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0, 1)) { > diff --git a/tools/libxl/xl_sxp.c b/tools/libxl/xl_sxp.c > index a16a025..f34f0ba 100644 > --- a/tools/libxl/xl_sxp.c > +++ b/tools/libxl/xl_sxp.c > @@ -87,6 +87,10 @@ void printf_info_sexp(int domid, > libxl_domain_config *d_config) > } > } > > + printf("\t(e820_host %s)\n", > + libxl_defbool_to_string(b_info->e820_host)); > + > + > printf("\t(image\n"); > switch (c_info->type) { > case LIBXL_DOMAIN_TYPE_HVM: > @@ -150,8 +154,6 @@ void printf_info_sexp(int domid, > libxl_domain_config *d_config) > printf("\t\t\t(kernel %s)\n", b_info->u.pv.kernel); > printf("\t\t\t(cmdline %s)\n", b_info->u.pv.cmdline); > printf("\t\t\t(ramdisk %s)\n", b_info->u.pv.ramdisk); > - printf("\t\t\t(e820_host %s)\n", > - libxl_defbool_to_string(b_info->u.pv.e820_host)); > printf("\t\t)\n"); > break; > default: > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c > index 874742c..4796221 100644 > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -566,10 +566,9 @@ int arch_domain_create(struct domain *d, > unsigned int domcr_flags) > { > /* 64-bit PV guest by default. */ > d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0; > - > - spin_lock_init(&d->arch.pv_domain.e820_lock); > } > > + spin_lock_init(&d->arch.e820_lock); > /* initialize default tsc behavior in case tools don't */ > tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0); > spin_lock_init(&d->arch.vtsc_lock); > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > index 54b1e6a..6c9b58c 100644 > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -3142,10 +3142,10 @@ static long hvm_memory_op(int cmd, > XEN_GUEST_HANDLE_PARAM(void) arg) > > switch ( cmd & MEMOP_CMD_MASK ) > { > - case XENMEM_memory_map: > case XENMEM_machine_memory_map: > case XENMEM_machphys_mapping: > return -ENOSYS; > + case XENMEM_memory_map: > case XENMEM_decrease_reservation: > rc = do_memory_op(cmd, arg); > current->domain->arch.hvm_domain.qemu_mapcache_invalidate = > 1; > @@ -3217,10 +3217,10 @@ static long hvm_memory_op_compat32(int cmd, > XEN_GUEST_HANDLE_PARAM(void) arg) > > switch ( cmd & MEMOP_CMD_MASK ) > { > - case XENMEM_memory_map: > case XENMEM_machine_memory_map: > case XENMEM_machphys_mapping: > return -ENOSYS; > + case XENMEM_memory_map: > case XENMEM_decrease_reservation: > rc = compat_memory_op(cmd, arg); > current->domain->arch.hvm_domain.qemu_mapcache_invalidate = > 1; > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c > index e7f0e13..4c3ce9a 100644 > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -4740,19 +4740,13 @@ long arch_memory_op(int op, > XEN_GUEST_HANDLE_PARAM(void) arg) > return rc; > } > > - if ( is_hvm_domain(d) ) > - { > - rcu_unlock_domain(d); > - return -EPERM; > - } > - > e820 = xmalloc_array(e820entry_t, fmap.map.nr_entries); > if ( e820 == NULL ) > { > rcu_unlock_domain(d); > return -ENOMEM; > } > - > + > if ( copy_from_guest(e820, fmap.map.buffer, > fmap.map.nr_entries) ) > { > xfree(e820); > @@ -4760,11 +4754,11 @@ long arch_memory_op(int op, > XEN_GUEST_HANDLE_PARAM(void) arg) > return -EFAULT; > } > > - spin_lock(&d->arch.pv_domain.e820_lock); > - xfree(d->arch.pv_domain.e820); > - d->arch.pv_domain.e820 = e820; > - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_lock(&d->arch.e820_lock); > + xfree(d->arch.e820); > + d->arch.e820 = e820; > + d->arch.nr_e820 = fmap.map.nr_entries; > + spin_unlock(&d->arch.e820_lock); > > rcu_unlock_domain(d); > return rc; > @@ -4778,26 +4772,26 @@ long arch_memory_op(int op, > XEN_GUEST_HANDLE_PARAM(void) arg) > if ( copy_from_guest(&map, arg, 1) ) > return -EFAULT; > > - spin_lock(&d->arch.pv_domain.e820_lock); > + spin_lock(&d->arch.e820_lock); > > /* Backwards compatibility. */ > - if ( (d->arch.pv_domain.nr_e820 == 0) || > - (d->arch.pv_domain.e820 == NULL) ) > + if ( (d->arch.nr_e820 == 0) || > + (d->arch.e820 == NULL) ) > { > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_unlock(&d->arch.e820_lock); > return -ENOSYS; > } > > - map.nr_entries = min(map.nr_entries, > d->arch.pv_domain.nr_e820); > - if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820, > + map.nr_entries = min(map.nr_entries, d->arch.nr_e820); > + if ( copy_to_guest(map.buffer, d->arch.e820, > map.nr_entries) || > __copy_to_guest(arg, &map, 1) ) > { > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_unlock(&d->arch.e820_lock); > return -EFAULT; > } > > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_unlock(&d->arch.e820_lock); > return 0; > } > > diff --git a/xen/include/asm-x86/domain.h > b/xen/include/asm-x86/domain.h > index d79464d..c3f9f8e 100644 > --- a/xen/include/asm-x86/domain.h > +++ b/xen/include/asm-x86/domain.h > @@ -234,11 +234,6 @@ struct pv_domain > > /* map_domain_page() mapping cache. */ > struct mapcache_domain mapcache; > - > - /* Pseudophysical e820 map (XENMEM_memory_map). */ > - spinlock_t e820_lock; > - struct e820entry *e820; > - unsigned int nr_e820; > }; > > struct arch_domain > @@ -313,6 +308,11 @@ struct arch_domain > (possibly other cases in the future > */ > uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */ > uint64_t vtsc_usercount; /* not used for hvm */ > + > + /* Pseudophysical e820 map (XENMEM_memory_map). */ > + spinlock_t e820_lock; > + struct e820entry *e820; > + unsigned int nr_e820; > } __cacheline_aligned; > > #define has_arch_pdevs(d) (!list_empty(&(d)->arch.pdev_list)) > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Gordan Bobic
2013-Sep-05 10:00 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Thu, 05 Sep 2013 10:41:09 +0100, Gordan Bobic <gordan@bobich.net> wrote:> Hmm... > > gcc -O2 -fomit-frame-pointer -m64 -fno-strict-aliasing -std=gnu99 > -Wall -Wstrict-prototypes -Wdeclaration-after-statement > -Wno-unused-but-set-variable -DNDEBUG -fno-builtin -fno-common > -Wredundant-decls -iwithprefix include -Werror -Wno-pointer-arith > -pipe -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include > -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-generic > -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-default > -msoft-float -fno-stack-protector -fno-exceptions -Wnested-externs > -mno-red-zone -mno-sse -fpic -fno-asynchronous-unwind-tables > -DGCC_HAS_VISIBILITY_ATTRIBUTE -nostdinc -g -D__XEN__ -include > /root/rpmbuild/BUILD/xen-4.3.0/xen/include/xen/config.h -DHAS_ACPI > -DHAS_GDBSX -DHAS_PASSTHROUGH -MMD -MF .debug.o.d -c debug.c -o > debug.o > gcc -O2 -fomit-frame-pointer -m64 -fno-strict-aliasing -std=gnu99 > -Wall -Wstrict-prototypes -Wdeclaration-after-statement > -Wno-unused-but-set-variable -DNDEBUG -fno-builtin -fno-common > -Wredundant-decls -iwithprefix include -Werror -Wno-pointer-arith > -pipe -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include > -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-generic > -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-default > -msoft-float -fno-stack-protector -fno-exceptions -Wnested-externs > -mno-red-zone -mno-sse -fpic -fno-asynchronous-unwind-tables > -DGCC_HAS_VISIBILITY_ATTRIBUTE -nostdinc -g -D__XEN__ -include > /root/rpmbuild/BUILD/xen-4.3.0/xen/include/xen/config.h -DHAS_ACPI > -DHAS_GDBSX -DHAS_PASSTHROUGH -MMD -MF .domain.o.d -c domain.c -o > domain.o > domain.c: In function ‘arch_domain_destroy’: > domain.c:595: error: ‘struct pv_domain’ has no member named ‘e820’ > make[4]: *** [domain.o] Error 1 > > It would seem you omitted this block from the original patch: > > ==> @@ -592,8 +592,8 @@ void arch_domain_destroy(struct domain *d) > { > if ( is_hvm_domain(d) ) > hvm_domain_destroy(d); > - else > - xfree(d->arch.pv_domain.e820); > + > + xfree(d->arch.e820); > > free_domain_pirqs(d); > if ( !is_idle_domain(d) ) > ==> > Was that intentional? Does that block look OK to you? Should I re-add > it?Just to clarify - re-adding this block fixes the build issue. Will test tonight whether it runs. What I really wanted to know is whether this is the correct way to handle the cleanup in this case. Gordan> On Wed, 4 Sep 2013 22:04:42 -0400, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: >> On Wed, Sep 04, 2013 at 02:11:06PM +0100, Gordan Bobic wrote: >>> I have this at the point where it actually builds. >>> Otherwise completely untested (will do that later today). >>> >>> Attached are: >>> >>> 1) libxl patch >>> Modified from the original patch to _not_ implicitly enable >>> e820_host when PCI devices are passed. >>> >>> 2) Mukesh's hypervisor e820 patch from here: >>> http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html >>> Modified slightly to attempt to address Jan's comment on the same >>> thread, and to adjust the diff line pointers to match against >>> 4.3.0 release code. >> >> I think that was the old version. I spotted a bug in it that >> was causing a hang. And also the one that explains why libxl >> would refuse to setup the E820. >> >> The problem was that in the XENMEM_set_memory_map there was >> a check to make sure that the guest launched was not HVM. >> >> Also there was bug in the initial domain creation where >> the spinlock was only set for PV and not for HVM. >> >>> >>> 3) A patch based on Konrad's earlier in this thread, with >>> a few additions and changes to make it all compile. >>> >>> Some peer review would be most welcome - this is my first >>> venture into Xen code, so please do assume that I have >>> no idea what I'm doing at the moment. :) >>> >>> I added yet another E820MAX #define, this time to >>> tools/firmware/hvmloader/e820.h >>> >>> If there is a better place to #include that via from >>> e820.c, please point me in the right direction. >> >> I think I saw that #define in tools/libxc/xenctrl.h. But since >> the tools/firmware cannot link to the libxc (b/c it is a >> Minicontained >> OS) I believe just having the #define in hvmloader/e820.h is >> the right call. >> >> Good first pass. I altered it a bit and got in the HVM guest >> the E820 entries printed out. Here is a big giant diff: >> >> diff --git a/tools/firmware/hvmloader/e820.c >> b/tools/firmware/hvmloader/e820.c >> index 2e05e93..3c80241 100644 >> --- a/tools/firmware/hvmloader/e820.c >> +++ b/tools/firmware/hvmloader/e820.c >> @@ -22,6 +22,9 @@ >> >> #include "config.h" >> #include "util.h" >> +#include "hypercall.h" >> +#include <xen/memory.h> >> +#include <errno.h> >> >> void dump_e820_table(struct e820entry *e820, unsigned int nr) >> { >> @@ -74,10 +77,20 @@ int build_e820_table(struct e820entry *e820, >> unsigned int bios_image_base) >> { >> unsigned int nr = 0; >> + struct xen_memory_map op; >> + struct e820entry map[E820MAX]; >> + int rc; >> >> if ( !lowmem_reserved_base ) >> lowmem_reserved_base = 0xA0000; >> >> + set_xen_guest_handle(op.buffer, map); >> + >> + rc = hypercall_memory_op ( XENMEM_memory_map, &op); >> + if ( rc != -ENOSYS) { /* It works!? */ >> + printf("%s:%d got %d op.nr_entries \n", __func__, __LINE__, >> op.nr_entries); >> + dump_e820_table(&map[0], op.nr_entries); >> + } >> /* Lowmem must be at least 512K to keep Windows happy) */ >> ASSERT ( lowmem_reserved_base > 512<<10 ); >> >> diff --git a/tools/firmware/hvmloader/e820.h >> b/tools/firmware/hvmloader/e820.h >> index b2ead7f..2fa700d 100644 >> --- a/tools/firmware/hvmloader/e820.h >> +++ b/tools/firmware/hvmloader/e820.h >> @@ -8,6 +8,9 @@ >> #define E820_RESERVED 2 >> #define E820_ACPI 3 >> #define E820_NVS 4 >> +#define E820_UNUSABLE 5 >> + >> +#define E820MAX 128 >> >> struct e820entry { >> uint64_t addr; >> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c >> index 0c32d0b..d8e2346 100644 >> --- a/tools/libxl/libxl_create.c >> +++ b/tools/libxl/libxl_create.c >> @@ -208,6 +208,8 @@ int >> libxl__domain_build_info_setdefault(libxl__gc *gc, >> >> libxl_defbool_setdefault(&b_info->disable_migrate, false); >> >> + libxl_defbool_setdefault(&b_info->e820_host, false); >> + >> switch (b_info->type) { >> case LIBXL_DOMAIN_TYPE_HVM: >> if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT) >> @@ -280,7 +282,6 @@ int >> libxl__domain_build_info_setdefault(libxl__gc *gc, >> >> break; >> case LIBXL_DOMAIN_TYPE_PV: >> - libxl_defbool_setdefault(&b_info->u.pv.e820_host, false); >> if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT) >> b_info->shadow_memkb = 0; >> if (b_info->u.pv.slack_memkb == LIBXL_MEMKB_DEFAULT) >> diff --git a/tools/libxl/libxl_types.idl >> b/tools/libxl/libxl_types.idl >> index 85341a0..fd6389a 100644 >> --- a/tools/libxl/libxl_types.idl >> +++ b/tools/libxl/libxl_types.idl >> @@ -299,6 +299,8 @@ libxl_domain_build_info = >> Struct("domain_build_info",[ >> ("irqs", Array(uint32, "num_irqs")), >> ("iomem", Array(libxl_iomem_range, "num_iomem")), >> ("claim_mode", libxl_defbool), >> + # Use host's E820 for PCI passthrough. >> + ("e820_host", libxl_defbool), >> ("u", KeyedUnion(None, libxl_domain_type, "type", >> [("hvm", Struct(None, [("firmware", >> string), >> ("bios", >> libxl_bios_type), >> @@ -345,8 +347,6 @@ libxl_domain_build_info = >> Struct("domain_build_info",[ >> ("cmdline", string), >> ("ramdisk", string), >> ("features", string, >> {'const': True}), >> - # Use host's E820 for PCI >> passthrough. >> - ("e820_host", libxl_defbool), >> ])), >> ("invalid", Struct(None, [])), >> ], keyvar_init_val = >> "LIBXL_DOMAIN_TYPE_INVALID")), >> diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c >> index a78c91d..94515a5 100644 >> --- a/tools/libxl/libxl_x86.c >> +++ b/tools/libxl/libxl_x86.c >> @@ -216,28 +216,41 @@ static int libxl__e820_alloc(libxl__gc *gc, >> uint32_t domid, >> struct e820entry map[E820MAX]; >> libxl_domain_build_info *b_info; >> >> - if (d_config == NULL || d_config->c_info.type == >> LIBXL_DOMAIN_TYPE_HVM) >> - return ERROR_INVAL; >> - >> b_info = &d_config->b_info; >> - if (!libxl_defbool_val(b_info->u.pv.e820_host)) >> + if (!libxl_defbool_val(b_info->e820_host)) { >> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.",__func__, >> __LINE__); >> return ERROR_INVAL; >> - >> + } >> rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX); >> if (rc < 0) { >> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.",__func__, >> __LINE__); >> errno = rc; >> return ERROR_FAIL; >> } >> nr = rc; >> - rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, >> - (b_info->max_memkb - b_info->target_memkb) + >> - b_info->u.pv.slack_memkb); >> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.nr:%d",__func__, >> __LINE__, nr); >> + if (d_config == NULL || d_config->c_info.type =>> LIBXL_DOMAIN_TYPE_HVM) { >> + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, >> + (b_info->max_memkb - >> b_info->target_memkb)); >> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, >> "%s:%d.rc%d",__func__, __LINE__, rc); >> + } else if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV) { >> + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, >> + (b_info->max_memkb - >> b_info->target_memkb) + >> + b_info->u.pv.slack_memkb); >> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, >> "%s:%d.rc%d",__func__, __LINE__, rc); >> + } >> + >> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, >> "%s:%d.rc%d",__func__, __LINE__, rc); >> if (rc) >> return ERROR_FAIL; >> >> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.rc%d, >> nr:%d",__func__, __LINE__, rc, nr); >> + >> rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr); >> >> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, >> "%s:%d.rc%d",__func__, __LINE__, rc); >> if (rc < 0) { >> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, >> "%s:%d.rc%d",__func__, __LINE__, rc); >> errno = rc; >> return ERROR_FAIL; >> } >> @@ -296,8 +309,7 @@ int libxl__arch_domain_create(libxl__gc *gc, >> libxl_domain_config *d_config, >> xc_shadow_control(ctx->xch, domid, >> XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION, NULL, 0, &shadow, 0, NULL); >> } >> >> - if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV && >> - libxl_defbool_val(d_config->b_info.u.pv.e820_host)) { >> + if (libxl_defbool_val(d_config->b_info.e820_host)) { >> ret = libxl__e820_alloc(gc, domid, d_config); >> if (ret) { >> LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR, >> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c >> index ed99622..d98ca24 100644 >> --- a/tools/libxl/xl_cmdimpl.c >> +++ b/tools/libxl/xl_cmdimpl.c >> @@ -1291,11 +1291,7 @@ skip_vfb: >> if (!xlu_cfg_get_long (config, "pci_permissive", &l, 0)) >> pci_permissive = l; >> >> - /* To be reworked (automatically enabled) once the auto >> ballooning >> - * after guest starts is done (with PCI devices passed in). */ >> - if (c_info->type == LIBXL_DOMAIN_TYPE_PV) { >> - xlu_cfg_get_defbool(config, "e820_host", >> &b_info->u.pv.e820_host, 0); >> - } >> + xlu_cfg_get_defbool(config, "e820_host", &b_info->e820_host, >> 0); >> >> if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) { >> d_config->num_pcidevs = 0; >> @@ -1314,7 +1310,7 @@ skip_vfb: >> d_config->num_pcidevs++; >> } >> if (d_config->num_pcidevs && c_info->type == >> LIBXL_DOMAIN_TYPE_PV) >> - libxl_defbool_set(&b_info->u.pv.e820_host, true); >> + libxl_defbool_set(&b_info->e820_host, true); >> } >> >> switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0, 1)) { >> diff --git a/tools/libxl/xl_sxp.c b/tools/libxl/xl_sxp.c >> index a16a025..f34f0ba 100644 >> --- a/tools/libxl/xl_sxp.c >> +++ b/tools/libxl/xl_sxp.c >> @@ -87,6 +87,10 @@ void printf_info_sexp(int domid, >> libxl_domain_config *d_config) >> } >> } >> >> + printf("\t(e820_host %s)\n", >> + libxl_defbool_to_string(b_info->e820_host)); >> + >> + >> printf("\t(image\n"); >> switch (c_info->type) { >> case LIBXL_DOMAIN_TYPE_HVM: >> @@ -150,8 +154,6 @@ void printf_info_sexp(int domid, >> libxl_domain_config *d_config) >> printf("\t\t\t(kernel %s)\n", b_info->u.pv.kernel); >> printf("\t\t\t(cmdline %s)\n", b_info->u.pv.cmdline); >> printf("\t\t\t(ramdisk %s)\n", b_info->u.pv.ramdisk); >> - printf("\t\t\t(e820_host %s)\n", >> - libxl_defbool_to_string(b_info->u.pv.e820_host)); >> printf("\t\t)\n"); >> break; >> default: >> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c >> index 874742c..4796221 100644 >> --- a/xen/arch/x86/domain.c >> +++ b/xen/arch/x86/domain.c >> @@ -566,10 +566,9 @@ int arch_domain_create(struct domain *d, >> unsigned int domcr_flags) >> { >> /* 64-bit PV guest by default. */ >> d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0; >> - >> - spin_lock_init(&d->arch.pv_domain.e820_lock); >> } >> >> + spin_lock_init(&d->arch.e820_lock); >> /* initialize default tsc behavior in case tools don't */ >> tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0); >> spin_lock_init(&d->arch.vtsc_lock); >> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c >> index 54b1e6a..6c9b58c 100644 >> --- a/xen/arch/x86/hvm/hvm.c >> +++ b/xen/arch/x86/hvm/hvm.c >> @@ -3142,10 +3142,10 @@ static long hvm_memory_op(int cmd, >> XEN_GUEST_HANDLE_PARAM(void) arg) >> >> switch ( cmd & MEMOP_CMD_MASK ) >> { >> - case XENMEM_memory_map: >> case XENMEM_machine_memory_map: >> case XENMEM_machphys_mapping: >> return -ENOSYS; >> + case XENMEM_memory_map: >> case XENMEM_decrease_reservation: >> rc = do_memory_op(cmd, arg); >> current->domain->arch.hvm_domain.qemu_mapcache_invalidate = >> 1; >> @@ -3217,10 +3217,10 @@ static long hvm_memory_op_compat32(int cmd, >> XEN_GUEST_HANDLE_PARAM(void) arg) >> >> switch ( cmd & MEMOP_CMD_MASK ) >> { >> - case XENMEM_memory_map: >> case XENMEM_machine_memory_map: >> case XENMEM_machphys_mapping: >> return -ENOSYS; >> + case XENMEM_memory_map: >> case XENMEM_decrease_reservation: >> rc = compat_memory_op(cmd, arg); >> current->domain->arch.hvm_domain.qemu_mapcache_invalidate = >> 1; >> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c >> index e7f0e13..4c3ce9a 100644 >> --- a/xen/arch/x86/mm.c >> +++ b/xen/arch/x86/mm.c >> @@ -4740,19 +4740,13 @@ long arch_memory_op(int op, >> XEN_GUEST_HANDLE_PARAM(void) arg) >> return rc; >> } >> >> - if ( is_hvm_domain(d) ) >> - { >> - rcu_unlock_domain(d); >> - return -EPERM; >> - } >> - >> e820 = xmalloc_array(e820entry_t, fmap.map.nr_entries); >> if ( e820 == NULL ) >> { >> rcu_unlock_domain(d); >> return -ENOMEM; >> } >> - >> + >> if ( copy_from_guest(e820, fmap.map.buffer, >> fmap.map.nr_entries) ) >> { >> xfree(e820); >> @@ -4760,11 +4754,11 @@ long arch_memory_op(int op, >> XEN_GUEST_HANDLE_PARAM(void) arg) >> return -EFAULT; >> } >> >> - spin_lock(&d->arch.pv_domain.e820_lock); >> - xfree(d->arch.pv_domain.e820); >> - d->arch.pv_domain.e820 = e820; >> - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; >> - spin_unlock(&d->arch.pv_domain.e820_lock); >> + spin_lock(&d->arch.e820_lock); >> + xfree(d->arch.e820); >> + d->arch.e820 = e820; >> + d->arch.nr_e820 = fmap.map.nr_entries; >> + spin_unlock(&d->arch.e820_lock); >> >> rcu_unlock_domain(d); >> return rc; >> @@ -4778,26 +4772,26 @@ long arch_memory_op(int op, >> XEN_GUEST_HANDLE_PARAM(void) arg) >> if ( copy_from_guest(&map, arg, 1) ) >> return -EFAULT; >> >> - spin_lock(&d->arch.pv_domain.e820_lock); >> + spin_lock(&d->arch.e820_lock); >> >> /* Backwards compatibility. */ >> - if ( (d->arch.pv_domain.nr_e820 == 0) || >> - (d->arch.pv_domain.e820 == NULL) ) >> + if ( (d->arch.nr_e820 == 0) || >> + (d->arch.e820 == NULL) ) >> { >> - spin_unlock(&d->arch.pv_domain.e820_lock); >> + spin_unlock(&d->arch.e820_lock); >> return -ENOSYS; >> } >> >> - map.nr_entries = min(map.nr_entries, >> d->arch.pv_domain.nr_e820); >> - if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820, >> + map.nr_entries = min(map.nr_entries, d->arch.nr_e820); >> + if ( copy_to_guest(map.buffer, d->arch.e820, >> map.nr_entries) || >> __copy_to_guest(arg, &map, 1) ) >> { >> - spin_unlock(&d->arch.pv_domain.e820_lock); >> + spin_unlock(&d->arch.e820_lock); >> return -EFAULT; >> } >> >> - spin_unlock(&d->arch.pv_domain.e820_lock); >> + spin_unlock(&d->arch.e820_lock); >> return 0; >> } >> >> diff --git a/xen/include/asm-x86/domain.h >> b/xen/include/asm-x86/domain.h >> index d79464d..c3f9f8e 100644 >> --- a/xen/include/asm-x86/domain.h >> +++ b/xen/include/asm-x86/domain.h >> @@ -234,11 +234,6 @@ struct pv_domain >> >> /* map_domain_page() mapping cache. */ >> struct mapcache_domain mapcache; >> - >> - /* Pseudophysical e820 map (XENMEM_memory_map). */ >> - spinlock_t e820_lock; >> - struct e820entry *e820; >> - unsigned int nr_e820; >> }; >> >> struct arch_domain >> @@ -313,6 +308,11 @@ struct arch_domain >> (possibly other cases in the future >> */ >> uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */ >> uint64_t vtsc_usercount; /* not used for hvm */ >> + >> + /* Pseudophysical e820 map (XENMEM_memory_map). */ >> + spinlock_t e820_lock; >> + struct e820entry *e820; >> + unsigned int nr_e820; >> } __cacheline_aligned; >> >> #define has_arch_pdevs(d) (!list_empty(&(d)->arch.pdev_list)) >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Gordan Bobic
2013-Sep-05 10:26 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Wed, 4 Sep 2013 22:04:42 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> diff --git a/tools/firmware/hvmloader/e820.h > b/tools/firmware/hvmloader/e820.h > index b2ead7f..2fa700d 100644 > --- a/tools/firmware/hvmloader/e820.h > +++ b/tools/firmware/hvmloader/e820.h > @@ -8,6 +8,9 @@ > #define E820_RESERVED 2 > #define E820_ACPI 3 > #define E820_NVS 4 > +#define E820_UNUSABLE 5 > + > +#define E820MAX 128 > > struct e820entry { > uint64_t addr;I don''t think we actually need +#define E820_UNUSABLE 5 any more because it is no longer used anywhere in the patch. Do we need that extra e820 hole type? I guess it''s only useful if we want to explicitly signify that a memory hole is inherited from the host e820 map, rather than _really_ needed. Otherwise we could probably just use E820_RESERVED in it''s place. Gordan
Konrad Rzeszutek Wilk
2013-Sep-05 12:36 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
Gordan Bobic <gordan@bobich.net> wrote:> On Thu, 05 Sep 2013 10:41:09 +0100, Gordan Bobic <gordan@bobich.net> > wrote: >> Hmm... >> >> gcc -O2 -fomit-frame-pointer -m64 -fno-strict-aliasing -std=gnu99 >> -Wall -Wstrict-prototypes -Wdeclaration-after-statement >> -Wno-unused-but-set-variable -DNDEBUG -fno-builtin -fno-common >> -Wredundant-decls -iwithprefix include -Werror -Wno-pointer-arith >> -pipe -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include >> -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-generic >> -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-default >> -msoft-float -fno-stack-protector -fno-exceptions -Wnested-externs >> -mno-red-zone -mno-sse -fpic -fno-asynchronous-unwind-tables >> -DGCC_HAS_VISIBILITY_ATTRIBUTE -nostdinc -g -D__XEN__ -include >> /root/rpmbuild/BUILD/xen-4.3.0/xen/include/xen/config.h -DHAS_ACPI >> -DHAS_GDBSX -DHAS_PASSTHROUGH -MMD -MF .debug.o.d -c debug.c -o >> debug.o >> gcc -O2 -fomit-frame-pointer -m64 -fno-strict-aliasing -std=gnu99 >> -Wall -Wstrict-prototypes -Wdeclaration-after-statement >> -Wno-unused-but-set-variable -DNDEBUG -fno-builtin -fno-common >> -Wredundant-decls -iwithprefix include -Werror -Wno-pointer-arith >> -pipe -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include >> -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-generic >> -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-default >> -msoft-float -fno-stack-protector -fno-exceptions -Wnested-externs >> -mno-red-zone -mno-sse -fpic -fno-asynchronous-unwind-tables >> -DGCC_HAS_VISIBILITY_ATTRIBUTE -nostdinc -g -D__XEN__ -include >> /root/rpmbuild/BUILD/xen-4.3.0/xen/include/xen/config.h -DHAS_ACPI >> -DHAS_GDBSX -DHAS_PASSTHROUGH -MMD -MF .domain.o.d -c domain.c -o >> domain.o >> domain.c: In function ‘arch_domain_destroy’: >> domain.c:595: error: ‘struct pv_domain’ has no member named ‘e820’ >> make[4]: *** [domain.o] Error 1 >> >> It would seem you omitted this block from the original patch: >> >> ==>> @@ -592,8 +592,8 @@ void arch_domain_destroy(struct domain *d) >> { >> if ( is_hvm_domain(d) ) >> hvm_domain_destroy(d); >> - else >> - xfree(d->arch.pv_domain.e820); >> + >> + xfree(d->arch.e820); >> >> free_domain_pirqs(d); >> if ( !is_idle_domain(d) ) >> ==>> >> Was that intentional? Does that block look OK to you? Should I re-add > >> it? > > Just to clarify - re-adding this block fixes the build issue. > Will test tonight whether it runs. What I really wanted to > know is whether this is the correct way to handle the cleanup > in this case.It is correct. I must have messed up my tree after I tested it.> > Gordan > >> On Wed, 4 Sep 2013 22:04:42 -0400, Konrad Rzeszutek Wilk >> <konrad.wilk@oracle.com> wrote: >>> On Wed, Sep 04, 2013 at 02:11:06PM +0100, Gordan Bobic wrote: >>>> I have this at the point where it actually builds. >>>> Otherwise completely untested (will do that later today). >>>> >>>> Attached are: >>>> >>>> 1) libxl patch >>>> Modified from the original patch to _not_ implicitly enable >>>> e820_host when PCI devices are passed. >>>> >>>> 2) Mukesh's hypervisor e820 patch from here: >>>> http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html >>>> Modified slightly to attempt to address Jan's comment on the same >>>> thread, and to adjust the diff line pointers to match against >>>> 4.3.0 release code. >>> >>> I think that was the old version. I spotted a bug in it that >>> was causing a hang. And also the one that explains why libxl >>> would refuse to setup the E820. >>> >>> The problem was that in the XENMEM_set_memory_map there was >>> a check to make sure that the guest launched was not HVM. >>> >>> Also there was bug in the initial domain creation where >>> the spinlock was only set for PV and not for HVM. >>> >>>> >>>> 3) A patch based on Konrad's earlier in this thread, with >>>> a few additions and changes to make it all compile. >>>> >>>> Some peer review would be most welcome - this is my first >>>> venture into Xen code, so please do assume that I have >>>> no idea what I'm doing at the moment. :) >>>> >>>> I added yet another E820MAX #define, this time to >>>> tools/firmware/hvmloader/e820.h >>>> >>>> If there is a better place to #include that via from >>>> e820.c, please point me in the right direction. >>> >>> I think I saw that #define in tools/libxc/xenctrl.h. But since >>> the tools/firmware cannot link to the libxc (b/c it is a >>> Minicontained >>> OS) I believe just having the #define in hvmloader/e820.h is >>> the right call. >>> >>> Good first pass. I altered it a bit and got in the HVM guest >>> the E820 entries printed out. Here is a big giant diff: >>> >>> diff --git a/tools/firmware/hvmloader/e820.c >>> b/tools/firmware/hvmloader/e820.c >>> index 2e05e93..3c80241 100644 >>> --- a/tools/firmware/hvmloader/e820.c >>> +++ b/tools/firmware/hvmloader/e820.c >>> @@ -22,6 +22,9 @@ >>> >>> #include "config.h" >>> #include "util.h" >>> +#include "hypercall.h" >>> +#include <xen/memory.h> >>> +#include <errno.h> >>> >>> void dump_e820_table(struct e820entry *e820, unsigned int nr) >>> { >>> @@ -74,10 +77,20 @@ int build_e820_table(struct e820entry *e820, >>> unsigned int bios_image_base) >>> { >>> unsigned int nr = 0; >>> + struct xen_memory_map op; >>> + struct e820entry map[E820MAX]; >>> + int rc; >>> >>> if ( !lowmem_reserved_base ) >>> lowmem_reserved_base = 0xA0000; >>> >>> + set_xen_guest_handle(op.buffer, map); >>> + >>> + rc = hypercall_memory_op ( XENMEM_memory_map, &op); >>> + if ( rc != -ENOSYS) { /* It works!? */ >>> + printf("%s:%d got %d op.nr_entries \n", __func__, __LINE__, >>> op.nr_entries); >>> + dump_e820_table(&map[0], op.nr_entries); >>> + } >>> /* Lowmem must be at least 512K to keep Windows happy) */ >>> ASSERT ( lowmem_reserved_base > 512<<10 ); >>> >>> diff --git a/tools/firmware/hvmloader/e820.h >>> b/tools/firmware/hvmloader/e820.h >>> index b2ead7f..2fa700d 100644 >>> --- a/tools/firmware/hvmloader/e820.h >>> +++ b/tools/firmware/hvmloader/e820.h >>> @@ -8,6 +8,9 @@ >>> #define E820_RESERVED 2 >>> #define E820_ACPI 3 >>> #define E820_NVS 4 >>> +#define E820_UNUSABLE 5 >>> + >>> +#define E820MAX 128 >>> >>> struct e820entry { >>> uint64_t addr; >>> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c >>> index 0c32d0b..d8e2346 100644 >>> --- a/tools/libxl/libxl_create.c >>> +++ b/tools/libxl/libxl_create.c >>> @@ -208,6 +208,8 @@ int >>> libxl__domain_build_info_setdefault(libxl__gc *gc, >>> >>> libxl_defbool_setdefault(&b_info->disable_migrate, false); >>> >>> + libxl_defbool_setdefault(&b_info->e820_host, false); >>> + >>> switch (b_info->type) { >>> case LIBXL_DOMAIN_TYPE_HVM: >>> if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT) >>> @@ -280,7 +282,6 @@ int >>> libxl__domain_build_info_setdefault(libxl__gc *gc, >>> >>> break; >>> case LIBXL_DOMAIN_TYPE_PV: >>> - libxl_defbool_setdefault(&b_info->u.pv.e820_host, false); >>> if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT) >>> b_info->shadow_memkb = 0; >>> if (b_info->u.pv.slack_memkb == LIBXL_MEMKB_DEFAULT) >>> diff --git a/tools/libxl/libxl_types.idl >>> b/tools/libxl/libxl_types.idl >>> index 85341a0..fd6389a 100644 >>> --- a/tools/libxl/libxl_types.idl >>> +++ b/tools/libxl/libxl_types.idl >>> @@ -299,6 +299,8 @@ libxl_domain_build_info = >>> Struct("domain_build_info",[ >>> ("irqs", Array(uint32, "num_irqs")), >>> ("iomem", Array(libxl_iomem_range, "num_iomem")), >>> ("claim_mode", libxl_defbool), >>> + # Use host's E820 for PCI passthrough. >>> + ("e820_host", libxl_defbool), >>> ("u", KeyedUnion(None, libxl_domain_type, "type", >>> [("hvm", Struct(None, [("firmware", >>> string), >>> ("bios", >>> libxl_bios_type), >>> @@ -345,8 +347,6 @@ libxl_domain_build_info = >>> Struct("domain_build_info",[ >>> ("cmdline", string), >>> ("ramdisk", string), >>> ("features", string, >>> {'const': True}), >>> - # Use host's E820 for PCI >>> passthrough. >>> - ("e820_host", libxl_defbool), >>> ])), >>> ("invalid", Struct(None, [])), >>> ], keyvar_init_val = >>> "LIBXL_DOMAIN_TYPE_INVALID")), >>> diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c >>> index a78c91d..94515a5 100644 >>> --- a/tools/libxl/libxl_x86.c >>> +++ b/tools/libxl/libxl_x86.c >>> @@ -216,28 +216,41 @@ static int libxl__e820_alloc(libxl__gc *gc, >>> uint32_t domid, >>> struct e820entry map[E820MAX]; >>> libxl_domain_build_info *b_info; >>> >>> - if (d_config == NULL || d_config->c_info.type == >>> LIBXL_DOMAIN_TYPE_HVM) >>> - return ERROR_INVAL; >>> - >>> b_info = &d_config->b_info; >>> - if (!libxl_defbool_val(b_info->u.pv.e820_host)) >>> + if (!libxl_defbool_val(b_info->e820_host)) { >>> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.",__func__, >>> __LINE__); >>> return ERROR_INVAL; >>> - >>> + } >>> rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX); >>> if (rc < 0) { >>> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.",__func__, >>> __LINE__); >>> errno = rc; >>> return ERROR_FAIL; >>> } >>> nr = rc; >>> - rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, >>> - (b_info->max_memkb - b_info->target_memkb) + >>> - b_info->u.pv.slack_memkb); >>> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.nr:%d",__func__, >>> __LINE__, nr); >>> + if (d_config == NULL || d_config->c_info.type =>>> LIBXL_DOMAIN_TYPE_HVM) { >>> + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, >>> + (b_info->max_memkb - >>> b_info->target_memkb)); >>> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, >>> "%s:%d.rc%d",__func__, __LINE__, rc); >>> + } else if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV) { >>> + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, >>> + (b_info->max_memkb - >>> b_info->target_memkb) + >>> + b_info->u.pv.slack_memkb); >>> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, >>> "%s:%d.rc%d",__func__, __LINE__, rc); >>> + } >>> + >>> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, >>> "%s:%d.rc%d",__func__, __LINE__, rc); >>> if (rc) >>> return ERROR_FAIL; >>> >>> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.rc%d, >>> nr:%d",__func__, __LINE__, rc, nr); >>> + >>> rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr); >>> >>> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, >>> "%s:%d.rc%d",__func__, __LINE__, rc); >>> if (rc < 0) { >>> + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, >>> "%s:%d.rc%d",__func__, __LINE__, rc); >>> errno = rc; >>> return ERROR_FAIL; >>> } >>> @@ -296,8 +309,7 @@ int libxl__arch_domain_create(libxl__gc *gc, >>> libxl_domain_config *d_config, >>> xc_shadow_control(ctx->xch, domid, >>> XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION, NULL, 0, &shadow, 0, NULL); >>> } >>> >>> - if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV && >>> - libxl_defbool_val(d_config->b_info.u.pv.e820_host)) { >>> + if (libxl_defbool_val(d_config->b_info.e820_host)) { >>> ret = libxl__e820_alloc(gc, domid, d_config); >>> if (ret) { >>> LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR, >>> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c >>> index ed99622..d98ca24 100644 >>> --- a/tools/libxl/xl_cmdimpl.c >>> +++ b/tools/libxl/xl_cmdimpl.c >>> @@ -1291,11 +1291,7 @@ skip_vfb: >>> if (!xlu_cfg_get_long (config, "pci_permissive", &l, 0)) >>> pci_permissive = l; >>> >>> - /* To be reworked (automatically enabled) once the auto >>> ballooning >>> - * after guest starts is done (with PCI devices passed in). */ >>> - if (c_info->type == LIBXL_DOMAIN_TYPE_PV) { >>> - xlu_cfg_get_defbool(config, "e820_host", >>> &b_info->u.pv.e820_host, 0); >>> - } >>> + xlu_cfg_get_defbool(config, "e820_host", &b_info->e820_host, >>> 0); >>> >>> if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) { >>> d_config->num_pcidevs = 0; >>> @@ -1314,7 +1310,7 @@ skip_vfb: >>> d_config->num_pcidevs++; >>> } >>> if (d_config->num_pcidevs && c_info->type == >>> LIBXL_DOMAIN_TYPE_PV) >>> - libxl_defbool_set(&b_info->u.pv.e820_host, true); >>> + libxl_defbool_set(&b_info->e820_host, true); >>> } >>> >>> switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0, 1)) { >>> diff --git a/tools/libxl/xl_sxp.c b/tools/libxl/xl_sxp.c >>> index a16a025..f34f0ba 100644 >>> --- a/tools/libxl/xl_sxp.c >>> +++ b/tools/libxl/xl_sxp.c >>> @@ -87,6 +87,10 @@ void printf_info_sexp(int domid, >>> libxl_domain_config *d_config) >>> } >>> } >>> >>> + printf("\t(e820_host %s)\n", >>> + libxl_defbool_to_string(b_info->e820_host)); >>> + >>> + >>> printf("\t(image\n"); >>> switch (c_info->type) { >>> case LIBXL_DOMAIN_TYPE_HVM: >>> @@ -150,8 +154,6 @@ void printf_info_sexp(int domid, >>> libxl_domain_config *d_config) >>> printf("\t\t\t(kernel %s)\n", b_info->u.pv.kernel); >>> printf("\t\t\t(cmdline %s)\n", b_info->u.pv.cmdline); >>> printf("\t\t\t(ramdisk %s)\n", b_info->u.pv.ramdisk); >>> - printf("\t\t\t(e820_host %s)\n", >>> - libxl_defbool_to_string(b_info->u.pv.e820_host)); >>> printf("\t\t)\n"); >>> break; >>> default: >>> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c >>> index 874742c..4796221 100644 >>> --- a/xen/arch/x86/domain.c >>> +++ b/xen/arch/x86/domain.c >>> @@ -566,10 +566,9 @@ int arch_domain_create(struct domain *d, >>> unsigned int domcr_flags) >>> { >>> /* 64-bit PV guest by default. */ >>> d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0; >>> - >>> - spin_lock_init(&d->arch.pv_domain.e820_lock); >>> } >>> >>> + spin_lock_init(&d->arch.e820_lock); >>> /* initialize default tsc behavior in case tools don't */ >>> tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0); >>> spin_lock_init(&d->arch.vtsc_lock); >>> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c >>> index 54b1e6a..6c9b58c 100644 >>> --- a/xen/arch/x86/hvm/hvm.c >>> +++ b/xen/arch/x86/hvm/hvm.c >>> @@ -3142,10 +3142,10 @@ static long hvm_memory_op(int cmd, >>> XEN_GUEST_HANDLE_PARAM(void) arg) >>> >>> switch ( cmd & MEMOP_CMD_MASK ) >>> { >>> - case XENMEM_memory_map: >>> case XENMEM_machine_memory_map: >>> case XENMEM_machphys_mapping: >>> return -ENOSYS; >>> + case XENMEM_memory_map: >>> case XENMEM_decrease_reservation: >>> rc = do_memory_op(cmd, arg); >>> current->domain->arch.hvm_domain.qemu_mapcache_invalidate > >>> 1; >>> @@ -3217,10 +3217,10 @@ static long hvm_memory_op_compat32(int cmd, >>> XEN_GUEST_HANDLE_PARAM(void) arg) >>> >>> switch ( cmd & MEMOP_CMD_MASK ) >>> { >>> - case XENMEM_memory_map: >>> case XENMEM_machine_memory_map: >>> case XENMEM_machphys_mapping: >>> return -ENOSYS; >>> + case XENMEM_memory_map: >>> case XENMEM_decrease_reservation: >>> rc = compat_memory_op(cmd, arg); >>> current->domain->arch.hvm_domain.qemu_mapcache_invalidate > >>> 1; >>> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c >>> index e7f0e13..4c3ce9a 100644 >>> --- a/xen/arch/x86/mm.c >>> +++ b/xen/arch/x86/mm.c >>> @@ -4740,19 +4740,13 @@ long arch_memory_op(int op, >>> XEN_GUEST_HANDLE_PARAM(void) arg) >>> return rc; >>> } >>> >>> - if ( is_hvm_domain(d) ) >>> - { >>> - rcu_unlock_domain(d); >>> - return -EPERM; >>> - } >>> - >>> e820 = xmalloc_array(e820entry_t, fmap.map.nr_entries); >>> if ( e820 == NULL ) >>> { >>> rcu_unlock_domain(d); >>> return -ENOMEM; >>> } >>> - >>> + >>> if ( copy_from_guest(e820, fmap.map.buffer, >>> fmap.map.nr_entries) ) >>> { >>> xfree(e820); >>> @@ -4760,11 +4754,11 @@ long arch_memory_op(int op, >>> XEN_GUEST_HANDLE_PARAM(void) arg) >>> return -EFAULT; >>> } >>> >>> - spin_lock(&d->arch.pv_domain.e820_lock); >>> - xfree(d->arch.pv_domain.e820); >>> - d->arch.pv_domain.e820 = e820; >>> - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; >>> - spin_unlock(&d->arch.pv_domain.e820_lock); >>> + spin_lock(&d->arch.e820_lock); >>> + xfree(d->arch.e820); >>> + d->arch.e820 = e820; >>> + d->arch.nr_e820 = fmap.map.nr_entries; >>> + spin_unlock(&d->arch.e820_lock); >>> >>> rcu_unlock_domain(d); >>> return rc; >>> @@ -4778,26 +4772,26 @@ long arch_memory_op(int op, >>> XEN_GUEST_HANDLE_PARAM(void) arg) >>> if ( copy_from_guest(&map, arg, 1) ) >>> return -EFAULT; >>> >>> - spin_lock(&d->arch.pv_domain.e820_lock); >>> + spin_lock(&d->arch.e820_lock); >>> >>> /* Backwards compatibility. */ >>> - if ( (d->arch.pv_domain.nr_e820 == 0) || >>> - (d->arch.pv_domain.e820 == NULL) ) >>> + if ( (d->arch.nr_e820 == 0) || >>> + (d->arch.e820 == NULL) ) >>> { >>> - spin_unlock(&d->arch.pv_domain.e820_lock); >>> + spin_unlock(&d->arch.e820_lock); >>> return -ENOSYS; >>> } >>> >>> - map.nr_entries = min(map.nr_entries, >>> d->arch.pv_domain.nr_e820); >>> - if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820, >>> + map.nr_entries = min(map.nr_entries, d->arch.nr_e820); >>> + if ( copy_to_guest(map.buffer, d->arch.e820, >>> map.nr_entries) || >>> __copy_to_guest(arg, &map, 1) ) >>> { >>> - spin_unlock(&d->arch.pv_domain.e820_lock); >>> + spin_unlock(&d->arch.e820_lock); >>> return -EFAULT; >>> } >>> >>> - spin_unlock(&d->arch.pv_domain.e820_lock); >>> + spin_unlock(&d->arch.e820_lock); >>> return 0; >>> } >>> >>> diff --git a/xen/include/asm-x86/domain.h >>> b/xen/include/asm-x86/domain.h >>> index d79464d..c3f9f8e 100644 >>> --- a/xen/include/asm-x86/domain.h >>> +++ b/xen/include/asm-x86/domain.h >>> @@ -234,11 +234,6 @@ struct pv_domain >>> >>> /* map_domain_page() mapping cache. */ >>> struct mapcache_domain mapcache; >>> - >>> - /* Pseudophysical e820 map (XENMEM_memory_map). */ >>> - spinlock_t e820_lock; >>> - struct e820entry *e820; >>> - unsigned int nr_e820; >>> }; >>> >>> struct arch_domain >>> @@ -313,6 +308,11 @@ struct arch_domain >>> (possibly other cases in the future > >>> */ >>> uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */ >>> uint64_t vtsc_usercount; /* not used for hvm */ >>> + >>> + /* Pseudophysical e820 map (XENMEM_memory_map). */ >>> + spinlock_t e820_lock; >>> + struct e820entry *e820; >>> + unsigned int nr_e820; >>> } __cacheline_aligned; >>> >>> #define has_arch_pdevs(d) (!list_empty(&(d)->arch.pdev_list)) >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xen.org >>> http://lists.xen.org/xen-devel >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2013-Sep-05 12:38 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
Gordan Bobic <gordan@bobich.net> wrote:> On Wed, 4 Sep 2013 22:04:42 -0400, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > >> diff --git a/tools/firmware/hvmloader/e820.h >> b/tools/firmware/hvmloader/e820.h >> index b2ead7f..2fa700d 100644 >> --- a/tools/firmware/hvmloader/e820.h >> +++ b/tools/firmware/hvmloader/e820.h >> @@ -8,6 +8,9 @@ >> #define E820_RESERVED 2 >> #define E820_ACPI 3 >> #define E820_NVS 4 >> +#define E820_UNUSABLE 5 >> + >> +#define E820MAX 128 >> >> struct e820entry { >> uint64_t addr; > > I don''t think we actually need > +#define E820_UNUSABLE 5 > > any more because it is no longer used anywhere > in the patch. Do we need that extra e820 hole type?You could extend the dump_e820... code to print that type as well> I guess it''s only useful if we want to explicitly > signify that a memory hole is inherited from > the host e820 map, rather than _really_ needed. > Otherwise we could probably just use E820_RESERVED > in it''s place.Originally it was used to cover area that are RAM in the host but won''t be RAM in the guest because the amount of memory the guest has is less then the physical amount.> > Gordan
Gordan Bobic
2013-Sep-05 21:13 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
Right, finally got around to trying this with the latest patch. With e820_host=0 things work as before: (XEN) HVM3: BIOS map: (XEN) HVM3: f0000-fffff: Main BIOS (XEN) HVM3: E820 table: (XEN) HVM3: [00]: 00000000:00000000 - 00000000:0009e000: RAM (XEN) HVM3: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED (XEN) HVM3: HOLE: 00000000:000a0000 - 00000000:000e0000 (XEN) HVM3: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED (XEN) HVM3: [03]: 00000000:00100000 - 00000000:e0000000: RAM (XEN) HVM3: HOLE: 00000000:e0000000 - 00000000:fc000000 (XEN) HVM3: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED (XEN) HVM3: [05]: 00000001:00000000 - 00000002:1f800000: RAM I seem to be getting two different E820 table dumps with e820_host=1: (XEN) HVM1: BIOS map: (XEN) HVM1: f0000-fffff: Main BIOS (XEN) HVM1: build_e820_table:91 got 8 op.nr_entries (XEN) HVM1: E820 table: (XEN) HVM1: [00]: 00000000:00000000 - 00000000:3f790000: RAM (XEN) HVM1: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI (XEN) HVM1: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS (XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED (XEN) HVM1: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 (XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED (XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 (XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED (XEN) HVM1: HOLE: 00000000:fee01000 - 00000000:ffc00000 (XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED (XEN) HVM1: [07]: 00000001:00000000 - 00000001:68870000: RAM (XEN) HVM1: E820 table: (XEN) HVM1: [00]: 00000000:00000000 - 00000000:0009e000: RAM (XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED (XEN) HVM1: HOLE: 00000000:000a0000 - 00000000:000e0000 (XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED (XEN) HVM1: [03]: 00000000:00100000 - 00000000:a7800000: RAM (XEN) HVM1: HOLE: 00000000:a7800000 - 00000000:fc000000 (XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED (XEN) HVM1: Invoking ROMBIOS ... I cannot quite figure out what is going on here - these tables can''t both be true. Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 and goes more or less contiguously up to 0xfec8b000. Looking at dmesg on domU, the e820 map more or less matches the second dump above. So I guess that should work - the entire IOMEM area of the host is in fact not mapped. But since I''ve passed 8GB of RAM to domU, shouldn''t there be another usable RAM area after 00000001:00000000 ? Gordan
Gordan Bobic
2013-Sep-05 21:29 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On 09/05/2013 10:13 PM, Gordan Bobic wrote:> Right, finally got around to trying this with the latest patch. > > With e820_host=0 things work as before: > > (XEN) HVM3: BIOS map: > (XEN) HVM3: f0000-fffff: Main BIOS > (XEN) HVM3: E820 table: > (XEN) HVM3: [00]: 00000000:00000000 - 00000000:0009e000: RAM > (XEN) HVM3: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED > (XEN) HVM3: HOLE: 00000000:000a0000 - 00000000:000e0000 > (XEN) HVM3: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED > (XEN) HVM3: [03]: 00000000:00100000 - 00000000:e0000000: RAM > (XEN) HVM3: HOLE: 00000000:e0000000 - 00000000:fc000000 > (XEN) HVM3: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED > (XEN) HVM3: [05]: 00000001:00000000 - 00000002:1f800000: RAM > > > I seem to be getting two different E820 table dumps with e820_host=1: > > (XEN) HVM1: BIOS map: > (XEN) HVM1: f0000-fffff: Main BIOS > (XEN) HVM1: build_e820_table:91 got 8 op.nr_entries > (XEN) HVM1: E820 table: > (XEN) HVM1: [00]: 00000000:00000000 - 00000000:3f790000: RAM > (XEN) HVM1: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI > (XEN) HVM1: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS > (XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED > (XEN) HVM1: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 > (XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED > (XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 > (XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED > (XEN) HVM1: HOLE: 00000000:fee01000 - 00000000:ffc00000 > (XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED > (XEN) HVM1: [07]: 00000001:00000000 - 00000001:68870000: RAM > (XEN) HVM1: E820 table: > (XEN) HVM1: [00]: 00000000:00000000 - 00000000:0009e000: RAM > (XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED > (XEN) HVM1: HOLE: 00000000:000a0000 - 00000000:000e0000 > (XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED > (XEN) HVM1: [03]: 00000000:00100000 - 00000000:a7800000: RAM > (XEN) HVM1: HOLE: 00000000:a7800000 - 00000000:fc000000 > (XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED > (XEN) HVM1: Invoking ROMBIOS ... > > I cannot quite figure out what is going on here - these tables can''t > both be true. > > Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 and > goes more or less contiguously up to 0xfec8b000. > > Looking at dmesg on domU, the e820 map more or less matches the second > dump above. > > So I guess that should work - the entire IOMEM area of the host is in > fact not mapped. But since I''ve passed 8GB of RAM to domU, shouldn''t > there be another usable RAM area after 00000001:00000000 ?I should probably also mention that the domU does in fact see 8GB of RAM, so clearly it is working. The PCI IOMEM reservations on the host are: # lspci -vvv | grep Region | grep Memory | sed -e ''s/.*Memory at //'' | sort a8000000 (64-bit, prefetchable) [disabled] [size=128M] b0000000 (64-bit, prefetchable) [disabled] [size=64M] b4000000 (64-bit, prefetchable) [size=64M] b8000000 (64-bit, prefetchable) [size=128M] c0000000 (64-bit, prefetchable) [size=256M] d7efc000 (32-bit, non-prefetchable) [size=16K] d8000000 (64-bit, non-prefetchable) [size=64K] dc000000 (64-bit, non-prefetchable) [size=16K] f3df4000 (64-bit, non-prefetchable) [size=16K] f3df8000 (32-bit, non-prefetchable) [size=1K] f3dfa000 (32-bit, non-prefetchable) [size=1K] f3dfc000 (32-bit, non-prefetchable) [size=2K] f3dfe000 (64-bit, non-prefetchable) [size=256] f3edc000 (64-bit, non-prefetchable) [size=16K] f3fdc000 (64-bit, non-prefetchable) [size=16K] f4000000 (32-bit, non-prefetchable) [disabled] [size=32M] f7ffc000 (32-bit, non-prefetchable) [disabled] [size=16K] f8000000 (32-bit, non-prefetchable) [size=32M] fbcfc000 (32-bit, non-prefetchable) [size=16K] fbdfe000 (64-bit, non-prefetchable) [disabled] [size=8K] fbeef000 (32-bit, non-prefetchable) [size=2K] fbeefc00 (32-bit, non-prefetchable) [size=16] fec8a000 (32-bit, non-prefetchable) [size=4K] What is a little concerning is that my GPU in dom0 has it''s IOMEM mapped at E0000000-E7FFFFFF E8000000-EBFFFFFF EC000000-EDFFFFFF Granted, this fits into a convenient hole in the host map 0xdc004000-0xf3df4000 but I cannot see that hole being listed as such in the xl dmesg E820 table dump. Is this _really_ working, or is it working by pure luck? Gordan
Gordan Bobic
2013-Sep-05 21:46 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On 09/05/2013 10:29 PM, Gordan Bobic wrote:> On 09/05/2013 10:13 PM, Gordan Bobic wrote: >> Right, finally got around to trying this with the latest patch. >> >> With e820_host=0 things work as before: >> >> (XEN) HVM3: BIOS map: >> (XEN) HVM3: f0000-fffff: Main BIOS >> (XEN) HVM3: E820 table: >> (XEN) HVM3: [00]: 00000000:00000000 - 00000000:0009e000: RAM >> (XEN) HVM3: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >> (XEN) HVM3: HOLE: 00000000:000a0000 - 00000000:000e0000 >> (XEN) HVM3: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >> (XEN) HVM3: [03]: 00000000:00100000 - 00000000:e0000000: RAM >> (XEN) HVM3: HOLE: 00000000:e0000000 - 00000000:fc000000 >> (XEN) HVM3: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >> (XEN) HVM3: [05]: 00000001:00000000 - 00000002:1f800000: RAM >> >> >> I seem to be getting two different E820 table dumps with e820_host=1: >> >> (XEN) HVM1: BIOS map: >> (XEN) HVM1: f0000-fffff: Main BIOS >> (XEN) HVM1: build_e820_table:91 got 8 op.nr_entries >> (XEN) HVM1: E820 table: >> (XEN) HVM1: [00]: 00000000:00000000 - 00000000:3f790000: RAM >> (XEN) HVM1: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI >> (XEN) HVM1: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS >> (XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED >> (XEN) HVM1: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 >> (XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED >> (XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 >> (XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED >> (XEN) HVM1: HOLE: 00000000:fee01000 - 00000000:ffc00000 >> (XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED >> (XEN) HVM1: [07]: 00000001:00000000 - 00000001:68870000: RAM >> (XEN) HVM1: E820 table: >> (XEN) HVM1: [00]: 00000000:00000000 - 00000000:0009e000: RAM >> (XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >> (XEN) HVM1: HOLE: 00000000:000a0000 - 00000000:000e0000 >> (XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >> (XEN) HVM1: [03]: 00000000:00100000 - 00000000:a7800000: RAM >> (XEN) HVM1: HOLE: 00000000:a7800000 - 00000000:fc000000 >> (XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >> (XEN) HVM1: Invoking ROMBIOS ... >> >> I cannot quite figure out what is going on here - these tables can''t >> both be true. >> >> Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 and >> goes more or less contiguously up to 0xfec8b000. >> >> Looking at dmesg on domU, the e820 map more or less matches the second >> dump above. >> >> So I guess that should work - the entire IOMEM area of the host is in >> fact not mapped. But since I''ve passed 8GB of RAM to domU, shouldn''t >> there be another usable RAM area after 00000001:00000000 ? > > I should probably also mention that the domU does in fact see 8GB of > RAM, so clearly it is working. > > The PCI IOMEM reservations on the host are: > # lspci -vvv | grep Region | grep Memory | sed -e ''s/.*Memory at //'' | sort > a8000000 (64-bit, prefetchable) [disabled] [size=128M] > b0000000 (64-bit, prefetchable) [disabled] [size=64M] > b4000000 (64-bit, prefetchable) [size=64M] > b8000000 (64-bit, prefetchable) [size=128M] > c0000000 (64-bit, prefetchable) [size=256M] > d7efc000 (32-bit, non-prefetchable) [size=16K] > d8000000 (64-bit, non-prefetchable) [size=64K] > dc000000 (64-bit, non-prefetchable) [size=16K] > f3df4000 (64-bit, non-prefetchable) [size=16K] > f3df8000 (32-bit, non-prefetchable) [size=1K] > f3dfa000 (32-bit, non-prefetchable) [size=1K] > f3dfc000 (32-bit, non-prefetchable) [size=2K] > f3dfe000 (64-bit, non-prefetchable) [size=256] > f3edc000 (64-bit, non-prefetchable) [size=16K] > f3fdc000 (64-bit, non-prefetchable) [size=16K] > f4000000 (32-bit, non-prefetchable) [disabled] [size=32M] > f7ffc000 (32-bit, non-prefetchable) [disabled] [size=16K] > f8000000 (32-bit, non-prefetchable) [size=32M] > fbcfc000 (32-bit, non-prefetchable) [size=16K] > fbdfe000 (64-bit, non-prefetchable) [disabled] [size=8K] > fbeef000 (32-bit, non-prefetchable) [size=2K] > fbeefc00 (32-bit, non-prefetchable) [size=16] > fec8a000 (32-bit, non-prefetchable) [size=4K] > > > What is a little concerning is that my GPU in dom0 has it''s IOMEM mapped at > E0000000-E7FFFFFF > E8000000-EBFFFFFF > EC000000-EDFFFFFF > > Granted, this fits into a convenient hole in the host map > 0xdc004000-0xf3df4000 but I cannot see that hole being listed as such in > the xl dmesg E820 table dump. Is this _really_ working, or is it working > by pure luck?Just doing a bit of testing at the moment. I haven''t had a crash yet (it would have happened by now, as things were before). But - I am definitely getting the sort of graphical glitching/corruption in 3D applications that I saw before when assigning > 2688MB of RAM to the domU. That implies that there is still some memory overwriting happening somewhere. Aaand just as I was tying that - I''ve just had a crash. :''( Back to the drawing board... Gordan
Konrad Rzeszutek Wilk
2013-Sep-05 22:23 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
Gordan Bobic <gordan@bobich.net> wrote:>Right, finally got around to trying this with the latest patch. > >With e820_host=0 things work as before: > >(XEN) HVM3: BIOS map: >(XEN) HVM3: f0000-fffff: Main BIOS >(XEN) HVM3: E820 table: >(XEN) HVM3: [00]: 00000000:00000000 - 00000000:0009e000: RAM >(XEN) HVM3: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >(XEN) HVM3: HOLE: 00000000:000a0000 - 00000000:000e0000 >(XEN) HVM3: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >(XEN) HVM3: [03]: 00000000:00100000 - 00000000:e0000000: RAM >(XEN) HVM3: HOLE: 00000000:e0000000 - 00000000:fc000000 >(XEN) HVM3: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >(XEN) HVM3: [05]: 00000001:00000000 - 00000002:1f800000: RAM > > >I seem to be getting two different E820 table dumps with e820_host=1: > >(XEN) HVM1: BIOS map: >(XEN) HVM1: f0000-fffff: Main BIOS >(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries >(XEN) HVM1: E820 table: >(XEN) HVM1: [00]: 00000000:00000000 - 00000000:3f790000: RAM >(XEN) HVM1: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI >(XEN) HVM1: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS >(XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED >(XEN) HVM1: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 >(XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED >(XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 >(XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED >(XEN) HVM1: HOLE: 00000000:fee01000 - 00000000:ffc00000 >(XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED >(XEN) HVM1: [07]: 00000001:00000000 - 00000001:68870000: RAM >(XEN) HVM1: E820 table: >(XEN) HVM1: [00]: 00000000:00000000 - 00000000:0009e000: RAM >(XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >(XEN) HVM1: HOLE: 00000000:000a0000 - 00000000:000e0000 >(XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >(XEN) HVM1: [03]: 00000000:00100000 - 00000000:a7800000: RAM >(XEN) HVM1: HOLE: 00000000:a7800000 - 00000000:fc000000 >(XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >(XEN) HVM1: Invoking ROMBIOS ... > >I cannot quite figure out what is going on here - these tables can''t >both be true. >Right. The code just prints the E820 that was constructed b/c of the e820_host =1 parameter as the first output. Then the second one is what was constructed originally. The code that would tie in the E820 from the hyper call and the alter how the hvmloader sets it up is not yet done.>Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 and >goes more or less contiguously up to 0xfec8b000. > >Looking at dmesg on domU, the e820 map more or less matches the second >dump above.Right. That is correct since the patch I sent just outputs stuff. No real changes to the E820 yet.> >So I guess that should work - the entire IOMEM area of the host is in >fact not mapped. But since I''ve passed 8GB of RAM to domU, shouldn''t >there be another usable RAM area after 00000001:00000000 ? > >Gordan
Gordan Bobic
2013-Sep-05 22:33 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On 09/05/2013 10:13 PM, Gordan Bobic wrote:> I seem to be getting two different E820 table dumps with e820_host=1: > > (XEN) HVM1: BIOS map: > (XEN) HVM1: f0000-fffff: Main BIOS > (XEN) HVM1: build_e820_table:91 got 8 op.nr_entries > (XEN) HVM1: E820 table: > (XEN) HVM1: [00]: 00000000:00000000 - 00000000:3f790000: RAM > (XEN) HVM1: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI > (XEN) HVM1: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS > (XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED > (XEN) HVM1: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 > (XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED > (XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 > (XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED > (XEN) HVM1: HOLE: 00000000:fee01000 - 00000000:ffc00000 > (XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED > (XEN) HVM1: [07]: 00000001:00000000 - 00000001:68870000: RAMI get it - this is the host e820 map. In dom0, dmesg shows: e820: BIOS-provided physical RAM map: Xen: [mem 0x0000000000000000-0x000000000009cfff] usable Xen: [mem 0x000000000009d000-0x00000000000fffff] reserved Xen: [mem 0x0000000000100000-0x000000003f78ffff] usable Xen: [mem 0x000000003f790000-0x000000003f79dfff] ACPI data Xen: [mem 0x000000003f79e000-0x000000003f7cffff] ACPI NVS Xen: [mem 0x000000003f7d0000-0x000000003f7dffff] reserved Xen: [mem 0x000000003f7e7000-0x000000003fffffff] reserved Xen: [mem 0x00000000fee00000-0x00000000fee00fff] reserved Xen: [mem 0x00000000ffc00000-0x00000000ffffffff] reserved Xen: [mem 0x0000000100000000-0x0000000cbfffffff] usable That tallies up with the above map exactly. So far so good. Not sure if the following is relevant, but here it is anyway just in case: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved e820: remove [mem 0x000a0000-0x000fffff] usable [...] e820: last_pfn = 0xcc0000 max_arch_pfn = 0x400000000 e820: last_pfn = 0x3f790 max_arch_pfn = 0x400000000 [...] Zone ranges: DMA [mem 0x00001000-0x00ffffff] DMA32 [mem 0x01000000-0xffffffff] Normal [mem 0x100000000-0xcbfffffff] [...] e820: [mem 0x40000000-0xfedfffff] available for PCI devices> (XEN) HVM1: E820 table: > (XEN) HVM1: [00]: 00000000:00000000 - 00000000:0009e000: RAM > (XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED > (XEN) HVM1: HOLE: 00000000:000a0000 - 00000000:000e0000 > (XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED > (XEN) HVM1: [03]: 00000000:00100000 - 00000000:a7800000: RAM > (XEN) HVM1: HOLE: 00000000:a7800000 - 00000000:fc000000 > (XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED > (XEN) HVM1: Invoking ROMBIOS ...Comparing this to the above, it seems that 9d000-9e000 is marked as reserved in dom0, but RAM in domU. Am I right in thinking that dom0(usable) == domU(RAM) in terms of meaning? What does "HOLE" actually mean in domU? Does it mean this space is OK to map domU IOMEM into? Or something else? Either way full possible chasl summary: dom0: reserved 9d000-9e000 domU: RAM 9d000-9e000 dom0: reserved a0000-dffff domU: HOLE a0000-dffff dom0: ACPI data 3f790000-3f79dfff dom0: ACPI NVS 3f79e000-3f7cffff dom0: reserved 3f7d0000-3f7dffff dom0: reserved domU: RAM 00100000-a7800000 Then there seems to be a hole in dom0: 40000000-fedfffff which talles up with the dom0 dmesg output above about it being for the PCI devices, i.e. that''s the IOMEM region (from 1GB to a lilttle under 4GB). But in domU, the 40000000-a77fffff is available as RAM. On the face of it, that''s actually fine - my PCI IOMEM mappings show the lowest mapping (according to lspci -vvv) starts at a8000000, which falls into the domU area marked as "HOLE" (a7800000-fc000000). And this does in fact appears to be where domU maps the GPU in both of my VMs: E0000000-E7FFFFFF E8000000-EBFFFFFF EC000000-EDFFFFFF and this doesn''t overlap with any mapped PCI IOMEM according to lspci. If we assume that anything below a8000000 doesn''t actually matter in this case (since if I give up to a8000000 memory to a domU everything works absolutely fine indefinitely, I am at a loss to explain what is actually going wrong and why the crash is still occuring - unless some other piece of hardware is having it''s domU IOMEM mapped somewhere in the range f3df4000-fec8b000 and that is causing a memory overwrite. I am just not seeing any obvious memory stomp at the moment... Gordan
Gordan Bobic
2013-Sep-05 22:42 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote:> Gordan Bobic <gordan@bobich.net> wrote: >> Right, finally got around to trying this with the latest patch. >> >> With e820_host=0 things work as before: >> >> (XEN) HVM3: BIOS map: >> (XEN) HVM3: f0000-fffff: Main BIOS >> (XEN) HVM3: E820 table: >> (XEN) HVM3: [00]: 00000000:00000000 - 00000000:0009e000: RAM >> (XEN) HVM3: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >> (XEN) HVM3: HOLE: 00000000:000a0000 - 00000000:000e0000 >> (XEN) HVM3: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >> (XEN) HVM3: [03]: 00000000:00100000 - 00000000:e0000000: RAM >> (XEN) HVM3: HOLE: 00000000:e0000000 - 00000000:fc000000 >> (XEN) HVM3: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >> (XEN) HVM3: [05]: 00000001:00000000 - 00000002:1f800000: RAM >> >> >> I seem to be getting two different E820 table dumps with e820_host=1: >> >> (XEN) HVM1: BIOS map: >> (XEN) HVM1: f0000-fffff: Main BIOS >> (XEN) HVM1: build_e820_table:91 got 8 op.nr_entries >> (XEN) HVM1: E820 table: >> (XEN) HVM1: [00]: 00000000:00000000 - 00000000:3f790000: RAM >> (XEN) HVM1: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI >> (XEN) HVM1: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS >> (XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED >> (XEN) HVM1: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 >> (XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED >> (XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 >> (XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED >> (XEN) HVM1: HOLE: 00000000:fee01000 - 00000000:ffc00000 >> (XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED >> (XEN) HVM1: [07]: 00000001:00000000 - 00000001:68870000: RAM >> (XEN) HVM1: E820 table: >> (XEN) HVM1: [00]: 00000000:00000000 - 00000000:0009e000: RAM >> (XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >> (XEN) HVM1: HOLE: 00000000:000a0000 - 00000000:000e0000 >> (XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >> (XEN) HVM1: [03]: 00000000:00100000 - 00000000:a7800000: RAM >> (XEN) HVM1: HOLE: 00000000:a7800000 - 00000000:fc000000 >> (XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >> (XEN) HVM1: Invoking ROMBIOS ... >> >> I cannot quite figure out what is going on here - these tables can''t >> both be true. >> > > Right. The code just prints the E820 that was constructed b/c of the e820_host =1 parameter as the first output. Then the second one is what was constructed originally. > > The code that would tie in the E820 from the hyper call and the alter how the hvmloader sets it up is not yet done. > > >> Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 and >> goes more or less contiguously up to 0xfec8b000. >> >> Looking at dmesg on domU, the e820 map more or less matches the second >> dump above. > > Right. That is correct since the patch I sent just outputs stuff. No real changes to the E820 yet./me *facepalms* That indeed explains everything. :) But having had a thorough look through the memory mappings (see my other long, rambling email), I don''t actually see an obvious area where RAM might overwrite a dom0 IOMEM range - assuming the "HOLE" part isn''t mapped as RAM in domU. Or to summarize: dom0 PCI IOMEM actually has mappings from a8000000 onward, and giving domU up to that much memory works fine. So the memory stomp must be happening from a8000000 onward. But - the only things above that address in domU are the HOLE up to fc000000 and RESERVED up to ffffffff. So no domU memory is getting mapped into the IOMEM range anyway - which begs the question of what is _actually_ causing the crash. Stuff I haven''t yet found in domU getting mapped into the a7800000-fc000000 hole overlapping dom0 IOMEM? SeaBIOS doing smething odd in the fc000000-fec8b000 range marked RESERVED in domU? Or am I reading this all wrong? Gordan
Gordan Bobic
2013-Sep-05 22:45 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote:> Gordan Bobic <gordan@bobich.net> wrote: >> Right, finally got around to trying this with the latest patch. >> >> With e820_host=0 things work as before: >> >> (XEN) HVM3: BIOS map: >> (XEN) HVM3: f0000-fffff: Main BIOS >> (XEN) HVM3: E820 table: >> (XEN) HVM3: [00]: 00000000:00000000 - 00000000:0009e000: RAM >> (XEN) HVM3: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >> (XEN) HVM3: HOLE: 00000000:000a0000 - 00000000:000e0000 >> (XEN) HVM3: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >> (XEN) HVM3: [03]: 00000000:00100000 - 00000000:e0000000: RAM >> (XEN) HVM3: HOLE: 00000000:e0000000 - 00000000:fc000000 >> (XEN) HVM3: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >> (XEN) HVM3: [05]: 00000001:00000000 - 00000002:1f800000: RAM >> >> >> I seem to be getting two different E820 table dumps with e820_host=1: >> >> (XEN) HVM1: BIOS map: >> (XEN) HVM1: f0000-fffff: Main BIOS >> (XEN) HVM1: build_e820_table:91 got 8 op.nr_entries >> (XEN) HVM1: E820 table: >> (XEN) HVM1: [00]: 00000000:00000000 - 00000000:3f790000: RAM >> (XEN) HVM1: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI >> (XEN) HVM1: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS >> (XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED >> (XEN) HVM1: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 >> (XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED >> (XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 >> (XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED >> (XEN) HVM1: HOLE: 00000000:fee01000 - 00000000:ffc00000 >> (XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED >> (XEN) HVM1: [07]: 00000001:00000000 - 00000001:68870000: RAM >> (XEN) HVM1: E820 table: >> (XEN) HVM1: [00]: 00000000:00000000 - 00000000:0009e000: RAM >> (XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >> (XEN) HVM1: HOLE: 00000000:000a0000 - 00000000:000e0000 >> (XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >> (XEN) HVM1: [03]: 00000000:00100000 - 00000000:a7800000: RAM >> (XEN) HVM1: HOLE: 00000000:a7800000 - 00000000:fc000000 >> (XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >> (XEN) HVM1: Invoking ROMBIOS ... >> >> I cannot quite figure out what is going on here - these tables can''t >> both be true. >> > > Right. The code just prints the E820 that was constructed b/c of the e820_host =1 parameter as the first output. Then the second one is what was constructed originally. > > The code that would tie in the E820 from the hyper call and the alter how the hvmloader sets it up is not yet done. > > >> Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 and >> goes more or less contiguously up to 0xfec8b000. >> >> Looking at dmesg on domU, the e820 map more or less matches the second >> dump above. > > Right. That is correct since the patch I sent just outputs stuff. No real changes to the E820 yet.I thought this did that in hvmloader/e820c: hypercall_memory_op ( XENMEM_memory_map, &op); Gordan
Konrad Rzeszutek Wilk
2013-Sep-05 23:01 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
Gordan Bobic <gordan@bobich.net> wrote:>On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote: >> Gordan Bobic <gordan@bobich.net> wrote: >>> Right, finally got around to trying this with the latest patch. >>> >>> With e820_host=0 things work as before: >>> >>> (XEN) HVM3: BIOS map: >>> (XEN) HVM3: f0000-fffff: Main BIOS >>> (XEN) HVM3: E820 table: >>> (XEN) HVM3: [00]: 00000000:00000000 - 00000000:0009e000: RAM >>> (XEN) HVM3: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >>> (XEN) HVM3: HOLE: 00000000:000a0000 - 00000000:000e0000 >>> (XEN) HVM3: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >>> (XEN) HVM3: [03]: 00000000:00100000 - 00000000:e0000000: RAM >>> (XEN) HVM3: HOLE: 00000000:e0000000 - 00000000:fc000000 >>> (XEN) HVM3: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >>> (XEN) HVM3: [05]: 00000001:00000000 - 00000002:1f800000: RAM >>> >>> >>> I seem to be getting two different E820 table dumps with >e820_host=1: >>> >>> (XEN) HVM1: BIOS map: >>> (XEN) HVM1: f0000-fffff: Main BIOS >>> (XEN) HVM1: build_e820_table:91 got 8 op.nr_entries >>> (XEN) HVM1: E820 table: >>> (XEN) HVM1: [00]: 00000000:00000000 - 00000000:3f790000: RAM >>> (XEN) HVM1: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI >>> (XEN) HVM1: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS >>> (XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED >>> (XEN) HVM1: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 >>> (XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED >>> (XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 >>> (XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED >>> (XEN) HVM1: HOLE: 00000000:fee01000 - 00000000:ffc00000 >>> (XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED >>> (XEN) HVM1: [07]: 00000001:00000000 - 00000001:68870000: RAM >>> (XEN) HVM1: E820 table: >>> (XEN) HVM1: [00]: 00000000:00000000 - 00000000:0009e000: RAM >>> (XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >>> (XEN) HVM1: HOLE: 00000000:000a0000 - 00000000:000e0000 >>> (XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >>> (XEN) HVM1: [03]: 00000000:00100000 - 00000000:a7800000: RAM >>> (XEN) HVM1: HOLE: 00000000:a7800000 - 00000000:fc000000 >>> (XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >>> (XEN) HVM1: Invoking ROMBIOS ... >>> >>> I cannot quite figure out what is going on here - these tables can''t >>> both be true. >>> >> >> Right. The code just prints the E820 that was constructed b/c of the >e820_host =1 parameter as the first output. Then the second one is >what was constructed originally. >> >> The code that would tie in the E820 from the hyper call and the alter >how the hvmloader sets it up is not yet done. >> >> >>> Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 and >>> goes more or less contiguously up to 0xfec8b000. >>> >>> Looking at dmesg on domU, the e820 map more or less matches the >second >>> dump above. >> >> Right. That is correct since the patch I sent just outputs stuff. >No real changes to the E820 yet. > >I thought this did that in hvmloader/e820c: >hypercall_memory_op ( XENMEM_memory_map, &op); > >GordanNo. They just gets the E820 that is stashed in the hypervisor for the guest. The PV guest would use it but hvmloader is not. This is what would needed to be implemented to allow hvmloader construct the E820 on its own.
Gordan Bobic
2013-Sep-06 12:23 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Thu, 05 Sep 2013 19:01:03 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> Gordan Bobic <gordan@bobich.net> wrote: >>On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote: >>> Gordan Bobic <gordan@bobich.net> wrote: >>>> Right, finally got around to trying this with the latest patch. >>>> >>>> With e820_host=0 things work as before: >>>> >>>> (XEN) HVM3: BIOS map: >>>> (XEN) HVM3: f0000-fffff: Main BIOS >>>> (XEN) HVM3: E820 table: >>>> (XEN) HVM3: [00]: 00000000:00000000 - 00000000:0009e000: RAM >>>> (XEN) HVM3: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >>>> (XEN) HVM3: HOLE: 00000000:000a0000 - 00000000:000e0000 >>>> (XEN) HVM3: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >>>> (XEN) HVM3: [03]: 00000000:00100000 - 00000000:e0000000: RAM >>>> (XEN) HVM3: HOLE: 00000000:e0000000 - 00000000:fc000000 >>>> (XEN) HVM3: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >>>> (XEN) HVM3: [05]: 00000001:00000000 - 00000002:1f800000: RAM >>>> >>>> >>>> I seem to be getting two different E820 table dumps with >>e820_host=1: >>>> >>>> (XEN) HVM1: BIOS map: >>>> (XEN) HVM1: f0000-fffff: Main BIOS >>>> (XEN) HVM1: build_e820_table:91 got 8 op.nr_entries >>>> (XEN) HVM1: E820 table: >>>> (XEN) HVM1: [00]: 00000000:00000000 - 00000000:3f790000: RAM >>>> (XEN) HVM1: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI >>>> (XEN) HVM1: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS >>>> (XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED >>>> (XEN) HVM1: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 >>>> (XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED >>>> (XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 >>>> (XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED >>>> (XEN) HVM1: HOLE: 00000000:fee01000 - 00000000:ffc00000 >>>> (XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED >>>> (XEN) HVM1: [07]: 00000001:00000000 - 00000001:68870000: RAM >>>> (XEN) HVM1: E820 table: >>>> (XEN) HVM1: [00]: 00000000:00000000 - 00000000:0009e000: RAM >>>> (XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >>>> (XEN) HVM1: HOLE: 00000000:000a0000 - 00000000:000e0000 >>>> (XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >>>> (XEN) HVM1: [03]: 00000000:00100000 - 00000000:a7800000: RAM >>>> (XEN) HVM1: HOLE: 00000000:a7800000 - 00000000:fc000000 >>>> (XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >>>> (XEN) HVM1: Invoking ROMBIOS ... >>>> >>>> I cannot quite figure out what is going on here - these tables >>>> can''t >>>> both be true. >>>> >>> >>> Right. The code just prints the E820 that was constructed b/c of >>> the >>e820_host =1 parameter as the first output. Then the second one is >>what was constructed originally. >>> >>> The code that would tie in the E820 from the hyper call and the >>> alter >>how the hvmloader sets it up is not yet done. >>> >>> >>>> Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 >>>> and >>>> goes more or less contiguously up to 0xfec8b000. >>>> >>>> Looking at dmesg on domU, the e820 map more or less matches the >>second >>>> dump above. >>> >>> Right. That is correct since the patch I sent just outputs stuff. >>No real changes to the E820 yet. >> >>I thought this did that in hvmloader/e820c: >>hypercall_memory_op ( XENMEM_memory_map, &op); >> >>Gordan > > No. They just gets the E820 that is stashed in the hypervisor for > the guest. The PV guest would use it but hvmloader is not. This is > what would needed to be implemented to allow hvmloader construct the > E820 on its own.Right. So so in hvmloader/e820.c we now have the host based map in struct e820entry map[E820MAX]; The rest of the function then goes and constructs the standard HVM e820 map in the passed in struct e820entry *e820 So all that needs to happen here is if e820_host is set, fill e820[] by copying map[] up to the hvm_info->low_mem_pgend (or hvm_info->high_mem_pgend if it is set). I am guessing that SeaBIOS and other existing stuff might break if the host map is just copied in verbatim, so presumably I need to add/dedupe the non-RAM parts of the maps. Is that right? Nothing else needs to happen? The following questions arise: 1) What to do in case of overlaps? On my specific hardware, the key difference in the end map will be that the hole at: (XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 will end up being created in domU. 2) Do only the holes need to be pulled from the host or the entire map? Would hvmloader/seabios/whatever know what to do if passed a map that is different from what they might expect (i.e. different from what the current hvmloader provides)? Or would this be likely to cause extensive further breakages? 3) At the moment I am leaning toward just pulling in the holes from the host e820, mirroring them in domU. 3.1) Marking them as "reserved" would likely fix the problem that was my primary motivation for doing this in the first place. Having said that - with all of the 1GB-3GB space marked as reserved, I''m not sure where the IOMEM would end up mapped in domU - things might just break. If marking the dom0 hole as a hole in domU without ensuring pBAR=vBAR, the PCI device in domU might get mapped with where another device is in dom0, which might cause the same problem. At the moment, I think the expedient thing to do is make domU map holes as per dom0 and ignore other non-RAM areas. This may (by luck) or may not fix my immediate problem (RAM in domU clobbering host''s mapped IOMEM), but at least it would cover the pre-requisite hole mapping for the next step which is vBAR=pBAR. I light of this, however, depending on the answer to 2) above, it may not be practical for e820_host option do do what it actually means for HVMs, at least not to the same extent as happens for PV. It would only do a part of it (initial vHOLE=pHOLE, to later be extended to the more specific case of vBAR=pBAR). Does this sound reasonable? Gordan
Konrad Rzeszutek Wilk
2013-Sep-06 13:04 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Thu, Sep 05, 2013 at 11:33:18PM +0100, Gordan Bobic wrote:> On 09/05/2013 10:13 PM, Gordan Bobic wrote: > > >I seem to be getting two different E820 table dumps with e820_host=1: > > > >(XEN) HVM1: BIOS map: > >(XEN) HVM1: f0000-fffff: Main BIOS > >(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries > >(XEN) HVM1: E820 table: > >(XEN) HVM1: [00]: 00000000:00000000 - 00000000:3f790000: RAM > >(XEN) HVM1: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI > >(XEN) HVM1: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS > >(XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED > >(XEN) HVM1: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 > >(XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED > >(XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 > >(XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED > >(XEN) HVM1: HOLE: 00000000:fee01000 - 00000000:ffc00000 > >(XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED > >(XEN) HVM1: [07]: 00000001:00000000 - 00000001:68870000: RAM > > I get it - this is the host e820 map. In dom0, dmesg shows: > > e820: BIOS-provided physical RAM map: > Xen: [mem 0x0000000000000000-0x000000000009cfff] usable > Xen: [mem 0x000000000009d000-0x00000000000fffff] reserved > Xen: [mem 0x0000000000100000-0x000000003f78ffff] usable > Xen: [mem 0x000000003f790000-0x000000003f79dfff] ACPI data > Xen: [mem 0x000000003f79e000-0x000000003f7cffff] ACPI NVS > Xen: [mem 0x000000003f7d0000-0x000000003f7dffff] reserved > Xen: [mem 0x000000003f7e7000-0x000000003fffffff] reserved > Xen: [mem 0x00000000fee00000-0x00000000fee00fff] reserved > Xen: [mem 0x00000000ffc00000-0x00000000ffffffff] reserved > Xen: [mem 0x0000000100000000-0x0000000cbfffffff] usable > > That tallies up with the above map exactly. So far so good. Not sure > if the following is relevant, but here it is anyway just in case: > > e820: update [mem 0x00000000-0x00000fff] usable ==> reserved > e820: remove [mem 0x000a0000-0x000fffff] usable > [...] > e820: last_pfn = 0xcc0000 max_arch_pfn = 0x400000000 > e820: last_pfn = 0x3f790 max_arch_pfn = 0x400000000 > [...] > Zone ranges: > DMA [mem 0x00001000-0x00ffffff] > DMA32 [mem 0x01000000-0xffffffff] > Normal [mem 0x100000000-0xcbfffffff] > [...] > e820: [mem 0x40000000-0xfedfffff] available for PCI devices > > > >(XEN) HVM1: E820 table: > >(XEN) HVM1: [00]: 00000000:00000000 - 00000000:0009e000: RAM > >(XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED > >(XEN) HVM1: HOLE: 00000000:000a0000 - 00000000:000e0000 > >(XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED > >(XEN) HVM1: [03]: 00000000:00100000 - 00000000:a7800000: RAM > >(XEN) HVM1: HOLE: 00000000:a7800000 - 00000000:fc000000 > >(XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED > >(XEN) HVM1: Invoking ROMBIOS ... > > Comparing this to the above, it seems that 9d000-9e000 is marked as > reserved in dom0, but RAM in domU. Am I right in thinking that > dom0(usable) == domU(RAM) in terms of meaning? > > What does "HOLE" actually mean in domU? Does it mean this space is > OK to map domU IOMEM into? Or something else? Either way full > possible chasl summary: > > dom0: reserved 9d000-9e000 > domU: RAM 9d000-9e000 > > dom0: reserved a0000-dffff > domU: HOLE a0000-dffff > > dom0: ACPI data 3f790000-3f79dfff > dom0: ACPI NVS 3f79e000-3f7cffff > dom0: reserved 3f7d0000-3f7dffff > dom0: reserved.. you are missing a range here.> domU: RAM 00100000-a7800000 > > Then there seems to be a hole in dom0: > 40000000-fedfffff which talles up with the dom0 dmesg output above > about it being for the PCI devices, i.e. that''s the IOMEM region > (from 1GB to a lilttle under 4GB). > > But in domU, the 40000000-a77fffff is available as RAM.OK, so that is the goal - make hvmloader construct the E820 memory layout and all of its pieces to fit that layout.> > On the face of it, that''s actually fine - my PCI IOMEM mappings show > the lowest mapping (according to lspci -vvv) starts at a8000000,<surprise>> which falls into the domU area marked as "HOLE" (a7800000-fc000000). > And this does in fact appears to be where domU maps the GPU in both > of my VMs: > > E0000000-E7FFFFFF > E8000000-EBFFFFFF > EC000000-EDFFFFFF > > and this doesn''t overlap with any mapped PCI IOMEM according to lspci. > > If we assume that anything below a8000000 doesn''t actually matter in > this case (since if I give up to a8000000 memory to a domU > everything works absolutely fine indefinitely, I am at a loss toJust to make sure I am not leading you astray. You are getting _no_ crashes when you have a guest with 1GB?> explain what is actually going wrong and why the crash is still > occuring - unless some other piece of hardware is having it''s domU > IOMEM mapped somewhere in the range f3df4000-fec8b000 and that is > causing a memory overwrite. > > I am just not seeing any obvious memory stomp at the moment...Neither am I.> > Gordan
Konrad Rzeszutek Wilk
2013-Sep-06 13:09 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Thu, Sep 05, 2013 at 11:42:38PM +0100, Gordan Bobic wrote:> On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote: > >Gordan Bobic <gordan@bobich.net> wrote: > >>Right, finally got around to trying this with the latest patch. > >> > >>With e820_host=0 things work as before: > >> > >>(XEN) HVM3: BIOS map: > >>(XEN) HVM3: f0000-fffff: Main BIOS > >>(XEN) HVM3: E820 table: > >>(XEN) HVM3: [00]: 00000000:00000000 - 00000000:0009e000: RAM > >>(XEN) HVM3: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED > >>(XEN) HVM3: HOLE: 00000000:000a0000 - 00000000:000e0000 > >>(XEN) HVM3: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED > >>(XEN) HVM3: [03]: 00000000:00100000 - 00000000:e0000000: RAM > >>(XEN) HVM3: HOLE: 00000000:e0000000 - 00000000:fc000000 > >>(XEN) HVM3: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED > >>(XEN) HVM3: [05]: 00000001:00000000 - 00000002:1f800000: RAM > >> > >> > >>I seem to be getting two different E820 table dumps with e820_host=1: > >> > >>(XEN) HVM1: BIOS map: > >>(XEN) HVM1: f0000-fffff: Main BIOS > >>(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries > >>(XEN) HVM1: E820 table: > >>(XEN) HVM1: [00]: 00000000:00000000 - 00000000:3f790000: RAM > >>(XEN) HVM1: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI > >>(XEN) HVM1: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS > >>(XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED > >>(XEN) HVM1: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 > >>(XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED > >>(XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 > >>(XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED > >>(XEN) HVM1: HOLE: 00000000:fee01000 - 00000000:ffc00000 > >>(XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED > >>(XEN) HVM1: [07]: 00000001:00000000 - 00000001:68870000: RAM > >>(XEN) HVM1: E820 table: > >>(XEN) HVM1: [00]: 00000000:00000000 - 00000000:0009e000: RAM > >>(XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED > >>(XEN) HVM1: HOLE: 00000000:000a0000 - 00000000:000e0000 > >>(XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED > >>(XEN) HVM1: [03]: 00000000:00100000 - 00000000:a7800000: RAM > >>(XEN) HVM1: HOLE: 00000000:a7800000 - 00000000:fc000000 > >>(XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED > >>(XEN) HVM1: Invoking ROMBIOS ... > >> > >>I cannot quite figure out what is going on here - these tables can''t > >>both be true. > >> > > > >Right. The code just prints the E820 that was constructed b/c of the e820_host =1 parameter as the first output. Then the second one is what was constructed originally. > > > >The code that would tie in the E820 from the hyper call and the alter how the hvmloader sets it up is not yet done. > > > > > >>Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 and > >>goes more or less contiguously up to 0xfec8b000. > >> > >>Looking at dmesg on domU, the e820 map more or less matches the second > >>dump above. > > > >Right. That is correct since the patch I sent just outputs stuff. No real changes to the E820 yet. > > /me *facepalms* > > That indeed explains everything. :) > > But having had a thorough look through the memory mappings (see my > other long, rambling email), I don''t actually see an obvious area > where RAM might overwrite a dom0 IOMEM range - assuming the "HOLE" > part isn''t mapped as RAM in domU. > > Or to summarize: > dom0 PCI IOMEM actually has mappings from a8000000 onward, and > giving domU up to that much memory works fine. So the memory stomp > must be happening from a8000000 onward. But - the only things above > that address in domU are the HOLE up to fc000000 and RESERVED up to > ffffffff. So no domU memory is getting mapped into the IOMEM range > anyway - which begs the question of what is _actually_ causing the > crash. Stuff I haven''t yet found in domU getting mapped into the > a7800000-fc000000 hole overlapping dom0 IOMEM? SeaBIOS doing > smething odd in the fc000000-fec8b000 range marked RESERVED in domU?There were some assumptions with that region and that stuff could be stick in there (like ACPI tables and SMBIOS I think). Perhaps a better question is - are any of the BARs of your card overlapping with the RESERVED range in the domU? Or if you grep through the hvmloader code are there anything addresses that look to be within that range? Incidentally could you send the output of lspci -vvvv from your output in the guest and in dom0 please? Thanks.> > Or am I reading this all wrong?You are on the right track I think. There is some assumption made about the RESERVED and HOLE that I think are conflicing with what the card thinks of. Another way to figure out what is happening is to crank up the verbosity of the driver in the domU. Specifically there is an CONFIG_MMIO_TRACE (or something like that) that will tell you the physical address the PCI cards are using and what it is writting in it. It could help in identifying _where_ the graphic card is writting/reading from. And also the last moment when it wrote something.> > Gordan
Konrad Rzeszutek Wilk
2013-Sep-06 13:20 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Fri, Sep 06, 2013 at 01:23:19PM +0100, Gordan Bobic wrote:> On Thu, 05 Sep 2013 19:01:03 -0400, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > >Gordan Bobic <gordan@bobich.net> wrote: > >>On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote: > >>>Gordan Bobic <gordan@bobich.net> wrote: > >>>>Right, finally got around to trying this with the latest patch. > >>>> > >>>>With e820_host=0 things work as before: > >>>> > >>>>(XEN) HVM3: BIOS map: > >>>>(XEN) HVM3: f0000-fffff: Main BIOS > >>>>(XEN) HVM3: E820 table: > >>>>(XEN) HVM3: [00]: 00000000:00000000 - 00000000:0009e000: RAM > >>>>(XEN) HVM3: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED > >>>>(XEN) HVM3: HOLE: 00000000:000a0000 - 00000000:000e0000 > >>>>(XEN) HVM3: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED > >>>>(XEN) HVM3: [03]: 00000000:00100000 - 00000000:e0000000: RAM > >>>>(XEN) HVM3: HOLE: 00000000:e0000000 - 00000000:fc000000 > >>>>(XEN) HVM3: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED > >>>>(XEN) HVM3: [05]: 00000001:00000000 - 00000002:1f800000: RAM > >>>> > >>>> > >>>>I seem to be getting two different E820 table dumps with > >>e820_host=1: > >>>> > >>>>(XEN) HVM1: BIOS map: > >>>>(XEN) HVM1: f0000-fffff: Main BIOS > >>>>(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries > >>>>(XEN) HVM1: E820 table: > >>>>(XEN) HVM1: [00]: 00000000:00000000 - 00000000:3f790000: RAM > >>>>(XEN) HVM1: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI > >>>>(XEN) HVM1: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS > >>>>(XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED > >>>>(XEN) HVM1: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 > >>>>(XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED > >>>>(XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 > >>>>(XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED > >>>>(XEN) HVM1: HOLE: 00000000:fee01000 - 00000000:ffc00000 > >>>>(XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED > >>>>(XEN) HVM1: [07]: 00000001:00000000 - 00000001:68870000: RAM > >>>>(XEN) HVM1: E820 table: > >>>>(XEN) HVM1: [00]: 00000000:00000000 - 00000000:0009e000: RAM > >>>>(XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED > >>>>(XEN) HVM1: HOLE: 00000000:000a0000 - 00000000:000e0000 > >>>>(XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED > >>>>(XEN) HVM1: [03]: 00000000:00100000 - 00000000:a7800000: RAM > >>>>(XEN) HVM1: HOLE: 00000000:a7800000 - 00000000:fc000000 > >>>>(XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED > >>>>(XEN) HVM1: Invoking ROMBIOS ... > >>>> > >>>>I cannot quite figure out what is going on here - these > >>>>tables can''t > >>>>both be true. > >>>> > >>> > >>>Right. The code just prints the E820 that was constructed b/c > >>>of the > >>e820_host =1 parameter as the first output. Then the second one is > >>what was constructed originally. > >>> > >>>The code that would tie in the E820 from the hyper call and > >>>the alter > >>how the hvmloader sets it up is not yet done. > >>> > >>> > >>>>Looking at the IOMEM on the host, the IOMEM begins at > >>>>0xa8000000 and > >>>>goes more or less contiguously up to 0xfec8b000. > >>>> > >>>>Looking at dmesg on domU, the e820 map more or less matches the > >>second > >>>>dump above. > >>> > >>>Right. That is correct since the patch I sent just outputs stuff. > >>No real changes to the E820 yet. > >> > >>I thought this did that in hvmloader/e820c: > >>hypercall_memory_op ( XENMEM_memory_map, &op); > >> > >>Gordan > > > >No. They just gets the E820 that is stashed in the hypervisor for > >the guest. The PV guest would use it but hvmloader is not. This is > >what would needed to be implemented to allow hvmloader construct the > >E820 on its own. > > Right. So so in hvmloader/e820.c we now have the host based map in > struct e820entry map[E820MAX]; > > The rest of the function then goes and constructs the standard HVM > e820 map in the passed in > struct e820entry *e820 > > So all that needs to happen here is if e820_host is set, fill e820[] > by copying map[] up to the hvm_info->low_mem_pgend > (or hvm_info->high_mem_pgend if it is set). I am guessing thatRight. And then the overflow would be put past 4GB. Or fill in the E820_RAM regions with it.> SeaBIOS and other existing stuff might break if the host map is > just copied in verbatim, so presumably I need to add/dedupe the > non-RAM parts of the maps.Probably. Or tweak SeaBIOS to use your E820. Also you need to figure out where hvmloader constructs the ACPI and SMBIOS tables and make sure they are within the E820_RESERVED regions.> > Is that right? Nothing else needs to happen?HA! You are going to hit some bugs probably :-)> > The following questions arise: > > 1) What to do in case of overlaps? On my specific hardware, > the key difference in the end map will be that the hole at: > (XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 > will end up being created in domU.The hole is also known as PCI gap or MMIO region. With the e820_host in effect you should use the host''s layout and use its hole placement. That will replicate it and make domU''s E820 hole look like the host.> > 2) Do only the holes need to be pulled from the host or > the entire map? Would hvmloader/seabios/whatever know > what to do if passed a map that is different from what > they might expect (i.e. different from what the current > hvmloader provides)? Or would this be likely to cause > extensive further breakages?I think there are some assumptions made where the hole starts. Those would have to be made more dynamic to deal with a different E820 layout.> > 3) At the moment I am leaning toward just pulling in the > holes from the host e820, mirroring them in domU.<nods>> 3.1) Marking them as "reserved" would likely fix the > problem that was my primary motivation for doing this > in the first place. Having said that - with all ofThat unfortuntaly will make them not-gaps nor MMIO regions. Meaning the kernel will scream: "You have a BAR in E820_ reserved region! That is bad!", and won''t setup the card. The hole needs to be replicated in the guest.> the 1GB-3GB space marked as reserved, I''m not sure where > the IOMEM would end up mapped in domU - things might just > break. If marking the dom0 hole as a hole in domU without > ensuring pBAR=vBAR, the PCI device in domU might get > mapped with where another device is in dom0, which might > cause the same problem.Right. hvmloader could (I hadn''t checked the code) scan the E820 and determine that the PCI BARs are within the E820_RESRV and try to move them to a hole. Since no hole would be found below 4GB it would remap the PCI BAR above 4GB. That - depending on the device - could be disastrous for the device. That is if it is only capable of 32-bit DMA''s it will never do anything.> > At the moment, I think the expedient thing to do is make > domU map holes as per dom0 and ignore other non-RAM<nods>> areas. This may (by luck) or may not fix my immediate problem > (RAM in domU clobbering host''s mapped IOMEM), but at > least it would cover the pre-requisite hole mapping for > the next step which is vBAR=pBAR.<nods>> > I light of this, however, depending on the answer to 2) > above, it may not be practical for e820_host option do doI think it will mean you need to look in the hvmloader directory a bit more and find all of the assumptions it makes about memory locations. One excellent tool is to do ''git log -p tools/hvmloader'' as it will tell you what changes have been done to address the memory layout construction.> what it actually means for HVMs, at least not to the same > extent as happens for PV. It would only do a part of it > (initial vHOLE=pHOLE, to later be extended to the more > specific case of vBAR=pBAR). > > Does this sound reasonable?Yes. I think the plan you outlined is sound. The difficultiy is going to be cramming the E820 constructed by e820_host in hvmloader and making sure that all the other parts of it (SMBIOS, ACPI, BIOS) will be more dynamic and use dynamic locations instead of hard-coded values. Loads of printks can help with that :-) The awesome thing is that it will make hvmloader a lot more flexible. And one can extend the e820_host to construct an E820 that is bizzare for testing even more absurd memory layouts (say, no RAM below 4GB). Keep on digging! Thanks for great analysis.> > Gordan
Gordan Bobic
2013-Sep-06 13:34 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Fri, 6 Sep 2013 09:04:35 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Thu, Sep 05, 2013 at 11:33:18PM +0100, Gordan Bobic wrote: >> On 09/05/2013 10:13 PM, Gordan Bobic wrote: >> >> >I seem to be getting two different E820 table dumps with >> e820_host=1: >> > >> >(XEN) HVM1: BIOS map: >> >(XEN) HVM1: f0000-fffff: Main BIOS >> >(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries >> >(XEN) HVM1: E820 table: >> >(XEN) HVM1: [00]: 00000000:00000000 - 00000000:3f790000: RAM >> >(XEN) HVM1: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI >> >(XEN) HVM1: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS >> >(XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED >> >(XEN) HVM1: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 >> >(XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED >> >(XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 >> >(XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED >> >(XEN) HVM1: HOLE: 00000000:fee01000 - 00000000:ffc00000 >> >(XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED >> >(XEN) HVM1: [07]: 00000001:00000000 - 00000001:68870000: RAM >> >> I get it - this is the host e820 map. In dom0, dmesg shows: >> >> e820: BIOS-provided physical RAM map: >> Xen: [mem 0x0000000000000000-0x000000000009cfff] usable >> Xen: [mem 0x000000000009d000-0x00000000000fffff] reserved >> Xen: [mem 0x0000000000100000-0x000000003f78ffff] usable >> Xen: [mem 0x000000003f790000-0x000000003f79dfff] ACPI data >> Xen: [mem 0x000000003f79e000-0x000000003f7cffff] ACPI NVS >> Xen: [mem 0x000000003f7d0000-0x000000003f7dffff] reserved >> Xen: [mem 0x000000003f7e7000-0x000000003fffffff] reserved >> Xen: [mem 0x00000000fee00000-0x00000000fee00fff] reserved >> Xen: [mem 0x00000000ffc00000-0x00000000ffffffff] reserved >> Xen: [mem 0x0000000100000000-0x0000000cbfffffff] usable >> >> That tallies up with the above map exactly. So far so good. Not sure >> if the following is relevant, but here it is anyway just in case: >> >> e820: update [mem 0x00000000-0x00000fff] usable ==> reserved >> e820: remove [mem 0x000a0000-0x000fffff] usable >> [...] >> e820: last_pfn = 0xcc0000 max_arch_pfn = 0x400000000 >> e820: last_pfn = 0x3f790 max_arch_pfn = 0x400000000 >> [...] >> Zone ranges: >> DMA [mem 0x00001000-0x00ffffff] >> DMA32 [mem 0x01000000-0xffffffff] >> Normal [mem 0x100000000-0xcbfffffff] >> [...] >> e820: [mem 0x40000000-0xfedfffff] available for PCI devices >> >> >> >(XEN) HVM1: E820 table: >> >(XEN) HVM1: [00]: 00000000:00000000 - 00000000:0009e000: RAM >> >(XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >> >(XEN) HVM1: HOLE: 00000000:000a0000 - 00000000:000e0000 >> >(XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >> >(XEN) HVM1: [03]: 00000000:00100000 - 00000000:a7800000: RAM >> >(XEN) HVM1: HOLE: 00000000:a7800000 - 00000000:fc000000 >> >(XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >> >(XEN) HVM1: Invoking ROMBIOS ... >> >> Comparing this to the above, it seems that 9d000-9e000 is marked as >> reserved in dom0, but RAM in domU. Am I right in thinking that >> dom0(usable) == domU(RAM) in terms of meaning? >> >> What does "HOLE" actually mean in domU? Does it mean this space is >> OK to map domU IOMEM into? Or something else? Either way full >> possible chasl summary: >> >> dom0: reserved 9d000-9e000 >> domU: RAM 9d000-9e000 >> >> dom0: reserved a0000-dffff >> domU: HOLE a0000-dffff >> >> dom0: ACPI data 3f790000-3f79dfff >> dom0: ACPI NVS 3f79e000-3f7cffff >> dom0: reserved 3f7d0000-3f7dffff >> dom0: reserved > > > .. you are missing a range here.It wasn''t meant as an exhaustive list, I was only looking at the interesting/overlapping areas.>> domU: RAM 00100000-a7800000 >> >> Then there seems to be a hole in dom0: >> 40000000-fedfffff which talles up with the dom0 dmesg output above >> about it being for the PCI devices, i.e. that''s the IOMEM region >> (from 1GB to a lilttle under 4GB). >> >> But in domU, the 40000000-a77fffff is available as RAM. > > OK, so that is the goal - make hvmloader construct the E820 memory > layout and all of its pieces to fit that layout.I am actually leaning toward only copying the holes from the host E820. The domU already seems to be successfully using various memory ranges that correspond to reserved and acpi ranges, so it doesn''t look like these are a problem.>> On the face of it, that''s actually fine - my PCI IOMEM mappings show >> the lowest mapping (according to lspci -vvv) starts at a8000000, > > <surprise>Indeed - on the host, the hole is 1GB-4GB, but there is no IOMEM mapped between 1024M and 2688MB. Hence why I can get away with a domU memory allocation up to 2688MB.>> which falls into the domU area marked as "HOLE" (a7800000-fc000000). >> And this does in fact appears to be where domU maps the GPU in both >> of my VMs: >> >> E0000000-E7FFFFFF >> E8000000-EBFFFFFF >> EC000000-EDFFFFFF >> >> and this doesn''t overlap with any mapped PCI IOMEM according to >> lspci. >> >> If we assume that anything below a8000000 doesn''t actually matter in >> this case (since if I give up to a8000000 memory to a domU >> everything works absolutely fine indefinitely, I am at a loss to > > > Just to make sure I am not leading you astray. You are getting _no_ > crashes > when you have a guest with 1GB?I haven''t tried limiting a guest to 1GB recently. My PCI passthrough domUs all have 2688MB assigned, and this works fine. More than that and they crash eventually. Does that answer your question? Or were you after something very specific to the 1GB domU case?>> explain what is actually going wrong and why the crash is still >> occuring - unless some other piece of hardware is having it''s domU >> IOMEM mapped somewhere in the range f3df4000-fec8b000 and that is >> causing a memory overwrite. >> >> I am just not seeing any obvious memory stomp at the moment... > > Neither am I.I may have pasted the wrong domU e820. I have a sneaky suspicion that this above map was from a domU with 2688MB of RAM assigned, hence why there is on domU RAM in the map above a7800000. I''ll re-check when I''m in front of that machine again. Are you OK with the plan to _only_ copy the holes from host E820 to the hvmloader E820? I think this would be sufficient and not cause any undue problems. The only things that would need to change are: 1) Enlarge the domU hole 2) Do something with the top reserved block, starting at RESERVED_MEMBASE=0xFC000000. What is this actually for? It overlaps with the host memory hole which extends all the way up to 0xfee00000. If it must be where it is, this could be problematic. What to do in this case? This does, also bring up another question - is there any point in bothering with matching the host holes? I would hazard a guess that no physical hardware is likely to have a memory hole bigger than 3GB under the 4GB limit. So would it perhaps be neater, easier, more consistent and more debuggable to just make the hvmloader put in a hole between 0x40000000-0xffffffff (the whole 3GB) by default? Or is that deemed to be too crippling for 32-bit non-PAE domUs (and are there enough of these aroudn to matter?)? Caveat - this alone wouldn''t cover any other weirdness such as the odd memory hole 0x3f7e0000-0x3f7e7000 on my hardware. Was this what you were thinking about when asking whether my domUs work OK with 1GB of RAM? Since that is just under the 1GB limit. To clarify, I am not suggesting just hard coding a 3GB memory hole - I am suggesting defaulting to at least that and them mapping in any additional memory holes as well. My reasoning behind this suggestion is that it would make things more consistent between different (possibly dissimilar) hosts. Gordan
Gordan Bobic
2013-Sep-06 14:09 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Fri, 6 Sep 2013 09:09:06 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Thu, Sep 05, 2013 at 11:42:38PM +0100, Gordan Bobic wrote: >> On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote: >> >Gordan Bobic <gordan@bobich.net> wrote: >> >>Right, finally got around to trying this with the latest patch. >> >> >> >>With e820_host=0 things work as before: >> >> >> >>(XEN) HVM3: BIOS map: >> >>(XEN) HVM3: f0000-fffff: Main BIOS >> >>(XEN) HVM3: E820 table: >> >>(XEN) HVM3: [00]: 00000000:00000000 - 00000000:0009e000: RAM >> >>(XEN) HVM3: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >> >>(XEN) HVM3: HOLE: 00000000:000a0000 - 00000000:000e0000 >> >>(XEN) HVM3: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >> >>(XEN) HVM3: [03]: 00000000:00100000 - 00000000:e0000000: RAM >> >>(XEN) HVM3: HOLE: 00000000:e0000000 - 00000000:fc000000 >> >>(XEN) HVM3: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >> >>(XEN) HVM3: [05]: 00000001:00000000 - 00000002:1f800000: RAM >> >> >> >> >> >>I seem to be getting two different E820 table dumps with >> e820_host=1: >> >> >> >>(XEN) HVM1: BIOS map: >> >>(XEN) HVM1: f0000-fffff: Main BIOS >> >>(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries >> >>(XEN) HVM1: E820 table: >> >>(XEN) HVM1: [00]: 00000000:00000000 - 00000000:3f790000: RAM >> >>(XEN) HVM1: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI >> >>(XEN) HVM1: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS >> >>(XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED >> >>(XEN) HVM1: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 >> >>(XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED >> >>(XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 >> >>(XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED >> >>(XEN) HVM1: HOLE: 00000000:fee01000 - 00000000:ffc00000 >> >>(XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED >> >>(XEN) HVM1: [07]: 00000001:00000000 - 00000001:68870000: RAM >> >>(XEN) HVM1: E820 table: >> >>(XEN) HVM1: [00]: 00000000:00000000 - 00000000:0009e000: RAM >> >>(XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >> >>(XEN) HVM1: HOLE: 00000000:000a0000 - 00000000:000e0000 >> >>(XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >> >>(XEN) HVM1: [03]: 00000000:00100000 - 00000000:a7800000: RAM >> >>(XEN) HVM1: HOLE: 00000000:a7800000 - 00000000:fc000000 >> >>(XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >> >>(XEN) HVM1: Invoking ROMBIOS ... >> >> >> >>I cannot quite figure out what is going on here - these tables >> can''t >> >>both be true. >> >> >> > >> >Right. The code just prints the E820 that was constructed b/c of >> the e820_host =1 parameter as the first output. Then the second one >> is what was constructed originally. >> > >> >The code that would tie in the E820 from the hyper call and the >> alter how the hvmloader sets it up is not yet done. >> > >> > >> >>Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 >> and >> >>goes more or less contiguously up to 0xfec8b000. >> >> >> >>Looking at dmesg on domU, the e820 map more or less matches the >> second >> >>dump above. >> > >> >Right. That is correct since the patch I sent just outputs stuff. >> No real changes to the E820 yet. >> >> /me *facepalms* >> >> That indeed explains everything. :) >> >> But having had a thorough look through the memory mappings (see my >> other long, rambling email), I don''t actually see an obvious area >> where RAM might overwrite a dom0 IOMEM range - assuming the "HOLE" >> part isn''t mapped as RAM in domU. >> >> Or to summarize: >> dom0 PCI IOMEM actually has mappings from a8000000 onward, and >> giving domU up to that much memory works fine. So the memory stomp >> must be happening from a8000000 onward. But - the only things above >> that address in domU are the HOLE up to fc000000 and RESERVED up to >> ffffffff. So no domU memory is getting mapped into the IOMEM range >> anyway - which begs the question of what is _actually_ causing the >> crash. Stuff I haven''t yet found in domU getting mapped into the >> a7800000-fc000000 hole overlapping dom0 IOMEM? SeaBIOS doing >> smething odd in the fc000000-fec8b000 range marked RESERVED in domU? > > There were some assumptions with that region and that stuff could > be stick in there (like ACPI tables and SMBIOS I think). > > Perhaps a better question is - are any of the BARs of your card > overlapping > with the RESERVED range in the domU? > > Or if you grep through the hvmloader code are there anything > addresses > that look to be within that range? > > Incidentally could you send the output of lspci -vvvv from your > output > in the guest and in dom0 please?Attached. The main point I''m trying to keep in mind here is that this needs to be generic and useful in different hardware cases, not just my own. If it were just about my own hardware and use case I''d have just opted for the approach of the old vBAR-pBAR patch, hard-coded the holes and been done with it.>> Or am I reading this all wrong? > > You are on the right track I think. There is some assumption made > about the RESERVED and HOLE that I think are conflicing with what the > card thinks of. Another way to figure out what is happening is to > crank > up the verbosity of the driver in the domU. Specifically there is > an CONFIG_MMIO_TRACE (or something like that) that will tell you the > physical address the PCI cards are using and what it is writting in > it. > > It could help in identifying _where_ the graphic card is > writting/reading > from. And also the last moment when it wrote something.That''s a part of my problem - my domU with a reproducible crash is Windows which is a lot less debuggable. :( I have a Linux domU that I use for figuring out what the domU looks like from the inside, but I don''t have a readily usable test-case for reproducing the crash there. Gordan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2013-Sep-06 14:32 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
> >>Then there seems to be a hole in dom0: > >>40000000-fedfffff which talles up with the dom0 dmesg output above > >>about it being for the PCI devices, i.e. that''s the IOMEM region > >>(from 1GB to a lilttle under 4GB). > >> > >>But in domU, the 40000000-a77fffff is available as RAM. > > > >OK, so that is the goal - make hvmloader construct the E820 memory > >layout and all of its pieces to fit that layout. > > I am actually leaning toward only copying the holes from the > host E820. The domU already seems to be successfully using various > memory ranges that correspond to reserved and acpi ranges, so > it doesn''t look like these are a problem.OK.> > >>On the face of it, that''s actually fine - my PCI IOMEM mappings show > >>the lowest mapping (according to lspci -vvv) starts at a8000000, > > > ><surprise> > > Indeed - on the host, the hole is 1GB-4GB, but there is no IOMEM > mapped between 1024M and 2688MB. Hence why I can get away with a > domU memory allocation up to 2688MB.When you say ''IOMEM'' you mean /proc/iomem output?> > >>which falls into the domU area marked as "HOLE" (a7800000-fc000000). > >>And this does in fact appears to be where domU maps the GPU in both > >>of my VMs: > >> > >>E0000000-E7FFFFFF > >>E8000000-EBFFFFFF > >>EC000000-EDFFFFFF > >> > >>and this doesn''t overlap with any mapped PCI IOMEM according to > >>lspci. > >> > >>If we assume that anything below a8000000 doesn''t actually matter in > >>this case (since if I give up to a8000000 memory to a domU > >>everything works absolutely fine indefinitely, I am at a loss to > > > > > >Just to make sure I am not leading you astray. You are getting > >_no_ crashes > >when you have a guest with 1GB? > > I haven''t tried limiting a guest to 1GB recently. My PCI passthrough > domUs all have 2688MB assigned, and this works fine. More than that > and they crash eventually. Does that answer your question? Or were > you after something very specific to the 1GB domU case?No no. I just was too lazy to compute what a800000 came out in decimal.> > >>explain what is actually going wrong and why the crash is still > >>occuring - unless some other piece of hardware is having it''s domU > >>IOMEM mapped somewhere in the range f3df4000-fec8b000 and that is > >>causing a memory overwrite. > >> > >>I am just not seeing any obvious memory stomp at the moment... > > > >Neither am I. > > I may have pasted the wrong domU e820. I have a sneaky suspicion > that this above map was from a domU with 2688MB of RAM assigned, > hence why there is on domU RAM in the map above a7800000. I''ll > re-check when I''m in front of that machine again. > > Are you OK with the plan to _only_ copy the holes from host E820 > to the hvmloader E820? I think this would be sufficient and not > cause any undue problems. The only things that would need to > change are: > 1) Enlarge the domU hole > 2) Do something with the top reserved block, starting at > RESERVED_MEMBASE=0xFC000000. What is this actually for? It > overlaps with the host memory hole which extends all the way up > to 0xfee00000. If it must be where it is, this could be > problematic. What to do in this case?I would do a git log or git annotate to find it. I recall some patches to move that - but I can''t recall the details.> > This does, also bring up another question - is there any point > in bothering with matching the host holes? I would hazard a > guess that no physical hardware is likely to have a memory > hole bigger than 3GB under the 4GB limit.3GB is about the max I have seen.> > So would it perhaps be neater, easier, more consistent and > more debuggable to just make the hvmloader put in a hole > between 0x40000000-0xffffffff (the whole 3GB) by default? > Or is that deemed to be too crippling for 32-bit non-PAE > domUs (and are there enough of these aroudn to matter?)?Correct. Also it would wreak havoc when migrating to other hvmloader''s which have a different layout.> > Caveat - this alone wouldn''t cover any other weirdness such as > the odd memory hole 0x3f7e0000-0x3f7e7000 on my hardware. Was > this what you were thinking about when asking whether my domUs > work OK with 1GB of RAM? Since that is just under the 1GB > limit.So there are some issues with i915 IGD having to have a ''flush page''. Mainly some non-RAM region that they can tell the IGD to flush its pages. And it had to be non-RAM and somehow via magic IGD registers you can program the physical address in the card - so the card has it remapped to itself. Usually it is some gap (aka hole) that ends has to be faithfully reproduced in the guest. But you are using nvidia and are not playing those nasty tricks.> > To clarify, I am not suggesting just hard coding a 3GB memory > hole - I am suggesting defaulting to at least that and them > mapping in any additional memory holes as well. My reasoning > behind this suggestion is that it would make things more > consistent between different (possibly dissimilar) hosts.Potentially. The other option when thinking about migration and PCI - is to interogate _All_ of the hosts that will be involved in the migration and construct an E820 that covers all the right regions. Then use that for the guests and then you can unplug/plug the PCI devices without much trouble. That is where the e820_host=1 parameter can be used and also some extra code to slurp up an XML of the E820 could be implemented. The 3GB HOLE could do it, but what if the host has some odd layout where the HOLE is above 4GB? Then we are back at remapping. I think Stefano had some thoughts about enlaring the HOLE and it might be good to include him here.> > Gordan
Gordan Bobic
2013-Sep-06 14:45 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Fri, 6 Sep 2013 09:20:50 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Fri, Sep 06, 2013 at 01:23:19PM +0100, Gordan Bobic wrote: >> On Thu, 05 Sep 2013 19:01:03 -0400, Konrad Rzeszutek Wilk >> <konrad.wilk@oracle.com> wrote: >> >Gordan Bobic <gordan@bobich.net> wrote: >> >>On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote: >> >>>Gordan Bobic <gordan@bobich.net> wrote: >> >>>>Right, finally got around to trying this with the latest patch. >> >>>> >> >>>>With e820_host=0 things work as before: >> >>>> >> >>>>(XEN) HVM3: BIOS map: >> >>>>(XEN) HVM3: f0000-fffff: Main BIOS >> >>>>(XEN) HVM3: E820 table: >> >>>>(XEN) HVM3: [00]: 00000000:00000000 - 00000000:0009e000: RAM >> >>>>(XEN) HVM3: [01]: 00000000:0009e000 - 00000000:000a0000: >> RESERVED >> >>>>(XEN) HVM3: HOLE: 00000000:000a0000 - 00000000:000e0000 >> >>>>(XEN) HVM3: [02]: 00000000:000e0000 - 00000000:00100000: >> RESERVED >> >>>>(XEN) HVM3: [03]: 00000000:00100000 - 00000000:e0000000: RAM >> >>>>(XEN) HVM3: HOLE: 00000000:e0000000 - 00000000:fc000000 >> >>>>(XEN) HVM3: [04]: 00000000:fc000000 - 00000001:00000000: >> RESERVED >> >>>>(XEN) HVM3: [05]: 00000001:00000000 - 00000002:1f800000: RAM >> >>>> >> >>>> >> >>>>I seem to be getting two different E820 table dumps with >> >>e820_host=1: >> >>>> >> >>>>(XEN) HVM1: BIOS map: >> >>>>(XEN) HVM1: f0000-fffff: Main BIOS >> >>>>(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries >> >>>>(XEN) HVM1: E820 table: >> >>>>(XEN) HVM1: [00]: 00000000:00000000 - 00000000:3f790000: RAM >> >>>>(XEN) HVM1: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI >> >>>>(XEN) HVM1: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS >> >>>>(XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: >> RESERVED >> >>>>(XEN) HVM1: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 >> >>>>(XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: >> RESERVED >> >>>>(XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 >> >>>>(XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: >> RESERVED >> >>>>(XEN) HVM1: HOLE: 00000000:fee01000 - 00000000:ffc00000 >> >>>>(XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: >> RESERVED >> >>>>(XEN) HVM1: [07]: 00000001:00000000 - 00000001:68870000: RAM >> >>>>(XEN) HVM1: E820 table: >> >>>>(XEN) HVM1: [00]: 00000000:00000000 - 00000000:0009e000: RAM >> >>>>(XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: >> RESERVED >> >>>>(XEN) HVM1: HOLE: 00000000:000a0000 - 00000000:000e0000 >> >>>>(XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: >> RESERVED >> >>>>(XEN) HVM1: [03]: 00000000:00100000 - 00000000:a7800000: RAM >> >>>>(XEN) HVM1: HOLE: 00000000:a7800000 - 00000000:fc000000 >> >>>>(XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: >> RESERVED >> >>>>(XEN) HVM1: Invoking ROMBIOS ... >> >>>> >> >>>>I cannot quite figure out what is going on here - these >> >>>>tables can''t >> >>>>both be true. >> >>>> >> >>> >> >>>Right. The code just prints the E820 that was constructed b/c >> >>>of the >> >>e820_host =1 parameter as the first output. Then the second one >> is >> >>what was constructed originally. >> >>> >> >>>The code that would tie in the E820 from the hyper call and >> >>>the alter >> >>how the hvmloader sets it up is not yet done. >> >>> >> >>> >> >>>>Looking at the IOMEM on the host, the IOMEM begins at >> >>>>0xa8000000 and >> >>>>goes more or less contiguously up to 0xfec8b000. >> >>>> >> >>>>Looking at dmesg on domU, the e820 map more or less matches the >> >>second >> >>>>dump above. >> >>> >> >>>Right. That is correct since the patch I sent just outputs >> stuff. >> >>No real changes to the E820 yet. >> >> >> >>I thought this did that in hvmloader/e820c: >> >>hypercall_memory_op ( XENMEM_memory_map, &op); >> >> >> >>Gordan >> > >> >No. They just gets the E820 that is stashed in the hypervisor for >> >the guest. The PV guest would use it but hvmloader is not. This is >> >what would needed to be implemented to allow hvmloader construct >> the >> >E820 on its own. >> >> Right. So so in hvmloader/e820.c we now have the host based map in >> struct e820entry map[E820MAX]; >> >> The rest of the function then goes and constructs the standard HVM >> e820 map in the passed in >> struct e820entry *e820 >> >> So all that needs to happen here is if e820_host is set, fill e820[] >> by copying map[] up to the hvm_info->low_mem_pgend >> (or hvm_info->high_mem_pgend if it is set). I am guessing that > > Right. And then the overflow would be put past 4GB. Or fill in the > E820_RAM regions with it. > >> SeaBIOS and other existing stuff might break if the host map is >> just copied in verbatim, so presumably I need to add/dedupe the >> non-RAM parts of the maps. > > Probably. Or tweak SeaBIOS to use your E820.I don''t think tweaking SeaBIOS to use a different specific map is the way forward. As I said in the other email, my motivation is to make something that will work in the general case, not for the memory map in my dodgy hardware (I''m sure there are many other poorly designed bits of hardware out there this might be useful on ;) ).> Also you need to figure out where hvmloader constructs the ACPI and > SMBIOS tables and make sure they are within the E820_RESERVED > regions.This doesn''t appear to have caused any problems - the only problematic part is trampling over the host''s _mapped_ parts of the PCI MMIO hole. Having domU RAM everywhere else doesn''t _appear_ to cause any problems, hence why I would like to focus my effort on making sure that the holes are mapped while breaking nothing else if at all possible.>> Is that right? Nothing else needs to happen? > > HA! You are going to hit some bugs probably :-)Hey, some degree of optimisim is required for perseverence. ;)>> The following questions arise: >> >> 1) What to do in case of overlaps? On my specific hardware, >> the key difference in the end map will be that the hole at: >> (XEN) HVM1: HOLE: 00000000:40000000 - 00000000:fee00000 >> will end up being created in domU. > > The hole is also known as PCI gap or MMIO region. With the > e820_host in effect you should use the host''s layout and > use its hole placement. That will replicate it and make > domU''s E820 hole look like the host.Hmm... Now there''s an idea. I _could_ just hard-code the memory hole to match that just to see if it fixes the problem. I rather expect, however, that this will just move the problem. Specifically, it is liable to make domU MMIO overlap (without matching) the dom0 MMIO and crash the host quite spectacularly. Unless domU decides to map MMIO from the bottom up, in which case there''s 1688MB of MMIO space between 0x40000000 and 0xa8000000 where MMIO will end up in domU, never overlapping the host''s map and everything will, by pure chance, work just fine from there on.>> 2) Do only the holes need to be pulled from the host or >> the entire map? Would hvmloader/seabios/whatever know >> what to do if passed a map that is different from what >> they might expect (i.e. different from what the current >> hvmloader provides)? Or would this be likely to cause >> extensive further breakages? > > I think there are some assumptions made where the hole > starts. Those would have to be made more dynamic to deal > with a different E820 layout.Assumptions made by what?>> 3) At the moment I am leaning toward just pulling in the >> holes from the host e820, mirroring them in domU. > > <nods> > >> 3.1) Marking them as "reserved" would likely fix the >> problem that was my primary motivation for doing this >> in the first place. Having said that - with all of > > That unfortuntaly will make them not-gaps nor MMIO regions. > Meaning the kernel will scream: "You have a BAR in E820_ > reserved region! That is bad!", and won''t setup the card.What makes decision in domU where to map the PCI devices'' MMIO? SeaBIOS?> The hole needs to be replicated in the guest. >> the 1GB-3GB space marked as reserved, I''m not sure where >> the IOMEM would end up mapped in domU - things might just >> break. If marking the dom0 hole as a hole in domU without >> ensuring pBAR=vBAR, the PCI device in domU might get >> mapped with where another device is in dom0, which might >> cause the same problem. > > Right. hvmloader could (I hadn''t checked the code) scan the > E820 and determine that the PCI BARs are within the E820_RESRV > and try to move them to a hole. Since no hole would be found > below 4GB it would remap the PCI BAR above 4GB. That - depending > on the device - could be disastrous for the device. That is > if it is only capable of 32-bit DMA''s it will never do anything.Nvidia cards have a 32-bit 32MB BAR by default, and two 64-bit BARs. Looking at the different maps, I think I see what is actually happening. In domU, the hole defaults to starting at e0000000, and this is also where the BARs get mapped for the GPU in domU. That implies that mirroring the host''s hole at 1GB-4GB, would actually likely work (by a fluke), since the BARs would (hopefully) get mapped at bottom (plenty of hole before the host''s mapping, 1688MB to be exact), and the rest of the hole would never get touched, stealthily (or obliviously, depending on how you want to look at it) avoiding trampling over the host''s BARs. OK, I''m convinced - I''ll give this a try and see how I get on. :)>> At the moment, I think the expedient thing to do is make >> domU map holes as per dom0 and ignore other non-RAM > > <nods> >> areas. This may (by luck) or may not fix my immediate problem >> (RAM in domU clobbering host''s mapped IOMEM), but at >> least it would cover the pre-requisite hole mapping for >> the next step which is vBAR=pBAR. > > <nods> >> >> I light of this, however, depending on the answer to 2) >> above, it may not be practical for e820_host option do do > > I think it will mean you need to look in the hvmloader directory > a bit more and find all of the assumptions it makes about memory > locations. One excellent tool is to do ''git log -p tools/hvmloader'' > as it will tell you what changes have been done to address > the memory layout construction.I''ll have a dig.>> what it actually means for HVMs, at least not to the same >> extent as happens for PV. It would only do a part of it >> (initial vHOLE=pHOLE, to later be extended to the more >> specific case of vBAR=pBAR). >> >> Does this sound reasonable? > > Yes. I think the plan you outlined is sound. The difficultiy is > going to be cramming the E820 constructed by e820_host in hvmloader > and making sure that all the other parts of it (SMBIOS, ACPI, BIOS) > will be more dynamic and use dynamic locations instead of > hard-coded values. > > Loads of printks can help with that :-)This is my main concern - that other things are making assumptions about where the holes are. At the moment it doesn''t look too bad since the only areas of conflict between (_my_) host and current hvmloader maps is in the RAM and HOLE areas, so coming up with a generic solution that will work for my use (and hopefully for most other people) ought to be fairly simple. Making it actually work in the edge cases will be harder - but then again for those cases it doesn''t work at the moment anyway so erring on the side of pragmatism may be the correct thing to do here.> The awesome thing is that it will make hvmloader a lot more > flexible. And one can extend the e820_host to construct an > E820 that is bizzare for testing even more absurd memory > layouts (say, no RAM below 4GB). > > Keep on digging! Thanks for great analysis.Thanks, I appreciate it. :) Gordan
Gordan Bobic
2013-Sep-06 16:30 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Fri, 6 Sep 2013 10:32:23 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:>> >>On the face of it, that''s actually fine - my PCI IOMEM mappings >> show >> >>the lowest mapping (according to lspci -vvv) starts at a8000000, >> > >> ><surprise> >> >> Indeed - on the host, the hole is 1GB-4GB, but there is no IOMEM >> mapped between 1024M and 2688MB. Hence why I can get away with a >> domU memory allocation up to 2688MB. > > When you say ''IOMEM'' you mean /proc/iomem output?I mean what lspci shows WRT where PCI device memory regions are mapped.>> >>explain what is actually going wrong and why the crash is still >> >>occuring - unless some other piece of hardware is having it''s domU >> >>IOMEM mapped somewhere in the range f3df4000-fec8b000 and that is >> >>causing a memory overwrite. >> >> >> >>I am just not seeing any obvious memory stomp at the moment... >> > >> >Neither am I. >> >> I may have pasted the wrong domU e820. I have a sneaky suspicion >> that this above map was from a domU with 2688MB of RAM assigned, >> hence why there is on domU RAM in the map above a7800000. I''ll >> re-check when I''m in front of that machine again. >> >> Are you OK with the plan to _only_ copy the holes from host E820 >> to the hvmloader E820? I think this would be sufficient and not >> cause any undue problems. The only things that would need to >> change are: >> 1) Enlarge the domU hole >> 2) Do something with the top reserved block, starting at >> RESERVED_MEMBASE=0xFC000000. What is this actually for? It >> overlaps with the host memory hole which extends all the way up >> to 0xfee00000. If it must be where it is, this could be >> problematic. What to do in this case? > > I would do a git log or git annotate to find it. I recall > some patches to move that - but I can''t recall the details.Will do. But what could this possibly be for?>> So would it perhaps be neater, easier, more consistent and >> more debuggable to just make the hvmloader put in a hole >> between 0x40000000-0xffffffff (the whole 3GB) by default? >> Or is that deemed to be too crippling for 32-bit non-PAE >> domUs (and are there enough of these aroudn to matter?)? > > Correct. Also it would wreak havoc when migrating to other > hvmloader''s which have a different layout.Two points here that might just be worth pointing out here: 1) domUs with e820_host set aren''t migratable anyway (including PV ones for which e820_host is currently implemented) 2) All of this is conditional on e820_host=1 being set in the config. Since legacy hosts won''t have this set anyway (since it isn''t implemented, and won''t be until this patch set is completed), surely any notion of backward compatibility for HVMs with e820_host=1 set is null and void. Thus - as a first pass solution that would work in most cases where this option is useful in the first place, setting the low RAM limit to the beginning of the first memory hole above 0x100000 (1MB) should be OK. Leave anything after that unmapped (that seems to be what shows up as "HOLE" on the dumps) all the way up to RESERVED_MEMBASE. That would only leave the question of what it is (if anything) that uses the memory between RESERVED_MEMBASE and 0xffffffff (4GB) and under which circumstances. This could be somewhat important because 0xfec8a000 -> +4KB on my machine is actually the Intel I/O APIC. If it is reserved and nothing uses it, no problem, it can stay as is. If SeaBIOS or similar is known to write to it under some circumstances, that could easily be quite crashtastic.>> Caveat - this alone wouldn''t cover any other weirdness such as >> the odd memory hole 0x3f7e0000-0x3f7e7000 on my hardware. Was >> this what you were thinking about when asking whether my domUs >> work OK with 1GB of RAM? Since that is just under the 1GB >> limit. > > So there are some issues with i915 IGD having to have a ''flush > page''. Mainly some non-RAM region that they can tell the IGD > to flush its pages. And it had to be non-RAM and somehow > via magic IGD registers you can program the physical address > in the card - so the card has it remapped to itself. > > Usually it is some gap (aka hole) that ends has to be > faithfully reproduced in the guest. But you are using > nvidia and are not playing those nasty tricks.Mere a different set of nasty tricks instead. :) But yes, on the whole, I agree. I will try to get the holes as similar as possible for a "production" level patch.>> To clarify, I am not suggesting just hard coding a 3GB memory >> hole - I am suggesting defaulting to at least that and them >> mapping in any additional memory holes as well. My reasoning >> behind this suggestion is that it would make things more >> consistent between different (possibly dissimilar) hosts. > > Potentially. The other option when thinking about migration > and PCI - is to interogate _All_ of the hosts that will be involved > in the migration and construct an E820 that covers all the > right regions. Then use that for the guests and then you > can unplug/plug the PCI devices without much trouble.That''s possibly a step too far at this point.> That is where the e820_host=1 parameter can be used and > also some extra code to slurp up an XML of the E820 could be > implemented. > > The 3GB HOLE could do it, but what if the host has some > odd layout where the HOLE is above 4GB? Then we are back at > remapping.Such a host would also only work with devices that _only_ require 64-bit BARs. But they do exist (e.g. ATI GPUs). Gordan
Gordan Bobic
2013-Sep-06 19:54 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
Here is a test patch I applied to: /tools/firmware/hvmloader/e820.c ==--- e820.c.orig 2013-09-06 11:15:20.023337321 +0100 +++ e820.c 2013-09-06 19:53:00.141876019 +0100 @@ -79,6 +79,7 @@ unsigned int nr = 0; struct xen_memory_map op; struct e820entry map[E820MAX]; + int e820_host = 0; int rc; if ( !lowmem_reserved_base ) @@ -88,6 +89,7 @@ rc = hypercall_memory_op ( XENMEM_memory_map, &op); if ( rc != -ENOSYS) { /* It works!? */ + e820_host = 1; printf("%s:%d got %d op.nr_entries \n", __func__, __LINE__, op.nr_entries); dump_e820_table(&map[0], op.nr_entries); } @@ -133,7 +135,12 @@ /* Low RAM goes here. Reserve space for special pages. */ BUG_ON((hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u << 20)); e820[nr].addr = 0x100000; - e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) - e820[nr].addr; + + if (e820_host) + e820[nr].size = 0x3f7e0000 - e820[nr].addr; + else + e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) - e820[nr].addr; + e820[nr].type = E820_RAM; nr++; == I''m sure this doesn''t need explicitly pointing out, but for the record, it is a gross hack just to prove the concept. The map dump with this patch applied and memory set to 8192 is: ==(XEN) HVM5: BIOS map: (XEN) HVM5: f0000-fffff: Main BIOS (XEN) HVM5: build_e820_table:93 got 8 op.nr_entries (XEN) HVM5: E820 table: (XEN) HVM5: [00]: 00000000:00000000 - 00000000:3f790000: RAM (XEN) HVM5: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI (XEN) HVM5: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS (XEN) HVM5: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED (XEN) HVM5: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 (XEN) HVM5: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED (XEN) HVM5: HOLE: 00000000:40000000 - 00000000:fee00000 (XEN) HVM5: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED (XEN) HVM5: HOLE: 00000000:fee01000 - 00000000:ffc00000 (XEN) HVM5: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED (XEN) HVM5: [07]: 00000001:00000000 - 00000002:c0870000: RAM (XEN) HVM5: E820 table: (XEN) HVM5: [00]: 00000000:00000000 - 00000000:0009e000: RAM (XEN) HVM5: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED (XEN) HVM5: HOLE: 00000000:000a0000 - 00000000:000e0000 (XEN) HVM5: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED (XEN) HVM5: [03]: 00000000:00100000 - 00000000:3f7e0000: RAM (XEN) HVM5: HOLE: 00000000:3f7e0000 - 00000000:fc000000 (XEN) HVM5: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED (XEN) HVM5: [05]: 00000001:00000000 - 00000002:1f800000: RAM (XEN) HVM5: Invoking ROMBIOS ... == Good observations: It works! No crashes, no screen corruption! As an added bonus, it fixes the problem of rebooting domUs causing them to lose GPU access and eventually crash the host even with memory allocation below the first PCI MMIO block. I am suspecting that something in the 0x3f7e0000-0x3f7e7000 hole that isn''t showing up on lspci might be responsible. I think that proves beyond any doubt what the problem was before. Interesting observations: 1) GPU PCI MMIO is still mapped at E0000000, rather than at the bottom of the memory hole. That implies that SeaBIOS (or whatever does the mapping) makes assumptions about where the memory hole begins. This will need to somehow be fixed / made dynamic. What decides where to map PCI memory for each device? 2) The memory hole size difference counts toward the total guest memory. I set memory=8192 maxmem=8192 but Windows in domU only sees 5.48GB. What is particularly odd is that that the missing memory isn''t 3GB, but 2.5GB - which implies that, again, there are other things making assumptions about the size and shape of the memory hole and moving the memory from the hole elsewhere to make it usable. What does this? My todo list, in order of priority (unless somebody here has a better idea) is: 1) Tidy up the hole enlargement to make it dynamically based on the host hole locations. In cases where the host hole overlaps something other than guest RAM/HOLE (i.e. RESERVED), guest spec wins. 2) Fix whatever is causing the hole memory increase to reduce the guest memory. The memory hole is a hole, not a shadow. I need some pointers on where to look for whatever is responsible for this. 3) Fix what makes decisions on where to map devices'' memory apertures. Ideally, the fix should be to detect host''s pBAR make vBAR=pBAR. Again, I need some pointers on where to look for whatever is responsible for doing this mapping. Gordan
Konrad Rzeszutek Wilk
2013-Sep-10 13:35 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Fri, Sep 06, 2013 at 08:54:24PM +0100, Gordan Bobic wrote:> Here is a test patch I applied to: > /tools/firmware/hvmloader/e820.c > > ==> --- e820.c.orig 2013-09-06 11:15:20.023337321 +0100 > +++ e820.c 2013-09-06 19:53:00.141876019 +0100 > @@ -79,6 +79,7 @@ > unsigned int nr = 0; > struct xen_memory_map op; > struct e820entry map[E820MAX]; > + int e820_host = 0; > int rc; > > if ( !lowmem_reserved_base ) > @@ -88,6 +89,7 @@ > > rc = hypercall_memory_op ( XENMEM_memory_map, &op); > if ( rc != -ENOSYS) { /* It works!? */ > + e820_host = 1; > printf("%s:%d got %d op.nr_entries \n", __func__, __LINE__, > op.nr_entries); > dump_e820_table(&map[0], op.nr_entries); > } > @@ -133,7 +135,12 @@ > /* Low RAM goes here. Reserve space for special pages. */ > BUG_ON((hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u << 20)); > e820[nr].addr = 0x100000; > - e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) - > e820[nr].addr; > + > + if (e820_host) > + e820[nr].size = 0x3f7e0000 - e820[nr].addr; > + else > + e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) - > e820[nr].addr; > + > e820[nr].type = E820_RAM; > nr++; > > ==> > I''m sure this doesn''t need explicitly pointing out, but for the > record, it is a gross hack just to prove the concept. > > The map dump with this patch applied and memory set to 8192 is: > > ==> (XEN) HVM5: BIOS map: > (XEN) HVM5: f0000-fffff: Main BIOS > (XEN) HVM5: build_e820_table:93 got 8 op.nr_entries > (XEN) HVM5: E820 table: > (XEN) HVM5: [00]: 00000000:00000000 - 00000000:3f790000: RAM > (XEN) HVM5: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI > (XEN) HVM5: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS > (XEN) HVM5: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED > (XEN) HVM5: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 > (XEN) HVM5: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED > (XEN) HVM5: HOLE: 00000000:40000000 - 00000000:fee00000 > (XEN) HVM5: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED > (XEN) HVM5: HOLE: 00000000:fee01000 - 00000000:ffc00000 > (XEN) HVM5: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED > (XEN) HVM5: [07]: 00000001:00000000 - 00000002:c0870000: RAM > (XEN) HVM5: E820 table: > (XEN) HVM5: [00]: 00000000:00000000 - 00000000:0009e000: RAM > (XEN) HVM5: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED > (XEN) HVM5: HOLE: 00000000:000a0000 - 00000000:000e0000 > (XEN) HVM5: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED > (XEN) HVM5: [03]: 00000000:00100000 - 00000000:3f7e0000: RAM > (XEN) HVM5: HOLE: 00000000:3f7e0000 - 00000000:fc000000 > (XEN) HVM5: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED > (XEN) HVM5: [05]: 00000001:00000000 - 00000002:1f800000: RAM > (XEN) HVM5: Invoking ROMBIOS ... > ==> > Good observations: > It works! No crashes, no screen corruption! As an added bonus, it > fixes the problem of rebooting domUs causing them to lose GPU access > and eventually crash the host even with memory allocation below the > first PCI MMIO block. I am suspecting that something in the > 0x3f7e0000-0x3f7e7000 hole that isn''t showing up on lspci might be > responsible. > > I think that proves beyond any doubt what the problem was before. > > Interesting observations: > 1) GPU PCI MMIO is still mapped at E0000000, rather than at the > bottom of the memory hole. That implies that SeaBIOS (or whatever > does the mapping) makes assumptions about where the memory hole > begins. This will need to somehow be fixed / made dynamic. What > decides where to map PCI memory for each device? > > 2) The memory hole size difference counts toward the total guest > memory. I set > memory=8192 > maxmem=8192 > but Windows in domU only sees 5.48GB. What is particularly odd is > that that the missing memory isn''t 3GB, but 2.5GB - which implies > that, again, there are other things making assumptions about the > size and shape of the memory hole and moving the memory from the > hole elsewhere to make it usable. What does this? > > My todo list, in order of priority (unless somebody here has a > better idea) is: > 1) Tidy up the hole enlargement to make it dynamically based on the > host hole locations. In cases where the host hole overlaps something > other than guest RAM/HOLE (i.e. RESERVED), guest spec wins.guest spec is .. the default hvmloader behavior?> > 2) Fix whatever is causing the hole memory increase to reduce the > guest memory. The memory hole is a hole, not a shadow. I need some > pointers on where to look for whatever is responsible for this.That is where git log tools/hvmloader/firmware might shed some light.> > 3) Fix what makes decisions on where to map devices'' memory > apertures. Ideally, the fix should be to detect host''s pBAR make > vBAR=pBAR. Again, I need some pointers on where to look for whatever > is responsible for doing this mapping.That should be all in tools/hvmloader/firmware I believe. ''pci_setup'' function, where it says: /* Assign iomem and ioport resources in descending order of size. */> > Gordan
Gordan Bobic
2013-Sep-10 15:04 UTC
Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)
On Tue, 10 Sep 2013 09:35:59 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Fri, Sep 06, 2013 at 08:54:24PM +0100, Gordan Bobic wrote: >> Here is a test patch I applied to: >> /tools/firmware/hvmloader/e820.c >> >> ==>> --- e820.c.orig 2013-09-06 11:15:20.023337321 +0100 >> +++ e820.c 2013-09-06 19:53:00.141876019 +0100 >> @@ -79,6 +79,7 @@ >> unsigned int nr = 0; >> struct xen_memory_map op; >> struct e820entry map[E820MAX]; >> + int e820_host = 0; >> int rc; >> >> if ( !lowmem_reserved_base ) >> @@ -88,6 +89,7 @@ >> >> rc = hypercall_memory_op ( XENMEM_memory_map, &op); >> if ( rc != -ENOSYS) { /* It works!? */ >> + e820_host = 1; >> printf("%s:%d got %d op.nr_entries \n", __func__, __LINE__, >> op.nr_entries); >> dump_e820_table(&map[0], op.nr_entries); >> } >> @@ -133,7 +135,12 @@ >> /* Low RAM goes here. Reserve space for special pages. */ >> BUG_ON((hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u << 20)); >> e820[nr].addr = 0x100000; >> - e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) - >> e820[nr].addr; >> + >> + if (e820_host) >> + e820[nr].size = 0x3f7e0000 - e820[nr].addr; >> + else >> + e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) - >> e820[nr].addr; >> + >> e820[nr].type = E820_RAM; >> nr++; >> >> ==>> >> I''m sure this doesn''t need explicitly pointing out, but for the >> record, it is a gross hack just to prove the concept. >> >> The map dump with this patch applied and memory set to 8192 is: >> >> ==>> (XEN) HVM5: BIOS map: >> (XEN) HVM5: f0000-fffff: Main BIOS >> (XEN) HVM5: build_e820_table:93 got 8 op.nr_entries >> (XEN) HVM5: E820 table: >> (XEN) HVM5: [00]: 00000000:00000000 - 00000000:3f790000: RAM >> (XEN) HVM5: [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI >> (XEN) HVM5: [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS >> (XEN) HVM5: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED >> (XEN) HVM5: HOLE: 00000000:3f7e0000 - 00000000:3f7e7000 >> (XEN) HVM5: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED >> (XEN) HVM5: HOLE: 00000000:40000000 - 00000000:fee00000 >> (XEN) HVM5: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED >> (XEN) HVM5: HOLE: 00000000:fee01000 - 00000000:ffc00000 >> (XEN) HVM5: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED >> (XEN) HVM5: [07]: 00000001:00000000 - 00000002:c0870000: RAM >> (XEN) HVM5: E820 table: >> (XEN) HVM5: [00]: 00000000:00000000 - 00000000:0009e000: RAM >> (XEN) HVM5: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED >> (XEN) HVM5: HOLE: 00000000:000a0000 - 00000000:000e0000 >> (XEN) HVM5: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED >> (XEN) HVM5: [03]: 00000000:00100000 - 00000000:3f7e0000: RAM >> (XEN) HVM5: HOLE: 00000000:3f7e0000 - 00000000:fc000000 >> (XEN) HVM5: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED >> (XEN) HVM5: [05]: 00000001:00000000 - 00000002:1f800000: RAM >> (XEN) HVM5: Invoking ROMBIOS ... >> ==>> >> Good observations: >> It works! No crashes, no screen corruption! As an added bonus, it >> fixes the problem of rebooting domUs causing them to lose GPU access >> and eventually crash the host even with memory allocation below the >> first PCI MMIO block. I am suspecting that something in the >> 0x3f7e0000-0x3f7e7000 hole that isn''t showing up on lspci might be >> responsible. >> >> I think that proves beyond any doubt what the problem was before. >> >> Interesting observations: >> 1) GPU PCI MMIO is still mapped at E0000000, rather than at the >> bottom of the memory hole. That implies that SeaBIOS (or whatever >> does the mapping) makes assumptions about where the memory hole >> begins. This will need to somehow be fixed / made dynamic. What >> decides where to map PCI memory for each device? >> >> 2) The memory hole size difference counts toward the total guest >> memory. I set >> memory=8192 >> maxmem=8192 >> but Windows in domU only sees 5.48GB. What is particularly odd is >> that that the missing memory isn''t 3GB, but 2.5GB - which implies >> that, again, there are other things making assumptions about the >> size and shape of the memory hole and moving the memory from the >> hole elsewhere to make it usable. What does this? >> >> My todo list, in order of priority (unless somebody here has a >> better idea) is: >> 1) Tidy up the hole enlargement to make it dynamically based on the >> host hole locations. In cases where the host hole overlaps something >> other than guest RAM/HOLE (i.e. RESERVED), guest spec wins. > > guest spec is .. the default hvmloader behavior?Yes, that''s exactly what I meant. At least until I can figure out what necessitates the default HVM behaviour.>> 2) Fix whatever is causing the hole memory increase to reduce the >> guest memory. The memory hole is a hole, not a shadow. I need some >> pointers on where to look for whatever is responsible for this. > > That is where git log tools/hvmloader/firmware might shed some light.I grepped for low_mem_pgend and high_mem_pgend, and the only place where I have found anything is in one place in libxc. Is this what sets it? Is this common to xm and xl?>> 3) Fix what makes decisions on where to map devices'' memory >> apertures. Ideally, the fix should be to detect host''s pBAR make >> vBAR=pBAR. Again, I need some pointers on where to look for whatever >> is responsible for doing this mapping. > > That should be all in tools/hvmloader/firmware I believe. > ''pci_setup'' function, where it says: > /* Assign iomem and ioport resources in descending order of size. */Thanks, will take a closer look there. Gordan
Maybe Matching Threads
- Xen VGA Passthrough - GTX 480 successfully quadrified to quadro 6000 (softmod) - more than 4GB of RAM for Win XP 64 Bits
- [PATCH RFC v13 00/20] Introduce PVH domU support
- boot a existing windows in hvm domain
- boot a existing windows in hvm domain
- Multiple VMs VGA Passthrough Success Report