thr3ads.net - Xen devel - Bug: Limitation of <=2GB RAM in domU persists with 4.3.0 [Jul 2013]

If this information is useful, please help other people find it:
Share via:

Gordan Bobic

2013-Jul-23 22:34 UTC

Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

I just built 4.3.0 in order to get > 2GB of RAM in domU with GPU 
passthrough without crashes. Unfortunately, the same crashes still 
happen. Massive frame buffer corruption on domU before it locks up 
solid. It seems the PCI memory stomp is still happening.

I am using qemu-dm, as I did on Xen 4.2.x.

So whatever fix for this went into 4.3.0 didn''t fix it for me.
Passing less than 2GB of RAM to domU till works fine.

I have attached:

qemu-dm log for domU
xl dmesg



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Jul-24 14:08 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Tue, Jul 23, 2013 at 11:34:00PM +0100, Gordan Bobic
wrote:> I just built 4.3.0 in order to get > 2GB of RAM in domU with GPU
> passthrough without crashes. Unfortunately, the same crashes still
> happen. Massive frame buffer corruption on domU before it locks up
> solid. It seems the PCI memory stomp is still happening.
> 
If you boot Xen with guest_loglvl=all

and then run the guest the consoel (xl dmesg) should also have
the output from QEMU - that will help in seeing how it constructs
the E820 (which was the problem last time).

Are you also able to get the serial log from the guest? (IF this is
Linux?) I usually have this in my guest config:

serial=''pty''

and when Linux boots up I add ''console=ttyS0,115200 loglevel=8
debug''
which will output everything to the ''xl console <guest> | tee
/tmp/log''.
> I am using qemu-dm, as I did on Xen 4.2.x.
> 
> So whatever fix for this went into 4.3.0 didn''t fix it for me.
> Passing less than 2GB of RAM to domU till works fine.
> 
> I have attached:
> 
> qemu-dm log for domU
> xl dmesg
> domid: 1
> Using file /dev/zvol/ssd/edi in read-write mode
> Watching /local/domain/0/device-model/1/logdirty/cmd
> Watching /local/domain/0/device-model/1/command
> Watching /local/domain/1/cpu
> char device redirected to /dev/pts/3
> qemu_map_cache_init nr_buckets = 10000 size 4194304
> shared page at pfn feffd
> buffered io page at pfn feffb
> Guest uuid = a57e6840-e9f5-4a14-a822-b2cc662c177f
> populating video RAM at ff000000
> mapping video RAM from ff000000
> Register xen platform.
> Done register platform.
> platform_fixed_ioport: changed ro/rw state of ROM memory area. now is rw
state.
> xs_read(/local/domain/0/device-model/1/xen_extended_power_mgmt): read error
> xs_read(): vncpasswd get error.
/vm/a57e6840-e9f5-4a14-a822-b2cc662c177f/vncpasswd.
> Log-dirty: no command yet.
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> vcpu-set: watch node error.
> [xenstore_process_vcpu_set_event]: /local/domain/1/cpu has no CPU!
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> xs_read(/local/domain/1/log-throttling): read error
> qemu: ignoring not-understood drive
`/local/domain/1/log-throttling''
> medium change watch on `/local/domain/1/log-throttling'' - unknown
device, ignored
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> dm-command: hot insert pass-through pci dev 
> register_real_device: Assigning real physical device 00:1a.1 ...
> register_real_device: Enable MSI translation via per device option
> register_real_device: Disable power management
> pt_iomul_init: Error: pt_iomul_init can''t open file
/dev/xen/pci_iomul: No such file or directory: 0x0:0x1a.0x1
> pt_register_regions: IO region registered (size=0x00000020
base_addr=0x00009a01)
> pci_intx: intx=2
> register_real_device: Real physical device 00:1a.1 registered successfuly!
> IRQ type = INTx
> dm-command: hot insert pass-through pci dev 
> register_real_device: Assigning real physical device 0d:00.0 ...
> register_real_device: Enable MSI translation via per device option
> register_real_device: Disable power management
> pt_iomul_init: Error: pt_iomul_init can''t open file
/dev/xen/pci_iomul: No such file or directory: 0xd:0x0.0x0
> pt_register_regions: IO region registered (size=0x00004000
base_addr=0xd7efc000)
> pci_intx: intx=1
> register_real_device: Real physical device 0d:00.0 registered successfuly!
> IRQ type = INTx
> dm-command: hot insert pass-through pci dev 
> register_real_device: Assigning real physical device 08:00.0 ...
> register_real_device: Enable MSI translation via per device option
> register_real_device: Disable power management
> pt_iomul_init: Error: pt_iomul_init can''t open file
/dev/xen/pci_iomul: No such file or directory: 0x8:0x0.0x0
> pt_register_regions: IO region registered (size=0x02000000
base_addr=0xf8000000)
> pt_register_regions: IO region registered (size=0x08000000
base_addr=0xb800000c)
> pt_register_regions: IO region registered (size=0x04000000
base_addr=0xb400000c)
> pt_register_regions: IO region registered (size=0x00000080
base_addr=0x0000df81)
> pt_register_regions: Expansion ROM registered (size=0x00080000
base_addr=0xfbd00000)
> pt_msi_setup: msi mapped with pirq 4f
> pci_intx: intx=1
> register_real_device: Real physical device 08:00.0 registered successfuly!
> IRQ type = MSI-INTx
> dm-command: hot insert pass-through pci dev 
> register_real_device: Assigning real physical device 08:00.1 ...
> register_real_device: Enable MSI translation via per device option
> register_real_device: Disable power management
> pt_iomul_init: Error: pt_iomul_init can''t open file
/dev/xen/pci_iomul: No such file or directory: 0x8:0x0.0x1
> pt_register_regions: IO region registered (size=0x00004000
base_addr=0xfbdfc000)
> pt_msi_setup: msi mapped with pirq 4e
> pci_intx: intx=2
> register_real_device: Real physical device 08:00.1 registered successfuly!
> IRQ type = MSI-INTx
> pt_iomem_map: e_phys=e0000000 maddr=b8000000 type=8 len=134217728 index=1
first_map=1
> pt_iomem_map: e_phys=e8000000 maddr=b4000000 type=8 len=67108864 index=3
first_map=1
> pt_iomem_map: e_phys=ec000000 maddr=f8000000 type=0 len=33554432 index=0
first_map=1
> vga s->lfb_addr = ef000000 s->lfb_end = ef800000 
> pt_iomem_map: e_phys=ef8a0000 maddr=d7efc000 type=0 len=16384 index=0
first_map=1
> pt_iomem_map: e_phys=ef8a4000 maddr=fbdfc000 type=0 len=16384 index=0
first_map=1
> pt_ioport_map: e_phys=c100 pio_base=df80 len=128 index=5 first_map=1
> pt_ioport_map: e_phys=c1e0 pio_base=9a00 len=32 index=4 first_map=1
> platform_fixed_ioport: changed ro/rw state of ROM memory area. now is rw
state.
> platform_fixed_ioport: changed ro/rw state of ROM memory area. now is ro
state.
> Unknown PV product 2 loaded in guest
> PV driver build 1
> region type 0 at [ef880000,ef8a0000).
> squash iomem [ef880000, ef8a0000).
> region type 1 at [c180,c1c0).
> vga s->lfb_addr = ef000000 s->lfb_end = ef800000 
> pt_ioport_map: e_phys=ffff pio_base=9a00 len=32 index=4 first_map=0
> pt_pci_write_config: [00:05:0] Warning: Guest attempt to set address to
unused Base Address Register. [Offset:30h][Length:4]
> pt_ioport_map: e_phys=c1e0 pio_base=9a00 len=32 index=4 first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=d7efc000 type=0 len=16384 index=0
first_map=0
> pt_pci_write_config: [00:06:0] Warning: Guest attempt to set address to
unused Base Address Register. [Offset:30h][Length:4]
> pt_iomem_map: e_phys=ef8a0000 maddr=d7efc000 type=0 len=16384 index=0
first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=f8000000 type=0 len=33554432 index=0
first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=b8000000 type=8 len=134217728 index=1
first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=b4000000 type=8 len=67108864 index=3
first_map=0
> pt_ioport_map: e_phys=ffff pio_base=df80 len=128 index=5 first_map=0
> pt_iomem_map: e_phys=ec000000 maddr=f8000000 type=0 len=33554432 index=0
first_map=0
> pt_iomem_map: e_phys=e0000000 maddr=b8000000 type=8 len=134217728 index=1
first_map=0
> pt_iomem_map: e_phys=e8000000 maddr=b4000000 type=8 len=67108864 index=3
first_map=0
> pt_ioport_map: e_phys=c100 pio_base=df80 len=128 index=5 first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=fbdfc000 type=0 len=16384 index=0
first_map=0
> pt_pci_write_config: [00:08:0] Warning: Guest attempt to set address to
unused Base Address Register. [Offset:30h][Length:4]
> pt_iomem_map: e_phys=ef8a4000 maddr=fbdfc000 type=0 len=16384 index=0
first_map=0
> pt_ioport_map: e_phys=ffff pio_base=9a00 len=32 index=4 first_map=0
> pt_ioport_map: e_phys=c1e0 pio_base=9a00 len=32 index=4 first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=fbdfc000 type=0 len=16384 index=0
first_map=0
> pt_iomem_map: e_phys=ef8a4000 maddr=fbdfc000 type=0 len=16384 index=0
first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=d7efc000 type=0 len=16384 index=0
first_map=0
> pt_iomem_map: e_phys=ef8a0000 maddr=d7efc000 type=0 len=16384 index=0
first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=f8000000 type=0 len=33554432 index=0
first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=b8000000 type=8 len=134217728 index=1
first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=b4000000 type=8 len=67108864 index=3
first_map=0
> pt_ioport_map: e_phys=ffff pio_base=df80 len=128 index=5 first_map=0
> pt_iomem_map: e_phys=ec000000 maddr=f8000000 type=0 len=33554432 index=0
first_map=0
> pt_iomem_map: e_phys=e0000000 maddr=b8000000 type=8 len=134217728 index=1
first_map=0
> pt_iomem_map: e_phys=e8000000 maddr=b4000000 type=8 len=67108864 index=3
first_map=0
> pt_ioport_map: e_phys=c100 pio_base=df80 len=128 index=5 first_map=0
>  __  __            _  _    _____  ___     _       _  __   
>  \ \/ /___ _ __   | || |  |___ / / _ \   / |  ___| |/ /_  
>   \  // _ \ ''_ \  | || |_   |_ \| | | |__| | / _ \ | ''_ \
>   /  \  __/ | | | |__   _| ___) | |_| |__| ||  __/ | (_) |
>  /_/\_\___|_| |_|    |_|(_)____(_)___/   |_(_)___|_|\___/ 
>                                                           
> (XEN) Xen version 4.3.0 (root@shatteredsilicon.net) (gcc (GCC) 4.4.5
20110214 (Red Hat 4.4.5-6)) debug=n Tue Jul 23 14:28:40 BST 2013
> (XEN) Latest ChangeSet: 
> (XEN) Bootloader: GNU GRUB 0.97
> (XEN) Command line: noreboot dom0_vcpus_pin
> (XEN) Video information:
> (XEN)  VGA is text mode 80x25, font 8x16
> (XEN)  VBE/DDC methods: V2; EDID transfer time: 1 seconds
> (XEN) Disc information:
> (XEN)  Found 4 MBR signatures
> (XEN)  Found 4 EDD information structures
> (XEN) Xen-e820 RAM map:
> (XEN)  0000000000000000 - 000000000009d400 (usable)
> (XEN)  000000000009d400 - 00000000000a0000 (reserved)
> (XEN)  00000000000e0000 - 0000000000100000 (reserved)
> (XEN)  0000000000100000 - 000000003f790000 (usable)
> (XEN)  000000003f790000 - 000000003f79e000 (ACPI data)
> (XEN)  000000003f79e000 - 000000003f7d0000 (ACPI NVS)
> (XEN)  000000003f7d0000 - 000000003f7e0000 (reserved)
> (XEN)  000000003f7e7000 - 0000000040000000 (reserved)
> (XEN)  00000000fee00000 - 00000000fee01000 (reserved)
> (XEN)  00000000ffc00000 - 0000000100000000 (reserved)
> (XEN)  0000000100000000 - 0000000cc0000000 (usable)
> (XEN) ACPI: RSDP 000F9F70, 0024 (r2 ACPIAM)
> (XEN) ACPI: XSDT 3F790100, 0064 (r1 042413 XSDT1438 20130424 MSFT       97)
> (XEN) ACPI: FACP 3F790290, 00F4 (r4 042413 FACP1438 20130424 MSFT       97)
> (XEN) ACPI: DSDT 3F7904F0, 58A3 (r2  1W555 1W555A58      A58 INTL 20051117)
> (XEN) ACPI: FACS 3F79E000, 0040
> (XEN) ACPI: APIC 3F790390, 0118 (r2 042413 APIC1438 20130424 MSFT       97)
> (XEN) ACPI: MCFG 3F7904B0, 003C (r1 042413 OEMMCFG  20130424 MSFT       97)
> (XEN) ACPI: OEMB 3F79E040, 0082 (r1 042413 OEMB1438 20130424 MSFT       97)
> (XEN) ACPI: SRAT 3F79A4F0, 0250 (r2 042413 OEMSRAT         1 INTL        1)
> (XEN) ACPI: HPET 3F79A740, 0038 (r1 042413 OEMHPET  20130424 MSFT       97)
> (XEN) ACPI: DMAR 3F79E0D0, 0120 (r1    AMI  OEMDMAR        1 MSFT       97)
> (XEN) ACPI: SSDT 3F7A4C70, 0363 (r1 DpgPmm    CpuPm       12 INTL 20051117)
> (XEN) System RAM: 49143MB (50322612kB)
> (XEN) Domain heap initialised DMA width 32 bits
> (XEN) Processor #0 6:12 APIC version 21
> (XEN) Processor #2 6:12 APIC version 21
> (XEN) Processor #4 6:12 APIC version 21
> (XEN) Processor #16 6:12 APIC version 21
> (XEN) Processor #18 6:12 APIC version 21
> (XEN) Processor #20 6:12 APIC version 21
> (XEN) Processor #32 6:12 APIC version 21
> (XEN) Processor #34 6:12 APIC version 21
> (XEN) Processor #36 6:12 APIC version 21
> (XEN) Processor #48 6:12 APIC version 21
> (XEN) Processor #50 6:12 APIC version 21
> (XEN) Processor #52 6:12 APIC version 21
> (XEN) Processor #1 6:12 APIC version 21
> (XEN) Processor #3 6:12 APIC version 21
> (XEN) Processor #5 6:12 APIC version 21
> (XEN) Processor #17 6:12 APIC version 21
> (XEN) Processor #19 6:12 APIC version 21
> (XEN) Processor #21 6:12 APIC version 21
> (XEN) Processor #33 6:12 APIC version 21
> (XEN) Processor #35 6:12 APIC version 21
> (XEN) Processor #37 6:12 APIC version 21
> (XEN) Processor #49 6:12 APIC version 21
> (XEN) Processor #51 6:12 APIC version 21
> (XEN) Processor #53 6:12 APIC version 21
> (XEN) IOAPIC[0]: apic_id 6, version 32, address 0xfec00000, GSI 0-23
> (XEN) IOAPIC[1]: apic_id 7, version 32, address 0xfec8a000, GSI 24-47
> (XEN) Enabling APIC mode:  Phys.  Using 2 I/O APICs
> (XEN) Using scheduler: SMP Credit Scheduler (credit)
> (XEN) Detected 3321.755 MHz processor.
> (XEN) Initing memory sharing.
> (XEN) PCI: Not using MCFG for segment 0000 bus 00-ff
> (XEN) Intel VT-d iommu 0 supported page sizes: 4kB.
> (XEN) Intel VT-d Snoop Control enabled.
> (XEN) Intel VT-d Dom0 DMA Passthrough not enabled.
> (XEN) Intel VT-d Queued Invalidation enabled.
> (XEN) Intel VT-d Interrupt Remapping not enabled.
> (XEN) Intel VT-d Shared EPT tables not enabled.
> (XEN) I/O virtualisation enabled
> (XEN)  - Dom0 mode: Relaxed
> (XEN) Interrupt remapping disabled
> (XEN) Enabled directed EOI with ioapic_ack_old on!
> (XEN) ENABLING IO-APIC IRQs
> (XEN)  -> Using old ACK method
> (XEN) Platform timer is 14.318MHz HPET
> (XEN) Allocated console ring of 64 KiB.
> (XEN) VMX: Supported advanced features:
> (XEN)  - APIC MMIO access virtualisation
> (XEN)  - APIC TPR shadow
> (XEN)  - Extended Page Tables (EPT)
> (XEN)  - Virtual-Processor Identifiers (VPID)
> (XEN)  - Virtual NMI
> (XEN)  - MSR direct-access bitmap
> (XEN)  - Unrestricted Guest
> (XEN) HVM: ASIDs enabled.
> (XEN) HVM: VMX enabled
> (XEN) HVM: Hardware Assisted Paging (HAP) detected
> (XEN) HVM: HAP page sizes: 4kB, 2MB, 1GB
> (XEN) Brought up 24 CPUs
> (XEN) verify_tsc_reliability: TSC warp detected, disabling TSC_RELIABLE
> (XEN) *** LOADING DOMAIN 0 ***
> (XEN)  Xen  kernel: 64-bit, lsb, compat32
> (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x1000000 -> 0x1f70000
> (XEN) PHYSICAL MEMORY ARRANGEMENT:
> (XEN)  Dom0 alloc.:   0000000420000000->0000000430000000 (12302085 pages
to be allocated)
> (XEN)  Init. ramdisk: 0000000cbbdc3000->0000000cbffff400
> (XEN) VIRTUAL MEMORY ARRANGEMENT:
> (XEN)  Loaded kernel: ffffffff81000000->ffffffff81f70000
> (XEN)  Init. ramdisk: ffffffff81f70000->ffffffff861ac400
> (XEN)  Phys-Mach map: ffffffff861ad000->ffffffff8c029a10
> (XEN)  Start info:    ffffffff8c02a000->ffffffff8c02a4b4
> (XEN)  Page tables:   ffffffff8c02b000->ffffffff8c090000
> (XEN)  Boot stack:    ffffffff8c090000->ffffffff8c091000
> (XEN)  TOTAL:         ffffffff80000000->ffffffff8c400000
> (XEN)  ENTRY ADDRESS: ffffffff818091e0
> (XEN) Dom0 has maximum 24 VCPUs
> (XEN) Scrubbing Free RAM: .done.
> (XEN) Initial low memory virq threshold set at 0x4000 pages.
> (XEN) Std. Loglevel: Errors and warnings
> (XEN) Guest Loglevel: Nothing (Rate-limited: Errors and warnings)
> (XEN) Xen is relinquishing VGA console.
> (XEN) *** Serial input -> DOM0 (type ''CTRL-a'' three
times to switch input to Xen)
> (XEN) Freed 272kB init memory.
> (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from
0x0000000000000002 to 0x0000000000000000.
> (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from
0x0000000000000002 to 0x0000000000000000.
> (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from
0x0000000000000002 to 0x0000000000000000.
> (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from
0x0000000000000002 to 0x0000000000000000.
> (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from
0x0000000000000002 to 0x0000000000000000.
> (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from
0x0000000000000002 to 0x0000000000000000.
> (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from
0x0000000000000002 to 0x0000000000000000.
> (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from
0x0000000000000002 to 0x0000000000000000.
> (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from
0x0000000000000002 to 0x0000000000000000.
> (XEN) traps.c:2503:d0 Domain attempted WRMSR 00000000000001fc from
0x0000000000000002 to 0x0000000000000000.
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Gordan Bobic

2013-Jul-24 14:17 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Wed, 24 Jul 2013 10:08:13 -0400, Konrad Rzeszutek Wilk 
 <konrad.wilk@oracle.com> wrote:> On Tue, Jul 23, 2013 at 11:34:00PM +0100, Gordan Bobic wrote:
>> I just built 4.3.0 in order to get > 2GB of RAM in domU with GPU
>> passthrough without crashes. Unfortunately, the same crashes still
>> happen. Massive frame buffer corruption on domU before it locks up
>> solid. It seems the PCI memory stomp is still happening.
>>
>
> If you boot Xen with guest_loglvl=all
>
> and then run the guest the consoel (xl dmesg) should also have
> the output from QEMU - that will help in seeing how it constructs
> the E820 (which was the problem last time).
 I will gather this tonight - apologies, I forgot that I removed
 the loglvl=all options from my boot config.
> Are you also able to get the serial log from the guest? (IF this is
> Linux?) I usually have this in my guest config:
>
> serial=''pty''
>
> and when Linux boots up I add ''console=ttyS0,115200 loglevel=8
debug''
> which will output everything to the ''xl console <guest> |
tee
> /tmp/log''.
 The intended guest is XP64. I will, however, get a Linux guest up
 and running with the exact same domU config (apart from the disk
 volume) for debugging this.

 Gordan

Konrad Rzeszutek Wilk

2013-Jul-24 16:06 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Wed, Jul 24, 2013 at 03:17:50PM +0100, Gordan Bobic
wrote:> On Wed, 24 Jul 2013 10:08:13 -0400, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> >On Tue, Jul 23, 2013 at 11:34:00PM +0100, Gordan Bobic wrote:
> >>I just built 4.3.0 in order to get > 2GB of RAM in domU with GPU
> >>passthrough without crashes. Unfortunately, the same crashes still
> >>happen. Massive frame buffer corruption on domU before it locks up
> >>solid. It seems the PCI memory stomp is still happening.
> >>
> >
> >If you boot Xen with guest_loglvl=all
> >
> >and then run the guest the consoel (xl dmesg) should also have
> >the output from QEMU - that will help in seeing how it constructs
> >the E820 (which was the problem last time).
> 
> I will gather this tonight - apologies, I forgot that I removed
> the loglvl=all options from my boot config.
Take your time.> 
> >Are you also able to get the serial log from the guest? (IF this is
> >Linux?) I usually have this in my guest config:
> >
> >serial=''pty''
> >
> >and when Linux boots up I add ''console=ttyS0,115200 loglevel=8
debug''
> >which will output everything to the ''xl console <guest>
| tee
> >/tmp/log''.
> 
> The intended guest is XP64. I will, however, get a Linux guest up
Ah, I am not actually sure how Linux will work. I hadn''t had a chance
to test that recently :-(
> and running with the exact same domU config (apart from the disk
> volume) for debugging this.
> 
> Gordan

Gordan Bobic

2013-Jul-24 16:14 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Wed, 24 Jul 2013 12:06:39 -0400, Konrad Rzeszutek Wilk 
 <konrad.wilk@oracle.com> wrote:
>> >Are you also able to get the serial log from the guest? (IF this is
>> >Linux?) I usually have this in my guest config:
>> >
>> >serial=''pty''
>> >
>> >and when Linux boots up I add ''console=ttyS0,115200
loglevel=8
>> debug''
>> >which will output everything to the ''xl console
<guest> | tee
>> >/tmp/log''.
>>
>> The intended guest is XP64. I will, however, get a Linux guest up
>
> Ah, I am not actually sure how Linux will work. I hadn''t had a
chance
> to test that recently :-(
 As long as it brings up the serial console, that should be
 sufficient, but working VNC to text console login would be
 convenient. The main thing I want to find on it is the
 BAR mapping addresses from lspci and compare that to the
 e820 map from dmesg.

 I wouldn''t expect the memory map provided by SeaBIOS and
 the BAR mappings configured by qemu-dm to differ
 depending on the domU OS. Or am I wrong here?

 If there is any overlap, the problem should be obvious.
 If there is no overlap, then something even more
 bizzare is going on, but we can worry about that
 later. :)

 Gordan

Konrad Rzeszutek Wilk

2013-Jul-24 16:31 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Wed, Jul 24, 2013 at 05:14:32PM +0100, Gordan Bobic
wrote:> On Wed, 24 Jul 2013 12:06:39 -0400, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> 
> >>>Are you also able to get the serial log from the guest? (IF
this is
> >>>Linux?) I usually have this in my guest config:
> >>>
> >>>serial=''pty''
> >>>
> >>>and when Linux boots up I add ''console=ttyS0,115200
loglevel=8
> >>debug''
> >>>which will output everything to the ''xl console
<guest> | tee
> >>>/tmp/log''.
> >>
> >>The intended guest is XP64. I will, however, get a Linux guest up
> >
> >Ah, I am not actually sure how Linux will work. I hadn''t had a
chance
> >to test that recently :-(
> 
> As long as it brings up the serial console, that should be
> sufficient, but working VNC to text console login would be
> convenient. The main thing I want to find on it is the
> BAR mapping addresses from lspci and compare that to the
> e820 map from dmesg.
I see. That should work for you.> 
> I wouldn''t expect the memory map provided by SeaBIOS and
> the BAR mappings configured by qemu-dm to differ
> depending on the domU OS. Or am I wrong here?
They might. The patches to fix the 2GB limit went in qemu-xen-traditional
meaning you have to use:

device_model_version = ''qemu-xen-traditional''

in your guest config (which I think you are already doing).

I don''t recall what the situation is with upstream
SeaBIOS.> 
> If there is any overlap, the problem should be obvious.
> If there is no overlap, then something even more
> bizzare is going on, but we can worry about that
> later. :)

Gordan Bobic

2013-Jul-24 17:26 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On 07/24/2013 05:31 PM, Konrad Rzeszutek Wilk wrote:

>> I wouldn''t expect the memory map provided by SeaBIOS and
>> the BAR mappings configured by qemu-dm to differ
>> depending on the domU OS. Or am I wrong here?
>
> They might. The patches to fix the 2GB limit went in qemu-xen-traditional
> meaning you have to use:
>
> device_model_version = ''qemu-xen-traditional''
>
> in your guest config (which I think you are already doing).
Yes and no. I am using a self-built 4.3.0 rpm based fairly closely on 
the CRC 4.2.x rpms for EL6. This includes a patch to only build qemu-dm 
and make it the default, which, presumably, means that I don''t have to 
explicitly specify device_model_version. But maybe I''m wrong.
I''ll try
specifying it explicitly and see if that helps.

Gordan

Gordan Bobic

2013-Jul-24 22:15 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Attached are the logs (loglvl=all) and configs for 2GB (working) and 8GB 
(screen corruption + domU crash + sometimes dom0 crashing with it).

I can see in the xl-dmesg log in 8GB case that there is memory remapping 
going on to allow for the lowmem MMIO hole, but it doesn''t seem to
help.

I will get a Linux VM up and running tomorrow and get a comparison of 
domU BARs vs. e820 map.













_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2013-Jul-25 19:18 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Wed, Jul 24, 2013 at 11:15 PM, Gordan Bobic <gordan@bobich.net>
wrote:> Attached are the logs (loglvl=all) and configs for 2GB (working) and 8GB
> (screen corruption + domU crash + sometimes dom0 crashing with it).
>
> I can see in the xl-dmesg log in 8GB case that there is memory remapping
> going on to allow for the lowmem MMIO hole, but it doesn''t seem to
help.
Gordan,

There''s a possibility that it''s actually got nothing to do
with
relocation, but with bugs in your hardware.  Can you try:
* Set the guest memory to 3600
* Boot the guest, and check to make sure that xl dmesg shows does
*not* relocate memory?
* Report whether it crashes?

If it''s a bug in the hardware, I would expect to see that memory was
not relocated, but that the system will lock up anyway.

Can you also do lspci -vvv in dom0 before assigning the device and
attach the output?

The hardware bug we''ve seen is this: In order for the IOMMU to work
properly, *all* DMA transactions must be passed up to the root bridge
so the IOMMU can translate the addresses from guest address to host
address.  Unfortunately, an awful lot of bridges will not do this
properly, which means that the address is not translated properly,
which means that if a *guest* memory address overlaps the a *host*
MMIO range, badness ensues.  There''s nothing we can do about this in
Xen other than make the guest MMIO hole the same size as the host MMIO
hole.

Thanks,
 -George

Gordan Bobic

2013-Jul-25 21:26 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Attached are:

domU-2GB dmesg, lspci
domU-8GB dmesg, lspci

map-2GB - memory map, e820 + PCI
map-8GB - memory map, e820 + PCI

There are no overlaps. In fact, the map is identical with 2040MB and 
8192MB, except for the top usable range being bigger.

So according to this, there _shouldn''t_ be any memory clobbering
going on within domU.

Which leads on to what George said earlier, which I will reply to in a 
separate email.

What puzzles me, however, is that I thought that in 4.3.0 all 64-bit 
BARs should automatically be re-mapped to memory > 4GB, and that
doesn''t
appear to be happening here. Or is the remapping only happening if there 
is not enough 32-bit space for all the BARs?

Gordan

On 07/24/2013 05:31 PM, Konrad Rzeszutek Wilk wrote:> On Wed, Jul 24, 2013 at 05:14:32PM +0100, Gordan Bobic wrote:
>> On Wed, 24 Jul 2013 12:06:39 -0400, Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com> wrote:
>>
>>>>> Are you also able to get the serial log from the guest? (IF
this is
>>>>> Linux?) I usually have this in my guest config:
>>>>>
>>>>> serial=''pty''
>>>>>
>>>>> and when Linux boots up I add
''console=ttyS0,115200 loglevel=8
>>>> debug''
>>>>> which will output everything to the ''xl console
<guest> | tee
>>>>> /tmp/log''.
>>>>
>>>> The intended guest is XP64. I will, however, get a Linux guest
up
>>>
>>> Ah, I am not actually sure how Linux will work. I hadn''t
had a chance
>>> to test that recently :-(
>>
>> As long as it brings up the serial console, that should be
>> sufficient, but working VNC to text console login would be
>> convenient. The main thing I want to find on it is the
>> BAR mapping addresses from lspci and compare that to the
>> e820 map from dmesg.
>
> I see. That should work for you.
>>
>> I wouldn''t expect the memory map provided by SeaBIOS and
>> the BAR mappings configured by qemu-dm to differ
>> depending on the domU OS. Or am I wrong here?
>
> They might. The patches to fix the 2GB limit went in qemu-xen-traditional
> meaning you have to use:
>
> device_model_version = ''qemu-xen-traditional''
>
> in your guest config (which I think you are already doing).
>
> I don''t recall what the situation is with upstream SeaBIOS.
>>
>> If there is any overlap, the problem should be obvious.
>> If there is no overlap, then something even more
>> bizzare is going on, but we can worry about that
>> later. :)







_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Gordan Bobic

2013-Jul-25 21:48 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On 07/25/2013 08:18 PM, George Dunlap wrote:> On Wed, Jul 24, 2013 at 11:15 PM, Gordan Bobic <gordan@bobich.net>
wrote:
>> Attached are the logs (loglvl=all) and configs for 2GB (working) and
8GB
>> (screen corruption + domU crash + sometimes dom0 crashing with it).
>>
>> I can see in the xl-dmesg log in 8GB case that there is memory
remapping
>> going on to allow for the lowmem MMIO hole, but it doesn''t
seem to help.
>
> There''s a possibility that it''s actually got nothing to
do with
> relocation, but with bugs in your hardware.
That wouldn''t surprise me at all, unfortunately. :(
> Can you try:
> * Set the guest memory to 3600
> * Boot the guest, and check to make sure that xl dmesg shows does
> *not* relocate memory?
> * Report whether it crashes?
xl dmesg from booting a Linux domU with 3600MB is attached.
The crash is never immediate, both Linux and Windows boot fine. But when 
a large 3D application like a game loads, there is frame buffer 
corruption immediately visible, and the domU will typically lock up some 
seconds later. Infrequently, it will take the host down with it.
> If it''s a bug in the hardware, I would expect to see that memory
was
> not relocated, but that the system will lock up anyway.
That is indeed what seems to happen - the memory map looks OK with no 
overlaps between PCI memory and ROM ranges and the usable or reserved 
e820 regions.
> Can you also do lspci -vvv in dom0 before assigning the device and
> attach the output?
I have attached it, but not before assigning - I''ll need to reboot for 
that. Do you expect there to be a difference in mapping in dom0 before 
and after assigning the device to domU?
> The hardware bug we''ve seen is this: In order for the IOMMU to
work
> properly, *all* DMA transactions must be passed up to the root bridge
> so the IOMMU can translate the addresses from guest address to host
> address.  Unfortunately, an awful lot of bridges will not do this
> properly, which means that the address is not translated properly,
> which means that if a *guest* memory address overlaps the a *host*
> MMIO range, badness ensues.
Hmm, looking at xl dmesg vs dom0 lspci, that does appear to be the case:

xl dmesg:
(XEN) HVM24: E820 table:
(XEN) HVM24:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
(XEN) HVM24:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
(XEN) HVM24:  HOLE: 00000000:000a0000 - 00000000:000e0000
(XEN) HVM24:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
(XEN) HVM24:  [03]: 00000000:00100000 - 00000000:e0000000: RAM
(XEN) HVM24:  HOLE: 00000000:e0000000 - 00000000:fc000000
(XEN) HVM24:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
(XEN) HVM24:  [05]: 00000001:00000000 - 00000001:00800000: RAM

lspci:
08:00.0 VGA compatible controller: nVidia Corporation GF100
         Region 0: Memory at f8000000 (32-bit, non-prefetchable) 
[disabled] [size=32M]
         Region 1: Memory at b8000000 (64-bit, prefetchable) [disabled] 
[size=128M]
         Region 3: Memory at b4000000 (64-bit, prefetchable) [disabled] 
[size=64M]

Unless I''m reading this wrong, it means that physical GPU region 0 is
in
the domU reserved area, and GPU regions 1 and 2 and in the domU RAM area.

b4000000 = 2880MB

So in theory, that might mean that I should be able to get away with up 
to 2880MB of RAM for domU without encountering frame buffer corruption 
and the crash. I will test this shortly.
> There''s nothing we can do about this in
> Xen other than make the guest MMIO hole the same size as the host MMIO
> hole.
Not sure I follow. Do you mean make it so that pBAR = vBAR?

Gordan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Gordan Bobic

2013-Jul-25 22:23 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On 07/25/2013 10:48 PM, Gordan Bobic wrote:> On 07/25/2013 08:18 PM, George Dunlap wrote:
>> On Wed, Jul 24, 2013 at 11:15 PM, Gordan Bobic
<gordan@bobich.net> wrote:
>>> Attached are the logs (loglvl=all) and configs for 2GB (working)
and 8GB
>>> (screen corruption + domU crash + sometimes dom0 crashing with it).
>>>
>>> I can see in the xl-dmesg log in 8GB case that there is memory
remapping
>>> going on to allow for the lowmem MMIO hole, but it doesn''t
seem to help.
>>
>> There''s a possibility that it''s actually got nothing
to do with
>> relocation, but with bugs in your hardware.
>
> That wouldn''t surprise me at all, unfortunately. :(
>
>> Can you try:
>> * Set the guest memory to 3600
>> * Boot the guest, and check to make sure that xl dmesg shows does
>> *not* relocate memory?
>> * Report whether it crashes?
>
> xl dmesg from booting a Linux domU with 3600MB is attached.
> The crash is never immediate, both Linux and Windows boot fine. But when
> a large 3D application like a game loads, there is frame buffer
> corruption immediately visible, and the domU will typically lock up some
> seconds later. Infrequently, it will take the host down with it.
>
>> If it''s a bug in the hardware, I would expect to see that
memory was
>> not relocated, but that the system will lock up anyway.
>
> That is indeed what seems to happen - the memory map looks OK with no
> overlaps between PCI memory and ROM ranges and the usable or reserved
> e820 regions.
>
>> Can you also do lspci -vvv in dom0 before assigning the device and
>> attach the output?
>
> I have attached it, but not before assigning - I''ll need to reboot
for
> that. Do you expect there to be a difference in mapping in dom0 before
> and after assigning the device to domU?
>
>> The hardware bug we''ve seen is this: In order for the IOMMU to
work
>> properly, *all* DMA transactions must be passed up to the root bridge
>> so the IOMMU can translate the addresses from guest address to host
>> address.  Unfortunately, an awful lot of bridges will not do this
>> properly, which means that the address is not translated properly,
>> which means that if a *guest* memory address overlaps the a *host*
>> MMIO range, badness ensues.
>
> Hmm, looking at xl dmesg vs dom0 lspci, that does appear to be the case:
>
> xl dmesg:
> (XEN) HVM24: E820 table:
> (XEN) HVM24:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
> (XEN) HVM24:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
> (XEN) HVM24:  HOLE: 00000000:000a0000 - 00000000:000e0000
> (XEN) HVM24:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
> (XEN) HVM24:  [03]: 00000000:00100000 - 00000000:e0000000: RAM
> (XEN) HVM24:  HOLE: 00000000:e0000000 - 00000000:fc000000
> (XEN) HVM24:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
> (XEN) HVM24:  [05]: 00000001:00000000 - 00000001:00800000: RAM
>
> lspci:
> 08:00.0 VGA compatible controller: nVidia Corporation GF100
>          Region 0: Memory at f8000000 (32-bit, non-prefetchable)
> [disabled] [size=32M]
>          Region 1: Memory at b8000000 (64-bit, prefetchable) [disabled]
> [size=128M]
>          Region 3: Memory at b4000000 (64-bit, prefetchable) [disabled]
> [size=64M]
>
> Unless I''m reading this wrong, it means that physical GPU region 0
is in
> the domU reserved area, and GPU regions 1 and 2 and in the domU RAM area.
>
> b4000000 = 2880MB
Correction - my other GPU has a BAR mapped lower, at 0xa8000000 which is 
2688MB. So I upped my memory mapping to 2688MB, and lo and behold, that 
doesn''t crash and games work just fine without frame buffer getting 
corrupted.

Now, if I am understanding the basic nature of the problem correctly, 
this _could_ be worked around by ensuring that vBAR = pBAR since in that 
case there is no room for the mis-mapped memory overwrites to occur. Is 
that correct?

I guess I could test this easily enough by applying the vBAR = pBAR hack.

Gordan

Ian Campbell

2013-Jul-26 00:21 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote:> Now, if I am understanding the basic nature of the problem correctly, 
> this _could_ be worked around by ensuring that vBAR = pBAR since in that 
> case there is no room for the mis-mapped memory overwrites to occur. Is 
> that correct?
AIUI (which is not very well...) it''s not so much vBAR=pBAR but making
the guest e820 (memory map) have the same MMIO holes as the host so that
there can''t be any clash between v- or p-BAR and RAM in the guest.
> I guess I could test this easily enough by applying the vBAR = pBAR hack.
Does the e820_host=1 option help? That might be PV only though, I can''t
remember...

Ian.
> 
> Gordan
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Andrew Bobulsky

2013-Jul-26 01:15 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell <ian.campbell@citrix.com>
wrote:> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote:
>> Now, if I am understanding the basic nature of the problem correctly,
>> this _could_ be worked around by ensuring that vBAR = pBAR since in
that
>> case there is no room for the mis-mapped memory overwrites to occur. Is
>> that correct?
>
> AIUI (which is not very well...) it''s not so much vBAR=pBAR but
making
> the guest e820 (memory map) have the same MMIO holes as the host so that
> there can''t be any clash between v- or p-BAR and RAM in the guest.
>
>> I guess I could test this easily enough by applying the vBAR = pBAR
hack.
>
> Does the e820_host=1 option help? That might be PV only though, I
can''t
> remember...
Alas, yes.  The man pages list it under "PV Guest Specific Options":
http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html

You got my hopes up! ;)

Carry on!  I''ll be sitting here metaphorically munching popcorn with
anticipation :P

-Andrew

Gordan Bobic

2013-Jul-26 09:23 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Fri, 26 Jul 2013 01:21:24 +0100, Ian Campbell 
 <ian.campbell@citrix.com> wrote:> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote:
>> Now, if I am understanding the basic nature of the problem 
>> correctly,
>> this _could_ be worked around by ensuring that vBAR = pBAR since in 
>> that
>> case there is no room for the mis-mapped memory overwrites to occur. 
>> Is
>> that correct?
>
> AIUI (which is not very well...) it''s not so much vBAR=pBAR but 
> making
> the guest e820 (memory map) have the same MMIO holes as the host so 
> that
> there can''t be any clash between v- or p-BAR and RAM in the guest.
 Sure, I understand that - but unless I am overlooking something,
 vBAR=pBAR implicitly ensures that.

 The question, then, is what happens in the null translation instance.
 Specifically, if the PCIe bridge/router is broken (and NF200 is, it
 seems), it would imply that when the driver talks to the device, the
 operation will get sent to the vBAR (=pBAR, i.e. straight to the
 hardware). This then gets translated to the pBAR. But - with a
 broken bridge, and vBAR=pBAR, the MMIO request hits the pBAR
 directly from the guest. Does it then still get intercepted by
 the hypervisor, translated (null operation), and re-transmitted?
 If so, this would lead to the card receiving everything twice,
 resulting either in things outright breaking or going half as
 fast at best.

 Now, all this could be a good thing or a bad thing, depending on
 how exactly you spin it. If the bridge is broken and doesn''t
 route all the way back to the root bridge, this could actually be
 a performance optimizing feature. If we set vBAR=pBAR and disable
 any translation thereafter, this avoids the overhead of passing
 everything to/from the root PCIe bridge, and we can just directly
 DMA everything.

 I''m sure there are security implications here, but since NF200
 doesn''t do PCIe ACS either, any concept of security goes out
 the window pre-emptively.

 So, my question is:
 1) If vBAR = pBAR, does the hypervisor still do any translation?
 I presume it does because it expects the traffic to pass up
 from the root bridge, to the hypervisor and then back, to
 ensure security. If indeed it does do this, where could I
 optionally disable it, and is there an easy to follow bit of
 example code for how to plumb in a boot parameter option for
 this?

 2) Further, I''m finding myself motivated to write that
 auto-set (as opposed to hard coded) vBAR=pBAR patch discussed
 briefly a week or so ago (have an init script read the BAR
 info from dom0 and put it in xenstore, plus a patch to
 make pBAR=vBAR reservations built dynamically rather than
 statically, based on this data. Now, I''m quite fluent in C,
 but my familiarity with Xen soruce code is nearly non-existant
 (limited to studying an old unsupported patch every now and then
 in order to make it apply to a more recent code release).
 Can anyone help me out with a high level view WRT where
 this would be best plumbed in (which files and the flow of
 control between the affected files)?

 The added bonus of this (if it can be made to work) is that
 it might just make unmodified GeForce cards work, too,
 which probably makes it worthwhile on it''s own.
>> I guess I could test this easily enough by applying the vBAR = pBAR 
>> hack.
>
> Does the e820_host=1 option help? That might be PV only though, I 
> can''t
> remember...
 Thanks for pointing this one out, I just found this post in the 
 archives:
 http://lists.xen.org/archives/html/xen-users/2012-08/msg00150.html

 With a broken PCIe router, would I also need iommu=soft?

 Gordan

Gordan Bobic

2013-Jul-26 09:28 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Thu, 25 Jul 2013 21:15:10 -0400, Andrew Bobulsky <rulerof@gmail.com> 
 wrote:> On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell
> <ian.campbell@citrix.com> wrote:
>> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote:
>>> Now, if I am understanding the basic nature of the problem 
>>> correctly,
>>> this _could_ be worked around by ensuring that vBAR = pBAR since in
>>> that
>>> case there is no room for the mis-mapped memory overwrites to 
>>> occur. Is
>>> that correct?
>>
>> AIUI (which is not very well...) it''s not so much vBAR=pBAR
but
>> making
>> the guest e820 (memory map) have the same MMIO holes as the host so 
>> that
>> there can''t be any clash between v- or p-BAR and RAM in the
guest.
>>
>>> I guess I could test this easily enough by applying the vBAR = pBAR
>>> hack.
>>
>> Does the e820_host=1 option help? That might be PV only though, I 
>> can''t
>> remember...
>
> Alas, yes.  The man pages list it under "PV Guest Specific
Options":
> http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html
 Now that is intereting - if this makes the memory holes the same 
 between
 the guest and the host, does it also implicitly vBAR=pBAR?

 Gordan

Gordan Bobic

2013-Jul-26 13:11 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Fri, 26 Jul 2013 10:28:12 +0100, Gordan Bobic <gordan@bobich.net> 
 wrote:> On Thu, 25 Jul 2013 21:15:10 -0400, Andrew Bobulsky
> <rulerof@gmail.com> wrote:
>> On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell
>> <ian.campbell@citrix.com> wrote:
>>> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote:
>>>> Now, if I am understanding the basic nature of the problem 
>>>> correctly,
>>>> this _could_ be worked around by ensuring that vBAR = pBAR
since
>>>> in that
>>>> case there is no room for the mis-mapped memory overwrites to 
>>>> occur. Is
>>>> that correct?
>>>
>>> AIUI (which is not very well...) it''s not so much
vBAR=pBAR but
>>> making
>>> the guest e820 (memory map) have the same MMIO holes as the host so
>>> that
>>> there can''t be any clash between v- or p-BAR and RAM in
the guest.
>>>
>>>> I guess I could test this easily enough by applying the vBAR = 
>>>> pBAR hack.
>>>
>>> Does the e820_host=1 option help? That might be PV only though, I 
>>> can''t
>>> remember...
>>
>> Alas, yes.  The man pages list it under "PV Guest Specific
Options":
>> http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html
>
> Now that is intereting - if this makes the memory holes the same 
> between
> the guest and the host, does it also implicitly vBAR=pBAR?
 Another thing that occurred to me might be useful to check - it is
 pretty easy to modify the BAR size on Nvidia cards. The defaults are
 64MB and 128MB for the two BARs. They can be made much, much larger,
 and there is often advantage to enlarging them to at least be equal to
 VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB, being a
 64-bit BAR, it might make the BIOS do the sane thing and map it above
 4GB. With the other BAR also suitably enlarged and it being done on
 the second GPU as well, there is no obvious option but to map them
 above 4GB (unless the BIOS is broken, which it may well be, in
 which case all bets are off).

 Which may just alleviate the memory issue if not completely fix
 the problem.

 Will try this and see what happens.

 Gordan

Konrad Rzeszutek Wilk

2013-Jul-28 10:26 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Andrew Bobulsky <rulerof@gmail.com> wrote:>On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell
<ian.campbell@citrix.com>
>wrote:
>> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote:
>>> Now, if I am understanding the basic nature of the problem
>correctly,
>>> this _could_ be worked around by ensuring that vBAR = pBAR since in
>that
>>> case there is no room for the mis-mapped memory overwrites to
occur.
>Is
>>> that correct?
>>
>> AIUI (which is not very well...) it''s not so much vBAR=pBAR
but
>making
>> the guest e820 (memory map) have the same MMIO holes as the host so
>that
>> there can''t be any clash between v- or p-BAR and RAM in the
guest.
>>
>>> I guess I could test this easily enough by applying the vBAR = pBAR
>hack.
>>
>> Does the e820_host=1 option help? That might be PV only though, I
>can''t
>> remember...
>
>Alas, yes.  The man pages list it under "PV Guest Specific
Options":
>http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html
>
>You got my hopes up! ;)
>
>Carry on!  I''ll be sitting here metaphorically munching popcorn
with
>anticipation :P
>
>-Andrew
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xen.org
>http://lists.xen.org/xen-devel
We could implement that for HVM guests too. But I am not sure about the
consequences of this for migration (say you unplug the device  beforehand and
then migrate to another host which has a different E820). That part requires a
bit of pondering.

Gordan Bobic

2013-Jul-28 21:24 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On 07/28/2013 11:26 AM, Konrad Rzeszutek Wilk wrote:> Andrew Bobulsky <rulerof@gmail.com> wrote:
>> On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell
<ian.campbell@citrix.com>
>> wrote:
>>> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote:
>>>> Now, if I am understanding the basic nature of the problem
>> correctly,
>>>> this _could_ be worked around by ensuring that vBAR = pBAR
since in
>> that
>>>> case there is no room for the mis-mapped memory overwrites to
occur.
>> Is
>>>> that correct?
>>>
>>> AIUI (which is not very well...) it''s not so much
vBAR=pBAR but
>> making
>>> the guest e820 (memory map) have the same MMIO holes as the host so
>> that
>>> there can''t be any clash between v- or p-BAR and RAM in
the guest.
>>>
>>>> I guess I could test this easily enough by applying the vBAR =
pBAR
>> hack.
>>>
>>> Does the e820_host=1 option help? That might be PV only though, I
>> can''t
>>> remember...
>>
>> Alas, yes.  The man pages list it under "PV Guest Specific
Options":
>> http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html
>>
>> You got my hopes up! ;)
>>
>> Carry on!  I''ll be sitting here metaphorically munching
popcorn with
>> anticipation :P
>
> We could implement that for HVM guests too. But I am not sure about
> the consequences of this for migration (say you unplug the device
> beforehand and then migrate to another host which has a different
> E820). That part requires a bit of pondering.
Just out of interest, what happens in case where the PV guests get 
migrated with e820_host=1 set?

Gordan

Konrad Rzeszutek Wilk

2013-Jul-28 23:17 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Gordan Bobic <gordan@bobich.net> wrote:>On 07/28/2013 11:26 AM, Konrad Rzeszutek Wilk wrote:
>> Andrew Bobulsky <rulerof@gmail.com> wrote:
>>> On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell
><ian.campbell@citrix.com>
>>> wrote:
>>>> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote:
>>>>> Now, if I am understanding the basic nature of the problem
>>> correctly,
>>>>> this _could_ be worked around by ensuring that vBAR = pBAR
since
>in
>>> that
>>>>> case there is no room for the mis-mapped memory overwrites
to
>occur.
>>> Is
>>>>> that correct?
>>>>
>>>> AIUI (which is not very well...) it''s not so much
vBAR=pBAR but
>>> making
>>>> the guest e820 (memory map) have the same MMIO holes as the
host so
>>> that
>>>> there can''t be any clash between v- or p-BAR and RAM
in the guest.
>>>>
>>>>> I guess I could test this easily enough by applying the
vBAR >pBAR
>>> hack.
>>>>
>>>> Does the e820_host=1 option help? That might be PV only though,
I
>>> can''t
>>>> remember...
>>>
>>> Alas, yes.  The man pages list it under "PV Guest Specific
Options":
>>> http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html
>>>
>>> You got my hopes up! ;)
>>>
>>> Carry on!  I''ll be sitting here metaphorically munching
popcorn with
>>> anticipation :P
>>
>> We could implement that for HVM guests too. But I am not sure about
>> the consequences of this for migration (say you unplug the device
>> beforehand and then migrate to another host which has a different
>> E820). That part requires a bit of pondering.
>
>Just out of interest, what happens in case where the PV guests get 
>migrated with e820_host=1 set?
>
>Gordan
We disallow (I think?)  as there is no way we can guarantee the E820 map.  I
guess your point is that since we disallow this on PV with this parameter there
is not much difference in allowing HVM guest with this.

Gordan Bobic

2013-Jul-28 23:30 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On 07/29/2013 12:17 AM, Konrad Rzeszutek Wilk wrote:> Gordan Bobic <gordan@bobich.net> wrote:
>> On 07/28/2013 11:26 AM, Konrad Rzeszutek Wilk wrote:
>>> Andrew Bobulsky <rulerof@gmail.com> wrote:
>>>> On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell
>> <ian.campbell@citrix.com>
>>>> wrote:
>>>>> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote:
>>>>>> Now, if I am understanding the basic nature of the
problem
>>>> correctly,
>>>>>> this _could_ be worked around by ensuring that vBAR =
pBAR since
>> in
>>>> that
>>>>>> case there is no room for the mis-mapped memory
overwrites to
>> occur.
>>>> Is
>>>>>> that correct?
>>>>>
>>>>> AIUI (which is not very well...) it''s not so much
vBAR=pBAR but
>>>> making
>>>>> the guest e820 (memory map) have the same MMIO holes as the
host so
>>>> that
>>>>> there can''t be any clash between v- or p-BAR and
RAM in the guest.
>>>>>
>>>>>> I guess I could test this easily enough by applying the
vBAR >> pBAR
>>>> hack.
>>>>>
>>>>> Does the e820_host=1 option help? That might be PV only
though, I
>>>> can''t
>>>>> remember...
>>>>
>>>> Alas, yes.  The man pages list it under "PV Guest Specific
Options":
>>>> http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html
>>>>
>>>> You got my hopes up! ;)
>>>>
>>>> Carry on!  I''ll be sitting here metaphorically
munching popcorn with
>>>> anticipation :P
>>>
>>> We could implement that for HVM guests too. But I am not sure about
>>> the consequences of this for migration (say you unplug the device
>>> beforehand and then migrate to another host which has a different
>>> E820). That part requires a bit of pondering.
>>
>> Just out of interest, what happens in case where the PV guests get
>> migrated with e820_host=1 set?
>>
>> Gordan
>
> We disallow (I think?)  as there is no way we can guarantee the
> E820 map.  I guess your point is that since we disallow this on
> PV with this parameter there is not much difference in allowing
> HVM guest with this.
That is indeed where I was pondering going with this, yes - apply the 
same restriction in the HVM case that exists in the PV case.

Regarding the e820_host=1 case, which of the following is true:

1) The dom0 BAR areas are simply reserved/holes and the domU still maps 
it''s own BARs elsewhere in the memory space?

2) domU is free to map BARs into any of the host E820 map holes of 
appropriate size?

3) vBAR=pBAR

4) Other?

Thanks.

Gordan

Ian Campbell

2013-Jul-29 09:53 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Sun, 2013-07-28 at 19:17 -0400, Konrad Rzeszutek Wilk
wrote:> Gordan Bobic <gordan@bobich.net> wrote:
> >On 07/28/2013 11:26 AM, Konrad Rzeszutek Wilk wrote:
> >> Andrew Bobulsky <rulerof@gmail.com> wrote:
> >>> On Thu, Jul 25, 2013 at 8:21 PM, Ian Campbell
> ><ian.campbell@citrix.com>
> >>> wrote:
> >>>> On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote:
> >>>>> Now, if I am understanding the basic nature of the
problem
> >>> correctly,
> >>>>> this _could_ be worked around by ensuring that vBAR =
pBAR since
> >in
> >>> that
> >>>>> case there is no room for the mis-mapped memory
overwrites to
> >occur.
> >>> Is
> >>>>> that correct?
> >>>>
> >>>> AIUI (which is not very well...) it''s not so much
vBAR=pBAR but
> >>> making
> >>>> the guest e820 (memory map) have the same MMIO holes as
the host so
> >>> that
> >>>> there can''t be any clash between v- or p-BAR and
RAM in the guest.
> >>>>
> >>>>> I guess I could test this easily enough by applying
the vBAR > >pBAR
> >>> hack.
> >>>>
> >>>> Does the e820_host=1 option help? That might be PV only
though, I
> >>> can''t
> >>>> remember...
> >>>
> >>> Alas, yes.  The man pages list it under "PV Guest
Specific Options":
> >>> http://xenbits.xen.org/docs/unstable/man/xl.cfg.5.html
> >>>
> >>> You got my hopes up! ;)
> >>>
> >>> Carry on!  I''ll be sitting here metaphorically
munching popcorn with
> >>> anticipation :P
> >>
> >> We could implement that for HVM guests too. But I am not sure
about
> >> the consequences of this for migration (say you unplug the device
> >> beforehand and then migrate to another host which has a different
> >> E820). That part requires a bit of pondering.
> >
> >Just out of interest, what happens in case where the PV guests get 
> >migrated with e820_host=1 set?
> >
> >Gordan
> 
> We disallow (I think?)  as there is no way we can guarantee the E820
> map.  I guess your point is that since we disallow this on PV with
> this parameter there is not much difference in allowing HVM guest with
> this. 
Yes, I don''t think it is unreasonable to disallow migration when
hardware specific workarounds have been applied (which is really what
e820_host is, for either PV or HVM).

Ian.

Ian Campbell

2013-Jul-29 11:14 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Fri, 2013-07-26 at 10:23 +0100, Gordan Bobic wrote:>  On Fri, 26 Jul 2013 01:21:24 +0100, Ian Campbell 
>  <ian.campbell@citrix.com> wrote:
> > On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote:
> >> Now, if I am understanding the basic nature of the problem 
> >> correctly,
> >> this _could_ be worked around by ensuring that vBAR = pBAR since
in
> >> that
> >> case there is no room for the mis-mapped memory overwrites to
occur.
> >> Is
> >> that correct?
> >
> > AIUI (which is not very well...) it''s not so much vBAR=pBAR
but
> > making
> > the guest e820 (memory map) have the same MMIO holes as the host so 
> > that
> > there can''t be any clash between v- or p-BAR and RAM in the
guest.
> 
>  Sure, I understand that - but unless I am overlooking something,
>  vBAR=pBAR implicitly ensures that.
Not quite because you need to ensure that guest RAM and guest MMIO space
do not overlap. So setting vBAR=pBAR is not sufficient, you also need to
ensure that there is no RAM at those addresses.

Depending on your PCI bus topology/hardware functionality it may be
sufficient to only ensure the memory map is the same as the host, so
long as the vBAR all fall within the MMIO regions. On other systems you
may require vBAR=pBAR in addition to that. Obviously doing both is most
likely to work.
>  The question, then, is what happens in the null translation instance.
>  Specifically, if the PCIe bridge/router is broken (and NF200 is, it
>  seems), it would imply that when the driver talks to the device, the
>  operation will get sent to the vBAR (=pBAR, i.e. straight to the
>  hardware). This then gets translated to the pBAR. But - with a
>  broken bridge, and vBAR=pBAR, the MMIO request hits the pBAR
>  directly from the guest. Does it then still get intercepted by
>  the hypervisor, translated (null operation), and re-transmitted?
>  If so, this would lead to the card receiving everything twice,
>  resulting either in things outright breaking or going half as
>  fast at best.
AIUI the issue is not so much with a device seeing an IO accesses twice
but with two device seeing the same IO access (one sees translated, the
other untranslated) and thinking it is for them and who "wins" when
such
shadowing occurs, which will differ depending on which device (or the
host CPU) is doing the IO.

It is not the hypervisor which is intercepting and translating, but the
hardware. A single bit of hardware should never see things twice.

Perhaps a diagram (intended to be more illustrative than "real"):
           CPU
            |
        MMU & IOMMU
            |                   | RAM
BUS 1:      `---+---------------''
                |
              BRIDGE 
                |
BUS 2:          `--- BUS 2  -------------
                             |           |
                          DEVICE A    DEVICE B

vBAR->pBAR translation happens at the IOMMU.

So if the CPU accesses a RAM address it will be translated by the MMU
and go to the correct address in RAM.

Lets assume that the bridge knows that accesses it forwards on need to
be translated. So if DEVICE A tries to access RAM then it the BRIDGE
will translate things (by talking to the IOMMU) and the access will
again go to the right place.

Likewise if the CPU tries to talk to DEVICE A then the MMIO accesses
will be translated and go to the right place.

However lets imagine DEVICE B happens to have a pBAR which is the same
as the memory which DEVICE A is trying to access. Lets also assume that
the BRIDGE has a bug which would allow DEVICE B to see DEVICE A''s
accesses directly instead of laundering them via the IOMMU (perhaps it
is really a shared bus like I''ve drawn it rather than a PCI-e thing
with
lanes etc).

So now DEVICE A''s memory access could be seen and acted on by both the
RAM (translated, probably) and DEVICE B. Weirdness will ensue, perhaps
the DMA read done via device A gets serviced by DEVICE B and not RAM, or
maybe the DMA write causes a side effect in DEVICE B. Furthermore the
"winner" might even be different for an access from DEVICE A vs an
access from the CPU etc.

This is something vaguely like the real bug, but only vaguely, because
my understanding of the real bug is a bit vague. I hope it is
illustrative of the sort of issue we are talking about.
> 
>  Now, all this could be a good thing or a bad thing, depending on
>  how exactly you spin it. If the bridge is broken and doesn''t
>  route all the way back to the root bridge, this could actually be
>  a performance optimizing feature. If we set vBAR=pBAR and disable
>  any translation thereafter, this avoids the overhead of passing
>  everything to/from the root PCIe bridge, and we can just directly
>  DMA everything.
I''m not sure how much perf overhead there is in practice since ISTR
that
the translations can be cached in the bridge and need explicit flushing
etc when they are modified. Obviously there will be some overhead but I
don''t think it will be anything like doubling the traffic.
>  I''m sure there are security implications here, but since NF200
>  doesn''t do PCIe ACS either, any concept of security goes out
>  the window pre-emptively.
> 
>  So, my question is:
>  1) If vBAR = pBAR, does the hypervisor still do any translation?
I would assume so.
>  I presume it does because it expects the traffic to pass up
>  from the root bridge, to the hypervisor and then back, to
>  ensure security.
NB: Not to the hypervisor (software) but to some bit of hardware which
interprets a table provided by the hypervisor.
>  If indeed it does do this, where could I
>  optionally disable it, and is there an easy to follow bit of
>  example code for how to plumb in a boot parameter option for
>  this?
I''m afraid I''ve no clue...

Perhaps if you started from the hypercall which the toolstacks use to
plumb stuff through you would be able to trace it down?
XEN_DOMCTL_memory_mapping perhaps? (I''m wary of saying too much because
there is every chance I am sending you on some wild goose chase)
>  2) Further, I''m finding myself motivated to write that
>  auto-set (as opposed to hard coded) vBAR=pBAR patch discussed
>  briefly a week or so ago (have an init script read the BAR
>  info from dom0 and put it in xenstore, plus a patch to
>  make pBAR=vBAR reservations built dynamically rather than
>  statically, based on this data. Now, I''m quite fluent in C,
>  but my familiarity with Xen soruce code is nearly non-existant
>  (limited to studying an old unsupported patch every now and then
>  in order to make it apply to a more recent code release).
>  Can anyone help me out with a high level view WRT where
>  this would be best plumbed in (which files and the flow of
>  control between the affected files)?
I''m not sure but the places I would start are the bits of libxc which
call things like XEN_DOMCTL_memory_mapping and the bits of libxl which
call into them. It would also be worth looking at the PCI setup code in
hvmloader (tools/firmware/hvmloader/) I have a feeling that is where the
code responsible for PCI bar allocation/layout within the guest''s
memory
map lives.

Perhaps you might want perhaps to implement a mode where libxl/libxc end
up writing the desired vBAR(==pBAR, in your case) values into xenstore
for hvmloader to pickup and implement. Not being a maintainer for that
area I''m not sure if that would acceptable or not.
> 
>  The added bonus of this (if it can be made to work) is that
>  it might just make unmodified GeForce cards work, too,
>  which probably makes it worthwhile on it''s own.
> 
> >> I guess I could test this easily enough by applying the vBAR =
pBAR
> >> hack.
> >
> > Does the e820_host=1 option help? That might be PV only though, I 
> > can''t
> > remember...
> 
>  Thanks for pointing this one out, I just found this post in the 
>  archives:
>  http://lists.xen.org/archives/html/xen-users/2012-08/msg00150.html
> 
>  With a broken PCIe router, would I also need iommu=soft?
I''m not sure that isn''t also a PV only thing. Sorry :-/
> 
>  Gordan

Konrad Rzeszutek Wilk

2013-Jul-29 18:04 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

> So, my question is:
> 1) If vBAR = pBAR, does the hypervisor still do any translation?
> I presume it does because it expects the traffic to pass up
> from the root bridge, to the hypervisor and then back, to
> ensure security. If indeed it does do this, where could I
> optionally disable it, and is there an easy to follow bit of
> example code for how to plumb in a boot parameter option for
> this?
It should. > 
> 2) Further, I''m finding myself motivated to write that
> auto-set (as opposed to hard coded) vBAR=pBAR patch discussed
> briefly a week or so ago (have an init script read the BAR
> info from dom0 and put it in xenstore, plus a patch to
> make pBAR=vBAR reservations built dynamically rather than
> statically, based on this data. Now, I''m quite fluent in C,
> but my familiarity with Xen soruce code is nearly non-existant
> (limited to studying an old unsupported patch every now and then
> in order to make it apply to a more recent code release).
> Can anyone help me out with a high level view WRT where
> this would be best plumbed in (which files and the flow of
> control between the affected files)?
hvmloader probably and the libxl e820 code. What from a 
high view needs to happen is that:
 1). Need to relax the check in libxl for e820_hole
    to also do it for HVM guests. Said code just iterates over the
    host E820 and sanitizes it a bit and makes a E820 hypercall to
    set it for the guest.

 2). Figure out whether the E820 hypercall (which sets the E820
    layout for a guest) can be run on HVM guests. I think it
    could not and Mukesh in his PVH patches posted a patch
    to enable that - "..Move e820 fields out of pv_domain struct"
 2). Hvmloader should do an E820 get machine memory hypercall
   to see if there is anything there. If there is - that means
    the toolstack has request a "new" type of E820. Iterate
    over the E820 and make it look like that.
    You can look in the Linux arch/x86/xen/setup.c to see how
    it does that.

   The complication there is that hvmloader needs to to fit the 
   ACPI code (the guest type one) and such.
   Presumarily you can just re-use the existing spaces that
   the host has marked as E820_RESERVED or E820_ACPI..

   Then there is the SMBIOS would need to move and the BIOS
   might need to be relocated - but I think those are relocatable
  in some form.
> 
> The added bonus of this (if it can be made to work) is that
> it might just make unmodified GeForce cards work, too,
> which probably makes it worthwhile on it''s own.
Well, I am more than happy to help you with this. > 
> >>I guess I could test this easily enough by applying the vBAR >
>>pBAR hack.
> >
> >Does the e820_host=1 option help? That might be PV only though, I
> >can''t
> >remember...
> 
> Thanks for pointing this one out, I just found this post in the
> archives:
> http://lists.xen.org/archives/html/xen-users/2012-08/msg00150.html
> 
> With a broken PCIe router, would I also need iommu=soft?
No. The iommu=soft is not needed with the recent pvops linux kernels.
But broken PCIe router''s don''t have much to do with the kernel
- that
is the hypervisor decision whether to allow a guest (either PV or HVM)
to have said device.> 
> Gordan
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

George Dunlap

2013-Jul-31 17:53 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Fri, Jul 26, 2013 at 2:11 PM, Gordan Bobic <gordan@bobich.net>
wrote:>> Now that is intereting - if this makes the memory holes the same
between
>> the guest and the host, does it also implicitly vBAR=pBAR?
>
>
> Another thing that occurred to me might be useful to check - it is
> pretty easy to modify the BAR size on Nvidia cards. The defaults are
> 64MB and 128MB for the two BARs. They can be made much, much larger,
> and there is often advantage to enlarging them to at least be equal to
> VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB, being a
> 64-bit BAR, it might make the BIOS do the sane thing and map it above
> 4GB. With the other BAR also suitably enlarged and it being done on
> the second GPU as well, there is no obvious option but to map them
> above 4GB (unless the BIOS is broken, which it may well be, in
> which case all bets are off).
>
> Which may just alleviate the memory issue if not completely fix
> the problem.
>
> Will try this and see what happens.
I believe XenServer has a patch that allows the toolstack (in this
case xapi) to set the default size of the MMIO hole.  Andrew, did that
ever make it upstream?

Unfortunately, it is unlikely to work with upstream qemu until we fix
the memory relocation issue...

 -George

Andrew Cooper

2013-Jul-31 17:56 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On 31/07/13 18:53, George Dunlap wrote:> On Fri, Jul 26, 2013 at 2:11 PM, Gordan Bobic <gordan@bobich.net>
wrote:
>>> Now that is intereting - if this makes the memory holes the same
between
>>> the guest and the host, does it also implicitly vBAR=pBAR?
>>
>> Another thing that occurred to me might be useful to check - it is
>> pretty easy to modify the BAR size on Nvidia cards. The defaults are
>> 64MB and 128MB for the two BARs. They can be made much, much larger,
>> and there is often advantage to enlarging them to at least be equal to
>> VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB, being a
>> 64-bit BAR, it might make the BIOS do the sane thing and map it above
>> 4GB. With the other BAR also suitably enlarged and it being done on
>> the second GPU as well, there is no obvious option but to map them
>> above 4GB (unless the BIOS is broken, which it may well be, in
>> which case all bets are off).
>>
>> Which may just alleviate the memory issue if not completely fix
>> the problem.
>>
>> Will try this and see what happens.
> I believe XenServer has a patch that allows the toolstack (in this
> case xapi) to set the default size of the MMIO hole.  Andrew, did that
> ever make it upstream?
>
> Unfortunately, it is unlikely to work with upstream qemu until we fix
> the memory relocation issue...
>
>  -George
I believe it did - the patch does not exist in our patch queue any more.

~Andrew

Gordan Bobic

2013-Jul-31 19:35 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On 07/31/2013 06:53 PM, George Dunlap wrote:> On Fri, Jul 26, 2013 at 2:11 PM, Gordan Bobic <gordan@bobich.net>
wrote:
>>> Now that is intereting - if this makes the memory holes the same
between
>>> the guest and the host, does it also implicitly vBAR=pBAR?
>>
>>
>> Another thing that occurred to me might be useful to check - it is
>> pretty easy to modify the BAR size on Nvidia cards. The defaults are
>> 64MB and 128MB for the two BARs. They can be made much, much larger,
>> and there is often advantage to enlarging them to at least be equal to
>> VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB, being a
>> 64-bit BAR, it might make the BIOS do the sane thing and map it above
>> 4GB. With the other BAR also suitably enlarged and it being done on
>> the second GPU as well, there is no obvious option but to map them
>> above 4GB (unless the BIOS is broken, which it may well be, in
>> which case all bets are off).
>>
>> Which may just alleviate the memory issue if not completely fix
>> the problem.
>>
>> Will try this and see what happens.
>
> I believe XenServer has a patch that allows the toolstack (in this
> case xapi) to set the default size of the MMIO hole.  Andrew, did that
> ever make it upstream?
>
> Unfortunately, it is unlikely to work with upstream qemu until we fix
> the memory relocation issue...
Interesting you should mention something like this. I''ve been pondering
whether it might be easier (even if it is a bodge) to simply always set 
the domU E820 map to have 0x80000000 - 0xFFFFFFFF (2GB->4GB) reserved. I 
have not yet seen a motherboard that maps 32-bit BARs below 2GB.

Note: Admittedly, I haven''t tested what happens when you have multiple 
Nvidia cards each with a 1GB 32-bit BAR, though, I fully expect 
weirdness. And Nvidia cards have have the 32-bit BAR0 up to 2GB in size! 
But I cannot see a good reason to use such a configuration since it''s 
the 64-bit BAR1 (up to 64GB in size) that provides the direct VRAM mapping.

Anyway, if the whole 2GB->4GB area was reserved, then presumably Xen 
would map the 32-bit bars below 2GB, which, provided there''s enough 
memory for the OS kernel to load and the BARs, shouldn''t be a problem
(I
cannot think of a sane case where this wouldn''t hold). 64-bit BARs can 
get re-mapped somewhere sky-high in domU RAM (at the top of the 
addressable range sounds like a reasonable bet, BIOS (for non-broken 
BIOS implementations, of which there seem to be fewer than I''d like to 
believe) would probably set those just above the size of RAM in the 
machine, so to 2^48 minus BAR size would possibly be a safe place to map 
them.

Yes, I know it''s a bodge. Yes, I know it wouldn''t solve the
GeForce
passthrough problem. Yes, host E820 with vBAR = pBAR (possibly without 
IOMMU involvement) would be an awesome feature to have. But the bodge of 
just punching a 2GB hole at 2GB might just be a lot easier to implement 
as a quick fix.

Gordan

Gordan Bobic

2013-Jul-31 19:36 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On 07/31/2013 06:56 PM, Andrew Cooper wrote:> On 31/07/13 18:53, George Dunlap wrote:
>> On Fri, Jul 26, 2013 at 2:11 PM, Gordan Bobic <gordan@bobich.net>
wrote:
>>>> Now that is intereting - if this makes the memory holes the
same between
>>>> the guest and the host, does it also implicitly vBAR=pBAR?
>>>
>>> Another thing that occurred to me might be useful to check - it is
>>> pretty easy to modify the BAR size on Nvidia cards. The defaults
are
>>> 64MB and 128MB for the two BARs. They can be made much, much
larger,
>>> and there is often advantage to enlarging them to at least be equal
to
>>> VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB, being a
>>> 64-bit BAR, it might make the BIOS do the sane thing and map it
above
>>> 4GB. With the other BAR also suitably enlarged and it being done on
>>> the second GPU as well, there is no obvious option but to map them
>>> above 4GB (unless the BIOS is broken, which it may well be, in
>>> which case all bets are off).
>>>
>>> Which may just alleviate the memory issue if not completely fix
>>> the problem.
>>>
>>> Will try this and see what happens.
>> I believe XenServer has a patch that allows the toolstack (in this
>> case xapi) to set the default size of the MMIO hole.  Andrew, did that
>> ever make it upstream?
>>
>> Unfortunately, it is unlikely to work with upstream qemu until we fix
>> the memory relocation issue...
>>
>
> I believe it did - the patch does not exist in our patch queue any more.
Can anyone point me at the relevant commit / docs on this patch?

Gordan

George Dunlap

2013-Aug-01 09:15 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On 31/07/13 20:35, Gordan Bobic wrote:> On 07/31/2013 06:53 PM, George Dunlap wrote:
>> On Fri, Jul 26, 2013 at 2:11 PM, Gordan Bobic <gordan@bobich.net>
wrote:
>>>> Now that is intereting - if this makes the memory holes the
same
>>>> between
>>>> the guest and the host, does it also implicitly vBAR=pBAR?
>>>
>>>
>>> Another thing that occurred to me might be useful to check - it is
>>> pretty easy to modify the BAR size on Nvidia cards. The defaults
are
>>> 64MB and 128MB for the two BARs. They can be made much, much
larger,
>>> and there is often advantage to enlarging them to at least be equal
to
>>> VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB, being a
>>> 64-bit BAR, it might make the BIOS do the sane thing and map it
above
>>> 4GB. With the other BAR also suitably enlarged and it being done on
>>> the second GPU as well, there is no obvious option but to map them
>>> above 4GB (unless the BIOS is broken, which it may well be, in
>>> which case all bets are off).
>>>
>>> Which may just alleviate the memory issue if not completely fix
>>> the problem.
>>>
>>> Will try this and see what happens.
>>
>> I believe XenServer has a patch that allows the toolstack (in this
>> case xapi) to set the default size of the MMIO hole.  Andrew, did that
>> ever make it upstream?
>>
>> Unfortunately, it is unlikely to work with upstream qemu until we fix
>> the memory relocation issue...
>
> Interesting you should mention something like this. I''ve been 
> pondering whether it might be easier (even if it is a bodge) to simply 
> always set the domU E820 map to have 0x80000000 - 0xFFFFFFFF 
> (2GB->4GB) reserved. I have not yet seen a motherboard that maps 
> 32-bit BARs below 2GB.
I''m pretty sure we''ve seen a memory hole larger than 2GiB, in
a box
loaded up with a boatload of GPUs.

The main problem with doing this unconditionally is that the relocated 
memory isn''t available to non-PAE 32-bit guests.  I think we should
have
a work-around in place for 4.4 that will avoid a collision between the 
host MMIO and guest memory addresses; but it will need to be off by 
default, at least for guests that don''t have a passed-through device.

  -George

Fabio Fantoni

2013-Aug-01 13:10 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Il 01/08/2013 11:15, George Dunlap ha scritto:> On 31/07/13 20:35, Gordan Bobic wrote:
>> On 07/31/2013 06:53 PM, George Dunlap wrote:
>>> On Fri, Jul 26, 2013 at 2:11 PM, Gordan Bobic
<gordan@bobich.net>
>>> wrote:
>>>>> Now that is intereting - if this makes the memory holes the
same
>>>>> between
>>>>> the guest and the host, does it also implicitly vBAR=pBAR?
>>>>
>>>>
>>>> Another thing that occurred to me might be useful to check - it
is
>>>> pretty easy to modify the BAR size on Nvidia cards. The
defaults are
>>>> 64MB and 128MB for the two BARs. They can be made much, much
larger,
>>>> and there is often advantage to enlarging them to at least be
equal to
>>>> VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB,
being a
>>>> 64-bit BAR, it might make the BIOS do the sane thing and map it
above
>>>> 4GB. With the other BAR also suitably enlarged and it being
done on
>>>> the second GPU as well, there is no obvious option but to map
them
>>>> above 4GB (unless the BIOS is broken, which it may well be, in
>>>> which case all bets are off).
>>>>
>>>> Which may just alleviate the memory issue if not completely fix
>>>> the problem.
>>>>
>>>> Will try this and see what happens.
>>>
>>> I believe XenServer has a patch that allows the toolstack (in this
>>> case xapi) to set the default size of the MMIO hole.  Andrew, did
that
>>> ever make it upstream?
>>>
>>> Unfortunately, it is unlikely to work with upstream qemu until we
fix
>>> the memory relocation issue...
>>
>> Interesting you should mention something like this. I''ve been 
>> pondering whether it might be easier (even if it is a bodge) to 
>> simply always set the domU E820 map to have 0x80000000 - 0xFFFFFFFF 
>> (2GB->4GB) reserved. I have not yet seen a motherboard that maps 
>> 32-bit BARs below 2GB.
>
> I''m pretty sure we''ve seen a memory hole larger than
2GiB, in a box
> loaded up with a boatload of GPUs.
>
> The main problem with doing this unconditionally is that the relocated 
> memory isn''t available to non-PAE 32-bit guests.  I think we
should
> have a work-around in place for 4.4 that will avoid a collision 
> between the host MMIO and guest memory addresses; but it will need to 
> be off by default, at least for guests that don''t have a 
> passed-through device.
>
>  -George
>
I see this recent patch on qemu:
http://git.qemu.org/?p=qemu.git;a=commit;h=398489018183d613306ab022653552247d93919f
Is related and can solve the problem or I''m wrong?
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

George Dunlap

2013-Aug-02 14:43 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Thu, Aug 1, 2013 at 2:10 PM, Fabio Fantoni <fabio.fantoni@m2r.biz>
wrote:> Il 01/08/2013 11:15, George Dunlap ha scritto:
>
>> On 31/07/13 20:35, Gordan Bobic wrote:
>>>
>>> On 07/31/2013 06:53 PM, George Dunlap wrote:
>>>>
>>>> On Fri, Jul 26, 2013 at 2:11 PM, Gordan Bobic
<gordan@bobich.net> wrote:
>>>>>>
>>>>>> Now that is intereting - if this makes the memory holes
the same
>>>>>> between
>>>>>> the guest and the host, does it also implicitly
vBAR=pBAR?
>>>>>
>>>>>
>>>>>
>>>>> Another thing that occurred to me might be useful to check
- it is
>>>>> pretty easy to modify the BAR size on Nvidia cards. The
defaults are
>>>>> 64MB and 128MB for the two BARs. They can be made much,
much larger,
>>>>> and there is often advantage to enlarging them to at least
be equal to
>>>>> VRAM size. Soooooo... If I boost the BAR from 128MB to 2GB,
being a
>>>>> 64-bit BAR, it might make the BIOS do the sane thing and
map it above
>>>>> 4GB. With the other BAR also suitably enlarged and it being
done on
>>>>> the second GPU as well, there is no obvious option but to
map them
>>>>> above 4GB (unless the BIOS is broken, which it may well be,
in
>>>>> which case all bets are off).
>>>>>
>>>>> Which may just alleviate the memory issue if not completely
fix
>>>>> the problem.
>>>>>
>>>>> Will try this and see what happens.
>>>>
>>>>
>>>> I believe XenServer has a patch that allows the toolstack (in
this
>>>> case xapi) to set the default size of the MMIO hole.  Andrew,
did that
>>>> ever make it upstream?
>>>>
>>>> Unfortunately, it is unlikely to work with upstream qemu until
we fix
>>>> the memory relocation issue...
>>>
>>>
>>> Interesting you should mention something like this. I''ve
been pondering
>>> whether it might be easier (even if it is a bodge) to simply always
set the
>>> domU E820 map to have 0x80000000 - 0xFFFFFFFF (2GB->4GB)
reserved. I have
>>> not yet seen a motherboard that maps 32-bit BARs below 2GB.
>>
>>
>> I''m pretty sure we''ve seen a memory hole larger than
2GiB, in a box loaded
>> up with a boatload of GPUs.
>>
>> The main problem with doing this unconditionally is that the relocated
>> memory isn''t available to non-PAE 32-bit guests.  I think we
should have a
>> work-around in place for 4.4 that will avoid a collision between the
host
>> MMIO and guest memory addresses; but it will need to be off by default,
at
>> least for guests that don''t have a passed-through device.
>>
>>  -George
>>
>
> I see this recent patch on qemu:
>
http://git.qemu.org/?p=qemu.git;a=commit;h=398489018183d613306ab022653552247d93919f
> Is related and can solve the problem or I''m wrong?
It doesn''t look like it to me, but thanks for looking.

 -George

Gordan Bobic

2013-Sep-03 13:53 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Mon, 29 Jul 2013 14:04:31 -0400, Konrad Rzeszutek Wilk 
 <konrad.wilk@oracle.com> wrote:

 Hi Konrad,

 Apologies it took me a month to get back to this.
>> 2) Further, I''m finding myself motivated to write that
>> auto-set (as opposed to hard coded) vBAR=pBAR patch discussed
>> briefly a week or so ago (have an init script read the BAR
>> info from dom0 and put it in xenstore, plus a patch to
>> make pBAR=vBAR reservations built dynamically rather than
>> statically, based on this data. Now, I''m quite fluent in C,
>> but my familiarity with Xen soruce code is nearly non-existant
>> (limited to studying an old unsupported patch every now and then
>> in order to make it apply to a more recent code release).
>> Can anyone help me out with a high level view WRT where
>> this would be best plumbed in (which files and the flow of
>> control between the affected files)?
>
> hvmloader probably and the libxl e820 code. What from a
> high view needs to happen is that:
>  1). Need to relax the check in libxl for e820_hole
>     to also do it for HVM guests. Said code just iterates over the
>     host E820 and sanitizes it a bit and makes a E820 hypercall to
>     set it for the guest.
 I''m looking at the libxl code at the moment.

 In cases where e820_host is seen as PV specific, would the
 correct thing to do be to move it out of the PV/HVM specific
 blocks so it applies to both?

 In libxl/libxl_x86.c/libxl__e820_alloc

 I have thus far changed the code to remove the PV check,
 and having moved e820_host option to be common to both VM
 types, I changed the 820 related instances from
 b_info->u.pv.e820_host
 to
 b_info->e820_host

 Is this the correct/preferred way this should be handled?
 Or would it be better to make e820_host be in both PV and
 HVM options, and refer to it as such
 (u.pv.e820_host / u.hvm.e820_host) ?

 The e820 sanitizer is called with b_info->u.pv.slack_memkb
 parameter. What does parameter actually mean? I googled
 it and couldn''t find any documentation specific to it, and
 it doesn''t appear to be documented as settable in the config
 file. What would the equivalent be in case of HVM?
>  2). Figure out whether the E820 hypercall (which sets the E820
>     layout for a guest) can be run on HVM guests. I think it
>     could not and Mukesh in his PVH patches posted a patch
>     to enable that - "..Move e820 fields out of pv_domain struct"
>  2). Hvmloader should do an E820 get machine memory hypercall
>    to see if there is anything there. If there is - that means
>     the toolstack has request a "new" type of E820. Iterate
>     over the E820 and make it look like that.
>     You can look in the Linux arch/x86/xen/setup.c to see how
>     it does that.
>
>    The complication there is that hvmloader needs to to fit the
>    ACPI code (the guest type one) and such.
>    Presumarily you can just re-use the existing spaces that
>    the host has marked as E820_RESERVED or E820_ACPI..
 Yup, I get it. Not only that, but it should also ideally (not
 strictly necessary, but it''d be handy) map the IOMEM for devices
 it is passed so that pBAR=vBAR (as opposed to just leaving all
 the host e820 reserved areas well alone - which would work for
 most things).
>    Then there is the SMBIOS would need to move and the BIOS
>    might need to be relocated - but I think those are relocatable
>   in some form.
 OK, I''ll look at that once I have a workable patch for the libxl
 part.
>> The added bonus of this (if it can be made to work) is that
>> it might just make unmodified GeForce cards work, too,
>> which probably makes it worthwhile on it''s own.
>
> Well, I am more than happy to help you with this.
 Thanks, much appreciated. :)

 Gordan

Konrad Rzeszutek Wilk

2013-Sep-03 14:59 UTC

head link

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

On Tue, Sep 03, 2013 at 02:53:06PM +0100, Gordan Bobic
wrote:> On Mon, 29 Jul 2013 14:04:31 -0400, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> 
> Hi Konrad,
> 
> Apologies it took me a month to get back to this.
Hey Gordan,

That is OK. Time flies fast!> 
> >>2) Further, I''m finding myself motivated to write that
> >>auto-set (as opposed to hard coded) vBAR=pBAR patch discussed
> >>briefly a week or so ago (have an init script read the BAR
> >>info from dom0 and put it in xenstore, plus a patch to
> >>make pBAR=vBAR reservations built dynamically rather than
> >>statically, based on this data. Now, I''m quite fluent in
C,
> >>but my familiarity with Xen soruce code is nearly non-existant
> >>(limited to studying an old unsupported patch every now and then
> >>in order to make it apply to a more recent code release).
> >>Can anyone help me out with a high level view WRT where
> >>this would be best plumbed in (which files and the flow of
> >>control between the affected files)?
> >
> >hvmloader probably and the libxl e820 code. What from a
> >high view needs to happen is that:
> > 1). Need to relax the check in libxl for e820_hole
> >    to also do it for HVM guests. Said code just iterates over the
> >    host E820 and sanitizes it a bit and makes a E820 hypercall to
> >    set it for the guest.
> 
> I''m looking at the libxl code at the moment.
> 
> In cases where e820_host is seen as PV specific, would the
> correct thing to do be to move it out of the PV/HVM specific
> blocks so it applies to both?
Yes.> 
> In libxl/libxl_x86.c/libxl__e820_alloc
> 
> I have thus far changed the code to remove the PV check,
> and having moved e820_host option to be common to both VM
> types, I changed the 820 related instances from
> b_info->u.pv.e820_host
> to
> b_info->e820_host
> 
> Is this the correct/preferred way this should be handled?
Yes.> Or would it be better to make e820_host be in both PV and
> HVM options, and refer to it as such
> (u.pv.e820_host / u.hvm.e820_host) ?
No. Lets make it work across the board.> 
> The e820 sanitizer is called with b_info->u.pv.slack_memkb
> parameter. What does parameter actually mean? I googled
> it and couldn''t find any documentation specific to it, and
> it doesn''t appear to be documented as settable in the config
> file. What would the equivalent be in case of HVM?
0.

If my memory serves me right it is  some amount of memory that
a PV guests that it does not use normally. It is used by the
frontend and backend driver to communicate. Kind of like a shadow
memory. But only ancient kernels use it but those still have to be
supported.> 
> > 2). Figure out whether the E820 hypercall (which sets the E820
> >    layout for a guest) can be run on HVM guests. I think it
> >    could not and Mukesh in his PVH patches posted a patch
> >    to enable that - "..Move e820 fields out of pv_domain
struct"
> > 2). Hvmloader should do an E820 get machine memory hypercall
> >   to see if there is anything there. If there is - that means
> >    the toolstack has request a "new" type of E820. Iterate
> >    over the E820 and make it look like that.
> >    You can look in the Linux arch/x86/xen/setup.c to see how
> >    it does that.
> >
> >   The complication there is that hvmloader needs to to fit the
> >   ACPI code (the guest type one) and such.
> >   Presumarily you can just re-use the existing spaces that
> >   the host has marked as E820_RESERVED or E820_ACPI..
> 
> Yup, I get it. Not only that, but it should also ideally (not
> strictly necessary, but it''d be handy) map the IOMEM for devices
> it is passed so that pBAR=vBAR (as opposed to just leaving all
> the host e820 reserved areas well alone - which would work for
> most things).
Yes. That is an extra complication that could be done in subsequent
patches. But in theory if you have the E820 mirrored from the host the
pBAR=vBAR should be easy enough as the values from the host BARs can
easily fit in the E820 gaps.
> 
> >   Then there is the SMBIOS would need to move and the BIOS
> >   might need to be relocated - but I think those are relocatable
> >  in some form.
> 
> OK, I''ll look at that once I have a workable patch for the libxl
> part.
Aye.> 
> >>The added bonus of this (if it can be made to work) is that
> >>it might just make unmodified GeForce cards work, too,
> >>which probably makes it worthwhile on it''s own.
> >
> >Well, I am more than happy to help you with this.
> 
> Thanks, much appreciated. :)
Yeeey! Vict^H^H^H^volunteer :-)! <manically laughter in the background>

I am also reachable on IRC (FreeNode mostly) as either darnok or konrad
if that would be more convient to discuss this.> 
> Gordan

Gordan Bobic

2013-Sep-03 19:47 UTC

head link

HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On 09/03/2013 03:59 PM, Konrad Rzeszutek Wilk wrote:
>>>> 2) Further, I''m finding myself motivated to write that
>>>> auto-set (as opposed to hard coded) vBAR=pBAR patch discussed
>>>> briefly a week or so ago (have an init script read the BAR
>>>> info from dom0 and put it in xenstore, plus a patch to
>>>> make pBAR=vBAR reservations built dynamically rather than
>>>> statically, based on this data. Now, I''m quite fluent
in C,
>>>> but my familiarity with Xen soruce code is nearly non-existant
>>>> (limited to studying an old unsupported patch every now and
then
>>>> in order to make it apply to a more recent code release).
>>>> Can anyone help me out with a high level view WRT where
>>>> this would be best plumbed in (which files and the flow of
>>>> control between the affected files)?
>>>
>>> hvmloader probably and the libxl e820 code. What from a
>>> high view needs to happen is that:
>>> 1). Need to relax the check in libxl for e820_hole
>>>     to also do it for HVM guests. Said code just iterates over the
>>>     host E820 and sanitizes it a bit and makes a E820 hypercall to
>>>     set it for the guest.[snip]

OK, I have attached a preliminary patch against 4.3.0 for the libxl 
part. It compiles. I haven''t tried running it to see if it actually 
works or does something, but my packages build.

Please let me know if I''ve missed anything. On it''s own, I
don''t think
this patch will do much (apart from maybe break HVM hosts with 
e820_host=1 set).
>>> 2). Figure out whether the E820 hypercall (which sets the E820
>>>     layout for a guest) can be run on HVM guests. I think it
>>>     could not and Mukesh in his PVH patches posted a patch
>>>     to enable that - "..Move e820 fields out of pv_domain
struct"
Is this already in 4.3.0 or is this an out-of-tree patch? Do you have a 
link to it handy?
>>> 2). Hvmloader should do an E820 get machine memory hypercall
>>>    to see if there is anything there. If there is - that means
>>>     the toolstack has request a "new" type of E820.
Iterate
>>>     over the E820 and make it look like that.
>>>     You can look in the Linux arch/x86/xen/setup.c to see how
>>>     it does that.
>>>
>>>    The complication there is that hvmloader needs to to fit the
>>>    ACPI code (the guest type one) and such.
>>>    Presumarily you can just re-use the existing spaces that
>>>    the host has marked as E820_RESERVED or E820_ACPI..
>>
>> Yup, I get it. Not only that, but it should also ideally (not
>> strictly necessary, but it''d be handy) map the IOMEM for
devices
>> it is passed so that pBAR=vBAR (as opposed to just leaving all
>> the host e820 reserved areas well alone - which would work for
>> most things).
>
> Yes. That is an extra complication that could be done in subsequent
> patches. But in theory if you have the E820 mirrored from the host the
> pBAR=vBAR should be easy enough as the values from the host BARs can
> easily fit in the E820 gaps.
Agreed. Let''s leave the pBAR=vBAR part for a separate patch set.
I''ll
have to figure out a sensible way to query the IOMEM regions for each of 
the devices passed to the VM and make sure they are in the same hole.
>>>    Then there is the SMBIOS would need to move and the BIOS
>>>    might need to be relocated - but I think those are relocatable
>>>   in some form.
[bit above left for later reference]
>>> Well, I am more than happy to help you with this.
>>
>> Thanks, much appreciated. :)
>
> Yeeey! Vict^H^H^H^volunteer :-)! <manically laughter in the
background>
>
> I am also reachable on IRC (FreeNode mostly) as either darnok or konrad
> if that would be more convient to discuss this.
Thanks. I''ll keep that in mind. :)

Gordan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Gordan Bobic

2013-Sep-03 20:35 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

First attempt at a test run predictably failed. I added e820_host=1 to a 
VM config and tried starting it:

[root@normandy ~]# xl create /etc/xen/edi
Parsing config from /etc/xen/edi
libxl: error: libxl_x86.c:307:libxl__arch_domain_create: Failed while 
collecting E820 with: -3 (errno:-1)

libxl: error: libxl_create.c:901:domcreate_rebuild_done: cannot 
(re-)build domain: -3
libxl: error: libxl_dm.c:1300:libxl__destroy_device_model: could not 
find device-model''s pid for dom 1
libxl: error: libxl.c:1415:libxl__destroy_domid: 
libxl__destroy_device_model failed for 1

xl-edi.log, qemu-dm-edi.log attached.
Both actually look identical to previous logs before the patch.

Is this something that is clearly a consequence of the patch being 
incomplete? Or did I break something?

Gordan

On 09/03/2013 08:47 PM, Gordan Bobic wrote:> On 09/03/2013 03:59 PM, Konrad Rzeszutek Wilk wrote:
>
>>>>> 2) Further, I''m finding myself motivated to write
that
>>>>> auto-set (as opposed to hard coded) vBAR=pBAR patch
discussed
>>>>> briefly a week or so ago (have an init script read the BAR
>>>>> info from dom0 and put it in xenstore, plus a patch to
>>>>> make pBAR=vBAR reservations built dynamically rather than
>>>>> statically, based on this data. Now, I''m quite
fluent in C,
>>>>> but my familiarity with Xen soruce code is nearly
non-existant
>>>>> (limited to studying an old unsupported patch every now and
then
>>>>> in order to make it apply to a more recent code release).
>>>>> Can anyone help me out with a high level view WRT where
>>>>> this would be best plumbed in (which files and the flow of
>>>>> control between the affected files)?
>>>>
>>>> hvmloader probably and the libxl e820 code. What from a
>>>> high view needs to happen is that:
>>>> 1). Need to relax the check in libxl for e820_hole
>>>>     to also do it for HVM guests. Said code just iterates over
the
>>>>     host E820 and sanitizes it a bit and makes a E820 hypercall
to
>>>>     set it for the guest.
> [snip]
>
> OK, I have attached a preliminary patch against 4.3.0 for the libxl
> part. It compiles. I haven''t tried running it to see if it
actually
> works or does something, but my packages build.
>
> Please let me know if I''ve missed anything. On it''s own,
I don''t think
> this patch will do much (apart from maybe break HVM hosts with
> e820_host=1 set).
>
>>>> 2). Figure out whether the E820 hypercall (which sets the E820
>>>>     layout for a guest) can be run on HVM guests. I think it
>>>>     could not and Mukesh in his PVH patches posted a patch
>>>>     to enable that - "..Move e820 fields out of pv_domain
struct"
>
> Is this already in 4.3.0 or is this an out-of-tree patch? Do you have a
> link to it handy?
>
>>>> 2). Hvmloader should do an E820 get machine memory hypercall
>>>>    to see if there is anything there. If there is - that means
>>>>     the toolstack has request a "new" type of E820.
Iterate
>>>>     over the E820 and make it look like that.
>>>>     You can look in the Linux arch/x86/xen/setup.c to see how
>>>>     it does that.
>>>>
>>>>    The complication there is that hvmloader needs to to fit the
>>>>    ACPI code (the guest type one) and such.
>>>>    Presumarily you can just re-use the existing spaces that
>>>>    the host has marked as E820_RESERVED or E820_ACPI..
>>>
>>> Yup, I get it. Not only that, but it should also ideally (not
>>> strictly necessary, but it''d be handy) map the IOMEM for
devices
>>> it is passed so that pBAR=vBAR (as opposed to just leaving all
>>> the host e820 reserved areas well alone - which would work for
>>> most things).
>>
>> Yes. That is an extra complication that could be done in subsequent
>> patches. But in theory if you have the E820 mirrored from the host the
>> pBAR=vBAR should be easy enough as the values from the host BARs can
>> easily fit in the E820 gaps.
>
> Agreed. Let''s leave the pBAR=vBAR part for a separate patch set.
I''ll
> have to figure out a sensible way to query the IOMEM regions for each of
> the devices passed to the VM and make sure they are in the same hole.
>
>>>>    Then there is the SMBIOS would need to move and the BIOS
>>>>    might need to be relocated - but I think those are
relocatable
>>>>   in some form.
>
> [bit above left for later reference]
>
>>>> Well, I am more than happy to help you with this.
>>>
>>> Thanks, much appreciated. :)
>>
>> Yeeey! Vict^H^H^H^volunteer :-)! <manically laughter in the
background>
>>
>> I am also reachable on IRC (FreeNode mostly) as either darnok or konrad
>> if that would be more convient to discuss this.
>
> Thanks. I''ll keep that in mind. :)
>
> Gordan
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Gordan Bobic

2013-Sep-03 20:49 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

I spoke too soon - even with e820_host=0, the same error occurs. What 
did I break? The code in question is this:

if (libxl_defbool_val(d_config->b_info.e820_host)) {
     ret = libxl__e820_alloc(gc, domid, d_config);
     if (ret) {
         LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR,
                 "Failed while collecting E820 with: %d (errno:%d)\n",
                 ret, errno);
     }
}

With e820_host=0, that outer black should evaluate to false, should it 
not? In libxl_create.c, if I am understanding the code correctly, 
e820_host is defaulted to false, too. What am I missing?

Gordan

On 09/03/2013 09:35 PM, Gordan Bobic wrote:> First attempt at a test run predictably failed. I added e820_host=1 to a
> VM config and tried starting it:
>
> [root@normandy ~]# xl create /etc/xen/edi
> Parsing config from /etc/xen/edi
> libxl: error: libxl_x86.c:307:libxl__arch_domain_create: Failed while
> collecting E820 with: -3 (errno:-1)
>
> libxl: error: libxl_create.c:901:domcreate_rebuild_done: cannot
> (re-)build domain: -3
> libxl: error: libxl_dm.c:1300:libxl__destroy_device_model: could not
> find device-model''s pid for dom 1
> libxl: error: libxl.c:1415:libxl__destroy_domid:
> libxl__destroy_device_model failed for 1
>
> xl-edi.log, qemu-dm-edi.log attached.
> Both actually look identical to previous logs before the patch.
>
> Is this something that is clearly a consequence of the patch being
> incomplete? Or did I break something?
>
> Gordan
>
> On 09/03/2013 08:47 PM, Gordan Bobic wrote:
>> On 09/03/2013 03:59 PM, Konrad Rzeszutek Wilk wrote:
>>
>>>>>> 2) Further, I''m finding myself motivated to
write that
>>>>>> auto-set (as opposed to hard coded) vBAR=pBAR patch
discussed
>>>>>> briefly a week or so ago (have an init script read the
BAR
>>>>>> info from dom0 and put it in xenstore, plus a patch to
>>>>>> make pBAR=vBAR reservations built dynamically rather
than
>>>>>> statically, based on this data. Now, I''m quite
fluent in C,
>>>>>> but my familiarity with Xen soruce code is nearly
non-existant
>>>>>> (limited to studying an old unsupported patch every now
and then
>>>>>> in order to make it apply to a more recent code
release).
>>>>>> Can anyone help me out with a high level view WRT where
>>>>>> this would be best plumbed in (which files and the flow
of
>>>>>> control between the affected files)?
>>>>>
>>>>> hvmloader probably and the libxl e820 code. What from a
>>>>> high view needs to happen is that:
>>>>> 1). Need to relax the check in libxl for e820_hole
>>>>>     to also do it for HVM guests. Said code just iterates
over the
>>>>>     host E820 and sanitizes it a bit and makes a E820
hypercall to
>>>>>     set it for the guest.
>> [snip]
>>
>> OK, I have attached a preliminary patch against 4.3.0 for the libxl
>> part. It compiles. I haven''t tried running it to see if it
actually
>> works or does something, but my packages build.
>>
>> Please let me know if I''ve missed anything. On it''s
own, I don''t think
>> this patch will do much (apart from maybe break HVM hosts with
>> e820_host=1 set).
>>
>>>>> 2). Figure out whether the E820 hypercall (which sets the
E820
>>>>>     layout for a guest) can be run on HVM guests. I think
it
>>>>>     could not and Mukesh in his PVH patches posted a patch
>>>>>     to enable that - "..Move e820 fields out of
pv_domain struct"
>>
>> Is this already in 4.3.0 or is this an out-of-tree patch? Do you have a
>> link to it handy?
>>
>>>>> 2). Hvmloader should do an E820 get machine memory
hypercall
>>>>>    to see if there is anything there. If there is - that
means
>>>>>     the toolstack has request a "new" type of
E820. Iterate
>>>>>     over the E820 and make it look like that.
>>>>>     You can look in the Linux arch/x86/xen/setup.c to see
how
>>>>>     it does that.
>>>>>
>>>>>    The complication there is that hvmloader needs to to fit
the
>>>>>    ACPI code (the guest type one) and such.
>>>>>    Presumarily you can just re-use the existing spaces that
>>>>>    the host has marked as E820_RESERVED or E820_ACPI..
>>>>
>>>> Yup, I get it. Not only that, but it should also ideally (not
>>>> strictly necessary, but it''d be handy) map the IOMEM
for devices
>>>> it is passed so that pBAR=vBAR (as opposed to just leaving all
>>>> the host e820 reserved areas well alone - which would work for
>>>> most things).
>>>
>>> Yes. That is an extra complication that could be done in subsequent
>>> patches. But in theory if you have the E820 mirrored from the host
the
>>> pBAR=vBAR should be easy enough as the values from the host BARs
can
>>> easily fit in the E820 gaps.
>>
>> Agreed. Let''s leave the pBAR=vBAR part for a separate patch
set. I''ll
>> have to figure out a sensible way to query the IOMEM regions for each
of
>> the devices passed to the VM and make sure they are in the same hole.
>>
>>>>>    Then there is the SMBIOS would need to move and the BIOS
>>>>>    might need to be relocated - but I think those are
relocatable
>>>>>   in some form.
>>
>> [bit above left for later reference]
>>
>>>>> Well, I am more than happy to help you with this.
>>>>
>>>> Thanks, much appreciated. :)
>>>
>>> Yeeey! Vict^H^H^H^volunteer :-)! <manically laughter in the
background>
>>>
>>> I am also reachable on IRC (FreeNode mostly) as either darnok or
konrad
>>> if that would be more convient to discuss this.
>>
>> Thanks. I''ll keep that in mind. :)
>>
>> Gordan
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
>>
>

Konrad Rzeszutek Wilk

2013-Sep-03 21:08 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Tue, Sep 03, 2013 at 09:35:50PM +0100, Gordan Bobic
wrote:> First attempt at a test run predictably failed. I added e820_host=1
> to a VM config and tried starting it:
> 
> [root@normandy ~]# xl create /etc/xen/edi
> Parsing config from /etc/xen/edi
> libxl: error: libxl_x86.c:307:libxl__arch_domain_create: Failed
> while collecting E820 with: -3 (errno:-1)
> 
> libxl: error: libxl_create.c:901:domcreate_rebuild_done: cannot
> (re-)build domain: -3
> libxl: error: libxl_dm.c:1300:libxl__destroy_device_model: could not
> find device-model''s pid for dom 1
> libxl: error: libxl.c:1415:libxl__destroy_domid:
> libxl__destroy_device_model failed for 1
> 
> xl-edi.log, qemu-dm-edi.log attached.
> Both actually look identical to previous logs before the patch.
> 
> Is this something that is clearly a consequence of the patch being
> incomplete? Or did I break something?
You are missing the hypervisor patch to set the E820 for HVM guests.
http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html

And that should make it possible to "stash" the E820 in the
hypervisor.

Then after that you will need to implement in the hvmloader.c the
XENMEM_memory_map hypercall to get the E820 and do something with it.


Oh, and something like this probably should do it - not compile tested
in any way:

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 1fcaed0..7b38890 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -3146,6 +3146,7 @@ static long hvm_memory_op(int cmd,
XEN_GUEST_HANDLE_PARAM(void) arg)
     case XENMEM_machine_memory_map:
     case XENMEM_machphys_mapping:
         return -ENOSYS;
+    case XENMEM_memory_map:
     case XENMEM_decrease_reservation:
         rc = do_memory_op(cmd, arg);
         current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 1;
@@ -3216,10 +3217,10 @@ static long hvm_memory_op_compat32(int cmd,
XEN_GUEST_HANDLE_PARAM(void) arg)
 
     switch ( cmd & MEMOP_CMD_MASK )
     {
-    case XENMEM_memory_map:
     case XENMEM_machine_memory_map:
     case XENMEM_machphys_mapping:
         return -ENOSYS;
+    case XENMEM_memory_map:
     case XENMEM_decrease_reservation:
         rc = compat_memory_op(cmd, arg);
         current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 1;

diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c
index 2e05e93..86fb20a 100644
--- a/tools/firmware/hvmloader/e820.c
+++ b/tools/firmware/hvmloader/e820.c
@@ -68,16 +68,42 @@ void dump_e820_table(struct e820entry *e820, unsigned int
nr)
     }
 }
 
+static const char *e820_names(int type)
+{
+    switch (type) {
+        case E820_RAM: return "RAM";
+        case E820_RESERVED: return "Reserved";
+        case E820_ACPI: return "ACPI";
+        case E820_NVS: return "ACPI NVS";
+        case E820_UNUSABLE: return "Unusable";
+        default: break;
+    }
+    return "Unknown";
+}
+
+
 /* Create an E820 table based on memory parameters provided in hvm_info. */
 int build_e820_table(struct e820entry *e820,
                      unsigned int lowmem_reserved_base,
                      unsigned int bios_image_base)
 {
     unsigned int nr = 0;
+    struct xen_memory_map op;
+    struct e820entry map[E820MAX];
+    int rc;
 
     if ( !lowmem_reserved_base )
             lowmem_reserved_base = 0xA0000;
 
+    set_xen_guest_handle(op.buffer, map);
+
+    rc = hypercall_memory_op ( XENMEM_memory_op, &op);
+    if ( rc != -ENOSYS) { /* It works!? */
+        int i;
+        for ( i = 0; i < op.nr_entries; i++ )
+            printf("    %lx -> %lx %s\n", map[i].addr >> 12,
+                   (map[i].addr + map[i].size) >> 12,
e820_names(map[i].type));
+    }
     /* Lowmem must be at least 512K to keep Windows happy) */
     ASSERT ( lowmem_reserved_base > 512<<10 );
 > 
> Gordan
> 
> On 09/03/2013 08:47 PM, Gordan Bobic wrote:
> >On 09/03/2013 03:59 PM, Konrad Rzeszutek Wilk wrote:
> >
> >>>>>2) Further, I''m finding myself motivated to
write that
> >>>>>auto-set (as opposed to hard coded) vBAR=pBAR patch
discussed
> >>>>>briefly a week or so ago (have an init script read the
BAR
> >>>>>info from dom0 and put it in xenstore, plus a patch to
> >>>>>make pBAR=vBAR reservations built dynamically rather
than
> >>>>>statically, based on this data. Now, I''m quite
fluent in C,
> >>>>>but my familiarity with Xen soruce code is nearly
non-existant
> >>>>>(limited to studying an old unsupported patch every now
and then
> >>>>>in order to make it apply to a more recent code
release).
> >>>>>Can anyone help me out with a high level view WRT where
> >>>>>this would be best plumbed in (which files and the flow
of
> >>>>>control between the affected files)?
> >>>>
> >>>>hvmloader probably and the libxl e820 code. What from a
> >>>>high view needs to happen is that:
> >>>>1). Need to relax the check in libxl for e820_hole
> >>>>    to also do it for HVM guests. Said code just iterates
over the
> >>>>    host E820 and sanitizes it a bit and makes a E820
hypercall to
> >>>>    set it for the guest.
> >[snip]
> >
> >OK, I have attached a preliminary patch against 4.3.0 for the libxl
> >part. It compiles. I haven''t tried running it to see if it
actually
> >works or does something, but my packages build.
> >
> >Please let me know if I''ve missed anything. On it''s
own, I don''t think
> >this patch will do much (apart from maybe break HVM hosts with
> >e820_host=1 set).
> >
> >>>>2). Figure out whether the E820 hypercall (which sets the
E820
> >>>>    layout for a guest) can be run on HVM guests. I think
it
> >>>>    could not and Mukesh in his PVH patches posted a patch
> >>>>    to enable that - "..Move e820 fields out of
pv_domain struct"
> >
> >Is this already in 4.3.0 or is this an out-of-tree patch? Do you have a
> >link to it handy?
> >
> >>>>2). Hvmloader should do an E820 get machine memory
hypercall
> >>>>   to see if there is anything there. If there is - that
means
> >>>>    the toolstack has request a "new" type of
E820. Iterate
> >>>>    over the E820 and make it look like that.
> >>>>    You can look in the Linux arch/x86/xen/setup.c to see
how
> >>>>    it does that.
> >>>>
> >>>>   The complication there is that hvmloader needs to to fit
the
> >>>>   ACPI code (the guest type one) and such.
> >>>>   Presumarily you can just re-use the existing spaces that
> >>>>   the host has marked as E820_RESERVED or E820_ACPI..
> >>>
> >>>Yup, I get it. Not only that, but it should also ideally (not
> >>>strictly necessary, but it''d be handy) map the IOMEM
for devices
> >>>it is passed so that pBAR=vBAR (as opposed to just leaving all
> >>>the host e820 reserved areas well alone - which would work for
> >>>most things).
> >>
> >>Yes. That is an extra complication that could be done in subsequent
> >>patches. But in theory if you have the E820 mirrored from the host
the
> >>pBAR=vBAR should be easy enough as the values from the host BARs
can
> >>easily fit in the E820 gaps.
> >
> >Agreed. Let''s leave the pBAR=vBAR part for a separate patch
set. I''ll
> >have to figure out a sensible way to query the IOMEM regions for each
of
> >the devices passed to the VM and make sure they are in the same hole.
> >
> >>>>   Then there is the SMBIOS would need to move and the BIOS
> >>>>   might need to be relocated - but I think those are
relocatable
> >>>>  in some form.
> >
> >[bit above left for later reference]
> >
> >>>>Well, I am more than happy to help you with this.
> >>>
> >>>Thanks, much appreciated. :)
> >>
> >>Yeeey! Vict^H^H^H^volunteer :-)! <manically laughter in the
background>
> >>
> >>I am also reachable on IRC (FreeNode mostly) as either darnok or
konrad
> >>if that would be more convient to discuss this.
> >
> >Thanks. I''ll keep that in mind. :)
> >
> >Gordan
> >
> >
> >_______________________________________________
> >Xen-devel mailing list
> >Xen-devel@lists.xen.org
> >http://lists.xen.org/xen-devel
> >
> 
> domid: 1
> Using file /dev/zvol/ssd/edi in read-write mode
> Watching /local/domain/0/device-model/1/logdirty/cmd
> Watching /local/domain/0/device-model/1/command
> Watching /local/domain/1/cpu
> char device redirected to /dev/pts/3
> qemu_map_cache_init nr_buckets = 10000 size 4194304
> shared page at pfn feffd
> buffered io page at pfn feffb
> Guest uuid = a57e6840-e9f5-4a14-a822-b2cc662c177f
> populating video RAM at ff000000
> mapping video RAM from ff000000
> Register xen platform.
> Done register platform.
> platform_fixed_ioport: changed ro/rw state of ROM memory area. now is rw
state.
> xs_read(/local/domain/0/device-model/1/xen_extended_power_mgmt): read error
> xs_read(): vncpasswd get error.
/vm/a57e6840-e9f5-4a14-a822-b2cc662c177f/vncpasswd.
> Log-dirty: no command yet.
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> vcpu-set: watch node error.
> [xenstore_process_vcpu_set_event]: /local/domain/1/cpu has no CPU!
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> xs_read(/local/domain/1/log-throttling): read error
> qemu: ignoring not-understood drive
`/local/domain/1/log-throttling''
> medium change watch on `/local/domain/1/log-throttling'' - unknown
device, ignored
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
> dm-command: hot insert pass-through pci dev 
> register_real_device: Assigning real physical device 08:00.0 ...
> register_real_device: Disable MSI translation via per device option
> register_real_device: Enable power management
> pt_iomul_init: Error: pt_iomul_init can''t open file
/dev/xen/pci_iomul: No such file or directory: 0x8:0x0.0x0
> pt_register_regions: IO region registered (size=0x02000000
base_addr=0xf8000000)
> pt_register_regions: IO region registered (size=0x08000000
base_addr=0xb800000c)
> pt_register_regions: IO region registered (size=0x04000000
base_addr=0xb400000c)
> pt_register_regions: IO region registered (size=0x00000080
base_addr=0x0000cf81)
> pt_register_regions: Expansion ROM registered (size=0x00080000
base_addr=0xfbc00000)
> pci_intx: intx=1
> register_real_device: Real physical device 08:00.0 registered successfuly!
> IRQ type = INTx
> dm-command: hot insert pass-through pci dev 
> register_real_device: Assigning real physical device 08:00.1 ...
> register_real_device: Disable MSI translation via per device option
> register_real_device: Enable power management
> pt_iomul_init: Error: pt_iomul_init can''t open file
/dev/xen/pci_iomul: No such file or directory: 0x8:0x0.0x1
> pt_register_regions: IO region registered (size=0x00004000
base_addr=0xfbcfc000)
> pci_intx: intx=2
> register_real_device: Real physical device 08:00.1 registered successfuly!
> IRQ type = INTx
> dm-command: hot insert pass-through pci dev 
> register_real_device: Assigning real physical device 0c:00.0 ...
> register_real_device: Disable MSI translation via per device option
> register_real_device: Enable power management
> pt_iomul_init: Error: pt_iomul_init can''t open file
/dev/xen/pci_iomul: No such file or directory: 0xc:0x0.0x0
> pt_register_regions: IO region registered (size=0x00004000
base_addr=0xd7efc000)
> pci_intx: intx=1
> register_real_device: Real physical device 0c:00.0 registered successfuly!
> IRQ type = INTx
> dm-command: hot insert pass-through pci dev 
> register_real_device: Assigning real physical device 00:1a.1 ...
> register_real_device: Disable MSI translation via per device option
> register_real_device: Enable power management
> pt_iomul_init: Error: pt_iomul_init can''t open file
/dev/xen/pci_iomul: No such file or directory: 0x0:0x1a.0x1
> pt_register_regions: IO region registered (size=0x00000020
base_addr=0x00008a01)
> pci_intx: intx=2
> register_real_device: Real physical device 00:1a.1 registered successfuly!
> IRQ type = INTx
> pt_iomem_map: e_phys=e0000000 maddr=b8000000 type=8 len=134217728 index=1
first_map=1
> pt_iomem_map: e_phys=e8000000 maddr=b4000000 type=8 len=67108864 index=3
first_map=1
> pt_iomem_map: e_phys=ec000000 maddr=f8000000 type=0 len=33554432 index=0
first_map=1
> vga s->lfb_addr = ef000000 s->lfb_end = ef800000 
> pt_iomem_map: e_phys=ef8a0000 maddr=fbcfc000 type=0 len=16384 index=0
first_map=1
> pt_iomem_map: e_phys=ef8a4000 maddr=d7efc000 type=0 len=16384 index=0
first_map=1
> pt_ioport_map: e_phys=c100 pio_base=cf80 len=128 index=5 first_map=1
> pt_ioport_map: e_phys=c1e0 pio_base=8a00 len=32 index=4 first_map=1
> platform_fixed_ioport: changed ro/rw state of ROM memory area. now is rw
state.
> platform_fixed_ioport: changed ro/rw state of ROM memory area. now is ro
state.
> Unknown PV product 2 loaded in guest
> PV driver build 1
> region type 0 at [ef880000,ef8a0000).
> squash iomem [ef880000, ef8a0000).
> region type 1 at [c180,c1c0).
> vga s->lfb_addr = ef000000 s->lfb_end = ef800000 
> pt_iomem_map: e_phys=ffffffff maddr=f8000000 type=0 len=33554432 index=0
first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=b8000000 type=8 len=134217728 index=1
first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=b4000000 type=8 len=67108864 index=3
first_map=0
> pt_ioport_map: e_phys=ffff pio_base=cf80 len=128 index=5 first_map=0
> pt_iomem_map: e_phys=ec000000 maddr=f8000000 type=0 len=33554432 index=0
first_map=0
> pt_iomem_map: e_phys=e0000000 maddr=b8000000 type=8 len=134217728 index=1
first_map=0
> pt_iomem_map: e_phys=e8000000 maddr=b4000000 type=8 len=67108864 index=3
first_map=0
> pt_ioport_map: e_phys=c100 pio_base=cf80 len=128 index=5 first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=fbcfc000 type=0 len=16384 index=0
first_map=0
> pt_pci_write_config: [00:06:0] Warning: Guest attempt to set address to
unused Base Address Register. [Offset:30h][Length:4]
> pt_iomem_map: e_phys=ef8a0000 maddr=fbcfc000 type=0 len=16384 index=0
first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=d7efc000 type=0 len=16384 index=0
first_map=0
> pt_pci_write_config: [00:07:0] Warning: Guest attempt to set address to
unused Base Address Register. [Offset:30h][Length:4]
> pt_iomem_map: e_phys=ef8a4000 maddr=d7efc000 type=0 len=16384 index=0
first_map=0
> pt_ioport_map: e_phys=ffff pio_base=8a00 len=32 index=4 first_map=0
> pt_pci_write_config: [00:08:0] Warning: Guest attempt to set address to
unused Base Address Register. [Offset:30h][Length:4]
> pt_ioport_map: e_phys=c1e0 pio_base=8a00 len=32 index=4 first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=f8000000 type=0 len=33554432 index=0
first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=b8000000 type=8 len=134217728 index=1
first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=b4000000 type=8 len=67108864 index=3
first_map=0
> pt_ioport_map: e_phys=ffff pio_base=cf80 len=128 index=5 first_map=0
> pt_iomem_map: e_phys=ec000000 maddr=f8000000 type=0 len=33554432 index=0
first_map=0
> pt_iomem_map: e_phys=e0000000 maddr=b8000000 type=8 len=134217728 index=1
first_map=0
> pt_iomem_map: e_phys=e8000000 maddr=b4000000 type=8 len=67108864 index=3
first_map=0
> pt_ioport_map: e_phys=c100 pio_base=cf80 len=128 index=5 first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=fbcfc000 type=0 len=16384 index=0
first_map=0
> pt_iomem_map: e_phys=ef8a0000 maddr=fbcfc000 type=0 len=16384 index=0
first_map=0
> pt_ioport_map: e_phys=ffff pio_base=8a00 len=32 index=4 first_map=0
> pt_ioport_map: e_phys=c1e0 pio_base=8a00 len=32 index=4 first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=d7efc000 type=0 len=16384 index=0
first_map=0
> pt_iomem_map: e_phys=ef8a4000 maddr=d7efc000 type=0 len=16384 index=0
first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=fbcfc000 type=0 len=16384 index=0
first_map=0
> pt_iomem_map: e_phys=ffffffff maddr=d7efc000 type=0 len=16384 index=0
first_map=0
> pt_ioport_map: e_phys=ffff pio_base=8a00 len=32 index=4 first_map=0
> shutdown requested in cpu_handle_ioreq
> Issued domain 1 poweroff
> Waiting for domain edi (domid 1) to die [pid 8363]
> Domain 1 has shut down, reason code 0 0x0
> Action for shutdown reason code 0 is destroy
> Domain 1 needs to be cleaned up: destroying the domain
> libxl: error: libxl_pci.c:990:libxl__device_pci_reset: The kernel
doesn''t support reset from sysfs for PCI device 0000:08:00.0
> libxl: error: libxl_pci.c:990:libxl__device_pci_reset: The kernel
doesn''t support reset from sysfs for PCI device 0000:08:00.1
> Done. Exiting now

Konrad Rzeszutek Wilk

2013-Sep-03 21:10 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Tue, Sep 03, 2013 at 09:49:40PM +0100, Gordan Bobic
wrote:> I spoke too soon - even with e820_host=0, the same error occurs.
> What did I break? The code in question is this:
> 
> if (libxl_defbool_val(d_config->b_info.e820_host)) {
>     ret = libxl__e820_alloc(gc, domid, d_config);
>     if (ret) {
>         LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR,
>                 "Failed while collecting E820 with: %d
(errno:%d)\n",
>                 ret, errno);
>     }
> }
> 
> With e820_host=0, that outer black should evaluate to false, should
> it not? In libxl_create.c, if I am understanding the code correctly,
> e820_host is defaulted to false, too. What am I missing?
Just sent you an email but I believe what is failing is:

241     rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr);

You can add some extra LIBXL__LOG_ERRNO to check each ''rc'' to
see
which one of them failed.

Hm, perhaps it might make sense to actually have the libxl__e820_alloc
also use the LIBXL__LOG_ERRNO to log more details..
> 
> Gordan
> 
> On 09/03/2013 09:35 PM, Gordan Bobic wrote:
> >First attempt at a test run predictably failed. I added e820_host=1 to
a
> >VM config and tried starting it:
> >
> >[root@normandy ~]# xl create /etc/xen/edi
> >Parsing config from /etc/xen/edi
> >libxl: error: libxl_x86.c:307:libxl__arch_domain_create: Failed while
> >collecting E820 with: -3 (errno:-1)
> >
> >libxl: error: libxl_create.c:901:domcreate_rebuild_done: cannot
> >(re-)build domain: -3
> >libxl: error: libxl_dm.c:1300:libxl__destroy_device_model: could not
> >find device-model''s pid for dom 1
> >libxl: error: libxl.c:1415:libxl__destroy_domid:
> >libxl__destroy_device_model failed for 1
> >
> >xl-edi.log, qemu-dm-edi.log attached.
> >Both actually look identical to previous logs before the patch.
> >
> >Is this something that is clearly a consequence of the patch being
> >incomplete? Or did I break something?
> >
> >Gordan
> >
> >On 09/03/2013 08:47 PM, Gordan Bobic wrote:
> >>On 09/03/2013 03:59 PM, Konrad Rzeszutek Wilk wrote:
> >>
> >>>>>>2) Further, I''m finding myself motivated
to write that
> >>>>>>auto-set (as opposed to hard coded) vBAR=pBAR patch
discussed
> >>>>>>briefly a week or so ago (have an init script read
the BAR
> >>>>>>info from dom0 and put it in xenstore, plus a patch
to
> >>>>>>make pBAR=vBAR reservations built dynamically
rather than
> >>>>>>statically, based on this data. Now, I''m
quite fluent in C,
> >>>>>>but my familiarity with Xen soruce code is nearly
non-existant
> >>>>>>(limited to studying an old unsupported patch every
now and then
> >>>>>>in order to make it apply to a more recent code
release).
> >>>>>>Can anyone help me out with a high level view WRT
where
> >>>>>>this would be best plumbed in (which files and the
flow of
> >>>>>>control between the affected files)?
> >>>>>
> >>>>>hvmloader probably and the libxl e820 code. What from a
> >>>>>high view needs to happen is that:
> >>>>>1). Need to relax the check in libxl for e820_hole
> >>>>>    to also do it for HVM guests. Said code just
iterates over the
> >>>>>    host E820 and sanitizes it a bit and makes a E820
hypercall to
> >>>>>    set it for the guest.
> >>[snip]
> >>
> >>OK, I have attached a preliminary patch against 4.3.0 for the libxl
> >>part. It compiles. I haven''t tried running it to see if it
actually
> >>works or does something, but my packages build.
> >>
> >>Please let me know if I''ve missed anything. On
it''s own, I don''t think
> >>this patch will do much (apart from maybe break HVM hosts with
> >>e820_host=1 set).
> >>
> >>>>>2). Figure out whether the E820 hypercall (which sets
the E820
> >>>>>    layout for a guest) can be run on HVM guests. I
think it
> >>>>>    could not and Mukesh in his PVH patches posted a
patch
> >>>>>    to enable that - "..Move e820 fields out of
pv_domain struct"
> >>
> >>Is this already in 4.3.0 or is this an out-of-tree patch? Do you
have a
> >>link to it handy?
> >>
> >>>>>2). Hvmloader should do an E820 get machine memory
hypercall
> >>>>>   to see if there is anything there. If there is -
that means
> >>>>>    the toolstack has request a "new" type of
E820. Iterate
> >>>>>    over the E820 and make it look like that.
> >>>>>    You can look in the Linux arch/x86/xen/setup.c to
see how
> >>>>>    it does that.
> >>>>>
> >>>>>   The complication there is that hvmloader needs to to
fit the
> >>>>>   ACPI code (the guest type one) and such.
> >>>>>   Presumarily you can just re-use the existing spaces
that
> >>>>>   the host has marked as E820_RESERVED or E820_ACPI..
> >>>>
> >>>>Yup, I get it. Not only that, but it should also ideally
(not
> >>>>strictly necessary, but it''d be handy) map the
IOMEM for devices
> >>>>it is passed so that pBAR=vBAR (as opposed to just leaving
all
> >>>>the host e820 reserved areas well alone - which would work
for
> >>>>most things).
> >>>
> >>>Yes. That is an extra complication that could be done in
subsequent
> >>>patches. But in theory if you have the E820 mirrored from the
host the
> >>>pBAR=vBAR should be easy enough as the values from the host
BARs can
> >>>easily fit in the E820 gaps.
> >>
> >>Agreed. Let''s leave the pBAR=vBAR part for a separate
patch set. I''ll
> >>have to figure out a sensible way to query the IOMEM regions for
each of
> >>the devices passed to the VM and make sure they are in the same
hole.
> >>
> >>>>>   Then there is the SMBIOS would need to move and the
BIOS
> >>>>>   might need to be relocated - but I think those are
relocatable
> >>>>>  in some form.
> >>
> >>[bit above left for later reference]
> >>
> >>>>>Well, I am more than happy to help you with this.
> >>>>
> >>>>Thanks, much appreciated. :)
> >>>
> >>>Yeeey! Vict^H^H^H^volunteer :-)! <manically laughter in the
background>
> >>>
> >>>I am also reachable on IRC (FreeNode mostly) as either darnok
or konrad
> >>>if that would be more convient to discuss this.
> >>
> >>Thanks. I''ll keep that in mind. :)
> >>
> >>Gordan
> >>
> >>
> >>_______________________________________________
> >>Xen-devel mailing list
> >>Xen-devel@lists.xen.org
> >>http://lists.xen.org/xen-devel
> >>
> >
>

Gordan Bobic

2013-Sep-03 21:24 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On 09/03/2013 10:10 PM, Konrad Rzeszutek Wilk wrote:> On Tue, Sep 03, 2013 at 09:49:40PM +0100, Gordan Bobic wrote:
>> I spoke too soon - even with e820_host=0, the same error occurs.
>> What did I break? The code in question is this:
>>
>> if (libxl_defbool_val(d_config->b_info.e820_host)) {
>>      ret = libxl__e820_alloc(gc, domid, d_config);
>>      if (ret) {
>>          LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR,
>>                  "Failed while collecting E820 with: %d
(errno:%d)\n",
>>                  ret, errno);
>>      }
>> }
>>
>> With e820_host=0, that outer black should evaluate to false, should
>> it not? In libxl_create.c, if I am understanding the code correctly,
>> e820_host is defaulted to false, too. What am I missing?
>
> Just sent you an email but I believe what is failing is:
>
> 241     rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr);
But with e820_host=0 set in the config, libxl__e820_alloc() should not 
be getting called in the first place. That function only gets called 
from line 303, inside that if block I pasted above. That is what is 
puzzling me.
> You can add some extra LIBXL__LOG_ERRNO to check each
''rc'' to see
> which one of them failed.
>
> Hm, perhaps it might make sense to actually have the libxl__e820_alloc
> also use the LIBXL__LOG_ERRNO to log more details..
OK, I''ll add some debug and see what I find.

Gordan

Konrad Rzeszutek Wilk

2013-Sep-03 21:30 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Tue, Sep 03, 2013 at 10:24:44PM +0100, Gordan Bobic
wrote:> On 09/03/2013 10:10 PM, Konrad Rzeszutek Wilk wrote:
> >On Tue, Sep 03, 2013 at 09:49:40PM +0100, Gordan Bobic wrote:
> >>I spoke too soon - even with e820_host=0, the same error occurs.
> >>What did I break? The code in question is this:
> >>
> >>if (libxl_defbool_val(d_config->b_info.e820_host)) {
> >>     ret = libxl__e820_alloc(gc, domid, d_config);
> >>     if (ret) {
> >>         LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR,
> >>                 "Failed while collecting E820 with: %d
(errno:%d)\n",
> >>                 ret, errno);
> >>     }
> >>}
> >>
> >>With e820_host=0, that outer black should evaluate to false, should
> >>it not? In libxl_create.c, if I am understanding the code
correctly,
> >>e820_host is defaulted to false, too. What am I missing?
Does your config have ''pci'' in it? The patch you sent had
this:

+        if (d_config->num_pcidevs)
+            libxl_defbool_set(&b_info->e820_host, true);

Which means that even if you did not have e820_host it will be automatically
set if you have PCI devices.
> >
> >Just sent you an email but I believe what is failing is:
> >
> >241     rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr);
> 
> But with e820_host=0 set in the config, libxl__e820_alloc() should
> not be getting called in the first place. That function only gets
> called from line 303, inside that if block I pasted above. That is
> what is puzzling me.
> 
> >You can add some extra LIBXL__LOG_ERRNO to check each
''rc'' to see
> >which one of them failed.
> >
> >Hm, perhaps it might make sense to actually have the libxl__e820_alloc
> >also use the LIBXL__LOG_ERRNO to log more details..
> 
> OK, I''ll add some debug and see what I find.
> 
> Gordan

Gordan Bobic

2013-Sep-04 00:18 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On 09/03/2013 10:30 PM, Konrad Rzeszutek Wilk wrote:> On Tue, Sep 03, 2013 at 10:24:44PM +0100, Gordan Bobic wrote:
>> On 09/03/2013 10:10 PM, Konrad Rzeszutek Wilk wrote:
>>> On Tue, Sep 03, 2013 at 09:49:40PM +0100, Gordan Bobic wrote:
>>>> I spoke too soon - even with e820_host=0, the same error
occurs.
>>>> What did I break? The code in question is this:
>>>>
>>>> if (libxl_defbool_val(d_config->b_info.e820_host)) {
>>>>      ret = libxl__e820_alloc(gc, domid, d_config);
>>>>      if (ret) {
>>>>          LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR,
>>>>                  "Failed while collecting E820 with: %d
(errno:%d)\n",
>>>>                  ret, errno);
>>>>      }
>>>> }
>>>>
>>>> With e820_host=0, that outer black should evaluate to false,
should
>>>> it not? In libxl_create.c, if I am understanding the code
correctly,
>>>> e820_host is defaulted to false, too. What am I missing?
>
> Does your config have ''pci'' in it? The patch you sent had
this:
>
> +        if (d_config->num_pcidevs)
> +            libxl_defbool_set(&b_info->e820_host, true);
>
> Which means that even if you did not have e820_host it will be
automatically
> set if you have PCI devices.
OK - that was embarrasing. Caffeine underflow error. :(
I backed out that block. I don''t think e820_host should be implicit in 
hvm when PCI devices are passed.

That makes the adjusted patch fragment:
--- xl_cmdimpl.c.orig	2013-09-04 00:42:57.424337503 +0100
+++ xl_cmdimpl.c	2013-09-04 00:43:21.213886356 +0100
@@ -1293,7 +1293,7 @@
                  d_config->num_pcidevs++;
          }
          if (d_config->num_pcidevs && c_info->type ==
LIBXL_DOMAIN_TYPE_PV)
-            libxl_defbool_set(&b_info->u.pv.e820_host, true);
+            libxl_defbool_set(&b_info->e820_host, true);
      }

      switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0, 1)) {


This should maintain the old behaviour for backward compatibility when 
e820_host is not set. I just tested it and it works (with e820_host=1 I 
get the previous error, with e820_host=0, everything works fine.

I will have a play with the other two patches tomorrow.

Gordan

Gordan Bobic

2013-Sep-04 09:21 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Tue, 3 Sep 2013 17:08:33 -0400, Konrad Rzeszutek Wilk 
 <konrad.wilk@oracle.com> wrote:
> You are missing the hypervisor patch to set the E820 for HVM guests.
> http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html
>
> And that should make it possible to "stash" the E820 in the 
> hypervisor.
 Regarding Jan''s comment on the thread here:
 http://lists.xen.org/archives/html/xen-devel/2013-05/msg01649.html

 Should this not instead of:
 == @@ -595,7 +595,7 @@ void arch_domain_destroy(struct domain *d)
      if ( is_hvm_domain(d) )
          hvm_domain_destroy(d);
      else
 -        xfree(d->arch.pv_domain.e820);
 +        xfree(d->arch.e820);
 
      free_domain_pirqs(d);
      if ( !is_idle_domain(d) )
 ==
 be something like:

 == @@ -595,7 +595,6 @@ void arch_domain_destroy(struct domain *d)
      if ( is_hvm_domain(d) )
          hvm_domain_destroy(d);
 -    else
 -        xfree(d->arch.pv_domain.e820);
 +        xfree(d->arch.e820);
 
      free_domain_pirqs(d);
      if ( !is_idle_domain(d) )
 ==
 The question I have is will d->arch.e820 always be there and set
 even with e820_host=0? Or does there need to be an extra check
 here?
> Then after that you will need to implement in the hvmloader.c the
> XENMEM_memory_map hypercall to get the E820 and do something with it.
>
>
> Oh, and something like this probably should do it - not compile 
> tested
> in any way:
>
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 1fcaed0..7b38890 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -3146,6 +3146,7 @@ static long hvm_memory_op(int cmd,
> XEN_GUEST_HANDLE_PARAM(void) arg)
>      case XENMEM_machine_memory_map:
>      case XENMEM_machphys_mapping:
>          return -ENOSYS;
> +    case XENMEM_memory_map:
>      case XENMEM_decrease_reservation:
>          rc = do_memory_op(cmd, arg);
>          current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 
> 1;
> @@ -3216,10 +3217,10 @@ static long hvm_memory_op_compat32(int cmd,
> XEN_GUEST_HANDLE_PARAM(void) arg)
>
>      switch ( cmd & MEMOP_CMD_MASK )
>      {
> -    case XENMEM_memory_map:
>      case XENMEM_machine_memory_map:
>      case XENMEM_machphys_mapping:
>          return -ENOSYS;
> +    case XENMEM_memory_map:
>      case XENMEM_decrease_reservation:
>          rc = compat_memory_op(cmd, arg);
>          current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 
> 1;
>
> diff --git a/tools/firmware/hvmloader/e820.c
> b/tools/firmware/hvmloader/e820.c
> index 2e05e93..86fb20a 100644
> --- a/tools/firmware/hvmloader/e820.c
> +++ b/tools/firmware/hvmloader/e820.c
> @@ -68,16 +68,42 @@ void dump_e820_table(struct e820entry *e820,
> unsigned int nr)
>      }
>  }
>
> +static const char *e820_names(int type)
> +{
> +    switch (type) {
> +        case E820_RAM: return "RAM";
> +        case E820_RESERVED: return "Reserved";
> +        case E820_ACPI: return "ACPI";
> +        case E820_NVS: return "ACPI NVS";
> +        case E820_UNUSABLE: return "Unusable";
> +        default: break;
> +    }
> +    return "Unknown";
> +}
> +
> +
>  /* Create an E820 table based on memory parameters provided in 
> hvm_info. */
>  int build_e820_table(struct e820entry *e820,
>                       unsigned int lowmem_reserved_base,
>                       unsigned int bios_image_base)
>  {
>      unsigned int nr = 0;
> +    struct xen_memory_map op;
> +    struct e820entry map[E820MAX];
> +    int rc;
>
>      if ( !lowmem_reserved_base )
>              lowmem_reserved_base = 0xA0000;
>
> +    set_xen_guest_handle(op.buffer, map);
> +
> +    rc = hypercall_memory_op ( XENMEM_memory_op, &op);
> +    if ( rc != -ENOSYS) { /* It works!? */
> +        int i;
> +        for ( i = 0; i < op.nr_entries; i++ )
> +            printf("    %lx -> %lx %s\n", map[i].addr
>> 12,
> +                   (map[i].addr + map[i].size) >> 12,
> e820_names(map[i].type));
> +    }
>      /* Lowmem must be at least 512K to keep Windows happy) */
>      ASSERT ( lowmem_reserved_base > 512<<10 );
 Thanks. :) Will try that when I''ve verified the first two
 patches (mine and Mukesh''s) build cleanly in my 4.3.0 package
 build.

 Gordan

Gordan Bobic

2013-Sep-04 11:01 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Tue, 3 Sep 2013 17:08:33 -0400, Konrad Rzeszutek Wilk 
 <konrad.wilk@oracle.com> wrote:
> Oh, and something like this probably should do it - not compile 
> tested
> in any way:
>
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 1fcaed0..7b38890 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -3146,6 +3146,7 @@ static long hvm_memory_op(int cmd,
> XEN_GUEST_HANDLE_PARAM(void) arg)
>      case XENMEM_machine_memory_map:
>      case XENMEM_machphys_mapping:
>          return -ENOSYS;
> +    case XENMEM_memory_map:
>      case XENMEM_decrease_reservation:
>          rc = do_memory_op(cmd, arg);
>          current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 
> 1;
 This seems to work better. :)

 --- a/xen/arch/x86/hvm/hvm.c
 +++ b/xen/arch/x86/hvm/hvm.c
 @@ -3142,10 +3142,10 @@ static long hvm_memory_op(int cmd, 
 XEN_GUEST_HANDLE_PARAM(void) arg)

      switch ( cmd & MEMOP_CMD_MASK )
      {
 -    case XENMEM_memory_map:
      case XENMEM_machine_memory_map:
      case XENMEM_machphys_mapping:
          return -ENOSYS;
 +    case XENMEM_memory_map:
      case XENMEM_decrease_reservation:
          rc = do_memory_op(cmd, arg);
          current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 1;

> diff --git a/tools/firmware/hvmloader/e820.c
> b/tools/firmware/hvmloader/e820.c
> index 2e05e93..86fb20a 100644
> --- a/tools/firmware/hvmloader/e820.c
> +++ b/tools/firmware/hvmloader/e820.c
> @@ -68,16 +68,42 @@ void dump_e820_table(struct e820entry *e820,
> unsigned int nr)
>      }
>  }
>
> +static const char *e820_names(int type)
> +{
> +    switch (type) {
> +        case E820_RAM: return "RAM";
> +        case E820_RESERVED: return "Reserved";
> +        case E820_ACPI: return "ACPI";
> +        case E820_NVS: return "ACPI NVS";
> +        case E820_UNUSABLE: return "Unusable";
> +        default: break;
> +    }
> +    return "Unknown";
> +}
 To make this work I also added:

 --- tools/firmware/hvmloader/e820.h.orig	2013-09-04 10:55:38.317275183 
 +0100
 +++ tools/firmware/hvmloader/e820.h	2013-09-04 10:56:14.374595809 +0100
 @@ -8,6 +8,7 @@
  #define E820_RESERVED     2
  #define E820_ACPI         3
  #define E820_NVS          4
 +#define E820_UNUSBLE      5
 
  struct e820entry {
      uint64_t addr;

 It that OK?
>  /* Create an E820 table based on memory parameters provided in 
> hvm_info. */
>  int build_e820_table(struct e820entry *e820,
>                       unsigned int lowmem_reserved_base,
>                       unsigned int bios_image_base)
>  {
>      unsigned int nr = 0;
> +    struct xen_memory_map op;
> +    struct e820entry map[E820MAX];
> +    int rc;
>
>      if ( !lowmem_reserved_base )
>              lowmem_reserved_base = 0xA0000;
>
> +    set_xen_guest_handle(op.buffer, map);
> +
> +    rc = hypercall_memory_op ( XENMEM_memory_op, &op);
 Where is XENMEM_memory_op defined?
 Should that be XENMEM_memory_map? Or maybe XENMEM_populate_physmap?

 Gordan

Gordan Bobic

2013-Sep-04 13:11 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

I have this at the point where it actually builds.
 Otherwise completely untested (will do that later today).

 Attached are:

 1) libxl patch
 Modified from the original patch to _not_ implicitly enable
 e820_host when PCI devices are passed.

 2) Mukesh''s hypervisor e820 patch from here:
 http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html
 Modified slightly to attempt to address Jan''s comment on the same
 thread, and to adjust the diff line pointers to match against
 4.3.0 release code.

 3) A patch based on Konrad''s earlier in this thread, with
 a few additions and changes to make it all compile.

 Some peer review would be most welcome - this is my first
 venture into Xen code, so please do assume that I have
 no idea what I''m doing at the moment. :)

 I added yet another E820MAX #define, this time to
 tools/firmware/hvmloader/e820.h

 If there is a better place to #include that via from
 e820.c, please point me in the right direction.

 Gordan

 On Wed, 04 Sep 2013 12:01:09 +0100, Gordan Bobic <gordan@bobich.net> 
 wrote:> On Tue, 3 Sep 2013 17:08:33 -0400, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
>
>> Oh, and something like this probably should do it - not compile 
>> tested
>> in any way:
>>
>> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
>> index 1fcaed0..7b38890 100644
>> --- a/xen/arch/x86/hvm/hvm.c
>> +++ b/xen/arch/x86/hvm/hvm.c
>> @@ -3146,6 +3146,7 @@ static long hvm_memory_op(int cmd,
>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>      case XENMEM_machine_memory_map:
>>      case XENMEM_machphys_mapping:
>>          return -ENOSYS;
>> +    case XENMEM_memory_map:
>>      case XENMEM_decrease_reservation:
>>          rc = do_memory_op(cmd, arg);
>>         
current->domain->arch.hvm_domain.qemu_mapcache_invalidate =
>> 1;
>
> This seems to work better. :)
>
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -3142,10 +3142,10 @@ static long hvm_memory_op(int cmd,
> XEN_GUEST_HANDLE_PARAM(void) arg)
>
>      switch ( cmd & MEMOP_CMD_MASK )
>      {
> -    case XENMEM_memory_map:
>      case XENMEM_machine_memory_map:
>      case XENMEM_machphys_mapping:
>          return -ENOSYS;
> +    case XENMEM_memory_map:
>      case XENMEM_decrease_reservation:
>          rc = do_memory_op(cmd, arg);
>          current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 
> 1;
>
>
>> diff --git a/tools/firmware/hvmloader/e820.c
>> b/tools/firmware/hvmloader/e820.c
>> index 2e05e93..86fb20a 100644
>> --- a/tools/firmware/hvmloader/e820.c
>> +++ b/tools/firmware/hvmloader/e820.c
>> @@ -68,16 +68,42 @@ void dump_e820_table(struct e820entry *e820,
>> unsigned int nr)
>>      }
>>  }
>>
>> +static const char *e820_names(int type)
>> +{
>> +    switch (type) {
>> +        case E820_RAM: return "RAM";
>> +        case E820_RESERVED: return "Reserved";
>> +        case E820_ACPI: return "ACPI";
>> +        case E820_NVS: return "ACPI NVS";
>> +        case E820_UNUSABLE: return "Unusable";
>> +        default: break;
>> +    }
>> +    return "Unknown";
>> +}
>
> To make this work I also added:
>
> --- tools/firmware/hvmloader/e820.h.orig	2013-09-04 
> 10:55:38.317275183 +0100
> +++ tools/firmware/hvmloader/e820.h	2013-09-04 10:56:14.374595809 
> +0100
> @@ -8,6 +8,7 @@
>  #define E820_RESERVED     2
>  #define E820_ACPI         3
>  #define E820_NVS          4
> +#define E820_UNUSBLE      5
>
>  struct e820entry {
>      uint64_t addr;
>
> It that OK?
>
>>  /* Create an E820 table based on memory parameters provided in 
>> hvm_info. */
>>  int build_e820_table(struct e820entry *e820,
>>                       unsigned int lowmem_reserved_base,
>>                       unsigned int bios_image_base)
>>  {
>>      unsigned int nr = 0;
>> +    struct xen_memory_map op;
>> +    struct e820entry map[E820MAX];
>> +    int rc;
>>
>>      if ( !lowmem_reserved_base )
>>              lowmem_reserved_base = 0xA0000;
>>
>> +    set_xen_guest_handle(op.buffer, map);
>> +
>> +    rc = hypercall_memory_op ( XENMEM_memory_op, &op);
>
> Where is XENMEM_memory_op defined?
> Should that be XENMEM_memory_map? Or maybe XENMEM_populate_physmap?
>
> Gordan
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Sep-04 14:08 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Wed, Sep 04, 2013 at 01:18:39AM +0100, Gordan Bobic
wrote:> On 09/03/2013 10:30 PM, Konrad Rzeszutek Wilk wrote:
> >On Tue, Sep 03, 2013 at 10:24:44PM +0100, Gordan Bobic wrote:
> >>On 09/03/2013 10:10 PM, Konrad Rzeszutek Wilk wrote:
> >>>On Tue, Sep 03, 2013 at 09:49:40PM +0100, Gordan Bobic wrote:
> >>>>I spoke too soon - even with e820_host=0, the same error
occurs.
> >>>>What did I break? The code in question is this:
> >>>>
> >>>>if (libxl_defbool_val(d_config->b_info.e820_host)) {
> >>>>     ret = libxl__e820_alloc(gc, domid, d_config);
> >>>>     if (ret) {
> >>>>         LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR,
> >>>>                 "Failed while collecting E820 with:
%d (errno:%d)\n",
> >>>>                 ret, errno);
> >>>>     }
> >>>>}
> >>>>
> >>>>With e820_host=0, that outer black should evaluate to
false, should
> >>>>it not? In libxl_create.c, if I am understanding the code
correctly,
> >>>>e820_host is defaulted to false, too. What am I missing?
> >
> >Does your config have ''pci'' in it? The patch you sent
had this:
> >
> >+        if (d_config->num_pcidevs)
> >+            libxl_defbool_set(&b_info->e820_host, true);
> >
> >Which means that even if you did not have e820_host it will be
automatically
> >set if you have PCI devices.
> 
> OK - that was embarrasing. Caffeine underflow error. :(
> I backed out that block. I don''t think e820_host should be
implicit
> in hvm when PCI devices are passed.
> 
> That makes the adjusted patch fragment:
> --- xl_cmdimpl.c.orig	2013-09-04 00:42:57.424337503 +0100
> +++ xl_cmdimpl.c	2013-09-04 00:43:21.213886356 +0100
> @@ -1293,7 +1293,7 @@
>                  d_config->num_pcidevs++;
>          }
>          if (d_config->num_pcidevs && c_info->type ==
LIBXL_DOMAIN_TYPE_PV)
I think you also want to get rid of the c_info->type check?
> -            libxl_defbool_set(&b_info->u.pv.e820_host, true);
> +            libxl_defbool_set(&b_info->e820_host, true);
>      }
> 
>      switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0,
1)) {
> 
> 
> This should maintain the old behaviour for backward compatibility
> when e820_host is not set. I just tested it and it works (with
> e820_host=1 I get the previous error, with e820_host=0, everything
> works fine.
I think it might make sense to relax the PV check. That way the only
way e820_host capability gets activated is if a the guest config
has pci=X stanze. But perhaps that _and_ e820_host=1 is what should
be done.

Or maybe a negative check - if ''pci'' stanze is there we
automatically
turn on e820_host=1 (right now that is how it works). If the user
has thought ''e820_host=0'' and ''pci=xxx'' then
we would turn the E820
off? That way if something is odd we can turn this off?
> 
> I will have a play with the other two patches tomorrow.
> 
> Gordan

Gordan Bobic

2013-Sep-04 14:23 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Wed, 4 Sep 2013 10:08:37 -0400, Konrad Rzeszutek Wilk 
 <konrad.wilk@oracle.com> wrote:> On Wed, Sep 04, 2013 at 01:18:39AM +0100, Gordan Bobic wrote:
>> On 09/03/2013 10:30 PM, Konrad Rzeszutek Wilk wrote:
>> >On Tue, Sep 03, 2013 at 10:24:44PM +0100, Gordan Bobic wrote:
>> >>On 09/03/2013 10:10 PM, Konrad Rzeszutek Wilk wrote:
>> >>>On Tue, Sep 03, 2013 at 09:49:40PM +0100, Gordan Bobic
wrote:
>> >>>>I spoke too soon - even with e820_host=0, the same
error occurs.
>> >>>>What did I break? The code in question is this:
>> >>>>
>> >>>>if (libxl_defbool_val(d_config->b_info.e820_host)) {
>> >>>>     ret = libxl__e820_alloc(gc, domid, d_config);
>> >>>>     if (ret) {
>> >>>>         LIBXL__LOG_ERRNO(gc->owner,
LIBXL__LOG_ERROR,
>> >>>>                 "Failed while collecting E820
with: %d
>> (errno:%d)\n",
>> >>>>                 ret, errno);
>> >>>>     }
>> >>>>}
>> >>>>
>> >>>>With e820_host=0, that outer black should evaluate to
false,
>> should
>> >>>>it not? In libxl_create.c, if I am understanding the
code
>> correctly,
>> >>>>e820_host is defaulted to false, too. What am I
missing?
>> >
>> >Does your config have ''pci'' in it? The patch you
sent had this:
>> >
>> >+        if (d_config->num_pcidevs)
>> >+            libxl_defbool_set(&b_info->e820_host, true);
>> >
>> >Which means that even if you did not have e820_host it will be 
>> automatically
>> >set if you have PCI devices.
>>
>> OK - that was embarrasing. Caffeine underflow error. :(
>> I backed out that block. I don''t think e820_host should be
implicit
>> in hvm when PCI devices are passed.
>>
>> That makes the adjusted patch fragment:
>> --- xl_cmdimpl.c.orig	2013-09-04 00:42:57.424337503 +0100
>> +++ xl_cmdimpl.c	2013-09-04 00:43:21.213886356 +0100
>> @@ -1293,7 +1293,7 @@
>>                  d_config->num_pcidevs++;
>>          }
>>          if (d_config->num_pcidevs && c_info->type == 
>> LIBXL_DOMAIN_TYPE_PV)
>
> I think you also want to get rid of the c_info->type check?
 That would alter the current PV behaviour of implicitly
 enabling e820_host with PCI devices passed, would it not?
 I was hoping to maintain current behaviours intact, and
 only affect what happens when e820_host=1 is set for HVMs.
>> -            libxl_defbool_set(&b_info->u.pv.e820_host, true);
>> +            libxl_defbool_set(&b_info->e820_host, true);
>>      }
>>
>>      switch (xlu_cfg_get_list(config, "cpuid", &cpuids,
0, 1)) {
>>
>>
>> This should maintain the old behaviour for backward compatibility
>> when e820_host is not set. I just tested it and it works (with
>> e820_host=1 I get the previous error, with e820_host=0, everything
>> works fine.
>
> I think it might make sense to relax the PV check. That way the only
> way e820_host capability gets activated is if a the guest config
> has pci=X stanze. But perhaps that _and_ e820_host=1 is what should
> be done.
 While I think these two checks should be separate in both cases,
 I don''t know that this won''t break something for PV
instances. And
 I would prefer to not have to also debug that code path at this
 point. :)
> Or maybe a negative check - if ''pci'' stanze is there we
automatically
> turn on e820_host=1 (right now that is how it works). If the user
> has thought ''e820_host=0'' and ''pci=xxx''
then we would turn the E820
> off? That way if something is odd we can turn this off?
 I am not disagreeing at all - I just really don''t want to change
 the current PV behaviour since that will potentially require
 extra debugging. Current PV behaviour seems to be that that if
 PCI devices are passed, e820_host=1 is always set regardless
 of whether it is explicitly enabled or disabled in the config.

 And I have no idea what will happen with a PV domain with
 PCI devices if e820_host=1 is disabled.

 Gordan

Konrad Rzeszutek Wilk

2013-Sep-04 18:00 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Wed, Sep 04, 2013 at 03:23:40PM +0100, Gordan Bobic
wrote:> On Wed, 4 Sep 2013 10:08:37 -0400, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> >On Wed, Sep 04, 2013 at 01:18:39AM +0100, Gordan Bobic wrote:
> >>On 09/03/2013 10:30 PM, Konrad Rzeszutek Wilk wrote:
> >>>On Tue, Sep 03, 2013 at 10:24:44PM +0100, Gordan Bobic wrote:
> >>>>On 09/03/2013 10:10 PM, Konrad Rzeszutek Wilk wrote:
> >>>>>On Tue, Sep 03, 2013 at 09:49:40PM +0100, Gordan Bobic
wrote:
> >>>>>>I spoke too soon - even with e820_host=0, the same
error occurs.
> >>>>>>What did I break? The code in question is this:
> >>>>>>
> >>>>>>if
(libxl_defbool_val(d_config->b_info.e820_host)) {
> >>>>>>     ret = libxl__e820_alloc(gc, domid, d_config);
> >>>>>>     if (ret) {
> >>>>>>         LIBXL__LOG_ERRNO(gc->owner,
LIBXL__LOG_ERROR,
> >>>>>>                 "Failed while collecting E820
with: %d
> >>(errno:%d)\n",
> >>>>>>                 ret, errno);
> >>>>>>     }
> >>>>>>}
> >>>>>>
> >>>>>>With e820_host=0, that outer black should evaluate
to false,
> >>should
> >>>>>>it not? In libxl_create.c, if I am understanding
the code
> >>correctly,
> >>>>>>e820_host is defaulted to false, too. What am I
missing?
> >>>
> >>>Does your config have ''pci'' in it? The patch
you sent had this:
> >>>
> >>>+        if (d_config->num_pcidevs)
> >>>+            libxl_defbool_set(&b_info->e820_host,
true);
> >>>
> >>>Which means that even if you did not have e820_host it will be
> >>automatically
> >>>set if you have PCI devices.
> >>
> >>OK - that was embarrasing. Caffeine underflow error. :(
> >>I backed out that block. I don''t think e820_host should be
implicit
> >>in hvm when PCI devices are passed.
> >>
> >>That makes the adjusted patch fragment:
> >>--- xl_cmdimpl.c.orig	2013-09-04 00:42:57.424337503 +0100
> >>+++ xl_cmdimpl.c	2013-09-04 00:43:21.213886356 +0100
> >>@@ -1293,7 +1293,7 @@
> >>                 d_config->num_pcidevs++;
> >>         }
> >>         if (d_config->num_pcidevs && c_info->type
=> >>LIBXL_DOMAIN_TYPE_PV)
> >
> >I think you also want to get rid of the c_info->type check?
> 
> That would alter the current PV behaviour of implicitly
> enabling e820_host with PCI devices passed, would it not?
> I was hoping to maintain current behaviours intact, and
> only affect what happens when e820_host=1 is set for HVMs.
> 
> >>-            libxl_defbool_set(&b_info->u.pv.e820_host,
true);
> >>+            libxl_defbool_set(&b_info->e820_host, true);
> >>     }
> >>
> >>     switch (xlu_cfg_get_list(config, "cpuid",
&cpuids, 0, 1)) {
> >>
> >>
> >>This should maintain the old behaviour for backward compatibility
> >>when e820_host is not set. I just tested it and it works (with
> >>e820_host=1 I get the previous error, with e820_host=0, everything
> >>works fine.
> >
> >I think it might make sense to relax the PV check. That way the only
> >way e820_host capability gets activated is if a the guest config
> >has pci=X stanze. But perhaps that _and_ e820_host=1 is what should
> >be done.
> 
> While I think these two checks should be separate in both cases,
> I don''t know that this won''t break something for PV
instances. And
> I would prefer to not have to also debug that code path at this
> point. :)
OK.> 
> >Or maybe a negative check - if ''pci'' stanze is there
we automatically
> >turn on e820_host=1 (right now that is how it works). If the user
> >has thought ''e820_host=0'' and
''pci=xxx'' then we would turn the E820
> >off? That way if something is odd we can turn this off?
> 
> I am not disagreeing at all - I just really don''t want to change
> the current PV behaviour since that will potentially require
> extra debugging. Current PV behaviour seems to be that that if
> PCI devices are passed, e820_host=1 is always set regardless
> of whether it is explicitly enabled or disabled in the config.
Right.> 
> And I have no idea what will happen with a PV domain with
> PCI devices if e820_host=1 is disabled.
It will boot - but if you are have more than 2GB the PCI devices
will most likely not work.> 
> Gordan

Gordan Bobic

2013-Sep-04 20:18 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

OK, I have done some preliminary testing. Details below.

On 09/04/2013 02:11 PM, Gordan Bobic wrote:> I have this at the point where it actually builds.
> Otherwise completely untested (will do that later today).
>
> Attached are:
>
> 1) libxl patch
> Modified from the original patch to _not_ implicitly enable
> e820_host when PCI devices are passed.
Builds, works with e820_host=0.
> 2) Mukesh''s hypervisor e820 patch from here:
> http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html
> Modified slightly to attempt to address Jan''s comment on the same
> thread, and to adjust the diff line pointers to match against
> 4.3.0 release code.
Builds, works with e820_host=0.
> 3) A patch based on Konrad''s earlier in this thread, with
> a few additions and changes to make it all compile.
Causes the domU to fail to start. No obvious errors in any logs, but the 
qemu-dm log simply stops before the usual point. There is blank white 
screen on VNC console. It looks like domU crashes before it even starts 
loading the OS.

I have attached two qemu-dm logs:
qemu-dm-edi.log - without patch 3
qemu-dm-edi.log.2 - with patch 3

I also attached the output of xl dmesg in each case.

With the 3rd patch applied, everything seems to stop just as the 
hypervisor is about to log the E820 table for HVM1 (obvious if you diff 
them).

This may be related to what I did to get your patch to build, Konrad.
The map never gets output, so either rc=-ENOSYS, or it crashes  during 
the hypercall.

With e820_host=0, the e820 map should be exactly the same as it would 
have been anyway, but something seems to go wrong during:

rc = hypercall_memory_op ( XENMEM_memory_map, &op);

Thoughts?

Gordan





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Sep-05 02:04 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Wed, Sep 04, 2013 at 02:11:06PM +0100, Gordan Bobic
wrote:> I have this at the point where it actually builds.
> Otherwise completely untested (will do that later today).
> 
> Attached are:
> 
> 1) libxl patch
> Modified from the original patch to _not_ implicitly enable
> e820_host when PCI devices are passed.
> 
> 2) Mukesh''s hypervisor e820 patch from here:
> http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html
> Modified slightly to attempt to address Jan''s comment on the same
> thread, and to adjust the diff line pointers to match against
> 4.3.0 release code.
I think that was the old version. I spotted a bug in it that
was causing a hang. And also the one that explains why libxl
would refuse to setup the E820.

The problem was that in the XENMEM_set_memory_map there was
a check to make sure that the guest launched was not HVM.

Also there was bug in the initial domain creation where
the spinlock was only set for PV and not for HVM.
> 
> 3) A patch based on Konrad''s earlier in this thread, with
> a few additions and changes to make it all compile.
> 
> Some peer review would be most welcome - this is my first
> venture into Xen code, so please do assume that I have
> no idea what I''m doing at the moment. :)
> 
> I added yet another E820MAX #define, this time to
> tools/firmware/hvmloader/e820.h
> 
> If there is a better place to #include that via from
> e820.c, please point me in the right direction.
I think I saw that #define in tools/libxc/xenctrl.h. But since
the tools/firmware cannot link to the libxc (b/c it is a Minicontained
OS) I believe just having the #define in hvmloader/e820.h is
the right call.

Good first pass. I altered it a bit and got in the HVM guest
the E820 entries printed out. Here is a big giant diff:

diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c
index 2e05e93..3c80241 100644
--- a/tools/firmware/hvmloader/e820.c
+++ b/tools/firmware/hvmloader/e820.c
@@ -22,6 +22,9 @@
 
 #include "config.h"
 #include "util.h"
+#include "hypercall.h"
+#include <xen/memory.h>
+#include <errno.h>
 
 void dump_e820_table(struct e820entry *e820, unsigned int nr)
 {
@@ -74,10 +77,20 @@ int build_e820_table(struct e820entry *e820,
                      unsigned int bios_image_base)
 {
     unsigned int nr = 0;
+    struct xen_memory_map op;
+    struct e820entry map[E820MAX];
+    int rc;
 
     if ( !lowmem_reserved_base )
             lowmem_reserved_base = 0xA0000;
 
+    set_xen_guest_handle(op.buffer, map);
+
+    rc = hypercall_memory_op ( XENMEM_memory_map, &op);
+    if ( rc != -ENOSYS) { /* It works!? */
+        printf("%s:%d got %d op.nr_entries \n", __func__, __LINE__,
op.nr_entries);
+        dump_e820_table(&map[0], op.nr_entries);
+    }
     /* Lowmem must be at least 512K to keep Windows happy) */
     ASSERT ( lowmem_reserved_base > 512<<10 );
 
diff --git a/tools/firmware/hvmloader/e820.h b/tools/firmware/hvmloader/e820.h
index b2ead7f..2fa700d 100644
--- a/tools/firmware/hvmloader/e820.h
+++ b/tools/firmware/hvmloader/e820.h
@@ -8,6 +8,9 @@
 #define E820_RESERVED     2
 #define E820_ACPI         3
 #define E820_NVS          4
+#define E820_UNUSABLE     5
+
+#define E820MAX         128
 
 struct e820entry {
     uint64_t addr;
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 0c32d0b..d8e2346 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -208,6 +208,8 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc,
 
     libxl_defbool_setdefault(&b_info->disable_migrate, false);
 
+    libxl_defbool_setdefault(&b_info->e820_host, false);
+
     switch (b_info->type) {
     case LIBXL_DOMAIN_TYPE_HVM:
         if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT)
@@ -280,7 +282,6 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc,
 
         break;
     case LIBXL_DOMAIN_TYPE_PV:
-        libxl_defbool_setdefault(&b_info->u.pv.e820_host, false);
         if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT)
             b_info->shadow_memkb = 0;
         if (b_info->u.pv.slack_memkb == LIBXL_MEMKB_DEFAULT)
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 85341a0..fd6389a 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -299,6 +299,8 @@ libxl_domain_build_info =
Struct("domain_build_info",[
     ("irqs",             Array(uint32, "num_irqs")),
     ("iomem",            Array(libxl_iomem_range,
"num_iomem")),
     ("claim_mode",	     libxl_defbool),
+    # Use host''s E820 for PCI passthrough.
+    ("e820_host",        libxl_defbool),
     ("u", KeyedUnion(None, libxl_domain_type, "type",
                 [("hvm", Struct(None, [("firmware",        
string),
                                        ("bios",            
libxl_bios_type),
@@ -345,8 +347,6 @@ libxl_domain_build_info =
Struct("domain_build_info",[
                                       ("cmdline", string),
                                       ("ramdisk", string),
                                       ("features", string,
{''const'': True}),
-                                      # Use host''s E820 for PCI
passthrough.
-                                      ("e820_host", libxl_defbool),
                                       ])),
                  ("invalid", Struct(None, [])),
                  ], keyvar_init_val = "LIBXL_DOMAIN_TYPE_INVALID")),
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index a78c91d..94515a5 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -216,28 +216,41 @@ static int libxl__e820_alloc(libxl__gc *gc, uint32_t
domid,
     struct e820entry map[E820MAX];
     libxl_domain_build_info *b_info;
 
-    if (d_config == NULL || d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM)
-        return ERROR_INVAL;
-
     b_info = &d_config->b_info;
-    if (!libxl_defbool_val(b_info->u.pv.e820_host))
+    if (!libxl_defbool_val(b_info->e820_host)) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.",__func__,
__LINE__);
         return ERROR_INVAL;
-
+    }
     rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
     if (rc < 0) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.",__func__,
__LINE__);
         errno = rc;
         return ERROR_FAIL;
     }
     nr = rc;
-    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
-                       (b_info->max_memkb - b_info->target_memkb) +
-                       b_info->u.pv.slack_memkb);
+    LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.nr:%d",__func__,
__LINE__, nr);
+    if (d_config == NULL || d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM)
{
+        rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
+                           (b_info->max_memkb - b_info->target_memkb));
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
"%s:%d.rc%d",__func__, __LINE__, rc);
+    } else if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV) {
+        rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
+                           (b_info->max_memkb - b_info->target_memkb) +
+                           b_info->u.pv.slack_memkb);
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
"%s:%d.rc%d",__func__, __LINE__, rc);
+    }
+
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
"%s:%d.rc%d",__func__, __LINE__, rc);
     if (rc)
         return ERROR_FAIL;
 
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.rc%d,
nr:%d",__func__, __LINE__, rc, nr);
+
     rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr);
 
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
"%s:%d.rc%d",__func__, __LINE__, rc);
     if (rc < 0) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
"%s:%d.rc%d",__func__, __LINE__, rc);
         errno  = rc;
         return ERROR_FAIL;
     }
@@ -296,8 +309,7 @@ int libxl__arch_domain_create(libxl__gc *gc,
libxl_domain_config *d_config,
         xc_shadow_control(ctx->xch, domid,
XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION, NULL, 0, &shadow, 0, NULL);
     }
 
-    if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV &&
-            libxl_defbool_val(d_config->b_info.u.pv.e820_host)) {
+    if (libxl_defbool_val(d_config->b_info.e820_host)) {
         ret = libxl__e820_alloc(gc, domid, d_config);
         if (ret) {
             LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR,
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index ed99622..d98ca24 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -1291,11 +1291,7 @@ skip_vfb:
     if (!xlu_cfg_get_long (config, "pci_permissive", &l, 0))
         pci_permissive = l;
 
-    /* To be reworked (automatically enabled) once the auto ballooning
-     * after guest starts is done (with PCI devices passed in). */
-    if (c_info->type == LIBXL_DOMAIN_TYPE_PV) {
-        xlu_cfg_get_defbool(config, "e820_host",
&b_info->u.pv.e820_host, 0);
-    }
+    xlu_cfg_get_defbool(config, "e820_host",
&b_info->e820_host, 0);
 
     if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) {
         d_config->num_pcidevs = 0;
@@ -1314,7 +1310,7 @@ skip_vfb:
                 d_config->num_pcidevs++;
         }
         if (d_config->num_pcidevs && c_info->type ==
LIBXL_DOMAIN_TYPE_PV)
-            libxl_defbool_set(&b_info->u.pv.e820_host, true);
+            libxl_defbool_set(&b_info->e820_host, true);
     }
 
     switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0, 1)) {
diff --git a/tools/libxl/xl_sxp.c b/tools/libxl/xl_sxp.c
index a16a025..f34f0ba 100644
--- a/tools/libxl/xl_sxp.c
+++ b/tools/libxl/xl_sxp.c
@@ -87,6 +87,10 @@ void printf_info_sexp(int domid, libxl_domain_config
*d_config)
         }
     }
 
+    printf("\t(e820_host %s)\n",
+           libxl_defbool_to_string(b_info->e820_host));
+
+
     printf("\t(image\n");
     switch (c_info->type) {
     case LIBXL_DOMAIN_TYPE_HVM:
@@ -150,8 +154,6 @@ void printf_info_sexp(int domid, libxl_domain_config
*d_config)
         printf("\t\t\t(kernel %s)\n", b_info->u.pv.kernel);
         printf("\t\t\t(cmdline %s)\n", b_info->u.pv.cmdline);
         printf("\t\t\t(ramdisk %s)\n", b_info->u.pv.ramdisk);
-        printf("\t\t\t(e820_host %s)\n",
-               libxl_defbool_to_string(b_info->u.pv.e820_host));
         printf("\t\t)\n");
         break;
     default:
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 874742c..4796221 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -566,10 +566,9 @@ int arch_domain_create(struct domain *d, unsigned int
domcr_flags)
     {
         /* 64-bit PV guest by default. */
         d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0;
-
-        spin_lock_init(&d->arch.pv_domain.e820_lock);
     }
 
+    spin_lock_init(&d->arch.e820_lock);
     /* initialize default tsc behavior in case tools don''t */
     tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0);
     spin_lock_init(&d->arch.vtsc_lock);
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 54b1e6a..6c9b58c 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -3142,10 +3142,10 @@ static long hvm_memory_op(int cmd,
XEN_GUEST_HANDLE_PARAM(void) arg)
 
     switch ( cmd & MEMOP_CMD_MASK )
     {
-    case XENMEM_memory_map:
     case XENMEM_machine_memory_map:
     case XENMEM_machphys_mapping:
         return -ENOSYS;
+    case XENMEM_memory_map:
     case XENMEM_decrease_reservation:
         rc = do_memory_op(cmd, arg);
         current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 1;
@@ -3217,10 +3217,10 @@ static long hvm_memory_op_compat32(int cmd,
XEN_GUEST_HANDLE_PARAM(void) arg)
 
     switch ( cmd & MEMOP_CMD_MASK )
     {
-    case XENMEM_memory_map:
     case XENMEM_machine_memory_map:
     case XENMEM_machphys_mapping:
         return -ENOSYS;
+    case XENMEM_memory_map:
     case XENMEM_decrease_reservation:
         rc = compat_memory_op(cmd, arg);
         current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 1;
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index e7f0e13..4c3ce9a 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4740,19 +4740,13 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void)
arg)
             return rc;
         }
 
-        if ( is_hvm_domain(d) )
-        {
-            rcu_unlock_domain(d);
-            return -EPERM;
-        }
-
         e820 = xmalloc_array(e820entry_t, fmap.map.nr_entries);
         if ( e820 == NULL )
         {
             rcu_unlock_domain(d);
             return -ENOMEM;
         }
-        
+
         if ( copy_from_guest(e820, fmap.map.buffer, fmap.map.nr_entries) )
         {
             xfree(e820);
@@ -4760,11 +4754,11 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void)
arg)
             return -EFAULT;
         }
 
-        spin_lock(&d->arch.pv_domain.e820_lock);
-        xfree(d->arch.pv_domain.e820);
-        d->arch.pv_domain.e820 = e820;
-        d->arch.pv_domain.nr_e820 = fmap.map.nr_entries;
-        spin_unlock(&d->arch.pv_domain.e820_lock);
+        spin_lock(&d->arch.e820_lock);
+        xfree(d->arch.e820);
+        d->arch.e820 = e820;
+        d->arch.nr_e820 = fmap.map.nr_entries;
+        spin_unlock(&d->arch.e820_lock);
 
         rcu_unlock_domain(d);
         return rc;
@@ -4778,26 +4772,26 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void)
arg)
         if ( copy_from_guest(&map, arg, 1) )
             return -EFAULT;
 
-        spin_lock(&d->arch.pv_domain.e820_lock);
+        spin_lock(&d->arch.e820_lock);
 
         /* Backwards compatibility. */
-        if ( (d->arch.pv_domain.nr_e820 == 0) ||
-             (d->arch.pv_domain.e820 == NULL) )
+        if ( (d->arch.nr_e820 == 0) ||
+             (d->arch.e820 == NULL) )
         {
-            spin_unlock(&d->arch.pv_domain.e820_lock);
+            spin_unlock(&d->arch.e820_lock);
             return -ENOSYS;
         }
 
-        map.nr_entries = min(map.nr_entries, d->arch.pv_domain.nr_e820);
-        if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820,
+        map.nr_entries = min(map.nr_entries, d->arch.nr_e820);
+        if ( copy_to_guest(map.buffer, d->arch.e820,
                            map.nr_entries) ||
              __copy_to_guest(arg, &map, 1) )
         {
-            spin_unlock(&d->arch.pv_domain.e820_lock);
+            spin_unlock(&d->arch.e820_lock);
             return -EFAULT;
         }
 
-        spin_unlock(&d->arch.pv_domain.e820_lock);
+        spin_unlock(&d->arch.e820_lock);
         return 0;
     }
 
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index d79464d..c3f9f8e 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -234,11 +234,6 @@ struct pv_domain
 
     /* map_domain_page() mapping cache. */
     struct mapcache_domain mapcache;
-
-    /* Pseudophysical e820 map (XENMEM_memory_map).  */
-    spinlock_t e820_lock;
-    struct e820entry *e820;
-    unsigned int nr_e820;
 };
 
 struct arch_domain
@@ -313,6 +308,11 @@ struct arch_domain
                                 (possibly other cases in the future */
     uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */
     uint64_t vtsc_usercount; /* not used for hvm */
+
+    /* Pseudophysical e820 map (XENMEM_memory_map).  */
+    spinlock_t e820_lock;
+    struct e820entry *e820;
+    unsigned int nr_e820;
 } __cacheline_aligned;
 
 #define has_arch_pdevs(d)    (!list_empty(&(d)->arch.pdev_list))

Gordan Bobic

2013-Sep-05 09:41 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Hmm...

 gcc -O2 -fomit-frame-pointer -m64 -fno-strict-aliasing -std=gnu99 -Wall 
 -Wstrict-prototypes -Wdeclaration-after-statement 
 -Wno-unused-but-set-variable   -DNDEBUG -fno-builtin -fno-common 
 -Wredundant-decls -iwithprefix include -Werror -Wno-pointer-arith -pipe 
 -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include  
 -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-generic 
 -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-default 
 -msoft-float -fno-stack-protector -fno-exceptions -Wnested-externs 
 -mno-red-zone -mno-sse -fpic -fno-asynchronous-unwind-tables 
 -DGCC_HAS_VISIBILITY_ATTRIBUTE -nostdinc -g -D__XEN__ -include 
 /root/rpmbuild/BUILD/xen-4.3.0/xen/include/xen/config.h -DHAS_ACPI 
 -DHAS_GDBSX -DHAS_PASSTHROUGH -MMD -MF .debug.o.d -c debug.c -o debug.o
 gcc -O2 -fomit-frame-pointer -m64 -fno-strict-aliasing -std=gnu99 -Wall 
 -Wstrict-prototypes -Wdeclaration-after-statement 
 -Wno-unused-but-set-variable   -DNDEBUG -fno-builtin -fno-common 
 -Wredundant-decls -iwithprefix include -Werror -Wno-pointer-arith -pipe 
 -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include  
 -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-generic 
 -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-default 
 -msoft-float -fno-stack-protector -fno-exceptions -Wnested-externs 
 -mno-red-zone -mno-sse -fpic -fno-asynchronous-unwind-tables 
 -DGCC_HAS_VISIBILITY_ATTRIBUTE -nostdinc -g -D__XEN__ -include 
 /root/rpmbuild/BUILD/xen-4.3.0/xen/include/xen/config.h -DHAS_ACPI 
 -DHAS_GDBSX -DHAS_PASSTHROUGH -MMD -MF .domain.o.d -c domain.c -o 
 domain.o
 domain.c: In function ‘arch_domain_destroy’:
 domain.c:595: error: ‘struct pv_domain’ has no member named ‘e820’
 make[4]: *** [domain.o] Error 1

 It would seem you omitted this block from the original patch:

 == @@ -592,8 +592,8 @@ void arch_domain_destroy(struct domain *d)
  {
      if ( is_hvm_domain(d) )
          hvm_domain_destroy(d);
 -    else
 -        xfree(d->arch.pv_domain.e820);
 +
 +    xfree(d->arch.e820);

      free_domain_pirqs(d);
      if ( !is_idle_domain(d) )
 ==
 Was that intentional? Does that block look OK to you? Should I re-add 
 it?

 Gordan

 On Wed, 4 Sep 2013 22:04:42 -0400, Konrad Rzeszutek Wilk 
 <konrad.wilk@oracle.com> wrote:> On Wed, Sep 04, 2013 at 02:11:06PM +0100, Gordan Bobic wrote:
>> I have this at the point where it actually builds.
>> Otherwise completely untested (will do that later today).
>>
>> Attached are:
>>
>> 1) libxl patch
>> Modified from the original patch to _not_ implicitly enable
>> e820_host when PCI devices are passed.
>>
>> 2) Mukesh's hypervisor e820 patch from here:
>> http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html
>> Modified slightly to attempt to address Jan's comment on the same
>> thread, and to adjust the diff line pointers to match against
>> 4.3.0 release code.
>
> I think that was the old version. I spotted a bug in it that
> was causing a hang. And also the one that explains why libxl
> would refuse to setup the E820.
>
> The problem was that in the XENMEM_set_memory_map there was
> a check to make sure that the guest launched was not HVM.
>
> Also there was bug in the initial domain creation where
> the spinlock was only set for PV and not for HVM.
>
>>
>> 3) A patch based on Konrad's earlier in this thread, with
>> a few additions and changes to make it all compile.
>>
>> Some peer review would be most welcome - this is my first
>> venture into Xen code, so please do assume that I have
>> no idea what I'm doing at the moment. :)
>>
>> I added yet another E820MAX #define, this time to
>> tools/firmware/hvmloader/e820.h
>>
>> If there is a better place to #include that via from
>> e820.c, please point me in the right direction.
>
> I think I saw that #define in tools/libxc/xenctrl.h. But since
> the tools/firmware cannot link to the libxc (b/c it is a 
> Minicontained
> OS) I believe just having the #define in hvmloader/e820.h is
> the right call.
>
> Good first pass. I altered it a bit and got in the HVM guest
> the E820 entries printed out. Here is a big giant diff:
>
> diff --git a/tools/firmware/hvmloader/e820.c
> b/tools/firmware/hvmloader/e820.c
> index 2e05e93..3c80241 100644
> --- a/tools/firmware/hvmloader/e820.c
> +++ b/tools/firmware/hvmloader/e820.c
> @@ -22,6 +22,9 @@
>
>  #include "config.h"
>  #include "util.h"
> +#include "hypercall.h"
> +#include <xen/memory.h>
> +#include <errno.h>
>
>  void dump_e820_table(struct e820entry *e820, unsigned int nr)
>  {
> @@ -74,10 +77,20 @@ int build_e820_table(struct e820entry *e820,
>                       unsigned int bios_image_base)
>  {
>      unsigned int nr = 0;
> +    struct xen_memory_map op;
> +    struct e820entry map[E820MAX];
> +    int rc;
>
>      if ( !lowmem_reserved_base )
>              lowmem_reserved_base = 0xA0000;
>
> +    set_xen_guest_handle(op.buffer, map);
> +
> +    rc = hypercall_memory_op ( XENMEM_memory_map, &op);
> +    if ( rc != -ENOSYS) { /* It works!? */
> +        printf("%s:%d got %d op.nr_entries \n", __func__,
__LINE__,
> op.nr_entries);
> +        dump_e820_table(&map[0], op.nr_entries);
> +    }
>      /* Lowmem must be at least 512K to keep Windows happy) */
>      ASSERT ( lowmem_reserved_base > 512<<10 );
>
> diff --git a/tools/firmware/hvmloader/e820.h
> b/tools/firmware/hvmloader/e820.h
> index b2ead7f..2fa700d 100644
> --- a/tools/firmware/hvmloader/e820.h
> +++ b/tools/firmware/hvmloader/e820.h
> @@ -8,6 +8,9 @@
>  #define E820_RESERVED     2
>  #define E820_ACPI         3
>  #define E820_NVS          4
> +#define E820_UNUSABLE     5
> +
> +#define E820MAX         128
>
>  struct e820entry {
>      uint64_t addr;
> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> index 0c32d0b..d8e2346 100644
> --- a/tools/libxl/libxl_create.c
> +++ b/tools/libxl/libxl_create.c
> @@ -208,6 +208,8 @@ int libxl__domain_build_info_setdefault(libxl__gc 
> *gc,
>
>      libxl_defbool_setdefault(&b_info->disable_migrate, false);
>
> +    libxl_defbool_setdefault(&b_info->e820_host, false);
> +
>      switch (b_info->type) {
>      case LIBXL_DOMAIN_TYPE_HVM:
>          if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT)
> @@ -280,7 +282,6 @@ int libxl__domain_build_info_setdefault(libxl__gc 
> *gc,
>
>          break;
>      case LIBXL_DOMAIN_TYPE_PV:
> -        libxl_defbool_setdefault(&b_info->u.pv.e820_host, false);
>          if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT)
>              b_info->shadow_memkb = 0;
>          if (b_info->u.pv.slack_memkb == LIBXL_MEMKB_DEFAULT)
> diff --git a/tools/libxl/libxl_types.idl 
> b/tools/libxl/libxl_types.idl
> index 85341a0..fd6389a 100644
> --- a/tools/libxl/libxl_types.idl
> +++ b/tools/libxl/libxl_types.idl
> @@ -299,6 +299,8 @@ libxl_domain_build_info = 
> Struct("domain_build_info",[
>      ("irqs",             Array(uint32, "num_irqs")),
>      ("iomem",            Array(libxl_iomem_range,
"num_iomem")),
>      ("claim_mode",	     libxl_defbool),
> +    # Use host's E820 for PCI passthrough.
> +    ("e820_host",        libxl_defbool),
>      ("u", KeyedUnion(None, libxl_domain_type, "type",
>                  [("hvm", Struct(None, [("firmware",   
string),
>                                         ("bios",
> libxl_bios_type),
> @@ -345,8 +347,6 @@ libxl_domain_build_info = 
> Struct("domain_build_info",[
>                                        ("cmdline", string),
>                                        ("ramdisk", string),
>                                        ("features", string,
{'const':
> True}),
> -                                      # Use host's E820 for PCI 
> passthrough.
> -                                      ("e820_host",
libxl_defbool),
>                                        ])),
>                   ("invalid", Struct(None, [])),
>                   ], keyvar_init_val =
"LIBXL_DOMAIN_TYPE_INVALID")),
> diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
> index a78c91d..94515a5 100644
> --- a/tools/libxl/libxl_x86.c
> +++ b/tools/libxl/libxl_x86.c
> @@ -216,28 +216,41 @@ static int libxl__e820_alloc(libxl__gc *gc,
> uint32_t domid,
>      struct e820entry map[E820MAX];
>      libxl_domain_build_info *b_info;
>
> -    if (d_config == NULL || d_config->c_info.type == 
> LIBXL_DOMAIN_TYPE_HVM)
> -        return ERROR_INVAL;
> -
>      b_info = &d_config->b_info;
> -    if (!libxl_defbool_val(b_info->u.pv.e820_host))
> +    if (!libxl_defbool_val(b_info->e820_host)) {
> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
"%s:%d.",__func__,
> __LINE__);
>          return ERROR_INVAL;
> -
> +    }
>      rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
>      if (rc < 0) {
> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
"%s:%d.",__func__,
> __LINE__);
>          errno = rc;
>          return ERROR_FAIL;
>      }
>      nr = rc;
> -    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
> -                       (b_info->max_memkb - b_info->target_memkb) +
> -                       b_info->u.pv.slack_memkb);
> +    LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
"%s:%d.nr:%d",__func__,
> __LINE__, nr);
> +    if (d_config == NULL || d_config->c_info.type =>
LIBXL_DOMAIN_TYPE_HVM) {
> +        rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
> +                           (b_info->max_memkb - 
> b_info->target_memkb));
> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
> "%s:%d.rc%d",__func__, __LINE__, rc);
> +    } else if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV) {
> +        rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
> +                           (b_info->max_memkb - 
> b_info->target_memkb) +
> +                           b_info->u.pv.slack_memkb);
> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
> "%s:%d.rc%d",__func__, __LINE__, rc);
> +    }
> +
> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
> "%s:%d.rc%d",__func__, __LINE__, rc);
>      if (rc)
>          return ERROR_FAIL;
>
> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.rc%d,
> nr:%d",__func__, __LINE__, rc, nr);
> +
>      rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr);
>
> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
> "%s:%d.rc%d",__func__, __LINE__, rc);
>      if (rc < 0) {
> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
> "%s:%d.rc%d",__func__, __LINE__, rc);
>          errno  = rc;
>          return ERROR_FAIL;
>      }
> @@ -296,8 +309,7 @@ int libxl__arch_domain_create(libxl__gc *gc,
> libxl_domain_config *d_config,
>          xc_shadow_control(ctx->xch, domid,
> XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION, NULL, 0, &shadow, 0, NULL);
>      }
>
> -    if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV &&
> -            libxl_defbool_val(d_config->b_info.u.pv.e820_host)) {
> +    if (libxl_defbool_val(d_config->b_info.e820_host)) {
>          ret = libxl__e820_alloc(gc, domid, d_config);
>          if (ret) {
>              LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR,
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> index ed99622..d98ca24 100644
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -1291,11 +1291,7 @@ skip_vfb:
>      if (!xlu_cfg_get_long (config, "pci_permissive", &l, 0))
>          pci_permissive = l;
>
> -    /* To be reworked (automatically enabled) once the auto 
> ballooning
> -     * after guest starts is done (with PCI devices passed in). */
> -    if (c_info->type == LIBXL_DOMAIN_TYPE_PV) {
> -        xlu_cfg_get_defbool(config, "e820_host",
> &b_info->u.pv.e820_host, 0);
> -    }
> +    xlu_cfg_get_defbool(config, "e820_host",
&b_info->e820_host, 0);
>
>      if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) {
>          d_config->num_pcidevs = 0;
> @@ -1314,7 +1310,7 @@ skip_vfb:
>                  d_config->num_pcidevs++;
>          }
>          if (d_config->num_pcidevs && c_info->type == 
> LIBXL_DOMAIN_TYPE_PV)
> -            libxl_defbool_set(&b_info->u.pv.e820_host, true);
> +            libxl_defbool_set(&b_info->e820_host, true);
>      }
>
>      switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0,
1)) {
> diff --git a/tools/libxl/xl_sxp.c b/tools/libxl/xl_sxp.c
> index a16a025..f34f0ba 100644
> --- a/tools/libxl/xl_sxp.c
> +++ b/tools/libxl/xl_sxp.c
> @@ -87,6 +87,10 @@ void printf_info_sexp(int domid,
> libxl_domain_config *d_config)
>          }
>      }
>
> +    printf("\t(e820_host %s)\n",
> +           libxl_defbool_to_string(b_info->e820_host));
> +
> +
>      printf("\t(image\n");
>      switch (c_info->type) {
>      case LIBXL_DOMAIN_TYPE_HVM:
> @@ -150,8 +154,6 @@ void printf_info_sexp(int domid,
> libxl_domain_config *d_config)
>          printf("\t\t\t(kernel %s)\n", b_info->u.pv.kernel);
>          printf("\t\t\t(cmdline %s)\n", b_info->u.pv.cmdline);
>          printf("\t\t\t(ramdisk %s)\n", b_info->u.pv.ramdisk);
> -        printf("\t\t\t(e820_host %s)\n",
> -               libxl_defbool_to_string(b_info->u.pv.e820_host));
>          printf("\t\t)\n");
>          break;
>      default:
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index 874742c..4796221 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -566,10 +566,9 @@ int arch_domain_create(struct domain *d,
> unsigned int domcr_flags)
>      {
>          /* 64-bit PV guest by default. */
>          d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0;
> -
> -        spin_lock_init(&d->arch.pv_domain.e820_lock);
>      }
>
> +    spin_lock_init(&d->arch.e820_lock);
>      /* initialize default tsc behavior in case tools don't */
>      tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0);
>      spin_lock_init(&d->arch.vtsc_lock);
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 54b1e6a..6c9b58c 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -3142,10 +3142,10 @@ static long hvm_memory_op(int cmd,
> XEN_GUEST_HANDLE_PARAM(void) arg)
>
>      switch ( cmd & MEMOP_CMD_MASK )
>      {
> -    case XENMEM_memory_map:
>      case XENMEM_machine_memory_map:
>      case XENMEM_machphys_mapping:
>          return -ENOSYS;
> +    case XENMEM_memory_map:
>      case XENMEM_decrease_reservation:
>          rc = do_memory_op(cmd, arg);
>          current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 
> 1;
> @@ -3217,10 +3217,10 @@ static long hvm_memory_op_compat32(int cmd,
> XEN_GUEST_HANDLE_PARAM(void) arg)
>
>      switch ( cmd & MEMOP_CMD_MASK )
>      {
> -    case XENMEM_memory_map:
>      case XENMEM_machine_memory_map:
>      case XENMEM_machphys_mapping:
>          return -ENOSYS;
> +    case XENMEM_memory_map:
>      case XENMEM_decrease_reservation:
>          rc = compat_memory_op(cmd, arg);
>          current->domain->arch.hvm_domain.qemu_mapcache_invalidate = 
> 1;
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index e7f0e13..4c3ce9a 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -4740,19 +4740,13 @@ long arch_memory_op(int op,
> XEN_GUEST_HANDLE_PARAM(void) arg)
>              return rc;
>          }
>
> -        if ( is_hvm_domain(d) )
> -        {
> -            rcu_unlock_domain(d);
> -            return -EPERM;
> -        }
> -
>          e820 = xmalloc_array(e820entry_t, fmap.map.nr_entries);
>          if ( e820 == NULL )
>          {
>              rcu_unlock_domain(d);
>              return -ENOMEM;
>          }
> -
> +
>          if ( copy_from_guest(e820, fmap.map.buffer, 
> fmap.map.nr_entries) )
>          {
>              xfree(e820);
> @@ -4760,11 +4754,11 @@ long arch_memory_op(int op,
> XEN_GUEST_HANDLE_PARAM(void) arg)
>              return -EFAULT;
>          }
>
> -        spin_lock(&d->arch.pv_domain.e820_lock);
> -        xfree(d->arch.pv_domain.e820);
> -        d->arch.pv_domain.e820 = e820;
> -        d->arch.pv_domain.nr_e820 = fmap.map.nr_entries;
> -        spin_unlock(&d->arch.pv_domain.e820_lock);
> +        spin_lock(&d->arch.e820_lock);
> +        xfree(d->arch.e820);
> +        d->arch.e820 = e820;
> +        d->arch.nr_e820 = fmap.map.nr_entries;
> +        spin_unlock(&d->arch.e820_lock);
>
>          rcu_unlock_domain(d);
>          return rc;
> @@ -4778,26 +4772,26 @@ long arch_memory_op(int op,
> XEN_GUEST_HANDLE_PARAM(void) arg)
>          if ( copy_from_guest(&map, arg, 1) )
>              return -EFAULT;
>
> -        spin_lock(&d->arch.pv_domain.e820_lock);
> +        spin_lock(&d->arch.e820_lock);
>
>          /* Backwards compatibility. */
> -        if ( (d->arch.pv_domain.nr_e820 == 0) ||
> -             (d->arch.pv_domain.e820 == NULL) )
> +        if ( (d->arch.nr_e820 == 0) ||
> +             (d->arch.e820 == NULL) )
>          {
> -            spin_unlock(&d->arch.pv_domain.e820_lock);
> +            spin_unlock(&d->arch.e820_lock);
>              return -ENOSYS;
>          }
>
> -        map.nr_entries = min(map.nr_entries, 
> d->arch.pv_domain.nr_e820);
> -        if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820,
> +        map.nr_entries = min(map.nr_entries, d->arch.nr_e820);
> +        if ( copy_to_guest(map.buffer, d->arch.e820,
>                             map.nr_entries) ||
>               __copy_to_guest(arg, &map, 1) )
>          {
> -            spin_unlock(&d->arch.pv_domain.e820_lock);
> +            spin_unlock(&d->arch.e820_lock);
>              return -EFAULT;
>          }
>
> -        spin_unlock(&d->arch.pv_domain.e820_lock);
> +        spin_unlock(&d->arch.e820_lock);
>          return 0;
>      }
>
> diff --git a/xen/include/asm-x86/domain.h 
> b/xen/include/asm-x86/domain.h
> index d79464d..c3f9f8e 100644
> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -234,11 +234,6 @@ struct pv_domain
>
>      /* map_domain_page() mapping cache. */
>      struct mapcache_domain mapcache;
> -
> -    /* Pseudophysical e820 map (XENMEM_memory_map).  */
> -    spinlock_t e820_lock;
> -    struct e820entry *e820;
> -    unsigned int nr_e820;
>  };
>
>  struct arch_domain
> @@ -313,6 +308,11 @@ struct arch_domain
>                                  (possibly other cases in the future 
> */
>      uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */
>      uint64_t vtsc_usercount; /* not used for hvm */
> +
> +    /* Pseudophysical e820 map (XENMEM_memory_map).  */
> +    spinlock_t e820_lock;
> +    struct e820entry *e820;
> +    unsigned int nr_e820;
>  } __cacheline_aligned;
>
>  #define has_arch_pdevs(d)    (!list_empty(&(d)->arch.pdev_list))
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Gordan Bobic

2013-Sep-05 10:00 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Thu, 05 Sep 2013 10:41:09 +0100, Gordan Bobic <gordan@bobich.net> 
 wrote:> Hmm...
>
> gcc -O2 -fomit-frame-pointer -m64 -fno-strict-aliasing -std=gnu99
> -Wall -Wstrict-prototypes -Wdeclaration-after-statement
> -Wno-unused-but-set-variable   -DNDEBUG -fno-builtin -fno-common
> -Wredundant-decls -iwithprefix include -Werror -Wno-pointer-arith
> -pipe -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include
> -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-generic
> -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-default
> -msoft-float -fno-stack-protector -fno-exceptions -Wnested-externs
> -mno-red-zone -mno-sse -fpic -fno-asynchronous-unwind-tables
> -DGCC_HAS_VISIBILITY_ATTRIBUTE -nostdinc -g -D__XEN__ -include
> /root/rpmbuild/BUILD/xen-4.3.0/xen/include/xen/config.h -DHAS_ACPI
> -DHAS_GDBSX -DHAS_PASSTHROUGH -MMD -MF .debug.o.d -c debug.c -o
> debug.o
> gcc -O2 -fomit-frame-pointer -m64 -fno-strict-aliasing -std=gnu99
> -Wall -Wstrict-prototypes -Wdeclaration-after-statement
> -Wno-unused-but-set-variable   -DNDEBUG -fno-builtin -fno-common
> -Wredundant-decls -iwithprefix include -Werror -Wno-pointer-arith
> -pipe -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include
> -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-generic
> -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-default
> -msoft-float -fno-stack-protector -fno-exceptions -Wnested-externs
> -mno-red-zone -mno-sse -fpic -fno-asynchronous-unwind-tables
> -DGCC_HAS_VISIBILITY_ATTRIBUTE -nostdinc -g -D__XEN__ -include
> /root/rpmbuild/BUILD/xen-4.3.0/xen/include/xen/config.h -DHAS_ACPI
> -DHAS_GDBSX -DHAS_PASSTHROUGH -MMD -MF .domain.o.d -c domain.c -o
> domain.o
> domain.c: In function ‘arch_domain_destroy’:
> domain.c:595: error: ‘struct pv_domain’ has no member named ‘e820’
> make[4]: *** [domain.o] Error 1
>
> It would seem you omitted this block from the original patch:
>
> ==> @@ -592,8 +592,8 @@ void arch_domain_destroy(struct domain *d)
>  {
>      if ( is_hvm_domain(d) )
>          hvm_domain_destroy(d);
> -    else
> -        xfree(d->arch.pv_domain.e820);
> +
> +    xfree(d->arch.e820);
>
>      free_domain_pirqs(d);
>      if ( !is_idle_domain(d) )
> ==>
> Was that intentional? Does that block look OK to you? Should I re-add 
> it?
 Just to clarify - re-adding this block fixes the build issue.
 Will test tonight whether it runs. What I really wanted to
 know is whether this is the correct way to handle the cleanup
 in this case.

 Gordan
> On Wed, 4 Sep 2013 22:04:42 -0400, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
>> On Wed, Sep 04, 2013 at 02:11:06PM +0100, Gordan Bobic wrote:
>>> I have this at the point where it actually builds.
>>> Otherwise completely untested (will do that later today).
>>>
>>> Attached are:
>>>
>>> 1) libxl patch
>>> Modified from the original patch to _not_ implicitly enable
>>> e820_host when PCI devices are passed.
>>>
>>> 2) Mukesh's hypervisor e820 patch from here:
>>> http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html
>>> Modified slightly to attempt to address Jan's comment on the
same
>>> thread, and to adjust the diff line pointers to match against
>>> 4.3.0 release code.
>>
>> I think that was the old version. I spotted a bug in it that
>> was causing a hang. And also the one that explains why libxl
>> would refuse to setup the E820.
>>
>> The problem was that in the XENMEM_set_memory_map there was
>> a check to make sure that the guest launched was not HVM.
>>
>> Also there was bug in the initial domain creation where
>> the spinlock was only set for PV and not for HVM.
>>
>>>
>>> 3) A patch based on Konrad's earlier in this thread, with
>>> a few additions and changes to make it all compile.
>>>
>>> Some peer review would be most welcome - this is my first
>>> venture into Xen code, so please do assume that I have
>>> no idea what I'm doing at the moment. :)
>>>
>>> I added yet another E820MAX #define, this time to
>>> tools/firmware/hvmloader/e820.h
>>>
>>> If there is a better place to #include that via from
>>> e820.c, please point me in the right direction.
>>
>> I think I saw that #define in tools/libxc/xenctrl.h. But since
>> the tools/firmware cannot link to the libxc (b/c it is a 
>> Minicontained
>> OS) I believe just having the #define in hvmloader/e820.h is
>> the right call.
>>
>> Good first pass. I altered it a bit and got in the HVM guest
>> the E820 entries printed out. Here is a big giant diff:
>>
>> diff --git a/tools/firmware/hvmloader/e820.c
>> b/tools/firmware/hvmloader/e820.c
>> index 2e05e93..3c80241 100644
>> --- a/tools/firmware/hvmloader/e820.c
>> +++ b/tools/firmware/hvmloader/e820.c
>> @@ -22,6 +22,9 @@
>>
>>  #include "config.h"
>>  #include "util.h"
>> +#include "hypercall.h"
>> +#include <xen/memory.h>
>> +#include <errno.h>
>>
>>  void dump_e820_table(struct e820entry *e820, unsigned int nr)
>>  {
>> @@ -74,10 +77,20 @@ int build_e820_table(struct e820entry *e820,
>>                       unsigned int bios_image_base)
>>  {
>>      unsigned int nr = 0;
>> +    struct xen_memory_map op;
>> +    struct e820entry map[E820MAX];
>> +    int rc;
>>
>>      if ( !lowmem_reserved_base )
>>              lowmem_reserved_base = 0xA0000;
>>
>> +    set_xen_guest_handle(op.buffer, map);
>> +
>> +    rc = hypercall_memory_op ( XENMEM_memory_map, &op);
>> +    if ( rc != -ENOSYS) { /* It works!? */
>> +        printf("%s:%d got %d op.nr_entries \n", __func__,
__LINE__,
>> op.nr_entries);
>> +        dump_e820_table(&map[0], op.nr_entries);
>> +    }
>>      /* Lowmem must be at least 512K to keep Windows happy) */
>>      ASSERT ( lowmem_reserved_base > 512<<10 );
>>
>> diff --git a/tools/firmware/hvmloader/e820.h
>> b/tools/firmware/hvmloader/e820.h
>> index b2ead7f..2fa700d 100644
>> --- a/tools/firmware/hvmloader/e820.h
>> +++ b/tools/firmware/hvmloader/e820.h
>> @@ -8,6 +8,9 @@
>>  #define E820_RESERVED     2
>>  #define E820_ACPI         3
>>  #define E820_NVS          4
>> +#define E820_UNUSABLE     5
>> +
>> +#define E820MAX         128
>>
>>  struct e820entry {
>>      uint64_t addr;
>> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
>> index 0c32d0b..d8e2346 100644
>> --- a/tools/libxl/libxl_create.c
>> +++ b/tools/libxl/libxl_create.c
>> @@ -208,6 +208,8 @@ int 
>> libxl__domain_build_info_setdefault(libxl__gc *gc,
>>
>>      libxl_defbool_setdefault(&b_info->disable_migrate, false);
>>
>> +    libxl_defbool_setdefault(&b_info->e820_host, false);
>> +
>>      switch (b_info->type) {
>>      case LIBXL_DOMAIN_TYPE_HVM:
>>          if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT)
>> @@ -280,7 +282,6 @@ int 
>> libxl__domain_build_info_setdefault(libxl__gc *gc,
>>
>>          break;
>>      case LIBXL_DOMAIN_TYPE_PV:
>> -        libxl_defbool_setdefault(&b_info->u.pv.e820_host,
false);
>>          if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT)
>>              b_info->shadow_memkb = 0;
>>          if (b_info->u.pv.slack_memkb == LIBXL_MEMKB_DEFAULT)
>> diff --git a/tools/libxl/libxl_types.idl 
>> b/tools/libxl/libxl_types.idl
>> index 85341a0..fd6389a 100644
>> --- a/tools/libxl/libxl_types.idl
>> +++ b/tools/libxl/libxl_types.idl
>> @@ -299,6 +299,8 @@ libxl_domain_build_info = 
>> Struct("domain_build_info",[
>>      ("irqs",             Array(uint32,
"num_irqs")),
>>      ("iomem",            Array(libxl_iomem_range,
"num_iomem")),
>>      ("claim_mode",	     libxl_defbool),
>> +    # Use host's E820 for PCI passthrough.
>> +    ("e820_host",        libxl_defbool),
>>      ("u", KeyedUnion(None, libxl_domain_type,
"type",
>>                  [("hvm", Struct(None,
[("firmware",
>> string),
>>                                         ("bios",
>> libxl_bios_type),
>> @@ -345,8 +347,6 @@ libxl_domain_build_info = 
>> Struct("domain_build_info",[
>>                                        ("cmdline", string),
>>                                        ("ramdisk", string),
>>                                        ("features", string, 
>> {'const': True}),
>> -                                      # Use host's E820 for PCI 
>> passthrough.
>> -                                      ("e820_host",
libxl_defbool),
>>                                        ])),
>>                   ("invalid", Struct(None, [])),
>>                   ], keyvar_init_val = 
>> "LIBXL_DOMAIN_TYPE_INVALID")),
>> diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
>> index a78c91d..94515a5 100644
>> --- a/tools/libxl/libxl_x86.c
>> +++ b/tools/libxl/libxl_x86.c
>> @@ -216,28 +216,41 @@ static int libxl__e820_alloc(libxl__gc *gc,
>> uint32_t domid,
>>      struct e820entry map[E820MAX];
>>      libxl_domain_build_info *b_info;
>>
>> -    if (d_config == NULL || d_config->c_info.type == 
>> LIBXL_DOMAIN_TYPE_HVM)
>> -        return ERROR_INVAL;
>> -
>>      b_info = &d_config->b_info;
>> -    if (!libxl_defbool_val(b_info->u.pv.e820_host))
>> +    if (!libxl_defbool_val(b_info->e820_host)) {
>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
"%s:%d.",__func__,
>> __LINE__);
>>          return ERROR_INVAL;
>> -
>> +    }
>>      rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
>>      if (rc < 0) {
>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
"%s:%d.",__func__,
>> __LINE__);
>>          errno = rc;
>>          return ERROR_FAIL;
>>      }
>>      nr = rc;
>> -    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
>> -                       (b_info->max_memkb -
b_info->target_memkb) +
>> -                       b_info->u.pv.slack_memkb);
>> +    LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
"%s:%d.nr:%d",__func__,
>> __LINE__, nr);
>> +    if (d_config == NULL || d_config->c_info.type =>>
LIBXL_DOMAIN_TYPE_HVM) {
>> +        rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
>> +                           (b_info->max_memkb - 
>> b_info->target_memkb));
>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
>> "%s:%d.rc%d",__func__, __LINE__, rc);
>> +    } else if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV) {
>> +        rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
>> +                           (b_info->max_memkb - 
>> b_info->target_memkb) +
>> +                           b_info->u.pv.slack_memkb);
>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
>> "%s:%d.rc%d",__func__, __LINE__, rc);
>> +    }
>> +
>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
>> "%s:%d.rc%d",__func__, __LINE__, rc);
>>      if (rc)
>>          return ERROR_FAIL;
>>
>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.rc%d,
>> nr:%d",__func__, __LINE__, rc, nr);
>> +
>>      rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr);
>>
>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
>> "%s:%d.rc%d",__func__, __LINE__, rc);
>>      if (rc < 0) {
>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
>> "%s:%d.rc%d",__func__, __LINE__, rc);
>>          errno  = rc;
>>          return ERROR_FAIL;
>>      }
>> @@ -296,8 +309,7 @@ int libxl__arch_domain_create(libxl__gc *gc,
>> libxl_domain_config *d_config,
>>          xc_shadow_control(ctx->xch, domid,
>> XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION, NULL, 0, &shadow, 0, NULL);
>>      }
>>
>> -    if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV &&
>> -            libxl_defbool_val(d_config->b_info.u.pv.e820_host)) {
>> +    if (libxl_defbool_val(d_config->b_info.e820_host)) {
>>          ret = libxl__e820_alloc(gc, domid, d_config);
>>          if (ret) {
>>              LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR,
>> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
>> index ed99622..d98ca24 100644
>> --- a/tools/libxl/xl_cmdimpl.c
>> +++ b/tools/libxl/xl_cmdimpl.c
>> @@ -1291,11 +1291,7 @@ skip_vfb:
>>      if (!xlu_cfg_get_long (config, "pci_permissive", &l,
0))
>>          pci_permissive = l;
>>
>> -    /* To be reworked (automatically enabled) once the auto 
>> ballooning
>> -     * after guest starts is done (with PCI devices passed in). */
>> -    if (c_info->type == LIBXL_DOMAIN_TYPE_PV) {
>> -        xlu_cfg_get_defbool(config, "e820_host",
>> &b_info->u.pv.e820_host, 0);
>> -    }
>> +    xlu_cfg_get_defbool(config, "e820_host",
&b_info->e820_host,
>> 0);
>>
>>      if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0))
{
>>          d_config->num_pcidevs = 0;
>> @@ -1314,7 +1310,7 @@ skip_vfb:
>>                  d_config->num_pcidevs++;
>>          }
>>          if (d_config->num_pcidevs && c_info->type == 
>> LIBXL_DOMAIN_TYPE_PV)
>> -            libxl_defbool_set(&b_info->u.pv.e820_host, true);
>> +            libxl_defbool_set(&b_info->e820_host, true);
>>      }
>>
>>      switch (xlu_cfg_get_list(config, "cpuid", &cpuids,
0, 1)) {
>> diff --git a/tools/libxl/xl_sxp.c b/tools/libxl/xl_sxp.c
>> index a16a025..f34f0ba 100644
>> --- a/tools/libxl/xl_sxp.c
>> +++ b/tools/libxl/xl_sxp.c
>> @@ -87,6 +87,10 @@ void printf_info_sexp(int domid,
>> libxl_domain_config *d_config)
>>          }
>>      }
>>
>> +    printf("\t(e820_host %s)\n",
>> +           libxl_defbool_to_string(b_info->e820_host));
>> +
>> +
>>      printf("\t(image\n");
>>      switch (c_info->type) {
>>      case LIBXL_DOMAIN_TYPE_HVM:
>> @@ -150,8 +154,6 @@ void printf_info_sexp(int domid,
>> libxl_domain_config *d_config)
>>          printf("\t\t\t(kernel %s)\n",
b_info->u.pv.kernel);
>>          printf("\t\t\t(cmdline %s)\n",
b_info->u.pv.cmdline);
>>          printf("\t\t\t(ramdisk %s)\n",
b_info->u.pv.ramdisk);
>> -        printf("\t\t\t(e820_host %s)\n",
>> -               libxl_defbool_to_string(b_info->u.pv.e820_host));
>>          printf("\t\t)\n");
>>          break;
>>      default:
>> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
>> index 874742c..4796221 100644
>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -566,10 +566,9 @@ int arch_domain_create(struct domain *d,
>> unsigned int domcr_flags)
>>      {
>>          /* 64-bit PV guest by default. */
>>          d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0;
>> -
>> -        spin_lock_init(&d->arch.pv_domain.e820_lock);
>>      }
>>
>> +    spin_lock_init(&d->arch.e820_lock);
>>      /* initialize default tsc behavior in case tools don't */
>>      tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0);
>>      spin_lock_init(&d->arch.vtsc_lock);
>> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
>> index 54b1e6a..6c9b58c 100644
>> --- a/xen/arch/x86/hvm/hvm.c
>> +++ b/xen/arch/x86/hvm/hvm.c
>> @@ -3142,10 +3142,10 @@ static long hvm_memory_op(int cmd,
>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>
>>      switch ( cmd & MEMOP_CMD_MASK )
>>      {
>> -    case XENMEM_memory_map:
>>      case XENMEM_machine_memory_map:
>>      case XENMEM_machphys_mapping:
>>          return -ENOSYS;
>> +    case XENMEM_memory_map:
>>      case XENMEM_decrease_reservation:
>>          rc = do_memory_op(cmd, arg);
>>         
current->domain->arch.hvm_domain.qemu_mapcache_invalidate =
>> 1;
>> @@ -3217,10 +3217,10 @@ static long hvm_memory_op_compat32(int cmd,
>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>
>>      switch ( cmd & MEMOP_CMD_MASK )
>>      {
>> -    case XENMEM_memory_map:
>>      case XENMEM_machine_memory_map:
>>      case XENMEM_machphys_mapping:
>>          return -ENOSYS;
>> +    case XENMEM_memory_map:
>>      case XENMEM_decrease_reservation:
>>          rc = compat_memory_op(cmd, arg);
>>         
current->domain->arch.hvm_domain.qemu_mapcache_invalidate =
>> 1;
>> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
>> index e7f0e13..4c3ce9a 100644
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -4740,19 +4740,13 @@ long arch_memory_op(int op,
>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>              return rc;
>>          }
>>
>> -        if ( is_hvm_domain(d) )
>> -        {
>> -            rcu_unlock_domain(d);
>> -            return -EPERM;
>> -        }
>> -
>>          e820 = xmalloc_array(e820entry_t, fmap.map.nr_entries);
>>          if ( e820 == NULL )
>>          {
>>              rcu_unlock_domain(d);
>>              return -ENOMEM;
>>          }
>> -
>> +
>>          if ( copy_from_guest(e820, fmap.map.buffer, 
>> fmap.map.nr_entries) )
>>          {
>>              xfree(e820);
>> @@ -4760,11 +4754,11 @@ long arch_memory_op(int op,
>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>              return -EFAULT;
>>          }
>>
>> -        spin_lock(&d->arch.pv_domain.e820_lock);
>> -        xfree(d->arch.pv_domain.e820);
>> -        d->arch.pv_domain.e820 = e820;
>> -        d->arch.pv_domain.nr_e820 = fmap.map.nr_entries;
>> -        spin_unlock(&d->arch.pv_domain.e820_lock);
>> +        spin_lock(&d->arch.e820_lock);
>> +        xfree(d->arch.e820);
>> +        d->arch.e820 = e820;
>> +        d->arch.nr_e820 = fmap.map.nr_entries;
>> +        spin_unlock(&d->arch.e820_lock);
>>
>>          rcu_unlock_domain(d);
>>          return rc;
>> @@ -4778,26 +4772,26 @@ long arch_memory_op(int op,
>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>          if ( copy_from_guest(&map, arg, 1) )
>>              return -EFAULT;
>>
>> -        spin_lock(&d->arch.pv_domain.e820_lock);
>> +        spin_lock(&d->arch.e820_lock);
>>
>>          /* Backwards compatibility. */
>> -        if ( (d->arch.pv_domain.nr_e820 == 0) ||
>> -             (d->arch.pv_domain.e820 == NULL) )
>> +        if ( (d->arch.nr_e820 == 0) ||
>> +             (d->arch.e820 == NULL) )
>>          {
>> -            spin_unlock(&d->arch.pv_domain.e820_lock);
>> +            spin_unlock(&d->arch.e820_lock);
>>              return -ENOSYS;
>>          }
>>
>> -        map.nr_entries = min(map.nr_entries, 
>> d->arch.pv_domain.nr_e820);
>> -        if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820,
>> +        map.nr_entries = min(map.nr_entries, d->arch.nr_e820);
>> +        if ( copy_to_guest(map.buffer, d->arch.e820,
>>                             map.nr_entries) ||
>>               __copy_to_guest(arg, &map, 1) )
>>          {
>> -            spin_unlock(&d->arch.pv_domain.e820_lock);
>> +            spin_unlock(&d->arch.e820_lock);
>>              return -EFAULT;
>>          }
>>
>> -        spin_unlock(&d->arch.pv_domain.e820_lock);
>> +        spin_unlock(&d->arch.e820_lock);
>>          return 0;
>>      }
>>
>> diff --git a/xen/include/asm-x86/domain.h 
>> b/xen/include/asm-x86/domain.h
>> index d79464d..c3f9f8e 100644
>> --- a/xen/include/asm-x86/domain.h
>> +++ b/xen/include/asm-x86/domain.h
>> @@ -234,11 +234,6 @@ struct pv_domain
>>
>>      /* map_domain_page() mapping cache. */
>>      struct mapcache_domain mapcache;
>> -
>> -    /* Pseudophysical e820 map (XENMEM_memory_map).  */
>> -    spinlock_t e820_lock;
>> -    struct e820entry *e820;
>> -    unsigned int nr_e820;
>>  };
>>
>>  struct arch_domain
>> @@ -313,6 +308,11 @@ struct arch_domain
>>                                  (possibly other cases in the future 
>> */
>>      uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */
>>      uint64_t vtsc_usercount; /* not used for hvm */
>> +
>> +    /* Pseudophysical e820 map (XENMEM_memory_map).  */
>> +    spinlock_t e820_lock;
>> +    struct e820entry *e820;
>> +    unsigned int nr_e820;
>>  } __cacheline_aligned;
>>
>>  #define has_arch_pdevs(d)   
(!list_empty(&(d)->arch.pdev_list))
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Gordan Bobic

2013-Sep-05 10:26 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Wed, 4 Sep 2013 22:04:42 -0400, Konrad Rzeszutek Wilk 
 <konrad.wilk@oracle.com> wrote:
> diff --git a/tools/firmware/hvmloader/e820.h
> b/tools/firmware/hvmloader/e820.h
> index b2ead7f..2fa700d 100644
> --- a/tools/firmware/hvmloader/e820.h
> +++ b/tools/firmware/hvmloader/e820.h
> @@ -8,6 +8,9 @@
>  #define E820_RESERVED     2
>  #define E820_ACPI         3
>  #define E820_NVS          4
> +#define E820_UNUSABLE     5
> +
> +#define E820MAX         128
>
>  struct e820entry {
>      uint64_t addr;
 I don''t think we actually need
 +#define E820_UNUSABLE     5

 any more because it is no longer used anywhere
 in the patch. Do we need that extra e820 hole type?
 I guess it''s only useful if we want to explicitly
 signify that a memory hole is inherited from
 the host e820 map, rather than _really_ needed.
 Otherwise we could probably just use E820_RESERVED
 in it''s place.

 Gordan

Konrad Rzeszutek Wilk

2013-Sep-05 12:36 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Gordan Bobic <gordan@bobich.net> wrote:> On Thu, 05 Sep 2013 10:41:09 +0100, Gordan Bobic <gordan@bobich.net> 
> wrote:
>> Hmm...
>>
>> gcc -O2 -fomit-frame-pointer -m64 -fno-strict-aliasing -std=gnu99
>> -Wall -Wstrict-prototypes -Wdeclaration-after-statement
>> -Wno-unused-but-set-variable   -DNDEBUG -fno-builtin -fno-common
>> -Wredundant-decls -iwithprefix include -Werror -Wno-pointer-arith
>> -pipe -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include
>> -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-generic
>> -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-default
>> -msoft-float -fno-stack-protector -fno-exceptions -Wnested-externs
>> -mno-red-zone -mno-sse -fpic -fno-asynchronous-unwind-tables
>> -DGCC_HAS_VISIBILITY_ATTRIBUTE -nostdinc -g -D__XEN__ -include
>> /root/rpmbuild/BUILD/xen-4.3.0/xen/include/xen/config.h -DHAS_ACPI
>> -DHAS_GDBSX -DHAS_PASSTHROUGH -MMD -MF .debug.o.d -c debug.c -o
>> debug.o
>> gcc -O2 -fomit-frame-pointer -m64 -fno-strict-aliasing -std=gnu99
>> -Wall -Wstrict-prototypes -Wdeclaration-after-statement
>> -Wno-unused-but-set-variable   -DNDEBUG -fno-builtin -fno-common
>> -Wredundant-decls -iwithprefix include -Werror -Wno-pointer-arith
>> -pipe -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include
>> -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-generic
>> -I/root/rpmbuild/BUILD/xen-4.3.0/xen/include/asm-x86/mach-default
>> -msoft-float -fno-stack-protector -fno-exceptions -Wnested-externs
>> -mno-red-zone -mno-sse -fpic -fno-asynchronous-unwind-tables
>> -DGCC_HAS_VISIBILITY_ATTRIBUTE -nostdinc -g -D__XEN__ -include
>> /root/rpmbuild/BUILD/xen-4.3.0/xen/include/xen/config.h -DHAS_ACPI
>> -DHAS_GDBSX -DHAS_PASSTHROUGH -MMD -MF .domain.o.d -c domain.c -o
>> domain.o
>> domain.c: In function ‘arch_domain_destroy’:
>> domain.c:595: error: ‘struct pv_domain’ has no member named ‘e820’
>> make[4]: *** [domain.o] Error 1
>>
>> It would seem you omitted this block from the original patch:
>>
>> ==>> @@ -592,8 +592,8 @@ void arch_domain_destroy(struct domain
*d)
>>  {
>>      if ( is_hvm_domain(d) )
>>          hvm_domain_destroy(d);
>> -    else
>> -        xfree(d->arch.pv_domain.e820);
>> +
>> +    xfree(d->arch.e820);
>>
>>      free_domain_pirqs(d);
>>      if ( !is_idle_domain(d) )
>> ==>>
>> Was that intentional? Does that block look OK to you? Should I re-add
>
>> it?
>
> Just to clarify - re-adding this block fixes the build issue.
> Will test tonight whether it runs. What I really wanted to
> know is whether this is the correct way to handle the cleanup
> in this case.
It is correct. I must have messed up my tree after I tested it.
>
> Gordan
>
>> On Wed, 4 Sep 2013 22:04:42 -0400, Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com> wrote:
>>> On Wed, Sep 04, 2013 at 02:11:06PM +0100, Gordan Bobic wrote:
>>>> I have this at the point where it actually builds.
>>>> Otherwise completely untested (will do that later today).
>>>>
>>>> Attached are:
>>>>
>>>> 1) libxl patch
>>>> Modified from the original patch to _not_ implicitly enable
>>>> e820_host when PCI devices are passed.
>>>>
>>>> 2) Mukesh's hypervisor e820 patch from here:
>>>>
http://lists.xen.org/archives/html/xen-devel/2013-05/msg01603.html
>>>> Modified slightly to attempt to address Jan's comment on
the same
>>>> thread, and to adjust the diff line pointers to match against
>>>> 4.3.0 release code.
>>>
>>> I think that was the old version. I spotted a bug in it that
>>> was causing a hang. And also the one that explains why libxl
>>> would refuse to setup the E820.
>>>
>>> The problem was that in the XENMEM_set_memory_map there was
>>> a check to make sure that the guest launched was not HVM.
>>>
>>> Also there was bug in the initial domain creation where
>>> the spinlock was only set for PV and not for HVM.
>>>
>>>>
>>>> 3) A patch based on Konrad's earlier in this thread, with
>>>> a few additions and changes to make it all compile.
>>>>
>>>> Some peer review would be most welcome - this is my first
>>>> venture into Xen code, so please do assume that I have
>>>> no idea what I'm doing at the moment. :)
>>>>
>>>> I added yet another E820MAX #define, this time to
>>>> tools/firmware/hvmloader/e820.h
>>>>
>>>> If there is a better place to #include that via from
>>>> e820.c, please point me in the right direction.
>>>
>>> I think I saw that #define in tools/libxc/xenctrl.h. But since
>>> the tools/firmware cannot link to the libxc (b/c it is a 
>>> Minicontained
>>> OS) I believe just having the #define in hvmloader/e820.h is
>>> the right call.
>>>
>>> Good first pass. I altered it a bit and got in the HVM guest
>>> the E820 entries printed out. Here is a big giant diff:
>>>
>>> diff --git a/tools/firmware/hvmloader/e820.c
>>> b/tools/firmware/hvmloader/e820.c
>>> index 2e05e93..3c80241 100644
>>> --- a/tools/firmware/hvmloader/e820.c
>>> +++ b/tools/firmware/hvmloader/e820.c
>>> @@ -22,6 +22,9 @@
>>>
>>>  #include "config.h"
>>>  #include "util.h"
>>> +#include "hypercall.h"
>>> +#include <xen/memory.h>
>>> +#include <errno.h>
>>>
>>>  void dump_e820_table(struct e820entry *e820, unsigned int nr)
>>>  {
>>> @@ -74,10 +77,20 @@ int build_e820_table(struct e820entry *e820,
>>>                       unsigned int bios_image_base)
>>>  {
>>>      unsigned int nr = 0;
>>> +    struct xen_memory_map op;
>>> +    struct e820entry map[E820MAX];
>>> +    int rc;
>>>
>>>      if ( !lowmem_reserved_base )
>>>              lowmem_reserved_base = 0xA0000;
>>>
>>> +    set_xen_guest_handle(op.buffer, map);
>>> +
>>> +    rc = hypercall_memory_op ( XENMEM_memory_map, &op);
>>> +    if ( rc != -ENOSYS) { /* It works!? */
>>> +        printf("%s:%d got %d op.nr_entries \n",
__func__, __LINE__,
>>> op.nr_entries);
>>> +        dump_e820_table(&map[0], op.nr_entries);
>>> +    }
>>>      /* Lowmem must be at least 512K to keep Windows happy) */
>>>      ASSERT ( lowmem_reserved_base > 512<<10 );
>>>
>>> diff --git a/tools/firmware/hvmloader/e820.h
>>> b/tools/firmware/hvmloader/e820.h
>>> index b2ead7f..2fa700d 100644
>>> --- a/tools/firmware/hvmloader/e820.h
>>> +++ b/tools/firmware/hvmloader/e820.h
>>> @@ -8,6 +8,9 @@
>>>  #define E820_RESERVED     2
>>>  #define E820_ACPI         3
>>>  #define E820_NVS          4
>>> +#define E820_UNUSABLE     5
>>> +
>>> +#define E820MAX         128
>>>
>>>  struct e820entry {
>>>      uint64_t addr;
>>> diff --git a/tools/libxl/libxl_create.c
b/tools/libxl/libxl_create.c
>>> index 0c32d0b..d8e2346 100644
>>> --- a/tools/libxl/libxl_create.c
>>> +++ b/tools/libxl/libxl_create.c
>>> @@ -208,6 +208,8 @@ int 
>>> libxl__domain_build_info_setdefault(libxl__gc *gc,
>>>
>>>      libxl_defbool_setdefault(&b_info->disable_migrate,
false);
>>>
>>> +    libxl_defbool_setdefault(&b_info->e820_host, false);
>>> +
>>>      switch (b_info->type) {
>>>      case LIBXL_DOMAIN_TYPE_HVM:
>>>          if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT)
>>> @@ -280,7 +282,6 @@ int 
>>> libxl__domain_build_info_setdefault(libxl__gc *gc,
>>>
>>>          break;
>>>      case LIBXL_DOMAIN_TYPE_PV:
>>> -        libxl_defbool_setdefault(&b_info->u.pv.e820_host,
false);
>>>          if (b_info->shadow_memkb == LIBXL_MEMKB_DEFAULT)
>>>              b_info->shadow_memkb = 0;
>>>          if (b_info->u.pv.slack_memkb == LIBXL_MEMKB_DEFAULT)
>>> diff --git a/tools/libxl/libxl_types.idl 
>>> b/tools/libxl/libxl_types.idl
>>> index 85341a0..fd6389a 100644
>>> --- a/tools/libxl/libxl_types.idl
>>> +++ b/tools/libxl/libxl_types.idl
>>> @@ -299,6 +299,8 @@ libxl_domain_build_info = 
>>> Struct("domain_build_info",[
>>>      ("irqs",             Array(uint32,
"num_irqs")),
>>>      ("iomem",            Array(libxl_iomem_range,
"num_iomem")),
>>>      ("claim_mode",	     libxl_defbool),
>>> +    # Use host's E820 for PCI passthrough.
>>> +    ("e820_host",        libxl_defbool),
>>>      ("u", KeyedUnion(None, libxl_domain_type,
"type",
>>>                  [("hvm", Struct(None,
[("firmware",
>>> string),
>>>                                         ("bios",
>>> libxl_bios_type),
>>> @@ -345,8 +347,6 @@ libxl_domain_build_info = 
>>> Struct("domain_build_info",[
>>>                                        ("cmdline",
string),
>>>                                        ("ramdisk",
string),
>>>                                        ("features",
string,
>>> {'const': True}),
>>> -                                      # Use host's E820 for
PCI
>>> passthrough.
>>> -                                      ("e820_host",
libxl_defbool),
>>>                                        ])),
>>>                   ("invalid", Struct(None, [])),
>>>                   ], keyvar_init_val = 
>>> "LIBXL_DOMAIN_TYPE_INVALID")),
>>> diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
>>> index a78c91d..94515a5 100644
>>> --- a/tools/libxl/libxl_x86.c
>>> +++ b/tools/libxl/libxl_x86.c
>>> @@ -216,28 +216,41 @@ static int libxl__e820_alloc(libxl__gc *gc,
>>> uint32_t domid,
>>>      struct e820entry map[E820MAX];
>>>      libxl_domain_build_info *b_info;
>>>
>>> -    if (d_config == NULL || d_config->c_info.type == 
>>> LIBXL_DOMAIN_TYPE_HVM)
>>> -        return ERROR_INVAL;
>>> -
>>>      b_info = &d_config->b_info;
>>> -    if (!libxl_defbool_val(b_info->u.pv.e820_host))
>>> +    if (!libxl_defbool_val(b_info->e820_host)) {
>>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
"%s:%d.",__func__,
>>> __LINE__);
>>>          return ERROR_INVAL;
>>> -
>>> +    }
>>>      rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
>>>      if (rc < 0) {
>>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
"%s:%d.",__func__,
>>> __LINE__);
>>>          errno = rc;
>>>          return ERROR_FAIL;
>>>      }
>>>      nr = rc;
>>> -    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
>>> -                       (b_info->max_memkb -
b_info->target_memkb) +
>>> -                       b_info->u.pv.slack_memkb);
>>> +    LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
"%s:%d.nr:%d",__func__,
>>> __LINE__, nr);
>>> +    if (d_config == NULL || d_config->c_info.type =>>>
LIBXL_DOMAIN_TYPE_HVM) {
>>> +        rc = e820_sanitize(ctx, map, &nr,
b_info->target_memkb,
>>> +                           (b_info->max_memkb - 
>>> b_info->target_memkb));
>>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
>>> "%s:%d.rc%d",__func__, __LINE__, rc);
>>> +    } else if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV) {
>>> +        rc = e820_sanitize(ctx, map, &nr,
b_info->target_memkb,
>>> +                           (b_info->max_memkb - 
>>> b_info->target_memkb) +
>>> +                           b_info->u.pv.slack_memkb);
>>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
>>> "%s:%d.rc%d",__func__, __LINE__, rc);
>>> +    }
>>> +
>>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
>>> "%s:%d.rc%d",__func__, __LINE__, rc);
>>>      if (rc)
>>>          return ERROR_FAIL;
>>>
>>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "%s:%d.rc%d,
>>> nr:%d",__func__, __LINE__, rc, nr);
>>> +
>>>      rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr);
>>>
>>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
>>> "%s:%d.rc%d",__func__, __LINE__, rc);
>>>      if (rc < 0) {
>>> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
>>> "%s:%d.rc%d",__func__, __LINE__, rc);
>>>          errno  = rc;
>>>          return ERROR_FAIL;
>>>      }
>>> @@ -296,8 +309,7 @@ int libxl__arch_domain_create(libxl__gc *gc,
>>> libxl_domain_config *d_config,
>>>          xc_shadow_control(ctx->xch, domid,
>>> XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION, NULL, 0, &shadow, 0,
NULL);
>>>      }
>>>
>>> -    if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_PV
&&
>>> -            libxl_defbool_val(d_config->b_info.u.pv.e820_host))
{
>>> +    if (libxl_defbool_val(d_config->b_info.e820_host)) {
>>>          ret = libxl__e820_alloc(gc, domid, d_config);
>>>          if (ret) {
>>>              LIBXL__LOG_ERRNO(gc->owner, LIBXL__LOG_ERROR,
>>> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
>>> index ed99622..d98ca24 100644
>>> --- a/tools/libxl/xl_cmdimpl.c
>>> +++ b/tools/libxl/xl_cmdimpl.c
>>> @@ -1291,11 +1291,7 @@ skip_vfb:
>>>      if (!xlu_cfg_get_long (config, "pci_permissive",
&l, 0))
>>>          pci_permissive = l;
>>>
>>> -    /* To be reworked (automatically enabled) once the auto 
>>> ballooning
>>> -     * after guest starts is done (with PCI devices passed in). */
>>> -    if (c_info->type == LIBXL_DOMAIN_TYPE_PV) {
>>> -        xlu_cfg_get_defbool(config, "e820_host",
>>> &b_info->u.pv.e820_host, 0);
>>> -    }
>>> +    xlu_cfg_get_defbool(config, "e820_host",
&b_info->e820_host,
>>> 0);
>>>
>>>      if (!xlu_cfg_get_list (config, "pci", &pcis, 0,
0)) {
>>>          d_config->num_pcidevs = 0;
>>> @@ -1314,7 +1310,7 @@ skip_vfb:
>>>                  d_config->num_pcidevs++;
>>>          }
>>>          if (d_config->num_pcidevs && c_info->type ==
>>> LIBXL_DOMAIN_TYPE_PV)
>>> -            libxl_defbool_set(&b_info->u.pv.e820_host,
true);
>>> +            libxl_defbool_set(&b_info->e820_host, true);
>>>      }
>>>
>>>      switch (xlu_cfg_get_list(config, "cpuid",
&cpuids, 0, 1)) {
>>> diff --git a/tools/libxl/xl_sxp.c b/tools/libxl/xl_sxp.c
>>> index a16a025..f34f0ba 100644
>>> --- a/tools/libxl/xl_sxp.c
>>> +++ b/tools/libxl/xl_sxp.c
>>> @@ -87,6 +87,10 @@ void printf_info_sexp(int domid,
>>> libxl_domain_config *d_config)
>>>          }
>>>      }
>>>
>>> +    printf("\t(e820_host %s)\n",
>>> +           libxl_defbool_to_string(b_info->e820_host));
>>> +
>>> +
>>>      printf("\t(image\n");
>>>      switch (c_info->type) {
>>>      case LIBXL_DOMAIN_TYPE_HVM:
>>> @@ -150,8 +154,6 @@ void printf_info_sexp(int domid,
>>> libxl_domain_config *d_config)
>>>          printf("\t\t\t(kernel %s)\n",
b_info->u.pv.kernel);
>>>          printf("\t\t\t(cmdline %s)\n",
b_info->u.pv.cmdline);
>>>          printf("\t\t\t(ramdisk %s)\n",
b_info->u.pv.ramdisk);
>>> -        printf("\t\t\t(e820_host %s)\n",
>>> -              
libxl_defbool_to_string(b_info->u.pv.e820_host));
>>>          printf("\t\t)\n");
>>>          break;
>>>      default:
>>> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
>>> index 874742c..4796221 100644
>>> --- a/xen/arch/x86/domain.c
>>> +++ b/xen/arch/x86/domain.c
>>> @@ -566,10 +566,9 @@ int arch_domain_create(struct domain *d,
>>> unsigned int domcr_flags)
>>>      {
>>>          /* 64-bit PV guest by default. */
>>>          d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0;
>>> -
>>> -        spin_lock_init(&d->arch.pv_domain.e820_lock);
>>>      }
>>>
>>> +    spin_lock_init(&d->arch.e820_lock);
>>>      /* initialize default tsc behavior in case tools don't */
>>>      tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0);
>>>      spin_lock_init(&d->arch.vtsc_lock);
>>> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
>>> index 54b1e6a..6c9b58c 100644
>>> --- a/xen/arch/x86/hvm/hvm.c
>>> +++ b/xen/arch/x86/hvm/hvm.c
>>> @@ -3142,10 +3142,10 @@ static long hvm_memory_op(int cmd,
>>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>>
>>>      switch ( cmd & MEMOP_CMD_MASK )
>>>      {
>>> -    case XENMEM_memory_map:
>>>      case XENMEM_machine_memory_map:
>>>      case XENMEM_machphys_mapping:
>>>          return -ENOSYS;
>>> +    case XENMEM_memory_map:
>>>      case XENMEM_decrease_reservation:
>>>          rc = do_memory_op(cmd, arg);
>>>         
current->domain->arch.hvm_domain.qemu_mapcache_invalidate >
>>> 1;
>>> @@ -3217,10 +3217,10 @@ static long hvm_memory_op_compat32(int cmd,
>>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>>
>>>      switch ( cmd & MEMOP_CMD_MASK )
>>>      {
>>> -    case XENMEM_memory_map:
>>>      case XENMEM_machine_memory_map:
>>>      case XENMEM_machphys_mapping:
>>>          return -ENOSYS;
>>> +    case XENMEM_memory_map:
>>>      case XENMEM_decrease_reservation:
>>>          rc = compat_memory_op(cmd, arg);
>>>         
current->domain->arch.hvm_domain.qemu_mapcache_invalidate >
>>> 1;
>>> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
>>> index e7f0e13..4c3ce9a 100644
>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -4740,19 +4740,13 @@ long arch_memory_op(int op,
>>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>>              return rc;
>>>          }
>>>
>>> -        if ( is_hvm_domain(d) )
>>> -        {
>>> -            rcu_unlock_domain(d);
>>> -            return -EPERM;
>>> -        }
>>> -
>>>          e820 = xmalloc_array(e820entry_t, fmap.map.nr_entries);
>>>          if ( e820 == NULL )
>>>          {
>>>              rcu_unlock_domain(d);
>>>              return -ENOMEM;
>>>          }
>>> -
>>> +
>>>          if ( copy_from_guest(e820, fmap.map.buffer, 
>>> fmap.map.nr_entries) )
>>>          {
>>>              xfree(e820);
>>> @@ -4760,11 +4754,11 @@ long arch_memory_op(int op,
>>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>>              return -EFAULT;
>>>          }
>>>
>>> -        spin_lock(&d->arch.pv_domain.e820_lock);
>>> -        xfree(d->arch.pv_domain.e820);
>>> -        d->arch.pv_domain.e820 = e820;
>>> -        d->arch.pv_domain.nr_e820 = fmap.map.nr_entries;
>>> -        spin_unlock(&d->arch.pv_domain.e820_lock);
>>> +        spin_lock(&d->arch.e820_lock);
>>> +        xfree(d->arch.e820);
>>> +        d->arch.e820 = e820;
>>> +        d->arch.nr_e820 = fmap.map.nr_entries;
>>> +        spin_unlock(&d->arch.e820_lock);
>>>
>>>          rcu_unlock_domain(d);
>>>          return rc;
>>> @@ -4778,26 +4772,26 @@ long arch_memory_op(int op,
>>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>>          if ( copy_from_guest(&map, arg, 1) )
>>>              return -EFAULT;
>>>
>>> -        spin_lock(&d->arch.pv_domain.e820_lock);
>>> +        spin_lock(&d->arch.e820_lock);
>>>
>>>          /* Backwards compatibility. */
>>> -        if ( (d->arch.pv_domain.nr_e820 == 0) ||
>>> -             (d->arch.pv_domain.e820 == NULL) )
>>> +        if ( (d->arch.nr_e820 == 0) ||
>>> +             (d->arch.e820 == NULL) )
>>>          {
>>> -            spin_unlock(&d->arch.pv_domain.e820_lock);
>>> +            spin_unlock(&d->arch.e820_lock);
>>>              return -ENOSYS;
>>>          }
>>>
>>> -        map.nr_entries = min(map.nr_entries, 
>>> d->arch.pv_domain.nr_e820);
>>> -        if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820,
>>> +        map.nr_entries = min(map.nr_entries, d->arch.nr_e820);
>>> +        if ( copy_to_guest(map.buffer, d->arch.e820,
>>>                             map.nr_entries) ||
>>>               __copy_to_guest(arg, &map, 1) )
>>>          {
>>> -            spin_unlock(&d->arch.pv_domain.e820_lock);
>>> +            spin_unlock(&d->arch.e820_lock);
>>>              return -EFAULT;
>>>          }
>>>
>>> -        spin_unlock(&d->arch.pv_domain.e820_lock);
>>> +        spin_unlock(&d->arch.e820_lock);
>>>          return 0;
>>>      }
>>>
>>> diff --git a/xen/include/asm-x86/domain.h 
>>> b/xen/include/asm-x86/domain.h
>>> index d79464d..c3f9f8e 100644
>>> --- a/xen/include/asm-x86/domain.h
>>> +++ b/xen/include/asm-x86/domain.h
>>> @@ -234,11 +234,6 @@ struct pv_domain
>>>
>>>      /* map_domain_page() mapping cache. */
>>>      struct mapcache_domain mapcache;
>>> -
>>> -    /* Pseudophysical e820 map (XENMEM_memory_map).  */
>>> -    spinlock_t e820_lock;
>>> -    struct e820entry *e820;
>>> -    unsigned int nr_e820;
>>>  };
>>>
>>>  struct arch_domain
>>> @@ -313,6 +308,11 @@ struct arch_domain
>>>                                  (possibly other cases in the
future
>
>>> */
>>>      uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */
>>>      uint64_t vtsc_usercount; /* not used for hvm */
>>> +
>>> +    /* Pseudophysical e820 map (XENMEM_memory_map).  */
>>> +    spinlock_t e820_lock;
>>> +    struct e820entry *e820;
>>> +    unsigned int nr_e820;
>>>  } __cacheline_aligned;
>>>
>>>  #define has_arch_pdevs(d)   
(!list_empty(&(d)->arch.pdev_list))
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xen.org
>>> http://lists.xen.org/xen-devel
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Sep-05 12:38 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Gordan Bobic <gordan@bobich.net> wrote:> On Wed, 4 Sep 2013 22:04:42 -0400, Konrad Rzeszutek Wilk 
> <konrad.wilk@oracle.com> wrote:
>
>> diff --git a/tools/firmware/hvmloader/e820.h
>> b/tools/firmware/hvmloader/e820.h
>> index b2ead7f..2fa700d 100644
>> --- a/tools/firmware/hvmloader/e820.h
>> +++ b/tools/firmware/hvmloader/e820.h
>> @@ -8,6 +8,9 @@
>>  #define E820_RESERVED     2
>>  #define E820_ACPI         3
>>  #define E820_NVS          4
>> +#define E820_UNUSABLE     5
>> +
>> +#define E820MAX         128
>>
>>  struct e820entry {
>>      uint64_t addr;
>
> I don''t think we actually need
> +#define E820_UNUSABLE     5
>
> any more because it is no longer used anywhere
> in the patch. Do we need that extra e820 hole type?
You could extend the dump_e820... code to print that type as well
> I guess it''s only useful if we want to explicitly
> signify that a memory hole is inherited from
> the host e820 map, rather than _really_ needed.
> Otherwise we could probably just use E820_RESERVED
> in it''s place.
Originally it was used to cover area that are RAM in the host but won''t
be RAM in the guest because the amount of memory the guest has is less then the
physical amount.
>
> Gordan

Gordan Bobic

2013-Sep-05 21:13 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Right, finally got around to trying this with the latest patch.

With e820_host=0 things work as before:

(XEN) HVM3: BIOS map:
(XEN) HVM3:  f0000-fffff: Main BIOS
(XEN) HVM3: E820 table:
(XEN) HVM3:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
(XEN) HVM3:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
(XEN) HVM3:  HOLE: 00000000:000a0000 - 00000000:000e0000
(XEN) HVM3:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
(XEN) HVM3:  [03]: 00000000:00100000 - 00000000:e0000000: RAM
(XEN) HVM3:  HOLE: 00000000:e0000000 - 00000000:fc000000
(XEN) HVM3:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
(XEN) HVM3:  [05]: 00000001:00000000 - 00000002:1f800000: RAM


I seem to be getting two different E820 table dumps with e820_host=1:

(XEN) HVM1: BIOS map:
(XEN) HVM1:  f0000-fffff: Main BIOS
(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
(XEN) HVM1: E820 table:
(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
(XEN) HVM1:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
(XEN) HVM1:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
(XEN) HVM1:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED
(XEN) HVM1:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
(XEN) HVM1:  [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED
(XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
(XEN) HVM1:  [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED
(XEN) HVM1:  HOLE: 00000000:fee01000 - 00000000:ffc00000
(XEN) HVM1:  [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED
(XEN) HVM1:  [07]: 00000001:00000000 - 00000001:68870000: RAM
(XEN) HVM1: E820 table:
(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
(XEN) HVM1:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
(XEN) HVM1:  HOLE: 00000000:000a0000 - 00000000:000e0000
(XEN) HVM1:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
(XEN) HVM1:  [03]: 00000000:00100000 - 00000000:a7800000: RAM
(XEN) HVM1:  HOLE: 00000000:a7800000 - 00000000:fc000000
(XEN) HVM1:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
(XEN) HVM1: Invoking ROMBIOS ...

I cannot quite figure out what is going on here - these tables can''t 
both be true.

Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 and 
goes more or less contiguously up to 0xfec8b000.

Looking at dmesg on domU, the e820 map more or less matches the second 
dump above.

So I guess that should work - the entire IOMEM area of the host is in 
fact not mapped. But since I''ve passed 8GB of RAM to domU,
shouldn''t
there be another usable RAM area after 00000001:00000000 ?

Gordan

Gordan Bobic

2013-Sep-05 21:29 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On 09/05/2013 10:13 PM, Gordan Bobic wrote:> Right, finally got around to trying this with the latest patch.
>
> With e820_host=0 things work as before:
>
> (XEN) HVM3: BIOS map:
> (XEN) HVM3:  f0000-fffff: Main BIOS
> (XEN) HVM3: E820 table:
> (XEN) HVM3:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
> (XEN) HVM3:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
> (XEN) HVM3:  HOLE: 00000000:000a0000 - 00000000:000e0000
> (XEN) HVM3:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
> (XEN) HVM3:  [03]: 00000000:00100000 - 00000000:e0000000: RAM
> (XEN) HVM3:  HOLE: 00000000:e0000000 - 00000000:fc000000
> (XEN) HVM3:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
> (XEN) HVM3:  [05]: 00000001:00000000 - 00000002:1f800000: RAM
>
>
> I seem to be getting two different E820 table dumps with e820_host=1:
>
> (XEN) HVM1: BIOS map:
> (XEN) HVM1:  f0000-fffff: Main BIOS
> (XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
> (XEN) HVM1: E820 table:
> (XEN) HVM1:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
> (XEN) HVM1:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
> (XEN) HVM1:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
> (XEN) HVM1:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED
> (XEN) HVM1:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
> (XEN) HVM1:  [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED
> (XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
> (XEN) HVM1:  [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED
> (XEN) HVM1:  HOLE: 00000000:fee01000 - 00000000:ffc00000
> (XEN) HVM1:  [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED
> (XEN) HVM1:  [07]: 00000001:00000000 - 00000001:68870000: RAM
> (XEN) HVM1: E820 table:
> (XEN) HVM1:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
> (XEN) HVM1:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
> (XEN) HVM1:  HOLE: 00000000:000a0000 - 00000000:000e0000
> (XEN) HVM1:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
> (XEN) HVM1:  [03]: 00000000:00100000 - 00000000:a7800000: RAM
> (XEN) HVM1:  HOLE: 00000000:a7800000 - 00000000:fc000000
> (XEN) HVM1:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
> (XEN) HVM1: Invoking ROMBIOS ...
>
> I cannot quite figure out what is going on here - these tables
can''t
> both be true.
>
> Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 and
> goes more or less contiguously up to 0xfec8b000.
>
> Looking at dmesg on domU, the e820 map more or less matches the second
> dump above.
>
> So I guess that should work - the entire IOMEM area of the host is in
> fact not mapped. But since I''ve passed 8GB of RAM to domU,
shouldn''t
> there be another usable RAM area after 00000001:00000000 ?
I should probably also mention that the domU does in fact see 8GB of 
RAM, so clearly it is working.

The PCI IOMEM reservations on the host are:
# lspci -vvv | grep Region | grep Memory | sed -e ''s/.*Memory at
//'' | sort
a8000000 (64-bit, prefetchable) [disabled] [size=128M]
b0000000 (64-bit, prefetchable) [disabled] [size=64M]
b4000000 (64-bit, prefetchable) [size=64M]
b8000000 (64-bit, prefetchable) [size=128M]
c0000000 (64-bit, prefetchable) [size=256M]
d7efc000 (32-bit, non-prefetchable) [size=16K]
d8000000 (64-bit, non-prefetchable) [size=64K]
dc000000 (64-bit, non-prefetchable) [size=16K]
f3df4000 (64-bit, non-prefetchable) [size=16K]
f3df8000 (32-bit, non-prefetchable) [size=1K]
f3dfa000 (32-bit, non-prefetchable) [size=1K]
f3dfc000 (32-bit, non-prefetchable) [size=2K]
f3dfe000 (64-bit, non-prefetchable) [size=256]
f3edc000 (64-bit, non-prefetchable) [size=16K]
f3fdc000 (64-bit, non-prefetchable) [size=16K]
f4000000 (32-bit, non-prefetchable) [disabled] [size=32M]
f7ffc000 (32-bit, non-prefetchable) [disabled] [size=16K]
f8000000 (32-bit, non-prefetchable) [size=32M]
fbcfc000 (32-bit, non-prefetchable) [size=16K]
fbdfe000 (64-bit, non-prefetchable) [disabled] [size=8K]
fbeef000 (32-bit, non-prefetchable) [size=2K]
fbeefc00 (32-bit, non-prefetchable) [size=16]
fec8a000 (32-bit, non-prefetchable) [size=4K]


What is a little concerning is that my GPU in dom0 has it''s IOMEM
mapped at
E0000000-E7FFFFFF
E8000000-EBFFFFFF
EC000000-EDFFFFFF

Granted, this fits into a convenient hole in the host map
0xdc004000-0xf3df4000 but I cannot see that hole being listed as such in 
the xl dmesg E820 table dump. Is this _really_ working, or is it working 
by pure luck?

Gordan

Gordan Bobic

2013-Sep-05 21:46 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On 09/05/2013 10:29 PM, Gordan Bobic wrote:> On 09/05/2013 10:13 PM, Gordan Bobic wrote:
>> Right, finally got around to trying this with the latest patch.
>>
>> With e820_host=0 things work as before:
>>
>> (XEN) HVM3: BIOS map:
>> (XEN) HVM3:  f0000-fffff: Main BIOS
>> (XEN) HVM3: E820 table:
>> (XEN) HVM3:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>> (XEN) HVM3:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>> (XEN) HVM3:  HOLE: 00000000:000a0000 - 00000000:000e0000
>> (XEN) HVM3:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>> (XEN) HVM3:  [03]: 00000000:00100000 - 00000000:e0000000: RAM
>> (XEN) HVM3:  HOLE: 00000000:e0000000 - 00000000:fc000000
>> (XEN) HVM3:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
>> (XEN) HVM3:  [05]: 00000001:00000000 - 00000002:1f800000: RAM
>>
>>
>> I seem to be getting two different E820 table dumps with e820_host=1:
>>
>> (XEN) HVM1: BIOS map:
>> (XEN) HVM1:  f0000-fffff: Main BIOS
>> (XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
>> (XEN) HVM1: E820 table:
>> (XEN) HVM1:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
>> (XEN) HVM1:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
>> (XEN) HVM1:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
>> (XEN) HVM1:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED
>> (XEN) HVM1:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
>> (XEN) HVM1:  [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED
>> (XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
>> (XEN) HVM1:  [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED
>> (XEN) HVM1:  HOLE: 00000000:fee01000 - 00000000:ffc00000
>> (XEN) HVM1:  [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED
>> (XEN) HVM1:  [07]: 00000001:00000000 - 00000001:68870000: RAM
>> (XEN) HVM1: E820 table:
>> (XEN) HVM1:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>> (XEN) HVM1:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>> (XEN) HVM1:  HOLE: 00000000:000a0000 - 00000000:000e0000
>> (XEN) HVM1:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>> (XEN) HVM1:  [03]: 00000000:00100000 - 00000000:a7800000: RAM
>> (XEN) HVM1:  HOLE: 00000000:a7800000 - 00000000:fc000000
>> (XEN) HVM1:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
>> (XEN) HVM1: Invoking ROMBIOS ...
>>
>> I cannot quite figure out what is going on here - these tables
can''t
>> both be true.
>>
>> Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 and
>> goes more or less contiguously up to 0xfec8b000.
>>
>> Looking at dmesg on domU, the e820 map more or less matches the second
>> dump above.
>>
>> So I guess that should work - the entire IOMEM area of the host is in
>> fact not mapped. But since I''ve passed 8GB of RAM to domU,
shouldn''t
>> there be another usable RAM area after 00000001:00000000 ?
>
> I should probably also mention that the domU does in fact see 8GB of
> RAM, so clearly it is working.
>
> The PCI IOMEM reservations on the host are:
> # lspci -vvv | grep Region | grep Memory | sed -e ''s/.*Memory at
//'' | sort
> a8000000 (64-bit, prefetchable) [disabled] [size=128M]
> b0000000 (64-bit, prefetchable) [disabled] [size=64M]
> b4000000 (64-bit, prefetchable) [size=64M]
> b8000000 (64-bit, prefetchable) [size=128M]
> c0000000 (64-bit, prefetchable) [size=256M]
> d7efc000 (32-bit, non-prefetchable) [size=16K]
> d8000000 (64-bit, non-prefetchable) [size=64K]
> dc000000 (64-bit, non-prefetchable) [size=16K]
> f3df4000 (64-bit, non-prefetchable) [size=16K]
> f3df8000 (32-bit, non-prefetchable) [size=1K]
> f3dfa000 (32-bit, non-prefetchable) [size=1K]
> f3dfc000 (32-bit, non-prefetchable) [size=2K]
> f3dfe000 (64-bit, non-prefetchable) [size=256]
> f3edc000 (64-bit, non-prefetchable) [size=16K]
> f3fdc000 (64-bit, non-prefetchable) [size=16K]
> f4000000 (32-bit, non-prefetchable) [disabled] [size=32M]
> f7ffc000 (32-bit, non-prefetchable) [disabled] [size=16K]
> f8000000 (32-bit, non-prefetchable) [size=32M]
> fbcfc000 (32-bit, non-prefetchable) [size=16K]
> fbdfe000 (64-bit, non-prefetchable) [disabled] [size=8K]
> fbeef000 (32-bit, non-prefetchable) [size=2K]
> fbeefc00 (32-bit, non-prefetchable) [size=16]
> fec8a000 (32-bit, non-prefetchable) [size=4K]
>
>
> What is a little concerning is that my GPU in dom0 has it''s IOMEM
mapped at
> E0000000-E7FFFFFF
> E8000000-EBFFFFFF
> EC000000-EDFFFFFF
>
> Granted, this fits into a convenient hole in the host map
> 0xdc004000-0xf3df4000 but I cannot see that hole being listed as such in
> the xl dmesg E820 table dump. Is this _really_ working, or is it working
> by pure luck?
Just doing a bit of testing at the moment. I haven''t had a crash yet
(it
would have happened by now, as things were before). But - I am 
definitely getting the sort of graphical glitching/corruption in 3D 
applications that I saw before when assigning > 2688MB of RAM to the 
domU. That implies that there is still some memory overwriting happening 
somewhere.

Aaand just as I was tying that - I''ve just had a crash. :''(

Back to the drawing board...

Gordan

Konrad Rzeszutek Wilk

2013-Sep-05 22:23 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Gordan Bobic <gordan@bobich.net> wrote:>Right, finally got around to trying this with the latest patch.
>
>With e820_host=0 things work as before:
>
>(XEN) HVM3: BIOS map:
>(XEN) HVM3:  f0000-fffff: Main BIOS
>(XEN) HVM3: E820 table:
>(XEN) HVM3:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>(XEN) HVM3:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>(XEN) HVM3:  HOLE: 00000000:000a0000 - 00000000:000e0000
>(XEN) HVM3:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>(XEN) HVM3:  [03]: 00000000:00100000 - 00000000:e0000000: RAM
>(XEN) HVM3:  HOLE: 00000000:e0000000 - 00000000:fc000000
>(XEN) HVM3:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
>(XEN) HVM3:  [05]: 00000001:00000000 - 00000002:1f800000: RAM
>
>
>I seem to be getting two different E820 table dumps with e820_host=1:
>
>(XEN) HVM1: BIOS map:
>(XEN) HVM1:  f0000-fffff: Main BIOS
>(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
>(XEN) HVM1: E820 table:
>(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
>(XEN) HVM1:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
>(XEN) HVM1:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
>(XEN) HVM1:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED
>(XEN) HVM1:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
>(XEN) HVM1:  [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED
>(XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
>(XEN) HVM1:  [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED
>(XEN) HVM1:  HOLE: 00000000:fee01000 - 00000000:ffc00000
>(XEN) HVM1:  [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED
>(XEN) HVM1:  [07]: 00000001:00000000 - 00000001:68870000: RAM
>(XEN) HVM1: E820 table:
>(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>(XEN) HVM1:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>(XEN) HVM1:  HOLE: 00000000:000a0000 - 00000000:000e0000
>(XEN) HVM1:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>(XEN) HVM1:  [03]: 00000000:00100000 - 00000000:a7800000: RAM
>(XEN) HVM1:  HOLE: 00000000:a7800000 - 00000000:fc000000
>(XEN) HVM1:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
>(XEN) HVM1: Invoking ROMBIOS ...
>
>I cannot quite figure out what is going on here - these tables
can''t
>both be true.
>
Right.  The code just prints the E820 that was constructed b/c of the e820_host
=1 parameter as the first output.  Then the second one is what was constructed
originally.

The code that would tie in the E820 from the hyper call and the alter how the
hvmloader sets it up is not yet done.

>Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 and 
>goes more or less contiguously up to 0xfec8b000.
>
>Looking at dmesg on domU, the e820 map more or less matches the second 
>dump above.
Right.  That is correct since the patch I sent just outputs stuff.  No real
changes to the E820 yet. >
>So I guess that should work - the entire IOMEM area of the host is in 
>fact not mapped. But since I''ve passed 8GB of RAM to domU,
shouldn''t
>there be another usable RAM area after 00000001:00000000 ?
>
>Gordan

Gordan Bobic

2013-Sep-05 22:33 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On 09/05/2013 10:13 PM, Gordan Bobic wrote:
> I seem to be getting two different E820 table dumps with e820_host=1:
>
> (XEN) HVM1: BIOS map:
> (XEN) HVM1:  f0000-fffff: Main BIOS
> (XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
> (XEN) HVM1: E820 table:
> (XEN) HVM1:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
> (XEN) HVM1:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
> (XEN) HVM1:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
> (XEN) HVM1:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED
> (XEN) HVM1:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
> (XEN) HVM1:  [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED
> (XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
> (XEN) HVM1:  [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED
> (XEN) HVM1:  HOLE: 00000000:fee01000 - 00000000:ffc00000
> (XEN) HVM1:  [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED
> (XEN) HVM1:  [07]: 00000001:00000000 - 00000001:68870000: RAM
I get it - this is the host e820 map. In dom0, dmesg shows:

e820: BIOS-provided physical RAM map:
Xen: [mem 0x0000000000000000-0x000000000009cfff] usable
Xen: [mem 0x000000000009d000-0x00000000000fffff] reserved
Xen: [mem 0x0000000000100000-0x000000003f78ffff] usable
Xen: [mem 0x000000003f790000-0x000000003f79dfff] ACPI data
Xen: [mem 0x000000003f79e000-0x000000003f7cffff] ACPI NVS
Xen: [mem 0x000000003f7d0000-0x000000003f7dffff] reserved
Xen: [mem 0x000000003f7e7000-0x000000003fffffff] reserved
Xen: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
Xen: [mem 0x00000000ffc00000-0x00000000ffffffff] reserved
Xen: [mem 0x0000000100000000-0x0000000cbfffffff] usable

That tallies up with the above map exactly. So far so good. Not sure if 
the following is relevant, but here it is anyway just in case:

e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
e820: remove [mem 0x000a0000-0x000fffff] usable
[...]
e820: last_pfn = 0xcc0000 max_arch_pfn = 0x400000000
e820: last_pfn = 0x3f790 max_arch_pfn = 0x400000000
[...]
Zone ranges:
   DMA      [mem 0x00001000-0x00ffffff]
   DMA32    [mem 0x01000000-0xffffffff]
   Normal   [mem 0x100000000-0xcbfffffff]
[...]
e820: [mem 0x40000000-0xfedfffff] available for PCI devices

> (XEN) HVM1: E820 table:
> (XEN) HVM1:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
> (XEN) HVM1:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
> (XEN) HVM1:  HOLE: 00000000:000a0000 - 00000000:000e0000
> (XEN) HVM1:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
> (XEN) HVM1:  [03]: 00000000:00100000 - 00000000:a7800000: RAM
> (XEN) HVM1:  HOLE: 00000000:a7800000 - 00000000:fc000000
> (XEN) HVM1:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
> (XEN) HVM1: Invoking ROMBIOS ...
Comparing this to the above, it seems that 9d000-9e000 is marked as 
reserved in dom0, but RAM in domU. Am I right in thinking that
dom0(usable) == domU(RAM) in terms of meaning?

What does "HOLE" actually mean in domU? Does it mean this space is OK
to
map domU IOMEM into? Or something else? Either way full possible chasl 
summary:

dom0: reserved	9d000-9e000
domU: RAM	9d000-9e000

dom0: reserved	a0000-dffff
domU: HOLE	a0000-dffff

dom0: ACPI data	3f790000-3f79dfff
dom0: ACPI NVS	3f79e000-3f7cffff
dom0: reserved	3f7d0000-3f7dffff
dom0: reserved	
domU: RAM	00100000-a7800000

Then there seems to be a hole in dom0:
40000000-fedfffff which talles up with the dom0 dmesg output above about 
it being for the PCI devices, i.e. that''s the IOMEM region (from 1GB to
a lilttle under 4GB).

But in domU, the 40000000-a77fffff is available as RAM.

On the face of it, that''s actually fine - my PCI IOMEM mappings show
the
lowest mapping (according to lspci -vvv) starts at a8000000, which falls 
into the domU area marked as "HOLE" (a7800000-fc000000). And this does
in fact appears to be where domU maps the GPU in both of my VMs:

E0000000-E7FFFFFF
E8000000-EBFFFFFF
EC000000-EDFFFFFF

and this doesn''t overlap with any mapped PCI IOMEM according to lspci.

If we assume that anything below a8000000 doesn''t actually matter in 
this case (since if I give up to a8000000 memory to a domU everything 
works absolutely fine indefinitely, I am at a loss to explain what is 
actually going wrong and why the crash is still occuring - unless some 
other piece of hardware is having it''s domU IOMEM mapped somewhere in 
the range f3df4000-fec8b000 and that is causing a memory overwrite.

I am just not seeing any obvious memory stomp at the moment...

Gordan

Gordan Bobic

2013-Sep-05 22:42 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote:> Gordan Bobic <gordan@bobich.net> wrote:
>> Right, finally got around to trying this with the latest patch.
>>
>> With e820_host=0 things work as before:
>>
>> (XEN) HVM3: BIOS map:
>> (XEN) HVM3:  f0000-fffff: Main BIOS
>> (XEN) HVM3: E820 table:
>> (XEN) HVM3:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>> (XEN) HVM3:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>> (XEN) HVM3:  HOLE: 00000000:000a0000 - 00000000:000e0000
>> (XEN) HVM3:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>> (XEN) HVM3:  [03]: 00000000:00100000 - 00000000:e0000000: RAM
>> (XEN) HVM3:  HOLE: 00000000:e0000000 - 00000000:fc000000
>> (XEN) HVM3:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
>> (XEN) HVM3:  [05]: 00000001:00000000 - 00000002:1f800000: RAM
>>
>>
>> I seem to be getting two different E820 table dumps with e820_host=1:
>>
>> (XEN) HVM1: BIOS map:
>> (XEN) HVM1:  f0000-fffff: Main BIOS
>> (XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
>> (XEN) HVM1: E820 table:
>> (XEN) HVM1:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
>> (XEN) HVM1:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
>> (XEN) HVM1:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
>> (XEN) HVM1:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED
>> (XEN) HVM1:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
>> (XEN) HVM1:  [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED
>> (XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
>> (XEN) HVM1:  [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED
>> (XEN) HVM1:  HOLE: 00000000:fee01000 - 00000000:ffc00000
>> (XEN) HVM1:  [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED
>> (XEN) HVM1:  [07]: 00000001:00000000 - 00000001:68870000: RAM
>> (XEN) HVM1: E820 table:
>> (XEN) HVM1:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>> (XEN) HVM1:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>> (XEN) HVM1:  HOLE: 00000000:000a0000 - 00000000:000e0000
>> (XEN) HVM1:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>> (XEN) HVM1:  [03]: 00000000:00100000 - 00000000:a7800000: RAM
>> (XEN) HVM1:  HOLE: 00000000:a7800000 - 00000000:fc000000
>> (XEN) HVM1:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
>> (XEN) HVM1: Invoking ROMBIOS ...
>>
>> I cannot quite figure out what is going on here - these tables
can''t
>> both be true.
>>
>
> Right.  The code just prints the E820 that was constructed b/c of the
e820_host =1 parameter as the first output.  Then the second one is what was
constructed originally.
>
> The code that would tie in the E820 from the hyper call and the alter how
the hvmloader sets it up is not yet done.
>
>
>> Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 and
>> goes more or less contiguously up to 0xfec8b000.
>>
>> Looking at dmesg on domU, the e820 map more or less matches the second
>> dump above.
>
> Right.  That is correct since the patch I sent just outputs stuff.  No real
changes to the E820 yet.
/me *facepalms*

That indeed explains everything. :)

But having had a thorough look through the memory mappings (see my other 
long, rambling email), I don''t actually see an obvious area where RAM 
might overwrite a dom0 IOMEM range - assuming the "HOLE" part
isn''t
mapped as RAM in domU.

Or to summarize:
dom0 PCI IOMEM actually has mappings from a8000000 onward, and giving 
domU up to that much memory works fine. So the memory stomp must be 
happening from a8000000 onward. But - the only things above that address 
in domU are the HOLE up to fc000000 and RESERVED up to ffffffff. So no 
domU memory is getting mapped into the IOMEM range anyway - which begs 
the question of what is _actually_ causing the crash. Stuff I haven''t 
yet found in domU getting mapped into the a7800000-fc000000 hole 
overlapping dom0 IOMEM? SeaBIOS doing smething odd in the 
fc000000-fec8b000 range marked RESERVED in domU?

Or am I reading this all wrong?

Gordan

Gordan Bobic

2013-Sep-05 22:45 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote:> Gordan Bobic <gordan@bobich.net> wrote:
>> Right, finally got around to trying this with the latest patch.
>>
>> With e820_host=0 things work as before:
>>
>> (XEN) HVM3: BIOS map:
>> (XEN) HVM3:  f0000-fffff: Main BIOS
>> (XEN) HVM3: E820 table:
>> (XEN) HVM3:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>> (XEN) HVM3:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>> (XEN) HVM3:  HOLE: 00000000:000a0000 - 00000000:000e0000
>> (XEN) HVM3:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>> (XEN) HVM3:  [03]: 00000000:00100000 - 00000000:e0000000: RAM
>> (XEN) HVM3:  HOLE: 00000000:e0000000 - 00000000:fc000000
>> (XEN) HVM3:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
>> (XEN) HVM3:  [05]: 00000001:00000000 - 00000002:1f800000: RAM
>>
>>
>> I seem to be getting two different E820 table dumps with e820_host=1:
>>
>> (XEN) HVM1: BIOS map:
>> (XEN) HVM1:  f0000-fffff: Main BIOS
>> (XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
>> (XEN) HVM1: E820 table:
>> (XEN) HVM1:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
>> (XEN) HVM1:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
>> (XEN) HVM1:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
>> (XEN) HVM1:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED
>> (XEN) HVM1:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
>> (XEN) HVM1:  [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED
>> (XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
>> (XEN) HVM1:  [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED
>> (XEN) HVM1:  HOLE: 00000000:fee01000 - 00000000:ffc00000
>> (XEN) HVM1:  [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED
>> (XEN) HVM1:  [07]: 00000001:00000000 - 00000001:68870000: RAM
>> (XEN) HVM1: E820 table:
>> (XEN) HVM1:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>> (XEN) HVM1:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>> (XEN) HVM1:  HOLE: 00000000:000a0000 - 00000000:000e0000
>> (XEN) HVM1:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>> (XEN) HVM1:  [03]: 00000000:00100000 - 00000000:a7800000: RAM
>> (XEN) HVM1:  HOLE: 00000000:a7800000 - 00000000:fc000000
>> (XEN) HVM1:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
>> (XEN) HVM1: Invoking ROMBIOS ...
>>
>> I cannot quite figure out what is going on here - these tables
can''t
>> both be true.
>>
>
> Right.  The code just prints the E820 that was constructed b/c of the
e820_host =1 parameter as the first output.  Then the second one is what was
constructed originally.
>
> The code that would tie in the E820 from the hyper call and the alter how
the hvmloader sets it up is not yet done.
>
>
>> Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000 and
>> goes more or less contiguously up to 0xfec8b000.
>>
>> Looking at dmesg on domU, the e820 map more or less matches the second
>> dump above.
>
> Right.  That is correct since the patch I sent just outputs stuff.  No real
changes to the E820 yet.
I thought this did that in hvmloader/e820c:
hypercall_memory_op ( XENMEM_memory_map, &op);

Gordan

Konrad Rzeszutek Wilk

2013-Sep-05 23:01 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Gordan Bobic <gordan@bobich.net> wrote:>On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote:
>> Gordan Bobic <gordan@bobich.net> wrote:
>>> Right, finally got around to trying this with the latest patch.
>>>
>>> With e820_host=0 things work as before:
>>>
>>> (XEN) HVM3: BIOS map:
>>> (XEN) HVM3:  f0000-fffff: Main BIOS
>>> (XEN) HVM3: E820 table:
>>> (XEN) HVM3:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>>> (XEN) HVM3:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>>> (XEN) HVM3:  HOLE: 00000000:000a0000 - 00000000:000e0000
>>> (XEN) HVM3:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>>> (XEN) HVM3:  [03]: 00000000:00100000 - 00000000:e0000000: RAM
>>> (XEN) HVM3:  HOLE: 00000000:e0000000 - 00000000:fc000000
>>> (XEN) HVM3:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
>>> (XEN) HVM3:  [05]: 00000001:00000000 - 00000002:1f800000: RAM
>>>
>>>
>>> I seem to be getting two different E820 table dumps with
>e820_host=1:
>>>
>>> (XEN) HVM1: BIOS map:
>>> (XEN) HVM1:  f0000-fffff: Main BIOS
>>> (XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
>>> (XEN) HVM1: E820 table:
>>> (XEN) HVM1:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
>>> (XEN) HVM1:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
>>> (XEN) HVM1:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
>>> (XEN) HVM1:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED
>>> (XEN) HVM1:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
>>> (XEN) HVM1:  [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED
>>> (XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
>>> (XEN) HVM1:  [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED
>>> (XEN) HVM1:  HOLE: 00000000:fee01000 - 00000000:ffc00000
>>> (XEN) HVM1:  [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED
>>> (XEN) HVM1:  [07]: 00000001:00000000 - 00000001:68870000: RAM
>>> (XEN) HVM1: E820 table:
>>> (XEN) HVM1:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>>> (XEN) HVM1:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>>> (XEN) HVM1:  HOLE: 00000000:000a0000 - 00000000:000e0000
>>> (XEN) HVM1:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>>> (XEN) HVM1:  [03]: 00000000:00100000 - 00000000:a7800000: RAM
>>> (XEN) HVM1:  HOLE: 00000000:a7800000 - 00000000:fc000000
>>> (XEN) HVM1:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
>>> (XEN) HVM1: Invoking ROMBIOS ...
>>>
>>> I cannot quite figure out what is going on here - these tables
can''t
>>> both be true.
>>>
>>
>> Right.  The code just prints the E820 that was constructed b/c of the
>e820_host =1 parameter as the first output.  Then the second one is
>what was constructed originally.
>>
>> The code that would tie in the E820 from the hyper call and the alter
>how the hvmloader sets it up is not yet done.
>>
>>
>>> Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000
and
>>> goes more or less contiguously up to 0xfec8b000.
>>>
>>> Looking at dmesg on domU, the e820 map more or less matches the
>second
>>> dump above.
>>
>> Right.  That is correct since the patch I sent just outputs stuff. 
>No real changes to the E820 yet.
>
>I thought this did that in hvmloader/e820c:
>hypercall_memory_op ( XENMEM_memory_map, &op);
>
>Gordan
No.  They just gets the E820 that is stashed in the hypervisor for the guest. 
The PV guest would use it but hvmloader is not. This is what would needed to be
implemented to allow hvmloader construct  the E820 on its own.

Gordan Bobic

2013-Sep-06 12:23 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Thu, 05 Sep 2013 19:01:03 -0400, Konrad Rzeszutek Wilk 
 <konrad.wilk@oracle.com> wrote:> Gordan Bobic <gordan@bobich.net> wrote:
>>On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote:
>>> Gordan Bobic <gordan@bobich.net> wrote:
>>>> Right, finally got around to trying this with the latest patch.
>>>>
>>>> With e820_host=0 things work as before:
>>>>
>>>> (XEN) HVM3: BIOS map:
>>>> (XEN) HVM3:  f0000-fffff: Main BIOS
>>>> (XEN) HVM3: E820 table:
>>>> (XEN) HVM3:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>>>> (XEN) HVM3:  [01]: 00000000:0009e000 - 00000000:000a0000:
RESERVED
>>>> (XEN) HVM3:  HOLE: 00000000:000a0000 - 00000000:000e0000
>>>> (XEN) HVM3:  [02]: 00000000:000e0000 - 00000000:00100000:
RESERVED
>>>> (XEN) HVM3:  [03]: 00000000:00100000 - 00000000:e0000000: RAM
>>>> (XEN) HVM3:  HOLE: 00000000:e0000000 - 00000000:fc000000
>>>> (XEN) HVM3:  [04]: 00000000:fc000000 - 00000001:00000000:
RESERVED
>>>> (XEN) HVM3:  [05]: 00000001:00000000 - 00000002:1f800000: RAM
>>>>
>>>>
>>>> I seem to be getting two different E820 table dumps with
>>e820_host=1:
>>>>
>>>> (XEN) HVM1: BIOS map:
>>>> (XEN) HVM1:  f0000-fffff: Main BIOS
>>>> (XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
>>>> (XEN) HVM1: E820 table:
>>>> (XEN) HVM1:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
>>>> (XEN) HVM1:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
>>>> (XEN) HVM1:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
>>>> (XEN) HVM1:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000:
RESERVED
>>>> (XEN) HVM1:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
>>>> (XEN) HVM1:  [04]: 00000000:3f7e7000 - 00000000:40000000:
RESERVED
>>>> (XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
>>>> (XEN) HVM1:  [05]: 00000000:fee00000 - 00000000:fee01000:
RESERVED
>>>> (XEN) HVM1:  HOLE: 00000000:fee01000 - 00000000:ffc00000
>>>> (XEN) HVM1:  [06]: 00000000:ffc00000 - 00000001:00000000:
RESERVED
>>>> (XEN) HVM1:  [07]: 00000001:00000000 - 00000001:68870000: RAM
>>>> (XEN) HVM1: E820 table:
>>>> (XEN) HVM1:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>>>> (XEN) HVM1:  [01]: 00000000:0009e000 - 00000000:000a0000:
RESERVED
>>>> (XEN) HVM1:  HOLE: 00000000:000a0000 - 00000000:000e0000
>>>> (XEN) HVM1:  [02]: 00000000:000e0000 - 00000000:00100000:
RESERVED
>>>> (XEN) HVM1:  [03]: 00000000:00100000 - 00000000:a7800000: RAM
>>>> (XEN) HVM1:  HOLE: 00000000:a7800000 - 00000000:fc000000
>>>> (XEN) HVM1:  [04]: 00000000:fc000000 - 00000001:00000000:
RESERVED
>>>> (XEN) HVM1: Invoking ROMBIOS ...
>>>>
>>>> I cannot quite figure out what is going on here - these tables 
>>>> can''t
>>>> both be true.
>>>>
>>>
>>> Right.  The code just prints the E820 that was constructed b/c of 
>>> the
>>e820_host =1 parameter as the first output.  Then the second one is
>>what was constructed originally.
>>>
>>> The code that would tie in the E820 from the hyper call and the 
>>> alter
>>how the hvmloader sets it up is not yet done.
>>>
>>>
>>>> Looking at the IOMEM on the host, the IOMEM begins at
0xa8000000
>>>> and
>>>> goes more or less contiguously up to 0xfec8b000.
>>>>
>>>> Looking at dmesg on domU, the e820 map more or less matches the
>>second
>>>> dump above.
>>>
>>> Right.  That is correct since the patch I sent just outputs stuff.
>>No real changes to the E820 yet.
>>
>>I thought this did that in hvmloader/e820c:
>>hypercall_memory_op ( XENMEM_memory_map, &op);
>>
>>Gordan
>
> No.  They just gets the E820 that is stashed in the hypervisor for
> the guest.  The PV guest would use it but hvmloader is not. This is
> what would needed to be implemented to allow hvmloader construct  the
> E820 on its own.
 Right. So so in hvmloader/e820.c we now have the host based map in
 struct e820entry map[E820MAX];

 The rest of the function then goes and constructs the standard HVM
 e820 map in the passed in
 struct e820entry *e820

 So all that needs to happen here is if e820_host is set, fill e820[]
 by copying map[] up to the hvm_info->low_mem_pgend
 (or hvm_info->high_mem_pgend if it is set). I am guessing that
 SeaBIOS and other existing stuff might break if the host map is
 just copied in verbatim, so presumably I need to add/dedupe the
 non-RAM parts of the maps.

 Is that right? Nothing else needs to happen?

 The following questions arise:

 1) What to do in case of overlaps? On my specific hardware,
 the key difference in the end map will be that the hole at:
 (XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
 will end up being created in domU.

 2) Do only the holes need to be pulled from the host or
 the entire map? Would hvmloader/seabios/whatever know
 what to do if passed a map that is different from what
 they might expect (i.e. different from what the current
 hvmloader provides)? Or would this be likely to cause
 extensive further breakages?

 3) At the moment I am leaning toward just pulling in the
 holes from the host e820, mirroring them in domU.
 3.1) Marking them as "reserved" would likely fix the
 problem that was my primary motivation for doing this
 in the first place. Having said that - with all of
 the 1GB-3GB space marked as reserved, I''m not sure where
 the IOMEM would end up mapped in domU - things might just
 break. If marking the dom0 hole as a hole in domU without
 ensuring pBAR=vBAR, the PCI device in domU might get
 mapped with where another device is in dom0, which might
 cause the same problem.

 At the moment, I think the expedient thing to do is make
 domU map holes as per dom0 and ignore other non-RAM
 areas. This may (by luck) or may not fix my immediate problem
 (RAM in domU clobbering host''s mapped IOMEM), but at
 least it would cover the pre-requisite hole mapping for
 the next step which is vBAR=pBAR.

 I light of this, however, depending on the answer to 2)
 above, it may not be practical for e820_host option do do
 what it actually means for HVMs, at least not to the same
 extent as happens for PV. It would only do a part of it
 (initial vHOLE=pHOLE, to later be extended to the more
 specific case of vBAR=pBAR).

 Does this sound reasonable?

 Gordan

Konrad Rzeszutek Wilk

2013-Sep-06 13:04 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Thu, Sep 05, 2013 at 11:33:18PM +0100, Gordan Bobic
wrote:> On 09/05/2013 10:13 PM, Gordan Bobic wrote:
> 
> >I seem to be getting two different E820 table dumps with e820_host=1:
> >
> >(XEN) HVM1: BIOS map:
> >(XEN) HVM1:  f0000-fffff: Main BIOS
> >(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
> >(XEN) HVM1: E820 table:
> >(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
> >(XEN) HVM1:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
> >(XEN) HVM1:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
> >(XEN) HVM1:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED
> >(XEN) HVM1:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
> >(XEN) HVM1:  [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED
> >(XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
> >(XEN) HVM1:  [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED
> >(XEN) HVM1:  HOLE: 00000000:fee01000 - 00000000:ffc00000
> >(XEN) HVM1:  [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED
> >(XEN) HVM1:  [07]: 00000001:00000000 - 00000001:68870000: RAM
> 
> I get it - this is the host e820 map. In dom0, dmesg shows:
> 
> e820: BIOS-provided physical RAM map:
> Xen: [mem 0x0000000000000000-0x000000000009cfff] usable
> Xen: [mem 0x000000000009d000-0x00000000000fffff] reserved
> Xen: [mem 0x0000000000100000-0x000000003f78ffff] usable
> Xen: [mem 0x000000003f790000-0x000000003f79dfff] ACPI data
> Xen: [mem 0x000000003f79e000-0x000000003f7cffff] ACPI NVS
> Xen: [mem 0x000000003f7d0000-0x000000003f7dffff] reserved
> Xen: [mem 0x000000003f7e7000-0x000000003fffffff] reserved
> Xen: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
> Xen: [mem 0x00000000ffc00000-0x00000000ffffffff] reserved
> Xen: [mem 0x0000000100000000-0x0000000cbfffffff] usable
> 
> That tallies up with the above map exactly. So far so good. Not sure
> if the following is relevant, but here it is anyway just in case:
> 
> e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> e820: remove [mem 0x000a0000-0x000fffff] usable
> [...]
> e820: last_pfn = 0xcc0000 max_arch_pfn = 0x400000000
> e820: last_pfn = 0x3f790 max_arch_pfn = 0x400000000
> [...]
> Zone ranges:
>   DMA      [mem 0x00001000-0x00ffffff]
>   DMA32    [mem 0x01000000-0xffffffff]
>   Normal   [mem 0x100000000-0xcbfffffff]
> [...]
> e820: [mem 0x40000000-0xfedfffff] available for PCI devices
> 
> 
> >(XEN) HVM1: E820 table:
> >(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
> >(XEN) HVM1:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
> >(XEN) HVM1:  HOLE: 00000000:000a0000 - 00000000:000e0000
> >(XEN) HVM1:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
> >(XEN) HVM1:  [03]: 00000000:00100000 - 00000000:a7800000: RAM
> >(XEN) HVM1:  HOLE: 00000000:a7800000 - 00000000:fc000000
> >(XEN) HVM1:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
> >(XEN) HVM1: Invoking ROMBIOS ...
> 
> Comparing this to the above, it seems that 9d000-9e000 is marked as
> reserved in dom0, but RAM in domU. Am I right in thinking that
> dom0(usable) == domU(RAM) in terms of meaning?
> 
> What does "HOLE" actually mean in domU? Does it mean this space
is
> OK to map domU IOMEM into? Or something else? Either way full
> possible chasl summary:
> 
> dom0: reserved	9d000-9e000
> domU: RAM	9d000-9e000
> 
> dom0: reserved	a0000-dffff
> domU: HOLE	a0000-dffff
> 
> dom0: ACPI data	3f790000-3f79dfff
> dom0: ACPI NVS	3f79e000-3f7cffff
> dom0: reserved	3f7d0000-3f7dffff
> dom0: reserved	

.. you are missing a range here.
> domU: RAM	00100000-a7800000
> 
> Then there seems to be a hole in dom0:
> 40000000-fedfffff which talles up with the dom0 dmesg output above
> about it being for the PCI devices, i.e. that''s the IOMEM region
> (from 1GB to a lilttle under 4GB).
> 
> But in domU, the 40000000-a77fffff is available as RAM.
OK, so that is the goal - make hvmloader construct the E820 memory
layout and all of its pieces to fit that layout.
> 
> On the face of it, that''s actually fine - my PCI IOMEM mappings
show
> the lowest mapping (according to lspci -vvv) starts at a8000000,
<surprise>
> which falls into the domU area marked as "HOLE"
(a7800000-fc000000).
> And this does in fact appears to be where domU maps the GPU in both
> of my VMs:
> 
> E0000000-E7FFFFFF
> E8000000-EBFFFFFF
> EC000000-EDFFFFFF
> 
> and this doesn''t overlap with any mapped PCI IOMEM according to
lspci.
> 
> If we assume that anything below a8000000 doesn''t actually matter
in
> this case (since if I give up to a8000000 memory to a domU
> everything works absolutely fine indefinitely, I am at a loss to

Just to make sure I am not leading you astray. You are getting _no_ crashes
when you have a guest with 1GB?
> explain what is actually going wrong and why the crash is still
> occuring - unless some other piece of hardware is having it''s domU
> IOMEM mapped somewhere in the range f3df4000-fec8b000 and that is
> causing a memory overwrite.
> 
> I am just not seeing any obvious memory stomp at the moment...
Neither am I.> 
> Gordan

Konrad Rzeszutek Wilk

2013-Sep-06 13:09 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Thu, Sep 05, 2013 at 11:42:38PM +0100, Gordan Bobic
wrote:> On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote:
> >Gordan Bobic <gordan@bobich.net> wrote:
> >>Right, finally got around to trying this with the latest patch.
> >>
> >>With e820_host=0 things work as before:
> >>
> >>(XEN) HVM3: BIOS map:
> >>(XEN) HVM3:  f0000-fffff: Main BIOS
> >>(XEN) HVM3: E820 table:
> >>(XEN) HVM3:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
> >>(XEN) HVM3:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
> >>(XEN) HVM3:  HOLE: 00000000:000a0000 - 00000000:000e0000
> >>(XEN) HVM3:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
> >>(XEN) HVM3:  [03]: 00000000:00100000 - 00000000:e0000000: RAM
> >>(XEN) HVM3:  HOLE: 00000000:e0000000 - 00000000:fc000000
> >>(XEN) HVM3:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
> >>(XEN) HVM3:  [05]: 00000001:00000000 - 00000002:1f800000: RAM
> >>
> >>
> >>I seem to be getting two different E820 table dumps with
e820_host=1:
> >>
> >>(XEN) HVM1: BIOS map:
> >>(XEN) HVM1:  f0000-fffff: Main BIOS
> >>(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
> >>(XEN) HVM1: E820 table:
> >>(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
> >>(XEN) HVM1:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
> >>(XEN) HVM1:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
> >>(XEN) HVM1:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED
> >>(XEN) HVM1:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
> >>(XEN) HVM1:  [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED
> >>(XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
> >>(XEN) HVM1:  [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED
> >>(XEN) HVM1:  HOLE: 00000000:fee01000 - 00000000:ffc00000
> >>(XEN) HVM1:  [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED
> >>(XEN) HVM1:  [07]: 00000001:00000000 - 00000001:68870000: RAM
> >>(XEN) HVM1: E820 table:
> >>(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
> >>(XEN) HVM1:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
> >>(XEN) HVM1:  HOLE: 00000000:000a0000 - 00000000:000e0000
> >>(XEN) HVM1:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
> >>(XEN) HVM1:  [03]: 00000000:00100000 - 00000000:a7800000: RAM
> >>(XEN) HVM1:  HOLE: 00000000:a7800000 - 00000000:fc000000
> >>(XEN) HVM1:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
> >>(XEN) HVM1: Invoking ROMBIOS ...
> >>
> >>I cannot quite figure out what is going on here - these tables
can''t
> >>both be true.
> >>
> >
> >Right.  The code just prints the E820 that was constructed b/c of the
e820_host =1 parameter as the first output.  Then the second one is what was
constructed originally.
> >
> >The code that would tie in the E820 from the hyper call and the alter
how the hvmloader sets it up is not yet done.
> >
> >
> >>Looking at the IOMEM on the host, the IOMEM begins at 0xa8000000
and
> >>goes more or less contiguously up to 0xfec8b000.
> >>
> >>Looking at dmesg on domU, the e820 map more or less matches the
second
> >>dump above.
> >
> >Right.  That is correct since the patch I sent just outputs stuff.  No
real changes to the E820 yet.
> 
> /me *facepalms*
> 
> That indeed explains everything. :)
> 
> But having had a thorough look through the memory mappings (see my
> other long, rambling email), I don''t actually see an obvious area
> where RAM might overwrite a dom0 IOMEM range - assuming the
"HOLE"
> part isn''t mapped as RAM in domU.
> 
> Or to summarize:
> dom0 PCI IOMEM actually has mappings from a8000000 onward, and
> giving domU up to that much memory works fine. So the memory stomp
> must be happening from a8000000 onward. But - the only things above
> that address in domU are the HOLE up to fc000000 and RESERVED up to
> ffffffff. So no domU memory is getting mapped into the IOMEM range
> anyway - which begs the question of what is _actually_ causing the
> crash. Stuff I haven''t yet found in domU getting mapped into the
> a7800000-fc000000 hole overlapping dom0 IOMEM? SeaBIOS doing
> smething odd in the fc000000-fec8b000 range marked RESERVED in domU?
There were some assumptions with that region and that stuff could
be stick in there (like ACPI tables and SMBIOS I think).

Perhaps a better question is - are any of the BARs of your card overlapping
with the RESERVED range in the domU?

Or if you grep through the hvmloader code are there anything addresses
that look to be within that range?

Incidentally could you send the output of lspci -vvvv from your output
in the guest and in dom0 please?

Thanks.> 
> Or am I reading this all wrong?
You are on the right track I think. There is some assumption made
about the RESERVED and HOLE that I think are conflicing with what the
card thinks of. Another way to figure out what is happening is to crank
up the verbosity of the driver in the domU. Specifically there is
an CONFIG_MMIO_TRACE (or something like that) that will tell you the
physical address the PCI cards are using and what it is writting in it.

It could help in identifying _where_ the graphic card is writting/reading
from. And also the last moment when it wrote something.
> 
> Gordan

Konrad Rzeszutek Wilk

2013-Sep-06 13:20 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Fri, Sep 06, 2013 at 01:23:19PM +0100, Gordan Bobic
wrote:> On Thu, 05 Sep 2013 19:01:03 -0400, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> >Gordan Bobic <gordan@bobich.net> wrote:
> >>On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote:
> >>>Gordan Bobic <gordan@bobich.net> wrote:
> >>>>Right, finally got around to trying this with the latest
patch.
> >>>>
> >>>>With e820_host=0 things work as before:
> >>>>
> >>>>(XEN) HVM3: BIOS map:
> >>>>(XEN) HVM3:  f0000-fffff: Main BIOS
> >>>>(XEN) HVM3: E820 table:
> >>>>(XEN) HVM3:  [00]: 00000000:00000000 - 00000000:0009e000:
RAM
> >>>>(XEN) HVM3:  [01]: 00000000:0009e000 - 00000000:000a0000:
RESERVED
> >>>>(XEN) HVM3:  HOLE: 00000000:000a0000 - 00000000:000e0000
> >>>>(XEN) HVM3:  [02]: 00000000:000e0000 - 00000000:00100000:
RESERVED
> >>>>(XEN) HVM3:  [03]: 00000000:00100000 - 00000000:e0000000:
RAM
> >>>>(XEN) HVM3:  HOLE: 00000000:e0000000 - 00000000:fc000000
> >>>>(XEN) HVM3:  [04]: 00000000:fc000000 - 00000001:00000000:
RESERVED
> >>>>(XEN) HVM3:  [05]: 00000001:00000000 - 00000002:1f800000:
RAM
> >>>>
> >>>>
> >>>>I seem to be getting two different E820 table dumps with
> >>e820_host=1:
> >>>>
> >>>>(XEN) HVM1: BIOS map:
> >>>>(XEN) HVM1:  f0000-fffff: Main BIOS
> >>>>(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
> >>>>(XEN) HVM1: E820 table:
> >>>>(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:3f790000:
RAM
> >>>>(XEN) HVM1:  [01]: 00000000:3f790000 - 00000000:3f79e000:
ACPI
> >>>>(XEN) HVM1:  [02]: 00000000:3f79e000 - 00000000:3f7d0000:
NVS
> >>>>(XEN) HVM1:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000:
RESERVED
> >>>>(XEN) HVM1:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
> >>>>(XEN) HVM1:  [04]: 00000000:3f7e7000 - 00000000:40000000:
RESERVED
> >>>>(XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
> >>>>(XEN) HVM1:  [05]: 00000000:fee00000 - 00000000:fee01000:
RESERVED
> >>>>(XEN) HVM1:  HOLE: 00000000:fee01000 - 00000000:ffc00000
> >>>>(XEN) HVM1:  [06]: 00000000:ffc00000 - 00000001:00000000:
RESERVED
> >>>>(XEN) HVM1:  [07]: 00000001:00000000 - 00000001:68870000:
RAM
> >>>>(XEN) HVM1: E820 table:
> >>>>(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:0009e000:
RAM
> >>>>(XEN) HVM1:  [01]: 00000000:0009e000 - 00000000:000a0000:
RESERVED
> >>>>(XEN) HVM1:  HOLE: 00000000:000a0000 - 00000000:000e0000
> >>>>(XEN) HVM1:  [02]: 00000000:000e0000 - 00000000:00100000:
RESERVED
> >>>>(XEN) HVM1:  [03]: 00000000:00100000 - 00000000:a7800000:
RAM
> >>>>(XEN) HVM1:  HOLE: 00000000:a7800000 - 00000000:fc000000
> >>>>(XEN) HVM1:  [04]: 00000000:fc000000 - 00000001:00000000:
RESERVED
> >>>>(XEN) HVM1: Invoking ROMBIOS ...
> >>>>
> >>>>I cannot quite figure out what is going on here - these
> >>>>tables can''t
> >>>>both be true.
> >>>>
> >>>
> >>>Right.  The code just prints the E820 that was constructed b/c
> >>>of the
> >>e820_host =1 parameter as the first output.  Then the second one is
> >>what was constructed originally.
> >>>
> >>>The code that would tie in the E820 from the hyper call and
> >>>the alter
> >>how the hvmloader sets it up is not yet done.
> >>>
> >>>
> >>>>Looking at the IOMEM on the host, the IOMEM begins at
> >>>>0xa8000000 and
> >>>>goes more or less contiguously up to 0xfec8b000.
> >>>>
> >>>>Looking at dmesg on domU, the e820 map more or less matches
the
> >>second
> >>>>dump above.
> >>>
> >>>Right.  That is correct since the patch I sent just outputs
stuff.
> >>No real changes to the E820 yet.
> >>
> >>I thought this did that in hvmloader/e820c:
> >>hypercall_memory_op ( XENMEM_memory_map, &op);
> >>
> >>Gordan
> >
> >No.  They just gets the E820 that is stashed in the hypervisor for
> >the guest.  The PV guest would use it but hvmloader is not. This is
> >what would needed to be implemented to allow hvmloader construct  the
> >E820 on its own.
> 
> Right. So so in hvmloader/e820.c we now have the host based map in
> struct e820entry map[E820MAX];
> 
> The rest of the function then goes and constructs the standard HVM
> e820 map in the passed in
> struct e820entry *e820
> 
> So all that needs to happen here is if e820_host is set, fill e820[]
> by copying map[] up to the hvm_info->low_mem_pgend
> (or hvm_info->high_mem_pgend if it is set). I am guessing that
Right. And then the overflow would be put past 4GB. Or fill in the
E820_RAM regions with it.
> SeaBIOS and other existing stuff might break if the host map is
> just copied in verbatim, so presumably I need to add/dedupe the
> non-RAM parts of the maps.
Probably. Or tweak SeaBIOS to use your E820.

Also you need to figure out where hvmloader constructs the ACPI and
SMBIOS tables and make sure they are within the E820_RESERVED regions.
> 
> Is that right? Nothing else needs to happen?
HA! You are going to hit some bugs probably :-)
> 
> The following questions arise:
> 
> 1) What to do in case of overlaps? On my specific hardware,
> the key difference in the end map will be that the hole at:
> (XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
> will end up being created in domU.
The hole is also known as PCI gap or MMIO region. With the
e820_host in effect you should use the host''s layout and
use its hole placement. That will replicate it and make
domU''s E820 hole look like the host.
> 
> 2) Do only the holes need to be pulled from the host or
> the entire map? Would hvmloader/seabios/whatever know
> what to do if passed a map that is different from what
> they might expect (i.e. different from what the current
> hvmloader provides)? Or would this be likely to cause
> extensive further breakages?
I think there are some assumptions made where the hole
starts. Those would have to be made more dynamic to deal
with a different E820 layout.> 
> 3) At the moment I am leaning toward just pulling in the
> holes from the host e820, mirroring them in domU.
<nods>
> 3.1) Marking them as "reserved" would likely fix the
> problem that was my primary motivation for doing this
> in the first place. Having said that - with all of
That unfortuntaly will make them not-gaps nor MMIO regions.
Meaning the kernel will scream: "You have a BAR in E820_
reserved region! That is bad!", and won''t setup the card.

The hole needs to be replicated in the guest.> the 1GB-3GB space marked as reserved, I''m not sure where
> the IOMEM would end up mapped in domU - things might just
> break. If marking the dom0 hole as a hole in domU without
> ensuring pBAR=vBAR, the PCI device in domU might get
> mapped with where another device is in dom0, which might
> cause the same problem.
Right. hvmloader could (I hadn''t checked the code) scan the
E820 and determine that the PCI BARs are within the E820_RESRV
and try to move them to a hole. Since no hole would be found
below 4GB it would remap the PCI BAR above 4GB. That - depending
on the device - could be disastrous for the device. That is 
if it is only capable of 32-bit DMA''s it will never do anything.
> 
> At the moment, I think the expedient thing to do is make
> domU map holes as per dom0 and ignore other non-RAM
<nods>> areas. This may (by luck) or may not fix my immediate problem
> (RAM in domU clobbering host''s mapped IOMEM), but at
> least it would cover the pre-requisite hole mapping for
> the next step which is vBAR=pBAR.
<nods>> 
> I light of this, however, depending on the answer to 2)
> above, it may not be practical for e820_host option do do
I think it will mean you need to look in the hvmloader directory
a bit more and find all of the assumptions it makes about memory
locations. One excellent tool is to do ''git log -p
tools/hvmloader''
as it will tell you what changes have been done to address
the memory layout construction.
> what it actually means for HVMs, at least not to the same
> extent as happens for PV. It would only do a part of it
> (initial vHOLE=pHOLE, to later be extended to the more
> specific case of vBAR=pBAR).
> 
> Does this sound reasonable?
Yes. I think the plan you outlined is sound. The difficultiy is
going to be cramming the E820 constructed by e820_host in hvmloader
and making sure that all the other parts of it (SMBIOS, ACPI, BIOS)
will be more dynamic and use dynamic locations instead of
hard-coded values.

Loads of printks can help with that :-)

The awesome thing is that it will make hvmloader a lot more
flexible. And one can extend the e820_host to construct an
E820 that is bizzare for testing even more absurd memory
layouts (say, no RAM below 4GB).

Keep on digging! Thanks for great analysis.> 
> Gordan

Gordan Bobic

2013-Sep-06 13:34 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Fri, 6 Sep 2013 09:04:35 -0400, Konrad Rzeszutek Wilk 
 <konrad.wilk@oracle.com> wrote:> On Thu, Sep 05, 2013 at 11:33:18PM +0100, Gordan Bobic wrote:
>> On 09/05/2013 10:13 PM, Gordan Bobic wrote:
>>
>> >I seem to be getting two different E820 table dumps with 
>> e820_host=1:
>> >
>> >(XEN) HVM1: BIOS map:
>> >(XEN) HVM1:  f0000-fffff: Main BIOS
>> >(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
>> >(XEN) HVM1: E820 table:
>> >(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
>> >(XEN) HVM1:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
>> >(XEN) HVM1:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
>> >(XEN) HVM1:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED
>> >(XEN) HVM1:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
>> >(XEN) HVM1:  [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED
>> >(XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
>> >(XEN) HVM1:  [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED
>> >(XEN) HVM1:  HOLE: 00000000:fee01000 - 00000000:ffc00000
>> >(XEN) HVM1:  [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED
>> >(XEN) HVM1:  [07]: 00000001:00000000 - 00000001:68870000: RAM
>>
>> I get it - this is the host e820 map. In dom0, dmesg shows:
>>
>> e820: BIOS-provided physical RAM map:
>> Xen: [mem 0x0000000000000000-0x000000000009cfff] usable
>> Xen: [mem 0x000000000009d000-0x00000000000fffff] reserved
>> Xen: [mem 0x0000000000100000-0x000000003f78ffff] usable
>> Xen: [mem 0x000000003f790000-0x000000003f79dfff] ACPI data
>> Xen: [mem 0x000000003f79e000-0x000000003f7cffff] ACPI NVS
>> Xen: [mem 0x000000003f7d0000-0x000000003f7dffff] reserved
>> Xen: [mem 0x000000003f7e7000-0x000000003fffffff] reserved
>> Xen: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
>> Xen: [mem 0x00000000ffc00000-0x00000000ffffffff] reserved
>> Xen: [mem 0x0000000100000000-0x0000000cbfffffff] usable
>>
>> That tallies up with the above map exactly. So far so good. Not sure
>> if the following is relevant, but here it is anyway just in case:
>>
>> e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
>> e820: remove [mem 0x000a0000-0x000fffff] usable
>> [...]
>> e820: last_pfn = 0xcc0000 max_arch_pfn = 0x400000000
>> e820: last_pfn = 0x3f790 max_arch_pfn = 0x400000000
>> [...]
>> Zone ranges:
>>   DMA      [mem 0x00001000-0x00ffffff]
>>   DMA32    [mem 0x01000000-0xffffffff]
>>   Normal   [mem 0x100000000-0xcbfffffff]
>> [...]
>> e820: [mem 0x40000000-0xfedfffff] available for PCI devices
>>
>>
>> >(XEN) HVM1: E820 table:
>> >(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>> >(XEN) HVM1:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>> >(XEN) HVM1:  HOLE: 00000000:000a0000 - 00000000:000e0000
>> >(XEN) HVM1:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>> >(XEN) HVM1:  [03]: 00000000:00100000 - 00000000:a7800000: RAM
>> >(XEN) HVM1:  HOLE: 00000000:a7800000 - 00000000:fc000000
>> >(XEN) HVM1:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
>> >(XEN) HVM1: Invoking ROMBIOS ...
>>
>> Comparing this to the above, it seems that 9d000-9e000 is marked as
>> reserved in dom0, but RAM in domU. Am I right in thinking that
>> dom0(usable) == domU(RAM) in terms of meaning?
>>
>> What does "HOLE" actually mean in domU? Does it mean this
space is
>> OK to map domU IOMEM into? Or something else? Either way full
>> possible chasl summary:
>>
>> dom0: reserved	9d000-9e000
>> domU: RAM	9d000-9e000
>>
>> dom0: reserved	a0000-dffff
>> domU: HOLE	a0000-dffff
>>
>> dom0: ACPI data	3f790000-3f79dfff
>> dom0: ACPI NVS	3f79e000-3f7cffff
>> dom0: reserved	3f7d0000-3f7dffff
>> dom0: reserved
>
>
> .. you are missing a range here.
 It wasn''t meant as an exhaustive list, I was only looking at the
 interesting/overlapping areas.
>> domU: RAM	00100000-a7800000
>>
>> Then there seems to be a hole in dom0:
>> 40000000-fedfffff which talles up with the dom0 dmesg output above
>> about it being for the PCI devices, i.e. that''s the IOMEM
region
>> (from 1GB to a lilttle under 4GB).
>>
>> But in domU, the 40000000-a77fffff is available as RAM.
>
> OK, so that is the goal - make hvmloader construct the E820 memory
> layout and all of its pieces to fit that layout.
 I am actually leaning toward only copying the holes from the
 host E820. The domU already seems to be successfully using various
 memory ranges that correspond to reserved and acpi ranges, so
 it doesn''t look like these are a problem.
>> On the face of it, that''s actually fine - my PCI IOMEM
mappings show
>> the lowest mapping (according to lspci -vvv) starts at a8000000,
>
> <surprise>
 Indeed - on the host, the hole is 1GB-4GB, but there is no IOMEM
 mapped between 1024M and 2688MB. Hence why I can get away with a
 domU memory allocation up to 2688MB.
>> which falls into the domU area marked as "HOLE"
(a7800000-fc000000).
>> And this does in fact appears to be where domU maps the GPU in both
>> of my VMs:
>>
>> E0000000-E7FFFFFF
>> E8000000-EBFFFFFF
>> EC000000-EDFFFFFF
>>
>> and this doesn''t overlap with any mapped PCI IOMEM according
to
>> lspci.
>>
>> If we assume that anything below a8000000 doesn''t actually
matter in
>> this case (since if I give up to a8000000 memory to a domU
>> everything works absolutely fine indefinitely, I am at a loss to
>
>
> Just to make sure I am not leading you astray. You are getting _no_ 
> crashes
> when you have a guest with 1GB?
 I haven''t tried limiting a guest to 1GB recently. My PCI passthrough
 domUs all have 2688MB assigned, and this works fine. More than that
 and they crash eventually. Does that answer your question? Or were
 you after something very specific to the 1GB domU case?
>> explain what is actually going wrong and why the crash is still
>> occuring - unless some other piece of hardware is having it''s
domU
>> IOMEM mapped somewhere in the range f3df4000-fec8b000 and that is
>> causing a memory overwrite.
>>
>> I am just not seeing any obvious memory stomp at the moment...
>
> Neither am I.
 I may have pasted the wrong domU e820. I have a sneaky suspicion
 that this above map was from a domU with 2688MB of RAM assigned,
 hence why there is on domU RAM in the map above a7800000. I''ll
 re-check when I''m in front of that machine again.

 Are you OK with the plan to _only_ copy the holes from host E820
 to the hvmloader E820? I think this would be sufficient and not
 cause any undue problems. The only things that would need to
 change are:
 1) Enlarge the domU hole
 2) Do something with the top reserved block, starting at
 RESERVED_MEMBASE=0xFC000000. What is this actually for? It
 overlaps with the host memory hole which extends all the way up
 to 0xfee00000. If it must be where it is, this could be
 problematic. What to do in this case?

 This does, also bring up another question - is there any point
 in bothering with matching the host holes? I would hazard a
 guess that no physical hardware is likely to have a memory
 hole bigger than 3GB under the 4GB limit.

 So would it perhaps be neater, easier, more consistent and
 more debuggable to just make the hvmloader put in a hole
 between 0x40000000-0xffffffff (the whole 3GB) by default?
 Or is that deemed to be too crippling for 32-bit non-PAE
 domUs (and are there enough of these aroudn to matter?)?

 Caveat - this alone wouldn''t cover any other weirdness such as
 the odd memory hole 0x3f7e0000-0x3f7e7000 on my hardware. Was
 this what you were thinking about when asking whether my domUs
 work OK with 1GB of RAM? Since that is just under the 1GB
 limit.

 To clarify, I am not suggesting just hard coding a 3GB memory
 hole - I am suggesting defaulting to at least that and them
 mapping in any additional memory holes as well. My reasoning
 behind this suggestion is that it would make things more
 consistent between different (possibly dissimilar) hosts.

 Gordan

Gordan Bobic

2013-Sep-06 14:09 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Fri, 6 Sep 2013 09:09:06 -0400, Konrad Rzeszutek Wilk 
 <konrad.wilk@oracle.com> wrote:> On Thu, Sep 05, 2013 at 11:42:38PM +0100, Gordan Bobic wrote:
>> On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote:
>> >Gordan Bobic <gordan@bobich.net> wrote:
>> >>Right, finally got around to trying this with the latest patch.
>> >>
>> >>With e820_host=0 things work as before:
>> >>
>> >>(XEN) HVM3: BIOS map:
>> >>(XEN) HVM3:  f0000-fffff: Main BIOS
>> >>(XEN) HVM3: E820 table:
>> >>(XEN) HVM3:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>> >>(XEN) HVM3:  [01]: 00000000:0009e000 - 00000000:000a0000:
RESERVED
>> >>(XEN) HVM3:  HOLE: 00000000:000a0000 - 00000000:000e0000
>> >>(XEN) HVM3:  [02]: 00000000:000e0000 - 00000000:00100000:
RESERVED
>> >>(XEN) HVM3:  [03]: 00000000:00100000 - 00000000:e0000000: RAM
>> >>(XEN) HVM3:  HOLE: 00000000:e0000000 - 00000000:fc000000
>> >>(XEN) HVM3:  [04]: 00000000:fc000000 - 00000001:00000000:
RESERVED
>> >>(XEN) HVM3:  [05]: 00000001:00000000 - 00000002:1f800000: RAM
>> >>
>> >>
>> >>I seem to be getting two different E820 table dumps with 
>> e820_host=1:
>> >>
>> >>(XEN) HVM1: BIOS map:
>> >>(XEN) HVM1:  f0000-fffff: Main BIOS
>> >>(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
>> >>(XEN) HVM1: E820 table:
>> >>(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
>> >>(XEN) HVM1:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
>> >>(XEN) HVM1:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
>> >>(XEN) HVM1:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000:
RESERVED
>> >>(XEN) HVM1:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
>> >>(XEN) HVM1:  [04]: 00000000:3f7e7000 - 00000000:40000000:
RESERVED
>> >>(XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
>> >>(XEN) HVM1:  [05]: 00000000:fee00000 - 00000000:fee01000:
RESERVED
>> >>(XEN) HVM1:  HOLE: 00000000:fee01000 - 00000000:ffc00000
>> >>(XEN) HVM1:  [06]: 00000000:ffc00000 - 00000001:00000000:
RESERVED
>> >>(XEN) HVM1:  [07]: 00000001:00000000 - 00000001:68870000: RAM
>> >>(XEN) HVM1: E820 table:
>> >>(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>> >>(XEN) HVM1:  [01]: 00000000:0009e000 - 00000000:000a0000:
RESERVED
>> >>(XEN) HVM1:  HOLE: 00000000:000a0000 - 00000000:000e0000
>> >>(XEN) HVM1:  [02]: 00000000:000e0000 - 00000000:00100000:
RESERVED
>> >>(XEN) HVM1:  [03]: 00000000:00100000 - 00000000:a7800000: RAM
>> >>(XEN) HVM1:  HOLE: 00000000:a7800000 - 00000000:fc000000
>> >>(XEN) HVM1:  [04]: 00000000:fc000000 - 00000001:00000000:
RESERVED
>> >>(XEN) HVM1: Invoking ROMBIOS ...
>> >>
>> >>I cannot quite figure out what is going on here - these tables 
>> can''t
>> >>both be true.
>> >>
>> >
>> >Right.  The code just prints the E820 that was constructed b/c of 
>> the e820_host =1 parameter as the first output.  Then the second one 
>> is what was constructed originally.
>> >
>> >The code that would tie in the E820 from the hyper call and the 
>> alter how the hvmloader sets it up is not yet done.
>> >
>> >
>> >>Looking at the IOMEM on the host, the IOMEM begins at
0xa8000000
>> and
>> >>goes more or less contiguously up to 0xfec8b000.
>> >>
>> >>Looking at dmesg on domU, the e820 map more or less matches the
>> second
>> >>dump above.
>> >
>> >Right.  That is correct since the patch I sent just outputs stuff.
>> No real changes to the E820 yet.
>>
>> /me *facepalms*
>>
>> That indeed explains everything. :)
>>
>> But having had a thorough look through the memory mappings (see my
>> other long, rambling email), I don''t actually see an obvious
area
>> where RAM might overwrite a dom0 IOMEM range - assuming the
"HOLE"
>> part isn''t mapped as RAM in domU.
>>
>> Or to summarize:
>> dom0 PCI IOMEM actually has mappings from a8000000 onward, and
>> giving domU up to that much memory works fine. So the memory stomp
>> must be happening from a8000000 onward. But - the only things above
>> that address in domU are the HOLE up to fc000000 and RESERVED up to
>> ffffffff. So no domU memory is getting mapped into the IOMEM range
>> anyway - which begs the question of what is _actually_ causing the
>> crash. Stuff I haven''t yet found in domU getting mapped into
the
>> a7800000-fc000000 hole overlapping dom0 IOMEM? SeaBIOS doing
>> smething odd in the fc000000-fec8b000 range marked RESERVED in domU?
>
> There were some assumptions with that region and that stuff could
> be stick in there (like ACPI tables and SMBIOS I think).
>
> Perhaps a better question is - are any of the BARs of your card 
> overlapping
> with the RESERVED range in the domU?
>
> Or if you grep through the hvmloader code are there anything 
> addresses
> that look to be within that range?
>
> Incidentally could you send the output of lspci -vvvv from your 
> output
> in the guest and in dom0 please?
 Attached.

 The main point I''m trying to keep in mind here is that this
 needs to be generic and useful in different hardware cases,
 not just my own. If it were just about my own hardware and use
 case I''d have just opted for the approach of the old vBAR-pBAR
 patch, hard-coded the holes and been done with it.
>> Or am I reading this all wrong?
>
> You are on the right track I think. There is some assumption made
> about the RESERVED and HOLE that I think are conflicing with what the
> card thinks of. Another way to figure out what is happening is to 
> crank
> up the verbosity of the driver in the domU. Specifically there is
> an CONFIG_MMIO_TRACE (or something like that) that will tell you the
> physical address the PCI cards are using and what it is writting in 
> it.
>
> It could help in identifying _where_ the graphic card is 
> writting/reading
> from. And also the last moment when it wrote something.
 That''s a part of my problem - my domU with a reproducible crash
 is Windows which is a lot less debuggable. :(
 I have a Linux domU that I use for figuring out what the domU looks
 like from the inside, but I don''t have a readily usable test-case
 for reproducing the crash there.

 Gordan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Sep-06 14:32 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

> >>Then there seems to be a hole in dom0:
> >>40000000-fedfffff which talles up with the dom0 dmesg output above
> >>about it being for the PCI devices, i.e. that''s the IOMEM
region
> >>(from 1GB to a lilttle under 4GB).
> >>
> >>But in domU, the 40000000-a77fffff is available as RAM.
> >
> >OK, so that is the goal - make hvmloader construct the E820 memory
> >layout and all of its pieces to fit that layout.
> 
> I am actually leaning toward only copying the holes from the
> host E820. The domU already seems to be successfully using various
> memory ranges that correspond to reserved and acpi ranges, so
> it doesn''t look like these are a problem.
OK.> 
> >>On the face of it, that''s actually fine - my PCI IOMEM
mappings show
> >>the lowest mapping (according to lspci -vvv) starts at a8000000,
> >
> ><surprise>
> 
> Indeed - on the host, the hole is 1GB-4GB, but there is no IOMEM
> mapped between 1024M and 2688MB. Hence why I can get away with a
> domU memory allocation up to 2688MB.
When you say ''IOMEM'' you mean /proc/iomem output?
> 
> >>which falls into the domU area marked as "HOLE"
(a7800000-fc000000).
> >>And this does in fact appears to be where domU maps the GPU in both
> >>of my VMs:
> >>
> >>E0000000-E7FFFFFF
> >>E8000000-EBFFFFFF
> >>EC000000-EDFFFFFF
> >>
> >>and this doesn''t overlap with any mapped PCI IOMEM
according to
> >>lspci.
> >>
> >>If we assume that anything below a8000000 doesn''t actually
matter in
> >>this case (since if I give up to a8000000 memory to a domU
> >>everything works absolutely fine indefinitely, I am at a loss to
> >
> >
> >Just to make sure I am not leading you astray. You are getting
> >_no_ crashes
> >when you have a guest with 1GB?
> 
> I haven''t tried limiting a guest to 1GB recently. My PCI
passthrough
> domUs all have 2688MB assigned, and this works fine. More than that
> and they crash eventually. Does that answer your question? Or were
> you after something very specific to the 1GB domU case?
No no. I just was too lazy to compute what a800000 came out in
decimal.> 
> >>explain what is actually going wrong and why the crash is still
> >>occuring - unless some other piece of hardware is having
it''s domU
> >>IOMEM mapped somewhere in the range f3df4000-fec8b000 and that is
> >>causing a memory overwrite.
> >>
> >>I am just not seeing any obvious memory stomp at the moment...
> >
> >Neither am I.
> 
> I may have pasted the wrong domU e820. I have a sneaky suspicion
> that this above map was from a domU with 2688MB of RAM assigned,
> hence why there is on domU RAM in the map above a7800000. I''ll
> re-check when I''m in front of that machine again.
> 
> Are you OK with the plan to _only_ copy the holes from host E820
> to the hvmloader E820? I think this would be sufficient and not
> cause any undue problems. The only things that would need to
> change are:
> 1) Enlarge the domU hole
> 2) Do something with the top reserved block, starting at
> RESERVED_MEMBASE=0xFC000000. What is this actually for? It
> overlaps with the host memory hole which extends all the way up
> to 0xfee00000. If it must be where it is, this could be
> problematic. What to do in this case?
I would do a git log or git annotate to find it. I recall
some patches to move that - but I can''t recall the details.
> 
> This does, also bring up another question - is there any point
> in bothering with matching the host holes? I would hazard a
> guess that no physical hardware is likely to have a memory
> hole bigger than 3GB under the 4GB limit.
3GB is about the max I have seen.
> 
> So would it perhaps be neater, easier, more consistent and
> more debuggable to just make the hvmloader put in a hole
> between 0x40000000-0xffffffff (the whole 3GB) by default?
> Or is that deemed to be too crippling for 32-bit non-PAE
> domUs (and are there enough of these aroudn to matter?)?
Correct. Also it would wreak havoc when migrating to other
hvmloader''s which have a different layout.
> 
> Caveat - this alone wouldn''t cover any other weirdness such as
> the odd memory hole 0x3f7e0000-0x3f7e7000 on my hardware. Was
> this what you were thinking about when asking whether my domUs
> work OK with 1GB of RAM? Since that is just under the 1GB
> limit.
So there are some issues with i915 IGD having to have a ''flush
page''. Mainly some non-RAM region that they can tell the IGD
to flush its pages. And it had to be non-RAM and somehow
via magic IGD registers you can program the physical address
in the card - so the card has it remapped to itself.

Usually it is some gap (aka hole) that ends has to be
faithfully reproduced in the guest. But you are using
nvidia and are not playing  those nasty tricks.

> 
> To clarify, I am not suggesting just hard coding a 3GB memory
> hole - I am suggesting defaulting to at least that and them
> mapping in any additional memory holes as well. My reasoning
> behind this suggestion is that it would make things more
> consistent between different (possibly dissimilar) hosts.
Potentially. The other option when thinking about migration
and PCI - is to interogate _All_ of the hosts that will be involved
in the migration and construct an E820 that covers all the
right regions. Then use that for the guests and then you
can unplug/plug the PCI devices without much trouble.

That is where the e820_host=1 parameter can be used and
also some extra code to slurp up an XML of the E820 could be
implemented.

The 3GB HOLE could do it, but what if the host has some
odd layout where the HOLE is above 4GB? Then we are back at
remapping.

I think Stefano had some thoughts about enlaring the HOLE
and it might be good to include him here.> 
> Gordan

Gordan Bobic

2013-Sep-06 14:45 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Fri, 6 Sep 2013 09:20:50 -0400, Konrad Rzeszutek Wilk 
 <konrad.wilk@oracle.com> wrote:> On Fri, Sep 06, 2013 at 01:23:19PM +0100, Gordan Bobic wrote:
>> On Thu, 05 Sep 2013 19:01:03 -0400, Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com> wrote:
>> >Gordan Bobic <gordan@bobich.net> wrote:
>> >>On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote:
>> >>>Gordan Bobic <gordan@bobich.net> wrote:
>> >>>>Right, finally got around to trying this with the
latest patch.
>> >>>>
>> >>>>With e820_host=0 things work as before:
>> >>>>
>> >>>>(XEN) HVM3: BIOS map:
>> >>>>(XEN) HVM3:  f0000-fffff: Main BIOS
>> >>>>(XEN) HVM3: E820 table:
>> >>>>(XEN) HVM3:  [00]: 00000000:00000000 -
00000000:0009e000: RAM
>> >>>>(XEN) HVM3:  [01]: 00000000:0009e000 -
00000000:000a0000:
>> RESERVED
>> >>>>(XEN) HVM3:  HOLE: 00000000:000a0000 -
00000000:000e0000
>> >>>>(XEN) HVM3:  [02]: 00000000:000e0000 -
00000000:00100000:
>> RESERVED
>> >>>>(XEN) HVM3:  [03]: 00000000:00100000 -
00000000:e0000000: RAM
>> >>>>(XEN) HVM3:  HOLE: 00000000:e0000000 -
00000000:fc000000
>> >>>>(XEN) HVM3:  [04]: 00000000:fc000000 -
00000001:00000000:
>> RESERVED
>> >>>>(XEN) HVM3:  [05]: 00000001:00000000 -
00000002:1f800000: RAM
>> >>>>
>> >>>>
>> >>>>I seem to be getting two different E820 table dumps
with
>> >>e820_host=1:
>> >>>>
>> >>>>(XEN) HVM1: BIOS map:
>> >>>>(XEN) HVM1:  f0000-fffff: Main BIOS
>> >>>>(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
>> >>>>(XEN) HVM1: E820 table:
>> >>>>(XEN) HVM1:  [00]: 00000000:00000000 -
00000000:3f790000: RAM
>> >>>>(XEN) HVM1:  [01]: 00000000:3f790000 -
00000000:3f79e000: ACPI
>> >>>>(XEN) HVM1:  [02]: 00000000:3f79e000 -
00000000:3f7d0000: NVS
>> >>>>(XEN) HVM1:  [03]: 00000000:3f7d0000 -
00000000:3f7e0000:
>> RESERVED
>> >>>>(XEN) HVM1:  HOLE: 00000000:3f7e0000 -
00000000:3f7e7000
>> >>>>(XEN) HVM1:  [04]: 00000000:3f7e7000 -
00000000:40000000:
>> RESERVED
>> >>>>(XEN) HVM1:  HOLE: 00000000:40000000 -
00000000:fee00000
>> >>>>(XEN) HVM1:  [05]: 00000000:fee00000 -
00000000:fee01000:
>> RESERVED
>> >>>>(XEN) HVM1:  HOLE: 00000000:fee01000 -
00000000:ffc00000
>> >>>>(XEN) HVM1:  [06]: 00000000:ffc00000 -
00000001:00000000:
>> RESERVED
>> >>>>(XEN) HVM1:  [07]: 00000001:00000000 -
00000001:68870000: RAM
>> >>>>(XEN) HVM1: E820 table:
>> >>>>(XEN) HVM1:  [00]: 00000000:00000000 -
00000000:0009e000: RAM
>> >>>>(XEN) HVM1:  [01]: 00000000:0009e000 -
00000000:000a0000:
>> RESERVED
>> >>>>(XEN) HVM1:  HOLE: 00000000:000a0000 -
00000000:000e0000
>> >>>>(XEN) HVM1:  [02]: 00000000:000e0000 -
00000000:00100000:
>> RESERVED
>> >>>>(XEN) HVM1:  [03]: 00000000:00100000 -
00000000:a7800000: RAM
>> >>>>(XEN) HVM1:  HOLE: 00000000:a7800000 -
00000000:fc000000
>> >>>>(XEN) HVM1:  [04]: 00000000:fc000000 -
00000001:00000000:
>> RESERVED
>> >>>>(XEN) HVM1: Invoking ROMBIOS ...
>> >>>>
>> >>>>I cannot quite figure out what is going on here - these
>> >>>>tables can''t
>> >>>>both be true.
>> >>>>
>> >>>
>> >>>Right.  The code just prints the E820 that was constructed
b/c
>> >>>of the
>> >>e820_host =1 parameter as the first output.  Then the second
one
>> is
>> >>what was constructed originally.
>> >>>
>> >>>The code that would tie in the E820 from the hyper call and
>> >>>the alter
>> >>how the hvmloader sets it up is not yet done.
>> >>>
>> >>>
>> >>>>Looking at the IOMEM on the host, the IOMEM begins at
>> >>>>0xa8000000 and
>> >>>>goes more or less contiguously up to 0xfec8b000.
>> >>>>
>> >>>>Looking at dmesg on domU, the e820 map more or less
matches the
>> >>second
>> >>>>dump above.
>> >>>
>> >>>Right.  That is correct since the patch I sent just outputs
>> stuff.
>> >>No real changes to the E820 yet.
>> >>
>> >>I thought this did that in hvmloader/e820c:
>> >>hypercall_memory_op ( XENMEM_memory_map, &op);
>> >>
>> >>Gordan
>> >
>> >No.  They just gets the E820 that is stashed in the hypervisor for
>> >the guest.  The PV guest would use it but hvmloader is not. This is
>> >what would needed to be implemented to allow hvmloader construct  
>> the
>> >E820 on its own.
>>
>> Right. So so in hvmloader/e820.c we now have the host based map in
>> struct e820entry map[E820MAX];
>>
>> The rest of the function then goes and constructs the standard HVM
>> e820 map in the passed in
>> struct e820entry *e820
>>
>> So all that needs to happen here is if e820_host is set, fill e820[]
>> by copying map[] up to the hvm_info->low_mem_pgend
>> (or hvm_info->high_mem_pgend if it is set). I am guessing that
>
> Right. And then the overflow would be put past 4GB. Or fill in the
> E820_RAM regions with it.
>
>> SeaBIOS and other existing stuff might break if the host map is
>> just copied in verbatim, so presumably I need to add/dedupe the
>> non-RAM parts of the maps.
>
> Probably. Or tweak SeaBIOS to use your E820.
 I don''t think tweaking SeaBIOS to use a different specific map
 is the way forward. As I said in the other email, my motivation
 is to make something that will work in the general case, not
 for the memory map in my dodgy hardware (I''m sure there are
 many other poorly designed bits of hardware out there this might
 be useful on ;) ).
> Also you need to figure out where hvmloader constructs the ACPI and
> SMBIOS tables and make sure they are within the E820_RESERVED 
> regions.
 This doesn''t appear to have caused any problems - the only
 problematic part is trampling over the host''s _mapped_ parts
 of the PCI MMIO hole. Having domU RAM everywhere else doesn''t
 _appear_ to cause any problems, hence why I would like to
 focus my effort on making sure that the holes are mapped
 while breaking nothing else if at all possible.
>> Is that right? Nothing else needs to happen?
>
> HA! You are going to hit some bugs probably :-)
 Hey, some degree of optimisim is required for perseverence. ;)
>> The following questions arise:
>>
>> 1) What to do in case of overlaps? On my specific hardware,
>> the key difference in the end map will be that the hole at:
>> (XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
>> will end up being created in domU.
>
> The hole is also known as PCI gap or MMIO region. With the
> e820_host in effect you should use the host''s layout and
> use its hole placement. That will replicate it and make
> domU''s E820 hole look like the host.
 Hmm... Now there''s an idea. I _could_ just hard-code the
 memory hole to match that just to see if it fixes the
 problem. I rather expect, however, that this will just
 move the problem.

 Specifically, it is liable to make domU MMIO overlap
 (without matching) the dom0 MMIO and crash the host
 quite spectacularly. Unless domU decides to map MMIO
 from the bottom up, in which case there''s 1688MB of
 MMIO space between 0x40000000 and 0xa8000000 where
 MMIO will end up in domU, never overlapping the host''s
 map and everything will, by pure chance, work just
 fine from there on.
>> 2) Do only the holes need to be pulled from the host or
>> the entire map? Would hvmloader/seabios/whatever know
>> what to do if passed a map that is different from what
>> they might expect (i.e. different from what the current
>> hvmloader provides)? Or would this be likely to cause
>> extensive further breakages?
>
> I think there are some assumptions made where the hole
> starts. Those would have to be made more dynamic to deal
> with a different E820 layout.
 Assumptions made by what?
>> 3) At the moment I am leaning toward just pulling in the
>> holes from the host e820, mirroring them in domU.
>
> <nods>
>
>> 3.1) Marking them as "reserved" would likely fix the
>> problem that was my primary motivation for doing this
>> in the first place. Having said that - with all of
>
> That unfortuntaly will make them not-gaps nor MMIO regions.
> Meaning the kernel will scream: "You have a BAR in E820_
> reserved region! That is bad!", and won''t setup the card.
 What makes decision in domU where to map the PCI
 devices'' MMIO? SeaBIOS?
> The hole needs to be replicated in the guest.
>> the 1GB-3GB space marked as reserved, I''m not sure where
>> the IOMEM would end up mapped in domU - things might just
>> break. If marking the dom0 hole as a hole in domU without
>> ensuring pBAR=vBAR, the PCI device in domU might get
>> mapped with where another device is in dom0, which might
>> cause the same problem.
>
> Right. hvmloader could (I hadn''t checked the code) scan the
> E820 and determine that the PCI BARs are within the E820_RESRV
> and try to move them to a hole. Since no hole would be found
> below 4GB it would remap the PCI BAR above 4GB. That - depending
> on the device - could be disastrous for the device. That is
> if it is only capable of 32-bit DMA''s it will never do anything.
 Nvidia cards have a 32-bit 32MB BAR by default, and two 64-bit
 BARs.

 Looking at the different maps, I think I see what is actually
 happening. In domU, the hole defaults to starting at e0000000,
 and this is also where the BARs get mapped for the GPU in domU.

 That implies that mirroring the host''s hole at 1GB-4GB, would
 actually likely work (by a fluke), since the BARs would
 (hopefully) get mapped at bottom (plenty of hole before the
 host''s mapping, 1688MB to be exact), and the rest of the hole
 would never get touched, stealthily (or obliviously, depending
 on how you want to look at it) avoiding trampling over the
 host''s BARs.

 OK, I''m convinced - I''ll give this a try and see how I get
 on. :)
>> At the moment, I think the expedient thing to do is make
>> domU map holes as per dom0 and ignore other non-RAM
>
> <nods>
>> areas. This may (by luck) or may not fix my immediate problem
>> (RAM in domU clobbering host''s mapped IOMEM), but at
>> least it would cover the pre-requisite hole mapping for
>> the next step which is vBAR=pBAR.
>
> <nods>
>>
>> I light of this, however, depending on the answer to 2)
>> above, it may not be practical for e820_host option do do
>
> I think it will mean you need to look in the hvmloader directory
> a bit more and find all of the assumptions it makes about memory
> locations. One excellent tool is to do ''git log -p
tools/hvmloader''
> as it will tell you what changes have been done to address
> the memory layout construction.
 I''ll have a dig.
>> what it actually means for HVMs, at least not to the same
>> extent as happens for PV. It would only do a part of it
>> (initial vHOLE=pHOLE, to later be extended to the more
>> specific case of vBAR=pBAR).
>>
>> Does this sound reasonable?
>
> Yes. I think the plan you outlined is sound. The difficultiy is
> going to be cramming the E820 constructed by e820_host in hvmloader
> and making sure that all the other parts of it (SMBIOS, ACPI, BIOS)
> will be more dynamic and use dynamic locations instead of
> hard-coded values.
>
> Loads of printks can help with that :-)
 This is my main concern - that other things are making assumptions
 about where the holes are. At the moment it doesn''t look too bad
 since the only areas of conflict between (_my_) host and current
 hvmloader maps is in the RAM and HOLE areas, so coming up with
 a generic solution that will work for my use (and hopefully
 for most other people) ought to be fairly simple. Making it
 actually work in the edge cases will be harder - but then again
 for those cases it doesn''t work at the moment anyway so erring
 on the side of pragmatism may be the correct thing to do here.
> The awesome thing is that it will make hvmloader a lot more
> flexible. And one can extend the e820_host to construct an
> E820 that is bizzare for testing even more absurd memory
> layouts (say, no RAM below 4GB).
>
> Keep on digging! Thanks for great analysis.
 Thanks, I appreciate it. :)

 Gordan

Gordan Bobic

2013-Sep-06 16:30 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Fri, 6 Sep 2013 10:32:23 -0400, Konrad Rzeszutek Wilk 
 <konrad.wilk@oracle.com> wrote:
>> >>On the face of it, that''s actually fine - my PCI IOMEM
mappings
>> show
>> >>the lowest mapping (according to lspci -vvv) starts at
a8000000,
>> >
>> ><surprise>
>>
>> Indeed - on the host, the hole is 1GB-4GB, but there is no IOMEM
>> mapped between 1024M and 2688MB. Hence why I can get away with a
>> domU memory allocation up to 2688MB.
>
> When you say ''IOMEM'' you mean /proc/iomem output?
 I mean what lspci shows WRT where PCI device memory regions
 are mapped.
>> >>explain what is actually going wrong and why the crash is still
>> >>occuring - unless some other piece of hardware is having
it''s domU
>> >>IOMEM mapped somewhere in the range f3df4000-fec8b000 and that
is
>> >>causing a memory overwrite.
>> >>
>> >>I am just not seeing any obvious memory stomp at the moment...
>> >
>> >Neither am I.
>>
>> I may have pasted the wrong domU e820. I have a sneaky suspicion
>> that this above map was from a domU with 2688MB of RAM assigned,
>> hence why there is on domU RAM in the map above a7800000. I''ll
>> re-check when I''m in front of that machine again.
>>
>> Are you OK with the plan to _only_ copy the holes from host E820
>> to the hvmloader E820? I think this would be sufficient and not
>> cause any undue problems. The only things that would need to
>> change are:
>> 1) Enlarge the domU hole
>> 2) Do something with the top reserved block, starting at
>> RESERVED_MEMBASE=0xFC000000. What is this actually for? It
>> overlaps with the host memory hole which extends all the way up
>> to 0xfee00000. If it must be where it is, this could be
>> problematic. What to do in this case?
>
> I would do a git log or git annotate to find it. I recall
> some patches to move that - but I can''t recall the details.
 Will do. But what could this possibly be for?
>> So would it perhaps be neater, easier, more consistent and
>> more debuggable to just make the hvmloader put in a hole
>> between 0x40000000-0xffffffff (the whole 3GB) by default?
>> Or is that deemed to be too crippling for 32-bit non-PAE
>> domUs (and are there enough of these aroudn to matter?)?
>
> Correct. Also it would wreak havoc when migrating to other
> hvmloader''s which have a different layout.
 Two points here that might just be worth pointing out here:

 1) domUs with e820_host set aren''t migratable anyway
 (including PV ones for which e820_host is currently
 implemented)

 2) All of this is conditional on e820_host=1 being set
 in the config. Since legacy hosts won''t have this set
 anyway (since it isn''t implemented, and won''t be until
 this patch set is completed), surely any notion of
 backward compatibility for HVMs with e820_host=1 set
 is null and void.

 Thus - as a first pass solution that would work in
 most cases where this option is useful in the first
 place, setting the low RAM limit to the beginning of
 the first memory hole above 0x100000 (1MB) should be
 OK.

 Leave anything after that unmapped (that seems to
 be what shows up as "HOLE" on the dumps) all the
 way up to RESERVED_MEMBASE.

 That would only leave the question of what it is
 (if anything) that uses the memory between
 RESERVED_MEMBASE and 0xffffffff (4GB) and under
 which circumstances. This could be somewhat important
 because 0xfec8a000 -> +4KB on my machine is actually
 the Intel I/O APIC. If it is reserved and nothing uses
 it, no problem, it can stay as is. If SeaBIOS or similar
 is known to write to it under some circumstances, that
 could easily be quite crashtastic.
>> Caveat - this alone wouldn''t cover any other weirdness such as
>> the odd memory hole 0x3f7e0000-0x3f7e7000 on my hardware. Was
>> this what you were thinking about when asking whether my domUs
>> work OK with 1GB of RAM? Since that is just under the 1GB
>> limit.
>
> So there are some issues with i915 IGD having to have a ''flush
> page''. Mainly some non-RAM region that they can tell the IGD
> to flush its pages. And it had to be non-RAM and somehow
> via magic IGD registers you can program the physical address
> in the card - so the card has it remapped to itself.
>
> Usually it is some gap (aka hole) that ends has to be
> faithfully reproduced in the guest. But you are using
> nvidia and are not playing  those nasty tricks.
 Mere a different set of nasty tricks instead. :)
 But yes, on the whole, I agree. I will try to get the holes
 as similar as possible for a "production" level patch.
>> To clarify, I am not suggesting just hard coding a 3GB memory
>> hole - I am suggesting defaulting to at least that and them
>> mapping in any additional memory holes as well. My reasoning
>> behind this suggestion is that it would make things more
>> consistent between different (possibly dissimilar) hosts.
>
> Potentially. The other option when thinking about migration
> and PCI - is to interogate _All_ of the hosts that will be involved
> in the migration and construct an E820 that covers all the
> right regions. Then use that for the guests and then you
> can unplug/plug the PCI devices without much trouble.
 That''s possibly a step too far at this point.
> That is where the e820_host=1 parameter can be used and
> also some extra code to slurp up an XML of the E820 could be
> implemented.
>
> The 3GB HOLE could do it, but what if the host has some
> odd layout where the HOLE is above 4GB? Then we are back at
> remapping.
 Such a host would also only work with devices that _only_
 require 64-bit BARs. But they do exist (e.g. ATI GPUs).

 Gordan

Gordan Bobic

2013-Sep-06 19:54 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Here is a test patch I applied to:
/tools/firmware/hvmloader/e820.c

==--- e820.c.orig	2013-09-06 11:15:20.023337321 +0100
+++ e820.c	2013-09-06 19:53:00.141876019 +0100
@@ -79,6 +79,7 @@
      unsigned int nr = 0;
      struct xen_memory_map op;
      struct e820entry map[E820MAX];
+    int e820_host = 0;
      int rc;

      if ( !lowmem_reserved_base )
@@ -88,6 +89,7 @@

      rc = hypercall_memory_op ( XENMEM_memory_map, &op);
      if ( rc != -ENOSYS) { /* It works!? */
+        e820_host = 1;
          printf("%s:%d got %d op.nr_entries \n", __func__, __LINE__, 
op.nr_entries);
          dump_e820_table(&map[0], op.nr_entries);
      }
@@ -133,7 +135,12 @@
      /* Low RAM goes here. Reserve space for special pages. */
      BUG_ON((hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u <<
20));
      e820[nr].addr = 0x100000;
-    e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) - 
e820[nr].addr;
+
+    if (e820_host)
+        e820[nr].size = 0x3f7e0000 - e820[nr].addr;
+    else
+        e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) - 
e820[nr].addr;
+
      e820[nr].type = E820_RAM;
      nr++;

==
I''m sure this doesn''t need explicitly pointing out, but for
the record,
it is a gross hack just to prove the concept.

The map dump with this patch applied and memory set to 8192 is:

==(XEN) HVM5: BIOS map:
(XEN) HVM5:  f0000-fffff: Main BIOS
(XEN) HVM5: build_e820_table:93 got 8 op.nr_entries
(XEN) HVM5: E820 table:
(XEN) HVM5:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
(XEN) HVM5:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
(XEN) HVM5:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
(XEN) HVM5:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED
(XEN) HVM5:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
(XEN) HVM5:  [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED
(XEN) HVM5:  HOLE: 00000000:40000000 - 00000000:fee00000
(XEN) HVM5:  [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED
(XEN) HVM5:  HOLE: 00000000:fee01000 - 00000000:ffc00000
(XEN) HVM5:  [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED
(XEN) HVM5:  [07]: 00000001:00000000 - 00000002:c0870000: RAM
(XEN) HVM5: E820 table:
(XEN) HVM5:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
(XEN) HVM5:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
(XEN) HVM5:  HOLE: 00000000:000a0000 - 00000000:000e0000
(XEN) HVM5:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
(XEN) HVM5:  [03]: 00000000:00100000 - 00000000:3f7e0000: RAM
(XEN) HVM5:  HOLE: 00000000:3f7e0000 - 00000000:fc000000
(XEN) HVM5:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
(XEN) HVM5:  [05]: 00000001:00000000 - 00000002:1f800000: RAM
(XEN) HVM5: Invoking ROMBIOS ...
==
Good observations:
It works! No crashes, no screen corruption! As an added bonus, it fixes 
the problem of rebooting domUs causing them to lose GPU access and 
eventually crash the host even with memory allocation below the first 
PCI MMIO block. I am suspecting that something in the 
0x3f7e0000-0x3f7e7000 hole that isn''t showing up on lspci might be 
responsible.

I think that proves beyond any doubt what the problem was before.

Interesting observations:
1) GPU PCI MMIO is still mapped at E0000000, rather than at the bottom 
of the memory hole. That implies that SeaBIOS (or whatever does the 
mapping) makes assumptions about where the memory hole begins. This will 
need to somehow be fixed / made dynamic. What decides where to map PCI 
memory for each device?

2) The memory hole size difference counts toward the total guest memory. 
I set
memory=8192
maxmem=8192
but Windows in domU only sees 5.48GB. What is particularly odd is that 
that the missing memory isn''t 3GB, but 2.5GB - which implies that, 
again, there are other things making assumptions about the size and 
shape of the memory hole and moving the memory from the hole elsewhere 
to make it usable. What does this?

My todo list, in order of priority (unless somebody here has a better 
idea) is:
1) Tidy up the hole enlargement to make it dynamically based on the host 
hole locations. In cases where the host hole overlaps something other 
than guest RAM/HOLE (i.e. RESERVED), guest spec wins.

2) Fix whatever is causing the hole memory increase to reduce the guest 
memory. The memory hole is a hole, not a shadow. I need some pointers on 
where to look for whatever is responsible for this.

3) Fix what makes decisions on where to map devices'' memory apertures. 
Ideally, the fix should be to detect host''s pBAR make vBAR=pBAR. Again,
I need some pointers on where to look for whatever is responsible for 
doing this mapping.

Gordan

Konrad Rzeszutek Wilk

2013-Sep-10 13:35 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Fri, Sep 06, 2013 at 08:54:24PM +0100, Gordan Bobic
wrote:> Here is a test patch I applied to:
> /tools/firmware/hvmloader/e820.c
> 
> ==> --- e820.c.orig	2013-09-06 11:15:20.023337321 +0100
> +++ e820.c	2013-09-06 19:53:00.141876019 +0100
> @@ -79,6 +79,7 @@
>      unsigned int nr = 0;
>      struct xen_memory_map op;
>      struct e820entry map[E820MAX];
> +    int e820_host = 0;
>      int rc;
> 
>      if ( !lowmem_reserved_base )
> @@ -88,6 +89,7 @@
> 
>      rc = hypercall_memory_op ( XENMEM_memory_map, &op);
>      if ( rc != -ENOSYS) { /* It works!? */
> +        e820_host = 1;
>          printf("%s:%d got %d op.nr_entries \n", __func__,
__LINE__,
> op.nr_entries);
>          dump_e820_table(&map[0], op.nr_entries);
>      }
> @@ -133,7 +135,12 @@
>      /* Low RAM goes here. Reserve space for special pages. */
>      BUG_ON((hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u
<< 20));
>      e820[nr].addr = 0x100000;
> -    e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) -
> e820[nr].addr;
> +
> +    if (e820_host)
> +        e820[nr].size = 0x3f7e0000 - e820[nr].addr;
> +    else
> +        e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) -
> e820[nr].addr;
> +
>      e820[nr].type = E820_RAM;
>      nr++;
> 
> ==> 
> I''m sure this doesn''t need explicitly pointing out, but
for the
> record, it is a gross hack just to prove the concept.
> 
> The map dump with this patch applied and memory set to 8192 is:
> 
> ==> (XEN) HVM5: BIOS map:
> (XEN) HVM5:  f0000-fffff: Main BIOS
> (XEN) HVM5: build_e820_table:93 got 8 op.nr_entries
> (XEN) HVM5: E820 table:
> (XEN) HVM5:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
> (XEN) HVM5:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
> (XEN) HVM5:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
> (XEN) HVM5:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED
> (XEN) HVM5:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
> (XEN) HVM5:  [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED
> (XEN) HVM5:  HOLE: 00000000:40000000 - 00000000:fee00000
> (XEN) HVM5:  [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED
> (XEN) HVM5:  HOLE: 00000000:fee01000 - 00000000:ffc00000
> (XEN) HVM5:  [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED
> (XEN) HVM5:  [07]: 00000001:00000000 - 00000002:c0870000: RAM
> (XEN) HVM5: E820 table:
> (XEN) HVM5:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
> (XEN) HVM5:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
> (XEN) HVM5:  HOLE: 00000000:000a0000 - 00000000:000e0000
> (XEN) HVM5:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
> (XEN) HVM5:  [03]: 00000000:00100000 - 00000000:3f7e0000: RAM
> (XEN) HVM5:  HOLE: 00000000:3f7e0000 - 00000000:fc000000
> (XEN) HVM5:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
> (XEN) HVM5:  [05]: 00000001:00000000 - 00000002:1f800000: RAM
> (XEN) HVM5: Invoking ROMBIOS ...
> ==> 
> Good observations:
> It works! No crashes, no screen corruption! As an added bonus, it
> fixes the problem of rebooting domUs causing them to lose GPU access
> and eventually crash the host even with memory allocation below the
> first PCI MMIO block. I am suspecting that something in the
> 0x3f7e0000-0x3f7e7000 hole that isn''t showing up on lspci might be
> responsible.
> 
> I think that proves beyond any doubt what the problem was before.
> 
> Interesting observations:
> 1) GPU PCI MMIO is still mapped at E0000000, rather than at the
> bottom of the memory hole. That implies that SeaBIOS (or whatever
> does the mapping) makes assumptions about where the memory hole
> begins. This will need to somehow be fixed / made dynamic. What
> decides where to map PCI memory for each device?
> 
> 2) The memory hole size difference counts toward the total guest
> memory. I set
> memory=8192
> maxmem=8192
> but Windows in domU only sees 5.48GB. What is particularly odd is
> that that the missing memory isn''t 3GB, but 2.5GB - which implies
> that, again, there are other things making assumptions about the
> size and shape of the memory hole and moving the memory from the
> hole elsewhere to make it usable. What does this?
> 
> My todo list, in order of priority (unless somebody here has a
> better idea) is:
> 1) Tidy up the hole enlargement to make it dynamically based on the
> host hole locations. In cases where the host hole overlaps something
> other than guest RAM/HOLE (i.e. RESERVED), guest spec wins.
guest spec is .. the default hvmloader behavior?> 
> 2) Fix whatever is causing the hole memory increase to reduce the
> guest memory. The memory hole is a hole, not a shadow. I need some
> pointers on where to look for whatever is responsible for this.
That is where git log tools/hvmloader/firmware might shed some light.
> 
> 3) Fix what makes decisions on where to map devices'' memory
> apertures. Ideally, the fix should be to detect host''s pBAR make
> vBAR=pBAR. Again, I need some pointers on where to look for whatever
> is responsible for doing this mapping.
That should be all in tools/hvmloader/firmware I believe.
''pci_setup'' function, where it says:
 /* Assign iomem and ioport resources in descending order of size. */ 

> 
> Gordan

Gordan Bobic

2013-Sep-10 15:04 UTC

head link

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

On Tue, 10 Sep 2013 09:35:59 -0400, Konrad Rzeszutek Wilk 
 <konrad.wilk@oracle.com> wrote:> On Fri, Sep 06, 2013 at 08:54:24PM +0100, Gordan Bobic wrote:
>> Here is a test patch I applied to:
>> /tools/firmware/hvmloader/e820.c
>>
>> ==>> --- e820.c.orig	2013-09-06 11:15:20.023337321 +0100
>> +++ e820.c	2013-09-06 19:53:00.141876019 +0100
>> @@ -79,6 +79,7 @@
>>      unsigned int nr = 0;
>>      struct xen_memory_map op;
>>      struct e820entry map[E820MAX];
>> +    int e820_host = 0;
>>      int rc;
>>
>>      if ( !lowmem_reserved_base )
>> @@ -88,6 +89,7 @@
>>
>>      rc = hypercall_memory_op ( XENMEM_memory_map, &op);
>>      if ( rc != -ENOSYS) { /* It works!? */
>> +        e820_host = 1;
>>          printf("%s:%d got %d op.nr_entries \n", __func__,
__LINE__,
>> op.nr_entries);
>>          dump_e820_table(&map[0], op.nr_entries);
>>      }
>> @@ -133,7 +135,12 @@
>>      /* Low RAM goes here. Reserve space for special pages. */
>>      BUG_ON((hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u
<< 20));
>>      e820[nr].addr = 0x100000;
>> -    e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) -
>> e820[nr].addr;
>> +
>> +    if (e820_host)
>> +        e820[nr].size = 0x3f7e0000 - e820[nr].addr;
>> +    else
>> +        e820[nr].size = (hvm_info->low_mem_pgend <<
PAGE_SHIFT) -
>> e820[nr].addr;
>> +
>>      e820[nr].type = E820_RAM;
>>      nr++;
>>
>> ==>>
>> I''m sure this doesn''t need explicitly pointing out,
but for the
>> record, it is a gross hack just to prove the concept.
>>
>> The map dump with this patch applied and memory set to 8192 is:
>>
>> ==>> (XEN) HVM5: BIOS map:
>> (XEN) HVM5:  f0000-fffff: Main BIOS
>> (XEN) HVM5: build_e820_table:93 got 8 op.nr_entries
>> (XEN) HVM5: E820 table:
>> (XEN) HVM5:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
>> (XEN) HVM5:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
>> (XEN) HVM5:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
>> (XEN) HVM5:  [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED
>> (XEN) HVM5:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
>> (XEN) HVM5:  [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED
>> (XEN) HVM5:  HOLE: 00000000:40000000 - 00000000:fee00000
>> (XEN) HVM5:  [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED
>> (XEN) HVM5:  HOLE: 00000000:fee01000 - 00000000:ffc00000
>> (XEN) HVM5:  [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED
>> (XEN) HVM5:  [07]: 00000001:00000000 - 00000002:c0870000: RAM
>> (XEN) HVM5: E820 table:
>> (XEN) HVM5:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>> (XEN) HVM5:  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>> (XEN) HVM5:  HOLE: 00000000:000a0000 - 00000000:000e0000
>> (XEN) HVM5:  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>> (XEN) HVM5:  [03]: 00000000:00100000 - 00000000:3f7e0000: RAM
>> (XEN) HVM5:  HOLE: 00000000:3f7e0000 - 00000000:fc000000
>> (XEN) HVM5:  [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
>> (XEN) HVM5:  [05]: 00000001:00000000 - 00000002:1f800000: RAM
>> (XEN) HVM5: Invoking ROMBIOS ...
>> ==>>
>> Good observations:
>> It works! No crashes, no screen corruption! As an added bonus, it
>> fixes the problem of rebooting domUs causing them to lose GPU access
>> and eventually crash the host even with memory allocation below the
>> first PCI MMIO block. I am suspecting that something in the
>> 0x3f7e0000-0x3f7e7000 hole that isn''t showing up on lspci
might be
>> responsible.
>>
>> I think that proves beyond any doubt what the problem was before.
>>
>> Interesting observations:
>> 1) GPU PCI MMIO is still mapped at E0000000, rather than at the
>> bottom of the memory hole. That implies that SeaBIOS (or whatever
>> does the mapping) makes assumptions about where the memory hole
>> begins. This will need to somehow be fixed / made dynamic. What
>> decides where to map PCI memory for each device?
>>
>> 2) The memory hole size difference counts toward the total guest
>> memory. I set
>> memory=8192
>> maxmem=8192
>> but Windows in domU only sees 5.48GB. What is particularly odd is
>> that that the missing memory isn''t 3GB, but 2.5GB - which
implies
>> that, again, there are other things making assumptions about the
>> size and shape of the memory hole and moving the memory from the
>> hole elsewhere to make it usable. What does this?
>>
>> My todo list, in order of priority (unless somebody here has a
>> better idea) is:
>> 1) Tidy up the hole enlargement to make it dynamically based on the
>> host hole locations. In cases where the host hole overlaps something
>> other than guest RAM/HOLE (i.e. RESERVED), guest spec wins.
>
> guest spec is .. the default hvmloader behavior?
 Yes, that''s exactly what I meant. At least until I can figure out
 what necessitates the default HVM behaviour.
>> 2) Fix whatever is causing the hole memory increase to reduce the
>> guest memory. The memory hole is a hole, not a shadow. I need some
>> pointers on where to look for whatever is responsible for this.
>
> That is where git log tools/hvmloader/firmware might shed some light.
 I grepped for low_mem_pgend and high_mem_pgend, and the only place
 where I have found anything is in one place in libxc. Is this what
 sets it? Is this common to xm and xl?
>> 3) Fix what makes decisions on where to map devices'' memory
>> apertures. Ideally, the fix should be to detect host''s pBAR
make
>> vBAR=pBAR. Again, I need some pointers on where to look for whatever
>> is responsible for doing this mapping.
>
> That should be all in tools/hvmloader/firmware I believe.
> ''pci_setup'' function, where it says:
>  /* Assign iomem and ioport resources in descending order of size. */
 Thanks, will take a closer look there.

 Gordan

Maybe Matching Threads

Search for more maybe matching threads

Xen devel - Jul 2013 - Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

Re: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0

HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Re: HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)

Maybe Matching Threads