Konrad Rzeszutek Wilk
2011-Apr-07 20:25 UTC
[Xen-devel] [PATCH 0 of 5] Patches for PCI passthrough with modified E820.
Hello, This set of RFC patches allows a PV domain to see the machine''s E820 and figure out where the "PCI I/O" gap is and match it with the reality. The short end is that with these patches a PV domain can: - Use the correct PCI I/O gap. Before these patches, Linux guest would boot up and would tell: [ 0.000000] Allocating PCI resources starting at 40000000 (gap: 40000000:c0000000) while in actuality the PCI I/O gap should have been: [ 0.000000] Allocating PCI resources starting at b0000000 (gap: b0000000:4c000000) - The PV domain with PCI devices was limited to 3GB. It now can be booted with 4GB, 8GB, or whatever number you want. The PCI devices will now _not_ conflict with System RAM. Meaning the drivers can load. - With 2.6.39 kernels (which has the 1-1 mapping code), the VM_IO flag will be now automatically applied to regions that are considerd PCI I/O regions. You can find out which those are by looking for ''1-1'' in the kernel bootup. To use this patchset, the guest config file has to have the parameter ''pci_hole=1'' enabled (hmm, any ideas for a better name?) This has been tested with 2.6.36, 2.6.37, 2.6.38, and 2.6.39 kernels, and even thought the earlier ones don''t have the 1-1 mapping code they still parse the E820 nicely, and the "Allocating PCI resource.." is now correct. Tested this with the PCI devices (NIC, MSI), and with 2GB, 4GB, and 6GB guests with success. *There is a bug in the Linux kernel so that if you want to save/restore a PV guest that has 1-1 mapping it won''t restore. Will post the patches shortly for that. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-07 20:25 UTC
[Xen-devel] [PATCH 1 of 5] tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls
# HG changeset patch # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> # Date 1302194186 14400 # Node ID decab6c21cc3d7ce4d4dad949d34ba35d4600490 # Parent 97763efc41f9b664cf6f7db653c9c3f51e50b358 tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls. The later retrieves the E820 as seen by the hypervisor (completly unchanged) and the second call sets the E820 for a specific guest. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> diff -r 97763efc41f9 -r decab6c21cc3 tools/libxc/xc_domain.c --- a/tools/libxc/xc_domain.c Tue Apr 05 18:23:54 2011 +0100 +++ b/tools/libxc/xc_domain.c Thu Apr 07 12:36:26 2011 -0400 @@ -510,6 +510,55 @@ return rc; } + +int xc_domain_set_memory_map(xc_interface *xch, + uint32_t domid, + struct e820entry entries[], + uint32_t nr_entries) +{ + int rc; + struct xen_foreign_memory_map fmap = { + .domid = domid, + .map = { .nr_entries = nr_entries } + }; + DECLARE_HYPERCALL_BOUNCE(entries, nr_entries * sizeof(struct e820entry), + XC_HYPERCALL_BUFFER_BOUNCE_IN); + + if ( !entries || xc_hypercall_bounce_pre(xch, entries) ) + return -1; + + set_xen_guest_handle(fmap.map.buffer, entries); + + rc = do_memory_op(xch, XENMEM_set_memory_map, &fmap, sizeof(fmap)); + + xc_hypercall_bounce_post(xch, entries); + + return rc; +} + +int xc_get_machine_memory_map(xc_interface *xch, + struct e820entry entries[], + uint32_t max_entries) +{ + int rc; + struct xen_memory_map memmap = { + .nr_entries = max_entries + }; + DECLARE_HYPERCALL_BOUNCE(entries, sizeof(struct e820entry) * max_entries, + XC_HYPERCALL_BUFFER_BOUNCE_OUT); + + if ( !entries || xc_hypercall_bounce_pre(xch, entries) || max_entries <= 1) + return -1; + + + set_xen_guest_handle(memmap.buffer, entries); + + rc = do_memory_op(xch, XENMEM_machine_memory_map, &memmap, sizeof(memmap)); + + xc_hypercall_bounce_post(xch, entries); + + return rc ? rc : memmap.nr_entries; +} #else int xc_domain_set_memmap_limit(xc_interface *xch, uint32_t domid, @@ -519,6 +568,23 @@ errno = ENOSYS; return -1; } +int xc_domain_set_memory_map(xc_interface *xch, + uint32_t domid, + struct e820entry entries[], + uint32_t nr_entries) +{ + PERROR("Function not implemented"); + errno = ENOSYS; + return -1; +} +int xc_get_machine_memory_map(xc_interface *xch, + struct e820entry entries[], + uint32_t max_entries) +{ + PERROR("Function not implemented"); + errno = ENOSYS; + return -1; +} #endif int xc_domain_set_time_offset(xc_interface *xch, diff -r 97763efc41f9 -r decab6c21cc3 tools/libxc/xc_e820.h --- a/tools/libxc/xc_e820.h Tue Apr 05 18:23:54 2011 +0100 +++ b/tools/libxc/xc_e820.h Thu Apr 07 12:36:26 2011 -0400 @@ -26,6 +26,7 @@ #define E820_RESERVED 2 #define E820_ACPI 3 #define E820_NVS 4 +#define E820_UNUSABLE 5 struct e820entry { uint64_t addr; diff -r 97763efc41f9 -r decab6c21cc3 tools/libxc/xenctrl.h --- a/tools/libxc/xenctrl.h Tue Apr 05 18:23:54 2011 +0100 +++ b/tools/libxc/xenctrl.h Thu Apr 07 12:36:26 2011 -0400 @@ -53,6 +53,7 @@ #include <xen/foreign/x86_32.h> #include <xen/foreign/x86_64.h> #include <xen/arch-x86/xen-mca.h> +#include "xc_e820.h" #endif #ifdef __ia64__ @@ -966,6 +967,15 @@ uint32_t domid, unsigned long map_limitkb); +int xc_domain_set_memory_map(xc_interface *xch, + uint32_t domid, + struct e820entry entries[], + uint32_t nr_entries); + +int xc_get_machine_memory_map(xc_interface *xch, + struct e820entry entries[], + uint32_t max_entries); + int xc_domain_set_time_offset(xc_interface *xch, uint32_t domid, int32_t time_offset_seconds); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-07 20:25 UTC
[Xen-devel] [PATCH 2 of 5] x86: make the pv-only e820 array be dynamic
# HG changeset patch # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> # Date 1302202697 14400 # Node ID 01d0d338b97491a3aa816dab43cc709a234214f7 # Parent decab6c21cc3d7ce4d4dad949d34ba35d4600490 x86: make the pv-only e820 array be dynamic. During creation of the PV domain we allocate the E820 structure to be the E820MAX. This will allow the tool stack to fill the E820 with more than three entries. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> diff -r decab6c21cc3 -r 01d0d338b974 xen/arch/x86/domain.c --- a/xen/arch/x86/domain.c Thu Apr 07 12:36:26 2011 -0400 +++ b/xen/arch/x86/domain.c Thu Apr 07 14:58:17 2011 -0400 @@ -634,14 +634,23 @@ d->arch.pirq_emuirq[i] = IRQ_UNBOUND; for (i = 0; i < nr_irqs; i++) d->arch.emuirq_pirq[i] = IRQ_UNBOUND; + } else + { + d->arch.pv_domain.e820 = xmalloc_array(struct e820entry, E820MAX); + + if ( !d->arch.pv_domain.e820 ) + goto fail; + + memset(d->arch.pv_domain.e820, 0, + E820MAX * sizeof(*d->arch.pv_domain.e820)); } - if ( (rc = iommu_domain_init(d)) != 0 ) goto fail; /* For Guest vMCE MSRs virtualization */ vmce_init_msr(d); + } if ( is_hvm_domain(d) ) @@ -668,6 +677,10 @@ fail: d->is_dying = DOMDYING_dead; vmce_destroy_msr(d); + if ( !is_hvm_domain(d) ) + { + xfree(d->arch.pv_domain.e820); + } xfree(d->arch.pirq_irq); xfree(d->arch.irq_pirq); xfree(d->arch.pirq_emuirq); @@ -696,6 +709,8 @@ if ( is_hvm_domain(d) ) hvm_domain_destroy(d); + else + xfree(d->arch.pv_domain.e820); vmce_destroy_msr(d); pci_release_devices(d); diff -r decab6c21cc3 -r 01d0d338b974 xen/arch/x86/mm.c --- a/xen/arch/x86/mm.c Thu Apr 07 12:36:26 2011 -0400 +++ b/xen/arch/x86/mm.c Thu Apr 07 14:58:17 2011 -0400 @@ -4710,7 +4710,7 @@ if ( copy_from_guest(&fmap, arg, 1) ) return -EFAULT; - if ( fmap.map.nr_entries > ARRAY_SIZE(d->arch.pv_domain.e820) ) + if ( fmap.map.nr_entries > E820MAX ) return -EINVAL; rc = rcu_lock_target_domain_by_id(fmap.domid, &d); @@ -4730,9 +4730,16 @@ return -EPERM; } + if ( d->arch.pv_domain.e820 == NULL ) + { + rcu_unlock_domain(d); + return -EINVAL; + } rc = copy_from_guest(d->arch.pv_domain.e820, fmap.map.buffer, fmap.map.nr_entries) ? -EFAULT : 0; - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; + + if ( rc == 0 ) + d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; rcu_unlock_domain(d); return rc; @@ -4747,6 +4754,9 @@ if ( d->arch.pv_domain.nr_e820 == 0 ) return -ENOSYS; + if ( d->arch.pv_domain.e820 == NULL ) + return -ENOSYS; + if ( copy_from_guest(&map, arg, 1) ) return -EFAULT; diff -r decab6c21cc3 -r 01d0d338b974 xen/include/asm-x86/domain.h --- a/xen/include/asm-x86/domain.h Thu Apr 07 12:36:26 2011 -0400 +++ b/xen/include/asm-x86/domain.h Thu Apr 07 14:58:17 2011 -0400 @@ -238,7 +238,7 @@ unsigned long pirq_eoi_map_mfn; /* Pseudophysical e820 map (XENMEM_memory_map). */ - struct e820entry e820[3]; + struct e820entry *e820; unsigned int nr_e820; }; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-07 20:25 UTC
[Xen-devel] [PATCH 3 of 5] x86: adjust the size of the e820 for pv guest to be dynamic
# HG changeset patch # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> # Date 1302203926 14400 # Node ID 546d8a03d5cbe0ceddadf701174f2417a0b72891 # Parent 01d0d338b97491a3aa816dab43cc709a234214f7 x86: adjust the size of the e820 for pv guest to be dynamic. Instead of using the E820MAX we will use the amount of E820 entries on the machine, plus the number three. Considering the use cases, which is the toolstack retrieving the E820, sanitizing it, and then setting it for the PV guest (for PCI passthrough), this dynamic number of E820 is just right. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> diff -r 01d0d338b974 -r 546d8a03d5cb xen/arch/x86/domain.c --- a/xen/arch/x86/domain.c Thu Apr 07 14:58:17 2011 -0400 +++ b/xen/arch/x86/domain.c Thu Apr 07 15:18:46 2011 -0400 @@ -636,13 +636,13 @@ d->arch.emuirq_pirq[i] = IRQ_UNBOUND; } else { - d->arch.pv_domain.e820 = xmalloc_array(struct e820entry, E820MAX); + d->arch.pv_domain.e820 = xmalloc_array(struct e820entry, E820NR); if ( !d->arch.pv_domain.e820 ) goto fail; memset(d->arch.pv_domain.e820, 0, - E820MAX * sizeof(*d->arch.pv_domain.e820)); + E820NR * sizeof(*d->arch.pv_domain.e820)); } if ( (rc = iommu_domain_init(d)) != 0 ) diff -r 01d0d338b974 -r 546d8a03d5cb xen/arch/x86/mm.c --- a/xen/arch/x86/mm.c Thu Apr 07 14:58:17 2011 -0400 +++ b/xen/arch/x86/mm.c Thu Apr 07 15:18:46 2011 -0400 @@ -4710,7 +4710,7 @@ if ( copy_from_guest(&fmap, arg, 1) ) return -EFAULT; - if ( fmap.map.nr_entries > E820MAX ) + if ( fmap.map.nr_entries > E820NR ) return -EINVAL; rc = rcu_lock_target_domain_by_id(fmap.domid, &d); diff -r 01d0d338b974 -r 546d8a03d5cb xen/include/asm-x86/e820.h --- a/xen/include/asm-x86/e820.h Thu Apr 07 14:58:17 2011 -0400 +++ b/xen/include/asm-x86/e820.h Thu Apr 07 15:18:46 2011 -0400 @@ -39,4 +39,6 @@ #define e820_raw bootsym(e820map) #define e820_raw_nr bootsym(e820nr) +/* The +3 is for guest RAM entries */ +#define E820NR (e820.nr_map + 3) #endif /*__E820_HEADER*/ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-07 20:25 UTC
[Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
# HG changeset patch # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> # Date 1302206363 14400 # Node ID 2e464234c94cfd29a98a9011a46d76846b87f7f8 # Parent 546d8a03d5cbe0ceddadf701174f2417a0b72891 libxl: Add support for passing in the machine''s E820 for PCI passthrough. The code (libxl_e820_alloc) calls the xc_get_machine_memory_map to retrieve the systems E820. Then the E820 is sanitized to weed out E820 entries below 16MB, and as well remove any E820_RAM or E820_UNUSED regions as the guest does not need to know abou them. The guest only needs the E820_ACPI, E820_NVS, E820_RESERVED to get an idea of where the PCI I/O space is. Mostly.. The Linux kernel assumes that any gap in the E820 is considered PCI I/O space which means that if we pass in the guest 2GB, and the E820_ACPI, and its friend start at 3GB, the gap between 2GB and 3GB will be considered as PCI I/O space. To guard against that we also create an E820_UNUSABLE between the region of ''target_kb'' (called ram_end in the code) up to the first E820_[ACPI,NVS,RESERVED] region. Memory that is slack or for balloon (so ''maxmem'' in guest configuration) is put behind the machine E820. Which in most cases is after the 4GB. The reason for doing the fetching of the E820 using the hypercall in the toolstack (instead of the guest doing it) is that when a Linux guest would do a hypercall to ''XENMEM_machine_memory_map'' it would retrieve an E820 with I/O range caps added in. Meaning that the region after 4GB up to end of possible memory would be marked as unusuable and the Linux kernel would not have any space to allocate a balloon region. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> diff -r 546d8a03d5cb -r 2e464234c94c tools/libxl/libxl.h --- a/tools/libxl/libxl.h Thu Apr 07 15:18:46 2011 -0400 +++ b/tools/libxl/libxl.h Thu Apr 07 15:59:23 2011 -0400 @@ -204,6 +204,14 @@ } libxl_file_reference; void libxl_file_reference_destroy(libxl_file_reference *p); +#include <xc_e820.h> +typedef struct { + uint32_t nr_entries; + struct e820entry *entry; +} libxl_e820; + +void libxl_e820_destroy(libxl_e820 *p); + /* libxl_cpuid_policy_list is a dynamic array storing CPUID policies * for multiple leafs. It is terminated with an entry holding * XEN_CPUID_INPUT_UNUSED in input[0] diff -r 546d8a03d5cb -r 2e464234c94c tools/libxl/libxl.idl --- a/tools/libxl/libxl.idl Thu Apr 07 15:18:46 2011 -0400 +++ b/tools/libxl/libxl.idl Thu Apr 07 15:59:23 2011 -0400 @@ -22,6 +22,7 @@ libxl_hwcap = Builtin("hwcap") +libxl_e820 = Builtin("e820", destructor_fn="libxl_e820_destroy", passby=PASS_BY_REFERENCE) # # Complex libxl types # @@ -112,6 +113,7 @@ ])), ("pv", "!%s", Struct(None, [("slack_memkb", uint32), + ("e820", libxl_e820), ("bootloader", string), ("bootloader_args", string), ("cmdline", string), diff -r 546d8a03d5cb -r 2e464234c94c tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c Thu Apr 07 15:18:46 2011 -0400 +++ b/tools/libxl/libxl_dom.c Thu Apr 07 15:59:23 2011 -0400 @@ -72,9 +72,23 @@ libxl_ctx *ctx = libxl__gc_owner(gc); xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus); xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); - if (!info->hvm) + if (!info->hvm) { + if (info->u.pv.e820.nr_entries) { + int rc; + rc = libxl_e820_sanitize(ctx, info->target_memkb, + (info->max_memkb - info->target_memkb) + info->u.pv.slack_memkb, + &info->u.pv.e820); + if (rc) + return rc; + xc_domain_set_memory_map(ctx->xch, domid, + info->u.pv.e820.entry, + info->u.pv.e820.nr_entries); + } + else { xc_domain_set_memmap_limit(ctx->xch, domid, (info->max_memkb + info->u.pv.slack_memkb)); + } + } xc_domain_set_tsc_info(ctx->xch, domid, info->tsc_mode, 0, 0, 0); if ( info->disable_migrate ) xc_domain_disable_migrate(ctx->xch, domid); diff -r 546d8a03d5cb -r 2e464234c94c tools/libxl/libxl_pci.c --- a/tools/libxl/libxl_pci.c Thu Apr 07 15:18:46 2011 -0400 +++ b/tools/libxl/libxl_pci.c Thu Apr 07 15:59:23 2011 -0400 @@ -37,6 +37,8 @@ #include "libxl_internal.h" #include "flexarray.h" +#include <xc_e820.h> + #define PCI_BDF "%04x:%02x:%02x.%01x" #define PCI_BDF_SHORT "%02x:%02x.%01x" #define PCI_BDF_VDEVFN "%04x:%02x:%02x.%01x@%02x" @@ -1047,3 +1049,154 @@ free(pcidevs); return 0; } + +#define E820MAX (128) +int libxl_e820_sanitize(libxl_ctx *ctx, unsigned long map_limitkb, + unsigned long balloon_kb, libxl_e820 *p) +{ + uint64_t delta_kb = 0, start = 0, start_kb = 0, last = 0, ram_end; + uint32_t i, idx = 0, nr; + struct e820entry e820[E820MAX]; + struct e820entry *src; + + if (!p || !p->entry || !map_limitkb) + return ERROR_FAIL; + + if (p->nr_entries <= 1) + return ERROR_FAIL; + + src = p->entry; + nr = p->nr_entries; + + /* Weed out anything under 16MB */ + for (i = 0; i < nr; i++) { + if (src[i].addr > 0x100000) + continue; + + src[i].type = 0; + src[i].size = 0; + src[i].addr = -1ULL; + } + + /* Find the lowest and highest entry in E820, skipping over + * undersired entries. */ + start = -1ULL; + last = 0; + for (i = 0; i < nr; i++) { + if ((src[i].type == E820_RAM) || + (src[i].type == E820_UNUSABLE) || + (src[i].type == 0)) + continue; + + start = src[i].addr < start ? src[i].addr : start; + last = src[i].addr + src[i].size > last ? + src[i].addr + src[i].size > last : last; + } + if (start > 1024) + start_kb = start >> 10; + + /* Add the memory RAM region for the guest */ + e820[idx].addr = 0; + e820[idx].size = (uint64_t)map_limitkb << 10; + e820[idx].type = E820_RAM; + + /* .. and trim if neccessary */ + if (start_kb && map_limitkb > start_kb) { + delta_kb = map_limitkb - start_kb; + if (delta_kb) + e820[idx].size -= (uint64_t)(delta_kb << 10); + } + /* Note: We don''t touch balloon_kb here. Will add it at the end. */ + ram_end = e820[idx].addr + e820[idx].size; + idx ++; + + LIBXL__LOG(ctx, LIBXL__LOG_DEBUG, "Memory: %ldkB End of RAM: 0x%lx (PFN) " \ + "Delta: %ldkB, PCI start: %ldkB (0x%lx PFN), Balloon %ldkB\n", + map_limitkb, ram_end >> 12, delta_kb, start_kb ,start >> 12, + balloon_kb); + + /* Check if there is a region between ram_end and start. */ + if (start > ram_end) { + /* .. and if not present, add it in. This is to guard against + the Linux guest assuming that the gap between the end of + RAM region and the start of the E820_[ACPI,NVS,RESERVED] + is PCI I/O space. Which it certainly is _not_. */ + e820[idx].type = E820_UNUSABLE; + e820[idx].addr = ram_end; + e820[idx].size = start - ram_end; + idx++; + } + /* Almost done: copy them over, ignoring the undesireable ones */ + for (i = 0; i < nr; i++) { + if ((src[i].type == E820_RAM) || + (src[i].type == E820_UNUSABLE) || + (src[i].type == 0)) + continue; + e820[idx].type = src[i].type; + e820[idx].addr = src[i].addr; + e820[idx].size = src[i].size; + idx++; + } + + /* At this point we have the mapped RAM + E820 entries from src. */ + if (balloon_kb) { + /* and if we truncated the RAM region, then add it to the end. */ + e820[idx].type = E820_RAM; + e820[idx].addr = (uint64_t)(1ULL << 32) > last ? (uint64_t)(1ULL << 32) : last; + /* also add the balloon memory to the end. */ + e820[idx].size = (uint64_t)(delta_kb << 10) + (uint64_t)(balloon_kb << 10); + idx++; + + } + nr = idx; + + for (i = 0; i < nr; i++) { + LIBXL__LOG(ctx, LIBXL__LOG_DEBUG, ":%s\t[%lx -> %lx]", + e820[i].type == E820_RAM ? "RAM " : + (e820[i].type == E820_RESERVED ? "RSV " : + e820[i].type == E820_ACPI ? "ACPI" : + (e820[i].type == E820_NVS ? "NVS " : + (e820[i].type == E820_UNUSABLE ? "UNU " : "----"))), + e820[i].addr >> 12, + (e820[i].addr + e820[i].size) >> 12); + } + + /* Done: copy the sanitized version. */ + if (nr > p->nr_entries) { + libxl_e820_destroy(p); + p->entry = calloc(nr, sizeof(struct e820entry)); + if (!p->entry) + return ERROR_NOMEM; + } + p->nr_entries = nr; + memcpy(p->entry, e820, nr * sizeof(struct e820entry)); + return 0; +} + + +int libxl_e820_alloc(libxl_ctx *ctx, libxl_e820 *p) +{ + int nr; + struct e820entry map[E820MAX]; + + nr = xc_get_machine_memory_map(ctx->xch, map, E820MAX); + if (nr < 0) { + errno = nr; + return ERROR_FAIL; + } + p->nr_entries = nr; + p->entry = calloc(nr, sizeof(struct e820entry)); + if (!p->entry) + return ERROR_NOMEM; + + memcpy(p->entry, map, nr * sizeof(struct e820entry)); + return 0; +} + +void libxl_e820_destroy(libxl_e820 *p) +{ + if (p->entry) + free(p->entry); + p->entry = NULL; + p->nr_entries = 0; +} diff -r 546d8a03d5cb -r 2e464234c94c tools/libxl/libxl_utils.h --- a/tools/libxl/libxl_utils.h Thu Apr 07 15:18:46 2011 -0400 +++ b/tools/libxl/libxl_utils.h Thu Apr 07 15:59:23 2011 -0400 @@ -89,5 +89,8 @@ return (s + 1023) / 1024; } +int libxl_e820_alloc(libxl_ctx *ctx, libxl_e820 *p); +int libxl_e820_sanitize(libxl_ctx *ctx, unsigned long map_limitkb, + unsigned long balloon_kb, libxl_e820 *p); #endif diff -r 546d8a03d5cb -r 2e464234c94c tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c Thu Apr 07 15:18:46 2011 -0400 +++ b/tools/libxl/xl_cmdimpl.c Thu Apr 07 15:59:23 2011 -0400 @@ -1009,6 +1009,13 @@ if (!xlu_cfg_get_long (config, "pci_power_mgmt", &l)) pci_power_mgmt = l; + if (!xlu_cfg_get_long (config, "pci_hole", &l)) { + if (l == 1) { + int rc = libxl_e820_alloc(&ctx, &b_info->u.pv.e820); + if (rc < 0) + fprintf(stderr, "failed while collecting E820 with: %d\n", rc); + } + } if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) { int i; d_config->num_pcidevs = 0; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-07 20:25 UTC
[Xen-devel] [PATCH 5 of 5] libxl: Convert E820_UNUSABLE and E820_RAM to E820_UNUSABLE as appropriate
# HG changeset patch # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> # Date 1302207699 14400 # Node ID 2c08a4f9b2123a2310cb1122d9269c4e46d09b68 # Parent 2e464234c94cfd29a98a9011a46d76846b87f7f8 libxl: Convert E820_UNUSABLE and E820_RAM to E820_UNUSABLE as appropriate. Most machines after the RAM regions in the e802 have a couple of E820_RESERVED, with E820_ACPI and E820_NVS. On some Intel machines, the E820 looks like swiss cheese: (XEN) Initial Xen-e820 RAM map: (XEN) 0000000000000000 - 000000000009d000 (usable) (XEN) 000000000009d000 - 00000000000a0000 (reserved) (XEN) 00000000000e0000 - 0000000000100000 (reserved) (XEN) 0000000000100000 - 000000009cf66000 (usable) (XEN) 000000009cf66000 - 000000009d102000 (ACPI NVS) (XEN) 000000009d102000 - 000000009f6bd000 (usable) <-- (XEN) 000000009f6bd000 - 000000009f6bf000 (reserved) (XEN) 000000009f6bf000 - 000000009f714000 (usable) <-- (XEN) 000000009f714000 - 000000009f7bf000 (ACPI NVS) (XEN) 000000009f7bf000 - 000000009f7e0000 (usable) <-- (XEN) 000000009f7e0000 - 000000009f7ff000 (ACPI data) (XEN) 000000009f7ff000 - 000000009f800000 (usable) <-- (XEN) 000000009f800000 - 00000000a0000000 (reserved) (XEN) 00000000a0000000 - 00000000b0000000 (reserved) (XEN) 00000000fc000000 - 00000000fd000000 (reserved) (XEN) 00000000ffe00000 - 0000000100000000 (reserved) (XEN) 0000000100000000 - 0000000160000000 (usable) Which means we have to pay attention to the E820_RAM that are between the E820_[ACPI,NVS,RESERVED]. If we remove those E820_RAM (b/c the amount of memory passed to the guest is less that where those E820 regions reside) from the E820, the Linux kernel interprets those "gaps" as PCI I/O space. This is what we are currently doing. This can be disastrous if we pass in an Intel IGD card which tries to use the first available PCI I/O space - and ends up using the MFNs which are actually RAM instead of being the PCI I/O space. To make this work, we convert all E820_RAM that are above the ''target_kb'' (those that overlap the ''target_kb'' are truncated appropriately) to be E820_UNUSABLE. We also limit this alternation up to 4GB. This means that an E820 for a guest from this (target_kb=1024, maxmem=2048): [ 0.000000] Set 405658 page(s) to 1-1 mapping. [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) [ 0.000000] Xen: 0000000000100000 - 0000000040000000 (usable) [ 0.000000] Xen: 0000000040000000 - 000000009cf66000 (unusable) [ 0.000000] Xen: 000000009cf66000 - 000000009d102000 (ACPI NVS) [ 0.000000] Xen: 000000009f6bd000 - 000000009f6bf000 (reserved) [ 0.000000] Xen: 000000009f714000 - 000000009f7bf000 (ACPI NVS) [ 0.000000] Xen: 000000009f7e0000 - 000000009f7ff000 (ACPI data) [ 0.000000] Xen: 000000009f800000 - 00000000b0000000 (reserved) [ 0.000000] Xen: 00000000fc000000 - 00000000fd000000 (reserved) [ 0.000000] Xen: 00000000fec00000 - 00000000fec01000 (reserved) [ 0.000000] Xen: 00000000fee00000 - 00000000fee01000 (reserved) [ 0.000000] Xen: 00000000ffe00000 - 0000000100000000 (reserved) [ 0.000000] Xen: 0000000100000000 - 0000000140800000 (usable) Will look as so: [ 0.000000] Set 395880 page(s) to 1-1 mapping. [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) [ 0.000000] Xen: 0000000000100000 - 0000000040000000 (usable) [ 0.000000] Xen: 0000000040000000 - 000000009cf66000 (unusable) [ 0.000000] Xen: 000000009cf66000 - 000000009d102000 (ACPI NVS) [ 0.000000] Xen: 000000009d102000 - 000000009f6bd000 (unusable) [ 0.000000] Xen: 000000009f6bd000 - 000000009f6bf000 (reserved) [ 0.000000] Xen: 000000009f6bf000 - 000000009f714000 (unusable) [ 0.000000] Xen: 000000009f714000 - 000000009f7bf000 (ACPI NVS) [ 0.000000] Xen: 000000009f7bf000 - 000000009f7e0000 (unusable) [ 0.000000] Xen: 000000009f7e0000 - 000000009f7ff000 (ACPI data) [ 0.000000] Xen: 000000009f7ff000 - 000000009f800000 (unusable) [ 0.000000] Xen: 000000009f800000 - 00000000b0000000 (reserved) [ 0.000000] Xen: 00000000fc000000 - 00000000fd000000 (reserved) [ 0.000000] Xen: 00000000fec00000 - 00000000fec01000 (reserved) [ 0.000000] Xen: 00000000fee00000 - 00000000fee01000 (reserved) [ 0.000000] Xen: 00000000ffe00000 - 0000000100000000 (reserved) [ 0.000000] Xen: 0000000100000000 - 0000000140800000 (usable) Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> diff -r 2e464234c94c -r 2c08a4f9b212 tools/libxl/libxl_pci.c --- a/tools/libxl/libxl_pci.c Thu Apr 07 15:59:23 2011 -0400 +++ b/tools/libxl/libxl_pci.c Thu Apr 07 16:21:39 2011 -0400 @@ -1115,22 +1115,98 @@ map_limitkb, ram_end >> 12, delta_kb, start_kb ,start >> 12, balloon_kb); + /* This whole code below is to guard against if the Intel IGD is passed into + * the guest. If we don''t pass in IGD, this whole code can be ignored. + * + * The reason for this code is that Intel boxes fill their E820 with + * E820_RAM amongst E820_RESERVED and we can''t just ditch the E820_RAM. + * That is b/c any "gaps" in the E820 is considered PCI I/O space by + * Linux and it would be utilized by the Intel IGD as I/O space while + * in reality it was an RAM region. + * + * What this means is that we have to walk the E820 and for any region + * that is RAM and below 4GB and above ram_end, needs to change its type + * to E820_UNUSED. We also need to move some of the E820_RAM regions if + * the overlap with ram_end. */ + for (i = 0; i < nr; i++) { + uint64_t end = src[i].addr + src[i].size; + + /* We don''t care about E820_UNUSABLE, but we need to + * change the type to zero b/c the loop after this + * sticks E820_UNUSABLE on the guest''s E820 but ignores + * the ones with type zero. */ + if ((src[i].type == E820_UNUSABLE) || + /* Any region that is within the "RAM region" can + * be safely ditched. */ + (end < ram_end)) { + src[i].type = 0; + continue; + } + + /* Look only at RAM regions. */ + if (src[i].type != E820_RAM) + continue; + + /* We only care about RAM regions below 4GB. */ + if (src[i].addr >= (1ULL<<32)) + continue; + + /* E820_RAM overlaps with our RAM region. Move it */ + if (src[i].addr < ram_end) { + uint64_t delta; + + src[i].type = E820_UNUSABLE; + delta = ram_end - src[i].addr; + /* The end < ram_end should weed this out */ + if (src[i].size - delta < 0) + src[i].type = 0; + else { + src[i].size -= delta; + src[i].addr = ram_end; + } + if (src[i].addr + src[i].size != end) { + /* We messed up somewhere */ + src[i].type = 0; + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "Computed E820 wrongly. Continuing on."); + } + } + /* Lastly, convert the RAM to UNSUABLE. Look in the Linux kernel + at git commit 2f14ddc3a7146ea4cd5a3d1ecd993f85f2e4f948 + "xen/setup: Inhibit resource API from using System RAM E820 + gaps as PCI mem gaps" for full explanation. */ + if (end > ram_end) + src[i].type = E820_UNUSABLE; + } + /* Check if there is a region between ram_end and start. */ if (start > ram_end) { + int add_unusable = 1; + for (i = 0; i < nr && add_unusable; i++) { + if (src[i].type != E820_UNUSABLE) + continue; + if (ram_end != src[i].addr) + continue; + if (start != src[i].addr + src[i].size) + /* there is one, adjust it */ + src[i].size = start - src[i].addr; + + add_unusable = 0; + } /* .. and if not present, add it in. This is to guard against the Linux guest assuming that the gap between the end of RAM region and the start of the E820_[ACPI,NVS,RESERVED] is PCI I/O space. Which it certainly is _not_. */ - e820[idx].type = E820_UNUSABLE; - e820[idx].addr = ram_end; - e820[idx].size = start - ram_end; - idx++; + if (add_unusable) { + e820[idx].type = E820_UNUSABLE; + e820[idx].addr = ram_end; + e820[idx].size = start - ram_end; + idx++; + } } - /* Almost done: copy them over, ignoring the undesireable ones */ - for (i = 0; i < nr; i++) { + /* Almost done: copy them over, ignoring the undesireable ones */ + for (i = 0; i < nr; i++) { if ((src[i].type == E820_RAM) || - (src[i].type == E820_UNUSABLE) || - (src[i].type == 0)) + (src[i].type == 0)) continue; e820[idx].type = src[i].type; e820[idx].addr = src[i].addr; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Apr-08 08:18 UTC
Re: [Xen-devel] [PATCH 1 of 5] tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls
On Thu, 2011-04-07 at 21:25 +0100, Konrad Rzeszutek Wilk wrote:> # HG changeset patch > # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > # Date 1302194186 14400 > # Node ID decab6c21cc3d7ce4d4dad949d34ba35d4600490 > # Parent 97763efc41f9b664cf6f7db653c9c3f51e50b358 > tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls. > > The later retrieves the E820 as seen by the hypervisor (completly > unchanged) and the second call sets the E820 for a specific guest. > > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > diff -r 97763efc41f9 -r decab6c21cc3 tools/libxc/xc_domain.c > --- a/tools/libxc/xc_domain.c Tue Apr 05 18:23:54 2011 +0100 > +++ b/tools/libxc/xc_domain.c Thu Apr 07 12:36:26 2011 -0400 > @@ -510,6 +510,55 @@Please add: [diff] showfunc = True to your ~/.hgrc> > return rc; > } > + > +int xc_domain_set_memory_map(xc_interface *xch, > + uint32_t domid, > + struct e820entry entries[], > + uint32_t nr_entries) > +{ > + int rc; > + struct xen_foreign_memory_map fmap = { > + .domid = domid, > + .map = { .nr_entries = nr_entries } > + }; > + DECLARE_HYPERCALL_BOUNCE(entries, nr_entries * sizeof(struct e820entry), > + XC_HYPERCALL_BUFFER_BOUNCE_IN); > + > + if ( !entries || xc_hypercall_bounce_pre(xch, entries) ) > + return -1; > + > + set_xen_guest_handle(fmap.map.buffer, entries); > + > + rc = do_memory_op(xch, XENMEM_set_memory_map, &fmap, sizeof(fmap)); > + > + xc_hypercall_bounce_post(xch, entries); > + > + return rc; > +}Should probably reimplement xc_domain_set_memmap_limit() as a wrapper around this new function. And/or nuke it if that''s an option. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Apr-08 08:22 UTC
Re: [Xen-devel] [PATCH 2 of 5] x86: make the pv-only e820 array be dynamic
On Thu, 2011-04-07 at 21:25 +0100, Konrad Rzeszutek Wilk wrote:> # HG changeset patch > # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > # Date 1302202697 14400 > # Node ID 01d0d338b97491a3aa816dab43cc709a234214f7 > # Parent decab6c21cc3d7ce4d4dad949d34ba35d4600490 > x86: make the pv-only e820 array be dynamic. > > During creation of the PV domain we allocate the E820 structure to be > the E820MAX. This will allow the tool stack to fill the E820 with more > than three entries.Is it possible to defer this allocation to set_memory_map time so we can allocate only the required size? I guess E820MAX (128) entries is only 2.5k but a typical e820 is capped at more like 20 entries in practice (<0.5k).> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > diff -r decab6c21cc3 -r 01d0d338b974 xen/arch/x86/domain.c > --- a/xen/arch/x86/domain.c Thu Apr 07 12:36:26 2011 -0400 > +++ b/xen/arch/x86/domain.c Thu Apr 07 14:58:17 2011 -0400 > @@ -634,14 +634,23 @@ > d->arch.pirq_emuirq[i] = IRQ_UNBOUND; > for (i = 0; i < nr_irqs; i++) > d->arch.emuirq_pirq[i] = IRQ_UNBOUND; > + } else > + { > + d->arch.pv_domain.e820 = xmalloc_array(struct e820entry, E820MAX); > + > + if ( !d->arch.pv_domain.e820 ) > + goto fail; > + > + memset(d->arch.pv_domain.e820, 0, > + E820MAX * sizeof(*d->arch.pv_domain.e820)); > } > > - > if ( (rc = iommu_domain_init(d)) != 0 ) > goto fail; > > /* For Guest vMCE MSRs virtualization */ > vmce_init_msr(d); > + > } > > if ( is_hvm_domain(d) ) > @@ -668,6 +677,10 @@ > fail: > d->is_dying = DOMDYING_dead; > vmce_destroy_msr(d); > + if ( !is_hvm_domain(d) ) > + { > + xfree(d->arch.pv_domain.e820); > + } > xfree(d->arch.pirq_irq); > xfree(d->arch.irq_pirq); > xfree(d->arch.pirq_emuirq); > @@ -696,6 +709,8 @@ > > if ( is_hvm_domain(d) ) > hvm_domain_destroy(d); > + else > + xfree(d->arch.pv_domain.e820); > > vmce_destroy_msr(d); > pci_release_devices(d); > diff -r decab6c21cc3 -r 01d0d338b974 xen/arch/x86/mm.c > --- a/xen/arch/x86/mm.c Thu Apr 07 12:36:26 2011 -0400 > +++ b/xen/arch/x86/mm.c Thu Apr 07 14:58:17 2011 -0400 > @@ -4710,7 +4710,7 @@ > if ( copy_from_guest(&fmap, arg, 1) ) > return -EFAULT; > > - if ( fmap.map.nr_entries > ARRAY_SIZE(d->arch.pv_domain.e820) ) > + if ( fmap.map.nr_entries > E820MAX ) > return -EINVAL; > > rc = rcu_lock_target_domain_by_id(fmap.domid, &d); > @@ -4730,9 +4730,16 @@ > return -EPERM; > } > > + if ( d->arch.pv_domain.e820 == NULL ) > + { > + rcu_unlock_domain(d); > + return -EINVAL; > + } > rc = copy_from_guest(d->arch.pv_domain.e820, fmap.map.buffer, > fmap.map.nr_entries) ? -EFAULT : 0; > - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; > + > + if ( rc == 0 ) > + d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; > > rcu_unlock_domain(d); > return rc; > @@ -4747,6 +4754,9 @@ > if ( d->arch.pv_domain.nr_e820 == 0 ) > return -ENOSYS; > > + if ( d->arch.pv_domain.e820 == NULL ) > + return -ENOSYS; > + > if ( copy_from_guest(&map, arg, 1) ) > return -EFAULT; > > diff -r decab6c21cc3 -r 01d0d338b974 xen/include/asm-x86/domain.h > --- a/xen/include/asm-x86/domain.h Thu Apr 07 12:36:26 2011 -0400 > +++ b/xen/include/asm-x86/domain.h Thu Apr 07 14:58:17 2011 -0400 > @@ -238,7 +238,7 @@ > unsigned long pirq_eoi_map_mfn; > > /* Pseudophysical e820 map (XENMEM_memory_map). */ > - struct e820entry e820[3]; > + struct e820entry *e820; > unsigned int nr_e820; > }; > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Apr-08 08:36 UTC
Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
On Thu, 2011-04-07 at 21:25 +0100, Konrad Rzeszutek Wilk wrote:> # HG changeset patch > # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > # Date 1302206363 14400 > # Node ID 2e464234c94cfd29a98a9011a46d76846b87f7f8 > # Parent 546d8a03d5cbe0ceddadf701174f2417a0b72891 > libxl: Add support for passing in the machine''s E820 for PCI passthrough. > > The code (libxl_e820_alloc) calls the xc_get_machine_memory_map to > retrieve the systems E820. Then the E820 is sanitized to weed out E820 > entries below 16MB, and as well remove any E820_RAM or E820_UNUSED > regions as the guest does not need to know abou them. The guest > only needs the E820_ACPI, E820_NVS, E820_RESERVED to get an idea of > where the PCI I/O space is. Mostly.. The Linux kernel assumes that any > gap in the E820 is considered PCI I/O space which means that if we pass > in the guest 2GB, and the E820_ACPI, and its friend start at 3GB, the > gap between 2GB and 3GB will be considered as PCI I/O space. To guard against > that we also create an E820_UNUSABLE between the region of ''target_kb'' > (called ram_end in the code) up to the first E820_[ACPI,NVS,RESERVED] region. > > Memory that is slack or for balloon (so ''maxmem'' in guest configuration) > is put behind the machine E820. Which in most cases is after the 4GB. > > The reason for doing the fetching of the E820 using the hypercall in > the toolstack (instead of the guest doing it) is that when a Linux > guest would do a hypercall to ''XENMEM_machine_memory_map'' it would > retrieve an E820 with I/O range caps added in. Meaning that the > region after 4GB up to end of possible memory would be marked as unusuable > and the Linux kernel would not have any space to allocate a balloon > region. > > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > diff -r 546d8a03d5cb -r 2e464234c94c tools/libxl/libxl.h > --- a/tools/libxl/libxl.h Thu Apr 07 15:18:46 2011 -0400 > +++ b/tools/libxl/libxl.h Thu Apr 07 15:59:23 2011 -0400 > @@ -204,6 +204,14 @@ > } libxl_file_reference; > void libxl_file_reference_destroy(libxl_file_reference *p); > > +#include <xc_e820.h>We are trying to avoid exposing libxc headers to libxl users. struct e820entry is well defined and unchanging so I don''t think there is much harm in duplicating it.> +typedef struct { > + uint32_t nr_entries; > + struct e820entry *entry; > +} libxl_e820;BTW does the e820entry array need to be packed at some point before getting returned to the guest? (I''m not sure what the expected alignment is, nor how a 20 byte struct containing two u64s and a u32 naturally aligns on x86_64).> diff -r 546d8a03d5cb -r 2e464234c94c tools/libxl/libxl.idl > --- a/tools/libxl/libxl.idl Thu Apr 07 15:18:46 2011 -0400 > +++ b/tools/libxl/libxl.idl Thu Apr 07 15:59:23 2011 -0400 > @@ -22,6 +22,7 @@ > > libxl_hwcap = Builtin("hwcap") > > +libxl_e820 = Builtin("e820", destructor_fn="libxl_e820_destroy", passby=PASS_BY_REFERENCE)Hmmm, We are starting to collect a lot of builtins of the form {len, pointer}. Time to start thinking of a new type class abstraction, I think... (not your problem in this series though)> diff -r 546d8a03d5cb -r 2e464234c94c tools/libxl/libxl_pci.c > --- a/tools/libxl/libxl_pci.c Thu Apr 07 15:18:46 2011 -0400 > +++ b/tools/libxl/libxl_pci.c Thu Apr 07 15:59:23 2011 -0400 > @@ -37,6 +37,8 @@ > #include "libxl_internal.h" > #include "flexarray.h" > > +#include <xc_e820.h> > + > #define PCI_BDF "%04x:%02x:%02x.%01x" > #define PCI_BDF_SHORT "%02x:%02x.%01x" > #define PCI_BDF_VDEVFN "%04x:%02x:%02x.%01x@%02x" > @@ -1047,3 +1049,154 @@ > free(pcidevs); > return 0; > } > + > +#define E820MAX (128)Should be somewhere public for the benefit of libxl users who want to setup an e820? Even if it should be private it should be in a .h. Are we expecting that libxl users might want to modify the e820? If not then why expose libxl_e820_alloc and libxl_e820_sanitize at all, just add a flag to the libxl interface which says whether or not to provide a host-derived e820. If users are expected to modify it then I''m in two minds about doing the sanitize step at domain build time instead of in libxl_e820_alloc. If we provide a default sanitized e820 to the caller and they want to modify it to be non-santized should we let them? I guess the question here is whether we are sanitizing actually invalid e820 maps or if we are also enforcing some higher level policy based on a specific class Linux guest''s requirements. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Apr-08 08:42 UTC
Re: [Xen-devel] [PATCH 0 of 5] Patches for PCI passthrough with modified E820.
On Thu, 2011-04-07 at 21:25 +0100, Konrad Rzeszutek Wilk wrote:> This set of RFC patches allows a PV domain to see the machine''s > E820 and figure out where the "PCI I/O" gap is and match it with the > reality.Does the domain builder obey this memory map at all or is it a PV guests responsibility to take the linear p2m allocation it starts with a move stuff around to fit the map?> To use this patchset, the guest config file has to have the parameter > ''pci_hole=1'' enabled (hmm, any ideas for a better name?)Is there any harm in just doing this for any guest configuration which has a "pci" option specified? (including the empty list "pci=[]" to handle guests which only want hotplug capabilities not an initial set of devices). Or could we even go so far as to consider always doing this unconditionally? Will older pvops and/or classic-Xen kernels or other PV OSes misbehave if we do either of these? is having a default-on option which these users need to force off better or worse than a default-off option which the opposite set of people need to enable? Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Jackson
2011-Apr-08 10:56 UTC
Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
Ian Campbell writes ("Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough"):> Are we expecting that libxl users might want to modify the e820? If not > then why expose libxl_e820_alloc and libxl_e820_sanitize at all, just > add a flag to the libxl interface which says whether or not to provide a > host-derived e820.Well, also, why do we need that flag at all ? Are we trying to do something different with domains which might get pci passthrough ? If so then that''s what should be specified at the libxl api, surely, rather than some opaque "write this rune to make it work" option. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-08 13:19 UTC
Re: [Xen-devel] [PATCH 1 of 5] tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls
On Fri, Apr 08, 2011 at 09:18:59AM +0100, Ian Campbell wrote:> On Thu, 2011-04-07 at 21:25 +0100, Konrad Rzeszutek Wilk wrote: > > # HG changeset patch > > # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > # Date 1302194186 14400 > > # Node ID decab6c21cc3d7ce4d4dad949d34ba35d4600490 > > # Parent 97763efc41f9b664cf6f7db653c9c3f51e50b358 > > tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls. > > > > The later retrieves the E820 as seen by the hypervisor (completly > > unchanged) and the second call sets the E820 for a specific guest. > > > > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > > > diff -r 97763efc41f9 -r decab6c21cc3 tools/libxc/xc_domain.c > > --- a/tools/libxc/xc_domain.c Tue Apr 05 18:23:54 2011 +0100 > > +++ b/tools/libxc/xc_domain.c Thu Apr 07 12:36:26 2011 -0400 > > @@ -510,6 +510,55 @@ > > Please add: > [diff] > showfunc = True > to your ~/.hgrcAye aye> > > > > return rc; > > } > > + > > +int xc_domain_set_memory_map(xc_interface *xch, > > + uint32_t domid, > > + struct e820entry entries[], > > + uint32_t nr_entries) > > +{ > > + int rc; > > + struct xen_foreign_memory_map fmap = { > > + .domid = domid, > > + .map = { .nr_entries = nr_entries } > > + }; > > + DECLARE_HYPERCALL_BOUNCE(entries, nr_entries * sizeof(struct e820entry), > > + XC_HYPERCALL_BUFFER_BOUNCE_IN); > > + > > + if ( !entries || xc_hypercall_bounce_pre(xch, entries) ) > > + return -1; > > + > > + set_xen_guest_handle(fmap.map.buffer, entries); > > + > > + rc = do_memory_op(xch, XENMEM_set_memory_map, &fmap, sizeof(fmap)); > > + > > + xc_hypercall_bounce_post(xch, entries); > > + > > + return rc; > > +} > > Should probably reimplement xc_domain_set_memmap_limit() as a wrapper > around this new function. And/or nuke it if that''s an option.Yes! Will do that. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-08 13:21 UTC
Re: [Xen-devel] [PATCH 2 of 5] x86: make the pv-only e820 array be dynamic
On Fri, Apr 08, 2011 at 09:22:05AM +0100, Ian Campbell wrote:> On Thu, 2011-04-07 at 21:25 +0100, Konrad Rzeszutek Wilk wrote: > > # HG changeset patch > > # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > # Date 1302202697 14400 > > # Node ID 01d0d338b97491a3aa816dab43cc709a234214f7 > > # Parent decab6c21cc3d7ce4d4dad949d34ba35d4600490 > > x86: make the pv-only e820 array be dynamic. > > > > During creation of the PV domain we allocate the E820 structure to be > > the E820MAX. This will allow the tool stack to fill the E820 with more > > than three entries. > > Is it possible to defer this allocation to set_memory_map time so we can > allocate only the required size? I guess E820MAX (128) entries is only > 2.5k but a typical e820 is capped at more like 20 entries in practice > (<0.5k).The later patch does that. If I figure out how to "squash" the hg commits together I can post it as one patch.. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-08 13:24 UTC
Re: [Xen-devel] [PATCH 0 of 5] Patches for PCI passthrough with modified E820.
On Fri, Apr 08, 2011 at 09:42:04AM +0100, Ian Campbell wrote:> On Thu, 2011-04-07 at 21:25 +0100, Konrad Rzeszutek Wilk wrote: > > > This set of RFC patches allows a PV domain to see the machine''s > > E820 and figure out where the "PCI I/O" gap is and match it with the > > reality. > > Does the domain builder obey this memory map at all or is it a PV guests > responsibility to take the linear p2m allocation it starts with a move > stuff around to fit the map?The PV guest is responsible.> > > To use this patchset, the guest config file has to have the parameter > > ''pci_hole=1'' enabled (hmm, any ideas for a better name?) > > Is there any harm in just doing this for any guest configuration which > has a "pci" option specified? (including the empty list "pci=[]" to > handle guests which only want hotplug capabilities not an initial set of > devices).Good idea. Will key of from both of them (since you can do hotplug PCI without having the ''pci'' option present).> > Or could we even go so far as to consider always doing this > unconditionally?Tempting. I would need to test the other older kernels to make sure I am not breaking anything.> > Will older pvops and/or classic-Xen kernels or other PV OSes misbehaveOlder pvops work. Will test the 2.6.32, as the earliest one I tested was the 2.6.36 and that worked quite well.> if we do either of these? is having a default-on option which these > users need to force off better or worse than a default-off option which > the opposite set of people need to enable?No idea. I just don''t want to cause regressions so choose the more conservative path... but that has the pitfalls of bit-rotting. Let me do some more testing and see what happens.> > Ian._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-08 13:33 UTC
Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
On Fri, Apr 08, 2011 at 09:36:45AM +0100, Ian Campbell wrote:> On Thu, 2011-04-07 at 21:25 +0100, Konrad Rzeszutek Wilk wrote: > > # HG changeset patch > > # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > # Date 1302206363 14400 > > # Node ID 2e464234c94cfd29a98a9011a46d76846b87f7f8 > > # Parent 546d8a03d5cbe0ceddadf701174f2417a0b72891 > > libxl: Add support for passing in the machine''s E820 for PCI passthrough. > > > > The code (libxl_e820_alloc) calls the xc_get_machine_memory_map to > > retrieve the systems E820. Then the E820 is sanitized to weed out E820 > > entries below 16MB, and as well remove any E820_RAM or E820_UNUSED > > regions as the guest does not need to know abou them. The guest > > only needs the E820_ACPI, E820_NVS, E820_RESERVED to get an idea of > > where the PCI I/O space is. Mostly.. The Linux kernel assumes that any > > gap in the E820 is considered PCI I/O space which means that if we pass > > in the guest 2GB, and the E820_ACPI, and its friend start at 3GB, the > > gap between 2GB and 3GB will be considered as PCI I/O space. To guard against > > that we also create an E820_UNUSABLE between the region of ''target_kb'' > > (called ram_end in the code) up to the first E820_[ACPI,NVS,RESERVED] region. > > > > Memory that is slack or for balloon (so ''maxmem'' in guest configuration) > > is put behind the machine E820. Which in most cases is after the 4GB. > > > > The reason for doing the fetching of the E820 using the hypercall in > > the toolstack (instead of the guest doing it) is that when a Linux > > guest would do a hypercall to ''XENMEM_machine_memory_map'' it would > > retrieve an E820 with I/O range caps added in. Meaning that the > > region after 4GB up to end of possible memory would be marked as unusuable > > and the Linux kernel would not have any space to allocate a balloon > > region. > > > > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > > > diff -r 546d8a03d5cb -r 2e464234c94c tools/libxl/libxl.h > > --- a/tools/libxl/libxl.h Thu Apr 07 15:18:46 2011 -0400 > > +++ b/tools/libxl/libxl.h Thu Apr 07 15:59:23 2011 -0400 > > @@ -204,6 +204,14 @@ > > } libxl_file_reference; > > void libxl_file_reference_destroy(libxl_file_reference *p); > > > > +#include <xc_e820.h> > > We are trying to avoid exposing libxc headers to libxl users. struct > e820entry is well defined and unchanging so I don''t think there is much > harm in duplicating it.OK.> > > +typedef struct { > > + uint32_t nr_entries; > > + struct e820entry *entry; > > +} libxl_e820; > > BTW does the e820entry array need to be packed at some point before > getting returned to the guest? (I''m not sure what the expected alignmentThe struct e820entry has the __packed__ attribute on it.> is, nor how a 20 byte struct containing two u64s and a u32 naturally > aligns on x86_64).It seems to work OK.. (I booted a 32-bit guest under a 64-bit hypervisor with a 64-bit dom0).> > > diff -r 546d8a03d5cb -r 2e464234c94c tools/libxl/libxl.idl > > --- a/tools/libxl/libxl.idl Thu Apr 07 15:18:46 2011 -0400 > > +++ b/tools/libxl/libxl.idl Thu Apr 07 15:59:23 2011 -0400 > > @@ -22,6 +22,7 @@ > > > > libxl_hwcap = Builtin("hwcap") > > > > +libxl_e820 = Builtin("e820", destructor_fn="libxl_e820_destroy", passby=PASS_BY_REFERENCE) > > Hmmm, We are starting to collect a lot of builtins of the form {len, > pointer}. Time to start thinking of a new type class abstraction, I > think... (not your problem in this series though) > > > diff -r 546d8a03d5cb -r 2e464234c94c tools/libxl/libxl_pci.c > > --- a/tools/libxl/libxl_pci.c Thu Apr 07 15:18:46 2011 -0400 > > +++ b/tools/libxl/libxl_pci.c Thu Apr 07 15:59:23 2011 -0400 > > @@ -37,6 +37,8 @@ > > #include "libxl_internal.h" > > #include "flexarray.h" > > > > +#include <xc_e820.h> > > + > > #define PCI_BDF "%04x:%02x:%02x.%01x" > > #define PCI_BDF_SHORT "%02x:%02x.%01x" > > #define PCI_BDF_VDEVFN "%04x:%02x:%02x.%01x@%02x" > > @@ -1047,3 +1049,154 @@ > > free(pcidevs); > > return 0; > > } > > + > > +#define E820MAX (128) > > Should be somewhere public for the benefit of libxl users who want to > setup an e820? Even if it should be private it should be in a .h.Aye aye.> > Are we expecting that libxl users might want to modify the e820? If not > then why expose libxl_e820_alloc and libxl_e820_sanitize at all, just > add a flag to the libxl interface which says whether or not to provide a > host-derived e820.OK. Where should I put the definitions of these two functions? (They are called from libxl_dom.c and xl_cmdimpl.c) Is there a internal header file?> > If users are expected to modify it then I''m in two minds about doing the > sanitize step at domain build time instead of in libxl_e820_alloc. If weThe issue I had was that to construct the e820, I needed the ''target_kb'' and ''max_memkb'' and ''slack_memkb''. If I can get those two values when the guest config file is parsed (so xl_cmdimpl.c) I think moving the sanitization in there (and not doing it during domain build time) is the right thing to do. Let me see if I can do that.> provide a default sanitized e820 to the caller and they want to modify > it to be non-santized should we let them? I guess the question here is > whether we are sanitizing actually invalid e820 maps or if we are also > enforcing some higher level policy based on a specific class Linux > guest''s requirements.Which might be quite different if the guest is a BSD one. I guess it is time to start launching those NetBSD guests :-) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-08 13:35 UTC
Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
On Fri, Apr 08, 2011 at 11:56:52AM +0100, Ian Jackson wrote:> Ian Campbell writes ("Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough"): > > Are we expecting that libxl users might want to modify the e820? If not > > then why expose libxl_e820_alloc and libxl_e820_sanitize at all, just > > add a flag to the libxl interface which says whether or not to provide a > > host-derived e820. > > Well, also, why do we need that flag at all ? Are we trying to do > something different with domains which might get pci passthrough ? IfYes. Well, it does work OK even if you are _not_ doing PCI passthrough. But the main users for this are the ones doing PCI passthrough.> so then that''s what should be specified at the libxl api, surely, > rather than some opaque "write this rune to make it work" option.OK. If I am understanding you guys right, you are saying, latch it of the ''pci'' option instead of this ''pci_hole'' option. And do not expose thse libxl_e820_* functions, but keep them internal to the libxl code. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Apr-08 13:55 UTC
Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
On Fri, 2011-04-08 at 14:35 +0100, Konrad Rzeszutek Wilk wrote:> On Fri, Apr 08, 2011 at 11:56:52AM +0100, Ian Jackson wrote: > > Ian Campbell writes ("Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough"): > > > Are we expecting that libxl users might want to modify the e820? If not > > > then why expose libxl_e820_alloc and libxl_e820_sanitize at all, just > > > add a flag to the libxl interface which says whether or not to provide a > > > host-derived e820. > > > > Well, also, why do we need that flag at all ? Are we trying to do > > something different with domains which might get pci passthrough ? If > > Yes. Well, it does work OK even if you are _not_ doing PCI passthrough. > But the main users for this are the ones doing PCI passthrough. > > > so then that''s what should be specified at the libxl api, surely, > > rather than some opaque "write this rune to make it work" option. > > OK. If I am understanding you guys right, you are saying, latch > it of the ''pci'' option instead of this ''pci_hole'' option.ACK. For the pure-hotplug case an empty pci=[] ought to suffice to trigger this functionality.> And do not expose thse libxl_e820_* functions, but keep them > internal to the libxl code.Yes, assuming we aren''t expecting anyone to modify the table between getting it from libxl and handing it back to libxl we should simply not expose it. We can always expose it later if need be... Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Apr-08 14:00 UTC
Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
On Fri, 2011-04-08 at 14:33 +0100, Konrad Rzeszutek Wilk wrote:> On Fri, Apr 08, 2011 at 09:36:45AM +0100, Ian Campbell wrote: > > On Thu, 2011-04-07 at 21:25 +0100, Konrad Rzeszutek Wilk wrote:> > > +typedef struct { > > > + uint32_t nr_entries; > > > + struct e820entry *entry; > > > +} libxl_e820; > > > > BTW does the e820entry array need to be packed at some point before > > getting returned to the guest? (I''m not sure what the expected alignment > > The struct e820entry has the __packed__ attribute on it.OK, so you might end up with the unpacked version in the libxl getting repacked into the xc version internally to libxl for consumption by libxc, which is annoying but better than exposing these things to libxl users, IMHO.> > is, nor how a 20 byte struct containing two u64s and a u32 naturally > > aligns on x86_64). > > It seems to work OK.. (I booted a 32-bit guest under a 64-bit hypervisor > with a 64-bit dom0).If the libxc version of the struct is __packed__ it''ll do the right thing, so that''s good.> > Are we expecting that libxl users might want to modify the e820? If not > > then why expose libxl_e820_alloc and libxl_e820_sanitize at all, just > > add a flag to the libxl interface which says whether or not to provide a > > host-derived e820. > > OK. Where should I put the definitions of these two functions? (They are > called from libxl_dom.c and xl_cmdimpl.c) Is there a internal header file?libxl_internal.h is the internal hdr although that isn''t available to xl_cmdimpl.c, obviously, but I think we are talking about not having this call in xl anyway.> > If users are expected to modify it then I''m in two minds about doing the > > sanitize step at domain build time instead of in libxl_e820_alloc. If we > > The issue I had was that to construct the e820, I needed the > ''target_kb'' and ''max_memkb'' and ''slack_memkb''. If I can get those two values > when the guest config file is parsed (so xl_cmdimpl.c) I think moving > the sanitization in there (and not doing it during domain build time) > is the right thing to do. Let me see if I can do that. > > > provide a default sanitized e820 to the caller and they want to modify > > it to be non-santized should we let them? I guess the question here is > > whether we are sanitizing actually invalid e820 maps or if we are also > > enforcing some higher level policy based on a specific class Linux > > guest''s requirements. > > Which might be quite different if the guest is a BSD one. I guess > it is time to start launching those NetBSD guests :-)Yeah ... I''m perpetually trying (and failing) to find a minute to try that again... Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2011-Apr-08 14:09 UTC
Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
At 14:55 +0100 on 08 Apr (1302274509), Ian Campbell wrote:> On Fri, 2011-04-08 at 14:35 +0100, Konrad Rzeszutek Wilk wrote: > > On Fri, Apr 08, 2011 at 11:56:52AM +0100, Ian Jackson wrote: > > > Ian Campbell writes ("Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough"): > > > > Are we expecting that libxl users might want to modify the e820? If not > > > > then why expose libxl_e820_alloc and libxl_e820_sanitize at all, just > > > > add a flag to the libxl interface which says whether or not to provide a > > > > host-derived e820. > > > > > > Well, also, why do we need that flag at all ? Are we trying to do > > > something different with domains which might get pci passthrough ? If > > > > Yes. Well, it does work OK even if you are _not_ doing PCI passthrough. > > But the main users for this are the ones doing PCI passthrough. > > > > > so then that''s what should be specified at the libxl api, surely, > > > rather than some opaque "write this rune to make it work" option. > > > > OK. If I am understanding you guys right, you are saying, latch > > it of the ''pci'' option instead of this ''pci_hole'' option. > > ACK. For the pure-hotplug case an empty pci=[] ought to suffice to > trigger this functionality.That''s a _horrible_ config syntax. We should either do this for all guests or give it its own option (which can default to 1 for guests with non-empty passthrough lists if you like). Cheers, Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Xen Platform Team Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Apr-08 14:17 UTC
Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
On Fri, 2011-04-08 at 15:09 +0100, Tim Deegan wrote:> At 14:55 +0100 on 08 Apr (1302274509), Ian Campbell wrote: > > On Fri, 2011-04-08 at 14:35 +0100, Konrad Rzeszutek Wilk wrote: > > > On Fri, Apr 08, 2011 at 11:56:52AM +0100, Ian Jackson wrote: > > > > Ian Campbell writes ("Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough"): > > > > > Are we expecting that libxl users might want to modify the e820? If not > > > > > then why expose libxl_e820_alloc and libxl_e820_sanitize at all, just > > > > > add a flag to the libxl interface which says whether or not to provide a > > > > > host-derived e820. > > > > > > > > Well, also, why do we need that flag at all ? Are we trying to do > > > > something different with domains which might get pci passthrough ? If > > > > > > Yes. Well, it does work OK even if you are _not_ doing PCI passthrough. > > > But the main users for this are the ones doing PCI passthrough. > > > > > > > so then that''s what should be specified at the libxl api, surely, > > > > rather than some opaque "write this rune to make it work" option. > > > > > > OK. If I am understanding you guys right, you are saying, latch > > > it of the ''pci'' option instead of this ''pci_hole'' option. > > > > ACK. For the pure-hotplug case an empty pci=[] ought to suffice to > > trigger this functionality. > > That''s a _horrible_ config syntax.fair enough. I thought it nicely represented the "has an empty PCI bus" myself...> We should either do this for all > guests or give it its own option (which can default to 1 for guests with > non-empty passthrough lists if you like).Ack, if it''s an option it should definitely default to on for non-empty pci passthrough lists. I think we can probably get away with default on always though and allow it to be explicitly disabled. That''s really just a case of trying on a few PV OSes and seeing how much they barf... Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2011-Apr-08 14:25 UTC
Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
At 15:17 +0100 on 08 Apr (1302275848), Ian Campbell wrote:> > That''s a _horrible_ config syntax. > > fair enough. I thought it nicely represented the "has an empty PCI bus" > myself...Sure, but it''s really not obvious that that means "has a memory layout that matches the host on which it was last booted".> > We should either do this for all > > guests or give it its own option (which can default to 1 for guests with > > non-empty passthrough lists if you like). > > Ack, if it''s an option it should definitely default to on for non-empty > pci passthrough lists. I think we can probably get away with default on > always though and allow it to be explicitly disabled. That''s really just > a case of trying on a few PV OSes and seeing how much they barf...Agreed. My guess is there won''t be any trouble. Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Xen Platform Team Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Apr-08 14:33 UTC
Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
On Fri, 2011-04-08 at 15:25 +0100, Tim Deegan wrote:> At 15:17 +0100 on 08 Apr (1302275848), Ian Campbell wrote: > > > That''s a _horrible_ config syntax. > > > > fair enough. I thought it nicely represented the "has an empty PCI bus" > > myself... > > Sure, but it''s really not obvious that that means "has a memory layout > that matches the host on which it was last booted".Hmm, yes, this proposal doesn''t really cover migration does it... You obviously can''t migrate with a device passed through but it should be possible to unplug, migrate and replug. Maybe having a sensible hole on the first host is sufficient... Especially if the annoying case is GFX passthrough -- you are much less like to want/be able to unplug your gfx over a migration... Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-08 14:34 UTC
Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
On Fri, Apr 08, 2011 at 03:25:36PM +0100, Tim Deegan wrote:> At 15:17 +0100 on 08 Apr (1302275848), Ian Campbell wrote: > > > That''s a _horrible_ config syntax. > > > > fair enough. I thought it nicely represented the "has an empty PCI bus" > > myself... > > Sure, but it''s really not obvious that that means "has a memory layout > that matches the host on which it was last booted". > > > > We should either do this for all > > > guests or give it its own option (which can default to 1 for guests with > > > non-empty passthrough lists if you like). > > > > Ack, if it''s an option it should definitely default to on for non-empty > > pci passthrough lists. I think we can probably get away with default on > > always though and allow it to be explicitly disabled. That''s really just > > a case of trying on a few PV OSes and seeing how much they barf... > > Agreed. My guess is there won''t be any trouble.Whoa. I just booted an SLES11 32 bit and it booted so nicely :-) (2.6.27) Time to try out some RHEL guests. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Apr-08 14:42 UTC
Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
On Fri, 2011-04-08 at 15:34 +0100, Konrad Rzeszutek Wilk wrote:> On Fri, Apr 08, 2011 at 03:25:36PM +0100, Tim Deegan wrote: > > At 15:17 +0100 on 08 Apr (1302275848), Ian Campbell wrote: > > > > That''s a _horrible_ config syntax. > > > > > > fair enough. I thought it nicely represented the "has an empty PCI bus" > > > myself... > > > > Sure, but it''s really not obvious that that means "has a memory layout > > that matches the host on which it was last booted". > > > > > > We should either do this for all > > > > guests or give it its own option (which can default to 1 for guests with > > > > non-empty passthrough lists if you like). > > > > > > Ack, if it''s an option it should definitely default to on for non-empty > > > pci passthrough lists. I think we can probably get away with default on > > > always though and allow it to be explicitly disabled. That''s really just > > > a case of trying on a few PV OSes and seeing how much they barf... > > > > Agreed. My guess is there won''t be any trouble. > > Whoa. I just booted an SLES11 32 bit and it booted so nicely :-) > (2.6.27)With all of its memory, or did it loose the bits from the holes?> Time to try out some RHEL guests.Good idea. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-08 14:54 UTC
Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
On Fri, Apr 08, 2011 at 03:42:48PM +0100, Ian Campbell wrote:> On Fri, 2011-04-08 at 15:34 +0100, Konrad Rzeszutek Wilk wrote: > > On Fri, Apr 08, 2011 at 03:25:36PM +0100, Tim Deegan wrote: > > > At 15:17 +0100 on 08 Apr (1302275848), Ian Campbell wrote: > > > > > That''s a _horrible_ config syntax. > > > > > > > > fair enough. I thought it nicely represented the "has an empty PCI bus" > > > > myself... > > > > > > Sure, but it''s really not obvious that that means "has a memory layout > > > that matches the host on which it was last booted". > > > > > > > > We should either do this for all > > > > > guests or give it its own option (which can default to 1 for guests with > > > > > non-empty passthrough lists if you like). > > > > > > > > Ack, if it''s an option it should definitely default to on for non-empty > > > > pci passthrough lists. I think we can probably get away with default on > > > > always though and allow it to be explicitly disabled. That''s really just > > > > a case of trying on a few PV OSes and seeing how much they barf... > > > > > > Agreed. My guess is there won''t be any trouble. > > > > Whoa. I just booted an SLES11 32 bit and it booted so nicely :-) > > (2.6.27) > > With all of its memory, or did it loose the bits from the holes?All memory intact. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-08 15:00 UTC
Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
On Fri, Apr 08, 2011 at 03:33:47PM +0100, Ian Campbell wrote:> On Fri, 2011-04-08 at 15:25 +0100, Tim Deegan wrote: > > At 15:17 +0100 on 08 Apr (1302275848), Ian Campbell wrote: > > > > That''s a _horrible_ config syntax. > > > > > > fair enough. I thought it nicely represented the "has an empty PCI bus" > > > myself... > > > > Sure, but it''s really not obvious that that means "has a memory layout > > that matches the host on which it was last booted". > > Hmm, yes, this proposal doesn''t really cover migration does it... > > You obviously can''t migrate with a device passed through but it should > be possible to unplug, migrate and replug. Maybe having a sensible holeYes.> on the first host is sufficient... Especially if the annoying case is > GFX passthrough -- you are much less like to want/be able to unplug your > gfx over a migration...So one use case is SR-IOV and depending on what machine you are on, the BARs can be different. If you are migrating between the same type of machines, where the BARs and E820 are the same machine it "ought" to work (hadn''t tested it as I only have one machine with SR-IOV ..but I should be able to test the save/restore on that machine). What I did test was save/restore on a single machine (where I unload the NIC module''s before hand and load it after a restore) and it worked (with some Linux kernel patches that I need to post soon). But the right way is to augment the Linux kernel code to deal with a new E820 _after_ a resume. This is something I plan on doing during the summer, but not right now. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Jackson
2011-Apr-08 16:01 UTC
Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough
Tim Deegan writes ("Re: [Xen-devel] [PATCH 4 of 5] libxl: Add support for passing in the machine''s E820 for PCI passthrough"):> At 15:17 +0100 on 08 Apr (1302275848), Ian Campbell wrote: > > > That''s a _horrible_ config syntax. > > > > fair enough. I thought it nicely represented the "has an empty PCI bus" > > myself... > > Sure, but it''s really not obvious that that means "has a memory layout > that matches the host on which it was last booted".No, that''s what it _does_. What it _means_ is "please make this domain work for pci passthrough hotplug", which AIUI is the circumstance in which it it relevant. That this has an effect on the guest memory map is an implementation detail. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel