Konrad Rzeszutek Wilk
2011-Apr-11 21:35 UTC
[Xen-devel] [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v2).
Hello, This set of v2 patches allows a PV domain to see the machine''s E820 and figure out where the "PCI I/O" gap is and match it with the reality. Changelog since v1 posting: - Squashed the "x86: make the pv-only e820 array be dynamic" and "x86: adjust the size of the e820 for pv guest to be dynamic" together. - Made xc_domain_set_memmap_limit use the ''xc_domain_set_memory_map'' - Moved ''libxl_e820_alloc'' and ''libxl_e820_sanitize'' to be an internal operation and called from ''libxl_device_pci_parse_bdf''. - Expanded ''libxl_device_pci_parse_bdf'' API call to have an extra argument (optional). The short end is that with these patches a PV domain can: - Use the correct PCI I/O gap. Before these patches, Linux guest would boot up and would tell: [ 0.000000] Allocating PCI resources starting at 40000000 (gap: 40000000:c0000000) while in actuality the PCI I/O gap should have been: [ 0.000000] Allocating PCI resources starting at b0000000 (gap: b0000000:4c000000) - The PV domain with PCI devices was limited to 3GB. It now can be booted with 4GB, 8GB, or whatever number you want. The PCI devices will now _not_ conflict with System RAM. Meaning the drivers can load. - With 2.6.39 kernels (which has the 1-1 mapping code), the VM_IO flag will be now automatically applied to regions that are considerd PCI I/O regions. You can find out which those are by looking for ''1-1'' in the kernel bootup. To use this patchset, the guest config file has to have the parameter ''pci=[''<BDF>'',...]'' enabled. This has been tested with 2.6.18 (RHEL5), 2.6.27(SLES11), 2.6.36, 2.6.37, 2.6.38, and 2.6.39 kernels. Also tested with PV NetBSD 5.1. Tested this with the PCI devices (NIC, MSI), and with 2GB, 4GB, and 6GB guests with success. P.S. *There is a bug in the Linux kernel so that if you want to save/restore a PV guest that has 1-1 mapping it won''t restore. Will post the patches shortly for that. Diffstat: tools/libxc/xc_domain.c | 77 +++++++---- tools/libxc/xc_e820.h | 1 tools/libxc/xenctrl.h | 11 + tools/libxl/libxl.h | 11 + tools/libxl/libxl.idl | 2 tools/libxl/libxl_dom.c | 10 + tools/libxl/libxl_internal.h | 1 tools/libxl/libxl_pci.c | 259 +++++++++++++++++++++++++++++++++++++- tools/libxl/xl_cmdimpl.c | 6 tools/python/xen/lowlevel/xl/xl.c | 2 xen/arch/x86/domain.c | 17 ++ xen/arch/x86/mm.c | 14 +- xen/include/asm-x86/domain.h | 2 xen/include/asm-x86/e820.h | 2 14 files changed, 379 insertions(+), 36 deletions(-) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-11 21:35 UTC
[Xen-devel] [PATCH 1 of 4] tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls (x86, amd64 only)
# HG changeset patch # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> # Date 1302557409 14400 # Node ID 8190b1eb9a8cfe63ee0aecc35416c4e8300c45f4 # Parent 4e7016852f35b44af6d5f82737ceb47015ef00a6 tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls (x86,amd64 only). The later retrieves the E820 as seen by the hypervisor (completely unchanged) and the second call sets the E820 for the specified guest. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> diff -r 4e7016852f35 -r 8190b1eb9a8c tools/libxc/xc_domain.c --- a/tools/libxc/xc_domain.c Mon Apr 11 17:25:15 2011 -0400 +++ b/tools/libxc/xc_domain.c Mon Apr 11 17:30:09 2011 -0400 @@ -478,37 +478,64 @@ int xc_domain_pin_memory_cacheattr(xc_in } #if defined(__i386__) || defined(__x86_64__) -#include "xc_e820.h" +int xc_domain_set_memory_map(xc_interface *xch, + uint32_t domid, + struct e820entry entries[], + uint32_t nr_entries) +{ + int rc; + struct xen_foreign_memory_map fmap = { + .domid = domid, + .map = { .nr_entries = nr_entries } + }; + DECLARE_HYPERCALL_BOUNCE(entries, nr_entries * sizeof(struct e820entry), + XC_HYPERCALL_BUFFER_BOUNCE_IN); + + if ( !entries || xc_hypercall_bounce_pre(xch, entries) ) + return -1; + + set_xen_guest_handle(fmap.map.buffer, entries); + + rc = do_memory_op(xch, XENMEM_set_memory_map, &fmap, sizeof(fmap)); + + xc_hypercall_bounce_post(xch, entries); + + return rc; +} +int xc_get_machine_memory_map(xc_interface *xch, + struct e820entry entries[], + uint32_t max_entries) +{ + int rc; + struct xen_memory_map memmap = { + .nr_entries = max_entries + }; + DECLARE_HYPERCALL_BOUNCE(entries, sizeof(struct e820entry) * max_entries, + XC_HYPERCALL_BUFFER_BOUNCE_OUT); + + if ( !entries || xc_hypercall_bounce_pre(xch, entries) || max_entries <= 1) + return -1; + + + set_xen_guest_handle(memmap.buffer, entries); + + rc = do_memory_op(xch, XENMEM_machine_memory_map, &memmap, sizeof(memmap)); + + xc_hypercall_bounce_post(xch, entries); + + return rc ? rc : memmap.nr_entries; +} int xc_domain_set_memmap_limit(xc_interface *xch, uint32_t domid, unsigned long map_limitkb) { - int rc; - struct xen_foreign_memory_map fmap = { - .domid = domid, - .map = { .nr_entries = 1 } - }; - DECLARE_HYPERCALL_BUFFER(struct e820entry, e820); + struct e820entry e820; - e820 = xc_hypercall_buffer_alloc(xch, e820, sizeof(*e820)); + e820.addr = 0; + e820.size = (uint64_t)map_limitkb << 10; + e820.type = E820_RAM; - if ( e820 == NULL ) - { - PERROR("Could not allocate memory for xc_domain_set_memmap_limit hypercall"); - return -1; - } - - e820->addr = 0; - e820->size = (uint64_t)map_limitkb << 10; - e820->type = E820_RAM; - - set_xen_guest_handle(fmap.map.buffer, e820); - - rc = do_memory_op(xch, XENMEM_set_memory_map, &fmap, sizeof(fmap)); - - xc_hypercall_buffer_free(xch, e820); - - return rc; + return xc_domain_set_memory_map(xch, domid, &e820, 1); } #else int xc_domain_set_memmap_limit(xc_interface *xch, diff -r 4e7016852f35 -r 8190b1eb9a8c tools/libxc/xc_e820.h --- a/tools/libxc/xc_e820.h Mon Apr 11 17:25:15 2011 -0400 +++ b/tools/libxc/xc_e820.h Mon Apr 11 17:30:09 2011 -0400 @@ -26,6 +26,7 @@ #define E820_RESERVED 2 #define E820_ACPI 3 #define E820_NVS 4 +#define E820_UNUSABLE 5 struct e820entry { uint64_t addr; diff -r 4e7016852f35 -r 8190b1eb9a8c tools/libxc/xenctrl.h --- a/tools/libxc/xenctrl.h Mon Apr 11 17:25:15 2011 -0400 +++ b/tools/libxc/xenctrl.h Mon Apr 11 17:30:09 2011 -0400 @@ -966,6 +966,17 @@ int xc_domain_set_memmap_limit(xc_interf uint32_t domid, unsigned long map_limitkb); +#if defined(__i386__) || defined(__x86_64__) +#include "xc_e820.h" +int xc_domain_set_memory_map(xc_interface *xch, + uint32_t domid, + struct e820entry entries[], + uint32_t nr_entries); + +int xc_get_machine_memory_map(xc_interface *xch, + struct e820entry entries[], + uint32_t max_entries); +#endif int xc_domain_set_time_offset(xc_interface *xch, uint32_t domid, int32_t time_offset_seconds); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-11 21:35 UTC
[Xen-devel] [PATCH 2 of 4] x86: make the pv-only e820 array be dynamic
# HG changeset patch # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> # Date 1302295108 14400 # Node ID e7057fec103ba69776d157d470d630ce99dbc540 # Parent 0f279f43f41ffa8549e076f2a30f499bd8d6cc1d x86: make the pv-only e820 array be dynamic. During creation of the PV domain we allocate the E820 structure to have the amount of E820 entries on the machine, plus the number three. This will allow the tool stack to fill the E820 with more than three entries. Specifically the use cases is , where the toolstack retrieves the E820, sanitizes it, and then sets it for the PV guest (for PCI passthrough), this dynamic number of E820 is just right. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> diff -r 0f279f43f41f -r e7057fec103b xen/arch/x86/domain.c --- a/xen/arch/x86/domain.c Fri Apr 08 16:35:49 2011 -0400 +++ b/xen/arch/x86/domain.c Fri Apr 08 16:38:28 2011 -0400 @@ -634,14 +634,23 @@ int arch_domain_create(struct domain *d, d->arch.pirq_emuirq[i] = IRQ_UNBOUND; for (i = 0; i < nr_irqs; i++) d->arch.emuirq_pirq[i] = IRQ_UNBOUND; + } else + { + d->arch.pv_domain.e820 = xmalloc_array(struct e820entry, E820NR); + + if ( !d->arch.pv_domain.e820 ) + goto fail; + + memset(d->arch.pv_domain.e820, 0, + E820NR * sizeof(*d->arch.pv_domain.e820)); } - if ( (rc = iommu_domain_init(d)) != 0 ) goto fail; /* For Guest vMCE MSRs virtualization */ vmce_init_msr(d); + } if ( is_hvm_domain(d) ) @@ -668,6 +677,10 @@ int arch_domain_create(struct domain *d, fail: d->is_dying = DOMDYING_dead; vmce_destroy_msr(d); + if ( !is_hvm_domain(d) ) + { + xfree(d->arch.pv_domain.e820); + } xfree(d->arch.pirq_irq); xfree(d->arch.irq_pirq); xfree(d->arch.pirq_emuirq); @@ -696,6 +709,8 @@ void arch_domain_destroy(struct domain * if ( is_hvm_domain(d) ) hvm_domain_destroy(d); + else + xfree(d->arch.pv_domain.e820); vmce_destroy_msr(d); pci_release_devices(d); diff -r 0f279f43f41f -r e7057fec103b xen/arch/x86/mm.c --- a/xen/arch/x86/mm.c Fri Apr 08 16:35:49 2011 -0400 +++ b/xen/arch/x86/mm.c Fri Apr 08 16:38:28 2011 -0400 @@ -4710,7 +4710,7 @@ long arch_memory_op(int op, XEN_GUEST_HA if ( copy_from_guest(&fmap, arg, 1) ) return -EFAULT; - if ( fmap.map.nr_entries > ARRAY_SIZE(d->arch.pv_domain.e820) ) + if ( fmap.map.nr_entries > E820NR ) return -EINVAL; rc = rcu_lock_target_domain_by_id(fmap.domid, &d); @@ -4730,9 +4730,16 @@ long arch_memory_op(int op, XEN_GUEST_HA return -EPERM; } + if ( d->arch.pv_domain.e820 == NULL ) + { + rcu_unlock_domain(d); + return -EINVAL; + } rc = copy_from_guest(d->arch.pv_domain.e820, fmap.map.buffer, fmap.map.nr_entries) ? -EFAULT : 0; - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; + + if ( rc == 0 ) + d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; rcu_unlock_domain(d); return rc; @@ -4747,6 +4754,9 @@ long arch_memory_op(int op, XEN_GUEST_HA if ( d->arch.pv_domain.nr_e820 == 0 ) return -ENOSYS; + if ( d->arch.pv_domain.e820 == NULL ) + return -ENOSYS; + if ( copy_from_guest(&map, arg, 1) ) return -EFAULT; diff -r 0f279f43f41f -r e7057fec103b xen/include/asm-x86/domain.h --- a/xen/include/asm-x86/domain.h Fri Apr 08 16:35:49 2011 -0400 +++ b/xen/include/asm-x86/domain.h Fri Apr 08 16:38:28 2011 -0400 @@ -238,7 +238,7 @@ struct pv_domain unsigned long pirq_eoi_map_mfn; /* Pseudophysical e820 map (XENMEM_memory_map). */ - struct e820entry e820[3]; + struct e820entry *e820; unsigned int nr_e820; }; diff -r 0f279f43f41f -r e7057fec103b xen/include/asm-x86/e820.h --- a/xen/include/asm-x86/e820.h Fri Apr 08 16:35:49 2011 -0400 +++ b/xen/include/asm-x86/e820.h Fri Apr 08 16:38:28 2011 -0400 @@ -39,4 +39,6 @@ extern unsigned int lowmem_kb, highmem_k #define e820_raw bootsym(e820map) #define e820_raw_nr bootsym(e820nr) +/* The +3 is for guest RAM entries */ +#define E820NR (e820.nr_map + 3) #endif /*__E820_HEADER*/ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-11 21:35 UTC
[Xen-devel] [PATCH 3 of 4] libxl: Add support for passing in the machine''s E820 for PCI passthrough in libxl_device_pci_parse_bdf
# HG changeset patch # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> # Date 1302556434 14400 # Node ID 75e24fb720fa9a1c529bc7e6b6eb9e1afc554130 # Parent e7057fec103ba69776d157d470d630ce99dbc540 libxl: Add support for passing in the machine''s E820 for PCI passthrough in libxl_device_pci_parse_bdf. The code that populates E820 is unconditionally triggered by the guest configuration having "pci=[''<BDF>,..'']" and being an PV guest. The code libxl_device_pci_parse_bdf calls the libxl__e820_alloc when it has parsed one BDF. Subsequent calls to libxl_device_pci_parse_bdf will not trigger libxl__e820_alloc being called (unless the first call to libxl__e820_alloc failed). libxl__e820_alloc calls the xc_get_machine_memory_map to retrieve the systems E820. Then the E820 is sanitized to weed out E820 entries below 16MB, and as well remove any E820_RAM or E820_UNUSED regions as the guest does not need to know about them. The guest only needs the E820_ACPI, E820_NVS, E820_RESERVED to get an idea of where the PCI I/O space is. Mostly.. The Linux kernel assumes that any gap in the E820 is considered PCI I/O space which means that if we pass in the guest 2GB, and the E820_ACPI, and its friend start at 3GB, the gap between 2GB and 3GB will be considered as PCI I/O space. To guard against that we also create an E820_UNUSABLE between the region of ''target_kb'' (called ram_end in the code) up to the first E820_[ACPI,NVS,RESERVED] region. When tested with another PV guest (NetBSD 5.1) the modified E820 gave it no trouble. The code has also been tested with older "classic" Xen Linux and with the newer "pvops" with success (SLES11, RHEL5, Ubuntu Lucid, Debian Squeeze, 2.6.37, 2.6.38, 2.6.39). Memory that is slack or for balloon (so ''maxmem'' in guest configuration) is put behind the machine E820. Which in most cases is after the 4GB. The reason for doing the fetching of the E820 using the hypercall in the toolstack (instead of the guest doing it) is that when a guest would do a hypercall to ''XENMEM_machine_memory_map'' it would retrieve an E820 with I/O range caps added in. Meaning that the region after 4GB up to end of possible memory would be marked as unusable and the kernel would not have any space to allocate a balloon region. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> diff -r e7057fec103b -r 75e24fb720fa tools/libxl/libxl.h --- a/tools/libxl/libxl.h Fri Apr 08 16:38:28 2011 -0400 +++ b/tools/libxl/libxl.h Mon Apr 11 17:13:54 2011 -0400 @@ -204,6 +204,14 @@ typedef struct { } libxl_file_reference; void libxl_file_reference_destroy(libxl_file_reference *p); +#define E820MAX (128) +typedef struct { + uint32_t nr_entries; + struct e820entry *entry; +} libxl_e820; + +void libxl_e820_destroy(libxl_e820 *p); + /* libxl_cpuid_policy_list is a dynamic array storing CPUID policies * for multiple leafs. It is terminated with an entry holding * XEN_CPUID_INPUT_UNUSED in input[0] @@ -452,7 +460,8 @@ int libxl_device_pci_remove(libxl_ctx *c int libxl_device_pci_shutdown(libxl_ctx *ctx, uint32_t domid); int libxl_device_pci_list_assigned(libxl_ctx *ctx, libxl_device_pci **list, uint32_t domid, int *num); int libxl_device_pci_list_assignable(libxl_ctx *ctx, libxl_device_pci **list, int *num); -int libxl_device_pci_parse_bdf(libxl_ctx *ctx, libxl_device_pci *pcidev, const char *str); +int libxl_device_pci_parse_bdf(libxl_ctx *ctx, libxl_device_pci *pcidev, const char *str, + libxl_domain_config *d_config); int libxl_cpuid_parse_config(libxl_cpuid_policy_list *cpuid, const char* str); int libxl_cpuid_parse_config_xend(libxl_cpuid_policy_list *cpuid, const char* str); diff -r e7057fec103b -r 75e24fb720fa tools/libxl/libxl.idl --- a/tools/libxl/libxl.idl Fri Apr 08 16:38:28 2011 -0400 +++ b/tools/libxl/libxl.idl Mon Apr 11 17:13:54 2011 -0400 @@ -22,6 +22,7 @@ libxl_file_reference = Builtin("file_ref libxl_hwcap = Builtin("hwcap") +libxl_e820 = Builtin("e820", destructor_fn="libxl_e820_destroy", passby=PASS_BY_REFERENCE) # # Complex libxl types # @@ -112,6 +113,7 @@ libxl_domain_build_info = Struct("domain ])), ("pv", "!%s", Struct(None, [("slack_memkb", uint32), + ("e820", libxl_e820), ("bootloader", string), ("bootloader_args", string), ("cmdline", string), diff -r e7057fec103b -r 75e24fb720fa tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c Fri Apr 08 16:38:28 2011 -0400 +++ b/tools/libxl/libxl_dom.c Mon Apr 11 17:13:54 2011 -0400 @@ -72,9 +72,17 @@ int libxl__build_pre(libxl__gc *gc, uint libxl_ctx *ctx = libxl__gc_owner(gc); xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus); xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); - if (!info->hvm) + if (!info->hvm) { + if (info->u.pv.e820.nr_entries) { + xc_domain_set_memory_map(ctx->xch, domid, + info->u.pv.e820.entry, + info->u.pv.e820.nr_entries); + } + else { xc_domain_set_memmap_limit(ctx->xch, domid, (info->max_memkb + info->u.pv.slack_memkb)); + } + } xc_domain_set_tsc_info(ctx->xch, domid, info->tsc_mode, 0, 0, 0); if ( info->disable_migrate ) xc_domain_disable_migrate(ctx->xch, domid); diff -r e7057fec103b -r 75e24fb720fa tools/libxl/libxl_internal.h --- a/tools/libxl/libxl_internal.h Fri Apr 08 16:38:28 2011 -0400 +++ b/tools/libxl/libxl_internal.h Mon Apr 11 17:13:54 2011 -0400 @@ -335,4 +335,5 @@ _hidden int libxl__error_set(libxl__gc * _hidden int libxl__file_reference_map(libxl_file_reference *f); _hidden int libxl__file_reference_unmap(libxl_file_reference *f); +_hidden int libxl__e820_alloc(libxl_ctx *ctx, libxl_domain_build_info *b_info); #endif diff -r e7057fec103b -r 75e24fb720fa tools/libxl/libxl_pci.c --- a/tools/libxl/libxl_pci.c Fri Apr 08 16:38:28 2011 -0400 +++ b/tools/libxl/libxl_pci.c Mon Apr 11 17:13:54 2011 -0400 @@ -87,7 +87,8 @@ static int hex_convert(const char *str, #define STATE_OPTIONS_K 6 #define STATE_OPTIONS_V 7 #define STATE_TERMINAL 8 -int libxl_device_pci_parse_bdf(libxl_ctx *ctx, libxl_device_pci *pcidev, const char *str) +int libxl_device_pci_parse_bdf(libxl_ctx *ctx, libxl_device_pci *pcidev, const char *str, + libxl_domain_config *d_config) { unsigned state = STATE_DOMAIN; unsigned dom, bus, dev, func, vslot = 0; @@ -202,6 +203,25 @@ int libxl_device_pci_parse_bdf(libxl_ctx pcidev_init(pcidev, dom, bus, dev, func, vslot << 3); + if (d_config && !d_config->c_info.hvm) { + libxl_domain_build_info *b_info = &d_config->b_info; + if (b_info->u.pv.e820.entry == NULL) { + /* Note: libxl_init_build_info sets the slack_memkb, and the other + * value: max_memkb, target_memkb, are set by parse_config_data. */ + if (b_info->max_memkb == 0 || b_info->target_memkb == 0 || + b_info->u.pv.slack_memkb == 0) { + LIBXL__LOG(ctx, LIBXL__LOG_WARNING, + "Failed to construct E820: no memory data!"); + } else { + int rc = 0; + rc = libxl__e820_alloc(ctx, b_info); + if (rc) + LIBXL__LOG(ctx, LIBXL__LOG_WARNING, + "Failed while collecting E820 with: %d (errno:%d)\n", + rc, errno); + } + } + } return 0; parse_error: @@ -1047,3 +1067,164 @@ int libxl_device_pci_shutdown(libxl_ctx free(pcidevs); return 0; } + +static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[], + uint32_t *nr_entries, + unsigned long map_limitkb, + unsigned long balloon_kb) +{ + uint64_t delta_kb = 0, start = 0, start_kb = 0, last = 0, ram_end; + uint32_t i, idx = 0, nr; + struct e820entry e820[E820MAX]; + + if (!src || !map_limitkb || !balloon_kb || !nr_entries) + return ERROR_INVAL; + + nr = *nr_entries; + if (!nr) + return ERROR_INVAL; + + if (nr > E820MAX) + return ERROR_NOMEM; + + /* Weed out anything under 16MB */ + for (i = 0; i < nr; i++) { + if (src[i].addr > 0x100000) + continue; + + src[i].type = 0; + src[i].size = 0; + src[i].addr = -1ULL; + } + + /* Find the lowest and highest entry in E820, skipping over + * undersired entries. */ + start = -1ULL; + last = 0; + for (i = 0; i < nr; i++) { + if ((src[i].type == E820_RAM) || + (src[i].type == E820_UNUSABLE) || + (src[i].type == 0)) + continue; + + start = src[i].addr < start ? src[i].addr : start; + last = src[i].addr + src[i].size > last ? + src[i].addr + src[i].size > last : last; + } + if (start > 1024) + start_kb = start >> 10; + + /* Add the memory RAM region for the guest */ + e820[idx].addr = 0; + e820[idx].size = (uint64_t)map_limitkb << 10; + e820[idx].type = E820_RAM; + + /* .. and trim if neccessary */ + if (start_kb && map_limitkb > start_kb) { + delta_kb = map_limitkb - start_kb; + if (delta_kb) + e820[idx].size -= (uint64_t)(delta_kb << 10); + } + /* Note: We don''t touch balloon_kb here. Will add it at the end. */ + ram_end = e820[idx].addr + e820[idx].size; + idx ++; + + LIBXL__LOG(ctx, LIBXL__LOG_DEBUG, "Memory: %ldkB End of RAM: 0x%lx (PFN) " \ + "Delta: %ldkB, PCI start: %ldkB (0x%lx PFN), Balloon %ldkB\n", + map_limitkb, ram_end >> 12, delta_kb, start_kb ,start >> 12, + balloon_kb); + + /* Check if there is a region between ram_end and start. */ + if (start > ram_end) { + /* .. and if not present, add it in. This is to guard against + the Linux guest assuming that the gap between the end of + RAM region and the start of the E820_[ACPI,NVS,RESERVED] + is PCI I/O space. Which it certainly is _not_. */ + e820[idx].type = E820_UNUSABLE; + e820[idx].addr = ram_end; + e820[idx].size = start - ram_end; + idx++; + } + /* Almost done: copy them over, ignoring the undesireable ones */ + for (i = 0; i < nr; i++) { + if ((src[i].type == E820_RAM) || + (src[i].type == E820_UNUSABLE) || + (src[i].type == 0)) + continue; + e820[idx].type = src[i].type; + e820[idx].addr = src[i].addr; + e820[idx].size = src[i].size; + idx++; + } + + /* At this point we have the mapped RAM + E820 entries from src. */ + if (balloon_kb) { + /* and if we truncated the RAM region, then add it to the end. */ + e820[idx].type = E820_RAM; + e820[idx].addr = (uint64_t)(1ULL << 32) > last ? (uint64_t)(1ULL << 32) : last; + /* also add the balloon memory to the end. */ + e820[idx].size = (uint64_t)(delta_kb << 10) + (uint64_t)(balloon_kb << 10); + idx++; + + } + nr = idx; + + for (i = 0; i < nr; i++) { + LIBXL__LOG(ctx, LIBXL__LOG_DEBUG, ":%s\t[%lx -> %lx]", + e820[i].type == E820_RAM ? "RAM " : + (e820[i].type == E820_RESERVED ? "RSV " : + e820[i].type == E820_ACPI ? "ACPI" : + (e820[i].type == E820_NVS ? "NVS " : + (e820[i].type == E820_UNUSABLE ? "UNU " : "----"))), + e820[i].addr >> 12, + (e820[i].addr + e820[i].size) >> 12); + } + + /* Done: copy the sanitized version. */ + *nr_entries = nr; + memcpy(src, e820, nr * sizeof(struct e820entry)); + return 0; +} + + +int libxl__e820_alloc(libxl_ctx *ctx, libxl_domain_build_info *b_info) +{ + int rc; + uint32_t nr; + libxl_e820 *p; + struct e820entry map[E820MAX]; + + if (b_info == NULL || b_info->hvm) + return ERROR_INVAL; + + rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX); + if (rc < 0) { + errno = rc; + return ERROR_FAIL; + } + nr = rc; + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, + (b_info->max_memkb - b_info->target_memkb) + + b_info->u.pv.slack_memkb); + if (rc) + return ERROR_FAIL; + + p = &b_info->u.pv.e820; + p->nr_entries = nr; + p->entry = calloc(nr, sizeof(struct e820entry)); + if (!p->entry) + return ERROR_NOMEM; + + memcpy(p->entry, map, nr * sizeof(struct e820entry)); + return 0; +} + +void libxl_e820_destroy(libxl_e820 *p) +{ + if (p) { + if (p->entry) + free(p->entry); + p->entry = NULL; + p->nr_entries = 0; + } +} diff -r e7057fec103b -r 75e24fb720fa tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c Fri Apr 08 16:38:28 2011 -0400 +++ b/tools/libxl/xl_cmdimpl.c Mon Apr 11 17:13:54 2011 -0400 @@ -1022,7 +1022,7 @@ skip_vfb: pcidev->msitranslate = pci_msitranslate; pcidev->power_mgmt = pci_power_mgmt; - if (!libxl_device_pci_parse_bdf(&ctx, pcidev, buf)) + if (!libxl_device_pci_parse_bdf(&ctx, pcidev, buf, d_config)) d_config->num_pcidevs++; } } @@ -2120,7 +2120,7 @@ static void pcidetach(const char *dom, c find_domain(dom); memset(&pcidev, 0x00, sizeof(pcidev)); - if (libxl_device_pci_parse_bdf(&ctx, &pcidev, bdf)) { + if (libxl_device_pci_parse_bdf(&ctx, &pcidev, bdf, NULL)) { fprintf(stderr, "pci-detach: malformed BDF specification \"%s\"\n", bdf); exit(2); } @@ -2165,7 +2165,7 @@ static void pciattach(const char *dom, c find_domain(dom); memset(&pcidev, 0x00, sizeof(pcidev)); - if (libxl_device_pci_parse_bdf(&ctx, &pcidev, bdf)) { + if (libxl_device_pci_parse_bdf(&ctx, &pcidev, bdf, NULL)) { fprintf(stderr, "pci-attach: malformed BDF specification \"%s\"\n", bdf); exit(2); } diff -r e7057fec103b -r 75e24fb720fa tools/python/xen/lowlevel/xl/xl.c --- a/tools/python/xen/lowlevel/xl/xl.c Fri Apr 08 16:38:28 2011 -0400 +++ b/tools/python/xen/lowlevel/xl/xl.c Mon Apr 11 17:13:54 2011 -0400 @@ -544,7 +544,7 @@ static PyObject *pyxl_pci_parse(XlObject return NULL; } - if ( libxl_device_pci_parse_bdf(&self->ctx, &pci->obj, str) ) { + if ( libxl_device_pci_parse_bdf(&self->ctx, &pci->obj, str, NULL) ) { PyErr_SetString(xl_error_obj, "cannot parse pci device spec (BDF)"); Py_DECREF(pci); return NULL; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-11 21:35 UTC
[Xen-devel] [PATCH 4 of 4] libxl: Convert E820_UNUSABLE and E820_RAM to E820_UNUSABLE as appropriate
# HG changeset patch # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> # Date 1302556594 14400 # Node ID e5a250cf820eb433b904fffe6b21dbdfe260a115 # Parent 75e24fb720fa9a1c529bc7e6b6eb9e1afc554130 libxl: Convert E820_UNUSABLE and E820_RAM to E820_UNUSABLE as appropriate. Most machines after the RAM regions in the e802 have a couple of E820_RESERVED, with E820_ACPI and E820_NVS. On some Intel machines, the E820 looks like swiss cheese: (XEN) Initial Xen-e820 RAM map: (XEN) 0000000000000000 - 000000000009d000 (usable) (XEN) 000000000009d000 - 00000000000a0000 (reserved) (XEN) 00000000000e0000 - 0000000000100000 (reserved) (XEN) 0000000000100000 - 000000009cf66000 (usable) (XEN) 000000009cf66000 - 000000009d102000 (ACPI NVS) (XEN) 000000009d102000 - 000000009f6bd000 (usable) <-- (XEN) 000000009f6bd000 - 000000009f6bf000 (reserved) (XEN) 000000009f6bf000 - 000000009f714000 (usable) <-- (XEN) 000000009f714000 - 000000009f7bf000 (ACPI NVS) (XEN) 000000009f7bf000 - 000000009f7e0000 (usable) <-- (XEN) 000000009f7e0000 - 000000009f7ff000 (ACPI data) (XEN) 000000009f7ff000 - 000000009f800000 (usable) <-- (XEN) 000000009f800000 - 00000000a0000000 (reserved) (XEN) 00000000a0000000 - 00000000b0000000 (reserved) (XEN) 00000000fc000000 - 00000000fd000000 (reserved) (XEN) 00000000ffe00000 - 0000000100000000 (reserved) (XEN) 0000000100000000 - 0000000160000000 (usable) Which means we have to pay attention to the E820_RAM that are between the E820_[ACPI,NVS,RESERVED]. If we remove those E820_RAM (b/c the amount of memory passed to the guest is less that where those E820 regions reside) from the E820, the Linux kernel interprets those "gaps" as PCI I/O space. This is what we are currently doing. This can be disastrous if we pass in an Intel IGD card which tries to use the first available PCI I/O space - and ends up using the MFNs which are actually RAM instead of being the PCI I/O space. To make this work, we convert all E820_RAM that are above the ''target_kb'' (those that overlap the ''target_kb'' are truncated appropriately) to be E820_UNUSABLE. We also limit this alternation up to 4GB. This means that an E820 for a guest from this (target_kb=1024, maxmem=2048): [ 0.000000] Set 405658 page(s) to 1-1 mapping. [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) [ 0.000000] Xen: 0000000000100000 - 0000000040000000 (usable) [ 0.000000] Xen: 0000000040000000 - 000000009cf66000 (unusable) [ 0.000000] Xen: 000000009cf66000 - 000000009d102000 (ACPI NVS) [ 0.000000] Xen: 000000009f6bd000 - 000000009f6bf000 (reserved) [ 0.000000] Xen: 000000009f714000 - 000000009f7bf000 (ACPI NVS) [ 0.000000] Xen: 000000009f7e0000 - 000000009f7ff000 (ACPI data) [ 0.000000] Xen: 000000009f800000 - 00000000b0000000 (reserved) [ 0.000000] Xen: 00000000fc000000 - 00000000fd000000 (reserved) [ 0.000000] Xen: 00000000fec00000 - 00000000fec01000 (reserved) [ 0.000000] Xen: 00000000fee00000 - 00000000fee01000 (reserved) [ 0.000000] Xen: 00000000ffe00000 - 0000000100000000 (reserved) [ 0.000000] Xen: 0000000100000000 - 0000000140800000 (usable) Will look as so: [ 0.000000] Set 395880 page(s) to 1-1 mapping. [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) [ 0.000000] Xen: 0000000000100000 - 0000000040000000 (usable) [ 0.000000] Xen: 0000000040000000 - 000000009cf66000 (unusable) [ 0.000000] Xen: 000000009cf66000 - 000000009d102000 (ACPI NVS) [ 0.000000] Xen: 000000009d102000 - 000000009f6bd000 (unusable) [ 0.000000] Xen: 000000009f6bd000 - 000000009f6bf000 (reserved) [ 0.000000] Xen: 000000009f6bf000 - 000000009f714000 (unusable) [ 0.000000] Xen: 000000009f714000 - 000000009f7bf000 (ACPI NVS) [ 0.000000] Xen: 000000009f7bf000 - 000000009f7e0000 (unusable) [ 0.000000] Xen: 000000009f7e0000 - 000000009f7ff000 (ACPI data) [ 0.000000] Xen: 000000009f7ff000 - 000000009f800000 (unusable) [ 0.000000] Xen: 000000009f800000 - 00000000b0000000 (reserved) [ 0.000000] Xen: 00000000fc000000 - 00000000fd000000 (reserved) [ 0.000000] Xen: 00000000fec00000 - 00000000fec01000 (reserved) [ 0.000000] Xen: 00000000fee00000 - 00000000fee01000 (reserved) [ 0.000000] Xen: 00000000ffe00000 - 0000000100000000 (reserved) [ 0.000000] Xen: 0000000100000000 - 0000000140800000 (usable) Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> diff -r 75e24fb720fa -r e5a250cf820e tools/libxl/libxl_pci.c --- a/tools/libxl/libxl_pci.c Mon Apr 11 17:13:54 2011 -0400 +++ b/tools/libxl/libxl_pci.c Mon Apr 11 17:16:34 2011 -0400 @@ -1134,22 +1134,98 @@ static int e820_sanitize(libxl_ctx *ctx, map_limitkb, ram_end >> 12, delta_kb, start_kb ,start >> 12, balloon_kb); + /* This whole code below is to guard against if the Intel IGD is passed into + * the guest. If we don''t pass in IGD, this whole code can be ignored. + * + * The reason for this code is that Intel boxes fill their E820 with + * E820_RAM amongst E820_RESERVED and we can''t just ditch the E820_RAM. + * That is b/c any "gaps" in the E820 is considered PCI I/O space by + * Linux and it would be utilized by the Intel IGD as I/O space while + * in reality it was an RAM region. + * + * What this means is that we have to walk the E820 and for any region + * that is RAM and below 4GB and above ram_end, needs to change its type + * to E820_UNUSED. We also need to move some of the E820_RAM regions if + * the overlap with ram_end. */ + for (i = 0; i < nr; i++) { + uint64_t end = src[i].addr + src[i].size; + + /* We don''t care about E820_UNUSABLE, but we need to + * change the type to zero b/c the loop after this + * sticks E820_UNUSABLE on the guest''s E820 but ignores + * the ones with type zero. */ + if ((src[i].type == E820_UNUSABLE) || + /* Any region that is within the "RAM region" can + * be safely ditched. */ + (end < ram_end)) { + src[i].type = 0; + continue; + } + + /* Look only at RAM regions. */ + if (src[i].type != E820_RAM) + continue; + + /* We only care about RAM regions below 4GB. */ + if (src[i].addr >= (1ULL<<32)) + continue; + + /* E820_RAM overlaps with our RAM region. Move it */ + if (src[i].addr < ram_end) { + uint64_t delta; + + src[i].type = E820_UNUSABLE; + delta = ram_end - src[i].addr; + /* The end < ram_end should weed this out */ + if (src[i].size - delta < 0) + src[i].type = 0; + else { + src[i].size -= delta; + src[i].addr = ram_end; + } + if (src[i].addr + src[i].size != end) { + /* We messed up somewhere */ + src[i].type = 0; + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "Computed E820 wrongly. Continuing on."); + } + } + /* Lastly, convert the RAM to UNSUABLE. Look in the Linux kernel + at git commit 2f14ddc3a7146ea4cd5a3d1ecd993f85f2e4f948 + "xen/setup: Inhibit resource API from using System RAM E820 + gaps as PCI mem gaps" for full explanation. */ + if (end > ram_end) + src[i].type = E820_UNUSABLE; + } + /* Check if there is a region between ram_end and start. */ if (start > ram_end) { + int add_unusable = 1; + for (i = 0; i < nr && add_unusable; i++) { + if (src[i].type != E820_UNUSABLE) + continue; + if (ram_end != src[i].addr) + continue; + if (start != src[i].addr + src[i].size) + /* there is one, adjust it */ + src[i].size = start - src[i].addr; + + add_unusable = 0; + } /* .. and if not present, add it in. This is to guard against the Linux guest assuming that the gap between the end of RAM region and the start of the E820_[ACPI,NVS,RESERVED] is PCI I/O space. Which it certainly is _not_. */ - e820[idx].type = E820_UNUSABLE; - e820[idx].addr = ram_end; - e820[idx].size = start - ram_end; - idx++; + if (add_unusable) { + e820[idx].type = E820_UNUSABLE; + e820[idx].addr = ram_end; + e820[idx].size = start - ram_end; + idx++; + } } - /* Almost done: copy them over, ignoring the undesireable ones */ - for (i = 0; i < nr; i++) { + /* Almost done: copy them over, ignoring the undesireable ones */ + for (i = 0; i < nr; i++) { if ((src[i].type == E820_RAM) || - (src[i].type == E820_UNUSABLE) || - (src[i].type == 0)) + (src[i].type == 0)) continue; e820[idx].type = src[i].type; e820[idx].addr = src[i].addr; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Apr-12 12:19 UTC
[Xen-devel] Re: [PATCH 3 of 4] libxl: Add support for passing in the machine''s E820 for PCI passthrough in libxl_device_pci_parse_bdf
On Mon, 2011-04-11 at 22:35 +0100, Konrad Rzeszutek Wilk wrote:> > > The code libxl_device_pci_parse_bdf calls the libxl__e820_alloc when > it has parsed one BDF. Subsequent calls to libxl_device_pci_parse_bdf > will > not trigger libxl__e820_alloc being called (unless the first call to > libxl__e820_alloc failed).That sounds like a very odd non-intuitive location for that allocation. Why not do it in libxl_domain_create or somewhere like that? I think the e820 map added to the idl should become a simple boolean flag and this should all be taken care of internally based on that. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2011-Apr-12 12:32 UTC
Re: [Xen-devel] [PATCH 2 of 4] x86: make the pv-only e820 array be dynamic
On 11/04/2011 22:35, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:> # HG changeset patch > # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > # Date 1302295108 14400 > # Node ID e7057fec103ba69776d157d470d630ce99dbc540 > # Parent 0f279f43f41ffa8549e076f2a30f499bd8d6cc1d > x86: make the pv-only e820 array be dynamic. > > During creation of the PV domain we allocate the E820 structure to > have the amount of E820 entries on the machine, plus the number three.How cunning. Why wouldn''t you just allocate exactly the right size of array in XENMEM_set_memory_map? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-12 12:53 UTC
Re: [Xen-devel] [PATCH 2 of 4] x86: make the pv-only e820 array be dynamic
On Tue, Apr 12, 2011 at 01:32:01PM +0100, Keir Fraser wrote:> On 11/04/2011 22:35, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote: > > > # HG changeset patch > > # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > # Date 1302295108 14400 > > # Node ID e7057fec103ba69776d157d470d630ce99dbc540 > > # Parent 0f279f43f41ffa8549e076f2a30f499bd8d6cc1d > > x86: make the pv-only e820 array be dynamic. > > > > During creation of the PV domain we allocate the E820 structure to > > have the amount of E820 entries on the machine, plus the number three. > > How cunning. > > Why wouldn''t you just allocate exactly the right size of array in > XENMEM_set_memory_map?I was thinking about it, but the mm.c code did not have the xen/xmalloc.h header, nor any references to xmalloc_array. Is it OK to make an xmalloc_array during a hypercall? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2011-Apr-12 13:06 UTC
Re: [Xen-devel] [PATCH 2 of 4] x86: make the pv-only e820 array be dynamic
On 12/04/2011 13:53, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:>> How cunning. >> >> Why wouldn''t you just allocate exactly the right size of array in >> XENMEM_set_memory_map? > > I was thinking about it, but the mm.c code did not have the > xen/xmalloc.h header, nor any references to xmalloc_array. > > Is it OK to make an xmalloc_array during a hypercall?Yes. I think the toolstack should be able to clean up on the newly-possible ENOMEM return from this hypercall. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-12 17:21 UTC
Re: [Xen-devel] [PATCH 2 of 4] x86: make the pv-only e820 array be dynamic
On Tue, Apr 12, 2011 at 02:06:05PM +0100, Keir Fraser wrote:> On 12/04/2011 13:53, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote: > > >> How cunning. > >> > >> Why wouldn''t you just allocate exactly the right size of array in > >> XENMEM_set_memory_map? > > > > I was thinking about it, but the mm.c code did not have the > > xen/xmalloc.h header, nor any references to xmalloc_array. > > > > Is it OK to make an xmalloc_array during a hypercall? > > Yes. I think the toolstack should be able to clean up on the newly-possible > ENOMEM return from this hypercall.Hm, not sure what I am hitting, but I can''t seem to be able to copy over the contents to the newly allocated array from the guest (this works fine with the previous version of the patch). This is what I get (XEN) mm.c:4734:d0 We want 8 E820 entries. We have: 1. (XEN) mm.c:4754:d0 Copying E820 8 entries. (XEN) ----[ Xen-4.2-110412 : 00000000000000a0 (XEN) rdx: 0000000000000000 rsi: 0000000000680004 rdi: 0000000000000000 (XEN) rbp: ffff82c48028fc68 rsp: ffff82c48028fc58 r8: ffff82c4802c8bd0 (XEN) r9: 0000000000000004 r10: 0000000000000004 r11: 0000000000000001 (XEN) r12: 00000000000000a0 r13: 0000000000000000 r14: 0000000000680004 (XEN) r15: 0000000000000008 cr0: 0000000080050033 cr4: 00000000000026f0 (XEN) cr3: 00000001056af000 cr2: 0000000000000000 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 (XEN) Xen stack trace from rsp=ffff82c48028fc58: (XEN) 00000000000000a0 00000000000000a0 ffff82c48028fc98 ffff82c48017fa7d (XEN) 0000000000000008 0000000000000000 0000000000000008 ffff88005c51d7c0 (XEN) ffff82c48028fd38 ffff82c480169857 ffff82c48028fcb8 ffff82c480167d9a (XEN) 000000008028fd08 ffff82f60214be50 0000000000000001 0000000000000008 (XEN) 0000000000680004 0000000000000001 ffff88005d8aa1f8 ffff8301056e0000 (XEN) ffff83011f9b3000 0000000000000000 ffff82c48028fe28 ffff8300cf0fd000 (XEN) 000000000000000d 00007fff2d629060 ffff88005c51d7c0 0000000000000008 (XEN) ffff82c48028fef8 ffff82c48011399c ffff82c4802c9940 000000008028ff18 (XEN) ffff82c48028fd88 0000000000123412 ffff82f602468250 ffff82c48028ff18 (XEN) ffff82c48028ff18 ffff82c4802c8e38 000000008028fde8 ffff82f602468250 (XEN) 00000000001068f6 ffff83011f9b3000 0000000000000070 0000000123412000 (XEN) ffff82f602468240 ffff88005d8a9220 0000000000000000 0000000000682004 (XEN) ffff82c48028fe28 ffff82c48016aa14 8000000127365065 000000018015fd21 (XEN) 0000000000000000 ffff8300cf0fd1c8 ffff8300cf0fd1c8 ffff82c48028fe88 (XEN) aaaaaaaaaaaaaaaa ffff88005d8a9220 ffff82c48028fef8 ffff82c480113cb7 (XEN) ffff8300cf0fd1d0 ffff8300cf0fd000 000000018028fef8 ffff82c48028ff18 (XEN) ffff82c48028fe68 ffff82c48028ff18 ffff8300cf0fd000 00007fff2d6290b0 (XEN) 0000000000000001 0000000000000008 ffff82c48028fef8 ffff82c4802038c3 (XEN) 0000000000680000 00007fb540012cd0 0000000000000000 0000000000000000 (XEN) 00007fb540012e60 000000000000e033 ffff82c48028fed8 ffff8300cf0fd000 (XEN) Xen call trace: (XEN) [<ffff82c48020a3c5>] vmac+0x5da/0x927 (XEN) [<ffff82c48017fa7d>] copy_from_user+0x72/0x9f (XEN) [<ffff82c480169857>] arch_memory_op+0x7cc/0xf05 (XEN) [<ffff82c48011399c>] do_memory_op+0x1dca/0x1df2 (XEN) [<ffff82c4802082c8>] syscall_enter+0xc8/0x122 And the patch is: diff -r 97763efc41f9 xen/arch/x86/domain.c --- a/xen/arch/x86/domain.c Tue Apr 05 18:23:54 2011 +0100 +++ b/xen/arch/x86/domain.c Tue Apr 12 13:18:34 2011 -0400 @@ -696,6 +696,8 @@ void arch_domain_destroy(struct domain * if ( is_hvm_domain(d) ) hvm_domain_destroy(d); + else + xfree(d->arch.pv_domain.e820); vmce_destroy_msr(d); pci_release_devices(d); diff -r 97763efc41f9 xen/arch/x86/mm.c --- a/xen/arch/x86/mm.c Tue Apr 05 18:23:54 2011 +0100 +++ b/xen/arch/x86/mm.c Tue Apr 12 13:18:34 2011 -0400 @@ -100,6 +100,7 @@ #include <xen/iocap.h> #include <xen/guest_access.h> #include <xen/pfn.h> +#include <xen/xmalloc.h> #include <asm/paging.h> #include <asm/shadow.h> #include <asm/page.h> @@ -4710,7 +4711,7 @@ long arch_memory_op(int op, XEN_GUEST_HA if ( copy_from_guest(&fmap, arg, 1) ) return -EFAULT; - if ( fmap.map.nr_entries > ARRAY_SIZE(d->arch.pv_domain.e820) ) + if ( fmap.map.nr_entries > E820MAX ) return -EINVAL; rc = rcu_lock_target_domain_by_id(fmap.domid, &d); @@ -4730,9 +4731,39 @@ long arch_memory_op(int op, XEN_GUEST_HA return -EPERM; } + gdprintk(XENLOG_INFO, "We want %d E820 entries. We have: %d.\n", fmap.map.nr_entries, d->arch.pv_domain.nr_e820); + if ( d->arch.pv_domain.e820 ) + { + if ( fmap.map.nr_entries >= d->arch.pv_domain.nr_e820 ) + { + xfree(d->arch.pv_domain.e820); + d->arch.pv_domain.e820 = NULL; + d->arch.pv_domain.nr_e820 = 0; + } + } else { + d->arch.pv_domain.e820 = xmalloc_array(e820entry_t, + fmap.map.nr_entries + 1); + if ( !d->arch.pv_domain.e820 ) + { + rcu_unlock_domain(d); + return -ENOMEM; + } + gdprintk(XENLOG_INFO, "Allocated E820 for %d entries.\n", fmap.map.nr_entries); + memset(d->arch.pv_domain.e820, 0, (fmap.map.nr_entries + 1) * sizeof(e820entry_t)); + } + gdprintk(XENLOG_INFO, "Copying E820 %d entries.\n", fmap.map.nr_entries); rc = copy_from_guest(d->arch.pv_domain.e820, fmap.map.buffer, fmap.map.nr_entries) ? -EFAULT : 0; - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; + + if ( rc == 0 ) + d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; + else { + /* Don''t free the E820 if it had been set before and we failed. */ + if ( !d->arch.pv_domain.nr_e820 ) { + xfree(d->arch.pv_domain.e820); + d->arch.pv_domain.e820 = NULL; + } + } rcu_unlock_domain(d); return rc; @@ -4747,6 +4778,9 @@ long arch_memory_op(int op, XEN_GUEST_HA if ( d->arch.pv_domain.nr_e820 == 0 ) return -ENOSYS; + if ( d->arch.pv_domain.e820 == NULL ) + return -ENOSYS; + if ( copy_from_guest(&map, arg, 1) ) return -EFAULT; diff -r 97763efc41f9 xen/include/asm-x86/domain.h --- a/xen/include/asm-x86/domain.h Tue Apr 05 18:23:54 2011 +0100 +++ b/xen/include/asm-x86/domain.h Tue Apr 12 13:18:34 2011 -0400 @@ -238,7 +238,7 @@ struct pv_domain unsigned long pirq_eoi_map_mfn; /* Pseudophysical e820 map (XENMEM_memory_map). */ - struct e820entry e820[3]; + struct e820entry *e820; unsigned int nr_e820; }; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-12 17:53 UTC
Re: [Xen-devel] Re: [PATCH 3 of 4] libxl: Add support for passing in the machine''s E820 for PCI passthrough in libxl_device_pci_parse_bdf
On Tue, Apr 12, 2011 at 01:19:43PM +0100, Ian Campbell wrote:> On Mon, 2011-04-11 at 22:35 +0100, Konrad Rzeszutek Wilk wrote: > > > > > > The code libxl_device_pci_parse_bdf calls the libxl__e820_alloc when > > it has parsed one BDF. Subsequent calls to libxl_device_pci_parse_bdf > > will > > not trigger libxl__e820_alloc being called (unless the first call to > > libxl__e820_alloc failed). > > That sounds like a very odd non-intuitive location for that allocation. > Why not do it in libxl_domain_create or somewhere like that? > > I think the e820 map added to the idl should become a simple boolean > flag and this should all be taken care of internally based on that.Like this? # HG changeset patch # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> # Date 1302617081 14400 # Node ID 945a34bcc2bd4bb341296f3ce84be6c42f24a4c9 # Parent 4527044d3d1cebdc393e596072bc02d947a8de51 libxl: Add support for passing in the machine''s E820 for PCI passthrough The code that populates E820 is unconditionally triggered by the guest configuration having "pci=[''<BDF>,..'']", being a PV guest, and if b_info->u.pv.no_machine_e820 is not set. The code do_domain_create calls the libxl__e820_alloc when it notices that the guest is PV, has at least one PCI devices, and does not have the no_machine_e820 flag set.. libxl__e820_alloc calls the xc_get_machine_memory_map to retrieve the systems E820. Then the E820 is sanitized to weed out E820 entries below 16MB, and as well remove any E820_RAM or E820_UNUSED regions as the guest does not need to know about them. The guest only needs the E820_ACPI, E820_NVS, E820_RESERVED to get an idea of where the PCI I/O space is. Mostly.. The Linux kernel assumes that any gap in the E820 is considered PCI I/O space which means that if we pass in the guest 2GB, and the E820_ACPI, and its friend start at 3GB, the gap between 2GB and 3GB will be considered as PCI I/O space. To guard against that we also create an E820_UNUSABLE between the region of ''target_kb'' (called ram_end in the code) up to the first E820_[ACPI,NVS,RESERVED] region. Lastly, the xc_domain_set_memory_map is called to install the new E820. When tested with another PV guest (NetBSD 5.1) the modified E820 gave it no trouble. The code has also been tested with older "classic" Xen Linux and with the newer "pvops" with success (SLES11, RHEL5, Ubuntu Lucid, Debian Squeeze, 2.6.37, 2.6.38, 2.6.39). Memory that is slack or for balloon (so ''maxmem'' in guest configuration) is put behind the machine E820. Which in most cases is after the 4GB. The reason for doing the fetching of the E820 using the hypercall in the toolstack (instead of the guest doing it) is that when a guest would do a hypercall to ''XENMEM_machine_memory_map'' it would retrieve an E820 with I/O range caps added in. Meaning that the region after 4GB up to end of possible memory would be marked as unusable and the kernel would not have any space to allocate a balloon region. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> diff -r 4527044d3d1c -r 945a34bcc2bd tools/libxl/libxl.idl --- a/tools/libxl/libxl.idl Tue Apr 12 09:59:59 2011 -0400 +++ b/tools/libxl/libxl.idl Tue Apr 12 10:04:41 2011 -0400 @@ -117,6 +117,7 @@ libxl_domain_build_info = Struct("domain ("cmdline", string), ("ramdisk", libxl_file_reference), ("features", string, True), + ("no_machine_e820", bool), ])), ])), ], diff -r 4527044d3d1c -r 945a34bcc2bd tools/libxl/libxl_create.c --- a/tools/libxl/libxl_create.c Tue Apr 12 09:59:59 2011 -0400 +++ b/tools/libxl/libxl_create.c Tue Apr 12 10:04:41 2011 -0400 @@ -536,6 +536,14 @@ static int do_domain_create(libxl__gc *g for (i = 0; i < d_config->num_pcidevs; i++) libxl__device_pci_add(gc, domid, &d_config->pcidevs[i], 1); + if (!d_config->c_info.hvm && !d_config->b_info.u.pv.no_machine_e820) { + int rc = 0; + rc = libxl__e820_alloc(ctx, domid, d_config); + if (rc) + LIBXL__LOG(ctx, LIBXL__LOG_WARNING, + "Failed while collecting E820 with: %d (errno:%d)\n", + rc, errno); + } if ( cb && (d_config->c_info.hvm || d_config->b_info.u.pv.bootloader )) { if ( (*cb)(ctx, domid, priv) ) goto error_out; diff -r 4527044d3d1c -r 945a34bcc2bd tools/libxl/libxl_internal.h --- a/tools/libxl/libxl_internal.h Tue Apr 12 09:59:59 2011 -0400 +++ b/tools/libxl/libxl_internal.h Tue Apr 12 10:04:41 2011 -0400 @@ -335,4 +335,5 @@ _hidden int libxl__error_set(libxl__gc * _hidden int libxl__file_reference_map(libxl_file_reference *f); _hidden int libxl__file_reference_unmap(libxl_file_reference *f); +_hidden int libxl__e820_alloc(libxl_ctx *ctx, uint32_t domid, libxl_domain_config *d_config); #endif diff -r 4527044d3d1c -r 945a34bcc2bd tools/libxl/libxl_pci.c --- a/tools/libxl/libxl_pci.c Tue Apr 12 09:59:59 2011 -0400 +++ b/tools/libxl/libxl_pci.c Tue Apr 12 10:04:41 2011 -0400 @@ -1047,3 +1047,157 @@ int libxl_device_pci_shutdown(libxl_ctx free(pcidevs); return 0; } + +static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[], + uint32_t *nr_entries, + unsigned long map_limitkb, + unsigned long balloon_kb) +{ + uint64_t delta_kb = 0, start = 0, start_kb = 0, last = 0, ram_end; + uint32_t i, idx = 0, nr; + struct e820entry e820[E820MAX]; + + if (!src || !map_limitkb || !balloon_kb || !nr_entries) + return ERROR_INVAL; + + nr = *nr_entries; + if (!nr) + return ERROR_INVAL; + + if (nr > E820MAX) + return ERROR_NOMEM; + + /* Weed out anything under 16MB */ + for (i = 0; i < nr; i++) { + if (src[i].addr > 0x100000) + continue; + + src[i].type = 0; + src[i].size = 0; + src[i].addr = -1ULL; + } + + /* Find the lowest and highest entry in E820, skipping over + * undersired entries. */ + start = -1ULL; + last = 0; + for (i = 0; i < nr; i++) { + if ((src[i].type == E820_RAM) || + (src[i].type == E820_UNUSABLE) || + (src[i].type == 0)) + continue; + + start = src[i].addr < start ? src[i].addr : start; + last = src[i].addr + src[i].size > last ? + src[i].addr + src[i].size > last : last; + } + if (start > 1024) + start_kb = start >> 10; + + /* Add the memory RAM region for the guest */ + e820[idx].addr = 0; + e820[idx].size = (uint64_t)map_limitkb << 10; + e820[idx].type = E820_RAM; + + /* .. and trim if neccessary */ + if (start_kb && map_limitkb > start_kb) { + delta_kb = map_limitkb - start_kb; + if (delta_kb) + e820[idx].size -= (uint64_t)(delta_kb << 10); + } + /* Note: We don''t touch balloon_kb here. Will add it at the end. */ + ram_end = e820[idx].addr + e820[idx].size; + idx ++; + + LIBXL__LOG(ctx, LIBXL__LOG_DEBUG, "Memory: %ldkB End of RAM: 0x%lx (PFN) " \ + "Delta: %ldkB, PCI start: %ldkB (0x%lx PFN), Balloon %ldkB\n", + map_limitkb, ram_end >> 12, delta_kb, start_kb ,start >> 12, + balloon_kb); + + /* Check if there is a region between ram_end and start. */ + if (start > ram_end) { + /* .. and if not present, add it in. This is to guard against + the Linux guest assuming that the gap between the end of + RAM region and the start of the E820_[ACPI,NVS,RESERVED] + is PCI I/O space. Which it certainly is _not_. */ + e820[idx].type = E820_UNUSABLE; + e820[idx].addr = ram_end; + e820[idx].size = start - ram_end; + idx++; + } + /* Almost done: copy them over, ignoring the undesireable ones */ + for (i = 0; i < nr; i++) { + if ((src[i].type == E820_RAM) || + (src[i].type == E820_UNUSABLE) || + (src[i].type == 0)) + continue; + e820[idx].type = src[i].type; + e820[idx].addr = src[i].addr; + e820[idx].size = src[i].size; + idx++; + } + + /* At this point we have the mapped RAM + E820 entries from src. */ + if (balloon_kb) { + /* and if we truncated the RAM region, then add it to the end. */ + e820[idx].type = E820_RAM; + e820[idx].addr = (uint64_t)(1ULL << 32) > last ? (uint64_t)(1ULL << 32) : last; + /* also add the balloon memory to the end. */ + e820[idx].size = (uint64_t)(delta_kb << 10) + (uint64_t)(balloon_kb << 10); + idx++; + + } + nr = idx; + + for (i = 0; i < nr; i++) { + LIBXL__LOG(ctx, LIBXL__LOG_DEBUG, ":%s\t[%lx -> %lx]", + e820[i].type == E820_RAM ? "RAM " : + (e820[i].type == E820_RESERVED ? "RSV " : + e820[i].type == E820_ACPI ? "ACPI" : + (e820[i].type == E820_NVS ? "NVS " : + (e820[i].type == E820_UNUSABLE ? "UNU " : "----"))), + e820[i].addr >> 12, + (e820[i].addr + e820[i].size) >> 12); + } + + /* Done: copy the sanitized version. */ + *nr_entries = nr; + memcpy(src, e820, nr * sizeof(struct e820entry)); + return 0; +} + + +int libxl__e820_alloc(libxl_ctx *ctx, uint32_t domid, libxl_domain_config *d_config) +{ + int rc; + uint32_t nr; + struct e820entry map[E820MAX]; + libxl_domain_build_info *b_info; + + if (d_config == NULL || d_config->c_info.hvm) + return ERROR_INVAL; + + b_info = &d_config->b_info; + if (b_info->u.pv.no_machine_e820) + return ERROR_INVAL; + + rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX); + if (rc < 0) { + errno = rc; + return ERROR_FAIL; + } + nr = rc; + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, + (b_info->max_memkb - b_info->target_memkb) + + b_info->u.pv.slack_memkb); + if (rc) + return ERROR_FAIL; + + rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr); + + if (rc < 0) { + errno = rc; + return ERROR_FAIL; + } + return 0; +} diff -r 4527044d3d1c -r 945a34bcc2bd tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c Tue Apr 12 09:59:59 2011 -0400 +++ b/tools/libxl/xl_cmdimpl.c Tue Apr 12 10:04:41 2011 -0400 @@ -371,6 +371,7 @@ static void printf_info(int domid, printf("\t\t\t(kernel %s)\n", b_info->kernel.path); printf("\t\t\t(cmdline %s)\n", b_info->u.pv.cmdline); printf("\t\t\t(ramdisk %s)\n", b_info->u.pv.ramdisk.path); + printf("\t\t\t(no_machine_e820 %d)\n", b_info->u.pv.no_machine_e820); printf("\t\t)\n"); } printf("\t)\n"); @@ -1025,7 +1026,9 @@ skip_vfb: if (!libxl_device_pci_parse_bdf(&ctx, pcidev, buf)) d_config->num_pcidevs++; } - } + } else + if (!c_info->hvm) + b_info->u.pv.no_machine_e820 = 1; switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0, 1)) { case 0: _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Apr-12 18:44 UTC
Re: [Xen-devel] Re: [PATCH 3 of 4] libxl: Add support for passing in the machine''s E820 for PCI passthrough in libxl_device_pci_parse_bdf
On Tue, 2011-04-12 at 18:53 +0100, Konrad Rzeszutek Wilk wrote:> On Tue, Apr 12, 2011 at 01:19:43PM +0100, Ian Campbell wrote: > > On Mon, 2011-04-11 at 22:35 +0100, Konrad Rzeszutek Wilk wrote: > > > > > > > > > The code libxl_device_pci_parse_bdf calls the libxl__e820_alloc when > > > it has parsed one BDF. Subsequent calls to libxl_device_pci_parse_bdf > > > will > > > not trigger libxl__e820_alloc being called (unless the first call to > > > libxl__e820_alloc failed). > > > > That sounds like a very odd non-intuitive location for that allocation. > > Why not do it in libxl_domain_create or somewhere like that? > > > > I think the e820 map added to the idl should become a simple boolean > > flag and this should all be taken care of internally based on that. > > Like this?Why no_machine_e820 instead of just machine_e820? You can set a non-zero default in the appropriate libxl_foo_init function if need be. Ian.> # HG changeset patch > # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > # Date 1302617081 14400 > # Node ID 945a34bcc2bd4bb341296f3ce84be6c42f24a4c9 > # Parent 4527044d3d1cebdc393e596072bc02d947a8de51 > libxl: Add support for passing in the machine''s E820 for PCI passthrough > > The code that populates E820 is unconditionally triggered by the guest > configuration having "pci=[''<BDF>,..'']", being a PV guest, and if > b_info->u.pv.no_machine_e820 is not set. > > The code do_domain_create calls the libxl__e820_alloc when > it notices that the guest is PV, has at least one PCI devices, and does not > have the no_machine_e820 flag set.. > > libxl__e820_alloc calls the xc_get_machine_memory_map to retrieve the systems > E820. Then the E820 is sanitized to weed out E820 entries below 16MB, and as > well remove any E820_RAM or E820_UNUSED regions as the guest does not need to > know about them. The guest only needs the E820_ACPI, E820_NVS, E820_RESERVED to > get an idea of where the PCI I/O space is. Mostly.. The Linux kernel assumes that any > gap in the E820 is considered PCI I/O space which means that if we pass > in the guest 2GB, and the E820_ACPI, and its friend start at 3GB, the > gap between 2GB and 3GB will be considered as PCI I/O space. To guard against > that we also create an E820_UNUSABLE between the region of ''target_kb'' > (called ram_end in the code) up to the first E820_[ACPI,NVS,RESERVED] region. > Lastly, the xc_domain_set_memory_map is called to install the new E820. > > When tested with another PV guest (NetBSD 5.1) the modified E820 gave > it no trouble. The code has also been tested with older "classic" Xen Linux > and with the newer "pvops" with success (SLES11, RHEL5, Ubuntu Lucid, > Debian Squeeze, 2.6.37, 2.6.38, 2.6.39). > > Memory that is slack or for balloon (so ''maxmem'' in guest configuration) > is put behind the machine E820. Which in most cases is after the 4GB. > > The reason for doing the fetching of the E820 using the hypercall in > the toolstack (instead of the guest doing it) is that when a guest > would do a hypercall to ''XENMEM_machine_memory_map'' it would > retrieve an E820 with I/O range caps added in. Meaning that the > region after 4GB up to end of possible memory would be marked as unusable > and the kernel would not have any space to allocate a balloon > region. > > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > diff -r 4527044d3d1c -r 945a34bcc2bd tools/libxl/libxl.idl > --- a/tools/libxl/libxl.idl Tue Apr 12 09:59:59 2011 -0400 > +++ b/tools/libxl/libxl.idl Tue Apr 12 10:04:41 2011 -0400 > @@ -117,6 +117,7 @@ libxl_domain_build_info = Struct("domain > ("cmdline", string), > ("ramdisk", libxl_file_reference), > ("features", string, True), > + ("no_machine_e820", bool), > ])), > ])), > ], > diff -r 4527044d3d1c -r 945a34bcc2bd tools/libxl/libxl_create.c > --- a/tools/libxl/libxl_create.c Tue Apr 12 09:59:59 2011 -0400 > +++ b/tools/libxl/libxl_create.c Tue Apr 12 10:04:41 2011 -0400 > @@ -536,6 +536,14 @@ static int do_domain_create(libxl__gc *g > for (i = 0; i < d_config->num_pcidevs; i++) > libxl__device_pci_add(gc, domid, &d_config->pcidevs[i], 1); > > + if (!d_config->c_info.hvm && !d_config->b_info.u.pv.no_machine_e820) { > + int rc = 0; > + rc = libxl__e820_alloc(ctx, domid, d_config); > + if (rc) > + LIBXL__LOG(ctx, LIBXL__LOG_WARNING, > + "Failed while collecting E820 with: %d (errno:%d)\n", > + rc, errno); > + } > if ( cb && (d_config->c_info.hvm || d_config->b_info.u.pv.bootloader )) { > if ( (*cb)(ctx, domid, priv) ) > goto error_out; > diff -r 4527044d3d1c -r 945a34bcc2bd tools/libxl/libxl_internal.h > --- a/tools/libxl/libxl_internal.h Tue Apr 12 09:59:59 2011 -0400 > +++ b/tools/libxl/libxl_internal.h Tue Apr 12 10:04:41 2011 -0400 > @@ -335,4 +335,5 @@ _hidden int libxl__error_set(libxl__gc * > _hidden int libxl__file_reference_map(libxl_file_reference *f); > _hidden int libxl__file_reference_unmap(libxl_file_reference *f); > > +_hidden int libxl__e820_alloc(libxl_ctx *ctx, uint32_t domid, libxl_domain_config *d_config); > #endif > diff -r 4527044d3d1c -r 945a34bcc2bd tools/libxl/libxl_pci.c > --- a/tools/libxl/libxl_pci.c Tue Apr 12 09:59:59 2011 -0400 > +++ b/tools/libxl/libxl_pci.c Tue Apr 12 10:04:41 2011 -0400 > @@ -1047,3 +1047,157 @@ int libxl_device_pci_shutdown(libxl_ctx > free(pcidevs); > return 0; > } > + > +static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[], > + uint32_t *nr_entries, > + unsigned long map_limitkb, > + unsigned long balloon_kb) > +{ > + uint64_t delta_kb = 0, start = 0, start_kb = 0, last = 0, ram_end; > + uint32_t i, idx = 0, nr; > + struct e820entry e820[E820MAX]; > + > + if (!src || !map_limitkb || !balloon_kb || !nr_entries) > + return ERROR_INVAL; > + > + nr = *nr_entries; > + if (!nr) > + return ERROR_INVAL; > + > + if (nr > E820MAX) > + return ERROR_NOMEM; > + > + /* Weed out anything under 16MB */ > + for (i = 0; i < nr; i++) { > + if (src[i].addr > 0x100000) > + continue; > + > + src[i].type = 0; > + src[i].size = 0; > + src[i].addr = -1ULL; > + } > + > + /* Find the lowest and highest entry in E820, skipping over > + * undersired entries. */ > + start = -1ULL; > + last = 0; > + for (i = 0; i < nr; i++) { > + if ((src[i].type == E820_RAM) || > + (src[i].type == E820_UNUSABLE) || > + (src[i].type == 0)) > + continue; > + > + start = src[i].addr < start ? src[i].addr : start; > + last = src[i].addr + src[i].size > last ? > + src[i].addr + src[i].size > last : last; > + } > + if (start > 1024) > + start_kb = start >> 10; > + > + /* Add the memory RAM region for the guest */ > + e820[idx].addr = 0; > + e820[idx].size = (uint64_t)map_limitkb << 10; > + e820[idx].type = E820_RAM; > + > + /* .. and trim if neccessary */ > + if (start_kb && map_limitkb > start_kb) { > + delta_kb = map_limitkb - start_kb; > + if (delta_kb) > + e820[idx].size -= (uint64_t)(delta_kb << 10); > + } > + /* Note: We don''t touch balloon_kb here. Will add it at the end. */ > + ram_end = e820[idx].addr + e820[idx].size; > + idx ++; > + > + LIBXL__LOG(ctx, LIBXL__LOG_DEBUG, "Memory: %ldkB End of RAM: 0x%lx (PFN) " \ > + "Delta: %ldkB, PCI start: %ldkB (0x%lx PFN), Balloon %ldkB\n", > + map_limitkb, ram_end >> 12, delta_kb, start_kb ,start >> 12, > + balloon_kb); > + > + /* Check if there is a region between ram_end and start. */ > + if (start > ram_end) { > + /* .. and if not present, add it in. This is to guard against > + the Linux guest assuming that the gap between the end of > + RAM region and the start of the E820_[ACPI,NVS,RESERVED] > + is PCI I/O space. Which it certainly is _not_. */ > + e820[idx].type = E820_UNUSABLE; > + e820[idx].addr = ram_end; > + e820[idx].size = start - ram_end; > + idx++; > + } > + /* Almost done: copy them over, ignoring the undesireable ones */ > + for (i = 0; i < nr; i++) { > + if ((src[i].type == E820_RAM) || > + (src[i].type == E820_UNUSABLE) || > + (src[i].type == 0)) > + continue; > + e820[idx].type = src[i].type; > + e820[idx].addr = src[i].addr; > + e820[idx].size = src[i].size; > + idx++; > + } > + > + /* At this point we have the mapped RAM + E820 entries from src. */ > + if (balloon_kb) { > + /* and if we truncated the RAM region, then add it to the end. */ > + e820[idx].type = E820_RAM; > + e820[idx].addr = (uint64_t)(1ULL << 32) > last ? (uint64_t)(1ULL << 32) : last; > + /* also add the balloon memory to the end. */ > + e820[idx].size = (uint64_t)(delta_kb << 10) + (uint64_t)(balloon_kb << 10); > + idx++; > + > + } > + nr = idx; > + > + for (i = 0; i < nr; i++) { > + LIBXL__LOG(ctx, LIBXL__LOG_DEBUG, ":%s\t[%lx -> %lx]", > + e820[i].type == E820_RAM ? "RAM " : > + (e820[i].type == E820_RESERVED ? "RSV " : > + e820[i].type == E820_ACPI ? "ACPI" : > + (e820[i].type == E820_NVS ? "NVS " : > + (e820[i].type == E820_UNUSABLE ? "UNU " : "----"))), > + e820[i].addr >> 12, > + (e820[i].addr + e820[i].size) >> 12); > + } > + > + /* Done: copy the sanitized version. */ > + *nr_entries = nr; > + memcpy(src, e820, nr * sizeof(struct e820entry)); > + return 0; > +} > + > + > +int libxl__e820_alloc(libxl_ctx *ctx, uint32_t domid, libxl_domain_config *d_config) > +{ > + int rc; > + uint32_t nr; > + struct e820entry map[E820MAX]; > + libxl_domain_build_info *b_info; > + > + if (d_config == NULL || d_config->c_info.hvm) > + return ERROR_INVAL; > + > + b_info = &d_config->b_info; > + if (b_info->u.pv.no_machine_e820) > + return ERROR_INVAL; > + > + rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX); > + if (rc < 0) { > + errno = rc; > + return ERROR_FAIL; > + } > + nr = rc; > + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, > + (b_info->max_memkb - b_info->target_memkb) + > + b_info->u.pv.slack_memkb); > + if (rc) > + return ERROR_FAIL; > + > + rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr); > + > + if (rc < 0) { > + errno = rc; > + return ERROR_FAIL; > + } > + return 0; > +} > diff -r 4527044d3d1c -r 945a34bcc2bd tools/libxl/xl_cmdimpl.c > --- a/tools/libxl/xl_cmdimpl.c Tue Apr 12 09:59:59 2011 -0400 > +++ b/tools/libxl/xl_cmdimpl.c Tue Apr 12 10:04:41 2011 -0400 > @@ -371,6 +371,7 @@ static void printf_info(int domid, > printf("\t\t\t(kernel %s)\n", b_info->kernel.path); > printf("\t\t\t(cmdline %s)\n", b_info->u.pv.cmdline); > printf("\t\t\t(ramdisk %s)\n", b_info->u.pv.ramdisk.path); > + printf("\t\t\t(no_machine_e820 %d)\n", b_info->u.pv.no_machine_e820); > printf("\t\t)\n"); > } > printf("\t)\n"); > @@ -1025,7 +1026,9 @@ skip_vfb: > if (!libxl_device_pci_parse_bdf(&ctx, pcidev, buf)) > d_config->num_pcidevs++; > } > - } > + } else > + if (!c_info->hvm) > + b_info->u.pv.no_machine_e820 = 1; > > switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0, 1)) { > case 0:_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-12 19:00 UTC
Re: [Xen-devel] Re: [PATCH 3 of 4] libxl: Add support for passing in the machine''s E820 for PCI passthrough in libxl_device_pci_parse_bdf
On Tue, Apr 12, 2011 at 07:44:48PM +0100, Ian Campbell wrote:> On Tue, 2011-04-12 at 18:53 +0100, Konrad Rzeszutek Wilk wrote: > > On Tue, Apr 12, 2011 at 01:19:43PM +0100, Ian Campbell wrote: > > > On Mon, 2011-04-11 at 22:35 +0100, Konrad Rzeszutek Wilk wrote: > > > > > > > > > > > > The code libxl_device_pci_parse_bdf calls the libxl__e820_alloc when > > > > it has parsed one BDF. Subsequent calls to libxl_device_pci_parse_bdf > > > > will > > > > not trigger libxl__e820_alloc being called (unless the first call to > > > > libxl__e820_alloc failed). > > > > > > That sounds like a very odd non-intuitive location for that allocation. > > > Why not do it in libxl_domain_create or somewhere like that? > > > > > > I think the e820 map added to the idl should become a simple boolean > > > flag and this should all be taken care of internally based on that. > > > > Like this? > > Why no_machine_e820 instead of just machine_e820? You can set a non-zero > default in the appropriate libxl_foo_init function if need be.I tried to ("machine_e820", bool, True, "Use machine''s E820 for PCI passthrough."), and it made the the machine_e820 variable be an ''const bool'' so that you couldn''t change it anymore. xl_cmdimpl.c:990: error: assignment of read-only member ‘machine_e820’ 112 char * cmdline; 113 libxl_file_reference ramdisk; 114 const char * features; 115 /* 116 * Use machine''s E820 for PCI passthrough. 117 */ 118 const bool machine_e820; 119 } pv; 120 } u; 121 } libxl_domain_build_info; If I make an ''integer'' I get the same thing. Ah, if I modify the python code: konrad@phenom:~/hg/xen-unstable.hg/tools/libxl$ hg diff diff -r cc01692f4b41 tools/libxl/gentypes.py --- a/tools/libxl/gentypes.py Tue Apr 12 14:46:05 2011 -0400 +++ b/tools/libxl/gentypes.py Tue Apr 12 14:58:24 2011 -0400 @@ -48,7 +48,7 @@ def libxl_C_type_define(ty, indent = "") s += format_comment(4, f.comment) x = libxl_C_instance_of(f.type, f.name) if f.const: - x = "const " + x + x = " " + x x = x.replace("\n", "\n ") s += " " + x + ";\n" if ty.typename is None: It starts working, but that does not look like the correct fix. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Apr-12 19:29 UTC
Re: [Xen-devel] Re: [PATCH 3 of 4] libxl: Add support for passing in the machine''s E820 for PCI passthrough in libxl_device_pci_parse_bdf
On Tue, 2011-04-12 at 20:00 +0100, Konrad Rzeszutek Wilk wrote:> On Tue, Apr 12, 2011 at 07:44:48PM +0100, Ian Campbell wrote: > > On Tue, 2011-04-12 at 18:53 +0100, Konrad Rzeszutek Wilk wrote: > > > On Tue, Apr 12, 2011 at 01:19:43PM +0100, Ian Campbell wrote: > > > > On Mon, 2011-04-11 at 22:35 +0100, Konrad Rzeszutek Wilk wrote: > > > > > > > > > > > > > > > The code libxl_device_pci_parse_bdf calls the libxl__e820_alloc when > > > > > it has parsed one BDF. Subsequent calls to libxl_device_pci_parse_bdf > > > > > will > > > > > not trigger libxl__e820_alloc being called (unless the first call to > > > > > libxl__e820_alloc failed). > > > > > > > > That sounds like a very odd non-intuitive location for that allocation. > > > > Why not do it in libxl_domain_create or somewhere like that? > > > > > > > > I think the e820 map added to the idl should become a simple boolean > > > > flag and this should all be taken care of internally based on that. > > > > > > Like this? > > > > Why no_machine_e820 instead of just machine_e820? You can set a non-zero > > default in the appropriate libxl_foo_init function if need be. > > I tried to > ("machine_e820", bool, True, "Use machine''s E820 for PCI passthrough."), > > and it made the the machine_e820 variable be an ''const bool'' so that you couldn''t > change it anymore.The "True" in the above is the const-ness, not the initial value. You want: ("machine_e820", bool, False, "Use machine''s E820 for PCI passthrough."), + the appropriate initialisation in libxl_init_build_info. Ian.> > xl_cmdimpl.c:990: error: assignment of read-only member ‘machine_e820’ > > 112 char * cmdline; > 113 libxl_file_reference ramdisk; > 114 const char * features; > 115 /* > 116 * Use machine''s E820 for PCI passthrough. > 117 */ > 118 const bool machine_e820; > 119 } pv; > 120 } u; > 121 } libxl_domain_build_info; > > If I make an ''integer'' I get the same thing. Ah, if I modify the python code: > konrad@phenom:~/hg/xen-unstable.hg/tools/libxl$ hg diff > diff -r cc01692f4b41 tools/libxl/gentypes.py > --- a/tools/libxl/gentypes.py Tue Apr 12 14:46:05 2011 -0400 > +++ b/tools/libxl/gentypes.py Tue Apr 12 14:58:24 2011 -0400 > @@ -48,7 +48,7 @@ def libxl_C_type_define(ty, indent = "") > s += format_comment(4, f.comment) > x = libxl_C_instance_of(f.type, f.name) > if f.const: > - x = "const " + x > + x = " " + x > x = x.replace("\n", "\n ") > s += " " + x + ";\n" > if ty.typename is None: > > It starts working, but that does not look like the correct fix._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Apr-13 07:34 UTC
Re: [Xen-devel] [PATCH 2 of 4] x86: make the pv-only e820 array be dynamic
>>> On 12.04.11 at 19:21, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > On Tue, Apr 12, 2011 at 02:06:05PM +0100, Keir Fraser wrote: >> On 12/04/2011 13:53, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote: >> >> >> How cunning. >> >> >> >> Why wouldn''t you just allocate exactly the right size of array in >> >> XENMEM_set_memory_map? >> > >> > I was thinking about it, but the mm.c code did not have the >> > xen/xmalloc.h header, nor any references to xmalloc_array. >> > >> > Is it OK to make an xmalloc_array during a hypercall? >> >> Yes. I think the toolstack should be able to clean up on the newly-possible >> ENOMEM return from this hypercall. > > Hm, not sure what I am hitting, but I can''t seem to be able to copy over the > contents to the newly allocated array from the guest (this works > fine with the previous version of the patch). This is what I get > > (XEN) mm.c:4734:d0 We want 8 E820 entries. We have: 1. > (XEN) mm.c:4754:d0 Copying E820 8 entries. > (XEN) ----[ Xen-4.2-110412 : 00000000000000a0 > (XEN) rdx: 0000000000000000 rsi: 0000000000680004 rdi: 0000000000000000 > (XEN) rbp: ffff82c48028fc68 rsp: ffff82c48028fc58 r8: ffff82c4802c8bd0 > (XEN) r9: 0000000000000004 r10: 0000000000000004 r11: 0000000000000001 > (XEN) r12: 00000000000000a0 r13: 0000000000000000 r14: 0000000000680004 > (XEN) r15: 0000000000000008 cr0: 0000000080050033 cr4: 00000000000026f0 > (XEN) cr3: 00000001056af000 cr2: 0000000000000000De-referencing NULL, because of ...> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 > (XEN) Xen stack trace from rsp=ffff82c48028fc58: > (XEN) 00000000000000a0 00000000000000a0 ffff82c48028fc98 ffff82c48017fa7d > (XEN) 0000000000000008 0000000000000000 0000000000000008 ffff88005c51d7c0 > (XEN) ffff82c48028fd38 ffff82c480169857 ffff82c48028fcb8 ffff82c480167d9a > (XEN) 000000008028fd08 ffff82f60214be50 0000000000000001 0000000000000008 > (XEN) 0000000000680004 0000000000000001 ffff88005d8aa1f8 ffff8301056e0000 > (XEN) ffff83011f9b3000 0000000000000000 ffff82c48028fe28 ffff8300cf0fd000 > (XEN) 000000000000000d 00007fff2d629060 ffff88005c51d7c0 0000000000000008 > (XEN) ffff82c48028fef8 ffff82c48011399c ffff82c4802c9940 000000008028ff18 > (XEN) ffff82c48028fd88 0000000000123412 ffff82f602468250 ffff82c48028ff18 > (XEN) ffff82c48028ff18 ffff82c4802c8e38 000000008028fde8 ffff82f602468250 > (XEN) 00000000001068f6 ffff83011f9b3000 0000000000000070 0000000123412000 > (XEN) ffff82f602468240 ffff88005d8a9220 0000000000000000 0000000000682004 > (XEN) ffff82c48028fe28 ffff82c48016aa14 8000000127365065 000000018015fd21 > (XEN) 0000000000000000 ffff8300cf0fd1c8 ffff8300cf0fd1c8 ffff82c48028fe88 > (XEN) aaaaaaaaaaaaaaaa ffff88005d8a9220 ffff82c48028fef8 ffff82c480113cb7 > (XEN) ffff8300cf0fd1d0 ffff8300cf0fd000 000000018028fef8 ffff82c48028ff18 > (XEN) ffff82c48028fe68 ffff82c48028ff18 ffff8300cf0fd000 00007fff2d6290b0 > (XEN) 0000000000000001 0000000000000008 ffff82c48028fef8 ffff82c4802038c3 > (XEN) 0000000000680000 00007fb540012cd0 0000000000000000 0000000000000000 > (XEN) 00007fb540012e60 000000000000e033 ffff82c48028fed8 ffff8300cf0fd000 > (XEN) Xen call trace: > (XEN) [<ffff82c48020a3c5>] vmac+0x5da/0x927 > (XEN) [<ffff82c48017fa7d>] copy_from_user+0x72/0x9f > (XEN) [<ffff82c480169857>] arch_memory_op+0x7cc/0xf05 > (XEN) [<ffff82c48011399c>] do_memory_op+0x1dca/0x1df2 > (XEN) [<ffff82c4802082c8>] syscall_enter+0xc8/0x122 > > And the patch is: > > diff -r 97763efc41f9 xen/arch/x86/domain.c > --- a/xen/arch/x86/domain.c Tue Apr 05 18:23:54 2011 +0100 > +++ b/xen/arch/x86/domain.c Tue Apr 12 13:18:34 2011 -0400 > @@ -696,6 +696,8 @@ void arch_domain_destroy(struct domain * > > if ( is_hvm_domain(d) ) > hvm_domain_destroy(d); > + else > + xfree(d->arch.pv_domain.e820); > > vmce_destroy_msr(d); > pci_release_devices(d); > diff -r 97763efc41f9 xen/arch/x86/mm.c > --- a/xen/arch/x86/mm.c Tue Apr 05 18:23:54 2011 +0100 > +++ b/xen/arch/x86/mm.c Tue Apr 12 13:18:34 2011 -0400 > @@ -100,6 +100,7 @@ > #include <xen/iocap.h> > #include <xen/guest_access.h> > #include <xen/pfn.h> > +#include <xen/xmalloc.h> > #include <asm/paging.h> > #include <asm/shadow.h> > #include <asm/page.h> > @@ -4710,7 +4711,7 @@ long arch_memory_op(int op, XEN_GUEST_HA > if ( copy_from_guest(&fmap, arg, 1) ) > return -EFAULT; > > - if ( fmap.map.nr_entries > ARRAY_SIZE(d->arch.pv_domain.e820) ) > + if ( fmap.map.nr_entries > E820MAX ) > return -EINVAL; > > rc = rcu_lock_target_domain_by_id(fmap.domid, &d); > @@ -4730,9 +4731,39 @@ long arch_memory_op(int op, XEN_GUEST_HA > return -EPERM; > } > > + gdprintk(XENLOG_INFO, "We want %d E820 entries. We have: %d.\n", fmap.map.nr_entries, d->arch.pv_domain.nr_e820); > + if ( d->arch.pv_domain.e820 ) > + { > + if ( fmap.map.nr_entries >= d->arch.pv_domain.nr_e820 ) > + { > + xfree(d->arch.pv_domain.e820); > + d->arch.pv_domain.e820 = NULL; > + d->arch.pv_domain.nr_e820 = 0; > + } > + } else {... this "else". If you have an e820 table already, and it''s too small, you want to not only free it, but also allocate a new one. So instead of an "else" you''ll need a second "if" here, recovering from the table possibly having got freed. Since xfree(NULL) is okay, you could actually simply use the inner conditional on the first check. Jan> + d->arch.pv_domain.e820 = xmalloc_array(e820entry_t, > + fmap.map.nr_entries + 1); > + if ( !d->arch.pv_domain.e820 ) > + { > + rcu_unlock_domain(d); > + return -ENOMEM; > + } > + gdprintk(XENLOG_INFO, "Allocated E820 for %d entries.\n", fmap.map.nr_entries); > + memset(d->arch.pv_domain.e820, 0, (fmap.map.nr_entries + 1) * sizeof(e820entry_t)); > + } > + gdprintk(XENLOG_INFO, "Copying E820 %d entries.\n", fmap.map.nr_entries); > rc = copy_from_guest(d->arch.pv_domain.e820, fmap.map.buffer, > fmap.map.nr_entries) ? -EFAULT : 0; > - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; > + > + if ( rc == 0 ) > + d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; > + else { > + /* Don''t free the E820 if it had been set before and we failed. */ > + if ( !d->arch.pv_domain.nr_e820 ) { > + xfree(d->arch.pv_domain.e820); > + d->arch.pv_domain.e820 = NULL; > + } > + } > > rcu_unlock_domain(d); > return rc; > @@ -4747,6 +4778,9 @@ long arch_memory_op(int op, XEN_GUEST_HA > if ( d->arch.pv_domain.nr_e820 == 0 ) > return -ENOSYS; > > + if ( d->arch.pv_domain.e820 == NULL ) > + return -ENOSYS; > + > if ( copy_from_guest(&map, arg, 1) ) > return -EFAULT; > > diff -r 97763efc41f9 xen/include/asm-x86/domain.h > --- a/xen/include/asm-x86/domain.h Tue Apr 05 18:23:54 2011 +0100 > +++ b/xen/include/asm-x86/domain.h Tue Apr 12 13:18:34 2011 -0400 > @@ -238,7 +238,7 @@ struct pv_domain > unsigned long pirq_eoi_map_mfn; > > /* Pseudophysical e820 map (XENMEM_memory_map). */ > - struct e820entry e820[3]; > + struct e820entry *e820; > unsigned int nr_e820; > }; > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2011-Apr-13 08:46 UTC
Re: [Xen-devel] [PATCH 2 of 4] x86: make the pv-only e820 array be dynamic
On 12/04/2011 18:21, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:> On Tue, Apr 12, 2011 at 02:06:05PM +0100, Keir Fraser wrote: >> On 12/04/2011 13:53, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote: >> >>>> How cunning. >>>> >>>> Why wouldn''t you just allocate exactly the right size of array in >>>> XENMEM_set_memory_map? >>> >>> I was thinking about it, but the mm.c code did not have the >>> xen/xmalloc.h header, nor any references to xmalloc_array. >>> >>> Is it OK to make an xmalloc_array during a hypercall? >> >> Yes. I think the toolstack should be able to clean up on the newly-possible >> ENOMEM return from this hypercall. > > Hm, not sure what I am hitting, but I can''t seem to be able to copy over the > contents to the newly allocated array from the guest (this works > fine with the previous version of the patch). This is what I getYour patch is not really quite right. I''ve attached a modified version for you to try. -- Keir> (XEN) mm.c:4734:d0 We want 8 E820 entries. We have: 1. > (XEN) mm.c:4754:d0 Copying E820 8 entries. > (XEN) ----[ Xen-4.2-110412 : 00000000000000a0 > (XEN) rdx: 0000000000000000 rsi: 0000000000680004 rdi: 0000000000000000 > (XEN) rbp: ffff82c48028fc68 rsp: ffff82c48028fc58 r8: ffff82c4802c8bd0 > (XEN) r9: 0000000000000004 r10: 0000000000000004 r11: 0000000000000001 > (XEN) r12: 00000000000000a0 r13: 0000000000000000 r14: 0000000000680004 > (XEN) r15: 0000000000000008 cr0: 0000000080050033 cr4: 00000000000026f0 > (XEN) cr3: 00000001056af000 cr2: 0000000000000000 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 > (XEN) Xen stack trace from rsp=ffff82c48028fc58: > (XEN) 00000000000000a0 00000000000000a0 ffff82c48028fc98 ffff82c48017fa7d > (XEN) 0000000000000008 0000000000000000 0000000000000008 ffff88005c51d7c0 > (XEN) ffff82c48028fd38 ffff82c480169857 ffff82c48028fcb8 ffff82c480167d9a > (XEN) 000000008028fd08 ffff82f60214be50 0000000000000001 0000000000000008 > (XEN) 0000000000680004 0000000000000001 ffff88005d8aa1f8 ffff8301056e0000 > (XEN) ffff83011f9b3000 0000000000000000 ffff82c48028fe28 ffff8300cf0fd000 > (XEN) 000000000000000d 00007fff2d629060 ffff88005c51d7c0 0000000000000008 > (XEN) ffff82c48028fef8 ffff82c48011399c ffff82c4802c9940 000000008028ff18 > (XEN) ffff82c48028fd88 0000000000123412 ffff82f602468250 ffff82c48028ff18 > (XEN) ffff82c48028ff18 ffff82c4802c8e38 000000008028fde8 ffff82f602468250 > (XEN) 00000000001068f6 ffff83011f9b3000 0000000000000070 0000000123412000 > (XEN) ffff82f602468240 ffff88005d8a9220 0000000000000000 0000000000682004 > (XEN) ffff82c48028fe28 ffff82c48016aa14 8000000127365065 000000018015fd21 > (XEN) 0000000000000000 ffff8300cf0fd1c8 ffff8300cf0fd1c8 ffff82c48028fe88 > (XEN) aaaaaaaaaaaaaaaa ffff88005d8a9220 ffff82c48028fef8 ffff82c480113cb7 > (XEN) ffff8300cf0fd1d0 ffff8300cf0fd000 000000018028fef8 ffff82c48028ff18 > (XEN) ffff82c48028fe68 ffff82c48028ff18 ffff8300cf0fd000 00007fff2d6290b0 > (XEN) 0000000000000001 0000000000000008 ffff82c48028fef8 ffff82c4802038c3 > (XEN) 0000000000680000 00007fb540012cd0 0000000000000000 0000000000000000 > (XEN) 00007fb540012e60 000000000000e033 ffff82c48028fed8 ffff8300cf0fd000 > (XEN) Xen call trace: > (XEN) [<ffff82c48020a3c5>] vmac+0x5da/0x927 > (XEN) [<ffff82c48017fa7d>] copy_from_user+0x72/0x9f > (XEN) [<ffff82c480169857>] arch_memory_op+0x7cc/0xf05 > (XEN) [<ffff82c48011399c>] do_memory_op+0x1dca/0x1df2 > (XEN) [<ffff82c4802082c8>] syscall_enter+0xc8/0x122 > > And the patch is: > > diff -r 97763efc41f9 xen/arch/x86/domain.c > --- a/xen/arch/x86/domain.c Tue Apr 05 18:23:54 2011 +0100 > +++ b/xen/arch/x86/domain.c Tue Apr 12 13:18:34 2011 -0400 > @@ -696,6 +696,8 @@ void arch_domain_destroy(struct domain * > > if ( is_hvm_domain(d) ) > hvm_domain_destroy(d); > + else > + xfree(d->arch.pv_domain.e820); > > vmce_destroy_msr(d); > pci_release_devices(d); > diff -r 97763efc41f9 xen/arch/x86/mm.c > --- a/xen/arch/x86/mm.c Tue Apr 05 18:23:54 2011 +0100 > +++ b/xen/arch/x86/mm.c Tue Apr 12 13:18:34 2011 -0400 > @@ -100,6 +100,7 @@ > #include <xen/iocap.h> > #include <xen/guest_access.h> > #include <xen/pfn.h> > +#include <xen/xmalloc.h> > #include <asm/paging.h> > #include <asm/shadow.h> > #include <asm/page.h> > @@ -4710,7 +4711,7 @@ long arch_memory_op(int op, XEN_GUEST_HA > if ( copy_from_guest(&fmap, arg, 1) ) > return -EFAULT; > > - if ( fmap.map.nr_entries > ARRAY_SIZE(d->arch.pv_domain.e820) ) > + if ( fmap.map.nr_entries > E820MAX ) > return -EINVAL; > > rc = rcu_lock_target_domain_by_id(fmap.domid, &d); > @@ -4730,9 +4731,39 @@ long arch_memory_op(int op, XEN_GUEST_HA > return -EPERM; > } > > + gdprintk(XENLOG_INFO, "We want %d E820 entries. We have: %d.\n", > fmap.map.nr_entries, d->arch.pv_domain.nr_e820); > + if ( d->arch.pv_domain.e820 ) > + { > + if ( fmap.map.nr_entries >= d->arch.pv_domain.nr_e820 ) > + { > + xfree(d->arch.pv_domain.e820); > + d->arch.pv_domain.e820 = NULL; > + d->arch.pv_domain.nr_e820 = 0; > + } > + } else { > + d->arch.pv_domain.e820 = xmalloc_array(e820entry_t, > + fmap.map.nr_entries + 1); > + if ( !d->arch.pv_domain.e820 ) > + { > + rcu_unlock_domain(d); > + return -ENOMEM; > + } > + gdprintk(XENLOG_INFO, "Allocated E820 for %d entries.\n", > fmap.map.nr_entries); > + memset(d->arch.pv_domain.e820, 0, (fmap.map.nr_entries + 1) * > sizeof(e820entry_t)); > + } > + gdprintk(XENLOG_INFO, "Copying E820 %d entries.\n", > fmap.map.nr_entries); > rc = copy_from_guest(d->arch.pv_domain.e820, fmap.map.buffer, > fmap.map.nr_entries) ? -EFAULT : 0; > - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; > + > + if ( rc == 0 ) > + d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; > + else { > + /* Don''t free the E820 if it had been set before and we failed. */ > + if ( !d->arch.pv_domain.nr_e820 ) { > + xfree(d->arch.pv_domain.e820); > + d->arch.pv_domain.e820 = NULL; > + } > + } > > rcu_unlock_domain(d); > return rc; > @@ -4747,6 +4778,9 @@ long arch_memory_op(int op, XEN_GUEST_HA > if ( d->arch.pv_domain.nr_e820 == 0 ) > return -ENOSYS; > > + if ( d->arch.pv_domain.e820 == NULL ) > + return -ENOSYS; > + > if ( copy_from_guest(&map, arg, 1) ) > return -EFAULT; > > diff -r 97763efc41f9 xen/include/asm-x86/domain.h > --- a/xen/include/asm-x86/domain.h Tue Apr 05 18:23:54 2011 +0100 > +++ b/xen/include/asm-x86/domain.h Tue Apr 12 13:18:34 2011 -0400 > @@ -238,7 +238,7 @@ struct pv_domain > unsigned long pirq_eoi_map_mfn; > > /* Pseudophysical e820 map (XENMEM_memory_map). */ > - struct e820entry e820[3]; > + struct e820entry *e820; > unsigned int nr_e820; > }; >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-13 13:31 UTC
Re: [Xen-devel] [PATCH 2 of 4] x86: make the pv-only e820 array be dynamic
> > ... this "else". If you have an e820 table already, and it''s too small, > you want to not only free it, but also allocate a new one. So > instead of an "else" you''ll need a second "if" here, recovering from > the table possibly having got freed. Since xfree(NULL) is okay, you > could actually simply use the inner conditional on the first check.Grrr.. Right. How embarrassing - I will blame it on the lack of sleep. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-13 13:32 UTC
Re: [Xen-devel] [PATCH 2 of 4] x86: make the pv-only e820 array be dynamic
On Wed, Apr 13, 2011 at 09:46:15AM +0100, Keir Fraser wrote:> On 12/04/2011 18:21, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote: > > > On Tue, Apr 12, 2011 at 02:06:05PM +0100, Keir Fraser wrote: > >> On 12/04/2011 13:53, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote: > >> > >>>> How cunning. > >>>> > >>>> Why wouldn''t you just allocate exactly the right size of array in > >>>> XENMEM_set_memory_map? > >>> > >>> I was thinking about it, but the mm.c code did not have the > >>> xen/xmalloc.h header, nor any references to xmalloc_array. > >>> > >>> Is it OK to make an xmalloc_array during a hypercall? > >> > >> Yes. I think the toolstack should be able to clean up on the newly-possible > >> ENOMEM return from this hypercall. > > > > Hm, not sure what I am hitting, but I can''t seem to be able to copy over the > > contents to the newly allocated array from the guest (this works > > fine with the previous version of the patch). This is what I get > > Your patch is not really quite right. I''ve attached a modified version for > you to try.Excellent, and it works quite nicely. Thank you. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel